Submodularity-inspired Data Selection for Goal-oriented Chatbot Training based on Sentence Embeddings

02/02/2018 ∙ by Mladen Dimovski, et al. ∙ 0

Goal-oriented (GO) dialogue systems rely on an initial natural language understanding (NLU) module to determine the user's intention and parameters thereof - also known as slots. Since the systems, also known as bots, help the users with solving problems in relatively narrow domains, they require training data within those domains. This leads to significant data availability issues that inhibit the development of successful bots. To alleviate this problem, we propose a technique of data selection in the low-data regime that allows training with significantly fewer labeled sentences, thus smaller labelling costs. We create a submodularity-inspired data ranking function, the ratio penalty marginal gain, to select data points to label based solely on the information extracted from the textual embedding space. We show that the distances in the embedding space are a viable source of information for data selection. This method outperforms several known active learning techniques, without using the label information. This allows for cost-efficient training of NLU units for goal-oriented bots. Moreover, our proposed selection technique does not need the retraining of the model in between the selection steps, making it time-efficient as well.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In their most useful form, goal–oriented dialogue systems need to understand the user’s need in great detail. A typical way of structuring this understanding is the separation of intents and slots — which can be also seen as parameters of the intents. Slot filling, sometimes known as entity extraction, is the problem of finding the relevant information in a user query that is needed for its further processing. As an example, in a restaurant reservation scenario, given the sentence Are there any French restaurants in Toronto downtown?, the task is to correctly output, or fill, the following slots: {cuisine: French} and {location: Toronto downtown}.

The slot filling task is usually seen as a sequence tagging problem where the goal is to tag each relevant word token with the corresponding slot name using the B–I–O (Begin, Inside, Outside) convention. The table below shows how we would correctly tag the previous example.

Are there any French
O O O B-Cuisine
restaurants in Toronto downtown
O O B-Location I-Location

Most methods created for slot filling are supervised and require large amount of labeled in–domain sentences to perform well. However, it is usually the case that only very little or no training data is available. Annotating new sentences is an expensive process which requires considerable human effort and, as a result, achieving good performance with as little data as possible becomes an important concern.

Our solution to the data availability problem relies on a better way of selecting the training samples to label. If a limited amount of resources are available, we want to enable the user to spend them in the most efficient way. More precisely, we propose a method to rank the unlabeled sentences according to their utility. We measure the latter by the help of a ranking function satisfying the principles of submodularity, a concept known to capture the intuition present in data selection.

We run experiments on three different publicly available slot filling datasets: MIT Restaurant, MIT Movie and ATIS [Liu et al.2013a, Liu et al.2013b]. We are interested in measuring the model’s performance when trained only with few dozens of labeled sentences, a situation we refer to as low–data regime. We compare our proposed selection method to several standard baselines, including two flavours of active learning.

We identify the following three main contributions:

  • We show that the space of raw, unlabeled sentences contains information that we can use to cleverly choose which ones to label.

  • We create a sumbodularity–inspired ranking function to select the potentially most useful sentences to label.

  • We apply this data selection method to the problem of slot filling and prove that the model’s performance can be considerably better when the training samples are chosen in an intelligent way.

In Section 3, we provide some background and detail the novel data selection technique. Section 4 describes the datasets that we work with in our experiments and the baselines that we used as a comparison reference point. Finally, in Section 5, we present and discuss the obtained results.

2 Related Work

2.1 Slot filling

Numerous models have been proposed to tackle the slot filling task, of which a comprehensive review was written by [Mesnil et al.2015]

. However, the most successful methods that have emerged are neural network architectures, and in particular recurrent neural network schemes based on word embeddings

[Kurata et al.2016, Ma and Hovy2016, Liu and Lane2016, Zhang and Wang2016, Zhai et al.2017, Zhu and Yu2017]. The number of proposed model variants in the recent literature is abundant, with architectures ranging from encoder–decoder models [Kurata et al.2016], models tying the slot filling problem with the closely–related intent detection task [Zhang and Wang2016] and even models that use the attention mechanism, originally introduced in [Bahdanau et al.2014], to boost performance [Liu and Lane2016]. The final model that we adopted in our study is a bi–directional LSTM network that uses character–sensitive word embeddings together with a fully–connected dense and a linear CRF layer on top [Huang et al.2015, Lample et al.2016]. We provide the model’s specifications and more details in the appendix.

2.2 Low–data regime challenges

Typically, machine learning models work reasonably well when trained with a sufficient amount of data; for example, reported results for the popular ATIS domain benchmark go beyond 95% F1 score

[Liu and Lane2016, Zhang and Wang2016]. However, performance significantly degrades when little amount of training data is available, which is a common scenario when a new domain of user queries is introduced. There are two major approaches used to handle the challenges presented by the scarcity of training data:

  • The first strategy is to train a multi–task model whose goal is to deliver better performance on the new domain by leveraging patterns learned from other closely–related domains for which sufficient training data exists [Jaech et al.2016, Hakkani-Tür et al.2016]

  • The second strategy is to cleverly select the few training instances that one can afford to label. This includes active learning strategies [Fu et al.2013, Angeli et al.2014], that identify data points that are close to the separation manifold of an imperfectly trained model.

In our work, we focus on the latter scenario as experience has shown that gathering in–domain high quality data can hardly be replaced by other techniques. We assume that we can choose the sentences we wish to label and, consequently, the main problem is to do this choice in a way that would yield the model’s best performance.

3 Data selection

3.1 Submodularity and rankings

Let be a ground set of elements and a function that assigns a real value to each subset of . is called submodular if the incremental benefit of adding an element to a set diminishes as the context in which it is considered grows. Formally, let and be subsets of , with . Then, is submodular if for every , where is the benefit, or the marginal gain of adding the element to the set . The concept of submodularity captures the idea of diminishing returns which is inherently present in data selection for training machine learning models.

In the case of the aforementioned slot filling problem, the ground set is the set of all available unlabeled sentences in a dataset and the value for a set is a score measuring the utility of the sentences in

. Submodular functions have already been used in document summarization

[Lin and Bilmes2011, Lin and Bilmes2012], and in various tasks of data subset selection [Kirchhoff and Bilmes2014, Wei et al.2014, Wei et al.2015]. However, to the best of our knowledge, they have not yet been studied in the slot filling context.

An important hypothesis that is made when submodular functions are used for data selection is that if and are two sets of data points for which , then using for training would give an overall better performance. If we have a predefined size for our training set, i.e. a maximum number of samples that we can afford to label, then we would need to find the set that maximizes with a constraint on . Cardinality–constrained submodular function maximization is an NP–hard problem and one usually has to resort to greedy maximization techniques (see [Krause and Golovin2014]). If the function is monotone, then the greedy solution, in which one iteratively adds the element having the largest marginal gain with respect to the already picked ones, gives a –approximation guarantee [Nemhauser et al.1978]. Consequently, what is important in practice is a way to rank the sentences, i.e. find a ranking permutation that would order them according to their usefulness. Notice that a monotone submodular function naturally defines a selection criteria that is the order in which the elements are picked with the greedy optimization procedure. In this slot filling data selection study, we explore different selection criteria without limiting ourselves to properly defined submodular functions.

3.2 Sentence similarity

A simple way to perform the sentence selection can be based on some intrinsic attribute they possess, such as their length or presence of an important token. However, in the absence of informative intrinsic elements, similarity to already known sentences can provide additional insight. An important metric we need to define thus is the similarity between a pair of sentences. The need to introduce such a measure of similarity appeared simultaneously with the development of active learning techniques for text classification. For example, in [McCallum et al.1998], the similarity between two documents and

, their analogue to our sentences, is measured by an exponentiated Kullback–Leibler divergence between the word occurrence empirical measures

and where is a smoothing parameter.

In our study, we use a recently developed technique of producing sentence embeddings, sent2vec [Pagliardini et al.2017]

, and use the output Euclidean space vectors to define the similarity. More precisely, for every sentence

, the sent2vec algorithm outputs a –dimensional embedding that we then use to define the similarity between sentences and as follows:




is a scaling constant, measuring the concentration of the cloud of points in the space. It is the inverse of the average distance between all pairs of embeddings. Note that will in general depend on the dataset, but it is not a hyper–parameter that needs to be tuned. Finally, the exponential function condenses the similarity to be in the interval as .

For a fixed sentence , the most similar candidates are the ones whose embedding vectors are the closest to in the embedding space. Tables 1 and 2 present two example sentences (shown in bold), one from the MIT Restaurant and another from the MIT Movie dataset, together with their closest neighbors. We see that the closeness in the embedding space is in line with human judgment. Moreover, selecting any of those sentences for training should largely diminish the usefulness of the others because they contain very little new information with respect to the already chosen phrase.

whats a cheap mexican restaurant here
we are looking for a cheap mexican restaurant
i am looking for a cheap mexican restaurant
is ixtapa mexican restaurant considered cheap
Table 1: An example sentence from the MIT Restaurant domain and the sentences corresponding to the three closest points to
what was the movie that featured that song over the rainbow
find me the movie with the song over the rainbow
what movie was the song somewhere out there featured in
what movie features the song hakuna matata
Table 2: An example sentence from the MIT Movie domain and the sentences corresponding to the three closest points to

A –dimensional t–SNE projection of the cloud of points of the MIT Movie domain, together with an isolated example cluster, is shown on Figure 1. The large black triangle in the cluster is an example sentence and the darker dots are its closest neighbors. Closeness is measured in a –dimensional space and distances are not exactly preserved under a –dimensional t–SNE projection, portrayed here. The cluster shown on Figure 1 corresponds to the sentences in Table 2.

Figure 1: A –dimensional t-SNE projection of the cloud of points representing the embeddings of the sentences of the MIT Movie domain. The overlapping plot gives a closer view of the small cluster at the bottom left; the corresponding sentences are shown in Table 2

3.3 Coverage score

The similarity metric defined in the previous section can be used to define a simple submodular function that evaluates every subset of sentences X [Lin and Bilmes2011]:


Intuitively, the inner sum measures the total similarity of the sentence to the whole dataset; it is a score of how well covers (hence the name coverage score). The marginal gain of a sentence is given by


and, as we see, does not depend on . That makes the function , strictly speaking, modular and the greedy optimization – optimal. Table 3 presents the top three coverage score sentences from the MIT Restaurant dataset and Figure 2 shows that these points tend to be centrally positioned.

The top three sentences with highest coverage score
1. i need to find somewhere to eat something close by im really hungry can you see if theres any buffet style restaurants within five miles of my location
2. i am trying to find a restaurant nearby that is casual and you can sit outside to eat somewhere that has good appetizers and food that is a light lunch
3. i need to find a restaurant that serves sushi near 5 th street one that doesnt have a dress code
Table 3: The top three coverage score sentences from the MIT Restaurant domain
Figure 2: The position of the top forty points in the –dimensional t–SNE projection of the cloud corresponding to the MIT Restaurant dataset according to two different rankings

The coverage score function suffers from that it only considers how much a sentence is useful in general, and not in the context of the other sentences in the set. Hence, it may pick candidates which cover a lot of the space, but happen to be very similar to each other. In order to deal with this problem, an additional penalty or diversity reward term is introduced and the marginal gain takes the form


where is a parameter that measures the trade–off between the coverage of and its similarity to the set of already picked sentences . The additional term makes the function submodular as is decreasing in ; however, experiments with both and , including variations thereof (see Figure 6 in the results section) yielded mediocre results. This suggests that linear penalization can be improved upon. In the next section, we introduce a new selection method which features a non–linear penalization.

3.4 Ratio penalty marginal gain

We propose a direct definition of a marginal gain of an element with respect to a set. This is an alternative to providing a submodular function for which we derive the marginal gain expression and then use it to find the maximizing set of a given size. We then use this marginal gain to rank and select the sentences that are likely the most valuable for labeling. We call it the ratio penalty marginal gain and define it as:


We cannot uniquely define a submodular function which generates the ratio penalty marginal gain. To see this, notice that, for example, and consequently, the definition of is ambiguous. Nevertheless, the above expression (6) satisfies the submodularity condition by being decreasing in .

The ratio penalty marginal gain is again a trade–off between the coverage score of a sentence and its similarity to the already picked ones. An important change is that the penalty is introduced by division, instead of difference as in from the previous section. To gain more intuition, notice that, as the logarithm function is increasing, the induced ranking is the same as the ranking produced by , where we define , i.e.


In summary, we start with the distances between the embeddings of the sentences, we re–scale them by dividing by the average distance and squash them exponentially. Then we do an aggregate computation in this new space by summing the similarities, and then we revert to the original scale by taking the logarithm.

4 Experiments

4.1 Datasets and method

We experiment with three different publicly available datasets whose details are shown in Table 4. Each one contains few thousand training samples, out of which we select only a few dozens following some selection criteria . More precisely, we are interested in the model’s performance when trained only with labeled sentences. This selection simulates the behaviour of a system that needs to be trained for a newly available domain. We measure the performance by the best achieved F1 score during training on a separate test set that we have for each domain. The final column of Table 4 shows the performance of our adopted model when trained on the full datasets. These are the best result that we can hope to achieve when training with a proper subset of training samples.

Domain #train #test #slots F1 score
MIT Restaurant 7661 1522 17 80.11
MIT Movie 9776 2444 25 87.86
ATIS 4978 893 127 95.51
Table 4: The three domains and their basic information (number of training samples, number of samples in the test set and number of different slots). The MIT Restaurant and MIT Movie datasets contain user queries about restaurant and movie information. The Airline Travel Information Services (ATIS) dataset mainly contains questions about flight booking and transport information

We will denote by the full training set of a particular domain, assuming that each is a (sentence, tags) pair. The training set comprising the subset of training samples that we select using the ranking will be denoted by .

4.2 Baselines

To the best of our knowledge, data selection techniques have not yet been applied to the slot filling problem. As a result, there is no direct benchmark to which we can compare our method. Instead, we use three baselines used in similar situations: random data selection, classic active learning and an adaptation of a state–of–the–art selection method – randomized active learning [Angeli et al.2014].

4.2.1 Random selection of data

To select sentences out of the total , we uniformly pick a random permutation and take as our training set . While random data selection may occasionally pick sentences which are not relevant, it guarantees diversity, leading to a reasonable performance. We show that some complex methods fail to outperform the random baseline or improve upon it only by a small margin.

4.2.2 Classic active learning framework

As our second baseline, we choose the standard active learning selection procedure. We iteratively train the model and select new data based on the model’s uncertainty about new unlabeled points. To calculate the uncertainty for a sentence containing word tokens, we adapt the least confidence (LC) measure [Fu et al.2013, Culotta and McCallum2005]. We define the uncertainty of a trained model for the sample as the average least confidence uncertainty across the labels


where is the most likely tag for the word predicted by and is its softmax–normalized score output by the last dense layer of the network.

We proceed by selecting batches of ten sentences, thus performing a mini–batch adaptive active learning [Wei et al.2015] as follows. The first ten sentences used for the initial training of the model are picked uniformly at random. Then, at each iteration , we pick a new batch of ten sentences for which is the most uncertain and retrain the model with the augmented training set .

4.2.3 Randomized active learning

In the text processing scenario, the classic active learning algorithm has the drawback of picking sentences that are not good representatives of the whole dataset. The model is often the least certain about samples that lie in sparsely populated regions of the input space. This blindness to input space density often leads to a poor performance. To address this problem, data is usually selected by a weighted combination of the uncertainty about a sample and its correlation to the other samples which measures the density of the region around it [Fu et al.2013, Culotta and McCallum2005].

This approach requires finding a good correlation metric and also tuning the trade–off parameter of confidence versus representativeness. The latter is not applicable in our scenario as we would need to access a potential validation set which deviates from our selection principle of labeling the least data possible. Instead, we adapt a well–performing technique proposed by [Angeli et al.2014] in which samples are selected randomly proportionally to the model’s uncertainty. More precisely, a sample sentence

is selected with probability

. Although uncertainty will again be the highest for the poor samples, as their number is small, they will contain only a tiny percent of the total uncertainty mass across the whole dataset. Consequently, they will have very little chance of being selected.

Figure 3: MIT Restaurant dataset (3 baselines + RPS)
Figure 4: MIT Movie dataset (3 baselines + RPS)

5 Results

Figures 3, 4 and 5 show the resulting curves of the three baselines and the ratio penalty selection (RPS) for the MIT Restaurant, MIT Movie and ATIS dataset, respectively. The x–axis shows the number of samples used for training and the y–axis the best F1 score obtained during training on the separate test set. As the three baselines rely on random sampling, we repeat the procedure five times and plot the mean together with a confidence interval. As expected, the classic active learning (ALC) algorithm performs poorly as it tends to pick uninformative samples at the boundary of the cloud. The randomized active learning (ALR) gives a lot better score, but surprisingly, it remains comparable to the random data selection strategy. Finally, the ratio penalty selection (RPS) yields the best results, outperforming the baselines by a significant margin across all three domains. For example, in the MIT Restaurant domain, the average gap between RPS and the best performing baseline ALR is around points in F1 score. The best result is obtained for samples, where we observe approximately relative improvement of RPS over ALR. Both in the MIT Restaurant and MIT Movie domain, RPS needs, on average, labeled sentences less to match the performance of ALR.

Figure 6 presents the resulting curves of different selection strategies for the MIT Restaurant dataset. The results were similar for the remaining two domains. The linear penalty selection introduced in Section 3.3, shown only for the best parameter , yields results which are better than the random choice and in some region comparable to RPS. However, the disadvantage of this method is the requirement to tune an additional hyper–parameter which will differ from one domain to another. We also present the performance when we train the model with the top longest sentences (Length Score). It is fairly well in the very low data regimes, but it soon starts to degrade becoming worse than all the other methods.

Figure 5: ATIS dataset (3 baselines + RPS)
Figure 6: Comparison of various additional selection techniques on the MIT Restaurant dataset

6 Conclusion

In this paper, we explored the utility of existing data selection approaches in the scenario of slot filling. We introduced a novel submodularity–inspired selection technique. We showed that a good choice of selection criteria can have a large influence on the model’s performance in the low–data regime. Moreover, we showed that one does not necessarily has to limit this search to properly defined submodular functions; as those are efficiently optimized in practice by using the derived maximal gain, defining only the latter is sufficient to produce the rankings of the sentences. In addition, we defined a similarity metric between pairs of sentences using their continuous vector representations, which is in line with human intuition. Finally, we showed that the space of raw samples already contains a lot of information that can be exploited. This information is potentially more useful than the output space information used in the active learning paradigm.


Adopted model specifications

We present here the specifications of our BiLSTM–CRF model and the values of the parameters used in our experiments. The model uses pre–trained GloVe embeddings on both word () and character level (). The final word representation is a concatenation of its GloVe embedding and its character representation obtained by a separate mini–BiLSTM network. Hence, whenever a word that has never been encountered appears, its character level representation gives partial information. The hidden layer sizes of the main and the character BiLSTM are and , respectively. Both the word and the character embeddings are fine–tuned during the training phase. Gradients are clipped to a maximum norm of and the learning rate of the Adam optimizer, which starts at

is geometrically decayed at every epoch by a factor of

until it reaches the minimum set value of . Mini–batch size is equal to .