Intelligent virtual assistants (IVA) with natural language understanding (NLU), such as Amazon Alexa, Apple Siri, Google Assistant, and Microsoft Cortana, are becoming increasingly popular. For IVA, NLU is a distinct component of spoken language understanding (SLU) 
, in conjunction with automatic speech recognition (ASR) and dialog management (DM). ASR produces a token sequence from speech, which is passed to NLU for both classifying the action or “intent” that the user wants to invoke (e.g. PlayMusicIntent, TurnOnIntent, BuyItemIntent), and recognizing named-entities (e.g. Artist, Genre, City). Based on the NLU output, DM decides the appropriate action and response, which could be starting a song playback or turning off lights.
Several components of NLU such as Intent Classification (IC) and Named Entity Recognition (NER), use machine learning components in order to build models that are robust to diverse patterns in natural language. Commonly used IC models include maximum entropy (MaxEnt)
, support vector machines (SVM)
, and deep neural networks (DNN). Popular NER models include conditional random fields (CRF) 
and long short-term memory (LSTM) networks. These machine learning methods perform best when using diverse annotated data collected from actual IVA user utterances. We refer to this data as live utterances.
NLU systems support functionality in a wide range of domains, such as music, weather, and traffic. As user needs expand, a major requirement is the ability to add support for new domains. A bottleneck in introducing new domains to an NLU system is that a new domain does not have a large number of annotated live utterances. For new domains, a dataset can be constructed using grammar generated utterances based on anticipated usage. Still, achieving good accuracy requires annotated live utterances from actual users because of the unexpected discrepancies between anticipated and actual usage. Thus, a mechanism is required to select live utterances that can be manually annotated and used to enrich the training dataset.
Random sampling is a common method for selecting utterances for annotation from the live data. However, in a production IVA system with high traffic, the number of available live utterances is vast. Meanwhile, due to the high cost of manual annotation, only a small sample of the live traffic can be annotated. Thus, in a random sample of live data, the number of utterances relevant to the new domain may be small. A desired sampling procedure should be able to select informative utterances that are relevant to the new domains. Informative utterances are the one that if annotated and added to the training data, reduce the error rates of the NLU model.
Active learning (AL)  refers to machine learning methods that can interact with the data sampling procedure and guide the selection of informative instances for annotation. In this work, we explore AL algorithms for targeted utterance selection for new underrepresented domains NLU. We compare least confidence  and query-by-committee  AL approaches. Moreover, we propose an AL algorithm called Majority-CRF that uses an ensemble of classification and sequence labeling models to guide utterance selection for annotation. We designed Majority-CRF algorithm to improve both IC and NER components of an NLU system. Simulation experiments on three different new domains show that Majority-CRF is able to achieve 6.6%-9% relative improvements compared to random sampling, as well as significant improvements compared to other active learning approaches.
2 Related Work
Tur et al. studied two AL approaches for reducing the annotation effort in SLU [12, 10]: a least confidence approach, and a query-by-committee disagreement. Both performed better than random sampling, and the authors concluded that the overall annotation effort could be reduced by half.
To improve AL, Hong-Kwang and Vaibhava  proposed to exploit the similarity between instances. Their results show improvements over simple confidence-based selection for data sizes of less than 5,000 utterances, but beyond 5,000 utterances the accuracy is the same. A computational limitation of the approach is that it requires computing the pairwise utterance similarity, an operation that is slow for millions of utterances (as available in production IVA). However, their approach could be potentially sped-up with techniques like locality-sensitive hashing.
are AL approaches based on decision theory. Expected model change tries to select utterances that will cause greatest change on model. Similarly, expected error reduction tries to select utterances that are going to maximally reduce generalization error. Both methods provide more sophisticated ways for estimating the value of getting any particular utterance annotated. However, they require computing an expectation across all possible ways to label the utterance, which is computationally expensive for models such as NER and IC with many labels and millions of parameters.
Schutze et al.  showed that AL is susceptible to the missed cluster effect when selection focuses only on examples around the existing decision boundary and misses important clusters of data that receive high confidence. They conclude that AL may produce a sub-optimal classifier compared to random sampling with large budget. To solve this problem Osugi et al.  proposed an AL algorithm that can balance exploitation (sampling around the decision boundary) and exploration (random sampling) by reallocating the annotation budget between the two. In our case, we iteratively select and annotate small batches of data that are used as feedback in subsequent selections, such that extensive exploration is not required. Moreover, we always dedicate part of the annotation budget to random sampling to avoid missed clusters, but in this paper we do not discuss the effects of the additional randomly sampled data.
3 Active Learning For New Domains
In this section, we first discuss the random sampling baselines and standard active learning approaches. Then, we describe the proposed Majority-CRF algorithm, and other AL algorithms that we tested.
3.1 Random Sampling Baselines
A common strategy to choose a subset of live traffic utterances for manual annotation is random sampling. We consider two random sampling baselines: uniform random sampling and domain random sampling.
Uniform random sampling is popular because it provides an unbiased estimate of the live utterance distribution. However, this approach samples fewer utterances for newly-released domains because of their low usage frequency. Thus, especially under a limited annotation budget and high data volume, accuracy improvements from randomly-sampling live data for new domains are slow.
Domain random sampling uses the NLU hypothesized domain to sample live utterances for annotation that are potentially relevant to a target domain. For new domains, domain random sampling provides live data quickly compared to uniform random sampling. However, this approach still does not focus on the most informative utterances. Also, we risk both missing representative utterances due to false negatives as well as sampling irrelevant utterances due to false positives.
3.2 Active Learning Baselines
AL algorithms can improve domain random sampling by selecting the most informative utterances for annotation. Two popular AL approaches are least confidence and query-by-committee.
Least confidence  AL involves processing unlabeled data with the current NLU models and prioritizing the selection of the instances on which the model exhibits least confidence. The intuition here is that utterances with low confidence are currently difficult, and so “teaching” the models how they should be labeled is informative. However, a major weakness of this method is that out-of-domain or unactionable utterances are likely to receive low confidence. This weakness can be alleviated by looking at instances with medium confidence using measures such as least margin between the top- hypotheses  or highest Shannon entropy .
Query-by-committee (QBC) 
uses a variety of different classifiers (e.g. SVMs, MaxEnt, Random Forests) that are trained on the existing annotated data. Each classifier is applied independently to every unlabeled candidate instance. To select an instance, we prioritize annotation of the top candidates that are assigned the most diverse labels. Then, the classifiers are retrained and the process repeated. One problem with this approach is that, depending on the model and the size of the committee, it could be computationally expensive to apply on large datasets.
3.3 Majority-CRF Algorithm
Majority-CRF is a confidence-based AL algorithm that uses selection models trained on the available data, and does not rely on the full NLU system. We choose the selection models to be faster and simpler compared to a full NLU system, which has several advantages. First, fast incremental training with the selected annotated data. Second, fast predictions on millions of utterances. Third, the selected data is not biased to the current NLU models, which makes our approach reusable even if the models change.
Algorithm 1 shows a generic AL procedure that we use to implement Majority-CRF and other variants of the algorithm that we tested. In the procedure, we train an ensemble of models on positive data from the domain of interest (e.g. Books) and negative data that is everything not in the domain of interest (e.g. Music, Videos, etc.). Then, we use the models to filter and prioritize a batch of utterance for annotation. After the batch is annotated, we retrain the models with the new data and repeat the process. To alleviate the tendency of the least confidence approaches to select irrelevant data, we add unsupported utterances and sentence fragments to the negative class training data of the AL models. This helps keep noisy utterances on the negative side of the decision boundary, so that they can be eliminated during filtering. Note that, for several domains at a time, we run the selection procedure independently and then deduplicate the utterances before sending them for annotation.
Models. We experimented with
-gram linear binary classifiers trained to minimize different loss functions:logistic, hinge, and squared. Each one is trained to distinguish between positive and negative data and learns a different decision boundary. Note that we use the classifiers raw unnormalized prediction scores (no sigmoid applied) that can be interpreted as distances between the utterance and the classifiers decision boundaries which is . We use the implementation of these classifiers from the Vowpal Wabbit package . Also, to directly target the NER task, we experimented with CRF, trained on the NER labels of the positive data.
Filtering function. We experimented with keep only majority positive prediction from the binary classifiers, and keep only prediction where there is at least one disagreement.
Scoring function. When consists of only binary classifiers, we combined classifier scores using either the sum of absolutes or the absolute sum . prioritizes utterances where all scores are small (i.e., close to all decision boundaries), and prioritizes utterances where either all scores are small or there is large disagreement between classifiers (e.g., one score is large negative, another is large positive, and the third is small). Both and can be seen as generalization of least confidence to a committee of classifiers. In algorithms including a CRF model , we compute the score with
, i.e., the CRF probabilitymultiplied by the logistic classifier probability , where
is the sigmoid function. Note that we ignore the outputs of the squared and hinge classifiers for scoring, though they may still be used for filtering.
Given Algorithm 1 and the choice of parameters , we compare the following variants of the generic AL algorithm:
[topsep=0pt, leftmargin=15pt, itemsep=0pt]
AL-Logistic: a single logistic classifier , filtering for only positive utterances () and least confidence prioritization.
QBC-SA and QBC-AS: , , with and correspondingly. Similar to query-by-committee with our generalized scoring.
Majority-SA and Majority-AS: , , with and correspondingly.
QBC-CRF and Majority-CRF: , , with and correspondingly.
The AL-Logistic and QBC variants serve as baseline AL algorithms. The QBC-CRF and Majority-CRF models combine the IC focused binary classifier scores with NER focused sequence labeling scores and use filtering by majority to select informative utterance from a large dataset. To the best of our knowledge, this is a novel architecture for active learning in NLU.
Abe and Mamitsuka  have proposed using bagging to built classifier committees for AL. Here bagging refers to random sampling with replacement of the original training data to create diverse classifiers. We experiment with the bagging technique for the majority and QBC AL algorithms, leading to the following bagging variants:
[topsep=0pt, leftmargin=15pt, itemsep=0pt]
QBC-AS-Bagging-Logistic and Majority-AS-Bagging-Logistic: different trained using bagging, , with and correspondingly.
QBC-AS-Bagging and Majority-AS-Bagging: same as QBC-AS and Majority-AS except with trained using bagging
4 Experimental Results
4.1 Evaluation Metrics
We use Slot Error Rate (SER)  that includes the intent as slot to evaluate the overall predictive performance of the NLU models. SER is defined as the ratio of the number of slot prediction errors to the total number of reference slots. Errors could be insertions, substitutions and deletions. We treat intent misclassification as another slot and count the intent errors as substitutions.
4.2 Simulated active learning
Real world application of AL requires manual annotations, which are costly for experimentation purposes. Therefore, to conduct multiple controlled experiments with different selection algorithms, we simulated AL by taking a subset of the available annotated training data as the unannotated candidate pool. By “hiding” the annotations, the AL algorithm had a small pool of annotated utterances as the simulated “new” domains for training. Then, the AL algorithm was allowed to choose relevant utterances from the simulated candidate pool. Once it selected the utterances, the annotations for those were revealed to the AL algorithm.
Dataset. We conducted experiments using an internal test dataset of 750K randomly sampled live utterances, and a training dataset of 42M utterances containing a combination of grammar generated and randomly sampled live utterances. The datasets cover 24 domains, including Music, Shopping, Local Search, Sports, Books, Cinema and Calendar.
Experimental Design. We split the full annotated training data into a 12M utterances initial training set for IC and NER, and a 30M utterance candidate pool for selection. This experimental setup attempts to mimics the real world use case where the candidate pool has a large proportion of negative examples (utterances that belong to different domains). To further increase the similarity to the real world candidate pool, we also added 100K sentence fragments and out-of-domain utterances to the candidate pool, which allows us compare the susceptibility of different algorithms to noisy or irrelevant data.
We choose Books, Local Search, and Cinema as target domains to simulate the AL algorithms, see Table 1. Each target domain had 550-650K utterances in the candidate pool. The rest of the 21 non-target domains have 28.5M utterances in the candidate pool. After choosing the target domains, we employed the different AL algorithms to select 12K utterances per domain from the candidate pool, for a total 36K utterance annotation budget.
We experimented with Majority-CRF and the rest of the configuration listed in Section 3.3, as well as
[topsep=0pt, leftmargin=15pt, itemsep=0pt]
Rand-Uniform – uniform random selection of 36K utterances, the same total annotation budget as for AL
Rand-Domain – random selection of 12K per target domain, the domain comes from the NLU hypothesis
AL-NLU – selection based on initial NLU model hypothesis domain and prioritization by lowest hypothesis confidence, all utterances are selected in a single run
We run each AL configuration twice and average the SER scores to account for fluctuations in selection due to the stochastic model training. For the random sampling configurations, we run each configuration five times and average the SER scores.
|Books||[c][25pt][c]290K||13K||[c][31pt][c]100pt “search in mystery books”|
|“read me a book”|
|Search||[c][25pt][c]260K||16K||[c][31pt][c]100pt “mexican food nearby”|
|“pick the top bank”|
|[c][12pt][c]Cinema||[c][25pt][c]270K||9K||[c][31pt][c]100pt “more about hulk”|
|“what’s playing in theaters”|
|Algorithm Group||Algorithm (default )||Overall||Books||Local Search||Cinema||Non-Target|
|Committee Bagging Models||[c][11pt][c]QBC-AS-Bagging-Logistic||35.1K||5599||7.65||7990||8.51||5420||7.60||16.0K|
|Committee Models and CRF||[c][11pt][c]QBC-CRF||35.1K||3653||7.44||6593||9.78||4064||10.26||20.7K|
4.2.1 SER Results
Table 2 shows the experimental results for the target domains Books, Local Search, and Cinema. We measure the domain error rates using SER, which is percent relative reduction compared to the initial model – higher is better. Note that we are unable to disclose the initial model metric values. For each experiment, we add both the target and non-target data to the initial model training data, and then we run an evaluation to compute SER.
We test for statistical significant improvements using the Wilcoxon test  with 1000 bootstrap resamples and p-value < 0.05.
Random Baselines. All active learning configurations achieved statistically significant SER gains compared to the Rand-Uniform. As expected, Rand-Uniform selected few relevant utterances for the target domains due to their low frequency. The Rand-Domain algorithm selects high #Utt for the target domains and it achieves statistically significant SER improvements compared to Rand-Uniform, but the overall gains are small, around 1% relative per target domain. A major factor for Rand-Domain’s reduced performance in our experiments is that it tends to capture many utterances that the NLU models can already recognize without errors. This is a common problem with NLU models.
Single Model Algorithms. AL-NLU(=1) SER improves over Rand-Uniform and Rand-Domain models and the improvements are statistically significant. The SER differences between AL-Logistic(=1) and AL-NLU(=1) are not statistically significant. However, AL-Logistic with 6 iterations (i.e., =6) obtained a statistically significant 1%-2% SER improvement over AL-NLU(=1) and Logistic(=1). In addition, AL-Logistic selected 200 fewer unactionable and unsupported utterances. This result demonstrates the importance of incremental utterance selection for iteratively refining the selection model.
Committee Algorithms. AL algorithms incorporating committee of models outperformed those based on single models by a statistically significant 1-2% SER. The Majority algorithms performed slightly better than the QBC algorithms, and were able to collect more target utterances. The absolute sum scoring function performed slightly better than sum of absolutes for both QBC and Majority. Amongst all committee algorithms, the Majority-AS algorithm performed the best, but the differences with the other committee algorithms are not statistically significant.
Committee Bagging Algorithms. The ensemble logistic bagging algorithm performed statistically significantly better than the single logistic algorithm, but worse than the algorithm using different (logistic, squared, and hinge) classifiers without bagging. When we use bagging for training the logistic, squared, and hinge classifiers, the results are slightly worse compared to not using bagging.
Committee and CRF Algorithms. The Majority-CRF algorithm with scoring achieves a statistically significant SER improvement of 1-2% compared to Majority-AS (the best configuration without the CRF). QBC-CRF is worse than Majority-CRF on Books, Local Search and Cinema. This difference was statistically significant on Books, but not on Cinema and Local Search.
In summary, AL yields more rapid improvements not only by selecting utterances relevant to the target domain, but also by trying to select the most informative utterances. For instance, although the various AL algorithms selected 40-50% false positive utterances from non-target domains, whereas Rand-Domain selected only around 20% false positives, the various AL algorithms still outperformed Rand-Domain sampler. This indicates that labeling ambiguous false positives is helpful for resolving existing confusions between domains. Another important observation is that majority filtering performs better than QBC disagreement filtering across all of our experiments. A possible reason for this may be that majority filtering selects a better balance of boundary utterances for classification and in-domain utterances for NER. Finally, the Majority-CRF results also indicate that incorporating the CRF model improves the performance of the committee models. This is most likely because incorporation of a CRF-based confidence directly targets the NER task.
4.3 Human-in-the-loop active learning
We also performed AL for six new NLU domains with human-in-the-loop annotators and actual unannotated user data. We used the Majority-SA configuration for simplicity in these case studies. We run the AL selection for 5-10 iterations and varying batch sizes between 1000-2000.
|Domain||SER||#Utt Selected||#Utt Testset|
Table 3 shows the SER results from AL with human annotators. We observe significant 4.6%-9% improvements compared to the existing NLU model. On average 25% of utterances are false positive, which is lower than the 50% in simulation because the starting model has more data for the negative class. Around 10% of the active learning data is lost due to selecting unactionable or out-of-domain utterances, which in our case is similar to random sampling.
While working with human annotators on new domains, we observed two major challenges that impact the improvements from AL. First, annotators make more mistakes on AL selected utterances as they are more ambiguous. Second, new domains may have a limited amount of test data, so the impact of AL cannot be fully measured. Currently, we address the annotation mistakes with manual data clean up and transformations, but further research is needed to develop an automated solution. To improve the coverage of the test dataset for new domains we are exploring selection using stratified sampling.
Active learning research is usually focused on selecting data to improve the overall machine learning model performance. In this paper, we instead focused on AL methods designed to select data for underrepresented domains. This is a common bottleneck in NLU systems. Our proposed Majority-CRF algorithm leads to statistically significant performance gains over standard AL and random sampling based methods while working with a limited annotation budget. In simulations, Majority-CRF showed an improvement of 6.6%-9% SER relative gain compared to random sampling, as well as significant improvements over other AL algorithms with the same annotation budget. Similarly, results with human-in-the-loop AL show statistically significant improvements of 4.6%-9% compared to the existing NLU system.
-  Renato De Mori, Frédéric Bechet, Dilek Hakkani-Tur, Michael McTear, Giuseppe Riccardi, and Gokhan Tur, “Spoken language understanding,” 2008.
Adam L Berger, Vincent J Della Pietra, and Stephen A Della Pietra,
“A maximum entropy approach to natural language processing,”Computational linguistics, 1996.
-  Corinna Cortes and Vladimir Vapnik, “Support-vector networks,” Machine learning, 1995.
-  Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, MIT Press, 2016.
-  John Lafferty, Andrew McCallum, Fernando Pereira, et al., “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of the eighteenth international conference on machine learning, ICML, 2001.
-  Zhiheng Huang, Wei Xu, and Kai Yu, “Bidirectional LSTM-CRF models for sequence tagging,” arXiv preprint arXiv:1508.01991, 2015.
-  Burr Settles, “Active learning literature survey,” Computer sciences technical report, University of Wisconsin–Madison, 2009.
David D Lewis and Jason Catlett,
“Heterogeneous uncertainty sampling for supervised learning,”in Proceedings of the eleventh international conference on machine learning, 1994.
-  Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby, “Selective sampling using the query by committee algorithm,” Machine learning, 1997.
-  Gokhan Tur, Robert E Schapire, and Dilek Hakkani-Tur, “Active learning for spoken language understanding,” in Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03). 2003 IEEE International Conference on, 2003.
-  Dan Shen, Jie Zhang, Jian Su, Guodong Zhou, and Chew-Lim Tan, “Multi-criteria-based active learning for named entity recognition,” in Proceedings of the 42Nd Annual Meeting on Association for Computational Linguistics, 2004.
Gokhan Tur, Dilek Hakkani-Tür, and Robert E Schapire,
“Combining active and semi-supervised learning for spoken language understanding,”Speech Communication, 2005.
-  Hong-Kwang Jeff Kuo and Vaibhava Goel, “Active learning with minimum expected error for spoken language understanding,” in INTERSPEECH, 2005.
-  Burr Settles, Mark Craven, and Soumya Ray, “Multiple-instance active learning,” in Advances in neural information processing systems, 2008.
-  Nicholas Roy and Andrew McCallum, “Toward optimal active learning through monte carlo estimation of error reduction,” ICML, Williamstown, 2001.
-  Hinrich Schütze, Emre Velipasaoglu, and Jan O Pedersen, “Performance thresholding in practical text classification,” 2006.
-  Thomas Osugi, Deng Kim, and Stephen Scott, “Balancing exploration and exploitation: A new algorithm for active machine learning,” in Data Mining, Fifth IEEE International Conference on, 2005.
Tobias Scheffer, Christian Decomain, and Stefan Wrobel,
“Active hidden markov models for information extraction,”in International Symposium on Intelligent Data Analysis, 2001.
-  Burr Settles and Mark Craven, “An analysis of active learning strategies for sequence labeling tasks,” in Proceedings of the conference on empirical methods in natural language processing, 2008.
-  John Langford, Lihong Li, and Alex Strehl, “Vowpal Wabbit,” 2007.
-  Naoki Abe Hiroshi Mamitsuka et al., “Query learning strategies using boosting and bagging,” in Machine learning: proceedings of the fifteenth international conference (ICML’98). Morgan Kaufmann Pub, 1998, vol. 1.
-  John Makhoul, Francis Kubala, Richard Schwartz, and Ralph Weischedel, “Performance measures for information extraction,” in In Proceedings of DARPA Broadcast News Workshop, 1999.
-  Myles Hollander, Douglas A Wolfe, and Eric Chicken, Nonparametric statistical methods, John Wiley & Sons, 2013.