Our commercial voice-assistant natural language understanding (NLU) system uses linear MaxEnt models Berger et al. (1996) for text classification, and linear-chain CRF models Lafferty et al. (2001); Sutton et al. (2012) for named-entity recognition (NER). There are several desirable properties that makes these models suitable for NLU:
[topsep=1pt, leftmargin=15pt, itemsep=-1pt]
They achieve high accuracy given the right set of features Wang and Manning (2012).
They produce probabilistic output, which allows combining multiple models in a recognition pipeline Su et al. (2018).
They can be compressed to run on hardware-constrained devices Strimel et al. (2018).
To train MaxEnt and CRF models, we minimize an objective function that includes the negative log-likelihood of the training dataset and elastic-net regularization. The elastic-net regularization combines the L1 and L2 penalties on the model weights. The L1 penalty, sum of absolute weights, serves as feature selection mechanism by forcing the unimportant model weights to zero. The L2 penalty, sum of squared weights, helps avoid overfitting by preventing weights that are excessively large.
A popular way to train elastic-net linear models is the Orthant-Wise Limited-memory Quasi-Newton Optimizer (OWL-QN) Andrew and Gao (2007). OWL-QN is a variant of the LBFGS Zhu et al. (1997) quasi-Newton optimization method that supports the L1 penalty. OWL-QN produces compact and accurate models, and it can be speed-up using multiple CPU threads.
In our voice-assistant’s NLU, we classify user utterance text into hundreds of classes. We have a hierarchical classification system, where we use MaxEnt to first perform domain classification (DC) and then intent classification (IC). DC predicts the general domain of an utterance. IC predicts the user intent within a domain. For example, for the utterance “play music”, DC predictMusic and IC predict PlayMusicIntent, and for the utterance “how is the weather”, DC predicts Weather and IC predicts GetWeatherIntent. This NLU architecture where intents are separated by domain allows us to efficiently train the IC models in parallel on their respective domain-specific training data. However, the MaxEnt DC training still has to be done on the entire training dataset. A DC model training with OWL-QN using 28 CPU threads on 50 million utterances takes around 2 hours.
For NER, we use one CRF model per domain to recognize domain-specific named-entities, such as artist name and song name for the Music domain. For example, in the sentence “play desert rose by sting”, NER labels “desert rose” as a song and “sting” as an artist. The CRF models are more complex and slower to train compared to the MaxEnt models. An NER model training with OWL-QN using 28 CPU threads on a large domain with 6 million utterances takes around 2 hours 40 min.
Continuous improvement of the NLU system requires frequent retraining of the DC, IC, and NER models. However, the long training times for DC and NER are a bottleneck. In this work, we focus on machine learning optimization methods to improve training efficiency. We develop a fast optimizer called F10-SGD which on large internal datasets trains MaxEnt and CRF models 4 times faster compared to OWL-QN without loss of accuracy or increase in the model size.
Our contributions are the following:
[topsep=1pt, leftmargin=15pt, itemsep=-1pt]
We combine Stochastic Gradient Descent (SGD) optimization techniques for parallel training and elastic-net regularization into a fast and accurate trainer.
We improve the accuracy of the NER model using biased sampling that prioritizes harder examples towards the end of the training.
We perform evaluations on internal and public datasets. On internal datasets, F10-SGD is 4x faster with half the CPU threads compared to OWL-QN. On external datasets, we speed-up training by 22% and 3.4-3.7x against FastText and CRFSuite respectively.
Why not GPU? An important constrain for our use case is to develop a fast optimizer for CPU instead of GPU. Our model building platform trains many models per week, and allocating GPU for each training is expensive.
2 Related Work
Bottou (2010, 2012) advocated for using SGD in large scale machine learning optimization. He developed CRFSGD111http://leon.bottou.org/projects/sgd and demonstrated the effectiveness of SGD for training CRF models. Okazaki (2007) developed the popular CRFSuite toolkit, which includes an SGD trainer based on CRFSGD, and an optimized implementation of the OWL-QN trainer. A limitation of both CRFSGD and CRFSuite is that they do not use multiple CPU threads to speed up training. Also, their SGD implementations do not support L1 regularization.
provide fast SGD implementations for logistic regression models (note that logistic regression is equivalent MaxEnt). Both support L1 regularization using cumulative penaltyTsuruoka et al. (2009) and truncated gradient Langford et al. (2009). Scikit-learn SGD does not support parallel SGD, and in Vowpal Wabbit parallel SGD is not supported, though it can be simulated via sending raw training examples over sockets to multiple Vowpal Wabbit processes that share the same memory.
FastText Joulin et al. (2016b) is linear embedding model for text classification. It supports asynchronous multi-threaded SGD training via Hogwild Recht et al. (2011), which makes training fast. However, FastText does not support L2 or dropout regularization, leading to suboptimal performance on small datasets. Also, it does not support L1 for feature selection, but it does have a quantization option Joulin et al. (2016a) to reduce the model size after training.
3.1 Elastic-net Linear Models
Probabilistic linear models have the general form as shown in Equation 1, where is an input example, is a prediction target, is a weight for the feature function , and
is a partition function that normalizes the probability. If the prediction targetis a class as in IC and DC, the model is a MaxEnt. If the prediction target is sequence of labels as in NER, the model is a CRF.
Equation 2 shows the optimization objective for MaxEnt and CRF. It includes the negative log-likelihood over the training set with size , and the L1 and L2 penalties.
To optimize , we use the gradient in Equation 4. The function returns: -1 for negative, 1 positive, or 0 for every weight of . is necessary to define the sub-gradient of for zero weights since the absolute function in the L1 penalty is not differentiable at zero.
The definitions of and differ for MaxEnt and CRF, and are not covered in this paper.
OWL-QN is an optimization method that iteratively adjusts the model parameters towards the optimal value. Algorithm 1 provides a high-level simplified description of OWL-QN that helps explain its computational bottlenecks, but does not describe the algorithm in details.
OWL-QN approximates the inverse hessian using the last weight differences , and gradient differences . The quality of the approximation depends on the size , a recommended value for is between 4 and 7.
In step 3, OWL-QN computes an update direction using and the gradient over training examples. Then, in step 4 it performs a line search in the direction . For a large even one epoch takes significant time, and OWL-QN typically needs epochs to find good approximation , and tens or even hundreds of epochs to converge.
Parallelization. We can speed-up OWL-QN by using multiple CPU threads to compute and . Each CPU thread receives a subset of the training dataset and computes the forward scores and the gradients . Then, the results are aggregated to compute and .
We propose using SGD as a fast alternative to OWL-QN for training MaxEnt and CRF models. Similar to OWL-QN, SGD is an iterative optimization method. It uses the gradient approximation in Equation 5. is calculated from for a single random training example. The weights are then updated with learning rate , without relying on the inverse hessian. Since is updated many times during a single epoch, SGD typically needs fewer epochs to converge compared to OWL-QN.
In we divide and by the number of examples since is updated times per epoch.
F10-SGD in Algorithm 2 is a parallel version of SGD, also known as Hogwild. Each CPU thread receives a random subset of the training dataset , and updates the weights . The weight updates are not synchronized, so it is possible for threads to override each other. However, the algorithm has a nearly optimal rate of convergence for models where many training examples have non-overlapping features Recht et al. (2011), which is the case for linear MaxEnt and CRF models.
Learning Rate. For SGD to converge, it is critical that the learning rate is neither too large nor too small. We start with an initial learning rate , and decay it linearly during training using Joulin et al. (2016b). We tune on a held-out development set using grid search.
The learning rate depends on the iteration counter . To have the same with multiple threads across identical runs, we have to update atomically. Otherwise, is going to differ, which causes fluctuation of the weights and makes it harder to reproduce previous models. The same atomicity consideration applies to the L1 and L2 algorithms below.
4.1 Lazy Updates
For linear MaxEnt and CRF models, typically has millions of weights for features corresponding to different words and phrases in the training data. Fortunately, a single example has only a small set of active features , so we can compute efficiently using only the weights . However, the computation of
requires applying the L1 and L2 penalties of the full weight vectorwhich is slow. Thus, it is essential to perform “lazy” L1 and L2 updates only on the active features.
Lazy L2. To make the L2 update lazy, we use the weight rescaling approach in Algorithm 3 Shalev-Shwartz et al. (2011). The key observation is that the L2 penalty can be seen as rescaling by at every update, Equations 6-7. Thus, we can represent as and update the unscaled weights and the scaler independently. At the end of every epoch, we multiply the weights by and reset .
Lazy L1. To make the L1 update lazy, we use the cumulative L1 penalty approach Tsuruoka et al. (2009) in Algorithm 4. First, from Equation 8, we note that the L1 update rarely makes weights zero, so it does reduce model size. Thus, we modify the L1 update to clip weights to zero if they cross the zero threshold after applying the L1 penalty. Second, to lazily update a weight , we have to maintain the cumulative L1 penalty , the L1 penalty that is already applied to the weight , and update the weight with the difference, steps 6 and 8 of Algorithm 4. At the end of every epoch, we run ApplyL1 for every weight and reset and .
Note that in step 3 we have to divide by the scaler to boost the L1 penalty in order to account for the L2 regularization.
4.2 Active Bias Sampling
Active bias sampling Chang et al. (2017) is a technique that can speed-up SGD convergence by emphasizing harder examples towards the end of the training. This is useful for NLU since there are many frequent training examples that are easy and can be deprioritize, e.g. “play music”, “how is the weather”, “stop”.
In Algorithm 5, we add active bias sampling to F10-SGD by replacing the random shuffle with sampling form discrete distribution at specified epoch.
assigns higher probability to medium confidence examples, and lower probability to high and low confidence examples. We prioritize medium confidence examples and not low confidence ones, as the latter are more likely to be outliers.
We define in Equation 9. is the average of the past iteration probabilities . The term can be interpreted as the union of the alternative prediction probabilities. And, is a smoothing prior that makes sure we assign non-zero probability to training examples with confidence close to one, otherwise the model may “forget” to recognize them.
In addition to sampling from
, we change the learning rate of the examples that we sampled. This way, we unbias the gradient estimation and ensure convergenceZhao and Zhang (2015). For each sample , we have an importance weight that is multiplied to the learning rate . Where is normalization factor that makes , i.e. we do not change the global learning rate.
We evaluated F10-SGD on internal and public datasets. We used a Linux machine with Intel Xeon E5-2670 2.60 GHz 32 cores processor and 244 GB memory. The code is compiled with GCC 4.9 and -O3 optimization flags.
Evaluation metrics. We use accuracy for MaxEnt and F1 score for NER222github.com/sighsmile/conlleval. We test for statistically significant differences using the Wilcoxon test Hollander et al. (2013) with 1000 bootstrap resamples and p-value < 0.05.
5.1 Internal Datasets
We evaluated F10-SGD against OWL-QN for DC, and the NER domains Music, Shopping and Cinema. The datasets details are in Table 1. We use the development dataset for tuning the learning rate and the L1/L2 hyper-parameters.
The reported speed improvements are only caused by the change in the optimizer. Since, the feature extraction, decoding, and model serialization are unchanged.
MaxEnt DC. Table 2 shows the MaxEnt DC results. The accuracy difference between F10-SGD and OWL-QN is not statistically significant, although F10-SGD accuracy is slightly higher. For DC, F10-SGD AB does not bring additional accuracy improvements. The reason is that the model learns to assign high confidence to most of the sentences in the first 7 epochs of the training, and there are not many medium confidence sentences for the active bias sampling to prioritize. The training speeds of F10-SGD and F10-SGD AB are approximately 4x faster with half the number of CPU threads compared to OWL-QN. It takes only 10 epochs for SGD to reach the same accuracy level of OWL-QN with 250 epochs. Also, we get an additional 20% speed up if we use 28 CPU threads.
|F10-SGD AB||14||7 + 3||200MB||31m||0.01|
Note that the Hogwild SGD does not scale linearly with the number of CPU threads Zhang et al. (2016). Training with 28 CPU threads is only 20% faster than with 14 CPU threads. That is because of memory contention between the different threads for updating the shared parameter vector. When using more CPU threads, we get increased accuracy fluctuations from 0.02 to 0.04 across training runs.
|F10-SGD AB||14||7 + 3||17MB||41m||0.42|
|F10-SGD AB||2||7 + 3||17MB||5m||0.10|
|F10-SGD AB||2||7 + 3||22MB||5m||1.06|
CRF Domain NER. In Table 3, the F1 score difference between OWL-QN and F10-SGD is not statistically significant for Music NER and Cinema NER, and statistically significantly worse for Shopping NER. F10-SGD AB statistically significantly improves F1 score for the Music and Cinema NER models, and for Shopping the F10-SGD AB difference with OWL-QN is not statistically significant. Similar to DC, the training speed of F10-SGD and F10-SGD AB is about 4x faster with half the number of CPU threads compared to OWL-QN. It takes only 10 epochs for F10-SGD AB to reach a comparable or better level of F1 score as 200 epochs with OWL-QN. Also, for Music NER, we get an additional 36% speed up if we use 28 CPU threads. When training with 28 CPU threads, the F1 score fluctuation increases from 0.02 to 0.05.
We performed additional experiments on 24 domains, but for lack of space we do not present the results. The average training time reduction was 4x and the relative F1 score improvement of 0.5%.
5.2 Public Datasets
We tested F10-SGD on public datasets against FastText (version from November 2018) for text classification and against CRFSuite (version 0.12) for NER. Both packages provide on of the fastest public implementations for their respective tasks.
MaxEnt. For the text classification evaluations, we used 8 datasets prepared by Zhang et al. (2015). We tuned the hyper-parameters for MaxEnt and FastText on a part of the training data, and used 5 epochs and 4 threads for training. For FastText we used 10 word embedding dimension, which is the same as Joulin et al. (2016b).
Table 4 shows the results. F10-SGD MaxEnt achieves 22% relative speed-up compared to FastText, and it’s faster to train on 7 out of 8 datasets. The MaxEnt models are faster to train on datasets with smaller number of classes, and the FastText models are comparable or faster with large number of classes because the dense output layer of FastText is CPU cache efficient. Furthermore, the MaxEnt models have smaller size compared to the FastText models because the former’s vocabulary is pruned by the L1 regularization. The accuracy is comparable at 0.28% average relative difference.
CRF. For the NER evaluations, we used the CoNLL-2003 English NER dataset Tjong Kim Sang and De Meulder (2003), and the Ontonotes 5.0 English NER dataset Weischedel et al. (2013). We tuned the CRF hyper-parameters on the dedicated development sets, and we used one thread for training because CRFSuite does not support training with multiple threads.
Table 5 shows the results. F10-SGD CRF achieves 3.4x-3.7x relative speed-up compared to CRFSuite with the OWL-QN optimizer. We trained with 10 and 20 epochs for F10-SGD and 50 and 100 epochs for CRFSuite. For both algorithms less epochs degrade F1 and more epochs do not improve F1. The F1 score differences between the best configurations for F10-SGD and CRFSuite are not statistically significant. F10-SGD AB does not improve the F1 score on the public datasets because there are less “easy” examples to deprioritize compared to the internal NLU datasets.
|CoNLL 2003 #Labels: 4 #Train: 14986|
|F10-SGD AB||7 + 3||9.6MB||0.8m||83.46|
|Ontonotes 5.0 #Labels: 18 #Train: 107973|
|F10-SGD AB||7 + 3||52MB||31m||79.98|
Fast model training is important for continuous improvement of voice-assistant’s NLU. However, training times are increasing as our datasets are growing. To reduce model training time, we developed the F10-SGD optimizer that can train elastic-net linear models for text classification and NER up to 4x faster compared to OWL-QN. This is accomplished without loss of accuracy or increase in model size. In addition, F10-SGD with active bias sampling provides small but statistically significant improvements in NER F1 scores on NLU datasets.
- Andrew and Gao (2007) Galen Andrew and Jianfeng Gao. 2007. Scalable training of l 1-regularized log-linear models. In Proceedings of the 24th international conference on Machine learning.
Berger et al. (1996)
Adam L Berger, Vincent J Della Pietra, and Stephen A Della Pietra. 1996.
A maximum entropy approach to natural language processing.Computational linguistics.
- Bottou (2010) Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010.
- Bottou (2012) Léon Bottou. 2012. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade.
Chang et al. (2017)
Haw-Shiuan Chang, Erik Learned-Miller, and Andrew McCallum. 2017.
Active bias: Training a more accurate neural network by emphasizing high variance samples.Advances in NIPS.
- Hollander et al. (2013) Myles Hollander, Douglas A Wolfe, and Eric Chicken. 2013. Nonparametric statistical methods.
- Joulin et al. (2016a) Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016a. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.
- Joulin et al. (2016b) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016b. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
- Lafferty et al. (2001) John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
- Langford et al. (2007) John Langford, Lihong Li, and Alex Strehl. 2007. Vowpal wabbit online learning project.
- Langford et al. (2009) John Langford, Lihong Li, and Tong Zhang. 2009. Sparse online learning via truncated gradient. Journal of Machine Learning Research.
- Okazaki (2007) Naoaki Okazaki. 2007. Crfsuite: a fast implementation of conditional random fields (crfs).
- Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in python. Journal of machine learning research.
- Recht et al. (2011) Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in NIPS.
- Shalev-Shwartz et al. (2011) Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. 2011. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming.
- Strimel et al. (2018) Grant P Strimel, Kanthashree Mysore Sathyendra, and Stanislav Peshterliev. 2018. Statistical model compression for small-footprint natural language understanding. Proc. Interspeech 2018, pages 571–575.
- Su et al. (2018) Chengwei Su, Rahul Gupta, Shankar Ananthakrishnan, and Spyros Matsoukas. 2018. A re-ranker scheme for integrating large scale nlu models. arXiv preprint arXiv:1809.09605.
- Sutton et al. (2012) Charles Sutton, Andrew McCallum, et al. 2012. An introduction to conditional random fields. Foundations and Trends® in Machine Learning.
- Tjong Kim Sang and De Meulder (2003) Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 142–147. Association for Computational Linguistics.
- Tsuruoka et al. (2009) Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou. 2009. Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In In Proceedings of the ACL.
- Wang and Manning (2012) Sida Wang and Christopher D Manning. 2012. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pages 90–94. Association for Computational Linguistics.
- Weischedel et al. (2013) Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA.
- Zhang et al. (2016) Huan Zhang, Cho-Jui Hsieh, and Venkatesh Akella. 2016. Hogwild++: A new mechanism for decentralized asynchronous stochastic gradient descent. In Data Mining (ICDM), 2016 IEEE 16th International Conference on.
- Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657.
- Zhao and Zhang (2015) Peilin Zhao and Tong Zhang. 2015. Stochastic optimization with importance sampling for regularized loss minimization. In international conference on machine learning.
- Zhu et al. (1997) Ciyou Zhu, Richard H Byrd, Peihuang Lu, and Jorge Nocedal. 1997. Algorithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS).