1 Introduction
Our commercial voiceassistant natural language understanding (NLU) system uses linear MaxEnt models Berger et al. (1996) for text classification, and linearchain CRF models Lafferty et al. (2001); Sutton et al. (2012) for namedentity recognition (NER). There are several desirable properties that makes these models suitable for NLU:

[topsep=1pt, leftmargin=15pt, itemsep=1pt]

They achieve high accuracy given the right set of features Wang and Manning (2012).

They produce probabilistic output, which allows combining multiple models in a recognition pipeline Su et al. (2018).

They can be compressed to run on hardwareconstrained devices Strimel et al. (2018).
To train MaxEnt and CRF models, we minimize an objective function that includes the negative loglikelihood of the training dataset and elasticnet regularization. The elasticnet regularization combines the L1 and L2 penalties on the model weights. The L1 penalty, sum of absolute weights, serves as feature selection mechanism by forcing the unimportant model weights to zero. The L2 penalty, sum of squared weights, helps avoid overfitting by preventing weights that are excessively large.
A popular way to train elasticnet linear models is the OrthantWise Limitedmemory QuasiNewton Optimizer (OWLQN) Andrew and Gao (2007). OWLQN is a variant of the LBFGS Zhu et al. (1997) quasiNewton optimization method that supports the L1 penalty. OWLQN produces compact and accurate models, and it can be speedup using multiple CPU threads.
In our voiceassistant’s NLU, we classify user utterance text into hundreds of classes. We have a hierarchical classification system, where we use MaxEnt to first perform domain classification (DC) and then intent classification (IC). DC predicts the general domain of an utterance. IC predicts the user intent within a domain. For example, for the utterance “play music”, DC predict
Music and IC predict PlayMusicIntent, and for the utterance “how is the weather”, DC predicts Weather and IC predicts GetWeatherIntent. This NLU architecture where intents are separated by domain allows us to efficiently train the IC models in parallel on their respective domainspecific training data. However, the MaxEnt DC training still has to be done on the entire training dataset. A DC model training with OWLQN using 28 CPU threads on 50 million utterances takes around 2 hours.For NER, we use one CRF model per domain to recognize domainspecific namedentities, such as artist name and song name for the Music domain. For example, in the sentence “play desert rose by sting”, NER labels “desert rose” as a song and “sting” as an artist. The CRF models are more complex and slower to train compared to the MaxEnt models. An NER model training with OWLQN using 28 CPU threads on a large domain with 6 million utterances takes around 2 hours 40 min.
Continuous improvement of the NLU system requires frequent retraining of the DC, IC, and NER models. However, the long training times for DC and NER are a bottleneck. In this work, we focus on machine learning optimization methods to improve training efficiency. We develop a fast optimizer called F10SGD which on large internal datasets trains MaxEnt and CRF models 4 times faster compared to OWLQN without loss of accuracy or increase in the model size.
Our contributions are the following:

[topsep=1pt, leftmargin=15pt, itemsep=1pt]

We combine Stochastic Gradient Descent (SGD) optimization techniques for parallel training and elasticnet regularization into a fast and accurate trainer.

We improve the accuracy of the NER model using biased sampling that prioritizes harder examples towards the end of the training.

We perform evaluations on internal and public datasets. On internal datasets, F10SGD is 4x faster with half the CPU threads compared to OWLQN. On external datasets, we speedup training by 22% and 3.43.7x against FastText and CRFSuite respectively.
Why not GPU? An important constrain for our use case is to develop a fast optimizer for CPU instead of GPU. Our model building platform trains many models per week, and allocating GPU for each training is expensive.
2 Related Work
Bottou (2010, 2012) advocated for using SGD in large scale machine learning optimization. He developed CRFSGD^{1}^{1}1http://leon.bottou.org/projects/sgd and demonstrated the effectiveness of SGD for training CRF models. Okazaki (2007) developed the popular CRFSuite toolkit, which includes an SGD trainer based on CRFSGD, and an optimized implementation of the OWLQN trainer. A limitation of both CRFSGD and CRFSuite is that they do not use multiple CPU threads to speed up training. Also, their SGD implementations do not support L1 regularization.
Scikitlearn Pedregosa et al. (2011) and Vowpal Wabbit Langford et al. (2007)
provide fast SGD implementations for logistic regression models (note that logistic regression is equivalent MaxEnt). Both support L1 regularization using cumulative penalty
Tsuruoka et al. (2009) and truncated gradient Langford et al. (2009). Scikitlearn SGD does not support parallel SGD, and in Vowpal Wabbit parallel SGD is not supported, though it can be simulated via sending raw training examples over sockets to multiple Vowpal Wabbit processes that share the same memory.FastText Joulin et al. (2016b) is linear embedding model for text classification. It supports asynchronous multithreaded SGD training via Hogwild Recht et al. (2011), which makes training fast. However, FastText does not support L2 or dropout regularization, leading to suboptimal performance on small datasets. Also, it does not support L1 for feature selection, but it does have a quantization option Joulin et al. (2016a) to reduce the model size after training.
3 Background
3.1 Elasticnet Linear Models
Probabilistic linear models have the general form as shown in Equation 1, where is an input example, is a prediction target, is a weight for the feature function , and
is a partition function that normalizes the probability. If the prediction target
is a class as in IC and DC, the model is a MaxEnt. If the prediction target is sequence of labels as in NER, the model is a CRF.(1) 
Equation 2 shows the optimization objective for MaxEnt and CRF. It includes the negative loglikelihood over the training set with size , and the L1 and L2 penalties.
(2)  
(3) 
To optimize , we use the gradient in Equation 4. The function returns: 1 for negative, 1 positive, or 0 for every weight of . is necessary to define the subgradient of for zero weights since the absolute function in the L1 penalty is not differentiable at zero.
(4) 
The definitions of and differ for MaxEnt and CRF, and are not covered in this paper.
3.2 OwlQn
OWLQN is an optimization method that iteratively adjusts the model parameters towards the optimal value. Algorithm 1 provides a highlevel simplified description of OWLQN that helps explain its computational bottlenecks, but does not describe the algorithm in details.
OWLQN approximates the inverse hessian using the last weight differences , and gradient differences . The quality of the approximation depends on the size , a recommended value for is between 4 and 7.
In step 3, OWLQN computes an update direction using and the gradient over training examples. Then, in step 4 it performs a line search in the direction . For a large even one epoch takes significant time, and OWLQN typically needs epochs to find good approximation , and tens or even hundreds of epochs to converge.
Parallelization. We can speedup OWLQN by using multiple CPU threads to compute and . Each CPU thread receives a subset of the training dataset and computes the forward scores and the gradients . Then, the results are aggregated to compute and .
4 F10Sgd
We propose using SGD as a fast alternative to OWLQN for training MaxEnt and CRF models. Similar to OWLQN, SGD is an iterative optimization method. It uses the gradient approximation in Equation 5. is calculated from for a single random training example. The weights are then updated with learning rate , without relying on the inverse hessian. Since is updated many times during a single epoch, SGD typically needs fewer epochs to converge compared to OWLQN.
(5)  
In we divide and by the number of examples since is updated times per epoch.
F10SGD in Algorithm 2 is a parallel version of SGD, also known as Hogwild. Each CPU thread receives a random subset of the training dataset , and updates the weights . The weight updates are not synchronized, so it is possible for threads to override each other. However, the algorithm has a nearly optimal rate of convergence for models where many training examples have nonoverlapping features Recht et al. (2011), which is the case for linear MaxEnt and CRF models.
Learning Rate. For SGD to converge, it is critical that the learning rate is neither too large nor too small. We start with an initial learning rate , and decay it linearly during training using Joulin et al. (2016b). We tune on a heldout development set using grid search.
The learning rate depends on the iteration counter . To have the same with multiple threads across identical runs, we have to update atomically. Otherwise, is going to differ, which causes fluctuation of the weights and makes it harder to reproduce previous models. The same atomicity consideration applies to the L1 and L2 algorithms below.
4.1 Lazy Updates
For linear MaxEnt and CRF models, typically has millions of weights for features corresponding to different words and phrases in the training data. Fortunately, a single example has only a small set of active features , so we can compute efficiently using only the weights . However, the computation of
requires applying the L1 and L2 penalties of the full weight vector
which is slow. Thus, it is essential to perform “lazy” L1 and L2 updates only on the active features.Lazy L2. To make the L2 update lazy, we use the weight rescaling approach in Algorithm 3 ShalevShwartz et al. (2011). The key observation is that the L2 penalty can be seen as rescaling by at every update, Equations 67. Thus, we can represent as and update the unscaled weights and the scaler independently. At the end of every epoch, we multiply the weights by and reset .
(6)  
(7) 
(8) 
Lazy L1. To make the L1 update lazy, we use the cumulative L1 penalty approach Tsuruoka et al. (2009) in Algorithm 4. First, from Equation 8, we note that the L1 update rarely makes weights zero, so it does reduce model size. Thus, we modify the L1 update to clip weights to zero if they cross the zero threshold after applying the L1 penalty. Second, to lazily update a weight , we have to maintain the cumulative L1 penalty , the L1 penalty that is already applied to the weight , and update the weight with the difference, steps 6 and 8 of Algorithm 4. At the end of every epoch, we run ApplyL1 for every weight and reset and .
Note that in step 3 we have to divide by the scaler to boost the L1 penalty in order to account for the L2 regularization.
4.2 Active Bias Sampling
Active bias sampling Chang et al. (2017) is a technique that can speedup SGD convergence by emphasizing harder examples towards the end of the training. This is useful for NLU since there are many frequent training examples that are easy and can be deprioritize, e.g. “play music”, “how is the weather”, “stop”.
In Algorithm 5, we add active bias sampling to F10SGD by replacing the random shuffle with sampling form discrete distribution at specified epoch.
assigns higher probability to medium confidence examples, and lower probability to high and low confidence examples. We prioritize medium confidence examples and not low confidence ones, as the latter are more likely to be outliers.
We define in Equation 9. is the average of the past iteration probabilities . The term can be interpreted as the union of the alternative prediction probabilities. And, is a smoothing prior that makes sure we assign nonzero probability to training examples with confidence close to one, otherwise the model may “forget” to recognize them.
(9)  
In addition to sampling from
, we change the learning rate of the examples that we sampled. This way, we unbias the gradient estimation and ensure convergence
Zhao and Zhang (2015). For each sample , we have an importance weight that is multiplied to the learning rate . Where is normalization factor that makes , i.e. we do not change the global learning rate.5 Results
We evaluated F10SGD on internal and public datasets. We used a Linux machine with Intel Xeon E52670 2.60 GHz 32 cores processor and 244 GB memory. The code is compiled with GCC 4.9 and O3 optimization flags.
Evaluation metrics. We use accuracy for MaxEnt and F1 score for NER^{2}^{2}2github.com/sighsmile/conlleval. We test for statistically significant differences using the Wilcoxon test Hollander et al. (2013) with 1000 bootstrap resamples and pvalue < 0.05.
5.1 Internal Datasets
We evaluated F10SGD against OWLQN for DC, and the NER domains Music, Shopping and Cinema. The datasets details are in Table 1. We use the development dataset for tuning the learning rate and the L1/L2 hyperparameters.
The reported speed improvements are only caused by the change in the optimizer. Since, the feature extraction, decoding, and model serialization are unchanged.
Dataset  #Labels  #Train  #Test  #Dev 

DC  24  50M  600K  600K 
Music NER  67  6M  140K  140K 
Shopping NER  26  3.4M  16K  16K 
Cinema NER  27  3M  3K  3K 
MaxEnt DC. Table 2 shows the MaxEnt DC results. The accuracy difference between F10SGD and OWLQN is not statistically significant, although F10SGD accuracy is slightly higher. For DC, F10SGD AB does not bring additional accuracy improvements. The reason is that the model learns to assign high confidence to most of the sentences in the first 7 epochs of the training, and there are not many medium confidence sentences for the active bias sampling to prioritize. The training speeds of F10SGD and F10SGD AB are approximately 4x faster with half the number of CPU threads compared to OWLQN. It takes only 10 epochs for SGD to reach the same accuracy level of OWLQN with 250 epochs. Also, we get an additional 20% speed up if we use 28 CPU threads.
Optimizer  CPU  Epoch  Size  Time  %Acc 

OWLQN  28  250  202MB  120m  0 
F10SGD  14  10  192MB  30m  0.05 
F10SGD  28  10  192MB  24m  0.04 
F10SGD AB  14  7 + 3  200MB  31m  0.01 
Note that the Hogwild SGD does not scale linearly with the number of CPU threads Zhang et al. (2016). Training with 28 CPU threads is only 20% faster than with 14 CPU threads. That is because of memory contention between the different threads for updating the shared parameter vector. When using more CPU threads, we get increased accuracy fluctuations from 0.02 to 0.04 across training runs.
Optimizer  CPU  Epoch  Size  Time  %F1 
Music  
OWLQN  28  200  16MB  164m  0 
F10SGD  14  10  17MB  41m  0.02 
F10SGD  28  10  17MB  26m  0.02 
F10SGD AB  14  7 + 3  17MB  41m  0.42 
Shopping  
OWLQN  4  200  16MB  20m  0 
F10SGD  2  10  18MB  5m  0.24 
F10SGD AB  2  7 + 3  17MB  5m  0.10 
Cinema  
OWLQN  4  200  23MB  19m  0 
F10SGD  2  10  22MB  5m  0.10 
F10SGD AB  2  7 + 3  22MB  5m  1.06 
CRF Domain NER. In Table 3, the F1 score difference between OWLQN and F10SGD is not statistically significant for Music NER and Cinema NER, and statistically significantly worse for Shopping NER. F10SGD AB statistically significantly improves F1 score for the Music and Cinema NER models, and for Shopping the F10SGD AB difference with OWLQN is not statistically significant. Similar to DC, the training speed of F10SGD and F10SGD AB is about 4x faster with half the number of CPU threads compared to OWLQN. It takes only 10 epochs for F10SGD AB to reach a comparable or better level of F1 score as 200 epochs with OWLQN. Also, for Music NER, we get an additional 36% speed up if we use 28 CPU threads. When training with 28 CPU threads, the F1 score fluctuation increases from 0.02 to 0.05.
We performed additional experiments on 24 domains, but for lack of space we do not present the results. The average training time reduction was 4x and the relative F1 score improvement of 0.5%.
5.2 Public Datasets
We tested F10SGD on public datasets against FastText (version from November 2018) for text classification and against CRFSuite (version 0.12) for NER. Both packages provide on of the fastest public implementations for their respective tasks.
MaxEnt. For the text classification evaluations, we used 8 datasets prepared by Zhang et al. (2015). We tuned the hyperparameters for MaxEnt and FastText on a part of the training data, and used 5 epochs and 4 threads for training. For FastText we used 10 word embedding dimension, which is the same as Joulin et al. (2016b).
Table 4 shows the results. F10SGD MaxEnt achieves 22% relative speedup compared to FastText, and it’s faster to train on 7 out of 8 datasets. The MaxEnt models are faster to train on datasets with smaller number of classes, and the FastText models are comparable or faster with large number of classes because the dense output layer of FastText is CPU cache efficient. Furthermore, the MaxEnt models have smaller size compared to the FastText models because the former’s vocabulary is pruned by the L1 regularization. The accuracy is comparable at 0.28% average relative difference.
FastText  F10SGD  

Time  Size  Acc  Time  Size  Acc  
AG  7s  387MB  92.44  3s  11MB  92.46 
Sogou  93s  402MB  96.82  69s  31MB  96.49 
DBP  26s  427MB  98.61  29s  17MB  98.62 
Yelp P.  44s  408MB  95.62  21s  27MB  95.69 
Yelp F.  51s  411MB  63.94  46s  147MB  63.12 
Yah. A.  95s  494MB  72.44  100s  275MB  71.83 
Amzn. F.  142s  461MB  60.33  127s  300MB  60.35 
Amzn. P.  164s  471MB  94.59  82s  75MB  94.66 
CRF. For the NER evaluations, we used the CoNLL2003 English NER dataset Tjong Kim Sang and De Meulder (2003), and the Ontonotes 5.0 English NER dataset Weischedel et al. (2013). We tuned the CRF hyperparameters on the dedicated development sets, and we used one thread for training because CRFSuite does not support training with multiple threads.
Table 5 shows the results. F10SGD CRF achieves 3.4x3.7x relative speedup compared to CRFSuite with the OWLQN optimizer. We trained with 10 and 20 epochs for F10SGD and 50 and 100 epochs for CRFSuite. For both algorithms less epochs degrade F1 and more epochs do not improve F1. The F1 score differences between the best configurations for F10SGD and CRFSuite are not statistically significant. F10SGD AB does not improve the F1 score on the public datasets because there are less “easy” examples to deprioritize compared to the internal NLU datasets.
Optimizer  Epoch  Size  Time  F1 

CoNLL 2003 #Labels: 4 #Train: 14986  
CRFSuite  50  12MB  2.8m  83.39 
CRFSuite  100  9.5MB  4.9m  83.34 
F10SGD  10  9.7MB  0.75m  83.35 
F10SGD AB  7 + 3  9.6MB  0.8m  83.46 
F10SGD  20  8.8MB  1.4m  83.34 
Ontonotes 5.0 #Labels: 18 #Train: 107973  
CRFSuite  50  68MB  102m  79.49 
CRFSuite  100  49MB  187m  79.99 
F10SGD  10  50MB  30m  79.50 
F10SGD AB  7 + 3  52MB  31m  79.98 
F10SGD  20  40MB  52m  79.60 
6 Conclusions
Fast model training is important for continuous improvement of voiceassistant’s NLU. However, training times are increasing as our datasets are growing. To reduce model training time, we developed the F10SGD optimizer that can train elasticnet linear models for text classification and NER up to 4x faster compared to OWLQN. This is accomplished without loss of accuracy or increase in model size. In addition, F10SGD with active bias sampling provides small but statistically significant improvements in NER F1 scores on NLU datasets.
References
 Andrew and Gao (2007) Galen Andrew and Jianfeng Gao. 2007. Scalable training of l 1regularized loglinear models. In Proceedings of the 24th international conference on Machine learning.

Berger et al. (1996)
Adam L Berger, Vincent J Della Pietra, and Stephen A Della Pietra. 1996.
A maximum entropy approach to natural language processing.
Computational linguistics.  Bottou (2010) Léon Bottou. 2010. Largescale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010.
 Bottou (2012) Léon Bottou. 2012. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade.

Chang et al. (2017)
HawShiuan Chang, Erik LearnedMiller, and Andrew McCallum. 2017.
Active bias: Training a more accurate neural network by emphasizing high variance samples.
Advances in NIPS.  Hollander et al. (2013) Myles Hollander, Douglas A Wolfe, and Eric Chicken. 2013. Nonparametric statistical methods.
 Joulin et al. (2016a) Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016a. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.
 Joulin et al. (2016b) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016b. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
 Lafferty et al. (2001) John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
 Langford et al. (2007) John Langford, Lihong Li, and Alex Strehl. 2007. Vowpal wabbit online learning project.
 Langford et al. (2009) John Langford, Lihong Li, and Tong Zhang. 2009. Sparse online learning via truncated gradient. Journal of Machine Learning Research.
 Okazaki (2007) Naoaki Okazaki. 2007. Crfsuite: a fast implementation of conditional random fields (crfs).
 Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikitlearn: Machine learning in python. Journal of machine learning research.
 Recht et al. (2011) Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in NIPS.
 ShalevShwartz et al. (2011) Shai ShalevShwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. 2011. Pegasos: Primal estimated subgradient solver for svm. Mathematical programming.
 Strimel et al. (2018) Grant P Strimel, Kanthashree Mysore Sathyendra, and Stanislav Peshterliev. 2018. Statistical model compression for smallfootprint natural language understanding. Proc. Interspeech 2018, pages 571–575.
 Su et al. (2018) Chengwei Su, Rahul Gupta, Shankar Ananthakrishnan, and Spyros Matsoukas. 2018. A reranker scheme for integrating large scale nlu models. arXiv preprint arXiv:1809.09605.
 Sutton et al. (2012) Charles Sutton, Andrew McCallum, et al. 2012. An introduction to conditional random fields. Foundations and Trends® in Machine Learning.
 Tjong Kim Sang and De Meulder (2003) Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll2003 shared task: Languageindependent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLTNAACL 2003Volume 4, pages 142–147. Association for Computational Linguistics.
 Tsuruoka et al. (2009) Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou. 2009. Stochastic gradient descent training for l1regularized loglinear models with cumulative penalty. In In Proceedings of the ACL.
 Wang and Manning (2012) Sida Wang and Christopher D Manning. 2012. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short PapersVolume 2, pages 90–94. Association for Computational Linguistics.
 Weischedel et al. (2013) Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA.
 Zhang et al. (2016) Huan Zhang, ChoJui Hsieh, and Venkatesh Akella. 2016. Hogwild++: A new mechanism for decentralized asynchronous stochastic gradient descent. In Data Mining (ICDM), 2016 IEEE 16th International Conference on.
 Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Characterlevel convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657.
 Zhao and Zhang (2015) Peilin Zhao and Tong Zhang. 2015. Stochastic optimization with importance sampling for regularized loss minimization. In international conference on machine learning.
 Zhu et al. (1997) Ciyou Zhu, Richard H Byrd, Peihuang Lu, and Jorge Nocedal. 1997. Algorithm 778: Lbfgsb: Fortran subroutines for largescale boundconstrained optimization. ACM Transactions on Mathematical Software (TOMS).
Comments
There are no comments yet.