State-of-the-art neural models for NLP are heavily parameterized, requiring hundreds of millions Devlin et al. (2019) and even billions Radford et al. (2019) of parameters. While over-parameterized models can sometimes be easier to train Livni et al. (2014), they may also introduce memory problems on small devices and lead to increased carbon emission Strubell et al. (2019); Schwartz et al. (2019).
In feature-based NLP, structured-sparse regularization, in particular the group lasso Yuan and Lin (2006), has been proposed as a method to reduce model size while preserving performance Martins et al. (2011). But, in neural NLP, some of the most widely used models—LSTMs Hochreiter and Schmidhuber (1997) and GRUs Cho et al. (2014)—do not have an obvious, intuitive notion of “structure” in their parameters (other than, perhaps, division into layers), so the use of structured sparsity at first may appear incongruous.
In this paper we show that group lasso can be successfully applied to neural NLP models. We focus on a family of neural models for which the hidden state exhibits a natural structure: rational RNNs Peng et al. (2018)
. In a rational RNN, the value of each hidden dimension is the score of a weighted finite-state automaton (WFSA) on (a prefix of) the input vector sequence. This property offers a natural grouping of the transition function parameters for each WFSA. As shown bySchwartz et al. (2018) and Peng et al. (2018), a variety of state-of-the-art neural architectures are rational (Lei et al., 2017; Bradbury et al., 2017; Foerster et al., 2017, inter alia), so learning parameter-efficient rational RNNs is of practical value. We also take advantage of the natural interpretation of rational RNNs as “soft” patterns Schwartz et al. (2018).
We apply a group lasso penalty to the WFSA parameters of rational RNNs during training, where each group is comprised of the parameters associated with one state in one WFSA (Fig. 1; §2). This penalty pushes the parameters in some groups to zero, effectively eliminating them, and making the WFSA smaller. When all of the states for a given WFSA are eliminated, the WFSA is removed entirely, so this approach can be viewed as learning the number of WFSAs (i.e., the RNN hidden dimension) as well as their size. We then retain the sparse structure, which results in a much smaller model in terms of parameters.
We experiment with four text classification benchmarks (§3
), using both GloVe and BERT embeddings. As we increase the regularization strength, we end up with smaller models. These models have a better tradeoff between the number of parameters and model performance compared to setting the number of WFSAs and their lengths by hand or using hyperparameter search. In almost all cases, our approach results in models with fewer parameters and similar or better performance compared to our baselines.
, which can take several GPU years to learn an appropriate neural architecture, our approach requires only two training runs: one to learn the structure, and the other to estimate its parameters. Other approaches either ignore the structure of the model and only look at the value of individual weightsLiu et al. (2019); LeCun et al. (1990); Lee et al. (2019); Frankle and Carbin (2019) or only use high-level structures like the number of layers of the network Wen et al. (2016); Scardapane et al. (2017); Gordon et al. (2018).
Finally, our approach touches on another appealing property of rational RNNs—their interpretability. Each WFSA captures a “soft” version of patterns like “such a great X”, and can be visualized as such Schwartz et al. (2018). By retaining a small number of WFSAs, model structures learned using our method can be visualized succinctly. In §4 we show that some of our sentiment analysis models rely exclusively on as few as three WFSAs.222That is, a rational RNN with hidden size 3.
We describe the proposed method. At a high level, we follow the standard practice for using regularization for sparsification Wen et al. (2016):
Fit a model on the training data, with the group lasso regularizer added to the loss during training (the parameters associated with one state comprise one group).
After convergence, eliminate the states whose parameters are zero.
Finetune the resulting, smaller model, by minimizing the unregularized loss with respect to its parameters.
In this work, we assume a single layer rational RNN, but our approach is equally applicable to multi-layer models. For clarity of the discussion, we start with a one-dimensional rational RNN (i.e., one based on a single WFSA only). We then generalize to the -dimensional case (computing the scores of WFSAs in parallel).
Rational recurrent networks
Following Peng et al. (2018)
, we parameterize the transition functions of WFSAs with neural networks, such that each transition (main path or self loop) defines a weighted function over the input word vector. We consider a 5-state WFSA, diagrammed in Fig.2.
A path starts at ; at least four tokens must be consumed to reach , and in this sense it captures 4-gram “soft” patterns (Peng et al., 2018; Schwartz et al., 2018). In addition to , we also designate , , and
as final states, allowing for the interpolation between patterns of different lengths.333We found this to be more stable than using only . The self-loop transitions over , , , and
aim to allow, but downweight, nonconsecutive patterns, as the self-loop transition functions yield values between 0 and 1 (using a sigmoid function). The recurrent function is equivalent to applying the Forward dynamic programming algorithm(Baum and Petrie, 1966).
Promoting sparsity with group lasso
We aim to learn a sparse rational model with fewer WFSA states. This can be achieved by penalizing the parameters associated with a given state, specifically the parameters associated with entering that state, by a transition from another state or a self-loop on that state. For example, the parameters of the WFSA in Fig. 2 (excluding the word embedding parameters) are assigned to four nonoverlapping groups, one for each non-starting state.
During training, the regularization term will push all parameters toward zero, and some will converge close to zero.444There are optimization methods that achieve “strong” sparsity Parikh and Boyd (2013), where some parameters are exactly set to zero during training. Recent work has shown these approaches can converge in nonconvex settings Reddi et al. (2016), but our experiments found them to be unstable. After convergence, we remove groups for which the norm falls below .555We use 0.1. This threshold was lightly tuned in preliminary experiments on the validation set and found to reliably remove those parameters which converged around zero without removing others. The resulting smaller model is then finetuned by continuing training without regularizing. With our linear-structured WFSA, zeroing out the group associated with a state in the middle effectively makes later states inaccessible. While our approach offers no guarantee to remove states from the end first (thus leaving no unreachable states), in our experiments it always did so.
To construct a rational RNN with WFSAs (a -dimensional model), we stack one-dimensional models, each of them separately parameterized. The parameters of a -dimensional rational model derived from the WFSA in Fig. 2 are organized into groups, four for each dimension. Since there is no direct interaction between different dimensions (e.g., through an affine transformation), group lasso sparsifies each dimension/WFSA independently. Hence the resulting rational RNN can consist of WFSAs of different sizes, the number of which could be smaller than if any of the WFSAs have all states eliminated.
One can treat the numbers and sizes of WFSAs as hyperparameters Oncina et al. (1993); Ron et al. (1994); de la Higuera (2010); Schwartz et al. (2018). By eliminating states from WFSAs with group lasso, we learn the WFSA structure while fitting the models’ parameters, reducing the number of training cycles by reducing the number of tunable hyperparameters.
We run sentiment analysis experiments. We train the rational RNN models (§2) with group lasso regularization, using increasingly large regularization strengths, resulting in increasingly compact models. As the goal of our experiments is to demonstrate the ability of our approach to reduce the number of parameters, we only consider rational baselines: the same rational RNNs trained without group lasso.666Rational RNNs have shown strong performance on the dataset we experiment with: a 2-layer rational model with between 100–300 hidden units obtained 92.7% classification accuracy, substantially outperforming an LSTM baseline Peng et al. (2018). The results of our models, which are single-layered and capped at 24 hidden units, are not directly comparable to these baselines, but are still within two points of the best result from that paper. We manually tune the number and sizes of the baselines WFSAs, and then compare the tradeoff curve between model size and accuracy. We describe our experiments below. For more details, see Appendix A.
We experiment with the Amazon reviews binary sentiment classification dataset Blitzer et al. (2007), composed of 22 product categories. We examine the standard dataset (original_mix) comprised of a mixture of data from the different categories Johnson and Zhang (2015).777http://riejohnson.com/cnn_data.html We also examine three of the largest individual categories as separate datasets (kitchen, dvd, and books), following Johnson and Zhang (2015). The three category datasets do not overlap with each other (though they do with original_mix), and are significantly different in size (see Appendix A), so we can see how our approach behaves with different amounts of training data.
To classify text, we concatenate the scores computed by each WFSA, then feed this-dimensional vector of scores into a linear binary classifier. We use log loss. We experiment with both type-level word embeddings (GloVe.6B.300d; Pennington et al., 2014) and contextual embeddings (BERT large; Devlin et al., 2019).888https://github.com/huggingface/pytorch-pretrained-BERT In both cases, we keep the embeddings fixed, so the vast majority of the learnable parameters are in the WFSAs. We train models using GloVe embeddings on all datasets. Due to memory constraints we evaluate BERT embeddings (frozen, not fine-tuned) only on the smallest dataset (kitchen).
As baselines, we train five versions of each rational architecture without group lasso, using the same number of WFSAs as our regularized models (24 for GloVe, 12 for BERT). Four of the baselines each use the same number of transitions for all WFSAs (1, 2, 3, and 4, corresponding to 2–5 states, and to 24, 48, 72, and 96 total transitions). The fifth baseline has an equal mix of all lengths (6 WFSAs of each size for GloVe, leading to 60 total transitions, and 3 WFSAs of each size for BERT, leading to 30 total transitions).
Each transition in our model is independently parameterized, so the total number of transitions linearly controls the number of learnable parameters (in addition to the parameters in the embedding layer).
, directly varying the number of transitions). Vertical lines encode one standard deviation for accuracy, while horizontal lines encode one standard deviation in the number of transitions (applicable only to our method).
Fig. 3 shows our classification test accuracy as a function of the total number of WFSA transitions in the model. We first notice that, as expected, the performance of our unregularized baselines improves as models are trained with more transitions (i.e., more parameters).
Compared to the baselines, training with group lasso provides a better tradeoff between performance and number of transitions. In particular, our heavily regularized models perform substantially better than the unigram baselines, gaining between 1–2% absolute improvements in four out of five cases. As our regularization strength decreases, we naturally gain less compared to our baselines, although still similar or better than the best baselines in four out of five cases.
Using our method with a high regularization strength, the resulting sparse structures often contain only a handful of WFSAs. In such cases, building on the interpretability of individual WFSAs, we are able visualize every hidden unit, i.e., the entire model. To visualize a single WFSA , we follow Schwartz et al. (2018) and compute the score of on every phrase in the training corpus. We then select the top and bottom scoring phrases for ,999As each WFSA score is used as a feature that is fed to a linear classifier, negative scores are also meaningful. and get a prototype-like description of the pattern representing .101010 While the WFSA scores are the sum of all paths deriving a document (plus-times semiring), here we search for the max (or min) scoring one. Despite the mismatch, a WFSA scores every possible path, and thus the max/min scoring path selection is still valid. As our examples show, many of these extracted paths are meaningful.
|Patt. 1||Top||not||worth||the time /s|
|not||worth||the 30 /s|
|Patt. 3||Top||bad||… ltd||… buyer|
|bad||… ltd||… buyer|
|horrible||… hl4040cn||… expensive|
|left||… ltd||… lens|
|Bottom||favorite||… ltd||… lens|
|really||… ltd||… buyer|
|really||… ltd||… buyer|
|best||… hl4040cn||… expensive|
Table 1 visualizes a sparse rational RNN trained on original_mix with only three WFSAs, (8 main-path transitions in total).111111The test performance of this model is 88%, 0.6% absolute below the average of the five models reported in Fig. 3. The table shows that looking at the top scores of each WFSA, two of the patterns respectively capture the phrases “not worth X /s” and “miserable/returned X /s’’. Pattern 3 is not as coherent, but most examples do contain sentiment-bearing words such as bad, horrible, or best. This might be the result of the tuning process of the sparse rational structure simply learning a collection of words, rather than coherant phrases. As a result, this WFSA is treated as a unigram pattern rather than a trigram. The lowest scoring phrases show a similar trend. Appendix B shows the same visualization for another sparse rational RNN containing only four WFSAs and 11 main-path transitions, trained with BERT embeddings.
We observe another interesting trend: two of the three patterns prefer expressions that appear near the end of the document. This could result from the nature of the datasets (e.g., many reviews end with a summary, containing important sentiment information), and/or our rational models’ recency preference. More specifically, the first self loop has weight , and hence the model is not penalized for taking self loops before the match; in contrast, the weights of the last self loop take values in due to the sigmoid, forcing a penalty for earlier phrase matches.121212Changing this behavior could be easily done by fixing the final self-loop to as well.
We presented a method for learning parameter-efficient RNNs. Our method applies group lasso regularization on rational RNNs, which are strongly connected to weighted finite-state automata, and thus amenable to learning with structured sparsity. Our experiments on four text classification datasets, using both GloVe and BERT embeddings, show that our sparse models provide a better performance/model size tradeoff. We hope our method will facilitate the development of “thin” NLP models, that are faster, consume less memory, and are interpretable Schwartz et al. (2019).
This work was completed while the first author was an intern at the Allen Institute for Artificial Intelligence. The authors thank Pradeep Dasigi, Gabriel Stanovsky, and Elad Eban for their discussion. In addition, the authors thank the members of the Noah’s ARK group at the University of Washington, the researchers at the Allen Institute for Artificial Intelligence, and the anonymous reviewers for their valuable feedback. This work was supported in part by a hardware gift from NVIDIA Corporation.
- Neural machine translation by jointly learning to align and translate. In Proc. of ICLR, Cited by: footnote 15.
Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics 37 (6). External Links: Cited by: §2.
- Biographies, Bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In Proc. of ACL, External Links: Cited by: §3.
Quasi-recurrent neural network. In Proc. of ICLR, Cited by: §1.
- Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. of EMNLP, External Links: Cited by: §1.
- Grammatical inference: learning automata and grammars. Cambridge University Press. Cited by: §2.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL, Cited by: Appendix A, Appendix B, §1, §3.
- Intelligible language modeling with input switched affine networks. In Proc. of ICML, Cited by: §1.
- The lottery ticket hypothesis: finding sparse, trainable neural networks. In Proc. of ICLR, Cited by: §1.
- MorphNet: fast & simple resource-constrained structure learning of deep networks. In Proc. of CVPR, Cited by: Appendix A, §1.
- Long short-term memory. Neural computation 9 (8). External Links: Cited by: §1.
Effective use of word order for text categorization with convolutional neural networks. In Proc. of NAACL, External Links: Cited by: §3.
- An empirical exploration of recurrent network architectures. In Proc. of ICML, External Links: Cited by: §1.
- Adam: a method for stochastic optimization. In Proc. of ICLR, Cited by: Appendix A.
- Optimal brain damage. In NeurIPS, Cited by: §1.
- SNIP: single-shot network pruning based on connection sensitivity. In Proc. of ICLR, Cited by: §1.
- Training RNNs as fast as CNNs. Note: arXiv:1709.02755 Cited by: §1.
- DARTS: differentiable architecture search. In Proc. of ICLR, Cited by: §1.
- On the computational efficiency of training neural networks. In NeurIPS, Cited by: §1.
- Structured sparsity in structured prediction. In Proc. of EMNLP, External Links: Cited by: §1.
Learning subsequential transducers for pattern recognition interpretation tasks. IEEE Trans. Pattern Anal. Mach. Intell. 15, pp. 448–458. Cited by: §2.
- Proximal algorithms. Foundations and Trends in Optimization. Cited by: footnote 4.
- Rational recurrences. In Proc. of EMNLP, External Links: Cited by: Appendix A, RNN Architecture Learning with Sparse Regularization, §1, §2, §2, footnote 6.
- GloVe: global vectors for word representation. In Proc. of EMNLP, External Links: Cited by: §3.
- Language models are unsupervised multitask learners. Cited by: §1.
- Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In NeurIPS, Cited by: footnote 4.
- Learning probabilistic automata with variable memory length. In Proc. of COLT, Cited by: §2.
- Group sparse regularization for deep neural networks. Neurocomputing 241. Cited by: §1.
- Green AI. Note: arXiv:1907.10597 External Links: Cited by: §1, §5.
- SoPa: bridging CNNs, RNNs, and weighted finite-state machines. In Proc. of ACL, External Links: Cited by: §1, §1, §2, §2, §4.
Energy and policy considerations for deep learning in NLP. In Proc. of ACL, Cited by: §1.
- Learning structured sparsity in deep neural networks. In NeurIPS, Cited by: §1, §2.
- Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (1). Cited by: §1.
Neural architecture search with reinforcement learning. In Proc of ICLR, Cited by: §1.
Appendix A Experiment Details
Table 2 shows the sizes of the datasets experimented with.
As preprocessing for the data for each individual category, we tokenize using the NLTK word tokenizer. We removed reviews with text shorter than 5 tokens.
We binarize the review score using the standard procedure, assigning 1- and 2-star reviews as negative, and 4- and 5-star reviews as positive (discarding 3-star reviews). Then, if there were more than 25,000 negative reviews, we downsample to 25,000 (otherwise we keep them all), and then downsample the positive reviews to be the same number as negative, to have a balanced dataset. We match the train, development, and test set proportions of 4:1:5 from the original mixture.
We generate the BERT embeddings using the sum of the last four hidden layers of the large uncased BERT model, so our embedding size is 1024. Summing the last four layers was the best performing approach in the ablation of Devlin et al. (2019) that had fewer than 4096 embedding size (which was too large to fit in memory). We embed each sentence individually (there can be multiple sentences within one example).
For GloVe, we train rational models with 24 5-state WFSAs, each corresponding to a 4-gram soft-pattern (Fig. 2). For BERT, we train models with 12 WFSAs.131313The BERT embedding dimension is significantly larger than GloVe (1024 compared to 300), so we used a smaller number of WFSAs. As our results show, the BERT models still substantially outperform the GloVe ones.
For each model (regularized or baseline), we run random search to select our hyperparameters (evaluating 20 uniformly sampled hyperparameter configurations). For the hyperparameter configuration that leads to the best development result, we train the model again 5 times with different random seeds, and report the mean and standard deviation of the models’ test performance.
The models are trained with Adam Kingma and Ba (2015). During training with group lasso we turn off the learning rate schedule (so the learning rate stays fixed), similarly to Gordon et al. (2018). This leads to improved stability in the learned structure for a given hyper-parameter assignment.
Regularization strength search
We searched for model structures that were regularized down to close to 20, 40, 60, or 80 transitions (10, 20, 30, and 40 for BERT experiments). For a particular goal size, we uniformly sample 20 hyperparameter assignments from the ranges in Table 4, then sorted the samples by increasing learning rate. For each hyperparameter assignment, we trained a model with the current regularization strength. If the resulting learned structure was too large (small), we doubled (halved) the regularization strength, repeating until we were within 10 transitions of our goal (5 for BERT experiments).141414If the regularization strength became larger than or smaller than , we threw out the hyperparameter assignment and resampled (this happened when, e.g., the learning rate was too small for any of the weights to actually make it to zero).
Finally, we finetuned the appropriately-sized learned structure by continuing training without the regularizer, and computed the result on the development set. For the best model on the development set, we retrained (first with the regularizer to learn a structure, then finetuned) five times, and plot the mean and variance of the test accuracy and learned structure size.
Appendix B Visualization
|Patt. 1||Top||are||perfect||… [CLS]|
|Bottom||not||… [SEP]||… [CLS]|
|very||disappointing||! [SEP] [CLS]|
|[CLS]||it does||it heat|
|[CLS]||evenly||, withstand heat|
|Patt. 3||Top||‘||pops||’ ’ escape|
Table 3 shows the same visualization shown in §4 for another sparse rational RNN containing only four WFSAs and 11 main-path transitions, trained with BERT embeddings on kitchen. It also shows a few clear patterns (e.g., Patt. 2). Interpretation here is more challenging though, as contextual embeddings make every token embedding depend on the entire context.151515Indeed, contextual embeddings raise problems for interpretation methods that work by targeting individual words, e.g., attention Bahdanau et al. (2015), as these embeddings also depend on other words. Interpretation methods for contextual embeddings are an exciting direction for future work. A particular example of this is the excessive use of the start token ([CLS]), whose contextual embedding has been shown to capture the sentiment information at the sentence level Devlin et al. (2019).
Regularization strength recommendation
If a practitioner wishes to learn a single small model, we recommend they start with such that the loss
and the regularization term are equal at initialization (before training). We found that having equal contribution led to eliminating approximately half of the states, though this varies with data set size, learning rate, and gradient clipping, among other variables.
|Learning rate||[, ]|
|Vertical dropout||[0, 0.5]|
|Recurrent dropout||[0, 0.5]|
|Embedding dropout||[0, 0.5]|
|Weight decay||[, ]|