1 Introduction
Stateoftheart neural models for NLP are heavily parameterized, requiring hundreds of millions Devlin et al. (2019) and even billions Radford et al. (2019) of parameters. While overparameterized models can sometimes be easier to train Livni et al. (2014), they may also introduce memory problems on small devices and lead to increased carbon emission Strubell et al. (2019); Schwartz et al. (2019).
In featurebased NLP, structuredsparse regularization, in particular the group lasso Yuan and Lin (2006), has been proposed as a method to reduce model size while preserving performance Martins et al. (2011). But, in neural NLP, some of the most widely used models—LSTMs Hochreiter and Schmidhuber (1997) and GRUs Cho et al. (2014)—do not have an obvious, intuitive notion of “structure” in their parameters (other than, perhaps, division into layers), so the use of structured sparsity at first may appear incongruous.
In this paper we show that group lasso can be successfully applied to neural NLP models. We focus on a family of neural models for which the hidden state exhibits a natural structure: rational RNNs Peng et al. (2018)
. In a rational RNN, the value of each hidden dimension is the score of a weighted finitestate automaton (WFSA) on (a prefix of) the input vector sequence. This property offers a natural grouping of the transition function parameters for each WFSA. As shown by
Schwartz et al. (2018) and Peng et al. (2018), a variety of stateoftheart neural architectures are rational (Lei et al., 2017; Bradbury et al., 2017; Foerster et al., 2017, inter alia), so learning parameterefficient rational RNNs is of practical value. We also take advantage of the natural interpretation of rational RNNs as “soft” patterns Schwartz et al. (2018).We apply a group lasso penalty to the WFSA parameters of rational RNNs during training, where each group is comprised of the parameters associated with one state in one WFSA (Fig. 1; §2). This penalty pushes the parameters in some groups to zero, effectively eliminating them, and making the WFSA smaller. When all of the states for a given WFSA are eliminated, the WFSA is removed entirely, so this approach can be viewed as learning the number of WFSAs (i.e., the RNN hidden dimension) as well as their size. We then retain the sparse structure, which results in a much smaller model in terms of parameters.
We experiment with four text classification benchmarks (§3
), using both GloVe and BERT embeddings. As we increase the regularization strength, we end up with smaller models. These models have a better tradeoff between the number of parameters and model performance compared to setting the number of WFSAs and their lengths by hand or using hyperparameter search. In almost all cases, our approach results in models with fewer parameters and similar or better performance compared to our baselines.
In contrast to neural architecture search (Jozefowicz et al., 2015; Zoph and Le, 2017)
, which can take several GPU years to learn an appropriate neural architecture, our approach requires only two training runs: one to learn the structure, and the other to estimate its parameters. Other approaches either ignore the structure of the model and only look at the value of individual weights
Liu et al. (2019); LeCun et al. (1990); Lee et al. (2019); Frankle and Carbin (2019) or only use highlevel structures like the number of layers of the network Wen et al. (2016); Scardapane et al. (2017); Gordon et al. (2018).Finally, our approach touches on another appealing property of rational RNNs—their interpretability. Each WFSA captures a “soft” version of patterns like “such a great X”, and can be visualized as such Schwartz et al. (2018). By retaining a small number of WFSAs, model structures learned using our method can be visualized succinctly. In §4 we show that some of our sentiment analysis models rely exclusively on as few as three WFSAs.^{2}^{2}2That is, a rational RNN with hidden size 3.
2 Method
We describe the proposed method. At a high level, we follow the standard practice for using regularization for sparsification Wen et al. (2016):

Fit a model on the training data, with the group lasso regularizer added to the loss during training (the parameters associated with one state comprise one group).

After convergence, eliminate the states whose parameters are zero.

Finetune the resulting, smaller model, by minimizing the unregularized loss with respect to its parameters.
In this work, we assume a single layer rational RNN, but our approach is equally applicable to multilayer models. For clarity of the discussion, we start with a onedimensional rational RNN (i.e., one based on a single WFSA only). We then generalize to the dimensional case (computing the scores of WFSAs in parallel).
Rational recurrent networks
Following Peng et al. (2018)
, we parameterize the transition functions of WFSAs with neural networks, such that each transition (main path or self loop) defines a weighted function over the input word vector. We consider a 5state WFSA, diagrammed in Fig.
2.A path starts at ; at least four tokens must be consumed to reach , and in this sense it captures 4gram “soft” patterns (Peng et al., 2018; Schwartz et al., 2018). In addition to , we also designate , , and
as final states, allowing for the interpolation between patterns of different lengths.
^{3}^{3}3We found this to be more stable than using only . The selfloop transitions over , , , andaim to allow, but downweight, nonconsecutive patterns, as the selfloop transition functions yield values between 0 and 1 (using a sigmoid function). The recurrent function is equivalent to applying the Forward dynamic programming algorithm
(Baum and Petrie, 1966).Promoting sparsity with group lasso
We aim to learn a sparse rational model with fewer WFSA states. This can be achieved by penalizing the parameters associated with a given state, specifically the parameters associated with entering that state, by a transition from another state or a selfloop on that state. For example, the parameters of the WFSA in Fig. 2 (excluding the word embedding parameters) are assigned to four nonoverlapping groups, one for each nonstarting state.
During training, the regularization term will push all parameters toward zero, and some will converge close to zero.^{4}^{4}4There are optimization methods that achieve “strong” sparsity Parikh and Boyd (2013), where some parameters are exactly set to zero during training. Recent work has shown these approaches can converge in nonconvex settings Reddi et al. (2016), but our experiments found them to be unstable. After convergence, we remove groups for which the norm falls below .^{5}^{5}5We use 0.1. This threshold was lightly tuned in preliminary experiments on the validation set and found to reliably remove those parameters which converged around zero without removing others. The resulting smaller model is then finetuned by continuing training without regularizing. With our linearstructured WFSA, zeroing out the group associated with a state in the middle effectively makes later states inaccessible. While our approach offers no guarantee to remove states from the end first (thus leaving no unreachable states), in our experiments it always did so.
dimensional Case
To construct a rational RNN with WFSAs (a dimensional model), we stack onedimensional models, each of them separately parameterized. The parameters of a dimensional rational model derived from the WFSA in Fig. 2 are organized into groups, four for each dimension. Since there is no direct interaction between different dimensions (e.g., through an affine transformation), group lasso sparsifies each dimension/WFSA independently. Hence the resulting rational RNN can consist of WFSAs of different sizes, the number of which could be smaller than if any of the WFSAs have all states eliminated.
One can treat the numbers and sizes of WFSAs as hyperparameters Oncina et al. (1993); Ron et al. (1994); de la Higuera (2010); Schwartz et al. (2018). By eliminating states from WFSAs with group lasso, we learn the WFSA structure while fitting the models’ parameters, reducing the number of training cycles by reducing the number of tunable hyperparameters.
3 Experiments
We run sentiment analysis experiments. We train the rational RNN models (§2) with group lasso regularization, using increasingly large regularization strengths, resulting in increasingly compact models. As the goal of our experiments is to demonstrate the ability of our approach to reduce the number of parameters, we only consider rational baselines: the same rational RNNs trained without group lasso.^{6}^{6}6Rational RNNs have shown strong performance on the dataset we experiment with: a 2layer rational model with between 100–300 hidden units obtained 92.7% classification accuracy, substantially outperforming an LSTM baseline Peng et al. (2018). The results of our models, which are singlelayered and capped at 24 hidden units, are not directly comparable to these baselines, but are still within two points of the best result from that paper. We manually tune the number and sizes of the baselines WFSAs, and then compare the tradeoff curve between model size and accuracy. We describe our experiments below. For more details, see Appendix A.
Data
We experiment with the Amazon reviews binary sentiment classification dataset Blitzer et al. (2007), composed of 22 product categories. We examine the standard dataset (original_mix) comprised of a mixture of data from the different categories Johnson and Zhang (2015).^{7}^{7}7http://riejohnson.com/cnn_data.html We also examine three of the largest individual categories as separate datasets (kitchen, dvd, and books), following Johnson and Zhang (2015). The three category datasets do not overlap with each other (though they do with original_mix), and are significantly different in size (see Appendix A), so we can see how our approach behaves with different amounts of training data.
Implementation details
To classify text, we concatenate the scores computed by each WFSA, then feed this
dimensional vector of scores into a linear binary classifier. We use log loss. We experiment with both typelevel word embeddings (GloVe.6B.300d; Pennington et al., 2014) and contextual embeddings (BERT large; Devlin et al., 2019).^{8}^{8}8https://github.com/huggingface/pytorchpretrainedBERT In both cases, we keep the embeddings fixed, so the vast majority of the learnable parameters are in the WFSAs. We train models using GloVe embeddings on all datasets. Due to memory constraints we evaluate BERT embeddings (frozen, not finetuned) only on the smallest dataset (kitchen).Baselines
As baselines, we train five versions of each rational architecture without group lasso, using the same number of WFSAs as our regularized models (24 for GloVe, 12 for BERT). Four of the baselines each use the same number of transitions for all WFSAs (1, 2, 3, and 4, corresponding to 2–5 states, and to 24, 48, 72, and 96 total transitions). The fifth baseline has an equal mix of all lengths (6 WFSAs of each size for GloVe, leading to 60 total transitions, and 3 WFSAs of each size for BERT, leading to 30 total transitions).
Each transition in our model is independently parameterized, so the total number of transitions linearly controls the number of learnable parameters (in addition to the parameters in the embedding layer).
, directly varying the number of transitions). Vertical lines encode one standard deviation for accuracy, while horizontal lines encode one standard deviation in the number of transitions (applicable only to our method).
Results
Fig. 3 shows our classification test accuracy as a function of the total number of WFSA transitions in the model. We first notice that, as expected, the performance of our unregularized baselines improves as models are trained with more transitions (i.e., more parameters).
Compared to the baselines, training with group lasso provides a better tradeoff between performance and number of transitions. In particular, our heavily regularized models perform substantially better than the unigram baselines, gaining between 1–2% absolute improvements in four out of five cases. As our regularization strength decreases, we naturally gain less compared to our baselines, although still similar or better than the best baselines in four out of five cases.
4 Visualization
Using our method with a high regularization strength, the resulting sparse structures often contain only a handful of WFSAs. In such cases, building on the interpretability of individual WFSAs, we are able visualize every hidden unit, i.e., the entire model. To visualize a single WFSA , we follow Schwartz et al. (2018) and compute the score of on every phrase in the training corpus. We then select the top and bottom scoring phrases for ,^{9}^{9}9As each WFSA score is used as a feature that is fed to a linear classifier, negative scores are also meaningful. and get a prototypelike description of the pattern representing .^{10}^{10}10 While the WFSA scores are the sum of all paths deriving a document (plustimes semiring), here we search for the max (or min) scoring one. Despite the mismatch, a WFSA scores every possible path, and thus the max/min scoring path selection is still valid. As our examples show, many of these extracted paths are meaningful.
transition  transition  transition  
Patt. 1  Top  not  worth  the time /s 
not  worth  the 30 /s  
not  worth  it /s  
not  worth  it /s  
Bottom  extremely  pleased  … /s  
highly  pleased  … /s  
extremely  pleased  … /s  
extremely  pleased  … /s  
Patt. 2  Top  miserable  /s  
miserable  … /s  
miserable  … /s  
returned  /s  
Bottom  superb  /s  
superb  /s  
superb  /s  
superb  choice /s  
Patt. 3  Top  bad  … ltd  … buyer 
bad  … ltd  … buyer  
horrible  … hl4040cn  … expensive  
left  … ltd  … lens  
Bottom  favorite  … ltd  … lens  
really  … ltd  … buyer  
really  … ltd  … buyer  
best  … hl4040cn  … expensive  
Table 1 visualizes a sparse rational RNN trained on original_mix with only three WFSAs, (8 mainpath transitions in total).^{11}^{11}11The test performance of this model is 88%, 0.6% absolute below the average of the five models reported in Fig. 3. The table shows that looking at the top scores of each WFSA, two of the patterns respectively capture the phrases “not worth X /s” and “miserable/returned X /s’’. Pattern 3 is not as coherent, but most examples do contain sentimentbearing words such as bad, horrible, or best. This might be the result of the tuning process of the sparse rational structure simply learning a collection of words, rather than coherant phrases. As a result, this WFSA is treated as a unigram pattern rather than a trigram. The lowest scoring phrases show a similar trend. Appendix B shows the same visualization for another sparse rational RNN containing only four WFSAs and 11 mainpath transitions, trained with BERT embeddings.
We observe another interesting trend: two of the three patterns prefer expressions that appear near the end of the document. This could result from the nature of the datasets (e.g., many reviews end with a summary, containing important sentiment information), and/or our rational models’ recency preference. More specifically, the first self loop has weight , and hence the model is not penalized for taking self loops before the match; in contrast, the weights of the last self loop take values in due to the sigmoid, forcing a penalty for earlier phrase matches.^{12}^{12}12Changing this behavior could be easily done by fixing the final selfloop to as well.
5 Conclusion
We presented a method for learning parameterefficient RNNs. Our method applies group lasso regularization on rational RNNs, which are strongly connected to weighted finitestate automata, and thus amenable to learning with structured sparsity. Our experiments on four text classification datasets, using both GloVe and BERT embeddings, show that our sparse models provide a better performance/model size tradeoff. We hope our method will facilitate the development of “thin” NLP models, that are faster, consume less memory, and are interpretable Schwartz et al. (2019).
Acknowledgments
This work was completed while the first author was an intern at the Allen Institute for Artificial Intelligence. The authors thank Pradeep Dasigi, Gabriel Stanovsky, and Elad Eban for their discussion. In addition, the authors thank the members of the Noah’s ARK group at the University of Washington, the researchers at the Allen Institute for Artificial Intelligence, and the anonymous reviewers for their valuable feedback. This work was supported in part by a hardware gift from NVIDIA Corporation.
References
 Neural machine translation by jointly learning to align and translate. In Proc. of ICLR, Cited by: footnote 15.

Statistical inference for probabilistic functions of finite state Markov chains
. The Annals of Mathematical Statistics 37 (6). External Links: ISSN 00034851, Document Cited by: §2.  Biographies, Bollywood, boomboxes and blenders: domain adaptation for sentiment classification. In Proc. of ACL, External Links: Link Cited by: §3.

Quasirecurrent neural network
. In Proc. of ICLR, Cited by: §1.  Learning phrase representations using RNN encoderdecoder for statistical machine translation. In Proc. of EMNLP, External Links: Link Cited by: §1.
 Grammatical inference: learning automata and grammars. Cambridge University Press. Cited by: §2.
 BERT: pretraining of deep bidirectional transformers for language understanding. In Proc. of NAACL, Cited by: Appendix A, Appendix B, §1, §3.
 Intelligible language modeling with input switched affine networks. In Proc. of ICML, Cited by: §1.
 The lottery ticket hypothesis: finding sparse, trainable neural networks. In Proc. of ICLR, Cited by: §1.
 MorphNet: fast & simple resourceconstrained structure learning of deep networks. In Proc. of CVPR, Cited by: Appendix A, §1.
 Long shortterm memory. Neural computation 9 (8). External Links: Document Cited by: §1.

Effective use of word order for text categorization with convolutional neural networks
. In Proc. of NAACL, External Links: Document Cited by: §3.  An empirical exploration of recurrent network architectures. In Proc. of ICML, External Links: Link Cited by: §1.
 Adam: a method for stochastic optimization. In Proc. of ICLR, Cited by: Appendix A.
 Optimal brain damage. In NeurIPS, Cited by: §1.
 SNIP: singleshot network pruning based on connection sensitivity. In Proc. of ICLR, Cited by: §1.
 Training RNNs as fast as CNNs. Note: arXiv:1709.02755 Cited by: §1.
 DARTS: differentiable architecture search. In Proc. of ICLR, Cited by: §1.
 On the computational efficiency of training neural networks. In NeurIPS, Cited by: §1.
 Structured sparsity in structured prediction. In Proc. of EMNLP, External Links: Link Cited by: §1.

Learning subsequential transducers for pattern recognition interpretation tasks
. IEEE Trans. Pattern Anal. Mach. Intell. 15, pp. 448–458. Cited by: §2.  Proximal algorithms. Foundations and Trends in Optimization. Cited by: footnote 4.
 Rational recurrences. In Proc. of EMNLP, External Links: Link Cited by: Appendix A, RNN Architecture Learning with Sparse Regularization, §1, §2, §2, footnote 6.
 GloVe: global vectors for word representation. In Proc. of EMNLP, External Links: Link Cited by: §3.
 Language models are unsupervised multitask learners. Cited by: §1.
 Proximal stochastic methods for nonsmooth nonconvex finitesum optimization. In NeurIPS, Cited by: footnote 4.
 Learning probabilistic automata with variable memory length. In Proc. of COLT, Cited by: §2.
 Group sparse regularization for deep neural networks. Neurocomputing 241. Cited by: §1.
 Green AI. Note: arXiv:1907.10597 External Links: Link Cited by: §1, §5.
 SoPa: bridging CNNs, RNNs, and weighted finitestate machines. In Proc. of ACL, External Links: Link Cited by: §1, §1, §2, §2, §4.

Energy and policy considerations for deep learning in NLP
. In Proc. of ACL, Cited by: §1.  Learning structured sparsity in deep neural networks. In NeurIPS, Cited by: §1, §2.
 Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (1). Cited by: §1.

Neural architecture search with reinforcement learning
. In Proc of ICLR, Cited by: §1.
Appendix A Experiment Details
Dataset statistics
Table 2 shows the sizes of the datasets experimented with.
Training  Dev.  Test  

kitchen  3,298  822  4,118 
dvd  14,066  3,514  17,578 
books  20,000  5,000  25,000 
original_mix  20,000  5,000  25,000 
Preprocessing
As preprocessing for the data for each individual category, we tokenize using the NLTK word tokenizer. We removed reviews with text shorter than 5 tokens.
We binarize the review score using the standard procedure, assigning 1 and 2star reviews as negative, and 4 and 5star reviews as positive (discarding 3star reviews). Then, if there were more than 25,000 negative reviews, we downsample to 25,000 (otherwise we keep them all), and then downsample the positive reviews to be the same number as negative, to have a balanced dataset. We match the train, development, and test set proportions of 4:1:5 from the original mixture.
We generate the BERT embeddings using the sum of the last four hidden layers of the large uncased BERT model, so our embedding size is 1024. Summing the last four layers was the best performing approach in the ablation of Devlin et al. (2019) that had fewer than 4096 embedding size (which was too large to fit in memory). We embed each sentence individually (there can be multiple sentences within one example).
Implementation details
For GloVe, we train rational models with 24 5state WFSAs, each corresponding to a 4gram softpattern (Fig. 2). For BERT, we train models with 12 WFSAs.^{13}^{13}13The BERT embedding dimension is significantly larger than GloVe (1024 compared to 300), so we used a smaller number of WFSAs. As our results show, the BERT models still substantially outperform the GloVe ones.
Experiments
For each model (regularized or baseline), we run random search to select our hyperparameters (evaluating 20 uniformly sampled hyperparameter configurations). For the hyperparameter configuration that leads to the best development result, we train the model again 5 times with different random seeds, and report the mean and standard deviation of the models’ test performance.
Parameters
Regularization strength search
We searched for model structures that were regularized down to close to 20, 40, 60, or 80 transitions (10, 20, 30, and 40 for BERT experiments). For a particular goal size, we uniformly sample 20 hyperparameter assignments from the ranges in Table 4, then sorted the samples by increasing learning rate. For each hyperparameter assignment, we trained a model with the current regularization strength. If the resulting learned structure was too large (small), we doubled (halved) the regularization strength, repeating until we were within 10 transitions of our goal (5 for BERT experiments).^{14}^{14}14If the regularization strength became larger than or smaller than , we threw out the hyperparameter assignment and resampled (this happened when, e.g., the learning rate was too small for any of the weights to actually make it to zero).
Finally, we finetuned the appropriatelysized learned structure by continuing training without the regularizer, and computed the result on the development set. For the best model on the development set, we retrained (first with the regularizer to learn a structure, then finetuned) five times, and plot the mean and variance of the test accuracy and learned structure size.
Appendix B Visualization
transition  transition  transition  
Patt. 1  Top  are  perfect  … [CLS] 
definitely  recommend  … [CLS]  
excellent  product  … [CLS]  
highly  recommend  … [CLS]  
Bottom  not  … [SEP]  … [CLS]  
very  disappointing  ! [SEP] [CLS]  
was  defective  … had  
would  not  … [CLS]  
Patt. 2  Top  [CLS]  mine  broke 
[CLS]  it  … heat  
[CLS]  thus  it  
[CLS]  it does  it heat  
Bottom  [CLS]  perfect  … cold  
[CLS]  sturdy  … cooks  
[CLS]  evenly  , withstand heat  
[CLS]  it  is  
Patt. 3  Top  ‘  pops  ’ ’ escape 
‘  gave  out  
that  had  escaped  
‘  non    
Bottom  simply  does  not  
[CLS]  useless  equipment !  
unit  would  not  
[CLS]  poor  to no  
Patt. 4  Top  [CLS]  after  
[CLS]  our  
mysteriously  jammed  
mysteriously  jammed  
Bottom  [CLS]  i  
[CLS]  i  
[CLS]  i  
[CLS]  we  
Table 3 shows the same visualization shown in §4 for another sparse rational RNN containing only four WFSAs and 11 mainpath transitions, trained with BERT embeddings on kitchen. It also shows a few clear patterns (e.g., Patt. 2). Interpretation here is more challenging though, as contextual embeddings make every token embedding depend on the entire context.^{15}^{15}15Indeed, contextual embeddings raise problems for interpretation methods that work by targeting individual words, e.g., attention Bahdanau et al. (2015), as these embeddings also depend on other words. Interpretation methods for contextual embeddings are an exciting direction for future work. A particular example of this is the excessive use of the start token ([CLS]), whose contextual embedding has been shown to capture the sentiment information at the sentence level Devlin et al. (2019).
Regularization strength recommendation
If a practitioner wishes to learn a single small model, we recommend they start with such that the loss
and the regularization term are equal at initialization (before training). We found that having equal contribution led to eliminating approximately half of the states, though this varies with data set size, learning rate, and gradient clipping, among other variables.
Type  Range 

Learning rate  [, ] 
Vertical dropout  [0, 0.5] 
Recurrent dropout  [0, 0.5] 
Embedding dropout  [0, 0.5] 
regularization  [0, 0.5] 
Weight decay  [, ] 