Adaptive Pruning of Neural Language Models for Mobile Devices

09/27/2018 ∙ by Raphael Tang, et al. ∙ University of Waterloo 0

Neural language models (NLMs) exist in an accuracy-efficiency tradeoff space where better perplexity typically comes at the cost of greater computation complexity. In a software keyboard application on mobile devices, this translates into higher power consumption and shorter battery life. This paper represents the first attempt, to our knowledge, in exploring accuracy-efficiency tradeoffs for NLMs. Building on quasi-recurrent neural networks (QRNNs), we apply pruning techniques to provide a "knob" to select different operating points. In addition, we propose a simple technique to recover some perplexity using a negligible amount of memory. Our empirical evaluations consider both perplexity as well as energy consumption on a Raspberry Pi, where we demonstrate which methods provide the best perplexity-power consumption operating point. At one operating point, one of the techniques is able to provide energy savings of 40 art with only a 17



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

An emerging application of neural language models (NLMs) is smart software keyboards on such mobile devices as smartphones and tablets that provide next-word prediction, allowing users to input entire words with a single tap. For example, the apps SwiftKey111 and Swype222 both advertise the use of neural networks for predictions. According to Google Play Store, SwiftKey has more than 100 million downloads, demonstrating its popularity.

Based on standard metrics such as perplexity, neural techniques represent an advance in the state of the art in language modeling (Merity et al., 2018b). Better models, however, come at a cost in computational complexity, which translates to higher power consumption. In the context of mobile devices, energy efficiency is, of course, an important optimization objective. A casual web search, for example, reveals numerous complaints from users of the above apps about battery drain, indicating that this is not a hypothetical concern.

In reality, neural language models exist in a accuracy–efficiency tradeoff space. Although this fact has been recognized for applications such as image recognition (Canziani et al., 2016) and keyword spotting (Tang et al., 2018), to our knowledge no one in the NLP community has explored these tradeoffs. All previous papers on NLMs simply report single-point perplexity figures. In contrast, the high-level goal of our work is to understand the tradeoffs between neural modeling accuracy and real-world efficiency constraints: in addition to perplexity, NLMs should be evaluated in terms of FLOPs,333Convention from literature defines number of FLOPs as the total number of additions and multiplications. milliJoule per query (mJ/q), and inference latency. We conduct exactly such experiments, using the Raspberry Pi (which shares the same architecture as most mobile devices today) as a more convenient hardware platform.

Ideally, NLMs should provide a “knob” that allows developers to tune accuracy–efficiency tradeoffs. In this paper, we explore pruning approaches that take a pre-trained quasi-recurrent neural network (QRNN; Bradbury et al., 2017), representing the state of the art in NLM today, and provides exactly such a knob. Furthermore, our techniques allow these tradeoffs to be tuned at inference time, which allows a mobile device to adaptively control its behavior, e.g., favor efficiency at the cost of accuracy when the battery is low.

Thus, this paper makes the following contributions: First, to our knowledge, we are the first to comprehensively explore accuracy–efficiency tradeoffs for NLMs with experimental evaluation of energy consumption on a Raspberry Pi. Second, we evaluate a number of inference-time pruning techniques that takes any pre-trained QRNN and provides a tunable accuracy–efficiency “knob”.

2 Background and Related Work

2.1 Quasi-recurrent Neural Networks

Figure 1: An illustration of the first QRNN layer for language modeling. In this visualization, a QRNN layer with a window size of two convolves and pools using embeddings from the input. Note the absence of recurrent weights.

Quasi-recurrent neural networks (QRNNs;  Bradbury et al., 2017) achieve highly competitive perplexity on word-level language modeling datasets, including state-of-the-art perplexity on WikiText-103 (Merity et al., 2018b). Although applying such techniques as dynamic evaluation (Krause et al., 2017), Hebbian softmax (Rae et al., 2018), and mixture of softmaxes (Yang et al., 2017) can produce lower perplexity, our focus is on the recurrent architecture. Thus, we explore the task of pruning QRNNs without using any other extensions.

Each word is encoded as a one-hot vector and then fed into a linear layer, which produces lower-dimensional word embeddings for the QRNN layers. A single QRNN layer consists of two distinct components—convolution and recurrent pooling—that alternate to imitate an LSTM 

(Hochreiter & Schmidhuber, 1997). Given a stacked sequence of inputs (e.g., word embeddings in language modeling), the one-dimensional convolution layer is defined as

where , , are the weights associated with the input, forget, and output gates, respectively,    represents a masked convolution along time, and

denotes the sigmoid function. For

, is the number of output channels, is the number of input channels, and the window size across time. Without loss of generality, we henceforth represent as two-dimensional matrices , where . The outputs are fed into a recurrent pooling layer:

where denotes element-wise product. Altogether, these two layers define a single QRNN layer (Bradbury et al., 2017; see Figure 1). Multiple layers can be stacked for greater expressiveness, where the output of the previous layer is the input to the current layer.

We tie the weights between the input and output layers, as used by Merity et al. (2018a) and proposed by Inan et al. (2017). In addition to improving perplexity, weight tying reduces the number of parameters and hence the memory footprint, which is beneficial to our task.

2.2 Pruning

Weight pruning is an effective strategy for reducing the computational footprint of a model. An influential pioneering work, LeCun et al. (1990) proposes to discard weights using a error-approximation approach based on Hessian diagonals. More recent work suggests pruning weights with small magnitudes (Han et al., 2016), with quantization and Huffman coding as additional steps. However, these approaches introduce irregular sparsity to the weights, and they assume that re-training the weights is feasible.

In this work, we take a different approach and focus on techniques that eliminate entire filters. This is because modern implementations of feedforward evaluation (e.g., im2col and particularly NEON instruction on ARM processors) take advantage of dense matrix multiplications. Pruning individual weights without changing the dimensions of the weight matrices has minimal effect on power consumption—this is confirmed by our initial exploratory studies on the Raspberry Pi. Hence, we only examine pruning techniques that discard entire filters of the convolutional layers:

Random pruning. A simple baseline (Mittal et al., 2018) is random filter pruning, where of the filters are randomly pruned, layer-by-layer. Interestingly, Mittal et al. (2018) find that random pruning is competitive with more advanced methods.

Filter norm. Li et al. (2017) propose ranking filters by their -norms, and then dropping off of the smallest filters on a layer-by-layer basis. Mittal et al. (2018) have previously found that -norm filter pruning (Li et al., 2017) outperforms a multitude of competing approaches.

Mean activation norm. Among other approaches, Molchanov et al. (2016)

suggest pruning filters whose mean activations are small. This approach is especially effective on ReLU, which both creates sparse activations and forces them to be non-negative.

regularization. Louizos et al. (2018) apply regularization to neural networks in order to learn sparse, efficient structures. Formally, define an objective


is the original loss function and

the weights. The dependence on the hypothesis and training examples has been omitted for brevity. The optimal solution entails a non-differentiable objective and iteration over all possibilities to choose the best ; hence, Louizos et al. (2018) propose the following relaxation of the objective:

where is a binary discrete random mask parameterized by , and is the CDF. Intuitively, for some choice of , the number of active parameters (on average) is penalized. Inspired by the Concrete distribution (Maddison et al., 2016), Louizos et al. (2018) propose the hard concrete distribution for , further relaxing the discrete random mask into a continuous one:

where is a continuous random vector such that , are the mask parameters, and

are scaling hyperparameters. Note that

can also be included as part of the mask parameters ; we follow Louizos et al. (2018) and fix . Louizos et al. (2018) then apply the reparameterization trick (Kingma & Welling, 2014; Rezende et al., 2014) and make a Monte Carlo approximation to the objective:

A closed form expression is derived for the penalty,

. At test time, the following estimator is used:

3 Inference-Time Pruning

In this section, we explain how the various techniques in Section 2.2 can be adapted to QRNNs. For the following methods, we assume that a pre-trained model is provided. We denote the weights at QRNN layer as . In all methods, we tie the indices across . For example, if filter is selected for pruning at layer , then , where denotes exclusion of index . This allows the removal of the column in the next layer as well.

Random pruning. We apply random pruning to , , and . That is, we randomly prune filters associated with the same indices across the three weights.

Filter norm. We apply filter norm pruning (Li et al., 2017), with the filter norms of acting as the criteria. We find most helpful, since small filter norms should result in small hidden outputs, which is not necessarily the case for and .

Mean activation norm. The hidden output is a natural candidate for collecting mean activation statistics. Intuitively, if is small on average, then the filters for are less important. Statistics are collected using a single pass of the entire training set. For inference-time pruning, we store the collected statistics.

regularization. Since we are given a pre-trained model and are prohibited from altering the weights, we learn the mask parameters only: . We also enforce the sparsity on entire rows of , which corresponds to “group sparsity” in Louizos et al. (2018). Specifically, we formulate the regularization on a feature map level instead, with as the target:

is chosen for the property that the feature map for is zero if is zero for .

This approach entails training and storing extra mask parameters for each operating point. However, we find this to be a non-issue for our task, since there are few operating points—three or four at most, out of which we use two for regularization—so the extra storage is negligible.

3.1 With Single-Rank Update

At specific operating points (e.g., 40% and 80% FLOPs), pre-trained weight updates can be stored and applied at inference-time to recover some perplexity. Suppose is a weight matrix in a neural network, and is some known set of weights that results in a lower loss. Clearly, can be stored and added at inference-time to obtain a better neural network. However, it is obvious that this scheme is wasteful, since could have directly substituted in the first place.

Sacrificing a negligible amount of storage to recover some perplexity, we propose learning a single-rank weight matrix update

to each weight in the convolution layers. Specifically, the process is as follows, beginning with a pre-trained model:

  1. Prune a pre-determined set of filters for some operating point (e.g., 40% FLOPs).

  2. Initialize the weight updates for each convolution layer , in our case .

  3. Fixing the existing weights for each convolution layer, train a single-rank update such that , where is used as the new weight.

  4. Store for use at inference time on the same operating point.

4 Experimental Setup

We evaluate the aforementioned pruning techniques for word-level language modeling on Penn Treebank (PTB) (Marcus et al., 1993; as preprocessed by Mikolov et al., 2010) and WikiText-103 (WT103) (Merity et al., 2017). We denote the models for PTB and WT103 as ptb-qrnn and wt103-qrnn, respectively.

4.1 Datasets and Tasks

For each model, we report word-level perplexity and recall-at-three (R@3), defined as the percentage of top three token–logit outputs that contain the true next token. For example, if {“cat”, “dog”, “baby”} are the top three predicted tokens for, “I adopted a

      ,” with “dog” being the ground truth, then the prediction is correct, regardless of the rank of “dog”.

Penn Treebank. Built from Wall Street Journal articles, Penn Treebank (PTB) is a small yet popular word-level dataset for language modeling. In the standard pre-processed version (Mikolov et al., 2010), the dataset contains roughly 887K, 70K, and 78K training, validation, and testing tokens, respectively. The number of unique tokens is capped at 10,000, yielding a relatively large 4.8% out-of-vocabulary (OOV) rate.

WikiText-103. Merity et al. (2017) introduce WikiText-2 and WikiText-103, datasets based on freely available Wikipedia articles. We use only WikiText-103, since WikiText-2 was designed to be similar to Penn Treebank. With 103 million training tokens, WikiText-103 is 103 times as large as PTB. WikiText-103 contains around 217K tokens for validation, and 245K for testing. The number of unique tokens is 267K, resulting in a 0.4% OOV rate, significantly lower than that of PTB.

4.2 Hyperparameters and Training

In all of the models, we chose the hyperparameters as suggested in Merity et al.’s codebase.444 For ptb-qrnn, we used a four-layer QRNN with 1550 hidden units for each layer and a 400-dimensional embedding. For wt103-qrnn, we used a four-layer QRNN with 2500 hidden units and 400-dimensional embeddings, along with a tied adaptive softmax (Merity et al., 2018b). In both models, the first layer uses a window size of two, while the rest use a windows size of one.

Following Merity et al. (2018a)

, we also adopted the regularization techniques randomized backpropagation through time, embedding dropout, temporal activation regularization (TAR), activation regularization (AR), and variational dropout. We followed the same training process as well, with non-monotonically triggered ASGD (NT-ASGD) as the optimizer. We use the same hyperparameters as

Merity et al. (2018a) and Merity et al. (2018b) for each model–dataset pair.

During the training of wt103-qrnn, we follow Merity et al. (2018b), using a tied adaptive softmax (Grave et al., 2017; Merity et al., 2018b) layer. At inference time, we use a regular softmax instead, since we require R@3.

Pruning. We selected a number of distinct operating points that represent discrete points in the accuracy–efficiency tradeoff space. Based on previous work (Tang et al., 2018), floating-point operations (FLOPs) is a good proxy of both energy usage and latency, and so we use FLOPs as a way of selecting our operating points. In regularization, the decay strength was selected so that the resulting model corresponds to roughly the FLOPs targets: To achieve 80% and 60% FLOPs for the model on PTB, we used , respectively. To achieve about 70% FLOPs on WT103, we chose .

We trained the hard concrete mask parameters for roughly 5000 steps using Adam with a learning rate of . Since the weight decay penalty is incompatible with the objective, we removed it while training the mask.

For mean activation pruning, which requires some training examples to collect statistics, we used the entire training set for ptb-qrnn. Since WikiText-103 is large, we used roughly 10% of the first training examples for collecting statistics on wt103-qrnn.

Single-rank update (SRU).

For the PTB model, the single-rank update was trained for 10 epochs using NT-ASGD 

(Merity et al., 2018a) with a non-monotonic interval of three. For WikiText-103, the update was trained for 2000 steps using Adam with a learning rate of . All other hyperparameters were the same as those used during the training stage.

4.3 Infrastructure Details

We trained all of our models on a commodity machine with a Titan V GPU, i7-4790k CPU, and 16 GB of RAM. We used PyTorch 0.4.0 (commit

1807bac) for developing and running our models. We deployed our models on a Raspberry Pi (RPi) 3 Model B (ARM Cortex-A53) running Raspbian Stretch (4.9.41-v7+). Specifically, we copied the trained models over to the RPi, and ran them at the same operating points accordingly.

We plugged the RPi into a Watts Up Pro meter, a wattmeter that reports power usage at the rate of 1 Hz via a USB cable, which is connected back to the RPi. Evaluating on the test set, we collected power draw statistics on 350 next-word predictions, which were averaged to produce a millijoule per query (mJ/q) estimate. We obtained latency estimates in a similar manner by averaging the milliseconds per query (ms/q). Finally, we subtracted off the idle power usage of the RPi to obtain a better estimate of the actual power for each query.

Although our final application is NLMs running on mobile devices such as smartphones and tablets, there are many challenges to directly evaluating on such hardware. The Raspberry Pi is a convenient stand-in since it uses exactly the same ARM processor architecture as nearly all mobile devices today. Evaluation on the RPi is widely adopted for research on efficient NNs today (Amato et al., 2017; Tang et al., 2018).

5 Results and Discussion

In our results for PTB and WT-103, we compare to state-of-the-art results in the past. In general, we find that QRNNs are strong competitors to LSTM approaches, and achieve state-of-the-art perplexity on WikiText-103 (Merity et al., 2018b).

# Method Model Quality Footprint w/SRU
Val. Test R@3 % FLOPs ms/q mJ/q Test R@3
1 Skip LSTM 60.9 58.3
2 AWD-LSTM 60.0 57.3 223 295
3 Orig. 59.0 56.8 44.7% 100% 224 296
4 reg. 63.0 60.7 43.6% 80% 185 227 59.3 44.1%
5 reg. 69.2 66.8 42.1% 60% 142 183 64.0 42.7%
6 Random 68.2 66.0 42.9% 80% 182 238 61.1 43.8%
7 Filter norm 76.1 72.7 42.4% 80% 182 238 66.1 43.1%
8 Mean activation 68.3 66.1 42.6% 80% 182 238 61.0 43.5%
Table 1: Select pruning results on Penn Treebank using a 4-layer QRNN, along with past results drawn from the original papers. Skip LSTM refers to the four-layer skip LSTM from Melis et al. (2018), and AWD-LSTM is from Merity et al. (2018a). The four-layer QRNN (Merity et al., 2018b) is the same model that we use, but we achieve better perplexity following the same methodology. The best results of each category are bolded. “w/SRU” denotes the results after applying an SRU.

For PTB, we note that a 20-point increase in perplexity may only correspond to a few points decrease in R@3, showing that perplexity changes on a much different scale than accuracy does (see Table 1, rows 3 and 7). Furthermore, lower perplexity does not necessarily imply higher accuracy (see rows 5 and 7), confirming that perplexity alone cannot completely determine the recall. In Table 1, we chose 75 as the cutoff-point for perplexity—further results are illustrated in Figure 2. For WT-103, we observe trends similar to those of PTB; A large drop in perplexity corresponds to a much smaller decrease in R@3 (see Table 2, rows 3 and 4).

# Method Model Quality Footprint w/SRU
Val. Test R@3 % FLOPs sec/q J/q Test R@3
1 Rae-LSTM 36.0 36.4
2 4-layer QRNN 32.0 33.0 1.24 1.48
3 Orig. 31.9 32.8 51.5% 100% 1.24 1.48
4 reg. 65.8 65.4 43.1% 69% 0.912 1.06 56.9 44.7%
5 Mean activation 89.8 92.9 38.9% 70% 0.942 1.10 55.7 46.0%
6 Filter norm 85.9 88.2 41.7% 70% 0.942 1.10 59.2 45.4%
7 Random 80.9 81.4 42.9% 70% 0.942 1.10 54.2 46.1%
Table 2: Select pruning results on WikiText-103 using a 4-layer QRNN, along with past results, drawn directly from the original papers. Note that Rae et al. (2018) primarily explore Hebbian softmax; Rae-LSTM refers to their LSTM model without any extensions. Bolded are the best results for each category.

5.1 Accuracy–Efficiency Tradeoffs

We illustrate the accuracy–efficiency tradeoff space of the PTB and WT103 models in Figure 2. For each model, we tabulate the results at fixed intervals according to the approximated percentage of FLOPs, relative to that of the unpruned model. We omit results that exceed 100 in test perplexity, since they are insufficient for language modeling in practice.

Figure 2: Full experimental results on Penn Treebank and WikiText-103. We illustrate the perplexity–efficiency tradeoff space on the test set obtained before applying the single-rank update.

Surprisingly, random filter pruning is a strong baseline, which supports the findings from Mittal et al. (2018). Random pruning not only outperforms filter norm and mean activation pruning, but also regains perplexity more easily with a single-rank update. From Table 1 (rows 6–8) and Table 2 (rows 5–7), we see that random pruning displays equivalent or superior performance to filter norm and mean activation pruning. Interestingly, random pruning achieves the lowest perplexity with a single-rank update (Table 2, rows 4–7), out of all the baseline approaches on WT103.

On the other hand, filter norm pruning is relatively weak, doing worse than random pruning in all cases—with or without a single-rank update—suggesting that filter norm pruning has no practical benefit over random pruning. regularization (Louizos et al., 2018) works best, as shown in rows 4–5 in Table 1 and row 4 in Table 2.

In general, testing on Penn Treebank and WikiText-103—two very different datasets—gives us consistent results, thus demonstrating the robustness of regularization (Louizos et al., 2018), compared to the other pruning approaches.

Figure 3: Illustration depicting pruning on a truncated subset of the first layer’s weights from the PTB model, where each row corresponds to a different technique, and each column a different operating point. From left to right, the operating points are 100%, 80%, 70%, 60%, and 50% FLOPs. For each of the subfigures, we concatenate from top to bottom the first 25 filters of , and from left to right the first 75 elements in each filter, yielding square visualizations. All the pruning techniques appear to be dropping weights differently—we note that, for regularization (row 4), the dropped weights remain largely constant throughout.

5.2 Power Usage and Latency

On the Raspberry Pi, the PTB models are relatively fast, while the WT103 models are high latency, taking over one second (Table 2, rows 2–3 and 8) for the full models. For type-ahead prediction on a mobile device, the WT103 models are unsuitable as-is; further steps (e.g., more pruning then re-training, vocabulary reduction, quantization) would be required to deploy the models for practical use. Supporting the findings from Tang et al. (2018), the number of FLOPs scales linearly with latency and power: Full experimental results from Figure 2 yield Pearson’s for both latency– and power–FLOPs measurements, suggesting a strong linear relationship between the number of FLOPs and both latency and power.

In terms of extra parameters, a single-rank update costs less than 74 KB for ptb-qrnn, and less than 120 KB for wt103-qrnn. Mean activation statistics requires 20 KB for ptb-qrnn, and 30 KB for wt103-qrnn. Mask parameters for regularization cost about 20 KB on each power level for ptb-qrnn, and 30 KB for wt103-qrnn. Filter norm pruning and random pruning do not require any extra storage.

6 Conclusion

Motivated by the mass adoption of smart software keyboards on mobile devices, we explore the task of inference-time pruning on QRNNs, state-of-the-art neural language models. Starting with existing training-time pruning methods, we extend their usability to QRNNs at run-time, obtaining multiple operating points in the accuracy–efficiency tradeoff space. To recover some perplexity using a negligible amount of memory, we propose to train and store single-rank weight updates at desired operating points.


We are grateful for Meng Dong’s work on power measurements and debugging for the RPi experiments, and we thank the reviewers for their time and feedback.