Bayesian Compression for Natural Language Processing

10/25/2018 ∙ by Nadezhda Chirkova, et al. ∙ Higher School of Economics 0

In natural language processing, a lot of the tasks are successfully solved with recurrent neural networks, but such models have a huge number of parameters. The majority of these parameters are often concentrated in the embedding layer, which size grows proportionally to the vocabulary length. We propose a Bayesian sparsification technique for RNNs which allows compressing the RNN dozens or hundreds of times without time-consuming hyperparameters tuning. We also generalize the model for vocabulary sparsification to filter out unnecessary words and compress the RNN even further. We show that the choice of the kept words is interpretable.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recurrent neural networks (RNNs) are among the most powerful models for natural language processing, speech recognition, question-answering systems  Chan et al. (2016); Ha et al. (2017); Wu et al. (2016); Ren et al. (2015). For complex tasks such as machine translation Wu et al. (2016) modern RNN architectures incorporate a huge number of parameters. To use these models on portable devices with limited memory the model compression is desired.

There are a lot of RNNs compression methods based on specific weight matrix representations Tjandra et al. (2017); Le et al. (2015) or sparsification Narang et al. (2017); Wen et al. (2018)

. In this paper we focus on RNNs compression via sparsification. One way to sparsify RNN is pruning where the weights with a small absolute value are eliminated from the model. Such methods are heuristic and require time-consuming hyperparameters tuning. There is another group of sparsification techniques based on Bayesian approach.

Molchanov et al. Molchanov et al. (2017) describe a model called SparseVD in which parameters controlling sparsity are tuned automatically during neural network training. However, this technique was not previously investigated for RNNs. In this paper, we apply Sparse VD to RNNs taking into account the specifics of recurrent network structure (Section 3.2). More precisely, we use the insight about using the same sample of weights for all timesteps in the sequence Gal and Ghahramani (2016); Fortunato et al. (2017). This modification makes local reparametrization trick Kingma et al. (2015); Molchanov et al. (2017) not applicable and changes SparseVD training procedure.

In natural language processing tasks the majority of weights in RNNs are often concentrated in the first layer that is connected to the vocabulary, for example in embedding layer. However, for some tasks the most of the words are unnecessary for accurate predictions. In our model we introduce multiplicative weights for the words to perform vocabulary sparsification (Section 3.3). These multiplicative weights are zeroing out during training causing filtering corresponding unnecessary words out of the model. It allows to boost RNN sparsification level even further.

To sum up, our contributions are as follows: (i) we adapt SparseVD to RNNs explaining the specifics of the resulting model and (ii) we generalize this model by introducing multiplicative weights for words to purposefully sparsify the vocabulary. Our results show that Sparse Variational Dropout leads to a very high level of sparsity in recurrent models without a significant quality drop. Models with additional vocabulary sparsification boost compression rate on text classification tasks but do not help that much on language modeling tasks. In classification tasks the vocabulary is compressed dozens of times, and the choice of words is interpretable.

2 Related work

Reducing RNN size is an important and rapidly developing area of research. There are three research directions: approximation of weight matriсes Tjandra et al. (2017); Le et al. (2015), reducing the precision of the weights Hubara et al. (2016) and sparsification of the weight matrices Narang et al. (2017); Wen et al. (2018). We focus on the last one. The most popular approach here is pruning: the weights of the RNN are cut off on some threshold. Narang et al. Narang et al. (2017) choose threshold using several hyperparameters that control the frequency, the rate and the duration of the weights eliminating. Wen et al. Wen et al. (2018)

propose to prune the weights in LSTM by groups corresponding to each neuron, this allows to accelerate forward pass through the network.

Another group of sparsification methods relies on Bayesian neural networks Molchanov et al. (2017); Neklyudov et al. (2017); Louizos et al. (2017)

. In Bayesian NNs the weights are treated as random variables, and our desire about sparse weights is expressed in a prior distribution over them. During training, the prior distribution is transformed into the posterior distribution over the weights, used to make predictions on testing phase.

Neklyudov et al. Neklyudov et al. (2017) and Louizos et al. Louizos et al. (2017) also introduce group Bayesian sparsification techniques that allow to eliminate neurons from the model.

The main advantage of the Bayesian sparsification techniques is that they have a small number of hyperparameters compared to pruning-based methods. Also, they lead to a higher sparsity level Molchanov et al. (2017); Neklyudov et al. (2017); Louizos et al. (2017).

There are several works on Bayesian recurrent neural networks Gal and Ghahramani (2016); Fortunato et al. (2017), but these methods are hard to extend to achieve sparsification. We apply sparse variational dropout to RNNs taking into account its recurrent specifics, including some insights highlighted by Gal and Ghahramani Gal and Ghahramani (2016), Fortunato et al. Fortunato et al. (2017).

3 Proposed method

3.1 Notations

In the rest of the paper is an input sequence, is a true output and is an output predicted by the RNN ( and

may be single vectors, sequences, etc.),

denotes a training set . All weights of the RNN except biases are denoted by , while a single weight (an element of any weight matrix) is denoted by . Note that we detach biases and denote them by because we do not sparsify them.

For definiteness, we will illustrate our model on an example architecture for the language modeling task, where :

In this example , . However, the model may be directly applied to any recurrent architecture.

3.2 Sparse variational dropout for RNNs

Following Kingma et al. Kingma et al. (2015), Molchanov et al. Molchanov et al. (2017), we put a fully-factorized log-uniform prior over the weights:

and approximate the posterior with a fully factorized normal distribution:

The task of posterior approximation is equivalent to variational lower bound optimization Molchanov et al. (2017):


Here the first term, a task-specific loss, is approximated with one sample from . The second term is a regularizer that moves posterior closer to prior and induces sparsity. This regularizer can be very closely approximated analytically Molchanov et al. (2017):


To make integral estimation unbiased, sampling from the posterior is performed with the use of reparametrization trick  

Kingma and Welling (2014):


The important difference of RNNs compared to feed-forward networks consists in sharing the same weight variable between different timesteps. Thus, we should use the same sample of weights for each timestep while computing the likelihood  Gal and Ghahramani (2016); Fortunato et al. (2017).

Kingma et al. Kingma et al. (2015), Molchanov et al. Molchanov et al. (2017) also use local reparametrization trick (LRT) that is sampling preactivation instead of individual weights. For example,

Tied weight sampling makes LRT not applicable to weight matrices that are used in more than one timestep in the RNN.

For the hidden-to-hidden matrix the linear combination is not normally distributed because depends on from the previous timestep. As a result, the rule about the sum of independent normal distributions with constant coefficients is not applicable. In practice, network with LRT on hidden-to-hidden weights cannot be trained properly.

For the input-to-hidden matrix the linear combination is normally distributed. However, sampling the same for all timesteps and sampling the same noise for preactivations for all timesteps are not equivalent. The same sample of corresponds to different samples of noise at different timesteps because of the different . Hence theoretically LRT is not applicable here. In practice, networks with LRT on input-to-hidden weights may give the same results and in some experiments, they even converge a little bit faster.

Since the training procedure is effective only with 2D noise tensor, we propose to sample the noise on the weights per mini-batch, not per individual object.

To sum up, the training procedure is as follows. To perform forward pass for a mini-batch, we firstly sample all weights following (3) and then apply RNN as usual. Then the gradients of (1) are computed w.r.t .

During the testing stage, we use the mean weights  Molchanov et al. (2017). Regularizer (2) causes the majority of components approach 0, and the weights are sparsified. More precisely, we eliminate weights with low signal-to-noise ratio  Molchanov et al. (2017).

3.3 Multiplicative weights for vocabulary sparsification

One of the advantages of Bayesian sparsification is an easy generalization for the sparsification of any groups of the weights that doesn’t complicate the training procedure Louizos et al. (2017). To do so, one should introduce shared multiplicative weight per each group, and elimination of this multiplicative weight will mean the elimination of the corresponding group. In our work we utilize this approach to achieve vocabulary sparsification.

Precisely, we introduce multiplicative probabilistic weights for words in the vocabulary (here is the size of the vocabulary). The forward pass with looks as follows:

  1. sample vector from the current approximation of the posterior for each input sequence from the mini-batch;

  2. multiply each one-hot encoded token

    from the sequence by (here both and are -dimensional);

  3. continue the forward pass as usual.

We work with in the same way as with other weights

: we use a log-uniform prior and approximate the posterior with a fully-factorized normal distribution with trainable mean and variance. However, since

is a one-dimensional vector, we can sample it individually for each object in a mini-batch to reduce the variance of the gradients. After training, we prune elements of with a low signal-to-noise ratio and subsequently, we do not use the corresponding words from the vocabulary and drop columns of weights from the embedding or input-to-hidden weight matrices.

4 Experiments

We perform experiments with LSTM architecture on two types of problems: text classification and language modeling. Three models are compared here: baseline model without any regularization, SparseVD model and SparseVD model with multiplicative weights for vocabulary sparsification (SparseVD-Voc).

To measure the sparsity level of our models we calculate the compression rate of individual weights as follows: . The sparsification of weights may lead not only to the compression but also to the acceleration of RNNs through group sparsity. Hence, we report the number of remaining neurons in all layers: input (vocabulary), embedding and recurrent. To compute this number for vocabulary layer in SparseVD-Voc we use introduced variables . For all other layers in SparseVD and SparseVD-Voc, we drop a neuron if all weights connected to this neuron are eliminated.

We optimize our networks using Adam Kingma and Ba (2015). Baseline networks overfit for all our tasks, therefore, we present results for them with early stopping. For all weights that we sparsify, we initialize with -3. We eliminate weights with signal-to-noise ratio less then . More details about experiment setup are presented in Appendix A.

Task Method Accuracy % Compression Vocabulary Neurons -
 Original 84.1 1x 20000
IMDb  SparseVD 85.1 1135x 4611
 SparseVD-Voc 83.6 12985x 292
 Original 90.6 1x 20000
AGNews  SparseVD 88.8 322x 5727
 SparseVD-Voc 89.2 469x 2444
Table 1: Results on text classification tasks. Compression is equal to . In last two columns number of remaining neurons in the input, embedding and recurrent layers are reported.
Task Method Valid Test Compression Vocabulary Neurons
 Original 1.498 1.454 1x 50 1000
Char PTB  SparseVD 1.472 1.429 7.6x 50 431
Bits-per-char  SparseVD-Voc 1.4584 1.4165 5.8x 48 510
 Original 135.6 129.5 1x 10000 256
Word PTB  SparseVD 115.0 109.0 14.0x 9985 153
Perplexity  SparseVD-Voc 126.3 120.6 11.1x 4353 207
Table 2: Results on language modeling tasks. Compression is equal to . In last two columns number of remaining neurons in input and recurrent layers are reported.

4.1 Text Classification

We evaluated our approach on two standard datasets for text classification: IMDb dataset Maas et al. (2011) for binary classification and AGNews dataset Zhang et al. (2015) for four-class classification. We set aside 15% and 5% of training data for validation purposes respectively. For both datasets, we use the vocabulary of 20,000 most frequent words.

We use networks with one embedding layer of 300 units, one LSTM layer of 128 / 512 hidden units for IMDb / AGNews, and finally, a fully connected layer applied to the last output of the LSTM. Embedding layer is initialized with word2vec Mikolov et al. (2013) / GloVe Pennington et al. (2014)

and SparseVD and SparseVD-Voc models are trained for 800 / 150 epochs on IMDb / AGNews.

The results are shown in Table 1

. SparseVD leads to a very high compression rate without a significant quality drop. SparseVD-Voc boosts the compression rate even further while still preserving the accuracy. Such high compression rates are achieved mostly because of the sparsification of the vocabulary: to classify texts we need to read only some important words from them. The remaining words in our models are mostly interpretable for the task (see Appendix 

B for the list of remaining words for IMBb). Figure 1 shows the only kept embedding component for remaining words on IMDb. This component reflects the sentiment score of the words.

Figure 1: IMDB: remaining embedding component vs sentiment score ((#pos. - #neg.) / #all texts with the word).

4.2 Language Modeling

We evaluate our models on the task of character-level and word-level language modeling on the Penn Treebank corpus Marcus et al. (1993) according to the train/valid/test partition of Mikolov et al. Mikolov et al. (2011). The dataset has a vocabulary of 50 characters or 10,000 words.

To solve character / word-level tasks we use networks with one LSTM layer of 1000 / 256 hidden units and fully-connected layer with softmax activation to predict next character or word. We train SparseVD and SparseVD-Voc models for 250 / 150 epochs on character-level / word-level tasks.

The results are shown in Table 2. To obtain these results we employ LRT on the last fully-connected layer. In our experiments with language modeling LRT on the last layer accelerate the training without harming the final result. Here we do not get such extreme compression rates as in the previous experiment but still, we are able to compress the models several times while achieving better quality w.r.t. the baseline because of the regularization effect of SparseVD. Vocabulary is not sparsified in the character-level task because there are only 50 characters and all of them matter. In the word-level task more than a half of the words are dropped. However, since in language modeling almost all words are important, the sparsification of the vocabulary makes the task more difficult to the network and leads to the drop in quality and the overall compression (network needs more difficult dynamic in the recurrent layer).


Results on SparseVD for RNNs shown in Section 3.2 have been supported by Russian Science Foundation (grant 17-71-20072). Results on multiplicative weights for vocabulary sparsification shown in Section 3.3 have been supported by Samsung Research, Samsung Electronics.


Appendix A Experimental setup

Initialization for text classification. Hidden-to-hidden weight matrices are initialized orthogonally and all other matrices are initialized uniformly using the method by Glorot and Bengio Glorot and Bengio (2010).

We train our networks using batches of size 128 and a learning rate of 0.0005.

Initialization for language modeling. All weight matrices of the networks are initialized orthogonally and all biases are initialized with zeros. Initial values of hidden and cell elements are not trainable and equal to zero.

For the character-level task, we train our networks on non-overlapping sequences of 100 characters in mini-batches of 64 using a learning rate of 0.002 and clip gradients with threshold 1.

For the word-level task, networks are unrolled for 35 steps. We use the final hidden states of the current mini-batch as the initial hidden state of the subsequent mini-batch (successive mini batches sequentially traverse the training set). The size of each mini-batch is 32. We train models using a learning rate of 0.002 and clip gradients with threshold 10.

Appendix B A list of remaining words on IMDB

SparseVD with multiplicative weights retained the following words on IMDB task (sorted by descending frequency in the whole corpus):

start, oov, and, to, is, br, in, it, this, was, film, t, you, not, have, It, just, good, very, would, story, if, only, see, even, no, were, my, much, well, bad, will, great, first, most, make, also, could, too, any, then, seen, plot, acting, life, over, off, did, love, best, better, i, If, still, man, something, m, re, thing, years, old, makes, director, nothing, seems, pretty, enough, own, original, world, series, young, us, right, always, isn, least, interesting, bit, both, script, minutes, making, 2, performance, might, far, anything, guy, She, am, away, woman, fun, played, worst, trying, looks, especially, book, DVD, reason, money, actor, shows, job, 1, someone, true, wife, beautiful, left, idea, half, excellent, 3, nice, fan, let, rest, poor, low, try, classic, production, boring, wrong, enjoy, mean, No, instead, awful, stupid, remember, wonderful, often, become, terrible, others, dialogue, perfect, liked, supposed, entertaining, waste, His, problem, Then, worse, definitely, 4, seemed, lives, example, care, loved, Why, tries, guess, genre, history, enjoyed, heart, amazing, starts, town, favorite, car, today, decent, brilliant, horrible, slow, kill, attempt, lack, interest, strong, chance, wouldn, sometimes, except, looked, crap, highly, wonder, annoying, Oh, simple, reality, gore, ridiculous, hilarious, talking, female, episodes, body, saying, running, save, disappointed, 7, 8, OK, word, thriller, Jack, silly, cheap, Oscar, predictable, enjoyable, moving, Unfortunately, surprised, release, effort, 9, none, dull, bunch, comments, realistic, fantastic, weak, atmosphere, apparently, premise, greatest, believable, lame, poorly, NOT, superb, badly, mess, perfectly, unique, joke, fails, masterpiece, sorry, nudity, flat, Good, dumb, Great, D, wasted, unless, bored, Tony, language, incredible, pointless, avoid, trash, failed, fake, Very, Stewart, awesome, garbage, pathetic, genius, glad, neither, laughable, beautifully, excuse, disappointing, disappointment, outstanding, stunning, noir, lacks, gem, F, redeeming, thin, absurd, Jesus, blame, rubbish, unfunny, Avoid, irritating, dreadful, skip, racist, Highly, MST3K

An interesting observation is that low and high ratings of films are kept while the middle values (5, 6) of the ratings are dropped.