Pyramidal Recurrent Units (PRUs): A New LSTM Unit
LSTMs are powerful tools for modeling contextual information, as evidenced by their success at the task of language modeling. However, modeling contexts in very high dimensional space can lead to poor generalizability. We introduce the Pyramidal Recurrent Unit (PRU), which enables learning representations in high dimensional space with more generalization power and fewer parameters. PRUs replace the linear transformation in LSTMs with more sophisticated interactions including pyramidal and grouped linear transformations. This architecture gives strong results on word-level language modeling while reducing the number of parameters significantly. In particular, PRU improves the perplexity of a recent state-of-the-art language model Merity et al. (2018) by up to 1.3 points while learning 15-20 PRU outperforms all previous RNN models that exploit different gating mechanisms and transformations. We provide a detailed examination of the PRU and its behavior on the language modeling tasks. Our code is open-source and available at https://sacmehta.github.io/PRU/READ FULL TEXT VIEW PDF
Pyramidal Recurrent Units (PRUs): A New LSTM Unit
are popular for many sequence modeling tasks and are used extensively in language modeling. A key to their success is their articulated gating structure, which allows for more control over the information passed along the recurrence. However, despite the sophistication of the gating mechanisms employed in LSTMs and similar recurrent units, the input and context vectors are treated with simple linear transformations prior to gating. Non-linear transformations such as convolutionsKim et al. (2016) have been used, but these have not achieved the performance of well regularized LSTMs for language modeling Melis et al. (2018).
A natural way to improve the expressiveness of linear transformations is to increase the number of dimensions of the input and context vectors, but this comes with a significant increase in the number of parameters which may limit generalizability. An example is shown in Figure 1, where LSTMs performance decreases with the increase in dimensions of the input and context vectors. Moreover, the semantics of the input and context vectors are different, suggesting that each may benefit from specialized treatment.
Guided by these insights, we introduce a new recurrent unit, the Pyramidal Recurrent Unit (PRU), which is based on the LSTM gating structure. Figure 2 provides an overview of the PRU. At the heart of the PRU is the pyramidal transformation (PT), which uses subsampling to effect multiple views of the input vector. The subsampled representations are combined in a pyramidal fusion structure, resulting in richer interactions between the individual dimensions of the input vector than is possible with a linear transformation. Context vectors, which have already undergone this transformation in the previous cell, are modified with a grouped linear transformation (GLT) which allows the network to learn latent representations in high dimensional space with fewer parameters and better generalizability (see Figure 1).
We show that PRUs can better model contextual information and demonstrate performance gains on the task of language modeling. The PRU improves the perplexity of the current state-of-the-art language model Merity et al. (2018) by up to 1.3 points, reaching perplexities of 56.56 and 64.53 on the Penn Treebank and WikiText2 datasets while learning 15-20% fewer parameters. Replacing an LSTM with a PRU results in improvements in perplexity across a variety of experimental settings. We provide detailed ablations which motivate the design of the PRU architecture, as well as detailed analysis of the effect of the PRU on other components of the language model.
Multiple methods, including a variety of gating structures and transformations, have been proposed to improve the performance of recurrent neural networks (RNNs). We first describe these approaches and then provide an overview of recent work in language modeling.
The performance of RNNs have been greatly improved by gating mechanisms such as LSTMs Hochreiter and Schmidhuber (1997), GRUs Chung et al. (2014), peep-hole connections Gers and Schmidhuber (2000), SRUs Lei et al. (2018), and RANs Lee et al. (2017). In this paper, we extend the gating architecture of LSTMs Hochreiter and Schmidhuber (1997), a widely used recurrent unit across different domains.
Apart from the widely used linear transformation for modeling the temporal data, another transformation that has gained popularity is convolution LeCun et al. (1995)
. Convolution-based methods have gained attention in computer vision tasksKrizhevsky et al. (2012)
as well as some of the natural language processing tasks including machine translationGehring et al. (2017). Convolution-based methods for language modeling, such as CharCNN Kim et al. (2016), have not yet achieved the performance of well regularized LSTMs Melis et al. (2018). We inherit ideas from convolution-based approaches, such as sub-sampling, to learn richer representations Krizhevsky et al. (2012); Han et al. (2017).
Recently, there has been an effort to improve the efficiency of RNNs. These approaches include quantization Xu et al. (2018), skimming Seo et al. (2018); Yu et al. (2017), skipping Campos et al. (2018), and query reduction Seo et al. (2017). These approaches extend standard RNNs and therefore, these approaches are complementary to our work.
Language modeling is a fundamental task for NLP and has garnered significant attention in recent years (see Table 1 for comparison with state-of-the-art methods). Merity et al. (2018) introduce regularization techniques such as weight dropping which, coupled with a non-monotonically triggered ASGD optimization, achieves strong performance improvements. Yang et al. (2018) extend Merity et al. (2018)
with the mixture of softmaxes (MoS) technique, which increases the rank of the matrix used to compute next-token probabilities. Further,Merity et al. (2017) and Krause et al. (2018)
propose methods to improve inference by adapting models to recent sequence history. Our work is complementary to these recent softmax layer and inference procedure improvements.
We introduce Pyramidal Recurrent Units (PRUs), a new RNN architecture which improves modeling of context by allowing for higher dimensional vector representations while learning fewer parameters. Figure 2 provides an overview of PRU. We first elaborate on the details of the pyramidal transformation and the grouped linear transformation. We then describe our recurrent unit, PRU.
The basic transformation in many recurrent units is a linear transformation defined as:
where are learned weights that linearly map to . To simplify notation, we omit the biases.
Motivated by successful applications of sub-sampling in computer vision (e.g., Burt and Adelson (1987); Lowe (1999); Krizhevsky et al. (2012); Mehta et al. (2018)), we subsample input vector into pyramidal levels to achieve representation of the input vector at multiple scales. This sub-sampling operation produces vectors, represented as , where is the sampling rate and . We learn scale-specific transformations for each . The transformed subsamples are concatenated to produce the pyramidal analog to , here denoted as :
where indicates concatenation. We note that pyramidal transformation with is the same as the linear transformation.
To improve gradient flow inside the recurrent unit, we combine the input and output using an element-wise sum (when dimension matches) to produce residual analog of pyramidal transformation, as shown in Figure 2 He et al. (2016).
The number of parameters learned by the linear transformation and the pyramidal transformation with pyramidal levels to map to are and respectively. Thus, pyramidal transformation reduces the parameters of a linear transformation by a factor of . For example, the pyramidal transformation (with and ) learns fewer parameters than the linear transformation.
Many RNN architectures apply linear transformations to both the input and context vector. However, this may not be ideal due to the differing semantics of each vector. In many NLP applications including language modeling, the input vector is a dense word embedding which is shared across all contexts for a given word in a dataset. In contrast, the context vector is highly contextualized by the current sequence. The differences between the input and context vector motivate their separate treatment in the PRU architecture.
The weights learned using the linear transformation (Eq. 1) are reused over multiple time steps, which makes them prone to over-fitting Gal and Ghahramani (2016). To combat over-fitting, various methods, such as variational dropout Gal and Ghahramani (2016) and weight dropout Merity et al. (2018), have been proposed to regularize these recurrent connections. To further improve generalization abilities while simultaneously enabling the recurrent unit to learn representations at very high dimensional space, we propose to use grouped linear transformation (GLT) instead of standard linear transformation for recurrent connections Kuchaiev and Ginsburg (2017). While pyramidal and linear transformations can be applied to transform context vectors, our experimental results in Section 4.4 suggests that GLTs are more effective.
The linear transformation maps linearly to . Grouped linear transformations break the linear interactions by factoring the linear transformation into two steps. First, a GLT splits the input vector into smaller groups such that . Second, a linear transformation is applied to map linearly to , for each . The resultant output vectors are concatenated to produce the final output vector .
GLTs learn representations at low dimensionality. Therefore, a GLT requires
fewer parameters than the linear transformation. We note that GLTs are subset of linear transformations. In a linear transformation, each neuron receives an input from each element in the input vector while in a GLT, each neuron receives an input from a subset of the input vector. Therefore, GLT is the same as a linear transformation when.
We extend the basic gating architecture of LSTM with the pyramidal and grouped linear transformations outlined above to produce the Pyramidal Recurrent Unit (PRU), whose improved sequence modeling capacity is evidenced in Section 4.
At time , the PRU combines the input vector and the previous context vector (or previous hidden state vector) using the following transformation function as:
We will now incorporate into LSTM gating architecture to produce PRU. At time , a PRU cell takes , , and as inputs to produce forget , input , output , and content gate signals. The inputs are combined with these gate signals to produce context vector and cell state . Mathematically, the PRU with the LSTM gating architecture can be defined as:
where represents the element-wise multiplication operation, and and
are the sigmoid and hyperbolic tangent activation functions. We note that LSTM is a special case of PRU when==.
To showcase the effectiveness of the PRU, we evaluate the performance on two standard datasets for word-level language modeling and compare with state-of-the-art methods. Additionally, we provide a detailed examination of the PRU and its behavior on the language modeling tasks.
We extend the language model, AWD-LSTM Merity et al. (2018), by replacing LSTM layers with PRU. Our model uses 3-layers of PRU with an embedding size of 400. The number of parameters learned by state-of-the-art methods vary from 18M to 66M with majority of the methods learning about 22M to 24M parameters on the PTB dataset. For a fair comparison with state-of-the-art methods, we fix the model size to 19M and vary the value of and hidden layer sizes so that total number of learned parameters is similar across different configurations. We use 1000, 1200, and 1400 as hidden layer sizes for values of =1,2, and 4, respectively. We use the same settings for the WT-2 dataset. We set the number of pyramidal levels to two in our experiments and use average pooling for sub-sampling. These values are selected based on our ablation experiments on the validation set (Section 4.4). We measure the performance of our models in terms of word-level perplexity. We follow the same training strategy as in Merity et al. (2018).
To understand the effect of regularization methods on the performance of PRUs, we perform experiments under two different settings: (1) Standard dropout: We use a standard dropout Srivastava et al. (2014) with probability of 0.5 after embedding layer, the output between LSTM layers, and the output of final LSTM layer. (2) Advanced dropout: We use the same dropout techniques with the same dropout values as in Merity et al. (2018). We call this model as AWD-PRU.
Table 1 compares the performance of the PRU with state-of-the-art methods. We can see that the PRU achieves the best performance with fewer parameters.
|Variational LSTM Gal and Ghahramani (2016)||–||–||–||20 M||–||78.6|
|CharCNN Kim et al. (2016)||–||–||–||19 M||–||78.9|
|Pointer Sentinel-LSTM Merity et al. (2017)||–||–||–||19 M||72.4||70.9|
|RHN Zilly et al. (2016)||–||–||–||23 M||67.9||65.4|
|NAS Cell Zoph and Le (2017)||–||–||–||25 M||–||64.0|
|Variational LSTM - Inan et al. (2017)||28 M||91.5||87||24 M||75.7||73.2|
|SRU - 6 layers Lei et al. (2018)||–||–||–||24 M||63.4||60.3|
|QRNN Bradbury et al. (2017)||–||–||–||18 M||82.1||78.3|
|RAN Lee et al. (2017)||–||–||–||22 M||–||78.5|
|4-layer skip-connection LSTM Melis et al. (2018)||–||–||–||24 M||60.9||58.3|
|AWD-LSTM - Merity et al. (2018)||33 M||69.1||66||24 M||60.7||58.8|
|AWD-LSTM - Merity et al. (2018)-finetuned||33 M||68.6||65.8||24 M||60||57.3|
|Variational LSTM Gal and Ghahramani (2016)||–||–||–||66 M||–||73.4|
|NAS Cell Zoph and Le (2017)||–||–||–||54 M||–||62.4|
|Quantized LSTM - Full precision Xu et al. (2018)||–||–||100.1||–||–||89.8|
|Quantized LSTM - 2 bit Xu et al. (2018)||–||–||106.1||–||–||95.8|
|With standard dropout|
|LSTM ()||29 M||78.93||75.08||20 M||68.57||66.29|
|LSTM ()||35 M||77.93||74.48||26 M||69.17||67.16|
|LSTM ()||42 M||77.55||74.44||33 M||70.88||68.55|
|Ours -PRU (, , )||28 M||79.15||76.59||19 M||69.8||67.78|
|Ours -PRU (, , )||28 M||76.62||73.79||19 M||67.17||64.92|
|Ours -PRU (, , )||28 M||75.46||72.77||19 M||64.76||62.42|
|With advanced dropouts|
|Ours - AWD-PRU (, , )||28 M||71.84||68.6||19 M||61.72||59.54|
|Ours - AWD-PRU (, , )||28 M||68.57||65.7||19 M||60.81||58.65|
|Ours - AWD-PRU (, , )||28 M||68.17||65.3||19 M||60.62||58.33|
|Ours - AWD-PRU (, , )-finetuned||28 M||67.19||64.53||19 M||58.46||56.56|
PRUs achieve either the same or better performance than LSTMs. In particular, the performance of PRUs improves with the increasing value of . At , PRUs outperform LSTMs by about 4 points on the PTB dataset and by about 3 points on the WT-2 dataset. This is explained in part by the regularization effect of the grouped linear transformation (Figure 1). With grouped linear and pyramidal transformations, PRUs learn rich representations at very high dimensional space while learning fewer parameters. On the other hand, LSTMs overfit to the training data at such high dimensions and learn to more parameters than PRUs.
With the advanced dropouts, the performance of PRUs improves by about 4 points on the PTB dataset and 7 points on the WT-2 dataset. This further improves with finetuning on the PTB (about 2 points) and WT-2 (about 1 point) datasets.
For similar number of parameters, the PRU with standard dropout outperforms most of the state-of-the-art methods by large margin on the PTB dataset (e.g. RAN Lee et al. (2017) by 16 points with 4M less parameters, QRNN Bradbury et al. (2017) by 16 points with 1M more parameters, and NAS Zoph and Le (2017) by 1.58 points with 6M less parameters). With advanced dropouts, the PRU delivers the best performance. On both datasets, the PRU improves the perplexity by about 1 point while learning 15-20% fewer parameters.
PRU is a drop-in replacement for LSTM, therefore, it can improve language models with modern inference techniques such as dynamic evaluation Krause et al. (2018). When we evaluate PRU-based language models (only with standard dropout) with dynamic evaluation on the PTB test set, the perplexity of PRU () improves from 62.42 to 55.23 while the perplexity of an LSTM () with similar settings improves from 66.29 to 58.79; suggesting that modern inference techniques are equally applicable to PRU-based language models.
It is shown above that the PRU can learn representations at higher dimensionality with more generalization power, resulting in performance gains for language modeling. A closer analysis of the impact of the PRU in a language modeling system reveals several factors that help explain how the PRU achieves these gains.
As exemplified in Table 1(a), the PRU tends toward more confident decisions, placing more of the probability mass on the top next-word prediction than the LSTM. To quantify this effect, we calculate the entropy of the next-token distribution for both the PRU and the LSTM using 3687 contexts from the PTB validation set. Figure 3 shows a histogram of the entropies of the distribution, where bins of size 0.23 are used to effect categories. We see that the PRU more often produces lower entropy distributions corresponding to higher confidences for next-token choices. This is evidenced by the mass of the red PRU curve lying in the lower entropy ranges compared to the blue LSTM’s curve. The PRU can produce confident decisions in part because more information is encoded in the higher dimensional context vectors.
The PRU has the ability to model individual words at different resolutions through the pyramidal transform; which provides multiple paths for the gradient to the embedding layer (similar to multi-task learning) and improves the flow of information. When considering the embeddings by part of speech, we find that the pyramid level 1 embeddings exhibit higher variance than the LSTM across all POS categories (Figure4), and that pyramid level 2 embeddings show extremely low variance111POS categories are computed using NLTK toolkit.. We hypothesize that the LSTM must encode both coarse group similarities and individual word differences into the same vector space, reducing the space between individual words of the same category. The PRU can rely on the subsampled embeddings to account for coarse-grained group similarities, allowing for finer individual word distinctions in the embedding layer. This hypothesis is strengthened by the entropy results described above: a model which can make finer distinctions between individual words can more confidently assign probability mass. A model that cannot make these distinctions, such as the LSTM, must spread its probability mass across a larger class of similar words.
Saliency analysis using gradients help identify relevant words in a test sequence that contribute to the prediction Gevrey et al. (2003); Li et al. (2016); Arras et al. (2017). These approaches compute the relevance as the squared norm of the gradients obtained through back-propagation. Table 1(a) visualizes the heatmaps for different sequences. PRUs, in general, give more relevance to contextual words than LSTMs, such as southeast (sample 1), cost (sample 2), face (sample 4), and introduced (sample 5), which help in making more confident decisions. Furthermore, when gradients during back-propagation are visualized Selvaraju et al. (2017) (Table 1(b)), we find that PRUs have better gradient coverage than LSTMs, suggesting PRUs use more features than LSTMs that contributes to the decision. This also suggests that PRUs update more parameters at each iteration which results in faster training. Language model in Merity et al. (2018)
takes 500 and 750 epochs to converge with PRU and LSTM as a recurrent unit, respectively.
In this section, we provide a systematic analysis of our design choices. Our training methodology is the same as described in Section 4.1 with the standard dropouts. For a thorough understanding of our design choices, we use a language model with a single layer of PRU and fix the size of embedding and hidden layers to 600. The word-level perplexities are reported on the validation sets of the PTB and the WT-2 datasets.
The two hyper-parameters that control the trade-off between performance and number of parameters in PRUs are the number of pyramidal levels and groups . Figure 5 provides a trade-off between perplexity and recurrent unit (RU) parameters222# total params = # embedding params + # RU params.
Variable and fixed : When we increase the number of pyramidal levels at a fixed value of , the performance of the PRU drops by about 1 to 4 points while reducing the total number of recurrent unit parameters by up to 15%. We note that the PRU with at delivers similar performance as the LSTM while learning about 15% fewer recurrent unit parameters.
Fixed and variable : When we vary the value of at fixed number of pyramidal levels , the total number of recurrent unit parameters decreases significantly with a minimal impact on the perplexity. For example, PRUs with and learns 77% fewer recurrent unit parameters while its perplexity (lower is better) increases by about 12% in comparison to LSTMs. Moreover, the decrease in number of parameters at higher value of enables PRUs to learn the representations in high dimensional space with better generalizability (Table 1).
Table 3 shows the impact of different transformations of the input vector and the context vector
. We make following observations: (1) Using the pyramidal transformation for the input vectors improves the perplexity by about 1 point on both the PTB and WT-2 datasets while reducing the number of recurrent unit parameters by about 14% (see R1 and R4). We note that the performance of the PRU drops by up to 1 point when residual connections are not used (R4 and R6). (2) Using the grouped linear transformation for context vectors reduces the total number of recurrent unit parameters by about 75% while the performance drops by about 11% (see R3 and R4). When we use the pyramidal transformation instead of the linear transformation, the performance drops by up to 2% while there is no significant drop in the number of parameters (R4 and R5).
|Transformations||PPL||# Params||PPL||# Params|
We set sub-sampling kernel (Eq. 3) with stride and size of 3 () in four different ways: (1) Skip: We skip every other element in the input vector. (2) Convolution: We initialize the elements of
randomly from normal distribution and learn them during training the model. We limit the output values between -1 and 1 usingactivation function to make training stable. (3) Avg. pool: We initialize the elements of to . (4) Max pool: We select the maximum value in the kernel window .
|Dataset||Skip||Max pool||Avg. Pool||Convolution|
Table 4 compares the performance of the PRU with different sampling methods. Average pooling performs the best while skipping give comparable performance. Both of these methods enable the network to learn richer word representations while representing the input vector in different forms, thus delivering higher performance. Surprisingly, a convolution-based sub-sampling method does not perform as well as the averaging method. The function used after convolution limits the range of output values which are further limited by the LSTM gating structure, thereby impeding in the flow of information inside the cell. Max pooling forces the network to learn representations from high magnitude elements, thus distinguishing features between elements vanishes, resulting in poor performance.
We introduce the Pyramidal Recurrent Unit, which better model contextual information by admitting higher dimensional representations with good generalizability. When applied to the task of language modeling, PRUs improve perplexity across several settings, including recent state-of-the-art systems. Our analysis shows that the PRU improves the flow of gradient and expand the word embedding subspace, resulting in more confident decisions. Here we have shown improvements for language modeling. In future, we plan to study the performance of PRUs on different tasks, including machine translation and question answering. In addition, we will study the performance of the PRU on language modeling with more recent inference techniques, such as dynamic evaluation and mixture of softmax.
This research was supported by NSF (IIS 1616112, III 1703166), Allen Distinguished Investigator Award, and gifts from Allen Institute for AI, Google, Amazon, and Bloomberg. We are grateful to Aaron Jaech, Hannah Rashkin, Mandar Joshi, Aniruddha Kembhavi, and anonymous reviewers for their helpful comments.
Explaining recurrent neural network predictions in sentiment analysis.In 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis,.
International Conference on Machine Learning (ICML).
IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Tying word vectors and word classifiers: A loss framework for language modeling.In International Conference for Learning Representations (ICLR).
Association for the Advancement of Artificial Intelligence (AAAI).
Neural architecture search with reinforcement learning.In International Conference for Learning Representations (ICLR).