When exploring a new problem, having a simple yet competitive off-the-shelf baseline is fundamental to new research. For instance, random_forest
showed random forests to be a strong baseline for many high-dimensional supervised learning tasks. For computer vision, off-the-shelf convolutional neural networks (CNNs) have earned their reputation as a strong baseline[CNNs] and basic building block for more complex models like visual question answering [Xiong2016]
. For natural language processing (NLP) and other sequential modeling tasks, recurrent neural networks (RNNs), and in particular Long Short-Term Memory (LSTM) networks, with a linear projection layer at the end have begun to attain a similar status. However, the standard LSTM is in many ways lacking as a baseline.zaremba2015empirical, gal2015theoretically, and others show that large improvements are possible using a forget bias, inverted dropout regularization or bidirectionality. We add three major additions with similar improvements to off-the-shelf LSTMs: Monte Carlo model averaging, embed average pooling, and residual connections. We analyze these and other more common improvements.
2 LSTM Network
LSTM networks are among the most commonly used models for tasks involving variable-length sequences of data, such as text classification. The basic LSTM layer consists of six equations:
is the sigmoid function,is element-wise multiplication, and is the value of variable at timestep . Each layer receives from the layer that came before it and and from the previous timestep, and it outputs to the layer that comes after it and and to the next timestep. The and values jointly constitute the recurrent state of the LSTM that is passed from one timestep to the next. Since the value completely updates at each timestep while the value maintains part of its own value through multiplication by the forget gate , and complement each other very well, with forming a “fast” state that can quickly adapt to new information and forming a “slow” state that allows information to be retained over longer periods of time [zaremba2015empirical]. While various papers have tried to systematically experiment with the 6 core equations constituting an LSTM [greff2015lstm, zaremba2015empirical], in general the basic LSTM equations have proven extremely resilient and, if not optimal, at least a local maximum.
3 Monte Carlo Model Averaging
It is common practice when applying dropout in neural networks to scale the weights up at train time (inverted dropout). This ensures that the expected magnitude of the inputs to any given layer are equivalent between train and test, allowing for an efficient computation of test-time predictions. However, for a model trained with dropout, test-time predictions generated without dropout merely approximate the ensemble of smaller models that dropout is meant to provide. A higher fidelity method requires that test-time dropout be conducted in a manner consistent with how the model was trained. To achieve this, we sample neural nets with dropout applied for each test example and average the predictions. With sufficiently large this Monte Carlo average should approach the true model average [srivastava2014dropout]. We show in Figure 1 that this technique can yield more accurate predictions on test-time data than the standard practice. This is demonstrated over a number of datasets, suggesting its applicability to many types of sequential architectures. While running multiple Monte Carlo samples is more computationally expensive, the overall increase is minimal as the process is only run on test-time forward passes and is highly parallelizable. We show that higher performance can be achieved with relatively few Monte Carlo samples, and that this number of samples is similar across different NLP datasets and tasks.
We encountered one ambiguity of Monte Carlo model averaging that to our knowledge remains unaddressed in prior literature: there is relatively little exploration as to where and how the model averaging is most appropriately handled. We investigated averaging over the output of the final recurrent layer (just before the projection layer), over the output of the projection layer (the pre-softmax unnormalized logits), and the post-softmax normalized probabilities, which is the approach taken bygal2015theoretically
for language modeling. We saw no discernible difference in performance between averaging the pre-projection and post-projection outputs. Averaging over the post-softmax probabilities showed marginal improvements over these two methods, but interestingly only for bidirectional models. We also explored using majority voting among the sampled models. This involves tallying the maximum post-softmax probabilities and selecting the class that received the most votes. This method differs from averaging the post-softmax probabilities in the same way max-margin differs from maximum likelihood estimation (MLE), de-emphasizing the points well inside the decision boundary or the models that predicted a class with extremely high probability. With sufficiently large, this voting method seemed to work best of the averaging methods we tried, and thus all of our displayed models use this technique. However, for classification problems with more classes, more Monte Carlo samples might be necessary to guarantee a meaningful plurality of class predictions. We conclude that the majority-vote Monte Carlo averaging method is preferable in the case where the ratio of Monte Carlo samples to number of classification labels is large ().
The Monte Carlo model averaging experiments, shown in Figure 1, were conducted as follows. We drew separate test samples for each example, differentiated by their dropout masks. For each sample size (whose values, plotted on the x-axis, were in the range from to with step-size ) we selected of our samples randomly without replacement and performed the relevant Monte Carlo averaging technique for that task, as discussed above. We do this
times for each point, to establish the mean and variance for that number of Monte Carlo iterations/samples. The variance is used to visualize the confidence interval in blue, while the red line denotes the test accuracy computed using the traditional approximation method (inverted dropout at train-time, and no dropout at test-time).
4 Embed Average Pooling
Reliably retaining long-range information is a well documented weakness of LSTM networks [Karpathy2015]. This is especially the case for very long sequences like the IMDB sentiment dataset [Maas2011], where deep sequential models fail to capture uni- and bi-gram occurrences over long sequences. This is likely why -gram based models, such as a bi-gram NBSVM [wang2012baselines], outperform RNN models on such datasetes. It was shown by Iyyer2015 and others that for general NLP classification tasks, the use of a deep, unordered composition (or bag-of-words) of a sequence can yield strong results. Their solution, the deep averaging network (DAN), combines the observed effectiveness of depth, with the unreasonable effectiveness of unordered representations of long sequences.
We suspect that the primary advantage of DANs is their ability to keep track of information that would have otherwise been forgotten by a sequential model, such as information early in the sequence for a unidirectional RNN or information in the middle of the sequence for a bidirectional RNN. Our embed average pooling supplements the bidirectional RNN with the information from a DAN at a relatively negligible computational cost.
As shown in Figure 2
, embed average pooling works by averaging the sequence of word vectors and passing this average through an MLP. The averaging is similar to an average pooling layer in a CNN (hence the name), but with the averaging being done temporally rather than spatially. The output of this MLP is concatenated to the final output of the RNN, and the combined vector is then passed into the projection and softmax layer. We apply the same dropout mask to the word vectors when passing them to the RNN as when averaging them, and we apply a different dropout mask on the output of the MLP. We experimented with applying the MLP before rather than after averaging the word vectors but found the latter to be most effective.
5 Residual Connections
For feed-forward convolutional neural networks used in computer vision tasks, residual networks, or ResNets, have obtained state of the art results [HeZRS15]. Rather than having each layer learn a wholly new representation of the data, as is customary for neural networks, ResNets have each layer (or group of layers) learn a residual which is added to the layer’s input and then passed on to the next layer. More formally, if the input to a layer (or group of layers) is and the output of that layer (or group of layers) is , then the input to the next layer (or group of layers) is , whereas it would be in a conventional neural network. This architecture allows the training of far deeper models. HeZRS15 trained convolutional neural networks as deep as 151 layers, compared to 16 layers used in VGGNets [SimonyanZ14a] or 22 layers used in GoogLeNet [szegedy2015going]
, and won the 2015 ImageNet Challenge. Since then, various papers have tried to build upon the ResNet paradigm[stochastic_depth, SzegedyIV16], and various others have tried to create convincing theoretical reasons for ResNet’s success [LiaoP16, VeitWB16].
We explored many different ways to incorporate residual connections in an RNN. The two most successful ones, which we call Res-V1 and Res-V2 are depicted in Figure 6. Res-V1 incorporates only vertical residuals, while Res-V2 incorporates both vertical and lateral residuals. With vertical residual connections, the input to a layer is added to its output and then passed to the next layer, as is done in feed-forward ResNets. Thus, whereas the input to a layer is normally the from the previous layer, with vertical residuals the input becomes the from the previous layer. This maintains many of the attractive properties of ResNets (e.g. unimpeded gradient flow across layers, adding/averaging the contributions of each layer) and thus lends itself naturally to deeper networks. However, it can interact unpredictably with the LSTM architecture, as the “fast” state of the LSTM no longer reflects the network’s full representation of the data at that point. To mitigate this unpredictability, Res-V2 also includes lateral residual connections. With lateral residual connections, the input to a layer is added to its output and then passed to the next timestep as the fast state of the LSTM. It is equivalent to replacing equation 6 with . Thus, applying both vertical and lateral residuals ensures that the same value is passed both to the next layer as input and to the next timestep as the “fast” state.
In addition to these two, we explored various other, ultimately less successful, ways of adding residual connections to an LSTM, the primary one being horizontal residual connections. In this architecture, rather than adding the input from the previous layer to a layer’s output, we added the fast state from the previous timestep. The hope was that adding residual connections across timesteps would allow information to flow more effectively across timesteps and thus improve the performance of RNNs that are deep across timesteps, much as ResNets do for networks that are deep across layers. Thus, we believed horizontal residual connections could solve the problem of LSTMs not learning long-term dependencies, the same problem we also hoped to mitigate with embed average pooling. Unfortunately, horizontal residuals failed, possibly because they blurred the distinction between the LSTM’s “fast” state and “slow” state and thus prevented the LSTM from quickly adapting to new data. Alternate combinations of horizontal, vertical, and lateral residual connections were also experimented with but yielded poor results.
6 Experimental Results
We chose two commonly used benchmark datasets for our experiments: the Stanford Sentiment Treebank (SST) [Socher2013] and the IMDB sentiment dataset [Maas2011]. This allowed us to compare the performance of our models to existing work and review the flexibility of our proposed model extensions across fairly disparate types of classification datasets. SST contains relatively well curated, short sequence sentences, in contrast to IMDB’s comparatively colloquial and lengthy sequences (some up to tokens). To further differentiate the classification tasks we chose to experiment with fine-grained, five-class sentiment on SST, while IMDB only offered binary labels. For IMDB, we randomly split the training set of examples into training and validation sets containing and examples respectively, as done in Maas2011.
Our objective is to show a series of compounding extensions to the standard LSTM baseline that enhance accuracy. To ensure scientific reliability, the addition of each feature is the only change from the previous model (see Figures 4 and 5). The baseline model is a 2-layer stacked LSTM with hidden size for SST and for IMDB, as used in Tai2015. All models in this paper used publicly available 300 dimensional word vectors, pre-trained using Glove on 840 million tokens of Common Crawl Data [pennington2014glove], and both the word vectors and the subsequent weight matrices were trained using Adam with a learning rate of .
The first set of basic feature additions were adding a forget bias and using dropout. Adding a bias of to the forget gate (i.e. adding 1.0 to the inside of the sigmoid function in equation 3) improves results across NLP tasks, especially for learning long-range dependencies [zaremba2015empirical]. Dropout [srivastava2014dropout] is a highly effective regularizer for deep models. For SST and IMDB we used grid search to select dropout probabilities of and respectively, applied to the input of each layer, including the projection/softmax layer. While forget bias appears to hurt performance in Figure 5, the combination of dropout and forget bias yielded better results in all cases than dropout without forget bias. Our last two basic optimizations were increasing the hidden sizes and then adding shared-weight bidirectionality to the RNN. The hidden sizes for SST and IMDB were increased to and respectively; we found significantly diminishing returns to performance from increases beyond this. We chose shared-weight bidirectionality to ensure the model size did not increase any further. Specifically, the forward and backward weights are shared, and the input to the projection/softmax layer is a concatenation of the forward and backward passes’ final hidden states.
All of our subsequent proposed model extensions are described at length in their own sections. For both datasets, we used Monte Carlo samples, and the embed average pooling MLP had one hidden layer and both a hidden dimension and an output dimension of as the output dimension of the embed average pooling MLP. Note that although the MLP weights increased the size of their respective models, this increase is negligible (equivalent to increasing the hidden size for SST from to or the hidden size of IMDB from to ), and we found that such a size increase had no discernible effect on accuracy when done without the embed average pooling.
Since each of our proposed modifications operate independently, they are well suited to use in combination as well as in isolation. In Figures 4 and 5 we compound these features on top of the more traditional enhancements. Due to the expensiveness of bidirectional models, Figure 4 also shows these compounding features on SST with and without bidirectionality. The validation accuracy distributions show that each augmentation usually provides some small but noticeable improvement on the previous model, as measured by consistent improvements in mean and median accuracy.
These box-plots show the performance of compounding model features on fine-grain SST validation accuracy. The red points, red lines, blue boxes, whiskers and plus-shaped points indicate the mean, median, quartiles, range, and outliers, respectively.
|Model||# Params (M)||
Train Time / Epoch (sec)
|Test Acc (%)|
|Model||# Params (M)||Train Time / Epoch (sec)||Test Acc (%)|
|NBSVM-tri, RNN, Sentence-Vec Ensemble [mesnil2014ensemble]|
We originally suspected that MC would provide marginal yet consistent improvements across datasets, while embed average pooling would especially excel for long sequences like in IMDB, where -gram based models and deep unordered compositions have benefited from their ability to retain information from disparate parts of the text. The former hypothesis was largely confirmed. However, while embed average pooling was generally performance-enhancing, the performance boost it yielded for IMDB was not significantly larger than the one it yielded for SST, though that may have been because the other enhancements already encompassed most of the advantages provided by deep unordered compositions.
The only evident exceptions to the positive trend are the variations of residual connections. Which of Res-V1 (vertical only) and Res-V2 (vertical and residual) outperformed the other depended on the dataset and whether the network was bidirectional. The Res-V2 architecture dominated in experiments 3(b) and 5 while the Res-V1 (only vertical residuals) architecture is most performant in Figure 3(a). This suggests for short sequences, bidirectionality and lateral residuals conflict. Further analysis of the effect of residual connections and model depth can be found in Figure 6. In that figure, the number of parameters, and hence model size, are kept uniform by modifying the hidden size as the layer depth changed. The hidden sizes used for , , , , and layer models were , , , , and respectively, maintaining total parameters for all models. As the graph demonstrates, normal LSTMs (“Vanilla”) perform drastically worse as they become deeper and narrower, while Res-V1 and Res-V2 both see their performance stay much steadier or even briefly rise. While depth wound up being far from a panacea for the datasets we experimented on, the ability of an LSTM with residual connections to maintain its performance as it gets deeper holds promise for other domains where the extra expressive power provided by depth might prove more crucial.
Selecting the best results for each model, we see results competitive with state-of-the-art performance for both IMDB111For IMDB, we benchmark only against results obtained from training exclusively on the labeled training set. Thus, we omit results from unsupervised models that leveraged the additional unlabeled examples, such as miyato2016virtual. and SST, even though many state-of-the-art models use either parse-tree information [Tai2015], multiple passes through the data [Kumar2016] or tremendous train and test-time computational and memory expenses [le2014distributed]. To our knowledge, our models constitute the best performance of purely sequential, single-pass, and computationally feasible models, precisely the desired features of a solid out-of-the-box baseline. Furthermore, for SST, the compounding enhancement model without bidirectionality, the final model shown in Figure 3(b), greatly exceeded the performance of the large bidirectional model ( vs ), with significantly less training time (Table 1). This suggests our enhancements could provide a similarly reasonable and efficient alternative to shared-weight bidirectionality for other such datasets.
We explore several easy to implement enhancements to the basic LSTM network that positively impact performance. These include both fairly well established extensions (biasing the forget gate, dropout, increasing the model size, bidirectionality) and several more novel ones (Monte Carlo model averaging, embed average pooling, residual connections). We find that these enhancements improve the performance of the LSTM in classification tasks, both in conjunction or isolation, with an accuracy close to state of the art despite being more lightweight and using less information than the current state of the art models. Our results suggest that these extensions should be incorporated into LSTM baselines.