Deep learning models are getting bigger, better and more computationally expensive in the quest to match or exceed human performance Wu et al. (2016); He et al. (2015); Amodei et al. (2015); Silver et al. (2016). With advances like the sparsely-gated mixture of experts Shazeer et al. (2017), pointer sentinel Merity et al. (2016), or attention mechanisms Bahdanau et al. (2015), models for natural language processing are growing more complex in order to solve harder linguistic problems.
Many of the problems these new models are designed to solve appear infrequently in real-world datasets, yet the complex model architectures motivated by such problems are employed for every example. For example, fig. 1
illustrates how a computationally cheap model (continuous bag-of-words) represents and clusters sentences. Clusters with simple syntax and semantics (“simple linguistic content”) tend to be classified correctly more often than clusters with difficult linguistic content. In particular, the BoW model is agnostic to word order and fails to accurately classify sentences with contrastive conjunctions.
This paper starts with the intuition that exclusively using a complex model leads to inefficient use of resources when dealing with more straightforward input examples. To remedy this, we propose two strategies for reducing inefficiency based on learning to classify the difficulty of a sentence. In both strategies, if we can determine that a sentence is easy, we use a computationally cheap bag-of-words (“skimming”). If we cannot, we default to an LSTM (reading). The first strategy uses the probability output of the BoW system as a confidence measure. The second strategy employs a decision network to learn the relationship between the BoW and the LSTM. Both strategies increase efficiency as measured by an area-under-the-curve (AUC) metric.
2 When to skim: strategies
To keep total computation time down, we investigate cheap strategies based on the BoW encoder. Where the probability thresholding strategy is a cost-free byproduct of BoW classification and the decision network strategy adding a small additional network.
2.1 Probability strategy
Since we use a softmax cross entropy loss function, our BoW model is penalized more for confident but wrong predictions. We should thus expect that confident answers are more likely correct.Figure 2 investigates the accuracy by thresholding probabilities empirically on the SST based on the BoW outputs, strengthening such hypothesis.
The probability strategy uses a threshold to determine which model to use, such that:
where is the prediction, the input, the BoW model and the LSTM model. The LSTM is used only when the probability of the BoW is below the threshold. Figure 3 illustrates how much data is funneled to the LSTM when increasing the probability threshold, .
|SST Valid||BoW 82%|
2.2 Decision network
In the probability strategy we make our decision based on the expectation that the more powerful LSTM will do a better job when the bag-of-words system is in doubt, which is not necessarily the case. Table 1 illustrates the confusion matrix between the BoW and the LSTM. It turns out that the LSTM is only strictly better 12% of the time, whereas 6% of the sentences neither the BoW or the LSTM is correct. In such case, there is no reason to run the LSTM and we might as well save time by only using the BoW.
2.2.1 Learning to skim, the setup
We propose a trainable decision network that is based on a secondary supervised classification task. We use the confusion matrix between the BoW and the LSTM from Table 1 as labels. We consider the case where the LSTM is correct and the BoW is wrong as the LSTM class and all other combinations as the BoW class.
However, the confusion matrix on the train set is biased due to the models overfitting—which is why cannot co-train the decision network and our models (BoW, LSTM) on the same data. Instead we create a new held-out split for training the decision network in a way that will generalize to the test set. We split the training set into a model training set (80% of training data) and a decision training set (remaining 20% of training data). We first train the BoW and the LSTM models on the model training set, generate labels for the decision training set and train the decision network on the decision training set, and lastly fine-tune the BoW and the LSTM on the original full training set while holding the decision network fixed. We find that the decision network will still generalize to models that are fine-tuned on the full training set. The entire pipeline is illustrated in 4.
3 Related Work
|Model||Cost per sample|
|Bag-of-words (BoW)||0.16 ms|
The idea of penalizing computational cost is not new. Adaptive computation time (ACT) Graves (2016) employs a cost function to penalize additional computation and thereby complexity. Concurrently with our work, two similar methods have been developed to choose between computationally cheap and expensive models. Odena et al. (2017) propose the composer model, which chooses between computationally inexpensive and expensive layers. To model the compute versus accuracy tradeoff they use a test-time modifiable loss function that resembles our probability strategy. The composer model, similar to our decision network, is not restrained to the binary choice of two models. Further, their model, similar to our decision network, does not have the drawbacks of probability thresholding, which requires every model of interest to be sequentially evaluated. Instead, it can in theory support a multi-class setting; choosing from several networks Bolukbasi et al. (2017)
similarly use probability output to choose between increasingly expensive models. They show results on ImageNet(Deng et al., 2009) and provides encouraging time-savings with minimal drop in performance. This further suggest that the probability thresholding strategy is a viable alternative to exclusively using SoTA models.
|Strategy||Validation AUC||Test AUC|
. Each model is trained ten times with different initialization, and results are reported as mean and standard deviation over the ten runs.
4.1 Model setup
The architecture and training details of all models are all available in Supplementary material. In table 2 is an overview of the computational cost of our models. Our dataset is the binary version of the Stanford Sentiment Treebank (SST), where “very positive” is combined with “positive”, “very negative” is combined with “negative” and all “neutral” examples are removed.
4.2 Benchmark model
To compare the two decision strategies we evaluate the trade-off between speed and accuracy, shown in fig. 5. Speedup is gained by using the BoW more frequently. We vary the probability threshold in both strategies and compute the fraction of samples dispatched to each model to calculate average computation time. We measure the average value of the speed-accuracy curve, a form of the area-under-the-curve (AUC) metric.
To construct a baseline we consider a naïve ratio between the two models, i.e. let
be the random variable to represent the average accuracy on an unseen sample. Thenhas the following properties:
Where is the accuracy and is the proportion of data used for BoW. According to the definition of the expectation of the random variable, we have the expected accuracy be:
We calculate the cost of our strategy and benchmark ratio in the following manner.
Where is the cost. Notice that the decision network is not a byproduct of BoW classification and requires running a second MLP model, but for simplicity we consider the cost equivalent to the probability strategy.
4.3 Quantitative results
4.4 Qualitative results
One might ask why the decision network is performing equivalently to the computationally simple probability thresholding technique. In Visualizations we have provided illustrations for qualitative analysis of why that might be the case. For example, A1 provides a t-SNE visualization of the last hidden layer in the BoW (used by both policies), from which we can assess that the probability strategy and the decision network follow similar predictive patterns. There are a few samples where the probabilities assigned by both strategies differ significantly; it would be interesting to inspect whether or not these have been clustered differently in the extra neural layers of the decision network. Towards that end, A2 is a t-SNE plot of the last hidden layer of the decision network. What we hope to see is that it learns to cluster when the LSTM is correct and the BoW is incorrect. However, from the visualization it does not seem to learn the tendencies of the LSTM. As we base our decision network on the last hidden state of the BoW, which is needed to reach a good solution, the decision network might not be able to discriminate where the BoW could not or it might have found the local minimum of imitating BoW probabilities too compelling. Furthermore, learning the reasoning of the LSTM solely by observing its correctness on a slim dataset could be too weak of a signal. Co-training the models in similar fashion to (Odena et al., 2017) might have yielded better results.
We have investigated if a cheap bag-of-words model can decide when samples, in binary sentiment classification, are easy or difficult. We found that a guided strategy, based on a bag-of-words neural network, can make informed decisions on the difficulty of samples and when to run an expensive classifier. This allow us to save computational time by only running complex classifiers on difficult sentences. In our attempts to build a more general decision network, we found that it is difficult to use a weaker network to learn the behavior of a stronger one by just observing its correctness.
- Amodei et al. (2015) D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, E. Elsen, J. Engel, L. Fan, C. Fougner, T. Han, A. Hannun, B. Jun, P. LeGresley, L. Lin, S. Narang, A. Ng, S. Ozair, R. Prenger, J. Raiman, S. Satheesh, D. Seetapun, S. Sengupta, Y. Wang, Z. Wang, C. Wang, B. Xiao, D. Yogatama, J. Zhan, and Z. Zhu. 2015. Deep speech 2: End-to-end speech recognition in english and mandarin. CoRR abs/1512.02595. http://arxiv.org/abs/1512.02595.
- Bahdanau et al. (2015) D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR.
- Bolukbasi et al. (2017) T. Bolukbasi, J. Wang, O. Dekel, and V. Saligrama. 2017. Adaptive neural networks for fast test-time prediction. CoRR abs/1702.07811. http://arxiv.org/abs/1702.07811.
- Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR.
- Gers et al. (2000) F. Gers, J. Schmidhuber, and F. Cummins. 2000. Learning to forget: Continual prediction with lstm. Neural Comput. 12(10):2451–2471.
A. Graves. 2012.
Supervised Sequence Labelling with Recurrent Neural Networks, Springer Berlin Heidelberg, pages 15–35.
- Graves (2016) A. Graves. 2016. Adaptive computation time for recurrent neural networks. CoRR abs/1603.08983. http://arxiv.org/abs/1603.08983.
- He et al. (2015) K. He, X. Zhang, S. Ren, and J. Sun. 2015. Deep residual learning for image recognition. CoRR abs/1512.03385. http://arxiv.org/abs/1512.03385.
- Hochreiter and Schmidhuber (1997) S. Hochreiter and J. Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9(8):1735–1780.
- Kingma and Ba (2014) D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
- Merity et al. (2016) S. Merity, C. Xiong, J. Bradbury, and R. Socher. 2016. Pointer sentinel mixture models. CoRR abs/1609.07843. http://arxiv.org/abs/1609.07843.
- Mikolov et al. (2013) T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013. In ICLR (workshop).
- Odena et al. (2017) A. Odena, D. Lawson, and C. Olah. 2017. Changing model behavior at test-time using reinforcement learning. arXiv preprint arXiv:1702.07780 .
- Pennington et al. (2014) J. Pennington, R. Socher, and C. D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP.
Ruck et al. (1990)
D. Ruck, S. Rogers, M. Kabrisky, M. Oxley, and B. Suter. 1990.
The multilayer perceptron as an approximation to a bayes optimal discriminant function.Neural Networks, IEEE Transactions on 1(4):296–298.
- Schuster and Paliwal (1997) M. Schuster and K. Paliwal. 1997. Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions .
- Shazeer et al. (2017) N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. CoRR abs/1701.06538.
- Silver et al. (2016) D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489. https://doi.org/10.1038/nature16961.
- Socher et al. (2013) R. Socher, A. Perelygin, J. Wu, J. Chuang, C. Manning, A. Ng, and C. Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP.
Srivastava et al. (2014)
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.
Dropout: a simple way to prevent neural networks from overfitting.
Journal of Machine Learning Research15(1):1929–1958.
- van der Maaten and Hinton (2008) L. van der Maaten and G. Hinton. 2008. Visualizing data using t-sne. Journal of Machine Learning Research 9(Nov):2579–2605.
- Wu et al. (2016) Y. Wu, M. Schuster, Z. Chen, Q. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144. http://arxiv.org/abs/1609.08144.
t-SNE plots for the qualitative analysis section.
All models are optimized with Adam Kingma and Ba (2014) with a learning rate of . We train our models with early stopping based on maximizing accuracy of all models, except the decision network where we maximize the AUC as described in section 4.2. We use SST subtrees in both the model and decision train splits and for training both models.
Model illustration: The bag of words (BoW)
As shown in fig. A3, the BoW model’s embeddings are initialized with pretrained GloVe Pennington et al. (2014) vectors, then updated during training. The embeddings are followed by an average-pooling layer Mikolov et al. (2013) and a two layer MLP Ruck et al. (1990) with dropout of Srivastava et al. (2014). The network is first trained on the model train dataset (80% of training data, as shown in fig. 4
) until convergence (early stopping, at max 50 epochs) and afterwards on the full train dataset (100% of training data) until convergence (early stopping, at max 50 epochs).
Model illustration: The LSTM
The LSTM is visualized in fig. A3. The LSTM’s word embeddings are initialized with GloVe Pennington et al. (2014). Instead of updating the embeddings, as is done in the BoW, we apply a trainable projection layer. We find that this reduces overfitting. After the projection layer a bi-directional Schuster and Paliwal (1997) recurrent neural network Graves (2012) with long short-term memory cells Hochreiter and Schmidhuber (1997); Gers et al. (2000)
is applied, followed by concatenated mean- and max-pooling of the hidden states across time. We then employ a two layer MLPRuck et al. (1990) with dropout of Srivastava et al. (2014). The network is first trained on the model train dataset (80% of training data) until convergence (early stopping, max 50 epochs) and afterwards on the full train dataset (100% of training data) until convergence (early stopping, max 50 epochs).
Model illustration: The Decision Network
The decision network is pictured in fig. A4, it inherits all but the output layer of the BoW model trained on the model train dataset, without dropouts. The layers originating from the BoW are not updated during training. We find that it overfits if we allow such. From the last hidden layer of the BoW model, a two layer MLP Ruck et al. (1990) with dropout of Srivastava et al. (2014) is applied on top.
The network is trained on the decision train portion of the dataset (20% of training data) until convergence. We use early stopping by measuring the AUC metric between the BoW and LSTM trained only on the model train dataset.