1 Introduction
Recurrent neural network (RNN) is a predominantly popular architecture for modeling natural language, where RNN sequentially ‘reads’ input tokens and outputs a distributed representation for each token. By recurrently updating the hidden state with an identical function, RNN inherently requires the same computational cost across time. While this requirement seems natural for some application domains, not all input token are equally important in many language processing tasks. For instance, in question answering, a rather efficient strategy would be to allocate less computation on irrelevant parts of the text (to the question) and only allow heavy computation on important parts.
Attention models (Bahdanau et al., 2015) compute the importance of the words relevant to the given task using an attention mechanism. They, however, do not focus on improving the efficiency of the inference. More recently, a variant of LSTMs (Yu et al., 2017) is introduced to improve inference efficiency through skipping multiple tokens at a given time step. In this paper, we introduce skimRNN that takes advantage of ‘skimming’ rather than ‘skipping’ tokens. Skimming refers to the ability to decide to spend little time (rather than skipping) on parts of the text that does not affect the reader’s main objective. Skimming typically gains trained human speed readers up to 4x speed up, occasionally with a bit of loss in the comprehension rates (Marcel Adam Just, 1987).
Inspired by the principles of human’s speed reading, we introduce SkimRNN (Figure 1), which makes a fast decision on the significance of each input (to the downstream task) and ‘skims’ through unimportant input tokens by using a smaller RNN to update only a fraction of the hidden state. When the decision is to ‘fully read’, SkimRNN updates the entire hidden state with the default RNN cell. Since the hard decision function (‘skim’ or ‘read’) is nondifferentiable, we use gumbelsoftmax (Jang et al., 2017)
to estimate the gradient of the function, instead of more traditional methods such as REINFORCE (policy gradient)
(Williams, 1992). The switching mechanism between the two RNN cells enables SkimRNN to reduce the total number of float operations (Flop reduction, or FlopR) when the skimming rate is high, which often leads to faster inference on CPUs^{1}^{1}1Flop reduction does not necessarily mean equivalent speed gain. For instance, on GPUs, there will be no speed gain because of parallel computation. On CPUs, the gain will not be as high as the FlopR due to overheads. See Section 4.3., a highly desirable goal for largescale products and small devices.SkimRNN has the same input and output interfaces as standard RNNs, so it can be conveniently used to speed up RNNs in existing models. This is in contrast to LSTMJump (Yu et al., 2017) that does not have outputs for the skipped time steps. Moreover, the speed of SkimRNN can be dynamically controlled at inference time by adjusting the threshold for the ‘skim’ decision. Lastly, we show that skimming achieves higher accuracy compared to skipping the tokens, implying that paying some attention to unimportant tokens is better than completely ignoring (skipping) them.
Our experiments show that SkimRNN attains computational advantage (float operation reduction, or FlopR) over a standard RNN, with up to 3x reduction in computations while maintaining the same level of accuracy, on four text classification tasks and two question answering task. Moreover, for applications that are concerned with latency than throughput, SkimRNN on a CPU can offer lowerlatency inference time compared to to standard RNNs on GPUs (Section 4.3). Our experiments show that we achieve higher accuracy and/or computational efficiency compared to LSTMjump and verify our intuition about the advantages of skimming compared to skipping.
2 Related Work
Fast neural networks.
As neural networks become widely integrated into realworld applications, making neural networks faster and lighter has drawn much attention in machine learning communities and industries recently.
Mnih et al. (2014) perform hard attention instead of soft attention on image patches for caption generation, which reduces number of computations and memory usage. Han et al. (2016)compress a trained convolutional neural networks so that the model occupies less memory.
Rastegari et al. (2016) approximate 32bit float operations with single bit binary operations to substantially increase computational speed at the cost of little loss of precision. Odena et al. (2017) propose to change model behavior on perinput basis, which can decide to use less computation for simpler inputs.More relevant work to ours are those that are specifically targeted for sequential data. LSTMJump (Yu et al., 2017) has the same goal as our model in that it aims to reduce the computational cost of recurrent neural networks. However, it is fundamentally different from skimRNN in that it skips some input tokens while ours does not ignore any token and skims if the token is unimportant. Our experiments confirm the benefits of skimming compared to skipping in Figure 6. In addition, LSTMJump does not produce LSTM outputs for skipped tokens, which often means that it is nontrivial to replace a regular LSTM in existing models with LSTMJump, if the outputs of the LSTM (instead of just the last hidden state) is used. On the other hand, SkimRNN emits a fixedsize output at every time step, so it is compatible with any RNNbased model. We also note the existence of SkipLSTM (Campos et al., 2017), a recent, concurrent submission to ours that shares many characteristics with LSTMJump.
Variable Computation in RNN (VCRNN) (Jernite et al., 2017) is also concerned with dynamically controlling the computational cost of RNN. However, VCRNN only controls the number of units to update at each time step, while SkimRNN contains multiple RNNs that “share” a common hidden state with different regions on which they operate (choosing which RNN to use at each time step). This has two important implications. First, the nested RNNs in SkimRNN have their own weights and thus can be considered as independent agents that interact with each other through the shared state. That is, SkimRNN updates the shared portion of the hidden state differently (by using different RNNs) depending on importance of the token, whereas the affected (first few) dimensions in VCRNN are identically updated regardless of the importance of the input. We argue that this capability of SkimRNN could be a crucial advantage, as we demonstrate in Section 4. Second, at each time step, VCRNN needs to make a dway decision (where is the hidden state size, usually hundreds), whereas SkimRNN only requires binary decision. This means that computing exact gradient of VCRNN is even more intractable ( vs
) than that of SkimRNN, and subsequently the gradient estimation would be harder as well. We conjecture that this results in a higher variance in the performance of VCRNN per training, which we also discuss in Section
4.Choi et al. (2017)
use a CNNbased sentence classifier, which can be efficiently computed with GPUs, to select the most relevant sentence(s) to the question among hundreds of candidates, and uses an RNNbased question answering model, which is relatively costly on GPUs, to obtain the answer from the selected sentence. The two models are jointly trained with REINFORCE
(Williams, 1992). SkimRNN is inherently different from the model in that ours is generic (replaces RNN) and is not specifically for question answering, and Choi et al. (2017) the model focuses on reducing GPUtime (maximizing parallelization), while ours focuses on reducing CPUtime (minimizing Flop).Johansen et al. (2017)
have shown that, for sentiment analysis, it is possible to cheaply determine if entire sentence can be correctly classified with a cheap bagofword model or needs a more expensive LSTM classifier. Again, SkimRNN is intrinsically different from their approach in that it makes a single, static decision on which model to use on the entire example.
Attention. Modeling human’s attention while reading has been studied in the field of cognitive psychology (Reichle et al., 2003). Neural attention mechanism has been also widely employed and proved to be essential for many language tasks (Bahdanau et al., 2015), allowing the model to focus on specific parts of of the text. Sigmoid attention (Mikolov et al., 2015; Balduzzi & Ghifary, 2016; Seo et al., 2017b) is especially more relevant to our work in that the attention decision is binary and local (per time step) instead of global (softmax). Nevertheless, it is important to note the distinction from SkimRNN that the neural attention mechanism is soft (differentiable) and is not intended for faster inference. More recently, Hahn & Keller (2016)
have modeled the human reading pattern with neural attention in an unsupervised learning approach, leading to conclusion that there exists tradeoff between a system’s performance in a given readingbased task and the speed of reading.
RNNs with hard decisions. Our model is relevant to several recent works that incorporate hard decisions within recurrent neural networks (Kong et al., 2016). Dyer et al. (2016) uses RNN for transitionbased dependency parsing. At each time step, the RNN unit decides between three possible choices. The architecture does not suffer from the intractability of computing the gradients, because the decision is supervised at every time step. Chung et al. (2017) dynamically construct multiscale RNN by making a hard binary decision on whether to update hidden state of each layer at each time step. In order to handle the intractability of computing the gradient, they use straightthrough estimation (Bengio et al., 2013) with slope annealing, which can be considered as an alternative method to Gumbelsoftmax reparameterization.
3 Model
SkimRNN unit consists of two RNN cells, default (big) RNN cell of hidden state size and small RNN cell of hidden state size , where and
are hyperparameters defined by the user and
. Each RNN cell has its own weight and bias, and it can be any variant of RNN, such as GRU and LSTM. The core idea of the model is that the SkimRNN dynamically makes the decision at each time step whether to use the big RNN (if the current token is important), or to skim by using the small RNN (if the current token is unimportant). Skipping a token can be implemented by setting , the size of the small RNN, equal to zero. Since small RNN requires less number of float operations than big RNN, the model is faster than big RNN alone while obtaining similar or better results than the big RNN alone. Later in Section 4, we will measure the speed effectiveness of SkimRNN via three criteria: skim rate (how many words are skimmed), number of float operations, and benchmarked speed on several platforms. Figure 1 depicts the schematic of SkimRNN on a short word sequence.We first describe the desired inference model of SkimRNN to be learned in Section 3.1
. The input to and the output of SkimRNN are equivalent to that of a regular RNN: a varyinglength sequence of vectors go in, and an equallength sequence of output vectors come out. We model the hard decision of
skimming at each time step with a stochastic multinomial variable. Note that obtaining the exact gradient is intractable as the sequence becomes longer, and the loss is not differentiable due to hard argmax; hence, in Section 3.2, we reparameterize the stochastic distribution with Gumbelsoftmax (Jang et al., 2017)to approximate the inference model with a fullydifferentiable function, which can be efficiently trained with stochastic gradient descent.
3.1 Inference
At each time step , SkimRNN unit takes the input and the previous hidden state as its arguments, and outputs the new state .^{2}^{2}2We assume both input and hidden state are dimensional for brevity, but our arguments are valid for different sizes as well. Let represent the number of choices for the hard decision at each time step. In SkimRNNs, since it either fully reads or skims. In general, although not explored in this paper, one can have for multiple degrees of skimming.
We model the decision making process with a multinomial random variable
over the probability distribution of choices
. We model with(1) 
where and are weights to be learned, and indicates row concatenation. Note that one can define in a different way (e.g., the dot product between and ), as long as its time complexity is strictly less than to gain computational advantage. For the ease of explanation, let the first element of the vector, , indicate the probability for fully reading, and the second element, , indicate the probability for skimming. Now we define the random variable to make the decision to skim () or not (), by sampling from the probability distribution .
(2) 
which means and will be sampled with the probability of and , respectively. If , then the unit applies a standard, full RNN on the input and the previous hidden state to obtain the new hidden state. If , then the unit applies a smaller RNN to obtain a small hidden state, which replaces only a portion of the previous hidden state. More formally,
(3) 
where is a full RNN with dimensional output, while is a smaller RNN with dimensional output, where , and is vector slicing. Note that and can be any variant of RNN such as GRU and LSTM^{3}^{3}3Since LSTM cell has two outputs, hidden state () and memory (), the slicing and concatenation in Equation 3 is applied for each output.. The main computational advantage of the model is that, if , then whenever the model decides to skim, it requires computations, which is substantially less than . Also, as a side effect, the last dimensions of the hidden state are less frequently updated, which we hypothesize to be a nontrivial factor for improved accuracy in some datasets (Section 4).
3.2 Training
Since the loss is a random variable that depends on the random variables , we minimize the expected loss with respect to the distribution of the variables.^{4}^{4}4An alternative view is that, if we let instead of sampling, which we do during inference for deterministic outcome, then the loss is nondifferentiable due to the argmax operation (hard decision).
Suppose that we define the loss function to be minimized conditioned on a particular sequence of decisions,
where is a sequence of decisions with length . Then the expectation of the loss function over the distribution of the sequence of the decisions is(4) 
In order to exactly compute , one needs to enumerate all possible , which is intractable (exponentially increases with the sequence length). It is possible to approximate the gradients with REINFORCE (Williams, 1992)
, which is an unbiased estimator, but it is known to have a high variance. We instead use gumbelsoftmax distribution
(Jang et al., 2017) to approximate Equation 2, (same size as ), which is fully differentiable. Hence the backpropagation can now efficiently flow to without being blocked by the stochastic variable , and the approximation can arbitrarily approach to by controlling hyperparameters. The reparameterized distribution is obtained by(5) 
where is an independent sample from and is the temperature (hyperparameter). We relax the conditional statement of Equation 3 by rewriting
(6) 
where is the candidate hidden state if . That is,
(7) 
as shown in Equation 3. Note that Equation 6 approaches Equation 3 as approaches to be a onehot vector. Jang et al. (2017) have shown that becomes more discrete and approaches the distribution of as . Hence we start from a high temperature (smoother ) value and slowly decreases it.
Lastly, in order to encourage the model to skim when possible, in addition to minimizing the main loss function (), which is applicationdependent, we also jointly minimize the arithmetic mean of the negative log probability of skimming, , where is the sequence length. We define the final loss function by
(8) 
where is a hyperparameter to control the ratio between the two terms.
4 Experiments
Dataset  task type  answer type  Number of examples  Avg. Len  vocab size 

SST  Sentiment Analysis  Pos/Neg  6,920 / 872 / 1,821  19  13,750 
Rotten Tomatoes  Sentiment Analysis  Pos/Neg  8,530 / 1,066 / 1,066  21  16,259 
IMDb  Sentiment Analysis  Pos/Neg  21,143 / 3,857 / 25,000  282  61,046 
AGNews  News classification  4 categories  101,851 / 18,149 / 7,600  43  60,088 
CBTNE  Question Answering  10 candidates  108,719 / 2,000 / 2,500  461  53,063 
CBTCN  Question Answering  10 candidates  120,769 / 2,000 / 2,500  500  53,185 
SQuAD  Question Answering  span from context  87,599 / 10,570 /   141  69,184 
We evaluate the effectiveness of SkimRNN in terms of accuracy and float operation reduction (FlopR) on four classification tasks and a question answering task. These language tasks have been chosen because they do not require one’s full attention to every detail of the text, but rather ask for capturing the highlevel information (classification) or focusing on specific portion (QA) of the text, which is more appropriate for the principle of speed reading^{5}^{5}5‘Speed reading’ would not be appropriate for many language tasks. For instance, in translation task, one would not skim through the text because most input tokens are crucial for the task..
We start with classification tasks (Section 4.1) and compare SkimRNN against standard RNN, LSTMJump (Yu et al., 2017), and VCRNN (Jernite et al., 2017), which have a similar goal to ours. Then we evaluate and analyze our system in a wellstudied question answering dataset, Stanford Question Answering Dataset (SQuAD) (Section 4.2). Since LSTMJump does not report on this dataset, we simulate ‘skipping’ by not updating the hidden state when the decision is to ‘skim’, and show that skimming yields better accuracyspeed tradeoff than skipping. We defer the results of SkimRNN on Children Book Test to Appendix B.
Evaluation Metrics. We measure the accuracy for the the classification task (Acc) and the F1 and exact match (EM) scores of the correct span for the question answering task. We evaluate the computational efficiency with skimming rate (Sk) i.e., how frequently words are skimmed, and reduction in float operations (FlopR). We also report benchmarked speed gain rate (compared to standard LSTM) of classification tasks and CBT since LSTMJump does not report Flop reduction rate (See Section 4.3 for how the benchmark is performed). Note that LSTMJump measures speed gain based on GPU while ours is measured based on CPU.
4.1 Text Classification
In a language classification task, the input is a sequence of words and the output is the vector of categorical probabilities. Each word is embedded into a dimensional vector. We initialize the vector with GloVe (Pennington et al., 2014)
and use those as the inputs for LSTM (or SkimLSTM). We make a linear transformation on the last hidden state of the LSTM and then apply softmax function to obtain the classification probabilities. We use Adam
(Kingma & Ba, 2015) for optimization, with initial learning rate of 0.0001. For SkimLSTM, where and is the global training step, following Jang et al. (2017). We experiment on different sizes of big LSTM () and small LSTM () and the ratio between the model loss and the skim loss () for SkimLSTM. We use batch size of 32 for SST and Rotten Tomatoes, and 128 for others. For all models, we stop early when the validation accuracy does not increase for global steps.LSTM  SST  Rotten Tomatoes  IMDb  AGNews  

Model  /  Acc  Sk  Flopr  Sp  Acc  Sk  Flopr  Sp  Acc  Sk  Flopr  Sp  Acc  Sk  Flopr  Sp 
Standard  86.4    1.0x  1.0x  82.5    1.0x  1.0x  91.1    1.0x  1.0x  93.5    1.0x  1.0x  
Skim  5/0.01  86.4  58.2  2.4x  1.4x  84.2  52.0  2.1x  1.3x  89.3  79.2  4.7x  2.1x  93.6  30.3  1.4x  1.0x 
Skim  10/0.01  85.8  61.1  2.5x  1.5x  82.5  58.5  2.4x  1.4x  91.2  83.9  5.8x  2.3x  93.5  33.7  1.5x  1.0x 
Skim  5/0.02  85.6  62.3  2.6x  1.5x  81.8  63.7  2.7x  1.5x  88.7  63.2  2.7x  1.5x  93.3  36.4  1.6x  1.0x 
Skim  10/0.02  86.4  68.0  3.0x  1.7x  82.5  63.0  2.6x  1.5x  90.9  90.7  9.5x  2.7x  92.5  10.6  1.1x  0.8x 
LSTMJump          79.3      1.6x  89.4      1.6x  89.3      1.1x  
VCRNN  81.9    2.6x    81.4    1.9x                    
SOTA  89.5        83.4        94.1        93.4       
I liked this movie, not because Tom Selleck was in it, but because it was a good story about baseball  
and it also had a semiover dramatized view of some of the issues that a BASEBALL player coming  
to the end of their time in Major League sports must face. I also greatly enjoyed the cultural differen  
Positive  ces in American and Japanese baseball and the small facts on how the games are played differently. 
Overall, it is a good movie to watch on Cable TV or rent on a cold winter’s night and watch about  
the "Dog Day’s" of summer and know that spring training is only a few months away. A good movie  
for a baseball fan as well as a good "DATE" movie. Trust me on that one! *Wink*  
No! no  No  NO! My entire being is revolting against this dreadful remake of a classic movie.  
I knew we were heading for trouble from the moment Meg Ryan appeared on screen with her ridi 

culous hair and clothing  literally looking like a scarecrow in that garden she was digging. Meg  
Negative  Ryan playing Meg Ryan  how tiresome is that?! And it got worse … so much worse. The horribly 
cliché lines, the stock characters, the increasing sense I was watching a spinoff of "The First Wives  
Club" and the ultimate hackneyed schtick in the delivery room. How many times have I seen this  
movie? Only once, but it feel like a dozen times  nothing original or fresh about it. For shame! 
Results.
Table 2 shows the accuracy and the computational cost of our model compared with standard LSTM, LSTMJump (Yu et al., 2017), and VCRNN (Jernite et al., 2017). First, SkimLSTM has a significant reduction in number of float operations compared to LSTM, as indicated by ‘FlopR’. When benchmarked on Python (‘Sp’ column), we observe a nontrivial speed up. We expect that the gain can be further maximized when implemented with lower level language that has smaller overhead. Second, our model outperforms standard LSTM and LSTMJump across all tasks, and its accuracy is better than or close to that of RNNbased state of the art models, which are often specifically designed for these tasks. We hypothesize the accuracy improvement over LSTM could be due to the increased stability of the hidden state, as the majority of the hidden state is not updated when skimming. Figure 2 shows the effect of varying the size of the small hidden state as well as the parameter on the accuracy and computational cost.
Table 3 shows an example from IMDb dataset, where SkimRNN with , , and correctly classifies it with high skimming rate (92%). The black words are skimmed, and blue words are fully read. As expected, the model skims through unimportant words, including prepositions, and latently learns to only carefully read the important words, such as ‘liked’, ‘dreadful’, and ‘tiresome’.
4.2 Question Answering
In Stanford Question Answering Dataset, the task is to locate the answer span for a given question in a context paragraph. We evaluate the effectiveness of SkimRNN for SQuAD with two different models: LSTM+Attention and BiDAF (Seo et al., 2017a)^{6}^{6}6We chose BiDAF because it provides a strong baseline for SQuAD, TriviaQA (Joshi et al., 2017), TextBookQA (Kembhavi et al., 2017), and multihop QA (Welbl et al., 2017) and achieves state of the art in WikiQA and SemEval2016 (Task 3A) (Min et al., 2017).. The first model is inspired by most current QA systems consisting of multiple LSTM layers and an attention mechanism. The model is complex enough to reach reasonable accuracy on the dataset, and simple enough to run wellcontrolled analyses for the SkimRNN. The details of the model are described in Appendix A.1. The second model is an opensource model designed for SQuAD, which is studied to mainly show that SkimRNN could replace RNN in existing complex systems.
Training. We use Adam and initial learning rate of . For stable training, we pretrain with standard LSTM for the first 5k steps , and then finetune with SkimLSTM (Section A.2 shows different pretraining schemas). Other hyperparameter setup follows that of classification in Section 4.1.
Results.
Table 4 (above double line) shows the accuracy (F1 and EM) of LSTM+Attention and SkimLSTM+Attention models as well as VCRNN (Jernite et al., 2017). We observe that the skimming models achieve higher or similar F1 score to those of the default nonskimming models (LSTM+Att) while attaining the reduction in computational cost (FlopR) by more than 1.4 times. Moreover, decreasing layers (1 layer) or hidden size (d=5) improves FlopR, but significantly decreases the accuracy (compared to skimming). Table 4 (below double line) demonstrates that replacing LSTM with SkimLSTM in an existing complex model (BiDAF) stably gives reduced computational cost without losing much accuracy (only drop from of BiDAF to of SkBiDAF with ).
Figure 4 shows the skimming rate of different layers of LSTM with varying values of in LSTM+Att model. There are four points on the axis of the figures associated with two forward and two backward layers of the model. We see two interesting trends here. First, the skimming rate of the second layers (forward and backward) are higher than that of the first layer across different gamma values. A possible explanation for this trend is that the model is more confident about which tokens are important at the second layer. Second, higher value leads to higher skimming rate, which agrees with its intended functionality.
Model  F1  EM  Sk  Flopr  

LSTM+Att (1 layer)    73.3  63.9    1.3x 
LSTM+Att ()    74.0  64.4    3.6x 
LSTM+Att    75.5  67.0    1.0x 
SkLSTM+Att ()  0.1  75.7  66.7  37.7  1.4x 
SkLSTM+Att ()  0.2  75.6  66.4  49.7  1.6x 
SkLSTM+Att  0.05  75.5  66.0  39.7  1.4x 
SkLSTM+Att  0.1  75.3  66.0  56.2  1.7x 
SkLSTM+Att  0.2  75.0  66.0  76.4  2.3x 
VCRNN    74.9  65.4    1.0x 
BiDAF ()    74.6  64.0    9.1x 
BiDAF ()    75.7  65.5    3.7x 
BiDAF    77.3  67.7    1.0x 
SkBiDAF  76.9  67.0  74.5  2.8x  
SkBiDAF  77.1  67.4  47.1  1.7x  
SOTA (Wang et al., 2017)  79.5  71.1     
Figure 6 shows F1 score of LSTM+Attention model using standard LSTM and Skim LSTM, sorted in ascending order by FlopR. While models tend to perform better with larger computational cost, Skim LSTM (Red) outperforms standard LSTM (Blue) with comparable computational cost. We also observe that the F1 score of SkimLSTM is more stable across different configurations and computational cost. Moreover, increasing the value of for SkimLSTM gradually increases skipping rate and FlopR, while it also leads to reduced accuracy.
Controlling skim rate. An important advantage of SkimRNN is that the skim rate (and thus computational cost) can be dynamically controlled at inference time by adjusting the threshold for ‘skim’ decision probability (Equation 1). Figure 6 shows the tradeoff between the accuracy and computational cost for two settings, confirming the importance of skimming () compared to skipping ().
Visualization. Figure 7 shows an example from SQuAD and visualizes which words SkimLSTM () reads (red) and skims (white). As expected, the model does not skim when the input seems to be relevant to answering the question. In addition, LSTM in second layer skims more than that in the first layer mainly because the second layer is more confident about the importance of each token, as shown in Figure 7. More visualizations are shown in in Appendix C.
4.3 Runtime Benchmarks
Here we briefly discuss the details of the runtime benchmarks for LSTM and SkimLSTM, which allow us to estimate the speed up of SkimLSTMbased models in our experiments (corresponding to ‘Sp’ in Table 2). We assume CPUbased benchmark by default, which has direct correlation with the number of float operations (Flop)^{7}^{7}7Speed up on GPUs hugely depends on parallelization, which is not relevant to our contribution.. As mentioned previously, the speedup results in Table 2 (as well as Figure 8
below) are benchmarked using Python (NumPy), instead of popular frameworks such as TensorFlow or PyTorch. In fact, we have benchmarked the speed of Length100 LSTM with
(batch size = 1) in all three frameworks on a single thread of CPU (averaged over 100 trials), and have observed that NumPy is 1.5 and 2.8 times faster than TensorFlow and PyTorch.^{8}^{8}8NumPy’s speed becomes similar to that of TensorFlow and PyTorch at and , respectively. At larger hidden size, NumPy becomes slower. This seems to be mostly due to the fact that the frameworks are primarily (optimized) for GPUs and they have larger overhead than NumPy that they cannot take much advantage of reducing the size of the hidden state of the LSTM below 100.Figure 8 shows the relative speed gain of SkimLSTM compared to standard LSTM with varying hidden state size and skim rate. We use NumPy, and the inferences are run on a single thread of CPU. We also plot the ratio between the reduction of the number of float operations (FlopR) of LSTM and SkimLSTM. This can be considered as a theoretical upper bound of the speed gain on CPUs. We note two important observations. First, there is an inevitable gap between the actual gain (solid line) and the theoretical gain (dotted line). This gap will be larger with more overhead of the framework, or more parallelization (e.g. multithreading). Second, the gap decreases as the hidden state size increases because the the overhead becomes negligible with very large matrix operations. Hence, the benefit of SkimRNN will be greater for larger hidden state size.
Latency. A modern GPU has much higher throughput than a CPU with parallel processing. However, for small networks, the CPU often has lower latency than the GPU. Comparing between NumPy with CPU and TensorFlow with GPU (Titan X), we observe that the former has 1.5 times lower latency (75 s vs 110 s per token) for LSTM of . This means that combining SkimRNN with CPUbased framework can lead to substantially lower latency than GPUs. For instance, SkimRNN with CPU on IMDb has 4.5x lower latency than a GPU, requiring only 29 s per token on average.
5 Conclusion
We present SkimRNN, a recurrent neural network that can dynamically decide to use the big RNN (read) or the small RNN (skim) at each time step, depending on the importance of the input. While SkimRNN has significantly lower computational cost than its RNN counterpart, the accuracy of SkimRNN is still on par with or better than standard RNNs, LSTMJump, and VCRNN. Since SkimRNN has the same input and output interface as an RNN, it can easily replace RNNs in existing applications. We also show that a SkimRNN can offer better latency results on a CPU compared to a standard RNN on a GPU. Future work involves using SkimRNN for applications that require much higher hidden state size, such as video understanding, and using multiple small RNN cells for varying degrees of skimming.
Acknowledgments
This research was supported by the NSF (IIS 1616112), Allen Distinguished Investigator Award, Samsung GRO award, and gifts from Google, Amazon, Allen Institute for AI, and Bloomberg. We thank the anonymous reviewers for their helpful comments.
References
 Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
 Balduzzi & Ghifary (2016) David Balduzzi and Muhammad Ghifary. Stronglytyped recurrent neural networks. In ICML, 2016.
 Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 Campos et al. (2017) Víctor Campos, Brendan Jou, Xavier Girói Nieto, Jordi Torres, and ShihFu Chang. Skip rnn: Learning to skip state updates in recurrent neural networks. arXiv preprint arXiv:1708.06834, 2017.
 Choi et al. (2017) Eunsol Choi, Daniel Hewlett, Alexandre Lacoste, Illia Polosukhin, Jakob Uszkoreit, and Jonathan Berant. Coarsetofine question answering for long documents. In ACL, 2017.
 Chung et al. (2017) Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. In ICLR, 2017.
 Dyer et al. (2016) Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A Smith. Recurrent neural network grammars. In NAACL, 2016.
 Hahn & Keller (2016) Michael Hahn and Frank Keller. Modeling human reading with neural attention. In EMNLP, 2016.
 Han et al. (2016) Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR, 2016.
 Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. In ICLR, 2017.
 Jernite et al. (2017) Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in recurrent neural networks. In ICLR, 2017.
 Johansen et al. (2017) Alexander Johansen, Bryan McCann, James Bradbury, and Richard Socher. Learning when to read and when to skim, 2017. URL https://metamind.io/research/learningwhentoskimandwhentoread.
 Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL, 2017.
 Kembhavi et al. (2017) Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In CVPR, 2017.
 Kingma & Ba (2015) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 Kokkinos & Potamianos (2017) Filippos Kokkinos and Alexandros Potamianos. Structural attention neural networks for improved sentiment analysis. arXiv preprint arXiv:1701.01811, 2017.
 Kong et al. (2016) Lingpeng Kong, Chris Dyer, and Noah A Smith. Segmental recurrent neural networks. In ICLR, 2016.
 Marcel Adam Just (1987) Patricia Anderson Carpenter Marcel Adam Just. The Psychology of Reading and Language Comprehension. 1987.
 Mikolov et al. (2015) Tomas Mikolov, Armand Joulin, Sumit Chopra, Michael Mathieu, and Marc’Aurelio Ranzato. Learning longer memory in recurrent neural networks. In ICLR Workshop, 2015.

Min et al. (2017)
Sewon Min, Minjoon Seo, and Hannaneh Hajishirzi.
Question answering through transfer learning from large finegrained supervision data.
In ACL, 2017.  Miyato et al. (2017) Takeru Miyato, Andrew M. Dai, and Ian Goodfellow. Adversarial training methods for semisupervised text classification. In ICLR, 2017.
 Mnih et al. (2014) Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In NIPS, 2014.

Odena et al. (2017)
Augustus Odena, Dieterich Lawson, and Christopher Olah.
Changing model behavior at testtime using reinforcement learning.
In ICLR Workshop, 2017.  Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, 2014.

Rastegari et al. (2016)
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi.
Xnornet: Imagenet classification using binary convolutional neural networks.
In ECCV, 2016.  Reichle et al. (2003) Erik D Reichle, Keith Rayner, and Alexander Pollatsek. The ez reader model of eyemovement control in reading: Comparisons to other models. Behavioral and brain sciences, 26(4):445–476, 2003.
 Seo et al. (2017a) Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In ICLR, 2017a.
 Seo et al. (2017b) Minjoon Seo, Sewon Min, Ali Farhadi, and Hannaneh Hajishirzi. Queryreduction networks for question answering. In ICLR, 2017b.
 Wang et al. (2017) Wenhui Wang, Nan Yangand, Furu Wei, Baobao Chang, and Ming Zhou. Gated selfmatching networks for reading comprehension and question answering. In ACL, 2017.
 Welbl et al. (2017) Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. Constructing datasets for multihop reading comprehension across documents. arXiv preprint arXiv:1710.06481, 2017.
 Williams (1992) Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 Yang et al. (2017) Zhilin Yang, Bhuwan Dhingra, Ye Yuan, Junjie Hu, William W. Cohen, and Ruslan Salakhutdinov. Words or characters? finegrained gating for reading comprehension. In ICLR, 2017.
 Yu et al. (2017) Adams Wei Yu, Hongrae Lee, and Quoc V Le. Learning to skim text. In ACL, 2017.
 Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. Characterlevel convolutional networks for text classification. In NIPS, 2015.
Appendix A Models and training details on SQuAD
a.1 LSTM+Attention details
Let and be the embeddings of th context word and th question word, respectively. We first obtain the dimensional representation of the entire question by computing the weighted average of the question word vectors. We obtain and , where is a trainable weight vector and is elementwise multiplication. Then the input to the (two layer) Bidirectional LSTMs will be
. We use the outputs of the second layer LSTM to independently predict the start index and the end index of the answer. We obtain the logits (to be softmaxed) of the start and the end index distributions from the weighted average of the outputs (the weights are learned and different for the start and the end). We minimize the sum of the negative log probabilities of the correct start/end indices.
a.2 Using pretrained model
(a) No pretrain
We train Skim LSTM from scratch. It has unstable skim rates, which are often too high or too low, and have very different skim rate in forward and backward direction of LSTM, with a significant loss in performance.
(b) Full pretrain
We finetune Skim LSTM from fully pretrained standard LSTM (F1 75.5, global step 18k). As we finetune the model, performance decreases and skim rate increases.
(c) Half pretrain
We finetune Skim LSTM from partiallypretrained standard LSTM (F1 70.7, pretraining stopping at 5k steps). Performance and skim rate increase together during training.
Appendix B Experiments on Children Book Test
In Children Book Test, the input is a sequence of 21 sentences, where the last sentence has one missing word (i.e. cloze test). The system needs to predict the missing word, which is one of ten provided candidates. Following LSTMJump (Yu et al., 2017), we use a simple LSTMbased QA model that help us to compare against LSTM and LSTMJump. We use singlelayer LSTM on the embeddings of the inputs and use the last hidden state of the LSTM for the classification, where the output distribution is obtained by performing softmax on the dot product between the embedding of each answer candidate and the hidden state. We minimize the negative log probability of the correct answers. We follow the same hyperparameter setup and evaluation metrics from that of Section 4.1.
Results.
In Table 4, we first note that SkimLSTM obtain better results than standard LSTM and LSTMJump. As discussed in Section 4.1, we hypothesize that the increase in accuracy could be due to the stabilization of the recurrent hidden state over a long distance. Second, using SkimLSTM, we see up to 72.2% skimming rate, 3.6x reduction in the number of floating point operations and 2.2x reduction in actual benchmarked time on NumPy with reasonable accuracies. Lastly, we note that LSTM, LSTMJump, and SkimLSTM are all significantly lower than the state of the art models, which consist of a wide variety of components such as attention mechanism which are often crucial for question answering models.
LSTM  config  CBTNE  CBTCN  

Model  /  Acc  Sk  Flopr  Sp  Acc  Sk  Flopr  Sp 
Std  /  49.0    1.0x  1.0x  54.9    1.0x  1.0x 
Skim  10/0.01  50.9  35.8  1.6x  1.3x  56.3  43.7  1.8x  1.5x 
Skim  10/0.02  51.4  72.2  3.6x  2.3x  51.4  86.5  7.1x  3.3x 
Skim  20/0.01  36.4  98.8  50.0x  1.3x  38.0  99.1  50.0x  1.3x 
Skim  20/0.02  50.0  70.5  3.3x  1.2x  54.5  54.1  2.1x  1.1x 
LSTMJump (Yu et al., 2017)  46.8      3.0x  49.7      6.1x  
SOTA (Yang et al., 2017)  75.0        72.0       
Comments
There are no comments yet.