Visualizing and Understanding Curriculum Learning for Long Short-Term Memory Networks

11/18/2016 ∙ by Volkan Cirik, et al. ∙ Carnegie Mellon University 0

Curriculum Learning emphasizes the order of training instances in a computational learning setup. The core hypothesis is that simpler instances should be learned early as building blocks to learn more complex ones. Despite its usefulness, it is still unknown how exactly the internal representation of models are affected by curriculum learning. In this paper, we study the effect of curriculum learning on Long Short-Term Memory (LSTM) networks, which have shown strong competency in many Natural Language Processing (NLP) problems. Our experiments on sentiment analysis task and a synthetic task similar to sequence prediction tasks in NLP show that curriculum learning has a positive effect on the LSTM's internal states by biasing the model towards building constructive representations i.e. the internal representation at the previous timesteps are used as building blocks for the final prediction. We also find that smaller models significantly improves when they are trained with curriculum learning. Lastly, we show that curriculum learning helps more when the amount of training data is limited.



There are no comments yet.


page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Inspired by the human learning process, Curriculum Learning (Elman, 1993; Bengio et al., 2009) is an algorithm that emphasizes the order of training instances in a computational learning setup. The main idea is that learning easy instances first could be helpful for learning more complex ones later in the training. The first algorithm proposed by Bengio et al. (2009), which we refer as one-pass curriculum, creates disjoint sets of training examples ordered by the complexity and used separately during training. The second algorithm called baby step curriculum uses an incremental approach where groups of more complex examples are incrementally added to the training set (Spitkovsky, Alshawi, and Jurafsky, 2010)

. These curriculum learning regimens were shown to improve performance in some Natural Language Processing and Computer Vision tasks

(Pentina, Sharmanska, and Lampert, 2015; Spitkovsky, Alshawi, and Jurafsky, 2010).

Despite its usefulness, it is still unknown how exactly computational models are affected internally by curriculum learning. An example of computational model particularly relevant to Natural Language Processing is Long Short-Term Memory (LSTM) network (Hochreiter and Schmidhuber, 1997). LSTM networks have shown competitive performance in several domains such as handwriting recognition (Graves et al., 2009) and parsing (Vinyals et al., 2015). Surprisingly, curriculum learning has not been studied in the context of LSTM networks to our knowledge. Detailed visualizations and analyses of curriculum learning regimens with LSTM will allow us to better understand how models are affected and provides us insights when to use these regimens. Knowing how curriculum learning works, we can design new extensions and understand the nature of tasks most suited for these learning regimens.

In this paper, we study the effect of curriculum learning on LSTM networks. We created experiments to directly compare two curriculum learning regimens, one-pass and baby step

, with two baseline approaches that include the conventional technique of randomly ordering the training samples. We use two benchmarks for our analyses. First, a synthetic task is designed which is similar to several Natural Language Processing tasks where a sequence of symbols are observed and a particular function (e.g. analogous to a linguistic or a semantic phenomenon) is aimed to be learned. Second, we use sentiment analysis where the polarity of subjective opinions is classified – a fundamental task in Natural Language Processing. As mentioned previously, this is the first work studying LSTM networks on sentiment analysis with curriculum learning to our knowledge.

Our visualizations and analyses on these two sequence tasks are designed to study three main factors. First, we compare the four learning regimens on how the LSTM network’s internal representations change as the final prediction is computed. To this end, we simply decode the representations at intermediate steps. This analysis helps us understand how a model handles the task with the help of curriculum learning. Second, we investigate how the performance of models with different complexities are affected by curriculum learning. Smaller yet accurate models are crucial in limited resource settings. Third, we study how the performance of curriculum learning changes in low-resource setups. This analysis provides us a valuable information considering low-resource scenarios are common in several data-driven learning domains such as Natural Language Processing.

Related Work

We review a list of topics related to our work in the context of curriculum learning (CL), analysis of neural networks and sentiment analysis with neural networks.

Curriculum Learning.

Motivated by children’s language learning, Elman (1993)

studies the effect of learning regimen on a synthetic grammar task. He shows that a Recurrent Neural Network (RNN) is able to learn a grammar when training data is presented from simple to complex order and fails to do so when the order is random.

Bengio et al. (2009) investigate CL from an optimization perspective. Their experiments on synthetic vision and word representation learning tasks show that CL results in better generalization and faster learning. Spitkovsky, Alshawi, and Jurafsky (2010) apply a CL strategy to learn an unsupervised parser for sentences of length and initialize the next parser for sentences of length with the previously learned one. They show that learning a hybrid model using the parsers learned for each sentence lengths achieves a significant improvement over a baseline model. Pentina, Sharmanska, and Lampert (2015) investigate the CL in a multi-task learning setup and propose a model to learn the order of multiple tasks. Their experiments on a set of vision tasks show that learning tasks sequentially is better than learning them jointly. Jiang et al. (2015) provide a general framework for CL and Self-Paced Learning where model picks which instances to train based on a simplicity metric. The proposed framework is able to combine prior knowledge of curriculum with Self-Paced Learning in the learning objective.

Long Short-Term Memory Networks.

LSTM networks are a variant of RNNs (Elman, 1990) capable of storing information and propagating loss over long distance. Using a gating mechanism by controlling the information flow into the internal representation, it is possible to avoid the problems of training RNNs (Bengio, Simard, and Frasconi, 1994; Pascanu, Mikolov, and Bengio, 2012). Several architectural variants have been proposed to improve the basic model (Cho et al., 2014; Chung et al., 2015; Yao et al., 2015; Kalchbrenner, Danihelka, and Graves, 2015; Dyer et al., 2015; Grefenstette et al., 2015).

Visualization of Neural Networks.

Although many of the neural network studies provide quantitative analysis, there are few qualitative analyses of neural networks. Zeiler and Fergus (2014)

visualize the feature maps of a Convolutional Neural Network (CNN)

(LeCun et al., 1998). They show that feature maps at different layers show sensitivity to different shapes, textures, and objects. Similarly, Karpathy, Johnson, and Li (2015) analyze LSTM on character level language modeling. Their analysis shows that deeper models with gating mechanisms achieve better results. They show that some cells in LSTM learn to detect patterns and how RNNs learn to generalize to longer sequences. More recently, Li et al. (2016) use visualization to show how neural network models handle several linguistics phenomena such as compositionality and negation using sentiment analysis and sequence auto-encoding.

Synthetic Tasks. Since the early days of neural networks, synthetic tasks were used to test the capabilities of the models (Fahlman, 1991)

and often serve as unit tests for machine learning models

(Weston et al., 2015). Similar to the first work on LSTM (Hochreiter and Schmidhuber, 1997), many of the contemporary neural network models (Graves, Wayne, and Danihelka, 2014; Kurach, Andrychowicz, and Sutskever, 2015; Sukhbaatar et al., 2015; Vinyals, Fortunato, and Jaitly, 2015) use synthetic tasks to compare and contrast several architectures. Inspired by these studies, we also use a synthetic task as one of our tasks to understand the effect of CL on LSTMs.

Sentiment Analysis with Neural Networks.

Several approaches have been proposed to solve sentiment analysis using neural networks. Socher et al. (2013) propose Recursive Neural Networks to exploit the syntactic structure of a sentence. A number of extensions of this model have been proposed in the context of sentiment analysis (Irsoy and Cardie, 2014; Tai, Socher, and Manning, 2015). Other proposed approaches use CNN (Kalchbrenner, Grefenstette, and Blunsom, 2014; Kim, 2014)

and the averaging of word vector models

(Iyyer et al., 2015; Le and Mikolov, 2014).

To our knowledge, this work is the first to study how the internal representation of LSTM change in a curriculum learning setup.

Curriculum Learning Regimens

Curriculum learning emphasizes the order of training instances, prioritizing simpler instances before the more complex ones. In this section, we describe two curriculum learning regimens: one-pass curriculum originally proposed by Bengio et al. (2009) and baby step curriculum from Spitkovsky, Alshawi, and Jurafsky (2010). For both regimens, we develop the curriculum using the same strategy proposed by Spitkovsky, Alshawi, and Jurafsky (2010) who assume that shorter sequences are easier to learn.

The following subsections are describing the two curriculum learning regimens as well as two baseline learning regimens.

One-Pass Curriculum

Bengio et al. (2009) propose to use a dataset with simpler instances in the first phase of the training. After some number of iterations, they switch to harder target dataset. The intuition is that after some training on simpler data, the model is ready to handle the harder target data. Here, we name this regimen One-Pass curriculum (see Algorithm 1). The training data is sorted by a curriculum and distributed into number of buckets. The training starts with the easiest bucket. Unlike the previous work (Bengio et al., 2009), we use early stopping – training stops for the bucket when the loss or task’s accuracy criteria on held-out set do not get any better for

number of epochs. Afterward, the next bucket is being used and trained in the same way. The whole training is stopped after all buckets are used. Note that the model uses each bucket only one time for the training, hence the name.

1:procedure OP-Curriculum(,, )
2:      = sort(, )
3:      where , ,
4:     for  = 1…  do
5:         while not converged for epochs do
6:              train(, )
7:         end while
8:     end for
9:end procedure
Algorithm 1 One-Pass Curriculum

Baby Steps Curriculum

The intuition behind Baby Steps curriculum (Bengio et al., 2009; Spitkovsky, Alshawi, and Jurafsky, 2010) is that simpler instances in the training data should not be discarded, instead, the complexity of the training data should be increased. After distributing data into buckets based on a curriculum, training starts with the easiest bucket. When the loss or task’s accuracy criteria on a held-out set do not get any better for number of epochs, the next bucket and the current data bucket are merged. The whole training is stopped after all buckets are used (see Algorithm 2).

1:procedure BS-Curriculum(,, )
2:      = sort(, )
3:      where , ,
5:     for  = 1…  do
7:         while not converged for epochs do
8:              train(, )
9:         end while
10:     end for
11:end procedure
Algorithm 2 Baby Steps Curriculum
Baby Steps Curriculum One-Pass Curriculum Sorted No-CL
Input Sequence Ground Truth Prediction Prediction Prediction Prediction
1 1 0.23 0.00 0.86 0.88 (0.20)
10 1 1.03 0.00 1.32 1.10 (0.28)
109 10 10.27 0.00 10.59 10.20 (0.78)
1091 11 11.40 0.00 11.29 10.88 (0.95)
10917 18 18.62 0.00 19.30 17.89 (1.49)
109173 21 21.54 0.00 22.99 20.84 (1.94)
1091735 26 26.77 4.29 29.34 25.90 (2.39)
10917356 32 32.82 14.19 37.21 32.00 (2.87)
109173567 39 40.56 29.28 46.38 39.12 (3.12)
1091735670 39 40.53 32.20 48.01 39.26 (3.00)
10917356706 45 46.73 52.86 55.50 45.95 (3.14)
109173567064 49 50.96 67.70 60.58 50.32 (3.09)
1091735670642 51 52.82 74.59 63.03 52.6 (2.94)
10917356706428 59 61.01 83.31 70.91 61.11 (2.87)
109173567064286 65 67.78 87.42 76.54 67.97 (2.77)
1091735670642861 66 69.27 83.43 76.46 69.33 (2.57)
10917356706428614 70 72.32 82.60 78.88 73.05 (2.36)
109173567064286145 75 76.77 83.34 81.56 77.67 (2.27)
1091735670642861451 76 78.57 80.42 80.44 78.07 (2.19)
10917356706428614516 82 83.05 82.52 83.68 83.36 (2.03)
Table 1:

Probing of the LSTM model at intermediate timesteps for the Digit Sum dataset. Left column is the input and underlined digit emphasizes the last input digit. Ground Truth is the running sum up to that point. Predicted values by the LSTM models are in prediction column. The number in parantheses are the standard deviation. The intermediate representation of Baby Step curriculum model is closer to running sum of the input sequence.

Baseline Regimens

The first baseline, named No-CL, is the common practice of shuffling the training data. For a neural network like our LSTM models, this means that training is performed as usual where one epoch sees all the training set in random order. For all experiments described in the following section, models learned with the No-CL regiment are trained 10 times111

Note that this favors No-CL due to lower variance in results.

, to get a proper average performance.

The second baseline, named Sorted, also sees all the data at each epoch but the ordering of the training instances is based on the curriculum . This is a simplification of the two CL regimens presented in the previous subsections since we are not partitioning the data based on its complexity (i.e., based on the curriculum). We are simply reordering the training set. A comparison between the No-CL and Sorted baselines will allow us to study the importance of training instance ordering.


The main goal of our experiments is to better understand how a computational model, specifically LSTM networks, are affected internally by CL. We aim to observe (1) the effect of CL on the internal model representations, (2) how the number of model parameters affect the performance of CL, and (3) how the amount of data size change the contribution of CL.

The following subsections present LSTM network and our experimental probing methodology to analyze the LSTM’s internal representations at different stages in the sequence modeling process.

Figure 1: Visualization of Input Digit (input digit at time ) vs (the average difference between predictions at and ). Shaded areas represent the variance. We expect to follow the input digit. Model trained with Baby Step curriculum shows the least variance the input digit. Note that for No-CL, we plot all 10 runs (with different seeds).


We now describe LSTM networks. Let be a sequence of one-hot coded sequence of symbols of length . At each time step the LSTM updates its cells as follows:


In above equations, is a [] matrix to calculate gate weights and new memory information . is an embedding matrix for symbols. At time , the sigmoid (sigm) and tanh (tanh) non-linearities are applied element-wise to the embedding representation of input and the output of the network from the previous time step . Vectors are binary gates controlling the input, forget and output gates respectively. The vector additively modifies the cell .

We use the final hidden representation

of the LSTM to prediction with a projection matrix . In the case of regression, we use where is [] and relu is rectified non-linearity. For classification, we predict one of class labels using where is [] and softmax is the softmax function.

Probing Internal States of LSTM

We aim to observe how the use of the internal representation of LSTM at intermediate steps changes depending on the learning regimen.

Each internal representation is probed using the functions learned for regression or the function learned for classification. By moving these probes along the sequences, we can study the intermediate representation at each time .

Digit Sum

We aim to simulate a low-resource sequence regression problem considering many of the NLP tasks only have a few thousand annotated samples. To this end, the Digit Sum task is posed as follows. Given a sequence of symbols of digits, the model is expected to predict the sum of digits. For instance given a sequence ”5 0 2 4 6” the expected output is 17.

Digit Sum task has similarities with our second sequence task, sentiment analysis, where digits are analogous to the word tokens in the natural language text and the summation is analogous to the subjective position of a sequence on a topic. Our two evaluation tasks also have some interesting differences which allow evaluating a broader range of sequence learning tasks. In sentiment analysis, the order of words makes a difference whereas, in the Digit Sum, the order of digits does not change the expected answer. Secondly, the learning setup is a classification of polarity levels for sentiment analysis whereas it is a regression for the Digit Sum.

Dataset Details. We define the evaluation task in the Digit Sum dataset as the summation of 20 digits, a typical length of sentences in natural language. Both the validation and testing sets contain 200 sequences of 20 digits randomly generated. The training set consist of 1000 sequences each from length 2 to 20, allowing to develop the curriculum automatically following Spitkovsky, Alshawi, and Jurafsky (2010) procedure. This results in a dataset of size 19K instances222We experimented with 10x smaller dataset size and observed very similar results. We do not report these due to limited space..

Experimental Details. We used LSTM with hidden units of

without peephole connections. For all configurations the size of digit embeddings and hidden units are the same. We use RMSprop

(Dauphin et al., 2015) with learning rate 0.001 and decay rate of 0.9 with minibatches of size 128. The patience parameter for early stopping is 10. We use Dropout (Srivastava et al., 2014) of rate in range {0,0.25,0.5} as suggested for LSTMs by Gal (2015).


Probing Internal Model Representations. We analyze the behavior of the model by using the intermediate representations during processing of a sequence. As we discussed previously, we feed the hidden representations of each digit to the regression node to predict at each timestep of the input. Table 1 shows the input sequence, ground truth, and predictions of the best models for each learning regimen based on validation loss. The prediction of the model trained with One-Pass curriculum and Sorted Baseline shows no correlation with the running sum of the digits. The Baby Step curriculum model is able to predict similar values to running sum.

Figure 2:

Predictions at intermediate tokens. Colors represent the polarity and heights represent the prediction probability. Baby Step curriculum shows a consistent behavior : only after observing a positive sub-phrase flips the prediction.

A sequence model can learn to solve this task in numerous ways such as memorizing sequences due to overfitting, using a count table of the digits, or doing a running sum at each time step. To analyze this, we report the average differences between successive predictions (we call it ) and the last input digit(see Figure 1). At each timestep, the model trained with Baby Step curriculum updates the hidden representation such that it correlates with the sum of digits observed up to that point. It is also interesting to observe that Baby Step curriculum shows better variance than the average of 10 random starts (No-CL). We emphasize that models are provided with the same sequences for training, yet, the learning regiments results in different models.

Figure 3: Mean Squared Error vs the number of units of LSTM. With much smaller model, Baby Step curriculum achieves the best results. The other model requires the right complexity to achieve comparable results.

Effects on Models With Different Complexities. Our next experiment studies the effect of learning regimens on models with different complexities. Figure 3 shows the Mean Squared Loss (MSE) results for the Digit Sum with LSTMs with varying hidden unit sizes. Baby Step curriculum achieves consistently better results even if the model has much fewer parameters. Other regimens require the model to have the right complexity. Efficient training of small models is particularly important if we do not have large annotated datasets to train big models. In addition, from practical perspective, to obtain smaller yet accurate models enables deploying fast and accurate models to a limited resource setting (Buciluǎ, Caruana, and Niculescu-Mizil, 2006; Hinton, Vinyals, and Dean, 2015).

Sentiment Analysis

Sentiment analysis is an application of NLP to identify the polarity of subjective information in given source (Pang and Lee, 2008). We use the Stanford Sentiment Treebank (SST) (Socher et al., 2013) an extension of a dataset (Pang and Lee, 2005) which has labels for 215,154 phrases in the parse trees of 11,855 sentences sampled from movie reviews. Real-valued sentiment labels are converted to an integer ordinal label in 0,..,4 by simple thresholding for five classes: very negative, negative, neutral, positive, and very positive. Therefore the task is posed as a 5-class classification problem.

Regimen All Conjunctions
No-CL from (Tai, Socher, and Manning, 2015) 46.4 (1.1)
No-CL (our implementation) 46.83 (1.1) 43.88 (1.9)
Curriculum Sorted 47.42 42.88
One-Pass Curriculum 45.74 43.09
Baby Steps Curriculum 47.37 46.07
Table 2: Classification Accuracies of Training Regimens on Sentiment Analysis Task. The numbers in parantheses are standard deviations. Model gets better at conjunctions if it is trained with Baby-Step curriculum.

Dataset Details. We use the standard train/dev/test splits of 8544/1101/2210 for the 5-class classification problem. We flatten the annotated tree structure into sequences of phrases to use finer grained annotations. We treat the words within the span of an inner phrase as a sequence and use the phrase’s annotation as label. This results in a bigger training set of 155019 instances.

Experimental Details. We follow the previous work (Tai, Socher, and Manning, 2015) for the empirical setup. We use a single layer LSTM with 168 units for the 5-class classification task. We initialized the word embeddings using 300-dimensional Glove vectors (Pennington, Socher, and Manning, 2014) and fine-tuned them during training. For optimization, we used RMSprop (Dauphin et al., 2015) with learning rate 0.001 and decay rate of 0.9 with mini-batches of size 128. The patience parameter for early stopping is 10.


As the first step to our more detailed analysis, Table 2 reports the overall performance of the four learning regimens and the original results stated by Tai, Socher, and Manning (2015). The advantage of CL is most prominent when predicting sentiment for sentences with conjunctions (last column in Table 2). For conjunctions where a span of text contradicts or supports overall sentiment polarity, Baby Step model achieves significantly better results than others. We take a closer look at the LSTM modeling process using a similar probing technique used for the Digit Sum dataset.

Figure 4: The effect of regimen vs the amount of training data. One-Pass and Baby Steps curriculum regimens gets better results when the training data is limited. They converge to similar points when the amount of data increases.

Probing Intermediate Representations. In Figure 2, we qualitatively show how different models process a sentence with a contrastive conjunction originally demonstrated by Socher et al. (2013). For each model, we plot the sentiment polarity and the probability of prediction for that polarity after observing a word token. Unlike the others, Baby Step model changes the sentiment at the appropriate time; after observing “spice” which constructs a positive statement with the sub-phrase “but it has just enough spice”. Handling contrastive conjunctions requires a model to merge two conflicting signals (i.e. positive and negative) coming from two directions (i.e. left phrase and right phrase) in an accurate way Socher et al. (2013). Considering LSTM’s limited capacity due to using only signal coming from previous timesteps (i.e processing the sentence from left to right), this result is particularly interesting because Baby Step CL boosts LSTM’s performance.

Effect of Training Data Size. To investigate the role of the amount of training data, we use a varying fraction of training data with learning regimens. Figure 4 shows the results. CL regimens help when training data is limited. When the amount of training data increases, the difference between the regimens gets lower. This result suggests that in low-resource setups, like many of the NLP problems, CL could be useful to improve a model’s performance.


We examined curriculum learning on two sequence prediction tasks. Our analyses showed that curriculum learning regimens based on shorter-first approach, help LSTM construct a partial representation of the sequence in a more intuitive way. We demonstrated that curriculum learning helps smaller models improve performance, contributes more in a low resource setup. Using a quantitative and qualitative analysis on sentiment analysis, we showed that a model trained with Baby Step curriculum significantly improves for sentences with conjunctions suggesting that curriculum learning helps LSTM learn longer sequences and functional role of the conjunctions.