Introduction
Inspired by the human learning process, Curriculum Learning (Elman, 1993; Bengio et al., 2009) is an algorithm that emphasizes the order of training instances in a computational learning setup. The main idea is that learning easy instances first could be helpful for learning more complex ones later in the training. The first algorithm proposed by Bengio et al. (2009), which we refer as onepass curriculum, creates disjoint sets of training examples ordered by the complexity and used separately during training. The second algorithm called baby step curriculum uses an incremental approach where groups of more complex examples are incrementally added to the training set (Spitkovsky, Alshawi, and Jurafsky, 2010)
. These curriculum learning regimens were shown to improve performance in some Natural Language Processing and Computer Vision tasks
(Pentina, Sharmanska, and Lampert, 2015; Spitkovsky, Alshawi, and Jurafsky, 2010).Despite its usefulness, it is still unknown how exactly computational models are affected internally by curriculum learning. An example of computational model particularly relevant to Natural Language Processing is Long ShortTerm Memory (LSTM) network (Hochreiter and Schmidhuber, 1997). LSTM networks have shown competitive performance in several domains such as handwriting recognition (Graves et al., 2009) and parsing (Vinyals et al., 2015). Surprisingly, curriculum learning has not been studied in the context of LSTM networks to our knowledge. Detailed visualizations and analyses of curriculum learning regimens with LSTM will allow us to better understand how models are affected and provides us insights when to use these regimens. Knowing how curriculum learning works, we can design new extensions and understand the nature of tasks most suited for these learning regimens.
In this paper, we study the effect of curriculum learning on LSTM networks. We created experiments to directly compare two curriculum learning regimens, onepass and baby step
, with two baseline approaches that include the conventional technique of randomly ordering the training samples. We use two benchmarks for our analyses. First, a synthetic task is designed which is similar to several Natural Language Processing tasks where a sequence of symbols are observed and a particular function (e.g. analogous to a linguistic or a semantic phenomenon) is aimed to be learned. Second, we use sentiment analysis where the polarity of subjective opinions is classified – a fundamental task in Natural Language Processing. As mentioned previously, this is the first work studying LSTM networks on sentiment analysis with curriculum learning to our knowledge.
Our visualizations and analyses on these two sequence tasks are designed to study three main factors. First, we compare the four learning regimens on how the LSTM network’s internal representations change as the final prediction is computed. To this end, we simply decode the representations at intermediate steps. This analysis helps us understand how a model handles the task with the help of curriculum learning. Second, we investigate how the performance of models with different complexities are affected by curriculum learning. Smaller yet accurate models are crucial in limited resource settings. Third, we study how the performance of curriculum learning changes in lowresource setups. This analysis provides us a valuable information considering lowresource scenarios are common in several datadriven learning domains such as Natural Language Processing.
Related Work
We review a list of topics related to our work in the context of curriculum learning (CL), analysis of neural networks and sentiment analysis with neural networks.
Curriculum Learning.
Motivated by children’s language learning, Elman (1993)
studies the effect of learning regimen on a synthetic grammar task. He shows that a Recurrent Neural Network (RNN) is able to learn a grammar when training data is presented from simple to complex order and fails to do so when the order is random.
Bengio et al. (2009) investigate CL from an optimization perspective. Their experiments on synthetic vision and word representation learning tasks show that CL results in better generalization and faster learning. Spitkovsky, Alshawi, and Jurafsky (2010) apply a CL strategy to learn an unsupervised parser for sentences of length and initialize the next parser for sentences of length with the previously learned one. They show that learning a hybrid model using the parsers learned for each sentence lengths achieves a significant improvement over a baseline model. Pentina, Sharmanska, and Lampert (2015) investigate the CL in a multitask learning setup and propose a model to learn the order of multiple tasks. Their experiments on a set of vision tasks show that learning tasks sequentially is better than learning them jointly. Jiang et al. (2015) provide a general framework for CL and SelfPaced Learning where model picks which instances to train based on a simplicity metric. The proposed framework is able to combine prior knowledge of curriculum with SelfPaced Learning in the learning objective.Long ShortTerm Memory Networks.
LSTM networks are a variant of RNNs (Elman, 1990) capable of storing information and propagating loss over long distance. Using a gating mechanism by controlling the information flow into the internal representation, it is possible to avoid the problems of training RNNs (Bengio, Simard, and Frasconi, 1994; Pascanu, Mikolov, and Bengio, 2012). Several architectural variants have been proposed to improve the basic model (Cho et al., 2014; Chung et al., 2015; Yao et al., 2015; Kalchbrenner, Danihelka, and Graves, 2015; Dyer et al., 2015; Grefenstette et al., 2015).
Visualization of Neural Networks.
Although many of the neural network studies provide quantitative analysis, there are few qualitative analyses of neural networks. Zeiler and Fergus (2014)
visualize the feature maps of a Convolutional Neural Network (CNN)
(LeCun et al., 1998). They show that feature maps at different layers show sensitivity to different shapes, textures, and objects. Similarly, Karpathy, Johnson, and Li (2015) analyze LSTM on character level language modeling. Their analysis shows that deeper models with gating mechanisms achieve better results. They show that some cells in LSTM learn to detect patterns and how RNNs learn to generalize to longer sequences. More recently, Li et al. (2016) use visualization to show how neural network models handle several linguistics phenomena such as compositionality and negation using sentiment analysis and sequence autoencoding.Synthetic Tasks. Since the early days of neural networks, synthetic tasks were used to test the capabilities of the models (Fahlman, 1991)
and often serve as unit tests for machine learning models
(Weston et al., 2015). Similar to the first work on LSTM (Hochreiter and Schmidhuber, 1997), many of the contemporary neural network models (Graves, Wayne, and Danihelka, 2014; Kurach, Andrychowicz, and Sutskever, 2015; Sukhbaatar et al., 2015; Vinyals, Fortunato, and Jaitly, 2015) use synthetic tasks to compare and contrast several architectures. Inspired by these studies, we also use a synthetic task as one of our tasks to understand the effect of CL on LSTMs.Sentiment Analysis with Neural Networks.
Several approaches have been proposed to solve sentiment analysis using neural networks. Socher et al. (2013) propose Recursive Neural Networks to exploit the syntactic structure of a sentence. A number of extensions of this model have been proposed in the context of sentiment analysis (Irsoy and Cardie, 2014; Tai, Socher, and Manning, 2015). Other proposed approaches use CNN (Kalchbrenner, Grefenstette, and Blunsom, 2014; Kim, 2014)
and the averaging of word vector models
(Iyyer et al., 2015; Le and Mikolov, 2014).To our knowledge, this work is the first to study how the internal representation of LSTM change in a curriculum learning setup.
Curriculum Learning Regimens
Curriculum learning emphasizes the order of training instances, prioritizing simpler instances before the more complex ones. In this section, we describe two curriculum learning regimens: onepass curriculum originally proposed by Bengio et al. (2009) and baby step curriculum from Spitkovsky, Alshawi, and Jurafsky (2010). For both regimens, we develop the curriculum using the same strategy proposed by Spitkovsky, Alshawi, and Jurafsky (2010) who assume that shorter sequences are easier to learn.
The following subsections are describing the two curriculum learning regimens as well as two baseline learning regimens.
OnePass Curriculum
Bengio et al. (2009) propose to use a dataset with simpler instances in the first phase of the training. After some number of iterations, they switch to harder target dataset. The intuition is that after some training on simpler data, the model is ready to handle the harder target data. Here, we name this regimen OnePass curriculum (see Algorithm 1). The training data is sorted by a curriculum and distributed into number of buckets. The training starts with the easiest bucket. Unlike the previous work (Bengio et al., 2009), we use early stopping – training stops for the bucket when the loss or task’s accuracy criteria on heldout set do not get any better for
number of epochs. Afterward, the next bucket is being used and trained in the same way. The whole training is stopped after all buckets are used. Note that the model uses each bucket only one time for the training, hence the name.
Baby Steps Curriculum
The intuition behind Baby Steps curriculum (Bengio et al., 2009; Spitkovsky, Alshawi, and Jurafsky, 2010) is that simpler instances in the training data should not be discarded, instead, the complexity of the training data should be increased. After distributing data into buckets based on a curriculum, training starts with the easiest bucket. When the loss or task’s accuracy criteria on a heldout set do not get any better for number of epochs, the next bucket and the current data bucket are merged. The whole training is stopped after all buckets are used (see Algorithm 2).
Baby Steps Curriculum  OnePass Curriculum  Sorted  NoCL  

Input Sequence  Ground Truth  Prediction  Prediction  Prediction  Prediction 
1  1  0.23  0.00  0.86  0.88 (0.20) 
10  1  1.03  0.00  1.32  1.10 (0.28) 
109  10  10.27  0.00  10.59  10.20 (0.78) 
1091  11  11.40  0.00  11.29  10.88 (0.95) 
10917  18  18.62  0.00  19.30  17.89 (1.49) 
109173  21  21.54  0.00  22.99  20.84 (1.94) 
1091735  26  26.77  4.29  29.34  25.90 (2.39) 
10917356  32  32.82  14.19  37.21  32.00 (2.87) 
109173567  39  40.56  29.28  46.38  39.12 (3.12) 
1091735670  39  40.53  32.20  48.01  39.26 (3.00) 
10917356706  45  46.73  52.86  55.50  45.95 (3.14) 
109173567064  49  50.96  67.70  60.58  50.32 (3.09) 
1091735670642  51  52.82  74.59  63.03  52.6 (2.94) 
10917356706428  59  61.01  83.31  70.91  61.11 (2.87) 
109173567064286  65  67.78  87.42  76.54  67.97 (2.77) 
1091735670642861  66  69.27  83.43  76.46  69.33 (2.57) 
10917356706428614  70  72.32  82.60  78.88  73.05 (2.36) 
109173567064286145  75  76.77  83.34  81.56  77.67 (2.27) 
1091735670642861451  76  78.57  80.42  80.44  78.07 (2.19) 
10917356706428614516  82  83.05  82.52  83.68  83.36 (2.03) 
Probing of the LSTM model at intermediate timesteps for the Digit Sum dataset. Left column is the input and underlined digit emphasizes the last input digit. Ground Truth is the running sum up to that point. Predicted values by the LSTM models are in prediction column. The number in parantheses are the standard deviation. The intermediate representation of Baby Step curriculum model is closer to running sum of the input sequence.
Baseline Regimens
The first baseline, named NoCL, is the common practice of shuffling the training data. For a neural network like our LSTM models, this means that training is performed as usual where one epoch sees all the training set in random order. For all experiments described in the following section, models learned with the NoCL regiment are trained 10 times^{1}^{1}1
Note that this favors NoCL due to lower variance in results.
, to get a proper average performance.The second baseline, named Sorted, also sees all the data at each epoch but the ordering of the training instances is based on the curriculum . This is a simplification of the two CL regimens presented in the previous subsections since we are not partitioning the data based on its complexity (i.e., based on the curriculum). We are simply reordering the training set. A comparison between the NoCL and Sorted baselines will allow us to study the importance of training instance ordering.
Experiments
The main goal of our experiments is to better understand how a computational model, specifically LSTM networks, are affected internally by CL. We aim to observe (1) the effect of CL on the internal model representations, (2) how the number of model parameters affect the performance of CL, and (3) how the amount of data size change the contribution of CL.
The following subsections present LSTM network and our experimental probing methodology to analyze the LSTM’s internal representations at different stages in the sequence modeling process.
Lstm
We now describe LSTM networks. Let be a sequence of onehot coded sequence of symbols of length . At each time step the LSTM updates its cells as follows:
(1)  
(2)  
(3) 
In above equations, is a [] matrix to calculate gate weights and new memory information . is an embedding matrix for symbols. At time , the sigmoid (sigm) and tanh (tanh) nonlinearities are applied elementwise to the embedding representation of input and the output of the network from the previous time step . Vectors are binary gates controlling the input, forget and output gates respectively. The vector additively modifies the cell .
We use the final hidden representation
of the LSTM to prediction with a projection matrix . In the case of regression, we use where is [] and relu is rectified nonlinearity. For classification, we predict one of class labels using where is [] and softmax is the softmax function.Probing Internal States of LSTM
We aim to observe how the use of the internal representation of LSTM at intermediate steps changes depending on the learning regimen.
Each internal representation is probed using the functions learned for regression or the function learned for classification. By moving these probes along the sequences, we can study the intermediate representation at each time .
Digit Sum
We aim to simulate a lowresource sequence regression problem considering many of the NLP tasks only have a few thousand annotated samples. To this end, the Digit Sum task is posed as follows. Given a sequence of symbols of digits, the model is expected to predict the sum of digits. For instance given a sequence ”5 0 2 4 6” the expected output is 17.
Digit Sum task has similarities with our second sequence task, sentiment analysis, where digits are analogous to the word tokens in the natural language text and the summation is analogous to the subjective position of a sequence on a topic. Our two evaluation tasks also have some interesting differences which allow evaluating a broader range of sequence learning tasks. In sentiment analysis, the order of words makes a difference whereas, in the Digit Sum, the order of digits does not change the expected answer. Secondly, the learning setup is a classification of polarity levels for sentiment analysis whereas it is a regression for the Digit Sum.
Dataset Details. We define the evaluation task in the Digit Sum dataset as the summation of 20 digits, a typical length of sentences in natural language. Both the validation and testing sets contain 200 sequences of 20 digits randomly generated. The training set consist of 1000 sequences each from length 2 to 20, allowing to develop the curriculum automatically following Spitkovsky, Alshawi, and Jurafsky (2010) procedure. This results in a dataset of size 19K instances^{2}^{2}2We experimented with 10x smaller dataset size and observed very similar results. We do not report these due to limited space..
Experimental Details. We used LSTM with hidden units of
without peephole connections. For all configurations the size of digit embeddings and hidden units are the same. We use RMSprop
(Dauphin et al., 2015) with learning rate 0.001 and decay rate of 0.9 with minibatches of size 128. The patience parameter for early stopping is 10. We use Dropout (Srivastava et al., 2014) of rate in range {0,0.25,0.5} as suggested for LSTMs by Gal (2015).Results
Probing Internal Model Representations. We analyze the behavior of the model by using the intermediate representations during processing of a sequence. As we discussed previously, we feed the hidden representations of each digit to the regression node to predict at each timestep of the input. Table 1 shows the input sequence, ground truth, and predictions of the best models for each learning regimen based on validation loss. The prediction of the model trained with OnePass curriculum and Sorted Baseline shows no correlation with the running sum of the digits. The Baby Step curriculum model is able to predict similar values to running sum.
A sequence model can learn to solve this task in numerous ways such as memorizing sequences due to overfitting, using a count table of the digits, or doing a running sum at each time step. To analyze this, we report the average differences between successive predictions (we call it ) and the last input digit(see Figure 1). At each timestep, the model trained with Baby Step curriculum updates the hidden representation such that it correlates with the sum of digits observed up to that point. It is also interesting to observe that Baby Step curriculum shows better variance than the average of 10 random starts (NoCL). We emphasize that models are provided with the same sequences for training, yet, the learning regiments results in different models.
Effects on Models With Different Complexities. Our next experiment studies the effect of learning regimens on models with different complexities. Figure 3 shows the Mean Squared Loss (MSE) results for the Digit Sum with LSTMs with varying hidden unit sizes. Baby Step curriculum achieves consistently better results even if the model has much fewer parameters. Other regimens require the model to have the right complexity. Efficient training of small models is particularly important if we do not have large annotated datasets to train big models. In addition, from practical perspective, to obtain smaller yet accurate models enables deploying fast and accurate models to a limited resource setting (Buciluǎ, Caruana, and NiculescuMizil, 2006; Hinton, Vinyals, and Dean, 2015).
Sentiment Analysis
Sentiment analysis is an application of NLP to identify the polarity of subjective information in given source (Pang and Lee, 2008). We use the Stanford Sentiment Treebank (SST) (Socher et al., 2013) an extension of a dataset (Pang and Lee, 2005) which has labels for 215,154 phrases in the parse trees of 11,855 sentences sampled from movie reviews. Realvalued sentiment labels are converted to an integer ordinal label in 0,..,4 by simple thresholding for five classes: very negative, negative, neutral, positive, and very positive. Therefore the task is posed as a 5class classification problem.
Regimen  All  Conjunctions 

NoCL from (Tai, Socher, and Manning, 2015)  46.4 (1.1)  
NoCL (our implementation)  46.83 (1.1)  43.88 (1.9) 
Curriculum Sorted  47.42  42.88 
OnePass Curriculum  45.74  43.09 
Baby Steps Curriculum  47.37  46.07 
Dataset Details. We use the standard train/dev/test splits of 8544/1101/2210 for the 5class classification problem. We flatten the annotated tree structure into sequences of phrases to use finer grained annotations. We treat the words within the span of an inner phrase as a sequence and use the phrase’s annotation as label. This results in a bigger training set of 155019 instances.
Experimental Details. We follow the previous work (Tai, Socher, and Manning, 2015) for the empirical setup. We use a single layer LSTM with 168 units for the 5class classification task. We initialized the word embeddings using 300dimensional Glove vectors (Pennington, Socher, and Manning, 2014) and finetuned them during training. For optimization, we used RMSprop (Dauphin et al., 2015) with learning rate 0.001 and decay rate of 0.9 with minibatches of size 128. The patience parameter for early stopping is 10.
Results
As the first step to our more detailed analysis, Table 2 reports the overall performance of the four learning regimens and the original results stated by Tai, Socher, and Manning (2015). The advantage of CL is most prominent when predicting sentiment for sentences with conjunctions (last column in Table 2). For conjunctions where a span of text contradicts or supports overall sentiment polarity, Baby Step model achieves significantly better results than others. We take a closer look at the LSTM modeling process using a similar probing technique used for the Digit Sum dataset.
Probing Intermediate Representations. In Figure 2, we qualitatively show how different models process a sentence with a contrastive conjunction originally demonstrated by Socher et al. (2013). For each model, we plot the sentiment polarity and the probability of prediction for that polarity after observing a word token. Unlike the others, Baby Step model changes the sentiment at the appropriate time; after observing “spice” which constructs a positive statement with the subphrase “but it has just enough spice”. Handling contrastive conjunctions requires a model to merge two conflicting signals (i.e. positive and negative) coming from two directions (i.e. left phrase and right phrase) in an accurate way Socher et al. (2013). Considering LSTM’s limited capacity due to using only signal coming from previous timesteps (i.e processing the sentence from left to right), this result is particularly interesting because Baby Step CL boosts LSTM’s performance.
Effect of Training Data Size. To investigate the role of the amount of training data, we use a varying fraction of training data with learning regimens. Figure 4 shows the results. CL regimens help when training data is limited. When the amount of training data increases, the difference between the regimens gets lower. This result suggests that in lowresource setups, like many of the NLP problems, CL could be useful to improve a model’s performance.
Conclusion
We examined curriculum learning on two sequence prediction tasks. Our analyses showed that curriculum learning regimens based on shorterfirst approach, help LSTM construct a partial representation of the sequence in a more intuitive way. We demonstrated that curriculum learning helps smaller models improve performance, contributes more in a low resource setup. Using a quantitative and qualitative analysis on sentiment analysis, we showed that a model trained with Baby Step curriculum significantly improves for sentences with conjunctions suggesting that curriculum learning helps LSTM learn longer sequences and functional role of the conjunctions.
References
 Bengio et al. (2009) Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, 41–48. ACM.
 Bengio, Simard, and Frasconi (1994) Bengio, Y.; Simard, P.; and Frasconi, P. 1994. Learning longterm dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on 5(2):157–166.
 Buciluǎ, Caruana, and NiculescuMizil (2006) Buciluǎ, C.; Caruana, R.; and NiculescuMizil, A. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 535–541. ACM.
 Cho et al. (2014) Cho, K.; van Merriënboer, B.; Bahdanau, D.; and Bengio, Y. 2014. On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259.
 Chung et al. (2015) Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2015. Gated feedback recurrent neural networks. arXiv preprint arXiv:1502.02367.
 Dauphin et al. (2015) Dauphin, Y. N.; de Vries, H.; Chung, J.; and Bengio, Y. 2015. Rmsprop and equilibrated adaptive learning rates for nonconvex optimization. arXiv preprint arXiv:1502.04390.
 Dyer et al. (2015) Dyer, C.; Ballesteros, M.; Ling, W.; Matthews, A.; and Smith, N. A. 2015. Transitionbased dependency parsing with stack long shortterm memory. arXiv preprint arXiv:1505.08075.
 Elman (1990) Elman, J. L. 1990. Finding structure in time. Cognitive science 14(2):179–211.
 Elman (1993) Elman, J. L. 1993. Learning and development in neural networks: The importance of starting small. Cognition 48(1):71–99.
 Fahlman (1991) Fahlman, S. E. 1991. The recurrent cascadecorrelation architecture. In Advances in Neural Information Processing Systems, 190–196.
 Gal (2015) Gal, Y. 2015. A theoretically grounded application of dropout in recurrent neural networks. arXiv preprint arXiv:1512.05287.
 Graves et al. (2009) Graves, A.; Liwicki, M.; Fernández, S.; Bertolami, R.; Bunke, H.; and Schmidhuber, J. 2009. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on 31(5):855–868.
 Graves, Wayne, and Danihelka (2014) Graves, A.; Wayne, G.; and Danihelka, I. 2014. Neural turing machines. arXiv preprint arXiv:1410.5401.
 Grefenstette et al. (2015) Grefenstette, E.; Hermann, K. M.; Suleyman, M.; and Blunsom, P. 2015. Learning to transduce with unbounded memory. In Advances in Neural Information Processing Systems, 1819–1827.
 Hinton, Vinyals, and Dean (2015) Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
 Hochreiter and Schmidhuber (1997) Hochreiter, S., and Schmidhuber, J. 1997. Long shortterm memory. Neural computation 9(8):1735–1780.
 Irsoy and Cardie (2014) Irsoy, O., and Cardie, C. 2014. Deep recursive neural networks for compositionality in language. In Advances in Neural Information Processing Systems, 2096–2104.
 Iyyer et al. (2015) Iyyer, M.; Manjunatha, V.; BoydGraber, J.; and III, H. D. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, volume 1, 1681–1691.
 Jiang et al. (2015) Jiang, L.; Meng, D.; Zhao, Q.; Shan, S.; and Hauptmann, A. G. 2015. Selfpaced curriculum learning. In AAAI, volume 2, 6.
 Kalchbrenner, Danihelka, and Graves (2015) Kalchbrenner, N.; Danihelka, I.; and Graves, A. 2015. Grid long shortterm memory. arXiv preprint arXiv:1507.01526.
 Kalchbrenner, Grefenstette, and Blunsom (2014) Kalchbrenner, N.; Grefenstette, E.; and Blunsom, P. 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188.
 Karpathy, Johnson, and Li (2015) Karpathy, A.; Johnson, J.; and Li, F.F. 2015. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078.
 Kim (2014) Kim, Y. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
 Kurach, Andrychowicz, and Sutskever (2015) Kurach, K.; Andrychowicz, M.; and Sutskever, I. 2015. Neural randomaccess machines. arXiv preprint arXiv:1511.06392.
 Le and Mikolov (2014) Le, Q. V., and Mikolov, T. 2014. Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053.
 LeCun et al. (1998) LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
 Li et al. (2016) Li, J.; Chen, X.; Hovy, E.; and Jurafsky, D. 2016. Visualizing and understanding neural models in nlp. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
 Pang and Lee (2005) Pang, B., and Lee, L. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 115–124. Association for Computational Linguistics.
 Pang and Lee (2008) Pang, B., and Lee, L. 2008. Opinion mining and sentiment analysis. Foundations and trends in information retrieval 2(12):1–135.
 Pascanu, Mikolov, and Bengio (2012) Pascanu, R.; Mikolov, T.; and Bengio, Y. 2012. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063.
 Pennington, Socher, and Manning (2014) Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In EMNLP, volume 14, 1532–1543.

Pentina, Sharmanska, and
Lampert (2015)
Pentina, A.; Sharmanska, V.; and Lampert, C. H.
2015.
Curriculum learning of multiple tasks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 5492–5500.  Socher et al. (2013) Socher, R.; Perelygin, A.; Wu, J. Y.; Chuang, J.; Manning, C. D.; Ng, A. Y.; and Potts, C. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), volume 1631, 1642. Citeseer.
 Spitkovsky, Alshawi, and Jurafsky (2010) Spitkovsky, V. I.; Alshawi, H.; and Jurafsky, D. 2010. From baby steps to leapfrog: How less is more in unsupervised dependency parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 751–759. Association for Computational Linguistics.
 Srivastava et al. (2014) Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958.
 Sukhbaatar et al. (2015) Sukhbaatar, S.; Weston, J.; Fergus, R.; et al. 2015. Endtoend memory networks. In Advances in Neural Information Processing Systems, 2431–2439.
 Tai, Socher, and Manning (2015) Tai, K. S.; Socher, R.; and Manning, C. D. 2015. Improved semantic representations from treestructured long shortterm memory networks. arXiv preprint arXiv:1503.00075.
 Vinyals et al. (2015) Vinyals, O.; Kaiser, Ł.; Koo, T.; Petrov, S.; Sutskever, I.; and Hinton, G. 2015. Grammar as a foreign language. In Advances in Neural Information Processing Systems, 2755–2763.
 Vinyals, Fortunato, and Jaitly (2015) Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer networks. In Advances in Neural Information Processing Systems, 2674–2682.
 Weston et al. (2015) Weston, J.; Bordes, A.; Chopra, S.; and Mikolov, T. 2015. Towards aicomplete question answering: a set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698.
 Yao et al. (2015) Yao, K.; Cohn, T.; Vylomova, K.; Duh, K.; and Dyer, C. 2015. Depthgated recurrent neural networks. arXiv preprint arXiv:1508.03790.
 Zeiler and Fergus (2014) Zeiler, M. D., and Fergus, R. 2014. Visualizing and understanding convolutional networks. In Computer vision–ECCV 2014. Springer. 818–833.
Comments
There are no comments yet.