On the Effective Use of Pretraining for Natural Language Inference

by   Ignacio Cases, et al.
Stanford University

Neural networks have excelled at many NLP tasks, but there remain open questions about the performance of pretrained distributed word representations and their interaction with weight initialization and other hyperparameters. We address these questions empirically using attention-based sequence-to-sequence models for natural language inference (NLI). Specifically, we compare three types of embeddings: random, pretrained (GloVe, word2vec), and retrofitted (pretrained plus WordNet information). We show that pretrained embeddings outperform both random and retrofitted ones in a large NLI corpus. Further experiments on more controlled data sets shed light on the contexts for which retrofitted embeddings can be useful. We also explore two principled approaches to initializing the rest of the model parameters, Gaussian and orthogonal, showing that the latter yields gains of up to 2.9


page 1

page 2

page 3

page 4


Using Similarity Measures to Select Pretraining Data for NER

Word vectors and Language Models (LMs) pretrained on a large amount of u...

Prompt Tuning for Generative Multimodal Pretrained Models

Prompt tuning has become a new paradigm for model tuning and it has demo...

Embeddings and Attention in Predictive Modeling

We explore in depth how categorical data can be processed with embedding...

An Exploration of Word Embedding Initialization in Deep-Learning Tasks

Word embeddings are the interface between the world of discrete units of...

Improving Natural Language Inference with a Pretrained Parser

We introduce a novel approach to incorporate syntax into natural languag...

Fusing finetuned models for better pretraining

Pretrained models are the standard starting point for training. This app...

Pretraining Federated Text Models for Next Word Prediction

Federated learning is a decentralized approach for training models on di...

1 Introduction

Unsupervised pretraining of supervised neural network inputs has proven valuable in a wide range of tasks, and there are strong theoretical reasons for expecting it to be useful, especially where the network architecture is complex and the available training data is limited [Erhan et al. 2009, Erhan et al. 2010]

. However, for natural language processing tasks, the results for pretraining remain mixed, with random input initializations sometimes even appearing superior. This is perhaps an unexpected result given the comparatively small size of most labeled NLP datasets and the consistent structure of natural language data across different usage contexts.

In this work, we trace the variable performance of pretraining in NLP to the ways in which these dense, highly structured representations interact with other properties of the network, especially the learning rate parameter and the weight initialization scheme. The basis for these experiments is the Stanford Natural Language Inference (SNLI) corpus [Bowman et al. 2015a], which contains about 570k sentence pairs labeled for entailment, contradiction, and semantic independence. SNLI is an ideal choice for a number of reasons: it is one of the larger human-annotated resources available today, it is oriented towards a task that has wide applicability in NLP [Dagan et al. 2006], and its relational structure is similar to what one finds in datasets for machine translation, paraphrase, and a number of other tasks.

Figure 1:

An example of an attention-based sequence-to-sequence model that classifies the relationship between a

premise, “pretraining helps NLI”, and a hypothesis, “it improves NLP”, as entailment. Dotted arrows illustrate the role of attention in selecting relevant information from the premise.

Our approach utilizes attention-based sequence-to-sequence models as illustrated in Figure 1. Using SNLI, we test the performance of three embedding classes: random; pretrained – GloVe; [Pennington et al. 2014] and word2vec [Mikolov et al. 2013] – as well as retrofitted, i.e., pretrained embeddings plus WordNet information [Faruqui et al. 2015]. Via large hyperparameter searches, we find that pretrained GloVe and word2vec significantly outperform random as long as the network is properly configured. As part of these experiments, we systematically evaluate two approaches to weight initialization: Gaussian [Glorot and Bengio 2010] and orthogonal [Saxe et al. 2014], finding that the latter yields consistent and substantial gains.

Surprisingly, retrofitting degrades performance, even though the WordNet information that this brings in should be well-aligned with the SNLI labels, given the close association between the WordNet nominal hierarchy and the nature of the NLI task. In a second set of experiments, we show that retrofitting can be helpful on word-level NLI-like tasks. Indeed, in these experiments, its performance is far superior to the other options. This suggests a hypothesis: if the inputs carry extensive information that word entails word , might that make it difficult to adjust to the fact that, for example, therefore entails ? To test this hypothesis, we rely on a rich theory of negation derived from the natural logic of maccartney2009a to incrementally add compositional complexity to the work-level task. We find that, as semantic complexity goes up, the performance of retrofitting declines.

Overall, these results indicate that working with pretrained inputs requires care but reliably improves performance of the NLI task.

2 Motivation

Unsupervised pretraining has been central to recent progress in deep learning

[LeCun et al. 2015]. hinton2006 and bengio2007 show that greedy layer-wise pretraining of the inner layers of a deep network allows for a sensible initialization of the parameters of the network. Similar strategies were central to some recent major success stories of deep networks (e.g., ???

). dai2015semi show that initializing a recurrent neural network (RNN) with an internal state obtained in a previous phase of unsupervised pretraining consistently results in better generalization and increased stability. More generally, the dense representations obtained by methods like GloVe and word2vec have, among other things, enabled the successful use of deep, complex models that could not previously have been optimized effectively.

Some recent work calls into question the importance of pretraining, however. One can train deep architectures without pretraining and achieve excellent results [Martens 2010, Sutskever et al. 2013], by using carefully chosen random initializations and advanced optimization methods, such as second order methods and adaptive gradients. It has been shown that randomly initialized models can find the same local minima found by pretraining if the models are run to convergence [Martens 2010, Chapelle and Erhan 2011, Saxe 2015]. One might conclude from these results that pretraining is mostly useful only for preventing overfitting on small datasets [Bengio et al. 2013, LeCun et al. 2015].

This conclusion would be too hasty, though. New analytical approaches to studying deep linear networks have reopened questions about the utility of pretraining. Drawing on a rich literature, saxe2014 and saxe2015 show that pretraining still offers optimization and generalization advantages when it is combined with standard as well as advanced optimization methods. There are also clear practical advantages to pretraining: faster convergence (with both first- and second-order methods) and better generalization. For very large datasets (close enough to the asymptotic limit of infinite data), some of training time will actually be spent in achieving the effects of pretraining. Avoiding such redundant effort might be crucial. For small or medium-sized datasets, this might not be an option, in which case pretraining is clearly the right choice.

For the most part, results achieved using random initial conditions have not been systematically compared against the use of pretrained initial conditions, even where pretraining is thought to be potentially helpful [Krizhevsky et al. 2012, Sutskever et al. 2013]. This is perhaps not surprising in light of the fact that systematic comparisons are challenging and resource-intensive to make. As saxe2015 note, the inputs interact in complex ways with all the other network parameters, meaning that systematic comparisons require extensive hyper-parameter searches on a variety of training-data sources. The next section defines our experimental framework for making these comparisons in the context of NLI. The findings indeed reveal complex dependencies between data, network parameters, and input structure.

3 Model architecture

Attention-based sequence-to-sequence (seq2seq) models have been effective for many NLP tasks, including machine translation [Bahdanau et al. 2014, Luong et al. 2015], speech recognition [Chorowski et al. 2015], and summarization [Rush et al. 2015]. In NLI, variations of this architecture have achieved exceptional results [Rocktäschel et al. 2015, Wang and Jiang 2015, Cheng et al. 2016]. It is therefore our architecture of choice.

A seq2seq model generally consists of two components, an encoder and a decoder [Cho et al. 2014, Sutskever et al. 2014], each of which is an RNN. In NLI, each training example is a triple (premise, hypothesis, label). The encoder builds a representation for the premise and passes the information to the decoder. The decoder reads through the hypothesis and predicts the assigned label, as illustrated in Figure 1.

Concretely, for both the encoder and the decoder, the hidden state at time is derived from the previous state and the current input as follows:



can take different forms, e.g., a vanilla RNN or a long short-term memory network (LSTM;

?). In all our models, we choose to use the multi-layer LSTM architecture described by zaremba14. At the bottom LSTM layer,

is the vector representation for the word at time

, which we look up from an embedding matrix (one for each encoder and decoder). This is where we experiment with different embedding classes (Section 4.1).

On the hypothesis side, the final state at the top LSTM layer

is passed through a softmax layer to compute the probability of assigning a label



which we use to compute the cross-entropy loss to minimize during training. At test time, we extract the most likely label for each sentence pair.

One key addition to this basic seq2seq approach is the use of attention mechanisms, which have proven to be valuable for NLI [Rocktäschel et al. 2015]. The idea of attention is to maintain a random access memory of all previously-computed hidden states on the encoder side (not just the last one). Such memory can be referred to as the decoder hidden states are built up, which improves learning, especially for long sentences. In our models, we follow luong2015 in using the local-p attention mechanism and the dot-product attention scoring function.

Apart from the embedding matrices, which can be either pretrained or randomly initialized, all other model parameters (LSTM, softmax, attention) are randomly initialized. We examine various important initialization techniques in Section 4.3.

4 Methods for Effective Use of Pretraining

This section describes our approach to comparatively evaluating different input embedding classes. We argue that, for each embedding class, certain preprocessing steps can be essential, and that certain hyperparameters are especially sensitive to the structure of the inputs.

4.1 Embedding Classes

Our central hypothesis is that the structure of the inputs is crucial. To evaluate this idea, we test random, GloVe [Pennington et al. 2014], and word2vec [Mikolov et al. 2013, Goldberg and Levy 2014] inputs, as well variants of GloVe and word2vec that have been retrofitted with WordNet information [Faruqui et al. 2015].

For the definitions of the GloVe and word2vec models, we refer to the original papers.111We used the publicly released embeddings, trained with Common Crawl 840B tokens for GloVe (http://nlp.stanford.edu/projects/glove/) and Google News 42B for word2vec https://code.google.com/archive/p/word2vec/. Although the training data sizes are notably different, our goal is not to compare these two models directly. The retrofitting algorithm can be found at github.com/mfaruqui/retrofitting. The further step of retrofitting is defined as follows. A -dimensional vector for a word drawn from a vocabulary , , is retrofitted into a vector by minimizing an objective that seeks to keep the retrofitted vector close to the original, but changed in a way that incorporates a notion of distance defined by a source external to the embedding space , such as WordNet [Fellbaum 1998]. The objective of faruqui2014 is defined as follows:


where and are parameters that control the relative strength of each contribution. When the external source is WordNet, is set to and is the inverse of the degree of node in WordNet. In our experiments, we used the off-the-shelf algorithm provided by the authors, keeping the pre-established parameters, with synonyms, hypernyms and hyponyms as connections in the graph [Faruqui et al. 2015].

4.2 Embedding Preprocessing

GloVe and word2vec embeddings can generally be used directly, with no preprocessing steps required. However, retrofitting generates vectors that are not centered and with amplitudes one magnitude lower than those of GloVe or word2vec. We therefore took two simple preprocessing steps that helped performance. First, each dimension was mean centered. This step makes the distribution of inputs more similar to the distribution of weights, which is especially important because of the coupling of inputs and parameters alluded to at the end of Section 2. Second, each dimension was rescaled

to have a standard deviation of one. Rescaling is usually applied to inputs showing high variance for features that should contribute equally. It is particularly useful when embeddings have either very large or very small amplitudes – the latter is what we often observe in retrofitted vectors. The overall effect of these two preprocessing steps is to make the summary statistics of all our input vectors similar.

4.3 Hyperparameter Search

We expect the performance of different inputs to be sensitive not only to the data and objective, but also to the hyperparameter settings. To try to find the optimal settings of these parameters for each kind of input, we rely on random search through hyperparameter space. bergstra2012 argue convincingly that random search is a reliable approach because the hyperparameter response function has a low effective dimensionality, i.e., it is more sensitive along some dimensions than others.

Our search proceeds in two stages. First, we perform a coarse search seeking out the dimensions of greater dependency in the hyperparameter space by training the model for a small number of epochs (usually just one). Second, we perform a finer search across these dimensions, holding the other parameters constant. This finer search is iteratively

annealed: on each iteration, the hypercube that delimits the search is recentered around the point found in the previous iteration and rescaled with a factor (set to in all our experiments).

Our hyperparameter searches focus on the learning rate, the initialization scheme, and the initialization constant. For the initialization scheme, we choose between Gaussian initialization, with the corrective multiplicative factor , where is the number of layers [Glorot and Bengio 2010, He et al. 2015], and the orthogonal random initialization introduced by saxe2014.222We operate on square matrices, e.g., for an LSTM parameter matrix, we orthogonalize 8 sub-matrices. The initialization constant is a multiplicative constant

applied to each of the initialization schemes. Other hyperparameters were set more heuristically. See Section 

5.1 and the Appendix A for more details.

5 Experiments

(a) GloVe
(b) retrofitted GloVe
(c) random
Figure 2:

Hyperparameter landscape for Gaussian weight initialization in the subspace spanned by our initialization and learning parameter ranges, for three 300d embeddings. Word2vec and word2vec retrofitted resulted in similar landscapes as GloVe and GloVe retrofitted, respectively, and are not shown here for reasons of space. The figures were generated using a mesh interpolation over the random search points.

Our primary experiments are with the Stanford NLI (SNLI) corpus [Bowman et al. 2015a]. This is one of the largest purely human-annotated datasets for NLP, which means that it offers the best chance for random initializations to suffice (Section 2). In addition, to gain analytic insights into how the inputs behave, we report on a series of evaluations using a smaller word-level dataset extracted from WordNet [Bowman et al. 2015b] and a controlled extension of the dataset to include negation. These experiments help us further understand the variable performance of different pretraining methods.

Ideally, we would evaluate our full set of input designs on a wide range of different corpora from a variety of tasks. Unfortunately, the hyperparameter searches that are crucial to our empirical argument are extremely computationally intensive, ruling out comparable runs on multiple large datasets. Given these constraints, the NLI task seems like an ideal choice. As noted in Section 3, it shares characteristics with paraphrase detection and machine translations, so we hope the results will transfer. More importantly, it is simply a challenging task because of its dependence on lexical semantics, compositional semantics, and world knowledge [MacCartney 2009, Dagan et al. 2010].

5.1 Stanford Natural Language Inference

In our SNLI experiments, we use the deep LSTM-based recurrent neural network with attention introduced in Section 3. The training set for SNLI contains 550k sentence pairs, and the development and test sets each contain 10k pairs, with a joint vocabulary count of 37,082 types. We use only the training and development sets here; given our goal of comparing the performance of different methods, we need not use the test set.

As described above, we evaluated five families of initial embeddings: random, GloVe, word2vec, retrofitted GloVe, and retrofitted word2vec. For each, we set the learning rate, initialization constant , and initialization scheme (Gaussian or orthogonal) using the search procedure described in Section 4.3.333Both the learning rate and initialization constant are chosen log-uniformly from Other hyperparameters found to be less variable are listed in the Appendix A; their values were fixed based on the results of a coarse search.

Although the hyperparameter landscape provides useful insights about the regions where the models perform adequately, an appropriate assessment of the dynamics of the models and their initializations requires full training. We initialized our models with the hyperparameters obtained during the random search described above. Then we trained them until convergence, with the number of epochs varying somewhat between experiments.

5.1.1 Results

We carried out over 2,000 experiments as part of hyperparameter search. Figure 2 shows the hyperparameter response landscapes after the first epoch for random, GloVe, and retrofitted GloVe. The landscapes are similar for all the embedding families, with a high mean region for lower values of the learning rate and initial range, followed by a big plateau with low mean and low variance for higher values of the hyperparameter pairs, which yielded poorer performance. There is a limit boundary in both hyperparameters delimited by constant lines around the value for both the learning rate and the initialization range, after which the network enters a non-learning regime.

After a single epoch, all kinds of pretraining outperform random vectors. Surprisingly, though, GloVe and word2vec also outperform their retrofitted counterparts. Table 1 shows that this ranking holds when the models are run to convergence, independently of whether Gaussian or orthogonal weight initialization schemes are used. Figure 3 traces the learning curves for the different inputs and weight intialization schemes.444Within the 160 values reported in Figure 3, we performed an interpolation on 8 of them due to an issue in our reporting system. None of these involved initial or final values. In addition, for all but random and word2vec, orthogonal initialization is notably better than Gaussian, and results in a mean gain of across all the pretrained embeddings, with gains of for GloVe vectors and for retrofitted Glove.

Gaussian Orthonormal
random 75.81 76.11
GloVe 81.13 82.10
word2vec 82.04 81.11
retro Glove 78.00 80.88
retro word2vec 78.46 78.71
Table 1: Accuracy results for the SNLI dev-set experiment.

5.1.2 Analysis

The SNLI experiments confirm our expectations that pretraining helps, in that all the methods of pretraining under consideration here outperformed random initialization. Evidently, the structure of the pretrained embeddings interacts constructively with the weights of the network, resulting in faster and more effective learning. We also find that orthogonal initialization is generally better. This effect might trace to the distance invariance ensured by the orthogonal random matrices. Vector orientation is important to GloVe and word2vec, and even more important for their retrofitted counterparts. We conjecture that keeping invariant the notion of distance encoded by these embeddings is highly profitable.

Figure 3: Accuracy on SNLI dev dataset as a function of the epoch for models run until convergence.

The one unexpected result is that retrofitting hinders performance on this task. The retrofitted information is essentially WordNet hierarchy information, which seems congruent to the NLI task. Why does it appear harmful in the SNLI setting? To address this question, we conducted two simpler experiments, first with just pairs of words (sequences of length for the premise and hypothesis) and then with words with increasingly complex negation sequences. These simpler experiments suggest that the complexity of the SNLI data interacts poorly with the highly structured starting point established by retrofitting.

5.2 Lexical Relations in WordNet

Retrofitting vectors according to the scheme used here means infusing them with information from the WordNet graph [Fellbaum 1998, Faruqui et al. 2015]. We thus expect this process to be helpful in predicting word-level entailment relations between words. bowman2014 released a dataset that allows us to directly test this hypothesis. It consists of 36,772 word-pairs derived from WordNet with the label set . This section reports on experiments with this dataset, showing that our expectation is met: retrofitting is extremely beneficial in this setting.

The model used the same settings as in the SNLI experiment, and we followed the same procedure for hyperparameter search. We study only random, GloVe, and retrofitted GloVe vectors to keep the presentation simple. The dataset was split into training, development, and test. The runs to convergence used epochs.

5.2.1 Results

As Table 2 shows, retrofitting is extremely helpful in this setting, in terms of both learning speed and overall accuracy. GloVe and random performed similarly, as in ?.

random 94.32
GloVe 94.45
word2vec 94.26
retro Glove 95.68
retro word2vec 95.49
Table 2: Accuracy results for the WordNet experiment.

5.2.2 Analysis

These results show clearly that retrofitted vectors can be helpful, and they also provide some clues as to why retrofitting hurts with SNLI. The process of retrofitting implants information from external lexical sources into the representations, modifying the magnitudes and directions of the original embeddings (while trying to keep them close to the original vectors). For example, the word cat is modified so it incorporates the notion that it entails animal, by adjusting the magnitudes and directions of these embeddings. These adjustments, so helpful for word-level comparisons, might actually make it harder to deal with the complexities of semantic composition. Our next experiment explores this hypothesis.

5.3 Lexical Relations with Negation

To begin to explore the hypothesis that semantic composition is the root cause of the variable performance of retrofitting, we conducted an experiment in which we introduced different amounts of semantic complexity in the form of negation into a word-level dataset and carried out our usual batch of assessments. The expectation is that, if the hypothesis is true, the performance of retrofitted embeddings with respect to a baseline should decrease.

Inspired by work in natural logic [MacCartney and Manning 2009, Icard 2012], we created a novel dataset based on a rich theory of negation. The dataset begins from a set of 145 words extracted from a subgraph of WordNet. We verified that each pair satifies one of the relations . The label set is larger than in our previous experiments to provide sufficient logical space for a multifaceted theory of negation. For instance, if we begin from ‘p hypernym q’, then negating just ‘p’ yields ‘not p hypernym q’, negating just ‘q’ yields ‘p neutral not q’, and negating them both reverses the original relation, yielding ‘not p hyponym not q’. The full table of relations is given in Table 3. Crucially, the variables ‘p’ and ‘q’ in this table need not be atomic; they can themselves be negated, allowing for automatic recursive application of negation to obtain ever larger datasets.555The category ‘neutral’, does not appear in the original dataset, but emerges as a result of the application of the theory of negation and quickly becomes the dominant category.

Following the methodology of Bowman:Potts:Manning:2015, we train on formulae of length and test exclusively on formulae of length in order to see how well the networks generalize to data more complex than any they saw in training. More specifically, we train on the dataset that results from two complete negations of the original word-level data. This yields examples. Our first test set was created by applying negation thrice to the original word-level relations. Successive test sets were obtained using the same system, up to the sixth level from the original dataset. Because the application of negation introduces a huge bias towards the category ‘neutral’, the test datasets were downsampled to examples with the same distribution of labels as the training dataset.

not-p, not-q p, not-q not-p, q
p disjoint q neutral hyponym hypernym
p equal q equal disjoint disjoint
p neutral q neutral neutral neutral
p hyponym q hypernym disjoint neutral
p hypernym q hyponym neutral disjoint
Table 3: The theory of negation used to define the dataset for the negation experiment. ‘p’ and ‘q’ can either be simple words or potentially multiply negated, multi-word terms.

5.3.1 Results

The results are presented in Figure 4. The conclusion is very clear: the performance of retrofitted vectors drops more when compared with the rest of the embeddings, suggesting that compositional complexity is indeed a problem for retrofitting. GloVe vectors again generalize best. (We expect word2vec to be similar.)

Figure 4: Accuracy on the complex negation dataset. The x-axis values correspond to levels of negation: For example, l3 contains terms like ‘not not not p’.

5.3.2 Analysis

The drop in performance suggests that the introduction of semantic complexity can degrade the performance of retrofitted vectors. It is thus perhaps not surprising that the highly complex SNLI sentences cause even deeper problems for these inputs. More generally, these results show that retrofitting high-dimensional embeddings is a tricky process. As it is defined, the retrofitting algorithm seeks to keep the retrofitted vectors close to the original vectors. However, in high-dimensional spaces, these small changes can significantly impact the path of optimization. It seems likely that these problems can be mitigated by modifying the retrofitting algorithm so that it makes more global and uniform modifications to the embedding space.

6 Conclusion

The central finding of this paper is that pretraining inputs with methods like GloVe and word2vec can lead to substantial performance gains as long as the model is configured and optimized so as to take advantage of this initial structure. In addition, we found that, for pretrained inputs, orthogonal random initialization is superior to a Gaussian initialization. Random input initialization remains a common choice for some tasks. This is perhaps justified where the training set is massive. The datasets used for neural machine translation (NMT) are often large enough to support this approach. However, while the datasets for other NLP tasks are growing, NMT is still unique in this sense. The dataset used for our main evaluations (SNLI) is large by NLP standards, and the effects of pretraining proved greater than what can be obtained otherwise.

Our results did present one puzzle: retrofitting the vectors with entailment-like information hindered performance for the entailment-based NLI task. With a series of controlled experiments, we traced this problem to the complexity of the SNLI data. These results show that fine-tuning high-dimensional embeddings is a delicate task, as the process can alter dimensions that are essential for downstream tasks. Nonetheless, even this disruptive form of pretraining led to better overall results than random initialization, thereby providing clear evidence that pretraining is an effective step in learning distributed representations for NLP.

Appendix A Appendix

Fixed values of less variable hyperparameters, based on the results of a coarse search:

  • Gradient clipping [Pascanu et al. 2013]: initially set in , and finally to .

  • Number of layers: , , or layers. We obtained close results with and layers. Although it is generally believed that more layers help, we fixed to to save computation time [Erhan et al. 2009, Erhan et al. 2010].

  • Dimensionality of the embeddings: . In our experience, increasing is generally better. We chose , as this is the largest available off-the-shelf distribution of GloVe and word2vec.

  • Retraining schedule: the epoch to start retraining the embeddings was originally set in . Later, it was set to for non-random embeddings; random embeddings were trained since the first iteration to maximize plasticity.

  • Dropout [Zaremba et al. 2015]: set to probability of dropping the connection.

  • Learning rate schedule: the epoch at which the learning rate starts its decay was fixed to epoch , with a fine-tuning rate of .

  • Batch size: preset at and not tuned.

The hyperparameter search during the first epoch led us to the optimal hyperparameters in Table 4.

Gaussian Orthonormal
random 1.31 1.86 0.99 0.34
GloVe 1.12 1.42 0.85 0.23
word2vec 1.16 0.44 0.98 2.06
retro Glove 1.57 1.91 0.80 1.35
retro word2vec 0.64 2.43 0.44 2.45
Table 4: Optimal hyperparameters found for the SNLI experiments. L: learning rate; IR: initialization range.


We thank Lauri Karttunen and Dan Lassiter for their insights during the early phase of the research for this paper, and Quoc V. Le and Sam Bowman for their valuable comments and advice. This research was supported in part by NSF BCS-1456077 and the Stanford Data Science Initiative.


  • [Bahdanau et al. 2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  • [Bengio et al. 2007] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. 2007. Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19:153.
  • [Bengio et al. 2013] Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE Trans Pattern Anal Mach Intell, 2.
  • [Bergstra and Bengio 2012] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization.

    The Journal of Machine Learning Research

    , 13(1):281–305.
  • [Bowman et al. 2015a] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015a. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
  • [Bowman et al. 2015b] Samuel R. Bowman, Christopher Potts, and Christopher D. Manning. 2015b. Learning distributed word representations for natural logic reasoning. In Knowledge Representation and Reasoning: Integrating Symbolic and Neural Approaches: Papers from the 2015 AAAI Spring Symposium, pages 10–13. AAAI Publications, March.
  • [Bowman et al. 2015c] Samuel R. Bowman, Christopher Potts, and Christopher D. Manning. 2015c. Recursive neural networks can learn logical semantics. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, Stroudsburg, PA. Association for Computational Linguistics.
  • [Chapelle and Erhan 2011] Olivier Chapelle and Dumitru Erhan. 2011. Improved preconditioner for hessian free optimization. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, volume 201.
  • [Cheng et al. 2016] Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733.
  • [Cho et al. 2014] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  • [Chorowski et al. 2015] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based models for speech recognition. In Advances in Neural Information Processing Systems 28.
  • [Dagan et al. 2006] Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment challenge. In J. Quinonero-Candela, I. Dagan, B. Magnini, and F. d’Alché Buc, editors, Machine Learning Challenges, Lecture Notes in Computer Science, volume 3944, pages 177–190. Springer-Verlag.
  • [Dagan et al. 2010] Ido Dagan, Bill Dolan, Bernardo Magnini, and Dan Roth. 2010. Recognizing textual entailment: Rational, evaluation and approaches. Natural Language Engineering, 16(01):105–105.
  • [Dahl et al. 2012] George E. Dahl, Dong Yu, Li Deng, and Alex Acero. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):30–42.
  • [Dai and Le 2015] Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems, pages 3061–3069.
  • [Erhan et al. 2009] Dumitru Erhan, Pierre-Antoine Manzagol, Yoshua Bengio, Samy Bengio, and Pascal Vincent. 2009. The difficulty of training deep architectures and the effect of unsupervised pre-training. In

    International Conference on artificial intelligence and statistics

    , pages 153–160.
  • [Erhan et al. 2010] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. 2010. Why does unsupervised pre-training help deep learning? The Journal of Machine Learning Research, 11:625–660.
  • [Faruqui et al. 2015] Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015.

    Retrofitting word vectors to semantic lexicons.

    In Proceedings of NAACL.
  • [Fellbaum 1998] Christiane Fellbaum, editor. 1998. WordNet: An Electronic Database. MIT Press, Cambridge, MA.
  • [Glorot and Bengio 2010] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In International conference on artificial intelligence and statistics, pages 249–256.
  • [Goldberg and Levy 2014] Yoav Goldberg and Omer Levy. 2014. word2vec explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. Technical Report arXiv:1402.3722v1, Bar Ilan University, February.
  • [He et al. 2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015.

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.


    Proceedings of the IEEE International Conference on Computer Vision

    , pages 1026–1034.
  • [Hinton et al. 2006] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye W. Teh. 2006. A fast learning algorithm for deep belief nets. Neural Comput, 18(7):1527–54, 7.
  • [Hochreiter and Schmidhuber 1997] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • [Icard 2012] Thomas F. Icard. 2012. Inclusion and exclusion in natural language. Studia Logica, 100(4):705–725.
  • [Krizhevsky et al. 2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012.

    Imagenet classification with deep convolutional neural networks.

    In Advances in neural information processing systems, pages 1097–1105.
  • [LeCun et al. 2015] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature, 521(7553):436–44, 5.
  • [Luong et al. 2015] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1412–1421. Association for Computational Linguistics, September.
  • [MacCartney and Manning 2009] Bill MacCartney and Christopher D. Manning. 2009. An extended model of natural logic. In International Conference on Computational Semantics (IWCS).
  • [MacCartney 2009] Bill MacCartney. 2009. Natural language inference. Ph.D. thesis, Department of Computer Science. Stanford University.
  • [Martens 2010] James Martens. 2010. Deep learning via hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 735–742.
  • [Mikolov et al. 2013] Tomas Mikolov, Kai Chen, Greg S. Corrado, and Jeffrey Dean. 2013.

    Efficient estimation of word representations in vector space.

    In Proceedings of the International Conference on Learning Representations.
  • [Mohamed et al. 2012] Abdel-Rahman Mohamed, George E. Dahl, and Geoffrey Hinton. 2012.

    Acoustic modeling using deep belief networks.

    Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):14–22.
  • [Pascanu et al. 2013] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, Georgia, USA, 2013, volume 28 of JMLR: W&CP.
  • [Pennington et al. 2014] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12.
  • [Rocktäschel et al. 2015] Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomás Kočisky, and Phil Blunsom. 2015. Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664.
  • [Rush et al. 2015] Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015.

    A neural attention model for abstractive sentence summarization.

    In EMNLP.
  • [Saxe et al. 2014] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. 2014. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Yoshua Bengio and Yann LeCun, editors, International Conference on Learning Representations.
  • [Saxe 2015] Andrew Michael Saxe. 2015. Deep Linear Neural Networks: A Theory of Learning in the Brain and Mind. Ph.D. thesis, Stanford University.
  • [Sermanet et al. 2013] Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala, and Yann LeCun. 2013. Pedestrian detection with unsupervised multi-stage feature learning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3626–3633.
  • [Sutskever et al. 2013] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th international conference on machine learning (ICML-13), pages 1139–1147.
  • [Sutskever et al. 2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
  • [Wang and Jiang 2015] Shuohang Wang and Jing Jiang. 2015. Learning natural language inference with lstm. arXiv preprint arXiv:1512.08849.
  • [Zaremba et al. 2014] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. CoRR, abs/1409.2329.
  • [Zaremba et al. 2015] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2015. Recurrent neural network regularization. In ICLR 2015.