The recent confluence of data availability and strong sequence-to-sequence learning algorithms has the potential to lead to practical tools for writing support. Grammatical error identification is one such application of potential utility as a component of a writing support tool. Much of the recent work in grammatical error identification and correction has made use of hand-tuned rules and features that augment data-driven approaches, or individual classifiers for human-designated subsets of errors. Given a large, annotated dataset of scientific journal articles, we propose a fully data-driven approach for this problem, inspired by recent work in neural machine translation and more generally, sequence-to-sequence learning[Sutskever et al.2014, Bahdanau et al.2014, Cho et al.2014].
The Automated Evaluation of Scientific Writing (AESW) 2016 dataset is a collection of nearly 10,000 scientific journal articles (over 1 million sentences) published between 2006 and 2013 and annotated with corrections by professional, native English-speaking editors. The goal of the associated AESW Shared Task is to identify whether or not a given unedited source sentence was corrected by the editor (that is, whether a given source sentence has one or more grammatical errors, broadly construed).
This system report describes our approach and submission to the AESW 2016 Shared Task, which establishes the current highest-performing public baseline for the binary prediction task. Our primary contribution is to demonstrate the utility of an attention-based encoder-decoder model for the binary prediction task. We also provide evidence of tangible performance gains using a character-aware version of the model, building on the character-aware language modeling work of KimEtAl-2016-CharLM. In addition to sentence-level classification, the models are capable of intra-sentence error identification and the generation of possible corrections. We also obtain additional gains by using an ensemble of a generative encoder-decoder and a discriminative CNN classifier.
The goal of the AESW shared task is to identify whether a particular sentence needs to be edited (contains a “grammatical” error, broadly construed111Some insertions and deletions in the shared task data represent stylistic choices, not all of which are necessarily recoverable given the sentence or paragraph context. For the purposes here, we refer to all such edits as “grammatical” errors.). The dataset consists of sentences taken from academic articles annotated with corrections by professional editors. Annotations are described via insertions and deletions, which are marked with start and end tags. Tokens to be deleted are surrounded with the deletion start tag and the deletion end tag and tokens to be inserted are surrounded with the insertion start tag and the insertion end tag . Replacements (as shown in Figure 1) are represented as deletion-insertion pairs. Unlike the related CoNLL-2014 Shared Task [Ng et al.2014] data, errors are not labeled with fine-grained types (article or determiner error, verb tense error, etc.).
More formally, we assume a vocabulary of natural language word types (some of which have orthographic errors) and a set of annotation tags. Given a sentence , where is the -th token of the sentence of length , we seek to predict whether or not the gold, annotated target sentence , where is the -th token of the annotated sentence of length , is identical to . We are given both and for supervised training. At test time, we are only given access to sequence . We learn to predict sequence .
Evaluation of this binary prediction task is via the -score, where the positive class is that indicating an error is present in the sentence (that is, where )222 The 2016 Shared Task also included a probabilistic estimation track. We leave for future work the adaptation of our approach to that task.
The 2016 Shared Task also included a probabilistic estimation track. We leave for future work the adaptation of our approach to that task..
Evaluation is at the sentence level, but the paragraph-level context for each sentence is also provided. The paragraphs, themselves, are shuffled so that full article context is not available. A coarse academic field category is also provided for each paragraph. Our models described below do not make use of the paragraph context nor the field category, and they treat each sentence independently.
Further information about the task is available in the Shared Task report [Daudaravicius et al.2016].
3 Related Work
While this is the first year for a shared task focusing on sentence-level binary error identification, previous work and shared tasks have focused on the related tasks of intra-sentence identification and correction of errors. Until recently, standard hand-annotated grammatical error datasets were not available, complicating comparisons and limiting the choice of methods used. Given the lack of a large hand-annotated corpus at the time, ParkEtAl-2011-UnsupervisedGEC demonstrated the use of the EM algorithm for parameter learning of a noise model using error data without corrections, performing evaluation on a much smaller set of sentences hand-corrected by Amazon Mechanical Turk workers.
More recent work has emerged as a result of a series of shared tasks, starting with the Helping Our Own (HOO) Pilot Shared Task run in 2011, which focused on a diverse set of errors in a small dataset [Dale and Kilgarriff2011], and the subsequent HOO 2012 Shared Task, which focused on the automated detection and correction of preposition and determiner errors [Dale et al.2012]. The CoNLL-2013 Shared Task [Ng et al.2013]333http://www.comp.nus.edu.sg/~nlp/conll13st.html focused on the correction of a limited set of five error types in essays by second-language learners of English at the National University of Singapore. The follow-up CoNLL-2014 Shared Task [Ng et al.2014]444http://www.comp.nus.edu.sg/~nlp/conll14st.html focused on the full generation task of correcting all errors in essays by second-language learners.
As with machine translation (MT), evaluation of the full generation task is still an open research area, but a subsequent human evaluation ranked the output from the CoNLL-2014 Shared Task systems [Napoles et al.2015]
. The system of FeliceEtAl-2014-Hybrid ranked highest, utilizing a combination of a rule-based system and phrase-based MT, with re-ranking via a large web-scale language model. Of the non-MT based approaches, the Illinois-Columbia system was a strong performer, combining several classifiers trained for specific types of errors[Rozovskaya et al.2014].
We use an end-to-end approach that does not have separate components for candidate generation or re-ranking that make use of hand-tuned rules or explicit syntax, nor do we employ separate classifiers for human-differentiated subsets of errors, unlike some previous work for the related task of grammatical error correction.
We next introduce two approaches for the task of sentence-level grammatical error identification: A binary classifier and a sequence-to-sequence model that is trained for correction but can also be used for identification as a side-effect.
4.1 Baseline Convolutional Neural Net
To establish a baseline, we follow past work that has shown strong performance with convolutional neural nets (CNNs) across various domains for sentence-level classification [Kim2014, Zhang and Wallace2015]. We utilize the one-layer CNN architecture of Kim-2014-CNN with the publicly available555https://code.google.com/archive/p/word2vec/
word vectors trained on the Google News dataset, which contains about 100 billion words[Mikolov et al.2013]. We experiment with keeping the word vectors static (CNN-static) and fine-tuning the vectors (CNN-nonstatic). The CNN models only have access to sentence-level labels and are not given correction-level annotations.
While it may seem more natural to utilize models trained for binary prediction, such as the aforementioned CNN, or for example, the recurrent network approach of DaiAndLe-2015-Seq2SeqLSTMClassfierPreTraining, we hypothesize that training at the lowest granularity of annotations may be useful for the task. We also suspect that the generation of corrections is of sufficient utility for end-users to further justify exploring models that produce corrections in addition to identification. We thus use the Shared Task as a means of assessing the utility of a full generation model for the binary prediction task.
We propose two encoder-decoder architectures for this task. Our word-based architecture (Word) is similar to that of luong-pham-manning:2015:EMNLP. Our character-based models (Char) still make predictions at the word-level, but use a CNN and a highway network over characters instead of word embeddings as the input to the encoder and decoder, as depicted in Figure 1. We follow past work [Sutskever et al.2014, Luong et al.2015]
in stacking multiple recurrent neural networks (RNNs), specifically Long Short-Term Memory (LSTM)[Hochreiter and Schmidhuber1997] networks, in both the encoder and decoder.
Here, we model the probability of the target given the source,, with an encoder neural network that summarizes the source sequence and a decoder neural network that generates a distribution over the target words and tags at each step given the source.
We start by describing the basic encoder and decoder architectures in terms of the Word model, and then we describe the Char model departures from Word.
The encoder reads the source sentence and outputs a sequence of vectors, associated with each word in the sentence, which will be selectively accessed during decoding via a soft attentional mechanism. We use a LSTM network to obtain the hidden states for each time step ,
For the Word models, is the word embedding for , the -th word in the source sentence. (The analogue for the Char models is discussed below.) The output of the encoder is the sequence of hidden state vectors . The initial hidden state of the encoder is set to zero (i.e. ).
The decoder is another LSTM that produces a distribution over the next target word/tag given the source vectors and the previously generated target words/tags . Let
be the summary of the target sentence up to the -th word, where is the word embedding for in the Word models. The current target hidden state is combined with each of the memory vectors in the source to produce attention weights as follows,
The source vectors are multiplied with the respective attention weights, summed, and interacted with the current decoder hidden state to produce a context vector ,
The probability distribution over the next word/tag is given by applying an affine transformation tofollowed by a softmax,
Finally, as in luong-pham-manning:2015:EMNLP, we feed as additional input to the decoder for the next time step by concatenating it with , so the decoder equation is modified to,
This allows the decoder to have knowledge of previous (soft) alignments at each time step. The decoder hidden state is initialized with the final hidden state of the encoder (i.e. ).
Character Convolutional Neural Network
For the Char
models, instead of a word embedding, our input for each word in the source/target sentence is an output from a character-level convolutional neural network (CharCNN) (depicted in Figure1). Our character model closely follows that of KimEtAl-2016-CharLM.
Suppose word is composed of characters . We concatenate the character embeddings to form the matrix , where the -th column corresponds to the character embedding for (of dimension ).
We then apply a convolution between and a filter of width , after which we add a bias and apply a nonlinearity to obtain a feature map . The -th element of is given by,
where is the Frobenius inner product and is the -to--th column of . Finally, we take the max-over-time
as the feature corresponding to filter . We use multiple filters to obtain a vector as the representation for a given source/target word or tag. We have separate CharCNNs for the encoder and decoder.
Instead of replacing the word embedding with , we feed through a highway network [Srivastava et al.2015]
. Whereas a multilayer perceptron produces a new set of features via the following transformation (given input),
a highway network instead computes,
where is a non-linearity (in our models, ), is the element-wise multiplication operator, and and are called the transform and carry gates. We feed into the highway network to obtain , which is used to replace the input word embeddings in both the encoder and the decoder.
Exact inference is computationally infeasible for the encoder-decoder models given the combinatorial explosion of possible output sequences, but we follow past work in NMT using beam search. We do not constrain the generation process of words outside insertion tags to words in the source, and each low-frequency holder token generated in the target sentence is replaced with the source token associated with the maximum attention weight. We use a beam size of 10 for all models, with the exception of one of the models in the final system combination, for which we use a beam of size 5, as noted in Section 6.
Note that this model generates corrections, but we are only interested in determining the existence of any error at the sentence-level. As such, after beam decoding, we simply check for whether there were any corrections in the target. However, we found that decoding in this way under-predicts sentence-level errors. It is therefore important to calibrate the weights associated with corrections, which we discuss in Section 5.3.
The AESW task data differs from previous grammatical error datasets in terms of scale and genre. To the best of our knowledge, the AESW dataset is the first large-scale, publicly available professionally edited dataset of academic, scientific writing. The training set consists of 466,672 sentences with edits and 722,742 sentences without edits, and the development set contains 57,340 sentences with edits and 90,106 sentences without. The raw training and development datasets are provided as annotated sentences, , from which the sequences may be deterministically derived. There are 143,802 sentences in the Shared Task test set with hidden gold labels, which serve directly as sequences.
As part of pre-processing, we treat each sentence independently, discarding paragraph context (which sentences, if any, were present in the same paragraph) and domain information, which is a coarse grouping by the field of the original journal (Engineering, Computer Science, Mathematics, Chemistry, Physics, etc.). We generate Penn Treebank style tokenizations of the input. Case is maintained and digits are not replaced with holder symbols. The vocabulary is restricted to the 50,000 most common tokens, with remaining low frequency tokens replaced with a special token. The Char model can encode but not decode over open vocabularies and hence we do not have any tokens on the source side of those models. For all of the encoder-decoder models, we replace the low-frequency target symbols during inference as discussed above in Section 4.2.
For development against the provided data with labels, we set aside a 10,000 sentence sample from the original development set for tuning, and use the remaining 137,446 sentences for validation666Note that the number of sentences in the final development set without labels posted on CodaLab (http://codalab.org) differed from that originally posted on the AESW 2016 Shared Task website with labels.. The encoder-decoder models are given all 466,672 pairs of and sequences with edits, augmented with varying numbers of pairs without edits. The Char+sample and Word+sample models are given a random sample of 200,000 pairs without edits for a total of 666,672 pairs of and sequences. The Char+all and Word+all models are given all 722,742 sentences without edits for a total of 1,189,414 pairs of and sequences. For some of the final testing models, we also train with the development sentences. In these latter cases, all sequence pairs are used. In training all of the encoder-decoder models, as indicated in Section 5.2, we drop sentences exceeding 50 tokens in length.
We also experimented with creating corrected versions of sentences for the CNN. The binary CNN classifiers are given 1,656,086 single-sentence training examples, of which 722,742 are error-free examples (in which ), and the remaining examples are constructed by removing the tags from the annotated sentences, , to create tag-free examples that contain errors (466,672 instances) and additional error-free examples (466,672 instances).
Training (along with testing) of all models was conducted on GPUs. Our models were implemented with the Torch777http://torch.ch framework.
. A limited grid search on the development set determined our use of filter windows of width 3, 4, and 5 and 1000 feature maps. We trained for 10 epochs. Training otherwise followed the approach of the correspondingly namedCNN-static and CNN-nonstatic models of Kim-2014-CNN.
Initial parameter settings (including architecture decisions such as the number of layers and embedding and hidden state sizes) were informed by concurrent work in neural machine translation and existing work such as that of SutskeverEtAl-2014-SequenceToSequence and luong-pham-manning:2015:EMNLP. We used -layer LSTMs with hidden units in each layer. We trained for 14 epochs with a batch size of 64 and a maximum sequence length of 50. The parameters for the Word model were uniformly initialized in , and those of the Char model were uniformly initialized in . The -normalized gradients were constrained to be . Our learning rate schedule started the learning rate at 1 and halved the learning rate after each epoch beyond epoch 10, or once the validation set perplexity no longer improved. The Word model used -dimensional word embeddings. For Char, the character embeddings were -dimensional, the filter width was , the number of feature maps was , and highway layers were used. The maximum word length was 35 characters for training Char. Note that we do not reverse the source () sequences, unlike some previous NMT work. Following the work of Zaremba14_rnn_regularization, we employed dropout with a probability of between the LSTM layers.
Training both Word and Char on the training set took on the order of a few days using GPUs, with the former being more efficient than the latter. In practice, we used two GPUs for training Char due to memory requirements.
Post-hoc tuning was performed on the 10k held-out portion of the development set. In terms of maximizing the -score, this post-hoc tuning was important for these models, without which precision was high and recall was low. We leave to future work alternative approaches to this type of post-hoc tuning.
For the CNN models, after training, we tuned the decision boundary to maximize the
-score on the held-out tuning set. Analogously, for the encoder-decoder models, after training the models, we tuned the bias weights (given as input to the final softmax layer generating the words/tags distribution) associated with the four annotation tags via a simple grid search by iteratively running beam search on the tuning set. Due to the relatively high expense of decoding, we employed a coarse grid search in which the bias weights of the four annotation tags were uniformly varied.
Results on the development set, excluding the 10k tuning set, appear in Table 1. Here (and elsewhere) Random is the result of randomly assigning a sentence to one of the binary classes. For the CNN classifiers, fine-tuning the word2vec embeddings improves performance. The encoder-decoder models improve over the CNN classifiers, even though the latter are provided with additional data (via word2vec). The character-based models yield tangible improvements over the word-based models.
For consistency here, we kept the beam size at 10 across models, but subsequent analysis revealed that increasing the beam from 5 to 10 had a negligible effect on overall performance.
, illustrating the importance of adjusting the bias weights associated with the annotation tags in balancing precision and recall to maximize thescore. The models trained on all sequence pairs without edits, Char+all and Word+all, perform particularly poorly without tuning these bias weights, yielding scores near that of Random before tuning, which corresponds to a weight of 0.0 in Figure 2.
The official development set posted on CodaLab differed slightly from the original development set provided with labels, so we include those results in Table 2 for the encoder-decoder models. Here, evaluation is performed on the CodaLab server, as with the final test submission. The overall relative performance pattern is similar to that of the original development set.
A comparison of our results with other shared task submissions appears in Table 3. (Teams were allowed to submit up to two results.) Our submission, Combination was a simple majority vote at the system level (for each test sentence) of 5 models888The choice of models was limited to those that were trained and tuned in time for the Shared Task deadline.: (1) a CNN-nonstatic model trained with the concatenation of the training and development sets (and using word2vec); (2) a Word model trained on all sequence pairs in the training and development sets with a beam size of 10 for decoding; (3,4) a Char+sample model trained on the training set, decoding the test set twice, each time with different weight biases (the two highest performing via the grid search over the tuning set) with a beam size of 10; and (5) a Char model trained on all sequence pairs in the training and development sets, with training suspended at epoch 9 (out of 14) and a beam size of 5 to meet the Shared Task deadline. For reference, we also include the CodaLab evaluation for just the Char+sample model trained on the training set with a beam size of 10, with the bias weights being those that generated the highest -score on the 10k tuning set.
Of particular interest, the Char+sample model performs well, both in terms of performance on the test set relative to other submissions, as well as on the development set relative to the Word models and the CNN classifiers. It is possible this is due to the ability of the Char models to capture some types of orthographic errors.
The empirical results suggest that simply adding additional already correct source-target pairs when training the encoder-decoder models may not boost performance, ceteris paribus, as seen in comparing the performance of Char+sample vs Word+sample, and Char+all vs Word+all. We leave to future work alternative approaches for introducing additional correct (target) sentences, as has been examined for neural machine translation models [Sennrich et al.2015, Gülçehre et al.2015].
Our results provide initial evidence to support the hypothesis that training at the lowest granularity of annotation is a more efficient use of data than training against the binary label. In future work, we plan to compare against sentence classification using LSTMs [Dai and Le2015] and convolutional models that use correction-level annotations.
Another benefit of the encoder-decoder models is that they can be used to generate corrections (and identify locations of intra-sentence errors) for end-users. However, the added generation capabilities of the encoder-decoder models comes at the expense of considerably longer training and testing times compared to the CNN classifiers.
We found that post-hoc tuning provides a straightforward means of tuning the precision-recall trade-off for these models, and we speculate (but leave to future work for investigation) that in practice, end-users might prefer greater emphasis placed on precision over recall.
We have presented our submission to the AESW 2016 Shared Task, suggesting, in particular, the utility of a neural attention-based model for sentence-level grammatical error identification. Our models do not make use of hand-tuned rules, are not trained with explicit syntactic annotations, and do not make use of individuals classifiers designed for human-designated subsets of errors.
For the encoder-decoder models, modeling at the sub-word level was beneficial, even though predictions were still made at the word level. It would be of interest to push this further to eliminate the need for an initial tokenization step, in order to generalize the approach to other languages, such as Chinese and Japanese.
We plan to examine alternative approaches for training with additional correct (target) sentences. Inducing artificial errors to generate more incorrect (source) sentences is also a direction we intend to pursue.
We leave for future work an analysis of the generation quality of our encoder-decoder models on the AESW dataset and the CoNLL-2014 Shared Task data, as well as user studies to assess whether performance is sufficient in practice to be useful, including the utility of correction vs. identification.
We consider this to be just the beginning of the development of data-driven support tools for writers, and many areas remain to be explored.
We would like to thank the organizers of the Shared Task for coordinating the task and making the unique AESW dataset available for research purposes. The Institute for Quantitative Social Science (IQSS) and the Harvard Initiative for Learning and Teaching (HILT) supported earlier, related research that led to our participation in the Shared Task. Jeffrey Ling graciously contributed a torch-based CNN implementation of Kim-2014-CNN.
- [Bahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
- [Cho et al.2014] Kyunghyun Cho, Bart Van Merriënboer, Çağlar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar, October. Association for Computational Linguistics.
- [Dai and Le2015] Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. CoRR, abs/1511.01432.
Robert Dale and Adam Kilgarriff.
Helping our own: The hoo 2011 pilot shared task.
Proceedings of the 13th European Workshop on Natural Language Generation, ENLG ’11, pages 242–249, Stroudsburg, PA, USA. Association for Computational Linguistics.
- [Dale et al.2012] Robert Dale, Ilya Anisimoff, and George Narroway. 2012. Hoo 2012: A report on the preposition and determiner error correction shared task. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, pages 54–62, Stroudsburg, PA, USA. Association for Computational Linguistics.
- [Daudaravicius et al.2016] Vidas Daudaravicius, Rafael E. Banchs, Elena Volodina, and Courtney Napoles. 2016. A report on the automatic evaluation of scientific writing shared task. In Proceedings of the Eleventh Workshop on Innovative Use of NLP for Building Educational Applications, San Diego, CA, USA, June. Association for Computational Linguistics.
- [Felice et al.2014] Mariano Felice, Zheng Yuan, Øistein E. Andersen, Helen Yannakoudakis, and Ekaterina Kochmar. 2014. Grammatical error correction using hybrid systems and type filtering. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 15–24, Baltimore, Maryland, June. Association for Computational Linguistics.
- [Gülçehre et al.2015] Çaglar Gülçehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loïc Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2015. On using monolingual corpora in neural machine translation. CoRR, abs/1503.03535.
- [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780, November.
- [Kim et al.2016] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2016. Character-Aware Neural Language Models. In Proceedings of AAAI.
- [Kim2014] Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar, October. Association for Computational Linguistics.
- [Luong et al.2015] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1412–1421, Lisbon, Portugal, September. Association for Computational Linguistics.
- [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc.
- [Napoles et al.2015] Courtney Napoles, Keisuke Sakaguchi, Matt Post, and Joel Tetreault. 2015. Ground truth for grammatical error correction metrics. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 588–593, Beijing, China, July. Association for Computational Linguistics.
- [Ng et al.2013] Hwee Tou Ng, Siew Mei Wu, Yuanbin Wu, Christian Hadiwinoto, and Joel Tetreault. 2013. The CoNLL-2013 shared task on grammatical error correction. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task, pages 1–12, Sofia, Bulgaria, August. Association for Computational Linguistics.
- [Ng et al.2014] Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–14, Baltimore, Maryland, June. Association for Computational Linguistics.
- [Park and Levy2011] Y. Albert Park and Roger Levy. 2011. Automated whole sentence grammar correction using a noisy channel model. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 934–944, Stroudsburg, PA, USA. Association for Computational Linguistics.
- [Rozovskaya et al.2014] Alla Rozovskaya, Kai-Wei Chang, Mark Sammons, Dan Roth, and Nizar Habash. 2014. The Illinois-Columbia system in the CoNLL-2014 shared task. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 34–42, Baltimore, Maryland, June. Association for Computational Linguistics.
- [Sennrich et al.2015] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Improving neural machine translation models with monolingual data. CoRR, abs/1511.06709.
- [Srivastava et al.2015] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Training very deep networks. CoRR, abs/1507.06228.
- [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112.
- [Zaremba et al.2014] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. CoRR, abs/1409.2329.
- [Zhang and Wallace2015] Ye Zhang and Byron Wallace. 2015. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. CoRR, abs/1510.03820.