Grammatical error correction (GEC) is the task of automatically correcting all types of errors in text; e.g. [In a such situaction In such a situation]. Using neural models for GEC is becoming increasingly popular (Xie et al., 2016; Yuan and Briscoe, 2016; Ji et al., 2017; Sakaguchi et al., 2017; Schmaltz et al., 2017; Chollampatt and Ng, 2018; Ge et al., 2018a, b), possibly combined with phrase-based SMT (Chollampatt et al., 2016; Chollampatt and Ng, 2017; Grundkiewicz and Junczys-Dowmunt, 2018)
. A potential challenge for purely neural GEC models is their vast output space since they assign non-zero probability mass to any sequence. GEC is – compared to machine translation – a highly constrained problem as corrections tend to be very local, and lexical choices are usually limited. Finite state transducers (FSTs) are an efficient way to represent large structured search spaces. In this paper, we propose to construct a hypothesis space using standard FST operations like composition, and then constrain the output of a neural GEC system to that space. We study two different scenarios: In the first scenario, we do not have access to annotated training data, and only use a small development set for tuning. In this scenario, we construct the hypothesis space using word-level context-independent confusion sets(Bryant and Briscoe, 2018)
based on spell checkers and morphology databases, and rescore it with count-based and neural language models (NLMs). In the second scenario, we assume to have enough training data available to train SMT and neural machine translation (NMT) systems. In this case, we make additional use of the SMT lattice and rescore with an NLM-NMT ensemble. Our contributions are:
We present an FST-based adaptation of the work of Bryant and Briscoe (2018) which allows exact inference, and does not require annotated training data. We report large gains from rescoring with a neural language model.
Our technique beats the best published result with comparable amounts of training data on the CoNLL-2014 (Ng et al., 2014) test set when applied to SMT lattices. Our combination strategy yields larger gains over the SMT baselines than simpler rescoring or pipelining used in prior work on hybrid systems (Grundkiewicz and Junczys-Dowmunt, 2018).
2 Constructing the Hypothesis Space
Constructing the set of hypotheses
The core idea of our approach is to first construct a (weighted) hypothesis space which is large enough to be likely to contain good corrections, but constrained enough to embrace the highly structured nature of GEC. Then, we use to constrain a neural beam decoder. We make extensive use of the FST operations available in OpenFST (Allauzen et al., 2007) like composition (denoted with the -operator) and projection (denoted with and ) to build . The process starts with an input lattice . In our experiments without annotated training data, is an FST which simply maps the input sentence to itself as shown in Fig. 1(a). If we do have access to enough annotated data, we train an SMT system on it and derive from the SMT -best list.111In the rare cases in which the -best list did not contain the source sentence we added it in a postprocessing step. For each hypothesis we compute the Levenshtein distance to the source sentence . We construct a string by prepending many tokens to , and construct such that:
We adapt the notation of Mohri (2003) and denote the cost assigns to mapping a string to itself as , and set if does not accept . is the SMT score. In other words, represents the weighted SMT -best list after adding many tokens to each hypothesis as illustrated in Fig. 1(c). We scale SMT scores by a factor for tuning.
Bryant and Briscoe (2018) addressed substitution errors such as non-words, morphology-, article-, and preposition-errors by creating confusion sets that contain possible (context-independent) 1:1 corrections for each input word . Specifically, they relied on CyHunspell for spell checking (Rodriguez and Seal, 2014), the AGID morphology database for morphology errors (Atkinson, 2011), and manually defined confusion sets for determiner and preposition errors, hence avoiding the need for annotated training data. We use the same confusion sets as Bryant and Briscoe (2018) to augment our hypothesis space via the edit flower transducer shown in Fig. 2. can map any sequence to itself via its -self-loop. Additionally, it allows the mapping for each . For example, for the misspelled word and the confusion set , allows mapping ‘situaction’ to ‘ situation’ and ‘ acquisition’, and to itself via the -self-loop. The additional token will help us to keep track of the edits. We obtain our base lattice which defines the set of possible hypotheses by composition and projection:
Fig. 1(d) shows for our running example.
Scoring the hypothesis space
We apply multiple scoring strategies to the hypotheses in . First, we penalize and tokens with two further parameters, and , by composing with the penalization transducer shown in Fig. 3.222Rather than using and tokens and the transducer we could directly incorporate the costs in the transducers and , respectively. We chose to use explicit correction tokens for clarity. The and parameters control the trade-off between the number and quality of the proposed corrections since high values bias towards fewer corrections.
To incorporate word-level language model scores we train a 5-gram count-based LM with KenLM (Heafield, 2011) on the One Billion Word Benchmark dataset (Chelba et al., 2014), and convert it to an FST using the OpenGrm NGram Library (Roark et al., 2012). For tuning purposes we scale weights in with :
Our combined word-level scores can be expressed with the following transducer:
Since we operate in the tropical semiring, path scores in are linear combinations of correction penalties, LM scores, and, if applicable, SMT scores, weighted with the -parameters. Note that exact inference in is possible using FST shortest path search. This is an improvement over the work of Bryant and Briscoe (2018) who selected correction options greedily. Our ultimate goal, however, is to rescore with neural models such as an NLM and – if annotated training data is available – an NMT model. Since our neural models use subword units (Sennrich et al., 2016, BPEs), we compose with a transducer which maps word sequences to BPE sequences. Our final transducer which we use to constrain the neural beam decoder can be written as:
To help downstream beam decoding we apply -removal, determinization, minimization, and weight pushing (Mohri, 1997; Mohri and Riley, 2001) to . We search for the best hypothesis with beam search using a combined score of word-level symbolic models (represented by ) and subword unit based neural models:
|1||Best published (B&B, 2018)||40.56||20.81||34.09||59.35||76.23||28.48||57.08||48.75|
|1||Best published (G&J-D, 2018)||66.77||34.49||56.25||n/a||n/a||n/a||n/a||61.50|
|2||Unconstrained single NMT||54.98||22.20||42.45||67.19||67.49||38.47||58.64||50.71|
The final decoding pass can be seen as an ensemble of a neural LM and an NMT model which is constrained and scored at each time step by the set of possible tokens in .
We have introduced three -parameters , , and , and three additional parameters , , and if we make use of annotated training data. We also use a word insertion penalty for our SMT-based experiments. We tune all these parameters on the development sets using Powell search (Powell, 1964).333Similarly to Bryant and Briscoe (2018), even in our experiments without annotated training data, we do need a very small amount of annotated sentences for tuning.
In our experiments with annotated training data we use the SMT system of Junczys-Dowmunt and Grundkiewicz (2016)444https://github.com/grammatical/baselines-emnlp2016 to create 1000-best lists from which we derive the input lattices . All our LMs are trained on the One Billion Word Benchmark dataset (Chelba et al., 2014). Our neural LM is a Transformer decoder architecture in the transformer_base configuration trained with Tensor2Tensor (Vaswani et al., 2018). Our NMT model is a Transformer model (transformer_base) trained on the concatenation of the NUCLE corpus (Dahlmeier et al., 2013) and the Lang-8 Corpus of Learner English v1.0 (Mizumoto et al., 2012). We only keep sentences with at least one correction (659K sentences in total). Both NMT and NLM models use byte pair encoding (Sennrich et al., 2016, BPE) with 32K merge operations. We delay SGD updates by 2 on four physical GPUs as suggested by Saunders et al. (2018). We decode with beam size 12 using the SGNMT decoder (Stahlberg et al., 2017). We evaluate on CoNLL-2014 (Ng et al., 2014) and JFLEG-Test (Napoles et al., 2017), using CoNLL-2013 (Ng et al., 2013)
and JFLEG-Dev as development sets. Our evaluation metrics are GLEU(Napoles et al., 2015) and M2 (Dahlmeier and Ng, 2012). We generated M2 files using ERRANT (Bryant et al., 2017) for JFLEG and Tab. 1 to be comparable to Bryant and Briscoe (2018), but used the official M2 files in Tab. 2 to be comparable to Grundkiewicz and Junczys-Dowmunt (2018).
Our LM-based GEC results without using annotated training data are summarized in Tab. 1. Even when we use the same resources (same LM and same confusion sets) as Bryant and Briscoe (2018), we see gains on JFLEG (rows 1 vs. 2), probably because we avoid search errors in our FST-based scheme. Adding an NLM yields significant gains across the board. Tab. 2 shows that adding confusion sets to SMT lattices is effective even without neural models (rows 3 vs. 4). Rescoring with neural models also benefits from the confusion sets (rows 5 vs. 6). With our ensemble systems (rows 7 and 8) we are able to outperform prior work555We compare our systems to the work of Grundkiewicz and Junczys-Dowmunt (2018) as they used similar training data. We note, however, that Ge et al. (2018b) reported even better results with much more (non-public) training data. Comparing (Ge et al., 2018a) and (Ge et al., 2018b) suggests that most of their gains come from the larger training set. (row 1) on CoNLL-2014 and come within 3 GLEU on JFLEG. Since the baseline SMT systems of Grundkiewicz and Junczys-Dowmunt (2018) were better than the ones we used, we achieve even higher relative gains over the respective SMT baselines (Tab. 3).
|G&J-D (2018)||This work|
Error type analysis
We also carried out a more detailed error type analysis of the best CoNLL-2014 M2 system with/without training data using ERRANT (Tab. 4). Specifically, this table shows that while the trained system was consistently better than the untrained system, the degree of the improvement differs significantly depending on the error type. In particular, since the untrained system was only designed to handle Replacement word errors, much of the improvement in the trained system comes from the ability to correct Missing and Unnecessary word errors. The trained system nevertheless still improves upon the untrained system in terms of replacement errors by 10 F (45.53 vs. 55.63).
In terms of more specific error types, the trained system was also able to capture a wider variety of error types, including content word errors (adjectives, adverbs, nouns and verbs) and other categories such as pronouns and punctuation. Since the untrained system only targets spelling, orthographic and morphological errors however, it is interesting to note that the difference in scores between these categories tends to be smaller than others; e.g. noun number (53.43 vs 64.96), orthography (62.77 vs 74.07), spelling (67.91 vs 75.21) and subject-verb agreement (66.67 vs 68.39). This suggests that an untrained system is already able to capture the majority of these error types.
|Expanded input sentence (Tab. 1)||61.28%|
|SMT lattice (Tab. 2, rows 3, 5)||55.64%|
|Expanded SMT lattice (Tab. 2, rows 4, 6-8)||48.17%|
Our FST-based composition cascade is designed to enrich the search space to allow the neural models to find better hypotheses. Tab. 5 reports the oracle sentence error rate for different configurations, i.e. the fraction of reference sentences in the test set which are not in the FSTs. Expanding the SMT lattice significantly reduces the oracle error rate from 55.63% to 48.17%.
We demonstrated that our FST-based approach to GEC outperforms prior work on LM-based GEC significantly, especially when combined with a neural LM. We also applied our approach to SMT lattices and reported much better relative gains over the SMT baselines than previous work on hybrid systems. Our results suggest that FSTs provide a powerful and effective framework for constraining neural GEC systems.
This work was supported by the U.K. Engineering and Physical Sciences Research Council (EPSRC grant EP/L027623/1).
- Allauzen et al. (2007) Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. 2007. OpenFst: A general and efficient weighted finite-state transducer library. In Implementation and Application of Automata, pages 11–23. Springer.
- Atkinson (2011) Kevin Atkinson. 2011. Automatically generated inflection database (AGID). http://wordlist.aspell.net/other/. [Online; accessed 24-December-2018].
- Bryant and Briscoe (2018) Christopher Bryant and Ted Briscoe. 2018. Language model based grammatical error correction without annotated training data. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 247–253. Association for Computational Linguistics.
- Bryant et al. (2017) Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. Automatic annotation and evaluation of error types for grammatical error correction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 793–805. Association for Computational Linguistics.
- Chelba et al. (2014) Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One billion word benchmark for measuring progress in statistical language modeling. In Fifteenth Annual Conference of the International Speech Communication Association (INTERSPEECH-2014), pages 2635–2639.
- Chollampatt and Ng (2017) Shamil Chollampatt and Hwee Tou Ng. 2017. Connecting the dots: Towards human-level grammatical error correction. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pages 327–333. Association for Computational Linguistics.
Chollampatt and Ng (2018)
Shamil Chollampatt and Hwee Tou Ng. 2018.
A multilayer convolutional encoder-decoder neural network for grammatical error correction.In
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA.
- Chollampatt et al. (2016) Shamil Chollampatt, Kaveh Taghipour, and Hwee Tou Ng. 2016. Neural network translation models for grammatical error correction. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 2768–2774. AAAI Press.
- Dahlmeier and Ng (2012) Daniel Dahlmeier and Hwee Tou Ng. 2012. Better evaluation for grammatical error correction. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 568–572. Association for Computational Linguistics.
- Dahlmeier et al. (2013) Daniel Dahlmeier, Hwee Tou Ng, and Siew Mei Wu. 2013. Building a large annotated corpus of learner English: The NUS corpus of learner English. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 22–31. Association for Computational Linguistics.
- Ge et al. (2018a) Tao Ge, Furu Wei, and Ming Zhou. 2018a. Fluency boost learning and inference for neural grammatical error correction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1055–1065. Association for Computational Linguistics.
- Ge et al. (2018b) Tao Ge, Furu Wei, and Ming Zhou. 2018b. Reaching human-level performance in automatic grammatical error correction: An empirical study. arXiv preprint arXiv:1807.01270.
- Grundkiewicz and Junczys-Dowmunt (2018) Roman Grundkiewicz and Marcin Junczys-Dowmunt. 2018. Near human-level performance in grammatical error correction with hybrid machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 284–290. Association for Computational Linguistics.
- Heafield (2011) Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197. Association for Computational Linguistics.
- Ji et al. (2017) Jianshu Ji, Qinlong Wang, Kristina Toutanova, Yongen Gong, Steven Truong, and Jianfeng Gao. 2017. A nested attention neural hybrid model for grammatical error correction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 753–762. Association for Computational Linguistics.
- Junczys-Dowmunt and Grundkiewicz (2016) Marcin Junczys-Dowmunt and Roman Grundkiewicz. 2016. Phrase-based machine translation is state-of-the-art for automatic grammatical error correction. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1546–1556. Association for Computational Linguistics.
- Mizumoto et al. (2012) Tomoya Mizumoto, Yuta Hayashibe, Mamoru Komachi, Masaaki Nagata, and Yuji Matsumoto. 2012. The effect of learner corpus size in grammatical error correction of ESL writings. In Proceedings of COLING 2012: Posters, pages 863–872. The COLING 2012 Organizing Committee.
- Mohri (1997) Mehryar Mohri. 1997. Finite-state transducers in language and speech processing. Computational Linguistics, 23(2).
- Mohri (2003) Mehryar Mohri. 2003. Edit-distance of weighted automata: General definitions and algorithms. International Journal of Foundations of Computer Science, 14(06):957–982.
- Mohri and Riley (2001) Mehryar Mohri and Michael Riley. 2001. A weight pushing algorithm for large vocabulary speech recognition. In Seventh European Conference on Speech Communication and Technology.
- Napoles et al. (2015) Courtney Napoles, Keisuke Sakaguchi, Matt Post, and Joel Tetreault. 2015. Ground truth for grammatical error correction metrics. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 588–593. Association for Computational Linguistics.
- Napoles et al. (2017) Courtney Napoles, Keisuke Sakaguchi, and Joel Tetreault. 2017. JFLEG: A fluency corpus and benchmark for grammatical error correction. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 229–234. Association for Computational Linguistics.
- Ng et al. (2014) Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–14. Association for Computational Linguistics.
- Ng et al. (2013) Hwee Tou Ng, Siew Mei Wu, Yuanbin Wu, Christian Hadiwinoto, and Joel Tetreault. 2013. The CoNLL-2013 shared task on grammatical error correction. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task, pages 1–12. Association for Computational Linguistics.
- Powell (1964) Michael JD Powell. 1964. An efficient method for finding the minimum of a function of several variables without calculating derivatives. The computer journal, 7(2):155–162.
- Roark et al. (2012) Brian Roark, Richard Sproat, Cyril Allauzen, Michael Riley, Jeffrey Sorensen, and Terry Tai. 2012. The OpenGrm open-source finite-state grammar software libraries. In Proceedings of the ACL 2012 System Demonstrations, pages 61–66. Association for Computational Linguistics.
- Rodriguez and Seal (2014) Tim Rodriguez and Matthew Seal. 2014. CyHunspell. https://github.com/MSeal/cython_hunspell. [Online; accessed 24-December-2018].
- Sakaguchi et al. (2017) Keisuke Sakaguchi, Matt Post, and Benjamin Van Durme. 2017. Grammatical error correction with neural reinforcement learning. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 366–372. Asian Federation of Natural Language Processing.
- Saunders et al. (2018) Danielle Saunders, Felix Stahlberg, Adrià de Gispert, and Bill Byrne. 2018. Multi-representation ensembles and delayed SGD updates improve syntax-based NMT. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 319–325. Association for Computational Linguistics.
- Schmaltz et al. (2017) Allen Schmaltz, Yoon Kim, Alexander Rush, and Stuart Shieber. 2017. Adapting sequence models for sentence correction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2807–2813. Association for Computational Linguistics.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725. Association for Computational Linguistics.
- Stahlberg et al. (2017) Felix Stahlberg, Eva Hasler, Danielle Saunders, and Bill Byrne. 2017. SGNMT – A flexible NMT decoding platform for quick prototyping of new models and search strategies. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 25–30. Association for Computational Linguistics.
- Vaswani et al. (2018) Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. 2018. Tensor2tensor for neural machine translation. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Papers), pages 193–199. Association for Machine Translation in the Americas.
- Xie et al. (2016) Ziang Xie, Anand Avati, Naveen Arivazhagan, Dan Jurafsky, and Andrew Y Ng. 2016. Neural language correction with character-based attention. arXiv preprint arXiv:1603.09727.
- Yuan and Briscoe (2016) Zheng Yuan and Ted Briscoe. 2016. Grammatical error correction using neural machine translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 380–386. Association for Computational Linguistics.