Tree-Transformer: A Transformer-Based Method for Correction of Tree-Structured Data

by   Jacob Harer, et al.
Boston University
Draper, Inc.

Many common sequential data sources, such as source code and natural language, have a natural tree-structured representation. These trees can be generated by fitting a sequence to a grammar, yielding a hierarchical ordering of the tokens in the sequence. This structure encodes a high degree of syntactic information, making it ideal for problems such as grammar correction. However, little work has been done to develop neural networks that can operate on and exploit tree-structured data. In this paper we present the Tree-Transformer — a novel neural network architecture designed to translate between arbitrary input and output trees. We applied this architecture to correction tasks in both the source code and natural language domains. On source code, our model achieved an improvement of 25% F0.5 over the best sequential method. On natural language, we achieved comparable results to the most complex state of the art systems, obtaining a 10% improvement in recall on the CoNLL 2014 benchmark and the highest to date F0.5 score on the AESW benchmark of 50.43.


page 1

page 2

page 3

page 4


AST-Transformer: Encoding Abstract Syntax Trees Efficiently for Code Summarization

Code summarization aims to generate brief natural language descriptions ...

Automatic Source Code Summarization with Extended Tree-LSTM

Neural machine translation models are used to automatically generate a d...

StructCoder: Structure-Aware Transformer for Code Generation

There has been a recent surge of interest in automating software enginee...

Trees in transformers: a theoretical analysis of the Transformer's ability to represent trees

Transformer networks are the de facto standard architecture in natural l...

Latent Predictor Networks for Code Generation

Many language generation tasks require the production of text conditione...

Improving the Robustness to Data Inconsistency between Training and Testing for Code Completion by Hierarchical Language Model

In the field of software engineering, applying language models to the to...

TreeGen: A Tree-Based Transformer Architecture for Code Generation

A code generation system generates programming language code based on an...

1 Introduction

Most machine learning approaches to correction tasks operate on sequential representations of input and output data. Generally this is done as a matter of convenience — sequential data is readily available and requires minimal effort to be used as training data for many machine learning models. Sequence-based machine learning models have produced prominent results in both translation and correction of natural language

Bahdanau et al. (2015); Vaswani et al. (2017); Xie et al. (2016); Yuan and Briscoe (2016); Ji et al. (2017); Schmaltz et al. (2017); Chollampatt and Ng (2018); Junczys-Dowmunt et al. (2018).

While algorithms that use sequential data have produced good results, most of these sequential data types can be more informatively represented using a tree structure. One common method to obtain tree-structured data is to fit sequential data to a grammar, such as a Context Free Grammar (CFG). The use of a grammar ensures that generated trees encode the higher-order syntactic structure of the data in addition to the information contained by sequential data.

In this work, we trained a neural network to operate directly on trees, teaching it to learn the syntax of the underlying grammar and to leverage this syntax to produce outputs which are grammatically correct. Our model, the Tree-Transformer, handles correction tasks in a tree-based encoder and decoder framework. The Tree-Transformer leverages the popular Transformer architecture Vaswani et al. (2017), modifying it to incorporate the tree structure of the data by adding a parent-sibling tree convolution block. To show the power of our model, we focused our experiments on two common data types and their respective tree representations: Abstract Syntax Trees (ASTs) for code and Constituency Parse Trees (CPTs) for natural language.

2 Related Work

2.1 Tree Structured Neural Networks

Existing work on tree-structured neural networks can largely be grouped into two categories: encoding trees and generating trees. Several types of tree encoders exist Tai et al. (2015); Zhu et al. (2015); Socher et al. (2011); Eriguchi et al. (2016). The seminal work of Tai et al. Tai et al. (2015)

laid the ground work for these methods, using a variant of a Long Short Term Memory (LSTM) network to encode an arbitrary tree. A large body of work has also focused on how to generate trees

Dong and Lapata (2016); Jaakkola (2017); Vinyals et al. (2015); Aharoni and Goldberg (2017); Rabinovich et al. (2017); Parisotto et al. (2017); Yin and Neubig (2017); Zhang et al. (2016). The work of Dong et al., Dong and Lapata (2016) and Alvarez-Melis and Jaakkola Jaakkola (2017) each extend the LSTM decoder popular in Neuro Machine Translation (NMT) systems to arbitrary trees. This is done by labeling some outputs as parent nodes and then forking off additional sequence generations to create their children. Only a small amount of work has combined encoding and generation of trees into a tree-to-tree system Chen et al. (2018); Chakraborty et al. (2018). Of note is Chakraborty et al. Chakraborty et al. (2018) who use a LSTM-based tree-to-tree method for source code completion.

To our knowledge our work is the first to use a Transformer-based network on trees, and apply Tree-to-Tree techniques to natural language and code correction tasks.

2.2 Code Correction

There is extensive existing work on automatic repair of software. However, the majority of this work is rule-based systems which make use of small datasets (see

Monperrus (2018) for a more extensive review of these methods). Two successful, recent approaches in this category are that of Le et al., Le et al. (2016) and Long and Rinard Long and Rinard (2016). Le et al. mine a history of bug fixes across multiple projects and attempt to reuse common bug fix patterns on newly discovered bugs. Long and Rinard learn and use a probabilistic model to rank potential fixes for defective code. Unfortunately, the small datasets used in these works are not suitable for training a large neural network like ours.

Neural network-based approaches for code correction are less common. Devlin et al. Devlin, Jacob et al. (2017) generate repairs with a rule-based method and then rank them using a neural network. Gupta et al. Gupta et al. (2017) were the first to train a NMT model to directly generate repairs for incorrect code. Additionally, Harer et al. Harer et al. (2018) use a Generative Adversarial Network to train an NMT model for code correction in the absence of paired data. The works of Gupta et al. and Harer et al. are the closest to our own since they directly correct code using an NMT system.

2.3 Grammatical Error Correction

Grammatical Error Correction (GEC) is the task of correcting grammatically incorrect sentences. This task is similar in many ways to machine translation tasks. However, initial attempts to apply NMT systems to GEC were outperformed by phrase-based or hybrid systems Junczys-Dowmunt and Grundkiewicz (2016); Chollampatt and Ng (2017); Dahlmeier et al. (2013).

Initial, purely neural systems for GEC largely copied NMT systems. Yuan and Brisco Yuan and Briscoe (2016) produced the first NMT style system for GEC by using the popular attention method of Bahdanau et al. Bahdanau et al. (2015). Xie et al. Xie et al. (2016) trained a novel character-based model with attention. Ji et al. Ji et al. (2017)

proposed a hybrid character-word level model, using a nested character level attention model to handle rare words. Schmaltz et al.

Schmaltz et al. (2017) used a word-level bidirectional LSTM network. Chollampatt and Ng Chollampatt and Ng (2018) created a convolution-based encoder and decoder network which was the first to beat state of the art phrased-based systems. Finally, Junczys-Dowmunt et al. Junczys-Dowmunt et al. (2018)

treated GEC as a low resource machine translation task, utilizing the combination of a large monolingual language model and a specifically designed correction loss function.

3 Architecture

Our Tree-Transformer architecture is based on the Transformer architecture of Vaswani et al. Vaswani et al. (2017), modified to handle tree-structured data. Our major change to the Transformer is the replacement of the feed forward sublayer in both the encoder and decoder with a Tree Convolution Block (TCB). The TCB allows each node direct access to its parent and left sibling, thus allowing the network to understand the tree structure. We follow the same overall architecture for the Transformer as Vaswani et al., consisting of self-attention, encoder-decoder attention, and TCB sub layers in each layer. Our models follow the 6 layer architecture of the base Transformer model with sublayer outputs, , of size 512 and tree convolution layers, , of size .

Figure 1: Tree-Transformer model architecture.

3.1 Parent-Sibling Tree Convolution

Tree convolution is computed for each node as:

The inputs , , and all come from the previous sublayer, from the same node, from the parent node, and from its left sibling. In cases where the node does not have either a parent (e.g. the root node) or a left sibling (e.g. parents first child), the inputs and

are replaced with a learned vector

and respectively.

In addition to the TCB used in each sub layer, we also use a TCB at the input to both the encoder and decoder. In the encoder this input block combines the embeddings from the parent, the sibling and the current node (, , and ). In the decoder, the current node is unknown since it has not yet been produced. Therefor,e this block only combines the parent and sibling embeddings, leaving out the input from the equation above.

The overall structure of the network is shown in Figure 1. The inputs to each TCB come from the network inputs, , for the input blocks, and from the previous sublayer, , for all other blocks.

Figure 2: Tree-Transformer State Transfer

3.2 Top-Down Encoder and Decoder

Both our encoder and decoder use a top-down approach where information flows from the tree’s root node down toward to the leaf nodes, as shown in Figure 2. Thus, leaf nodes have access to a large sub-tree, while parent nodes have a more limited view. An alternative approach would be to use a bottom-up encoder, where each node has access to its children, and a top-down decoder which can disseminate this information. This bottom-up/top-down model is intuitive because information flows up the tree in the encoder and then back down the decoder. However, we found that utilizing the same top-down ordering for both encoder and decoder performed better, likely because the symmetry in encoder and decoder allows decoder nodes to easily attend to their corresponding nodes in the encoder. This symmetry trivializes copying from encoder to decoder, which is particularly useful in correction tasks where large portions of the trees remain unchanged between input and output.

3.3 Generating Tree Structure

In order to generate each tree’s structure, we treat each set of siblings as a sequence. For each set of siblings, we generate each node one at a time, ending with the generation of an end-of-sequence token. The vocabulary used defines a set of leaf and parent nodes. When a parent node is generated, we begin creation of that nodes children as another set of siblings.

3.4 Depth First Ordering

As with any NMT system, each output value is produced from the decoder and fed back to subsequent nodes as input during evaluation. Ensuring inputs and are available to each node requires that parents are produced before children and that siblings are produced in left-to-right order. To enforce this constraint, we order the nodes in a depth-first manner. This ordering is shown by the numbering on nodes in Figure 3. The self attention mechanism in the decoder is also masked according to this order, so that each node only has access to previously produced ones.

3.5 No Positional Encoding

The Transformer architecture utilizes a positional encoding to help localization of the attention mechanisms. Positional encoding is not as important in our model because the TCB allows nodes to easily locate its parent and siblings in the tree. In fact, we found that inclusion of a Positional Encoding caused the network to overfit, likely due to the relatively small size of the correction datasets we use. Given this, our Tree-Transformer networks do not include a Positional Encoding.

4 Why Tree Transformer

In this section we motivate the design of our Tree-Transformer model over other possible tree-based architectures. Our choice to build upon the Transformer model was two-fold. First, Transformer-based models have significantly reduced time-complexity relative to Recurrent Neural Network (RNN) based approaches. Second, many of the building blocks required for a tree-to-tree translation system, including self-attention, are already present in the Transformer architecture.

4.1 Recurrent vs Attention Tree Networks

Many previous works on tree-structured networks used RNN-based tree architectures where nodes in layer are given access to their parent or children in the same layer Dong and Lapata (2016); Jaakkola (2017); Chakraborty et al. (2018). This state transfer requires an ordering to the nodes during training where earlier nodes in the same layer must be computed prior to later ones. This ordering requirement leads to poor time complexity for tree-structured RNN’s, since each node in a tree needs access to multiple prior nodes (e.g. parent and sibling). Accessing the states of prior nodes thus requires a gather operation over all past produced nodes. These gather operations are slow, and performing them serially for each node in the tree can be prohibitively expensive.

An alternative to the RNN type architecture is a convolutional or attention-based one, where nodes in layer are given access to prior nodes in layer . With the dependence in the same layer removed, the gather operation can be batched over all nodes in a tree, resulting in one large gather operation instead of . From our experiments, this batching resulted in a reduction in training time of two orders of magnitude on our largest dataset; from around months to less than a day.

4.2 Conditional Probabilities and Self Attention

The Tree-Transformer’s structure helps the network produce grammatically correct outputs. However, for translation/correction tasks we must additionally ensure that each output, , is conditionally dependent on both the input, , and on previous outputs, . Conditioning the output on the input is achieved using an encoder-decoder attention mechanism Bahdanau et al. (2015). Conditioning each output on previous outputs is more difficult with a tree-based system. In a tree-based model like ours, with only parent and sibling connections, the leaf nodes in one branch do not have access to leaf nodes in other branches. This leads to potentially undesired conditional independence between branches. Consider the example constituency parse trees shown in Figure 3

. Given the initial noun phrase "My dog", the following verb phrase "dug a hole" is far more likely than "gave a speech". However, in a tree-based model the verb phrases do not have direct access to the sampled noun phrase, meaning both possible sentences would be considered roughly equally probable by the model.

We address the above limitation with the inclusion of a self-attention mechanism which allows nodes access to all previously produced nodes. This mechanism, along with the depth-first ordering of the node described in section 3.4, gives each leaf node access to all past produced leaf nodes. Our model fits the standard probabilistic language model given as:


where is the index of the node in the depth first ordering.

Figure 3: Example Constituency Parse Tree. The index of the node in depth-first ordering is shown in the bottom left of each node. Note: leaf nodes in the verb phrase do not have access to leaf nodes in the left noun phrase without self-attention

5 Training

This section describes the training procedure and parameter choices used in our model. We trained our models in parallel on Nvidia Tesla V-100 GPU’s. Trees were batched together based on size with each batch containing a maximum words. We used the ADAM optimizer with inverse square root decay and a warm up of

steps. A full list of hyperparameters for each run is included in Appendix


5.1 Regularization

The correction datasets we used in this paper are relatively small compared to typical NMT datasets. As such we found a high degree of regularization was necessary. We included dropout of 0.3 before the residual of each sub layer and attention dropout of 0.1. We also added dropout of 0.3 to each TCB after the non-linearity. We applied dropout to words in both source and target embeddings as per Junczys-Dowmunt et al. (2018) with probabilities 0.2 and 0.1 respectively. We included label smoothing with

5.2 Beam-Search

Because of the depth-first ordering of nodes in our model, we can use beam search in the same way as traditional NMT systems. Following equation 1, we can compute the probability of a generated sub-tree of nodes simply as the product of probabilities for each node. We utilize beam-search during testing with a beam width of 6. Larger beam widths did not produce improved results.

6 Experiments/Results

6.1 Code Correction

We trained our Tree-Transformer on code examples taken from the NIST SATE IV dataset Okun et al. (2013). SATE IV contains around C and C++ files from 116 different Common Weakness Enumerations (CWEs), and was originally designed to test static analyzers. Each file contains a bad function with a known security vulnerability and at least one good function which fixes the vulnerability. We generate Abstract Syntax Trees (ASTs) from these functions using the Clang AST framework C. Lattner and V. S. Adve (2004); 9.

To provide a representation which is usable to our network, we tokenize the AST over a fixed vocabulary in three ways. First, high level AST nodes and data types are represented by individual tokens. Second, character and numeric literals are represented by a sequence of ASCII characters with a parent node defining the kind of literal (e.g. Int Literal, Float Literal). Finally, we use variable renaming to assign per function unique tokens to each variable and string. Our vocabulary consists of tokens made up of AST tokens, data type tokens, ASCII tokens, and Variable tokens.

Using the SATE IV dataset requires pre-processing, which we do during data generation. First, many of the SATE IV functions contain large amounts of dead code. In these cases, the bad and good functions contain largely the same code, but one path will be executed during the bad function and another in the good one. To make these cases more realistic, we removed the dead code. Second, although each file for a particular CWE contains unique functions at the text level, many of them are identical once converted to AST with renamed variables. Identical cases comes in two flavors: one where a bad function is identical to its good counterpart, and one where multiple bad functions from different files are identical. The first occurs commonly in SATE IV in cases where the bad and good functions are identical except for differing function calls. Since we operate at the function level, examples of this case are not useful and are removed. The second case occurs when bad functions differ only in variable names, strings, or function calls. To handle this, we compare all bad functions tree representations and combine identical bad functions into a single bad tree and a combination of all good trees from its component functions, with duplicate good trees removed. After pre-processing the data, we retain a total of bad functions and good functions. These are split 80/10/10 into training/validation/testing.

To our knowledge this processing of the SATE IV dataset is new. As such, we compare our network to two NMT systems operating on sequence-based representation of the data; a 4 Layer LSTM with attention and a base Transformer model. These sequence-based models use a representation with an almost identical vocabulary and tokenization to our tree representation but they operate over the tokenized sequence output of the Clang Lexer instead of the AST.

During testing, we utilized clangs source-to-source compilation framework to return the tree output of our networks to source code. We then compute precision, recall, and scores of source code edits using the MaxMatch algorithm 29. Results are given in Table 1 Our Tree-Transformer model performs better than either of the other two models we considered. We believe this is because source code is more naturally structured as a tree than as a sequence, lending itself to our tree-based model.

Architecture Precision Recall F
4-layer LSTM 51.3 53.4 51.7
Transformer 59.6 86.1 63.5
Tree-Transformer 84.5 85.7 84.7
Table 1: Sate IV Results

6.2 Grammar Error Correction

We applied our tree-based model as an alternative to NMT and phrase-based methods for GEC. Specifically, we encoded incorrect sentences using their constituency parse trees, and then generated corrected parse trees. Constituency parse trees represent sentences based on their syntactic structure by fitting them to a phrase structured grammar. Words from the input sentence become leaf nodes of their respective parse trees, and these nodes are combined into phrases of more complexity as we progress up the tree. Goller and Kuchler (1996); Socher et al. (2011)

A large amount of research has focused on the generation of constituency parse trees Chen and Manning (2014); Socher et al. (2013); Klein and Manning (2003). We utilize the Stanford NLP group’s shift-reduce constituency parser 37 to generate trees for both incorrect and correct sentences in our datasets. These are represented to our network with a combined vocab consisting of word level tokens and parent tokens. The parent tokens come from the part of speech tags originally defined by the Penn Treebank 34 plus a root token. Following recent work in GEC Junczys-Dowmunt et al. (2018); Chollampatt and Ng (2018), the word level tokens are converted into sub-words using a Byte Pair Encoding (BPE) trained on the large Wikipedia dataset Heinzerling and Strube (2018). The BPE segments rare words into multiple subwords, avoiding the post-processing of unknown words used in many existing GEC techniques.

We test our network on two GEC benchmarks: the commonly used NUCLE CoNLL 2014 task Ng et al. (2014), and the AESW dataset Daudaravicius et al. (2016). CoNLL 2014 training data contains sentences extracted from essays written by non-native English learners. Following the majority of existing GEC work, we augment the small CoNLL dataset with the larger NIST Lang-8 corpus. The Lang-8 data contains sentences of crowd sourced data taken from the Lang-8 website, making it noisy and of lower quality than the CoNLL data. We test on the sentences in CoNLL 2014 testing set and use the

sentences of the CoNLL 2013 testing set as validation data. For evaluation we use the official CoNLL M2scorer algorithm to determine edits and compute precision and recall


We also explore the large AESW dataset. AESW was designed in order to train grammar error identification systems. However, it includes both incorrect and corrected versions of sentences, making it useful for GEC as well. AESW contains training, testing, and validation sentences. The AESW data was taken from scientific papers authored by non-native speakers, and as such contains far more formal language than CoNLL.

6.2.1 GEC Training

For GEC we include a few additions to the training procedure described in Section 5. First, we pre-train the network in two ways on sentences from the large monolingual dataset provided by Junczys-Dowmunt and Grundkiewicz (2016)

. We pre-train the decoder in our network as a language model, including all layers in the decoder except the encoder-decoder attention mechanism. We also pre-train the entire model as a denoising-autoencoder, using a source embedding word dropout of 0.4.

For loss we use the edit-weighted MLE objective defined by Junczys-Dowmunt et al. Junczys-Dowmunt et al. (2018):

where are a training pair, and is if is part of an edit and 1 otherwise. We compute which tokens are part of an edit using the python Apted graph matching library 2; M. Pawlik and N. Augsten (2016); M. Pawlik and N. Augsten (2015).

During beam-search we ensemble our networks with the monolingual language-model used for pre-training as per Xie et al. (2016):

where is chosen between and based on the validation set. Typically, we found an alpha of 0.15 performed best. Networks with are labeled with +Mon-Ens

6.2.2 CoNLL 2014 Analysis

Results for CoNLL 2014 are provided in Table 2. Our Tree-Transformer achieves significantly higher recall than existing approaches, meaning we successfully repair more of the grammar errors. However, our precision is also lower which implies we make additional unnecessary edits. We attribute this drop to the fact that our method tends to generate examples which fit a structured grammar. Thus sentences with uncommon grammar tend to be converted to a more common way of saying things. An example of this effect is provided in table 3.

6.2.3 AESW Analysis

Results for AESW are provided in Table 4. We achieve the highest to date F score on AESW, including beating out our own sequence-based Transformer model. We attribute this to the fact that AESW is composed of samples taken from submitted papers. The more formal language used in this context may be a better fit for the structured grammar used by our model.

Architecture Precision Recall F
Prior State-of-the-art Approaches
Chollampatt and Ng 2017. 62.74 32.96 53.14
Junczys-Dowmunt and Grundkiewicz. 2016 61.27 27,98 49.49
Prior Neural Approaches
Ji et al. 2017 - - 45.15
Schmaltz et al. 2017 - - 41.37
Xie et al. 2016 49.24 23.77 40.56
Yuan and Briscoe. 2016 - - 39.90
Chollampatt and Ng. 2018 65.49 33.14 54.79
Junczys-Dowmunt et al. 2018 63.0 38.9 56.1
This Work
Tree-Transformer 57.39 28.12 47.50
Tree-Transformer +Mon 58.45 30.42 49.35
Tree-Transformer +Mon +Mon-Ens 57.84 33.26 50.39
Tree-Transformer +Auto 65.22 30.38 53.05
Tree-Transformer +Auto +Mon-Ens 59.14 43.23 55.09
Table 2: CoNLL 2014 results
Input In conclusion , we could tell the benefits of telling genetic risk to the carriers relatives overweights the costs .
Labels In conclusion , we can see that the benefits of telling genetic risk to the carrier’s relatives outweighs the costs .
In conclusion , we can see that the benefits of disclosing genetic risk to the carriers relatives outweigh the costs.
In conclusion , we can see that the benefits of revealing genetic risk to the carrier’s relatives outweigh the costs .
Network In conclusion , it can be argued that the benefits of revealing genetic risk to one ’s relatives outweighs the costs .
Table 3: CoNLL 2014 Example Output
Architecture Precision Recall F
Prior Approaches
Schmaltz et al. 2017 (Phrased-based) - - 38.31
Schmaltz et al. 2017 (Word LSTM) - - 42.78
Schmaltz et al 2017 (Char LSTM) - - 46.72
This Work
Transformer (Seq to Seq) 52.3 36.2 48.03
Tree-Transformer 55.4 37.1 50.43
Table 4: AESW results

7 Conclusion

In this paper we introduced the Tree-Transformer architecture for tree-to-tree correction tasks. We applied our method to correction datasets for both code and natural language and showed an increase in performance over existing sequence-based methods. We believe our model achieves its success by taking advantage of the strong grammatical structure inherent in tree-structured representations. For the future we hope to apply our approach to other tree-to-tree tasks, such as natural language translation. Additionally, we intend to extend our approach into a more general graph-to-graph method.


  • R. Aharoni and Y. Goldberg (2017)

    Towards String-to-Tree Neural Machine Translation

    Association of Computational Linguistics (ACL). Cited by: §2.1.
  • [2] (2015) Apted python library. Note: Cited by: §6.2.1.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural Machine Translation by Jointly Learning to Align and Translate. International Conference on Learning Representations (ICLR). Cited by: §1, §2.3, §4.2.
  • S. Chakraborty, M. Allamanis, and B. Ray (2018) Tree2Tree neural translation model for learning source code changes. arXiv pre-print. Cited by: §2.1, §4.1.
  • D. Chen and C. D. Manning (2014) A fast and accurate dependency parser using neural networks.

    Emperical Methods in Natural Language Processing (EMNLP)

    Cited by: §6.2.
  • X. Chen, C. Liu, and D. Song (2018) Tree-to-tree Neural Networks for Program Translation.. Neural Information Processing Systems (NeurIPS). Cited by: §2.1.
  • S. Chollampatt and H. T. Ng (2017) Connecting the Dots: Towards Human-Level Grammatical Error Correction. The 12th Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics (ACL). Cited by: §2.3.
  • S. Chollampatt and H. T. Ng (2018) A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction..

    Association for the Advancement of Artificial Intelligence (AAAI)

    Cited by: §1, §2.3, §6.2.
  • [9] (2011) Clang library. Note: Cited by: §6.1.
  • D. Dahlmeier, H. T. Ng, and S. M. Wu (2013) Building a Large Annotated Corpus of Learner English - The NUS Corpus of Learner English.. North American Chapter of the Association of Computational Linguistics (NAACL). Cited by: §2.3.
  • V. Daudaravicius, R. Banchs, E. Volodine, and C. Napoles (2016) A report on the automatic evaluation of scientific writing shared task. 11th Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics (ACL). Cited by: §6.2.
  • Devlin, Jacob, Uesato, Jonathan, Singh, Rishabh, and Kohli, Pushmeet (2017) Semantic Code Repair using Neuro-Symbolic Transformation Networks. arXiv:1710.11054. Cited by: §2.2.
  • L. Dong and M. Lapata (2016) Language to Logical Form with Neural Attention.. Association of Computational Linguistics (ACL). Cited by: §2.1, §4.1.
  • A. Eriguchi, K. Hashimoto, and Y. Tsuruoka (2016) Tree-to-Sequence Attentional Neural Machine Translation.. Association of Computational Linguistics (ACL). Cited by: §2.1.
  • C. Goller and A. Kuchler (1996)

    Learning task-dependent distributed representations by backpropagation through structure

    International Conference on Neural Networks (ICNN’96). Cited by: §6.2.
  • R. Gupta, S. Pal, A. Kanade, and S. Shevade (2017)

    DeepFix: fixing common c language errors by deep learning.

    Association for the Advancement of Artifical Intelligence (AAAI), pp. 1345–1351. Cited by: §2.2.
  • J. Harer, O. Ozdemir, T. Lazovich, C. P. Reale, R. L. Russell, L. Y. Kim, and P. Chin (2018) Learning to Repair Software Vulnerabilities with Generative Adversarial Networks. Neural Information Processing Systems (NeuroIPS). Cited by: §2.2.
  • B. Heinzerling and M. Strube (2018) BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Cited by: §6.2.
  • D. A. &. T. S. Jaakkola (2017) Tree-structured decoding with doubly-recurrent neural networks. International Conference on Learning Representations (ICLR). Cited by: §2.1, §4.1.
  • J. Ji, Q. Wang, K. Toutanova, Y. Gong, S. Truong, and J. Gao (2017) A Nested Attention Neural Hybrid Model for Grammatical Error Correction. Association of Computational Linguistics (ACL). Cited by: §1, §2.3.
  • M. Junczys-Dowmunt, R. Grundkiewicz, S. Guha, and K. Heafield (2018) Approach neural grammatical error correction as a low-resource machine translation task. North American Chapter of the Association of Computational Linguistics (NAACL-HLT). Cited by: 19th item, §1, §2.3, §5.1, §6.2.1, §6.2.
  • M. Junczys-Dowmunt and R. Grundkiewicz (2016) Phrase-based Machine Translation is State-of-the-Art for Automatic Grammatical Error Correction.. Empirical Methods in Natural Language (EMNLP). Cited by: §2.3, §6.2.1.
  • D. Klein and C. Manning (2003) Accurate unlexicalized parsing. Association for Computational Linguistics (ACL). Cited by: §6.2.
  • C. Lattner and V. S. Adve (2004) LLVM - A Compilation Framework for Lifelong Program Analysis & Transformation.. CGO. Cited by: §6.1.
  • X. B. D. Le, D. Lo, and C. Le Goues (2016) History driven program repair. Software Analysis, Evolution, and Reengineering (SANER). Cited by: §2.2.
  • F. Long and M. Rinard (2016) Automatic patch generation by learning correct code. Principles of Programming Languages (POPL). Cited by: §2.2.
  • M. Monperrus (2018) Automatic software repair: a bibliography. ACM Computing Surveys (CSUR). Cited by: §2.2.
  • H. Ng, S. Wu, T. Briscoe, C. Hadiwinoto, R. Susanto, and C. Bryant (2014) The conll-2014 shared task on grammatical error correction. Conference on Computational Natural Language Learning, Association for Computational Linguistics (ACL). Cited by: §6.2.
  • [29] (2014) Offical scorer for conll 2014 shared task. Note: Cited by: §6.1, §6.2.
  • V. Okun, A. Delaitre, and P. Black (2013) Report on the static analysis tool exposition (sate) iv. Technical Report. Cited by: §6.1.
  • E. Parisotto, A. Mohamed, R. Singh, L. Li, D. Zhou, and P. Kohli (2017) Neuro-Symbolic Program Synthesis. International Conference on Learning Representations (ICLR). Cited by: §2.1.
  • M. Pawlik and N. Augsten (2015) Efficient computation of the tree edit distance. ACM Transactions on Database Systems. Cited by: §6.2.1.
  • M. Pawlik and N. Augsten (2016) Tree edit distance: robust and memory-efficient. Information Systems 56. Cited by: §6.2.1.
  • [34] (2016) Penn treebank ii tags. Note: Cited by: §6.2.
  • M. Rabinovich, M. Stern, and D. Klein (2017) Abstract Syntax Networks for Code Generation and Semantic Parsing. Association of Computational Linguistics (ACL). Cited by: §2.1.
  • A. Schmaltz, Y. Kim, A. M. Rush, and S. M. Shieber (2017) Adapting Sequence Models for Sentence Correction. Empirical Methods in Natural Language (EMNLP). Cited by: §1, §2.3.
  • [37] (2014) Shift-reduce constituency parser. Note: Cited by: §6.2.
  • R. Socher, J. Bauer, and A. Y. Manning (2013) Parsing with compositional vector grammars. Association for Computational Linguistics (ACL). Cited by: §6.2.
  • R. Socher, C. C. Lin, A. Y. Ng, and C. D. Manning (2011) Parsing Natural Scenes and Natural Language with Recursive Neural Networks.. International Conference on Machine Learning (ICML). Cited by: §2.1, §6.2.
  • K. S. Tai, R. Socher, and C. D. Manning (2015) Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. Association of Computational Linguistics (ACL). Cited by: §2.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention Is All You Need. Neural Information Processing Systems (NIPS). Cited by: 13rd item, 2nd item, 8th item, §1, §1, §3.
  • O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton (2015) Grammar as a Foreign Language. Neural Information Processing Systems (NIPS). Cited by: §2.1.
  • Z. Xie, A. Avati, N. Arivazhagan, D. Jurafsky, and A. Y. Ng (2016) Neural Language Correction with Character-Based Attention. External Links: 1603.09727v1 Cited by: 18th item, §1, §2.3, §6.2.1.
  • P. Yin and G. Neubig (2017) A Syntactic Neural Model for General-Purpose Code Generation. Association of Computational Linguistics (ACL). Cited by: §2.1.
  • Z. Yuan and T. Briscoe (2016) Grammatical error correction using neural machine translation. North American Chapter of the Association of Computational Linguistics (NAACL). Cited by: §1, §2.3.
  • X. Zhang, L. Lu, and M. Lapata (2016) Top-down Tree Long Short-Term Memory Networks. North American Chapter of the Association of Computational Linguistics (NAACL). Cited by: §2.1.
  • X. Zhu, P. Sobhani, and H. Guo (2015) Long Short-Term Memory Over Tree Structures. International Conference on Machine Learning (ICML). Cited by: §2.1.

Appendix A hyperparameters

Hyperparameters utilized are listed in tables 5 and 6. Default hyperparameters are listed at the top of each table. A blank means the run utilized the default hyperparameter. Explanation of hyperparemeters follows.

  • N - number of layers

  • - size of sublayer outputs - see Vaswani et al. (2017)

  • - size of inner layer in TDB/FF for Tree-Transformer/Transfomer

  • - Number of attention heads

  • - Size of keys in attention mechanism

  • - Size of values in attention mechanism

  • - Dropout probability between sub-layers

  • - Dropout probability on attention mechanism - see Vaswani et al. (2017)

  • - Dropout probability on inner layer of TDB/FF for Tree-Transformer/Transfomer

  • - Source embedding word dropout probability

  • - Target embedding word dropout probability

  • - Label Smoothing

  • lr - Learning Rate, We use isr learning rate as per Vaswani et al. (2017). As such this learning rate will never be fully met, maximum learning rate depends upon warmup.

  • warmup - number of steps for linearly LR warmup

  • train steps - total number of steps for training

  • Mon - Initialized from monolingual pre-trained network

  • Auto - Initialized from autoencoder pre-trained network

  • Mon-Ens - Ensemble trained network with monolingual netword during beam search as per Xie et al. (2016)

  • EW-MLE - Use edit-weight MLE objective function as per Junczys-Dowmunt et al. (2018)

  • Time (Hours) - Total training time

Architecture N
default 6 512 2048 8 64 64 0.3 0.1 0.3 0.2 0.1 0.1
Sate IV
LSTM 4 1024 N/a
GEC Pretraining
Autoencoder 0.4
Conll 2014
Tree-Transformer +Mon
Tree-Transformer +Mon +Mon-Ens
Tree-Transformer +Auto
Tree-Transformer +Auto +Mon-Ens
Table 5: Model Parameters
Architecture lr warmup train steps Mon Auto Mon-Ens EW-MLE Time (Hours)
default 4000 100k - - - -
Sate IV
Transformer 18
Tree-Transformer 22
GEC Pretraining
Monolingual 500k 38
Autoencoder 500k 50
Conll 2014
Tree-Transformer 16000 3 26
Tree-Transformer +Mon 16000 3 26
Tree-Transformer +Mon +Mon-Ens 16000 3 26
Tree-Transformer +Auto 16000 3 26
Tree-Transformer +Auto +Mon-Ens 16000 3 26
Transformer 16000 3 19
Tree-Transformer 16000 3 25
Table 6: Training Parameters