Latest Data Science Materials
Program translation is an important tool to migrate legacy code in one language into an ecosystem built in a different language. In this work, we are the first to consider employing deep neural networks toward tackling this problem. We observe that program translation is a modular procedure, in which a sub-tree of the source tree is translated into the corresponding target sub-tree at each step. To capture this intuition, we design a tree-to-tree neural network as an encoder-decoder architecture to translate a source tree into a target one. Meanwhile, we develop an attention mechanism for the tree-to-tree model, so that when the decoder expands one non-terminal in the target tree, the attention mechanism locates the corresponding sub-tree in the source tree to guide the expansion of the decoder. We evaluate the program translation capability of our tree-to-tree model against several state-of-the-art approaches. Compared against other neural translation models, we observe that our approach is consistently better than the baselines with a margin of up to 15 points. Further, our approach can improve the previous state-of-the-art program translation approaches by a margin of 20 points on the translation of real-world projects.READ FULL TEXT VIEW PDF
The task of translating between programming languages differs from the
Reverse engineering of binary executables is a critical problem in the
In multi-source sequence-to-sequence tasks, the attention mechanism can ...
Simultaneous machine translation models start generating a target sequen...
A code generation system generates programming language code based on an...
In this paper we combine the advantages of a model using global source
Automatic translation from natural language descriptions into programs i...
Latest Data Science Materials
Programs are the main tool for building computer applications, the IT industry, and the digital world. Various programming languages have been invented to facilitate programmers to develop programs for different applications. At the same time, the variety of different programming languages also introduces a burden when programmers want to combine programs written in different languages together. Therefore, there is a tremendous need to enable program translation between different programming languages.
Nowadays, to translate programs between different programming languages, typically programmers would manually investigate the correspondence between the grammars of the two languages, then develop a rule-based translator. However, this process can be inefficient and error-prone. In this work, we make the first attempt to examine whether we can leverage deep neural networks to build a program translator automatically.
Intuitively, the program translation problem in its format is similar to a natural language translation problem. Some previous work propose to adapt phrase-based statistical machine translation (SMT) for code migration nguyen2013lexical ; karaivanov2014phrase ; nguyen2015divide . Recently, neural network approaches, such as sequence-to-sequence-based models, have achieved the state-of-the-art performance on machine translation bahdanau2014neural ; chousing2015on ; eriguchi2016tree ; he2016dual ; vaswani2017attention
. In this work, we study neural machine translation methods to handle the program translation problem. However, a big challenge making a sequence-to-sequence-based model ineffective is that, unlike natural languages, programming languages have rigorous grammars and are not tolerant to typos and grammatical mistakes. It has been demonstrated that it is very hard for an RNN-based sequence generator to generate syntactically correct programs when the lengths grow largekarpathy2015visualizing .
In this work, we observe that the main issue of an RNN that makes it hard to produce syntactically correct programs is that it entangles two sub-tasks together: (1) learning the grammar; and (2) aligning the sequence with the grammar. When these two tasks can be handled separately, the performance can typically boost. For example, Dong et al. employ a tree-based decoder to separate the two tasks dong2016language . In particular, the decoder in dong2016language leverages the tree structural information to (1) generate the nodes at the same depth of the parse tree using an LSTM decoder; and (2) expand a non-terminal and generate its children in the parse tree. Such an approach has been demonstrated to achieve the state-of-the-art results on several semantic parsing tasks.
Inspired by this observation, we hypothesize that the structural information of both source and target parse trees can be leveraged to enable such a separation. Inspired by this intuition, we propose tree-to-tree neural networks to combine both a tree encoder and a tree decoder. In particular, we observe that in the program translation problem, both source and target programs have their parse trees. In addition, a cross-language compiler typically follows a modular procedure to translate the individual sub-components in the source tree into their corresponding target ones, and then compose them to form the final target tree. Therefore, we design the workflow of a tree-to-tree neural network to align with this procedure: when the decoder expands a non-terminal, it locates the corresponding sub-tree in the source tree using an attention mechanism, and uses the information of the sub-tree to guide the non-terminal expansion. In particular, a tree encoder is helpful in this scenario, since it can aggregate all information of a sub-tree to the embedding of its root, so that the embedding can be used to guide the non-terminal expansion of the target tree.
propose tree-based autoencoder architectures. However, in these models, the decoder can only access to a single hidden vector representing the source tree, thus they are not performant on the translation task. In our evaluation, we demonstrate that without an attention mechanism, the translation performance isin most cases, while using an attention mechanism could boost the performance to . Another work bradbury2017towards proposes a tree-based attentional encoder-decoder architecture for natural language translation, but their model performs even worse than the attentional sequence-to-sequence baseline model. One main reason is that their attention mechanism calculates the attention weights of each node independently, which does not well capture the hierarchical structure of the parse trees. In our work, we design a parent attention feeding mechanism that formulates the dependence of attention maps between different nodes, and show that this attention mechanism further improves the performance of our tree-to-tree model considerably, especially when the size of the parse trees grows large (i.e., performance gain). To the best of our knowledge, this is the first successful demonstration of tree-to-tree neural network architecture proposed for translation tasks in the literature.
To test our hypothesis, we develop two novel program translation tasks, and employ a Java to C# benchmark used by existing program translation works nguyen2015divide ; nguyen2013lexical . First, we compare our approach against several neural network approaches on our proposed two tasks. Experimental results demonstrate that our tree-to-tree model outperforms other state-of-the-art neural networks on the program translation tasks, and yields a margin of up to on the token accuracy and up to on the program accuracy. Further, we compare our approach with previous program translation approaches on the Java to C# benchmark, and the results show that our tree-to-tree model outperforms previous state-of-the-art by a large margin of on program accuracy. These results demonstrate that our tree-to-tree model is promising toward tackling the program translation problem. Meanwhile, we believe that our proposed tree-to-tree neural network could also be adapted to other tree-to-tree tasks, and we consider it as future work.
In this work, we consider the problem of translating a program in one language into another. One approach is to model the problem as a machine translation problem between two languages, and thus numerous neural machine translation approaches can be applied.
For the program translation problem, however, a unique property is that each input program unambiguously corresponds to a unique parse tree. Thus, rather than modeling the input program as a sequence of tokens, we can consider the problem as translating a source tree into a target tree. Note that most modern programming languages are accompanied with a well-developed parser, so we can assume that the parse trees of both the source and the target programs can be easily obtained.
The main challenge of the problem in our consideration is that the cross-compiler for translating programs typically does not exist. Therefore, even if we assume the existence of parsers for both the source and the target languages, the translation problem itself is still non-trivial. We formally define the problem as follows.
Given two programming languages and , each being a set of instances , where is a program, and is its corresponding parse tree. We assume that there exists a translation oracle , which maps instances in to instances in . Given a dataset of instance pairs such that and , our problem is to learn a function that maps each into .
In this work, we focus on the problem setting that we have a set of paired source and target programs to learn the translator. Note that all existing program translation works karaivanov2014phrase ; nguyen2015divide ; nguyen2013lexical also study the problem under such an assumption. When such an alignment is lacking, the program translation problem is more challenging. Several techniques for NMT have been proposed to handle this issue, such as dual learning he2016dual , which have the potential to be extended for the program translation task. We leave these more challenging problem setups as future work.
In this section, we present our design of the tree-to-tree neural network. We first motivate the design, and then present the details.
Inspired by the above motivation, we design the tree-to-tree neural network, which follows an encoder-decoder framework to encode the source tree into an embedding, and decode the embedding into the target tree. To capture the intuition of the modular translation process, the decoder employs an attention mechanism to locate the corresponding source sub-tree when expanding the non-terminal. We illustrate the workflow of a tree-to-tree model in Figure 2, and present each component of the model below.
Note that the source and target trees may contain multiple branches. Although we can design tree-encoders and tree-decoders to handle trees with arbitrary number of branches, we observe that encoder and decoder for binary trees can be more effective. Thus, the first step is to convert both the source tree and the target tree into a binary tree. To this end, we employ the Left-Child Right-Sibling representation for this conversion.
The encoder employs a Tree-LSTM tai2015improved to compute embeddings for both the entire source tree and each of its sub-tree. In particular, consider a node with the value
in its one-hot encoding representation, and it has two childrenand , which are its left child and right child respectively. The encoder recursively computes the embedding for from the bottom up.
Assume that the left child and the right child maintain the LSTM state and respectively, and the embedding of is . Then the LSTM state of is computed as
where denotes the concatenation of and . Note that a node may lack one or both of its children. In this case, the encoder sets the LSTM state of the missing child to be zero.
The decoder generates the target tree starting from a single root node. The decoder first copies the LSTM state of the root of the source tree, and attaches it to the root node of the target tree. Then the decoder maintains a queue of all nodes to be expanded, and recursively expands each of them. In each iteration, the decoder pops one node from the queue, and expands it. In the following, we call the node being expanded the expanding node.
First, the decoder will predict the value of expanding node. To this end, the decoder computes the embedding of the expanding node , and then feeds it into a softmax regression network for prediction:
Here, is a trainable matrix of size , where is the vocabulary size of the outputs and is the embedding dimension. Note that is computed using the attention mechanism, which we will explain later.
The value of each node is a non-terminal, a terminal, or a special token. If , then the decoder finishes expanding this node. Otherwise, the decoder generates one new node as the left child and another new node as the right child of the expanding one. Assume that , are the LSTM states of its left child and right child respectively, then they are computed as:
Here, is a trainable word embedding matrix of size . Note that the generation of the left child and right child use two different sets of parameters for LSTM and LSTM respectively. These new children are pushed into the queue of all nodes to be expanded. When the queue is empty, the target tree generation process terminates.
Notice that although the sets of terminal and non-terminal are disjoint, it is necessary to include the token for the following reasons. First, due to the left-child-right-sibling encoding, although a terminal does not have a child, since it could have a right child representing its sibling in the original tree, is still needed for predicting the right branch. Meanwhile, we combine the terminal and non-terminal sets into a single vocabulary for the decoder, and do not incorporate the knowledge of grammar rules into the model, thus the model needs to infer whether a predicted token is a terminal or a non-terminal itself. In our evaluation, we find that a well-trained model never generates a left child for a terminal, which indicates that the model can learn to distinguish between terminals and non-terminals correctly.
Now we consider how to compute . One straightforward approach is to compute as , which is the hidden state attached to the expanding node. However, in doing so, the embedding will soon forget the information about the source tree when generating deep nodes in the target tree, and thus the model yields a very poor performance.
To make better use of the information of the source tree, our tree-to-tree model employs an attention mechanism to locate the source sub-tree corresponding to the sub-tree rooted at the expanding node. Specifically, we compute the following probability:
where is the expanding node. We denote this probability as , and we compute it as
where is a trainable matrix of size .
To leverage the information from the source tree, we compute the expectation of the hidden state value across all conditioned on , i.e.,
This embedding can then be combined with , the hidden state of the expanding node, to compute as follows:
where , are trainable matrices of size respectively.
In the above approach, the attention vectors are computed independently to each other, since once is used for predicting the node value , is no longer used for further predictions. However, intuitively, the attention decisions for the prediction of each node should be related to each other. For example, for a non-terminal node in the target tree, suppose that it is related to in the source tree, then it is very likely that the attention weights of its children should focus on the descendants of . Therefore, when predicting the attention vector of a node, the model should leverage the attention information of its parent as well.
Following this intuition, we propose a parent attention feeding mechanism, so that the attention vector of the expanding node is taken into account when predicting the attention vectors of its children. Formally, besides the embedding of the node value , we modify the inputs to and of the decoder in Equations (3) and (4) as below:
Notice that these formulas in their formats coincide with the input-feeding method for sequential neural networks luong2015effective , but their meanings are different. For sequential models, the input attention vector belongs to the previous token, while here it belongs to the parent node. In our evaluation, we will show that such a parent attention feeding mechanism significantly improves the performance of our tree-to-tree model.
In this section, we evaluate our tree-to-tree neural network with several baseline approaches on the program translation task. To do so, we first describe three benchmark datasets in Section 4.1 for evaluating different aspects; then we evaluate our tree-to-tree model against several baseline approaches, including the state-of-the-art neural network approaches and program translation approaches.
For the evaluation on Java to C#, we tried to contact the authors of nguyen2015divide for their dataset, but our emails were not responded. Thus, we employ the same approach as in nguyen2015divide to crawl several open-source projects, which have both a Java and a C# implementation. Same as in nguyen2015divide , we pair the methods in Java and C# based on their file names and method names. The statistics of the dataset is summarized in Appendix B. Due to the change of the versions of these projects, the concrete dataset in our evaluation may differ from nguyen2015divide . For each project, we apply ten-fold validation on matched method pairs, as in nguyen2015divide .
The main metric evaluated in our evaluation is the program accuracy, which is the percentage of the predicted target programs that are exactly the same as the ground truth in the dataset. Note that the program accuracy is an underestimation of the true accuracy based on semantic equivalence, and this metric has been used in nguyen2015divide . This metric is more meaningful than other previously proposed metrics, such as syntax-correctness and dependency-graph-accuracy, which are not directly comparable to semantic equivalence. We also measure another metric called token accuracy, and we defer the details to Appendix C.
We evaluate our tree-to-tree model against a sequence-to-sequence model bahdanau2014neural ; vinyals2015grammar , a sequence-to-tree model dong2016language , and a tree-to-sequence model eriguchi2016tree . Note that for a sequence-to-sequence model, there can be four variants to handle different input-output formats. For example, given a program, we can simply tokenize it into a sequence of tokens. We call this format as raw program, denoted as P. We can also use the parser to parse the program into a parse tree, and then serialize the parse tree as a sequence of tokens. Our serialization of a tree follows its depth-first traversal order, which is the same as vinyals2015grammar . We call this format as parse tree, denoted as T. For both input and output formats, we can choose either P or T. For a sequence-to-tree model, we have two variants based on its input format being either P or T; note that the sequence-to-tree model generates a tree as output, and thus requires its output format to be T (unserialized). Similarly, the tree-to-sequence model has two variants, and our tree-to-tree only has one form. Therefore, we have 9 different models in our evaluation.
The hyper-parameters used in different models can be found in Appendix A. The baseline models have employed their own input-feeding or parent-feeding method that is analogous to our parent attention feeding mechanism.
The program accuracy results are presented in Table 1. We can observe that our tree2tree model outperforms all baseline models on all datasets. Especially, on the dataset with longer programs, the program accuracy significantly outperforms all seq2seq models by a large margin, i.e., up to . Its margin over a seq2tree model can also reach around points. These results demonstrate that tree2tree model is more capable of learning the correspondence between the source and the target programs; in particular, it is significantly better than other baselines at handling longer inputs.
Meanwhile, we perform an ablation study to compare the full tree2tree model with (1) tree2tree without parent attention feeding (TT (-PF)) and (2) tree2tree without attention (TT (-Attn)). We observe that the full tree2tree model significantly outperforms the other alternatives. In particular, on JC-BL, the full tree2tree’s program accuracy is points higher than the tree2tree model without parent attention feeding.
More importantly, we observe that the program accuracy of tree2tree model without the attention mechanism is nearly . Note that such a model is similar to a tree-to-tree autoencoder architecture. This result shows that our novel architecture can significantly outperform previous tree-to-tree-like architectures on the program translation task.
However, although our tree2tree model performs better than other baselines, it still could not achieve accuracy. After investigating into the prediction, we find that the main reason is because the translation may introduce temporary variables. Because such temporary variables appear very rarely in the training set, it could be hard for a neural network to infer correctly in these cases. Actually, the longer the programs are, the more temporary variables that the cross-compiler may introduce, which makes the prediction harder. We consider further improving the model to handle this problem as future work.
|Reported in nguyen2015divide|
, on the real-world benchmark from Java to C#. Here, J2C# is a rule-based system, 1pSMT directly applies the phrase-based SMT on sequential programs, and mppSMT is a multi-phase phrase-based SMT approach that leverages both the raw programs and their parse trees.
The results are summarized in Table 2. For previous approaches, we report the results from nguyen2015divide . We can observe that our tree2tree approach can significantly outperform the previous state-of-the-art on all projects except Antlr. The improvements range from to .
On Antlr, the tree2tree model performs worse. We attribute this to the fact that Antlr contains too few data samples for training. We test our hypothesis by constructing another training and validation set from all other 5 projects, and test our model on the entire Antlr. We observe that our tree2tree model can achieve a test accuracy of , which is 9 points higher than the state-of-the-art. Therefore, we conclude that our approach can significantly outperform previous program translation approaches when there are sufficient training data.
Some recent work have applied statistical machine translation techniques to program translation allamanis2017survey ; karaivanov2014phrase ; nguyen2015divide ; nguyen2013lexical ; nguyen2016mapping ; oda2015learning . For example, several works propose to adapt phrase-based statistical machine translation models and leverage grammatical structures of programming languages for code migration karaivanov2014phrase ; nguyen2015divide ; nguyen2013lexical . In nguyen2016mapping
, Nguyen et al. propose to use Word2Vec representation for APIs in libraries used in different programming languages, then learn a transformation matrix for API mapping. On the contrary, our work is the first to employ deep learning techniques for program translation.
Recently, various neural networks with tree structures have been proposed to employ the structural information of the data dong2016language ; rabinovich2017abstract ; parisotto2016neuro ; yin2017syntactic ; alvarez2016tree ; tai2015improved ; zhu2015long ; socher2011parsing ; eriguchi2016tree ; zhang2016top ; socher2011semi ; kusner2017grammar ; bradbury2017towards . In these work, different tree-structured encoders are proposed for embedding the input data, and different tree-structured decoders are proposed for predicting the output trees. In particular, in socher2011semi ; kusner2017grammar
, they propose tree-structured autoencoders to learn vector representations of trees, and show better performance on tree reconstruction and other tasks such as sentiment analysis. Another workbradbury2017towards proposes to use a tree-structured encoder-decoder architecture for natural language translation, where both the encoder and the decoder are variants of the RNNG model dyer2016recurrent ; however, the performance of their model is slightly worse than the sequence-to-sequence model with attention, which is mainly due to the fact that their attention mechanism can not condition the future attention weights on previously computed ones. In this work, we are the first to demonstrate a successful design of tree-to-tree neural network for translation tasks.
Other work study using neural networks to generate parse trees from input-output examples dong2016language ; vinyals2015grammar ; aharoni2017towards ; rabinovich2017abstract ; yin2017syntactic ; alvarez2016tree ; dyer2016recurrent ; chen2018towards ; chen2016latent . In dong2016language , Dong et al. propose a seq2tree model that allows the decoder RNN to generate the output tree recursively in a top-down fashion. This approach achieves the state-of-the-art results on several semantic parsing tasks. Some other work incorporate the knowledge of the grammar into the architecture design yin2017syntactic ; rabinovich2017abstract to achieve better performance on specific tasks. However, these approaches are hard to generalize to other tasks. Again, none of them is designed for program translation or proposes a tree-to-tree architecture.
A recent line of research study using neural networks for code generation balog2016deepcoder ; devlin2017robustfill ; parisotto2016neuro ; ling2016latent ; rabinovich2017abstract ; yin2017syntactic . In ling2016latent ; rabinovich2017abstract ; yin2017syntactic , they study generating code in a DSL from inputs in natural language or in another DSL. However, their designs require additional manual efforts to adapt to new DSLs in consideration. In our work, we consider the tree-to-tree model as a generic approach that can be applied to any grammar.
In this work, we are the first to consider neural network approaches for the program translation problem, and are the first to demonstrate a successful design of tree-to-tree neural network combining both a tree-RNN encoder and a tree-RNN decoder for translation tasks. Extensive evaluation demonstrates that our tree-to-tree neural network outperforms several state-of-the-art models. This renders our tree-to-tree model as a promising tool toward tackling the program translation problem. In addition, we believe that our proposed tree-to-tree neural network has the potential to generalize to other tree-to-tree tasks, and we consider it as future work.
At the same time, we observe many challenges in program translation that existing techniques are not capable of handling. For example, the models are hard to generalize to programs longer than the training ones; it is unclear how to handle an infinite vocabulary set that may be employed in real-world applications; further, the training requires a dataset of aligned input-output pairs, which may be lacking in practice. We consider all these problems as important future work in the research agenda toward solving the program translation problem.
We thank the anonymous reviewers for their valuable comments. This material is in part based upon work supported by the National Science Foundation under Grant No. TWC-1409915, Berkeley DeepDrive, and DARPA D3M under Grant No. FA8750-17-2-0091. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Tree-structured decoding with doubly-recurrent neural networks.In ICLR, 2017.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, 2015.
Proceedings of the 28th international conference on machine learning (ICML-11), pages 129–136, 2011.
Improved semantic representations from tree-structured long short-term memory networks.In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2015.
|Number of RNN layers||3||1||1||1|
|Encoder RNN cell||LSTM||LSTM||Tree LSTM||Tree LSTM|
|Decoder RNN cell||LSTM|
|Initial learning rate||0.005|
|Learning rate decay schedule||Decay the learning rate by a factor of when the|
|validation loss does not decrease for 500 mini-batches|
|Hidden state size||256|
|Gradient clip threshold||5.0|
|Weights initialization||Uniformly random from [-0.1, 0.1]|
We present the hyper-parameters of different neural networks in Table 3. These hyper-parameters are chosen to achieve the best accuracy on the development set through a grid search.
|Average input length (P)||10||20|
|Minimal output length (P)||23||33|
|Maximal output length (P)||151||311|
|Average output length (P)||44||69|
|Minimal input length (T)||34||69|
|Maximal input length (T)||61||111|
|Average input length (T)||48||85|
|Minimal output length (T)||38||73|
|Maximal output length (T)||251||531|
|Average output length (T)||71||129|
|Project||# of matched methods|
Besides the program accuracy, we also measure the token accuracy of different approaches, which is the percentage of the tokens that are exactly the same as the ground truth. This metric is a finer-grained measurement of the correctness, thus provides some additional insights of the performance of different models.
In the following, we discuss our synthetic translation task from an imperative language to a functional language.
For the synthetic task, we design an imperative source language and a functional target language. Such a design makes the source and target languages use different programming paradigms, so that the translation can be challenging. Figure 4 illustrates an example of the translation, which demonstrates that a for-loop is translated into a recursive function. We manually implement a translator, which is used to acquire the ground truth. The grammar specifications of the source language (FOR language) and the target language (LAMBDA language) are provided in Figure 5 and Figure 6 respectively. The python source code to implement the translator from a FOR program to a LAMBDA program is provided in Figure 7.
We create two datasets for the synthetic task: one with an average length of 20 (SYN-S) and the other with an average length of 50 (SYN-L). Here, the length of a program indicates the number of tokens in the source program.
|Source program||Target program|
|for i=1; i10; i+1 do||letrec f i =|
|if x1 then||if i10 then|
|y=1||let _ = if x1 then|
|else||let y=1 in ()|
|y=2||else let y=2 in ()|
|endfor||in f i+1|
|in f 1|
|Average input length (P)||20||50|
|Minimal output length (P)||22||46|
|Maximal output length (P)||44||96|
|Average output length (P)||30||71|
|Minimal input length (T)||40||100|
|Maximal input length (T)||56||134|
|Average input length (T)||49||111|
|Minimal output length (T)||41||90|
|Maximal output length (T)||82||177|
|Average output length (T)||55||133|