1 Introduction
Neural networks have been applied successfully to many generative modeling tasks, from images with pixellevel detail oord2016pixel to strokes corresponding to crude sketches ha2017neural to natural language in automated responses to user questions vinyals2015show . Less extensively studied have been neural models for generation of highly structured artifacts, for example the source code of programs. Program generation has many potential applications, including automatically testing programming tools holler2012fuzzing and assisting humans as they solve programming tasks hindle2012naturalness ; raychev2014code . However, a fundamental difficulty in this problem domain is that to be acceptable to a compiler, a program must satisfy a rich set of constraints such as “never use undeclared variables” or “only use variables in a typesafe way”. Learning such constraints automatically from data is a difficult task.
In this paper, we present a new generative model, called Neural Attribute Machines (NAMs), for programs that satisfy constraints like the above. The key insight of our approach is that the constraints enforced by a programming language are known in full detail a priori. Accordingly, they can be incorporated into training, and we propose a framework for doing so. We demonstrate that this approach has significant benefits: training existing architectures on samples that unfailingly abide by a constraint still produces a generative model that often violates the constraint; in contrast, the NAM model significantly outperforms these models at sampling from the space of constrained programs.
We use the formalism of attribute grammars MST:Knuth68 as the language for expressing rich structural constraints over programs. Our model composes such a grammar with a recurrent neural network (RNN) that generates the nodes of a program’s abstractsyntax tree, and uses it to constrain the output of the RNN at each point in time. Our training framework builds off of the observation that in the setting of generating constrained samples, there is a correct prediction, and then there are two categories of incorrect predictions. Incorrect predictions that are nevertheless legal under the constraint are more desirable than predictions that violate the constraint. The NAM addresses this multifaceted problem in two ways. First, it augments the input sequence with a fixedlength representation of a structural context that the attribute grammar uses to check constraints. The context provides information that the current input sequence is just one particular instantiation of a more general structural constraint. Second, NAMs optimize a threevalued loss function that penalizes correct, incorrectbutlegal, and incorrectandillegal predictions differently.
The main contributions of this paper can be summarized as follows:

We present a new neural network and logical architecture that incorporates background knowledge in the form of attributegrammar constraints.

We give a framework to train this new architecture that uses a threevalued loss function to penalize correct, incorrectbutlegal, and incorrectandillegal predictions differently.

Our experiments show the difficulties existing neural models have with learning in constrained domains, and the advantages of using the proposed framework.
2 Methodology
2.1 Background on Attribute Grammars
Definition 2.1
An attribute grammar (AG) is a contextfree grammar extended by attaching attributes to the terminal and nonterminal symbols of the grammar, and by supplying attribute equations to define attribute values MST:Knuth68 . [6pt] Each production can also be equipped with an attribute constraint to specify that some relationship must hold among the values of the production’s attributes.
In every production , each denotes an occurrence of one of the grammar symbols; associated with each such symbol occurrence is a set of attribute occurrences corresponding to the symbol’s attributes.
The attributes of a symbol , denoted by , are divided into two disjoint classes: synthesized attributes and inherited attributes. A production’s output attributes are the synthesizedattribute occurrences of the lefthandside nonterminal plus the inheritedattribute occurrences of the righthandside symbols; its input attributes are the inheritedattribute occurrences of the lefthandside nonterminal plus the synthesizedattribute occurrences of the righthandside symbols.
Each production has a set of attribute equations, each of which defines the value of one of the production’s output attributes as a function of the production’s input attributes. We assume that the terminal symbols of the grammar have no synthesized attributes, and that the root symbol of the grammar has no inherited attributes. Noncircularity is a decidable property of AGs MST:Knuth71 , and hence we can assume that no derivation tree exists in which an attribute instance is defined transitively in terms of itself.
An AG is attributed JCSS:LRS74 if, in each production , the attribute equation for each inheritedattribute occurrence of a righthandside symbol only depends on (i) inheritedattribute occurrences of , and (ii) synthesizedattribute occurrences of .
[6pt] With reasonable assumptions about the computational power of attribute equations and attribute constraints, attributed AGs capture the class NP IC:EPS88
(modulo a technical “padding” adjustment).


(a)  (b) 
Example 2.2
To illustrate attributed AGs, we use a simple language of binary numerals. The abstractsyntax trees for binary numerals are defined using the following operator/operand declarations:^{1}^{1}1 The notation used above is a variant of contextfree grammars in which the operator names (Numeral, Pair, Zero, and One) serve to identify the productions uniquely. For example, the declaration “numeral: Numeral(bits);” is the analog of the production “.” (The notation is adapted from the Synthesizer Generator Book:RT88 .)
numeral  :  Numeral(bits);  bits  :  Pair(bits bits)  Zero()  One(); 
Two integervalued attributes—”positionIn” and “positionOut”—will be used to determine a bit’s position in a numeral:
bits  {  inherited int positionIn; synthesized int positionOut; }; 
These attributes are used to define a lefttoright pattern of information flow through each derivation tree (also known as “lefttoright threading”). In particular, with the attribute equations given in Fig. 1(a), at each leaf of the tree, the value of bits.positionOut is the position of the bit that the leaf represents with respect to the left end of the numeral (where the leftmost bit is considered to be at position 1).^{2}^{2}2 In the attribute equations for a production such as “bits: Pair(bits bits);” we distinguish between the three different occurrences of nonterminal “bits” via the symbols “bits$1,” “bits$2,” and “bits$3,” which denote the leftmost occurrence, the nexttoleftmost occurrence, etc. (In this case, the leftmost occurrence is the lefthandside occurrence.) Fig. 1(b) shows the lefttoright pattern of dependences among attribute dependences in a sixnode tree.
Our ultimate goal is the creation of new trees de novo, with generation proceeding topdown, lefttoright. The latter characteristic motivated the choice that constraints be expressible using an attributed AG, because they propagate information lefttoright in an AST.
attributed AGs also offer substantially increased expressive power over contextfree grammars—in particular, an attributed AG can express noncontextfree constraints on the trees in the language . There are a large number of such constraints involved in any grammar that produces only programs that pass a C compiler. In the study presented in Section 3, we experimented with two constraints in isolation:
 Declaredvariable constraint:

Each use of a variable must be preceded by a declaration of the variable.
 Typesafevariable constraint:

Each use of a variable must satisfy the type requirements of the position of the variable in the tree.
By working with a corpus of C programs that all compile, all of the training examples satisfy both constraints.
2.2 From an AST to a sequence
While other attempts at learning models that can be used to generate trees include performing convolutions over nodes in binary trees mou2015discriminative or stacking multiple RNNs in fixed directions shuai2016dag , a natural paradigm to adopt for presenting a tree to a neural model is a singletraversal, topdown, lefttoright sequentialization of the tree. An AST is represented by a depthfirst sequence of pairs: each pair , for , consists of a nonterminal and a production that has on the lefthand side. Depth information in the tree is conveyed by interspersing pop indicators when moving up the tree. As a preprocessing step, all variable names are aliased by their order of use in the program so that the distinct variable used is named Var. This approach prevents difficulties associated with rarely used token names, and does not lose any meaningful information about structural constraints. Fig. 2 depicts a trivial example to illustrate the process.
Let and denote the input and output dimensions, respectively. If is a nonterminal in , we use to denote the subset of the productions with on the lefthand side. is the set of legal outputs under the contextfree constraint. The two contextsensitive constraints considered here further narrow the set of legal outputs. We use and to denote the set of legal outputs at some (unspecified) point in the linearized tree, under the declaredvariable constraint and the type constraint , respectively. Let be either or ; at a prediction step at an instance of nonterminal , the possible output can be partitioned into three sets: , , and .
2.3 Challenges
Large and variably sized AST sequences present several problems for traditional neural sequencelearners. The first is the existence of distant nonlocal dependences, such as a declaration of variable near the beginning of a file that might be many hundreds of steps away before the RNN needs to predict the use of . Another is the existence of complex relationships between nodes that are difficult to express in a linear sequentialization, like when the distinction between a greatgrandparent and a greatuncle node is important. Third, due to only approximating the sequence distribution, it is very likely that while generating large ASTs under randomized sampling, novel contexts will be encountered. In such a case, there would not be any information explicitly contained in the training set that can guide further generation after entering “uncharted territory.” With other approaches, one relies on the ability of the learner to generalize from the training set. However, imperfect learning of the constraints of L could cause poor generalization. Because NAMs are able to learn from the constraints, they have the potential to generalize based on the constraints, and hence have the potential to perform much better when they enter uncharted territory. The new idea pursued here is to leverage constraints that are defined unambiguously even in novel situations. If (an approximation to) the constraints can be learned, they will provide additional guidance about what to do in these situations.
2.4 Nam
NAMs are equipped with a deterministic automaton, referred to hereafter as the logical machine
. The logical machine provides assistance to the NAM in two ways: it augments the input vector with a fixedlength vector that represents the context, and it imposes its knowledge of the output partitions
, , and to add an extra loss term for constraintviolating predictions.Augmented input.
The logical machine outputs a fixedlength binary vector that represents the context of the current node with respect to the constraint desired. All variable names are known from the grammar , and each corresponds to one production rule . Let be the production rule that chooses variable name and be the collection of all production rules for variables. For the context vector for the declaredvariable constraint, which is binary and of length , the valued entries are the positions of declared variables.
For the context vector for the typesafevariables constraint, each variable and type combination has an entry in , which is if is of type . There is then one additional entry for each type , which is if the current prediction must be of type .
Developing fixedlength context vectors that are informative for the constraint at hand could be daunting for constraints that are more conceptually complex, but even these simple representations had a profound positive impact on the model learned. Different representations of the same constraint were not tried. An interesting direction for future work would be to test how robust the NAM framework is to different ways of building the context vector.
Threelevel loss function.
In addition to being presented with augmented input by the logical machine, NAMs are also trained with additional reinforcement from the logical machine. The standard crossentropy loss function that measures the distance between the model’s predicted probability distribution
and the true observationsuffers from an undesirable consequence in the setting of learning to generate constrained sequences. The onehot encoding of
means that probabilities assigned to all are penalized equally.However, in the presence of constraints, there are really three categories of predictions: the partitions previously mentioned. Having a threelevel loss function that punishes the partition of illegal predictions more than the legalbutincorrect prediction could be interpreted in the vein of methods that artificially increase the trainingset size, such as leftright reversing of images of scenes in a classification task where it is assumed a priori that orientation could not possibly affect the classification. These methods are most effective in situations like images or trees, where the input is highdimensional but lies on a lowerdimensional manifold. In our work, the logical machine provides some feedback to the NAM that even though certain sequences were not actually in the training data, they have the possibility of existing, while others do not.
The objective that the NAM optimizes can be written as follows:
(1) 
where is the traditional crossentropy loss function and is the additional penalty for violating constraint , whose magnitude is controlled by . We say that Equation 1 defines a threelevel loss function because predictions that are both wrong and illegal are penalized by both terms, while predictions that are wrong but legal are only penalized by the first. The tradeoff between NAMs learning the specific training sequences versus the constraint more generally—without caring which
legal sequences are more realistic—can be controlled with the hyperparameter
.Algorithms.
The algorithms for training and generation are given as Algs. 1 and 2
, respectively. [6pt] During training, the trees in the corpus are traversed and the NAM’s parameters are updated via stochastic gradient descent. Generation then samples trees from the learned model.
3 Experiments
All of our experiments used the following models: a vanilla RNN, a NAM with just the augmented input, a NAM with just the new objective function, and the full NAM with both. The chosen version of RNN is the Long Short Term Memory (LSTM) cell
hochreiter1997long , which has been favored in recent literature for its ability to learn longterm dependences, although the NAM framework is general to any type of RNN cell margarit2016batch ; stollenga2015parallel ; sutskever2014sequence. For the entire experiment, two stacked LSTMs are used, the number of neurons in each layer is
, and backpropagation through time is truncated after
steps. The Adam optimizer kingma2014adam is used with learning rate , dropout srivastava2014dropout with a keep probability of is applied to all layers except the softmax output, and both norm and norm regularization is applied to weights but not biases in all matrices with . The NAM’s values were set to , chosen so that the order of magnitude of the gradients for each term were roughly equal. As de novo generation is the goal in mind, all models were trained until their generation performance no longer improved as measured by the evaluation criteria discussed below.An artificial corpus created for the work here is a set of 1,500 simple C programs containing elementary arithmetic operations, variable manipulations, and controlflow operations, 15% of which were held out for testing. There are an average of 7.01 unique variables, 3.29 unique types, 6.47 procedures, and 101.82 lines of code per program, providing a challenge for both constraints by having numerous declarations, multiple types in the same program, and changing scopes. The full corpus is available in the supplementary material. Programs in the corpus are translated to an abstract syntax tree (AST) in the C Intermediate Language (CIL) necula2002cil . The set of these ASTs can be interpreted as a sample from an attributed attribute grammar .
3.1 Evaluation criteria
Our experiments were designed to answer the following questions:

[label=.]

What is the quality of simulated samples?

How well do the models represent the training data?

At what rate are constraints violated while sampling?
Methods for evaluating the quality of the learned model in generation tasks can be more subjective than in prediction tasks, where performance on heldout test sets is relevant. In our case, however, we can make various measurements of error rates when internal nodes are generated, as well as test whether a generated tree satisfies the constraint as a whole, which provides an overall measure of success.
Three measurements are used throughout

The ability to learn the corpus as measured by the average negative loglikelihood of the training samples under each model.

The number of predictions made that violate the constraint while generating new samples.

The number of trees that are entirely legal under the specific constraint under consideration. (In each generated tree, one constraint violation is sufficient to make the whole tree illegal.)
3.2 Declaredvariable constraint
Our first experiment imposed the constraint that every variable used must be declared (see columns 25 of Fig. 4, and Fig. 5). As shown by Fig. 4, even though the vanilla RNN gets to see the whole tree prior to the node requiring a prediction, it still makes many mistakes. For comparison, a stochastic contextfree grammar that has been given the same augmented input is shown in Fig. 5. Since it now includes the context vector, it is referred to hereafter as a stochastic grammar with context (SGWC). This model will never choose to use variable if the element of the context vector corresponding to is not set to 1, and thus never violates the constraint. However, it does not specialize to the corpus very well, motivating the use of neural models that capture richer patterns in the trainingset. The NAM’s modified objective function offers a modest improvement, but augmenting the input with the context vector provides a much more significant improvement. Moreover, the latter effect does not dominate the former: the full NAM with both improvements performs the best overall.
Some insight into how the models differ can be gleaned from the average negative loglikelihoods (see Fig. 5). As expected, the NAM’s loss term acts as a regularization term, and even though trainingset trees are less likely, the result is improved generation ability (see Fig. 4). The better generation ability strongly suggests that the excess fitting that the vanilla RNN did to the trainingset compared to the NAM w/ 3level loss is best described as overfitting.
The average number of unique variables and average number of procedures in each generated program gives one measurement of each model’s fidelity to the training corpus. Trainingset trees varied, but averaged 7.0 unique variables and 6.5 procedures. The vanilla RNN uses more unique variables and has fewer procedures than programs in the trainingset, corroborating the likelihood numbers’ indication that they did not learn the corpus as well as the other models. The NAM with context and the NAM with both improvements yielded samples that resembled the corpus much more faithfully by these measurements (see Fig. 4).
Augmenting the input with the context vector makes the representation of the input at each step much richer. Thus, the NAM with context is able to learn more valuable patterns of the training data that exist in this higherdimensional space. The number of legal trees increases 2.8fold over the baseline vanilla RNN. This drastic improvement still has some degree of overfitting that can be alleviated with the regularizing loss term. The full NAM thus has a slightly higher average negative loglikelihood than the one without the regularization, but it has the best results during generation (see Fig. 4)
Declaredvariable constraint  Typesafevariable constraint  
Model  Avg.  Avg.  Constraint  Legal  Avg.  Avg.  Constraint  Legal 
Vars.  Procs.  Violations  Trees  Vars.  Procs.  Violations  Trees  
Vanilla RNN  8.5  4.1  9426  187  8.4  4.3  7707  116 
NAM w/ 3level loss  8.2  4.3  8119  203  8.0  4.4  6673  177 
NAM w/ context  6.7  6.7  2105  532  6.6  6.7  902  665 
NAM w/ both  6.8  6.7  1846  582  6.5  7.0  697  674 
Model  Declaredvariable (train/test)  Typesafevariable (train/test) 

SGWC  1.856/1.813  1.406/7.178 
Vanilla RNN  .231/.253  1.366/1.425 
NAM w/ 3level loss  .246/.257  1.375/1.471 
NAM w/context  .181/.188  .782/.779 
NAM w/both  .194/.208  .819/.794 
3.3 Typesafevariable constraint
The same relative performance is seen when we work with the second constraint. The SGWC generalizes to the test set especially poorly in this case, because the context vector varies more and thus there are rare or completely novel situations that the SGWC struggles with. The vanilla RNN is the same model as in Section 3.2, because it does not take the constraint into consideration in any way. The raw number of violations is lower in this setting because the tests for this constraint occur less frequently: not all variable uses involve multiple types that must agree. Even so, the simpler models produce fewer legal trees than in the experiment in Section 3.2.
4 Related Work
The problem of corpusdriven program generation has been studied before ICML:MT14 ; raychev2014code ; nguyenFSE13 ; nguyenICSE15 ; bielik2016phog . Statistical models used in this task include gram topic models nguyenFSE13 , probabilistic treesubstitution grammars idiomsFSE14 , a generalization of probabilistic grammars known as probabilistic higherorder grammars bielik2016phog , and recurrent neural networks raychev2014code . The most closely related piece of work is by Maddison and Tarlow ICML:MT14 , who use logbilinear treetraversal models, a class of probabilistic pushdown automata, for program generation. Their model also addresses the problems of declarations of names and type consistency, and they use “traversal variables” to propagate information from parts of the alreadyproduced tree to influence what production is selected at a node. However, the statetransition function of their generator admits a simple “tabular parameterization,” whereas memory updates in our approach involve complex interactions between a neural and a logical machine. Also, their training process does not have an analog of our threevalued loss function.
Program generation is closely related to the problem of program synthesis, the problem of producing programs in a highlevel language that implement a usergiven specification. A recent body of work uses neural techniques parisotto2016neuro ; murali2017bayesian ; balog2016deepcoder to solve this problem. Of these efforts, Balog et al. balog2016deepcoder and Murali et al. murali2017bayesian use combinatorial search, guided by a neural network, to generate programs that satisfy languagelevel constraints. However, this literature has not studied neural architectures whose training predisposes them toward satisfying such constraints.
The work presented here can be related to several key concepts in the theory of grammars. Breaking down the generation of a tree into a series of nonterminals, terminals, and production rules is the same methodology used with stochastic contextfreegrammars. As an automaton that sees an input stream that contains occurrences of “pop” and produces an output, a NAM is a form of transducer, namely a visiblypushdown transducer ICALP:RS08 .
Neural stack machines like those in NIPS2015_5648 augment an RNN with a stack, which the RNN must learn how to operate through differentiable approximations. [6pt] In contrast, a NAM only needs to learn how to make use of data values generated by the logical machine, rather than additionally needing to learn how to mimic the logical machine’s operations.
The new term that was introduced in the objective function can be thought of as a way to perform regularization. Many attempts at customized regularization have been demonstrated chien2016bayesian ; bai2014sae . Our regularization term allows NAMs to learn the set of all legal production rules without penalty, but regularizes the learning of the specific singleton relative to the set of legal production rules.
5 Discussion
Learning to generate sequences with strong structural constraints would ideally be as easy as presenting an RNN exclusively with sequences that are members of the constrained space. As our experiments show, this can be difficult to achieve in practice. In some cases, though, aspects of the structure can be explicitly represented. NAMs provide a framework for incorporating the knowledge of these constraints into RNNs. They significantly outperform RNNs without the constraint when trained on the same data.
We have demonstrated the utility of NAMs for two attributed AG problems. The work here allows for the possibility of creating a generator of NAM systems: from a specification of a desired constrained language, one could generate the corresponding form for training. Moreover, these are just two of the possible types of mistakes that would prevent a program from passing a C compiler’s many checks. A topic for future work is to incorporate enough C constraints so that generated programs would have a high probability of being compilable.
Because other sequences have a natural underlying parse tree and associated constraints that can be expressed using an attributed AG, another topic for future work is to explore the application of NAMs to other structured sequences, such as proof trees of a logic.
References
 [1] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
 [2] David Ha and Douglas Eck. A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477, 2017.

[3]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan.
Show and tell: A neural image caption generator.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 3156–3164, 2015.  [4] Christian Holler, Kim Herzig, and Andreas Zeller. Fuzzing with code fragments. In USENIX Security Symposium, pages 445–458, 2012.
 [5] Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on, pages 837–847. IEEE, 2012.
 [6] Veselin Raychev, Martin Vechev, and Eran Yahav. Code completion with statistical language models. In PLDI, 2014.
 [7] D.E. Knuth. Semantics of contextfree languages. Mathematical Systems Theory, 2(2):127–145, June 1968.
 [8] D.E. Knuth. Semantics of contextfree languages: Correction. Mathematical Systems Theory, 5(1):95–96, March 1971.
 [9] P.M. Lewis, D.J. Rosenkrantz, and R.E. Stearns. Attributed translations. J. Comput. Syst. Sci., 9(3):279–307, December 1974.
 [10] Sophocles Efremidis, Christos H. Papadimitriou, and Martha Sideris. Complexity characterizations of attribute grammar languages. Inf. and Comp., 78(3):178–186, 1988.
 [11] T. Reps and T. Teitelbaum. The Synthesizer Generator: A System for Constructing LanguageBased Editors. SpringerVerlag, NY, 1988.
 [12] Lili Mou, Hao Peng, Ge Li, Yan Xu, Lu Zhang, and Zhi Jin. Discriminative neural sentence modeling by treebased convolution. arXiv preprint arXiv:1504.01106, 2015.
 [13] Bing Shuai, Zhen Zuo, Bing Wang, and Gang Wang. Dagrecurrent neural networks for scene labeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3620–3629, 2016.
 [14] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.

[15]
Horia Margarit and Raghav Subramaniam.
A batchnormalized recurrent network for sentiment classification.
Advances in Neural Information Processing Systems, 2016.  [16] Marijn F. Stollenga, Wonmin Byeon, Marcus Liwicki, and Juergen Schmidhuber. Parallel multidimensional LSTM, with application to fast biomedical volumetric image segmentation. In Advances in Neural Information Processing Systems, pages 2998–3006, 2015.
 [17] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
 [18] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[19]
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and
Ruslan Salakhutdinov.
Dropout: A simple way to prevent neural networks from overfitting.
Journal of Machine Learning Research
, 15(1):1929–1958, 2014.  [20] George C. Necula, Scott McPeak, Shree P. Rahul, and Westley Weimer. CIL: Intermediate language and tools for analysis and transformation of C programs. In International Conference on Compiler Construction, pages 213–228. Springer, 2002.
 [21] C.J. Maddison and D. Tarlow. Structured generative models of natural source code. In ICML, 2014.
 [22] Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N. Nguyen. A statistical semantic language model for source code. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2013, pages 532–542, New York, NY, USA, 2013. ACM.
 [23] Anh Tuan Nguyen and Tien N. Nguyen. Graphbased statistical language model for code. In Proceedings of the 37th International Conference on Software Engineering  Volume 1, ICSE ’15, pages 858–868, Piscataway, NJ, USA, 2015. IEEE Press.
 [24] Pavol Bielik, Veselin Raychev, and Martin Vechev. PHOG: Probabilistic model for code. In ICML, pages 19–24, 2016.
 [25] Miltiadis Allamanis and Charles Sutton. Mining idioms from source code. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages 472–483, New York, NY, USA, 2014. ACM.
 [26] Emilio Parisotto, Abdelrahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. Neurosymbolic program synthesis. arXiv preprint arXiv:1611.01855, 2016.
 [27] Vijayaraghavan Murali, Swarat Chaudhuri, and Chris Jermaine. Bayesian sketch learning for program synthesis. arXiv preprint arXiv:1703.05698, 2017.
 [28] Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. Deepcoder: Learning to write programs. arXiv preprint arXiv:1611.01989, 2016.
 [29] J.F. Raskin and F. Servais. Visibly pushdown transducers. In ICALP, 2008.
 [30] Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to transduce with unbounded memory. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1828–1836. Curran Associates, Inc., 2015.
 [31] JenTzung Chien and YuanChu Ku. Bayesian recurrent neural network for language modeling. IEEE transactions on neural networks and learning systems, 27(2):361–374, 2016.

[32]
Jing Bai and Yan Wu.
SAERNN deep learning for RGBD based object recognition.
In International Conference on Intelligent Computing, pages 235–240. Springer, 2014.
Comments
There are no comments yet.