Analyzing large software artifacts (so-called “Big Code”) has proven extremely challenging for logic-base approaches, which quickly force a trade-off between accuracy and scalability. Applying deep learning architectures to this task is a promising direction that flourished in the last few years, with a number of techniques being proposed around a variety of features extracted from software programs, ranging from syntactic to statically derived from source code to dynamically inferred from execution traces. Since these models operate in separate application domains, it is hard to draw a comparison. Our goal in this paper is to measure how precisely models can capture the program semantics and to this end we propose a framework,CoSet, to standardize the evaluation of neural program embeddings.
The idea is to reuse the knowledge distilled from existing code repositories in an attempt to simplify the future development of software, and in turn improve the product quality, is quite powerful. Code completion for example is a significant boost to programmer productivity and as such is a worthwhile task to pursue. Early methods in the field applied NLP techniques to discover the textual patterns existed in the source code [5, 6, 7]; following approaches proposed to learn the syntactic program embedding from the Abstract Syntax Trees (AST) [8, 9, 10]. Such approaches addressed code completion, but fell short of more complex tasks such as program synthesis or repair, where thorough understanding and precise representations of program semantics are required. A number of new deep-learning architectures developed to specifically address this issue [2, 3, 4, 11, 12] tried a variety of improvements, in terms of features (static features based on program source, dynamic featured from the execution of the program , abstract features from symbolic program traces , features from the graph representation of a program ). This diversity also makes it hard to evaluate these approaches as they do not uniformly or universally improve on existing work, but rather exhibit trade-offs across the natural characteristics of programs.
We propose a new benchmarking framework, called CoSet, that aims to provide a consistent baseline for evaluating the performance of any neural program embedding. CoSet captures both the human-induced variations in software (due to, e.g., coding style, algorithmic choices, code layout and structure) and algorithmic transformations to the software (due to, e.g., software refactoring, code optimization). CoSet consists of a dataset of almost eighty-five thousand programs, developed by a large number of users, solving one of ten coding problems, and labeled with a number of characteristics of interest (e.g. running time, pre- and post-condition, loop invariant). This dataset is paired with a number of program transformations, which when applied to any program in the dataset can produce variations of that program. These two elements (the dataset and the transformations) gives CoSet the power to distinguish between various neural program embeddings with high precision.
One use of CoSet is to provide a classification task for the neural program embeddings under evaluation. The benchmark will measure the accuracy and the stability of the embedding, giving the user a precise answer on how various embeddings compare. A second use of CoSet is as a debugging tool: once a user determines that a neural program embedding fails short of some goal, the CoSet program transformations can be used (in delta debugging fashion) to identify the characteristic of the programming language, the program code, or the program execution that causes the accuracy drop.
We have conducted a pilot study with this benchmarking framework on four of the latest deep learning models. We selected DyPro , TreeLSTM , the gated graph neural network (GGNN) , and the AST-Path neural network (APNN)  as evaluation subjects for CoSet. Through our comprehensive evaluation, we find that DyPro achieves the best result in the semantics prediction task and is significantly more stable with its prediction results than static models, which have difficulties in capturing the program semantics. As a result, static models are much less stable against the syntax variations even when the semantics are preserved. On the other hand, the generalization from static program features as done in GGNN and APNN, while insufficiently accurate, is scalable to large or long-running programs, which overwhelm the dynamic model of DyPro. Through careful use of CoSet’s debugging features, we then identify a number of specific shortcomings in the tested models, from lack of support for variable types, to confusion about logging and other ancillary program aspects, and to limitations in representing APIs.
We make the following contributions:
We design CoSet, a novel benchmark framework that expresses both human and algorithmic variations in software artifacts.
We propose CoSet for evaluating how precise the models can learn to interpret the program semantics and for identifying specific program characteristics that are the source of misclassification.
We present our evaluation results including the strength and weakness of each model, followed by an in-depth discussion of our findings.
2 Motivation and Running Example
For an intuition on why programs are difficult to learn, consider the program of Figure 1. The method Program.Difference() takes as input an array, sorts it, and returns the difference between the smallest value and the largest value.
To learn the semantics of this program sufficiently well as to classify it correctly as a sorting routine using the Bubblesort strategy requires bridging the gap between the syntactic representation and its runtime execution. This gap is typically expressed in abstract interpretation by considering a program to be the sum-total of all its possible execution traces and considering a program’s source code to be an abstraction of the program-as-traces model. Intuitively this distinction implies that learning to classify programs may be limited in accuracy when using only the source code, and correspondingly benchmarks should focus on samples whose source code and execution traces are not trivially related.
The gap between syntax and (runtime) semantics is characterized by several properties, which become the requirements for our CoSet benchmark. In other words, we wish for CoSet to have enough training and test samples to cover the whole range of distinctions between syntax and semantics.
First, the gap between program syntax and semantics is manifested in that similar programs of similar syntax may have to vastly different semantics. For example, consider the two sorting implementations, both sorting an array via two nested loops, both comparing the current element to its successor, and both swapping them if the order is incorrect. Yet the two functions could implement different sorting strategies, namely Bubblesort and Insertionsort. Therefore minor syntactic discrepancies can lead to significant semantic differences. Our first requirement for the benchmark is to include sufficient samples that vary both syntactically and semantically, though solving the same software task.
Second, program statements are almost never interpreted in the order in which the corresponding token sequence appears in the source code (the only exception being straight-line programs, i.e., ones without any control-flow statements). For example, a conditional statement only executes one branch each time, but its token sequence is expressed sequentially as multiple branch bodies. Similarly, when iterating over a looping structure at runtime, it is unclear in which order any two tokens are executed when considering different loop iterations. In Figure 1 the statement on line 23 may or may not execute after the statement on line 25 from a previous loop iteration. A related characteristic is that the order of independent statements (e.g., lines 19 and 20) is arbitrary and the learning model needs to be evaluated for syntactic-ordering biases. The second requirement for our benchmark is to include samples that vary only in the syntactic order of statements, without changing semantics.
Third, many tokens in the source code are not relevant to program semantics. For example, variable names do not affect the results computed by a program, but rather the dependencies between variables (both data and control dependencies) play an essential role in defining program semantics. In Figure 1 the variables swappbit, i, temp1, temp2, b, and a could have any other names, distinct from each other, and the program would behave the same way. The third requirement for our benchmark is to include samples with arbitrarily values for syntactic tokens that do not affect semantics.
3 Benchmark Design
The requirements from the previous section determine the need for a wide range and wide diversity of programs, both in terms of source code and in terms of execution behavior. Before describing CoSet’s design, we add an additional requirement, that of non-adversarial, natural style for our dataset samples. The goal of CoSet is to provide the capability of testing learning models on their accuracy in capturing program semantics, and we wish to explicitly exclude adversarial examples from this benchmark. The rationale is that adversarial examples, which underwent program obfuscation in order to make their analysis and understanding exceedingly hard, are an orthogonal problem that requires separate solutions and accordingly a distinct benchmark. For example, a common obfuscation used by malicious programs is that of embedding an interpreter. The original (malicious) program is replaced by a new program that consists of an interpreter and an reimplementation of the original program in a randomly generated language, to be handled by the interpreter. Understanding any adversarial examples obfuscated this way requires addressing three tasks: (1) identifying the presence of an interpreter, (2) learning the semantics of that interpreter, and (3) learning the semantics of the source code embedded alongside of the interpreter. CoSet as a benchmark focuses only on task (3) and thus explicitly excludes adversarially generated samples.
With this additional requirement in mind, we design the benchmark to include samples that capture normal variation in terms of syntax, style, algorithms, etc. arising in day-to-day programming work. This means using program samples from a diverse set of programmers, all solving the same coding problem, as well as including tools to transform these samples according to two common steps in the software-development lifecycle, refactoring and optimization.
3.1 Program Dataset
The dataset consists of 84,165 programs in total. They are obtained from a popular online coding platform. Programs were written in several different languages: Java, C# and Python. All programs solve a particular coding problem. We hand picked the problems to ensure the diversity of the programs in the dataset. Specifically it contains introductory programming exercises for beginners, coding puzzles that exhibit considerable algorithmic complexity and challenging problems frequently appearing on coding interviews. We split the whole dataset into a training set containing 64,165 programs, a validation set of 10,000 programs, and a test set of the remaining 10,000 programs.
3.2 Program Labels
The dataset have been manually analyzed and labeled on the basis of operational semantics (e.g. Bubblesort, Insertionsort, Mergesort, etc., for a sorting routine). We allow certain kind of variations to keep the total number of labels manageable. For example, we ignore local variables allocated for temporary storage, the iterative style of looping or recursion, sort order: descending or ascending, etc.
The work is done by fourteen PhD students and exchange scholars at University of California, Davis. They come from different research backgrounds such as programming language, database, security, graphics, machine learning, etc. All of them have been interviewed and tested for their knowledge on program semantics. To reduce the labeling error, we distributed programs solving the same coding problem mostly to a single person and have them cross check the results for validation. The whole process took more than three months to complete. In the end, we defined 38 labels in total, with more than 2,000 programs on average for each label.111Readers are encouraged to consult the supplemental material.
3.3 Program Transformations
In this section, we introduce the transformations we apply to CoSet for generating program variants. They serve two purposes in our experiments: debugging the model and gauging the model stability. The former refers to identifying the reason of misclassification; the latter measures how stable are the models against software variations, both of which will be explained in details in Section 4. CoSet considers three types of transformations, semantics-preserving, semantics-approximating, and semantics-changing.
This category can be further split into compiler optimizations (i.e. for improving the performance of a program in code compilation) and software refactoring (i.e. for improving the code readability and maintainability in software development). Even though they are designed for different settings and purposes, all of them preserve the semantics of the original programs. In this paper we choose Constant and Variable Propagation (CVP), Dead Code Elimination (DCE), Loop Unrolling (LU) and Hoisting for compiler optimization and Variable Renaming (VR), Nested Condition Simplification (NCS), Control Flag Removal (CFR) and Control Statement Unification (CSU) for software refactoring. Interested readers are invited to consult Aho et al.’s compiler textbook  and the Eclipse IDE documentation [14, 15] for the examples of each transformation.
Semantics-approximating transformations change a program in minor ways producing a new program with semantics close to that of the original. In particular we adopt two kinds of transformations. First we change type names to others within the same group. For example, we replace int with long in the group of integer types, or double with float in the group of floating-point types. Second we also change an API in the program to another of the similar semantics such as changing Array.Sort(a:array) to Array.Sort(a:array,c:comparer) from Microsoft’s .NET Framework for C#.
We also include transformations that change the semantics of a program. Those transformations are necessary solely for the purpose of debugging a model. Specifically CoSet can remove error handling code from a program and can fix a buggy program with SARFGEN , a productized repair tool often used in Massive Open Online Courses (MOOCs).
In this section, we describe in details how we use CoSet to evaluate the strengths and weaknesses of each deep neural architecture by measuring two characteristics—accuracy and stability.
4.1 Evaluating Classification Accuracy Using CoSet
We train GGNN, APNN, TreeLSTM and DyPro on CoSet to predict the semantic label of a test program. In this experiment we use prediction accuracy and F1 score to measure classification results.
(a) depicts the prediction accuracy of all models on CoSet. Overall, DyPro comes on top as the most accurate model at 84.7%, with GGNN achieves 75.2% prediction accuracy. In terms of F1 score ((b)), DyPro and GGNN keep their top positions while APNN and TreeLSTM swap their places at the bottom of the ranking. The other metric of interest is scalability in terms of program size. In order to compare GGNN, APNN, and TreeLSTM (which operate on source code, typically measured as number of lines) on one hand, and DyPro (which operates on execution trace, measured as number of state changes) on the other, we unify the program-size dimension as number of bytes, with a line of code being on average 15 bytes in length and a trace state changes 20 bytes. As shown in Figures 1(c) and 1(d), accuracy quickly drops as program size increases, with static models (GGNN, APNN, and TreeLSTM) losing all accuracy once a program exceeds 100 lines.
Discussion of Misclassification Types.
CoSet is able to trigger many types of misclassifications. To gain a better understanding of the errors each model made, we have conducted a large scale manual inspection and summarize our findings below.
Types and Variable Names: Type name can cause issues. For example, (a) shows a program in CoSet that causes APNN to misclassify. Replacing float to double fixes this error. As a more subtle example, program in (b) is misclassified by all of the static models. The error is due to variable col and row. Since vast majority of programs in CoSet use row (resp. col) as the induction variable in the outer (resp. inner) loop , static models do not recognize the same semantics a program denotes after some of its variables (i.e. row and col) are interchanged. It suffices to conclude that static models weigh variable names in their prediction, where in fact they should not be considered for any semantics-based classification task.
Syntactic Structure: To capture human programming habits in terms of code style and code structure, CoSet exposes a uneven distribution of code samples in terms of program syntax (i.e. some are more frequently used than others), which helps to trigger misclassifications. For example, the switch statement in program 2(g) is the reason of the error. In other words, models would have correctly classified the program if the switch is converted to a if-else, a far more popular selection statement.
Scalability: CoSet also reveals scalability being a major issue among all models including both static and dynamic. As the size of programs/execution traces increases, all models suffer notable drops in accuracy (resp. F1 score) depicted in (c) (resp. (d)). In particular APNN is shown to be the most vulnerable.
API Usage: CoSet discloses issues among static models in generalizing API calls. Figures 2(c) and 2(d) show two examples. (c) (resp. (d)) can be fixed by replacing the underlined API with Array.Sort(a) (resp. indexing operator ). Granted, CoSet bears some blame as it contains very few programs of the same label as the two examples that has the exact identical API signatures, but to expect a sufficient coverage of all API prototypes (e.g. 17 in total for Array.Sort according to Microsoft .NET framework 4.7) is a challenging task for any dataset. Therefore generalizing to largely similar despite partially unseen API is necessary and immediate. In addition, in both cases such API variations account for a small portion of the whole program. In particular we find 86.2% (resp. 68.9%) of the programs in the training set have a very similar body (i.e. differ by less than eight tokens) to (c) (resp. (d)) according to DECKARD , arguably the state-of-the-art clone detection algorithm. As a result, it is reasonable to expect static models to be immune from the minor API changes.
Error Handling: CoSet shows error handling code can be another cause of misclassification. (e) depicts an example that is misclassified by APNN and TreeLSTM. By removing the if clause at the beginning of the function, both models produce the correct result. This is another example CoSet demonstrates that static models are unstable against syntax change that barely affects the program semantics.
Buggy Code: CoSet reveals the DyPro model as the most susceptible to buggy programs. The reason being the dynamic nature of program representation DyPro adopts. That is if there is a bug in the program, its runtime dynamic traces will likely lead to greater discrepancies than the syntactic representations (e.g. tokens, ASTs, etc.). Take the program in (f) for example, the bug lies in the API call in int val = char.GetNumericValue(‘a’) which converts a character of digit to an integer. In contrast what the programmer intended is to directly get the ascii code/unicode of the character using int val = ‘a’. Unfortunately this bug leads to a chaotic dynamic representation (i.e. char.GetNumericValue() will be evaluated to -1 for any non-digit character), DyPro is not able to overcome. In comparison, all static models display a stronger fault tolerance by correctly classifying the buggy program.
Semantic Property: CoSet gives insights into the limitations of static models in learning semantic program properties. The program in (h) triggers a misclassification of this kind on all static models. Worth mentioning the program does not implement a standard Bubblesort strategy, in which the inner loop systematically shifts its bound towards the beginning of an array after bigger numbers are being bubbled up to the end of the array. Instead the inner loop in the program keeps sorting the entire array until reaching an iteration where no items are swapped, at which point the array is sorted.
Our methodology for analyzing this misclassification starts with formally defining the semantic label of Bubblesort to apply to programs that have a function satisfying the following properties:
time complexity of ;
input type signature of and post-condition that the array is sorted, ;
having a nested loop structure and an invariant of the outer loop assuming is the number of its iterations, .
Similarly we give definitions of all other labels in CoSet. To identify the source of the misclassification (i.e. a program not being classified as ), we repeat the classification task for a broader set of classes, using one property at a time as a new label definition. For example, using property (1) alone we collect all programs with time complexity in CoSet as the training data for a label . Similarly for property (2) we use all programs that implement a sorting routine and label them ; and for (3) we collect all programs that display the bubbling behavior under label . Results show static models incorrectly predict the program in (h) only when using property (3) as label definition. In other words, the static models GGNN, APNN, and TreeLSTM fail to learn a sufficiently expressive discriminator to understand the non-standard bubbling sort strategy. This methodology demonstrates CoSet’s utility in discovering the limitations of models in learning semantic properties.
|Models||Compiler Optimization||Software Refactoring|
4.2 Evaluating Model Stability Using CoSet
In this experiment we evaluate how stable the models are with their predictions. The reason stability is a relevant and important metric is because changes in source code are inevitable, and a model that is resilient against changes, especially the semantics-preserving kind, will yield many benefits. As mentioned in the benchmark design (Section 3), CoSet applies normal, standard, and widely adopted program transformations to simulate how software evolves during its normal lifecycle (e.g. code optimization or software refactoring).
Given the models that are trained for the experiment in Section 4.1 and CoSet’s testing program, we apply code transformations to generate a new test set, preserving semantics for all samples. Programs with no applicable transformations are excluded from the new test set. We then examine if models hold on to their prior predictions. Table 1 depicts the results. The number in each cell denotes the percentage of programs in the new test set on which a model changed its prediction. Overall, all models display decent stability against software transformations. No transformation was able to sway 30% or more predictions for any model. For the static models, TreeLSTM is shown to be the most stable deep neural architecture while APNN is the most sensitive model to the transformations. Unsurprisingly, DyPro handles almost all semantics-preserving transformations as its features are extracted from execution traces and thus largely unaffected by these source-code transformations.
5 Related Work
(Modified National Institute of Standards and Technology database) is a database of handwritten digits that is widely used for training and testing machine learning algorithms. The MNIST database contains 60,000 training images and 10,000 testing images. There have been many attempts to achieve the lowest error rate, among which some indeed have accomplished "near-human performance". ImageNet is a visual database designed for visual object recognition research. More than 14 million images have been hand-annotated to indicate what objects are pictured. ImageNet contains more than 20,000 categories, each of which consists of several hundred images. Since 2010, ImageNet has been adopted in an annual contest, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), where machine learning models compete to correctly classify and detect objects and scenes. CIFAR-10 
(Canadian Institute For Advanced Research) is another database of images that are commonly used to train machine learning and computer vision algorithms. It is one of the most widely used datasets for machine learning research. The CIFAR-10 dataset contains 60,000 32x32 color images in 10 different classese.g. cars, birds, cats, etc. Labeled Faces in the Wild 
is a database of face photographs designed for studying the problem of unconstrained face recognition. The dataset contains more than 13,000 images of faces collected from the web. Each face has been labeled with the name of the person pictured. 1680 of the people pictured have two or more distinct photos in the data set. The only constraint on these faces is that they were detected by the Viola-Jones face detector.
Unlike the existing databases designed almost exclusively for training and testing machine learning models in computer vision and image processing domain, we present CoSet to evaluate how precise deep neural architecture can learn the semantics of a program. In addition we also introduce a suite of program transformations for assessing the stability of model prediction and debugging the classification mistake.
We introduce a benchmark framework called CoSet for evaluating the accuracy and stability of neural program embeddings and related neural network architectures proposed for software classification task. CoSet consists of a set of natural, non-adversarial source-code samples from a variety of programmers tasked with well-defined coding challenges, and supplements this set with source-code transformations representative of common software optimization and refactoring. In our evaluation we observed that CoSet was effective in measuring differences among various models proposed in literature and provided debugging capabilities to identify the root causes of misclassifications.
- Tai et al.  Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic representations from tree-structured long short-term memory networks. CoRR, abs/1503.00075, 2015. URL http://arxiv.org/abs/1503.00075.
- Allamanis et al.  Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740, 2017.
- Alon et al.  Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed representations of code. arXiv preprint arXiv:1803.09473, 2018.
- Wang  Ke Wang. Learning scalable and precise representation of program semantics. arXiv preprint arXiv:1905.05251, 2019.
- Hindle et al.  Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on, pages 837–847. IEEE, 2012.
- Gupta et al.  Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. Deepfix: Fixing common c language errors by deep learning. 2017.
- Pu et al.  Yewen Pu, Karthik Narasimhan, Armando Solar-Lezama, and Regina Barzilay. sk_p: a neural program corrector for moocs. In Companion Proceedings of the 2016 ACM SIGPLAN International Conference on Systems, Programming, Languages and Applications: Software for Humanity, pages 39–40. ACM, 2016.
- Maddison and Tarlow  Chris Maddison and Daniel Tarlow. Structured generative models of natural source code. In International Conference on Machine Learning, pages 649–657, 2014.
- Bielik et al.  Pavol Bielik, Veselin Raychev, and Martin Vechev. Phog: probabilistic model for code. In International Conference on Machine Learning, pages 2933–2942, 2016.
- Mou et al.  Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. Convolutional neural networks over tree structures for programming language processing. 2016.
- Henkel et al.  Jordan Henkel, Shuvendu Lahiri, Ben Liblit, and Thomas Reps. Code vectors: Understanding programs through embedded abstracted symbolic traces. arXiv preprint arXiv:1803.06686, 2018.
- DeFreez et al.  Daniel DeFreez, Aditya V. Thakur, and Cindy Rubio-González. Path-based function embeddings. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings, ICSE ’18, pages 430–431, New York, NY, USA, 2018. ACM. ISBN 978-1-4503-5663-3. doi: 10.1145/3183440.3195042. URL http://doi.acm.org/10.1145/3183440.3195042.
- Aho et al.  Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools (2Nd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2006. ISBN 0321486811.
-  A brief history of eclipse. https://www.ibm.com/developerworks/rational/library/nov05/cernosek/index.html. Accessed: 2019-05-22.
- Goth  Greg Goth. Beware the march of this ide: Eclipse is overshadowing other tool technologies. IEEE software, 22(4):108–111, 2005.
- Wang et al.  Ke Wang, Rishabh Singh, and Zhendong Su. Search, align, and repair: Data-driven feedback generation for introductory programming exercises. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, pages 481–495, New York, NY, USA, 2018. ACM. ISBN 978-1-4503-5698-5. doi: 10.1145/3192366.3192384. URL http://doi.acm.org/10.1145/3192366.3192384.
- Jiang et al.  Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering, ICSE ’07, pages 96–105, Washington, DC, USA, 2007. IEEE Computer Society. ISBN 0-7695-2828-7. doi: 10.1109/ICSE.2007.30. URL https://doi.org/10.1109/ICSE.2007.30.
- LeCun et al.  Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Deng et al.  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
- Krizhevsky  Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- Huang et al.  Gary B Huang, Marwan Mattar, Tamara Berg, and Eric Learned-Miller. Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In Workshop on faces in’Real-Life’Images: detection, alignment, and recognition, 2008.