Context-Aware Parse Trees

by   Fangke Ye, et al.

The simplified parse tree (SPT) presented in Aroma, a state-of-the-art code recommendation system, is a tree-structured representation used to infer code semantics by capturing program structure rather than program syntax. This is a departure from the classical abstract syntax tree, which is principally driven by programming language syntax. While we believe a semantics-driven representation is desirable, the specifics of an SPT's construction can impact its performance. We analyze these nuances and present a new tree structure, heavily influenced by Aroma's SPT, called a context-aware parse tree (CAPT). CAPT enhances SPT by providing a richer level of semantic representation. Specifically, CAPT provides additional binding support for language-specific techniques for adding semantically-salient features, and language-agnostic techniques for removing syntactically-present but semantically-irrelevant features. Our research quantitatively demonstrates the value of our proposed semantically-salient features, enabling a specific CAPT configuration to be 39% more accurate than SPT across the 48,610 programs we analyzed.



There are no comments yet.


page 7


Software Language Comprehension using a Program-Derived Semantic Graph

Traditional code transformation structures, such as an abstract syntax t...

On the Performance of Bytecode Interpreters in Prolog

The semantics and the recursive execution model of Prolog make it very n...

MISIM: An End-to-End Neural Code Similarity System

Code similarity systems are integral to a range of applications from cod...

ast2vec: Utilizing Recursive Neural Encodings of Python Programs

Educational datamining involves the application of datamining techniques...

Functional programming with lambda-tree syntax

We present the design of a new functional programming language, MLTS, th...

A Hierarchical Approach to Neural Context-Aware Modeling

We present a new recurrent neural network topology to enhance state-of-t...

Kayak: Safe Semantic Refactoring to Java Streams

Refactorings are structured changes to existing software that leave its ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Machine programming (MP), as defined by Gottschlich et al. in ”The Three Pillars of Machine Programming,” is any system that automates some aspect of software development (Gottschlich et al., 2018). An open research challenge in MP is how to build effective automated code similarity systems. The potential use cases for such code similarity systems ranges from code recommendation to automated software bug patching, to name a few (Dinella et al., 2020; Luan et al., 2019; Allamanis et al., 2018; Pradel and Sen, 2018; Bhatia et al., 2018; Bader et al., 2019; Barman et al., 2016). Yet, as others have noted, the correct structural representation for such a code similarity system remains unclear (Luan et al., 2019; Odena and Sutton, 2020; Ben-Nun et al., 2018; Alon et al., 2018, 2019; Tufano et al., 2018; Alon et al., 2019; Zhang et al., 2019; Allamanis et al., 2018).

Figure 1. Example of the Differences in the SPT and CAPT.

In this work, we present a new structural representation, a context-aware parse tree (CAPT). In contrast to syntactic representations, such as an abstract syntax tree (AST), CAPT is designed to capture the semantic meaning of the user’s code. That is, CAPT provides information relevant to whether two code snippets 111For the purposes of this work, we precisely define code snippet as a C/C++ function, discussed in more detail in Section 3. are semantically convergent, even if they are syntactically divergent. Code similarity is informally defined as the process of determining whether two code snippets are semantically similar. In this work, we show that CAPTs are competitive with simplified parse trees (SPTs), the representation used by the Aroma (Luan et al., 2019) state-of-the-art code recommendation system. From the point of view of machine programming (Gottschlich et al., 2018), this work seeks to improve our ability to automatically recognize a user’s intent from code. Thus, it principally falls in the ”intention pillar.” (Gottschlich et al., 2018) In addition, once such intention is understood it may be augmented or transferred from language to language, thereby advancing the “adaptation pillar.”

Why Not An AST?

While the abstract syntax tree (AST) has had tremendous, demonstrable value in cases where syntactic structure is of primary importance (e.g., source code compilation (Aho et al., 2006)), the utility of the AST in the space of extracting semantic meaning from code (i.e., lifting intention (Kamil et al., 2016; Ginsbach et al., 2018; Ahmad et al., 2019)) may be less clear. The AST contains a complete syntactic representation of the program, which can contain many details relevant to program compilation, but that may be less salient for semantic analysis. For example, an AST for a single line of C/C++ code like,

int x = (y+3);

may contain separate nodes for a variable declaration, a compound statement, and three implicit casts.222 While such details are critical for correctly implementing a compiler, they may not be relevant for semantic analysis. For example,

int x = y;

may be considered semantically similar to the prior code snippet, even if the second operation does not have implicit casts or compound statements. Unlike an AST, a CAPT can omit these details to help improve similarity analysis. We note that the Aroma authors have previously illustrated some of these AST limitations as well (Luan et al., 2019).

Why Not An SPT?

The Aroma team (Luan et al., 2019), inspired, at least partially, by weaknesses in the AST (as stated directly by the authors), introduced the simplified parse tree (SPT). An SPT is a new structural representation for code similarity, which intentionally departs from the AST. By design, the SPT reduces the syntactic information collected from source code: each node in the SPT is strictly a token from the original program (no other nodes are introduced). The Aroma authors demonstrate that this reduction, or in some cases elimination, of lower-level syntactic information can be helpful for code similarity systems. Such a reduction may be especially salient in the context of type 3 and 4 similarity, where code may be syntactically different, but is semantically similar (Roy et al., 2009).

While the AST may contain too much low-level syntactic information, the SPT may omit semantically-relevant information. For example, because the SPT contains only tokens, the SPT effectively omits whether or not a token binds to a local or a global variable. On the other hand, the SPT may also include too much specificity. For example, the exact name of a variable or function is not always relevant to its semantic meaning (e.g., two global variables with the same name but from different programs might not imply similar semantics). In light of this, we identified two areas where an SPT can be modified such that the resulting tree yields improved code similarity accuracy. Those areas are: (i) language-agnostic modifications, which remove potentially irrelevant syntactic information, and (ii) language-specific modifications, which introduces new syntactic information that may be semantically salient. We believe CAPTs can capture the essence of these two design elements. Our paper provides the following technical contributions:

  • We introduce the context-aware parse tree (CAPT), a novel modification of the simplified parse tree (SPT) intended to improve code similarity analysis.

  • We illustrate and discuss the two flexibility enhancements CAPT has compared to Aroma’s simplified parse tree: (i) language-agnostic and (ii) language-specific.

  • Our research quantitatively demonstrates the value of our proposed semantically-salient features, enabling a specific CAPT configuration to be 39% more accurate than SPT across the 48,610 programs we analyzed.

2. System Design

Before discussing the specifics of our approach, we first provide some background on how both CAPTs and SPTs are used in code similarity systems. Figure 2

presents an abbreviated overview of our code similarity system, MISIM (we illustrate MISIM using CAPTs, but any process that transforms a code snippet into a feature vector could be used, including SPTs). Figure 

2 illustrates the process of transforming source code (e.g., a C function) to a feature vector. Once a feature vector is generated, a code similarity measurement (e.g., vector dot product (Lipschutz, 1968)

, cosine similarity 

(Singhal, 2001)

, machine-learned similarity 

(Zhao and Huang, 2018)) calculates the similarity score between the input program and other programs stored in a database.  333In the context of this work, we perform code similarity analysis on an entire C/C++ program, where we differentiate code snippets by uniquely defined C/C++ function bodies. Although the current system only supports C/C++ code, our design is agnostic to the underlying programming language. While we have built a prototype of the entire system, CAPT is the emphasis of this paper. As such, we omit a deeper dissection of other components of the system as we consider them outside of the scope of this paper.

Figure 2. The MISIM Code Similarity System.

To generate a CAPT, the code is first parsed into a language-agnostic parse tree. Next, the system performs language-specific transformations in constructing an initial intermediate form of CAPT by adding pertinent information used to disambiguate code. The system then performs language-agnostic transformations (e.g., abstracting the number of code statements in a function) by potentially pruning or modifying the CAPT’s nodes. Subsequently, a CAPT is featurized into a vector using the same procedure as SPT’s featurization process (Luan et al., 2019).

3. The Context-Aware Parse Tree (CAPT)

In this section, we describe the fundamental design of CAPT, including the key differences between CAPT and Aroma’s simplified parse tree (SPT). Some of these details are visually illustrated in Figure 1. Fundamentally, a CAPT is the result of transforming an SPT in specific ways according to a configuration, options that give CAPT a greater degree of flexibility. Different configurations may result in better performance in some domains, but worse performance in others. Here, we focus on describing the intuition behind these options, and evaluating every possible configuration in one domain. In this preliminary work, we do not address figuring out which configuration to use in a particular domain, but plan to investigate this in future work.

CAPT Configuration Categories (see Table 1)

CAPT’s configuration categories come in two general forms: language-specific configurations and language-agnostic configurations, listed in Table 1. We next give an intuitive overview of both.

Table 1 lists the current types and options for the language-specific and language-agnostic categories in CAPT. 444While our exploration into CAPT is still early, we believe our categories may be exhaustive (that is, fully encompassing). Yet, we do not believe our configuration types or options are exhaustive. Each configuration type has multiple options associated with it to afford the user the flexibility in exploring a number of CAPT configurations. For all configuration types, option 0 always corresponds to the Aroma system’s original SPT. Each of the types in Table 1 are described in greater detail in the following sections.

Language-specific configurations, described in Section 3.1 are designed to resolve syntactic ambiguity present in the SPT. For example, in Figure 1, the SPT treats the parenthetical expression (global1 + global2) identically to the parenthetical expression init(global1), whereas the CAPT configuration shown disambiguates these two terms (the first is a parenthetical expression, the second is an argument list). Such a disambiguation may be useful to a code similarity system, as the CAPT representation makes the presence of a function call more clear.

Language-agnostic configurations, described in Section 3.2, can improve code similarity analysis by unbinding overly-specific semantics that may be present in the original SPT structure. For example, in Figure 1, the SPT includes the literal names global1, global2, etc. The CAPT variant, on the other hand, unbinds these names and replaces them with a generic string (#GVAR). This could improve code similarity analysis if the exact token names are irrelevant, and the semantically-salient feature is simply that there is a global variable.

We note that that these examples are not universal. One specific CAPT configuration is unlikely to work in all scenarios: sometimes, disambiguating parenthetical expressions may be good, other times, it may be bad. This work seeks to explore and analyze these possible configurations. We provide a formalization and concrete examples of both language-agnostic and language-specific configurations later in this section.

Category Type Option
Language-specific A. Node Annotations 0. No change (Aroma’s original configuration)
1. Annotate all internal nodes
2. Annotate parenthesis nodes (C/C++ Specific)
Language-agnostic B. Compound Statements 0. No change (Aroma’s original configuration)
1. Remove all features relevant to compound statements
2. Replace with ‘{#}’
C. Global Variables 0. No change (Aroma’s original configuration)
1. Remove all features relevant to global variables
2. Replace with ‘#GVAR’
3. Replace with ‘#VAR’ (the label for local variables)
D. Global Functions 0. No change (Aroma’s original configuration)
1. Remove all features relevant to global functions
2. Replace with ‘#EXFUNC’
Table 1. CAPT Configuration Options.
Figure 3. Language Ambiguity in Simplified Parse Tree.

3.1. Language-Specific Configurations

Language-specific configurations are meant to capture semantic meaning by resolving ambiguity and introducing specificity related to the specific underlying programming language. Intuitively, these configurations can be thought of as syntax-binding, capturing semantic information that are bound to the particular syntactical structure of the program. In some cases, these specifications may capture relevant semantic information, whereas in other cases these specifications may capture irrelevant details.

Node Annotations.

We define a node annotation as a modification to a tree’s node to incorporate more information. Node annotations are generally used to facilitate disambiguation caused by language-specific syntax ambiguity. These ambiguous scenarios tend to arise when certain code constructs and/or operators have been overloaded in a specific language. In such cases, the original SPT structure may be insufficient to properly disambiguate between them, potentially reducing its ability to evaluate code semantic similarity (see Figure 3). CAPT’s node annotation options are meant to help resolve this. As we incorporate more language-specific syntax into CAPT nodes, we run the chance of overloading the tree with syntactic details. This could potentially undo the general reasoning behind Aroma’s SPT and our CAPT structure. We discuss this in greater detail in Section 3.3.

3.1.1. C and C++ Node Annotations

For our first embodiment of CAPT, we have focused solely on C and C++ programs. We have found that programs in C/C++ present at least two interesting challenges.555We do not claim that these challenges are unique to C/++: these challenges may be present in other languages as well.

(Lack of) Compilation.

We have found that, unlike programs in higher-level programming languages (e.g., Python (Van Rossum and Drake, 2009), JavaScript (Flanagan, 2006)), C/C++ programs found ”in the wild” tend to not immediately compile from a source repository (e.g., GitHub (Cosentino et al., 2017)). Thus, code similarity analysis may need to be performed without relying on successful compilation.

Many Solutions.

The C and C++ programming languages provide multiple diverse ways to solve the same problem (e.g., searching a list with a for loop vs. using std::find). Because of this, C/C++ enables programmers to create semantically similar programs that are syntactically divergent. In extreme cases, such semantically similar (or identical) solutions may differ in computation time by orders of magnitude (Satish et al., 2012). This requires that code similarity techniques to be robust in their ability to identify semantic similarity in the presence of syntactic dissimilarity (i.e. a type 4 code similarity exercise (Roy et al., 2009)).

We believe that analytically deriving the optimal selection of node annotations across all C/C++ code may be untenable. To accommodate this, we currently provide two levels of granularity for C/C++ node annotations in CAPT. 666This is still early work and we expect to identify further refinement options in C/C++ and other languages as the research progresses.

  • Option 0: original Aroma SPT configuration.

  • Option 1: annotation of all nodes with their language-specific node type.

  • Option 2: annotation of all nodes containing parentheticals with their language-specific node type.

Option 1 corresponds to an extreme case of a concrete syntax embedding (e.g., every node contains syntactic information, and all syntactic information is represented in some node). Since such an embedding may ”overload” the code similarity system with irrelevant syntactic details, Option 2 can be used to annotate only parentheticals, which we have empirically identified to often have notably divergent semantic meaning based on context.

An example is shown in Figure 1. In one case the parentheses is applied as a mathematical precedence operator, in the other it is used as a function call. If left unresolved, such ambiguity would cause the subtree rooted at node 7 of function f1

to be classified identically to the subtree rooted at node 5 of function

f2. The intended purpose of the parenthesis operator is context sensitive and is disambiguated by encoding the contextual information into the two distinct node annotations, i.e. the parenthesized expression and the argument list respectively.

3.2. Language-Agnostic Configurations

Unlike language-specific configurations, language-agnostic configurations are not meant to be restricted to the specific syntax of a specific language. Instead, they are meant to be applied generally across multiple languages. Intuitively, these configurations can be thought of as syntax-unbinding in nature: they generally abstract (or, in some cases, entirely eliminate) syntactical information in the attempt to improve its ability to derive semantic meaning from the code.

Compound Statements.

The compound statements configuration is a language-agnostic option that enables the user to control how much non-terminal node information is incorporated into the CAPT. Again, Option 0 corresponds to the original Aroma SPT. Option 1 omits separate features for compound statements altogether. Option 2 does not discriminate between compound statements of different lengths and specifies a special label to denote the presence of a compound statement. For example, the for loop construct in C/C++ is represented with a single label with this option instead of constructing three separate labels for the loop initialization, test condition and increment.

Global Variables.

The global variables configuration specifies the degree of global variable-specific information contained in a CAPT. In addition to Aroma’s original configuration (Option 0), which annotates nodes by including the precise global variable name, CAPT provides three additional configurations. Option 1 specifies the extreme case of eliding all information on global variables. Option 2 annotates all global variables with the special label ‘#GVAR’, omitting the names of the global variable identifiers. Option 3 designates global variables with the label ‘#VAR’ rendering them indistinguishable from the usage of local variables.

Intuitively, including the precise global variable names (Option 0) may be appropriate if code similarity is being performed on a single code-base, where two references to a global variable with the same name necessarily refer to the same global variable. Options 1 through 3, which remove global variable information to varying degrees, may be appropriate when performing code similarity between unrelated code-bases, where two different global variables named (for example) foo are most likely unrelated.

Global Functions.

The global functions configuration serves the dual purpose of (i) controlling the amount of function-specific information to featurize and (ii) to disambiguate between the usage of global functions and global variables in CAPT, a feature that is curiously absent in the original SPT design: the SPT shown in Figure 1 has no distinction between init (a function) and global1 (a variable). Option 1 removes all features pertaining to global functions. Option 2 annotates all global function references with the special label ‘#EXFUNC’ while eliminating the function identifier. Intuitively, these options behave similarly to the global variable options. Our current prototype, which handles only single C/C++ functions, does not differentiate between external functions. In future work, we plan to investigate CAPT variants that differentiate between local, global, and library functions.

3.3. Discussion

We believe there is no silver bullet solution for code similarity for all programs and programming languages. Based on this belief, a key intuition of CAPT’s design is to provide a structure that is semantically rich based on structure, with heavy inspiration from Aroma’s SPT, while simultaneously providing a range of customizable parameters to accommodate a wide variety of scenarios. CAPT’s language-agnostic and language-specific configurations and their associated options serve for exploration of a series of tree variants, each differing in their granularity of detail of abstractions.

For instance, the compound statements configuration provides three levels of abstraction. Option 0 is Aroma’s baseline configuration and is the finest level of abstraction, as it featurizes the number of constituents in a compound statement node. Option 2 reduces compound statements to a single token and represents a slightly higher level of abstraction. Option 1 eliminates all features related to compound statements and is the coarsest level of abstraction. The same trend applies to the global variables and global functions configurations. It is our belief, based on early evidence, that the appropriate level of abstraction in CAPT is likely based on many factors such as (i) code similarity purpose, (ii) programming language expressiveness, and (iii) application domain.

Aroma’s original SPT seems to work well for a common code base where global variables have consistent semantics and global functions are standard API calls also with consistent semantics (e.g., a single code-base). However, for cases outside of such spaces, some question about applicability arise. For example, assumptions about consistent semantics for global variables and functions may not hold in cases of non-common code-bases or non-standardized global function names (Wulf and Shaw, 1973; Gellenbeck and Cook, 1991; Feitelson et al., 2020). The capacity to differentiate between these cases, and others, is a key motivation for CAPT.

We do not believe that CAPT’s current structure is exhaustive. With this in mind, we have designed CAPT to be extensible, enabling a seamless mechanism to add new configurations and options (described in Section 4). Our intention with this paper is to present initial findings in exploring CAPT’s structure. Based on our early experimental analysis, presented in Section 4, CAPT seems to be a promising research direction for code similarity.

An Important Weakness

While CAPT provides added flexibility over SPT, such flexibility may be misused. With CAPT, system developers are free to add or remove as much syntactic differentiation detail they choose for a given language or given code body. Such overspecification (or underspecification), may result in syntactic overload (or underload) which may cause reduced code similarity accuracy over the original SPT design, as we illustrate in Section 4.

4. Experimental Results

In this section, we discuss our experimental setup and analyze the performance of CAPT compared to Aroma’s simplified parse tree (SPT). In Section 4.1, we describe the code corpus used with CAPT that includes hundreds of unique programming solutions for 104 different programming problems. In Section 4.2, we explain the dataset grouping and enumeration for our experiments. We also discuss the metrics used to quantitatively rank the different CAPT configurations and those chosen for evaluation of code similarity. Section 4.3 demonstrates that, a code similarity system built using CAPT (i) has a greater frequency of improved accuracy for the total number of problems and (ii) is, on average, more accurate than SPT. For completeness, we also include cases where CAPT configurations perform poorly.

4.1. Dataset

Our experiments use the POJ-104 dataset. The POJ-104 dataset is the result of educationally-inspired programming questions, which consist of student written programs to 104 problems (Mou et al., 2016). Each problem has 500 unique solutions written in C/C++. Each solution has been validated for correctness. We categorize all solutions for a given POJ-104 problem as being in the same semantic similarity equivalence class. We make no claims about the semantic similarity or dissimilarity of solutions to two or more different POJ-104 problems.

Using this approach, we treat the problem of code similarity analysis as a classification problem. We classify two programs as semantically similar if they originate from the same equivalence class (i.e., the same POJ-104 problem). Using this approach, the labels for these classifications can be implicitly lifted using the problem’s unique identifier.

Eliminated Programs From POJ-104.

Some of the coding solutions in the POJ-104 dataset have been marked as illegal by the parser we used 777Tree-sitter: After investigating, we found this to be due to the code using non-standard coding conventions (e.g., unspecified return types, lack of semicolons at the end of structure definitions, etc.). Because they could not be properly parsed, we have pruned them from the dataset. We also eliminated all of the solutions that we could find that had hard-coded answers to problems. 888Unfortunately, due to the size of the dataset, we cannot guarantee all such programs were eliminated. The resulting dataset consists of 48,610 programming solutions with 370 to 499 uniquely coded solutions per problem.

(a) Breakdown of the Number of Groups with AP Greater or Less than SPT.
(b) Average Precision for the Group Containing the Best Case.
(c) Mean of Average Precision Over All Program Groups.
Figure 4. Comparison of CAPT and SPT. The blue bars in (a) and (b), and all the bars in (c), from left to right, correspond to the best two, the median, and the worst two CAPT configurations, ranked by the metric displayed in each subfigure.

4.2. Experimental Setup

In this section, we describe our experimental setup. At the highest level, we compare the performance of various configurations of CAPT to Aroma’s SPT. The list of possible CAPT configurations are shown in Table 1.

Problem Group Selection.

Given that POJ-104 consists of 104 unique problems and 48,610 programs, depending on how we analyze the data, we might face intractability problems in both computational and combinatorial complexity. With this in mind, our initial approach is to construct 1000 sets of five unique, pseudo-randomly selected problems for code similarity analysis. Using this approach, we evaluate every configuration of CAPT and Aroma’s original SPT on each pair of solutions for each problem set. We then aggregate the results across all the groups to estimate their overall performance. While this approach is not exhaustive of possible combinations (in set size or set combinations), we aim for it to be a reasonable starting point. As our research with CAPT matures, we plan to explore a broader variety of set sizes and a more exhaustive number of combinations.

Code Similarity Performance Evaluation.

For each problem group, we exhaustively calculate code similarity scores for all unique solution pairs, including pairs constructed from the same program solution (i.e., program compared to program ). We use to refer to the set of groups and to indicate a particular group in . We denote as the number of groups in (i.e. cardinality) and —g— as the number of solutions in group . For = , where , the total unique program pairs (denoted by ) in is .

To compute the similarity score of a solution pair, we use Aroma’s approach. This includes calculating the dot product of two feature vectors (i.e., a program pair), each of which is generated from a CAPT or SPT structure. The larger the magnitude of the dot product, the greater the similarity.

We evaluate the quality of the recommendation based on average precision. Precision is the ratio of true positives to the sum of true positives and false positives. Here, true positives denote solution pairs correctly classified as similar and false positives refer to solution pairs incorrectly classified as similar. Recall is the ratio of true positives to the sum of true positives and false negatives, where false negatives are solution pairs incorrectly classified as different. As we monotonically increase the threshold from the minimum value to the maximum value, precision generally increases while recall generally decreases. The average precision (AP) summarizes the performance of a binary classifier under different thresholds for categorizing whether the solutions are from the same equivalence class (i.e., the same POJ-104 problem) (Liu, 2009). AP is calculated using the following formula over all thresholds.

  1. All unique values from the similarity scores, corresponding to the solution pairs, are gathered and sorted in descending order. Let be the number of unique scores and be the sorted list of such scores.

  2. For in , the precision and recall for the classifier with the threshold being is computed.

  3. Let . The average precision is computed as:

Configuration Identifier.

In the following sections, we refer to a configuration of CAPT by its unique identifier (ID). A configuration ID is formatted as A-B-C-D. Each of the four letters corresponds to a configuration type in the second column of Table 1, and will be replaced by an option number specified in the third column of the table. Configuration 0-0-0-0 corresponds to Aroma’s SPT.

4.3. Results

Figure 3(a) depicts the number of problem groups where a particular CAPT variant performed better (blue) or worse (orange) than SPT. For example, the CAPT configuration 2-0-0-0 outperformed SPT in 774 of 1000 problem groups, and underperformed in 226 problem groups. This equates to a 54.8% accuracy improvement of CAPT over SPT. Figure 3(a) shows the two best (2-0-0-0 and 2-1-2-2), the median (1-1-0-1), and the two worst (1-0-1-0 and 1-2-1-0) configurations with respect to SPT. 2-1-2-2 demonstrates a CAPT configuration where all its options are exercised and not tunable in SPT. This configuration performs better than SPT on 695 of the 1000 problem groups, i.e. on of the problem groups. Although it is less performant than the 2-0-0-0 configuration, it exercises all of CAPT’s unique tunable parameters. We speculate that these configuration results may vary based on programming language, domain, and problem type, amongst other parameters.

Figure 3(b) shows the group containing the problems for which CAPT achieved the best performance relative to SPT, among all 1000 problem groups. In other words, Figure 3(b) shows the performance of SPT and CAPT for the single problem group with the greatest difference between a CAPT configuration and SPT. In this single group, CAPT achieves the maximum improvement of more than 30% over SPT for this problem group on two of its configurations. We note that, since we tested 108 CAPT configurations across 1000 different problem groups, there is a reasonable chance of observing such a large difference even if CAPT performed identically to SPT in expectation. We do not intend for this result to demonstrate statistical significance, but simply to illustrate the outcome of our experiments.

Figure 3(c) compares the mean of AP over all 1000 problem groups. In it, the blue bars, moving left to right, depict the CAPT configurations that are (i) the two best, (ii) the median, and (iii) the two worst in terms of average precision. Aroma’s baseline SPT configuration is highlighted in orange. The best two CAPT configurations show an average improvement of more than 1% over SPT, while the others degraded performance relative to the baseline SPT configuration.

These results illustrate that certain CAPT configurations can outperform the SPT on average by a small margin, and can outperform the SPT on specific problem groups by a large margin. However, we also note that choosing a good CAPT configuration for a domain is essential. We leave automating this configuration selection to future work.

4.3.1. Analysis of Configurations

Figures 4(a)-4(d) serve to illustrate the performance variation for individual configurations. Figure 4(a) shows the effect of varying the options for the node annotation configuration. Applying the annotations for the parentheses operator (option 2) results in the best overall performance while annotating every internal node (option 1) results in a concrete syntax tree and the worst overall performance. This underscores the trade-offs in incorporating syntax-binding transformations in CAPT. In Figure 4(b) we observe that removing all features relevant to compound statements (option 1) leads to the best overall performance when compared with other options. This indicates that adding separate features for compound statements obscures the code’s intended semantics when the constituent statements are also individually featurized.

(a) Node Annotations.
(b) Compound Statements.
(c) Global Variables.
(d) Global Functions.
Figure 5. The Distributions of Performance for Configurations with a Fixed Option Type.

Figure 4(c) shows that removing all features relevant to global variables (option 1) degrades performance. We also observe that eliminating the global variable identifiers and assigning a tag to signal their presence (option 2) performs best overall, possibly because global variables appearing in similar contexts may not use the same variable identifiers. Further, option 2 performs better than the case where global variables are indistinguishable from local variables (option 3). Figure 4(d) indicates that removing features relevant to identifiers of global functions, but flagging their presence with a special tag as done in option 2, generally gives the best performance. This result is consistent with the intuitions for eliminating features of function identifiers in CAPT as discussed in Section 3.3.

A Subtle Observation.

A more nuanced and subtle observation is that our results seem to indicate that for each CAPT configuration the optimal granularity of abstraction detail is different. For compound statements the best option seems to corresponds to the coarsest level of abstraction detail, while for node annotation, global variables, and global functions the best option seems to corresponds to one of the intermediate levels of abstraction detail. For our future work, we aim to perform a deeper analysis on this and hopefully learn such configurations, to reduce (or eliminate) the overhead necessary of trying to manually discover such configurations.

5. Related Work

Research into code representations in the space of code similarity is still in its infancy, yet, there is a large and growing body of work to consider. A classical approach to infer code semantics is to utilize an intermediate representation (IR), such as LLVM (Lattner and Adve, 2004). However, these representations were originally designed with the purpose of mapping efficiently to low-level instruction set architectures (ISAs). As such, they might not be ideal candidates for code similarity.

Still, advances in ML seem to have stimulated a number of approaches (Ben-Nun et al., 2018; Zhao and Huang, 2018) that rely on such IRs to infer high-level semantics for the purpose of code similarity. Nevertheless, such approaches suffer from the disadvantage of requiring compilation to determine the validity of the input code, hence limiting their applicability. Other research avoids this reliance on compilation by representing a program using its dynamic execution trace (Wang et al., 2018). While such approaches enable the encoding of concrete details of the program semantics, the collection of dynamic traces can be costly (program execution is required).

The idea of utilizing the compiler’s IR has been extended further by more recent ML-based approaches (Alon et al., 2019, 2019; Zhang et al., 2019), that use an abstract syntax tree (AST), which is at a higher level of abstraction than some IRs, such as LLVM. These AST approaches tend to rely on featurizing the AST to include some of its meta-properties, such as paths in the AST, to discover structural similarities. A key intuition for these approaches is that structural similarity of an AST may correlate to code similarity. Other recent research has focused on constructing code representations from raw source code tokens or sequences (Sachdev et al., 2018; Sajnani et al., 2016) with some success. However, these rely on certain strict assumptions on the input code and might be challenging to generalize.

To our knowledge, Aroma’s simplified parse tree (SPT) represents a state-of-the-art structural representation for code similarity (Luan et al., 2019). Aroma extends all the aforementioned approaches in that it uses a customized parse tree representation, SPT, which encapsulates high-level code semantics. The SPT is at higher level of abstraction than previous AST-based approaches as it avoids representing irrelevant syntactic information. Our work is inspired by Aroma’s SPT and aims to take it one step further by allowing for systematic exploration of a range of customizable configuration parameters that control CAPT’s construction.

6. Future Work and Conclusion

In this paper, we presented the context-aware parse tree (CAPT), a novel tree structure that we have developed principally for the purpose of code similarity analysis. CAPT is heavily inspired by Aroma’s simplified parse tree (SPT).

Our research quantitatively demonstrates the value of our proposed semantically-salient features, enabling a specific CAPT configuration to be 39% more accurate than SPT across the 48,610 programs we analyzed. We believe CAPT is able to produce improved code similarity accuracy because it provides a more flexible semantic configuration across (i) language-specific ambiguity resolution and (ii) unbinding support via language-agnostic techniques for removal of syntactic features that are semantically irrelevant. Our exploration into CAPT is still in its infancy.


  • M. B. S. Ahmad, J. Ragan-Kelley, A. Cheung, and S. Kamil (2019) Automatically Translating Image Processing Libraries to Halide. ACM Trans. Graph. 38 (6). External Links: Document, ISSN 0730-0301, Link Cited by: §1.
  • A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman (2006) Compilers: principles, techniques, and tools (2nd edition). Addison-Wesley Longman Publishing Co., Inc., USA. External Links: ISBN 0321486811 Cited by: §1.
  • M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton (2018) A survey of machine learning for big code and naturalness. ACM Computing Surveys 51 (4). Cited by: §1.
  • M. Allamanis, M. Brockschmidt, and M. Khademi (2018) Learning to represent programs with graphs. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • U. Alon, O. Levy, and E. Yahav (2019) Code2seq: generating sequences from structured representations of code. In International Conference on Learning Representations, External Links: Link Cited by: §1, §5.
  • U. Alon, M. Zilberstein, O. Levy, and E. Yahav (2018) A general path-based representation for predicting program properties. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, New York, NY, USA, pp. 404–419. External Links: ISBN 9781450356985, Link, Document Cited by: §1.
  • U. Alon, M. Zilberstein, O. Levy, and E. Yahav (2019)

    code2vec: Learning Distributed Representations of Code

    Proc. ACM Program. Lang. 3 (POPL), pp. 40:1–40:29. External Links: Document, ISSN 2475-1421, Link Cited by: §1, §5.
  • J. Bader, A. Scott, M. Pradel, and S. Chandra (2019) Getafix: learning to fix bugs automatically. Proc. ACM Program. Lang. 3 (OOPSLA). External Links: Link, Document Cited by: §1.
  • S. Barman, S. Chasins, R. Bodik, and S. Gulwani (2016) Ringer: web automation by demonstration. SIGPLAN Not. 51 (10), pp. 748–764. External Links: ISSN 0362-1340, Link, Document Cited by: §1.
  • T. Ben-Nun, A. S. Jakobovits, and T. Hoefler (2018) Neural code comprehension: a learnable representation of code semantics. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 3585–3597. Cited by: §1, §5.
  • S. Bhatia, P. Kohli, and R. Singh (2018) Neuro-symbolic program corrector for introductory programming assignments. In Proceedings of the 40th International Conference on Software Engineering, ICSE ’18, New York, NY, USA, pp. 60–70. External Links: ISBN 9781450356381, Link, Document Cited by: §1.
  • V. Cosentino, J. L. Cánovas Izquierdo, and J. Cabot (2017) A Systematic Mapping Study of Software Development With GitHub. IEEE Access 5, pp. 7173–7192. External Links: Document, ISSN 2169-3536 Cited by: §3.1.1.
  • E. Dinella, H. Dai, Z. Li, M. Naik, L. Song, and K. Wang (2020) Hoppity: Learning Graph Transformations to Detect and Fix Bugs in Programs. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • D. Feitelson, A. Mizrahi, N. Noy, A. Ben Shabat, O. Eliyahu, and R. Sheffer (2020) How developers choose names. IEEE Transactions on Software Engineering, pp. 1–1. External Links: Document, ISSN 2326-3881 Cited by: §3.3.
  • D. Flanagan (2006) JavaScript: the definitive guide. ”O’Reilly Media, Inc.”. Cited by: §3.1.1.
  • E. M. Gellenbeck and C. R. Cook (1991) An investigation of procedure and variable names as beacons during program comprehension. In Empirical studies of programmers: Fourth workshop, pp. 65–81. Cited by: §3.3.
  • P. Ginsbach, T. Remmelg, M. Steuwer, B. Bodin, C. Dubach, and M. F. P. O’Boyle (2018) Automatic Matching of Legacy Code to Heterogeneous APIs: An Idiomatic Approach. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’18, New York, NY, USA, pp. 139–153. External Links: Document, ISBN 9781450349116, Link Cited by: §1.
  • J. Gottschlich, A. Solar-Lezama, N. Tatbul, M. Carbin, M. Rinard, R. Barzilay, S. Amarasinghe, J. B. Tenenbaum, and T. Mattson (2018) The Three Pillars of Machine Programming. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2018, New York, NY, USA, pp. 69–80. External Links: Document, ISBN 978-1-4503-5834-7, Link Cited by: §1, §1.
  • S. Kamil, A. Cheung, S. Itzhaky, and A. Solar-Lezama (2016) Verified Lifting of Stencil Computations. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’16, New York, NY, USA, pp. 711–726. External Links: Document, ISBN 978-1-4503-4261-2, Link Cited by: §1.
  • C. Lattner and V. Adve (2004) LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, CGO ’04, USA, pp. 75. External Links: ISBN 0769521029 Cited by: §5.
  • S. Lipschutz (1968) Schaum’s outline of theory and problems of linear algebra. McGraw-Hill, New York. External Links: ISBN 0070379890 Cited by: §2.
  • T. Liu (2009) Learning to rank for information retrieval. Found. Trends Inf. Retr. 3 (3), pp. 225–331. External Links: ISSN 1554-0669, Link, Document Cited by: §4.2.
  • S. Luan, D. Yang, C. Barnaby, K. Sen, and S. Chandra (2019) Aroma: Code Recommendation via Structural Code Search. Proc. ACM Program. Lang. 3 (OOPSLA), pp. 152:1–152:28. External Links: Document, ISSN 2475-1421, Link Cited by: §1, §1, §1, §1, §2, §5.
  • L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin (2016) Convolutional Neural Networks over Tree Structures for Programming Language Processing. In

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence

    AAAI ’16, pp. 1287–1293. Cited by: §4.1.
  • A. Odena and C. Sutton (2020) Learning to Represent Programs with Property Signatures. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • M. Pradel and K. Sen (2018) DeepBugs: a learning approach to name-based bug detection. Proc. ACM Program. Lang. 2 (OOPSLA). External Links: Link, Document Cited by: §1.
  • C. K. Roy, J. R. Cordy, and R. Koschke (2009) Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci. Comput. Program. 74 (7), pp. 470–495. External Links: ISSN 0167-6423, Link, Document Cited by: §1, §3.1.1.
  • S. Sachdev, H. Li, S. Luan, S. Kim, K. Sen, and S. Chandra (2018) Retrieval on source code: a neural code search. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2018, New York, NY, USA, pp. 31–41. External Links: ISBN 9781450358347, Link, Document Cited by: §5.
  • H. Sajnani, V. Saini, J. Svajlenko, C. K. Roy, and C. V. Lopes (2016) SourcererCC: scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering, ICSE ’16, New York, NY, USA, pp. 1157–1168. External Links: ISBN 9781450339001, Link, Document Cited by: §5.
  • N. Satish, C. Kim, J. Chhugani, H. Saito, R. Krishnaiyer, M. Smelyanskiy, M. Girkar, and P. Dubey (2012) Can Traditional Programming Bridge the Ninja Performance Gap for Parallel Computing Applications?. In 2012 39th Annual International Symposium on Computer Architecture (ISCA), pp. 440–451. External Links: Document, ISSN 1063-6897 Cited by: §3.1.1.
  • A. Singhal (2001) Modern information retrieval: a brief overview. IEEE Data Eng. Bull. 24, pp. 35–43. Cited by: §2.
  • M. Tufano, C. Watson, G. Bavota, M. Di Penta, M. White, and D. Poshyvanyk (2018) Deep learning similarities from different representations of source code. In Proceedings of the 15th International Conference on Mining Software Repositories, MSR ’18, New York, NY, USA, pp. 542–553. External Links: ISBN 9781450357166, Link, Document Cited by: §1.
  • G. Van Rossum and F. L. Drake (2009) Python 3 reference manual. CreateSpace, Scotts Valley, CA. External Links: ISBN 1441412697 Cited by: §3.1.1.
  • K. Wang, Z. Su, and R. Singh (2018) Dynamic neural program embeddings for program repair. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  • W. Wulf and M. Shaw (1973) Global variable considered harmful. SIGPLAN Not. 8 (2), pp. 28–34. External Links: ISSN 0362-1340, Link, Document Cited by: §3.3.
  • J. Zhang, X. Wang, H. Zhang, H. Sun, K. Wang, and X. Liu (2019) A novel neural source code representation based on abstract syntax tree. In Proceedings of the 41st International Conference on Software Engineering, ICSE ’19, pp. 783–794. External Links: Link, Document Cited by: §1, §5.
  • G. Zhao and J. Huang (2018) DeepSim: Deep Learning Code Functional Similarity. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2018, New York, NY, USA, pp. 141–151. External Links: ISBN 9781450355735, Link, Document Cited by: §2, §5.