MISIM: An End-to-End Neural Code Similarity System

by   Fangke Ye, et al.
Georgia Institute of Technology

Code similarity systems are integral to a range of applications from code recommendation to automated construction of software tests and defect mitigation. In this paper, we present Machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system that consists of two core components. First, MISIM uses a novel context-aware similarity structure, which is designed to aid in lifting semantic meaning from code syntax. Second, MISIM provides a neural-based code similarity scoring system, which can be implemented with various neural network algorithms and topologies with learned parameters. We compare MISIM to three other state-of-the-art code similarity systems: (i) code2vec, (ii) Neural Code Comprehension, and (iii) Aroma. In our experimental evaluation across 45,780 programs, MISIM consistently outperformed all three systems, often by a large factor (upwards of 40.6x).



There are no comments yet.


page 19


Automatic Stance Detection Using End-to-End Memory Networks

We present a novel end-to-end memory network for stance detection, which...

Context-Aware Parse Trees

The simplified parse tree (SPT) presented in Aroma, a state-of-the-art c...

Code Clone Detection based on Event Embedding and Event Dependency

The code clone detection method based on semantic similarity has importa...

Ain't Nobody Got Time For Coding: Structure-Aware Program Synthesis From Natural Language

Program synthesis from natural language (NL) is practical for humans and...

A Neural Network-Based Linguistic Similarity Measure for Entrainment in Conversations

Linguistic entrainment is a phenomenon where people tend to mimic each o...

Learnable Parameter Similarity

Most of the existing approaches focus on specific visual tasks while ign...

Performance analysis and optimization of the JOREK code for many-core CPUs

This report investigates the performance of the JOREK code on the Intel ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The field of machine programming (MP) is concerned with the automation of software development Gottschlich et al. (2018)

. In recent years, there has been an emergence of many MP systems, due, in part, to advances in machine learning, formal methods, data availability, and computing efficiency 

Luan et al. (2019); Odena and Sutton (2020); Ben-Nun et al. (2018); Alon et al. (2018, 2019); Tufano et al. (2018); Alon et al. (2019); Zhang et al. (2019); Allamanis et al. (2018); Cosentino et al. (2017). One open challenge in MP is the construction of accurate code similarity systems, which generally try to determine if two code snippets are semantically similar (i.e., having similar characteristics through some analysis). Accurate code similarity systems may assist in many programming tasks, such as code recommendation systems to improve programming development productivity, to automated bug detection and mitigation systems to improve programmer debugging productivity Dinella et al. (2020); Luan et al. (2019); Allamanis et al. (2018); Pradel and Sen (2018); Bhatia et al. (2018); Bader et al. (2019); Barman et al. (2016). Yet, as others have noted, code similarity systems tend to contain many complex components, where selection of even the most basic components, such as the structural representation of code, remain unclear Luan et al. (2019); Odena and Sutton (2020); Ben-Nun et al. (2018); Alon et al. (2018, 2019); Tufano et al. (2018); Alon et al. (2019); Zhang et al. (2019); Allamanis et al. (2018).

In this paper, we attempt to address some of the open questions around code similarity with our novel end-to-end code similarity system called Machine Inferred Code Similarity (MISIM). We principally focus on two main novelties of MISIM and how they may improve code similarity analysis: (i) its structural representation of code, called context-aware semantic structure (CASS), and (ii) its neural-based learned code similarity scoring algorithm. These components can be used individually or together as we have chosen to do.

This paper makes the following technical contributions:

  • [nosep,leftmargin=1em,labelwidth=*,align=left]

  • We present Machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system.

  • We present the context-aware semantic structure (CASS), a structural representation of code designed specifically to lift semantic meaning from code syntax.

  • We present a novel neural network-driven architecture to learn code similarity scoring, which uses pairs of code as input.

  • We compare MISIM to three other state-of-the-art code similarity systems, code2vec Alon et al. (2019), Neural Code Comprehension Ben-Nun et al. (2018), and Aroma Luan et al. (2019). Our experimental evaluation, across 45,780 programs, shows that MISIM outperforms these systems, often by a large factor (upwards of ).

2 Related Work

Although research in the space of code similarity is still in the early stages, there is a growing body of exploratory work in this area Alon et al. (2019); Luan et al. (2019); Ben-Nun et al. (2018); Kim et al. (2018); Liu et al. (2018). In this section, we briefly examine three state-of-the-art systems in this domain: code2vec Alon et al. (2019), Neural Code Comprehension Ben-Nun et al. (2018), and the Aroma system Luan et al. (2019). In Section 4, we perform an experimental analysis of MISIM compared to each of these systems.

code2vec. A core goal of code2vec Alon et al. (2019) is to learn a code embedding for representing snippets of code. The code embedding can be an enabler for automating various programming related tasks including code similarity. A code embedding is trained through the task of semantic labeling of code snippets (i.e., predicting the function name for a function body). As input, code2vec uses abstract syntax tree (AST) paths to represent a code snippet. The AST is a tree structure that represents the syntactic information of the source code Baxter et al. (1998). code2vec incorporates an attention-based neural network to automatically identify AST paths that are more relevant to deriving code semantics.

Neural Code Comprehension (NCC). NCC Ben-Nun et al. (2018) attempts to learn code semantics based on an intermediate representation (IR) Lattner and Adve (2004) of the code. NCC processes source code at the IR level in an attempt to extract additional semantic meaning. It transforms IR into a contextual flow graph (XFG) that incorporates both data- and control-flow of the code, and uses XFG to train a neural network for learning a code embedding. One of the constraints of NCC is that its IR requires code compilation. The MISIM system does not use an IR and thus does not have this requirement, which may be helpful in some settings (e.g., live programming environments).

Aroma. Aroma Luan et al. (2019)

is a code recommendation system that takes a partially-written code snippet and recommends extensions for the snippet. The intuition behind Aroma is that programmers often write code that may have already been written. Aroma leverages a code base of functions and quickly recommends extensions. This may improve programmer productivity. One of the core code similarity component of the Aroma system is the simplified parse tree (SPT). SPT is a tree structure that represents a code snippet. Unlike an AST, an SPT is not language-specific and thus allows code similarity comparison across various programming languages. To compute the similarity score of two code snippets, Aroma extracts binary feature vectors from SPTs and calculate their dot product.

3 MISIM System

Figure 1 shows an overview of MISIM, our end-to-end code similarity system. A core component of MISIM lies is the novel context-aware semantic structure (CASS), which aims to capture semantically salient properties of the input code. CASS is also context-aware, as it can capture information that describes the context of the code (e.g., it is a function call, it is an operation, etc.). Once these CASSes are constructed, they are vectorized and used as input to a neural network, which produces a feature vector. Once a feature vector is generated, a code similarity measurement (e.g., vector dot product Lipschutz (1968)

, cosine similarity 

Baeza-Yates and Ribeiro-Neto (1999)) calculates the similarity score between the input program and any other program that has undergone the same CASS transformation process.

Figure 1: Overview of the MISIM System.

3.1 Context-Aware Semantic Structure

For this paper, we use complete C and C++ programs as input to MISIM. Each program may consist of multiple functions. We have found that, unlike programs in higher-level programming languages (e.g., Python Van Rossum and Drake (2009), JavaScript Flanagan (2006)), C/C++ programs found “in the wild” may not be well-formed nor exhaustively include all of their dependencies. As a byproduct, such programs might not be compilable. Thus, we believe it may be useful for code similarity systems to not require successful compilation as a core property of the code they analyze. We have designed CASS with this in mind, such that it does not require compilable code. We define CASS formally in Definition 1.

Definition 1 (Context-aware semantic structure (CASS))

A CASS consists of one or more CASS trees and an optional global attributes table (GAT). A CASS tree, , is a collection of nodes, , and edges, , denoted as . Each edge is directed from a parent node, to a child node, , or where and . The root node, , of the tree signifies the beginning of the code snippet and has no parent node, i.e., . A child node is either an internal node or a leaf node. An internal node has at least one child node while a leaf node has no child nodes. A CASS tree can be empty, in which it has no nodes. The CASS GAT contains exactly one entry per unique function definition in the code snippet. A GAT entry includes the input and output cardinality values for the corresponding function.

Definition 2 (Node labels)

Every CASS node has an associated label, . During the construction of a CASS tree, the program tokens at each node, are mapped to its corresponding label or . This is depicted with an expression grammar for node labels and the function mapping tokens to labels below.111Note: the expression grammar we provide is non-exhaustive due to space limitations. The complete set of standard C/C++ tokens or binary and unary operators is collectively denoted in shorthand as ‘…’.

=110pt =2pt <bin-op> ::= ‘+’ | ‘-’ | ‘*’ | ‘/’ | …

<unary-op> ::= ‘++’ | ‘–’ | …

<leaf-node-label> ::= "LITERAL" | "IDENT" | ‘#VAR’ | ‘#GVAR’ | ‘#EXFUNC’ | ‘#LIT’ | …

<exp> ::= ‘$’ | ‘$’ <bin-op> ‘$’ | <unary-op> ‘$’ | …

<internal-node-label> ::= ‘for’ ‘(’ <exp> ‘;’ <exp> ‘;’ <exp> ‘)’ <exp> ‘;’ ‘int’ <exp> ‘;’ ‘return’ <exp> ‘;’ <exp> …

Definition 3 (Node prefix label)

A node prefix label is a string prefixed to a node label. A node prefix label may or may not be present.

In this section, we describe the fundamental design of CASS, emphasizing its configurable options for capturing semantically salient code properties. A simple example of this is illustrated in Figure 2, where a function and its corresponding CASS tree and GAT are shown (defined in Definition 1).

Fundamentally, a CASS tree is the result of transforming a concrete syntax tree in specific ways according to a configuration, which include options that give CASS a greater degree of flexibility. Different configurations may result in better accuracy in some domains and worse in the others.

Figure 2: A Summation Function and One Variant of Its Context-Aware Semantic Structure.

CASS Configuration Categories.

Here, we briefly describe the intuition behind CASS configurations. In general, these configurations can be broadly classified into two categories:

language-specific configurations and language-agnostic configurations.

Language-specific configurations (LSCs). LSCs are designed to resolve syntactic ambiguity present in the concrete syntax tree. For example, the parentheses operator is overloaded in many programming languages to enforce order of evaluation of operands in an expression as well as to enclose a list of function arguments. CASS disambiguates these two terms by explicitly embedding this context specific information in the CASS tree nodes (the first is a parenthesized expression and the second is an argument list), using the node prefix label described in Definition 3. Such a disambiguation could be useful to a code similarity system, as it makes the presence of a function call more clear.

Language-agnostic configurations (LACs). LACs can improve code similarity analysis by unbinding overly-specific semantics that may be present in the original concrete syntax tree structure. For example, in Figure 2, a standard concrete syntax tree construction might include the variable names sum, i, etc. While they are necessary for the compilation of the code, they might be extraneous to the detection of code similarity. Some CASS variants, on the other hand, unbind these names and replace them with a generic string (#VAR). This could improve code similarity analysis if the exact identifier names are irrelevant, and the semantically-salient feature is simply that there is a variable in that code context. CASS includes other language-agnostic configurations for global variables and global functions, which eliminates the featurization of function names and distinguishes global function references from global variable references. Additionally, CASS provides a language-agnostic configuration called compound statements to control whether the number of constituent statements of a compound statement is reflected in its label.

We have found that the specific context in which code similarity is performed seems to provide some indication of the optimal specificity of the CASS configuration. In other words, one specific CASS configuration is unlikely to work in all scenarios. Sometimes disambiguating parenthetical expressions may be helpful (e.g., divergent behavior of parentheticals, such as mathematical ordering and function calls). Other times it may not (e.g., convergent behavior of parentheticals, such as function initializers and class constructors). To address this, CASS provides a number of options to control the language-specific and/or language-agnostic configurations to enable tailored CASS representations for particular application contexts. The current exhaustive list of CASS configurations and their associated options, as well as their experimental evaluation, are presented in Appendix A.

3.2 Neural Scoring Algorithm

MISIM’s neural scoring algorithm aims to compute the similarity score of two input programs. The algorithm consists of two phases. The first phase involves a neural network model that maps a featurized CASS to a real-valued code vector. The second phase generates a similarity score between a pair of code vectors using a similarity metric.222For this work, we have chosen cosine similarity as the similarity metric used within MISIM. For the remainder of this section, we describe the details of the scoring model, its training strategy, and other neural network model choices.

3.2.1 Model

We investigated three neural network approaches for MISIM’s scoring algorithm: (i) a graph neural network (GNN) Zhou et al. (2018), (ii)

a recurrent neural network (RNN), and

(iii) a bag of manual features (BoF) neural network. We name these models MISIM-GNN, MISIM-RNN, and MISIM-BoF respectively. MISIM-GNN performs the best overall for our experiments, therefore, we describe it in detail in this section. Details of the MISIM-RNN and MISIM-BoF models can be found in Appendix B.

MISIM-GNN. MISIM-GNN’s architecture is shown in Figure 3. For this approach, an input program’s CASS representation is transformed into a graph. Then, each node in the graph is embedded into a trainable vector, serving as the node’s initial state. Next, a GNN is used to update each node’s state iteratively. Finally, a global readout function is applied to extract a vector representation of the entire graph from the nodes’ final states. We describe each of these steps in more detail below.

Figure 3: MISIM-GNN Architecture.

Input Graph Construction. We represent each program as a single CASS instance. Each instance can contain one or more CASS trees, where each tree corresponds to a unique function of the program. The CASS instance is converted into a single graph representation to serve as the input to the model. The graph is constructed by first transforming each CASS tree and its GAT entry into an individual graph. These graphs are then merged into a single (disjoint) graph (i.e., the set of nodes/edges of the merged graph is the union of node/edge sets of the individual graphs). For a CASS consisting of a CASS tree and a GAT entry , we transform it into a directed graph , where is the set of graph nodes, is the set of edge types, and is the set of graph edges. The graph is constructed as follows:

The two edge types, and , represent edges from CASS tree nodes to their child nodes and to their parent nodes, respectively.

Graph Neural Network. MISIM embeds each node in the input graph into a vector by assigning a trainable vector to each unique node label (with the optional prefix) and GAT attribute. The node embeddings are then used as node initial states () by a relational graph convolutional network (R-GCN Schlichtkrull et al. (2018)) specified as the following:

where is the number of GNN layers, is the set of neighbors of that connect to through an edge of type , and , are weight matrices to be learned.

Code Vector Generation.

To obtain a vector representing the entire input graph, we apply both an average pooling and a max pooling on the graph nodes’ final states (

). The resulting two vectors are concatenated and fed into a fully-connected layer, yielding the code vector for the input program.

3.2.2 Training

We train the neural network model following the setting of metric learning Schroff et al. (2015); Hermans et al. (2017); Musgrave et al. (2020); Sun et al. (2020), which tries to map input data to a vector space where, under a distance (or similarity) metric, similar data points are close together (or have large similarity scores) and dissimilar data points are far apart (or have small similarity scores). The metric we use is the cosine similarity in the code vector space. As shown in the lower half of Figure 1

, we use pair-wise labels to train the model. Each pair of input programs are mapped to two code vectors by the model, from which a similarity score is computed and optimized using a metric learning loss function.

4 Experimental Evaluation

In this section, we present our experimental analysis of MISIM. First, we describe the experimental setup. Next, we analyze the performance of MISIM compared to three state-of-the-art systems: code2vec Alon et al. (2019), Neural Code Comprehension (NCC) Ben-Nun et al. (2018), and Aroma Luan et al. (2019), on a dataset containing more than 45,000 programs. Overall, we find that MISIM has improved performance than these systems across three metrics. Lastly, we compare the performance of two MISIM variants, each trained with a different CASS configuration.

4.1 Experimental Setup

In this subsection, we describe the experimental setup we used to evaluate MISIM, including details on the datasets and the training procedure we used.

Dataset. Our experiments use the POJ-104 dataset Mou et al. (2016). It consists of student-written C/C++ programs, which solve the 104 problems (numbered from 1 to 104) in the dataset.333POJ-104 is available at https://sites.google.com/site/treebasedcnn. Each problem has 500 unique solutions that have been validated for correctness. We label two programs as similar if they are solutions to the same problem. After a filtering step, as specified in Appendix B.3, we split the dataset by into sets for training (problems 1–64), validation (problems 65–80), and testing (problems 81–104). Detailed statistics of the dataset partitioning are shown in Table 2.

Dataset Split #Problems #Programs
Training 64 28,137
Validation 16 7,193
Test 24 10,450
Total 104 45,780
Table 2: Code similarity system accuracy. Results are shown as the average and min/max values, relative to the average, over 3 runs.
Method MAP@R (%) AP (%) AUPRG (%)
code2vec 1.98 (-0.24/+0.29) 5.57 (-0.98/+0.65) 37.24 (-22.98/+15.72)
NCC 39.95 (-2.29/+1.64) 50.81 (-2.94/+1.59) 98.88 (-0.19/+0.10)
NCC-w/o-inst2vec 54.19 (-3.18/+3.52) 63.06 (-5.43/+4.37) 99.40 (-0.22/+0.17)
Aroma-Dot 52.08 45.72 98.43
Aroma-Cos 55.12 55.72 99.08
MISIM-GNN 82.45 (-0.61/+0.40) 82.15 (-2.74/+1.63) 99.86 (-0.04/+0.03)
MISIM-RNN 74.01 (-2.00/+3.81) 81.78 (-1.51/+2.71) 99.84 (-0.03/+0.04)
MISIM-BoF 74.38 (-1.04/+1.04) 83.07 (-0.69/+0.50) 99.87 (-0.01/+0.01)
Table 1: POJ-104 dataset partitioning, consisting of 104 problems with 500 C/C++ programs per problem, verified for correctness.


Unless otherwise specified, we use the same training procedure in all experiments. The models are built and trained using PyTorch 

Paszke et al. (2019). To train the models, we use the Circle loss Sun et al. (2020), a state-of-the-art loss function that has been effective in various similarity learning tasks. Following the P-K sampling strategy Hermans et al. (2017)

, we construct a batch of programs by first randomly sampling 16 problems, and then randomly sampling 5 solutions for each problem. The loss function takes the similarity scores of all intra-batch pairs and their pair-wise labels as input. Further details about the training procedure and hyperparameters are discussed in Appendix 


4.2 Experimental Results

In this subsection, we describe the evaluation of MISIM against three state-of-the-art systems. This includes details on (i)

the evaluation metrics,

(ii) the adaptation of the state-of-the-art systems to our experimental setting, and (iii) the results and analysis.


The accuracy metrics we use for evaluation are Mean Average Precision at R (MAP@R) Musgrave et al. (2020), Average Precision (AP) Baeza-Yates and Ribeiro-Neto (1999), and Area Under Precision-Recall-Gain Curve (AUPRG) Flach and Kull (2015). MAP@R measures how accurately a model can retrieve similar (or relevant) items from a database given a query. MAP@R rewards a ranking system (e.g., a search engine, a code recommendation engine, etc.) for correctly ranking relevant items with an order where more relevant items are ranked higher than less relevant items. It is defined as the mean of average precision scores, each of which is evaluated for retrieving R most similar samples given a query. In our case, the set of queries is the set of all test programs. For a program, R is the number of other programs in the same class (i.e., a POJ-104 problem). MAP@R is applied to both validation and testing. We use AP and AUPRG to measure the performance in a binary classification setting, in which the models are viewed as binary classifiers that determine whether a pair of programs are similar by comparing their similarity score with a threshold. AP and AUPRG are only used for testing. They are computed from the similarity scores of all program pairs in the test set, as well as their pair-wise labels. For the systems that require training (i.e., systems with ML learned similarity scoring), we train and evaluate them three times with different random seeds. We report the average and min/max values of each accuracy metric.

Modifications to code2vec, NCC, and Aroma

To compare with code2vec, NCC, and Aroma, we adapt them to our experimental setting in the following ways. The original code2vec takes a function as an input, extracts its AST paths to form the input to its neural network, and trains the network using the function name prediction task. In our experiments, we feed the AST paths from all function(s) in a program into the neural network and train it using the metric learning task described in Section 3.2.2. NCC contains a pre-training phase, named inst2vec

, on a large code corpus for generating instruction embeddings, and a subsequent phase that trains an RNN for a downstream task using the pre-trained embeddings. We train the downstream RNN model on our metric learning task in two ways. The first uses the pre-trained embeddings (labeled as NCC in our results). The second trains the embeddings from scratch on our task in an end-to-end fashion (labeled as NCC-w/o-inst2vec). In addition, following NCC’s experiment procedure, data augmentation is applied to the training set by compiling the code with different optimization flags. For both code2vec and NCC, we use the same model architectures and embedding/hidden sizes suggested in their papers and open sourced implementations. The dimension of their output vectors (i.e., code vectors) is set to the same as our MISIM models. Aroma extracts manual features from the code and computes the similarity score of two programs by taking the dot product of their binary feature vectors. We experiment with both its original scoring mechanism (labeled: Aroma-Dot) and a variant that uses the cosine similarity (labeled: Aroma-Cos).

(a) Mean Average Precision at R.
(b) Average Precision.
(c) Area Under Precision-Recall-Gain Curve.
Figure 4: Summarized accuracy results on the POJ-104 test set for code2vec, NCC, and Aroma and MISIM. Bar heights are the averages of the measurements over 3 runs, and error bars are bounded by the minimum and the maximum of measured values.


Table 2 and Figure 4 show the accuracy of MISIM, code2vec, NNC, and Aroma. The blue bars show the results of the MISIM system variants trained using the baseline CASS configuration444CASS configuration identifiers are as follows. (the baseline configuration) is 0-0-0-0-0 and (non-baseline configuration) is 2-2-3-2-1. More details about configuration notations are described in Appendix A.5.. The orange bars show the results of code2vec, NNC, and Aroma. We observe that MISIM-GNN results in the best performance for MAP@R, yielding 0.5 to 40.6 improvements over the other systems. In some cases, MISIM-BoF achieves the best AP and AUPRG scores. In summary, as shown in Table 2, MISIM systems have better accuracy than the other systems we compared against across the three evaluation metrics.

We provide a brief intuition for these results. The code2vec system uses paths in an abstract syntax tree (AST), a pure-syntactical code representation, as the input to its neural network. We speculate that such representation may (i) keep excessive fine-grained syntactical details while (ii)

omitting structural (i.e., semantic) information. This may explain why code2vec emits an underwhelming accuracy in our experiments. The Aroma system employs manual features derived from the simplified parse tree and computes the number of overlapping features from two programs as their similarity score. The selection of manual features appears to be heuristic-based and might potentially result in a loss in semantic information. Neural Code Comprehension (NCC) tries to learn code semantics from LLVM IR, a low level code representation designed for compilers. The lowering process from source code to LLVM IR may discard some semantic-relevant information such as identifier names and syntactic patterns, which is usually not utilized by compilers, but might be useful for inferring code semantics. The absence of such information from NCC’s input may limit its code similarity accuracy.

4.3 Specialized Experimental Result: CASS Configurations

In this subsection, we provide early anecdotal evidence indicating that no CASS configuration is invariably the best for all code snippets. In the abbreviated analysis we provide here, we find that configurations may need to be chosen based on the characteristics of code that MISIM will be trained on and, eventually, used for. To illustrate this, we conducted a series of experiments that train MISIM-GNN models with two CASS configurations on several randomly sampled sub-training sets and compared their test accuracy. The two configurations used were , which is the CASS baseline configuration and , which is a non-baseline configuration that provides a higher abstraction over the source code than (e.g., replace global variable names with a unified string, etc.).4 Table 3 shows the results from four selected sub-training sets, named , , , and from POJ-104. It can be seen that when trained on or , the system using configuration performs better than the baseline configuration in all three accuracy metrics. However, using the training sets or , the results are inverted.

To better understand this divergence, we compared the semantic features of to . We observed that some CASS-defined semantically salient features (e.g., global variables – see Appendix A) that had been customized to extract, occurred less frequently in than in . We speculate that, in the context of the POJ-104 dataset, when global variables are used more frequently, they are more likely to have consistent meaning across different programs. As a result, abstracting them away as does for , leads to a loss in semantic information salient to code similarity. Conversely, when global variables are not frequently used, there is an increased likelihood that the semantics they extract are specific to a single program’s embodiment. As such, retaining their names in a CASS, may increase syntactic noise, thereby reducing model performance. Therefore, when eliminates them for , there is an improvement in accuracy.

Sub-Training Set Configuration MAP@R (%) AP (%) AUPRG (%)
69.78 (-0.42/+0.21) 76.39 (-1.68/+1.51) 99.78 (-0.03/+0.03)
71.99 (-0.26/+0.45) 79.89 (-1.20/+0.71) 99.83 (-0.02/+0.01)
63.45 (-1.58/+1.92) 68.58 (-2.51/+2.85) 99.63 (-0.06/+0.06)
67.40 (-1.85/+1.23) 69.86 (-3.34/+1.79) 99.65 (-0.10/+0.05)
63.53 (-1.08/+1.53) 72.47 (-0.95/+1.24) 99.70 (-0.04/+0.03)
61.23 (-2.04/+1.57) 69.83 (-1.03/+1.60) 99.65 (-0.03/+0.03)
61.78 (-0.46/+0.47) 66.86 (-2.31/+2.81) 99.56 (-0.06/+0.07)
60.86 (-1.59/+0.90) 63.86 (-3.06/+3.43) 99.46 (-0.14/+0.11)
Table 3: Test Accuracy of MISIM-GNN Trained on Different Subsets of the Training Set. The results are shown as the average and min/max values relative to the average over 3 runs.

5 Conclusion

In this paper, we presented MISIM, an end-to-end code similarity system. MISIM has two core novelties. The first is the context-aware semantics structure (CASS) designed specifically to lift semantic meaning from code syntax. The second is a neural-based code similarity scoring algorithm for learning code similarity scoring using CASS. Our experimental evaluation showed that MISIM outperforms three other state-of-the-art code similarity systems usually by a large factor (up to 40.6). We also provided anecdotal evidence illustrating that there may not be one universally optimal CASS configuration. An open research question for MISIM is in how to automatically derive the proper configuration of its various components for a given code corpus, specifically the CASS and neural scoring algorithms. To realize this, we may first need to design a new semantics analysis system that can automatically characterize a given code corpus in some meaningful way. Such characterizations may then be useful to guide the learning process and help identify optimal MISIM components.

6 Broader Impact

To discuss the broader impact of our project, we will categorize impacts by their degree of influence. For example, by the phrase “first-degree negative impact” we will refer to a scenario where a given research idea can be directly used for harm (e.g., DeepFake Floridi (2018), DeepNude555We have intentionally not included a citation to this work. We do not want to be seen, in any way, as endorsing or promoting it. We believe such an act would be ethically irresponsible.

, so on). Similarly, by "second-degree negative impact" we will refer to a scenario where a research idea may have a direct negative or positive impact based on how it is used (e.g., facial recognition for security vs. oppressing minorities, GPT 

Radford et al. (2019) to create an empathetic chatbot vs. malicious fake news, etc). We call a research idea to have a "third-degree negative impact" if the idea by itself represents an abstract concept (e.g., a similarly metric) and cannot harm by its own, but can be used to build a second application which can then have a negative impact based on its use.

We envision the following positive broader impacts of the research idea presented in this paper. As briefly mentioned in the introduction, an end-to-end code similarity system can be incorporated in programming tools (e.g., Visual Studio, Eclipse, etc.) to improve the productivity of a programmer by offering him/her a similar but "known to be more efficient" code snippet. It can be used in coding education by displaying better (e.g., concise, faster, space-efficient, etc.) code for a given code snippet, in assisting program debugging by identifying potential missing parts, for plagiarism detection, for automated bug-detection and fixing, in automatic code transformations (e.g., replacing a Python function with an equivalent C function) and so on. If used wisely with proper control and governance, we believe it can create many positive impacts.

We can envision the following third-degree negative impacts. If a tool that uses code similarity becomes mature enough to automatically generate correct compilable codes, it can be potentially used to automatically replace code from one language to another or to replace a slow code with a fast one. A malicious person can leverage the code similarity tool to crawl the web and steal codes on the web, find common patterns and security flaws in the code available on the web, and then find ways to hack at a massive scale. Codes generated from the same code generators are likely to be more vulnerable to such attacks. If systems allow automatic code patching/fixing based on code-similarity without proper testing, it might create security flaws if hacked. If programmers get used to getting help from a programming tool, that might negatively reduce the learning ability of programmers unless the tool also offers explainability. Explainability would be required to understand what the tool is learning about the code similarity and to educate the programmers about it.

To summarize, code similarity is an abstract concept that is likely to have numerous positive applications. However, if used in other tools, it might also play a role in creating a third-order negative impact. It may be used to develop tools and applications which, if mature enough, may cause unacceptable or dangerous situations. To mitigate the negative impacts, we would need to ensure proper policy and security measures are in place to prevent negative usage. In particular, such secure systems may require a human-in-the-loop so that any such tool is used to enhance the capability and productivity of programmers.


  • [1] M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton (2018-09) A Survey of Machine Learning for Big Code and Naturalness. ACM Computing Surveys 51 (4). Cited by: §1.
  • [2] M. Allamanis, M. Brockschmidt, and M. Khademi (2018) Learning to Represent Programs with Graphs. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • [3] U. Alon, O. Levy, and E. Yahav (2019) code2seq: Generating Sequences from Structured Representations of Code. In International Conference on Learning Representations, External Links: Link Cited by: §A.7, §1.
  • [4] U. Alon, M. Zilberstein, O. Levy, and E. Yahav (2018) A General Path-Based Representation for Predicting Program Properties. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, New York, NY, USA, pp. 404–419. External Links: ISBN 9781450356985, Link, Document Cited by: §1.
  • [5] U. Alon, M. Zilberstein, O. Levy, and E. Yahav (2019-01)

    code2vec: Learning Distributed Representations of Code

    Proc. ACM Program. Lang. 3 (POPL), pp. 40:1–40:29. External Links: Document, ISSN 2475-1421, Link Cited by: §A.7, 4th item, §1, §2, §2, §4.
  • [6] J. Bader, A. Scott, M. Pradel, and S. Chandra (2019-10) Getafix: Learning to Fix Bugs Automatically. Proc. ACM Program. Lang. 3 (OOPSLA). External Links: Link, Document Cited by: §1.
  • [7] R. A. Baeza-Yates and B. Ribeiro-Neto (1999) Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., USA. External Links: ISBN 020139829X Cited by: §3, §4.2.
  • [8] S. Barman, S. Chasins, R. Bodik, and S. Gulwani (2016-10) Ringer: Web Automation by Demonstration. SIGPLAN Not. 51 (10), pp. 748–764. External Links: ISSN 0362-1340, Link, Document Cited by: §1.
  • [9] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier (1998) Clone detection using abstract syntax trees. In Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272), pp. 368–377. Cited by: §2.
  • [10] T. Ben-Nun, A. S. Jakobovits, and T. Hoefler (2018) Neural Code Comprehension: A Learnable Representation of Code Semantics. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 3585–3597. Cited by: 4th item, §1, §2, §2, §4.
  • [11] S. Bhatia, P. Kohli, and R. Singh (2018) Neuro-Symbolic Program Corrector for Introductory Programming Assignments. In Proceedings of the 40th International Conference on Software Engineering, ICSE ’18, New York, NY, USA, pp. 60–70. External Links: ISBN 9781450356381, Link, Document Cited by: §1.
  • [12] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014-10) Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    Doha, Qatar, pp. 1724–1734. External Links: Document, Link Cited by: §B.2.
  • [13] V. Cosentino, J. L. Cánovas Izquierdo, and J. Cabot (2017) A Systematic Mapping Study of Software Development With GitHub. IEEE Access 5, pp. 7173–7192. External Links: Document, ISSN 2169-3536 Cited by: §A.1.1, §1.
  • [14] E. Dinella, H. Dai, Z. Li, M. Naik, L. Song, and K. Wang (2020) Hoppity: Learning Graph Transformations to Detect and Fix Bugs in Programs. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • [15] D. Feitelson, A. Mizrahi, N. Noy, A. Ben Shabat, O. Eliyahu, and R. Sheffer (2020) How Developers Choose Names. IEEE Transactions on Software Engineering, pp. 1–1. External Links: Document, ISSN 2326-3881 Cited by: §A.3.
  • [16] P. A. Flach and M. Kull (2015) Precision-recall-gain curves: pr analysis done right. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, Cambridge, MA, USA, pp. 838–846. Cited by: §4.2.
  • [17] D. Flanagan (2006) JavaScript: the definitive guide. "O’Reilly Media, Inc.". Cited by: §A.1.1, §3.1.
  • [18] L. Floridi (2018-08) Artificial Intelligence, Deepfakes and a Future of Ectypes. Philosophy & Technology 31. External Links: Document Cited by: §6.
  • [19] E. M. Gellenbeck and C. R. Cook (1991) An Investigation of Procedure and Variable Names as Beacons During Program Comprehension. In Empirical studies of programmers: Fourth workshop, pp. 65–81. Cited by: §A.3.
  • [20] J. Gottschlich, A. Solar-Lezama, N. Tatbul, M. Carbin, M. Rinard, R. Barzilay, S. Amarasinghe, J. B. Tenenbaum, and T. Mattson (2018) The Three Pillars of Machine Programming. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2018, New York, NY, USA, pp. 69–80. External Links: Document, ISBN 978-1-4503-5834-7, Link Cited by: §1.
  • [21] A. Hermans, L. Beyer, and B. Leibe (2017) In Defense of the Triplet Loss for Person Re-Identification. CoRR abs/1703.07737. External Links: Link, 1703.07737 Cited by: §3.2.2, §4.1.
  • [22] X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin (2018) Deep Code Comment Generation. In Proceedings of the 26th Conference on Program Comprehension, ICPC ’18, New York, NY, USA, pp. 200–210. External Links: Document, ISBN 9781450357142, Link Cited by: §B.2.
  • [23] K. Kim, D. Kim, T. F. Bissyandé, E. Choi, L. Li, J. Klein, and Y. L. Traon (2018) FaCoY: A Code-to-Code Search Engine. In Proceedings of the 40th International Conference on Software Engineering, ICSE ’18, New York, NY, USA, pp. 946–957. External Links: ISBN 9781450356381, Link, Document Cited by: §2.
  • [24] C. Lattner and V. Adve (2004) LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, CGO ’04, USA, pp. 75. External Links: ISBN 0769521029 Cited by: §2.
  • [25] S. Lipschutz (1968) Schaum’s outline of theory and problems of linear algebra. McGraw-Hill, New York. External Links: ISBN 0070379890 Cited by: §3.
  • [26] B. Liu, W. Huo, C. Zhang, W. Li, F. Li, A. Piao, and W. Zou (2018) Diff: Cross-Version Binary Code Similarity Detection with DNN. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, New York, NY, USA, pp. 667–678. External Links: ISBN 9781450359375, Link, Document Cited by: §2.
  • [27] T. Liu (2009-03) Learning to Rank for Information Retrieval. Found. Trends Inf. Retr. 3 (3), pp. 225–331. External Links: ISSN 1554-0669, Link, Document Cited by: §A.5.
  • [28] I. Loshchilov and F. Hutter (2019) Decoupled Weight Decay Regularization. In International Conference on Learning Representations, External Links: Link Cited by: §B.4.
  • [29] S. Luan, D. Yang, C. Barnaby, K. Sen, and S. Chandra (2019-10) Aroma: Code Recommendation via Structural Code Search. Proc. ACM Program. Lang. 3 (OOPSLA), pp. 152:1–152:28. External Links: Document, ISSN 2475-1421, Link Cited by: §A.7, §B.1, 4th item, §1, §2, §2, §4.
  • [30] L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin (2016) Convolutional Neural Networks over Tree Structures for Programming Language Processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI ’16, pp. 1287–1293. Cited by: §4.1.
  • [31] K. Musgrave, S. Belongie, and S. Lim (2020) A Metric Learning Reality Check. External Links: 2003.08505 Cited by: §3.2.2, §4.2.
  • [32] A. Odena and C. Sutton (2020) Learning to Represent Programs with Property Signatures. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • [33] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §4.1.
  • [34] M. Pradel and K. Sen (2018-10) DeepBugs: A Learning Approach to Name-Based Bug Detection. Proc. ACM Program. Lang. 2 (OOPSLA). External Links: Link, Document Cited by: §1.
  • [35] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language Models are Unsupervised Multitask Learners. Cited by: §6.
  • [36] C. K. Roy, J. R. Cordy, and R. Koschke (2009-05) Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach. Sci. Comput. Program. 74 (7), pp. 470–495. External Links: ISSN 0167-6423, Link, Document Cited by: §A.1.1.
  • [37] N. Satish, C. Kim, J. Chhugani, H. Saito, R. Krishnaiyer, M. Smelyanskiy, M. Girkar, and P. Dubey (2012-06) Can Traditional Programming Bridge the Ninja Performance Gap for Parallel Computing Applications?. In 2012 39th Annual International Symposium on Computer Architecture (ISCA), pp. 440–451. External Links: Document, ISSN 1063-6897 Cited by: §A.1.1.
  • [38] M. S. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. In The Semantic Web - 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3-7, 2018, Proceedings, A. Gangemi, R. Navigli, M. Vidal, P. Hitzler, R. Troncy, L. Hollink, A. Tordai, and M. Alam (Eds.), Lecture Notes in Computer Science, Vol. 10843, pp. 593–607. External Links: Document, Link Cited by: §3.2.1.
  • [39] F. Schroff, D. Kalenichenko, and J. Philbin (2015-06) FaceNet: A Unified Embedding for Face Recognition and Clustering. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §3.2.2.
  • [40] Y. Sun, C. Cheng, Y. Zhang, C. Zhang, L. Zheng, Z. Wang, and Y. Wei (2020) Circle Loss: A Unified Perspective of Pair Similarity Optimization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.2.2, §4.1.
  • [41] M. Tufano, C. Watson, G. Bavota, M. Di Penta, M. White, and D. Poshyvanyk (2018) Deep Learning Similarities from Different Representations of Source Code. In Proceedings of the 15th International Conference on Mining Software Repositories, MSR ’18, New York, NY, USA, pp. 542–553. External Links: ISBN 9781450357166, Link, Document Cited by: §1.
  • [42] G. Van Rossum and F. L. Drake (2009) Python 3 reference manual. CreateSpace, Scotts Valley, CA. External Links: ISBN 1441412697 Cited by: §A.1.1, §3.1.
  • [43] W. Wulf and M. Shaw (1973-02) Global Variable Considered Harmful. SIGPLAN Not. 8 (2), pp. 28–34. External Links: ISSN 0362-1340, Link, Document Cited by: §A.3.
  • [44] J. Zhang, X. Wang, H. Zhang, H. Sun, K. Wang, and X. Liu (2019) A Novel Neural Source Code Representation Based on Abstract Syntax Tree. In Proceedings of the 41st International Conference on Software Engineering, ICSE ’19, pp. 783–794. External Links: Link, Document Cited by: §1.
  • [45] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun (2018) Graph Neural Networks: A Review of Methods and Applications. CoRR abs/1812.08434. External Links: Link, 1812.08434 Cited by: §3.2.1.

Appendix A Context-Aware Semantics Structure Details

Table 4 lists the current types and options for the language-specific and language-agnostic categories in CASS.666While our exploration into CASS is still early, we believe our categories may be exhaustive (that is, fully encompassing). Yet, we do not believe our configuration types or options are exhaustive. Each configuration type has multiple options associated with it to afford the user the flexibility in exploring a number of CASS configurations. For all configuration types, option 0 always corresponds to the Aroma system’s original simplified parse tree (SPT). Each of the types in Table 4 are described in greater detail in the following sections.

Language-specific configurations, described in Section A.1 are designed to resolve syntactic ambiguity present in the SPT. For example, in Figure 5, the SPT treats the parenthetical expression (global1 + global2) identically to the parenthetical expression init(global1), whereas the CASS configuration shown disambiguates these two terms (the first is a parenthetical expression, the second is an argument list). Such a disambiguation may be useful to a code similarity system, as the CASS representation makes the presence of a function call more clear.

Figure 5: Example SPT and CASS tree.

Language-agnostic configurations, described in Section A.2, can improve code similarity analysis by unbinding overly-specific semantics that may be present in the original SPT structure. For example, in Figure 5, the SPT includes the literal names global1, global2, etc. The CASS variant, on the other hand, unbinds these names and replaces them with a generic string (#GVAR). This could improve code similarity analysis if the exact token names are irrelevant, and the semantically-salient feature is simply that there is a global variable.

We note that that these examples are not universal. One specific CASS configuration is unlikely to work in all scenarios: sometimes, disambiguating parenthetical expressions may be good, other times, it may be bad. This work seeks to explore and analyze these possible configurations. We provide a formalization and concrete examples of both language-agnostic and language-specific configurations later in this section.

Figure 6: Language Ambiguity in Simplified Parse Tree.

a.1 Language-Specific Configurations

Language-specific configurations are meant to capture semantic meaning by resolving ambiguity and introducing specificity related to the specific underlying programming language. Intuitively, these configurations can be thought of as syntax-binding, capturing semantic information that are bound to the particular syntactical structure of the program. In some cases, these specifications may capture relevant semantic information, whereas in other cases these specifications may capture irrelevant details.

Node Prefix Label.

We define a node prefix label as a prefix string to a tree’s node label to incorporate more information. Node prefix labels are generally used to facilitate disambiguation caused by language-specific syntax ambiguity. These ambiguous scenarios tend to arise when certain code constructs and/or operators have been overloaded in a specific language. In such cases, the original SPT structure may be insufficient to properly disambiguate between them, potentially reducing its ability to evaluate code semantic similarity (see Figure 6). CASS’s node prefix label options are meant to help resolve this. As we incorporate more language-specific syntax into CASS nodes, we run the chance of overloading the tree with syntactic details. This could potentially undo the general reasoning behind Aroma’s SPT and our CASS structure. We discuss this in greater detail in Section A.3.

Category Type Option
Language-specific A. Node Prefix Label 0. No change (Aroma’s original configuration)
1. Add a prefix to each internal node label
2. Add a prefix to parenthesis node label (C/C++ Specific)
Language-agnostic B. Compound Statements 0. No change (Aroma’s original configuration)
1. Remove all features relevant to compound statements
2. Replace with ‘{#}’
C. Global Variables 0. No change (Aroma’s original configuration)
1. Remove all features relevant to global variables
2. Replace with ‘#GVAR’
3. Replace with ‘#VAR’ (the label for local variables)
D. Global Functions 0. No change (Aroma’s original configuration)
1. Remove all features relevant to global functions
2. Replace with ‘#EXFUNC’
E. Function Cardinality 0. No change
1. Include the input and output cardinality per function
Table 4: CASS Configuration Options.

a.1.1 C and C++ Node Prefix Label

For our first embodiment of CASS, we have focused solely on C and C++ programs. We have found that programs in C/C++ present at least two interesting challenges.777We do not claim that these challenges are unique to C/C++: these challenges may be present in other languages as well.

(Lack of) Compilation.

We have found that, unlike programs in higher-level programming languages (e.g., Python [42], JavaScript [17]), C/C++ programs found "in the wild" tend to not immediately compile from a source repository (e.g., GitHub [13]). Thus, code similarity analysis may need to be performed without relying on successful compilation.

Many Solutions.

The C and C++ programming languages provide multiple diverse ways to solve the same problem (e.g., searching a list with a for loop vs. using std::find). Because of this, C/C++ enables programmers to create semantically similar programs that are syntactically divergent. In extreme cases, such semantically similar (or identical) solutions may differ in computation time by orders of magnitude [37]. This requires that code similarity techniques to be robust in their ability to identify semantic similarity in the presence of syntactic dissimilarity (i.e. a type 4 code similarity exercise [36]).

We believe that analytically deriving the optimal selection of node prefix labels across all C/C++ code may be untenable. To accommodate this, we currently provide two levels of granularity for C/C++ node prefix labels in CASS.888This is still early work and we expect to identify further refinement options in C/C++ and other languages as the research progresses.

  • Option 0: original Aroma SPT configuration.

  • Option 1: node prefix label of all nodes with their language-specific node type.

  • Option 2: node prefix label of all nodes containing parentheticals with their language-specific node type.

Option 1 corresponds to an extreme case of a concrete syntax embedding (e.g., every node contains syntactic information, and all syntactic information is represented in some node). Since such an embedding may "overload" the code similarity system with irrelevant syntactic details, Option 2 can be used to annotate only parentheticals, which we have empirically identified to often have notably divergent semantic meaning based on context.

An example is shown in Figure 6. In one case the parentheses is applied as a mathematical precedence operator, in the other it is used as a function call. If left unresolved, such ambiguity would cause the subtree rooted at node 7 of function f1 to be classified identically to the subtree rooted at node 5 of function f2. The intended purpose of the parenthesis operator is context sensitive and is disambiguated by encoding the contextual information into the two distinct node prefix labels, i.e. the parenthesized expression and the argument list respectively.

a.2 Language-Agnostic Configurations

Unlike language-specific configurations, language-agnostic configurations are not meant to be restricted to the specific syntax of a specific language. Instead, they are meant to be applied generally across multiple languages. Intuitively, these configurations can be thought of as syntax-unbinding in nature: they generally abstract (or, in some cases, entirely eliminate) syntactical information in the attempt to improve its ability to derive semantic meaning from the code.

Compound Statements.

The compound statements configuration is a language-agnostic option that enables the user to control how much non-terminal node information is incorporated into the CASS. Again, Option 0 corresponds to the original Aroma SPT. Option 1 omits separate features for compound statements altogether. Option 2 does not discriminate between compound statements of different lengths and specifies a special label to denote the presence of a compound statement. For example, the for loop construct in C/C++ is represented with a single label with this option instead of constructing three separate labels for the loop initialization, test condition and increment.

Global Variables.

The global variables configuration specifies the degree of global variable-specific information contained in a CASS. In addition to Aroma’s original configuration (Option 0), which annotates nodes by including the precise global variable name, CASS provides three additional configurations. Option 1 specifies the extreme case of eliding all information on global variables. Option 2 annotates all global variables with the special label ‘#GVAR’, omitting the names of the global variable identifiers. Option 3 designates global variables with the label ‘#VAR’ rendering them indistinguishable from the usage of local variables.

Intuitively, including the precise global variable names (Option 0) may be appropriate if code similarity is being performed on a single code-base, where two references to a global variable with the same name necessarily refer to the same global variable. Options 1 through 3, which remove global variable information to varying degrees, may be appropriate when performing code similarity between unrelated code-bases, where two different global variables named (for example) foo are most likely unrelated.

Global Functions.

The global functions configuration serves the dual purpose of (i) controlling the amount of function-specific information to featurize and (ii) to disambiguate between the usage of global functions and global variables in CASS, a feature that is curiously absent in the original SPT design: the SPT shown in Figure 5 has no distinction between init (a function) and global1 (a variable). Option 1 removes all features pertaining to global functions. Option 2 annotates all global function references with the special label ‘#EXFUNC’ while eliminating the function identifier. Intuitively, these options behave similarly to the global variable options. Our current prototype, which handles only single C/C++ functions, does not differentiate between external functions. In future work, we plan to investigate CASS variants that differentiate between local, global, and library functions.

Function Cardinality.

The function cardinality configuration aims to abstract the semantics of certain group of functions through input and output cardinality (e.g., the number of input parameters). As an example, a function that returns the size of a container may have zero input parameters and one output parameter, where the output parameter returns the number of elements in the container. Likewise, another function that checks if a container is empty may also have zero input parameters and one output parameter, where the output parameter returns a boolean that indicates if the function contains any element. Both functions are semantically similar in that they are both performing non-mutating state checks on the container. Moreover, this is captured in the identical representation of their function input and output cardinality.999This is not meant to claim that all such semantically similar functions will have identical input and output cardinality. However, we do believe it can be used as guidance to extract additional semantic meaning. CASS can be configured to rely on this characteristic, along with others, to capture semantics of the source code.

a.3 Discussion

We believe there is no silver bullet solution for code similarity for all programs and programming languages. Based on this belief, a key intuition of CASS’s design is to provide a structure that is semantically rich based on structure, with inspiration from Aroma’s SPT, while simultaneously providing a range of customizable parameters to accommodate a wide variety of scenarios. CASS’s language-agnostic and language-specific configurations and their associated options serve for exploration of a series of tree variants, each differing in their granularity of detail of abstractions.

For instance, the compound statements configuration provides three levels of abstraction. Option 0 is Aroma’s baseline configuration and is the finest level of abstraction, as it featurizes the number of constituents in a compound statement node. Option 2 reduces compound statements to a single token and represents a slightly higher level of abstraction. Option 1 eliminates all features related to compound statements and is the coarsest level of abstraction. The same trend applies to the global variables and global functions configurations. It is our belief, based on early evidence, that the appropriate level of abstraction in CASS is likely based on many factors such as (i) code similarity purpose, (ii) programming language expressiveness, and (iii) application domain.

Aroma’s original SPT seems to work well for a common code base where global variables have consistent semantics and global functions are standard API calls also with consistent semantics (e.g., a single code-base). However, for cases outside of such spaces, some question about applicability arise. For example, assumptions about consistent semantics for global variables and functions may not hold in cases of non-common code-bases or non-standardized global function names [43, 19, 15]. Having the capability to differentiate between these cases, and others, is a key motivation for CASS.

We do not believe that CASS’s current structure is exhaustive. With this in mind, we have designed CASS to be extensible, enabling a seamless mechanism to add new configurations and options (described in Section A.4). Our intention with this paper is to present initial findings in exploring CASS’s structure. Based on our early experimental analysis, presented in Section A.4, CASS seems to be a promising research direction for code similarity.

An Important Weakness.

While CAST provides added flexibility over SPT, such flexibility may be misused. With CAST, system developers are free to add or remove as much syntactic differentiation detail they choose for a given language or given code body. Such overspecification (or underspecification), may result in syntactic overload (or underload) which may cause reduced code similarity accuracy over the original SPT design, as we illustrate in Section A.4.

a.4 CASS Experimental Results

In this section, we discuss our experimental setup and analyze the performance of CASS compared to Aroma’s simplified parse tree (SPT). In Section A.5, we explain the dataset grouping and enumeration for our experiments. We also discuss the metrics used to quantitatively rank the different CASS configurations and those chosen for evaluation of code similarity. Section A.6 demonstrates that, a code similarity system built using CASS (i) has a greater frequency of improved accuracy for the total number of problems and (ii) is, on average, more accurate than SPT. For completeness, we also include cases where CASS configurations perform poorly.

a.5 Experimental Setup

In this section, we describe our experimental setup. At the highest level, we compare the performance of various configurations of CASS to Aroma’s SPT. The list of possible CASS configurations are shown in Table 4.


The experiments use the same POJ-104 dataset introduced in Section 4.1. The filtering step described in Appendix B.3 is also applied, except that the programs who cannot be compiled to LLVM IR are not removed.

Problem Group Selection.

Given that POJ-104 consists of 104 unique problems and nearly 50,000 programs, depending on how we analyze the data, we might face intractability problems in both computational and combinatorial complexity. With this in mind, our initial approach is to construct 1000 sets of five unique, pseudo-randomly selected problems for code similarity analysis. Using this approach, we evaluate every configuration of CASS and Aroma’s original SPT on each pair of solutions for each problem set. We then aggregate the results across all the groups to estimate their overall performance. While this approach is not exhaustive of possible combinations (in set size or set combinations), we aim for it to be a reasonable starting point. As our research with CASS matures, we plan to explore a broader variety of set sizes and a more exhaustive number of combinations.

Code Similarity Performance Evaluation.

For each problem group, we exhaustively calculate code similarity scores for all unique solution pairs, including pairs constructed from the same program solution (i.e., program compared to program ). We use to refer to the set of groups and to indicate a particular group in . We denote as the number of groups in (i.e. cardinality) and |g| as the number of solutions in group . For = , where , the total unique program pairs (denoted by ) in is .

To compute the similarity score of a solution pair, we use Aroma’s approach. This includes calculating the dot product of two feature vectors (i.e., a program pair), each of which is generated from a CASS or SPT structure. The larger the magnitude of the dot product, the greater the similarity.

We evaluate the quality of the recommendation based on average precision. Precision is the ratio of true positives to the sum of true positives and false positives. Here, true positives denote solution pairs correctly classified as similar and false positives refer to solution pairs incorrectly classified as similar. Recall is the ratio of true positives to the sum of true positives and false negatives, where false negatives are solution pairs incorrectly classified as different. As we monotonically increase the threshold from the minimum value to the maximum value, precision generally increases while recall generally decreases. The average precision (AP) summarizes the performance of a binary classifier under different thresholds for categorizing whether the solutions are from the same equivalence class (i.e., the same POJ-104 problem) [27]. AP is calculated using the following formula over all thresholds.

  1. All unique values from the similarity scores, corresponding to the solution pairs, are gathered and sorted in descending order. Let be the number of unique scores and be the sorted list of such scores.

  2. For in , the precision and recall for the classifier with the threshold being is computed.

  3. Let . The average precision is computed as:

Configuration Identifier.

In the following sections, we refer to a configuration of CASS by its unique identifier (ID). A configuration ID is formatted as A-B-C-D-E. Each of the five letters corresponds to a configuration type in the second column of Table 4, and will be replaced by an option number specified in the third column of the table. Configuration 0-0-0-0-0 corresponds to Aroma’s SPT.

a.6 Results

Figure 6(a) depicts the number of problem groups where a particular CASS variant performed better (blue) or worse (orange) than SPT. For example, the CASS configuration 2-0-0-0-1 outperformed SPT in 859 of 1000 problem groups, and underperformed in 141 problem groups. This equates to a 71.8% accuracy improvement of CASS over SPT. Figure 6(a) shows the two best (2-0-0-0-1 and 0-0-0-0-1), the median (2-2-3-0-0), and the two worst (1-0-1-0-0 and 1-2-1-0-0) configurations with respect to SPT. Although we have seen certain configurations that perform better than SPT, there are also configurations that perform worse. We observed that the configurations with better performance have function cardinality option as 1. We also observed that the configurations with worse performance have function cardinality option as 0. These observations indicates that function cardinality seems to improve code similarity accuracy, at least, for the data we are considering. We speculate that these configuration results may vary based on programming language, problem domain, and other constraints.

Figure 6(b) shows the group containing the problems for which CASS achieved the best performance relative to SPT, among all 1000 problem groups. In other words, Figure 6(b) shows the performance of SPT and CASS for the single problem group with the greatest difference between a CASS configuration and SPT. In this single group, CASS achieves the maximum improvement of more than 30% over SPT for this problem group on two of its configurations. We note that, since we tested 216 CASS configurations across 1000 different problem groups, there is a reasonable chance of observing such a large difference even if CASS performed identically to SPT in expectation. We do not intend for this result to demonstrate statistical significance, but simply to illustrate the outcome of our experiments.

Figure 6(c) compares the mean of AP over all 1000 problem groups. In it, the blue bars, moving left to right, depict the CASS configurations that are (i) the two best, (ii) the median, and (iii) the two worst in terms of average precision. Aroma’s baseline SPT configuration is highlighted in orange. The best two CASS configurations show an average improvement of more than 1% over SPT, while the others degraded performance relative to the baseline SPT configuration.

These results illustrate that certain CASS configurations can outperform the SPT on average by a small margin, and can outperform the SPT on specific problem groups by a large margin. However, we also note that choosing a good CASS configuration for a domain is essential. We leave automating this configuration selection to future work.

(a) Breakdown of the Number of Groups with AP Greater or Less than SPT.
(b) Average Precision for the Group Containing the Best Case.
(c) Mean of Average Precision Over All Program Groups.
Figure 7: Comparison of CASS and SPT. The blue bars in (a) and (b), and all the bars in (c), from left to right, correspond to the best two, the median, and the worst two CASS configurations, ranked by the metric displayed in each subfigure.

a.6.1 Analysis of Configurations

Figures 7(a)-7(e) serve to illustrate the performance variation for individual configurations. Figure 7(a) shows the effect of varying the options for the node prefix label configuration. Applying the node prefix label for the parentheses operator (option 2) results in the best overall performance while annotating every internal node (option 1) results in a concrete syntax tree and the worst overall performance. This underscores the trade-offs in incorporating syntax-binding transformations in CASS. In Figure 7(b) we observe that removing all features relevant to compound statements (option 1) leads to the best overall performance when compared with other options. This indicates that adding separate features for compound statements obscures the code’s intended semantics when the constituent statements are also individually featurized.

Figure 7(c) shows that removing all features relevant to global variables (option 1) degrades performance. We also observe that eliminating the global variable identifiers and assigning a label to signal their presence (option 2) performs best overall, possibly because global variables appearing in similar contexts may not use the same variable identifiers. Further, option 2 performs better than the case where global variables are indistinguishable from local variables (option 3). Figure 7(d) indicates that removing features relevant to identifiers of global functions, but flagging their presence with a special label as done in option 2, generally gives the best performance. This result is consistent with the intuitions for eliminating features of function identifiers in CASS as discussed in Section A.3. Figure 7(e) shows that capturing the input and output cardinality improves the average performance. This aligns with our assumption that function cardinality may abstract the semantics of certain group of functions.

A Subtle Observation.

A more nuanced and subtle observation is that our results seem to indicate that for each CASS configuration the optimal granularity of abstraction detail is different. For compound statements the best option seems to corresponds to the coarsest level of abstraction detail, while for node prefix label, global variables, and global functions the best option seems to corresponds to one of the intermediate levels of abstraction detail. Additionally, for function cardinality, the best option has finer level of detail. For our future work, we aim to perform a deeper analysis on this and hopefully learn such configurations, to reduce (or eliminate) the overhead necessary of trying to manually discover such configurations.

(a) Node Prefix Labels.
(b) Compound Statements.
(c) Global Variables.
(d) Global Functions.
(e) Function Cardinality.
Figure 8: The Distributions of Performance for Configurations with a Fixed Option Type.

a.7 Visualization of CASS and Other Tree Representations

In this section, we provide a visualization of several structures that have been used in code similarity studies. Each structure represents a summation function written in C, as shown in Figure 9. Figure 10 shows a possible CASS of the summation function. The CASS presents the structure of the code body as a tree and captures input and output carnality in the global attributes table. Figure 11 shows the parse tree that the CASS tree is derived from. Figure 12 shows an abstract syntax tree representation that much existing research [3, 5] is based on. Note that the abstract syntax tree only contains syntactical information of the original code. Figure 13 shows the simplified parse tree used by the Aroma code recommendation engine [29].

int summation(int start_val, int end_val)
    int sum = 0;
    for (int i = start_val; i <= end_val; ++i)
        sum += i;
    return sum;
Figure 9: A Summation Function in C.
Figure 10: Context-Aware Semantic Structure.
Figure 11: Parse Tree.
Figure 12: Abstract Syntax Tree.
Figure 13: Simplified Parse Tree.

Appendix B Models and Experimental Details

In this section, we describe the models evaluated in our experiments other than MISIM-GNN, and discuss the details of the experimental procedure.

b.1 Model: MISIM-BoF

The MISIM-BoF model takes a bag of manual features extracted from a CASS as its input. The features include the ones extracted from CASS trees, using the same procedure described in Aroma 

[29], as well as the entries in CASS GATs. As shown in Figure 14, the output code vector is computed by taking the elementwise mean of the feature embeddings and projecting it into the code vector space with a fully connected layer.

Figure 14: MISIM-BoF Architecture.

b.2 Model: MISIM-RNN

The input to the MISIM-RNN model is a serialized representation of a CASS. Each CASS tree, representing a function in the program, is convereted to a sequence using the technique proposed in [22]. The GAT entry associated with a CASS tree is both pretended and appended to the tree’s sequence, forming the sequence of the corresponding function. As illustrated in Figure 15, each function’s sequence first has its tokens embedded, and then gets summarized to a function-level vector by a bidirectional GRU layer [12]. The code vector for the entire program is subsequently computed by taking the mean and max pooling of the function-level vectors, concatenating these two vectors, and passing the resulting vector through a fully connected layer.

Figure 15: MISIM-RNN Architecture.

b.3 Dataset Filtering

We eliminate some of the programs in the dataset because (i) they are not parsable, (ii) they have hard-coded answers101010Unfortunately, due to the size of the dataset, we cannot guarantee all such programs were eliminated., or (iii) they cannot be compiled to LLVM IR. After investigating, we found that the unparsable programs contain code using non-standard coding conventions (e.g., unspecified return types, lack of semicolons at the end of structure definitions, etc.), which are not recognized by the parser we use111111Tree-sitter: http://tree-sitter.github.io/tree-sitter.. The reason for removing programs that belong to (iii) is to facilitate a fair comparison with NCC, which takes LLVM IR as input. The resulting dataset consists of 45,780 programs with 231 to 491 unique solution programs per problem.

b.4 Training Procedure and Hyperparameters

We use the AdamW optimizer with a learning rate of  [28]

. The training runs for 100 epochs, each containing 1,000 iterations, and the model that gives the best validation accuracy is used for testing.

121212We have observed that the validation accuracy stops to increase before the 100th epoch in all experiments. The hyperparameters used for the Circle loss are and

. For all of our MISIM models, we use 128-dimensional embedding vectors, hidden states, and code vectors. We also apply dropout with a probability of 0.5 to the embedding vectors. To handle rare or unknown tokens, a token that appears less than 5 times in the training set is replaced with a special

UNKNOWN token.