With the growth of computing, comes the growth in computations. These computations are necessarily the byproduct of implementation of well-defined algorithms (Cormen:2009:IAT:1614191) implemented as programs. Thanks to the World Wide Web, these programs have become more accessible in various online platforms starting from programming tutorials to production quality code.
Code snippets from open-source hosting sites (github; bitbucket), Community Question and Answer sites (stackoverflow) and downloadable binaries along with the relevant information gives rise to “Big Code” (prog-big-code). A good part of the Big code is largely attributed to several implementations of the same algorithm that differ in a multitude of ways like, their complexity and theoretical metrics of execution, and efficiency of implementation. While Big Data processing itself poses a lot of challenges, analyzing humongous volumes of code poses significant and much harder problems, because of the fact that most of the programming language questions are undecidable as stated by Rice’s theorem (Rice-10.2307/1990888).
There is an increased need for understanding the syntax and semantics of programs and categorizing them for various purposes. This includes algorithm classification (tbcnn-aaai16; who-wrote-this-code), code search (source-forager; sourcerer), code synthesis (abstract-syn-net), Bug detection (wang2016bugram), Code summarization (iyer2016summarizing), and Software maintenance (Allamanis:learning-nat-coding-conv; Allamanis:suggesting-acc-meth-names; conv-attn-net-extreme-summ; gupta2017deepfix).
Many existing works on representing programs use a form of Natural Language Processing for modeling; all of them exploit the statistical properties of the code, and adhere to theNaturalness hypothesis (allamanis2018survey). These works primarily use Word2Vec methods like skip gram and CBOW (mikolov2013word2vec), or encoder-decoder models like Seq2Seq (seq2seq)
to encode programs as distributed vectors.
We extend the above, and propose an Extended Naturalness hypothesis for Programs to consider the properties of the program by posing the problem of obtaining embeddings as a Data and Control flow problem.
Extended Naturalness hypothesis. Software is a form of human communication; software corpora have similar statistical properties to natural language corpora; and these properties along with static and dynamic program analysis information, can be exploited to build effective software engineering tools.
In this paper, we propose IR2Vec, an agglomerative approach for constructing a continuous, distributed vector to represent source code at different (and increasing) levels of IR hierarchy - Instruction, Function and Program. The vectors that are formed lower down the (program abstraction) hierarchy are used to build the vectors at the higher levels.
The initial seed vector to represent entities is learned by considering statistical properties in a Representational learning framework. Using this seed entity vectors, hierarchical vectors for the input program are formed considering the static and dynamic analysis information obtained from data and control flow analysis.
We make use of the LLVM compiler infrastructure (Lattner:2004:llvm) to process and analyze the code. The input program is converted to LLVM’s Intermediate Representation (IR), and the IR constructs form the entities whose representation is learnt. This makes our approach of representing programs to be source language and target architecture independent.
We show that the embeddings obtained by IR2Vec provide superior results when compared to the previous works (alon2019code2vec; cummins2017end2end; o2013portable-grewe; magni2014automatic), even though these earlier works were designed to solve specialized tasks, and IR2Vec is generic. We also compare our IR2Vec results with the ones of Ben-Nun et al. (ncc); both have similar motivation in generating generic embeddings using LLVM IR.
We demonstrate the effectiveness of the obtained encodings by answering the following Research Questions (RQ’s) in the later sections:
RQ1: How well do the seed embeddings capture the semantics of the entities in LLVM IR?
As the seed embeddings play a significant role in forming embeddings at higher levels of Program abstraction, it is of paramount importance that they capture the semantic meaning of the entities to differentiate the different programs. We show the effectiveness of the obtained seed embeddings in Sec. LABEL:subsection:seedEmbeddings-eval.
RQ2: How good are the obtained embeddings for solving diverse applications? We show the richness of the embeddings by applying it for different tasks: Program classification, Heterogeneous device mapping, and Prediction of thread coarsening factor in Sec. LABEL:subsection:program-classification, LABEL:subsection:dev-map and LABEL:subsection:thread-coarsening.
RQ3: How scalable is the proposed methodology, and how likely does our method encounter Out of Vocabulary (OOV) words when compared to other methods? We discuss various aspects by which our encoding is more scalable than the others, and also show that it does not encounter OOV words in Sec. LABEL:subsection:scalability.
RQ4: What is the contribution of data and control flow information to the final embeddings? We repeat the experimentation with which RQ2 was answered (in Sec. LABEL:sec:experimentation) after peeling off the data and control flow information. We thereby show the importance of data flow and control flow information in Sec. LABEL:sec:ablationStudy.
The following are our contributions:
A unique flow-analysis (program theoretic) based encoding to represent programs as vectors using Data and Control flow Information.
Proposal of a Concise and Scalable encoding infrastructure using Agglomerative Methodology which in turn is built from the entities of LLVM IR.
Hierarchy of encodings for Instruction, Function and Program.
Testing of the effectiveness on a variety of tasks involving Program classification, Heterogeneous device mapping and Thread coarsening.
The paper is organized as follows: In Sec. 3, we give some basic background information. In Sec. 4, we explain the methodology followed to form the data and control flow based encodings at various levels. In Sec. LABEL:sec:experimentation, we show Experimentation followed by discussion on results. In Sec. LABEL:sec:ablationStudy, we compare our model with various varieties of its variants so as to show the strength of the proposed encoding. Finally, in Sec. LABEL:sec:conclusions, we conclude the paper.
2. Related Works
Modeling code as a distributed vector involves representing the program as a vector, whose individual dimensions cannot be distinctly labeled. Such a vector is an approximation of the original program, whose semantic meaning is “distributed“ across multiple components.
In this section, we categorize some existing works that model codes, based on their representations, the applications that they handle, and the embedding techniques that they use.
Programs are represented using standard formats like lexical tokens (Allamanis:suggesting-acc-meth-names; conv-attn-net-extreme-summ; cummins2017end2end), Abstract Syntax trees (ASTs) (path-based-rep:Alon:2018:GPR:3192366.3192412; Raychev:pred-prog-prop), Program Dependence Graphs (allamanis2018learning)
, etc. Then, a neural network model like RNN or its variants is trained on the representation to form distributed vectors.
We use LLVM IR (LLVM-LangRef) as the base representation for learning the embeddings in high dimensional space. To the best of our knowledge, we are the first ones to model the entities of the IR—Opcodes, Operands and Types—in the form of relationships and to use a translation based model (transe-Bordes:2013:TEM:2999792.2999923) to capture such multi-relational data in higher dimensions.
In the earlier works, the training to generate embeddings was application specific and programming language specific: Allamanis et al. (Allamanis:suggesting-acc-meth-names) propose a token based neural probabilistic model for suggesting meaningful method names in Java; Cummins et al. (cummins2017end2end) propose the DeepTune framework to create a distributed vector from the tokens obtained from code to solve the optimization problems like thread coarsening and device mapping in OpenCL; Alon et al. (alon2019code2vec) propose code2vec, a methodology to represent codes using information from the AST paths coupled with attention networks to determine the importance of a particular path to form the code vector for predicting the method names in Java; Mou et al. (tbcnn-aaai16)
propose a tree based CNN model to classify C++ programs; Gupta et al.(gupta2017deepfix) propose a token based multi-layer sequence to sequence model to fix common C program errors by students; Other applications like learning syntactic program fixes from examples (Rolim:2017:LSP:3097368.3097417), bug detection (Pradel:2018:DLA:3288538.3276517; DBLP:conf/iclr/WangSS18) and program repair (xiong:2017:ProgramRepair) model the code as an embedding in a high dimensional space followed by using RNN like models to synthesize fixes. The survey by Allamanis et al. (allamanis2018survey) covers more such application specific approaches.
On the other hand, our approach is more generic, and both application and programming language independent. We show the effectiveness of our embeddings on both software engineering task (to classify programs on a real time dataset) as well as optimization tasks (device mapping and thread coarsening) in Sec. LABEL:sec:experimentation.
Several attempts (Bimodal:Allamanis:2015:BMS:3045118.3045344; ncc; prog-big-code) have been made to represent programs as distributed vectors in continuous space using word embedding techniques for diverse applications. They generate and expose embeddings for the program (ncc), or the embeddings themselves become part of the training for the specific downstream task (alon2019code2vec; Pradel:2018:DLA:3288538.3276517) without being directly exposed.
Our framework exposes an hierarchy of representations at the various levels of the program — Instruction, Function and Program level. We also believe that, we are the first ones to use program analysis based approaches to form these vectors agglomeratively from the base seed encodings obtained from the IR without using machine learning approaches.
The closest to our work is Ben-Nun et al.’s Neural Code Comprehensions (NCC) (ncc) who represent programs using LLVM IR. They use skip gram model (skipgram-NIPS2013_5021) on conteXtual Flow Graph (XFG) which models data/control flow of the program to represent the IR. The skipgram model is trained to generate embeddings for every IR instruction. So as to avoid Out Of Vocabulary (OOV) statements, they maintain a large vocabulary; one which uses large () number of XFG statement pairs. A more thorough comparison of our work with NCC (ncc) is given in Sec. LABEL:subsection:scalability.
3.1. LLVM and Program semantics
LLVM is a compiler infrastructure that translates source-code to machine code by performing various optimizations on its Intermediate Representation (LLVM IR) (Lattner:2004:llvm). LLVM IR is a typed, well-formed, low-level, Universal IR to represent any high-level language and translate it to a wide spectrum of targets (LLVM-LangRef) [LLVM lang. Ref]. Being a successful compiler, LLVM provides easy access to existing control and data flow analysis and lets new passes to be added seamlessly.
The building blocks of LLVM IR include: Instruction, Basic block, Function and Module (Fig. 1). Every instruction contains opcode, type and operands and every instruction is statically typed. A basic block is a maximal sequence of LLVM instructions without any jumps. A collection of basic blocks form a function, and a module is a collection of functions. This hierarchical nature of LLVM IR representation helps in obtaining embeddings at the corresponding levels of program.
Analyzing the control flow structure of a program involves building a control flow graph (CFG), a directed graph in which each basic block is represented as a vertex, and the flow of control from one basic block to another is represented by an edge. Within a basic block, the flow of execution is sequential. Characterizing the flow of information which flows into (and out of) each basic block constitutes the data flow analysis. As the combination of data flow and control flow analyses information helps to describe the program flow, we use it for formation of embeddings to represent the program.
The control flow of a program is primarily described by its branches. The prediction of probabilities of conditional branches, either statically or dynamically, is calledbranch prediction (muchnick1997advanced)
. Static information is obtained by estimating the program profiles statically(Wu:1994:SBF:192724.192725). On the other hand, dynamic information is more accurate and involves dynamic profiling methods (Ball:1992:OPT:143165.143180).
Given a source and destination basic block, the probability with which the branch would be taken can be predicted using the block frequency information generated by profiling the code. This probability is called as Branch probability. We try to use this data for modelling the control flow information.
3.2. Representational Learning
The effectiveness of a machine learning algorithm depends on the choice of data representation and on the specific features used. Representational Learning is a branch of machine learning that learns the representations of data by automatically extracting the useful features (replearning-review).
In a similar spirit, Knowledge Graph embedding models try to model entities (nodes) and relations (edges) of a knowledge graph in a continuous vector space of n-dimensions(knowledge-graph-embedding-survey). In a broad sense, the input to these algorithms are triplets, where, are -dimensional vectors with and being Entities, and being a Relation in the observed Knowledge Graph.
Of the many varieties available, we use TransE (transe-Bordes:2013:TEM:2999792.2999923), a translational representational learning model which tries to learn the relationship of the form , given the triplet .
4. Code Embeddings
In this section, we explain our methodology for obtaining code embeddings at various hierarchy levels of the IR. We first give an overview of the methodology, and then describe the process of embedding instructions and basic blocks (BB) by considering the data flow and control flow information to form a cumulative BB vector. We then explain the process to represent the functions and modules by combining the individual BB vectors to form the final Code Vector.
The overview of the proposed methodology is shown in Fig. 2. Instructions in IR can be represented as a Relationship Graph, with the instruction entities as nodes, and the relation between the entities as edges. A translational learning model is used to learn these relations (Sec. 4.2). The output of this learning is the dictionary containing the embeddings of the entities and is called Seed embedding vocabulary.
The above dictionary is looked up to form the embeddings at various levels of the input program. The Use-Def and Reaching definition (Hecht:1977:FAC:540175; muchnick1997advanced) information are used to form the instruction vector. In this process we also weigh the contribution of each Reaching definition with the probability with which they reach the current instruction. The instructions which are live are used to form the Basic block Vector. This process of formation of basic block vector using the flow analysis information is explained in the Sec. 4.3. The vector to represent a function is obtained by using the basic block vectors of the function. The Code vector is obtained by propagating the vectors obtained at the function level with the call graph information as explained in Sec. 4.3.3.
4.2. Modelling LLVM IR as relations
4.2.1. Generic tuples
The opcode, type of operation (int, float, etc.) and arguments are extracted from the LLVM IR (Fig. 3a). This extracted IR is preprocessed in the following way: first, the identifier information is abstracted out with more generic information as shown in Tab. 1. Next, the Type information is abstracted to represent a base type ignoring its width.
|Address of a basic block||LABEL|
4.2.2. Code triplets
From this preprocessed data, three major relations are formed: (1) TypeOf: Relation between the opcode and the type of an instruction, (2) NextInst: Relation between the opcode of the current instruction and opcode of the next instruction; (3) Arg: Relation between opcode and its i operand. This transformation from actual IR to relation (¡h, r, t¿ triplets) is shown in Fig. 3. These triplets form the input to the representation learning model.
4.3. Instruction Vector
The output of learning model is the Seed embedding vocabulary containing the vector representation for every entity of the relation. Let the entities of instruction , be represented as , , - corresponding to Opcode, Type and i Argument of the opcode and their corresponding vector representations be , , .
Then, an instruction of format … is represented as
are chosen heuristically so as to give more weightage to opcode than type, and more weightage to type than arguments:
This resultant vector representing an instruction is the Instruction vector.
4.3.1. Embedding Data flow information
An instruction in LLVM IR may define a variable or a pointer that could be used in another section of the program. The set of uses of a variable (or pointer) gives rise to the use-def (UD) information of that particular variable (or pointer) in LLVM IR (Hecht:1977:FAC:540175; muchnick1997advanced).
In imperative languages, a variable can be redefined; meaning, in the flow of the program, it has different set of lifetimes. Such a redefinition is said to kill the earlier definition of that variable. During the flow of program execution, only a subset of such live definitions would reach the use of that particular variable. This subset of live definitions that reach an instruction is called the Reaching Definitions of the variable for that instruction. We model instruction vector using such Data flow analyses information. Each which has been already defined is represented using the embeddings of its Reaching definitions. The Instruction Vector for a reaching definition if not calculated, is computed at this instant.
4.3.2. Branches, Control flow and Profile information
Branches are the key in determining the program structure; which in turn, defines the control flow of the program. But, treating all the definitions which reach the current instruction by different paths equally may be misleading. So as to get a precise meaning of the program’s control flow, each path through which the definitions could reach a use should be treated differently.
The key to doing the above is to take into account the probability of the path determined by each branch. This information can be obtained via static analyses techniques which are readily available in compilers. Using BranchProbabilityInfo (BPI), an analysis pass in LLVM, it is possible to get an estimate of the probability with which the control flows from one basic block to another (llvm-blockfreq; llvm-bpi). The probabilities of all the outgoing edges from a block naturally sum up to one. In program analysis, this information is highly valued and aids other optimization passes like determining hotness of a basic block, and similar instrumentation purposes. We use this BPI information to find the probability with which a definition could reach its use.
4.3.3. Using BPI to construct Instruction Vector
If , , … , are the reaching definitions of , and be their corresponding encodings, then,
Let be the probability of definition in instruction reaching instruction via path . Then, is the sum of the probabilities of all such paths reaching the instruction from instruction . The branch probability between two instructions and corresponds to the probability of reaching from .
For the cases where the definition is not available (for example, function parameters), the generic entity representation of ”VAR” or ”PTR” from the seed embedding vocabulary is used. An illustration is shown in Fig. 4, where the instructions and reach as arguments, each with the probabilities labeled on the respective edges.
An instruction is said to be killed when the return value of that instruction is redefined. As LLVM IR is in SSA form (ssa:cytron1991efficiently; Lattner:2004:llvm), each variable has a single definition and the memory gets (re-)defined. Based on this, we categorize the instructions into two classes: one which define memory, and one that do not. The first class of instructions is Write instructions, and the second class of instructions is Read instructions.
Embeddings are formed for each instruction as explained above. If these embeddings correspond to a write instruction, future uses of the redefined value will take the embedding of this current instruction—instead of the embedding corresponding to its earlier definition—until it gets redefined. This process of Kill and Update, along with the use of reaching definition for forming the instruction vectors within (and across) the basic block are illustrated in Fig. 5 (and Fig. 6) for the corresponding Control Flow Graph (CFG) respectively.
4.3.4. Resolving Circular Dependencies
While formation of the instruction vectors, circular dependencies between two write instructions may arise if both of them write to the same location and the (re-)definitions are reachable from each other.
For calculating , the encoding of , in the CFG show in Fig. 7111Though this CFG is the classic irreducibility pattern (Hecht:1977:FAC:540175; muchnick1997advanced), it is easy to construct reducible CFGs with circular dependencies., it can be seen that . Also, is needed for encoding and is yet to be computed, results in a circular dependency:
This problem of circular dependencies can be solved by posing the embedding equations as a set of simultaneous equations to a solver. For example, the embedding equations of and shown in Fig. 7 would be:
If , and are functions of , then Eqn. 3 becomes:
The above equations need to be solved to get the embeddings for and .
One possible issue that can arise on resolving the circular dependency is that the system may not always have a unique solution (not reach a fixed point (Hecht:1977:FAC:540175; Kam:1976:GDF:321921.321938)). For the example shown in the Fig. 7, if and , then Eqn. (4) would result in a system without a unique solution.
This problem can be overcome by randomly picking one of the equations in the system, and perturbing the probability value of that equation to , so that the modified system converges to a solution.222Choosing infinitesimally small values could lead to an unbounded increase of the magnitude of the elements in the encoding vector.
In out entire experimentation shown in Sec. LABEL:sec:experimentation, we however did not encounter such a system.
4.4. Construction of Code vector from Instruction vector
After computing the instruction vector for every instruction of the basic block, we calculate the Basic Block vector by using the embeddings of those instructions which are not killed. If LIVE correspond to the live instruction set of the basic block containing the set of instructions and whose kill set is represented as KILL, then the corresponding basic block vector is computed as
The vector to represent a function with basic blocks is calculated as
Our encoding and propagation also takes care of programs with function calls; the embeddings are obtained by using the call graph information. For every function call, the function vector for the callee function is calculated, and this value is used to represent the call instruction. For the functions, which are resolved during the link time, we just use the embeddings obtained for call instruction. This final vector represents the function. This process of obtaining the instruction vector and function vector is summarized in the Algorithm LABEL:algorithm:inst-vec.
If are the embeddings of the functions in the program, then the code vector representing the program is