Learning the Relation between Code Features and Code Transforms with Structured Prediction

07/22/2019 ∙ by Zhongxing Yu, et al. ∙ KTH Royal Institute of Technology University of Luxembourg Université de Valenciennes et du Hainaut-Cambrésis (UVHC) 0

We present in this paper the first approach for structurally predicting code transforms at the level of AST nodes using conditional random fields. Our approach first learns offline a probabilistic model that captures how certain code transforms are applied to certain AST nodes, and then uses the learned model to predict transforms for new, unseen code snippets. We implement our approach in the context of repair transform prediction for Java programs. Our implementation contains a set of carefully designed code features, deals with the training data imbalance issue, and comprises transform constraints that are specific to code. We conduct a large-scale experimental evaluation based on a dataset of 4,590,679 bug fixing commits from real-world Java projects. The experimental results show that our approach predicts the code transforms with a success rate varying from 37.1

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The source code of a computer program evolves under a sequence of edits. Those edits are not random: they capture the evolution of the specification and they reflect the semantic constraints of the programming language. In other words, there is a probability distribution over the edit space. This probability distribution is conditioned by two factors: the programming language and the current version of the program. In this paper, we make a contribution to better capture this probability distribution.

This is an important problem. It has implications in two research areas: program synthesis and program repair. In program synthesis, capturing the probability distribution of code evolution is one technique to steer the search in the immense synthesis search space (Gulwani et al., 2017). In program repair, this probability distribution can not only enable a more focused exploration of the search space, but can also result in better patches (Long and Rinard, 2016). If one contemplates the umbrella topic of “automated code evolution”, which also includes for example super-optimization (Joshi et al., 2002), refactoring (Mens and Tourwé, 2004), and diversification (Homescu et al., 2013), one can consider that the probability distribution of edits over a given program is the foundation to achieve effective search.

The problem of computing the probability distribution of edits given a program is an unsolved problem. Yet, it is being investigated, especially with the idea to approximate the distribution with a data-driven approach by analyzing past software evolution (Le et al., 2016; Wang et al., 2018; Chen et al., 2018). For instance, in the context of automated program repair, the Prophet system (Long and Rinard, 2016) analyses a set of past commits extracted from version control systems in order to compute the likelihood of a given patch. The fundamental problem is a representation problem: we need to identify the proper representation for both the program and the edit. If the representation is too fine-grain (e.g., at the character or the token level), one would need tremendous amount of data or memory to capture the probability distribution. If the representation is too coarse-grain, the relation between code and edits becomes so vague, making it useless for driving automated evolution.

In this paper, we propose an approach to learn the relation between code and edits. This approach is novel and effective. Its novelty lies in the representation of both programs and edits.

Representing programs. Programs are represented with rich combination of abstract syntax trees and carefully engineered features. This representation has two advantages: 1) the learning algorithm has access to the full program information (all tokens), as well as to the AST tree structure and the AST node types; 2) the learning algorithm does not have to extract the probability from scratch, it can leverage the human knowledge encoded in the features to better and faster identify the signal from the noise in the learning data.

Representing edits. In our work, edits are defined as follows. An ‘edit’ is a basic operation performed on the abstract syntaxt tree (AST) of the program before change. A ‘diff’ is the complete set of edits done in an atomic code change, as captured by a commit in a version control system. A code transform is an abstract view over a diff, capturing a single edit or a group of several conceptually related edits. In this paper, the core unit of representation for edits is code transform. We carefully design those code transforms with two main requirements: 1) they must be automatically extracted from commit history with high accuracy to minimize noise in the input data, and 2) they must be precise enough to be automatically applied and to yield a well-formed program. In this paper, we present sixteen code transforms that are specifically designed for program repair.

Learning algorithm. The learning machinery is provided by structured prediction, in particular conditional random field (CRF) (Lafferty et al., 2001)

. Structured prediction is a branch of machine learning that is well appropriate for tree-based data, which is our case with ASTs. In recent research, it has been successfully used on programs, including automatic deobfuscation

(Raychev et al., 2015) and automatic renaming (Alon et al., 2018). Overall, our system takes as input ASTs annotated with additional information, and predicts the most likely code transforms to be applied given the input AST. This means that our learned model captures the probability distribution of edits at the level of considered transforms.

We fully implement our approach in the context of Java programs and program repair. Our prototype system takes as input a set of past commits and produces a probabilistic model. This model is then used to predict the code transforms to be used for repairing a defect.

We evaluate our approach on 4,590,679 bug fixing commits. With an original experimental methodology, we measure to what extent our approach correctly predicts the code transforms to be applied on the program version before the commit. The core idea of the evaluation is known as ‘cross-validation’, it works as follows: we first collect the ground-truth code transforms on all commits and split the data into training part and testing part, we then train the model with the ground-truth labels on the training data set, and finally we compare the predicted code transforms on the testing data set against the ground-truth ones that were held out. We perform two series of experiments, one on “single-transform” diffs and one on “multiple-transform” diffs. For “single-transform” diffs, our overall best performance model achieves 61.1% accuracy. For “multiple-transform” diffs, which is arguably a harder prediction problem, our best performance model achieves 37.1% accuracy.

To sum up, our contributions are:

  • A novel approach to predict code transforms based on structured prediction.

  • An implementation of the approach for repair transform prediction for Java programs. Our implementation contains a set of carefully designed code features, deals with the training data imbalance issue, and takes into account constraints on admissible transforms over certain AST nodes.

  • A large-scale experimental evaluation on 4,590,679 bug fixing commits. The evaluation results show that our overall best performance model achieves 61.1% and 37.1% accuracy for “single-transform” and “multiple-transform” cases respectively.

2. Overview

Figure 1. An Overview of Our Approach for Predicting Code Transforms.

In this section, using a working example, we provide an informal overview of our approach for predicting code transforms on AST nodes. Figure 1 gives a graphical overview. It shows the diff of Git commit ecc184b in project Jmist111https://github.com/bwkimmel/jmist/commit/ecc184bc08ee08159cdd79045c2ed0c4245ba59c. The problem involves two wrong invocations to method that should be replaced by invocations to method . The code transform behind this diff is a replacement of a method invocation by another one, which is called a “Meth-RW-Meth” repair transform in this paper. In the diff of Figure 1, there are two instances of this code transform and they are applied to two different AST nodes.

The table at the top right hand side of Figure 1 shows the predictions of the model. The first prediction, with the highest score 0.6, is composed of two “Meth-RW-Meth” code transforms on nodes identified with indexes 11 and 13. Since it is the actual repair changes to be made, it means that for this example, our approach successfully predicts the code transforms. Note that the prediction involves the locations to apply the code transforms: “Meth-RW-Meth” points to the two AST nodes corresponding to the invocation of . We now outline how our approach achieves this.

Feature Extraction. Given the buggy code snippet, our approach first parses it to construct an AST and then extracts the following two types of features:

  • The first type of feature is based on the characteristics of program elements. For instance, they can be whether the invoked method has overloaded methods and whether the invocation is wrapped with an if-check when called in other parts of the program. These features are engineered, and they relate to code idioms, semantics not directly captured by the AST (such as method overloaded), common usage.

  • The second type of feature is based on the abstract syntax tree. All AST nodes are represented with special vertices, edges, triangles that are used for structured prediction. For the specific example in Figure 1, an excerpt of this tree structure is shown on the left hand side.

Offline Model Learning. After extracting the features for all samples in a training dataset of patches, our approach feeds them to a probabilistic model. The model is learned offline and captures how certain repair transforms are applied to certain AST nodes. The model is only learned once and then can be used to do predictions for arbitrary buggy code snippets. The model is parametrized by a set of code transforms.

Code Transforms. The code transforms are defined in terms of their changes on the AST structure. The AST nodes of the buggy code snippet are annotated with labels that indicate the presence of a code transform. We then use conditional random field (CRF), to learn from these annotated data points. The learning process makes use of the two types of features mentioned above, and establishes the relation between the considered features and the code transforms on AST nodes.

Prediction. Finally, using the extracted features for the new, unseen buggy code snippet, the already learned model assigns likely code transforms to AST nodes. Each assignment comes with a score representing the probability of the transform. For the buggy code snippet shown in Figure 1, the top-3 predictions by our trained model are shown in the table at the right hand side. The most likely prediction, with the score of 0.6, says there is a need to apply two transforms “Meth-RW-Meth” (replace one invocation by another one) to AST nodes with indexes 11 and 13 respectively. This prediction is indeed correct. To repair this bug, we exactly need those two repair transforms suggested by the most likely prediction.

Key Points. We now emphasize the key points of our approach. First, our model performs structured prediction and predicts the repair transforms for all AST nodes of the buggy code snippet given as input. We have a special repair transform called EMPTY which means no repair transform, capturing the absence of change, and we call the other transforms actual repair transforms. The EMPTY predictions are not shown in Figure 1, yet they are indeed outputs of the model. Second, our model does prediction of transforms on specific AST nodes. For instance, there are three method invocations involved in the buggy code snippet (i.e., , , ), the most likely prediction attaches the repair transforms to the actual buggy invocation (i.e., the two calls to ). Finally, our model can effectively deal with the case when there need multiple actual repair transforms to different AST nodes. Those joint transforms are learned at training time and given as outputs at predication time, as shown in Figure 1.

3. Preliminaries

Before describing our approach in detail, we first provide the necessary background and the terminology that will be used throughout the paper. We do structured prediction over code transforms on AST nodes, so we start by defining the AST.

Definition 3.1. (Abstract Syntax Tree). The abstract syntax tree (AST) for a code snippet is a tuple where N is a set of nonterminal nodes, T is a set of terminal nodes, is the root node, is a function that maps a nonterminal node to its children nodes, L is a set of node labels, is a function that maps a node to its label, V is a set of node values, and is a function that maps a node to its value (can be empty).

For simplicity, we use some notations related with AST in the remaining of this paper. We use ast to refer to a certain AST for a code snippet, n to refer to a certain AST node (either nonterminal or terminal), n(lab, val) to refer to a certain node with label lab and value val, to refer to the children nodes of n, and finally l(n) and v(n) are used to denote the label and value of node n respectively.

We use conditional random field (CRF) (Lafferty et al., 2001) to do the learning. Before describing CRF in detail, we first give the definition of clique and maximal clique, which are key for understanding CRF.

Definition 3.2. (Clique and Maximal Clique). For an undirected graph G = (V, E), a clique C is a set X of vertices of G such that every two distinct vertices are adjacent. A maximal clique is a clique that cannot be extended by including one more adjacent vertex.

Imagine a simple undirected triangle graph, then there are three 1-vertex cliques (the vertices), three 2-vertex cliques (the edges), and one 3-vertex clique (the triangle), and the 3-vertex clique is the only maximal clique.

We can now give the definition of conditional random fields (CRFs). CRFs are a kind of discriminative probabilistic graph model that models the conditional distribution directly, and have been successfully used in areas such as information retrieval (Pinto et al., 2003)

and computer vision

(He et al., 2004). The formal definition of CRF is as follows (Lafferty et al., 2001).

Definition 3.3. (Conditional Random Field) Let X = and Y =

be two sets of random variables,

G = (V, E) be an undirected graph over Y such that Y is indexed by the vertices of G, and C be the set of all cliques in G. Then (X, Y) is a conditional random field if for any value x of X (i.e., conditioned on X), the distribution p(y|x) factorizes according to G and is represented as:

(1)

where Z(x) is a normalization function that ensures the probabilities over all possible assignments to y sum to 1, and is defined as:

(2)

Here denotes the set of possible assignments of y for x. In the above definition, each is a local function that depends on the whole X but only on a subset which belong to the clique c, and the value of it can be deemed as a measure of how compatible the values

are with each other. Like Hidden Markov Models (HMMs), each local function has the special log-linear form over a prespecified set of feature functions {

f}:

(3)

Each feature function f: is used to score assignments of subset variables , and is the model parameter to learn which represents the weight for feature function f. Typically, the same set of feature functions with the same parameters is used for every clique in the graph.

4. Structured Prediction of Source Code Transforms

In this section, we introduce our approach for structured prediction of source code transforms. Our approach works in two phases. By using the history code transform information for a set of m examples where n corresponds to the AST for original code (i.e., before transform) of example i and t corresponds to the transforms applied to nodes of n, we first use an offline training phase to learn a probabilistic model that captures the conditional probability p(t | n). Once the model is learned, we then use it to structurally predict the most likely transforms needed for the AST nodes of a new, unseen code snippet.

We first define a conditional random field for the transform prediction task, and then discuss how the approach can be achieved in a step-by-step manner. We give detailed specific instantiation of the approach in the next section, but they can be instantiated in other ways.

4.1. CRF for Transform Prediction

To effectively guide the code transformation process, our aim is to precisely predict the needed transforms at the granularity of AST node. We thus first define the following two random fields.

Definition 4.1. (Random field of AST nodes and transforms). Let P = {1, 2, 3, …, Q} be the set of AST nodes where the integer is a unique identifier to denote the nodes traversed in pre-order. We associate a random field N = over node symbol (a node symbol is a unique combination of node label and node value) in each position p in the AST and another random field T = over the code transform applied to each position p in the AST.

The realizations of N (denoted by n) will be the actual nodes for a specific input AST, and the realizations of T (denoted by t) will be the actual transforms applied to nodes of the specific input AST. Figure 2 (a) and (b) give an example of the realizations of the two random fields where the transforms are applied to repair the bug.

Figure 2. An example of input tree, output tree, and the corresponding CRF graph

Given the two random fields N and T, we now discuss the choice of the undirected graph over random variables in T

, i.e., the choice of CRFs for transform prediction purpose. In CRF, the more complex the graph, the more kinds of feature functions which relate all variables in a clique can be defined, which in turn would lead to a larger class of conditional probability distributions. However, note at the same time a complex graph will make the exact inference algorithms become intractable

(Sutton and McCallum, 2012). For the transform prediction problem, we define the undirected graph in the following way:

Definition 4.2. (Graph for random field T) The undirected graph for random field T = is G = (V, E) such that (1) V = and (2) E = where and denote that there exist parent-child and immediate-sibling relationship between positions i and j in the input AST tree respectively.

The undirected graph is chosen as (1) it enables to explore dependencies between transforms applied to different AST nodes in a recursive manner both vertically and horizontally, and the hierarchical nature of AST implies dependencies following these two recursions; (2) the maximal clique in the graph is triangle and exact inference algorithm is available for this kind of graph. Figure 2 (c) shows the undirected graph for the random field T in Figure 2 (b).

The defined undirected graph contains three kinds of cliques: (1) Node clique which conatins all vertices in the graph; (2) Edge clique which contains all connected edges in the graph; and (3) Triangle clique which contains all connected triangles in the graph. For the graph in Figure 2 (c), = {, , , , , , }, = {(, ), (, ), (, ), (, ), (, ), (, ), (, ), (, ), (, )}, and = {(, , ), (, , ), (, , )}. In the remaining of this paper, unless explicitly specified, we refer to clique of any kinds when we talk about clique.

According to the above definition of random fields N and T and the undirected graph for random field T, we define a CRF over the transforms t to nodes given the observable AST nodes n for the code snippet as:

Like our established CRF, the observable input variables and the predicted output variables in general have the same structure for most applications of CRFs, and thus each feature function for a certain clique c typically depends on the subset of n in the same clique. Considering this, our established CRF for transform prediction is as follows:

(4)

where and .

The feature functions are the key components of CRF and the learned weight for each of them is critical for controlling the probability of a certain assignment t given the observable n. For instance, to favor the specific assignment t for a certain clique c, a feature function can be defined with a large numerical value. With the weight > 0, the assignment with t for the clique c will receive a high conditional probability. We will discuss in more detail about how feature functions can be defined for our established CRF in the next section.

4.2. Structured Prediction of Source Code Transforms Using CRF

After establishment of the CRF for transform prediction, we next describe how our approach can be achieved. We first show how to define transforms on AST nodes, then illustrate how to define the feature functions used in CRF, and finally describe how to train the CRF model and use the already trained CRF to do prediction.

Transform on AST node. Our approach first needs to define a set of targeted transforms and specify how they are attached to AST nodes. We achieve this based on the concept of basic tree edit operations (Valiente, 2002).

  • Update–UPD(x, val)–Update the value of a node x with the value val.

  • Add–ADD(x, y, i)–Add a new node x. If the parent y is specified, x is inserted as the child of y, otherwise x is added as the new root node.

  • Delete–DEL(x)–Remove the leaf node x from the tree.

  • Move–MOV(x, y, i)–Move the subtree having node x as root to make it the child of a parent node y.

To attach a certain transform t on a certain node n, we requite that there exists one of the 4 basic tree edit operations on n and (or) other basic tree edit operations on other nodes related with n. Using repair transform as the example, the next section will give detailed examples on how to define transforms on AST nodes. Note due to the existence of effective tree differencing tools (Fluri et al., 2007; Falleri et al., 2014), this manner of using basic tree edit operations as the basis for defining transforms on AST nodes also facilitates the preparation of training data.

Feature functions for CRF. Feature functions are key to controlling the likelihood predictions in CRF. Similar to the feature functions that have been proven useful on other application areas of CRF (He et al., 2004; Pinto et al., 2003), we can consider two types of feature functions for our transform prediction problem.

The first type of feature function is called observation-based feature function. For a certain clique c, observation-based feature functions typically have the form:

The notation is an indicator function of which takes the value 1 when and 0 otherwise, is a function on the input which we call observation function. In other words, the feature function is nonzero only for a single output configuration . But as long as the constraint is met, then the feature value depends only on the input observation . Put it in another way, we can think of observation-based features as depending only on the input , but we have a separate set of weights (after learning) for each output configuration. In this case, for a certain clique c, we can establish the relation between and by using program analysis to analyze the characteristics of input nodes . For example, for a node clique which involves a node whose label is invocation (i.e., method call), we can establish a function which analyzes whether the called method has overloaded method(s) (return function value 1 if yes, otherwise return function value 0) and associate the function with different possible transforms that can be applied on this node. Suppose we are focusing transforms to repair bugs, hopefully the feature function will have a relatively large weight after learning from large data. Same as mentioned in Section 2, here “Meth-RW-Meth” denotes a repair transform that replaces a method invocation by another one, including overloaded methods.

The other type of feature function is called indicator-based feature function, which can be viewed as a pre-processing step before the launch of the learning phase and typically have the following form for a certain clique c:

Here and denote the input and output for a clique observed in any of the training data . The basic idea behind indicator-based feature functions is for each kind of clique (either node clique, edge clique or triangle clique), we collect from training data all possible input-output tuples for it and transform each input-output tuple into a feature function. By learning from a large representative data set, a weight then can be associated with each input-output tuple which in turn can be used to do output predictions for new unseen inputs.

Using repair transform as the example, we will show particular instantiations of observation-based and indicator-based feature functions in next section.

Learning and Prediction. Given the training data of m

samples, we assume the samples are drawn independently from the underlying joint distribution

p(t, n) and are identically distributed, i.e., the training data are IID. The training goal is to automatically compute the optimal weights = { for feature functions in a way that achieves generalization. In other words, for the set of AST nodes n for a new code snippet drawn from the same distribution P (but not contained in the training data set D), its needed transforms t are predicted correctly by the learned model. The typical way to train CRF model is using classical penalized maximum (log)-likelihood:

That is, the weights for feature functions are chosen such that the training data has highest probability under the model.

Based on the above defined feature functions and the learned weights for each feature function, for the AST nodes n of a new code snippet, the conditional probability of each possible transform t can be calculated by substituting the defined feature functions and the learned weight for each feature function into the following formula:

CRFs typically output the single most likely prediction by using the following query (also known as MAP or Maximum a Posteriori query (Koller and Friedman, 2009)):

For code transform prediction problems, there are some specific issues that need to be taken into account during the learning and prediction process. First, for a code snippet, there in general just need a few actual transforms applied to certain AST nodes. As a result, the training data

is skewed in turns of the number of AST nodes that are associated with actual transforms. If the skew is too large, the learned weight

will be biased. Second, typically there exist constraints on the admissible transforms assigned to a certain node n. In particular, the admissible transforms assigned to a certain node n is highly dependent of the label of itself and its neighbour nodes.

We will show how to get the maximum likelihood estimate and how to do prediction in detail in next section. In particular, we will present how to deal with training data imbalance issue and constraints on admissible transforms.

5. Repair Transform Prediction

In this section, using repair transforms as example, we give a full realization of our approach for structured prediction of source code transforms. We first give the definitions of repair transforms on AST nodes, then describe how feature functions are constructed for the specific transform prediction problem, and finally give the full CRF learning and prediction algorithms which take the specific issues associated with transform prediction into consideration.

5.1. Repair Transforms on AST Nodes

Repair transforms are transforms used to change the buggy code into correct code, and are at the heart of many repair techniques (Kim et al., 2013; Saha et al., 2017; Long et al., 2017). The repair transforms used by these repair techniques are not at the level of AST node and are tried in a fixed order during repair. Our approach instead is able to predict the needed repair transforms at the granularity of AST node. To achieve this, we first give the definitions of repair transforms on AST nodes. Our definitions are on top of the 4 basic tree edit operations: ADD, DEL, MOV, and UPD.

We now give detailed definitions for 16 repair transforms, which cover some of the most widely used transforms for repairing bugs. Our definition targets Java language and uses Eclipse JDT style AST, but can be extended to other languages and other AST implementations. We first give some notations and basic definitions for facilitating the definition of repair transforms.

Figure 3 presents the notations we use to define the repair transforms on AST nodes. Here the logical expression denotes an expression made up by a set of atomic boolean expressions (through the use of logical operator), and the logical expression cannot be extended with other atomic boolean expressions. The function P maps a node n to its parent node, root maps a code snippet C to the root node of its ast, d maps the value of a node to the root node of definition code for when is Variable Access or Method Call, rl maps a node n to the root node of the related logical expression when is Conditional If, Logical Operator, or Ternary Operator, st maps a node n to the subtree whose root node is n, maps a node n to the related code block when is Conditional If or Try-Catch, maps an ast to all of its nodes, TE(e,n) is an indicator function about whether there exists a specific basic tree edit operation e on node n, o maps a set of statements in a code block CB to its subset which contains only moved statements, and finally maps a code block CB to a single statement with the smallest code line index. In addition, when the tree edit operation is UPD, we use to denote the new value of the node n after the edit operation.

LogicalOperator = {||, &&}     BinaryOperator = {||, &&, |, ,̂ &, ==, !=, <, >, <=, >=, <<, >>, +, -, *, /,% }

LogicalExpression = BoolExp ||(&&) … ||(&&) BoolExp  CB = AO = {ADD, DEL, MOV, UPD}

P: d: rl:

st: :

Literal = {Number, String, null}

Figure 3. Definitions and notations used for defining repair transforms on AST nodes

Each repair transform is specified as a tuple (n, name) which denotes that a certain repair transform name is attached to an AST node n. Note that we do not claim that our definition is complete and covers every case of each transform, but we believe the typical case of each transform is included according to our definition. When defining the repair transforms, we focus on the essential tree edit operations. To make the repair transforms valid with regard to language grammar, some other tree edit operations following up these essential tree edit operations are implicit. For instance, when there exists the ADD operation which inserts an AST node whose label is Conditional If, some other ADD operations which insert nodes for the condition predicate must be accompanied. To avoid clutter, we do not explicitly show these follow-up tree edit operations.

Figure 4 gives the definitions for some repair transforms that target the inner nodes of a statement. We first have two repair transforms which move an expression into (Wrap-Meth) and out of (Unwrap-Meth) a method call. To avoid the case that the move operations arise because of changes in the signature of the involved method call, we add a constraint that there are no tree edit operations on the root node of the method definition and the children nodes of the root node. Note when our definition explicitly involves a certain variable access or method call, we in general have constraints about the tree edit operations on the definitions of the accessed variable or method. We then have two repair transforms about replacing a variable access by another variable access (Var-RW-Var) or a method call (Var-RW-Meth). Similarly, we also have repair transforms that replace a method call by a variable access (Meth-RW-Var) or another method call (Meth-RW-Meth). In particular, there are two sub-cases for Meth-RW-Meth repair transform, corresponding to the case that the name of the replaced method call is different from (Case1) or same with (Case2, i.e., use another overloaded method) that of the original method call respectively. The BinOperator-Rep and Constant-Rep repair transforms replace a binary operator with another binary operator and replace a constant literal with another constant literal respectively. Finally, we have two repair transforms which expand (LogExp-Exp) and reduce (LogExp-Red) an existing logical expression respectively. For these two repair transforms, we add the constraint that for the node corresponding to the inserted (deleted) logical operator, one child node is not subject to any tree edit operations and for the other child , the nodes of all the sub-tree rooted at are subject to ADD (DEL) tree edit operation. Besides, as logical expression can typically be expanded in different ways when it contains several atomic boolean expressions, we thus associate both these two repair transforms to the root node of the logical expression.

->→ |-⊢ =>⇒

Figure 4. Definitions of the repair transforms targeting inner AST nodes of a statement

Figure 5 gives the definitions for some other repair transforms that majorly target the “virtual root” node of a statement. The “virtual root” node is introduced to separate the actual root node , and is inserted between and its parent node. For , we view the label and value of it as ‘virtualroot’ and ‘null’ respectively, and we use the notation to denote the for a statement s. We introduce as it is more reasonable to view some repair transforms are attached to it rather than the actual root node . For instance, for an assignment statement whose root node is the binary operator ‘=’, it is preferable to view the repair transform which wraps the statement with an ‘If’ condition check is attached to . Note this also facilities the construction of the CRF model as it typically has the single-label per time step assumption (Sutton and McCallum, 2012) (i.e., single repair transform per AST node for our problem).

We first have the repair transform which adds an ‘If’ conditional check for an existing statement, and the added ‘If’ check does not have the ‘Else’ block. Note some other tree edit operations on the ‘Then’ block can be accompanied by the add of the ‘If’ conditional check, and we view the first statement in the ‘Then’ block whose actual root node is subject to ‘MOV’ operation as the ‘old’ statement and the target of the conditional check. Depending on whether the ast of the added logical expression has the node , we further split the transform into Wrap-IF-N (as shown in Figure 5) and Wrap-IF-O (other not null related check, not shown in Figure 5 for space reason). When the added ‘If’ check has the ‘Else’ block, we have “Wrap-IfElse” related repair transform which has three cases: case1 for which the ‘old’ statement is in the ‘Then’ block, case2 for which the ‘old’ statement is in the ‘Else’ block, and case3 for the case of add of a ternary expression. Similarly, we also split it into Wrap-IFELSE-N (as shown in Figure 5) and Wrap-IFELSE-O based on the check of the existence of node . We then have the Unwrap-IF repair transform which removes the conditional check, and the conditional check can be in the form of ‘If’ expression (case1) and ternary expression (case2). Finally, the Wrap-TRY repair transform warps an existing statement with “Try-Catch” exception handle.

->→ |-⊢ =>⇒

Figure 5. Definitions of the repair transforms targeting the virtual root node of a statement

5.2. Feature Functions for CRF

We now describe how feature functions are constructed for the repair transform prediction problem. The design and choice of the feature functions is important, because it highly impacts the performance of the CRF model (Sutton and McCallum, 2012). We consider two types of feature functions for our established CRF model: observation-based feature functions and indicator-based feature functions.

5.2.1. Observation-based Feature Functions

Observation-based feature functions analyze the characteristics of input nodes and establish the correlation between those characteristics and the repair transforms applied to nodes. We design observation-based feature functions related with different kinds of program elements reflected in AST nodes, including variable access, method call, logical expression, binary operator, and the whole statement. We first present the characteristics we analyze and then present the observation-based feature functions on top of them.

Node Characteristics.

Depending on the label of the AST node, we accordingly analyze different kinds of characteristics associated with it and the characteristics can generally be classified into 3 kinds based on their nature.

Type Related. For a node n whose label is variable access, we have six characteristics related with the type of the accessed variable var. (V1) The type of var is primitive; (V2) The type of var is objective; (V3) var is an instance of the class it resides; (V4) There exist variables in scope (i.e., is accessible) that are type compatible with var; (V5) There exist method definitions or method calls in the class for which at least one of their parameters is type compatible with var; (V6) There exist method definitions or method calls in the class whose return type is type compatible with var.

For a node n whose label is method call, we have five characteristics concerning type related with the accessed method m. (M1) The return type of m is primitive; (M2) The return type of m is objective; (M3) The types of some of parameters of m are compatible with the return type of m; (M4) There exist variables in scope that are type compatible with the return type of m; (M5) There exist method definitions or method calls in the class whose return type is type compatible with the return type of m.

Usage Related. For the accessed variable var by an AST node n, we have eight characteristics related with the usage of var in other statements. (V7) When var is a locale variable, it has not been referenced in other statements before the statement that var resides; (V8) When var is a locale variable, it has not been assigned before the statement that var resides and it does not have default not-null expression when declaration; (V9) When var is a field, it has not been referenced in other statements of the class besides the statement that var resides; (V10) When var is a field, it has not been assigned in other statements of the class besides the statement that var resides and it does not have default not-null expression when declaration; (V11) There exist other statements (besides the statement that var resides) in the class that use some same type variables with var, but have null check guard; (V12) There exist other statements (besides the statement that var resides) in the class that use some same type variables with var, but have normal check guard; (V13) When var is a parameter of a method call , replace var with another variable can get another method call used in the class; (V14) When var is a parameter of a method call , replace var with a method call can get another method call used in the class.

For the accessed method m by an AST node, we have five characteristics concerning the usage of m in other statements. (M6) There exist other statements in the class that use a method call whose signature is same with m, but have null check guard; (M7) There exist other statements in the class that use a method call whose signature is same with m, but have normal check guard; (M8) There exist other statements in the class that use a method call whose signature is same with m, but are wrapped with try catch exception handle; (M9) When m is a parameter of a method call , replace m with a variable var can get another method call used in the class; (M10) When m is a parameter of a method call , replace m with another method call can get another method call used in the class.

For a node n that is the root of a logical expression e, we have three characteristics concerning usage of constituent elements of e in other statements. (LE1) There exists a variable var referenced by an atomic boolean expression of e, and there exists another atomic boolean expression in other statements that reference variables of same type with var but and are different after identifier substitution; (LE2) There exists a variable var in scope which is not referenced by e, but there exist atomic boolean expressions in other statements which reference variables of same type with var; (LE3) There exists a boolean variable in scope that is not referenced by e.

Syntax Related. We have two characteristics related with the syntax of the accessed variable var by a node. (V15) There exist other variables in scope that are similar in identifier name with var; (V16) There exist method definitions or method calls in the class that are similar in identifier name with var. We use Levenshtein distance metric to measure the difference between two string sequences.

For the accessed method m by a node, we have four syntax related characteristics. (M11) The identifier name of method m starts with ‘get’; (M12) The method m has overloaded method; (M13) There exist variables in scope that are similar in identifier name with m; (M14) There exist method definitions or method calls in the class that are similar in identifier name with m. The similarity is also established using Levenshtein distance.

For a statement s, we have five syntax related characteristics, and these characteristics are attached to the virtual root node of the statement. (S1) The statement kind of s; (S2) The statement kind of the previous statement in the same block with s; (S3) The statement kind of the next statement in the same block with s; (S4) The statement kind of the parent statement of s; (S5) The associated method of s throws exception or the associated class of s extends an exception type.

For a node n whose label is binary operator, we have four characteristics related with the accessed binary operator bo. (BO1) The operator kind of bo; (BO2) When bo is a logical operator, its operands contain the exclamation mark ! (i.e., not operator); (BO3) When bo is a logical operator, its operands contain the literal ‘null’; (BO4) When bo is a mathematical operator, its operands contain number ‘0’ or ‘1’.

We also have three syntax related characteristics for the root node n of a logical expression e. (LE4) There exists an atomic boolean expression that contains the exclamation mark !; (LE5) There exists an atomic expression which is simply a boolean variable; (LE6) There exists an atomic boolean expression of e that is null check and there also exists another atomic boolean expression of e that is not null check (i.e., there exists mix check).

For the characteristics related with statement kind (S1-S4) or binary operator kind BO1, we enumerate all possible kinds and establish a sub-characteristic of the form “The kind is X” for each possible kind X. Typical binary operator kinds in mainstream language include logical relation, bit operation, equality comparison, shift operation, and mathematical operation as shown in Figure 3. After doing this, each of the characteristics can be viewed as a boolean valued function, i.e, a predicate on the characteristics. We use n.c to denote the boolean evaluation result of a certain characteristic c on n.

As there exist structural dependencies between different AST nodes, the characteristic of a certain node can possibly imply repair transforms on other nodes. For instance, the characteristic V7 for a variable access node can be related with “Wrap-If” related transform on the virtual root node of the statement. To take this into account, we propagate some characteristics of certain child nodes to their parent nodes. First, we propagate the characteristics V7, V8, V9, V10, V11, and V12 for a variable access node to the virtual root node of the statement and the method access node that is an ancestor of . Second, we propagate the characteristics M6, M7, and M8 for a method access node to the virtual root node of the statement. When propagating a certain characteristics c from a node to another node , the corresponding for is “There exists at least one child node that has characteristic c”, and the predicate value is calculated as follows:

where to represent k children of that have characteristic c, including .

Observation-based Feature Functions. After analyzing the characteristics, the observation function can then be designed as indicator function of the characteristics. Let denote the set of characteristics we have established for node whose label is L. For an AST node n and a certain characteristic , we define the direct observation function and inverse observation function as follows:

Then, we need to correlate the observation function with repair transforms. For different types of nodes, the set of possible repair transforms on them are different. To see what transforms are possible for a certain node, we make use of information from the training data. For repair transform prediction problem, the training data consists of m buggy programs where for each program, the repair transforms required to fix the bug are associated to appropriate AST nodes. For an AST node , we use t(n) to denote the associated repair transform on it, and we define the viable transform set for a node whose label is L as follows:

That is, we deem a repair transform possible for a certain type of node when we have observed the occurrence of this at least once in the training data.

We finally define observation-based feature functions on top of the observation functions and the viable transform set. Let represent the set of possible labels for a node, we define observation-based direct feature function and observation-based inverse feature function as follows:

Intuitively speaking, for each possible repair transform t on a node whose label is L, we associate it with each of the characteristics we have designed for nodes with label L to form a feature function. Through training on big data, we can get weights for different transform-characteristic pairs. The direct and inverse feature functions explore the correlation in different direction.

Note we in this paper define observation-based feature functions on node cliques, and it is possible to define more complex observation-based feature functions on edge cliques and triangle cliques by analyzing the characteristics involving all nodes in edge and triangle cliques. In addition, as we do not analyze the characteristics of all node labels, the observation-based feature functions accordingly do not target all types of nodes.

5.2.2. Indicator-based Feature Functions

The other type of feature function is called indicator-based feature function, which can be viewed as a pre-processing step for the training data. Recall that l(n) and t(n) are used to denote the label and repair transform on an AST node n respectively. Here, we also use to represent the ith child node of a node n. Given the training datset , for each tuple , we define the observed set of transforms on nodes for different kinds of cliques as follows:

In other words, we enumerate possible repair transforms on nodes for different cliques observed in the specific training example . For the entire training data set D, we can then obtain all observed set of transforms on nodes as follows:

Based on the observed set of transforms on nodes, we finally define indicator-based feature functions for different kinds of cliques as follows:

where corresponds to the repair transform associated with node . For edge clique, and are parent node and child node respectively. For triangle clique, , and are parent node, left child node and right child node respectively. Note in the remaining of this paper, we use the same notation as here.

The above defined indicator-based feature functions do not take the value of the nodes into account. To study the repair transforms associated with triangle cliques when the value of the left child is the same with that of the right child (e.g., same variable access), we define another indicator-based feature function for triangle as follows:

In the learning phase, we can learn the corresponding weight for each of the indicator-based feature functions defined above. Note that the weights and the indicator-based feature functions can vary depending on the training data D, but they are independ of the buggy code snippet for which we are trying to predict repair transforms.

5.3. Learning and Prediction

5.3.1. Learning

Recall that the learning problem is to determine the optimal weights = { for feature functions from the training data . The typical way to train CRFs is by penalized maximum (log)-likelihood, which optimizes the following log-likelihood objective function with respect to the model p(t | n, ):

The above objective function treats each as equally important. However, one significant characteristic in repair transform prediction problem (in general transform prediction problem) is that the buggy code snippet is nearly correct, and there in general just needs a few actual repair transforms applied to certain AST nodes. Note we attach a virtual ‘EMPTY’ repair transform to those nodes which are not associated with any repair transforms. Consequently, the training data is skewed in turns of the number of AST nodes that are associated with actual repair transforms. If the skew is too large, the learned weight will be dominated by those training examples with few repair transforms on few nodes, which in turn will predict that few nodes need to subject certain repair transforms for most new unknown buggy code snippets. However, correctly predicting those instances which need relatively more repair transforms on nodes is extremely important as those bugs are much more hard to deal with. We call this issue “transform number imbalance issue”.

Addressing the Imbalance Issue. To deal with the training data imbalance problem, there are typically three major groups of solutions: sampling methods (Chawla et al., 2002), cost-sensitive learning (Elkan, 2001), and one-class learning (Tax and Duin, 2004). For our (repair) transform prediction problem, inspired by the approach for dealing with training data imbalance issue in (Song et al., 2013), we propose a method called “transform distribution aware learning”. The method is similar to sampling methods, but does not have the disadvantage of removing important examples in under-sampling and adding redundant examples in over-sampling (can cause overfitting). The method analyzes the training data before the launch of training and gives more weight on those training examples which have relatively more nodes associated with actual repair transforms. Formally, for the training data set , we define the set U which contains all the observed numbers of actual repair transforms used for repairing a bug as follows:

where S(t) denotes the number of actual repair transforms in t.

We then define a distribution-aware prior as:

where is the number of training examples in D that have u actual repair transforms, is the average number of training examples per each number in U, and q is a coefficient that controls the magnitude of the distribution-aware prior.

We then multiply the distribution-aware prior with the log probability for each training example in the objective function and get a new objective function:

Note when the training data set D

has a uniform distribution (i.e., for each

, is equal) or when the coefficient q equates to 0, the new objective function is reduced to the typical objective function. Through the use of distribution-aware prior, more weight can be put on those training examples which have relatively more actual repair transforms attached to nodes and are scarce in D. The larger the coefficient q, the more weight we put on those types of training examples which are scarce in D. Overall, by using the distribution-aware prior, all the training examples in D can be adjusted to have a balanced impact in the learning process.

After substituting the CRF model into the new objective function, becomes:

Regularization. As our CRF contains a large number of feature functions, we use regularization

to avoid that the learned weights for feature functions are over-fitting. The regularization can be viewed as a penalty on weight vectors whose norm is too large. We use the typical penalty based on the Euclidean norm of

and the strength of the penalty is determined by the regularization parameter 1/2. The regularized objective function is then:

The regularized objective function is concave and thus every local optimum is also a global optimum. However, in general cannot be maximized in closed form, so numerical optimization is used and a particularly successful method is L-BFGS (Liu and Nocedal, 1989). L-BFGS belongs to the family of quasi-Newton methods and can be used as a black-box optimization routine by feeding the value and first derivative of the objective function. The first derivative of the objective function for each parameter is:

where ranges over all assignments to t in the clique c.

The computation of the first term is straightforward (i.e., sum the feature function values over the training data set), but calculating the second term requires to calculate the marginal probability , which is an inference task and we will discuss it below.

5.3.2. Inference

CRF typically outputs the most likely prediction using the MAP query . As said in the section about learning, the learning process needs to calculate the marginal probability for a certain clique c. These are the two inference problems arise in CRF, and can be seen as fundamentally the same operation on two different semirings (Sutton and McCallum, 2012). To change the marginalization problem to the maximization problem, we just need to substitute maximization calculation for addition calculation.

When the associated undirected graphs with CRF have cycles, typically approximate inference algorithms have to be used. However, one advantage of our CRF model is that the maximal clique in the undirected graph is triangle, for which efficient exact inference algorithms are available. The process is first using junction tree algorithm to change the graph into a tree, and then belief propagation can be used to do the inference (Sutton and McCallum, 2012). We refer readers to (Jensen and Jensen, 1994) for details about junction tree algorithm and (Pearl, 1982) for details about belief propagation algorithm.

The belief propagation algorithm is also called message-passing algorithm, and the marginal distributions are recursively computed using messages exchanged between all the nodes in the junction tree. When it comes to the original undirected graph for CRF, let c be a maximal clique and the set be its neighbour maximal clique set, i.e., the set of maximal cliques that have common nodes with c. One can informally interpret message passing as that the marginal distribution of c is determined by summing over all the admissible label assignments for the nodes of each , and for each , its marginal distribution in turn relies on all the admissible label assignments for all of its neighbour maximal cliques.

Constraint on Valid Repair Transforms. One important characteristic of the (repair) transform prediction problem is the admissible repair transforms assigned to a certain node n are highly dependent of the label of itself and its neighbour nodes. For instance, for our defined 16 repair transforms, a repair transform t on the virtual root node of a statement is valid only when . To accommodate this, we define a set of constraints as follows:

where N, E, T refers to the node clique, edge clique, and triangle clique respectively, and : is a boolean function indicating whether the joint assignment (, n) for a clique of kind k violates i-th constraint established for this clique kind.

Constraints restrict the sets of admissible label assignments for cliques and the message-passing algorithm can be easily modified to take the constraint into account: only admissible labeling assignments are considered in the messages. Constraints can be established according to the domain knowledge about what kinds of repair transforms are possible for certain nodes in a clique. We in this paper define constraints based on the training data set . The idea behind is that the training data set D is large and representative enough, and for an assignment to be admissible, we should have observed the occurrence of this in the training data. Using the notations in section 5.2.2, for different types of cliques, we define the set of admissible repair transforms for different node labels as follows:

where represents the set of possible labels for a node, , , and are the sets whose elements are 1-tuple, 2-tuple, and 3-tuple respectively.

Based on the established admissible repair transforms for different node labels, we can determine whether the assignments for different cliques violate the constraint as follows:

where the value 1 means the constraint is violated and the assignment is not admissible.

The use of constraint in (repair) transform prediction arises for two reasons. First, it can significantly reduce the time complexity in inference as only admissible transform assignments are considered in the messages. Second, it allows to eliminate incorrect transform assignments according to domain knowledge, resulting in accuracy improvement.

6. Experimental Evaluation

We in this section present the implementation, the experimental methodology, and the empirical results of predicting repair transforms on Java code.

6.1. Implementation.

We implemented our repair transform prediction approach in a tool called TEZEN. The tool is written in Java and it learns and preditcs code transforms for Java source code. It consists of two parts: repair transform extraction and CRF learning and prediction. For repair transform extraction, we use Gumtree (Falleri et al., 2014) to extract the AST tree edit script. GumTree is an off-the-shelf, state-of-the-art tree differencing tool that computes AST-level program modifications, and outputs them as the 4 basic tree edit operations: UPD, ADD, DEL, and MOV. We also use Spoon (Pawlak et al., 2015) to analyze the code that surrounds the AST nodes affected by tree edit operations. Besides, PPD (Madeiral et al., 2018) is employed to facilitate the detection of certain repair transforms. For CRF, our CRF model is implemented on top of the XCRF library (Jousse et al., 2006), which is a framework for building CRF to label XML data. We extend XCRF to incorporate our specific feature functions, learning and prediction algorithms. In particular, a major modification both at the conceptual and implementation level is the support for computing the top-k predictions, i.e., the predictions with the k highest conditional probabilities.

6.2. Experimental Setup.

Dataset. We use the data set of Soto et al. (Soto et al., 2016) as the source to get our needed data set with repair transforms on AST node. This data set contains 4, 590, 679 bug fixing commits. When we use Gumtree to analyze the diffs of those bug-fixing commits, we find that when the diff is relatively large, the edit scripts are not accurate enough to reflect real code changes in a significant number of cases. Thus, we limit our repair transform extraction to those bug-fixing commits with relatively small diffs. To achieve this, we first give the definition of root tree edit operation.

Definition 6.1. (Root Tree Edit Operation). A basic tree edit operation is a root tree edit operation if the parent node of x already exists in the AST.

For example, for the code change that inserts a method call “call(a,b)”, there are three add operations: , , and . For the three operations, only is a root tree edit operation as the parent node for the other two add operations does not exist in the AST before the change.

We experimented with different thresholds for root tree edit operations and find when the threshold is set to 10, the Gumtree outputs of the 4 basic tree edit operations are accurate enough to reflect real code changes in most cases and we can correctly attach repair transforms to nodes. After setting the threshold to be 10, we find that the tree edit operations in a file are typically targeting a single statement. Consequently, we use the AST of the targeted statement as the AST of the bug code. Finally, we have 261,263 pairs of extracted from the original 4,590,679 bug fixing commits, where n is the AST of a changed statement and t is the set of repair transforms associated with the nodes of n.

Table 1 shows the number for different repair transforms we have in the data set. The “Single” (resp. “Multiple”) refers to the case where the number of actual (i.e., non-empty) repair transforms for a certain is one (resp. larger than one). We can see from the table that a majority of the data involves just one actual repair transform applied to a certain node. To check the quality of the data set, we randomly sample 200 examples for each repair transform (in particular also 200 examples for the case of multiple transforms) and manually check whether correct repair transforms are attached to correct AST nodes. The result is shown in the column #Correct(S). Overall, we can see the precision is satisfactory, achieving at least 88% correctness rate.

Repair Transform #Number #Correct (S) Repair Transform #Number #Correct (S)
Wrap-Meth 6,044 95% LogExp-Exp 21,045 98%
Unwrap-Meth 2,613 92% LogExp-Red 2,987 97%
Var-RW-Var 28,861 96% Wrap-IF-N 9,290 93%
Var-RW-Meth 3,638 93% Wrap-IF-O 10,611 95%
Meth-RW-Var 3,801 91% Wrap-IFELSE-N 4,178 90%
Meth-RW-Meth 43,539 92% Wrap-IFELSE-O 10,742 92%
BinOperator-Rep 5,972 97% Constant-Rep 82,295 96%
Unwrap-IF 6,478 96% Wrap-TRY 6,016 96%
Single 250,702 Multiple 10,561 88%
Table 1. Descriptive statistics of the data set. The manual check of the extracted repair transforms verifies the quality of the data.

Cross-validation. We use cross-validation to select parameters of the CRF model and evaluate the performance of the trained CRF model.

First, we randomly select 300 instances for each of the 16 studied repair transforms from the “single-transform” category and 1000 instances from the “multiple-transform” category as the test set. For “single-transform” category, the number 300 is chosen as it is around 10 percent of the whole instances for the studied repair transforms Unwrap-Meth, LogExp-Red, Var-RW-Meth, and Meth-RW-Var (2,613, 2,987, 3,638, and 3,801 respectively), which have fewer number of instances than the other 12 studied repair transforms for our data set. To enable direct and fair comparison of the model performance for different repair transforms, we make the number of test instances for each repair transform from the “Single-transform” category the same. For “multiple-transform” category, the number 1000 is chosen as it is around 10 percent of the whole instances for the “multiple-transform” category. To see the model performance, we compare the predicted code transforms on the testing data set against the ground-truth ones that we extract from the bug-fixing commits.

The remaining instances in the data set are used as the training data set and we use the 10-fold cross-validation to investigate the impact of the parameters and select them accordingly. There are three parameters involved in our CRF model: the regularization parameter , the parameter q that indicates the magnitude of the distribution-aware prior, and finally the L-BFGS iteration parameter G that specifies the number of gradient computations used by the optimization procedure. Higher values of mean larger penalty on large weights associated with feature functions and higher values of G will result in higher execution cost for inference.

We split the training data set into 10 equal folds and evaluate the error rate on each fold by training a model on the other 9 folds. The error rate is established using the top-3evaluation metric described in the next paragraph. We repeat this process by using a set of different parameters and try to identify parameters with the lowest error rate. The procedure determines that 500 and 200 are good values for and G respectively, and the following results are based on these two values. The parameter q significantly impacts the prediction result for “single-transform” category and “multiple-transform” category, and we will discuss its impact in detail in the result section.

Evaluation Metric. For each tuple in the test data set, we view a prediction by our trained CRF as correct if . That is, for each node , the predicted repair transform associated to it is exactly the same as the repair transform from the ground truth data. We use both top-1 and top-3 to evaluate the model performance. For top-3, we consider the prediction as a success if at least one of the 3 predictions of sets of transforms exactly matches the expected ones.

Baseline. We compare the model performance with a baseline which assigns repair transforms to nodes according to basic statistics on past commits. Using the same data set employed for training the model, we establish a set of tuples <L, T, P> where L is a certain node label, T is a certain repair transform, and P represents the probability of assigning T to nodes with label L. Basically, let denote the number of nodes with label L in the training data set and denote the number of nodes with label L and are associated with repair transform T, then P = /. The baseline assumes there needs only one actual repair transform and prioritizes the node n and repair transform T on it according to the established probability for <l(n), T>. When there are several nodes of the same label, the baseline breaks ties according to their orders during pre-order traversal. After all possible single node transforms have been explored, based on the established set of tuples <L, T, P>, the baseline gradually explores all possible two node transforms, three node transforms and so on.

In addition, to see the impact of observation-based and indicator-based feature functions, we also compare our approach (“Full Model”) against a “Partial Model” that uses only observation-based feature functions (denoted as ‘O’) or only uses indicator-based feature functions (denoted as ‘I’).

Model Statistics. Our full CRF model contains 12,977 feature functions, including 5,910 observation-based feature functions and 7,067 indicator-based feature functions. We store them along with their weights in a JSON file, which is 3.2MB.

6.3. Experimental Results

Our training is run on a cluster node with 24 cores and 512 GB RAM, and the system is Ubuntu 16.04. For the full model, the training takes 41 hours for each value of parameter q when 500 and 200 are set as the values for and G.

Table 2 and Table 3 show the performance of the model on the test set when top-1 and top-3 are used as the evaluation metric respectively. Recall that the test set contains 1,000 instances for the “multiple-transform” category and 300 instances for each of the studied 16 repair transforms from the “single-transform” category. The numbers in the cells of the tables represent the number of instances in the test set that have been correctly predicted when a certain approach (either baseline or different trained models) and a certain evaluation metric are used (either top-1 or top-3). The percentages in the two rows “Overall Accuracy” and “Accuracy” summarize the performance for “single-transform” and “multiple-transform” instances respectively. The “Full Model” refers to the model that uses both observation-based and indicator-based feature functions, and the “Partial Model” refers to the model that uses only observation-based feature functions (denoted as ‘O’) or only uses indicator-based feature functions (denoted as ‘I’). For space reason, we only show the result obtained with the overall optimal value of q (that is 0.5) for partial model.

Overall, all of our models (either full model or partial model under different values of q) perform consistently better than the baseline when top-1 and top-3 are used as the evaluation metrics. The baseline can just provide certain prediction accuracy for those few repair transforms that are most widely used and is not able to correctly predict those comparably less used repair transforms, and is not able to do “multiple-transform” prediction. Our model, however, not only keeps high prediction accuracy for those few most widely used repair transforms, but also provides reasonable prediction accuracy for those comparably less used repair transforms and “multiple-transform” instances. When top-1 is used as the evaluation metric, the overall best performance model (q = 0.7) has 22.5% and 13.0% accuracy for “single-transform” and “multiple-transform” instances respectively, and the baseline just has 11.4% and 0% accuracy accordingly. When top-3 is used as the evaluation metric, our best performance model (q = 0.5) achieves 61.1% and 37.1% accuracy for “single-transform” and “multiple-transform” instances respectively, and the accuracies for the baseline are 25.2% and 0% accordingly.

Repair Transform Baseline Full Model Partial Model
q=0.0 q=0.3 q=0.5 q=0.7 q=1.0 q=1.5 O I
Wrap-Meth 0 5 4 4 7 5 15 0 0
Unwrap-Meth 0 28 29 29 31 34 37 4 3
Var-RW-Var 0 92 91 96 92 91 87 41 80
Var-RW-Meth 0 40 36 33 42 31 27