An idiom is a syntactic fragment that recurs frequently across software projects. Idiomatic code is usually the most natural way to express a certain computation, which explains its frequent recurrence in code. An idiomatic imperative code fragment often has a single semantic purpose that, in principle, can be replaced with API calls or functional operators.
To illustrate the motivation for this work, consider the imperative code examples in the Hack programming language111Hack is a programming language for the HipHop Virtual Machine, created by Facebook as a dialect of PHP: https://hacklang.org/. in Figure 1
, (a), (c) and (e). These examples—adapted from the codebase at Facebook—loop over a vector and accumulate some value in the loop body. To capture this idiom, Hack supports a more functionalVecmap_with_key API, and we do find instances where a developer refactored code to replace a loop with a map call; for instance, see (b), (d) and (f). This kind of refactoring is of course not unique to Hack; examples in other programming languages abound, such as using LINQ APIs in C# (Allamanis et al., 2018) or Python’s list comprehensions.
Why do imperative idioms continue to linger in code? This can be attributed to several reasons: (1) developers being unaware of the API that can replace the imperative code, or (2) a new API construct being introduced and imperative locations not being updated consistently to use this construct, or (3) an API that would simplify this idiomatic pattern has not been included in the language yet. Identifying such idiomatic patterns and replacing the idiomatic imperative code with corresponding API calls or operators can help in maintainability and comprehensibility of the codebase. Additionally, identification of common idioms may provide data-driven guidance to language developers for new language constructs (this purpose is outside this paper’s scope).
1.1. Finding missed refactoring opportunities
Suppose we want a tool that looks at past instances of refactorings and then identifies additional opportunities of similar refactorings either in existing or new code. For example, the code in Figure 0(a), 0(c), and 0(e) was replaced, respectively, with the code in Figure 0(b), 0(d), and 0(f), where each of the examples was rewritten using the map API. We want the tool to learn a general pattern which when exists in the code, begs for refactoring.222An even more powerful tool would also suggest the refactored code that is drop-in replacement for the existing code, but the design of semantically correct code transformation is a separate hard problem, outside the scope of this work.
At first blush, this may look like a pattern matching or clone detection problem, where a code fragment that is a candidate for refactoring might be a clone of code thatwas refactored in the past to introduce an API call. Another candidate approach might aim to extract generalizable code transformations from a small set of specific examples of transformations (Bader et al., 2019; Meng et al., 2013; Miltner et al., 2019; Polozov and Gulwani, 2015; Rolim et al., 2017, 2018; Gao et al., 2020).
However, identifying the pattern for Vecmap_with_key API—let alone the transformation—from the examples in Figure 1 is non-trivial for the following reasons:
Code that maps values to an accumulator may have different types. For example, in Figure 0(a) we have string accumulation whereas in Figure 0(c) we have a vector accumulation. Therefore, any tool that can identify this pattern would need to identify the semantic pattern that each of the example in Figure 1 is accumulating to a collection variable.
Code may be interleaved with other code, as in Figure 0(c). In this example, although the accumulation is to a vector variable, they have additional code that operates on functions and variables other than the key and value. Additionally, code in Figure 0(a) interleaves iteration and string concatenation.
Therefore, a naive syntactic-pattern-based approach would fail to identify common patterns that matches all examples in Figure 1. A different potential solution is to define a domain-specific language and let developers handcraft custom rules to identify semantic patterns. This requires significant manual effort, and less experienced programmers might not know how to generalize these patterns.
We turn to the statistical pattern mining work by Allamanis and Sutton (Allamanis and Sutton, 2014)
, which has shown the possibility of finding patterns in abstract syntax trees (ASTs) that correspond to idioms based on their frequency of occurrence. They proposed to use probabilistic tree substitution grammars (pTSG), a non-parametric Bayesian machine learning technique to find idioms (we give an overview of this technique in Section3.2). While this is an exciting idea, in practice this does not work very well out of the box, as Allamanis et al. report in their follow-up work (Allamanis et al., 2018) (and as we found as well, see Section 4.) Because purely syntactic idioms are oblivious to semantics, they capture only shallow patterns that are not useful for our end purpose.
In subsequent work, Allamanis et al. (Allamanis et al., 2018) propose to use ASTs augmented with variable mutability and function purity information (we give an overview in Section 2.) They found that this worked well for refactoring loops with functional constructs in LINQ.
Unfortunately, we found practical issues with the enhancement in (Allamanis et al., 2018) when it comes to our setting: (1) the technique in (Allamanis et al., 2018) requires manual annotations or dynamic analysis to infer those, neither of which were an option for us; (2) it can only match patterns with exact lexical order of appearance of variables, and (3) it cannot detect patterns interleaved with other code. Based on our pencil and paper simulation with known mutability and purity information, their approach fails to learn a pattern that matches the examples in Figure 1. We discuss this limitation further in Section 2.
We propose Jezero, an approach that works around the practical complexities of the work of Allamanis et al. (Allamanis et al., 2018). Our approach adds semantic information to ASTs in a different way: approximate dataflow information represented as an extension of the AST. Jezero automatically learns semantic patterns from a large codebase over tree structures–dataflow augmented ASTs–that are generated leveraging a cheap, syntactic dataflow analysis. Our key insight is that semantic patterns can be captured as canonical dataflow trees. This observation is inspired by the seminal Programmer’s Apprentice paper (Rich and Waters, 1988) idea that high-level concepts can be identified as dataflow patterns. In fact, recent works in the area of code search (Premtoon et al., 2020), code clone detection (Wang et al., 2020), and refactoring (Meng et al., 2015) also use this insight and use dataflow analysis to identify semantically similar code.
For instance, a desirable dataflow pattern that summarizes the examples in Figure 1 is the following: foreach contains a datawrite to a collection variable with dataread from first and second primitive variables and the first collection variable in the order of their definition. Concretely, to learn such a pattern, given a set of methods, we collect approximate dataflow information from their abstract syntax trees and construct a dataset of canonicalized trees as described in Section 3.1. We then mine for dataflow patterns using a non-parametric Bayesian ML technique (see Section 3.2) (Allamanis and Sutton, 2014), to which, our representation looks just like any other tree. Figure 5 shows the tree pattern that Jezero mined and that matches all three examples of Figure 1.333Due to the statistical process, the patterns mined sometimes may not cover all relevant aspects of the desirable pattern.
The process of using Jezero involves the following steps: (1) point Jezero at a corpus that is likely to contain instances of the idiomatic pattern we want to uncover; (2) let Jezero
mine the patterns and come up with a most suitable one(s) using its ranking heuristics; and (3) useJezero further to point out locations in the code where similar refactoring can be carried out.
This paper makes the following contributions:
We present a new canonicalized tree representation based on inexpensive dataflow to mine semantic idioms. The approach takes as input a code corpus, generates a tree for each method augmented with dataflow information, and similar to (Allamanis and Sutton, 2014), uses Bayesian learning methods to mine idiomatic patterns. Our tree representation overcomes the practical problems in adopting the closest previous work (Allamanis et al., 2018).
We present Jezero a tool that implements both idiom learning and identification of refactoring opportunities that works at the scale of Facebook code base.
We evaluate Jezero on the task of mining idioms for loopy map/filter code. The mining is done over (refactoring) instances per API taken from Facebook’s Hack codebase. Each instance contains two commit versions, one version with the imperative code and the other with the code refactored to use a functional operator. On an evaluation set, we found Jezero’s F1 score to be 0.57, significantly better than a baseline technique without the dataflow enhancement.
We also evaluated Jezero for identifying new, hitherto unknown opportunities for refactoring code to introduce APIs. Using the top-ranked idioms, we then found matches by matching these idioms against the Facebook code base containing Hack methods; the average precision of finding real opportunities was . The baseline, without dataflow, matched a mere 23 locations.
To our knowledge, Jezero is unique in its ability to find refactoring opportunities from legacy code, based on purely unsupervised learning, and without requiring annotations or dynamic analysis.
is unique in its ability to find refactoring opportunities from legacy code, based on purely unsupervised learning, and without requiring annotations or dynamic analysis.Moreover, we expect the ideas in Jezero to carry over to other languages such as Python, which over time have provided more succinct ways to express idiomatic code.
2. Background on Mining Idioms
Allamanis and Sutton (Allamanis and Sutton, 2014) have addressed the problem of idiom mining as an unsupervised learning problem and proposed a probabilistic tree substitution grammar (pTSG) inference to mine idiomatic patterns. In this work, they mine syntactic idioms from ASTs; however, in their following work (Allamanis et al., 2018) they show that syntactic idioms tend to capture shallow, uninterpretable patterns and fail to capture widely used idioms. Data sparsity and extreme code variability are cited as the reasons for shallow idioms. Therefore, to mine interesting idioms and to avoid sparsity, the authors introduce semantic idioms. Semantic idioms improve upon syntactic idioms through a coiling process (Allamanis et al., 2018). Coiling is a graph transformation that augments standard ASTs with semantic information to yield coiled ASTs (CASTs). These CASTs are then mined using probabilistic tree substitution grammars (pTSG), a machine learning technique for finding salient (and not just frequent) patterns (Cohn et al., 2010). They infer semantic properties such as variable mutability and function purity using a testing-based analysis. For the libraries that do not have test suites, the authors manually annotate with the required properties. The lower path in Fig 3 shows the overall process.
Using the semantic information, the coiling rewrite phase augments the nodes with variable mutability and distinguishes collections from other variables. The pruning phase retains only subtrees rooted at loop headers and abstracts expressions and control-free statement sequences to regions to reduce sparsity. Specifically, they abstract loop expressions into a single EXPR node, labeled with variable references. Additionally, they use REGION nodes to capture code blocks that do not contain any control statements. These nodes encode the purity of variables used in the code block. The purity node types include read (R), write (W), and read-write (RW), and these nodes further differentiate between primitive (prefixed by U) and collection (prefixed by C) variables. Note that region nodes in this work only consider variable mutability, i.e., whether variables, being it collection or primitive, are read from or written to. While this representation effectively captures a class of semantic idioms, it is not sufficient to capture refactoring idioms that require additional flow information between variables. Figure 2 shows the CASTs for the examples in Figure 1.
Despite the effectiveness of the proposed methods in identifying semantic loop idioms, they suffer from limitations that prohibit direct application for identifying code patterns as in Figure 1 — which do occur in realistic and large codebases. Specifically,
To construct REGION nodes, prior work relied on manual annotations or testing-based analysis. Both these efforts are expensive and might not be available for all codebases. Certainly, this is not available in legacy codebases.
The augmented trees contain variable references which are numbered based on their lexicographical ordering. Hence, two loops with the same semantic concept but with a different number of variable references will have different patterns. Looking at Figure 2, at this level of abstraction, it is not clear that these trees are about the same idiom.
Further, due to the lexicographical ordering, loops with the same concept but with additional interleaved code statements will most likely have different patterns.
Following these limitations, despite the code in Figure 1 being arguably about very similar constructs, Allamanis et al.’s approach (Allamanis et al., 2018) would fail to consider the examples as being part of the same idiom. The desired idioms, shown in Figure 2, albeit similar, are sufficiently distinct (e.g., ordering of variables) to be considered the same.
While this is the state-of-the-art in idiom mining, the fact that it requires dynamic analysis makes it impractical to be used in our codebase. As such, in Section 4, we will instead compare our approach with the AST-based tree representation for idiom mining proposed by Allamanis et al. (Allamanis and Sutton, 2014).
3. Proposed Approach: Jezero
In this work, we propose a new canonicalized dataflow tree representation that overcomes the limitations of prior work listed in Section 2. The upper path of Figure 3 provides an overview of Jezero, the tool implementing this approach; as is clear, the difference with (Allamanis et al., 2018) is that we eschew coiling, and instead work with dataflow augmented trees. Section 3.1 describes the construction of dataflow augmented trees, which is our new technical contribution, and Section 3.2 gives an overview of the unsupervised idiom learning and sampling approach, which is the same as in previous work (Allamanis and Sutton, 2014). Note that (Allamanis and Sutton, 2014) goes directly from code ASTs to pTSG, without any tree augmentation.
The process of using Jezero involves the following steps: (1) point Jezero at a corpus that is likely to contain instances of the idiomatic pattern we want to uncover; (2) let Jezero mine the patterns and come up with a most suitable one(s) using its ranking heuristics; and (3) use Jezero further to point out locations in the code where similar refactoring can be carried out.
3.1. Dataflow Augmented Trees
The key insight of Jezero is that high-level concepts can be identified as dataflow patterns. Furthermore, these patterns can be captured and represented as canonical trees using an inexpensive dataflow analysis procedure. The problem with representing dataflow information in detail is that there is not enough commonality across specific dataflow graphs for a useful semantic pattern to emerge using a statistical process of mining patterns. This challenge is often referred to as the sparsity problem (Allamanis and Sutton, 2014).
Jezero combats the sparsity problem with an abstraction that relies on dataflow information structured in a canonicalized way to capture high-level semantic concepts only. Construction of a dataflow augmented tree entails the following steps. First, we extract approximate type information from the AST. Next, we propose a lightweight algorithm that uses the extracted types and information present in the AST to derive flow information as dataflow tables. To mine useful idiomatic patterns, we propose a new tree representation that is amenable to the unsupervised pattern mining algorithm. This canonicalized tree is constructed using information from the dataflow tables. Additionally, we make the following assumptions: (1) mining trees at method level is sufficient to capture refactoring idioms, (2) it is sufficient to keep the control flow structure and collapse control-free sequence code to a region (an approximation that is often used in static analyses (Rosen et al., 1988)), and (3) the side effects, if any, of function calls are inconsequential to the dataflow information that we intend to capture.
Static Extraction of Type Information.
To encode semantic information, we need to capture the data type of each variable. However, precise type information of the variables is often not necessary and can be counterproductive. For example, the type of variable output in Figure 0(a) is string whereas the type of variable call_stack_nodes in Figure 0(c) is vector. Therefore, having precise type information would lead to different patterns. Whereas, if the role of those variables in both cases is to act as a collection, we want to only ascribe a collection type to both.
We overcome the need for an expensive analysis algorithm by proposing an approximate type analysis based on information available in the ASTs. In our approach, variables are assigned as either collection, object or primitive type. Each variable is assigned primitive as the default type and, based on hints from the syntax tree, the type may be modified to collection or object. For example, if a variable contains a subscript operator, it is assigned a collection type. Similarly, if a variable contains the arrow operator (or equivalent operators in other programming languages) it is assigned an object type. At the end of this procedure, we have a type table that assigns types for each variable in the method.
As an example, Table 1 summarizes the types for the code example in Figure 0(c). Note that these types are particularly useful for identifying map/filter APIs, and can be tweaked when looking for other API-related patterns. For instance, string types would be necessary when searching for patterns using string APIs.
Static Extraction of Dataflow Information.
We use dataflow tables to capture data writes and data reads that happen in a code block. To construct these tables, we propose a lightweight analysis that derives dataflow information based on the AST and the type table collected in the previous step. We mention at the very outset that this dataflow representation is not intended to be sound, as needed in compiler optimizations. This choice lets us get away with specific choices that are effective for the purpose at hand. The analysis computes dataflow table at each control-flow point like if, foreach, etc. encodes the data reads that a variable depends on. Formally, , where is a tuple containing a canonicalized identifier (id) and a variable being referenced.
Identifier for a data write is generated using the variable type and the order of appearance of the write in the current control-flow block. In case of Figure 0(c), the unique identifier for variable would be since it is the first collection variable being written to, although it is the second collection variable in the order of appearance. represents a set of read references. Identifier for a data read is generated using the variable type and the order of appearance in the control-flow block. For the example, in Figure 0(c), read reference for the variable would be . Examples of flow operations, , include:
The first flow function calculates when there is an assignment of an expression to a variable. We compute a unique reference and compute read references for each variable in the expression. We then take a union of these references to update the dataflow table; now maps to these set of read references. The second flow function is for assignment of the return values of a function call to a variable. We compute an unique identifier and read references similar to the previous flow function. Note that we assume that for the purposes of idiom mining, we can ignore side-effects from function calls. The third flow function is for a foreach statement with an empty body. Since we have data writes to two variables ( and ) we create two unique references and . We further compute the read references for each of these variables and update . For a given data write, we identify all read dependencies using a fix-point computation. Table 2 illustrates the state of in two iterations of the fix-point computation in Figure 0(c).
A key aspect of this representation of dataflow tables — essential to overcome the limitations of the approaches detailed in Section 2 — is the fact that we propose a new canonicalized label (id) for each dataflow operation. Each label is obtained by concatenating the data type of the variable with a number. This canonicalized label helps to overcome the lexical ordering issues of previous approaches. In particular, we propose that each type of data write has its own numbering. For example, primitive writes have their own numbering which is incremented whenever there is a data write to a primitive variable. This canonicalization allows for interleaved data writes to different types of variables. While the data writes have special numbering, the data read references are computed based on their order of appearance. Hence, the same variable can have a different data read and write reference. Nested control-flow blocks require construction of to take into account the direction of information flow. There are two choices regarding the information flow (1) top-down, where is carried from the outer to the inner code block (2) bottom-up, where is carried from the inner to the outer code block. The choice of information flow influences the type of idioms that we can mine. The top-down information flow allows to capture idioms that require context information. For instance, consider the following code snippet where in the inner loop there is a collection variable results populated with the result of function calls on the items in the values collection.
We cannot directly assign the result of Vec\map_with_key to results, rather we have to Vec\concat with the items already in results. While the top-down information flow can lead to richer idioms, it suffers from two problems: (1) tree sparsity, since no two code blocks share the almost similar context information (2) the learning algorithm proposed by (Allamanis et al., 2018) learns context free grammars; however, if we want to use context information, a mining approach that can learn context sensitive grammars is needed.
On the other hand, bottom-up information flow allows us to identify local patterns which helps avoid the sparsity problem and is amenable to the grammar learning algorithm. Therefore in Jezero’s construction, information flow is bottom-up, i.e., from inner to outer code block. In this approach, for each of the control flow node, we recursively compute dataflow table for inner block and update the outer block using - a merge operator. The operator takes as input two dataflow tables and returns a merged dataflow table, formally, , where and represent the outer and inner-block respectively; and is the merged table. We have the following update rules for :
Table 3 illustrates the working of the merge operator for the example in Listing 1. Jezero starts by computing the information for each inner control flow block. Table 2(b) illustrates the the dataflow table () for the inner foreach loop. Table 2(c) computes the partial dataflow table () for the outer block without the nested loop. Now we carry out merge using the rules of operator. We retain the first three entries of as per the first rule of . Next, we discard the first row of as per the second rule of . For write to variable, we need to merge and update the dataflow from the inner block. We use the third rule of to collect all read references whose variables are declared in the outer context and update the identifier to reflect their position in the outer code block (see Table 2(c)).
Despite the fact that this bottom-up-only dataflow computation is incomplete, and that traditional context-sensitive analysis may provide sound semantic information, we make this choice purposely as it works well for finding idioms.
The pattern mining algorithm we use (see Section 3.2) learns tree fragments given a context free grammar. Therefore, having as a starting point the dataflow information captured as tables, we need a suitable tree representation that is amenable to (tree) pattern mining. In this work, we replace a control-free sequence of statements with trees that capture dataflow information. This tree contains information about the data writes and data reads that happen in a code block. To ensure that these trees are compatible with the underlying learning technique, we propose the following canonicalized tree representation.
To be capable of
differentiating between distinct data writes, our proposed tree representation always contains a set of (distinct)
child nodes that represent data writes to collection, primitive, or object type. This design choice helps in overcoming the
limitation with different number of write statements, since the
learning algorithm will learn to retain only the common data write
pattern (i.e., the common subtree).
Additionally, to account for the dataflow in the loop header, we add
primitive write dataflow nodes which
shows the flow from the collection being iterated over to the loop
header variables. Jezero models dataflow tables as trees using the following grammar:
Figure 4 illustrates the region node for the code example in Figure 0(c). For this particular example, we have data writes to two primitive variables in the loop header from the collection variable identifier_to_id which is captured in the primitive_write subtree. In the body of the loop, there is a write to a collection variable call_stack_nodes. This dataflow is represented as a child node in the collection_write subtree. Since this is the first collection write in the loop body it is referenced with collection_write_0 and the data reads from identifier, id, nodes, and identifier_to_count are represented as a right balanced subtree. The fix-point operation identifies that the variable write to call_stack_nodes also depends on data read from identifier_to_id since identifier and id data write depends on it. The canonicalization can be seen in the data read reference for call_stack_nodes, which is collection_1 whereas the data write reference for the same variable is collection_write_0. Construction of similar canonicalized trees for other code snippets in Figure 1 helps identify the common pattern mentioned in Section 1. The pattern identified from these trees is that foreach region contains a collection_write to a collection_0 variable with data read from primitive_0, primitive_1 and collection_0.
The proposed representation captures information flow between variables in addition to the variable mutability, whereas CASTs (Allamanis et al., 2018) capture only variable mutability in their region nodes. Canonicalized labels for each type of data write maps the first data write in Figure 0(a), 0(c), 0(e) to the same variable reference whereas CASTs maps them to different references (see Figure 2). Jezero’s tree representation arranges the dataflow information such that a top-down mining algorithm (Section 3.2) can efficiently extract frequent subtrees (i.e., frequent flow patterns) from large sets of trees.
3.2. Mining Idioms
Allamanis and Sutton (Allamanis and Sutton, 2014) propose probabilistic tree substitution grammars to infer and capture code idioms. A tree substitution grammar (TSG) is an extension to a context-free grammar (CFG), in which productions expand into tree fragments. Formally, a TSG is a tuple , where is a set of terminal symbols, is a set of nonterminal symbols, is the root of the nonterminal symbol and is a set of productions. In case of TSG, each production takes the form , where is a tree fragment rooted at the nonterminal .
The way to produce a string from a TSG is to begin with a tree containing
only, and recursively expand the tree top-to-bottom, left-to-right as in CFGs — the difference is that some rules can increase the height of the tree by more than 1. A pTSG augments a TSG with probabilities, in an analogous way to a probabilistic CFG (pCFG). Each tree fragment in the pTSG can be thought of as describing a set of context-free rules that are used in a sequence. Formally, a pTSG iswhich augments a TSG with , a set of distributions , for all , each of which is a distribution over the set of all rules in that have left-hand side .
The goal of our mining problem is to infer a pTSG in which every tree fragment represents a code idiom. Given a set of trees () for pTSG learning, the key factor that determines model complexity is the number of fragment rules associated with each nonterminal. If the model assigns too few fragments to a non-terminal, it will not be able to identify useful patterns (underfitting); on the other hand, if it assigns too many fragments, then it can simply memorize the corpus (overfitting) (Allamanis et al., 2018)
. Furthermore, we do not know in advance how many fragments are associated with each non-terminal. Non-parametric Bayesian statistics(Murphy, 2012; Gelman et al., 2013)
provide a simple, yet powerful, method to manage this trade-off for cases where the number of parameters is unknown. In this work, we use the nonparametric Bayesian inference methods proposed by Allamanis and Sutton(Allamanis and Sutton, 2014) to mine refactoring idioms. To infer a pTSG
using Bayesian inference, we first compute a probability distribution over probabilistic grammars,
. This distribution is bootstrapped by estimating the maximum likelihood from our training corpus. While this gives distribution over full trees, we require the distribution over fragments. This is defined as, where ranges over the set of productions that are used within . The specific prior distribution that we use is Dirichlet process. The Dirichlet process is specified by a base measure, which is the fragment distribution , and a concentration parameter that controls the rich-get-richer effect. Given
and prior distribution, we apply Bayes’ rule to obtain posterior distribution. The posterior Dirichlet process pTSG is characterized by a finite set of tree fragments for each non-terminal. To compute this distribution, we resort to approximate inference based on Markov Chain Monte Carlo (MCMC)(Liang et al., 2010). Specifically, we use Gibbs sampling to sample the posterior distribution over grammars.
At each sampling iteration, Jezero samples the trees from the corpus and for each node in the tree it decides
if it is a root or not based on posterior probability.Jezero adds trees to the sampling corpus and adds tree fragments to the sample grammar based on whether the fragments are root (denoted by ) or not. Next, for each tree node , Jezero identifies the parent whose . Based on the current node and its root parent, Jezero samples it to decide whether to merge them as a single fragment or to separate them into different fragments. To do this, Jezero computes the probability of the joint tree (node and parent ), and the split probabilities. Based on a threshold it either splits the node and tree into fragments or merges them into one fragment;
, is the root of the fragment, and count is the number of times that a tree occurs as a fragment in the corpus, as determined by the current values of . Once the sampling is complete, Jezero orders the grammar based on the production probability and filter out those rules that have probability is less than .
In MCMC it is essential that there is good mixing of samples, hence Jezero visits the the trees in the corpus and their nodes in different orders to further introduce randomness. We seed the sampling process by annotating randomly 90% of the nodes with 444Other annotation values (namely, 40% and 60%) yield similar results at the cost of 3x slowdown in execution time.. Furthermore, it incrementally adds trees to the corpus to compute the grammar. Jezero repeats this process for iterations to identify the posterior distribution over fragments, which are then returned as idioms. We further experimented with iterations but no changes in the evaluation scores for the APIs were observed.
Figure 5 shows the top-1 idiom mined by Jezero for the Vec\map_map_with_key API. Note how this is prefix of the tree shown in Figure 4: the key advantage of the canonicalized representation. Despite its shallow nature, in Section 4, we show that this idiom is very effective in identifying refactoring opportunities. In particular, according to this idiom, the most common dataflow pattern is a loop with a write to the first collection variable, and it depends on the loop iteration variable. The reason for Jezero to return shallow idioms is attributed to the Gibbs sampling process. Finding deeper idioms is possible by tuning the Gibbs sampling hyper-parameters (it remains, however, for future work).
3.3. Idiom Ranking
The mining process may compute a large number of idioms, and therefore, we need a mechanism to rank the idioms based on their usefulness in identifying refactoring patterns. We identify three different ranking schemes that could help surfacing interesting patterns. Before ranking the idioms, we first prune away trivial idioms based on two rules: (1) remove idioms that have been seen less than a minimum number of times (c_min); and (2) remove idioms that have less than a minimum number of nodes (n_min) (Allamanis and Sutton, 2014).
Ranking based on Coverage (C)
Idiom coverage (Allamanis and Sutton, 2014) is computed as the percent of source code dataflow trees that are matched to the mined idiom. Coverage ranges between and , indicating the extent to which the mined idioms exist in a piece of code.
Ranking based on Information-theoretic Measure (CE)
To maximize information content and coverage, Allamanis et al. (Allamanis et al., 2018) score an idiom by multiplying coverage and cross-entropy gain. Cross entropy gain measures the expressivity of an idiom and averages log-ratio of the posterior pTSG probability of the idiom over the probability yielded by the basic pCFG. This ratio measures how much each idiom “improves” upon the base pCFG distribution.
Ranking based on Jaccard Similarity (IOU)
In addition to the two ranking schemes from prior work, we propose a scheme based on the average area an idiom covers when matching a code location. The score for an idiom is computed as follows:
where is a heuristic function that returns a value between and based on the pre-assigned weight of the root of the idiom (e.g. foreach, if have higher weight than primitive_write). Function computes the number of trees whose subtree is the idiom . The term (intersection over union) measures the average number of nodes that overlap an idiom and its supports.
This section details the empirical evaluation of Jezero on the task of learning refactoring idioms for a diverse set of APIs from the Facebook Hack codebase. Further, for each API, we measure the effectiveness of the mined idiom in identifying known and potential refactoring locations.
Dataset for Mining
As distinct APIs have different refactoring patterns, per API, we construct a dataset of patterns using historical change data. We call these changes edits and they contain a before and after version of changed source file(s). We first scrape edits in a given time interval and API from Facebook’s code repository and construct a dataset for mining refactoring patterns using before versions of the edits. However, this suffers from a drawback that edits collected using this approach would, most likely, contain excessive noise, i.e., changes that are not relevant to the refactoring changes. Hence, we opted for a relatively inexpensive way to prune out irrelevant edits. We propose the following three-fold heuristic filter, as shown in Figure 6: (1) first we compute a “treediff” of each edit using the GumTree algorithm (Falleri et al., 2014) and remove method trees that were not modified or did not see an introduction of the API we are investigating, (2) we then collect code edits whose API keyword occurrence in after version is higher compared to before, (3) we further filter edits where the cyclomatic complexity of before is greater than the after. These heuristics are not meant to end up with the actual refactorings exclusively, but to increase the chances of each before-after pair being a valid refactoring. We believe that the resulting diversity in the dataset helps prevent overfitting. On average, for the APIs in Table 4 the initial dataset contains trees rooted at the method level (i.e., number of methods — average per API — found before the three-fold heuristic). After the pruning stage, on average, the dataset contains training trees rooted at method level ( trees for Vec\map_with_key; trees for Vec\gen_filter; trees for Dict\filter; trees for Dict\map_with_key).
We evaluate the effectiveness of Jezero in two settings. First, we measure the accuracy of the proposed approach on a manually constructed validation set, containing true refactoring opportunities and non-opportunities.
Second, we measure the performance of Jezero in identifying refactoring opportunities in the entire codebase. For each API, we sample idioms for iterations and with a concentration parameter value of . Furthermore, we have pruned rare (c_min ) and small (n_min ) idioms. The observations reported are the result of running Jezero for 88 hours on an Intel Core Processor i7-6700 CPU @2.39GHz with 57GB RAM. Note that this is the one-time cost to train the 4 different APIs. Moreover, for each API, the training takes about the same amount ( 22hours). Prediction time is just a few milliseconds to identify matching locations.
Effectiveness in identifying known refactoring.
To measure the accuracy of the proposed approach, for each API, we manually construct a ground truth dataset (1) using manually confirmed refactoring locations in the historical change data and (2) manually identified potential refactoring locations from a set of files sampled from the current version of the codebase. These locations from the current codebase are included to get a wider variety of code samples.
The constructed evaluation dataset contains trees555This is the average number of trees randomly sampled from the mentioned before, as well as files sampled from the current codebase, on average, where trees out of these are true refactoring locations66627 is the number (average per API) of trees that were manually verified to be true refactoring locations. (see Figure 4). Note that not all trees are true refactorings. This happens because, other than the true refactoring method, an edits’s before version in our dataset may contain several methods with loopy code that is similar to the idiom. Hence, a manual check of the trees revealed that were actual refactorings.
We compare Jezero with Haggis, the AST-based idiom mining proposed by Allamanis et.al (Allamanis and Sutton, 2014). Haggis does not have the dataflow augmentation that Jezero has, but is otherwise identical to Jezero. We could not compare with semantic idiom mining based on coiling (Allamanis et al., 2018) because the latter requires annotations or dynamic analysis.
We measure the effectiveness of the proposed approach in identifying refactoring locations by comparing precision, recall, and the F1 scores of idioms produced by the baseline AST-based approach (Haggis) vs Jezero. Table 4 shows the accuracy results for four API refactorings when using top-1 idiom using the intersection over union ranking scheme. On average Jezero’s F1 score is significantly better than the baseline on all APIs. This indicates the effectiveness of the proposed static analysis based tree representation. Additionally, the differences between the APIs reflect how well the new tree representation can identify diverse patterns.
|Average||27 / 431||0.08/0.22/0.05||0.57/0.55/0.62|
Effectiveness in identifying refactoring opportunities.
In this experiment, we measure the performance of Jezero in identifying potential refactoring locations on Facebook’s codebase with Hack methods, spread over files. For each API, we identify matching locations using the top-1 idiom from Jezero and Haggis. Table 5 summarizes the number of matching locations for each API. Jezero matches locations (202 on average; i.e., % of the trees rooted at loop headers), whereas Haggis only matches locations in all ( on average); Haggis fails to identify any refactoring opportunities in the case of Vec\gen_filter and Dict\filter.
Furthermore, to identify the precision of the matched locations, we manually inspect all locations of Haggis and a random subset of () locations for each API returned by Jezero.
|Jezero / Haggis||Jezero / Haggis|
|Vec\map_with_key||260 / 1||0.91 / 1.00|
|Vec\gen_filter||247 / 0||0.41 / 0.00|
|Dict\filter||134 / 0||0.39 / 0.00|
|Dict\map_with_key||166 / 22||0.68 / 0.91|
|Average||202 / 6||0.60 / -|
Note that since we do not have a dataset of locations that should match in the internal codebase, no measures of recall are reported. On average Jezero has a precision of , which is an encouraging number. Haggis’s average precision is inconsequential because of low absolute numbers.
In summary, Jezero not only mined patterns in an unsupervised way, those mined patterns were extremely productive in locating opportunities in the “wild”.
Effectiveness of different ranking strategies.
For each API, we rank the mined idioms using the ranking schemes discussed in Section 3.3 and compute the F1 score based on top-1 idiom. Table 6 summarizes the results. We can see that F1 scores are almost identical across the different ranking schemes. This observation is consistent with both Haggis and Jezero. Therefore any ranking strategy is well suited for both the approaches.
5. Threats to Validity
Regarding internal validity, the effectiveness of parameters may depend on the extent and nature of codebase used. To mitigate this risk we have experimented with a combination of parameters and ranking schemes. However, we have not systematically explored every combination of parameters in our experiments. Hence, other combinations may work better to other systems. In terms of external validity, the proposed approach has only been evaluated using a codebase developed by a single company (albeit a large codebase). Furthermore, the approach has been evaluated on the Hack programming language, and may only generalize the results to other programming languages with prudence. To mitigate this threat, as future work, we will investigate the effectiveness of our approach on other languages and codebases.
6. Limitations and Future Work
The dataflow trees we generate are type agnostic. Therefore, different APIs could have similar idiomatic patterns. For example, we observe that top-3 idioms of Vec\gen_filter, Dict\filter are identical. Other APIs that are expected to yield identical idioms are Vec\map_with_key and Dict\map_with_key. To improve the precision of the proposed approach we can add type information (1) while generating the tree representation, or (2) use it to disambiguate APIs at prediction time. Moreover, the current dataflow trees are rather general — e.g., no information about if-expressions are captured. Adding additional information for, e.g., the variable references in the if condition, will likely help mine better idioms.
Additionally, the proposed tree canonicalization was influenced by the idiom mining machinery which identifies contiguous patterns from trees. Capturing information about variables outside a local code block makes it a graph mining problem. To overcome this, we can introduce predicates, such as contains,before,after, and construct a tree based on this grammar. However, this might lead to a computationally expensive sampling approach.
Furthermore, the dataflow trees are generated using a bottom-up approach where the information flow is in one direction, from the inner to the outer code block. The design choice we made to capture local patterns is not suitable, for example, for patterns that depend on context information from the outer region. This has been discussed in Section 3.1, and illustrated in Listing 1.
Finally, refactoring opportunities identified by Jezero may not be good candidates for actual refactoring, due to the replacement API being less performant or readable. Therefore, we do not plan to automatically apply refactorings detected by Jezero, and instead surface suggestions to developers during the code review process.
7. Related Work
We have discussed the work of Allamanis et al. (Allamanis and Sutton, 2014; Allamanis et al., 2018) on idiom mining through the paper. To recap, Jezero differs from their work in two ways. First, prior work uses annotations or dynamic analysis to capture semantic properties, while Jezero instead uses lightweight static analysis, which is more readily available. Second, Jezero constructs a canonicalized tree representation that captures dataflow in a control-flow block which provides richer information that we need for our setting of identifying latent refactorings.
Code clone detection (Li et al., 2004; Kamiya et al., 2002; Sajnani et al., 2016; Jiang et al., 2007; Roy and Cordy, 2008) techniques are related to idiom mining, as the goal is to identify highly similar code blocks. Rattan et al. (Rattan et al., 2013) identify several clone detection techniques that use syntax and semantics of a program (Baxter et al., 1998; Koschke et al., 2006). Code idiom mining proposed in this work searches for frequent as opposed to maximally identical subtrees as with clone detection techniques. Semantic code search techniques (Premtoon et al., 2020; David and Yahav, 2014; Stolee et al., 2014; Sivaraman et al., 2019) are also related to idiom mining, since they utilize type (Reiss, 2009), data, and control flow (Premtoon et al., 2020; Wang et al., 2010) information for identifying clones. Our approach differs in two ways; (1) code search requires the user to provide a search pattern, whereas Jezero infers such a pattern (2) search techniques that infer a pattern (Sivaraman et al., 2019)
leverages active learning while we use nonparametric Bayesian methods. Another related area is API mining(Zhong et al., 2009; Wang et al., 2013; Acharya et al., 2007; Galenson et al., 2014; Mandelin et al., 2005). However, this problem is significantly different from idiom mining since it tries to mine sequences or graphs (Gu et al., 2016; Mou et al., 2016; Nguyen et al., 2009) of API method calls, usually ignoring most features of the language. API protocols can be considered a type of semantic idiom; therefore, idiom mining is a general technique for pattern matching and can be specialized to API mining by devising appropriate tree representations.
In recent years, we have seen an emerging trend of tools and techniques that synthesize program transformations from examples of code edits (Bader et al., 2019; Meng et al., 2013; Miltner et al., 2019; Polozov and Gulwani, 2015; Rolim et al., 2017, 2018; Gao et al., 2020). The synthesized transformation should satisfy the given examples while producing correct edits on unseen inputs. Existing approaches have addressed this in different ways. Sydit (Meng et al., 2011) and LASE (Meng et al., 2013), are only able to generalize variables names, methods and fields. Moreover, the former only accepts one example and synthesizes the transformation using the most general generalization, whereas the latter accepts multiple examples and synthesizes the transformation using the most specific generalization. This approach is also the approach taken by Revisar (Rolim et al., 2018) and Getafix (Bader et al., 2019). ReFazer (Rolim et al., 2017; Gao et al., 2020) learns a set of transformations consistent with the examples and uses a set of heuristics to rank the transformations in order of likelihood to be correct. While these techniques learn transformations from the provided examples, Jezero’s main focus is on the detection of statistically significant patterns directly from a corpus, and then pointing out likely opportunities for refactoring. On a different note, many of these tools can also benefit from the dataflow augmented tree structure that we introduced that makes the common semantic pattern manifest.
We propose Jezero, a scalable, lightweight technique that is capable of surfacing semantic idioms from large codebases. Under the hood, Jezero extends the abstract syntax tree with canonicalized dataflow trees and leverages a well-suited a nonparametric Bayesian method to mine the semantic idioms.
Our experiments on Facebook’s Hack code shows Jezero’s clear advantage, as it was significantly more effective than a baseline that did not have the dataflow augmentation in being able to find refactoring opportunities from unannotated legacy code. On a randomly drawn sample containing Hack methods, Jezero found matches at 1.5% locations among which, the precision of finding a real refactoring opportunity was .
We expect the ideas in Jezero to carry over to other languages such as Python, as it provides ways to express idiomatic code.
-  (2007) Mining api patterns as partial orders from source code: from usage scenarios to specifications. In Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pp. 25–34. Cited by: §7.
-  (2018) Mining semantic loop idioms. IEEE Transactions on Software Engineering 44 (7), pp. 651–668. Cited by: Figure 2, 1st item, §1.1, §1.1, §1.1, §1.1, §1, §2, §2, §3.1, §3.1, §3.2, §3.3, §3, §4, §7.
-  (2014) Mining idioms from source code. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 472–483. Cited by: 1st item, §1.1, §1.1, §2, §2, §3.1, §3.2, §3.2, §3.3, §3.3, §3, §4, §7.
-  (2019) Getafix: learning to fix bugs automatically. Proceedings of the ACM on Programming Languages 3 (OOPSLA), pp. 1–27. Cited by: §1.1, §7.
-  (1998) Clone detection using abstract syntax trees. In Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272), pp. 368–377. Cited by: §7.
-  (2010) Inducing tree-substitution grammars. The Journal of Machine Learning Research 11, pp. 3053–3096. Cited by: §2.
-  (2014) Tracelet-based code search in executables. Acm Sigplan Notices 49 (6), pp. 349–360. Cited by: §7.
-  (2014) Fine-grained and Accurate Source Code Differencing. In Proceedings of the International Conference on Automated Software Engineering, Västeras, Sweden, pp. 313–324. Note: update for oadoi on Nov 02 2018 External Links: Cited by: §4.
-  (2014) Codehint: dynamic and interactive synthesis of code snippets. In Proceedings of the 36th international conference on Software Engineering, pp. 653–663. Cited by: §7.
-  (2020) Feedback-driven semi-supervised synthesis of program transformations. Proceedings of the ACM on Programming Languages 4 (OOPSLA), pp. 1–30. Cited by: §1.1, §7.
-  (2013) Bayesian data analysis. CRC press. Cited by: §3.2.
-  (2016) Deep api learning. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 631–642. Cited by: §7.
-  (2007) DECKARD: scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering, ICSE ’07, Washington, DC, USA, pp. 96–105. External Links: Cited by: §7.
-  (2002) CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28 (7), pp. 654–670. Cited by: §7.
-  (2006) Clone detection using abstract syntax suffix trees. In 2006 13th Working Conference on Reverse Engineering, pp. 253–262. Cited by: §7.
-  (2004) CP-miner: a tool for finding copy-paste and related bugs in operating system code.. In OSdi, Vol. 4, pp. 289–302. Cited by: §7.
-  (2010) Type-based mcmc. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 573–581. Cited by: §3.2.
-  (2005) Jungloid mining: helping to navigate the api jungle. ACM Sigplan Notices 40 (6), pp. 48–61. Cited by: §7.
-  (2015) Does automated refactoring obviate systematic editing?. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1, pp. 392–402. Cited by: §1.1.
-  (2011) Systematic editing: generating program transformations from an example. ACM SIGPLAN Notices 46 (6), pp. 329–342. Cited by: §7.
-  (2013) LASE: locating and applying systematic edits by learning from examples. In 2013 35th International Conference on Software Engineering (ICSE), pp. 502–511. Cited by: §1.1, §7.
-  (2019) On the fly synthesis of edit suggestions. Proceedings of the ACM on Programming Languages 3 (OOPSLA), pp. 1–29. Cited by: §1.1, §7.
Convolutional neural networks over tree structures for programming language processing.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30. Cited by: §7.
-  (2012) Machine learning: a probabilistic perspective. MIT press. Cited by: §3.2.
-  (2009) Graph-based mining of multiple object usage patterns. In Proceedings of the 7th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT symposium on the Foundations of Software Engineering, pp. 383–392. Cited by: §7.
-  (2015) Flashmeta: a framework for inductive program synthesis. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, pp. 107–126. Cited by: §1.1, §7.
-  (2020) Semantic code search via equational reasoning.. In PLDI, pp. 1066–1082. Cited by: §1.1, §7.
-  (2013) Software clone detection: a systematic review. Information and Software Technology 55 (7), pp. 1165–1199. Cited by: §7.
-  (2009) Semantics-based code search. In 2009 IEEE 31st International Conference on Software Engineering, pp. 243–253. Cited by: §7.
-  (1988) The programmer’s apprentice: a research overview. Computer 21 (11), pp. 10–25. Cited by: §1.1.
-  (2017) Learning syntactic program transformations from examples. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), pp. 404–415. Cited by: §1.1, §7.
-  (2018) Learning quick fixes from code repositories. arXiv preprint arXiv:1803.03806. Cited by: §1.1, §7.
-  (1988) Global value numbers and redundant computations. In Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp. 12–27. Cited by: §3.1.
-  (2008) NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In Program Comprehension, 2008. ICPC 2008. The 16th IEEE International Conference on, pp. 172–181. Cited by: §7.
-  (2016) SourcererCC: scaling code clone detection to big-code. In Software Engineering (ICSE), 2016 IEEE/ACM 38th International Conference on, pp. 1157–1168. Cited by: §7.
Active inductive logic programming for code search. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 292–303. Cited by: §7.
-  (2014) Solving the search for source code. ACM Transactions on Software Engineering and Methodology (TOSEM) 23 (3), pp. 1–45. Cited by: §7.
-  (2013) Mining succinct and high-coverage api usage patterns from source code. In 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 319–328. Cited by: §7.
Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 261–271. Cited by: §1.1.
-  (2010) Matching dependence-related queries in the system dependence graph. In Proceedings of the IEEE/ACM international conference on Automated software engineering, pp. 457–466. Cited by: §7.
-  (2009) MAPO: mining and recommending api usage patterns. In European Conference on Object-Oriented Programming, pp. 318–343. Cited by: §7.