Copying and pasting from existing code is a coding practice that refuses to die out in spite of much expert disapproval (Kim et al., 2004; Juergens et al., 2009). The approach is vilified for good reason: it is easy to write buggy programs using blind copy-and-paste. At the same time, the widespread nature of the practice indicates that programmers often have to write code that substantially overlaps with existing code, and that they find it tedious to write this code from scratch.
In spite of its popularity, copying and pasting code is not always easy. To copy and paste effectively, the programmer has to identify a piece of code that is relevant to their work. After pasting this code, they have to modify it to fit the requirements of their task and the code that they have already written. Many of the bugs introduced during copying and pasting come from the low-level, manual nature of the task.
In this paper, we present a programming methodology, called program splicing, that aims to offer the benefits of copy-and-paste without some of its pitfalls. Here, the programmer writes code with the assistance of a program synthesizer (Alur et al., 2015a; Solar-Lezama et al., 2006) that is able to query a large, searchable database of program snippets extracted from online open-source repositories. Operationally, the programmer starts by writing a “draft” that is a mix of unfinished code and natural language comments, along with an incomplete correctness requirement, for example in the form of test cases or API call sequence constraints. The synthesizer completes the “holes” in the draft by instantiating them with code extracted from the database, such that the resulting program meets its correctness requirement. The programmer may then further modify the program and possibly proceed to perform additional rounds of synthesis.
In more detail, our synthesis algorithm operates as follows. First, it identifies and retrieves from the database a small number of program snippets that are relevant to the code in the draft. These search results are viewed as pieces of knowledge relevant to the synthesis task at hand, and are used to guide the synthesis algorithm. Specifically, from each result, the algorithm extracts a set of codelets
: expressions and statements that are conceivably related to the synthesis task. Next, it systematically enumerates over possible instantiations of holes in the draft with codelets, using a number of heuristics to prune the space of instantiations.
The primary distinction between our synthesis algorithm and existing search-based approaches to synthesis lies in the use of pre-existing code. A key benefit of such a data-driven approach is that it helps with the problem of underspecification. Because synthesis involves the discovery of programs, the specification for a synthesis problem may be incomplete. This means that even if a synthesizer finds a solution that meets the specification, this solution may in fact be nonsensical. This problem is especially common in traditional synthesis tools, which explore a space of candidate programs without significant human guidance. In contrast, the codelets in our approach are sourced from pre-existing code that humans wrote when solving related programming tasks. This means that our search for programs is biased towards programs that human-readable and likely to follow common-sense constraints that humans assume.
The use of pre-existing code also has a positive effect on scalability. Without codelets, the synthesizer would have to instantiate holes in the draft with expressions built entirely from scratch. In contrast, in program splicing, the synthesizer searches the more limited space of ways in which codelets can be “merged” with pre-existing code.
We present an implementation of program splicing that uses a corpus of approximately 3.5 million methods, extracted from the Sourcerer (Sajnani et al., 2014; Ossher et al., 2012; Bajracharya et al., 2014) source code repository, to perform synthesis of Java programs. We evaluate our approach on a suite of Java programming tasks, including the implementation of scripts useful in everyday computing, modifications of well-known algorithms, and initial prototypes of software components such as GUIs, HTML parsers, and HTTP servers. Our evaluation includes a comparison with Scalpel (Barr et al., 2015), a state-of-the-art programming system that can “transplant” code across programs, as well as a user study with 18 participants. The evaluation shows our system to outperform Scalpel and indicates that it can significantly boost overall programmer productivity.
Now we summarize the contributions of the paper:
We propose program splicing, a methodology where programmers use a program synthesizer that can query a large database of existing code, as a more robust proxy for copying and pasting code.
We present an implementation of program splicing for the Java language that is driven by a corpus of 3.5 million Java methods.
We present an extensive empirical evaluation of our system on a range of everyday programming tasks. The evaluation, which includes a user study, shows that our method outperforms a state-of-the-art competing approach and increases overall programmer productivity.
The rest of the paper is organized as follows. In Section 2, we give an overview of our method. In Section 3, we formally state our synthesis problem. Section 4 describes the approach of program splicing. Section 5 presents our evaluation. Related work is described in Section 6. We conclude with some discussion in Section 7.
In this section, we describe program splicing from a user’s perspective using a few motivating examples.
2.1. Primality Testing
Consider a programmer who would like to implement a primality testing function using the Sieve of Eratosthenes algorithm. The programmer knows that the function must build an array prime of bits, the -th bit being set to true if the number is a prime. However, they do not recall in detail how to initialize the array and the algorithm for populating this array.
In current practice, the programmer would search the web for a Sieve
of Eratosthenes algorithm, copy code from one of the search results,
and modify this code manually. In contrast, in program splicing, they
write a draft program in a notation inspired by the Sketch
system for program synthesis (Solar-Lezama et al., 2006; Solar-Lezama, 2009)
(Figure 1). This draft program declares
the array prime; however, in place of the code to fill this
array, simply leaves a hole represented by a special symbol
??”. A hole in a program serves as a placeholder for an
external codelets which will be filled in by our system. In this
example, the external snippets will be an Sieve of Eratosthenes
describes the forms of external code that are relevant to the task
using natural language comments.
In this example, the comments contain words such as “sieve”,
“eratosthenes” and “primality” in the “
at line 1 suggesting a Sieve of Eratosthenes implementation. The
system will use these words as a hint to search the code
database. This is similar to a web search using text, but in this case
it is done in a programming scenario. Finally, in order to ensure that
the synthesized code is compatible with the code that she has already
written, the programmer needs to provide some correctness
requirements. The requirements for our example are shown in the
TEST” section at the top of the draft.
Given the draft, our program synthesizer issues a query to a searchable database of code snippets. The code database then returns a set of functions relevant to the current programming task, including at least one Sieve of Eratosthenes implementation (such an implementation is shown in Figure 3). The system now extracts a set of codelets — expressions and statements — from these functions, and uses a composition of these codelets to fill in the hole in the draft. The completed draft is showed in Figure 2.
2.2. Reading a Matrix from a CSV File
Now we show an example where external code snippets are used to complete a draft with multiple holes, through an interactive process. Suppose the programmer would like to read a matrix from a comma-separated values (CSV) file into a 2-dimensional array and then to square the matrix. This programming task has two major pieces: reading from the csv file and matrix multiplication. In the beginning, the programmer focuses on the first task, and accordingly, writes the draft program shown in Figure 4(a). In this draft, the programmer simply declares a 2d-array. Then she leaves a hole as proxy for the code for reading the matrix from the csv file, and provides some comments and requirements to guide the instantiation of the hole. Our system then searches the code database for relevant external code. For example, such a program is shown in Figure 4. Snippets from this code is then merged into the existing draft.
After getting the code that reads a matrix from a csv file, the user now focuses on the second part of the task, which is matrix multiplication. They extend the previous code into a new draft, which has a hole for the matrix multiplication code, some comments and requirements. This draft is shown in Figure 4(b). Our system now searches the code database for codelets that does matrix multiplication and merges these codelets into the existing code, while ensuring that all requirements are met. The complete program resulting from this process is shown in Figure 6.
As shown in the example, our system can be used in an iterative and interactive manner. A programmer can start writing code as usual, and brings in external resources from the web into the existing codebase as needed. In this respect our approach is similar to copy-and-paste. The difference is that our system automates the process of finding and modifying relevant code, and guarantees a certain level of reliability by ensuring that the output program meets all its requirements.
2.3. Face Detection using OpenCV
In previous examples, we rely on input-output tests to verify the correctness of a solution. Now we consider the use of program splicing in the implementation of face detection
, a computer vision task in which input-output tests are hard to specify, requiring the use of an alternative form for correctness requirement. Specifically, the requirements that we use are constraints on sequences of API calls that a program makes, given in the form of a finite automaton.
Figure 7 shows a draft program for this task. In
this example, a user wants to use a
CascadeClassifier object from
OpenCV to detect faces from an input image called
output image named
faceDetection.png should have the same
picture with a rectangle drawn above the faces.
The API call constraint for the task is shown in
Figure 9. This requirement describes a sequence of
object creation and API invocation actions performed during face
detection. While the requirement is more low-level than unit tests, we
note that it frees users from specifying small details such as what
configuration file to be used, the color for drawing rectangles on
faces and the order of specifying the four corners of rectangles. Our
synthesizer uses this requirement to filter out many of the candidate
programs that it considers during synthesis. Only a few solutions
satisfy the requirement, and the user could easily pick the correct
one shown in Figure 8.
3. Problem formulation
In this section, we formulate the program synthesis problem that is at the heart of the program splicing methodology.
As mentioned earlier, a draft program in our setting consists of incomplete code and a set of natural language comments. We start by spelling out the language of code permitted in our drafts.
Our approach accepts code in a subset of Java,
abstractly represented by the following grammar. In summary, the
grammar permits standard imperative expressions and statements over
base and array types, as well as a symbol
¡expr¿ ::= id — c — ¡expr¿ binop ¡expr¿ — unaryop ¡expr¿ — f(¡expr¿, …, ¡expr¿) — id := ¡expr¿ — ??
¡stmt¿ ::= let id = ¡expr¿ — if ¡expr¿ ¡stmt¿ ¡stmt¿ — while ¡expr¿ ¡stmt¿ — ¡stmt¿ ; ¡stmt¿ — ??
¡program¿ ::= id (¡expr¿, …, ¡expr¿) ¡stmt¿
In this grammar, represents a constant, represents an identifier, represents external functions (API calls), and and respectively represent binary and unary operators. We assume that a standard type system is used to assign types to expressions and statements in this grammar. The actual language handled by our implementation goes somewhat beyond this grammar, permitting arrays, objects, data structure definitions, a limited form of recursion, and syntactic sugar such as for-loops.
The special symbol
?? in the grammar represents two kinds of
holes. Expression holes is a placeholder for a missing
expression. A statement hole is a placeholder for a missing
The semantics of a program with holes can be defined as a set of complete (hole-free) programs obtained by instantiating the holes with expressions and statements. The semantics of a complete program is defined in the standard way. We skip the formal definitions of these semantics for brevity.
Aside from a draft, an input to a program splicing problem includes a requirement. This requirement is not expected to be a full correctness specification. Specifically, our implementation permits two classes of requirements: input-output tests, and finite automata that constrain the sequences of API calls that a program can make. We assume a procedure to conservatively check whether a given complete program satisfies a given set of requirements. For requirements that are input-output tests, this procedure simply evaluates the program on the tests. The procedure for automaton constraints is based on a standard, sound program analysis.
Let be a draft program with one or more holes. Let be a database containing programs with no holes. Our objective is to use the programs from to complete holes in . Specifically, we use the expressions (similarly, statements) from to complete the expression holes (similarly, statement holes) in . Naturally, such an instantiation of the holes can be performed in many different ways. Our goal is to do this instantiation such that the resulting program passes the requirement.
More precisely, consider the set of all codelets — subexpressions and statements — that appear in programs from . Let be the set of complete programs obtained by instantiating the holes of by appropriately typed codelets in . Let be a function that maps a complete program in to a boolean value indicating whether the input program passes the requirement accompanying . Our problem is to find a program such that .
In this section, we present our solution to the synthesis problem presented in the previous section.
Our synthesis problem has two key subproblems: code search and hole substitution.
Code Search: Given a program , search a large corpus containing thousands of programs for a set of relevant programs such that the retrieved programs contain the codelets we want to synthesize. The desired properties of the code search technique should be high precision and high efficiency. Here, we define precision as the number of retrieved programs that have the codelet we want to synthesize, and we define efficiency as the runtime required for each search. In summary, we need to fetch a set of programs which contain the exact codelet we want within a short period of time.
Hole Substitution: Given multiple database programs , we would like to search for the correct codelets to substitute the hole. Multiple programs combined consist of a large number of codelets. The key challenge here is to prune the search space such that we can efficiently get the exact codelet we want and ensure the codelets we want will not be dropped by our heuristics.
In general, solving the synthesis problem requires us to find a sweet spot between expressiveness and efficiency. In traditional program synthesis, an algorithm with high expressiveness is more capable of generating various kinds of code, but usually requires more time to search. In contrast, an efficient algorithm tends to generate a very limited amount of code, and sometimes the search space fails to cover the solution.
In our case, expressiveness corresponds to the number of codelets from the database programs we consider during synthesis and the efficiency corresponds to the time the synthesis task takes or the number of incorrect programs we filter in the end. Ideally, we want a sufficiently expressive and efficient synthesis algorithm such that it can complete any draft program within a short amount of time. In this section, we start explain the detail of each component of our method. We first discuss how our code search method gives us sufficient expressiveness and how efficiency is achieved with the synthesis algorithm without sacrificing too much expressiveness.
4.1. Searching for programs
In this section, we describe the code search techniques employed to query a large database of programs effectively. This is the first step in our workflow: to find candidate functionality from the program database to complete the draft program. Given a draft program with a hole, relevant code based on the context (such as comments, function signature, parameter names) around the hole are returned by the code search.
An important goal of the code search component is to have quick response when searching large amounts of code to ensure efficiency of our synthesis algorithm. To accomplish this, various code features are extracted from a large corpus of open source code. These code features—along with the corresponding source code—are stored in a program database. The program database is a scalable object-store database that allows for fast similarity-based queries.
A query issued to the program database includes code features extracted from the draft program, along with associated weights indicating the relative importance of the code features. The program database computes the k-nearest neighboring corpus elements to the query, using the code features stored, associated weights, and similarity metrics defined on each code feature. The result of the query is presented as a ranked list of source code corresponding to the k-nearest neighbors.
Expressiveness can be easily guaranteed, since we control the number of neighbors we consider. We can increase until we have enough programs which contains the codelet we want to synthesize. Notice that the more programs we retrieve, the larger the search space is, and thus synthesis will require much more time. Eventually, we need to target at a small and ensure we have the desired programs.
Below we describe the features extracted and the associated similarity metrics.
Natural language terms. For this feature, we extract the function name, comments, local variable names, and parameter names of a function. Such extracted natural language (NL) terms are then subjected to a series of standard NL pre-processing steps, such as splitting words with underscores or camel-case, removing stop words (including typical English stop words, and those specialized for Java code), stemming, lemmatization, and removing single character strings.
Additionally, we use a greedy algorithm (Feild et al., 2006) for splitting terms into multiple words, based on dictionary lookup. This is to handle the case where programmers combine multiple words, without separating the words with underscores or camel-case, when naming functions and variables.
After NL pre-processing, we compute a tf-idf (term frequency-inverse document frequency) score for each NL term. Each function is considered as a document, and the tf-idf is computed per project. We give the function name term an inflated score ( more than other terms) because it often provides significant information about a function’s purpose.
The similarity between two functions is measured by taking the cosine-similarity of their NL terms, together with their tf-idf values. Below is an example of NL terms features for the draft showed in figure1.
"primal":0.10976425998969035, "siev":0.658585559938142, "test":0.10976425998969035, "prime":0.658585559938142, ...
Here, we extract all the variable names, the name of the function, and perform some basic normalization such as splitting camel case and underscores. The similarity metric used is the Jaccard index on sets of names.
The similarity search is primarily driven by the natural language term features, with variable names and function names providing additional context around the hole in the query code. We give more weights to natural language term features and less weights to variable names and function names. The reason is that the most important hint in the source code is comment, because users are required to describe the code they want to synthesize. However, variable names and function names must not be treated as equally important, because sometimes variable names and function names might be totally irrelevant to the code they want to synthesize. For example, users might leave comments saying that they want the code that reads a matrix from a csv file, but it is totally possible that the surrounding context is all about matrix calculation.
4.2. Program completion
After we have retrieved a set of programs from the program database, our next step is to complete the draft. For each database program paired with the given partial program, we spawn a thread to do the code completion task, parallelizing the process in order to have high efficiency. A code completion task consists of the following steps:
4.2.1. Hole substitution
The first step is to use the codelets from the database program to substitute the holes in the draft. Procedure 1 shows the algorithm. We start by checking whether there is any hole in the draft at line 1. If not, we move on to the merging step. Otherwise, we start injecting codelets into the draft. For each hole, we iterate all the codelets starting from the smallest one and check whether the injection is valid using our heuristics at line 6. If so, we then substitute the hole with the codelet at line 7 and then continue injecting more codelets by recursively calling itself at line 8 until we finish filling all the holes. When no more holes exist in the draft program, we then merge the codelets into the existing codebase, which is explained in detail in later section. If at some point injecting a codelet is not successful, we backtrack and try another codelet.
Our search space is then over a finite set of codelets, giving us guaranteed termination. However, we would still like to apply some heuristics to make the synthesis more efficient, because the search space is still quite large given that we need to have a substantial amount of database program to guarantee expressiveness. Next, we discuss our heuristics used in the step of hole substitution.
Synthesizing expressions If we are searching for substitutions for an expression hole , we can first infer the type of using surrounding context. If we try to use an expression codelet from a database program, we need to first ensure and are of the same type. Otherwise, we ignore . This heuristic actually gives us efficiency at no cost of expressiveness.
In addition, we can also consider the roles of and . The
intuition is that we only consider the codelet that serves as the same
role by looking at the parent of and the parent of in the
parse tree. If the parents of and are not of the same kind,
then we discard and look for another
codelet. Figure 10 illustrates the idea. If we are
looking for a codelet to replace a hole representing the
inside an assignment statement, our target codelets are more likely to
rval of other assignment statements. We can then just
consider those codelets as substitutions and ignore other
codelets. The same can be applied if we want to synthesize the code
for the guard of a condition.
This heuristic also gives us better efficiency, but some expressiveness is reduced, because an expression with different role might still be the desired one. In addition, an algorithm has different implementations, and therefore it is possible that we might throw away useful expressions because of different program structures. However, we can increase the number of database programs to cover enough variations such that it is more likely to have the expressions we want. Therefore, it is safe to sacrifice some expressiveness for efficiency.
Synthesizing statements When we are searching for substitutions for a statement hole , we need to consider a sequence of statements from the database program. We first define a sliding window of various lengths and use that to scan the database program in order to identify the statement sequence we would like to use to substitute . We also scan the sequences under loops and conditions. We then use each codelet to substitute the hole.
We considered using the heuristic similar to role matching for expression holes, but later we discovered that role matching cannot be applied when we are synthesizing statement sequences for statement holes. Typically, the semantic of an expression includes ’s surrounding context and the meaning of varies if its surrounding context or role changes. An expression by itself tends to not have any useful meaning until it is tied to a specific role in a semantic structure. Therefore surrounding context could be indicative in selecting a target expression.
However, that is not the case for statement holes. Most statement sequences tend to serve as a stand-alone functionality and its semantic is rather complete, and they can appear anywhere regardless of its surrounding context. Surrounding context in this case does not provide any useful information and sometimes it becomes even misleading. Hence, using additional surrounding context tends to undermine the synthesis algorithm. Consider the following two programs where we want to generate a codelet under a loop, but the code we want to synthesize is at the top level of the function. If we were checking the parents of these two codelets, we would be throwing out the code we want, because one sits under a loop and the other is under a function.
Overall, it is quite hard to achieve high efficiency when we synthesize statements. The target codelet could appear anywhere in the database program, and a sequence of statements tends to have more references that need renaming. Therefore we expect synthesizing statements to be a more difficult task. Although role matching cannot be applied to synthesizing stand-alone functionality, it can still be considered when the surrounding context is necessary for a codelet, especially for an incomplete functionality.
4.2.2. Code merging and Testing
One problem with using the codelets from the database programs is that the naming schemes are different from the ones in the original draft program. Therefore, after we have completed the draft program, we search for reference substitution such that the resulting program refers back to the data defined in the draft program, which is quite similar to code transplantation (Barr et al., 2015).
The algorithm is showed in Procedure 2. The task here is essentially searching for a mapping between the references across two programs. We first check whether we have undefined references in the program at line 1. If not, we check the program correctness against the requirement at line 2. If it is correct, then we have a solution. If there is still undefined reference in the program, we then try to rename each undefined reference to another defined reference at line 10. We repeat by recursively calling itself until no more undefined names exist in the program. We guide the search by using types. When we are considering renaming to , we rename only if their types are the same. Again, types give us better efficiency. If at any point the algorithm cannot rename a reference due to the lack of available target references in another program, the algorithm will backtrack and try another renaming for a previous reference.
This reference substitution step is performed every time we complete a draft and thus the whole algorithm suffers from exponential blowup. Since the expressions and statements are all from the database programs which are finite, we have a finite search space. Moreover, we also set a time limit on the entire search process and thus our main synthesis algorithm is guaranteed to terminate.
After we have finished renaming all references in a completed program, we validate the solution against of the requirement either in form of a predefined input-output test suite or a predefined API call sequence constraint given as a finite automaton. If users provide IO tests, we run the solution on the provided test suite to validate its correctness. If an API call sequence constraint is given instead, we encode the constraint into Java source code in which API calls are captured and new variables are defined to keep track of the current state in the finite automaton. When the complete program is run, the constraint will be automatically checked and thus the correctness is determined. We also set a time limit for program execution to ensure termination. Notice that we could let the synthesis algorithm produce multiple solutions by letting it continue the search after a correct completion is found. If there are multiple correct completions, we will rank them in the order they appear and return as many solutions as required.
Note that it is very easy to add a selection function to choose the best solution among all the completions. This is quite useful if other requirements not represented as tests or API call sequence constraints are desirable. For example, we added a simple filter where we ignore solutions with two consecutive and redundant return statements. Potentially, one or more layers of selections could be done after validation to ensure program properties.
Our goal is to evaluate the performance of program splicing and its ability to complete a draft program using a large code corpus such that the resulting program meets a correctness requirement within a reasonable amount of time. The experiment consists of completing a set of draft programs given a code database where a set of relevant statistics for each run is recorded. In addition, we test the performance of our code search method afterwards and show the results of the user study we conducted where we test whether our synthesis tool could increase programming productivity.
In this section, we briefly describe our benchmark problems followed by the experiments and the results. We evaluate the performance of program splicing and select a set of benchmark problems with corresponding draft programs to automate the process where users try to bring external resources from the web and merge them into the existing codebase. We ensure that the code we synthesize is quite common in popular online code repositories so that it is more likely to find them in our program database.
It is desirable to compare our method with existing synthesis methods including Sketch (Solar-Lezama et al., 2006), syntax-guided synthesis (Alur et al., 2015b), code reuse tools such as S (Reiss, 2009), Code Conjure (Hummel et al., 2008), CodeGenie (Lazzarini Lemos et al., 2009) and Hunter (Wang et al., 2016) or other statistical methods (Raghothaman et al., 2015; Gvero and Kuncak, 2015). However, none of these methods are comparable, because (1) traditional synthesis methods such as Sketch do not search for or use existing source code, (2) code reuse methods only consider programs at the granularity of functions instead of more fine-grained level such as statements and expressions and (3) some methods such as SWIM (Raghothaman et al., 2015) and anyCode (Gvero and Kuncak, 2015) only aim to synthesize API-specific code snippets. Specifically, we fed the draft showed in Figure 12 to Sketch and it was not able to complete the draft within 30 minutes. In contrast, our splicing system could generate the correct expressions within 5 seconds after the code search is complete. Moreover, our splicing system could generate code snippets while Sketch cannot handle statement synthesis problems.
Code transplantation or Scalpel (Barr et al., 2015) is actually the most similar one to our work and we will use Scalpel for comparison with the correct donor programs provided to Scalpel. One thing to notice is that we cannot apply Scalpel to some of our system-related benchmark problems, because Scalpel targets at C programs instead of Java. Some system programmings in C and in Java tend to be very different and therefore, we only compare our tool with Scalpel on some benchmark problems where the differences of solutions are not significant.
Our benchmark problems consist of synthesizing components from online repositories and we include 15 benchmark problems. The draft program for most benchmark problem contains one or two statement hole and expression holes. Each draft program has its own comments and correctness requirements. Most benchmark problems use typical input-output tests except for “Echo Server”, “Face Detection” and “Hello World GUI” where API call sequence constraint is used to check the correctness. Here, we highlight two draft programs from the benchmark problems which are showed in figure 11. All the draft programs are listed in Appendix A.1.
LCS Table Building: A user calculates the longest common subsequence of two integer arrays, and she has a written a draft program with the code snippets to extract the subsequence from the table and display the result. A hole is left for the code that builds the table for running dynamic programming algorithm.
HTTP Server: A user would like to set up an HTTP server that serves the content of a text file. She wrote a draft program which has a HTTP request handler, but she does not remember how to read from a text file and how to set up an HTTP server. Two holes are left for the code that reads from a text file and the code that sets up an HTTP server. In addition, she also leaves a hole for the response status code in the request handler.
|Benchmarks||Synthesis Time||No Roles||No Types||LOC||Var||Holes (expr-stmt)||Test||Scalpel|
|Hello World GUI||16.0||timeout||timeout||24-33||4||1-2||C||N/A|
|Prim’s Distance Update||61.1||66.4||timeout||53-58||11||1-1||4||timeout|
We implemented program splicing in Scala 2.12.1 based on 64-bit
OpenJDK 8 and we used BeanShell (bea, 2017) and
Nailgun (nai, 2017) to test all the completed draft programs. For
each benchmark problem, we ran the system on the draft program we
derived. These experiments were conducted on a 2.2GHz Intel Xeon CPU
with 12 cores and 64GB RAM. For each program, we set the time limit to
5 minutes and record the runtime for synthesis. To roughly have a
sense of the search space size, we list the number of variables and
holes in each draft program, the line number and the number of
database programs we use for synthesis. Finally, we list the LOC of
the draft program and its completed version. Our corpus comes from
Maven 2012 dataset from Sourcerer (Sajnani
et al., 2014; Ossher
et al., 2012; Bajracharya
et al., 2014). We extracted over 3.5 million
methods with features from this corpus.
5.2.1. Synthesis Algorithm Evaluation
Table 1 shows the results for each benchmark problem with where is the number of database programs we retrieve. We set because five programs are usually sufficient to ensure that the retrieved programs contain the target codelet we want to synthesize. In addition, we put more weights on features that consider comments and variable names to search the database -nearest-neighbor search. The choice on weight selection is explained in section 4.1.
According the results showed in Table 1, data-driven synthesis works for all benchmark problems. The time required for most code search which is based on k-nearest-neighbor search is approximately 15 seconds meaning that the code search is very efficient, given that we have millions of functions in the database. For most of the benchmark problems, our method was able to complete the draft program under two minutes and the number of tests required is no more than five, indicating that users of our system do not have the burden of writing too many tests. Notice that for “Echo Server”, “Face Detection” and “Hello World GUI”, a letter “C” is used to signal an API call sequence constraint being used to test the correctness. Because of the Java testing infrastructure we used for testing, a large amount of program execution overhead was reduced. We can also see that synthesis takes more time as the number of holes and the number of variables increase. Having more holes, more variables and sometimes more lines leads to larger combinatorial search space for hole substitutions with codelets, and more variables increase the search space for code merging and renaming.
Impact of type matching and role matching Types guide the search during hole substitutions and code merging and it potentially eliminates final solutions that do not type check. In addition, role matching eliminates the expression substitutions where the role of a candidate expression is different from the role of a hole. In order to understand their impacts, we record the synthesis time without using types as heuristic which is showed in the “No Types” column of Table 1. “No Roles” column shows the runtime where role matching heuristic is not applied. We can see that using types and roles can reduce a large amount of search space, although types seem to be more effective. These heuristics become more and more important for larger draft programs as the number of variables increases. Without types and role matching, our synthesis algorithm even timed out for some harder benchmark problems. Notice that role matching is applied when we synthesize for expressions, as we cannot apply role matching when synthesizing statement sequences and thus we do not see any difference in the “LCS” benchmark problem.
Scalpel Comparison Code transplantation (Barr et al., 2015) is very similar to our work except for the fact that they do not consider using a large code corpus. However, it is still worthwhile to conduct a series of performance comparisons since Scalpel also extracts code snippets from external programs or donor programs. We ran Scalpel on some of our benchmark problems with correct donors specified for multiple times. Notice that Scalpel has some advantage over our system under this setting, because Scalpel does not need to search for a set of relevant programs from a large code corpus. Nevertheless, even with such advantage, most of the runs could not finish within 5 minutes except for “Sieve Prime” which is one of the easy ones. Even though we did not run Scalpel on all benchmark problems, it is reasonable to believe that the performance of
Scalpel which is based on genetic programming is not as efficient as our system which is based on enumerative search.
5.2.2. Code Search Evaluation
The synthesis result actually depends heavily on the quality of the programs retrieved from the database. A high-quality program should contain the exact codelet we want to synthesize. Without high-quality program, no solutions will exist in the search space. Therefore, it is important to study the quality of the database programs we used during synthesis and calculate the proportion of high-quality programs. We define this quantity as precision which is equal to the number of high-quality programs versus the total number of programs we used for synthesis.
Technical difficulty exists when we try to calculating the precision, because it requires deep analysis to check whether a portion of the database programs satisfies a certain property which is quite expensive. Searching for a codelet which satisfies a specification in a program is essentially another expensive code search and verification problem. Therefore, we come up with a good proxy to approximate precision. In order to see whether a program has the codelet we want, we can run the synthesis algorithm using that particular program and remove time constraint. If a particular database program has the codelet we want, eventually the synthesis would be successful, because the search space is finite. Essentially, our synthesis algorithm searches each database program for the target codelet and checks its property using correctness requirements.
We calculate the precision as the number of database programs, ,
increases. We record the number of high-quality programs versus the
total number of programs we used for synthesis. Again, we put more
weights on natural language features and less on other features for
-nearest-neighbor search. In addition, it is interesting to see
whether well-known code search engines could be beneficial to our
synthesis algorithm as well. Therefore, we compare our code search
method with GitHub (git, 2016) code search engine. We choose
GitHub because its search method is quite similar and comparable in
our case. Other code search methods which rely on code pattern or
other syntax element (Jiang
et al., 2007; Keivanloo
et al., 2014)
are quite different from our method and thus they are not comparable.
We search GitHub for similar programs using comments and keywords from
the draft programs and test their precision. We searched for programs
contained in files with
.java extension and also tagged as written
in Java language. We used “best match” ordering and selected top-
entries from the search result. When there are multiple functions in a
source code, we simply pick the function that is most likely to
contain the target codelets.
Figure 13 shows the numerical difference between the precision of our code search method and the precision from GitHub search engine with various as the x-axis. Positive percentage indicates our code search method finds more high-quality programs and negative percentage means the opposite. From the figure, we can see that our code search method is actually quite effective at fetching high-quality programs across different benchmark problems with various . Moreover, our code search method can always find more high-quality programs than well-known code search engine within five programs, meaning that our method has better precision. In some cases such as “Collision Detection” and “LCS”, GitHub tends to be more precise. After inspecting the programs from our database and the ones from Github, we believe the reason is that our database happens to have more repetitive and testing functions in the “LCS” case and different implementations in the “Collision Detection” case.
As we were trying to calculate the precision, we discovered that finding high-quality programs is actually quite difficult, because even a small change could completely change the usability of a program. In addition to the programs which contains syntactic features that our system does not support, the following factors can influence programs’ usability, even for a programmer:
Different implementation If an algorithm has more than one implementation, it is possible that we might retrieve another implementation which cannot be used for synthesis. In most cases, different implementations tend to not include the codelets we want to synthesize. Suppose we want to synthesize the main loop of an iterative binary search, it is impossible to use the recursive version. Moreover, multi-threaded versions or other versions that use different libraries are also quite popular and we cannot use those as well.
In addition, some database programs might use different constants for variable initializations and array sizes. Loops range might be slightly different as well. Usually these differences lead to crashes or logic error if we use those codelets to complete the draft without any modification. Typically in this scenario it is trivial for the users to change those constants and operators so that the retrieved programs could be used. Therefore, we performed simple transformations on the database programs to reflect that in order to prevent precision from being undermined.
Repetitive, irrelevant and invalid programs With an enormous amount of programs in a database, ensuring the quality of every single program is difficult. It is quite easy to get garbage programs when searching a large corpus, let alone the fact that we are using natural language which is ambiguous by nature. For example, when we search for quick sorts, we get back some other sort functions, driver functions for sorting algorithms and also test functions with “sort” as part of the name. Furthermore, repetitive and empty functions are also quite popular. Repetitive programs might come from repository cloning and duplicate code commit. Empty functions seem to be created and abandoned later by programmers.
5.2.3. User Study
It is unclear whether our system will be beneficial in the wild, for use by actual developers. In this subsection, we describe a user study aimed at answering this question.
Study setup. We recruited 12 graduate students and six professional programmers and developed four programming problems. Each participant was asked to complete all four programming problems using a web-based programming environment. Per person, two problems were completed using program splicing (we subsequently call this a “with” task), and two without (a “without” task). “With” and “without” tasks were assigned to participants randomly.
In order to simulate the industrial programming setting where an engineer is asked to develop a code meeting a provided specification, for each task, participants were given a description of the target program they need to implement, and also a description of the test cases they need to write to verify the correctness of the program. Figure 13(a) shows an example skeleton program for the “without” task on the Sieve of Eratosthenes programming problem, and figure 13(b) shows a draft program where participants need to put in comments and requirements to complete the “with” task.
When completing both “with” and “without” tasks, participants were encouraged to find and use relevant code snippets from the Internet. For the “with” tasks, participants were asked to use our system to provide at least one candidate solution to the programming problem, but then they could choose to use that candidate, or not use it. Before using the web-based programming environment and our system, they were asked to finish a warm-up problem in order to be familiar to the programming environment and our system.
To evaluate whether our system could boost programming productivity, we recorded the amount of time the participants used to correctly complete each programming problem. In order to determine whether there is a statistically significant difference in task completion time for “with” versus “without” tasks for the same programming problem, we define the following null hypothesis:
= “For programming problem , the expected ‘without’ task completion time is no greater than the expected ‘with’ task completion time.”
If this hypothesis is rejected at a sufficiently small -value for a specific programming problem, it means that it is likely that the average completion time is smaller for the “with” task than the “without” task, and hence program splicing likely has some benefit on the problem.
Given the times recorded over each problem and each task, we use bootstrap(Efron, 1982) to calculate the -value for each programming problem. The bootstrap works by simulating a large number of datasets from the original data by re-sampling with replacement many times, and the -value is approximated by the fraction of the time when the null hypothesis holds in the simulated data sets.
In addition to measuring time, we also recorded the number of times that “with” task participants for each problem asked the program splicing system for help. Typically the participants would stop using our system after they have received a useful codelet, and so a large number of requests may indicate an inability of the system to produce a useful result.
Programming Problems. The four programming problems were as follows:
Sieve of Eratosthenes: Implement the Sieve of Eratosthenes to test the primality of an integer. This is an interesting programming problem because it is purely algorithmic, involves no systems programming or API calls, and further codes to solve this problem are ubiquitous on the Internet. Going into our study, we expected program splicing to be of little use on this problem, because an Internet search should result in many different Sieve programs which should be trivial to tailor into a solution to the problem. Given this and the fact that test codes are so easy to write, we expected participants will use the least amount of time to finish this problem, regardless of whether they are given a “with” or “without” task for this problem.
File Name Collection: Collect all file names under a directory tree recursively and return a list of file names. We chose this problem because it represents an easy systems programming problem. Further, there is no standard solution to this problem, while it is still quite easy to write tests. Therefore we expected Internet search to be less useful, whereas program splicing might be quite helpful.
CSV Matrix Multiplication: Read a matrix from a CSV file, square the matrix and return it as a 2d-array. This problem includes a combination of system programming and algorithmic programming. We chose this problem expecting that “with” task programmers would need to use our system multiple times in an interactive manner to generate two independent code snippets. Given this, we expected that the time gap between the “with” task and “without” task participants to be smaller.
HTML Parsing: Read and parse an HTML document from a text file, store all links which contain a given word into a result list and return the result list. This is the most difficult problem among the four. Not only would those “with” task participants need to use our system multiple times, but they are required to write test for HTML manipulation. Specifically, program splicing necessitates that participants manually provide HTML to build test cases that are used to validate the correctness of the code for extracting links from the parsed HTML document. At the same time, the JSoup (jso, 2017) HTML parsing library that we asked participants to use has rather comprehensive and straight-forward documentation. Hence, we expected that time gap between “with” and “without” task participants would be the smallest among the four problems.
Figure 16 shows the -values for each
programming problem, as well as the number of times code splicing was
invoked for each problem’s “with” task. Figure 15
shows time spent on each submission with and without splicing,
including the average time and the box plots. We can see that for most
programming problems except for
HTML, the average time used to
finish the “with” task is is significantly lower than the time
required to finish the “without” task. The -values in
figure 16 are also small enough for us to
reject the null hypotheses (stating that there is no utility to
program splicing) with over 99% confidence.
Note that the average number of program splicing invocations for most problems (except HTML Parsing) is very close to one, meaning that program splicing could return codelets that the participants could use to complete the problem with only one try. We argue that this also indicates that the system is rather easy to use, and is indeed able to boost programming productivity in many cases. As the level of difficulty of the problem increases, so does the benefit of using our system.
It is, however, useful to consider the HTML Parsing
programming problem, which is the one case where program splicing was
not useful. Why is this? What is the failure mode? After careful
investigation, we believe that there are two reasons program splicing
did not help. First, the documentation of the HTML parsing library
JSoup (jso, 2017), is very comprehensive and
well-done. Hence the problem was easy. Second, it is very easy to make
mistakes when writing tests, which require developing correct HTML
code and inserting it in a test. We found that participants had a
difficult time with this. Participants typically forgot to escape
quote characters within a string when loading a variable containing
even very simple HTML. Although providing a better programming
environment would be very helpful, the difficulty in writing tests
meant that program splicing was less helpful. That said, writing tests
has independent value, and if the difficulty in writing tests was the
key impediment to using splicing, it may not be a strong argument
against the tool.
|Problem||-value||Avg. Number of Invocations|
We close this subsection by asking: When is program splicing likely most useful for programmers? One surprising case seems to be programming problems that are deceptively simple, containing intricate algorithmics (loops and recursion) that programmers tend to have a difficult time with. Sieve of Eratosthenes falls in this category. The Sieve appears to be very simple, and so we initially expected splicing to be of little use. However, due to the perceived simplicity, we found that “without” participants tended to write their own solutions without consulting the Internet (even though we encouraged Internet use)—and this over-confidence resulted in buggy programs and longer development times. We were especially surprised to find this in the case of Sieve of Eratosthenes and also in the case of matrix multiplication part in CSV, where there are many solutions available on the Internet. Use of program splicing protected “with” participants from such difficulties.
We also found splicing to be useful when documentation is lacking and there is not a standard way of doing things. Consider CSV and Collecting File Names where the official Java documentation does not provide any code snippets on how to parse a CSV files or how to collect file names under a directory subtree. “Without” participants had to rely on combing through solutions from StackOverflow (sta, 2017), where multiple solutions exist, using different libraries, each with various pros and cons. Program splicing cuts out the need for manual searching and understanding many different possible solutions—if the splicing succeeds and passes the provided test cases, the user can be relatively confident that the provided solution is correct.
6. Related Work
The problem of program synthesis has been considered for a long time (Pnueli and Rosner, 1989; Alur et al., 2015a) and recently it has been successfully applied to domain-specific tasks (Yaghmazadeh et al., 2016; Feng et al., 2017; Feng et al., 2016; Polozov and Gulwani, 2015). In particular, the notion of drafts used in program splicing is inspired by “sketches” (Solar-Lezama et al., 2006) and “templates” from previous approaches to synthesis (Srivastava et al., 2012). The sort of combinatorial search that our synthesis algorithm uses has parallels in enumerative approaches to synthesis (Udupa et al., 2013; Feser et al., 2015). However, the key difference between program splicing and other synthesis techniques is that our method reuses existing source code from the web instead of generating programs from scratch, and this gives our method better scalability advantage and traditional synthesis methods such as Sketch (Solar-Lezama et al., 2006) lack the capability of synthesizing statements efficiently.
Code transplantation and other methods that use genetic programming (Barr et al., 2015; Petke et al., 2014; Harman et al., 2014; Jia et al., 2015; Marginean et al., 2015) are very similar to our work. They transplant external code snippets, functionalities or organs across multiple programs. However, they do not considering searching a large code corpus, and genetic-programming based methods suffer from a serious efficiency issue according to our experiment results. CodePhage (Sidiroglou-Douskos et al., 2015) also transplants arbitrary code snippets across different applications, but the transplantaions are done exclusively for binary programs. In addition, their experiments only consider adding checks for program repair, and it is still unclear whether it is applicable to arbitrary code.
Code reuse tools such as CodeConjure (Hummel et al., 2008), S (Reiss, 2009), CodeGenie (Lazzarini Lemos et al., 2009), and Hunter (Wang et al., 2016) are quite beneficial to programmers. It has been showed that these tool could increase the efficiency of programming. Some are similar to our work in which they use natural language and tests to search for relevant programs. However, the major difference between these tools and ours is that they usually consider programs at the granularity of functions and sometimes a piece of type adapter code is required (Wang et al., 2016). On the other hand, our method provides a Sketch-like interface where programmers can write holes. Naturally, our method needs to dig into functions and look for statements and expressions, which opens a door to an exponentially larger code database for reuse.
Program analysis and synthesis using a large pool of existing source code, or “Big Code”, recently has gained a lot of attention. Mishne et al. (Mishne et al., 2012) focuses on mining the specification for API calls from a large amount of code snippets. Statistical methods such as graphical models (Raychev et al., 2015), language models (Raychev et al., 2014; Hindle et al., 2012; Nguyen and Nguyen, 2015) and learning from noisy data (Raychev et al., 2016) have been showed to be quite effective in inferring program properties and code completion. Our work, however, does not depend heavily on statistical methods. SWIM (Raghothaman et al., 2015) and other works (Gvero and Kuncak, 2015) also uses natural language query and the web to search for code snippets such as API usages, but they do not consider draft programs that offer a context with which new programs must be merged. DeepCoder (Balog et al., 2016)
has gained a lot of popularity recently. They successfully applied deep learning to the problem of program synthesis and show that deep learning could be beneficial in reducing the search space during synthesis. Our work is different in the sense that we do not use any statistical method, and DeepCoder works on a simpler language.
Copy-and-paste has long been seen to be a problematic approach to programming. However, a recent paper (Narasimhan and Reichenbach, 2015) seeks to redeem copy-and-paste through a method that finds clones of a fully written program and automatically merges these cloned code with the new code. However, their work only considers extremely similar programs whose parse trees are a few edits away and does not consider draft programs. In contrast, our approach can bring arbitrary programs from the internet into a programming context, and uses a combinatorial synthesizer to splice this code into the context. Gilligan (Holmes and Walker, 2013) serves as a code reuse assistant, which is very similar to our approach. However, it does not consider using external source code.
Much of the related work on finding similar code has focused on clone detection: finding syntactically exact or nearly-exact copies of source code fragments. See (Roy et al., 2009) for a relatively recent survey of these techniques. Code clone detection usually assumes that two fragments of code were derived from the same original code, flagging fragments with certain types or quantities of edits between them as clones. Syntax elements have been shown to be effective in code search (Jiang et al., 2007; Keivanloo et al., 2014), but our code search mainly relies on natural language with minor syntax elements. We also differ from other token-based code search engines in that we do not only rely on natural language tokens in code.
Code search has been performed in various other settings, using different code features: TRACY (David and Yahav, 2014) uses k-limited static program paths (called tracelets) to search similar stripped binary code snippets, Exemplar (Grechanik et al., 2010) uses Java APIs exercised to aid code search, Portfolio (McMillan et al., 2011) uses function call graph to improve code search and ranking. Refer Table 3 in (McMillan et al., 2011) for a comprehensive comparison of various code search engines: most of these are targeted towards creating user-facing code search engines that can find relevant code based on a user-specified query. SMT solves (Stolee and Elbaum, 2012) have also been used for semantic code search and the combination of SMT solvers and semantic code search has been shown to be effective at repairing programs (Ke et al., 2015).
In this paper, we introduce program splicing, a synthesis-based approach to programming that can serve as a principled and automated substitute for copying and pasting code from the internet. The main technology behind program splicing is a program synthesizer that can query a database containing a large number of code snippets mined from open-source software repositories. Our experiments show that it is possible to synthesize code snippets by fruitfully combining such database queries with combinatorial exploration of a space of expressions and statements. We also conducted a user study and the results show that our method could indeed boost programming productivity.
One important future work is to ensure the high quality of database programs, because the effect on the quality of the database program could be very significant. We could develop better features and similarity metrics in order to increase the code search precision.
Acknowledgements.This material is based upon work supported by the Sponsor National Science Foundation Rlhttp://dx.doi.org/10.13039/100000001 under Grant No. Grant #3 and Grant No. Grant #3. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.
- git (2016) 2016. GitHub. (2016). https://github.com Accessed: 2016-08-25.
- bea (2017) 2017. BeanShell. (2017). https://github.com/beanshell/beanshell Accessed: 2017-04-04.
- jso (2017) 2017. JSoup. (2017). https://jsoup.org Accessed: 2017-04-02.
- nai (2017) 2017. Nailgun. (2017). http://martiansoftware.com/nailgun/ Accessed: 2017-04-04.
- sta (2017) 2017. Stackoverflow. (2017). https://stackoverflow.com Accessed: 2017-04-02.
- Alur et al. (2015a) Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo MK Martin, Mukund Raghothaman, Sanjit A Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. 2015a. Syntax-guided synthesis. Dependable Software Systems Engineering 40 (2015), 1–25.
- Alur et al. (2015b) Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo MK Martin, Mukund Raghothaman, Sanjit A Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. 2015b. Syntax-guided synthesis. Dependable Software Systems Engineering 40 (2015), 1–25.
- Bajracharya et al. (2014) Sushil Bajracharya, Joel Ossher, and Cristina Lopes. 2014. Sourcerer: An infrastructure for large-scale collection and analysis of open-source code. Science of Computer Programming 79 (2014), 241 – 259. DOI:http://dx.doi.org/10.1016/j.scico.2012.04.008
- Balog et al. (2016) Matej Balog, Alexander L Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. 2016. DeepCoder: Learning to Write Programs. arXiv preprint arXiv:1611.01989 (2016).
- Barr et al. (2015) Earl T. Barr, Mark Harman, Yue Jia, Alexandru Marginean, and Justyna Petke. 2015. Automated Software Transplantation. In Proceedings of the 2015 International Symposium on Software Testing and Analysis (ISSTA 2015). ACM, New York, NY, USA, 257–269. DOI:http://dx.doi.org/10.1145/2771783.2771796
- David and Yahav (2014) Yaniv David and Eran Yahav. 2014. Tracelet-based code search in executables. In ACM SIGPLAN Notices, Vol. 49. ACM, 349–360.
- Efron (1982) Bradley Efron. 1982. The jackknife, the bootstrap and other resampling plans. SIAM.
- Feild et al. (2006) Henry Feild, David Binkley, and Dawn Lawrie. 2006. An empirical comparison of techniques for extracting concept abbreviations from identifiers. In Proceedings of IASTED International Conference on Software Engineering and Applications (SEA’06). Citeseer.
- Feng et al. (2016) Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, and Swarat Chaudhuri. 2016. Component-based Synthesis of Table Consolidation and Transformation Tasks from Examples. CoRR abs/1611.07502 (2016). http://arxiv.org/abs/1611.07502
- Feng et al. (2017) Yu Feng, Ruben Martins, Yuepeng Wang, Isil Dillig, and Thomas W. Reps. 2017. Component-based Synthesis for Complex APIs. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages (POPL 2017). ACM, New York, NY, USA, 599–612. DOI:http://dx.doi.org/10.1145/3009837.3009851
- Feser et al. (2015) John K. Feser, Swarat Chaudhuri, and Isil Dillig. 2015. Synthesizing Data Structure Transformations from Input-output Examples. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2015). ACM, New York, NY, USA, 229–239. DOI:http://dx.doi.org/10.1145/2737924.2737977
- Grechanik et al. (2010) Mark Grechanik, Chen Fu, Qing Xie, Collin McMillan, Denys Poshyvanyk, and Chad Cumby. 2010. A search engine for finding highly relevant applications. In ACM/IEEE International Conference on Software Engineering. ACM Press, New York, New York, USA.
- Gvero and Kuncak (2015) Tihomir Gvero and Viktor Kuncak. 2015. Interactive synthesis using free-form queries. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 2. IEEE, 689–692.
- Harman et al. (2014) Mark Harman, Yue Jia, and William B Langdon. 2014. Babel pidgin: SBSE can grow and graft entirely new functionality into a real world system. In International Symposium on Search Based Software Engineering. Springer, 247–252.
- Hindle et al. (2012) Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In 2012 34th International Conference on Software Engineering (ICSE). IEEE, 837–847.
- Holmes and Walker (2013) Reid Holmes and Robert J. Walker. 2013. Systematizing Pragmatic Software Reuse. ACM Trans. Softw. Eng. Methodol. 21, 4, Article 20 (Feb. 2013), 44 pages. DOI:http://dx.doi.org/10.1145/2377656.2377657
- Hummel et al. (2008) Oliver Hummel, Werner Janjic, and Colin Atkinson. 2008. Code conjurer: Pulling reusable software out of thin air. IEEE software 25, 5 (2008).
- Jia et al. (2015) Yue Jia, Mark Harman, William B Langdon, and Alexandru Marginean. 2015. Grow and serve: Growing Django citation services using SBSE. In International Symposium on Search Based Software Engineering. Springer, 269–275.
- Jiang et al. (2007) Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, 96–105.
- Juergens et al. (2009) Elmar Juergens, Florian Deissenboeck, Benjamin Hummel, and Stefan Wagner. 2009. Do code clones matter?. In Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society, 485–495.
- Ke et al. (2015) Yalin Ke, Kathryn T Stolee, Claire Le Goues, and Yuriy Brun. 2015. Repairing programs with semantic code search (t). In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on. IEEE, 295–306.
- Keivanloo et al. (2014) Iman Keivanloo, Juergen Rilling, and Ying Zou. 2014. Spotting working code examples. In Proceedings of the 36th International Conference on Software Engineering. ACM, 664–675.
- Kim et al. (2004) Miryung Kim, Lawrence Bergman, Tessa Lau, and David Notkin. 2004. An ethnographic study of copy and paste programming practices in OOPL. In Empirical Software Engineering, 2004. ISESE’04. Proceedings. 2004 International Symposium on. IEEE, 83–92.
- Lazzarini Lemos et al. (2009) Otávio Augusto Lazzarini Lemos, Sushil Bajracharya, Joel Ossher, Paulo Cesar Masiero, and Cristina Lopes. 2009. Applying test-driven code search to the reuse of auxiliary functionality. In Proceedings of the 2009 ACM symposium on Applied Computing. ACM, 476–482.
- Marginean et al. (2015) Alexandru Marginean, Earl T Barr, Mark Harman, and Yue Jia. 2015. Automated transplantation of call graph and layout features into Kate. In International Symposium on Search Based Software Engineering. Springer, 262–268.
- McMillan et al. (2011) Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Qing Xie, and Chen Fu. 2011. Portfolio: finding relevant functions and their usage. In International conference on Software engineering. ACM Press, New York, New York, USA.
- Mishne et al. (2012) Alon Mishne, Sharon Shoham, and Eran Yahav. 2012. Typestate-based Semantic Code Search over Partial Programs. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA ’12). ACM, New York, NY, USA, 997–1016. DOI:http://dx.doi.org/10.1145/2384616.2384689
- Narasimhan and Reichenbach (2015) Krishna Narasimhan and Christoph Reichenbach. 2015. Copy and Paste Redeemed. In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on. IEEE, 630–640.
- Nguyen and Nguyen (2015) Anh Tuan Nguyen and Tien N Nguyen. 2015. Graph-based statistical language model for code. In Proceedings of the 37th International Conference on Software Engineering-Volume 1. IEEE Press, 858–868.
- Ossher et al. (2012) J. Ossher, H. Sajnani, and C. Lopes. 2012. Astra: Bottom-up Construction of Structured Artifact Repositories. In Reverse Engineering (WCRE), 2012 19th Working Conference on. 41–50. DOI:http://dx.doi.org/10.1109/WCRE.2012.14
- Petke et al. (2014) Justyna Petke, Mark Harman, William B Langdon, and Westley Weimer. 2014. Using genetic improvement and code transplants to specialise a C++ program to a problem class. In European Conference on Genetic Programming. Springer, 137–149.
- Pnueli and Rosner (1989) Amir Pnueli and Roni Rosner. 1989. On the Synthesis of an Asynchronous Reactive Module. In Proceedings of the 16th International Colloquium on Automata, Languages and Programming (ICALP ’89). Springer-Verlag, London, UK, UK, 652–671. http://dl.acm.org/citation.cfm?id=646243.681607
- Polozov and Gulwani (2015) Oleksandr Polozov and Sumit Gulwani. 2015. Flashmeta: A framework for inductive program synthesis. ACM SIGPLAN Notices 50, 10 (2015), 107–126.
- Raghothaman et al. (2015) Mukund Raghothaman, Yi Wei, and Youssef Hamadi. 2015. SWIM: Synthesizing What I Mean. arXiv preprint arXiv:1511.08497 (2015).
- Raychev et al. (2016) Veselin Raychev, Pavol Bielik, Martin Vechev, and Andreas Krause. 2016. Learning programs from noisy data. In ACM SIGPLAN Notices, Vol. 51. ACM, 761–774.
- Raychev et al. (2015) Veselin Raychev, Martin Vechev, and Andreas Krause. 2015. Predicting Program Properties from ”Big Code”. In Proceedings of the 42Nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’15). ACM, New York, NY, USA, 111–124. DOI:http://dx.doi.org/10.1145/2676726.2677009
- Raychev et al. (2014) Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code Completion with Statistical Language Models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’14). ACM, New York, NY, USA, 419–428. DOI:http://dx.doi.org/10.1145/2594291.2594321
- Reiss (2009) Steven P. Reiss. 2009. Semantics-based Code Search. In Proceedings of the 31st International Conference on Software Engineering (ICSE ’09). IEEE Computer Society, Washington, DC, USA, 243–253. DOI:http://dx.doi.org/10.1109/ICSE.2009.5070525
- Roy et al. (2009) Chanchal K Roy, James R Cordy, and Rainer Koschke. 2009. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming 74, 7 (2009), 470–495.
- Sajnani et al. (2014) H. Sajnani, V. Saini, J. Ossher, and C.V. Lopes. 2014. Is Popularity a Measure of Quality? An Analysis of Maven Components. In Software Maintenance and Evolution (ICSME), 2014 IEEE International Conference on. 231–240. DOI:http://dx.doi.org/10.1109/ICSME.2014.45
- Sidiroglou-Douskos et al. (2015) Stelios Sidiroglou-Douskos, Eric Lahtinen, Fan Long, and Martin Rinard. 2015. Automatic error elimination by horizontal code transfer across multiple applications. In ACM SIGPLAN Notices, Vol. 50. ACM, 43–54.
- Solar-Lezama (2009) Armando Solar-Lezama. 2009. The sketching approach to program synthesis. In Asian Symposium on Programming Languages and Systems. Springer, 4–13.
- Solar-Lezama et al. (2006) Armando Solar-Lezama, Liviu Tancau, Rastislav Bodik, Sanjit Seshia, and Vijay Saraswat. 2006. Combinatorial Sketching for Finite Programs. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XII). ACM, New York, NY, USA, 404–415. DOI:http://dx.doi.org/10.1145/1168857.1168907
- Srivastava et al. (2012) Saurabh Srivastava, Sumit Gulwani, and Jeffrey S. Foster. 2012. Template-based program verification and program synthesis. International Journal on Software Tools for Technology Transfer 15, 5 (2012), 497–518. DOI:http://dx.doi.org/10.1007/s10009-012-0223-4
- Stolee and Elbaum (2012) Kathryn T Stolee and Sebastian Elbaum. 2012. Toward semantic search via SMT solver. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering. ACM, 25.
- Udupa et al. (2013) Abhishek Udupa, Arun Raghavan, Jyotirmoy V Deshmukh, Sela Mador-Haim, Milo MK Martin, and Rajeev Alur. 2013. TRANSIT: specifying protocols with concolic snippets. ACM SIGPLAN Notices 48, 6 (2013), 287–296.
- Wang et al. (2016) Yuepeng Wang, Yu Feng, Ruben Martins, Arati Kaushik, Isil Dillig, and Steven P. Reiss. 2016. Hunter: Next-generation Code Reuse for Java. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). ACM, New York, NY, USA, 1028–1032. DOI:http://dx.doi.org/10.1145/2950290.2983934
- Yaghmazadeh et al. (2016) Navid Yaghmazadeh, Christian Klinger, Isil Dillig, and Swarat Chaudhuri. 2016. Synthesizing Transformations on Hierarchically Structured Data. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’16). ACM, New York, NY, USA, 508–521. DOI:http://dx.doi.org/10.1145/2908080.2908088
Appendix A Appendix
a.1. Draft Programs in Benchmarks
- Binary Search Draft:
- Collision Detection Draft:
- CSV Draft:
- Echo Server Draft:
- Face Detection Draft:
- Collecting Files Draft:
- Hello World GUI Draft:
- HTTP Server Draft:
- LCS Draft:
- Matrix Multiplication Draft:
- Prim’s Distance Update Draft:
- Quick Sort Draft:
- Sieve Prime Draft: