Deep API Programmer: Learning to Program with APIs

04/14/2017 ∙ by Surya Bhupatiraju, et al. ∙ MIT Microsoft 0

We present DAPIP, a Programming-By-Example system that learns to program with APIs to perform data transformation tasks. We design a domain-specific language (DSL) that allows for arbitrary concatenations of API outputs and constant strings. The DSL consists of three family of APIs: regular expression-based APIs, lookup APIs, and transformation APIs. We then present a novel neural synthesis algorithm to search for programs in the DSL that are consistent with a given set of examples. The search algorithm uses recently introduced neural architectures to encode input-output examples and to model the program search in the DSL. We show that synthesis algorithm outperforms baseline methods for synthesizing programs on both synthetic and real-world benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability to discover a program consistent with a given user intent (specification) is considered as one of the central problems in artificial intelligence 

[Green1969]. While significant progress has been made in synthesizing programs in different domains [Alur et al.2013], current synthesis techniques do not scale to larger and more complex programs. Moreover, the state-of-the-art synthesis techniques [Gulwani et al.2012]

require a great deal of domain expertise with manually designed heuristics and rules to develop an efficient search procedure. In this paper, we present

Dapip, or Deep API Programmer, a system that aims to overcome some of these shortcomings by automatically learning a synthesis algorithm in the domain of data transformation tasks.

The process of transforming data from raw data into usable formats (also known as data wrangling) is a key problem faced by data scientists for any data analysis task. Some studies have reported that this data wrangling process can sometimes take up to 80% of the total data analysis time [Dasu and Johnson2003, Kandel et al.2011]. Recently, Programming-By-Example (PBE) techniques such as FlashFill [Gulwani2011, Gulwani et al.2012] and BlinkFill [Singh2016] were developed to help users perform data transformation tasks using examples instead of having to write complex programs. These techniques encode the space of programs using a domain-specific language (DSL), and then develop algorithms based on version-space algebra (VSA) [Polozov and Gulwani2015, Lau et al.2003] to efficiently search the space of programs. There are two key shortcomings of these approaches. First, the DSL is limited to only certain low-level syntactic regular expression-based operators that allow for an efficient structuring of search space. This limits the expressiveness of the PBE systems; for example, they do not allow semantic data transformations using arbitrary transformation functions such as obtaining month names from a date or abbreviating the state name in an input address. Second, building an efficient synthesizer using VSA requires a large engineering effort with manually designed heuristic rules.

We tackle the first shortcoming by designing Dapip’s DSL to have function APIs as the core element, which allows for composition of APIs with constant strings. The DSL consists of three kinds of APIs: regular expression-based APIs, lookup APIs, and transformation APIs. The regular expression-based APIs perform a regular expression-based transformation on the input strings, which are needed for syntactic data transformations. The lookup APIs search for a particular string in the input data based on a dictionary of strings, and the transformation APIs perform some transformation on top of a lookup operation based on a predefined mapping between two sets of strings. The lookup and transformation APIs allow for semantic data transformations.

The second shortcoming is handled by learning the synthesis algorithm in Dapip automatically from data using two recently introduced neural modules [Parisotto et al.2016]. The first module called the cross-correlational encoder

computes a fixed-dimension vector representation of the input-output examples by using tensor representations obtained by running two bi-directional LSTMs 

[Hochreiter and Schmidhuber1997, Graves and Schmidhuber2005]

on the input and output strings and computing their cross correlation. The second module, the recursive-reverse-recursive neural network, or

R3NN, encodes a partial derivation in the DSL and given the example encoding vector, returns a distribution over the space of possible expansions to the partial derivation. The R3NN incrementally builds a program in the DSL that is consistent with the input-output examples. The input-output encoder and the R3NN modules are trained end-to-end using thousands of programs and corresponding input-output examples, which are automatically sampled from the DSL.

We evaluate Dapip

on a set of synthetic and 238 real-world FlashFill benchmarks. Our experiments indicate that our deep learning based approach is able to effectively model and predict the presence of different types of APIs. It is able to solve 45% of the FlashFill benchmarks and significantly outperforms the enumerative search based baseline.

To summarize, the key contributions of this paper are:

  • We design an expressive DSL with APIs that can encode both syntactic and semantic data transformation tasks.

  • We automatically learn a synthesis algorithm for synthesizing programs in the DSL using neural architectures.

  • We evaluate our system Dapip on 238 real-world FlashFill benchmarks and thousands of synthetic benchmarks.

2 Motivating Examples

We present a few real-world examples to motivate the DSL.

Example 1.

An Excel user wanted to transform names to first initial followed by last name as shown in Figure 1. Since some input examples had optional middle names, the user was struggling to find a macro to perform the desired task.

Input Output
1 John S. Henry J. Henry
2 Mike Stanley M. Stanley
3 Bernie John Smith B. Smith
4 Martha S Johnson M. Johnson
Figure 1: An example FlashFill task of abbreviating names. The user provides the first two outputs, and the bold entries are then automatically generated by the learned program.

Dapip learns the following program for this task: . The learned program uses the GetFirstChar and GetLastWord APIs that belong to the class of regex APIs, which extract substrings from the input string based on regular expressions.

Input Output
1 500 Mem Dr., Cambridge, 02139 Cambridge, MA
2 22 NE Street, Redmond, USA Redmond, WA
3 Seattle, 98002 Seattle, WA
4 21 Peace Ave., Kirkland, 98034 Kirkland, WA
Figure 2: Transforming addresses to City and State.
Example 2.

An Excel user had a list of addresses and wanted to extract the city and state values as shown in Figure 2.

This is an example of a very common task that can not be performed by systems such as FlashFill. Since the data is in many different formats, there is no consistent regular expression that can be used to extract the city names. Moreover, to obtain the state name, the system needs to use a transform API GetStateFromCity. Dapip learns the following program:
.

More examples of real-world FlashFill tasks can be found in Appendices C and D.

Figure 3: Using the trained R3NN model to sample programs from the DSL given a set of input-output examples; even in the inference use case, nodes are expanded in a particular, discrete order.

3 Overview of Approach

Figure 4: Training the R3NN network to learn distributions over DSL expansions conditioned on the input-output examples; expansion are performed in a particular order as dictated by the conditional distribution.

We now present an overview of our end-to-end system that learns to synthesize programs in a DSL that are consistent with a set of examples. The training phase of our system is shown in Figure 4 and the test phase is shown in Figure 3. We first design a DSL that allows for composition of nested API calls with constant strings. We designed this DSL after studying a large family of real-world string transformation tasks so that it is expressive enough to encode these tasks. During the training phase, we use a program sampler to uniformly sample a large number of programs from this DSL. For each program, we use a rule-based approach to construct 5 input strings for the program such that the prerequisites of the program are met. We obtain the output strings by executing the program on the input strings.

During training, each sampled program together with the corresponding input-output examples is used to train the R3NN model, a neural architecture that learns distributions over the expansions in the DSL conditioned on the examples. The examples are encoded using a second neural architecture called the cross-correlational encoder, which produces a fixed-dimensional vector. The R3NN system takes as input the input-output conditioning vector, the DSL, and the training program, and is trained to predict a conditional distribution over the set of DSL expansions. The next expansion is sampled from this conditional distribution, leading to the partial tree, and the procedure repeats; one can observe a potential order of the nodes growing in the respective figures.

The trained R3NN model can then be used to synthesize programs in the DSL given a set of examples. The trained model takes the input-output conditioning vector as input, and generates a distribution over the set of DSL expansions that are likely to be the expansions required to construct the desired program. The distribution is then sampled to derive programs in the DSL, where the order of expansions is specified by the distribution, as shown in the respective figure, and the system returns the first program that is consistent with the input-output examples.

4 Domain-Specific Language

The syntax of the domain-specific language for API-based string transformations is shown in Figure 5. The top-level construct of the language is the Concat function that returns the concatenation of its argument substrings . A substring expression can either be a constant string , the input string , or the result of an API function with as its argument. The Concat operator allows for composition of API calls with constant strings. The DSL consists of 3 types of APIs: regex APIs , lookup APIs , and transformation APIs .

Regex API : The regex APIs search for certain regular expression-based patterns in the input string and return the matched string. Some examples of regex APIs are GetFirstNum, GetBetFirstAndSecondCommas, etc. Our DSL consists of 104 such regex APIs.

Lookup API : The lookup APIs look for presence of certain strings in the input string and return the lookup string. Each lookup API consists of a dictionary of a finite collection of strings, which are used for searching input substrings. Some examples of lookup APIs are GetCity,GetState, GetStockSymbol etc. For example, the GetState API contains a dictionary of US state names, whereas the GetCity API contains a dictionary of US cities. Our DSL consists of such lookup APIs.

Transformation API : The transformation APIs consists of a dictionary , which maps a finite collection of strings to another finite collection of strings . These APIs search for a string in the input string and return the corresponding output string . Some examples of such APIs include GetStateFromCity, GetFirstNameInitial, etc. For example, the transformation API GetStateFromCity consists of a dictionary mapping a collection of US cities to the corresponding US states. Our DSL consists of such functions.

The full list of all functions is provided in Appendix A.

Figure 5: The syntax of the DSL for API compositions.

5 Neural Architecture for Search

The neural search over the programs in the DSL conditioned on the input-output examples is performed using the model outlined in [Parisotto et al.2016]. First, the input-output examples are encoded into a fixed length feature vector that aims to capturing shared patterns between the input and output strings. This example representation is then passed to a neural tree-based generative model over program trees, called R3NN, to generate the desired hidden program. We provide a high level overview of the both the architectures.

5.1 Neural Input-Output Encoder

The cross-correlational encoder generates a fixed-dimensional vector representation of a set of input-output (I/O) examples. Intuitively, the encoder needs to capture three key information: parts of the output strings that are likely to be constant strings, parts of the output strings that can be computed from input strings, and some characteristics of the example strings that will help the program generator module identify the set of useful APIs for the given task. To simplify the DSL, we assume a fixed universe of possible constant strings so that we can focus on training the encoder to produce the likely set of APIs.

The I/O encoder first runs two bidirectional LSTM networks separately on the input and output strings in each example pair, which produces two matrices of size , where is the LSTM hidden dimension and is the maximum length of the I/O string. The encoder then slides the output matrix over the input matrix for each time step and computes the outer product between respective matrix columns. There are in total alignments as we slide the matrices and we obtain vectors in total after the dot product. Finally, the encoder concatenates the values for overlapping time steps to obtain a -dimensional vector encoding for each example pair.

5.2 Tree-Structured Generation Model

The tree generation model incrementally constructs a program tree starting from the start symbol of the DSL grammar and expanding the tree with one derivation at a time until obtaining a tree with that consists only non-terminal nodes. The R3NN network assigns posterior probabilities to every valid expansion of a partial tree to guide the search algorithm. In other words, given a partial program tree, the R3NN network decides which non-terminal node to expand in the tree and with which expansion rule in the grammar.

The R3NN is defined by the following parameters: i) an -dimensional representation for every symbol in the grammar, ii) an -dimensional representation for each grammar rule , iii) a deep neural network for each grammar rule that takes as input a vector (where Q is the number of RHS symbols of ) and outputs a vector , and iv) a deep neural network (reverse of ) which takes as input a vector and outputs a vector .

Given a partial program tree, R3NN first assigns the representation to each leaf node , where denotes the grammar symbol of node . It then performs a standard recursive pass over the tree from bottom-to-top, by recursively applying for every non-leaf node on its RHS node representations to compute the representation of , where denotes the rule associated with node . This pass continues until we reach the root node. The represents information about all tree nodes, but does not encode any notion of the node positions in the tree. To solve this issue, R3NN performs a reverse-recursive pass starting from the root node to compute updated representations of all child nodes using the reverse deep network . After performing the reverse-recursive pass, each leaf node

is assigned a new distributed representation

, which intuitively captures the global information about every other node in the tree.

The scores for each expansion can now be obtained from the global leaf representations . Let be the expansion type (production rule that applies) and let be the leaf node that is applied to for an expansion . The score of an expansion is calculated using and the probability of the expansion is obtained by exponentiated normalized sum over the scores: .

6 Evaluation

We now present results from two major sets of experiments and analyze the model in more detail with the goal of assessing its expressiveness. We demonstrate that our model is capable of learning to synthesize simple programs when provided with a library of over 100 API functions. We also show that the model is capable of strong generalization, where it can not only generalize across different I/O examples for a given program, but also across new, unseen programs.

6.1 Experimental Setup and Training Details

We use both synthetic benchmarks and real-world FlashFill benchmarks for evaluation. The synthetic benchmarks are obtained by sampling the programs in the DSL uniformly, and then using a rule-based approach to generate corresponding input-output examples. For example, if we sample a program consisting of GetThirdNum and GetState APIs, the rule-based approach would ensure that the input strings in the example consist of at least three numbers and one state strings. For each benchmark, we sample five input strings and the corresponding output strings are obtained by executing the sampled program on the input strings. Several examples of training data are shown in Appendix B.

We first train the R3NN on a DSL consisting of only one family of APIs to evaluate its effectiveness on learning individual API family. We call the models trained on only the regex APIs (and constant strings) as the FF models and call the corresponding DSL as the regex-only DSL. We then train the R3NN with all APIs to evaluate the effectiveness of learning programs in the DSL consisting of different APIs and their composition with constant strings; we call this DSL the full DSL. The models trained on the full DSL are called the FF++ models. Since the FlashFill benchmarks can be solved using only the regex APIs and the set of constant strings, we also evaluate the FF model on the FlashFill benchmarks.

We train the cross-correlation encoder and R3NN jointly with the principle of maximum likelihood; the model produces posterior probabilities over possible expansions and we backpropagate an error signal based on the ground truth programs. We use the Adam optimizer 

[Kingma and Ba2014]

, with an initial learning rate of 0.001 and clipping gradients at 10 for both modules. We found that small learning rates are crucial for R3NN to prevent unstable learning. Every epoch consists of 1000 training batches of 10 instances, where each instance contains a ground truth program and 5 input-output pairs. The evaluation on synthetic data is performed on programs that are not seen during training. We report results when evaluating with both 1-best inference and with stochastic search (10, 50, or 100 samples), where we resample a program conditioned on the same input-output examples multiple times. This way, we allow the model to have small errors in its final posterior probabilities for selecting an expansion.

6.2 Learning API Types

Each of the three classes of API functions, while much more interpretable, still pose nontrivial challenges for the model to learn to compose. The lookup API functions contain large dictionaries and the model must learn when to call such APIs given the input-output examples. For example, while the difference between names and cities may seem trivial to human practitioners, the model must learn to disambiguate each of these entities. The transformation API functions pose an additional challenge; with programs that require these types of API calls, not only does the model need to learn some encoding of the hidden dictionary, but the output string may not contain any obvious matching substring in the input string because of the nature of the API function. As a result, a simple string matching algorithm between the inputs and outputs will not work to solve this problem, and the input-output encoder must learn useful representations of pairs of them, and be expressive enough to capture the implicit string transformation. Lastly, the regex API functions do not encode dictionaries but represent syntactic substring operations, and the model must learn to recognize which API functions to call based on which parts of the output are present in the input.

We first present an ablative study of what class of APIs are the easiest to learn in isolation, and which one is the most challenging in the full DSL.

Regex APIs

In Table 1, we report the training and validation set accuracies of different models trained on the regex-only DSL (FF model). The length column denotes the maximum length of programs that each model was trained on. The length 7 model was trained with 9000 programs, length 8 with 16000, length 9 with 616510, and length 10 with 1263000 programs. For validation, we select 1000 randomly chosen held-out programs from this set and generate new I/O examples to test the generalization power of the trained model.

Table 1: Best FF model performance by max program length.

Of particular note is the performance on programs of length 10. At this length, the DSL can generate programs with API nesting, API composition, and concatenation with a constant string; this represents all possible constructs in our DSL.

Lookup and Transform APIs

In this experiment, we fix the maximum size of the programs in the training and validation set to size 10 and only include the lookup and transform APIs in the DSL. The results are shown in Table 2. We find that when the DSL is restricted to these APIs, the trained models achieve a very high accuracy and are able to identify composition of APIs with very high precision.

Table 2: Learnability of other APIs.
All APIs: Regex + Transform + Lookup

We now present the model evaluation that was trained on the full DSL. Recall that because we’ve trained on the full DSL, these models are referred to as the FF++ models.

Table 3: The performance of best FF++ model on synthetic dataset by max length of programs.

The performance of the FF++ models is shown in Table 3. We observe that both training and validation accuracies decreased as compared to the FF models, which is expected since we now have an increased set of APIs that also include more complex APIs encoding large dictionaries. However, the length 10 model is still able to get 44% accuracy.

We analyze these results further to understand the learnability of different APIs when trained together as shown in Table 4. The regex APIs seem to be the easiest to learn for the network, which may be accredited to the specific nature of the IO encoder, as it was designed to detect patterns in substrings between the input and output examples. Interestingly, the lookup APIs are harder to learn than the transformation APIs, which can be attributed to the fact that they encode larger dictionaries as compared to the dictionaries of transform APIs.

Table 4: Ablative analysis of FF++ model performance.

6.3 FlashFill using API Compositions

We now present the results of the best FF and FF++ models on the FlashFill benchmarks obtained from the authors of FlashFill [Gulwani et al.2012]. These benchmarks correspond to real-world string transformation tasks in Excel, where each benchmark comprises of 5 input-output string examples.

6.3.1 Ff models

Baseline performance with uniform search

We first present the results we obtain with a baseline uniform search model on the FlashFill benchmarks in Table 5. The baseline model performs a uniform search over the DSL expansions and is biased towards small programs. We also present stochastic sampling results for a fair comparison with the performance of the FF models.

Table 5: Uniform search on FF benchmarks

The uniform search does surprisingly well considering the large space of all possible programs because the DSL we designed with APIs allows many of the benchmarks to be solved with a single call, e.g. GetFirstWord, and the uniform search sampler is biased towards shorter programs.

Ff Model performance on FlashFill Benchmarks

We now evaluate the trained models whose accuracies on synthetic data are reported in Table 1. Note that unlike in Table 1, each of the model is evaluated on the same dataset and so the results are comparable across rows. In this case, we not only report the results with stochastic sampling, but also report the 1-best programs under the 1 column in Table 6.

Table 6: FF model performance on FF benchmarks

In this case, we observe that with 100 samples, the length-10 model is able to solve of the benchmarks. It surpasses the performance of Neural FlashFill [Parisotto et al.2016], which achieves an accuracy of 23% with 100 samples and 34% with 1000 samples. On further inspection of the benchmarks, we find that only of the benchmarks can be solved with programs of length in our DSL. If we normalize across this, we see that we can solve of all solvable benchmarks. This indicates that our model is capable of learning to synthesize realistic programs.

6.3.2 Ff++ Model

Baseline Performance with uniform search

We first present the baseline results of uniform search. Since the DSL has expanded, the uniform search performs slightly worse and can only achieve an accuracy of about 11% with 100 samples.

Table 7: Uniform search on FF benchmarks with full DSL
FlashFill benchmark performance

The results for evaluating the FF++ model on the FlashFill benchmarks is shown in Table 8. The length 10 models can still remarkably solve 37% of the benchmarks even with the extended DSL.

Table 8: FF++ model performance on FF benchmarks

7 Related Work

We describe the related work from the domains of VSA-based programming by example systems and neural program induction and synthesis systems.

Programming By Example for String Manipulations

There has been much recent work on designing version space algebra-based PBE systems for performing data transformation and extraction. FlashFill [Gulwani2011, Gulwani et al.2012] is a PBE system that performs regular expression based string transformations using examples. Given an input-output example string, FlashFill first searches over all possible ways to decompose the output string and represent the set of those sub-programs concisely using a DAG data structure. This VSA-based approach has then been extended to also build PBE systems for number transformations [Singh and Gulwani2012b], table joins [Singh and Gulwani2012a], data extraction [Le and Gulwani2014], and data reshaping [Barowy et al.2015]. While these methods are interpretable and tractable, they are unscalable to any additions of new functionality. Dapip, unlike the VSA-based PBE systems, is trained automatically using the R3NN network by sampling several thousands of programs from arbitrary DSLs.

Neural Program Induction and Synthesis

There has been a plethora of recent work in both neural program induction and neural program synthesis. The goal in neural program induction is to teach neural networks the functional behavior of a program by augmenting the neural networks with additional computational modules such as Neural GPU [Kaiser and Sutskever2015]

, Neural Turing Machine 

[Graves et al.2014], and stacks-augmented RNNs [Joulin and Mikolov2015]. One limitation of these architectures is that although they are able to learn the functional behavior, they do not expose an interpretable program back to the user. In addition, they need to be trained per task separately, representing a lack of strong generality. More recent work, such as Terpret [Gaunt et al.2016] and Neural-RAM [Kurach et al.2015] seek to mitigate the interpretability issue but they need to be trained for each individual benchmark problem, which is prohibitively expensive.

A recent approach was proposed to use the R3NN-based neural architectures to synthesize programs in a DSL similar to that of FlashFill [Parisotto et al.2016]. We employ the same architecture but in a different DSL consisting of APIs at the core level of expressions. The APIs allows the program depth to be shallower than programs in a DSL with more primitives, and we investigate if that can make the task of automatically learning a search strategy easier for the R3NN. We argue that imposing higher-order functions is much more extensible and more akin to human-like programming.

8 Future Work

There are a number of ways that we can extend the results and techniques presented in this paper to yield both improvements in the current numbers as well as allow us to scale to larger programs.

8.1 Function embeddings

We rely on the R3NN and the input-output encoder to implicitly encode the semantics of each function, and we’ve shown through a number of experiments that the tree model is capable of doing so. This is impressive in its own right, but in order to improve the performance further, we should extend the model to support explicit, continuous representations of each function. This can be achieved in a number of ways - the simplest of which involves encoding each function as a randomly initialized vector and allowing the model to attend to API functions that may be relevant to the input-output examples. We can freeze the embeddings, or we can elect to backpropagate errors through both the attention mechanism and the embeddings, and jointly learn these representations. This represents a principled approach to adding new functions and method is easy to extend to additional API functions that we may choose to add.

8.2 Divide and conquer

Function embeddings allow us to perform better on existing problems by giving the model more information as to what choices to make when generating the tree. However, this does not resolve the issue of scalability. Even with function embeddings, as the inputs and outputs grow in size and complexity, we have no scalable method of performing inference over which programs to synthesize. However, instead of viewing the problem as a whole, we can break up the problem into smaller pieces and try to solve each subpiece and concatenate the answers together. This divide-and-conquer approach allows us to treat larger problems as conglomerations of a number of smaller problems. This procedure requires two general mechanisms: one module will need to predict how to split the output string into smaller, meaningful chunks, and the second module will consume each input-output piece, synthesize the correct program, and each piece will eventually be concatenated together. This is especially convenient in this problem setting because the FlashFill language is one that is focused on concatenations, so we lose no generality in being able to solve the problem.

8.3 Extending the DSL

An interesting extension of the DSL is to add multi-argument API function calls. This could yield more general API functions, such as GetNthObj(n, o), and could replace functions like GetFirstWord, GetSecondNumber. In addition, we can also add multi-argument Concat functions; this idea goes neatly with the divide-and-conquer approach and can be used to help scale the model to synthesize larger programs.

8.4 Batching Trees

While the divide-and-conquer approach is an algorithmic improvement to speed up the process of training the model, we can also take advantage of the model to incorporate faster batching proocols. Using a tree-based generative models allows us to batch operations together that occur at the same depth in the tree because each operation is indepenedent of all of its siblings. Moreover, we can also batch multiple trees together for increased performance.

9 Conclusion

In this paper, we presented Dapip, a system that tries to automatically learn a synthesis algorithm given a DSL. In particular, we designed a DSL consisting of APIs as first class constructs that allows the system to perform richer tasks using small sized programs. We used the recently introduced R3NN neural architecture to automatically learn a synthesis algorithm for our DSL. The preliminary results suggest that the system is able to efficiently learn programs up to size 10 with about 45% accuracy on real-world benchmarks. We believe this direction of using neural architectures to automatically develop synthesis algorithms for PBE systems can lead to big advancements in program synthesis techniques and make it more generally applicable to many new domains.

References

Appendix

Appendix A The Complete Set of APIs

LookUp (18) Transform (13) Regex (104)
GetStreetNum GetStateFromCity GetFirstWord GetFourthToLastNumber
GetStreetName GetCityFromZipcode GetSecondWord GetFifthToLastNumber
GetAptNum GetStateAbbrFromState GetThirdWord GetFirstAlpha
GetCityName GetStateFromStateAbbr GetFourthWord GetSecondAlpha
GetStateName GetFirstInitial GetFifthWord GetThirdAlpha
GetStateAbbr GetLastInitial GetLastWord GetFourthAlpha
GetZipcode GetStockSymbolFromCEO GetSecondToLastWord GetFifthAlpha
GetFirstName GetCEOFromCompany GetThirdToLastWord GetLastAlpha
GetLastName GetCompanyFromStockSymbol GetFourthToLastWord GetSecondToLastAlpha
GetTitle GetOrdinalFromDate GetFifthToLastWord GetThirdToLastAlpha
GetSuffix GetMonthFromDate GetFirstNumber GetFourthToLastAlpha
GetCompany GetWeekdayFromDate GetSecondNumber GetFifthToLastAlpha
GetCEO GetYearFromDate GetThirdNumber GetFirstWS
GetStockSymbol GetFourthNumber GetSecondWS
GetWeekday GetFifthNumber GetThirdWS
GetMonth GetLastNumber GetFourthWS
GetYear GetSecondToLastNumber GetFifthWS
GetDate GetThirdToLastNumber GetLastWS


Regex (cont.)
GetSecondToLastWS GetFirstSpaceToEnd GetFirstTwoChar GetFifthToLastCapsWord
GetThirdToLastWS GetStartToLastSpace GetFirstThreeChar GetFirstPropercaseWord
GetFourthToLastWS GetLastSpaceToEnd GetFirstFourChar GetSecondPropercaseWord
GetFifthToLastWS GetStartToDash GetFirstFiveChar GetThirdPropercaseWord
TrimSpaces GetFirstDashToSecondDash GetFirstDigit GetFourthPropercaseWord
TrimLeadingZeros GetLastDashToEnd GetFirstTwoDigit GetFifthPropercaseWord
GetIdentity GetStartToFirstComma GetFirstThreeDigit GetAllPropercaseWords
ReplaceSpacesWithDashes GetWordBetweenFirstAndSecondComma GetFirstFourDigit GetLastPropercaseWord
ReplaceSpacesWithCommas GetWordBetweenSecondAndThirdComma GetFirstFiveDigit GetSecondToLastPropercaseWord
ReplaceSpacesWithUnderscores GetLastCommaToEnd GetFirstCapsWord GetThirdToLastPropercaseWord
ToLowercase GetWordBetweenCommaSpaceAndEnd GetSecondCapsWord GetFourthToLastPropercaseWord
ToUppercase GetStartToParan GetThirdCapsWord GetFifthToLastPropercaseWord
ToPropercase GetStartToFirstColon GetFourthCapsWord GetAllLetters
GetWordBetweenStartAndAt GetStartToSecondColon GetFifthCapsWord GetAllNumbers
GetWordBetweenAtAndEnd GetStringBetweenLastColonToEnd GetLastCapsWord
GetWordBetweenStartAndDot GetStringBetweenLastFirstAndSecondQuote GetSecondToLastCapsWord
GetWordBetweenDotAndEnd GetStartToEndOfFirstNumber GetThirdToLastCapsWord
GetStartToFirstSpace GetFirstChar GetFourthToLastCapsWord
Figure 6: Full list of functions used by our model; there are 18 lookup functions, 13 transform functions, and 104 regex functions. All functions take the input string as the input and produce a single output string, which can then be concatenated or nested with other function calls. Recall that we provide no function semantics of any of the above functions to the model; the model implicitly learns latent representations of each function. Therefore, any number of functions can be added to this library and the underlying learning algorithms will not need modifications.

Appendix B Samples of Training Data

DAPIP Prediction: (Concat (ConstStr CONST10) (GetStreetName (arg inp)))
Inputs: Outputs: Predictions:
} summer Impulse St. Pellerin Mr.Impulse St. Mr.Impulse St.
Hensley Bag St. HI Rinaldo Nolan @ Mr.Bag St. Mr.Bag St.
hook Gertha % Plate St. hobbies MT Mr.Plate St. Mr.Plate St.
discussion Mcfarlin . Straw St. Mr.Straw St. Mr.Straw St.
hobbies Anger St. Twitty Downing ? Mr.Anger St. Mr.Anger St.
DAPIP Prediction: (Concat (ConstStr CONST36) (GetStateName (arg inp)))
Inputs: Outputs: Predictions:
MA , North Carolina Zehr Gilma  North Carolina  North Carolina
Utah Evelia % Nancy  Utah  Utah
Josh skin . Missouri Agudelo  Missouri  Missouri
yarn drawer ‘ Indiana  Indiana  Indiana
Sandidge ) key Indiana  Indiana  Indiana
DAPIP Prediction: (Concat (GetStateAbbrFromState (arg inp)) (ConstStr CONST25))
Inputs: Outputs: Predictions:
Elza Foot Locker Illinois @bo.com Mollett IL * IL *
$ can Sound St. mist Nevada NV * NV *
Harpin Utah . Reali RI Laurinda Borden UT * UT *
) Connecticut Belt Mortimer CT * CT *
Danita   Tennessee throat TN * TN *
DAPIP Prediction: (GetSecondToLastWS (arg (GetCEO (arg inp))))
Inputs: Outputs: Predictions:
Eldora John Thain Marotta John John
Marya clover Sundar Pichai Sundar Sundar
327 drawer Gregory Wasson Kristian Gregory Gregory
! AOL Inc. Rinaldo quicksand James Gorman James James
Richard Johnson Barbie Gasaway Richard Richard
Figure 7: Selected samples of correct model predictions on the FlashFill++ synthetic test set. These samples also provide samples of the nature of the programmatically generated training data for the FF and FF++ models. Note that because these programs rely on the semantic Lookup and Transform APIs, there is no provided reference FlashFill program, as there would be no program that could solve these synthetic benchmarks.

Appendix C Samples of Solved FlashFill Benchmarks

FlashFill Program: (SubStr (RegPos (RegexStr (ConstStr ”0-0”)) (k 1) (dir End))(RegPos (RegexStr REGEX4) (k 4) (dir End)))
DAPIP Prediction: (TrimLeadingZeros (arg (GetFirstDashToSecondDash (arg Inp))))
Inputs: Outputs: Predictions:
09:40-09:50 9:50 9:50
09:50-08:30 8:30 8:30
09:50-07:30 7:30 7:30
09:50-09:55 9:55 9:55
05:50-06:30 6:30 6:30
FlashFill Program: (Concat (SubStr (RegPos (RegexStr REGEX8) (k 1) (dir End)) (RegPos (RegexStr REGEX1) (k 1) (dir End))) (ConstStr ”@”))
DAPIP Prediction: (Concat (ToLowercase (arg (GetFirstWord (arg Inp))) (ConstStr CONST13))
Inputs: Outputs: Predictions:
Sophia Underwood sophia@ sophia@
Logan Smith logan@ logan@
Lucas Janckle lucas@ lucas@
Audrey Bennette audrey@ audrey@
Amelia Ford amelia@ amelia@
FlashFill Program: (Concat (SubStr (RegPos (RegexStr REGEX8) (k 1) (dir End)) (RegPos (RegexStr REGEX4) (k 1) (dir End))) (ConstStr ”]”))
DAPIP Prediction: (Concat (GetStartToEndOfFirstNumber (arg (ToUppercase (arg Inp))) (ConstStr CONST12))
Inputs: Outputs: Predictions:
[CPT-00350 [CPT-00350] [CPT-00350]
[CPT-11523] [CPT-11523] [CPT-11523]
[CPT-23412] [CPT-23412] [CPT-23412]
[CPT-23412 [CPT-23412] [CPT-23412]
[CPT-2422] [CPT-2422] [CPT-2422]
FlashFill Program: (SubStr (RegPos (RegexStr REGEX8) (k 1) (dir End)) (RegPos (RegexStr REGEX4) (k 1) (dir End)))
DAPIP Prediction: (GetLastNumber (arg (TrimSpaces (arg GetFirstAlpha (arg Inp)))))
Inputs: Outputs: Predictions:
1:42:00 AM 1 1
4:18:00 AM 4 4
6:54:00 PM 6 6
11:06:00 AM 11 11
9:12:00 AM 9 9
Figure 8: Selected samples of correct model predictions on the Flashfill test set. We additionally provide the program that FlashFill inferred given the input-output pairs, and contrast that with DAPIP’s prediction. Note that DAPIP programs have a much higher level of expressivity and interpretability.

Appendix D Samples of Unsolved FlashFill Benchmarks

FlashFill Program: (Concat (Concat (Concat (Concat (Concat (Concat (Concat (Concat (Concat (Concat (SubStr (RegPos (RegexStr REGEX8) (k 1) (dir End)) (RegPos (RegexStr REGEX1) (k 1) (dir End))) (ConstStr ”,”)) (SubStr (RegPos (RegexStr REGEX7) (k 1) (dir End)) (RegPos (RegexStr REGEX1) (k 2) (dir End)))) (ConstStr ”,”)) (SubStr (RegPos (RegexStr REGEX7) (k 2) (dir End)) (RegPos (RegexStr REGEX1) (k 3) (dir End)))) (ConstStr ”,”)) (SubStr (RegPos (RegexStr REGEX7) (k 3) (dir End)) (RegPos (RegexStr REGEX1) (k 4) (dir End)))) (ConstStr ”.”)) (ConstStr ”and”)) (ConstStr ”.”)) (SubStr (RegPos (RegexStr REGEX7) (k 4) (dir End)) (RegPos (RegexStr REGEX10) (k 1) (dir End))))
Inputs: Outputs:
Tom Mickey Minnie Donald Daffy Tom,Mickey,Minnie,Donald.and.Daffy
Ben Bill Jerry Meyer Rahul Ben,Bill,Jerry,Meyer.and.Rahul
Shahrukh Aamir Salman Amitabh Ajay Shahrukh,Aamir,Salman,Amitabh.and.Ajay
Kobe Lebron Dwayne Chris Kevin Kobe,Lebron,Dwayne,Chris.and.Kevin
Earth Fire Wind Water Sun Earth,Fire,Wind,Water.and.Sun
FlashFill Program: (Concat (Concat (Concat (SubStr (RegPos (RegexStr REGEX8) (k 1) (dir End)) (RegPos (RegexStr REGEX4) (k 1) (dir End))) (SubStr (RegPos (RegexStr (ConstStr ”1-”)) (k 1) (dir End)) (RegPos (RegexStr REGEX4) (k 2) (dir End)))) (SubStr (RegPos (RegexStr (ConstStr ”-”)) (k 2) (dir End)) (RegPos (RegexStr REGEX4) (k 3) (dir End)))) (SubStr (RegPos (RegexStr (ConstStr ”-”)) (k 3) (dir End)) (RegPos (RegexStr REGEX4) (k 4) (dir End))))
Inputs: Outputs:
1-452-789-4567 14527894567
1-503-897-4567 15038974567
1-408-789-4561 14087894561
1-406-789-1562 14067891562
1-845-456-7891 18454567891
Figure 9: Selected samples from the FlashFill benchmarks that could not be solved by our model; benchmarks such as these constitute the 55% of FlashFill benchmarks that our model cannot solve. At present, the maximum length of programs that DAPIP can produce is only 10, and these particular benchmarks would require much longer programs. However, if the batching of trees can be done more efficiently, the system can be trained on longer programs and it is conceivable that these benchmarks can be solved.