Latent Attention For If-Then Program Synthesis

11/07/2016 ∙ by Xinyun Chen, et al. ∙ 0

Automatic translation from natural language descriptions into programs is a longstanding challenging problem. In this work, we consider a simple yet important sub-problem: translation from textual descriptions to If-Then programs. We devise a novel neural network architecture for this task which we train end-to-end. Specifically, we introduce Latent Attention, which computes multiplicative weights for the words in the description in a two-stage process with the goal of better leveraging the natural language structures that indicate the relevant parts for predicting program elements. Our architecture reduces the error rate by 28.57 one-shot learning scenario of If-Then program synthesis and simulate it with our existing dataset. We demonstrate a variation on the training procedure for this scenario that outperforms the original procedure, significantly closing the gap to the model trained with all data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A touchstone problem for computational linguistics is to translate natural language descriptions into executable programs. Over the past decade, there has been an increasing number of attempts to address this problem from both the natural language processing community and the programming language community. In this paper, we focus on a simple but important subset of programs containing only one If-Then statement.

An If-Then program, which is also called a recipe, specifies a trigger and an action function, representing a program which will take the action when the trigger condition is met. On websites, such as, a user often provides a natural language description of the recipe’s functionality as well. Recent work quirk2015language ; beltagy-quirk:2016:P16-1 ; dong2016language studied the problem of automatically synthesizing If-Then programs from their descriptions. In particular, LSTM-based sequence-to-sequence approaches dong2016language

and an approach of ensembling a neural network and logistic regression 

beltagy-quirk:2016:P16-1 were proposed to deal with this problem. In beltagy-quirk:2016:P16-1 , however, the authors claim that the diversity of vocabulary and sentence structures makes it difficult for an RNN to learn useful representations, and their ensemble approach indeed shows better performance than the LSTM-based approach dong2016language on the function prediction task (see Section 2).

In this paper, we introduce a new attention architecture, called Latent Attention

, to overcome this difficulty. With Latent Attention, a weight is learned on each token to determine its importance for prediction of the trigger or the action. Unlike standard attention methods, Latent Attention computes the token weights in a two-step process, which aims to better capture the sentence structure. We show that by employing Latent Attention over outputs of a bi-directional LSTM, our new Latent Attention model can improve over the best prior result 

beltagy-quirk:2016:P16-1 by 5 percentage points from to when predicting the trigger and action functions together, reducing the error rate of beltagy-quirk:2016:P16-1 by .

Besides the If-Then program synthesis task proposed by quirk2015language , we are also interested in a new scenario. When a new trigger or action is released, the training data will contain few corresponding examples. We refer to this case as a one-shot learning problem. We show that our Latent Attention model on top of dictionary embedding combining with a new training algorithm can achieve a reasonably good performance for the one-shot learning task.

2 If-Then Program Synthesis

If-Then Recipes.

In this work, we consider an important class of simple programs called If-Then“recipes” (or recipes for short), which are very small programs for event-driven automation of tasks. Specifically, a recipe consists of a trigger and an action, indicating that the action will be executed when the trigger is fulfilled.

The simplicity of If-Then recipes makes it a great tool for users who may not know how to code. Even non-technical users can specify their goals using recipes, instead of writing code in a more full-fledged programming language. A number of websites have embraced the If-Then programming paradigm and have been hugely successful with tens of thousands of personal recipes created, including and In this paper, we focus on data crawled from allows users to share their recipes publicly, along with short natural language descriptions to explain the recipes’ functionality. A recipe on consists of a trigger channel, a trigger function, an action channel, an action function, and arguments for the functions. There are a wide range of channels, which can represent entities such as devices, web applications, and IFTTT-provided services. Each channel has a set of functions representing events (i.e., trigger functions) or action executions (i.e., action functions).

For example, an IFTTT recipe with the following description

Autosave your Instagram photos to Dropbox

has the trigger channel Instagram, trigger function Any_new_photo_by_you, action channel Dropbox, and action function Add_file_from_URL. Some functions may take arguments. For example, the Add_file_from_URL function takes three arguments: the source URL, the name for the saved file, and the path to the destination folder.

Problem Setup.

Our task is similar to that in quirk2015language . In particular, for each description, we focus on predicting the channel and function for trigger and action respectively. Synthesizing a valid recipe also requires generating the arguments. As argued by beltagy-quirk:2016:P16-1 , however, the arguments are not crucial for representing an If-Then program. Therefore, we defer our treatment for arguments generation to Appendix B, where we show that a simple frequency-based method can outperform all existing approaches. In this way, our task turns into two classification problems for predicting the trigger and action functions (or channels).

Besides the problem setup in quirk2015language , we also introduce a new variation of the problem, a one-shot learning scenario: when some new channels or functions are initially available, there are very few recipes using these channels and functions in the training set. We explore techniques to still achieve a reasonable prediction accuracy on labels with very few training examples.

3 Related Work

Recently there has been increasing interests in executable code generation. Existing works have studied generating domain-specific code, such as regular expressions kushman2013using , code for parsing input documents lei2013natural , database queries zelle1996learning ; berant2013semantic , commands to robots kate2005learning , operating systems branavan2009reinforcement , smartphone automation le2013smartsynth , and spreadsheets gulwani2014nlyze . A recent effort considers translating a mixed natural language and structured specification into programming code DBLP:journals/corr/LingGHKSWB16 . Most of these approaches rely on semantic parsing wong2006learning ; jones2012semantic ; artzi2009broad ; quirk2015language . In particular, quirk2015language introduces the problem of translating IFTTT descriptions into executable code, and provides a semantic parsing-based approach. Two recent work studied approaches using sequence-to-sequence model dong2016language and an ensemble of a neural network and a logistic regression model beltagy-quirk:2016:P16-1 to deal with this problem, and showed better performance than quirk2015language

. We show that our Latent Attention method outperforms all prior approaches. Recurrent neural networks 

zaremba2014recurrent ; chung2014empirical along with attention bahdanau2014neural have demonstrated impressive results on tasks such as machine translation bahdanau2014neural , generating image captions xu2015show , syntactic parsing vinyals2015grammar and question answering memn2n .

4 Latent Attention Model

4.1 Motivation

To translate a natural language description into a program, we would like to locate the words in the description that are the most relevant for predicting desired labels (trigger/action channels/functions). For example, in the following description

Autosave Instagram photos to your Dropbox folder

the blue text “Instagram photos” is the most relevent for predicting the trigger. To capture this information, we can adapt the attention mechanism bahdanau2014neural ; memn2n —first compute a weight of the importance of each token in the sentence, and then output a weighted sum of the embeddings of these tokens.

However, our intuition suggests that the weight for each token depends not only on the token itself, but also the overall sentence structure. For example, in

Post photos in your Dropbox folder to Instagram

“Dropbox” determines the trigger, even though in the previous example, which contains almost the same set of tokens, “Instagram” should play this role. In this example, the prepositions such as “to” hint that the trigger channel is specified in the middle of the description rather than at the end. Taking this into account allows us to select “Dropbox” over “Instagram”.

Latent Attention is designed to exploit such clues. We use the usual attention mechanism for computing a latent weight for each token to determine which tokens in the sequence are more relevant to the trigger or the action. These latent weights determine the final attention weights, which we call active weights. As an example, given the presence of the token “to”, we might look at the tokens before “to” to determine the trigger.

Figure 1: Network Architecture

4.2 The network

The Latent Attention architecture is presented in Figure 1

. We follow the convention of using lower-case letters to indicate column vectors, and capital letters for matrices. Our model takes as input a sequence of symbols

, with each coming from a dictionary of words. We denote . Here, is the maximal length of a description. We illustrate each layer of the network below.

Latent attention layer.

We assume each symbol is encoded as a one-hot vector of dimensions. We can embed the input sequence into a -dimensional embedding sequence using , where is a set of parameters. We will discuss different embedding methods in Section 4.3. Here is of size .

The latent attention layer’s output is computed as a standard softmax on top of . Specifically, assume that is the -dimensional output vector, is a -dimensional trainable vector, we have

Active attention layer.

The active attention layer computes each token’s weight based on its importance for the final prediction. We call these weights active weights. We first embed into using another set of parameters , i.e., is of size . Next, for each token , we compute its active attention input through a softmax:

Here, and denote the the -th column vector of and respectively, and is a trainable parameter matrix of size . Notice that , we can compute by performing column-wise softmax over . Here, is of size .

The active weights are computed as the sum of , weighted by the output of latent attention weight:

Output representation.

We use a third set of parameters to embed into a embedding matrix, and the final output , a -dimensional vector, is the sum of the embedding weighted by the active weights:


We use a softmax to make the final prediction: , where is a parameter matrix, and is the number of classes.

4.3 Details


We consider two embedding methods for representing words in the vector space. The first is a straightforward word embedding, i.e., , where is a matrix and the rows of are one-hot vectors over the vocabulary of size . We refer to this as “dictionary embedding” later in the paper. is not pretrained with a different dataset or objective, but initialized randomly and learned at the same time as all other parameters. We observe that when using Latent Attention, this simple method is effective enough to outperform some recent results  quirk2015language ; dong2016language .

The other approach is to take the word embeddings, run them through a bi-directional LSTM (BDLSTM) zaremba2014recurrent , and then use the concatenation of two LSTMs’ outputs at each time step as the embedding. This can take into account the context around a token, and thus the embeddings should contain more information from the sequence than from a single token. We refer to such an approach as “BDLSTM embedding”. The details are deferred to Appendix A. In our experiments, we observe that with the help of this embedding method, Latent Attention can outperform the prior state-of-the-art.

In Latent Attention, we have three sets of embedding parameters, i.e., . In practice, we find that we can equalize the three without loss of performance. Later, we will show that keeping them separate is helpful for our one-shot learning setting.

Normalizing active weights.

We find that normalizing the active weights before computing the output is helpful to improve the performance. Specifically, we compute the output as

where is the -norm of . In our experiments, we observe that this normalization can improve the performance by 1 to 2 points.

Padding and clipping.

Latent Attention requires a fixed-length input sequence. To handle inputs of variable lengths, we perform padding and clipping. If an input’s length is smaller than

, then we pad it with null tokens at the end of the sequence. If an input’s length is greater than (which is 25 in our experiements), we keep the first 12 and the last 13 tokens, and get rid of all the rest.


We tokenize each sentence by splitting on whitespace and punctuation (e.g., ), and convert all characters into lowercase. We keep all punctuation symbols as tokens too. We map each of the top 4,000 most frequent tokens into themselves, and all the rest into a special token UNK. Therefore our vocabulary size is 4,001. Our implementation has no special handling for typos.

5 If-Then Program Synthesis Task Evaluation

In this section, we evaluate our approaches with several baselines and previous work quirk2015language ; beltagy-quirk:2016:P16-1 ; dong2016language . We use the same crawler from Quirk et al. quirk2015language to crawl recipes from Unfortunately, many recipes are no longer available. We crawled all remaining recipes, ultimately obtaining 68,083 recipes for the training set. quirk2015language also provides a list of 5,171 recipes for validation, and 4,294 recipes for test. All test recipes come with labels from Amazon Mechanical Turk workers. We found that only 4,220 validation recipes and 3,868 test recipes remain available. quirk2015language defines a subset of test recipes, where each recipe has at least 3 workers agreeing on its labels from, as the gold testset. We find that 584 out of the 758 gold test recipes used in quirk2015language remain available. We refer to these recipes as the gold test set. We present the data statistics in Appendix C.

Evaluated methods.

We evaluate two embedding methods as well as the effectiveness of different attention mechanisms. In particular, we compare no attention, standard attention, and Latent Attention. Therefore, we evaluate six architectures in total. When using dictionary embedding with no attention, for each sentence, we sum the embedding of each word, then pass it through a softmax layer for prediction. For convenience, we refer to such a process as

standard softmax. For BDLSTM with no attention, we concatenate final states of forward and backward LSTMs, then pass the concatenation through a softmax layer for prediction. The two embedding methods with standard attention mechanism memn2n are described in Appendix A. The Latent Attention models have been presented in Section 4.

Training details.

For architectures with no attention, they were trained using a learning rate of 0.01 initially, which is multiplied by 0.9 every 1,000 time steps. Gradients with norm greater than 5 were scaled down to have norm 5. For architectures with either standard attention mechanism or Latent Attention, they were trained using a learning rate of 0.001 without decay, and gradients with norm greater than 40 were scaled down to have norm 40. All models were trained using Adam kingma2014adam . All weights were initialized uniformly randomly in . Mini-batches were randomly shuffled during training. The mini-batch size is 32 and the embedding vector size is 50.


Figure 3 and Figure 3 present the results of prediction accuracy on channel and function respectively. Three previous works’ results are presented as well. In particular, quirk2015language is the first work introducing the If-Then program synthesis task. dong2016language investigates the approaches using sequence-to-sequence models, while beltagy-quirk:2016:P16-1

proposes an approach to ensemble a feed-forward neural network and a logistic regression model. The numerical values for all data points can be found in Appendix 


For our six architectures, we use 10 different random initializations to train 10 different models. To ensemble models, we choose the best models on the validation set among the 10 models, and average their softmax outputs as the ensembled output. For the three existing approaches quirk2015language ; dong2016language ; beltagy-quirk:2016:P16-1 , we choose the best results from these papers.

Figure 2: Accuracy for Channel
Figure 3: Accuracy for Channel+Function

We train the model to optimize for function prediction accuracy. The channel accuracy in Figure 3 is computed in the following way: to predict the channel, we first predict the function (from a list of all functions in all channels), and the channel that the function belongs to is returned as the predicted channel. We observe that

  • Latent Attention steadily improves over standard attention architectures and no attention ones using either embedding method.

  • In our six evaluated architectures, ensembling improves upon using only one model significantly.

  • When ensembling more than one model, BDLSTM embeddings perform better than dictionary embeddings. We attribute this to that for each token, BDLSTM can encode the information of its surrounding tokens, e.g., phrases, into its embedding, which is thus more effective.

  • For the channel prediction task in Figure 3, all architectures except dictionary embedding with no attention (i.e., Dict) can outperform quirk2015language . Ensembling only 2 BDLSTM models with either standard attention or Latent Attention is enough to achieve better performance than prior art dong2016language . By ensembling 10 BDLSTM+LA models, we can improve the latest results dong2016language and beltagy-quirk:2016:P16-1 by 1.9 points and 2.5 point respectively.

  • For the function prediction task in Figure 3, all our six models (including Dict) outperform quirk2015language . Further, ensembling 9 BDLSTM+LA can improve the previous best results beltagy-quirk:2016:P16-1 by 5 points. In other words, our approach reduces the error rate of beltagy-quirk:2016:P16-1 by 28.57.

6 One-Shot Learning

We consider the scenario when websites such as release new channels and functions. In such a scenario, for a period of time, there will be very few recipes using the newly available channels and fucntions; however, we would still like to enable synthesizing If-Then programs using these new functions. The rarity of such recipes in the training set creates a challenge similar to the one-shot learning setting. In this scenario, we want to leverage the large amount of recipes for existing functions, and the goal is to achieve a good prediction accuracy for the new functions without significantly compromising the overall accuracy.

6.1 Datasets to simulate one-shot learning

To simulate this scenario with our existing dataset, we build two one-shot variants of it as follows. We first split the set of trigger functions into two sets, based on their frequency. The top100 set contains the top 100 most frequently used trigger functions, while the non-top100 set contains the rest.

Given a set of trigger functions

, we can build a skewed training set to include all recipes using functions in

, and 10 randomly chosen recipes for each function not in . We denote this skewed training set created based on as , and refer to functions in as majority functions and functions in as minority functions. In our experiments, we construct two new training sets by choosing to be the top100 set and non-top100 set respectively. We refer to these two training sets as SkewTop100 and SkewNonTop100.

The motivation for creating these datasets is to mimic two different scenarios. On one hand, SkewTop100 simulates the case that at the startup phase of a service, popular recipes are first published, while less frequently used recipes are introduced later. On the other hand, SkewNonTop100 captures the opposite situation. The statistics for these two training sets are presented in Appendix C. While SkewTop100 is more common in real life, the SkewNonTop100 training set is only of the entire training set, and thus is more challenging.

6.2 Training

We evaluate three training methods as follows, where the last one is specifically designed for attention mechanisms. In all methods, the training data is either SkewTop100 or SkewNonTop100.

Standard training.

We do not modify the training process.

Naïve two-step training.

We do standard training first. Since the data is heavily skewed, the model may behave poorly on the minority functions. From a training set , we create a rebalanced dataset, by randomly choosing 10 recipes for each function in and all recipes using functions in . Therefore, the numbers of recipes using each function are similar in this rebalanced dataset. We recommence the training using this rebalanced training dataset in the second step.

Two-step training.

We still do standard training first, and then create the rebalanced dataset in the similar way as that in naïve two-step training. However, in the second step, instead of training the entire network, we keep the attention parameters fixed, and train only the parameters in the remaining part of the model. Take the Latent Attention model depicted in Figure 1 as an example. In the second step, we keep parameters , , , and fixed, and only update and while training on the rebalanced dataset. We based this procedure on the intuition that since the rebalanced dataset is very small, fewer trainable parameters enable easier training.

(a) Trigger Function Accuracy (SkewTop100)
(b) Trigger Function Accuracy (SkewNonTop100)
Figure 4: One-shot learning experiments. For each column XY-Z, X from {B, D} represents whether the embedding is BDLSTM or Dictionary; Y is either empty, or is from {A, L}, meaning that either no attention is used, or standard attention or Latent Attention is used; and Z is from {S, 2N, 2}, denoting standard training, naïve two-step training or two-step training.

6.3 Results

We compare the three training strategies using our proposed models. We omit the no attention models, which do not perform better than attention models and cannot be trained using two-step training. We only train one model per strategy, so the results are without ensembling. The results are presented in Figure 4. The concrete values can be found in Appendix C. For reference, the best single BDLSTM+LA model can achieve trigger function accuracy: on top100 functions, and on non-top100 functions. We observe that

  • Using two-step training, both the overall accuracy and the accuracy on the minority functions are generally better than using standard training and naïve two-step training.

  • Latent Attention outperforms standard attention when using the same training method.

  • The best Latent Attention model (Dict+LA) with two-step training can achieve and accuracy for trigger function on the gold test set, when trained on the SkewTop100 and SkewNonTop100 datasets respectively. For comparison, when using the entire training dataset, trigger function accuracy of Dict+LA is . Note that the SkewNonTop100 dataset accounts for only of the entire training dataset.

  • For SkewTop100 training set, Dict+LA model can achieve accuracy on minority functions in gold test set. This number for using the full training dataset is , although the non-top100 recipes in SkewTop100 make up only of those in the full training set.

7 Empirical Analysis of Latent Attention

Figure 5: Examples of attention weights output by Dict+LA. latent, trigger, and action indicate the latent weights and active weights for the trigger and the action respectively. Low values less than are omitted.

We show some correctly classified and misclassified examples in Figure 

5 along with their attention weights. The weights are computed from a Dict+LA model. We choose Dict+LA instead of BDLSTM+LA, because the BDLSTM embedding of each token does not correspond to the token itself only — it will contain the information passing from previous and subsequent tokens in the sequence. Therefore, the attention of BDLSTM+LA is not as easy to interpret as Dict+LA.

The latent weights are those used to predict the action functions. In correctly classified examples, we observe that the latent weights are assigned to the prepositions that determine which parts of the sentence are associated with the trigger or the action. An interesting example is (b), where a high latent weight is assigned to “,”. This indicates that LA considers “,” as informative as other English words such as “to”. We observe the similar phenomenon in Example (c), where token “” has the highest latent weight.

In several misclassified examples, we observe that some attention weights may not be assigned correctly. In Example (e), although there is nowhere explicitly showing the trigger should be using a Facebook channel, the phrase “photo of me” hints that “me” should be tagged in the photo. Therefore, a human can infer that this should use a function from the Facebook channel, called “You_are_tagged_in_a_photo”. The Dict+LA model does not learn this association from the training data. In this example, we expect that the model should assign high weights onto the phrase “of me”, but this is not the case, i.e., the weights assigned to “of” and “me” are 0.01 and 0.007 respectively. This shows that the Dict+LA model does not correlate these two words with the You_are_tagged_in_a_photo function. BDLSTM+LA, on the other hand, can jointly consider the two tokens, and make the correct prediction.

Example (h) is another example where outside knowledge might help: Dict+LA predicts the trigger function to be Create_a_post since it does not learn that Instagram only consists of photos (and low weight was placed on “Instagram” when predicting the trigger anyway). Again, BDLSTM+LA can predict this case correctly.


We thank the anonymous reviewers for their valuable comments. This material is based upon work partially supported by the National Science Foundation under Grant No. TWC-1409915, and a DARPA grant FA8750-15-2-0104. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation and DARPA.


Appendix A BDLSTM and attention model details

a.1 BDLSTM embedding

Recurrent neural networks have become popular for natural language processing tasks due to their suitability for processing sequential data. Given inputs to , a RNN computes

where is a zero vector, and are trained parameter matrices respectively of size and , and is used as a bias.

Long Short-Term Memory (LSTM) is a RNN variant which is better suited for learning long-term dependencies. Although several versions of it have been described in the literature, we use the version in Zaremba et al. [21] and borrow their notation here:


is the sigmoid function, and

denotes the element-wise multiplication. The memory cells are designed to store information for longer periods of time than the hidden state.

We construct the bi-directional model with a forward LSTM which receives the input sequence in the original order, and a backward LSTM which receives the input sequence in the reverse order. The BDLSTM embedding is the concatenation of the output of the two. This structure is illustrated in Figure 6.

a.2 Standard attention model

The standard attention model differs with Latent Attention in the way that there is only one layer of active attention. In particular, we have

The attention layer.

We compute the attention over the tokens with the following:

has dimensions and is a -dimensional trainable vector.

Output representation.

We use a third set of parameters to embed , and then the final output, a -dimension vector, is the weighted-sum of these embeddings using the active weights.


We compute probabilities over the output class labels by a matrix multiplication followed by softmax:

Figure 7: score for arguments prediction

Appendix B Predicting Arguments

We provide a frequency-based method for predicting the function arguments as a baseline, and show that this can outperform existing approaches dramatically when combined with our higher-performance function name prediction. In particular, for each description, we first predict the (trigger and action) functions . For each function , for each argument , and for each possible argument value , we compute the frequency that ’s argument takes the value . We denote this frequency as . Our prediction is made by computing

Note that the prediction is made entirely based on the predicted function , without using any information from the description.

We found that for a given function, some arguments may not appear in all recipes using this function. In this case, we give the value a special token, ; this is distinct from the case where the argument exists but its value has zero length (i.e., “”).

We use the same setup as in Section 5. The results are presented in Figure 7. [3] does not present their results for arguments prediction, so we do not include it in Figure 7. We can observe that the results are basically consistent with the results for channel and function accuracy.

Appendix C Data statistics and numerical results

Training Test (Gold)
# of trigger channels 112 59
# of trigger functions 443 136
# of action channels 87 41
# of action functions 161 56
# of recipes 68,083 584
Table 1: Statistics for IFTTT dataset
Channel Accuracy for Ensembled Models (Fig. 3)
Ensemble 1 2 3 4 5 6 7 8 9 10
Dict 71.9 72.8 73.5 74.1 74.7 80.1 80.5 79.6 80.5 81.3
Dict+A 82.4 83.0 83.6 83.2 83.2 83.2 83.7 83.9 83.6 83.7
Dict+LA 87.3 87.7 88.5 87.7 87.7 87.3 87.0 86.4 86.4 87.5
BDLSTM 84.8 89.2 90.1 90.4 90.6 90.8 90.4 90.9 91.4 91.6
BDLSTM+A 89.2 90.4 90.4 89.7 90.4 90.4 90.8 90.9 90.8 91.1
BDLSTM+LA 89.6 89.9 90.2 90.4 90.8 90.8 90.9 90.9 91.4 91.6
Dong et al. [3] 81.4
Beltagy et al. [7] 89.7
Quirk et al. [16] 89.1
Function Accuracy for Ensembled Models (Fig. 3)
Ensemble 1 2 3 4 5 6 7 8 9 10
Dict 71.6 74.7 74.7 75.9 76.0 76.0 75.7 75.7 76.0 76.4
Dict+A 74.0 76.0 75.9 76.0 76.4 75.7 76.5 77.6 77.2 77.2
Dict+LA 79.6 78.4 78.0 78.9 78.0 79.9 79.9 79.9 81.3 82.2
BDLSTM 78.6 81.8 81.5 82.4 84.1 85.4 85.6 86.0 85.8 85.4
BDLSTM+A 80.3 83.6 84.6 84.4 84.6 84.4 84.4 84.6 85.1 84.8
BDLSTM+LA 82.4 83.7 85.3 86.0 85.8 85.6 86.0 86.8 87.5 87.3
Dong et al. [3] 78.4
Beltagy et al. [7] 82.5
Quirk et al. [16] 71.0
F1 Score for Arguments for Ensembled Models (Fig. 7)
Ensemble 1 2 3 4 5 6 7 8 9 10
Dict 70.9 72.6 72.4 72.6 72.7 72.7 72.6 72.4 72.9 72.9
Dict+A 72.6 73.2 73.1 73.2 73.2 73.0 73.4 73.4 73.4 73.5
Dict+LA 73.1 73.8 74.5 74.2 74.9 74.8 74.7 75.0 75.1 75.1
BDLSTM 73.2 75.0 75.8 76.0 76.0 76.1 76.5 76.4 76.4 76.4
BDLSTM+A 74.4 75.8 75.9 75.9 76.0 76.0 75.8 76.0 76.1 76.0
BDLSTM+LA 74.7 76.0 76.0 76.3 76.2 76.2 76.3 76.8 76.7 76.8
Dong et al. [3] 74.2
Quirk et al. [16] 66.5
Table 2: Numerical Results for Figure 3 3, and 7

In this section, we provide concrete data statistics and results. The statistics for IFTTT dataset that we evaluated is presented in Table 1. The numerical values corresponding to Figure 33, and 7 are presented in Table 2. The statistics for the data used in one-shot learning are presented in Table 3. The numerical results corresponding to Figure 3(a) and 3(b) are presented in Table 4.

SkewTop100 SkewNonTop100
# of recipes 61,341 10,707
# of recipes in 58,376 9,707
# of recipes not in 2,965 1,000
Table 3: Statistics for unbalanced training sets
Table 4: Numerical Results For Figure 3(a) and 3(b)