Training models that can understand natural language instructions and execute them in a real-world environment is of paramount importance for communicating with virtual assistants and robots, and therefore has attracted considerable attention Branavan et al. (2009); Vogel and Jurafsky (2010); Chen and Mooney (2011). A prominent approach is to cast the problem as semantic parsing, where instructions are mapped to a high-level programming language Artzi and Zettlemoyer (2013); Long et al. (2016); Guu et al. (2017). Because annotating programs at scale is impractical, it is desirable to train a model from instructions, an initial world state, and a target world state only, letting the program itself be a latent variable.
Learning from such weak supervision results in a difficult search problem at training time. The model must search for a program that when executed leads to the correct target state. Early work employed lexicons and grammars to constrain the search spaceClarke et al. (2010); Liang et al. (2011); Krishnamurthy and Mitchell (2012); Berant et al. (2013); Artzi and Zettlemoyer (2013), but recent success of sequence-to-sequence models Sutskever et al. (2014) shifted most of the burden to learning. Search is often performed simply using beam search, where program tokens are emitted from left-to-right, or program trees are generated top-down Cheng et al. (2017) or bottom-up Liang et al. (2017). Nevertheless, when instructions are long and complex and reward is sparse, the model may never find enough correct programs, and training will fail.
In this paper, we propose a beam-search algorithm for mapping a sequence of natural language instructions to a program. First, we capitalize on the target world state being available at training time and train a critic
network that given the language instructions, current world state, and target world state estimates the expected future reward for each search state. In contrast to traditional beam search where states representing partial programs are scored based on their likelihood only, we also consider expected future reward, leading to a more targeted search at training time. Second, rather than search in the space of programs, we search in a more compressed execution space, where each state is defined by the result of executing a partial program, while tracking recent entities and actions. This makes the search space smaller, while allowing us to handle references to past entities and actions.
We evaluate on the SCONE dataset, which includes three different domains where long sequences of 5 instructions are mapped to programs. We show that while standard beam search gets stuck in a local optimum and is unable to discover good programs for many examples, our model is able to bootstrap in this challenging setup, improving final performance by more than 25 points on average. We also perform extensive analysis and show that both value-based search as well as searching in execution space contribute to the final performance.
Mapping instructions to programs invariably involves a context, such as a database or a robotic environment, in which the program (or logical form) is executed. The goal is to train a model given a training set , where is the context, is a sequence of natural language instructions, and is the target state of the environment after following the instructions, which we refer to as denotation. The model is trained to map the instructions to a program such that executing in the context results in the denotation , which we denote by . Thus, the program is a latent variable we must search for at both training time and test time. When the sequence of instructions is long, search becomes hard, particularly in the early stages of training.
Recent work tackled the training problem using variants of reinforcement learning (RL)Suhr and Artzi (2018); Liang et al. (2018) or maximum marginal likelihood (MML) Guu et al. (2017); Goldman et al. (2018). We now briefly describe MML training, which we based our training procedure on, and outperformed RL in past work under comparable conditions Guu et al. (2017).
We denote by a model, parameterized by , that generates the program token by token from left to right. The model receives the context , instructions and previously predicted program tokens , and returns a distribution over the next program token
. The probability of a program prefix is defined to be:
. The model is trained using stochastic gradient descent to maximize the MML objective:
where if , and 0 otherwise (For brevity, we omit and from ).
The search problem arises because it is impossible to exhaustively enumerate the countable set of all programs, and thus the sum over programs is approximated by a small set of high probability programs (introducing some bias). Search is commonly done with beam-search, an iterative algorithm that builds an approximation of the highest probability programs according to the model. At each time step , the algorithm constructs a beam of at most program prefixes of length . Given a beam , is constructed by selecting the most likely continuations of prefixes in , according to . The algorithm runs for a fixed number of iteration , and returns all the complete programs that were discovered.
In this paper, our focus is the search problem that arises at training time when training from denotations, namely, we must find programs that execute to the right denotation. Thus, we would like to focus on scenarios where programs are long, and consequently vanilla beam search fails as a search algorithm. We next describe the SCONE dataset, which provides such an opportunity.
The SCONE dataset
long2016projections presented the SCONE dataset, where a sequence of instructions needs to be mapped to a program consisting of a sequence of commands executed in an environment. The dataset comprises three domains, where each domain includes several objects (such as people or beakers), each with different properties (such as shirt color or chemical color). SCONE provides a good environment for stress-testing search algorithms because a sequence of instructions needs to be mapped to a sequence of commands that result in the correct denotation. Figure 1 shows an example input from the Scene domain.
Formally, the context in SCONE is a world that is specified by a list of positions, where each position may contain an object with certain properties. A formal language is defined to interact and manipulate the world. The formal language contains constants (e.g., numbers and colors), functions that allow to query the world and refer to various objects and intermediate computations, and actions, which are functions that modify the world state. Each command is composed of a single action and several arguments constructed recursively from constants and functions. For example, the command move(hasHat(yellow), leftOf(hasShirt(blue))), contains the action move, which moves a person to a specified position. The person is computed by hasHat(yellow), which queries the world for the position of the person with a yellow hat, and the target position is computed by leftOf(hasShirt(blue)), which queries the world for the position to the left of the person with a blue shirt. We refer to guu2017bridging for a full description of the formal language.
Our goal is to train a model that given an initial world and a sequence of natural language utterances , will map each utterance to a command such that applying the program on will result in the target world, i.e., .
3 Markov Decision Process Formulation
To present our search algorithm, we first formulate the problem as a Markov Decision Process (MDP). We show in Section4 that the graph induced by an MDP when searching in the space of programs is a tree. Conversely, our algorithm will search in a more compact execution space, where the MDP corresponds to a graph.
We represent the MDP as a tuple , where is the state space, is the action space, is the reward function, and is a deterministic transition function. To define , we first assume all program prefixes are executable, which can be easily done as we show below. The execution result of a prefix in the context , denoted by , contains its denotation, but also additional information stored in the internal state of the executor, as we elaborate below. Let be the set of all valid programs prefixes. The set of states is defined to be , i.e., the input paired with the result of executing all possible program prefixes. Because many program prefixes have the same execution result, the execution space is much smaller than the space of programs.
The action set includes all possible program tokens,111Decoding is grammar-constrained at training and test time to output syntactically valid program tokens only. and the transition function is computed by the environment executor. Last, the reward iff the action ends the program and leads to a state where the denotation is equal to the target world . The model is as a parameterized neural policy over the MDP that provides a distribution over the program vocabulary at each time step (details in Appendix A). Next, we precisely define the execution result .
We assume every program prefix can be executed, which can be done by writing programs in postfix notation, as described by guu2017bridging. For example, the instruction move(hasHat(yellow), leftOf(hasShirt(blue))) is written as yellow hasHat blue hasShirt leftOf move. With this notation, a partial program can be executed left-to-right by maintaining a program stack, . At each time step, the executor pushes predicted constants (yellow) to , and applies functions (hasHat) by popping their arguments from and pushing the computed result. Actions (move) are applied by popping arguments from and performing the action in the current world.222Left-to-right decoding is equivalent to bottom-up decoding of the program tree in this case.
Because the input is a sequence of utterances, each mapping to a command, whenever an action is predicted, the model moves to processing the next utterance, (and the stack is emptied), until all utterances have been processed. To handle references to previous utterances, we must track previous actions and arguments, and so the executor maintains an execution history, , which is a list of previously executed actions and their arguments (the formal language provides functions to query ). Thus, the execution result of a program prefix is , which includes the current utterance, current denotation, the program stack and the execution history.
Figure 2 illustrates an example transition from one state to another in the Scene domain. Importantly, the state does not store the full program generated (only the execution history), and thus many different programs lead to the same state. Next, we describe a search algorithm in the state space of this MDP.
4 Searching in Execution Space
Model improvement relies on generating correct programs given a possibly weak model. Standard beam-search explores the space of all program token sequences up to some fixed length. We propose two technical contributions to improve search: (a) We simplify the search problem by searching for correct executions rather than correct programs (b) We use the target denotation at training time to better estimate partial program scores in the search space. We describe those next.
4.1 Execution space
Our task can be represented as a search problem in a graph. The space of programs can be formalized as a directed tree , where vertices represent all program prefixes, and labeled edges represent all prefix continuations: an edge , represents a continuation of the prefix with the token , where denotes concatenation. The root of the graph represents the empty sequence. Similarly, Execution space is a directed graph induced from the MDP described in Section 3. Graph vertices represent MDP states, which express execution results, and labeled edges represent transitions. An edge labeled by program token means that . Since multiple programs have the same execution result, execution space is a compressed representation of program space: multiple vertices in program space map to a single vertex in execution space. Figure 3 illustrates a set of programs in both program space and execution space.
Each path in execution space represents a different program prefix, and the path’s final state represents its execution result. Program search can therefore be reduced to execution search: given an example and a model , we can use to explore in execution space, discover terminal states, i.e., states corresponding to a full program, and extract paths that lead to those states.
Clearly, the advantage of this reduction is that as execution space is smaller, the search problem becomes easier. Moreover, we can score each state based on many paths that lead to that node rather than based on a single path only.
Our approach is similar to the DPD algorithm Pasupat and Liang (2016), where CKY-style search is performed in denotation space, followed by search in a pruned space of programs for question answering. However, this search method was used without learning, and so the search was not guided by a trained model, which is a major part of our algorithm as we describe next.
4.2 Value-based Beam Search in Execution Space
We propose Value-based Beam Search in eXecution space (VBSiX), a variant of beam search modified for searching in execution space.
Standard beam search is a breadth-first traversal of the program space tree, where a fixed number of vertices are kept in the beam at every level of the tree. The selection of vertices is done by scoring their corresponding prefixes according to . The same traversal can be applied in execution space as well. However, since each vertex in execution space represents an execution result and not a particular prefix, we need to extend the scoring function to estimate the model probability for a vertex in execution space.
Let be a vertex discovered in iteration of the search. The state will be scored according to , the probability to reach vertex after iterations333We score paths in different iterations independently to avoid bias towards shorter paths. A MDP state that appears in multiple iterations will get a different score in each iteration. according to the model . This probability is the sum of probabilities of all prefixes of length that reach :
VBSiX approximates by performing this computation over the states in the beam rather than all states, which can be done efficiently using dynamic programming (see Algorithm 1). This is a lower bound on the true model probability for this state, since there might be prefixes of length that reach that were not discovered.
Searching in execution space has significant advantages over standard beam search. First, since each vertex in execution space compactly represents multiple prefixes, a beam in VBSiX effectively holds more prefixes than standard beam search. Second, running search over a graph rather than a tree is less greedy, because the same vertex can surface back even if it fell out of the beam. Third, the scoring function of VBSiX is an aggregation over multiple paths in program space and therefore scores for states are more robust.
4.3 Value-based Ranker
Above, we scored a vertex based on the sum of prefix probabilities leading to . Alternatively, can be scored based on the sum of suffix probabilities from to correct terminal states, i.e., the expected reward of :
where are all possible trajectories starting from and is the reward observed when taking the trajectory from . Enumerating all trajectories is intractable and so we will approximate with a trained value network parameterized by .
This value-based approach has several advantages: First, evaluating the probability of outgoing trajectories provides look-ahead that is missing from standard beam search. Second (and importantly), because we are focusing on search at training time, we can use the target denotation as input to the value network and define it as . This lets the value network compare the denotation in the current state to the target denotation and get a better estimate for expected reward, especially in the later utterances of the sequence. This is similar in spirit to the actor-critic model for sequence-to-sequence proposed by Bahdanau et al. (2017), where the full output sequence is observed, and to Suhr2018Situated who used the denotation to define a reward function.
Following bahdanau2017actor we call the probability for a state after steps, the actor score, and the expected reward estimate the critic score. We define the actor-critic scoring function as the sum of both scores. While conceptually the critic score is a natural scoring function in VBSiX, the two contributions are orthogonal: the critic score can be used in program space, and VBSiX can use the actor score only.
Figure 4 visualizes scoring and pruning the beam with the actor-critic scoring function in iteration . Vertices in are discovered by expanding vertices in , and each vertex is ranked by the sum of the actor and critic scores. In the figure, the highlighted vertex has a score , which is a sum of the actor score () and the critic score (). The actor score is the sum of its discovered incoming prefixes () and the critic score is a value network estimation for the sum of probabilities for outgoing trajectories reaching correct terminal states (). Only the top- states are kept in the beam ( in the figure).
Algorithm 1 summarizes our final proposed search algorithm at training time: VBSiX with an actor-critic scoring function. At a high level, the function ProgramSearch() uses the VBSiX procedure to construct a small search graph and a set of terminal states . Then, standard beam search (in program space, with actor score) is applied on this small graph to extract a set of programs (lines 2-4). Full details of programs’ extraction from are given in Appendix C. The programs are used to update the models and (Section 5).
The VBSiX function receives as input a context , an utterance sequence , a target denotation , an actor model and a value network , and returns a graph and a set of terminal states . VBSiX begins exploration from , the state of the empty program (lines 7-9). At every step , the algorithm constructs the beam , a set of states that can be reached from in steps, , a dynamic programming (DP) chart mapping each state to its actor score in iteration , and the graph (lines 10-21). and are updated for every newly discovered state (lines 18-19), and terminal states are added to (line 16). is scored and pruned according to the sum of the actor score and the value function (line 24). Since the value function and DP chart are used for efficient ranking, the asymptotic run-time complexity of VBSiX is the same as standard beam search ()). The beam search in the path-extraction step (Line 3) can be done with a small beam size, since it operates over a small graph, and so its contribution to algorithm complexity is negligible.
We train the model and value network jointly, where is trained using MML as described in Section 2, and the value network is trained as we explain next (Algorithm 2). Given a training example , we first generate a set of programs with VBSiX. We then construct a set of training examples , where each example labels states encountered while generating programs with the probability mass of correct programs suffixes that extend it, i.e., , where ranges over all and Finally, we train to maximize the objective:
Similar to the estimation of , labeling examples for the value network is affected by beam-search errors: the labels are a lower bound for the true expected reward, since correct programs might fall off the beam. However, since search is guided by the model, those programs are likely to have low probability and their contribution to the true expected reward negligible. Moreover, estimates from the value network are based on training over multiple examples, while model probability estimates are based on a DP chart for one particular example. This makes the value network more robust to beam search errors.
Neural network architecture:
We adapt the model proposed by guu2017bridging for SCONE. The model receives the current utterance and the program stack , and returns a distribution over the valid next tokens. Our value network receives the the same input, but also the next utterance , the current world state and the target world state , and outputs a scalar. Appendix A provides a full description of the architecture.
|3 utt||5 utt||3 utt||5 utt||3 utt||5 utt|
Test accuracy and standard deviation ofVBSiX compared to multiple baselines. We evaluate the same model over the first 3 and 5 utterances in each domain.
|Search space||Value||3 utt||5 utt||3 utt||5 utt||3 utt||5 utt|
6.1 Experimental setup
We evaluate our method on the three domains of SCONE, evaluating with the standard accuracy metric, i.e., the proportion of test examples where the program generated by the model produced the correct denotation . We train with VBSiX, and use standard beam search at test time to generate a set of programs and pick the one with highest model probability. Each test example contains 5 utterances, and similar to prior work we report the accuracy of each model on all 5 utterances as well as the first 3 utterances. We run each experiment 6 times with different random seeds and report the average accuracy and standard deviation.
In contrast to prior work on SCONE Long et al. (2016); Guu et al. (2017); Suhr and Artzi (2018), where models were trained on all sequences of 1 or 2 utterances, and thus were exposed during training to all gold intermediate states, we train from longer sequences keeping intermediate states latent. This leads to a harder search problem that was not addressed previously, but makes our results incomparable to previous results. In Scene and Tangram, we used the first 4 and 5 utterances as examples. In Alchemy, we used the first utterance and the full 5 utterances.
To warm-start the value network, we train for a few thousand steps, and only then start re-ranking with its predictions. Moreover, we gain efficiency by first returning (=128) states with the actor score only, and then re-ranking those with the actor-critic score and returning (=32) states. Last, we use the value network only in the last two utterances of every example since we found it has less effect in earlier utterances where future uncertainty is large.
We evaluate the following models:
MML: Our main baseline, where search is done with beam search and training with MML. We use randomized beam-search, which adds -greedy exploration to beam search, which was proposed by guu2017bridging and performed better.444We did not include meritocratic updates Guu et al. (2017), since it performed worse in initial experiments.
VBSiX: Our full model that ranks states with an actor-critic score and searches in execution space.
The full list of hyper-parameters such as beam size, learning rate, etc. are reported in appendix B.
Table 1 reports test accuracy of VBSiX compared to the baselines. First, VBSiX outperforms all baselines in all cases. Similar to recent findings,555http://www.argmin.net/2018/02/20/reinforce/ REINFORCE is unable to find good programs to bootstrap from and training fails. MML performs much better, especially on Alchemy and Tangram, but is still almost 30 points lower than VBSiX on average.
On top of the improvement in accuracy, the standard deviation of VBSiX is lower than the other baselines across the 6 random seeds, showing the robustness of our model. One exception is the Scene domain, where the language is more complex and absolute accuracies are lower. The performance of other baselines is quite low and consequently also the standard deviation.
We perform ablation tests to examine the benefit of our two main technical contributions (a) execution space (b) value-based beam search. Table 2 presents the accuracy results on the validation set when each component is used separately, when both of them are used (VBSiX), and when none are used (beam-search). We find that both contributions are important for the final performance, as the full system achieves the highest accuracy across all domains. In Scene, each component separately has only a slight advantage over beam-search, and therefore both components are required to achieve significant improvement. However, in Alchemy and Tangram most of the gain is due to the value network.
In addition to validation accuracy, we can also directly measure hit accuracy at training time, i.e., the proportion of training examples where the beam produced by the search algorithm contains a program with the correct denotation. This measures the effectiveness of search at training time directly. In Figure 5, we show the train hit accuracy in each training step, averaged across the 6 random seeds. The graphs illustrate the performance of each search algorithm in every domain, and the improvement of the model during training. We observe that validation accuracy results are well-correlated with the improvement in hit accuracy, showing that the better performance can be mostly attributed to the search algorithm.
We analyze the ability of the value network to predict expected reward. The reward of a state depends on two properties, (a) connectivity: whether there is a trajectory from this state to a correct terminal state, and (b) model likelihood: the probability the model assigns to those trajectories. We collected a random set of 120 states in the Scene domain from, where the real expected reward was very high (), or very low and the value network predicted well (less than deviation) or poorly (more than deviation). For ease of analysis we only look at states from the final utterance.
To analyze connectivity, we looked at states that cannot reach a correct terminal state with a single action (since states in the last utterance can perform one action only, the expected reward is 0). Those are states where either their current and target world differ in too many ways, or the stack content is not relevant to the differences between the worlds. We find that when there are many differences between the current and target world, the value network correctly estimates low expected reward in of the cases. However, when there is just one mismatch between the current and target world, the value network tends to ignore it and erroneously predicts high reward in 78.9% of the cases.
To analyze whether the value network can predict the success of the trained policy, we consider states from which there is an action that leads to the target world. While it is challenging to fully interpret the value network, we notice that the network predicts a value that is in 86.1% of the cases where the number of people in the world is no more than 2, and a value that is in 82.1% of the cases where the number of people in the world is more than 2. This indicates that the value network believes more complex worlds, involving many people, are harder for the policy.
7 Related Work
Training from denotations has been extensively investigated in both question answering Clarke et al. (2010); Liang et al. (2011); Berant et al. (2013); Kwiatkowski et al. (2013); Pasupat and Liang (2015) and instruction mapping Branavan et al. (2009); Artzi and Zettlemoyer (2013); Bisk et al. (2016), with a recent emphasis on neural encoder-decoder models Guu et al. (2017); Cheng et al. (2017); Neelakantan et al. (2016); Krishnamurthy et al. (2017); Liang et al. (2017, 2018); Goldman et al. (2018). Understanding instructions has been studied with the SAIL corpus MacMahon et al. (2006); Chen and Mooney (2011); Andreas and Klein (2015); Mei et al. (2016) and recently with other environments, focusing on single sentence instructions Janner et al. (2018); Misra et al. (2018); Anderson et al. (2018); Tan and Bansal (2018).
Tackling the limitations of beam search has been investigated recently by proposing objectives suitable for beam search Wiseman and Rush (2016), by using continuous relaxation that afford differentiability Goyal et al. (2018), and by developing specialized stopping criteria Yang et al. (2018).
Similar to our work, bahdanau2017actor and Suhr2018Situated proposed ways to evaluate the predictions at intermediate steps from a sparse reward signal. bahdanau2017actor used a critic network to estimate the expected BLEU score in translation, while Suhr2018Situated used the edit-distance between the current world state and the goal world state in the SCONE task. However, in those works stronger supervision was assumed: bahdanau2017actor utilized the gold sequences , and Suhr2018Situated used intermediate worlds states. Moreover, in their work intermediate evaluations were used to compute the gradient updates, rather than for guiding search.
Guiding search with both a policy and value networks was done in recent work on Monte-Carlo Tree Search (MCTS) for sparse-reward tasks Silver et al. (2017); T. A. and and Barber (2017); Shen et al. (2018). In MCTS, the value network evaluations are refined with backup updates in order to gain an advantage over the policy scores. However, in our implementation we gain this advantage by simply feeding the denotation to the value network. We also note that the additive interpolation between an actor and a critic are reminiscent of
algorithm where states are scored by adding past cost and an admissible heuristic for future costKlein and Manning (2003); Pauls and Klein (2009); lee et al. (2016).
In this work, we propose a new training algorithm for mapping instruction to programs given denotation supervision only. Our algorithm exploits the denotation at training time to train a critic network that is used to rank search states on the beam, and performs search in a compact space of execution results rather than in the space of programs. We evaluate our algorithm on three different domains from the SCONE dataset, and find that it dramatically improves performance compared to strong baselines across all domains.
- Anderson et al. (2018) P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel. 2018. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Computer Vision and Pattern Recognition (CVPR).
Andreas and Klein (2015)
J. Andreas and D. Klein. 2015.
Alignment-based compositional semantics for instruction following.
Empirical Methods in Natural Language Processing (EMNLP).
Artzi and Zettlemoyer (2013)
Y. Artzi and L. Zettlemoyer. 2013.
Weakly supervised learning of semantic parsers for mapping instructions to actions.Transactions of the Association for Computational Linguistics (TACL), 1:49–62.
- Bahdanau et al. (2017) D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio. 2017. An actor-critic algorithm for sequence prediction. In International Conference on Learning Representations (ICLR).
- Berant et al. (2013) J. Berant, A. Chou, R. Frostig, and P. Liang. 2013. Semantic parsing on Freebase from question-answer pairs. In Empirical Methods in Natural Language Processing (EMNLP).
- Bisk et al. (2016) Y. Bisk, D. Yuret, and D. Marcu. 2016. Natural language communication with robots. In North American Association for Computational Linguistics (NAACL).
- Branavan et al. (2009) S. Branavan, H. Chen, L. S. Zettlemoyer, and R. Barzilay. 2009. Reinforcement learning for mapping instructions to actions. In Association for Computational Linguistics and International Joint Conference on Natural Language Processing (ACL-IJCNLP), pages 82–90.
- Chen and Mooney (2011) D. L. Chen and R. J. Mooney. 2011. Learning to interpret natural language navigation instructions from observations. In Association for the Advancement of Artificial Intelligence (AAAI), pages 859–865.
- Cheng et al. (2017) J. Cheng, S. Reddy, V. Saraswat, and M. Lapata. 2017. Learning structured natural language representations for semantic parsing. In Association for Computational Linguistics (ACL).
- Clarke et al. (2010) J. Clarke, D. Goldwasser, M. Chang, and D. Roth. 2010. Driving semantic parsing from the world’s response. In Computational Natural Language Learning (CoNLL), pages 18–27.
- Goldman et al. (2018) O. Goldman, V. Latcinnik, U. Naveh, A. Globerson, and J. Berant. 2018. Weakly-supervised semantic parsing with abstract examples. In Association for Computational Linguistics (ACL).
- Goyal et al. (2018) K. Goyal, G. Neubig, C. Dyer, and T. Berg-Kirkpatrick. 2018. A continuous relaxation of beam search for end-to-end training of neural sequence models. In Association for the Advancement of Artificial Intelligence (AAAI).
- Guu et al. (2017) K. Guu, P. Pasupat, E. Z. Liu, and P. Liang. 2017. From language to programs: Bridging reinforcement learning and maximum marginal likelihood. In Association for Computational Linguistics (ACL).
- Hochreiter and Schmidhuber (1997) S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
- Janner et al. (2018) M. Janner, K. Narasimhan, and R. Barzilay. 2018. Representation learning for grounded spatial reasoning. Transactions of the Association for Computational Linguistics (TACL), 6.
- Kingma and Ba (2014) D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Klein and Manning (2003) D. Klein and C. Manning. 2003. A* parsing: Fast exact viterbi parse selection. In Human Language Technology and North American Association for Computational Linguistics (HLT/NAACL).
- Krishnamurthy et al. (2017) J. Krishnamurthy, P. Dasigi, and M. Gardner. 2017. Neural semantic parsing with type constraints for semi-structured tables. In Empirical Methods in Natural Language Processing (EMNLP).
- Krishnamurthy and Mitchell (2012) J. Krishnamurthy and T. Mitchell. 2012. Weakly supervised training of semantic parsers. In Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP/CoNLL), pages 754–765.
- Kwiatkowski et al. (2013) T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. 2013. Scaling semantic parsers with on-the-fly ontology matching. In Empirical Methods in Natural Language Processing (EMNLP).
- lee et al. (2016) K. lee, M. Lewis, and L. Zettlemoyer. 2016. Global neural CCG parsing with optimality guarantees. In Empirical Methods in Natural Language Processing (EMNLP).
Liang et al. (2017)
C. Liang, J. Berant, Q. Le, and K. D. F. N. Lao. 2017.
Neural symbolic machines: Learning semantic parsers on Freebase with weak supervision.In Association for Computational Linguistics (ACL).
- Liang et al. (2018) C. Liang, M. Norouzi, J. Berant, Q. Le, and N. Lao. 2018. Memory augmented policy optimization for program synthesis with generalization. In Advances in Neural Information Processing Systems (NIPS).
- Liang et al. (2011) P. Liang, M. I. Jordan, and D. Klein. 2011. Learning dependency-based compositional semantics. In Association for Computational Linguistics (ACL), pages 590–599.
- Long et al. (2016) R. Long, P. Pasupat, and P. Liang. 2016. Simpler context-dependent logical forms via model projections. In Association for Computational Linguistics (ACL).
- MacMahon et al. (2006) M. MacMahon, B. Stankiewicz, and B. Kuipers. 2006. Walk the talk: Connecting language, knowledge, and action in route instructions. In National Conference on Artificial Intelligence.
- Mei et al. (2016) H. Mei, M. Bansal, and M. R. Walter. 2016. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In Association for the Advancement of Artificial Intelligence (AAAI).
- Misra et al. (2018) D. Misra, A. Bennett, V. Blukis, E. Niklasson, M. Shatkhin, and A. Yoav. 2018. Mapping instructions to actions in 3D environments with visual goal prediction. In Empirical Methods in Natural Language Processing (EMNLP).
- Neelakantan et al. (2016) A. Neelakantan, Q. V. Le, and I. Sutskever. 2016. Neural programmer: Inducing latent programs with gradient descent. In International Conference on Learning Representations (ICLR).
- Pasupat and Liang (2015) P. Pasupat and P. Liang. 2015. Compositional semantic parsing on semi-structured tables. In Association for Computational Linguistics (ACL).
- Pasupat and Liang (2016) P. Pasupat and P. Liang. 2016. Inferring logical forms from denotations. In Association for Computational Linguistics (ACL).
- Pauls and Klein (2009) A. Pauls and D. Klein. 2009. K-best A* parsing. In Association for Computational Linguistics (ACL), pages 958–966.
- Pennington et al. (2014) J. Pennington, R. Socher, and C. D. Manning. 2014. GloVe: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
- Shen et al. (2018) Y. Shen, J. Chen, P. Huang, Y. Guo, and J. Gao. 2018. Reinforcewalk: Learning to walk in graph with monte carlo tree search. In International Conference on Learning Representations (ICLR).
- Silver et al. (2017) D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L., M. Lai, A. Bolton, et al. 2017. Mastering the game of go without human knowledge. Nature, 550(7676):354–359.
- Suhr and Artzi (2018) A. Suhr and Y. Artzi. 2018. Situated mapping of sequential instructions to actions with single-step reward observation. In Association for Computational Linguistics (ACL).
Sutskever et al. (2014)
I. Sutskever, O. Vinyals, and Q. V. Le. 2014.
Sequence to sequence learning with neural networks.In Advances in Neural Information Processing Systems (NIPS), pages 3104–3112.
- Sutton et al. (1999) R. Sutton, D. McAllester, S. Singh, and Y. Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NIPS).
T. A. and and Barber (2017)
Z. Tian T. A. and and D. Barber. 2017.
Thinking Fast and Slow with Deep Learning and Tree Search. Advances in Neural Information Processing Systems 30.
- Tan and Bansal (2018) H. Tan and M. Bansal. 2018. Source-target inference models for spatial instruction understanding. In Association for the Advancement of Artificial Intelligence (AAAI).
- Vogel and Jurafsky (2010) A. Vogel and D. Jurafsky. 2010. Learning to follow navigational directions. In Association for Computational Linguistics (ACL), pages 806–814.
- Williams (1992) R. J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256.
- Wiseman and Rush (2016) S. Wiseman and A. M. Rush. 2016. Sequence-to-sequence learning as beam-search optimization. In Empirical Methods in Natural Language Processing (EMNLP).
Yang et al. (2018)
Y. Yang, L. Huang, and M. Ma. 2018.
Breaking the beam search curse: A study of (re-) scoring methods and stopping criteria for neural machine translation.In Empirical Methods in Natural Language Processing (EMNLP).
Appendix A Neural Network Architecture
We adopt the model proposed by guu2017bridging. The model receives the current utterance and the program stack . A bidirectional LSTM Hochreiter and Schmidhuber (1997) is used to embed , while
is embedded by concatenating the embedding of stack elements. The embedded input is then fed to a feed-forward network with attention over the LSTM hidden states, followed by a softmax layer that predicts a program token. Our value networkshares the input layer of . In addition, it receives the next utterance , the current world state and the target world state . The utterance is embedded with an additional BiLSTM, and world states are embedded by concatenating embeddings of SCONE elements. The inputs are concatenated and fed to a feed-forward network, followed by a sigmoid layer that outputs a scalar.
Appendix B Hyper-parameters
Table 3 contains the hyper-parameter setting for each experiment. Hyper-parameters of REINFORCE and MML were taken from guu2017bridging. In all experiments learning rate was 0.001 and mini-batch size was 8. We explicitly define the following hyper-parameters which are not self-explanatory:
Sample size: Number of samples drawn from in REINFORCE
Baseline: A constant subtracted from the reward for variance reduction.
Execution beam size: in Algorithm 1.
Value ranking start step: Step when we start ranking states using the critic score.
Value re-rank size: Size of beam returned by the actor score before re-ranking with the actor-critic score.
Appendix C Programs’ Extraction
We denote to be the search-graph found in the execution-search. The paths in that lead to terminal states represent the discovered programs. A program’s correctness is determined by the correctness of its terminal state, i.e whether the state’s world matches the target world.
We extract correct and incorrect programs separately. Correct programs are built with standard beam search (guided by our model ) over prefixes in , that their states are connected to correct terminal states. The search is therefore restricted to the space of the correct found programs. Incorrect programs are built by selecting, for each incorrect terminal state in , a single path that leads to it.