Log In Sign Up

The Fast and the Flexible: training neural networks to learn to follow instructions from small data

Learning to follow human instructions is a challenging task because while interpreting instructions requires discovering arbitrary algorithms, humans typically provide very few examples to learn from. For learning from this data to be possible, strong inductive biases are necessary. Work in the past has relied on hand-coded components or manually engineered features to provide such biases. In contrast, here we seek to establish whether this knowledge can be acquired automatically by a neural network system through a two phase training procedure: A (slow) offline learning stage where the network learns about the general structure of the task and a (fast) online adaptation phase where the network learns the language of a new given speaker. Controlled experiments show that when the network is exposed to familiar instructions but containing novel words, the model adapts very efficiently to the new vocabulary. Moreover, even for human speakers whose language usage can depart significantly from our artificial training language, our network can still make use of its automatically acquired inductive bias to learn to follow instructions more effectively.


page 1

page 2

page 3

page 4


Unified Pragmatic Models for Generating and Following Instructions

We extend models for both following and generating natural language inst...

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor

Instruction tuning enables pretrained language models to perform new tas...

BB_Evac: Fast Location-Sensitive Behavior-Based Building Evacuation

Past work on evacuation planning assumes that evacuees will follow instr...

Human few-shot learning of compositional instructions

People learn in fast and flexible ways that have not been emulated by ma...

Sparse Meta Networks for Sequential Adaptation and its Application to Adaptive Language Modelling

Training a deep neural network requires a large amount of single-task da...

Playing Atari Ball Games with Hierarchical Reinforcement Learning

Human beings are particularly good at reasoning and inference from just ...

Mutual exclusivity as a challenge for neural networks

Strong inductive biases allow children to learn in fast and adaptable wa...

1 Introduction

Learning to follow instructions from human speakers is a long-pursued goal in artificial intelligence which can at least be traced back to Terry Winograd’s work on SHRDLU


. This system was capable of interpreting and following human instructions written in natural language and issued in a world composed of geometric figures. While this first system relied on a set of hand-coded rules to process natural language, most work that followed aimed at using machine learning to map linguistic utterances into their semantic interpretations

[Chen and Mooney2011, Artzi and Zettlemoyer2013, Andreas and Klein2015, Wang, Liang, and Manning2016]. Earlier work assumed that users would speak all in the same natural language, and thus the systems could be trained offline once and for all. However, recently Wang:etal:2016 departed from this assumption by proposing SHRDLURN, a language game where users could issue instructions in any arbitrary language to a system that needs to incrementally learn this novel language. In this game, users have to communicate linguistic instructions to a learning system in order to manipulate piles of coloured blocks (see Figure 1 for an example). Importantly, each user can use any type of language –natural or not– to communicate with the system.

One constraint that makes this learning problem particularly challenging is that human users typically provide only a handful of examples for the system to learn from. This requires the learning algorithms to incorporate strong inductive biases in order to learn effectively. That is, they need to complement the scarce input with priors that would help the model make the right inferences even in the absence of positive data. A way of giving the models a powerful inductive bias is by hand-coding features or operations that are specific to the given domain where the instructions must be interpreted. For example, Wang:etal:2016 propose a log-linear semantic parser which crucially relies on a set of hand-coded functional primitives. While effective, this strategy severely curtails the portability of a system: For every new domain, human technical expertise is required to adapt the system. Instead, we would like these inductive biases to be learned automatically without human intervention.

Figure 1: Illustration of the SHRDLURN task of [Wang, Liang, and Manning2016]

In this paper, we introduce a neural network system that learns domain-specific rules directly from data. This system can then be used to quickly learn the language of new users online. It uses a two phase regime: First, the network is trained offline on artificially created data that is cheap to produce to learn the mechanics of a given task. Next, the network is deployed to real human users who will train it online with just a handful of examples.

The work in this paper is organized as follows: We first start by creating a large artificially generated dataset to train the systems in the offline phase. We then experiment with different neural network architectures to find which general learning system adapts best for this task. Then, we propose how to adapt this network by training it online and demonstrate its effectiveness on recovering the meaning of scrambled words and on learning to process the language from human users, using the dataset introduced by Wang:etal:2016. Our experiments show that our system can quickly learn to recover the meaning of words from their usage. Moreover, we demonstrate that the offline training phase allows it to recover some of the inductive biases that the log-linear semantic parser was manually endowed with. This allows the model to learn more effectively compared to a neural network system that did not go through this pre-training phase.

2 Related Work

Learning to follow human natural language instructions has a long tradition in NLP, dating at least back to the work of Terry Winograd [Winograd1972]

, who developed a rule-based system for this endeavour. Subsequent work centered around automatically learning the rules to process language

[Shimizu and Haas2009, Chen and Mooney2011, Artzi and Zettlemoyer2013, Vogel and Jurafsky2010, Andreas and Klein2015]. This line of work assumes that users speak all in the same language, and thus a system can be trained on a set of dialogs pertaining to some of those speakers and then generalize to new ones.

Instead, Wang:etal:2016 describe a block manipulation game in which a system needs to learn to follow natural language instructions produced by human users using the correct outcome of the instruction as feedback. What distinguishes this from other work is that every user can speak in their own –natural or invented– language. For the game to remain engaging, the system needs to quickly adapt to the user’s language, thus requiring a system that can learn much faster from small data. The system they propose is composed of a set of hand-coded functions that can manipulate the state of the block piles and a log-linear learning model that learns to map n-gram features from the linguistic instructions to expressions in this programming language. Our work departs from this base in that we provide no hand-coded primitives to solve this task, but aim at learning an end-to-end system that follows natural language instructions from human feedback.

Another line of research that is closely related to ours, is that of fast mapping [Lake et al.2011, Trueswell et al.2013, Herbelot and Baroni2017], where the goal is to acquire a new concept from a single example of its usage in context. While we don’t aim at learning new concepts here, we do want to learn from few examples to draw an analogy between a new term and a previously acquired concept.

Our work can be seen as an instance of the transfer learning paradigm

[Pan and Yang2010], which has been successful in both linguistic [Mikolov et al.2013, Peters et al.2018] and visual processing [Oquab et al.2014]. Rather than transferring knowledge from one task to another, we are transferring between artificial and natural data.

3 Method

A model aimed at following natural language instructions must master at least two skills. First, it needs to process the language of the human user. Second, it must act on the target domain in sensible ways (and not trying actions that a human user would probably never ask for).

Whereas the first aspect is dependent on each specific user’s language, the second requirement is not related to a specific user, and could – as illustrated by the successes of Wang:etal:2016’s log-linear model – be learned beforehand.

To allow a system to acquire these skills automatically from data, we introduce a two-step training regime. First, we train the neural network model “offline” on a dataset that mimics the target task. Next, we allow this model to independently adapt to the language of each particular human user by training it online with the examples that each user provides.

In the following two subsections we describe, first, the data and neural network architectures used to model the task, and next, the online training procedure used to adapt to single speakers.

3.1 Offline learning phase

The first step of our method involves training a neural network model to perform the task at hand. We used supervised learning to train the system on a dataset that we constructed by simulating a user playing the game. In this way, we did not require any real data to kick-start our model. Below we describe, first, the procedure used to generate the dataset and, second, the neural network models that were explored in this phase.


The data for SHRDLURN task takes the form of triples: a start configuration of colored blocks grouped into piles, a natural language instruction given by a user and the resulting configuration of colored blocks that comply with the given instruction111The original paper produces a rank of candidate configurations to give to a human annotator on a first stage. Since here we focus on pre-annotated data where only the expected target configuration is given, we will restrict our evaluation to top-1 accuracy. (Figure 1). We generated 88 natural language instructions following the grammar in Figure 2. The language of the grammar was kept as minimal as possible, with just enough variation to capture the simplest possible actions in this game. Furthermore, we created the initial block configurations by building 6 piles containing a maximum of 3 randomly sampled colored blocks each. The piles in the dataset were serialized into a sequence by encoding them into 6 lists delimited by a special symbol, each of them containing a sequence of color tokens or a special empty symbol. We then computed the resulting target configuration using a rule-based interpretation of our grammar. An example of our generated data is depicted in Figure 3.

VERB add remove
COLOR red cyan brown orange
POS 1st 2nd 3rd 4th
5th 6th even odd
leftmost rightmost every
Figure 2: Grammar of our artificially generated language
Instruction remove red at 3rd tile
Initial Config. BROWN X X # RED X X # ORANGE RED X
Target Config. BROWN X X # RED X X # ORANGE X X
Figure 3: Example of an entry in our dataset. We show three rather than six columns for conciseness.


To model this task we used an encoder-decoder [Sutskever, Vinyals, and Le2014] architecture: The encoder reads the natural language utterance

and transforms it into a sequence of feature vectors

, which are then read by the decoder through an attention layer. This latter module reads the sequence describing the input block configurations and produces a new sequence that is construed as the resulting block configuration. To pass information from the encoder to the decoder, we equipped the decoder with an attention mechanism, [Bahdanau, Cho, and Bengio2014, Luong, Pham, and Manning2015]. This allows the decoder, at every timestep, to extract a convex combination of the hidden vectors . We trained the system to match the target block configuration (represented as 1-hot vectors) using a cross-entropy loss:

Both the encoder and decoder modules are sequence models, meaning that they read a sequence of inputs and compute, in turn, a sequence of outputs and that can be trained end-to-end. We experimented with two state-of-the-art sequence models: A standard recurrent LSTM [Hochreiter and Schmidhuber1997] and a convolutional sequence model [Gehring et al.2016, Gehring et al.2017], which has been shown to outperform the former on a range of different tasks [Bai, Kolter, and Koltun2018]. For the convolutional model we used kernel size

and padding to make the size of the output match the size of the input sequence. Because of the invariant structure of the block configuration that is organized into lists of columns, we expected the convolutional model (as a decoder) to be particularly well-fit to process them. We explored all possible combinations of architectures for the encoder and decoder components. Furthermore, as a simple baseline, we also considered a bag-of-words encoder that computes the average of trainable word embeddings.

3.2 Online learning phase

Once the model has been trained to follow a specific set of instructions given by a simulated user, we want it to serve a new, real user, who does not know anything about how the model was trained and is encouraged to communicate with the system using her own language. To do so, the model will have to adapt to follow instructions given in a potentially very different language from the one it has seen during offline training. One of the first challenges it will encounter is to quickly master the meaning of new words. This challenge of inferring the meaning of a word from a single exposure goes by the name of ‘fast-mapping’ [Lake et al.2011, Trueswell et al.2013]. Here, we take inspiration from the method proposed by [Herbelot and Baroni2017], who learn the embeddings for new words with gradient descent, freezing all the other network weights. We further develop it by experimenting with different variations of this method: Like them, we try learning only new word embeddings, but also learning the full embedding layer (thus allowing words seen during offline training to shift their meaning). Additionally, we test what happens when the full encoder weights are unfrozen, allowing to adapt not only the embeddings but also how they are processed sequentially. In the latter two cases, we incorporate regularization over the embeddings and the model weights.

Training algorithm

Human users interact with the system by asking it in their own language to perform transformations on the colored block piles, providing immediate feedback on what was the intended target configuration.222In our experiments, we use pre-recorded data from [Wang, Liang, and Manning2016]. In our system, each new example that the model observes is added to a buffer . Then, the model is further trained with a fixed number of gradient descent steps using examples drawn from a subset of this buffer.

In order to reduce the impact of local minima that the model could encounter when learning from just a handful of examples, we train different copies (rather than training a single model) each with a set of differently initialized embeddings for new words. Then, to make predictions, we have to pick which model we will use, using evidence . We experimented with two model selection strategies: greedy, by which we pick the model with the lowest loss computed over the full training buffer examples (); and 1-out, where we save the last example for validation and pick the model that has the lowest loss on that example (, ) 333Other than these, there is wealth of methods in the literature for model selection (see, e.g. Claeskens:etal:2008, Claeskens:etal:2008). To limit the scope of this work, we leave this exploration for future work.. Algorithm 1 summarizes our approach.

1:Initialize models
2:Let be an empty training buffer
3:for t = 1,2,…,T do
4:     Observe the input
5:     Select best model using data
6:     Predict
7:     Observe feedback .
8:     Add to
9:     Train on data
10:procedure Select(, )
11:     Let return that minimizes
12:procedure Train(, )
13:     for ,  do
14:         Draw
15:         Compute
16:         Update      
Algorithm 1 Online Training

4 Experiments

Our experimental question is to establish whether we can train a neural network system to learn the rules and structure of a task while communicating with a scripted teacher and then having it adapt to the particular nuances of each human user. We tackled this question incrementally. First, we explored what is the best architectural choice for solving the SHRDLURN task on our large artificially-constructed dataset. Next, we ran multiple controlled experiments to investigate the adaptation skills of our online learning system. In particular, we first tested whether the model was able to recover the original meaning of a word that had been replaced with a new arbitrary symbol – e.g. “red” becomes “roze”– on an online training regime. Finally, we proceeded to learn from real human utterances using the dataset collected by Wang:etal:2016.

4.1 Offline training

We used the data generation method described in the previous section to construct a dataset to train our neural network systems. To evaluated the models in a challenging compositional setting [Lake and Baroni2017], we create validation and test sets that have no overlap with training instructions or block configurations, rather than producing a random split of the data. To this end, we split all the 88 possible utterances that can be generated from our grammar into 66 utterances for training, 11 for validation and 11 for testing. Similarly, we split all possible 85 combinations that make a valid column of blocks into 69 combinations for training, 8 for validation and 8 for testing, sampling input block configurations using combinations of 6 columns pertaining only to the relevant set. In this way, we generated 42000 instances for training, 4000 for validation and 4000 for testing.

We explored all possible combinations of encoder and decoder models: LSTM encoder and LSTM decoder (seq2seq), LSTM encoder and convolutional decoder (seq2conv), convolutional encoder and LSTM decoder (conv2seq), and both convolutional encoder and decoder (conv2conv). Furthermore, we explored a bag of words encoder with an LSTM decoder (bow2seq). We trained 5 models with our generated dataset and use the best performing for the following experiments.

We conducted a hyperparameter search for all these models, exploring the number of layers (1 or 2 for LSTMs, 4 or 5 for the convolutional network), the size of the hidden layer (32, 64, 128, 256) and dropout rate (0, 0.2, 0.5). For each model, we picked the hyperparameters that maximized accuracy on our validation set and report validation and test accuracy in Table 

1. As can be noticed, seq2conv is the best model for this task by a large margin, performing perfectly or almost perfectly on this challenging test split featuring only unseen utterances and block configurations. Furthermore, this validates our hypothesis that the convolutional decoder is better fitted to process the structure of the block piles.

Model Val. Accuracy Test Accuracy
Table 1: Model’s accuracies evaluated on block configurations and utterances that were completely unseen during offline training. Results expressed in percentages.

4.2 Recovering corrupted words

Next, we ask whether our system could adapt quickly to controlled variations in the language. To test this, we presented the model with a simulated user producing utterances drawn from the same grammar as before, but where some words have been corrupted so the model cannot recognize them anymore. We then evaluated the model on whether it can recover the meaning of these words during online training. For this experiment, we combined the validation and test sections of our dataset, containing in all 22 different utterances, to make sure that the presented utterances were completely unseen during training time. We then split the vocabulary on these utterances in two disjoint sets of words that we want to corrupt, one for validation and one for testing. For validation, we take one verb (“add”), 2 colors (“orange” and “red”), and 4 positions (“1st”, “3rd; ; “5th” and “even”). We then extracted a set of 15 utterances containing these words and corrupted each occurrence of them by replacing them with a new token (consistently keeping the same new token for each occurrence of the word). We further extracted, for each of these utterances, 3 block configurations to pair them with, resulting in a simulated game with 45 instruction examples. For testing, we created controlled games where we corrupted: one single word, two words, three words and finally, all words from the test set vocabulary444We also experimented with different types of corrupted words (verbs, colors or position numerals but we found no obvious differences between them.. By keeping the two vocabularies disjoint we make sure that by optimizing the hyperparameters of our online training scheme, we don’t happen to be good at recovering words from a particular subset.

We use the validation set to calibrate the hyperparameters of the online training routine. In particular, we vary the optimization algorithm to use (Adam or SGD), the number of training steps (100, 200 or 500), the regularization weight (, , , ), the learning rate (, , ), and the model selection strategy (greedy or 1-out), while keeping the number of model parameters that are trained in parallel fixed to . For this particular experiment, we considered learning only the embeddings for the new words, leaving all the remaining weights frozen (model 1). To assess the relative merits of this model, we compared it with ablated versions where the encoder has been randomly initialized but the decoder is kept fixed (model 2) and a fully randomly initialized model (model 3). Furthermore, we evaluated the impact of having multiple () concurrently trained model parameters by comparing it with just having a single set of parameters trained (model 4). We report the best hyperparameters for each model in the supplementary materials. We use online accuracy as our figure of merit, computed as , where is the length of the game. We report the results in Table 2.

Transfer Adapt 1 word 2 words 3 words all words
1. Enc+Dec Emb.
2. Dec. Enc.
3. Enc + Dec
4. Enc+Dec Emb.
Table 2: Online accuracies (in percentages) for the word recovery task averaged over 7 games for 1 word, 17 for 2 words, 10 for 3 words and a single interaction for the all words condition. “Transfer” stands for the components whose weights were saved (and not reinitialized) from the offline training. “Adapt” stands for the components whose weights get updated during the online training.

First, we can see that –perhaps not too surprisingly– the model that adapts only the word embeddings performs best. Notably, it can reach 73% accuracy even when all words have been corrupted (whereas, for example, the model of Wang:etal:2016 obtains 55% on the same task). The only exception comes in the single corrupted word condition, where re-learning the full encoder seems to be performing even better. Despite this intriguing last result, it remains encouraging to see that the model can quickly recover the meaning of corrupted words even in the most challenging setting where all words have been changed. In addition, we can observe the usefulness of having multiple sets of parameters trained, by comparing the “Embeddings” models by default trained with models and when , observing that the former is consistently better.

To gain further understanding, on the Analysis section we analyze the representations that our model learns on this task.

4.3 Adapting to human speakers

Having established in controlled experiments the ability of our model to adapt to situations in which it will encounter novel words to refer to already seen concepts, albeit in a distribution that is comparable to the one seen during training, we moved to the more challenging setting where the model needs to adapt to real human speakers whose language can significantly depart from the one seen during the offline learning phase, both in surface form and in their underlying semantics. For this we used the dataset made available by Wang:etal:2016, collected from turkers playing the SHRDLURN game in collaboration with their log-linear/symbolic model. The dataset contains 100 games with nearly 10k instruction examples.

We first selected three games in this dataset to produce a validation set. We then used this set to tune the online learning hyperparameters. To select them we visually inspected games that were in the top 5% in terms of Wang et al.’s model performance and chose three in which the language of the users looked reasonably close to our artificial one, even if the words were completely different. We based this criterion on the hypothesis that similar languages may provide a stronger learning signal and thus it was better to focus our hyperparameter exploration on cases where we had a higher chance to succeed. This turned out to be not too important, as our experiments will show. Indeed, the model can ignore completely the information stored in the encoder, in charge or processing the user language, and still perform at its best, suggesting that surface features do not play a big role in the model’s learning capabilities. This was a pleasent surprise because it meant that the specific details of our artificial grammar were mostly irrelevant and what mattered the most were the machincs of the task that the model learned to capture. All the remaining 97 games were left for testing.

(1) Embeddings (2) Encoder (3) Encoder+Decoder
acc. acc. acc.


(c) Nothing (Random) - - - -
(b) Decoder (Random Encoder) - -
(a) Encoder + Decoder
Table 3: For each (valid) combination of set of weights to re-use and weights to adapt online, we report average online accuracy on [Wang, Liang, and Manning2016] dataset and pearson- correlation between online accuracies obtained by our model and those reported by the authors.


We explored 6 different variants of adapting our model to new speakers that can be classified according to two factors of variation. On one hand, we varied which set of pre-trained weights were carried through to the online training phase:

(a) All the weights in the encoder plus all the weights of the decoder; (b) only the decoder weights while randomly initializing the encoder; or (c) no weights re-used from the offline learning phase (this latter taking the role of a baseline for our method). On the other hand, we explored which subset of weights we adapt, leaving all the rest frozen: (1) Only the word embeddings555Here we report adapting the full embedding layer, which performed better than just adapting the embeddings for new words, (2) the full weights of the encoder or (3) the full network (both encoder and decoder). Among the 9 possible combinations, we restricted to the 6 that wouldn’t result on random components not being updated (for example (c-2) would result in a model with a randomly initialized decoder that is never trained), thus leaving out (c-1), (c-2) and (b-1) .

For each of the remaining 6 valid training regimes, we ran an independent hyperparameter search choosing from the same pool of candidate values as in the word recovery task (described in the previous subsection). We picked the hyperparameter configuration that maximized the average online accuracy on the three validation games. The best hyperparameters are reported on the supplementary materials.

We then evaluated each of the model variants on the 97 interactions in our test set and measure the average online accuracy obtained by each of the systems. Our working hypothesis is that a system’s performance is strongly dependent on the quality of the inductive bias that it has acquired or been programmed with for the task at hand. Considering that the log-linear model that was originally introduced with this dataset [Wang, Liang, and Manning2016] incorporates this bias in the form of hand-coded functions, we construe it as a good proxy for what is a useful bias to learn this task. Therefore, the more a model behaves similar to this system, the more likely that it encodes similar inductive biases. For this reason, we also measure the correlation between the online accuracy obtained by our model on every single game and that obtained by Wang:etal:2016’s system to measure the similarity in behavior. The results of these experiments are displayed in Table 3.

In the first place, we confirm that models using the knowledge acquired in the offline training phase (rows a and b) perform better than a randomly initialized model (3-c). Second, perhaps surprisingly, a randomly initialized encoder performs slightly better than the pre-trained one666Recall that the encoder is the component that reads and interprets the user language, while the decoder processes the block configurations conditioned on the information extracted by the encoder.. This result suggests that the model is better off ignoring the specifics of our artificial grammar and learning the language from scratch, even from very few examples. We deem this a positive outcome because it suggests that no manual effort is required to match the specific surface form of user’s language when training the system offline. Finally, we observe that the models that perform the best are those in column (2) which adapt the encoder weights, freezing the decoder ones. Interestingly, they are also those that correlate the most with the symbolic system. Indeed, performance scores seem to be perfectly aligned with the correlation coefficients777As a matter of fact the 7 entries of online accuracy and pearson are themselves correlated with , which is highly significant even for these few data points. This result is compatible with our hypothesis that the symbolic system carries learning biases which, the better the models are at capturing, the better they will perform in the end task.

Furthermore, overall, the results indicate that all the useful bias in the system is contained in the decoder because the model can dispense with the encoder’s initialization. This observation suggests that the nature of the knowledge that is carried through from the offline phase to the online learning phase represents the mechanics of the task rather than relying on similarities between our artificial language and the actual language of human speakers. To test this hypothesis further, we exchanged all words in our artificial grammar for other new scrambled words and shuffled all sentences in an arbitrary but consistent way, thus destroying any existing similarity at the vocabulary and syntactic level. Then, we retrained our model on this new shuffled data and repeated the online training procedure keeping the decoder weights, obtaining 20.7% mean online accuracy, which is much closer to the results of the models trained on the original grammar than it is to the randomly initialized model. Therefore, we conclude that a large part of the knowledge that the model exploits come from the tasks mechanics than from specifics of the language used.

Finally, we note that the symbolic model attains a higher average online accuracy of in this dataset, showing that there is still room for improvement in this task. Yet, it is important to remark that since this model features hand-coded domain knowledge it is expected to have an advantage over a model that has to learn these rules from data alone, and thus the results are not directly comparable but rather serve as a reference point.

5 Analysis

Figure 4: Cosine similarities of the newly learned word embedding for the corrupted version of the word “brown” with the rest of the vocabulary.

Word recovery

To gain some further understanding of what our model learns, we examined the word embeddings learned by our model in the word recovery task. In particular, we wanted to see whether the embedding that the model had re-learned for the corrupted word was similar to the embedding of the original word. We analyzed a game in which 3 words had been corrupted: “brown”, “remove” and “every”. We then evaluated how close each of the corrupted versions of these words (called “braun”, “rmv” and “evr”) were to their original counterparts in terms of cosine similarity. For conciseness, we focus here just on the word “braun”. In this particular game, the model encounters this word for the first time on an utterance “rmv braun at evr tile”, on which it expectedly fails to predict correctly its meaning. Yet, afterwards it resolves correctly every new occurrence of the word, like for example, on “rmv braun at 5th tile” or “add braun to 4th tile”. On Figure 4 we show the similarity of this word with every other word in the vocabulary. While it can be seen that the network correctly identified that this word bore some similarity with the word “brown”, even more so it did with “remove” or “4th”. We link this result with the observation made by [Lake and Baroni2017] who show that neural network systems struggle at capturing systematic compositionality. Interestingly, we also explored a mechanism that allowed the model to re-use already known word embeddings rather than learning them from scratch to alleviate the exposed problem, but it didn’t seem to improve the performance of the model.

Human data

On the previous section we have shown that the performance of our system correlates strongly with the symbolic system of Wang et al. Yet, this correlation is not perfect, and thus, there are games in which our system performs comparatively better or worse on a normalized scale. We looked for examples of such games in the dataset. Figure 5 shows a particular case that our system fails to learn. Notably it is using other blocks as referring expressions to indicate positions, a mechanism that the model had not seen during offline training, and thus it struggled to quickly assign a meaning to it.

On more realistic settings, language learning does not take the form of our idealized two-phase learning process, but it is an ongoing learning cycle where new communicative strategies can be proposed or discovered on the fly, as this example of using colors as referring expressions teaches us. Tackling this learning process requires advances that are well out of the scope of the state-of-the-art and, of course, of the current work too888An easy fix would have been adding instances of this mechanism to our dataset, possibly improving our final performance. Yet, this bypasses the core issue that we attempt to illustrate here. Namely, that humans can creatively come up with a potentially infinite number of strategies to communicate and our systems should be able to cope with that.. However, we see these challenges as exciting problems to pursue in the future.

Figure 5: Example of failing case for our system. During offline training it had not seen other colored blocks to be used as referring expressions for locations.

6 Conclusions

Learning to follow human instructions is a challenging task because humans typically (and rightfully so) provide very few examples to learn from. For learning from this data to be possible, it is necessary to make use of some inductive bias. Whereas work in the past has relied on hand-coded components or manually engineered features, here we sought to establish whether this knowledge can be acquired automatically by a neural network system through a two phase training procedure: A (slow) offline learning stage where the network learns about the general structure of the task and a (fast) online adaptation phase where the network needs to learn the language of a new specific speaker. Controlled experiments demonstrate that when the network is exposed to a language which is very similar to the one it has been trained on except for some changed words, the model adapts very efficiently to the new vocabulary. Moreover, even for human speakers whose language usage can considerably depart from our artificial language, our network can still make use of the inductive bias that has been automatically learned from the data. Interestingly, using a randomly initialized encoder on this task performs equally well or better than the pre-trained encoder, hinting that the knowledge that the network learns to re-use is more specific to the task rather than discovering language universals. This is not too surprising given the minimalism of our artificial grammar.

To conclude, we have shown that our system can extract useful inductive bias to generalize to follow linguistic instructions of new speakers using an unconstrained language. With this, we are the first to develop a neural model to play the SHRDLURN task without any hand-coded components. We believe that an interesting direction to explore in the future is adopting meta-learning techniques [Finn, Abbeel, and Levine2017, Ravi and Larochelle2017], where the network parameters are tuned having in mind that they should serve for adaptation. We hope that bringing together these techniques with the presented here, we can move closer to having fast and flexible human assistants.


  • [Andreas and Klein2015] Andreas, J., and Klein, D. 2015. Alignment-based compositional semantics for instruction following. In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , 1165–1174.
  • [Artzi and Zettlemoyer2013] Artzi, Y., and Zettlemoyer, L. 2013. Weakly supervised learning of semantic parsers for mapping instructions to actions. Transactions of the Association of Computational Linguistics 1:49–62.
  • [Bahdanau, Cho, and Bengio2014] Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR2015).
  • [Bai, Kolter, and Koltun2018] Bai, S.; Kolter, J. Z.; and Koltun, V. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. CoRR abs/1803.01271.
  • [Chen and Mooney2011] Chen, D. L., and Mooney, R. J. 2011. Learning to interpret natural language navigation instructions from observations. In AAAI, volume 2, 1–2.
  • [Claeskens, Hjort, and others2008] Claeskens, G.; Hjort, N. L.; et al. 2008. Model selection and model averaging. Cambridge Books.
  • [Finn, Abbeel, and Levine2017] Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 1126–1135.
  • [Gehring et al.2016] Gehring, J.; Auli, M.; Grangier, D.; and Dauphin, Y. N. 2016. A convolutional encoder model for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), 123–135.
  • [Gehring et al.2017] Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, Y. N. 2017. Convolutional sequence to sequence learning. In Precup, D., and Teh, Y. W., eds., Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 1243–1252. International Convention Centre, Sydney, Australia: PMLR.
  • [Herbelot and Baroni2017] Herbelot, A., and Baroni, M. 2017. High-risk learning: acquiring new word vectors from tiny data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 304–309.
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural Computation 9(8):1735–1780.
  • [Lake and Baroni2017] Lake, B. M., and Baroni, M. 2017. Still not systematic after all these years: On the compositional skills of sequence-to-sequence recurrent networks. CoRR abs/1711.00350.
  • [Lake et al.2011] Lake, B.; Salakhutdinov, R.; Gross, J.; and Tenenbaum, J. 2011. One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 33.
  • [Luong, Pham, and Manning2015] Luong, M.; Pham, H.; and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1412–1421. Lisbon, Portugal: Association for Computational Linguistics.
  • [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (NIPS), 3111–3119.
  • [Oquab et al.2014] Oquab, M.; Bottou, L.; Laptev, I.; and Sivic, J. 2014.

    Learning and transferring mid-level image representations using convolutional neural networks.

    In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, 1717–1724. IEEE.
  • [Pan and Yang2010] Pan, S. J., and Yang, Q. 2010. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22(10):1345–1359.
  • [Peters et al.2018] Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. Proceedings of NAACL 2018.
  • [Ravi and Larochelle2017] Ravi, S., and Larochelle, H. 2017. Optimization as a model for few-shot learning. In Proceedings of the International Conference of Learning Representations(ICLR).
  • [Shimizu and Haas2009] Shimizu, N., and Haas, A. R. 2009. Learning to follow navigational route instructions. In IJCAI, volume 9, 1488–1493.
  • [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems (NIPS), 3104–3112.
  • [Trueswell et al.2013] Trueswell, J. C.; Medina, T. N.; Hafri, A.; and Gleitman, L. R. 2013. Propose but verify: Fast mapping meets cross-situational word learning. Cognitive psychology 66(1):126–156.
  • [Vogel and Jurafsky2010] Vogel, A., and Jurafsky, D. 2010. Learning to follow navigational directions. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 806–814. Association for Computational Linguistics.
  • [Wang, Liang, and Manning2016] Wang, S. I.; Liang, P.; and Manning, C. D. 2016. Learning language games through interaction. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2368–2378.
  • [Winograd1972] Winograd, T. 1972. Understanding natural language. Cognitive psychology 3(1):1–191.