Zero-shot Task Adaptation using Natural Language

06/05/2021 ∙ by Prasoon Goyal, et al. ∙ The University of Texas at Austin 0

Imitation learning and instruction-following are two common approaches to communicate a user's intent to a learning agent. However, as the complexity of tasks grows, it could be beneficial to use both demonstrations and language to communicate with an agent. In this work, we propose a novel setting where an agent is given both a demonstration and a description, and must combine information from both the modalities. Specifically, given a demonstration for a task (the source task), and a natural language description of the differences between the demonstrated task and a related but different task (the target task), our goal is to train an agent to complete the target task in a zero-shot setting, that is, without any demonstrations for the target task. To this end, we introduce Language-Aided Reward and Value Adaptation (LARVA) which, given a source demonstration and a linguistic description of how the target task differs, learns to output a reward / value function that accurately describes the target task. Our experiments show that on a diverse set of adaptations, our approach is able to complete more than 95 template-based descriptions, and more than 70 language.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

References

1 Introduction

Teaching learning agents how to perform a new task is a central problem in artificial intelligence. One paradigm, namely imitation learning

argall2009survey, involves showing demonstration(s) of the desired task to the agent, which can then used by the agent to infer the demonstrator’s intent, and hence, learn a policy for the task. However, for each new task, the agent must be given a new set of demonstrations, which is not scalable as the number of tasks grow, particularly because providing demonstrations is often a cumbersome process. On the other hand, techniques in instruction-following macmahon2006walk; vogel2010learning; chen2011learning communicate the target task to a learning agent using natural language. As the complexity of tasks grow, providing intricate details using natural language could become challenging.

Figure 1: Example of the setting: The top row shows the source task, while the bottom shows the target task. Given a demonstration of the source task, and a natural language description of the difference between the two tasks such as “In the third step, move the green flat block from bottom left to top left.”, our goal is to train an agent to perform the target task without any demonstrations of the target task.

This motivates a new paradigm to teach agents, that allows scaling up learning from demonstration to multiple related tasks with a single (or a few) demonstration(s) by using a much more natural modality, namely, language. At the same time, intricate details which are harder to communicate using language alone can be provided using demonstration(s). To this end, we propose a novel setting—given a demonstration of a task (the source task), we want an agent to complete a somewhat different task (the target task) in a zero-shot setting, that is, without access to any demonstrations for the target task. The difference between the source task and the target task is communicated using language. The proposed setting requires combining information from both the demonstration and the language, and can therefore serve as an important step towards building systems for more complex tasks which are difficult to communicate using demonstrations or language alone.

For example, consider an environment consisting of objects on different shelves of an organizer, as shown in Figure 1. Suppose the source task (top row) requires moving the green flat block from bottom-right to bottom-left, the blue flat block from middle-left to bottom-right, and then the green flat block from bottom-left to middle-left. The target task (bottom row) is similar, but in the final step, the green flat block should be moved to top-left instead. We posit that given a demonstration for the source task, and a free-form natural language description of the difference between the source and the target tasks, such as “In the third step, move the green flat block from bottom left to top left”, an agent should be able to infer the goal for the target task. We propose a framework that can handle a diverse set of adaptations between the source and the target tasks, such as a missing step, an extra step, and swapping the final positions of two objects.

The environment has a similar structure to several real-world applications, where task adaptation using language could be particularly beneficial. For instance, consider the goal of building service robots for home environments. These robots must be able to learn a wide variety of tasks from non-expert users. Many tasks, such as cooking or assembly, involve a sequence of discrete steps, and such tasks could have several variations, like different cooking recipes or assembling different kinds of furniture. Being able to demonstrate one (or a few) of these tasks, and then communicating the difference between the demonstrated task(s) and other similar tasks could significantly reduce the burden of teaching new skills for the users.

These problems involve planning/control at 2 levels—high-level planning over the steps, and low-level control for executing each step. Since our proposed algorithm focuses on the high-level planning, we illustrate our approach on the simple environment shown in Figure 1, where the low-level control is abstracted away. However, our framework is general, and can be combined with approaches that perform low-level control.

The proposed setting is challenging for several reasons. First, most existing approaches in imitation learning and instruction-following infer the goal for a target task from a demonstration or an instruction, respectively. However, in our setting, neither of these modalities is sufficient by itself, and the agent must be able to combine complementary information from the source demonstration(s) and the natural language descriptions, in order to infer the goal for the target task. Second, in order to understand the natural language description, the agent must be able to map concepts in the description to objects and actions, a problem known as symbol grounding harnad1990symbol. Finally, in order to be scalable, we intend to learn a purely data-driven model that can does not require engineering features for the language or the environment, and can learn to infer the goal for the target task directly from data.

We introduce the Language-Aided Reward and Value Adaptation (LARVA) model that takes in a dataset of source demonstrations, linguistic descriptions, and either the reward or optimal value function for the target task, and is trained to predict the reward or optimal value function of the target task given a source demonstration and a linguistic description. The choice between reward and value prediction could be problem-dependent—for domains with complex transition dynamics, learning a value function requires reasoning about these dynamics, and therefore, it might be better to use LARVA for reward prediction, with a separate policy learning phase using the predicted rewards; for domains with simpler dynamics, a value function could be directly predicted using LARVA, thereby removing the need for a separate policy learning phase.

We experiment with a diverse set of adaptations, and show that the model successfully completes over 95% target tasks when using synthetically generated language, and about 75% target tasks when using unconstrained natural language collected using Amazon Mechanical Turk.

2 Related Work

Imitation Learning.

Imitation learning is one of the standard paradigms for teaching new skills to learning agents. The agent is provided with demonstration(s) of a task, and must infer the demonstrator’s intent, and hence, learn to complete the task argall2009survey

. Approaches to imitation learning can broadly be classified into behavior cloning

pomerleau1989alvinn; ross2010efficient; ross2011reduction

, inverse reinforcement learning

abbeel2004apprenticeship; ramachandran2007bayesian; ziebart2008maximum; finn2016guided, and adversarial learning ho2016generative; fu2017learning. Our proposed setting differs from standard imitation learning, since the agent is provided with demonstration(s) of the source task, but needs to infer the reward for a related but different target task, the difference being communicated using language.

Transfer Learning.

Another closely related problem to our proposed setting is transfer learning, wherein an agent trained on a source task needs to complete a related but different target task. The agent is

finetuned on data from the target task in transfer learning, and the objective is to reduce the amount of data needed for the target task by effectively transferring experience from the source task pan2009survey. This is different from our proposed setting, because in our setting, we don’t need to transfer experience from the source task to the target task; instead, the demonstration(s) for the source task must be used with the description to infer the goal for the target task.

Meta-learning and Few-shot Learning.

Our setting is also related to the meta-learning and few-shot learning settings vanschoren2018meta; wang2020generalizing

. In meta-learning, an agent is given training data from multiple tasks sampled from a distribution of tasks, and it must use these data to generalize to new tasks sampled from the distribution. The data from these tasks could be used to extract useful features, build models as a pretraining step, or learn a training routine (e.g. how to search over the space of neural network architectures), which can be transferred to the new task to learn from fewer datapoints, learn a more robust model, or converge to a solution faster. Few-shot learning is a subcategory of techniques within meta-learning, where the goal is to learn a new task from few training examples. Our approach infers the target reward function for a new target task given a source demonstration and a description in a zero-shot setting, which is a special case of few-shot learning.

Language as Task Description.

In a large class of problems, which can broadly be termed as instruction-following, language is used to communicate the task to a learning agent. In this setting, the agent is given a natural language command for a task, and is trained to take a sequence of actions that complete the task. A well-studied problem in this setting is the Vision-and-language Navigation, where the tasks consist of navigating to a desired location in an environment, given a natural language command and an egocentric view from the agent’s current position anderson2018vision; fried2018speaker; wang2019reinforced. Another subcategory of instruction-following involves instructing an embodied agent using natural language tellex2011understanding; hemachandra2015learning; arkin2017contextual; shridhar2020alfred; stepputtis2020language; misra2016tell; sung2018robobarista. Our proposed setting is different from instruction-following, in that the goal of the target task is not being communicated using language alone; instead, a demonstration for a related task (source task) is available, and language is used to communicate the difference between the demonstrated task and the target task. Thus, the information in the demonstration and the language complement each other.

Language to Aid Learning.

Several approaches have been proposed in the past that use language to aid the learning process of an agent. In a reinforcement learning setting, this could take the form of a language-based reward, in addition to the extrinsic reward from the environment luketina2019survey; goyal2019using; goyal2020pixl2r; kaplan2017beating, or using language to communicate information about the environment to the agent wang2021grounding; branavan2012learning; narasimhan2018grounding. andreas2017learning use language in a meta-learning setting for few-shot transfer to new tasks. While related to our setting, in this work, language is provided for each task independently, and tasks are deemed similar if their linguistic descriptions are related. In our setting, however, language is used to explicitly describe the difference between two tasks.

3 Problem Definition

Consider a goal-based task

, which can be defined as a task where the objective is to reach a designated goal state in as few steps as possible. It can be expressed using the standard Markov Decision Process (MDP) formalism, as

, where is the set of all states, is the set of all actions, is the transition function, is the reward function, is the discount factor, and is the unique goal state.

At timestep , the agent observes a state , and takes an action , according to some policy . The environment transitions to a new state , and the agent receives a reward . The objective is to learn a policy , such that the expected future return, , is maximized. Further, denotes the optimal value function under the reward function , and can be used to act optimally.

The reward function for a goal-based task can be defined as , where is the indicator function. Thus, for , an optimal policy for a goal-based task must reach the goal state from any state in as few steps as possible.

Let be a family of goal-based tasks , each with a distinct goal , and the reward function defined as above. The set of states and actions , the transition functions , and the discount factors across different tasks may be related or unrelated kroemer2019review.

For instance, in the environment shown in Figure 1, a goal-based task consists of arranging the objects in a specific configuration defined by a goal state , while is the set of all multi-step rearrangement tasks in the environment.

Let , be two tasks, and be a natural language description of the difference between the tasks. Given a demonstration for the source task , and the natural language description , our objective is to train an agent to complete the target task in a zero-shot setting, i.e., without access to the reward function or demonstrations for the target task.

4 LARVA: Language-Aided Reward and Value Adaptation

We propose Language-Aided Reward and Value Adaptation (LARVA), a neural network that takes in a source demonstration, , the difference between the source and target tasks described using natural language, , and a state from the target task, , and is trained to predict either , the reward for the state in the target task, or , the optimal value of the state under the target reward function .

We assume access to a training set , where is a demonstration for the source task of the datapoint, is the natural language description of the difference between the source task and the target task for the datapoint, and is the goal state for the target task. The details of the dataset and the data collection process are described in Section 5.

Next, we describe the network architecture, and training details of LARVA.

((a)) Full model: The target goal predictor takes in a source demonstration and a description to predict the goal state for the target task. The reward / value network takes this predicted goal state, and another state from the target task to predict the reward or value of the state under the target reward function.
((b)) Target goal predictor
Figure 2: Neural network architecture for LARVA

4.1 Network Architecture

We decompose the learning problem into two subproblems: (1) predicting the goal state for the target task given the source demonstration and the language, and (2) predicting the reward / value of state given the goal state of the target task. As such, we propose a neural network architecture that consists of two modules: (1) Target Goal Predictor, and (2) Reward / Value Network (see Figure 2). This decomposition allows for additional supervision of the target goal predictor, using the ground-truth goal state for the target task.

4.1.1 Target Goal Predictor

Given a sequence of images representing a source demonstration (), and a natural language description of the difference between the source and the target task (), the target goal predictor module is trained to predict the goal state of the target task ().

Demonstration Encoder.

Each image in the source demonstration is first passed through a convolutional neural network to obtain a

-dimensional vector representation, where

is a hyperparameter tuned using the validation set. The resulting sequence of vectors is padded to the maximum demonstration length (

) in the training data, and the vectors are then concatenated to obtain a single -dimensional vector. 111We also experimented with an LSTM and a transformer to encode the sequence of vectors, but obtained significantly worse performance compared to a simple concatenation. A possible explanation for this behavior is that encoding the frames into a single vector independently of the language may make it harder to associate information in language with that in individual frames, suggesting that cross-attention between language and the frames might be required. Our preliminary experiments with attention-based models did not work well, but a more thorough anaysis is needed.

Language Encoder.

The natural language description is first converted into a one-hot representation using a vocabulary (see Sections 5.2 for details about the vocabulary), which is then passed through a two-layer LSTM module to obtain a vector representation of the description. The hidden size of the LSTM is set to , which is tuned using the validation set.

Target Goal Decoder.

The vector representations of the source demonstration and the natural language description are concatenated, and the resulting vector is passed through a deconvolution network to obtain an image representation of the target goal state.

4.1.2 Reward / Value Network

The reward or value network takes the predicted target goal state and another state from the target task as inputs, and is trained to predict the reward or the value respectively of state under the target reward function. The predicted goal state and the state are encoded using the same CNN encoder (i.e. shared weights) used for encoding the demonstration frames in the target goal predictor, to obtain -dimensional vector representations. The reward or value of state

is computed as the cosine similarity between the vector representations of

and the state . We represent the ground-truth reward or value as , while the network prediction as .

4.2 Training

To train the model, we assume access to a dataset . Using the goal state for the target task, the reward function and the optimal value function for the target task can be computed, which is used to supervise the model training as described below.

4.2.1 Training Objectives

Mean Squared Error.

Since we want the model to regress to the true reward or value for states

, a natural choice for the loss function is the mean squared error (MSE),

.

Target Goal Supervision.

Further, we additionally supervise the Target Goal Predictor using the true goal state for the target task, using an MSE loss, .

Thus, the overall training loss is given by , where is an hyperparameter tuned using a validation set.

4.2.2 Optimization

For training the model, a datapoint is sampled uniformly at random from . When predicting the value function, a target state is sampled uniformly at random from at each step of the optimization process. When predicting the reward function, the goal state

is sampled with 50% probability, while a non-goal state is sampled uniformly at random otherwise. This is required because the reward function is sparse. We use an Adam optimizer

kingma2014adam

to train the network end-to-end for 50 epochs, with weights initialized randomly according to glorot2010understanding. A validation set is used to tune hyperparameters via random search.

5 Environment and Dataset

In this section, we describe the environment we use in our experiments. While the framework described above is fairly general, in this work, we focus on a simpler setting that is more amenable to analysis. Specifically, we assume discrete state and action spaces and , and deterministic transitions, i.e., .

5.1 The Organizer Environment

We propose the Organizer Environment, which consists of an organizer with 3 shelves. There are 8 distinct objects, and each object can take one of 3 colors (red, blue, or green), giving a total of 24 distinct colored objects (see Figure 4 in the appendix). Objects can be placed in each shelf, either to the left or the right, resulting in a total of 6 distinct positions, .

Objects can be placed in different configurations to create different states. In our experiments, we use tasks with 2 or 3 objects. The total number of states with 2 or 3 objects (i.e. ) is 285,120. The action space is common across all tasks, and consists of 30 move actions, . Finally, there is a STOP action that indicates the termination of an episode.

5.2 Language Data

In this work, we experiment with 6 types of adaptations: (1) moving the same object in the source and target tasks, but to different final positions; (2) moving a different object in the source and target tasks, but to the same final position; (3) moving two objects in the source and target tasks, with their final positions swapped in the target task; (4) deleting a step from the source task; (5) inserting a step to the source task; and (6) modifying a step in the source task. Examples for each of these adaptations are shown in the appendix (Table 3).

For each pair of source and target tasks in the dataset, we need a linguistic description of the difference between the tasks. We start by generating linguistic descriptions using a set of templates, such as, “Move obj1 instead of obj2 to the same position.” We ensure that for most of these templates, the target task cannot be inferred from the description alone, and thus, the model must use both the demonstration of the source task and the linguistic description to infer the goal for the target task.

Next, we collected natural language for a subset of these synthetic (i.e. template-generated) descriptions using Amazon Mechanical Turk (AMT). Workers were provided with the template-generated descriptions, and were asked to paraphrase these descriptions. Importantly, our initial data-collection experiments suggested that providing the workers with the task images resulted in inferior descriptions, wherein, many descriptions would describe the target task completely, instead of how it differs from the source task. As such, the workers were first shown some examples of source and target tasks to familiarize them with the domain, and were then only provided with template-generated descriptions, without the images for the source and target tasks, to obtain the paraphrases. See Section A in the appendix for more details about the data collection process.

We applied a basic filtering process to remove clearly bad descriptions, such as those with 2 words or less, and those that were identical to the given paraphrases. We did not make any edits to the descriptions, like correcting for grammatical or spelling errors. Some examples of template-generated and natural language descriptions obtained using AMT are shown in Table 1.

It can be observed that while the first four rows in the table are valid paraphrases, the remaining three paraphrases could be ambiguous depending on the source and target tasks. For instance, in row 5, the target task involves an extra step after the first step, while the natural language paraphrase could be interpreted as modifying the second step. In row 6, the natural language description is not a valid paraphrase, while in row 7, the paraphrase is difficult to comprehend. We manually analysed a small subset of the collected paraphrases, and found that about 15% of the annotations were ambiguous / non-informative. Some of this noise could be addressed by modifying the data-collection pipeline, for example, by providing more information about the source and target tasks to humans, and filtering out non-informative / difficult to comprehend paraphrases.

Template Natural language paraphrase
1. Move the cylinder to middle left. Place the cylinder in the middle left
2. Move the red tall block to the final position of green long block. Place the red tall block in the green longs blocks final position
3. Skip the third step. do not do the third step
4. In the first step, move the green tall cylinder from bottom right to bottom left. for the first step, put the green tall cylinder in the bottom left position
5. Move blue tall cylinder from bottom left to middle left after the first step. For the second step move the blue tall cylinder to the middle left
6. Move the blue cube to the final position of blue tall cylinder. Swap the blue cube with the red cube on bottom shelf.
7. Move blue tall block from top right to bottom left after the second step. Move blue tall square from upper option to base left after the subsequent advance.
Table 1: Examples of template-generated and natural language descriptions collected using AMT.

A vocabulary was created using the training split of the synthetic and natural language descriptions, discarding words that occurred fewer than 10 times in the corpus. While encoding a description using the resulting vocabulary, out-of-vocabulary tokens were represented using the <unk> symbol.

6 Experiments

6.1 Details about the setup

Dataset.

For each adaptation, 6,600 pairs of source and target tasks were generated along with the template-based descriptions. Of these, 600 templates were used for each adaptation to collect natural language descriptions using Amazon Mechanical Turk. Thus, our dataset consisted of 6,600 examples in total for each adaptation, of which 6,000 examples consisted of synthetic language, and 600 consisted of natural language. Of the 6,000 synthetic examples per adaptation, 5,000 were used for training, 500 for validation, and the remaining 500 for testing. Similarly, of the 600 natural language examples per adaptation, 500 were used for training, 50 for validation, and 50 for testing. This gave us a training dataset with 33,000 examples, and validation and test datasets with 3,300 examples each, across all adaptation types.

Evaluation Metrics.

For each experiment, the trained model predicts the reward or value of the given state . When using the value function, the trained network is used to predict the values for all states . When using the reward function, the trained network is used to predict the rewards for all states , from which the optimal value function is computed using dynamic programming. In both cases, if the state with the maximum value matches the goal state for the target task, , the task is deemed to be successful. We train the models using the entire training set (i.e. both synthetic and natural language examples across all adaptations), and report the percentage of successfully completed target tasks for both synthetic and natural language descriptions. For each experiment, we tune the hyperparameters on the validation set, and report the success rate on the test set corresponding to the setting yielding the maximum success rate on the validation set.

Experiment Success rate (%)
Synthetic Natural
1. LARVA; reward prediction 97.8 75.7
2. LARVA; value prediction 97.7 73.3
3. LARVA; no target goal supervision 20.0 2.7
4. LARVA; no language 20.7 22.3
5. LARVA; no source demonstration 4.2 3.3
6. NN without decomposition 1.8 1.0
7. LARVA: Compostionality – red box 87.6 62.4
8. LARVA: Compostionality – blue cylinder 89.4 65.9
Table 2: Success rates of different models

6.2 Results

In this section, we describe the performance of our full model, along with various ablations. Our results are summarized in Table 2.

First, we evaluate our full LARVA model, with both reward and value predictions (rows 1 and 2 in Table 2). In both cases, the models result in successful completion of the target task more than 97% of the time with synthetic language, and more than 73% of the time with natural language. The drop in performance when using natural language can partly be attributed to the 15% of paraphrases that are potentially ambiguous or uninformative, as discussed in Section 5.2, while the remaining performance gap is likely because natural language has more variation than synthetic language. Better data collection and more complex models could be explored to bridge this gap further. Our experiments to analyze the impact of the quantity of data suggests that increasing the amount of synthetic or natural language data is unlikely to provide a significant improvement on the natural language test set (see Section D in the appendix).

The similar performance when predicting rewards and values is not unexpected—we observed in our experiments that training the target goal prediction module was more challenging than training the the reward or value networks. Since the target goal prediction module is identical for both reward and value predictions, the performance in both cases is upper-bounded by the performance of the target goal prediction module. For domains with complex dynamics, reward and value prediction might result in significantly different sucess rates.

Next, we discuss ablations of our model. We present the results only with value prediction, since as noted, both reward and value predictions perform similarly.

  1. [leftmargin=12pt]

  2. To study the effect of target goal supervision for training the target goal predictor, we remove , optimizing the network using the ground-truth values only. Row 3 in Table 2 shows that this drastically degrades performance, confirming the efficacy of target goal supervision.

  3. To ensure that most tasks require using information from both the source demonstration and the natural language description, we run unimodal baselines, wherein the network is provided with only the source demonstration (row 4) or only the language (row 5). As expected, both the settings result in a substantial drop in performance. Interestingly, using only the source demonstration results in over 20% successful completions. This is because given the set of adaptations, the source demonstration constrains the space of target configurations more effectively (e.g. if the source task consists of three steps, the target task must contain at least two of those steps, since source and target tasks differ by only one step for all adaptations).

  4. Finally, we experiment with an alternate neural network architecture, that does not decompose the learning problem into target goal prediction and value prediction. The source demonstration, the language, and the target state are all encoded independently, and concatenated, from which the value for state is predicted. Row 6 shows that the resulting model achieves a very low success on target tasks, demonstrating the importance of decomposing the learning problem as in LARVA.

In the experiments so far, the data were randomly partitioned into training, validation, and test splits. However, a key aspect of using language is the ability to compose concepts. For instance, humans can learn concepts like “blue box” and “red cylinder” from independent examples, and can recognize a “blue cylinder” by composing these concepts without ever having seen examples of the new concept.

To evaluate whether our proposed model can exhibit the ability to compose concepts seen during training, we create 2 new splits of our data—in both the splits, the training data consists of all datapoints that do not contain any blue cylinders or red boxes. In the first split, the validation set consists of all datapoints containing blue cylinders, while the test set consists of all datapoints containing red boxes. In the second split, the validation and test sets are swapped.222Datapoints containing both a blue cylinder and a red box are discarded.

We train LARVA on these new splits (using value prediction), and report the success rate on the test set in Table 2, rows 7 and 8. As can be observed, our model is able to successfully complete a large fraction of the target tasks, by composing concepts seen during training, however, there is room for further improvement by using richer models.

7 Conclusions and Future Work

Conclusions.

We proposed a novel framework which allows teaching agents using a combination of demonstrations and language. Given a demonstration of a source task, and a natural language description of the difference between the source task and a target task, we introduce Language-Aided Reward and Value Adaptation (LARVA), a model that can perform the target task in a zero-shot setting. We experimented with a diverse set of adaptations on a simple discrete environment, and show that the model is able to complete nearly all target tasks successfully when using synthetic language, while more than 70% of the target tasks when using free-form natural language. A key component of LARVA, as demonstrated by the ablation experiments, is decomposing the full problem into two subproblems (target goal prediction and reward / value prediction), which allows for intermediate supervision.

Limitations and Future work.

First, our experimental evaluation involved a fairly simple setup. While a simple domain allows for faster experimentation and better analysis, richer domains with more visual diversity, complex dynamics, and continuous states and actions ought to provide a more thorough analysis in future work. Similarly, a relatively simple family of tasks was considered in this work. Generalization to more complex families of tasks needs further experimentation. It is worth noting that our general framework is applicable to all these variations. Second, our approach requires about 30,000 pairs of source and target tasks, along with natural language descriptions to learn the model. While this is comparable to related approaches (e.g. stepputtis2020language use 40,000 training examples for instruction-following), we believe that on more realistic tasks, using pretrained vision and language encoders could significantly reduce the data requirements. Further, training the model is a one-time process, that can be performed before deployment; after deployment, the system can be provided with demonstrations and descriptions to perform new tasks without any further training. Finally, when using natural language, there is a significant performance gap from the template-generated language. A data collection pipeline with better noise-filtering and richer language models (e.g. ones that use attention) could help bridge this gap.

Appendix A Data Collection using Amazon Mechanical Turk

The interface used for collecting natural language descriptions using Amazon Mechanical Turk is shown in Figure 3. The total amount of money spent on data collection was $985, with an average pay of $6.78 per hour.

Figure 3: Interface for collecting paraphrases using Amazon Mechanical Turk

Appendix B Details about the Domain

1. Same object, different place location
Example Source Task Example Target Task
2. Different object, same place location
Example Source Task Example Target Task
3. Move two objects, with swapped final locations
Example Source Task Example Target Task
4. Delete a step
Example Source Task Example Target Task
5. Insert a step
Example Source Task Example Target Task
6. Modify a step
Example Source Task Example Target Task
Table 3: Types of adaptations used in our experiments. For each type, an example of source and target tasks is shown.
Figure 4: Objects in the Organizer Environment

Figure 4 shows all the objects in the Organizer environment. Table 3 shows an example for each type of adaptation described in Section 5.2.

Appendix C Compute Resources

The experiments were primarily run on a 64-CPU machine, with a Quadro RTX 6000 GPU. Each experimental run took on average 5 hours to complete.

Appendix D Additional Experiments

# natural # synthetic
0 7,500 15,000 30,000
0 - 83.8 93.2 97.3
750 3.0 85.8 93.4 97.2
1,500 5.9 85.8 91.9 96.8
3,000 30.1 88.1 93.6 97.7
((a)) Synthetic language test set
# natural # synthetic
0 7,500 15,000 30,000
0 - 48.7 46.3 51.0
750 1.7 60.7 58.7 65.7
1,500 4.3 63.0 64.0 73.3
3,000 29.3 68.7 72.0 73.3
((b)) Natural language test set
Table 4: Success rates (%) when using varying amounts of synthetic and natural language data to train LARVA. The row labels show the number of natural language examples used while the column labels show the number of synthetic language examples used.

In order to better understand the amount of data needed, we trained LARVA with varying amounts of synthetic and natural language training examples (using value prediction). The results are summarized in Table 4.

Unsurprisingly, on the synthetic language test set, the number of natural language examples only makes a difference when the number of synthetic language examples in the training set is small. The results on the natural language test set are more informative. In particular, if no natural language examples are used for training, the model is only able to successfully complete about 50% of the tasks, even as the amount of synthetic language data is increased. Furthermore, using 1,500 natural language examples instead of 3,000 with 30,000 synthetic language examples results in a comparable performance as the full set. Similarly, halving the amount of synthetic language data (i.e. 15,000 examples instead of 30,000) when using the full natural language set results in only a small reduction in performance. However, it is clear that some amount of natural training data is needed to successfully generalize to natural language test cases.

These results suggest that using additional synthetic language or natural language data compared to our full set will likely not result in a significant performance improvement, and thus, improving the performance when using natural language requires filtering out low quality natural language data, and using richer models.

Appendix E Societal Impacts

Our approach is a step towards building agents that can be trained more conveniently by non-experts, which is a crucial property for AI that can work alongside humans to accomplish everyday tasks.

While such AI can have wide-reaching positive impact, for instance, in taking care of the elderly or differently-abled, and accomplishing simple tasks in small-scale businesses freeing up valuable human time, such technology can also be used for unethical objectives. First, such a system could be taught malicious behavior more conveniently, using demonstrations and language. Thus, the downstream system that executes these behaviors in the real-world (e.g. a mobile robot) must be designed with safety as a top priority. Further, there should be a system in place (either manual or automated) which should filter out potentially malicious data, both while training and predicting from a model like LARVA. One way to do this could be to check if the language contains any words that could cause harm (e.g. “attack”). Second, research in natural language processing is largely based on English. Care must be taken to ensure that as these systems are scaled, support for other languages is provided, as well as variations in accents, social backgrounds, etc. are handled, so that the technology can be used to benefit all sections of the society.