rl solves sequential decision making problems by utilizing a trial-and-error approach guided by a reward signal. rl has achieved tremendous successes, especially in beating humans in games (Silver et al., 2018; Jaderberg et al., 2019) and robotics (Levine et al., 2016). However, RL also suffers form various open problems, such as its sample inefficiency. This sample inefficiency is often caused by reward function-specification. A sparse and delayed reward signal makes it difficult for the agent to experience, and learn from, meaningful reward signals.
Designing tasks suitable to solve with rl algorithms is often challenging (Ng et al., 1999), and mostly involves designing a task-specific reward function. A recent line of research, surveyed by Luketina et al. (2019), has proposed methods that allow task descriptions to be specified using natural language. However, such methods (Chevalier-Boisvert et al., 2018) have proven to still be very sample inefficient, requiring the usage of up to 50 GPUs during weeks in order to learn relatively simple tasks.
One promising approach includes Jiang et al. (2019), which proposed to tackle this sample inefficiency by decomposing the problem into an hierarchical structure, guided by the compositional nature of natural language.
Humans follow a similar strategy, and when confronted with a new problem, humans are generally capable of forming intuitive theories about how to tackle the problem at hand. These intuitive theories often consist of sequences of high-level actions (e.g. first go to store x, then stop for gas before driving home). An interesting approach to make rl more sample efficient, would be to combine high level human intuitive theories, expressed using natural language, and low-level automated trial and error learning.
An essential part of such a symbiosis is the ability of an agent to quickly adapt from one task to a similar task. A human does not need to learn each individual task from scratch, but has a set of base strategies from which a new strategy can be quickly formed.
Current algorithms capable of quickly adapting their control policies to solve related tasks, mostly rely on intensive training using a diverse set of tasks, often guided by a curriculum of increasingly more difficult and diverse tasks (Bengio et al., 2009).
In this paper, we take a different approach and examine if we can facilitate fast task adaptation by utilizing semantic meaning from task descriptions formulated in natural language. Our method is capable of, given a set of pre-trained control policies, and a new previously unseen task, making a decision about which previously developed control policy will adapt best, in order to solve a new previously unseen task, solely from its instruction.
In the following sections of this paper, we first briefly review key research related to ours (Section 2). Section 3 contains a description of the environment, and tasks we use to demonstrate our method. In Section 4 we describe the proposed method. Section 5 demonstrates experimentally how well our method is capable of performing task-adaptation in a simple environment.
2 Related work
Our proposed method can be situated on the intersection of transfer learning and natural language usage in rl. In this section, we first briefly review how our method relates to key research in transfer learning in rl, how natural language has been used in rl, and what research has been conducted on this intersection.
Transfer learning in reinforcement learning
Utilizing knowledge gained from learning one task to another task has been a widely studied field. The goal of this field is to make rl more sample efficient (Konidaris, 2006; Taylor and Stone, 2009). Common approaches include to train the agent on multiple tasks (Hessel et al., 2019), or to construct parameterized policies (Schaul et al., 2015; Andreas et al., 2017; Oh et al., 2017), which can be configured to perform new tasks. An alternative approach consists of learning inter-task mappings (Taylor, 2007), based on task similarities. Our method similarly is capable of detecting task similarities, using additional information captured in task descriptions.
Language instructions in reinforcement learning
Recent advances in rl, surveyed by Luketina et al. (2019), have demonstrated the usage of natural language in order to build models capable of capturing domain knowledge.
A commonly used approach consists of directly embedding both visual observation and language instruction in order to train a control policy (Hermann et al., 2017; Misra et al., 2017; Chevalier-Boisvert et al., 2018). Alternatively, Goyal et al. (2019) uses natural language reward shaping, by predicting if an action in a trajectory matches a task description. Jiang et al. (2019) explores the compositional structure of natural language in order to train a hierarchical algorithm, capable of discovering abstractions that generalize over different sub-tasks using language instructions. However, current approaches commonly heavily depend on large amounts of human labeled data and hand-designed policies. In this context, our method can reduce the dependency on expensive human labeling by providing fast task-adaptation.
Transfer learning guided by language in reinforcement learning
Co-Reyes et al. (2019) proposed a meta-learning algorithm capable of utilizing corrective instructions formulated in natural langue in order to facilitate task-adaptation.
Most similar to our research, is the work done by Narasimhan et al. (2018), which includes a way to use entity descriptions in natural language as a layer of abstraction, in order to facilitate transfer of an rl-policy, to a new environment.
3 BabyAI environment
In order to demonstrate the capabilities of our method, we make use of the BabyAI environment proposed by Chevalier-Boisvert et al. (2018). In this environment, the agent is tasked with completing various tasks in a 2D gridworld. The environment supports multiple rooms, but for our preliminary experiments, we only consider a single room, and use the goto and pickup tasks. The task the agent is charged with, is described using a synthetic baby language. The pixels of the screen, together with this instruction, form the observation of the agent. The environment supports partial observability of the state. However for our experiments we use the fully observable configuration. The action-space we consider for our experiments consists of moving forward, turning left/right, object-pickup/drop, opening doors, and a finish action. Notice that in order to solve the goto and pickup tasks, only a subset of the action-space is required.
In this environment, the reward-signal is only sparsely observed, as the agent only receives a reward upon task completion. A few example tasks are presented in Figure 1.
The instructions used in the BabyAI environment are all generated using the proposed Baby Language. This language consists of a small vocabulary, but can be used combinatorially to express a relatively rich set of different tasks.
Instructions we use in our transfer experiments follow the same verb, object color, object pattern (e.g. pickup the yellow box). The following words make up the vocabulary used in our experiments:
Verbs: pickup, goto
Objects: box, key, ball
Colors: blue, red, green, yellow
In total, this allows us to express 24 different tasks. While the BabyAI platform is a great platform to demonstrate the qualities of our method, our method is not environment-specific, and we plan to extend this research to multiple environments.
The main idea of our approach is to utilize a limited set of pre-trained base control policies. When confronted with a new task, described using natural language (the transfer instruction), the best base policy is selected and the new task is learned based on this base policy.
As such our method consists of two parts: the first part is a pre-training step, while the second part deals with the effective task-adaptation. A pseudo-code summary of our method can be found below in Algorithm 1.
4.1 Pre-training base control policies
In this pre-training phase, we first train a set of base control policies . A control policy determines the action an agent takes, based on the state the agent resides in.
Each base control policy should reliably be able to perform one instruction . This task instruction is expressed in natural language (e.g. go to the blue ball or pickup the yellow key). Training base control policies can be done using any rl algorithm. In this preliminary research, the set of possible instructions is limited. This is due to the fixed vocabulary described in Section 3. The amount of pre-trained control policies should be sufficiently large, but smaller than the entire set of possible instructions ().
For a base control policy to facilitate efficient task-adaptation, it is beneficial to make slight adaptations to the environment. An example of such variations includes spawning the agent in a different position after each iteration.
Our method can be used with a fixed number of base control policies, which are trained during a single pre-training phase. However, our method can also be extended to work in an iterative fashion. In this iterative approach, the agent starts with a small set of pre-trained base control policies. When confronted with a new task, our method is used to determine the best base control policy to facilitate task-adaptation (e.g. ). After training the new policy by adapting the selected base control policy , the new policy can be added to the set of base control policies. This will allow executing more efficient task-adaptations, as more base control policies become available.
In the proposed method, we select instructions to train base control policies , from a uniform random distribution. However, an interesting extension to this method might be to select base control policies based on a more advanced selection objective. For example, maximizing distance between the task instructions (in a language-embedding).
4.2 Sampling task-adaptations
The second phase of our method consists of utilizing the developed base control policies in order to sample a limited number of task adaptations. A single task adaptation sample consists of taking a fully developed base control policy , and using it to perform a new instruction , different from the one it was trained on. An example of such a sample would include to start from a policy trained on an instruction go to the yellow box, and ask it to perform a different task, such as pickup the yellow box.
A task-adaptation from one policy to a new one is done by loading the parameters of the base policy as the initialization of the new policy we want to develop. Training can be performed using any rl-algorithm. During this sampling phase the policy does not need to converge. Training only needs to happen for a limited number of steps. This amount of required steps is significantly lower than fully developing the policy. After the sampled task adaptation has been executed for -steps, we measure the performance. This can be done by, for example, calculating the success rate of the agent satisfying the instruction over the last 100 iterations. Table 1 contains a few examples of this sampling process.
|Base control policy instruction||Transfer instruction||Measured performance|
|Pickup the red ball||Goto the green key||0.91|
|Pickup the red ball||Goto the red ball||0.76|
|Goto the yellow box||Goto the green key||0.86|
|Goto the yellow box||Goto the red ball||0.86|
For each base control policy, we randomly select different tasks from to sample task adaptation. So in summary, our method requires running task adaptation samples, each consisting of training steps. Similarly to the selection of the base control policies, we leave a more advanced sampling strategy as future work.
This sampling method allows the generation of a dataset that can be used to generalize expected task adaptation over unseen tasks. Furthermore, the resulting policies, which were partially developed during the sampled task adaptations, could be used by the agent to further develop, when tasked with the linked instruction.
4.3 Training the transfer-model
In the next stage of our method, we train a binary classification model in order to generalize the perceived task adaptation.
The input of the proposed model consists of a concatenation of the sampled transfer instruction , combined with the instructions attached to two sampled base policies ( and ). The output of the model consists of a single binary output. This output is trained to be positive, if the first base policy with instruction performed better during transfer sampling than the second base policy with instruction . An example dataset is presented in Table 2
|Instruction||Transfer instruction||Transfer instruction||Class|
|Goto the green key||Pickup the red ball||Goto the yellow box||1|
|Goto the red ball||Pickup the red ball||Goto the yellow box||0|
In order to work directly with instructions in natural language, a language embedding is used. This embedding is trained end-to-end, and thus is specifically trained to encode instructions based on their transfer capabilities.
4.4 Transfer-model usage
The resulting transfer-model can be used when the agent is confronted with a new task, it currently has no developed base control policy for. Given a set of labeled base policies, and a task instruction, the various possibilities can be tested in order to make an assessment about which base policy will result in the fastest task-adaptation.
5.1 Task-adaptation in the BabyAI environment
In order to find out whether patterns can be discovered in task adaptations using instructions expressed using natural language, we performed a large set of transfer experiments in the BabyAI environment. In this experiments, we wanted to find out which parts of the instructions (verb, object, color) matter in making efficient task adaptation decisions.
The results of this experiment are summarized in Figure 3, 4 and 5. Each of these plots shows results averaged over 636 task adaptations. The green line represents performance when training a policy from scratch, while the blue line shows transfer performance averaged over all performed transfer experiments.
We see some clear patterns. The verb seems to be the most important part of the task instruction. For example, when confronted with a new task which has a goto verb, base control policies which are also trained on a goto instruction seem to transfer best. This is an expected result, as the verb-part of the instruction, also determines the required set of primitive actions to solve the task.
5.2 Transfer model
As experimentally demonstrated in the previous experiment, various parts of task instructions have a different impact on the task adaptation performance.
In this second experiment we trained different amounts of randomly sampled base control policies. While training can be done using any rl algorithm, we used DQN (Mnih et al., 2015)
in our experiments. Training a base control policy is done using at least 1 million steps, and ends when the policy achieves a success rate of at least 95%, measured on the previous 100 iterations. The full set of used training hyperparameters is described in AppendixA.
After developing different base control policies, we sampled task adaptations for each base control policy. The results gathered from these task adaptations were used to train the transfer model.
In table 3, we show performance of our model when using various numbers of base control policies (), and different numbers of task adaptation samples (). We measure model accuracy over a holdout-set consisting of all possible expressible task-adaptations not seen during sampling.
|k=8||0.61 0.03||0.62 0.03||0.61 0.05||0.64 0.05||0.65 0.02||0.66 0.03|
|k=10||0.62 0.03||0.62 0.05||0.64 0.06||0.62 0.04||0.66 0.03||0.67 0.02|
|k=12||0.67 0.02||0.67 0.01||0.66 0.02||0.67 0.02||0.68 0.02||0.66 0.04|
|k=14||0.64 0.04||0.66 0.02||0.67 0.03||0.69 0.01||0.69 0.03||0.68 0.01|
|k=18||0.67 0.03||0.68 0.02||0.68 0.03||0.71 0.01||0.70 0.02||0.71 0.02|
|k=20||0.69 0.01||0.68 0.05||0.70 0.02||0.69 0.04||0.71 0.03||0.71 0.03|
Accuracy of the binary task adaptation classifier model. The different rows represent the various amount of base control policies used during training, the columns represent the amount of task adaptations sampled for each base control policy. Results are averaged over 5 runs.
Our preliminary results show that even with a limited number of base control policies, and sampled task adaptations, a transfer model can be developed. There is still room for improvement regarding the accuracy of the model, however the stochastic nature of rl makes task transfer inherently noisy.
However the increased sample efficiency, due to efficient task-adaptation provided by our method, is a quintessential building block, in a lifelong learning setting (Silver et al., 2013).
6 Discussion and future work
In this paper, we presented a method capable of predicting, given a set of base control policies, which of these base control policies will adapt the fastest to a new previously unseen task. In order to make assessments about task adaptation, our method uses a for this task specifically trained language embedding on the task instructions.
Our preliminary results show that a binary classification approach can make assessments about task-adaptation by utilizing semantic meaning of task instructions formatted in natural language. When confronted with an expanding set of tasks in a lifelong-learning setting, our method has the potential to vastly improve sample efficiency.
However, our method still relies on a set of randomly selected base control policies, and task transfer samples. Future research could optimize our method by introducing an iterative sampling method based on a more advanced selection criterion such as instruction diversity. Another interesting extension to our method includes the usage of an open vocabulary.
Modular Multitask Reinforcement Learning with Policy Sketches.
Proceedings of the 34th International Conference on Machine Learning, Cited by: §2.
- Curriculum Learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, Quebec, Canada. External Links: Cited by: §1.
- BabyAI: First Steps Towards Grounded Language Learning With a Human In the Loop. In ICLR19, Cited by: §1, §2, Figure 1, §3.
- Meta-learning language-guided policy learning. In International Conference on Learning Representations, External Links: Cited by: §2.
- Using Natural Language for Reward Shaping in Reinforcement Learning. In IJCAI19, Cited by: §2.
- Grounded language learning in a simulated 3d world. arXiv preprint arXiv:1706.06551. Cited by: §2.
Multi-task Deep Reinforcement Learning with PopArt.
Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §2.
- Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science 364 (6443). External Links: Cited by: §1.
- Language as an Abstraction for Hierarchical Deep Reinforcement Learning. arXiv:1906.07343 [cs, stat]. Cited by: §1, §2.
- A Framework for Transfer in Reinforcement Learning. In ICML Workshop on Structural Knowledge Transfer for Machine Learning, Cited by: §2.
- End-to-End Training of Deep Visuomotor Policies. The Journal of Machine Learning Research. Cited by: §1.
- A Survey of Reinforcement Learning Informed by Natural Language. In IJCAI19, Cited by: §1, §2.
- Mapping instructions and visual observations to actions with reinforcement learning. arXiv preprint arXiv:1704.08795. Cited by: §2.
- Human-level control through deep reinforcement learning. Nature 518. Cited by: Appendix A, §5.2.
- Grounding Language for Transfer in Deep Reinforcement Learning. Journal of Artificial Intelligence Research 63. Cited by: §2.
- Policy invariance under reward transformations: theory and application to reward shaping. In ICML99, Cited by: §1.
- Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning, Cited by: §2.
- Universal Value Function Approximators. In Proceedings of the 32nd International Conference on Machine Learning, Cited by: §2.
- Lifelong machine learning systems: beyond learning algorithms. In 2013 AAAI spring symposium series, Cited by: §5.2.
- A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362 (6419). External Links: Cited by: §1.
- Transfer Learning for Reinforcement Learning Domains: A Survey. Journal of Machine Learning Research 10. Cited by: §2.
- Representation Transfer for Reinforcement Learning. In AAAI Fall Symposium: Computational Approaches to Representation Change during Learning and Development, Cited by: §2.
Appendix A Training hyperparameters
The network architecture used corresponds to the proposed architecture in Mnih et al. (2015):
Experience replay size: 100.000
Discount factor : 0.99
Adam learning rate: 0.0000625
Adam : 0.00015
Target-network update steps: 8.000
Random exploration steps: 10.000
Exploration decay-steps: 1.000.000
Minimum exploration : 0.01
During base policy training, the policy is trained until convergence. For the task adaptation sampling 100.000 training steps are used for each task adaptation sample.
Appendix B Transfer model architecture and hyperparameters
Sigmoid:Linear(in_features: 24, out_features: 1) Relu:Linear(in_features: 24, out_features: 24) Relu:Linear(in_features: 9, out_features: 24) Dropout(p=0.2) Embedding(num_embeddings: 10, embedding_dim: 1)
Training steps: 1.000.000
Adam learning rate: 0.001