The BabyAI platform 111(https://github.com/mila-iqia/babyai is an environment designed to evaluate how well an agent follows grounded-language instructions. The quality of an agent is measured with two metrics: its success rate at following instructions and the number of episodes or demonstrations required to train it. BabyAI 1.0, [babyai_iclr19], presents results of a baseline agent trained by reinforcement and imitation learning (RL and IL) methods. In this technical report, we present three modifications that significantly improved the baseline results.
Two modifications are to the network’s architecture and the third to the representation of the visual input. The network is modified by removing maxpooling at lower levels in the visual encoder and adding residual connections around FiLM layers[perez_film:_2017]. The visual representation is modified to use learned embeddings in a Bag-of-Words fashion [mikolov2013efficient].
2 Proposed Architectural and Representational Modifications
This section describes the network architecture and BabyAI 1.0 visual representation before detailing the two architectural modifications and two alternate visual representations.
The BabyAI platform has nineteen levels which can be categorised into two types: small and big [babyai_iclr19]. Small levels are single-room but big levels are usually rooms. The BabyAI 1.0 baseline agent has two architectures used on the small and big levels. These architectures have the same structure and are illustrated by Figure 1.a. The architecture takes two inputs, a visual input and a linguistic instruction. We use FiLM to combine the outputs of a convolutional ‘visual encoder’ with a GRU [cho_learning_2014] embedding of the instruction. We refer the reader to Appendix A for more details concerning the distinction between ‘big’ and ‘small’.
Figures 1.b and 1.c respectively present two architectural modifications: removing pooling in the visual encoder and adding residual connections around the image convolution and the FiLM layers. To ensure that the shape of the visual encoder is consistent after pooling is removed, we change filter size from to . We expect these changes to improve sample efficiency because they enable more information to be transmitted to higher layers.
At every timestep, the agent receives visual information about a
grid of tiles which are immediately in the direction it is facing. BabyAI 1.0 represents a tile by a triple-integer value. The first integer describes the type of object in the tile and the second integer the object’s color. The third integer is only used if the object is a door, and describes whether it is open, closed or locked. BabyAI 1.0 represents a visual input by concatenating all tile representations together. This results in a tensor of size.
A gridworld tile (and thus the visual input) can also be represented in two other ways: as a “bag of words” or by an RGB image. In the Bag-of-Words (“BOW”) approach a set of symbols that describes the tile is embedded in a trainable lookup table. This approach is commonly used in gridwords such as [leike2017ai], [rajendran2015attend] and [SchraderSokoban2018]
. Because a tile in BabyAI can be represented by three integers, we use three look-up tables and use each integer as a key. A tile is then represented by the mean of the three looked-up feature vectors. As with the BabyAI 1.0 visual representation, the BOW representation is formed from combining all tile representations into a 3D tensor. As we set the dimensionality of a feature vector to 128, the dimensionality of the BOW visual representation is. This is depicted in Figure 1.d.
The contents of a tile can also be represented by a 3-channel Red-Green-Blue (RGB) image. As we choose the size of an image to be pixels, a grid is thus represented by an image stored in a tensor. The entire visual input can be represented in a 3-channel RGB image with dimensionality . An architecture using this visual representation is illustrated in Figure 1.
Architectures names are structured in two parts. The first part is either “original”, “bow” or “pixels”, indicating which visual representation is used. The second part is optional and describes whether “_endpool” (because the only source of pooling is at the end) or “_res” (adding residual connections) are present in the architecture.
To determine the best architecture and visual representation, we follow BabyAI 1.0 and experiment on the six easiest BabyAI levels. These six levels consist of five single-room levels and one multi-room level. Then, we present IL performance benchmarks on all levels.
3.1 Finding the Best Architecture
We measure RL sample and computational efficiency and IL performance with varying number of demonstrations. RL experiments were structured in two stages.
The first set of experiments investigated architectural modifications. Results in Table 1(a) showed that removing pooling in the visual encoder had a significant improvement on sample efficiency, but adding residual connections effected both increases and decreases. Nevertheless, we adopted residual connections for further experiments because the sample efficiency increase for PutNextLocal greatly outweighted the total decrease in GoToLocal and GoTo.
The second set of experiments investigated visual representations. Results in Table 1(b) still do not show a wide variation in sample efficiency. Because training from pixels was hard on the two most difficult levels (GoTo, PutNextLocal), we halved the learning rate ( in Adam [kingma_adam:_2015] from to ) and reran the second set of experiments. The resulting statistics in Table 1(c) do not show much variation between the three visual representations.
We now consider the computational efficiency of training each of these five architectures. Training from pixels has a slower throughput than the other visual representations (Table 2). Because of this and no clear advantage in RL sample efficiency (Tables 1(b), 1(c)), we drop further experiments on pixels.
Now, we investigate whether changing the visual representation to BOW and two architectural modifications improve IL performance. babyai_iclr19
measures sample efficiency using an interpolated function fitted with a Gaussian Process (GP)[rasmussen_gaussian_2005]
. In our experiments we found that an infeasibly large number of training runs would be required in order to obtain a sufficiently confident sample efficiency estimate from the GP. Instead, we follow Table 6 in[zolna2020combating], who evaluate IL by observing its success rate trained with varying number of demonstrations. zolna2020combating use th, th and all of 1 million demonstrations. We use 5, 10, 50, 100 and 500 thousand demonstrations, which correspond to th, th, th, th and th of the total 1 million demonstrations.
IL results in Table 3 show that training from BOW is advantageous to the original BabyAI 1.0 visual representation. Interestingly, we find that for hard levels with a few number of demos, the architectural modifications are not beneficial for training. This is offset by changing the visual representation to BOW.
3.2 Benchmarking the Best Modifications
Having constructed the BabyAI 1.1 agent, we benchmark its performance over all nineteen BabyAI levels.
Table 4 shows that modifications found by the previous section yielded improvements in performance over all levels. Four more levels (Unlock, Putnext, Synth and SynthLoc) were solved, and success rate increased by in the hardest level (BossLevel) from to .
As BabyAI was intended to be a lightweight experimental platform, BabyAI 1.0 used a specific hand-crafted representation rather than a more realistic pixel-based representation. We have shown that training from other visual representations (BOW and pixels) is feasible, and is sometimes more sample efficient (Table 3). However, learning from pixels took longer to compute (Table 2
) and was more sensitive to hyperparameters. Besides, using pixel representations for the tiles still does not bridge the reality gap between the gridworld and the 3D real world, for almost all challenging aspects of visual perception, such as e.g. occlusion, illumination, different viewpoints are still not modelled. For this reason we keep the BabyAI input representation symbolic, though we switch to the more standard BOW approach for encoding the symbolic input.
This report introduces BabyAI 1.1, the latest version of the BabyAI platform. This modifies the previous version of BabyAI with minor changes to the baseline agent, but major improvements to baseline statistics. We hope that this change encourages researchers to (re-)use similar architectures within novel agents, so that research into grounded language learning may be conducted in more computationally efficient ways.
This research was mostly performed at Mila with funding by the Government of Quebec and CIFAR, and enabled by Compute Canada (www.computecanada.ca).
|RL (FPS)||1139 128||927 72||907 69||855 58||540 67|
. Each figure is an average over ten training runs initialised to a randomly chosen seed. Standard deviations ofare omitted for clarity. The first five experiments were run with the ‘small’ architecture. GoTo experiments were run with the ‘large’ architecture. More experimental details are given in Appendix C.
|Level||Success Rate (%)||Demo Length (Mean Std)|
|BabyAI 1.0||BabyAI 1.1|
Appendix A Agent Architecture
At every timestep, an agent receives a visual input and linguistic command of variable length which compels the agent to execute an action. The BabyAI baseline agent is implemented by a deep neural network which processes the visual input and linguistic command, producing an action. The visual input and linguistic instruction are respectively encoded by a Gated Recurrent Unit (GRU) and convolutional network. These two encodings are then combined by two batch-normalised FiLM layers. A Long-Short-Term-Memory cell (LSTM)[hochreiter_long_1997]
integrates the output of FiLM across timesteps. Finally, the integrated output is passed to policy and value heads. The agent can be trained by RL or IL methods in conjunction with BackPropagation Through Time (BPTT)[werbos_bptt].
The ‘small’ configuration uses a unidirectional GRU and LSTM of dimensionality 128 for memory. The ‘big’ configuration uses a 128-dimensional bidirectional GRU with attention [bahdanau_neural_2015] and the memory LSTM with dimensionality 2048.
Appendix B Reinforcement Learning Experiments
Sample efficiency is defined by the number of RL training episodes needed to train an agent to success rate. babyai_iclr19 defines success as whether an agent can follow an instruction within steps, a figure pre-defined for each level.
We use Advantage-Actor Critic (A2C) [wu_a2c] with Proximal Policy Optimisation (PPO) [schulman_proximal_2017] and Generalised Advantage Estimation (GAE) [schulman_high-dimensional_2015]
. Data for A2C is collected in batches of 64 rollouts of length 40. These were used in 4 epochs of PPO.was used in GAE. If an agent completes a task after steps, it is rewarded with . Otherwise, no reward is given. The returns were discounted by . Results in Table 1(a) and Table 1(b) was optimised by Adam with the hyperparameters , , and . Results in Table 1(c) used .
Appendix C Imitation Learning Experiments
Different hyperparameters were used for training ‘small’ and ‘big’ models.
The small model was trained with a batch size of 256 and an epoch consisting of 25600 demos. Backpropagation Through Time (BPTT) was truncated at 20 steps.
The large model had a batch size of 128 and an epoch of 102400 demonstrations. BPTT was truncated at 80 steps. In addition, the model was trained with an entropy regulariser, which had a coefficient of .
These were optimised by Adam with for small architectures, for large architectures and , and .