BabyAI 1.1

07/24/2020 ∙ by David Yu-Tung Hui, et al. ∙ 6

The BabyAI platform is designed to measure the sample efficiency of training an agent to follow grounded-language instructions. BabyAI 1.0 presents baseline results of an agent trained by deep imitation or reinforcement learning. BabyAI 1.1 improves the agent's architecture in three minor ways. This increases reinforcement learning sample efficiency by up to 3 times and improves imitation learning performance on the hardest level from 77 hope that these improvements increase the computational efficiency of BabyAI experiments and help users design better agents.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The BabyAI platform 111(https://github.com/mila-iqia/babyai is an environment designed to evaluate how well an agent follows grounded-language instructions. The quality of an agent is measured with two metrics: its success rate at following instructions and the number of episodes or demonstrations required to train it. BabyAI 1.0, [babyai_iclr19], presents results of a baseline agent trained by reinforcement and imitation learning (RL and IL) methods. In this technical report, we present three modifications that significantly improved the baseline results.

Two modifications are to the network’s architecture and the third to the representation of the visual input. The network is modified by removing maxpooling at lower levels in the visual encoder and adding residual connections around FiLM layers

[perez_film:_2017]. The visual representation is modified to use learned embeddings in a Bag-of-Words fashion [mikolov2013efficient].

2 Proposed Architectural and Representational Modifications

This section describes the network architecture and BabyAI 1.0 visual representation before detailing the two architectural modifications and two alternate visual representations.

Figure 1: The five architectures. (a), (b) and (c): BabyAI 1.0 architecture, first modification removing pooling in the visual encoder and second modification introducing residual connections around FiLM. (d), (e): respective architectural changes needed to use BOW and pixel visual representations. Arrows at the bottom of architectures (d) and (e) feed into an architecture identical from the first convolutional layer of (c) onwards.

The BabyAI platform has nineteen levels which can be categorised into two types: small and big [babyai_iclr19]. Small levels are single-room but big levels are usually rooms. The BabyAI 1.0 baseline agent has two architectures used on the small and big levels. These architectures have the same structure and are illustrated by Figure 1.a. The architecture takes two inputs, a visual input and a linguistic instruction. We use FiLM to combine the outputs of a convolutional ‘visual encoder’ with a GRU [cho_learning_2014] embedding of the instruction. We refer the reader to Appendix A for more details concerning the distinction between ‘big’ and ‘small’.

Figures 1.b and 1.c respectively present two architectural modifications: removing pooling in the visual encoder and adding residual connections around the image convolution and the FiLM layers. To ensure that the shape of the visual encoder is consistent after pooling is removed, we change filter size from to . We expect these changes to improve sample efficiency because they enable more information to be transmitted to higher layers.

At every timestep, the agent receives visual information about a

grid of tiles which are immediately in the direction it is facing. BabyAI 1.0 represents a tile by a triple-integer value. The first integer describes the type of object in the tile and the second integer the object’s color. The third integer is only used if the object is a door, and describes whether it is open, closed or locked. BabyAI 1.0 represents a visual input by concatenating all tile representations together. This results in a tensor of size

.

A gridworld tile (and thus the visual input) can also be represented in two other ways: as a “bag of words” or by an RGB image. In the Bag-of-Words (“BOW”) approach a set of symbols that describes the tile is embedded in a trainable lookup table. This approach is commonly used in gridwords such as [leike2017ai], [rajendran2015attend] and [SchraderSokoban2018]

. Because a tile in BabyAI can be represented by three integers, we use three look-up tables and use each integer as a key. A tile is then represented by the mean of the three looked-up feature vectors. As with the BabyAI 1.0 visual representation, the BOW representation is formed from combining all tile representations into a 3D tensor. As we set the dimensionality of a feature vector to 128, the dimensionality of the BOW visual representation is

. This is depicted in Figure 1.d.

The contents of a tile can also be represented by a 3-channel Red-Green-Blue (RGB) image. As we choose the size of an image to be pixels, a grid is thus represented by an image stored in a tensor. The entire visual input can be represented in a 3-channel RGB image with dimensionality . An architecture using this visual representation is illustrated in Figure 1.

Architectures names are structured in two parts. The first part is either “original”, “bow” or “pixels”, indicating which visual representation is used. The second part is optional and describes whether “_endpool” (because the only source of pooling is at the end) or “_res” (adding residual connections) are present in the architecture.

3 Experiments

To determine the best architecture and visual representation, we follow BabyAI 1.0 and experiment on the six easiest BabyAI levels. These six levels consist of five single-room levels and one multi-room level. Then, we present IL performance benchmarks on all levels.

3.1 Finding the Best Architecture

We measure RL sample and computational efficiency and IL performance with varying number of demonstrations. RL experiments were structured in two stages.

The first set of experiments investigated architectural modifications. Results in Table 1(a) showed that removing pooling in the visual encoder had a significant improvement on sample efficiency, but adding residual connections effected both increases and decreases. Nevertheless, we adopted residual connections for further experiments because the sample efficiency increase for PutNextLocal greatly outweighted the total decrease in GoToLocal and GoTo.

The second set of experiments investigated visual representations. Results in Table 1(b) still do not show a wide variation in sample efficiency. Because training from pixels was hard on the two most difficult levels (GoTo, PutNextLocal), we halved the learning rate ( in Adam [kingma_adam:_2015] from to ) and reran the second set of experiments. The resulting statistics in Table 1(c) do not show much variation between the three visual representations.

We now consider the computational efficiency of training each of these five architectures. Training from pixels has a slower throughput than the other visual representations (Table 2). Because of this and no clear advantage in RL sample efficiency (Tables 1(b), 1(c)), we drop further experiments on pixels.

Now, we investigate whether changing the visual representation to BOW and two architectural modifications improve IL performance. babyai_iclr19

measures sample efficiency using an interpolated function fitted with a Gaussian Process (GP)

[rasmussen_gaussian_2005]

. In our experiments we found that an infeasibly large number of training runs would be required in order to obtain a sufficiently confident sample efficiency estimate from the GP. Instead, we follow Table 6 in

[zolna2020combating], who evaluate IL by observing its success rate trained with varying number of demonstrations. zolna2020combating use th, th and all of 1 million demonstrations. We use 5, 10, 50, 100 and 500 thousand demonstrations, which correspond to th, th, th, th and th of the total 1 million demonstrations.

IL results in Table 3 show that training from BOW is advantageous to the original BabyAI 1.0 visual representation. Interestingly, we find that for hard levels with a few number of demos, the architectural modifications are not beneficial for training. This is offset by changing the visual representation to BOW.

3.2 Benchmarking the Best Modifications

Having constructed the BabyAI 1.1 agent, we benchmark its performance over all nineteen BabyAI levels.

Table 4 shows that modifications found by the previous section yielded improvements in performance over all levels. Four more levels (Unlock, Putnext, Synth and SynthLoc) were solved, and success rate increased by in the hardest level (BossLevel) from to .

4 Conclusion

As BabyAI was intended to be a lightweight experimental platform, BabyAI 1.0 used a specific hand-crafted representation rather than a more realistic pixel-based representation. We have shown that training from other visual representations (BOW and pixels) is feasible, and is sometimes more sample efficient (Table 3). However, learning from pixels took longer to compute (Table 2

) and was more sensitive to hyperparameters. Besides, using pixel representations for the tiles still does not bridge the reality gap between the gridworld and the 3D real world, for almost all challenging aspects of visual perception, such as e.g. occlusion, illumination, different viewpoints are still not modelled. For this reason we keep the BabyAI input representation symbolic, though we switch to the more standard BOW approach for encoding the symbolic input.

This report introduces BabyAI 1.1, the latest version of the BabyAI platform. This modifies the previous version of BabyAI with minor changes to the baseline agent, but major improvements to baseline statistics. We hope that this change encourages researchers to (re-)use similar architectures within novel agents, so that research into grounded language learning may be conducted in more computationally efficient ways.

Acknowledgements

This research was mostly performed at Mila with funding by the Government of Quebec and CIFAR, and enabled by Compute Canada (www.computecanada.ca).

accept null hypothesis that there is

no significant difference
reject null hypothesis at 1 % significance due to significant increase in sample efficiency (number is smaller)
reject null hypothesis at 1 % significance due to significant decrease in sample efficiency (number is bigger)
A cell is shaded depending on whether its values are statistically significant from those of the cell to its left

. Statistical significance is computed using a two-tailed T-test with inequal variance.

Level Sample Efficiency
original original_endpool original_endpool_res
GoToRedBallGrey 21 5 21 6 21 5
GoToRedBall 273 27 200 16 179 17
GoToLocal 1311 251 381 30 437 45
PickupLoc 1797 290 743 132 710 166
PutNextLocal 2984 172 2169 739 1009 128
GoTo 1601 463 454 69 813 278
(a) RL sample efficiency for different architectures. See the last paragraph of Section 2 for the explanation of architecture names.
Level Sample Efficiency
original_endpool_res bow_endpool_res pixels_endpool_res
GoToRedBallGrey 21 5 24 2 34 4
GoToRedBall 179 17 177 2 172 2
GoToLocal 437 45 611 760 242 15
PickupLoc 710 166 982 266 1082 385
PutNextLocal 1092 143 876 104 Not Trainable
GoTo 813 278 817 502 Not Trainable
(b) RL sample efficiency for different visual representations with learning rate = . See the last paragraph of Section 2 for the explanation of architecture names.
Level Sample Efficiency
original_endpool_res bow_endpool_res pixels_endpool_res
GoToRedBallGrey 35 5 30 4 44 11
GoToRedBall 263 22 164 3 155 5
GoToLocal 606 81 449 176 336 28
PickupLoc 1732 579 1461 422 1308 421
PutNextLocal 1277 252 876 104 1301 320
GoTo 984 484 803 525 845 329
(c) RL sample efficiency for different visual representations with learning rate = . See the last paragraph of Section 2 for the explanation of architecture names.
Table 1: RL sample efficiency (mean std, thousands of episodes) in the six easiest BabyAI levels with respect to different architectures (Table 1(a)) and visual representations with 1e-4 learning rate (Table 1(b)) and 5e-5 learning rate (Table 1(c)). Each figure is an average over ten training runs initialised to a randomly chosen seed. All experiments were run with the ‘small’ architecture. More experimental details are in Appendix B.
Architecture original original_endpool original_endpool_res bow_endpool_res pixels_endpool_res
RL (FPS) 1139 128 927 72 907 69 855 58 540 67
Table 2: Frames Per Second (mean std) of RL training with different architectures, averaged across the six easiest BabyAI levels. Inter-level differences are negligible.
accept null hypothesis that there is no significant difference
reject null hypothesis at 1 % significance due to significant increase in performance (number is bigger)
reject null hypothesis at 1 % significance due to significant decrease in performance (number is smaller)
A cell is shaded depending on whether its values are statistically significant from those of the cell to its left. Statistical significance is computed using a two-tailed T-test with inequal variance.
Level Number of Demos (thousands) Success Rate
original original_endpool_res bow_endpool_res
GoToRedBallGrey 5 99.5 0.1 99.5 0.1 99.7 0.1
10 99.7 99.8 0.1 99.9
50 100 100 100
100 100 100 100
500 100 100 100
GoToRedBall 5 89.6 0.3 91.3 0.6 99.3 0.3
10 93.1 0.8 95.6 0.7 99.8 0.1
50 99.2 0.2 99.9 100
100 99.7 100 100
500 99.9 100 100
GoToLocal 5 72.5 1.0 71.6 1.4 84.2 2.0
10 79.9 1.2 79.7 1.8 94.2 0.8
50 95.3 0.5 99.6 0.1 99.8 0.1
100 97.8 0.3 99.9 99.9
500 99.6 0.1 100 100
PutNextLocal 5 22.3 1.7 12.0 1.8 12.5 1.2
10 39.1 3.5 16.2 2.3 24.9 3.2
50 80.8 1.4 90.6 3.5 88.6 11.0
100 93.9 0.4 99.5 0.1 99.5 0.5
500 99.3 0.2 100 100
PickupLoc 5 53.0 1.3 35.9 1.5 60.3 1.8
10 65.3 1.5 53.7 1.2 74.9 3.9
50 90.8 1.5 96.2 0.5 97.0 0.3
100 96.4 0.5 98.5 0.4 98.6 0.3
500 99.5 0.2 99.8 0.1 99.8 0.1
GoTo 10 70.4 1.1 76.3 5.0 96.1 0.4
100 94.9 0.3 99.3 0.1 99.4
Table 3: IL success rate (%) in the six easiest BabyAI levels with respect to varying number of demonstrations, architectures and visual representations. Experiments have a success rate are successful and are bolded

. Each figure is an average over ten training runs initialised to a randomly chosen seed. Standard deviations of

are omitted for clarity. The first five experiments were run with the ‘small’ architecture. GoTo experiments were run with the ‘large’ architecture. More experimental details are given in Appendix C.
Level Success Rate (%) Demo Length (Mean Std)
BabyAI 1.0 BabyAI 1.1
GoToObj 100 100 5.18 2.38
GoToRedBallGrey 100 100 5.81 3.29
GoToRedBall 100 100 5.38 3.13
GoToLocal 99.8 100 5.04 2.76
PutNextLocal 99.2 100 12.4 4.54
PickupLoc 99.4 100 6.13 2.97
GoToObjMaze 99.9 100 70.8 48.9
GoTo 99.4 100 56.8 46.7
Pickup 99 100 57.8 46.7
UnblockPickup 99 100 57.2 50
Open 100 100 31.5 30.5
Unlock 98.4 100 81.6 61.1
PutNext 98.8 99.6 89.9 49.6
Synth 97.3 100 50.4 49.3
SynthLoc 97.9 100 47.9 47.9
GoToSeq 95.4 96.7 72.7 52.2
SynthSeq 87.7 93.9 81.8 61.3
GoToImpUnlock 87.2 84.0 110 81.9
BossLevel 77 90.4 84.3 64.5
Table 4: Comparision of baseline IL results for all BabyAI levels. Experiments with a success rate are successful and are bolded. On all levels, the ‘big’ configuration was trained on 1 million demonstrations until the loss has converged. As running these experiments are computationally expensive, we present results on 1 seed. Success rate is calculated as with 512 trials once the loss has converged.

References

Appendix A Agent Architecture

At every timestep, an agent receives a visual input and linguistic command of variable length which compels the agent to execute an action. The BabyAI baseline agent is implemented by a deep neural network which processes the visual input and linguistic command, producing an action. The visual input and linguistic instruction are respectively encoded by a Gated Recurrent Unit (GRU) and convolutional network. These two encodings are then combined by two batch-normalised FiLM layers. A Long-Short-Term-Memory cell (LSTM)

[hochreiter_long_1997]

integrates the output of FiLM across timesteps. Finally, the integrated output is passed to policy and value heads. The agent can be trained by RL or IL methods in conjunction with BackPropagation Through Time (BPTT)

[werbos_bptt].

The ‘small’ configuration uses a unidirectional GRU and LSTM of dimensionality 128 for memory. The ‘big’ configuration uses a 128-dimensional bidirectional GRU with attention [bahdanau_neural_2015] and the memory LSTM with dimensionality 2048.

Appendix B Reinforcement Learning Experiments

Sample efficiency is defined by the number of RL training episodes needed to train an agent to success rate. babyai_iclr19 defines success as whether an agent can follow an instruction within steps, a figure pre-defined for each level.

We use Advantage-Actor Critic (A2C) [wu_a2c] with Proximal Policy Optimisation (PPO) [schulman_proximal_2017] and Generalised Advantage Estimation (GAE) [schulman_high-dimensional_2015]

. Data for A2C is collected in batches of 64 rollouts of length 40. These were used in 4 epochs of PPO.

was used in GAE. If an agent completes a task after steps, it is rewarded with . Otherwise, no reward is given. The returns were discounted by . Results in Table 1(a) and Table 1(b) was optimised by Adam with the hyperparameters , , and . Results in Table 1(c) used .

Appendix C Imitation Learning Experiments

Different hyperparameters were used for training ‘small’ and ‘big’ models.

The small model was trained with a batch size of 256 and an epoch consisting of 25600 demos. Backpropagation Through Time (BPTT) was truncated at 20 steps.

The large model had a batch size of 128 and an epoch of 102400 demonstrations. BPTT was truncated at 80 steps. In addition, the model was trained with an entropy regulariser, which had a coefficient of .

These were optimised by Adam with for small architectures, for large architectures and , and .