Task-Oriented Language Grounding for Language Input with Multiple Sub-Goals of Non-Linear Order

10/27/2019 ∙ by Vladislav Kurenkov, et al. ∙ 0

In this work, we analyze the performance of general deep reinforcement learning algorithms for a task-oriented language grounding problem, where language input contains multiple sub-goals and their order of execution is non-linear. We generate a simple instructional language for the GridWorld environment, that is built around three language elements (order connectors) defining the order of execution: one linear - "comma" and two non-linear - "but first", "but before". We apply one of the deep reinforcement learning baselines - Double DQN with frame stacking and ablate several extensions such as Prioritized Experience Replay and Gated-Attention architecture. Our results show that the introduction of non-linear order connectors improves the success rate on instructions with a higher number of sub-goals in 2-3 times, but it still does not exceed 20 Gated-Attention provides no competitive advantage against concatenation in this setting. Source code and experiments' results are available at https://github.com/vkurenkov/language-grounding-multigoal



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Task-oriented language grounding refers to the process of extracting semantically meaningful representations of language by mapping it to visual elements and actions in the environment in order to perform the task specified by the instruction Chaplot et al. (2018).

Recent works in this paradigm focus on wide spectrum of natural language instructions’ semantics: different characteristics of referents (colors, relative positions, relative sizes), multiple tasks, and multiple sub-goals Chaplot et al. (2018); Hermann et al. (2017); Yu et al. (2017, 2018, 2018); Chevalier-Boisvert et al. (2018); Oh et al. (2017).

In this work, we are interested in the language input with the semantics of multiple sub-goals, focusing on the order of execution, as the natural language contains elements that may lead to the non-linear order of execution (e.g. “Take the garbage out, but first wash the dishes“). We refer to this kind of elements as non-linear order connectors in the setting of task-oriented language grounding.

In particular, we want to answer: what is the performance of general deep reinforcement learning algorithms with this kind of language input? Can it successfully learn all of the training instructions? Can it generalize to an unseen number of sub-goals?

To answer these questions, we generate an instruction’s language for a modified GridWorld environment, where the agent needs to visit items specified in a given instruction. The language is built around three order connectors: one linear – “comma“, and two non-linear – “but first“, and “but before“, producing instructions like “Go to the red, go to the blue, but first go to the green“. In contrast to Oh et al. (2017), where the sub-goals are separated in advance and the order is known, in this work, we specifically aim to study whether the agent can learn to determine the order of execution based on the connectors present in the language input.

We apply one of the offline deep reinforcement learning baselines – Dueling DQN and examine the impact of several extensions such as Gated-Attention architecture Chaplot et al. (2018) and Prioritized Experience Replay Schaul et al. (2016).

First, we discover that training for both non-linear and linear order connectors at the same time improves the performance on the latter when compared to training using only linear ones. Second, we observe that the generalization to an unseen number of sub-goals using general methods of deep reinforcement learning is possible but still very limited for both linear and non-linear connectors. Third, we find that there is no advantage of using gated-attention against simple concatenation in this setting. And fourth, we observe that the usage of prioritized experience replay in this setup may be enough to achieve the training performance near to perfect. 111Source code and experiments’ results available at https://github.com/vkurenkov/language-grounding-multigoal

2 Problem Formulation

2.1 Environment

The target environment is the modified version of GridWorld. Figure 1 illustrates one of the possible layouts. The environment is episodic and bounded by a maximum number of steps which can be restarted indefinitely. The goal of the agent is to visit every object in the right order. For example, in Figure 1, if the provided linguistic instruction is “Go to the blue object, but first go to the green object“ then the agent must visit the green object, and then move to the blue object. If the agent violates the order of visitation or arrives at the wrong object, the episode terminates.




Figure 1: The modified GridWorld environment. The goal of the agent (A) is to visit items in the right order as specified in a given language input. R – Red, B – Blue, G – Green.

2.2 Language

Figure 2: The finite state automaton describing relations between the order connectors in the target language.

The semantics of interest are the presence of multiple sub-goals and the order of their execution. To capture them, we generate an instruction’s language, where every instruction describes what objects (referents) the agent should visit and in what order.

The generated language operates on only one task – “Go to the …“, and three referents: red, blue, and green. An instruction may contain multiple sub-goals, and these sub-goals are connected in special ways. In the setting of task-oriented language grounding, we denote the connections between sub-goals that define the order of execution as the order connectors.

We distinguish between linear and non-linear order connectors. The former refers to the connectors that preserve the order as they appear in the language input, e.g. “Go to the red, go to the blue“, “comma“ is a linear connector, as the order of execution is the same as the sub-goals ordered in the input. The non-linear connectors may change the order of execution in any way, e.g. “Go to the red, go to the green, but first go to the blue“, “but first“ is a non-linear connector as the last sub-goal in the language input should be executed the first.

The generated language contains three order connectors: one linear – “comma“, and two non-linear – “but first“, “but before“. The connector “but before“ swaps the order of two consecutive sub-goals, e.g. “Go to the red, go to the green, but before go to the blue“ resolves to [red, blue, green]. Figure 2 depicts how the language is generated on the level of order connectors. If “but before“ or “but first“ connectors are present, we restrict them to be the last order connector in the instruction to avoid ambiguity in the language. The generated language excludes all the instructions that resolve to the visitation of the same item two or more times in a row (e.g “Go to the red, go to the blue, but first go to the red“ is not allowed).

2.3 Evaluation

Language Subset Example Instruction Example Order
Comma Go to the red, go to the blue, go to the green red, blue, green
Comma-ButFirst Go to the red, go to the blue, but first go to the green green, red, blue
Comma-ButBefore Go to the red, go to the blue, but before go to the green red, green, blue
Table 1: Example instructions of the target language subsets.
Language Subset Order Connectors # Training # Testing
Comma comma 21 168
Comma-ButFirst comma, but first 39 336
Comma-ButBefore comma, but before 39 336
Table 2: Description of the target language subsets. Comma is a subset of both Comma-ButFirst and Comma-ButBefore.

The evaluation is built around instructions’ allocation in order to assess the effect of different order connectors. We extract three subsets of the language: Comma, Comma-ButFirst, Comma-ButBefore. The first subset is designated to assess the performance on the language input with linear order connectors only. The goal of the two last subsets is to measure the performance in the presence of non-linear connectors of different type: absolute position change (Comma-ButFirst) and the relative position change (Comma-ButBefore).

For every subset, we make two splits, one – for the evaluation of training performance, the other – for the evaluation of generalization to a higher number of sub-goals. The splits are obtained as depicted in Figure 3. We limit every subset to include instructions with six sub-goals at max. Then we split it into the training and testing parts. The training part contains all the instructions bounded by 3 sub-goals, and the testing part contains the rest.

Table 2 describes what order connectors are present in every subset, and how many instructions are in the training and testing splits. The Comma is a subset of both Comma-ButFirst and Comma-ButBefore.

To measure the training performance, we vary the proportion of training instructions from 0.1 to 0.9 with a step of 0.1. The performance on the training instructions is quantified as the success rate on these instructions. To measure the testing performance, we train the models on the biggest proportion of the training instructions.

As for the environment’s layout, we randomly sampled only one instance, and utilize it in both training and testing scenarios for all of the algorithms to reduce the computing time, as the layouts generalization is not in the scope of this study.

Figure 3: The split of instructions within the generated language subsets. In the end, we obtain two splits: training and testing. The N% of the instructions with 1-3 subgoals are sampled for training.

3 Methods

3.1 Algorithms

We focus on one of the offline reinforcement learning baseline algorithms - single-actor Dueling DQN (DDQN) Wang et al. (2016). In particular, we vary two components: the network’s architecture and the form of the experience replay.

For the latter, we examine the Prioritized Experience Replay Schaul et al. (2016), which is hypothesized to provide a form of implicit curriculum.

For the former, we experiment with Gated-Attention (GA) and Concatenation (Cat) architectures Chaplot et al. (2018). As the GA was shown to be advantageous for the language input in VizDOOM environment Kempka et al. (2016); Chaplot et al. (2018). But, the language input in Chaplot et al. (2018) primarily concentrated on different object attributes, we are interested in whether this mechanism will have a positive impact on instructions with multiple sub-goals as well.

The network architecture is as in Chaplot et al. (2018), but instead of recurrence part, we utilize stacking of 4 previous observations as in Misra et al. (2017).

The training loop is organized such that the number of the target network’s updates is kept constant for all instructions. Meaning that, instead of random instruction sampling, we iterate over all training instructions in a similar way to Misra et al. (2017) and only then update our target network.

3.2 Reward Shaping

The problem is tackled under the most informative reward shaping scheme. It incorporates the information on how many steps are left to the successful execution of the current sub-goal.

In order to preserve the optimal policy for the original Markov Decision Process, we apply a potential-based reward transformation

Ng et al. (1999).

4 Results and Discussion

Language Subset DDQN + Cat DDQN + GA DDQN + Cat + PER DDQN + GA + PER
Comma 81.1% 73.3% 89.5% 95.5%
Comma-ButFirst 82.0% 75.1% 100.0% 96.2%
(84.5%) (78.1%) (100.0%) (96.6%)
Comma-ButBefore 89.9% 72.9% 100.0% 94.4%
(88.7%) (76.3%) (100.0%) (95.0%)
Table 3: Success rates at training instructions. The scores are averaged over different training proportions. The upper rows show the performance for all training instructions at the corresponding language subset. The bottom rows of Comma-ButFirst and Comma-ButBefore show the performance on the Comma instructions while being trained at Comma-ButFirst or Comma-ButBefore.
Language Subset DDQN + Cat DDQN + GA DDQN + Cat + PER DDQN + GA + PER
Comma 10.7% 3.6% 6.5% 4.8%
Comma-ButFirst 8.9% 7.4% 17.6% 13.4%
(8.3%) (6.5%) (16.7%) (13.1%)
Comma-ButBefore 13.4% 7.7% 17.6% 12.5%
(13.7%) (8.3%) (20.8%) (12.5%)
Table 4: Success rates at testing instructions (higher number of sub-goals). The scores are taken for the runs with the biggest training proportions. The bottom rows of Comma-ButFirst and Comma-ButBefore show the performance on the Comma instructions while being trained at Comma-ButFirst or Comma-ButBefore.

The training and testing performance of the Dueling DQN algorithm with different extensions can be found in the Tables 3 and 4.

The training at language subsets Comma-ButFirst and Comma-ButBefore substantially improves the training and generalization performance at Comma subset when compared to training only at Comma subset. This is quite an unexpected outcome that suggests that the exposition to the non-linear order connectors may improve the performance for linear connectors.

We notice that the concatenation consistently outperforms the gated-attention mechanism for both training and testing instructions. We suspect that the gated-attention is useful in the scenarios where objects are described in terms of multiple attributes, but it has no to harming effect when it comes to the order connectors.

The frame stacking was enough to achieve the success at training and at some part of the testing instructions. The reason is not clear, but we hypothesize that it can be explained by the lack of the layouts variability and by the offloading mechanism Carvalho and Nolfi (2016). This requires further investigation.

The usage of prioritized experience replay outperforms the simple replay buffer by a big margin – from 10% to 20% success rate improvement. This is a well-established fact for the Atari domain Schaul et al. (2016), but not quite explored in the areas of multi-task reinforcement learning or task-oriented language grounding.

5 Conclusion

In this work, we applied baseline methods of general deep reinforcement learning to the problem of task-oriented language grounding, where language input contains linear and non-linear order connectors.

We found that even baseline models can capture the semantics of linear and non-linear order connectors at the training instructions, but it is not enough to achieve high generalization performance even up to six sub-goals. The best results are achieved with the prioritized experience replay mechanism, which suggests its potential application to general multi-task reinforcement learning and task-oriented language grounding. And the most importantly, we found that training at both linear and non-linear order connectors helps to capture the semantics of the order connectors better when compared to just training on linear connectors – the generalization performance increases in 2-3 times.

These findings suggest that we should look for a model that would generalize better even in such a simple setting: for instance, by introducing recurrence Chaplot et al. (2018); Mei et al. (2016); Hermann et al. (2017) or hindsight experience replay Andrychowicz et al. (2017). But as the training performance is shown to be near perfect, it would be interesting to investigate the order connectors in visually rich environments or/and in the presence of other natural language instructions’ semantics: multiple tasks, multiple referent attributes, and etc.


  • M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba (2017) Hindsight experience replay. In NIPS, Cited by: §5.
  • J. Carvalho and S. Nolfi (2016) Cognitive offloading does not prevent but rather promotes cognitive development. PLOS ONE 11, pp. e0160679. External Links: Document Cited by: §4.
  • D. S. Chaplot, K. M. Sathyendra, R. K. Pasumarthi, D. Rajagopal, and R. R. Salakhutdinov (2018) Gated-attention architectures for task-oriented language grounding. In AAAI, Cited by: §1, §1, §1, §3.1, §3.1, §5.
  • M. Chevalier-Boisvert, D. Bahdanau, S. Lahlou, L. Willems, C. Saharia, T. H. Nguyen, and Y. Bengio (2018) BabyAI: first steps towards grounded language learning with a human in the loop. CoRR abs/1810.08272. External Links: Link, 1810.08272 Cited by: §1.
  • K. M. Hermann, F. Hill, S. Green, F. Wang, R. Faulkner, H. Soyer, D. Szepesvari, W. M. Czarnecki, M. Jaderberg, D. Teplyashin, M. Wainwright, C. Apps, D. Hassabis, and P. Blunsom (2017) Grounded language learning in a simulated 3d world. CoRR abs/1706.06551. External Links: Link, 1706.06551 Cited by: §1, §5.
  • M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaśkowski (2016) ViZDoom: a doom-based ai research platform for visual reinforcement learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), Vol. , pp. 1–8. External Links: Document, ISSN 2325-4289 Cited by: §3.1.
  • H. Mei, M. Bansal, and M. R. Walter (2016) Listen, attend, and walk: neural mapping of navigational instructions to action sequences. In AAAI, Cited by: §5.
  • D. Misra, J. Langford, and Y. Artzi (2017) Mapping instructions and visual observations to actions with reinforcement learning. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    pp. 1004–1015. External Links: Document, Link Cited by: §3.1, §3.1.
  • A. Y. Ng, D. Harada, and S. J. Russell (1999) Policy invariance under reward transformations: theory and application to reward shaping. In

    Proceedings of the Sixteenth International Conference on Machine Learning

    ICML ’99, San Francisco, CA, USA, pp. 278–287. External Links: ISBN 1-55860-612-2, Link Cited by: §3.2.
  • J. Oh, S. P. Singh, H. Lee, and P. Kohli (2017) Zero-shot task generalization with multi-task deep reinforcement learning. In ICML, Cited by: §1, §1.
  • T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2016) Prioritized experience replay. CoRR abs/1511.05952. Cited by: §1, §3.1, §4.
  • Z. Wang, N. de Freitas, and M. Lanctot (2016) Dueling network architectures for deep reinforcement learning. In ICML, Cited by: §3.1.
  • H. Yu, X. Lian, H. Zhang, and W. Xu (2018) Guided feature transformation (gft): a neural language grounding module for embodied agents. In Proceedings of The 2nd Conference on Robot Learning, A. Billard, A. Dragan, J. Peters, and J. Morimoto (Eds.), Proceedings of Machine Learning Research, Vol. 87, , pp. 81–98. External Links: Link Cited by: §1.
  • H. Yu, H. Zhang, and W. Xu (2017) A deep compositional framework for human-like language acquisition in virtual environment. CoRR abs/1703.09831. Cited by: §1.
  • H. Yu, H. Zhang, and W. Xu (2018) Interactive grounded language acquisition and generalization in a 2d world. ICLR. Cited by: §1.