Language-Conditioned Goal Generation: a New Approach to Language Grounding for RL

06/12/2020 ∙ by Cédric Colas, et al. ∙ 8

In the real world, linguistic agents are also embodied agents: they perceive and act in the physical world. The notion of Language Grounding questions the interactions between language and embodiment: how do learning agents connect or ground linguistic representations to the physical world ? This question has recently been approached by the Reinforcement Learning community under the framework of instruction-following agents. In these agents, behavioral policies or reward functions are conditioned on the embedding of an instruction expressed in natural language. This paper proposes another approach: using language to condition goal generators. Given any goal-conditioned policy, one could train a language-conditioned goal generator to generate language-agnostic goals for the agent. This method allows to decouple sensorimotor learning from language acquisition and enable agents to demonstrate a diversity of behaviors for any given instruction. We propose a particular instantiation of this approach and demonstrate its benefits.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Language Grounding describes the idea that language acquisition is strongly shaped by one’s experience of the physical world. This idea emerged in Cognitive Science (harnad1990symbol; Glenberg2002; Zwaan05)

and quickly inspired Artificial Intelligence approaches in natural language processing

(roy2005connecting), human-machine interactions (Dominey2005; Madden2010) and more recently deep Reinforcement Learning (deep RL) (luketina2019survey). In the RL community, this has taken the form of language-conditioned agents (hermann2017grounded; chan2019actrce; bahdanau2018learning; cideron2019self; jiang2019language; zhong2019rtfm; waytowich2019grounding; colas2020language). Language can be used to build representations (waytowich2019grounding) or to characterize the dynamics of the environment (zhong2019rtfm). However, it is mostly used to represent instructions or goals (hermann2017grounded; chan2019actrce; bahdanau2018learning; cideron2019self; jiang2019language; colas2020language). In these approaches, natural language (nl) sentences are embedded through recurrent networks and merged with the agent’s state to form the input of the policy or reward function. The language encoder is then trained jointly with either the former (hermann2017grounded; chan2019actrce; jiang2019language; cideron2019self; hill2019emergent) or the latter (bahdanau2018learning; colas2020language). These approach demonstrate benefits over traditional goal-conditioned methods:

  1. [leftmargin=0.6cm, nolistsep]

  2. Targeting abstract goals. Language can express abstract goals characterized by sets of properties the scene should verify (e.g. block A above block B). This contrasts with previous goal-as-state approaches, where goals are specific states of the agent (e.g. target pixel image, target block positions).

  3. Systematic generalization. Previous language-conditioned approaches demonstrate strong generalization capabilities including generalizations to new combinations of action verbs and object attributes (colors, shapes, categories etc.) (hermann2017grounded; bahdanau2018learning; hill2019emergent; colas2020language).

However, this comes with drawbacks:

  1. [leftmargin=0.6cm, nolistsep]

  2. Language becomes a pre-requisite for sensorimotor learning. Pre-verbal infants are known to demonstrate goal-oriented behavior (wood1976role). Language-conditioned agents require language inputs to act in the world and, thus, cannot account for this decoupling.

  3. Lack of behavioral diversity. Because policies are directly conditioned on language embeddings, an instruction in a given context only generates a low diversity of behaviors: a unique behavior for a deterministic policy and minor noise-induced behavioral variations for stochastic policies.

Figure 1: Language-conditioned policy, goal-conditioned policy and language-conditioned goal generator.

This paper proposes to leverage the abstraction and generalization capacities of language while 1) decoupling language acquisition from sensorimotor learning; 2) allowing behavioral diversity. This is achieved via language-conditioned goal generation (Figure 1). Instead of using language to condition policies directly, we use it to condition a generator of language-agnostic goals. Language-agnostic goals can be represented by specific states (e.g. pixels inputs) (pere2018unsupervised; laversanne2018curiosity; nair2018visual; warde2018unsupervised; nair2019contextual; pong2019skew) or handcrafted representations (e.g. target block coordinates) (schaul2015universal; andrychowicz2017hindsight; florensa2018automatic; racaniere2019automated; fournier2019clic; colas2019curious). These can come from any source: uniform sampling of the goal space (schaul2015universal), sampling from low-density areas (pong2019skew), from high learning progress areas (fournier2019clic; colas2019curious), using generative models of states (nair2018visual; nair2019contextual) or generative models targeting intermediate difficulty (florensa2018automatic; racaniere2019automated). In our approach, language-conditioned goal generators are a source of language-agnostic goals among others. As a result, pure sensorimotor training does not require any language input: agents can simply use any other source of goals. Language-conditioned goal generators are generative models of language-agnostic goals conditioned on instructions (e.g. Variational Auto-Encoders, Generative Adversarial Networks). For any given instruction, agent can thus sample a set of matching goals. This results in behavioral diversity, a set of diverse behaviors matching any given instruction.


This paper introduces a novel approach to the problem of language grounding in RL agents: language-conditioned goal generation. This leverages the capability of language to represent abstract goals and to generalize, while avoiding the lack of behavioral diversity and the dependence to language inputs usually associated with language-conditioned agents. This paper presents a particular implementation of this approach and disentangles language acquisition from policy learning by assuming access to pre-trained goal-conditioned policies. Our policies are pre-trained to reach high-level configurations characterizing spatial relations between objects in the scene.

2 Methods

This paper presents a particular implementation of language-conditioned goal generation. Section 2.1 describes the learning environment and the pre-trained goal-conditioned policies. Assuming these, Section 2.2 describes an implementation of a language-conditioned goal generator and how it is used for language grounding.

2.1 Behavioral Policies Conditioned on Abstract Semantic Representations

The agents considered in this paper were pre-trained without language input by an algorithm presented in a companion paper.111This paper is not cited here to respect anonymity. It will be after reviews.

The Fetch Manipulate environment

Agents evolve in the Fetch Manipulate environment: a robotic manipulation domain based on mujoco (todorov2012mujoco) and adapted from the Fetch tasks (plappert2018multi). Agents are DoF robotic arms that face colored blocks on a table (see Figure 2). They are given innate representations called semantic configurations that characterize spatial relations between blocks. During sensorimotor training, agents discovered all reachable configurations in that space and learned to master them.

Semantic configurations.

In contrast to traditional approaches, goals are not defined as particular targets for each block but as high-level semantic configurations. These configurations are based on two spatial predicates infants demonstrate early in their development (mandler2012spatial): the close and the above binary predicates. These two predicates are applied to all permutations of object pairs, i.e. permutations for the objects we consider. Because the close predicate is order invariant, we only need to evaluate it on object combinations. The above predicate being order dependent, we need all

permutations. The resulting binary vector of size

forms the semantic configuration. It represents the spatial relations between objects in the scene. In the resulting semantic configuration space , the agent can reach physically valid configurations, including stacks of or blocks and pyramids. Supplementary Section 5 provides formal definitions and properties of predicates and semantic configurations. Figure 2 displays visual representations of some example configurations.

Figure 2: Some semantic configurations in Fetch Manipulate.

2.2 Language-Conditioned Goal Generation

The language-conditioned goal generator, or language module, generates semantic configurations matching the agent’s initial configuration and a sentence describing an expected transformation of a relation between a pair of objects. This section explains its training and use for language grounding.

Training the language module.

The language-conditioned goal generator is implemented by a conditional Variational Auto-Encoder (c-vae) (sohn2015learning) trained in a supervised setting. The training data is collected via interactions between a trained agent and a social partner. For each goal-directed trajectory the agent performs, the social partner provides the description of one of the resulting transformations in the object relations. The set of possible descriptions contains sentences, each describing in a simplified language a positive or negative shift for each of the predicates (e.g. get red above green). nl descriptions are encoded via a recurrent network that is jointly trained with the c-vae. Supplementary Section 6 provides the list of sentences and implementation details.

Language grounding.

At test time, agents are instructed by one of the sentences to transform a relation between objects. The trained language module acts as a translator: agent can sample a goal configuration matching their current state and instruction. This module effectively enables agents to ground nl in their internal semantic representations and set of sensorimotor skills. We consider three evaluation settings: 1) performing a single instruction; 2) performing a sequence of instructions; 3) performing a logical combination of instructions. As the agent can generate a set of goals matching any instruction, it can easily combine these sets to perform logical functions of instructions: and is an intersection, or is an union and not is the complement within the set of goals the agent discovered during sensorimotor training. Given sets of compatible goal configurations, agents can also try again: find other goal configurations that match the required instruction when previous attempts failed.

3 Experiments

We first train a language-conditioned goal generator from a training dataset collected via interactions between the agent and a social partner (Section 2.2). For a given initial configuration and a given sentence, we want the language module to generate all compatible final configurations, and just these.

Language-conditioned goal generation performance.

To evaluate the language module, we construct a synthetic, oracle dataset of triplets , , , where is the initial configuration, is the sentence describing the expected transformation and is the set of all final configurations compatible with . Note that, on average, in contains configurations, while the training dataset only contains . We are interested in two metrics: 1) The Compatibility Probability

(CP) is the probability that a goal sampled from the generator belongs to

; 2) The Coverage (Cov) is the size of the intersection between and the set resulting from sampling the language-conditioned generator times. We compute these metrics on different sets of input pairs , each calling for a different type of generalization:

  1. [leftmargin=0.6cm, nolistsep]

  2. Pairs found in , except pairs removed to form the following test sets. This calls for the extrapolation of known initialization-effect pairs to new final configurations ( contains only 20% of on average).

  3. Pairs that were removed from , calling for a recombination of known effects on known .

  4. Pairs for which the was entirely removed from . This calls for the transfer of known effects on unknown .

  5. Pairs for which the was entirely removed from . This calls for generalization in the language space, to generalize unknown effects from related sentences and transpose this to known .

  6. Pairs for which both the and the were entirely removed from . This calls for the generalizations 3 and 4 combined.

Our language module demonstrates these types of generalization (see Table 1). Agents can generate goals from situations they never encountered (Test 3). They can generalize the meaning of sentences they never heard (Test 4) and even apply the latter to unknown situations (Test 5). We detail the testing sets in Supplementary Section 6.

Metr. Test 1 Test 2 Test 3 Test 4 Test 5
CP 0.93 0.94 0.95 0.90 0.92
Cov 0.97 0.93 0.98 0.99 0.98
Table 1: Language module average metrics over seeds. Std is below for Cov and for CP.

Grounding language in sensorimotor behavior.

We investigate how the language module interacts with the sensorimotor skills of the agent. We consider three evaluation settings. In the transition setup, we look at the average success rate of the agent when asked to perform the instructions times each, resetting the environment each time. In the expression setup, we evaluate the agent on randomly generated logical functions of sentences. In both setups, we give the agent attempts, enabling it to resample new compatible goals when the previous failed (without reset). Success rates after (SR1) and (SR5) attempts are reported in Table 2. In the sequence setup, we ask the agent to execute random sequences of instructions without reset and report the average number of successes before the agent fails: (std). The RL agents are evaluated with the c-vae models evaluated above. These results show that the language module efficiently implements language grounding. Agents achieve instructed transitions almost all the time, resampling alternative goals when previous ones failed. They only fail when a previous trajectory kicked the blocks out of reach.

 Metr. Transition Expression
Table 2: Language grounding performance metrics over 10 seeds (avstd).

4 Discussion

This paper introduces language-conditioned goal generation: a new approach to the problem of language grounding in RL agents. It shows the following advantages over traditional language-conditioned RL agents:

  • [leftmargin=0.6cm, nolistsep]

  • Decoupled language grounding. Our approach enables agents to decouple language acquisition from sensorimotor learning, as it is observed in infants (wood1976role). However, nothing in the architecture prevents language from being grounded during sensorimotor learning. This would result in "overlapping waves" of sensorimotor and linguistic development (siegler1998emerging).

  • Behavioral diversity. The language module generates a diversity of goals matching any instruction (see Coverage metrics in Table 1). This results in a behavioral diversity that language-conditioned agents cannot demonstrate. Indeed, while a language-conditioned agent trained on put red close_to green would only push the red block towards the green one, our agent can generate many matching goal configurations. It could build a pyramid, make a blue-green-red pile or target a dozen other compatible configurations.

  • Trying again. Generating a diversity of goals matching any instruction enables agents to try again: to find alternative approaches to satisfy a same instruction when previous attempts failed. Table 2 shows that benefiting from several attempts significantly improves the chances of success. This could also improve robustness to perturbed environments where former successful behaviors fail.

  • Logical expressions. Generating sets of compatible goals makes it easy to scale language understanding to any logical combination of instructions. This cannot be achieved with language-conditioned policies.

One strength of the language-conditioned approach is their capability of systematic generalization (hermann2017grounded; bahdanau2018learning; hill2019emergent; colas2020language). Our language module also seems to demonstrate generalization abilities: agents can generate goals from situations they never encountered (Type 3). They can generalize the meaning of sentences they never heard (Type 4) and even apply the latter to unknown situations (Type 5). Type 4, especially, involves recombinations of action verbs and attributes similar to hill2019emergent; colas2020language. c-vae were already used to generate goals matching initial states in nair2019contextual. However, the additional condition on instructions brings the benefits of language use: 1) the representation of abstract goals; 2) the systematic generalization capacities. In addition instructions enable agents to control their goal generation. Because image-based goal generation was shown to work in nair2018visual; pong2019skew; nair2019contextual, we believe our language-conditioned goal generator could be trained to generate imaged-based goals. Further work could investigate the extension of language-conditioned goal generator to diverse goal-conditioned settings such as, for example, Quality-Diversity algorithms. These algorithms train populations of diverse and high-performing solutions to a problem (cully_qd). Each solution is associated with its behavioral characterization

: a low-dimensional description of its behavior. Our language-conditioned goal generator could be used to map language instructions to that behavioral space, enabling a language-based control of these diversity-seeking evolutionary algorithms

(mapelite; nslc; nses; colas2020scaling).


Cédric Colas is partly funded by the French Ministère des Armées - Direction Générale de l’Armement.


Supplementary Material

This supplementary material includes:

  • [leftmargin=0.6cm, nolistsep]

  • Section 5: a formal definition of semantic configurations and the definiton of the semantic configurations used in the Fetch Manipulate environment.

  • Section 6: further details about the training of our language-conditioned goal generation module.

5 Formal Definition of Semantic Configurations

Semantic configurations are based on a collection of formal systems known as predicate logic or first-order logic. They use -ary relations to describe possible connections between quantified variables. This paper focuses on spatial binary predicates characterizing spatial relations between pairs of physical objects. We provide formal definitions, properties and examples below.

Binary predicates

Consider a finite set of objects . A binary predicate associated with a semantic relation r

is an expression that takes as input any ordered pair of objects

. is said true if and only if " r " is verified. For simplicity, we refer to and r interchangeably.

Examples of binary predicates.

We consider the objects and .

  • The expression " is close to " describes the predicate close evaluated on (, ).

  • The expression " is above " describes the predicate above evaluated on (, ).

Semantic mapping functions.

To achieve symbol grounding into non-symbolic sensorimotor interactions using predicates, we define a semantic mapping function associated with the binary predicate as the probability that is true given the states of the considered objects. Formally, if we consider the objects , and their respective states , , then:

This paper assumes oracle deterministic semantic mapping functions, i.e. is a Boolean function in . Practically, we hard-code a function, assumed internal to the agent, that uses predefined fixed thresholds to determine whether a predicate is true or false given the states of the considered objects. For example, for the close predicate, it outputs if and only if the Euclidean distance between the two considered objects is below a defined threshold. For the sake of simplicity, we omit the word deterministic.

Symmetry and asymmetry.

Consider a finite set of objects and a binary predicate . The predicate is said to be symmetric if and only if, for any ordered pair of objects , " r " and " r " are equivalent. As a result, the corresponding semantic mapping function needs to be symmetric, i.e. . The predicate is said to be asymmetric iff, for any ordered pair , " r " implies not " r ".


We consider the objects and .

  • close is symmetric: " is close to " " is close to ". The corresponding semantic mapping function is based on the Euclidean distance, which is symmetric.

  • above is asymmetric: " is above " not " is above ". The corresponding semantic mapping function evaluates the sign of the difference of the object -axis coordinates.

Effective number of predicate relations.

Consider a finite set of objects and a binary predicate .

  • If is not symmetric, then the effective number of relations that can be described without redundancy is equal to the number of permutations of objects among , i.e. .

  • If is symmetric, then the effective number of relations is equal to the number of combinations of objects among , i.e. .

Semantic configurations based on spatial relations.

Let be a list of binary predicates. The concatenation of the evaluations of the semantic mapping functions on the pairs of objects forms a semantic configuration. It is an abstract representation of a scene which characterizes all relations defined by the predicates among the objects. This defines a binary semantic configuration space , where . If any world configuration can be mapped to , not all configurations are reachable (e.g. cannot be above and below at the same time).

Semantic representation space in Fetch Manipulate.

In the Fetch Manipulate environment, we restrict semantic representations to the use of the close and above binary predicates applied on objects. The resulting semantic configurations are formed by:

where c() and a() refer to the close and above predicates respectively and are the red, green and blue blocks respectively.

6 Language-Conditioned Goal Generator Training

We use a conditional Variational Auto-Encoder (c-vae) (sohn2015learning). Conditioned on the initial configuration and a sentence describing the expected transformation of one object relation, it generates compatible goal configurations. After the first phase of goal-directed sensorimotor training, the agent interacts with a hard-coded social partner as described in Main Section 2.2. From these interactions, we obtain a dataset of triplets: initial configuration, final configuration and sentence describing one change of predicate from the initial to the final configuration. The list of sentences used by the synthetic social partner are provided in Table 3. Note that red, green and blue refer to objects respectively.

Content of test sets.

We describe the test sets:

  1. Test set is made of input pairs from the training set, but tests the coverage of all compatible final configurations , 80% of which are not found in the training set. In that sense, it is partly a test set.

  2. Test set contains two input pairs: {, put blue close_to green} and {, put green below red} corresponding to and compatible final configurations respectively.

  3. Test set corresponds to all pairs including the initial configuration ( pairs), with an average of compatible final configurations.

  4. Test set corresponds to all pairs including one of the sentences put green on_top_of red and put blue far_from red, i.e. pairs with an average of compatible final configurations.

  5. Test set is all pairs that include both the initial configuration of test set and one of the sentences of test set , i.e. 2 pairs with and compatible goals respectively. Note that pairs of set are removed from sets and .

Type Sentences
Close Put block A close_to block B,
Bring block B and block A together,
Put block B close_to block A,
Bring block A and block B together,
Get block B and block A close_from each_other,
Get block A close_to block B
Get block A and block B close_from each_other,
Get block B close_to block A.
Close Put block A far_from block B,
Get block A far_from block B,
Put block B far_from block A,
Get block B far_from block A,
Get block A and block B far_from each_other,
Bring block A and block B apart,
Get block B and block A far_from each_other,
Bring block B and block A apart.
Above Put block A above block B,
Put block A on_top_of block B,
Put block B under block A,
Put block B below block A.
Above Remove block A from_above block B,
Remove block A from block B,
Remove block B from_below block A,
Put block B and block A on_the_same_plane,
Put block A and block B on_the_same_plane.
Table 3: List of instructions. Each of them specifies a shift of one predicate, either from false to true () or true to false (). block A and block B represent two different blocks from {red, blue, green}.

Testing on logical expressions of instructions.

To evaluate our agents on logical functions of instructions, we generate three types of expressions:

  1. instructions of the form "A and B" where A and B are basic instructions corresponding to shifts of the form above  (see Table 3). These intersections correspond to stacks of or pyramids.

  2. instructions of the form "A and B" where A and B are above and close instructions respectively. B can be replaced by "not B" with probability 0.5.

  3. instructions of the form "(A and B) or (C and D))", where A, B, C, D are basic instructions: A and C are above instructions while B and D are close instructions. Here also, any instruction can be replaced by its negation with probability 0.5.


The encoder is a fully-connected neural network with two layers of size

and ReLU activations. It takes as input the concatenation of the final binary configuration and its two conditions: the initial binary configuration and an embedding of the nl sentence. The nl sentence is embedded with an recurrent network with embedding size , tanh

non-linearities and biases. The encoder outputs the mean and log-variance of the latent distribution of size

. The decoder is also a fully-connected network with two hidden layers of size and ReLU activations. It takes as input the latent code and the same conditions as the encoder. As it generates binary vectors, the last layer uses sigmoid

activations. We train the architecture with a mixture of Kullback-Leibler divergence loss

w.r.t a standard Gaussian prior and a binary Cross-Entropy loss . The combined loss is with . We use an Adam optimizer with a learning rate of , a batch size of and optimize for epochs. As training is fast (

min on a single cpu), we conducted a quick hyperparameter search over

, layer sizes, learning rates and latent sizes (see Table 4). We found robust results for various layer sizes, various below and latent sizes above .

Hyperparam. Values.
layers size
learning rate
latent sizes
Table 4: Language module hyperparameter search. In bold are the selected hyperparameters.