Probing Emergent Semantics in Predictive Agents via Question Answering

by   Abhishek Das, et al.

Recent work has shown how predictive modeling can endow agents with rich knowledge of their surroundings, improving their ability to act in complex environments. We propose question-answering as a general paradigm to decode and understand the representations that such agents develop, applying our method to two recent approaches to predictive modeling -action-conditional CPC (Guo et al., 2018) and SimCore (Gregor et al., 2019). After training agents with these predictive objectives in a visually-rich, 3D environment with an assortment of objects, colors, shapes, and spatial configurations, we probe their internal state representations with synthetic (English) questions, without backpropagating gradients from the question-answering decoder into the agent. The performance of different agents when probed this way reveals that they learn to encode factual, and seemingly compositional, information about objects, properties and spatial relations from their physical environment. Our approach is intuitive, i.e. humans can easily interpret responses of the model as opposed to inspecting continuous vectors, and model-agnostic, i.e. applicable to any modeling approach. By revealing the implicit knowledge of objects, quantities, properties and relations acquired by agents as they learn, question-conditional agent probing can stimulate the design and development of stronger predictive learning objectives.


page 2

page 7

page 8


CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

When answering a question, people often draw upon their rich world knowl...

Jointly Learning Truth-Conditional Denotations and Groundings using Parallel Attention

We present a model that jointly learns the denotations of words together...

Functorial Question Answering

We study the relational variant of the categorical compositional distrib...

Bridging Anaphora Resolution as Question Answering

Most previous studies on bridging anaphora resolution (Poesio et al., 20...

Dialog-based Language Learning

A long-term goal of machine learning research is to build an intelligent...

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Given a simple request (e.g., Put a washed apple in the kitchen fridge),...

Dynamic Adaptive Network Intelligence

Accurate representational learning of both the explicit and implicit rel...

1 Introduction

Since the time of Plato, philosophers have considered the apparent distinction between “knowing how” (procedural knowledge or skills) and “knowing what” (propositional knowledge or facts). It is uncontroversial that deep reinforcement learning (RL) agents can effectively acquire procedural knowledge as they learn to play games or solve tasks. Such knowledge might manifest in an ability to find all of the green apples in a room, or to climb all of the ladders while avoiding snakes. However, the capacity of such agents to acquire factual knowledge about their surroundings – of the sort that can be readily hard-coded in symbolic form in classical AI – is far from established. Thus, even if an agent successfully climbs ladders and avoids snakes, we have no certainty that it ‘knows’ that ladders are brown, that there are five snakes nearby, or that the agent is currently in the middle of a three-level tower with one ladder left to climb.

The acquisition of knowledge about objects, properties, relations and quantities by learning-based agents is desirable for several reasons. First, such knowledge should ultimately complement procedural knowledge when forming plans that enable execution of complex, multi-stage cognitive tasks. Second, there seems (to philosophers at least) to be something fundamentally human about having knowledge of facts or propositions (Stich, 1979). If one of the goals of AI is to build machines that can engage with, and exhibit convincing intelligence to, human users (e.g. justifying their behaviour so humans understand/trust them), then a need for uncovering and measuring such knowledge in learning-based agents will inevitably arise.

Figure 1: We train predictive agents to explore a visually-rich 3D environment with an assortment of objects of different shapes, colors and sizes. As the agent navigates (trajectory shown in white on the top-down map), an auxiliary network learns to simulate representations of future observations (labeled ‘Simulation Network’) steps into the future, self-supervised by a loss against the ground-truth egocentric observation at . Simultaneously, another decoder network is trained to extract answers to a variety of questions about the environment, conditioned on the agent’s internal state but without affecting it (notice ‘stop gradient’ – gradients from the QA decoder are not backpropagated into the agent). We use this question-answering paradigm to decode and understand the internal representations that such agents develop. Note that the top-down map is only shown for illustration and not available to the agent.

Here, we propose the question-conditional probing of agent internal states as a means to study and quantify the knowledge about objects, properties, relations and quantities encoded in the internal representations of neural-network-based agents. Couching an analysis of such knowledge in terms of question-answering has several pragmatic advantages. First, question-answering provides a general purpose method for agent-analysis and an intuitive investigative tool for humans – one can simply

ask an agent what it knows about its environment and get an answer back, without having to inspect internal activations. Second, the space of questions is essentially open-ended – we can pose arbitrarily complex questions to an agent, enabling a comprehensive analysis of the current state of its propositional knowledge. Question-answering has previously been studied in textual (Rajpurkar et al., 2016, 2018), visual (Malinowski and Fritz, 2014; Antol et al., 2015; Das et al., 2017) and embodied (Gordon et al., 2018; Das et al., 2018a) settings. Crucially, however, these systems are trained end-to-end for the goal of answering questions. Here, we utilize question-answering simply to probe an agent’s internal representation, without backpropagating gradients from the question-answering decoder into the agent. That is, we view question-answering as a general purpose (conditional) decoder of environmental information designed to assist the development of agents by revealing the extent (and limits) of their knowledge.

Many techniques have been proposed for endowing agents with general (i.e. task-agnostic) knowledge, based on both hard-coding and learning. Here, we specifically focus on the effect of self-supervised predictive modeling – a learning-based approach – on the acquisition of propositional knowledge. Inspired by learning in humans (Elman, 1990; Rao and Ballard, 1999; Clark, 2016; Hohwy, 2013), predictive modeling, i.e. predicting future sensory observations, has emerged as a powerful method to learn general-purpose neural network representations (Elias, 1955; Atal and Schroeder, 1970; Schmidhuber, 1991; Schaul and Ring, 2013; Schaul et al., 2015; Silver et al., 2017; Wayne et al., 2018; Guo et al., 2018; Gregor et al., 2019; Recanatesi et al., 2019). These representations can be learned while exploring in and interacting with an environment in a task-agnostic manner, and later exploited for goal-directed behavior.

We evaluate predictive vs. non-predictive agents (both trained for exploration) on our question-answering testbed to investigate how much knowldge of object shapes, quantities, and spatial relations they acquire solely by egocentric prediction. The set includes a mix of questions that can plausibly be answered from a single observation or a few consecutive observations, and those that require the agent to integrate global knowledge of its entire surroundings.

Concretely, we make the following contributions:

  • In a visually-rich D room environment developed in the Unity engine, we develop a set of questions designed to probe a diverse body of factual knowledge about the environment – from identifying shapes and colors (‘What shape is the red object?’) to counting (‘How many blue objects are there?’) to spatial relations (‘What is the color of the chair near the table?’), exhaustive search (‘Is there a cushion?’), and comparisons (‘Are there the same number of tables as chairs?’).

  • We train RL agents augmented with predictive loss functions – 1) action-conditional CPC 

    (Guo et al., 2018) and 2) SimCore (Gregor et al., 2019) – for an exploration task and analyze the internal representations they develop by decoding answers to our suite of questions. Crucially, the QA decoder is trained independent of the predictive agent and we find that QA performance is indicative of the agent’s ability to capture global environment structure and semantics solely through egocentric prediction. We compare these predictive agents to strong non-predictive LSTM baselines as well as to an agent that is explicitly optimized for the question-answering task.

  • We establish generality of the encoded knowledge by testing zero-shot generalization of a trained QA decoder to compositionally novel questions (unseen combinations of seen attributes), suggesting a degree of compositionality in the internal representations captured by predictive agents.

2 Background and related work

Our work builds on studies of predictive modeling and auxiliary objectives in reinforcement learning as well as grounded language learning and embodied question answering.

Propositional knowledge is knowledge that a statement, expressed in natural or formal language, is true (Truncellito, 2007). Since at least Plato, epistemologist philosophers have contrasted propositional knowledge with procedural knowledge (knowledge of how to do something), and some (but not all) distinguish this from perceptual knowledge (knowledge obtained by the senses that cannot be translated into a proposition) (Dretske, 1995). An ability to exhibit this sort of knowledge in a convincing way is likely to be crucial for the long-term goal of having agents achieve satisfying interactions with humans, since an agent that cannot express its knowledge and beliefs in human-interpretable form may struggle to earn the trust of users.

Predictive modeling and auxiliary loss functions in RL. The power of predictive modeling for representation learning has been known since at least the seminal work of (Elman, 1990) on emergent language structures. More recent examples include Word2Vec (Mikolov et al., 2013), Skip-Thought vectors (Kiros et al., 2015), and BERT (Devlin et al., 2019) in language, while in vision similar principles have been applied to context prediction (Doersch et al., 2015; Noroozi and Favaro, 2016), unsupervised tracking (Wang and Gupta, 2015), inpainting (Pathak et al., 2016)

and colorization 

(Zhang et al., 2016). More related to us is the use of such techniques in designing auxiliary loss functions for training model-free RL agents, such as successor representations (Dayan, 1993; Zhu et al., 2017a), value and reward prediction (Jaderberg et al., 2016; Hermann et al., 2017; Wayne et al., 2018), contrastive predictive coding (CPC) (Oord et al., 2018; Guo et al., 2018), and SimCore (Gregor et al., 2019).

Question type Template Level codename QA pairs
Attribute What is the color of the shape? color
What shape is the color object? shape
Count How many shape are there? count_shape
How many color objects are there? count_color
Exist Is there a shape? existence_shape
Compare Count Are there the same number of color1 objects as color2 objects? compare_n_color
Are there the same number of shape1 as shape2? compare_n_shape
Relation Attribute What is the color of the shape1 near the shape2? near_color
What is the color object near the shape? near_shape
Table 1: QA task templates. In every episode, objects and their configurations are randomly generated, and these templates get translated to QA pairs for all unambiguous shape, color combinations. There are shapes and colors in total. See A.4 for details.

Grounded language learning. Inspired by the work of (Winograd, 1972) on SHRDLU, several recent works have explored linguistic representation learning by grounding language into actions and pixels in physical environments – in 2D gridworlds (Andreas et al., 2017; Yu et al., 2018; Misra et al., 2017), 3D (Chaplot et al., 2018; Das et al., 2018a; Gordon et al., 2018; Cangea et al., 2019; Puig et al., 2018; Zhu et al., 2017a; Anderson et al., 2018; Gupta et al., 2017; Zhu et al., 2017b; Oh et al., 2017; Shu et al., 2018; Vogel and Jurafsky, 2010; Hill et al., 2020) and textual (Matuszek et al., 2013; Narasimhan et al., 2015) environments. Closest to our work is the task of Embodied Question Answering (Gordon et al., 2018; Das et al., 2018a, b; Yu et al., 2019; Wijmans et al., 2019) – where an embodied agent in an environment (e.g. a house) is asked to answer a question (e.g. “What color is the piano?”). Typical approaches to EmbodiedQA involve training agents to move for the goal of answering questions. In contrast, our focus is on learning a predictive model in a goal-agnostic exploration phase and using question-answering as a post-hoc testbed for evaluating the semantic knowledge that emerges in the agent’s representations from predicting the future.

Neural population decoding. Probing an agent with a QA decoder can be viewed as a variant of neural population decoding, used as an analysis tool in neuroscience (Georgopoulos et al., 1986; Bialek et al., 1991; Salinas and Abbott, 1994)

and more recently in deep learning 

(Guo et al., 2018; Gregor et al., 2019; Azar et al., 2019; Alain and Bengio, 2016; Conneau et al., 2018; Tenney et al., 2019)

. The idea is to test whether specific information is encoded in a learned representation, by feeding the representation as input to a probe network, generally a classifier trained to extract the desired information. In RL, this is done by training a probe to predict parts of the ground-truth state of the environment, such as an agent’s position or orientation, without backpropagating through the agent’s internal state.

Prior work has required a separate network to be trained for each probe, even for closely related properties such as position vs. orientation (Guo et al., 2018) or grammatical features of different words in the same sentence (Conneau et al., 2018). Moreover, each probe is designed with property-specific inductive biases, such as convnets for top-down views vs. MLPs for position (Gregor et al., 2019). In contrast, we train a single, general-purpose probe network that covers a variety of question types, with an inductive bias for language processing. This generality is possible because of the external conditioning, in the form of the question, supplied to the probe. External conditioning moreover enables agent analysis using novel perturbations of the probe’s training questions.

Neuroscience. Predictive modeling is thought to be a fundamental component of human cognition (Elman, 1990; Hohwy, 2013; Seth, 2015). In particular, it has been proposed that perception, learning and decision-making rely on the minimization of prediction error (Rao and Ballard, 1999; Clark, 2016). A well-established strand of work has focused on decoding predictive representations in brain states (Nortmann et al., 2013; Huth et al., 2016). The question of how prediction of sensory experience relates to higher-order conceptual knowledge is complex and subject to debate (Williams, 2018; Roskies and Wood, 2017), though some have proposed that conceptual knowledge, planning, reasoning, and other higher-order functions emerge in deeper layers of a predictive network. We focus on the emergence of propositional knowledge in a predictive agent’s internal representations.

3 Environment & Tasks

Environment. We use a Unity-based visually-rich D environment (see Figure 1). It is a single L-shaped room that can be programmatically populated with an assortment of objects of different colors at different spatial locations and orientations. In total, we use a library of different objects, referred to as ‘shapes’ henceforth (e.g. chair, teddy, glass, etc.), in different colors (e.g. red, blue, green, etc.). For a complete list of environment details, see Sec. A.4.

At every step, the agent gets a first-person RGB image as its observation, and the action space consists of movements (move-forward,back,left,right), turns (turn-up,down,left,right), and object pick-up and manipulation ( DoF: yaw, pitch, roll, and movement along the axis between the agent and object). See Table 5 in the Appendix for the full set of actions.

Question-Answering Tasks.

We develop a range of question-answering tasks of varying complexity that test the agent’s local and global scene understanding, visual reasoning, and memory skills. Inspired by 

(Johnson et al., 2017; Das et al., 2018a; Gordon et al., 2018), we programmatically generate a dataset of questions (see Table 1). These questions ask about the presence or absence of objects (existence_shape), their attributes (color, shape), counts (count_color, count_shape), quantitative comparisons (compare_count_color, compare_count_shape), and elementary spatial relations (near_color, near_shape). Unlike the fully-observable setting in CLEVR (Johnson et al., 2017), the agent does not get a global view of the environment, and must answer these questions from a sequence of partial egocentric observations. Moreover, unlike prior work on EmbodiedQA (Gordon et al., 2018; Das et al., 2018a), the agent is not being trained end-to-end to move to answer questions. It is being trained to explore, and answers are being decoded (without backpropagating gradients) from its internal representation. Thus, in order to answer these questions, the agent must learn to encode relevant aspects of the environment in a representation amenable to easy decoding into symbols (e.g. what does the word “chair” mean? or what representations does computing “how many” require?).

4 Approach

Figure 2: Approach: at every timestep , the agent receives an RGB observation

as input, processes it using a convolutional neural network to produce

, which is then processed by an LSTM to select action . The agent learns to explore – it receives a reward of for navigating to each new object. As it explores the environment, it builds up an internal representation , which receives pressure from an auxiliary predictive module to capture environment semantics so as to accurately predict consequences of its actions multiple steps into the future. We experiment with a vanilla LSTM agent and two recent predictive approaches – CPC(Guo et al., 2018) and SimCore (Gregor et al., 2019). The internal representations are then probed via a question-answering decoder whose gradients are not backpropagated into the agent. The QA decoder is an LSTM initialized with and receiving the question at every timestep.

Learning an exploration policy. Predictive modeling has proven to be effective for an agent to develop general knowledge of its environment as it explores and behaves towards its goal, typically maximising environment returns (Gregor et al., 2019; Guo et al., 2018). Since we wish to evaluate the effectiveness of predictive modeling independent of the agent’s specific goal, we define a simple task that stimulates the agent to visit all of the ‘important’ places in the environment (i.e. to acquire an exploratory but otherwise task-neutral policy). This is achieved by giving the agent a reward of every time it visits an object in the room for the first time. After visiting all objects, rewards are refreshed and available to be consumed by the agent again (i.e. re-visiting an object the agent has already been to will now again lead to a reward), and this process continues for the duration of each episode ( seconds or steps).

During training on this exploration task, the agent receives a first-person RGB observation at every timestep , and processes it using a convolutional neural network to produce . This is input to an LSTM policy whose hidden state is and output a discrete action . The agent optimizes the discounted sum of future rewards using an importance-weighted actor-critic algorithm (Espeholt et al., 2018).

Training the QA-decoder. The question-answering decoder is operationalized as an LSTM that is initialized with the agent’s internal representation and receives the question as input at every timestep (see Fig. 2). The question is a string that we tokenise into words and then map to learned embeddings. The question decoder LSTM is then unrolled for a fixed number of computation steps after which it predicts a softmax distribution over the vocabulary of one-word answers to questions in Table 1, and is trained via a cross-entropy loss. Crucially, this QA decoder is trained independent of the agent policy; i.e. gradients from this decoder are not allowed to flow back into the agent. We evaluate question-answering performance by measuring top-1 accuracy at the end of the episode – we consider the agent’s top predicted answer at the last time step of the episode and compare that with the ground-truth answer.

The QA decoder can be seen as a general purpose decoder trained to extract object-specific knowledge from the agent’s internal state without affecting the agent itself. If this knowledge is not retained in the agent’s internal state, then this decoder will not be able to extract it. This is an important difference with respect to prior work (Gordon et al., 2018; Das et al., 2018a) – wherein agents were trained to move to answer questions, i.e. all parameters had access to linguistic information. Recall that the agent’s navigation policy has been trained for exploration, and so the visual information required to answer a question need not be present in the observation at the end of the episode. Thus, through question-answering, we are evaluating the degree to which agents encode relevant aspects of the environment (object colors, shapes, counts, spatial relations) in their internal representations and maintain this information in memory beyond the point at which it was initially received. See A.1.3 for more details about the QA decoder.

4.1 Auxiliary Predictive Losses

We augment the baseline architecture described above with an auxiliary predictive head consisting of a simulation network (operationalized as an LSTM) that is initialized with the agent’s internal state and deterministically simulates future latent states in an open-loop manner, receiving the agent’s action sequence as input. We evaluate two predictive losses – action-conditional CPC (Guo et al., 2018) and SimCore (Gregor et al., 2019). See Fig. 2 for overview, A.1.2 for details.

Action-conditional CPC (CPCA, (Guo et al., 2018)

) makes use of a noise contrastive estimation model to discriminate between true observations processed by the convolutional neural network

( steps into the future) and negatives randomly sampled from the dataset , in our case from other episodes in the minibatch. Specifically, at each timestep (up to a maximum), the output of the simulation core and are fed to an MLP to predict , and and are used to predict .

SimCore (Gregor et al., 2019) uses the simulated state to condition a generative model based on ConvDRAW (Gregor et al., 2016) and GECO (Rezende and Viola, 2018) that predicts the distribution of true observations in pixel space.

Figure 3: L – Reward in an episode. R – Top- QA accuracy. Averaged over seeds. Shaded region is SD.

Baselines. We evaluate and compare the above approaches with 1) a vanilla RL agent without any auxiliary predictive losses (referred to as ‘LSTM’), and 2) a question-only agent that receives zero-masked observations as input and is useful to measure biases in our question-answering testbed. Such a baseline is critical, particularly when working with simulated environments, as it can uncover biases in the environment’s generation of tasks that can result in strong but uninteresting performance from agents capable of powerful function approximation (Thomason et al., 2019).

No stop gradient. We also compare against an agent without blocking the QA decoder gradients (labeled ‘No SG’). This model differs from the above in that it is trained end-to-end – with supervision – to answer the set of questions in addition to the exploration task. Hence, it represents an agent receiving privileged information about how to answer and its performance provides an upper bound for how challenging these question-answering tasks are in this context.

Overall shape color exist count_shape count_color compare_n_color compare_n_shape near_shape near_color
Baseline: Question-only
Oracle: No SG
Table 2: Top- accuracy on question-answering tasks.

5 Experiments & Results

5.1 Question-Answering Performance

We begin by analyzing performance on a single question – shape – which are of the form “what shape is the color object?”. Figure 3 shows the average reward accumulated by the agent in one episode (left) and the QA accuracy at the last timestep of the episode (right) for all approaches over the course of training. We make the following observations:

Figure 4: (Left): Sample trajectory (

) and QA decoding predictions (for top 5 most probable answers) for the ‘What shape is the green object?’ from SimCore. Note that top-down map is not available to the agent. (Right): QA accuracy on disjoint train and test splits.

  • All agents learn to explore. With the exception ‘question-only’, all agents achieve high reward on the exploration task. This means that they visited all objects in the room more than once each and therefore, in principle, have been exposed to sufficient information to answer all questions.

  • Predictive models aid navigation. Agents equipped with auxiliary predictive losses – CPCA and SimCore – collect the most rewards, suggesting that predictive modeling helps navigate the environment efficiently. This is consistent with findings in (Gregor et al., 2019).

  • QA decoding from LSTM and CPCA representations is no better than chance.

  • SimCore’s representations lead to best QA accuracy. SimCore gets to a QA accuracy of indicating that its representations best capture propositional knowledge and are best suited for decoding answers to questions. Figure 4 (Left) shows example predictions.

  • Wide gap between SimCore and No SG. There is a gap between SimCore and the No SG oracle, suggesting scope for better auxiliary predictive losses.

It is worth emphasizing that answering this shape question from observations is not a challenging task in and of itself. The No SG agent, which is trained end-to-end to optimize both for exploration and QA, achieves almost-perfect accuracy (). The challenge arises from the fact that we are not training the agent end-to-end – from pixels to navigation to QA – but decoding the answer from the agent’s internal state, which is learned agnostic to the question. The answer can only be decoded if the agent’s internal state contains relevant information represented in an easily-decodable way.

Figure 5: (Left) DeepMind Lab environment (Beattie et al., 2016): Rectangular-shaped room with 6 randomly selected objects out of a pool of 20 different objects of different colors. (Right) QA accuracy for color questions (What is the color of the shape?) in DeepMind Lab. Consistent with results in the main paper, internal representations of the SimCore agent lead to the highest accuracy while CPCA and LSTM perform worse and similar to each other.

Decoder complexity. To explore the possibility that answer-relevant information is present in the agent’s internal state but requires a more powerful decoder, we experiment with QA decoders of a range of depths. As detailed in Figure 7 in the appendix, we find that using a deeper QA decoder with SimCore does lead to higher QA accuracy (from layers), although greater decoder depths become detrimental after layers. Crucially, however, in the non-predictive LSTM agent, the correct answer cannot be decoded irrespective of QA decoder capacity. This highlights an important aspect of our question-answering evaluation paradigm – that while the absolute accuracy at answering questions may also depend on decoder capacity, relative differences provide an informative comparison between internal representations developed by different agents.

Table 2 shows QA accuracy for all QA tasks (see Figure 8 in appendix for training curves). The results reveal large variability in difficulty across question types. Questions about attributes (color and shape), which can be answered from a single well-chosen frame of visual experience, are the easiest, followed by spatial relationship questions (near_color and near_shape), and the hardest are counting questions (count_color and count_shape). We further note that:

  • All agents perform better than the question-only baseline, which captures any biases in the environment or question distributions (enabling strategies such as constant prediction of the most-common answer).

  • CPCA representations are not better than LSTM on most question types.

  • SimCore representations achieve higher QA accuracy than other approaches, substantially above the question-only baseline on count_color ( vs. ), near_shape ( vs. ) and near_color ( vs. ), demonstrating a strong tendency for encoding and retaining information about object identities, properties, and both spatial and temporal relations.

Finally, as before, the No SG agent trained to answer questions without stopped gradients achieves highest accuracy for most questions, although not all – perhaps due to trade-offs between simultaneously optimizing performance for different QA losses and the exploration task.

5.2 Compositional Generalization

While there is a high degree of procedural randomization in our environment and QA tasks, overparameterized neural-network-based models in limited environments are always prone to overfitting or rote memorization. We therefore constructed a test of the generality of the information encoded in the internal state of an agent. The test involves a variant of the shape question type (i.e. questions like “what shape is the color object?”), but in which the possible question-answer pairs are partitioned into mutually exclusive training and test splits. Specifically, the test questions are constrained such that they are compositionally novel – the color, shape combination involved in the question-answer pair is never observed during training, but both attributes are observed in other contexts. For instance, a test question-answer pair “Q: what shape is the blue object?, A: table” is excluded from the training set of the QA decoder, but “Q: what shape is the blue object?, A: car” and “Q: What shape is the green object?, A: table” are part of the training set (but not the test set).

We evaluate the SimCore agent on this test of generalization (since other agents perform poorly on the original task). Figure 4 (right) shows that the QA decoder applied to SimCore’s internal states performs at substantially above-chance (and all baselines) on the held-out test questions (although somewhat lower than training performance). This indicates that the QA decoder extracts and applies information in a comparatively factorized (or compositional) manner, and suggests (circumstantially) that the knowledge acquired by the SimCore agent may also be represented in this way.

5.3 Robustness of the results

To check if our results are robust to the choice of environment, we developed a similar setup using the DeepMind Lab environment  (Beattie et al., 2016) and ran the same experiments without

any change in hyperparameters.

The environment consists of a rectangular room that is populated with a random selection of objects of different shapes and colors in each episode. There are 6 distinct objects in each room, selected from a pool of 20 objects and 9 different colors. We use a similar exploration reward structure as in our earlier environment to train the agents to navigate and observe all objects. Finally, in each episode, we introduce a question of the form ‘What is the color of the shape?’ where shape is replaced by the name of an object present in the room.

Figure 5 shows question-answering accuracies in the DeepMind Lab environment. Consistent with the results presented above, internal representations of the SimCore agent lead to the highest answering accuracy while CPCA and the vanilla LSTM agent perform worse and similar to each other. Crucially, for running experiments in DeepMind Lab, we did not change any hyperparameters from the experimental setup described before. This demonstrates that our approach is not specific to a single environment and that it can be readily applied in a variety of settings.

6 Discussion

Developing agents with world models of their environments is an important problem in AI. To do so, we need tools to evaluate and diagnose the internal representations forming these world models in addition to studying task performance. Here, we marry together population or glass-box decoding techniques with a question-answering paradigm to discover how much propositional (or declarative) knowledge agents acquire as they explore their environment.

We started by developing a range of question-answering tasks in a visually-rich D environment, serving as a diagnostic test of an agent’s scene understanding, visual reasoning, and memory skills. Next, we trained agents to optimize an exploration objective with and without auxiliary self-supervised predictive losses, and evaluated the representations they form as they explore an environment, via this question-answering testbed. We compared model-free RL agents alongside agents that make egocentric visual predictions and found that the latter (in particular SimCore (Gregor et al., 2019)) are able to reliably capture detailed propositional knowledge in their internal states, which can be decoded as answers to questions, while non-predictive agents do not, even if they optimize the exploration objective well.

Interestingly, not all predictive agents are equally good at acquiring knowledge of objects, relations and quantities. We compared a model learning the probability distribution of future frames in pixel space via a generative model (SimCore 

(Gregor et al., 2019)) with a model based on discriminating frames through contrastive estimation (CPC(Guo et al., 2018)). We found that while both learned to navigate well, only the former developed representations that could be used for answering questions about the environment. (Gregor et al., 2019) previously showed that the choice of predictive model has a significant impact on the ability to decode an agent’s position and top-down map reconstructions of the environment from its internal representations. Our experiments extend this result to decoding factual knowledge, and demonstrate that the question-answering approach has utility for comparing agents.

Finally, the fact that we can even decode answers to questions from an agent’s internal representations learned solely from egocentric future predictions, without exposing the agent itself directly to knowledge in propositional form, is encouraging. It indicates that the agent is learning to form and maintain invariant object identities and properties (modulo limitations in decoder capacity) in its internal state without explicit supervision.

It is 30 years since (Elman, 1990) showed how syntactic structures and semantic organization can emerge in the units of a neural network as a consequence of the simple objective of predicting the next word in a sequence. This work corroborates Elman’s findings, showing that language-relevant general knowledge can emerge in a situated neural-network agent that predicts future low-level visual observations via sufficiently powerful generative mechanism. The result also aligns with perspectives that emphasize the importance of between sensory modalities in supporting the development of conceptual or linguistic knowledge (McClelland et al., 2019). Our study is a small example of how language can be used as a channel to probe and understand what exactly agents can learn from their environments. We hope it motivates future research in evaluating predictive agents using natural linguistic interactions.


  • G. Alain and Y. Bengio (2016) Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644. Cited by: §2.
  • P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel (2018) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In CVPR, Cited by: §2.
  • J. Andreas, D. Klein, and S. Levine (2017) Modular multitask reinforcement learning with policy sketches. In ICML, Cited by: §2.
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) VQA: Visual Question Answering. In ICCV, Cited by: §1.
  • B. S. Atal and M. R. Schroeder (1970) Adaptive predictive coding of speech signals. Bell System Technical Journal. Cited by: §1.
  • M. G. Azar, B. Piot, B. A. Pires, J. Grill, F. Altché, and R. Munos (2019) World discovery models. arXiv preprint arXiv:1902.07685. Cited by: §2.
  • C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V. Valdés, A. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King, D. Hassabis, S. Legg, and S. Petersen (2016) DeepMind lab. CoRR abs/1612.03801. External Links: Link, 1612.03801 Cited by: Figure 5, §5.3.
  • W. Bialek, F. Rieke, R. D. R. Van Steveninck, and D. Warland (1991) Reading a neural code. Science 252 (5014), pp. 1854–1857. Cited by: §2.
  • C. Cangea, E. Belilovsky, P. Liò, and A. Courville (2019) VideoNavQA: bridging the gap between visual and embodied question answering. arXiv preprint arXiv:1908.04950. Cited by: §2.
  • D. S. Chaplot, K. M. Sathyendra, R. K. Pasumarthi, D. Rajagopal, and R. Salakhutdinov (2018) Gated-attention architectures for task-oriented language grounding. In AAAI, Cited by: §2.
  • A. Clark (2016) Surfing uncertainty. Oxford University Press, Oxford. Cited by: §1, §2.
  • A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and M. Baroni (2018) What you can cram into a single $&#⁢ vector: probing sentence embeddings for linguistic properties. In Proceedings of ACL, Cited by: §2, §2.
  • A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra (2018a) Embodied Question Answering. In CVPR, Cited by: §1, §2, §3, §4.
  • A. Das, G. Gkioxari, S. Lee, D. Parikh, and D. Batra (2018b) Neural Modular Control for Embodied Question Answering. In CORL, Cited by: §2.
  • A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M.F. Moura, D. Parikh, and D. Batra (2017) Visual Dialog. In CVPR, Cited by: §1.
  • P. Dayan (1993) Improving generalization for temporal difference learning: the successor representation. Neural Computation. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §2.
  • C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In ICCV, Cited by: §2.
  • F. Dretske (1995) Meaningful perception. An Invitation to Cognitive Science: Visual Cognition,, pp. 331–352. Cited by: §2.
  • P. Elias (1955) Predictive coding – I. IRE Transactions on Information Theory. Cited by: §1.
  • J. L. Elman (1990) Finding structure in time. Cognitive science. Cited by: §1, §2, §2, §6.
  • L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. (2018) Impala: scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561. Cited by: §A.1.1, §4.
  • A. P. Georgopoulos, A. B. Schwartz, and R. E. Kettner (1986) Neuronal population coding of movement direction. Science 233 (4771), pp. 1416–1419. Cited by: §2.
  • D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi (2018) IQA: Visual Question Answering in Interactive Environments. In CVPR, Cited by: §1, §2, §3, §4.
  • K. Gregor, F. Besse, D. J. Rezende, I. Danihelka, and D. Wierstra (2016) Towards conceptual compression. In NeurIPS, Cited by: §A.1.2, §4.1.
  • K. Gregor, D. J. Rezende, F. Besse, Y. Wu, H. Merzic, and A. v. d. Oord (2019) Shaping Belief States with Generative Environment Models for RL. In NeurIPS, Cited by: 5(a), §A.1.2, Probing Emergent Semantics in Predictive Agents via Question Answering, 2nd item, §1, §2, §2, §2, Figure 2, §4.1, §4.1, §4, 2nd item, §6, §6.
  • Z. D. Guo, M. G. Azar, B. Piot, B. A. Pires, T. Pohlen, and R. Munos (2018) Neural predictive belief representations. arXiv preprint arXiv:1811.06407. Cited by: §A.1.2, Probing Emergent Semantics in Predictive Agents via Question Answering, 2nd item, §1, §2, §2, §2, Figure 2, §4.1, §4.1, §4, §6.
  • S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik (2017) Cognitive mapping and planning for visual navigation. In CVPR, Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In CVPR, Cited by: §A.1.2.
  • K. M. Hermann, F. Hill, S. Green, F. Wang, R. Faulkner, H. Soyer, D. Szepesvari, W. Czarnecki, M. Jaderberg, D. Teplyashin, et al. (2017) Grounded language learning in a simulated 3D world. arXiv preprint arXiv:1706.06551. Cited by: §2.
  • F. Hill, S. Mokra, N. Wong, and T. Harley (2020)

    Human instruction-following with deep reinforcement learning via transfer-learning from text

    External Links: 2005.09382 Cited by: §2.
  • J. Hohwy (2013) The predictive mind. Oxford University Press, Oxford. Cited by: §1, §2.
  • A. G. Huth, T. Lee, S. Nishimoto, N. Y. Bilenko, A. T. Vu, and J. L. Gallant (2016) Decoding the semantic content of natural movies from human brain activity. Frontiers in systems neuroscience. Cited by: §2.
  • M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu (2016) Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397. Cited by: §2.
  • J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick (2017) CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, Cited by: §3.
  • R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler (2015) Skip-thought vectors. In NIPS, Cited by: §2.
  • M. Malinowski and M. Fritz (2014) A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input. In NIPS, Cited by: §1.
  • C. Matuszek, E. Herbst, L. Zettlemoyer, and D. Fox (2013) Learning to parse natural language commands to a robot control system. In Experimental Robotics, Cited by: §2.
  • J. L. McClelland, F. Hill, M. Rudolph, J. Baldridge, and H. Schütze (2019) Extending machine language models toward human-level language understanding. External Links: 1912.05877 Cited by: §6.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §2.
  • D. Misra, J. Langford, and Y. Artzi (2017) Mapping instructions and visual observations to actions with reinforcement learning. In ACL, Cited by: §2.
  • K. Narasimhan, T. Kulkarni, and R. Barzilay (2015) Language understanding for text-based games using deep reinforcement learning. In EMNLP, Cited by: §2.
  • M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, Cited by: §2.
  • N. Nortmann, S. Rekauzke, S. Onat, P. König, and D. Jancke (2013) Primary visual cortex represents the difference between past and present. Cerebral Cortex. Cited by: §2.
  • J. Oh, S. Singh, H. Lee, and P. Kohli (2017) Zero-shot task generalization with multi-task deep reinforcement learning. In ICML, Cited by: §2.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2.
  • D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In CVPR, Cited by: §2.
  • X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, and A. Torralba (2018) VirtualHome: simulating household activities via programs. In CVPR, Cited by: §2.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. In ACL, Cited by: §1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ Questions for Machine Comprehension of Text. In EMNLP, Cited by: §1.
  • R. P. Rao and D. H. Ballard (1999) Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience. Cited by: §1, §2.
  • S. Recanatesi, M. Farrell, G. Lajoie, S. Deneve, M. Rigotti, and E. Shea-Brown (2019) Predictive learning extracts latent space representations from sensory observations. bioRxiv. Cited by: §1.
  • D. J. Rezende and F. Viola (2018) Taming VAEs. arXiv preprint arXiv:1810.00597. Cited by: §4.1.
  • A. Roskies and C. Wood (2017) Catching the prediction wave in brain science. Analysis 77, pp. 848–857. Cited by: §2.
  • E. Salinas and L. Abbott (1994) Vector reconstruction from firing rates. Journal of computational neuroscience 1 (1-2), pp. 89–107. Cited by: §2.
  • T. Schaul, D. Horgan, K. Gregor, and D. Silver (2015) Universal value function approximators. In ICML, Cited by: §1.
  • T. Schaul and M. Ring (2013) Better generalization with forecasts. In IJCAI, Cited by: §1.
  • J. Schmidhuber (1991) Curious model-building control systems. In IJCNN, Cited by: §1.
  • A. K. Seth (2015) The cybernetic bayesian brain: from interoceptive inference to sensorimotor contingencies. In Open MIND: 35(T), T. M. &. J. M. Windt (Ed.), Cited by: §2.
  • T. Shu, C. Xiong, and R. Socher (2018) Hierarchical and interpretable skill acquisition in multi-task reinforcement learning. In ICLR, Cited by: §2.
  • D. Silver, H. van Hasselt, M. Hessel, T. Schaul, A. Guez, T. Harley, G. Dulac-Arnold, D. Reichert, N. Rabinowitz, A. Barreto, et al. (2017) The predictron: end-to-end learning and planning. In ICML, Cited by: §1.
  • S. P. Stich (1979) Do animals have beliefs?. Australasian Journal of Philosophy 57 (1), pp. 15–28. Cited by: §1.
  • I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, R. T. McCoy, N. Kim, B. V. Durme, S. Bowman, D. Das, and E. Pavlick (2019) What do you learn from context? probing for sentence structure in contextualized word representations. In ICLR, Cited by: §2.
  • J. Thomason, D. Gordan, and Y. Bisk (2019) Shifting the baseline: single modality performance on visual navigation & QA. In NAACL, Cited by: §4.1.
  • D. Truncellito (2007) Epistemology. internet encyclopedia of philosophy. Cited by: §2.
  • A. Vogel and D. Jurafsky (2010) Learning to follow navigational directions. In ACL, Cited by: §2.
  • X. Wang and A. Gupta (2015) Unsupervised learning of visual representations using videos. In ICCV, Cited by: §2.
  • G. Wayne, C. Hung, D. Amos, M. Mirza, A. Ahuja, A. Grabska-Barwinska, J. Rae, P. Mirowski, J. Z. Leibo, A. Santoro, et al. (2018) Unsupervised predictive memory in a goal-directed agent. arXiv preprint arXiv:1803.10760. Cited by: §1, §2.
  • E. Wijmans, S. Datta, O. Maksymets, A. Das, G. Gkioxari, S. Lee, I. Essa, D. Parikh, and D. Batra (2019) Embodied Question Answering in Photorealistic Environments with Point Cloud Perception. In CVPR, Cited by: §2.
  • D. Williams (2018) Predictive coding and thought. Synthese, pp. 1–27. Cited by: §2.
  • T. Winograd (1972) Understanding natural language. Cognitive Psychology. Cited by: §2.
  • H. Yu, H. Zhang, and W. Xu (2018) Interactive Grounded Language Acquisition and Generalization in a 2D World. In ICLR, Cited by: §2.
  • L. Yu, X. Chen, G. Gkioxari, M. Bansal, T. L. Berg, and D. Batra (2019) Multi-target embodied question answering. In CVPR, Cited by: §2.
  • R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In ECCV, Cited by: §2.
  • Y. Zhu, D. Gordon, E. Kolve, D. Fox, L. Fei-Fei, A. Gupta, R. Mottaghi, and A. Farhadi (2017a) Visual Semantic Planning using Deep Successor Representations. In ICCV, Cited by: §2, §2.
  • Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi (2017b) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, Cited by: §2.

Appendix A Appendix

a.1 Network architectures and Training setup

a.1.1 Importance Weighted Actor-Learner Architecture

Agents were trained using the IMPALA framework (Espeholt et al., 2018). Briefly, there are N parallel ‘actors’ collecting experience from the environment in a replay buffer and one learner taking batches of trajectories and performing the learning updates. During one learning update the agent network is unrolled, all the losses (RL and auxiliary ones) are evaluated and the gradients computed.

a.1.2 Agents

Input encoder To process the frame input, all models in this work use a residual network (He et al., 2016)

of 6 64-channel ResNet blocks with rectified linear activation functions and bottleneck channel of size 32. We use strides of (2, 1, 2, 1, 2, 1) and don’t use batch-norm. Following the convnet we flatten the ouput and use a linear layer to reduce the size to 500 dimensions. Finally, we concatenate this encoding of the frame together with a one hot encoding of the previous action and the previous reward.

Core architecture

The recurrent core of all agents is a 2-layer LSTM with 256 hidden units per layer. At each time step this core consumes the input embedding described above and updates its state. We then use a 200 units single layer MLP to compute a value baseline and an equivalent network to compute action logits, from where one discrete action is sampled.

Simulation Network Both predictive agents have a simulation network with the same architecture as the agent’s core. This network is initialized with the agent state at some random time from the trajectory and unrolled forward for a random number of steps up to 16, receiving only the actions of the agent as inputs. We then use the resulting LSTM hidden state as conditional input for the prediction loss (SimCore or CPCA).

SimCore We use the same architecture and hyperparameters described in (Gregor et al., 2019). The output of the simulation network is used to condition a Convolutional DRAW (Gregor et al., 2016). This is a conditional deep variational auto-encoder with recurrent encoder and decoder using convolutional operations and a canvas that accumulates the results at each step to compute the distribution over inputs. It features a recurrent prior network that receives the conditioning vector and computes a prior over the latent variables. See more details in (Gregor et al., 2019).

Action-conditional CPC We replicate the architecture used in (Guo et al., 2018). CPCA uses the output of the simulation network as input to an MLP that is trained to discriminate true versus false future frame embedding. Specifically, the simulation network outputs a conditioning vector after simulation steps which is concatenated with the frame embedding produced by the image encoder on the frame

and sent through the MLP discriminator. The discriminator has one hidden layer of 512 units, ReLU activations and a linear output of size 1 which is trained to binary classify true embeddings into one class and false embeddings into another. We take the negative examples from random time points in the same batch of trajectories.

a.1.3 QA network architecture

Question encoding The question string is first tokenized to words and then mapped to integers corresponding to vocabulary indices. These are then used to lookup 32-dimensional embeddings for each word. We then unroll a 64-units single-layer LSTM for a fixed number of 15 steps. The language representation is then computed by summing the hidden states for all time steps.

QA decoder. To decode answers from the internal state of the agents we use a second LSTM initialized with the internal state of the agent’s LSTM and unroll it for a fixed number of steps, consuming the question embedding at each step. The results reported in the main section were computed using 12 decoding steps. The terminal state is sent through a two-layer MLP (sizes 256, 256) to compute a vector of answer logits with the size of the vocabulary and output the top-1 answer.

a.1.4 Hyper-parameters

The hyper-parameter values used in all the experiments are in Table 3.

Learning rate 1e-4
Unroll length 50
Adam 0.90
Adam 0.95
Policy entropy regularization 0.0003
Discount factor 0.99
No. of ResNet blocks 6
No. of channel in ResNet block 64
Frame embedding size 500
No. of LSTM layers 2
No. of units per LSTM layer 256
No. of units in value MLP 200
No. of units in policy MLP 200
Simulation Network
Overshoot length 16
No. of LSTM layers 2
No. of units per LSTM layer 256
No. of simulations per trajectory 6
No. of evaluations per overshoot 2
No. of ConvDRAW Steps 8
GECO kappa 0.0015
MLP discriminator size 64
QA network
Vocabulary size 1000
Maximum question length 15
No. of units in Text LSTM encoder 64
Question embedding size 32
No. of LSTM layers in question decoder 2
No. of units per LSTM layer 256
No. of units in question decoder MLP 200
No. of decoding steps 12
Table 3: Hyperparameters.

a.1.5 Negative sampling strategies for CPCA

We experimented with multiple sampling strategies for the CPCA agent (whether or not negative examples are sampled from the same trajectory, the number of contrastive prediction steps, the number of negative examples). We report the best results in the main text. The CPCA agent did provide better representations of the environment than the LSTM-based agent, as shown by the top-down view reconstruction loss (Figure 5(a)). However, none of the CPCA agent variations that we tried led to better-than-chance question-answering accuracy. As an example, in Figure 5(b) we compare sampling negatives from the same trajectory or from any trajectory in the training batch.

(a) To test whether the CPCA loss provided improved representations we reconstructed the environment top-down view, similar to (Gregor et al., 2019). Indeed the reconstruction loss is lower for CPCA than for the LSTM agent.
(b) QA accuracy for the CPCA agent is not better than the LSTM agent, for both sampling strategies of negatives.
Figure 6:

a.2 Effect of QA network depth

To study the effect of the QA network capacity on the answer accuracy, we tested decoders of different depths applied to both the SimCore and the LSTM agent’s internal representations (7). The QA network is an LSTM initialized with the agent’s internal state that we unroll for a fixed number of steps feeding the question as input at each step. We found that, indeed, the answering accuracy increased with the number of unroll steps from 1 to 12, while greater number of steps became detrimental. We performed the same analysis on the LSTM agent and found that regardless of the capacity of the QA network, we could not decode the correct answer from its internal state, suggesting that the limiting factor is not the capacity of the decoder but the lack of useful representations in the LSTM agent state.

Figure 7: Answer accuracy over training for increasing QA decoder’s depths. Left subplot shows the results for the SimCore agent and right subplot for the LSTM baseline. For SimCore, the QA accuracy increases with the decoder depth, up to 12 layers. For the LSTM agent, QA accuracy is not better than chance regardless of the capacity of the QA network.

a.3 Answering accuracy during training for all questions

The QA accuracy over training for all questions is shown in Figure 8.

Figure 8: QA accuracy over training for all questions and all models.

a.4 Environment

Our environment is a single L-shaped D room, procedurally populated with an assortment of objects.

Actions and Observations. The environment is episodic, and runs at frames per second. Each episode takes seconds (or steps). At each step, the environment provides the agent with two observations: a x RGB image with the first-person view of the agent and the text containing the question.

The agent can interact with the environment by providing multiple simultaneous actions to control movement (forward/back, left/right), looking (up/down, left/right), picking up and manipulating objects (4 degrees of freedom: yaw, pitch, roll + movement along the axis between agent and object).

Rewards. To allow training using cross-entropy, as described in Section 2, the environment provides the ground-truth answer instead of the reward to the agent.

Object creation and placement. We generate between and objects, depending on the task, with the type of the object, its color and size being uniformly sampled from the set described in Table 4.

Objects will be placed in a random location and random orientation. For some tasks, we required some additional constraints - for example, if the question is ”What is the color of the cushion near the bed?”, we need to ensure only one cushion is close to the bed. This was done by checking the constraints and regenerating the placement in case they were not satisfied.

Attribute Options
Object basketball, cushion, carriage, train, grinder, candle, teddy, chair,
scissors, stool, book, football, rubber duck, glass, toothpaste, arm chair,
robot, hairdryer, cube block, bathtub, TV, plane, cuboid block,
car, tv cabinet, plate, soap, rocket, dining table, pillar block,
potted plant, boat, tennisball, tape dispenser, pencil, wash basin,
vase, picture frame, bottle, bed, helicopter, napkin, table lamp,
wardrobe, racket, keyboard, chest, bus, roof block, toilet
Color aquamarine, blue, green, magenta, orange, purple, pink, red,
white, yellow
Size small, medium, large
Table 4: Randomization of objects in the Unity room. different types, different colors and different scales.
Body movement actions Movement and grip actions Object manipulation
Table 5: Environment action set.