World Discovery Models

02/20/2019
by   Mohammad Gheshlaghi Azar, et al.
0

As humans we are driven by a strong desire for seeking novelty in our world. Also upon observing a novel pattern we are capable of refining our understanding of the world based on the new information---humans can discover their world. The outstanding ability of the human mind for discovery has led to many breakthroughs in science, art and technology. Here we investigate the possibility of building an agent capable of discovering its world using the modern AI technology. In particular we introduce NDIGO, Neural Differential Information Gain Optimisation, a self-supervised discovery model that aims at seeking new information to construct a global view of its world from partial and noisy observations. Our experiments on some controlled 2-D navigation tasks show that NDIGO outperforms state-of-the-art information-seeking methods in terms of the quality of the learned representation. The improvement in performance is particularly significant in the presence of white or structured noise where other information-seeking methods follow the noise instead of discovering their world.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 9

page 10

page 11

page 14

12/27/2018

QRFA: A Data-Driven Model of Information-Seeking Dialogues

Understanding the structure of interaction processes helps us to improve...
07/15/2020

Active World Model Learning with Progress Curiosity

World models are self-supervised predictive models of how the world evol...
11/27/2021

Learning from learning machines: a new generation of AI technology to meet the needs of science

We outline emerging opportunities and challenges to enhance the utility ...
05/21/2020

Novel Policy Seeking with Constrained Optimization

In this work, we address the problem of learning to seek novel policies ...
08/18/2021

CollaborER: A Self-supervised Entity Resolution Framework Using Multi-features Collaboration

Entity Resolution (ER) aims to identify whether two tuples refer to the ...
12/13/2020

Open-World Class Discovery with Kernel Networks

We study an Open-World Class Discovery problem in which, given labeled t...
08/23/2021

Conceptualising Healthcare-Seeking as an Activity to Explain Technology Use: A Case of M-health

The purpose of this paper is to engage with the Information Systems' con...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern AI has been remarkably successful in solving complex decision-making problems such as GO (silver2016mastering; silver2017mastering), simulated control tasks (schulman2015trust), robotics (levine2016end), poker (moravvcik2017deepstack) and Atari games (mnih2015human; hessel2018rainbow)

. Despite these successes the agents developed by those methods are specialists: they perform extremely well at the tasks they were trained on but are not very successful at generalising their task-dependent skills in the form of a general domain understanding. Also, the success of the existing AI agents often depends strongly on the availability of external feedback from their world in the form of reward signals or labelled data, for which some level of supervision is required. This is in contrast to the human mind, which is a general and self-supervised learning system that

discovers the world around it even when no external reinforcement is available. Discovery is the ability to obtain knowledge of a phenomenon for the first time (merriam2004merriam). As discovery entails the process of learning of and about new things, it is an integral part of what makes humans capable of understanding their world in a task-independent and self-supervised fashion.

The underlying process of discovery in humans is complex and multifaceted (hohwy2013predictive). However one can identify two main mechanisms for discovery (clark2017nice). The first mechanism is active information seeking. One of the primary behaviours of humans is their attraction to novelty (new information) in their world (litman2005curiosity; kidd2015psychology). The human mind is very good at distinguishing between the novel and the known, and this ability is partially due to the extensive internal reward mechanisms of surprise, curiosity and excitement (schmidhuber2009simple). The second mechanism is building a statistical world model. Within cognitive neuroscience, the theory of statistical predictive mind states that the brain, like scientists, constructs and maintains a set of hypotheses over its representation of the world (friston2014computational). Upon perceiving a novelty, our brain has the ability to validate the existing hypothesis, reinforce the ones which are compatible with the new observation and discard the incompatible ones. This self-supervised process of hypothesis building is essentially how humans consolidate their ever-growing knowledge in the form of an accurate and global model. Inspired by these inputs from cognitive neuroscience, information-seeking algorithms have received significant attention to improve the exploration capability of artificial learning agents (schmidhuber1991possibility; houthooft2016vime; achiam2017surprise; pathak2017curiosity; burda2018large; shyam2018model). However, the scope of the existing information-seeking algorithms is often limited to the case of fully observable and deterministic environments. One of the problems with the existing novelty-seeking algorithms is that agents trained by these methods tend to become attracted to random patterns in their world and stop exploring upon encountering them, despite the fact that these random patterns contain no actual information on the world (burda2018large). Moreover, the performance of existing agents are often evaluated based on their ability to solve a reinforcement learning (RL) task with extrinsic reward, and not on the quality of the learned world representation, which is the actual goal of discovery. Thus, it is not clear whether the existing algorithms are capable of using the novel information to discover their world. Therefore, the problem of discovery in the general case of partially observable and stochastic environments remains open.

The main contribution of this paper is to develop a practical and end-to-end algorithm for discovery in stochastic and partially observable worlds using modern AI technology. We achieve this goal by designing a simple yet effective algorithm called NDIGO, Neural Differential Information Gain O

ptimisation, for information seeking designed specifically for stochastic partially observable domains. NDIGO identifies novelty by measuring the increment of information provided by a new observation in predicting the future observations, compared to a baseline prediction for which this observation is withheld. We show that this measure can be estimated using the difference of prediction losses of two estimators, one of which can access the complete set of observations while the other does not receive the latest observation. We then use this measure of novelty as the intrinsic reward to train the policy using a state of the art reinforcement learning algorithm 

(kapturowski2018recurrent). One of the key features of NDIGO is its robustness to noise, as the process of subtracting prediction losses cancels out errors that the algorithm cannot improve on. Moreover, NDIGO is well-suited for discovery in partially observable domains as the measure of novelty in NDIGO drives the agent to the unobserved areas of the world where new information can be gained from the observations. Our experiments show that NDIGO produces a robust performance in the presence of noise in partial observable environments: NDIGO not only finds true novelty without being distracted by the noise, but it also incorporates this information into its world representation without forgetting previous observation.

2 Related Work

It has been argued for decades in developmental psychology (white1959motivation; deci1985intrinsic; csikszentmihalyi1992optimal), neuroscience (dayan2002reward; kakade2002dopamine; horvitz2000mesolimbocortical)

and machine learning 

(oudeyer2008can; gottlieb2013information; schmidhuber1991curious) that an agent maximising a simple intrinsic reward based on patterns that are both novel and learnable could exhibit essential aspects of intelligence such as autonomous development (oudeyer2016evolution).

More specifically, in his survey on the theory of creativity and intrinsic motivation, schmidhuber2010formal explains how to build the agent that could discover and understand in a self-supervised way its environment. He establishes that crucial components are necessary: i) a world model (ha2018world)

that encodes what is currently known. It can be a working memory component such as a Long Short Term Memory network 

(LSTM, hochreiter1997long)

or a Gated Recurrent Unit network 

(GRU, cho2014learning). ii) a learning algorithm that improves the world model. For instance, guo2018 have shown that a GRU trained with a Contrastive Prediction Coding (CPC, oord2018representation) loss on future frames could learn a representation of the agent’s current and past position and orientation, as well as position of objects in the environment. iii) An intrinsic reward generator based on the world model that produces rewards for patterns that are both novel and learnable. Different types of intrinsic rewards can be used, such as the world model’s prediction error (stadie2015incentivizing; pathak2017curiosity), improvement of the model’s prediction error, also known as prediction gain (achiam2017surprise; schmidhuber1991curious; lopes2012exploration), and finally information gain (shyam2018model; itti2009bayesian; little2013learning; frank2014curiosity; houthooft2016vime). iv the last component is an RL algorithm that finds an optimal policy with respect to the intrinsic rewards.

Recently, several implementations of intrinsically motivated agents have been attempted using modern AI technology. Most of them used the concept of prediction error as an intrinsic reward (stadie2015incentivizing; pathak2017curiosity; burda2018large; haber2018learning)

. However, it has been argued that agents optimising the prediction error are susceptible to being attracted to white noise 

(oudeyer2007intrinsic) and therefore should be avoided. To solve the white-noise problem, different types of random or learned projections (burda2018large) of the original image into a smaller feature space less susceptible to white-noise are considered. Other implementations rely on approximations of the concept of information gain (houthooft2016vime; achiam2017surprise) via a variational lower bound argument. Indeed, as they are trying to train a probabilistic model over the set of possible dynamics, the computation of the posterior of that distribution is intractable (houthooft2016vime). Finally, models based on prediction gain are fundamentally harder to train compared to prediction error (achiam2017surprise; lopes2012exploration; pathak2017curiosity; ostrovski2017count). Also it is not entirely clear how effective they are in seeking novelty in comparison with methods that rely on information gain (schmidhuber2010formal).

3 Setting

We consider a partially observable environment where an agent is shown an observation at time , then selects an action which generates a new observation at the next time step. We assume observations are generated by an underlying process following Markov dynamics, i.e. , where

is the dynamics of the underlying process. Although we do not explicitly use the corresponding terminology, this process can be formalised in terms of Partially Observable Markov Decision Processes 

(POMDPs; lovejoy1991survey; cassandra1998exact).

The future observation in a POMDP can also be seen as the output of a stochastic mapping with input the current history. Indeed, at any given time , let the current history be all past actions and observations . Then we define

the probability distribution of

knowing the history and the action . One can generalise this notion for k-step prediction: for any integers and , let us denote by the integer interval , and let and be the sequence of actions and observations from time up to time , respectively. Then can be seen as a sample drawn from the probability distribution , which is the -step open-loop prediction model of the observation . We also use the short-hand notation as the probability distribution of given the history and the sequence of actions .

4 Learning the World Model

The world model should capture what the agent currently knows about the world so that he could make predictions based on what it knows. We thus build a model of the world by predicting future observations given the past (see e. g., schmidhuber1991curious; guo2018). More precisely, we build an internal representation by making predictions of futures frames conditioned on a sequence of actions and given the past . This is similar to the approach of Predictive State Representations (littman2002predictive), from which we know that if the learnt representation is able to predict the probability of any future observation conditioned on any sequence of actions and history, then this representation contains all information about the belief state (i.e., distribution over the ground truth state ).

4.1 Architecture

We propose to learn the world model by using a recurrent neural network (RNN)

fed with the concatenation of observation features and the action

(encoded as a one-hot vector). The observation features

are obtained by applying a convolutional neural network (CNN)

to the observation . The RNN is a Gated Recurrent Unit (GRU) and the internal representation is the hidden state of the GRU, that is, , as shown in Figure 1. We initialise this GRU by setting its hidden state to the null vector , and using where is a fixed, arbitrary action and are the features corresponding to the original observation . We train this representation with some future-frame prediction tasks conditioned on sequences of actions and the representation . These frame prediction tasks consist in estimating the probability distribution, for various (with to be specified later), of future observation conditioned on the internal representation and the sequence of actions . We denote these estimates by or simply by for conciseness and when no confusion is possible. As the notation suggests, we will use as an estimate of . The neural architecture consists in different neural nets . Each neural net receives as input the concatenation of the internal representation and the sequence of actions , and outputs the distributions over observations: (). For a fixed and a fixed

, the loss function

at time step associated with the network is a cross entropy loss: We finally define for any given sequence of actions and observations the representation loss function as the sum of these cross entropy losses: .

Figure 1: World Model: a CNN and a GRU encode the history into an internal representation . Then, frame predictions tasks are trained in order to shape the representation .

+

4.2 Evaluation of the learnt representation

In the POMDP setting, the real state represents all there is to know about the world at time . By constructing a belief state, which is a distribution over the possible states conditioned on the history , the agent can assess its uncertainty about the real state given the history . Therefore, in order to assess the quality of the learnt representation , we use the glass-box approach described in Figure 12 to build a belief state of the world. It consists simply in training a neural network fed by the internal representation to predict a distribution over the possible real state . This kind of approach is only possible in artificial or controlled environments where the real state is available to the experimenter but yet not given to the agent. We also make sure that no gradient from is being back-propagated to the internal representation such that the evaluation does not influence the learning of the representation and the behaviour of the agent. For a fixed , the loss used to trained is a cross entropy loss (For a more detailed description of the approach see guo2018): . We call this loss discovery loss, and use it as a measure of how much information about the whole world the agent is able to encode in its internal representation , i.e., how much of the world has been discovered by the agent.

5 NDIGO Agent

Our NDIGO agent is a discovery agent that learns to seek new information in its environment and then incorporate this information into a world representation. Inspired by the intrinsic motivation literature (schmidhuber2010formal), the NDIGO agent achieves this information-seeking behaviour as a result of optimising an intrinsic reward. Therefore, the agent’s exploratory skills depend critically on designing an appropriate reward signal that encourages discovering the world. Ideally, we want this reward signal to be high when the agent gets an observation containing new information about the real state . As we cannot access at training time, we rely on the accuracy of our future observations predictions to estimate the information we have about .

Intuitively, for a fixed horizon , the prediction error loss is a good measure on how much information is lacking about the future observation . The higher the loss, the more uncertain the agent is about the future observation so the less information it has about this observation. Therefore, one could define an intrinsic reward directly as the prediction error loss, thus encouraging the agent to move towards states for which it is the less capable of predicting future observations. The hope is that the less information we have in a certain belief state, the easier it is to gain new information. Although this approach may have good results in deterministic environments, it is however not suitable in certain stochastic environments. For instance, consider the extreme case in which the agent is offered to observe white noise such as a TV displaying static. An agent motivated with prediction error loss would continually receive a high intrinsic reward simply by staying in front of this TV, as it cannot improve its predictions of future observations, and would effectively remain fascinated by this noise.

5.1 The NDIGO intrinsic reward

The reason why the naive prediction error reward fails in such a simple example is that the agent identifies that a lot of information is lacking, but does not acknowledge that no progress is made towards acquiring this lacking information. To overcome this issue, we introduce the NDIGO reward, for a fixed , as follows:

(1)

where represents the future observation considered and is the horizon of NDIGO. The two terms in the right-hand side of Equation 1 measure how much information the agent lacks about the future observation knowing all past observations prior to with either excluded (left term) or included (right term). Intuitively, we take the difference between the information we have at time with the information we have at time . This way we get an estimate of how much information the agent gained about by observing . Note that the reward is attributed at time in order to make it dependent on and only (and not on the policy), once the prediction model has been learnt. If the reward had been assigned at time instead (time of prediction) it would have depended on the policy used to generate the action sequence , which would have violated the Markovian assumption required to train the RL algorithm. Coming back to our broken TV example, the white noise in does not help in predicting the future observation . The NDIGO reward is then the difference of two large terms of similar amplitude, leading to a small reward: while acknowledging that a lot of information is missing (large prediction error loss) NDIGO also realises that no more of it can be extracted (small difference of prediction error loss). Our experiments show that using NDIGO allows the agent to avoid being stuck in the presence of noise, as presented in Section 6, thus confirming these theoretical considerations.

5.2 Algorithm

Given the intrinsic reward , we use the state-of-the-art RL algorithm R2D2 (kapturowski2018recurrent) to optimise the policy. The NDIGO agent interacts with its world using the NDIGO policy to obtain new observation , which is used to train the world model by minimising the future prediction loss . The losses are then used to obtain the intrinsic reward at the next time step, and the process is then repeated. An in-depth description of the complete NDIGO algorithm can be found in Section B.5.

5.3 Relation to information gain

Information gain has been widely used as the novelty signal in the literature (houthooft2016vime; little2013learning). A very broad definition of the information gain (schmidhuber2010formal) is the distance (or divergence) between distributions on any random event of interest before and after a new sequence of observations. Choosing the random event to be the future observations or actions and the divergence to be the Kullback-Leiber divergence then the -step predictive information gain of the future event with respect to the sequence of observations is defined as: , and measures how much information can be gained about the future observation from the sequence of past observations given the whole history up to time step and the sequence of actions from up to . In the case of we recover the 1-step information gain on the next observation due to . We also use the following short-hand notation for the information gain for every and . Also by convention we define .

We now show that the NDIGO intrinsic reward can be expressed as the difference of information gain due to and . For a given horizon and , the intrinsic reward for time step is:

(2)
(3)

Given that and are respectively an estimate of and , and based on the fact that these estimates become more accurate as the number of samples increases, we have:

(4)

The first term in Equation 4 measures how much information can be gained about from the sequence of past observations whereas the second term measures how much information can be gained about from the sequence of past observations . Therefore, as , the expected value of the NDIGO reward at step is equal to the amount of additional information that can be gained by the observation when trying to predict .

6 Experiments

We evaluate the performance of NDIGO qualitatively and quantitatively on five experiments, where we demonstrate different aspects of discovery with NDIGO. In all experiments there are some hidden objects which the agent seeks to discover. However the underlying dynamics of the objects are different. In the simplest case, the location of objects only changes at the beginning of every episode, whereas in the most complex the objects are changing their locations throughout the episode according to some random walk strategy. We investigate (i) whether the agent can efficiently search for novelty, i.e., finding the location of objects; (ii) whether the agent can encode the information of object location in its representation of the world such that the discovery loss of predicting the objects is as small as possible.

6.1 Baselines

We compare our algorithm NDIGO-, with being the horizon and taking values in , to different information seeking and exploration baselines considered to be state of the art in the intrinsic motivation literature. Prediction Error (PE) (haber2018learning; achiam2017surprise): The PE model uses the same architecture and the same losses than NDIGO. The only difference is that the intrinsic reward is the predictor error: . Prediction Gain (PG) (achiam2017surprise; ostrovski2017count): Our version of PG uses the same architecture and the same losses than NDIGO. In addition, at every learner steps we save a copy of the prediction network into a fixed target network. The intrinsic reward is the difference in prediction error, between the up-to-date network and the target network predictions: , where is the distribution computed with the weights of the fixed target network. Intrinsic Curiosity Module (ICM) (pathak2017curiosity; burda2018large): The method consists in training the internal representation to be less sensitive to noise using a self-supervised inverse dynamics model. Then a forward model is used to predict the future internal representation from the actual representation and the action (more details on this model are in Appendix D). The intrinsic reward .

6.2 Test environments

The 5 rooms environment.

The 5 rooms environment (see Figure 2) is a local-view 2D environment composed of rooms implemented using the pycolab library111https://github.com/deepmind/pycolab. In pycolab, the environment is composed of cells that contain features such as walls, objects or agents. In the 5 rooms environment, there is one central room and four peripheral rooms (composed of cells) that we will refer to as upper, lower, left and right rooms. Each of the four peripheral rooms may contain different types of “objects” that occupy a cell exclusively. At every episode, the agent starts in the middle of the central room and the starting position of each object is randomised. The objects may or may not move, but as a general rule in any episode they never leave the room they started in. Finally, we only place objects in the peripheral rooms, and in each room there is never more than one object.

The maze environment.

The maze environment (see Figure 3) is also a pycolab local-view 2D environment. It is set-up as a maze composed of six different rooms connected by corridors. The agent starts at a fixed position in the environment in an otherwise empty room ; rooms are numbered from to based on the order in which they can be reached, i.e. the agent cannot reach room number without going through rooms and in this order. A white noise object is always present in room , and a there is single fixedin rooms and . Room contains a special movable, which should attract the agent even when the environment is completely learned.

Figure 2: The 5 rooms environment: in this instance, we can see in white the agent, fixed objects in each of the peripheral rooms and in grey the impenetrable walls. The shaded area around the agent represents its region-cell local view.
Figure 3: The maze environment: in this instance, we can see in white the agent, fixedobjects in blue, green, pink and red. white noiseis the closest object to the agent location also in green.

Objects.

We consider five different types of objects: fixed, bouncing, Brownian, white noise and movable. fixed objects are fixed during episodes, but change position from episode to episode. They provide information gain about their position when it is not already encoded in the agent’s representation. bouncing objects bounce in a straight line from wall to wall inside a room. In addition to providing information gain similar to fixed objects, they allow us to test the capacity of the representation to encode predictable object after the object is no longer in the agent’s view. Brownian objects follow a Brownian motion within a room, by moving uniformly at random in one of the four directions. white noise objects change location instantly to any position inside the same room, uniformly at random, at each time step, and are therefore unpredictable. Finally, movable objects do not move by themselves, but the agent can cause them to move to a random location by attempting to move into their cells. Interacting with these objects allows more information gain to be generated.

Agent’s observations and actions.

The observation at time consists in a concatenation of images (called channels) of pixels representing the different features of the local view of the agent. This can be represented by multidimensional array where is the number of channels. The first channel represents the walls in the local view: indicates the presence of a wall and the absence of a wall. Then, each of the remaining channels represents the position of an object with a one-hot array if the object is present in the local view or with a null array otherwise. The possible actions are stay, up, down, right, left and are encoded with a one-hot vector of size .

6.3 Performance evaluation

The agent’s performance is measured by its capacity to estimate the underlying state of the world from its internal representation (discovery loss, see Section 4.2). In pycolab, it is possible to compute a discovery loss for each aspect of the world state (location of each object for instance). So that it is easy to understand which aspects of the world the agent can understand and keep in its internal representation. Once again we stress the fact that no gradient is back-propagated from that evaluation procedure to the internal representation. In addition, we provide other statistics such as average values of first-visit time and visit counts of a given object to describe the behavior of the agent. The first-visit time is the number of episode time steps the agent needs before first observing a given object; the visit count is the total number of time steps where the agent observes the object. Finally, we also provide more qualitative results with videos of the agent discovering the worlds (see https://www.youtube.com/channel/
UC5OPHK7pvsZE-jVclZMvhmQ).

6.4 Experimental results

In this section we evaluate the performance of NDIGO on some controlled navigation task (for the implementation details and the specification of the prediction and policy networks and the training algorithms see Appendix B).

Experiment 1.

We evaluate the discovery skills of NDIGO by testing how effectively it can ignore the white noise, from which there is nothing to learn, and discover the location of the fixed object. Here, we use a 5 rooms setting with a fixed object in the upper room, and a white noise object in the lower room.

Figure 4: Experiment 1: Average discovery loss of the fixed object. The results are averaged over 10 seeds.
Visit count First visit time
fixed w. noise fixed w. noise
Random
PE
PG
ICM
NDIGO-1
NDIGO-2
NDIGO-4
Table 1: Experiment 1: Average values of the visit counts and first visit time of the trained agent for the fixed and white noise objects in one episode.

We report in Figure 4 the learning curves for the discovery loss of the fixed object. This result shows the quality of the learned representation in terms of encoding the location of fixed object. We observe that the long-horizon variant of NDIGO (NDIGO-4) outperforms the best baseline (ICM) by more than an order of magnitude. Also the asymptotic performance of NDIGO-4 is significantly better than NDIGO-1 and NDIGO-2.

In Table 1

we also report the average value and standard deviation of visit count and first visit time of the trained agents for the

fixed object and the white noise object in an episode222 Each episode is set to end after time steps; if an agent does not find the object by the end of the episode, the first visit time is set to . . We observe that different variants of NDIGO are driven towards the fixed object and manage to find it faster than the baselines while avoiding the white noise object. While ICM is also attracted by the fixed object, it is not doing it as fast as NDIGO. PE, as expected, is only attracted by the white noise object where its reward is the highest. We also observe that the performance of NDIGO improves as we increase the prediction horizon. From now on, in the tables, we report only the ICM results as it is the only competitive baseline. Exhaustive results are reported in Section E.1.

Experiment 2.

To demonstrate better the information-seeking behaviour of our algorithm, we place randomly a fixed object in either the upper, left or right room and a white noise object in the lower room. Thus, to discover the object, the agent must actively look for it in all but the lower room.

Similar to Experiment 1, We report the average discovery loss of the fixed object in Figure 5. We observe that all variants of NDIGO perform better than the baselines by a clear margin. Though ICM performance is not far behind NDIGO (less than two times worse than NDIGO-4). We also observe no significant difference between the different variants of NDIGO in this case. We also report in Table 2 the first visit and visit counts for the fixed object and the white noise object in an episode. NDIGO again demonstrates a superior performance to the baselines. We also observe that NDIGO in most case is not attracted towards the white noise object. An interesting observation is that, as we increase the horizon of prediction in NDIGO, it takes more time for the agent to find the fixed object but at the same time the visit counts increases as well, i.e, the agent stay close to the object for longer time after the first visit.

As a qualitative result, we also report top-down-view snapshots of the behavior of NDIGO- up to the time of discovery of fixed in the right room in Figure 6. We also depicts the predicted view of the world from the agent’s representation in Figure 6. As the location of object is unknown to the agent, we observe that the agent searches the top-side, left-side and right-side rooms until it discovers the fixed object in the right-side room. It also successfully avoids the bottom-side room containing the white noise object. Also as soon as the agent finds the fixed object the uncertainty about the location of fixed object completely vanishes (as the agent has learned there is only one fixed object exists in the world).

Figure 5: Experiment 2: Average discovery loss of the fixed object. The results are averaged over 10 seeds.
(a)
(b)
(c)
Figure 6: Experiment 2: top-down-view snapshots of the behavior of the NDIGO-4 agent. (a) after entering the top-side room (b) after entering the right-side room (c) after discovering the fixed object in the left-side room. In each subpanel the left-side image depicts the ground-truth top-down-view of the world and the right-side image depicts the predicted view from the agent’s representation. All times are in seconds.
Visit count First visit time
fixed w. noise fixed w. noise
ICM
NDIGO-1
NDIGO-2
NDIGO-4
Table 2: Average values of the visit counts and first visit time of the trained agent for the fixed and white noise objects in Experiment 2.

Experiment 3.

We investigate whether NDIGO is able to discover and retain the dynamics of moving (but still predictable) objects even when not being in its field of view. For this, we used a 5 rooms setting with two bouncing objects in upper and lower rooms and a white noise object in the right room.

Figure 7: Experiment 3: Average discovery loss of bouncing objects. The results are averaged over 10 seeds.
Visit count First visit time
upper obj. lower obj. upper obj. lower obj.
ICM
NDIGO-1
NDIGO-2
NDIGO-4
Table 3: Average values of the visit counts and first visit time of the trained agent for the bouncing objects in Experiment 3.

We report the discovery loss in Figure 7. We observe that all variants of NDIGO outperforms the baselines by a large margin in terms of the discovery loss of the bouncing object. As the discovery loss for both bouncing objects is small, this indicates that NDIGO can encode the dynamics of bouncing objects in its representation. We report the first-visit and visit counts for the bouncing objects in Table 3. NDIGO has a superior performance than the baselines both in terms of visit counts and visit time to the fixed objects except for the visit count of the lower object in which ICM produces the best performance. Finally, as a qualitative result, we also report top-down-view snapshots of the behavior of NDIGO- after the discovery of each bouncing object in Figure 8. We observe that the agent can estimate the location of both bouncings in the first visit. Also after departing from the green bouncing object and moving towards the red bouncing object, still it can track the dynamics of the green bouncing object with some small error. This is despite the fact that the green bouncing object is not anymore observed by the agent.

(a)
(b)
Figure 8: Experiment 3: top-down-view snapshots of the behavior of the NDIGO-1 agent. (a) after discovering the green bouncing object in the bottom-side room (b) after discovering the red bouncing object in the top-side room. In each subpanel the left-side image depicts the ground-truth top-down-view of the world and the right-side image depicts the predicted view from the agent’s representation. All times are in seconds.

Experiment 4.

We investigate if the horizon affects the performance of the agents in terms of its sensitivity to structured noise. For this we evaluated which objects the agent seeks in a 5 rooms setting with a Brownian object in the upper room and a fixed object in the lower room. In the upper room, the Brownian moves at every time step. For the Brownian, unlike white noise, it is not guaranteed that the reward of NDIGO is zero. However by increasing the horizon, one may expect that the intrinsic reward due to the Brownian object becomes negligible because it becomes harder to predict with higher horizons.

Figure 9: Experiment 4: Average discovery loss of the fixed object . The results are averaged over 10 seeds.
Visit count First visit time
Brownian fixed Brownian fixed
ICM
NDIGO-1
NDIGO-2
NDIGO-4
Table 4: Average values of the visit counts and first visit time of the trained agent for the Brownian and fixed objects in Experiment 4, with all baselines.
Figure 10: Experiment 5: Average discovery loss of the fixed and movable objects. The results are averaged over 10 seeds.
(a)
(b)
(c)
(d)
Figure 11: Experiment 5: top-down-view snapshots of the behavior of the NDIGO-1 agent in the maze problem: (a) at the beginning of the episode (b) after discovering the fixed objects in room 3 and 4 (c) after discovering the movable object in room 5 (d) after discovering the fixed object in room 2. In each subpanel the left-side image depicts the ground-truth top-down-view of the world and the right-side image depicts the predicted view from the agent’s representation. All times are in seconds.
Visit frequency
Room 1 Room 2 Room 3 Room 4 Room 5
white noise fixed fixed fixed movable
ICM
NDIGO-1
NDIGO-2
NDIGO-5
NDIGO-10
Table 5: Average frequency of visits to each room for the trained agents.
First visit time
Room 1 Room 2 Room 3 Room 4 Room 5
white noise fixed fixed fixed movable
ICM -
NDIGO-1
NDIGO-2
NDIGO-5 -
NDIGO-10 -
Table 6: Average time of first visit to each room for the trained agents.

We report the results in Figure 9. We observe that the ICM baseline as well as the variants of NDIGO with the short horizon are being attracted to the structured randomness generated by the Brownian object. Only NDIGO- can ignore the Brownian object and discover the fixed object. As a result NDIGO- is the only algorithm capable of minimising the discovery loss of the fixed object.

Experiment 5.

We now compare discovery ability of the agents in a complex maze environment (see Figure 3) with no extrinsic reward. Here, the agent starts in a fixed position in the maze environment, and is given no incentive to explore but its intrinsic reward. This setting is challenging for discovery and exploration, since to go the end of the maze the agents need to take a very long and specific sequence of actions. This highlights the importance of intrinsic rewards that encourage discovery. We report the learning curves of NDIGO as well as the baselines in Figure 10. We observe that in this case different variants of NDIGO outperform the baselines by a wide margin in terms of discovery loss, while NDIGO-1 and NDIGO-2 outperforming NDIGO-5. Note that due to the presence of movable object, which is unpredictable upon re-spawning, the average loss in this experiment is higher than the prior fixed object experiments. We also evaluate the discovery performance of the agent as the number of rooms it is capable of exploring within the duration of the episode. We present the average visit frequency and first visit time of each room for the trained agents (see Tables 5 and 6). NDIGO- and NDIGO- appear as the only agents capable of reaching the final room, whereas NDIGO- explores out of . The rest can not go beyond the white noise object.

As a qualitative result, we also report top-down-view snapshots of the behavior of NDIGO- up to the time of discovery of the last fixed object in room 2 in Figure 11. We also depicts the predicted view of the world from the agent’s representation in Figure 6. We observe the agent drives across the maze all the way from room 1 to room 5 and in the process discovers the fixed objects in rooms 3-4 (see Figure 5(a)) and the movable object in room 5 (see Figure 5(c)). It then chases movable object until movable object gets fixated on the top-left corner of the world. The agent then moves back to room 2 (see Figure 5(c)) and discovers the last blue fixed object there, while maintaining its knowledge of the other objects.The reason for ignoring the blue fixed object in room 2, in the first place, might be due to the fact that the agent can obtain more intrinsic rewards by chasing the movable object. So it tries to reach to room 5 as fast as possible at the expense of ignoring the blue fixed object in room 2.

7 Conclusion

We aimed at building a proof of concept for a world discovery model by developing the NDIGO agent and comparing its performance with the state-of-the-art information-seeking algorithms in terms of its ability to discover the world. Specifically, we considered a variety of simple local-view 2D navigation tasks with some hidden randomly-placed objects and looked at whether the agent can discover its environment and the location of objects. We evaluate the ability of our agent for discovery through the glass-box approach which measures how accurate location of objects can be predicted from the internal representation. Our results showed that in all these tasks NDIGO produces an effective information seeking strategy capable of discovering the hidden objects without being distracted by the white noise, whereas the baseline information seeking methods in most cases failed to discover the objects due to the presence of noise.

There remains much interesting future work to pursue. The ability of our agent to discover its world can be very useful in improving performance in multi-task and transfer settings as the NDIGO model can be used to discover the the new features of new tasks. Also in this paper we focused on visually simple tasks. To scale up our model to more complex visual tasks we need to consider more powerful prediction models such as Pixel-CNN (van2016conditional), VAE (kingma2013auto), Info-GAN (chen2016infogan) and Draw (gregor2015draw) capable of providing high accuracy predictions for high-dimensional visual scenes. We also can go beyond predicting only visual observations to other modalities of sensory inputs, such as proprioception and touch sensors (amos2018learning).

Acknowledgements

We would like to thank Daniel Guo, Theophane Webber, Caglar Gulcehre, Toby Pohlen, Steven Kapturovski and Tom Stepleton for insightful discussions, comments and feedback on this work.

References

Appendix A NDIGO Global Network Architecture

Figure 12: Global Architecture of the NDIGO agent

Appendix B NDIGO Agent Implementation Details

b.1 World Model

  • Convolution Neural Network (CNN) : Observations are fed through a two-layer CNN ( filters with stride, then filters with

    stride; edges are padded if needed), then through a fully connected single layer perceptron with 256 units, then through a ReLU activation, resulting in a transformed observation

    .

  • Gated Recurrent Unit (GRU) : units GRU.

  • Frame predictors

    are MultiLayer Perceptrons (MLPs): one hidden-layer of

    units followed by a ReLU activation and the output layer of units ( is the size of the local view which the size of the observation ) followed by a ReLU activation.

  • Optimiser for frame predictions: Adam optimiser (kingma2014adam) with batch size and learning rate .

b.2 Reward Generator

  • The horizon can take one of these values in our experiments.

b.3 Evaluation

  • The MLP : one hidden-layer of units followed by a ReLU activation and the output layer of units ( is the size of the global view of the 5 rooms environment which is also the size of the real state ) followed by a ReLU activation.

  • Optimiser for evaluation: Adam optimiser (kingma2014adam) with batch size and learning rate .

b.4 RL Agent

We use the Recurrent Replay Distributed DQN (R2D2) (kapturowski2018recurrent) with the following parameters:

  • Replay: replay period is , replay trace length is , replay size is and we use uniform prioritisation.

  • Network architecture: R2D2 uses a two-layer CNN, followed by a GRU which feeds into the advantage and value heads of a dueling network (wang2015dueling), each with a hidden layer of size units. The CNN and GRU of the RL agent have the same architecture and parameters as the one described for the World Model (see Sec.B.1) but do not share the same weights.

  • Algorithm: Retrace Learning update (munos2016safe) with discount factor and eligibility traces coefficient , target network with update period , no reward clipping and signed-hyperbolic re-scaling (pohlen2018observe).

  • Distributed training: actors and learner, actor update period every learner steps.

  • Optimiser for RL: Adam optimiser (kingma2014adam) with batch size and learning rate .

  • The intrinsic rewards are provided directly to the RL agent without any scaling.

b.5 Training loop pseudocode

0:  Policy , history , , weights
1:  
2:  for  do
3:       Observation CNN
4:       Belief GRU
5:     
6:     for  do
7:          Prediction MLP
8:        
9:        
10:     end for
11:     
12:  end for
13:  Update to minimise
14:  Update using the set of rewards with the RL Algo.
Algorithm 1 NDIGO training loop.

Appendix C NDIGO Alternative Architecture

An alternative architecture for NDIGO consists in encoding the sequence of actions into a representation using a GRU . The hidden state of this GRU is , with the initialisation . Then we use a single neural network to output, for any , the probability distribution when given the input concatenated with . The loss function for the network at time step is a cross entropy loss:

(5)
Figure 13: Alternative architecture of NDIGO for the frame prediction tasks.

Appendix D pathak2017curiosity’s ICM Model for Partially Observable Environments

The method consists in training the internal representation to be less sensitive to noise using a self-supervised inverse dynamics model. To do so, one inverse dynamics model fed by (concatenation of the internal representation and the transformed observation ) outputs a distribution over actions that predicts the action . This network is trained by the loss: . Then a forward model fed by (concatenation of the internal representation and the action) outputs a vector that directly predict the future internal representation . The forward model is trained with a regression loss: . The neural architecture is shown in Fig. 14. Finally, the intrinsic reward is defined as:

(6)

This is slightly different from the architecture proposed by pathak2017curiosity in order to be compatible with partially observable environments.

Figure 14: pathak2017curiosity’s ICM Model for Partially Observable Environments

d.1 Details of the ICM Model’s Architecture

The ICM agent shares exactly the same architecture than the NDIGO agent except that the forward predictors are replaced by an inverse model and a forward model .

  • The inverse model is an MLP: one hidden layer of units and the output layer of units (one-hot action size).

  • The forward model is an MLP: one hidden layer of units and the output layer of units (size of the GRU).

Appendix E Additional results

e.1 Additional results for Experiment 2-4

Tables 9, 8 and 7 contain the results (including baselines) for experiments 2 to 4.

Visit count First visit time
fixed white noise fixed white noise
Random
PE
PG
ICM
NDIGO-1
NDIGO-2
NDIGO-4
Table 7: Average values of the visit counts and first visit time of the trained agent for the fixed and white noise objects in Experiment 2, with all baselines.
Visit count First visit time
upper obj. lower obj. upper obj. lower obj.
PE
ICM
NDIGO-1
NDIGO-2
NDIGO-4
Table 8: Average values of the visit counts and first visit time of the trained agent for the bouncing objects in Experiment 3, with the PE and ICML baselines.
Visit count First visit time
Brownian fixed Brownian fixed
Random
PE
PG
ICM
NDIGO-1
NDIGO-2
NDIGO-4
Table 9: Average values of the visit counts and first visit time of the trained agent for the Brownian and fixed objects in Experiment 4, with all baselines.

e.2 Additional results for Experiment 5

Tables 11 and 10 present the complete results of Experiment 5; note that a room is considered as visited when the agent has actually seen the object inside that room. As the object in Room 2 can be missed by the agent if it appears in the lower-right corner of the maze, the reported frequency of visits to Room 2 can be lower than that of Rooms 3 and beyond, as this is the case for the reported figures of the NDIGO-1 and NDIGO-2 agents. A dash symbol indicates that the agent never visits the corresponding room.

Visit frequency
Room 1 Room 2 Room 3 Room 4 Room 5
white noise fixed fixed fixed movable
Random
PE
PG
ICM
NDIGO-1
NDIGO-2
NDIGO-5
NDIGO-10
Table 10: Average frequency of visits to each room for the trained agents.
First visit time
Room 1 Room 2 Room 3 Room 4 Room 5
white noise fixed fixed fixed movable
Random - - -
PE - - - -
PG - - - -
ICM -
NDIGO-1
NDIGO-2
NDIGO-5 -
NDIGO-10 -
Table 11: Average time of first visit to each room for the trained agents.