Learning Shared Dynamics with Meta-World Models

11/05/2018 ∙ by Lisheng Wu, et al. ∙ UCL 8

Humans have consciousness as the ability to perceive events and objects: a mental model of the world developed from the most impoverished of visual stimuli, enabling humans to make rapid decisions and take actions. Although spatial and temporal aspects of different scenes are generally diverse, the underlying physics among environments still work the same way, thus learning an abstract description of shared physical dynamics helps human to understand the world. In this paper, we explore building this mental world with neural network models through multi-task learning, namely the meta-world model. We show through extensive experiments that our proposed meta-world models successfully capture the common dynamics over the compact representations of visually different environments from Atari Games. We also demonstrate that agents equipped with our meta-world model possess the ability of visual self-recognition, i.e., recognize themselves from the reflected mirrored environment derived from the classic mirror self-recognition test (MSR).



There are no comments yet.


page 5

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Building machines to mimic human brain’s abilities has drawn considerate attention from the recent progress in artificial intelligence (AI)

[Lake2014, Lake, Salakhutdinov, and Tenenbaum2015, Graves et al.2016, Baker et al.2017]. Founders of the modern computational science consider the possibility that machines would ultimately possess consciousness [Turing1950], which is the ability to perceive events and objects as humans have [Van Gulick2018]. However, many of the current advances lie on the statistical pattern recognition paradigm, which treats learning as making good predictions by discovering patterns correlated to high value rewards (or low errors) directly from the environments [Lake et al.2017]. These approaches do not draw inspiration from human cognition aspects, i.e., learning and thinking like a person [Dehaene, Lau, and Kouider2017], and are considered as model-free. At the early stage of learning in high-dimensional environments with sparse rewards, model-free methods cannot find an optimal learning direction due to the lack of reward signals, thus requiring large amounts of data to explore and learn a good policy.

In contrast, humans can achieve comparable performances on a range of tasks with much less experiences. For example, a human player needs only two hours of practice before achieving reasonable performance on one Atari game and can quickly adapt to different games, while DQN needs several days’ of training using large amounts of computational resources [Mnih et al.2015].

Humans develop a mental model of the world from the most impoverished of visual stimuli [Heider and Simmel1944], and use this world to make rapid decisions and take actions [Forrester1971]. To build truly human-like AI machines, we consider an engineering approach in the first step, and focus on developing the world model to support explanation and understanding. One of the key ingredients for building this model is the cognitive capability of understanding the underlying physics and dynamics of the environment [Wellman and Gelman1992]. For example, in the Atari Pong game, the ball and paddles follow principles of persistence, continuity, cohesion and solidity [Bellemare et al.2015]. Through mental models of the world, humans can reconstruct a perceptual scene following these principles to support mental simulations that can predict the future movement of the objects [Spelke1990]. Equipped with a world model understanding these intuitive theories of physics, agents can simulate the real experiences and learn the structured properties of the environment. By exploiting the underlying dynamics learned by the model, we can reduce the dimension of the input and generalize features across states and actions in high-dimensional environments [Watter et al.2015, Wahlström, Schön, and Deisenroth2015, Levine and Abbeel2014]. With one transition model, agents can attend to the dynamics of the states by modeling how the environment evolves with specific action. These approaches, regarded as model-based learning, have been investigated by several previous works [Sutton1990, Levine and Abbeel2014, Watter et al.2015, Wahlström, Schön, and Deisenroth2015, Schmidhuber2015, Gu et al.2016, Leibfried, Kushman, and Hofmann2016, Ha and Schmidhuber2018]. However, the learned underlying dynamics are often restricted to be effective in a single world environment.

Although spatial and temporal aspects of different scenes are generally diverse, the underlying physics among environments still work the same way. As such, humans can easily adapt to different environments with the help of their understanding of the underlying physics. For example, humans can still play Pong game when the observation is mirrored or transposed (rotated by ninety degrees and mirrored). On the other hand, recent computational achievements cannot solve such scene understanding problems, e.g., not rotation-invariant. Human-like world models should understand these shared physical dynamics and use them to rapidly generalize knowledge to new tasks and environments. This common knowledge serves as a prior for learning new task, helping agents to efficiently adapt to new environments. For example, the knowledge gained from Atari Pong can be beneficial for learning to play Breakout and Video Pinball, which share the common concept of the ball and paddles

[Parisotto, Ba, and Salakhutdinov2015]. Multi-task learning (or meta-learning) [Duan et al.2016, Finn, Abbeel, and Levine2017, Parisotto, Ba, and Salakhutdinov2015] has been used to learn the common knowledge among different tasks. However, most meta-learning algorithms generally have troubles in searching common parameters or feature spaces directly from high-dimensional environments.

In this paper, we explore building meta mental world models to establish the common physical dynamics over the compact representations from visually different environments. Without the guidance of explicit rewards, our meta-world models learn about the relationships among different worlds through multi-task learning. The key concept is to maintain a recurrent neural network (RNN) using self-supervised learning to capture the underlying dynamics of each environment and find a common latent structure across different environments. As a result, we obtain a low-dimensional common latent structure among multiple environments which share the underlying dynamics learned by the RNN. We demonstrate the performance of our meta-world using five variants of the Atari Pong game, which are completely different from the original Pong in the transition function and even the state space.

As the only premise we have is that all training environments share the same physical dynamics, two challenges are induced to build meta-world models. Firstly, the transformation pattern and corresponding states between environments are unknown, so we cannot simply obtain the common representations by direct transformation or supervised learning. Secondly, it’s not guaranteed to learn shared dynamics as the capacity of neural networks can be large enough to regard all training environments’ dynamics as different even with a single model. Through extensive experiments, we find that it’s possible to unify two visually different environment using the shared dynamics. We also demonstrate that agents equipped with our meta-world model possess the ability of visual self-recognition, i.e., pass the classic mirror self-recognition test (MSR) [Gallup1970]. To the best of our knowledge, this is the first work to build a meta-world model to learn shared dynamics.

Related Work

Model-based deep reinforcement learning algorithms have been shown to be more effective than model-free alternatives in certain tasks

[Watter et al.2015, Wahlström, Schön, and Deisenroth2015, Levine and Abbeel2014]. One of the classical model-based algorithms is Dyna-Q [Sutton1990] which learns the policy from both the model and the environment by supplementing real world on-policy experiences with simulated trajectories. However, using trajectories from a non-optimal or biased model can lead to learning a poor policy [Gu et al.2016]

. To model the world environment of Atari Games, autoencoder has been used to predict the next observation and environment rewards

[Leibfried, Kushman, and Hofmann2016]. Some previous works [Schmidhuber2015, Ha and Schmidhuber2018, Leibfried, Kushman, and Hofmann2016]

maintain a recurrent architecture to model the world using unsupervised learning and proved its efficiency in helping RL agents to outperform previous methods in complex environments. The work from

[Ha and Schmidhuber2018] also demonstrates their model’s capability of helping the agent to act in real world by learning from the ”dream world”, i.e., the mental model of the world. However, these models can only be applied to a single environment and need to be built from scratch for new environments. Although using a similar recurrent architecture, our work differs from above works by learning the common underlying dynamics over multiple environments.

To achieve multi-task learning, recurrent architecture [Duan et al.2016, Wang et al.2016] has also been used to learn to reinforcement learn by adapting to different MDPs automatically, which is shown to be comparable to the UCB1 algorithm [Auer, Cesa-Bianchi, and Fischer2002] on bandit problems. Meta-learning shared hierarchies (MLSH) [Frans et al.2017] shares sub-policies among different tasks to achieve the goal in the training process, where high hierarchy actions are obtained and reused in other tasks. Model-agnostic meta-learning algorithm (MAML) [Finn, Abbeel, and Levine2017] minimizes the total error across multiple tasks by locally conducting few-shot learning to find the optimal parameters for both supervised learning and RL. Actor-mimic [Parisotto, Ba, and Salakhutdinov2015] distills multiple pre-trained DQNs on different tasks into one single network to accelerate the learning process by initializing the learning model with learned parameters of the distilled network. To achieve promising results, these pre-trained DQNs have to be expert policies. Distral [Teh et al.2017]

learns multiple tasks jointly and trains a shared policy as the ”centroid” by distillation. Most of the meta-learning approaches consider the problems within the model-free RL paradigm and focus on finding the common structure in the policy space. However, learning in high-dimensional environments with model-free approaches suffers from the sparse and high-variance reward signals, thus requiring large amounts of data to explore. In contrast, we explicitly maintain a meta-world model to capture the latent structures and dynamics of the environment, thus having more stable correlation signals.

In terms of the meta-learning for model-based algorithms, [Al-Shedivat et al.2017, Clavera et al.2018] focused on model adaptation when the model is incomplete or the underlying MDPs are evolving. By taking the unlearned model as a new task and continuously learning new structures, the agent can keep its model up to date. Different from above approaches, we focus on how to establish the common physical dynamics over the compact representations from visually different environments.


In this section, we will introduce the notation and formalize the meta-world learning.

Environment Setting

We first define the environment as a Markov Decision Process (MDP) represented by tuple

, where

are the state space, the transition probability function, the action space, and the reward function respectively. At each time step

, a state is provided by the environment . The agent in this environment receives this state information and chooses an action according to its own strategy. The environment then receives this action and moves to a new state following the transition function . The initial state of the environment is determined by a distribution . In the following sections, we also use the apostrophe to indicate quantities at next time step when is neglected for simplicity, e.g., instead of .

To help agents learn compact representations of the environment, [Ha and Schmidhuber2018] introduced the world model. Inspired by the human cognitive system, the world model (, ) consists of two components: a Vision Model (V) and a Memory Model (M), denoted as and , parameterized by and respectively. At each time step , the vision model receives the real-time high-dimensional observation (e.g., a image frame) from the environment and compresses the input into a compact but informative representation . The memory model serves as the history encoder and future predictor by: (1) compressing the abstract presentation from V and producing a history information over time; (2) combining the history information with the current abstract observation to predict the future . Because the agent affects the environment state with its own action, agent action is also fed into the memory model to better predict the future. With the ability of reconstructing low-dimensional representations to the form of environment observations, a variational autoencoder (VAE) [Kingma and Welling2013] is normally used as the vision model, where the encoder and the decoder are denoted as and , parameterized by and respectively. Agents can then use the information from and to make a decision , where


Meta-World Model

To train a world model that can seize the common underlying dynamics of multiple environments , we propose the meta-world model and formalize the problem of meta-world learning. The goal of meta-world learning is to build one memory model M for sharing across multiple environments. During the training stage, a set of trajectories are sampled from the environment . We define the meta-world model as and the world model for environment as , abbreviated as . For each environment , the V model with parameter encodes input states

to latent vectors

, which represents the dynamics learned by the meta-world model. Our expectation is to learn similar , i.e., shared underlying dynamics across multiple environments.


In this section, we first provide the meta-learning algorithm for meta-world models. Then we present the method to validate the performance of meta-world models. To find the underlying dynamics of different environments, we describe some further constraints applied to the intermediate output distribution of the V model.

(a) Reconstruction
(b) Prediction
Figure 1: Meta-learning for meta-world models.

Meta-Learning for Meta-World Models

To train a meta-world model, we adopt the variational autoencoder (VAE) as the vision model V and the LSTM [Hochreiter and Schmidhuber1997] model with variant output as the memory model M.

As illustrate in Fig. 1, the training process consists of two stages: the reconstruction of V models and the prediction of M models. The reconstruction stage is for V model to compress the input state to the latent vector , and uncompress to the output state that closely matches the original . In the prediction stage, the M model takes both the action and the latent vector as inputs to predict . Then, the predicted state is obtained by decoding using the decoder of the V model to approximate the true next state , which is the result of applying to the environment under the current state . The reconstruction loss is the distance between the state approximation and the true environment state , and the prediction loss comes from the distance between the predicted state and the actual state of the next time step :


where is a distance measurement function. During the training process, the updating rules for both models are defined as


where and are the ratio of and at one training step respectively. The V model is updated with gradients of both and , and the M model is updated with gradients of only.

In the meta-learning for meta-world models, we alternate the training process for V models between minimizing and by adjusting and accordingly. For the training of V models, minimizing helps to adjust V models of different environments with respect to the shared underlying dynamics learned by the memory model M. However, the adjustment of V models to satisfy the learning of shared dynamics will in turn affect the reconstruction ability of V models. The optimization of and are influenced by each other similar to a general-sum game, where a equilibrium is hard to reach when two optimization process run simultaneously. Thus, we use the alternate training scheme as minimizing and for V models respectively throughout the training process to avoid trapping into local minimum. To capture the shared underlying dynamics across multiple environments, following constraints are considered in our model:

  1. The Memory model M should make use of the abstract representation of the current state and its own memory to predict the next state with specific action.

  2. The Vision model V should be able to reconstruct the original state from its abstract representation.

Validating the Model Performance

People may doubt that the meta-world might not learn the common dynamics because a neural network with high capacity could represent multiple dynamics simultaneously. Despite the strict constraints listed above, it’s still possible for our meta-world model to learn multiple dynamics simultaneously or be trapped into local minimum. In this section, we present the method to validate that meta-world models actually learn the common dynamics of multiple environments.

Suppose we have environment and environment sharing the same underlying physical dynamics but different visual observations. For the sake of simplicity, we assume the visual observation of is the transpose of that of , i.e., clockwise rotated by and horizontal flipped. For most of the advanced neural networks without rotation-invariance, these two environments are totally different. Although they might be seen as different to humans at the first glance, humans will come to realize and are almost identical because they consist of the same components and share the same underlying physical dynamics, even though the dynamics are presented in different directions. Thus, humans have the ability to unify different observations sharing the same underlying dynamics.

To demonstrate that our meta-world model is also capable of capturing the common dynamics of such two environments instead of treating them as different dynamics and absorbing them simultaneously into the neuron model,

and should have very similar abstract representations and of corresponding observations and . However, it’s not applicable to directly measure the distance of the abstract representations in the vector space, due to the high-variance of V models. Instead, we validate our model by encoding the observation from to the corresponding abstract representation , which is decoded with the decoder of the V model of environment . Then, we compare the decoded result with the corresponding observation . If and are almost identical, we can confirm that meta-world models actually capture the underlying physical dynamics of different environments and unify their abstract representations as the shared dynamics, rather than simply memorizing and absorbing them into the neuron models.

Adding Constraints on Abstract Representations

As described in the above section, we want to unify the abstract representations of different environments sharing the same underlying physical dynamics. However, the abstract representation generated by each vision model might not be the same without further constraints on the learning period. Concretely, because we adopt the VAE as the V model, the element-wise similarity of the abstract representation is determined by the distribution (both the mean and the variance) of the intermediate output of VAE. Thus, it is difficult for randomly initialized V models to find the common dynamics through the coordination of abstract representations.

To add further constraints on the distribution of the abstract representations , we introduced the Maximum Mean Discrepancy [Gretton et al.2007] to regularize the channels of by adding constraints on the distance among all the mean and variance of the intermediate output of V models. Concretely, we force all meta-world models to have similar distributions to the abstract representation of by adding the total training loss with a regularizer


where and are the mean and variance of the VAE of , and and are the ratio of the error of mean and log variance respectively. Then the total training loss of the V model can be written as,


During the training process, we add MMD loss to the reconstruction stage, while keeping training with prediction loss in the prediction stage. In this way, the distributions of abstract representations can also be influenced by other environments through the participation of memory model M. That is, the V models of other environments can affect M to change its parameters by the joint training in the prediction stage, thus influencing the distributions of abstract representations of indirectly.

Figure 2: Meta-world environments.


This section starts with the introduction of our experiment environment (the Atari Pong game), problem setup, and evaluation criteria. We then present the experiment results and analysis of different training strategies, i.e., using corresponding states from different environments as training data or not. We also demonstrate that agents equipped with our meta-world model possess the ability of visual self-recognition as being self-aware.

The Pong World

Atari Pong game is a environment with two paddles that can only move up or down to hit the ball. The dynamics of Pong game involves an opponent which the player’s action cannot control, and an agent that the player can control To verify the ability of our models capturing shared physical dynamics among different visual observations, we generate five variants of the Atari Pong game while keeping the physical dynamics of the environment. As shown in Fig.2, each variant corresponds to one transformation from : (a) the transposed , which is transformed from the state observation of by clockwise rotating and horizontal flipping; (b) the horizontal-swapped , which is generated by vertically splitting the observation frame of from the center and swapping the left part with the right part; (c) the inverse , which is created by exchanging the background color with the paddles/ball color of ; (d) the mirror-symmetric , which reflects like a mirror by horizontally swapping the observation; and (e) the vertical-swapped , which is obtained by horizontally splitting the observation of from the center and swapping the upper part with the lower part. We should note that the split line in the transformation from to may cut the paddles into two halves, e.g., a part of the paddle could disappear from the top of the frame and reappear at the bottom when the paddle is moving across the center. We consider this situation as the paddle teleportation.

These variants can be divided into two groups by the difference of state space compared to . The first group, including , , and , has totally different state space from by either putting the paddles in different positions or expressing the corresponding states with different pixel values. Obviously, actions in this group of environments appear differently because of the new state space. The second group, consisting of and , has either the same or at least some overlaps with the state space of . However, the actions still appear differently because of the transformation.

Different from the original Atari Pong observations, we (1) transform each image frame to a binary matrix; (2) remove the scoreboard to focus on the dynamics of the paddles; and (3) resize each frame to to serve as the state observations of the original Pong world environment . The action space is formed by all six available discrete actions of the original Atari game environment.

(a) and
(b) and
(c) and
(d) and
(e) and
Figure 3: Training meta-world models with corresponding inputs. Legend: : prediction loss ; : reconstruction loss ; : transformation loss ; : predicted transformation loss .

Problem Setup

Although the presented environment variants transit with actions in different ways from the aspect of observations, they share the same underlying physical dynamics as actions impose transition on the paddle in corresponding ways. By observing the shared dynamics from environment variants, humans could recall the dynamics seen in the original environment and try to find the similarities, then finally consider them as highly related environments. However, artificial neural networks (ANNs) are not able to generalize across such transformation with ease. For example, ANNs are not rotation-invariant, thus regarding the transposed environment as a totally different world as . Our target is to unify the latent representations between environment variants and the original environment by utilizing their shared dynamics. For simplicity, we build meta-world models on the original Pong environment and one of the five environment variants separately. We explore training the shared dynamics on two environments with corresponding inputs and non-corresponding inputs, i.e., using corresponding states from different environments as training data or not, to verify the performance of meta-world models.

To collect the training data covering most dynamics of the Pong environment, we use an agent with random policy to play the game for 10,000 episodes and limit the episode length to 1000 steps. The vision model V adopts the same architectures as the World Model [Ha and Schmidhuber2018] with the latent vector

of size 32. The memory model M is a LSTM with 32 hidden units to predict the parameters of a diagonal Gaussian distribution. The predicted latent representation of the next state

is then sampled from the Gaussian distribution. The latent vector encoded by VAE is sampled from the same Gaussian distribution , making it enough for an LSTM with one Gaussian prediction to deal with the experiments. We set the batch size for each task as 16 and the sequence length of LSTM as 25. As described before, we alternate the training process for V models between minimizing and by setting each training iteration with 20 prediction iterations and 10 reconstruction iterations.

Evaluation Criteria

Each experiment involves two environments (denoted by and ) as we want to validate whether the latent representations of the corresponding states are identical, or, at least close enough. Because of the variance introduced by the VAE, we can indirectly decode the latent representation of the state from by the decoder of to . The performance is evaluated by the transformation loss


which is the distance between and the real corresponding states .

We also investigate whether the RNN prediction of the latent representation for each environment shares the same vector space. We decode the predicted latent representation of environment using the decoder of to get . Similarly, the performance is evaluated by the predicted transformation loss


which is the distance between and .

Meta-World Experiments

Corresponding Inputs

In the first experiment, we train two environments with corresponding states as input. However, we don’t provide meta-world models with any information about the type of the transformation between environments, thus preventing from simply obtaining the common representations by direct transformation or supervised learning. For simplicity, we focus on pair-wise training, i.e., fix the original environment and choose one of the five variants for each training process. At each time step, we randomly sample 16 trajectories of length 25 from the dataset as the training data for , and transform these samples to corresponding states as the training data for , thus making the training input for different environments already have the same type of dynamics.

We present training results with respect to , , , and for each of the five environment variant to illustrate the performance of meta-world models. As shown in Fig. 3, the transformation loss of all experiments could converge to a relatively low value, which is nearly the same as the reconstruction loss of the corresponding environment variant; same results stand for the predicted transformation loss and prediction loss as well. At each training step of the prediction stage, the corresponding inputs setting provides meta-world models with same types of dynamics from different environments, which are easier to learn by the memory model. Meanwhile, vision models could learn to adapt to these shared dynamics simultaneously, thus leading to easier convergence of meta-world models. With the help of the memory model predicting corresponding states, vision models of different environments could understand corresponding states in the same way, i.e., meta-world models could unify the latent representations of different vision models, thus possessing the ability to capture the underlying shared dynamics of different environments.

(a) and
(b) and
(c) and
(d) and
(e) and
Figure 4: Training meta-world models without corresponding inputs. Legend: : prediction loss ; : reconstruction loss ; : transformation loss ; : predicted transformation loss .

Non-corresponding Inputs

In the second experiment, we explore a more humanlike but difficult learning strategy: learning shared dynamics without knowing the corresponding states, i.e., randomly choosing training data instead of providing the model with trajectories with same types of dynamics as input. Concretely, we split the training set into two parts and conduct the corresponding transformation on one part to prepare the training data for . At each time step, we randomly sample 16 trajectories from each part and feed them as input to meta-world models. These trajectories normally present different types of dynamics, which prevent the memory model from learning corresponding parts of dynamics simultaneously from both environments, thus requiring models to figure out the exact corresponding states and then learn the shared dynamics.

Similar to experiments with corresponding inputs, we also present training results with respect to , , , and for the original environment and each of the five variant. As illustrated in Fig. 4, the convergence results on experiments of and prove the effectiveness of meta-world models in unifying the latent representations of different environments when training with non-corresponding inputs, although being slightly worse than that of using corresponding inputs. This is mainly because vision models also try to adapt to the memory model with previous learned dynamics when the memory model is updated during the prediction stage. During the reconstruction stage, the memory model is in turn prevented from overfitting because it needs to predict proper latent representation to get interpreted by the decoder of environment . Thus, meta-world models are able to capture the underlying shared dynamics even though the training process is unstable, i.e., both the memory model and vision models are optimized through fluctuated directions.

Meanwhile, the meta-world model built on and finds it hard to capture the shared dynamics when training with non-corresponding inputs. We consider it may be attributed to the paddle teleportation; in , a paddle can be divided into two parts where one part could disappear at one side of the frame and reappear at the other side, resulting in new properties of split and teleported paddles not existing in other environments. Fig. 3(e) illustrates the learning curve of a successful try to learn the meta-world model on and .


log standard deviations

(b) mean distance
Figure 5: Difference between latent vectors and of corresponding states from vision models of and .

Analyzing the Shared Dynamics

To explicitly demonstrate the ability of meta-world models to unify the representation space of different environments, we provide the visualized results on evaluation set in Appendix. A. To further investigate the shared dynamics among different environments, we analyze the difference between latent representations and from two environments and as shown in Fig. 5.

Fig. 4(a) illustrates the log standard deviations of the vision models of and respectively, where a small group of elements (indexed as 7, 13, 19, 20 and 21) has relatively low variance in both environments, thus keeping relatively stable value across different observations. Fig. 4(b) shows the mean distance of each element from latent representations of corresponding states between and , where the elements in the same group are close to each other in the representation space between different environments. Being stable through different observations, these elements also share high similarity between both environments and can be interpreted mutually by both visual models. We consider these elements as the critical part in representing the shared dynamics and define them as key elements.

(a) sum of absolute weights connected to in
(b) mean absolute gradients of output to in
Figure 6: Validating the importance of key elements.

To validate the importance of key elements in extracting the underlying dynamics, we show the weights of the decoder neural network connected to the latent representations in Fig. 5(a). As the sum of absolute weights connected to key elements are much larger than others, the change of key elements will apply higher influence to the decoder output , thus illustrating their significance in latent representations. Also, the value of each element of the decoder output ranges from 0 to 1, where pixels of paddles correspond to high values. We then compute the gradients of the output with respect to the latent vector to observe which part of the latent vector contributes to the paddle more. As shown in Fig. 5(b), the mean absolute gradients of key elements are significantly larger than others. Indeed, other elements have nearly zero gradients, meaning almost no contributions to the decoder output . Consequently, the learned unified representations mainly concentrate on key elements.


One of the simple definitions of self-awareness is the ability to recognize oneself as an individual separated from the environment. The mirror self-recognition test (MSR) is the classical method for attempting to measure self-awareness [Gallup1970], where an animal is exposed to mirrors for the first time to test if it can recognize the reflected image as itself, rather than of another animal. Humans are known to have self-consciousness as they can learn to recognize their own images after prolonged confrontation with mirrors and stop responding socially to the reflection. Although the reason why animals can achieve this without seeing themselves before is still under debate, we propose a possible direction; we think animals with self-consciousness should have the ability to correlate their own actions to that in the mirror and perceive the reflection as themselves. The influence of their actions will have the same impact in both the mirrored environment and the real world even with different visual observations, which means two environments share the same underlying dynamics.

In our experiments, we define the self-awareness of agents as the ability to unify the concepts of the agent itself in two different environments, e.g., and . Our first experiment with corresponding inputs can be seen as the Atari Pong version of the mirror test. When taking an action in , an agent with meta-world models can predict how will the mirrored world change with the influence of this action. Being able to predict the next observation in a mirrored Pong game, an agent possesses the ability to distinguish between the controlled paddle and the opponent paddle by observing their actions and dynamics, thus recognizing themselves from the reflected mirrored environment.

Meanwhile, the experiment with non-corresponding inputs takes a step further, where agents try to figure out the correspondence between two environments without knowing the exact transformations. This can be seen as the process of reasoning about the differences between observations and searching the common rules among them. Humans normally understand the world in this way, where the process of integrating knowledge involves finding the shared aspects of different observations that progress independently. Once the common points are found, the conception about these observations can get unified in a certain way, thus helping human to enhance their knowledge to better understand the world.


In this paper, we propose the meta-world models for learning shared underlying dynamics among visually different environments. To propose an engineering approach to mimic the activity of human brains, which keeps a mental model of the world developed from the most impoverished of visual stimuli, we explore building a meta-world model with basic form of consciousness. We show through extensive experiments that our meta-world model can successfully learn an abstract description of shared physical dynamics among different variants of the Atari Pong game, which can be totally different in both state space and transition functions. We also demonstrate that agents equipped with our meta-world model possess the ability of visual self-recognition, i.e., pass the classic mirror self-recognition test (MSR). For future directions, we would like to explore the ability of our meta-world model understanding the world by applying it to more diverse environments. We would also like to combine the meta-world model with ”model-free” methods to learn through experience and make inferences and planning more computationally efficient.


Appendix A A. Unifying the representation space of different environments

Figure 7: The visualized results on evaluation set. The first column is the observation of environment , which is encoded to by V model . The rest columns correspond to the decoding results of from vision models of , , , , .