The ability to perceive and represent visual sensory data into useful and concise descriptions is considered a fundamental cognitive capability in humans [1, 2], and thus crucial for building intelligent agents . Representations that succinctly reflect the true state of the environment should allow agents to learn to act in those environments with fewer interactions, and effectively transfer knowledge across different tasks in the environment.
Recently, deep representation learning has led to tremendous progress in a variety of machine learning problems across numerous domains[4, 5, 6, 7, 8]. Typically, such representations are often learned via end-to-end learning using the signal from labels or rewards, which makes such techniques often very sample-inefficient. In contrast, human learning in the natural world appears to require little to no explicit supervision for perception .
Unsupervised [10, 11, 12] and self-supervised representation learning [13, 14, 15] have emerged as an alternative to supervised versions which can yield useful representations with reduced sample complexity. In the context of learning state representations , current unsupervised methods rely on generative decoding of the data using either VAEs [17, 18, 19, 20] or prediction in pixel-space [21, 22]. Since these objectives are based on reconstruction error in the pixel space, they are not incentivized to capture abstract latent factors and often default to capturing pixel level details.
In this work, we leverage recent advances in self-supervision that rely on scalable estimation of mutual information[23, 24, 25, 26], and propose a new contrastive state representation learning method named Spatiotemporal DeepInfomax (ST-DIM), which maximizes the mutual information across both the spatial and temporal axes.
To systematically evaluate the ability of different representation learning methods at capturing the true underlying factors of variation, we propose a benchmark based on Atari 2600 games using the Arcade Learning Environment [ALE, 27]. A simulated environment provides access to the underlying generative factors of the data, which we extract using the source code of the games. These factors include variables such as the location of the player character, location of various items of interest (keys, doors, etc.), and various non-player characters, such as enemies (see figure 1). Performance of a representation learning technique in the Atari representation learning benchmark is then evaluated using linear probing 
, i.e. the accuracy of linear classifiers trained to predict the latent generative factors from the learned representations.
Our contributions are the following
We propose a new self-supervised state representation learning technique which exploits the spatial-temporal nature of visual observations in a reinforcement learning setting.
We propose a new state representation learning benchmark using 22 Atari 2600 games based on the Arcade Learning Environment (ALE).
We conduct extensive evaluations of existing representation learning techniques on the proposed benchmark and compare with our proposed method.
2 Related Work
Unsupervised representation learning via mutual information estimation: Recent works in unsupervised representation learning have focused on extracting latent representations by maximizing a lower bound on the mutual information between the representation and the input. Belghazi et al. 
estimate the mutual information with neural networks using the Donsker-Varadhan representation of the KL divergence, while Chen et al.  use the variational bound from Barber and Agakov  to learn discrete latent representations. Hjelm et al.  learn representations by maximizing the Jensen-Shannon divergence between joint and product of marginals of an image and its patches. van den Oord et al. 
maximize mutual information using a multi-sample version of noise contrastive estimation[32, 33]. See  for a review of different variational bounds for mutual information.
State representation learning:
Learning better state representations is an active area of research within robotics and reinforcement learning. Jonschkowski and Brock  and Jonschkowski et al.  propose to learn representations using a set of handcrafted robotic priors. Several prior works use a VAE and its variations to learn a mapping from observations to state representations [37, 17, 38]. Thomas et al.  aims to learn the representations that maximize the causal relationship between the distributed policies and the representation of changes in the state. Recently, Cuccu et al.  shows that visual processing and policy learning can be effectively decoupled in Atari games. Nachum et al.  connects mutual information estimators to representation learning in hierarchical RL. Our work is also closely related to recent work in learning object-oriented representations .
Evaluation frameworks of representations:
Evaluating representations is an open problem, and doing so is usually domain specific. In vision tasks, it is common to evaluate based on the presence of linearly separable label-relevant information, either in the domain the representation was learned on 
or in transfer learning tasks[44, 45]. In NLP, the SentEval  and GLUE  benchmarks provide a means of providing a more linguistic-specific understanding of what the model has learned, and these have become a standard tool in NLP research. Our evaluation framework can be thought of as a GLUE-like benchmarking tool for RL, providing a fine-grained understanding of how well the RL agent perceives the objects in the scene. Analogous to GLUE in NLP, we anticipate that our benchmarking tool will be useful in RL research for better designing components of agent learning.
3 Spatiotemporal Deep Infomax
We assume a setting where an agent interacts with an environment and observes a set of high-dimensional observations across several episodes. Our goal is to learn an abstract representation of the observation that captures the underlying latent generative factors of the environment.
This representations should focus on high-level semantics (e.g., the concept of agents, enemies, objects, score, etc.) and ignore the low-level details such as the precise texture of the background, which warrants a departure from the class of methods that rely on a generative decoding of the full observation. Prior work in neuroscience [48, 49] has suggested that the brain maximizes predictive information  at an abstract level to avoid sensory overload. Predictive information, or the mutual information between consecutive states, has also been shown to be the organizing principle of retinal ganglion cells in salamander brains . Thus our representation learning approach relies on maximizing an estimate based on a lower bound on the mutual information over consecutive observations and .
3.1 Maximizing mutual information across space and time
Given a mutual information estimator, we follow DIM  and maximize a sum of patch-level mutual information objectives. The global objectives maximize the mutual information between the full observation at time with small patches of the observation at time . The representations of the small image patches are taken to be the hidden activations of the convolutional encoder applied to the full observation. The layer is picked appropriately to ensure that the hidden activations only have a limited receptive field corresponding to the size of the full observations. The local objective maximizes the mutual information between the local feature at time with the corresponding local feature at time . Figure 2 is a visual depiction of our model which we call Spatiotemporal Deep Infomax (ST-DIM).
when used to learn representations. To alleviate this issue, our approach constructs multiple small mutual information objectives (rather than a single large one) which are easier to estimate via lower bounds, which has been concurrently found to work well in the context of semi-supervised learning.
samples from some joint distribution. For any index , is a sample from the joint which we refer to as positive examples, and for any , is a sample from the product of marginals 111For convenience, ignoring those that are in the support of the joint., which we refer to as negative examples. The InfoNCE objective learns a score function which assigns large values to positive examples and small values to negative examples by maximizing the following bound [see 24, 34, for more details on this bound],
Following van den Oord et al.  we use a bilinear model for the score function , where is the representation encoder. The bilinear model combined with the InfoNCE objective forces the encoder to learn linearly predictable representations, which we believe helps in learning representations at the semantic level. In our context, the positive examples correspond to pairs of consecutive observations and negative samples correspond to pair to pair of non-consecutive observations , where is a randomly sampled time index from the episode. For ST-DIM, the final score function for the global objective is and the score function of the local objective is , where is the feature map at the layer at the spatial location.
4 The Atari Annotated RAm Interface (AARI)
Measuring the usefulness of a representation is still an open problem, as a core utility of representations is their use as feature extractors in tasks that are different from those used for training (e.g., transfer learning). Measuring classification performance, for example, may only reveal the amount of class-relevant information in a representation, but may not reveal other information useful for segmentation. It would be useful, then, to have a more general set of measures on the usefulness of a representation, such as ones that may indicate more general utility across numerous real-world tasks. In this vein, we assert that in the context of dynamic, visual, interactive environments, the capability of a representation to capture the underlying high-level factors of the state of an environment will be generally useful for a variety of downstream tasks such as prediction, control, and tracking.
We find video games to be a useful candidate for evaluating visual representation learning algorithms primarily because they are spatiotemporal in nature, which is (1) more realistic compared to static i.i.d. datasets and (2) prior work [57, 58] have argued that without temporal structure, recovering the true underlying latent factors is undecidable. Apart from this, video games also provide ready access to the underlying ground truth states, unlike real-world datasets, which we need to evaluate performance of different techniques.
Annotating Atari RAM:
ALE does not explicitly expose any ground truth state information. However, ALE does expose the RAM state (128 bytes per timestep) which are used by the game programmer to store important state information such as the location of sprites, the state of the clock, or the current room the agent is in. To extract these variables, we consulted commented disassemblies  (or source code) of Atari 2600 games which were made available by Engelhardt  and Jentzsch and CPUWIZ . We were able to find and verify important state variables for a total of 22 games. Once this information is acquired, combining it with the ALE interface produces a wrapper that can automatically output a state label for every example frame generated from the game. We make this available with an easy-to-use gym wrapper, which returns this information with no change to existing code using gym interfaces. Table 1 lists the 22 games along with the categories of variables for each game. We describe the meaning of each category in the next section.
State variable categories:
We categorize the state variables of all the games among six major categories: agent localization, small object localization, other localization, score/clock/lives/display, and miscellaneous. Agent Loc. (agent localization) refers to state variables that represent the or coordinates on the screen of any sprite controllable by actions. Small Loc. (small object localization) variables refer to the or screen position of small objects, like balls or missiles. Prominent examples include the ball in Breakout and Pong, and the torpedo in Seaquest. Other Loc. (other localization) denotes the or location of any other sprites, including enemies or large objects to pick up. For example, the location of ghosts in Ms Pacman or the ice floes in Frostbite. Score/Clock/Lives/Display refers to variables that track the score of the game, the clock, or the number of remaining lives the agent has, or some other display variable, like the oxygen meter in Seaquest. Misc. (Miscellaneous) consists of state variables that are largely specific to a game, and don’t fall within one of the above mentioned categories. Examples include the existence of each block or pin in Breakout and Bowling, the room number in Montezuma’s Revenge, or Ms. Pacman’s facing direction.
Evaluating representation learning methods is a challenging open problem. The notion of disentanglement [62, 63] has emerged as a way to measure the usefulness of a representation [64, 37]. In this work, we focus only on explicitness, i.e the degree to which underlying generative factors can be recovered using a linear transformation from the learned representation. This is standard methodology in the self-supervised representation learning literature [14, 24, 65, 15, 25]. Specifically, to evaluate a representation we train linear classifiers predicting each state variable, and we report the mean F1 score.
5 Experimental Setup
We evaluate the performance of different representation learning methods on our benchmark. Our experimental pipeline consists of first training an encoder, then freezing its weights and evaluating its performance on linear probing tasks. For each identified generative factor in each game, we construct a linear probing task where the representation is trained to predict the ground truth value of that factor. Note that the gradients are not backpropagated through the encoder network, and only used to train the linear classifier on top of the representation.
5.1 Data preprocessing and acquisition
We consider two different modes for collecting the data: (1) using a random agent (steps through the environment by selecting actions randomly), and (2) using a PPO  agent trained for 50M timesteps. For both these modes, we ensure there is enough data diversity by collecting data using 8 differently initialized workers. We also add additional stochasticity to the pretrained PPO agent by using an
-greedy like mechanism wherein at each timestep we take a random action with probability222For all our experiments, we used ..
In our evaluations, we compare the following methods:
Randomly-initialized CNN encoder (random-cnn).
Next-step pixel prediction model (pixel-pred) inspired by the "No-action Feedforward" model from .
Contrastive Predictive Coding (cpc) , which maximizes the mutual information between current latents and latents at a future timestep.
supervised model which learns the encoder and the linear probe using the labels. The gradients are backpropagated through the encoder in this case, so this provides a base-case performance bound.
All methods use the same base encoder architecture, which is the CNN from , but adapted for the full 160x210 Atari frame size. To ensure a fair comparison, we use a representation size of 256 for each method. As a sanity check, we include a blind majority classifier (maj-clf), which predicts label values based on the mode of the train set. More details in Appendix, section A.
We train a different 256-way333Each RAM variable is a single byte thus has 256 possible values ranging from 0 to 255. linear classifier with the representation under consideration as input. We ensure the distribution of realizations of each state variable has high entropy by pruning any variable with entropy less than 0.6. We also ensure there are no duplicates between the train and test set. We train each linear probe with 35,000 frames and use 5,000 and 10,000 frames each for validation and test respectively. We use early stopping and a learning rate scheduler based on plateaus in the validation loss.
We report the F1 averaged across all categories for each method and for each game in Table 2 for data collected by random agent. In addition, we provide a breakdown of probe results in each category, such as small object localization or score/lives classification in Table 3 for the random agent. We include the corresponding tables for these results with data collected by a pretrained PPO agent in tables 6 and 7. The results in table 2 show that ST-DIM largely outperforms other methods in terms of mean F1 score. In general, contrastive methods (ST-DIM and CPC) methods seem to perform better than generative methods (VAE and PIXEL-PRED) at these probing tasks. We find that RandomCNN is a strong prior in Atari games as has been observed before , possibly due to the inductive bias captured by the CNN architecture empirically observed in . We find similar trends to hold on results with data collected by a PPO agent. Despite contrastive methods performing well, there is still a sizable gap between ST-DIM and the fully supervised approach, leaving room for improvement from new unsupervised representation learning techniques for the benchmark.
We investigate two ablations of our ST-DIM model: Global-T-DIM, which only maximizes the mutual information between the global representations and JSD-ST-DIM, which uses the NCE loss  instead of the InfoNCE loss, which is equivalent to maximizing the Jensen Shannon Divergence between representations. We report results from these ablations in Figure 3. We see from the results in that 1) the InfoNCE loss performs better than the JSD loss and 2) contrasting spatiotemporally (and not just temporally) is important across the board for capturing all categories of latent factors.
We found ST-DIM has two main advantages which explain its superior performance over other methods and over its own ablations. It captures small objects much better than other methods, and is more robust to the presence of easy-to-exploit features which hurts other contrastive methods. Both these advantages are due to ST-DIM maximizing mutual information of patch representations.
Capturing small objects:
As we can see in Table 3, ST-DIM performs better at capturing small objects than other methods, especially generative models like VAE and pixel prediction methods. This is likely because generative models try to model every pixel, so they are not penalized much if they fail to model the few pixels that make up a small object. Similarly, ST-DIM holds this same advantage over Global-T-DIM (see Table 9), which is likely due to the fact that Global-T-DIM is not penalized if its global representation fails to capture features from some patches of the frame.
Robust to presence of easy-to-exploit features:
Representation learning with mutual information or contrastive losses often fail to capture all salient features if a few easy-to-learn features are sufficient to saturate the objective. This phenomenon has been linked to the looseness of mutual information lower bounds [52, 53] and gradient starvation . We see the most prominent example of this phenomenon in Boxing. The observations in Boxing have a clock showing the time remaining in the round. A representation which encodes the shown time can perform near-perfect predictions without learning any other salient features in the observation. Table 4 shows that CPC, Global T-DIM, and ST-DIM perform well at predicting the clock variable. However only ST-DIM does well on encoding the other variables such as the score and the position of the boxers.
We also observe that the best generative model (PIXEL-PRED) does not suffer from this problem. It performs its worst on high-entropy features such as the clock and player score (where ST-DIM excels), and does slightly better than ST-DIM on low-entropy features which have a large contribution in the pixel space such as player and enemy locations. This sheds light on the qualitative difference between contrastive and generative methods: contrastive methods prefer capturing high-entropy features (irrespective of contribution to pixel space) while generative methods do not, and generative methods prefer capturing large objects which have low entropy. This complementary nature suggests hybrid models as an exciting direction of future work.
We present a new representation learning technique which maximizes the mutual information of representations across spatial and temporal axes. We also propose a new benchmark for state representation learning based on the Atari 2600 suite of games to emphasize learning multiple generative factors. We demonstrate that the proposed method excels at capturing the underlying latent factors of a state even for small objects or when a large number of objects are present, which prove difficult for generative and other contrastive techniques, respectively. We have shown that our proposed benchmark can be used to study qualitative and quantitative differences between representation learning techniques, and hope that it will encourage more research in the problem of state representation learning.
We are grateful for the collaborative research environment provided by Mila and Microsoft Research. We thank Aaron Courville, Chris Pal, Remi Tachet, Eric Yuan, Chinwei-Huang, Khimya Khetrapal and Tristan Deleu for helpful discussions and feedback during the course of this work. We would also like to thank the developers of PyTorch and Weights&Biases.
- Marr  David Marr. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Henry Holt and Co., Inc., 1982. ISBN 0716715678.
- Gordon and Irwin  Robert D Gordon and David E Irwin. What’s in an object file? evidence from priming studies. Perception & Psychophysics, 58(8):1260–1277, 1996.
- Lake et al.  Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and brain sciences, 40, 2017.
- Krizhevsky et al.  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Amodei et al.  Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning, pages 173–182, 2016.
- Wu et al.  Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
- Mnih et al.  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Silver et al.  David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
- Gross  Charles G Gross. Learning, perception, and the brain, 1968.
- Dumoulin et al.  Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. International Conference on Learning Representations (ICLR), 2017.
- Kingma and Welling  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. International Conference on Learning Representations (ICLR), 2014.
- Dinh et al.  Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. International Conference on Learning Representations (ICLR), 2017.
- Pathak et al.  Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. Context encoders: Feature learning by inpainting. In
- Doersch and Zisserman  Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2051–2060, 2017.
- Kolesnikov et al.  Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised visual representation learning. arXiv preprint arXiv:1901.09005, 2019.
- Lesort et al.  Timothée Lesort, Natalia Díaz-Rodríguez, Jean-Franois Goudou, and David Filliat. State representation learning for control: An overview. Neural Networks, 2018.
- Watter et al.  Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pages 2746–2754, 2015.
- Higgins et al.  Irina Higgins, Arka Pal, Andrei Rusu, Loic Matthey, Christopher Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1480–1490. JMLR. org, 2017.
- Ha and Schmidhuber  David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, pages 2450–2462, 2018.
- Duan  Wuyang Duan. Learning state representations for robotic control: Information disentangling and multi-modal learning. Master’s thesis, Delft University of Technology, 2017.
- Oh et al.  Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pages 2863–2871, 2015.
- Finn et al.  Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. In Advances in neural information processing systems, pages 64–72, 2016.
- Belghazi et al.  Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In Proceedings of the 35th International Conference on Machine Learning, pages 531–540, 2018. URL http://proceedings.mlr.press/v80/belghazi18a.html.
- van den Oord et al.  Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Hjelm et al.  R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. International Conference on Learning Representations (ICLR), 2019.
- Veličković et al.  Petar Veličković, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. arXiv preprint arXiv:1809.10341, 2018.
Bellemare et al. 
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling.
The arcade learning environment: An evaluation platform for general
Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Alain and Bengio  Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. International Conference on Learning Representations (Workshop Track), 2017.
- Donsker and Varadhan  Monroe D Donsker and SR Srinivasa Varadhan. Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on Pure and Applied Mathematics, 36(2):183–212, 1983.
- Chen et al.  Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
- Barber and Agakov  David Barber and Felix Agakov. The im algorithm: A variational approach to information maximization. In Proceedings of the 16th International Conference on Neural Information Processing Systems, NIPS’03, pages 201–208, Cambridge, MA, USA, 2003. MIT Press. URL http://dl.acm.org/citation.cfm?id=2981345.2981371.
- Gutmann and Hyvärinen  Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304, 2010.
- Ma and Collins  Zhuang Ma and Michael Collins. Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency. arXiv preprint arXiv:1809.01812, 2018.
- Poole et al.  Ben Poole, Sherjil Ozair, Aäron Van den Oord, Alexander A Alemi, and George Tucker. On variational bounds of mutual information. In International Conference on Machine Learning, 2019.
- Jonschkowski and Brock  Rico Jonschkowski and Oliver Brock. Learning state representations with robotic priors. Autonomous Robots, 39(3):407–428, 2015.
- Jonschkowski et al.  Rico Jonschkowski, Roland Hafner, Jonathan Scholz, and Martin Riedmiller. Pves: Position-velocity encoders for unsupervised learning of structured state representations. arXiv preprint arXiv:1705.09805, 2017.
- Higgins et al.  Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230, 2018.
- van Hoof et al.  Herke van Hoof, Nutan Chen, Maximilian Karl, Patrick van der Smagt, and Jan Peters. Stable reinforcement learning with autoencoders for tactile and visual data. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3928–3934. IEEE, 2016.
- Thomas et al.  Valentin Thomas, Jules Pondard, Emmanuel Bengio, Marc Sarfati, Philippe Beaudoin, Marie-Jean Meurs, Joelle Pineau, Doina Precup, and Yoshua Bengio. Independently controllable factors. arXiv preprint arXiv:1708.01289, 2017.
Cuccu et al. 
Giuseppe Cuccu, Julian Togelius, and Philippe Cudré-Mauroux.
Playing atari with six neurons.International Conference on Autonomous Agents and Multiagent Systems, 2019.
- Nachum et al.  Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Near-optimal representation learning for hierarchical reinforcement learning. International Conference on Learning Representations (ICLR), 2019.
- Burgess et al.  Christopher P Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick, and Alexander Lerchner. Monet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390, 2019.
- Coates et al.  Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223, 2011.
- Xian et al.  Yongqin Xian, Christoph H. Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning - a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1–1, 2018. ISSN 1939-3539. doi: 10.1109/tpami.2018.2857768. URL http://dx.doi.org/10.1109/TPAMI.2018.2857768.
- Triantafillou et al.  Eleni Triantafillou, Richard Zemel, and Raquel Urtasun. Few-shot learning through an information retrieval lens, 2017.
- Conneau and Kiela  Alexis Conneau and Douwe Kiela. Senteval: An evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), 2018.
- Wang et al.  Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJ4km2R5t7.
- Friston  Karl Friston. A theory of cortical responses. Philosophical transactions of the Royal Society B: Biological sciences, 360(1456):815–836, 2005.
- Rao and Ballard  Rajesh PN Rao and Dana H Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 2(1):79, 1999.
- Bialek and Tishby  William Bialek and Naftali Tishby. Predictive information. arXiv preprint cond-mat/9902341, 1999.
- Palmer et al.  Stephanie E Palmer, Olivier Marre, Michael J Berry, and William Bialek. Predictive information in a sensory population. Proceedings of the National Academy of Sciences, 112(22):6908–6913, 2015.
- McAllester and Statos  David McAllester and Karl Statos. Formal limitations on the measurement of mutual information. arXiv preprint arXiv:1811.04251, 2018.
- Ozair et al.  Sherjil Ozair, Corey Lynch, Yoshua Bengio, Aaron Van den Oord, Sergey Levine, and Pierre Sermanet. Wasserstein dependency measure for representation learning. arXiv preprint arXiv:1903.11780, 2019.
- Bachman et al.  Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910, 2019.
- Sohn  Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, pages 1857–1865, 2016.
- Sermanet et al.  Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1134–1141. IEEE, 2018.
- Hyvärinen et al.  Aapo Hyvärinen, Juha Karhunen, and Erkki Oja. Independent component analysis, volume 46. John Wiley & Sons, 2004.
- Locatello et al.  Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. International Conference on Machine Learning, 2019.
- Whalen and Taylor  Zach Whalen and Laurie N Taylor. Playing the past. History and Nostalgia in Video Games. Nashville, TN: Vanderbilt University Press, 2008.
- Engelhardt  Steve Engelhardt. BJARS.com Atari Archives. http://bjars.com, 2019. [Online; accessed 1-March-2019].
- Jentzsch and CPUWIZ  Thomas Jentzsch and CPUWIZ. Atariage atari 2600 forums, 2019. URL http://atariage.com/forums/forum/16-atari-2600/.
- Bengio  Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009.
- Bengio et al.  Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 35(8):1798–1828, 2013.
- Eastwood and Williams  Cian Eastwood and Christopher KI Williams. A framework for the quantitative evaluation of disentangled representations. International Conference on Learning Representations (ICLR), 2018.
- Caron et al.  Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), pages 132–149, 2018.
- Schulman et al.  John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Mnih et al. 
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis
Antonoglou, Daan Wierstra, and Martin Riedmiller.
Playing atari with deep reinforcement learning.
NIPS Deep Learning Workshop. MIT Press, 2013.
- Burda et al.  Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros. Large-scale study of curiosity-driven learning. International Conference on Learning Representations (ICLR), 2019.
- Ulyanov et al.  Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9446–9454, 2018.
- Hyvärinen and Pajunen  Aapo Hyvärinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3):429–439, 1999.
- Combes et al.  Remi Tachet des Combes, Mohammad Pezeshki, Samira Shabanian, Aaron Courville, and Yoshua Bengio. On the learning dynamics of deep neural networks. arXiv preprint arXiv:1809.06848, 2018.
- Paszke et al.  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
- Kansky et al.  Ken Kansky, Tom Silver, David A Mély, Mohamed Eldawy, Miguel Lázaro-Gredilla, Xinghua Lou, Nimrod Dorfman, Szymon Sidor, Scott Phoenix, and Dileep George. Schema networks: Zero-shot transfer with a generative causal model of intuitive physics. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1809–1818. JMLR. org, 2017.
- Zhang et al.  Amy Zhang, Yuxin Wu, and Joelle Pineau. Natural environment benchmarks for reinforcement learning. arXiv preprint arXiv:1811.06032, 2018.
Appendix A Architecture Details
The linear probe is a linear layer of width 256 with a softmax activation and trained with a cross-entropy loss.
Majority Classifier (maj-clsf):
The majority classifier is parameterless and just uses the mode of the distribution of classes from the training set for each state variable and guesses that mode for every example on the test set at test time.
The Random-CNN is the base encoder with randomly initialzied weights and no training
VAE and Pixel-Pred:
The VAE and Pixel Prediction model use the base encoder plus each have an extra 256 wide fully connected layer to parameterize the log variance for the VAE and to more closely resemble theNo Action Feed Forward model from . In addition bith models have a deconvolutional network as a decoder, which is the exact transpose of the base encoder in figure 4.
ST-DIM (and its ablations):
ST-DIM and the two ablations, JSD-ST-DIM and Global-T-DIM, all use the same architecture which is the base encoder plus a xx bilinear layer.
The supervised model is our base encoder plus our linear probe trained end-to-end with the ground truth labels.
PPO Features (section E):
The PPO model is our base encoder plus two linear layers for the policy and the value function, respectively.
Appendix B Preprocessing and Hyperparameters
We preprocess frames primarily in the same way as described in , with the key difference being we use the full 160x210 images for all our experiments instead of downsampling to 84x84. Table 5 lists the hyper-parameters we use across all games. For all our experiments, we use a learning rate scheduler based on plateaus in the validation loss (for both contrastive training and probing).
|Max-pool over last N action repeat frames||2|
|End of episode when life lost||Yes|
|No-Op action reset||Yes|
|Sequence Length (CPC)||100|
|Learning Rate (Training)||3e-4|
|Learning Rate (Probing, non supervised)||5e-2|
|Learning Rate (Probing, supervised)||3e-4|
|Encoder training steps||70000|
|Probe training steps||35000|
|Probe test steps||10000|
Preprocessing steps and hyperparameters
We run our experiments on a autoscaling-cluster with multiple P100 and V100 GPUs. We use 8 cores per machines to distribute data collection across different workers.
Appendix C Results with Probes Trained on Data Collected By a Pretrained RL agent
In addition to evaluating on data collected by a random agent, we also evaluate different representation learning methods on data collected by a pretrained PPO  agent. Specifically, we use a PPO agent trained for 50M steps on each game. We choose actions stochastically by sampling from the PPO agent’s action distribution at every time step, and inject additional stochasticity by using an -greedy mechanism with . Table 6 shows the game-by-game breakdown of mean F1 probe scores obtained by each method in this evaluation setting. Table 7 additionally shows the category-wise breakdown of results for each method. We observe a similar trend in performance as observed earlier with a random agent.
Appendix D More Detailed Ablation Results
Appendix E Probing Pretrained RL Agents
We make a first attempt at examining the features that RL agents learn. Specifically, we train linear probes on the representations from PPO agents that were trained for 50 million frames. The architecture of the PPO agent is described in section A. As we see from table 10, the features perform poorly in the probing tasks compared to the baselines. Kansky et al. , Zhang et al.  have also argued that model-free agents have trouble encoding high level state information. However, we note that these are preliminary results and require thorough investigation over different policies and models.