End-to-end learning of feature representations has led to advances in image classification Krizhevsky et al. (2012), generative modeling of images Goodfellow et al. (2014) and agents which outperform expert humans at game play Mnih et al. (2015); Silver et al. (2016). However, this training procedure induces task-specific representations, especially in the case of reinforcement learning, making it difficult to re-purpose the learned knowledge for future unseen tasks. On the other hand, humans explicitly learn notions of objects, relations, geometry and cardinality in a task-agnostic manner Spelke and Kinzler (2007)
and re-purpose this knowledge to future tasks. Deep generative models aim to learn task-agnostic features which have been shown to be useful for tasks such as object classification and semantic segmentation. We argue that despite their success on supervised learning tasks such unsupervised learning methods have not found wide applicability in reinforcement learning and control because they were not designed for control as the downstream task.
For instance, there has been extensive research inspired by psychology and cognitive science on explicitly learning object-centric representations from pixels. Both instance and semantic segmentation has been approached using supervised Long et al. (2015); Pinheiro et al. (2016) and unsupervised learning Burgess et al. (2019); Greff et al. (2019); Ionescu et al. (2018); Grundmann et al. (2010); Ji et al. (2018); Li et al. (2018); Goel et al. (2018) methods. However, the representations learned by these methods do not explicitly encode fine-grained locations and orientations of object parts, and thus they have not been extensively used in the control and reinforcement learning literature. We argue that being able to precisely control objects and object parts is at the root of many complex sensory motor behaviors.
In recent work, object keypoint or landmark discovery methods Zhang et al. (2018); Jakab et al. (2018) have been proposed to learn representations that precisely represent locations of object parts. These methods predict a set of Cartesian co-ordinates of keypoints denoting the salient locations of objects given image frame(s). However, as we will show, the existing methods struggle to accurately track keypoints under the variability in number, size, and motion of objects present in common RL domains.
We propose Transporter, a novel architecture to explicitly discover spatially, temporally and geometrically aligned keypoints given only videos. After training, each keypoint represents and tracks the co-ordinate of an object or object part even as it undergoes deformations (see fig. 1 for illustrations). As we will show, Transporter learns more accurate and more consistent keypoints on standard RL domains than existing methods. We will then showcase two ways in which the learned keypoints can be used for control and reinforcement learning. First, we show that using keypoints as inputs to policies instead of RGB observations leads to drastically more data efficient reinforcement learning on Atari games. Second, we show that by learning to control the Cartesian coordinates of the keypoints in the image plane we are able to learn skills or options Sutton et al. (1999) grounded in pixel observations, which is an important problem in reinforcement learning. We evaluate the learned skills by using them for exploration and show that they lead to much better exploration than primitive actions, especially on sparse reward tasks. Crucially, the learned skills are task-agnostic because they are learned without access to any rewards.
In summary, our key contributions are:
Transporter learns state of the art object keypoints across a variety of commonly used RL environments. Our proposed architecture is robust to varying number, size and motion of objects.
Using learned keypoints as state input leads to policies that perform better than state-of-the-art model-free and model-based reinforcement learning methods on several Atari environments, while using only up to 100k environment interactions.
Learning skills to manipulate the most controllable keypoints provides an efficient action space for exploration. We demonstrate drastic reductions in the search complexity for exploring challenging Atari environments. Surprisingly, our action space enables random agents to play several Atari games without rewards and any task-dependent learning.
2 Related Work
Our work is related to the recently proposed literature on unsupervised object keypoint discovery Zhang et al. (2018); Jakab et al. (2018). Most notably, Jakab et al. Jakab et al. (2018) proposed an encoder-decoder architecture with differentiable bottlenecks in the intermediate layer. We reuse their bottleneck architecture but add a crucial new inductive bias – the feature transport mechanism – to constrain the representation to be more spatially aligned compared to all baselines. The approach in Zhang et al. Zhang et al. (2018)
discovers keypoints using single images and requires privileged information about temporal transformations between frames in form of optical flow. This approach also requires multiple loss and regularization terms to converge. In contrast, our approach does not require access to these transformations and learns keypoints with a simple pixel-wise L2 loss function. Other approaches for learning object structure has similar limitationsThewlis et al. (2017); Shu et al. (2018); Suwajanakorn et al. (2018); Wiles et al. (2018). Deep generative models with structured bottlenecks have recently seen a lot of advances Chen et al. (2016); Kulkarni et al. (2015); Whitney et al. (2016); Xue et al. (2016); Higgins et al. (2017) but they do not explicitly reason about geometry.
Unsupervised learning of object keypoints has not been widely explored in the control literature, with the notable exception of Finn et al. (2016). However, this model uses a full-connected layer for reconstruction and therefore can learn non-spatial latent embeddings similar to a baseline we consider Jakab et al. (2018). Moreover, similar to Zhang et al. (2018) their auto-encoder reconstructs single frames and hence does not learn to factorize geometry. Object-centric representations have also been studied in the context of intrinsic motivation, hierarchical reinforcement learning and exploration. However, existing approaches either require hand-crafted object representations Kulkarni et al. (2016) or have not been shown to capture fine-grained representations over long temporal horizons Ionescu et al. (2018).
In section 3.1 we first detail our model for unsupervised discovery of object keypoints from videos. Next, we describe the application of the learned object keyponts to control for – (1) data-efficient reinforcement learning (section 3.2.1), and (2) learning keypoint based options for efficient exploration (section 3.2.2).
3.1 Feature Transport for learning Object Keypoints
Given an image , our objective is to extract 2-dimensional image locations or keypoints, , which correspond to locations of objects or object-parts without any manual labels for locations. We follow the formulation of Jakab et al. (2018) and assume access to frame pairs collected from some trajectories such that the frames differ only in objects’ pose / geometry or appearance. The learning objective then is to reconstruct the second frame from the first . This is achieved by computing ConvNet (CNN) feature maps and extracting 2D locations by marginalising the keypoint-detetor feature-maps along the image dimensions (as proposed in Jakab et al. (2018)). A transported feature map is generated by suppressing both sets of keypoint locations in and compositing in the featuremaps around the keypoints from :
is a heatmap image containing fixed-variance isotropic Gaussians around each of thepoints specified by . A final CNN with small-receptive field refines the transported reconstruction to regress the target frame . We use pixel-wise squared- reconstruction error for end-to-end learning.
In words, (i) the features in the source image at the target positions are replaced with the features from the target image – this is the transportation; and (ii) the features at the source position are set to zero. The refine net (which maps from the transported feature map to an image) then has two tasks: (i) to inpaint the missing features at the source position; and (ii) to clean up the image around the target positions. Refer to Figure 1 for a concise description of our method.
3.2 Object Keypoints for Control
Given learned keypoints, we want to use them within the context of control and exploration. Consider a Markov Decision Process (MDP) with visual observationsas states, actions and a transition function . We use a Transporter model which is pretrained in an unsupervised fashion without extrinsic rewards. The agents output actions and receive rewards as normal.
3.2.1 Data-efficient reinforcement learning
Our first hypothesis is that task-agnostic learning of object keypoints can enable fast learning of goal-directed policies. This is because once we learn keypoints, the control policy can be much simpler and does not have to relearn all visual features using temporal difference learning. In order to test this hypothesis, we use a variant of the neural fitted Q-learning framework Riedmiller (2005)
with learned keypoints as input and a recurrent neural network Q function to output behaviors. The agent observesand
only at the corresponding masked keypoint locations. We encode one hot vectors to denote positions of keypoints and their corresponding (keypoint mask averaged) feature vectors at that location.Transporter is trained by collecting data using a random policy and without any reward functions (see supplementary material for details). The Transporter network weights are fixed during behavior learning given environment rewards.
3.2.2 Keypoint-based options for efficient exploration
Our second hypothesis is that learned keypoints can enable significantly better task-independent exploration. Typically, raw actions are randomly sampled to bootstrap goal-directed policy learning. This exploration strategy is notoriously inefficient. We leverage the Transporter representation to learn a new action space. The actions are now skills grounded in the control of co-ordinate values of each keypoint. This idea has been explored in the reinforcement learning community Kulkarni et al. (2016); Ionescu et al. (2018) but it has been hard to learn spatial features with long temporal consistency. Here we show that Transporter is particularly amenable to this task. We show that randomly exploring in this space leads to significantly more rewards compared to raw actions. Our learned action space is agnostic to the control algorithm and hence other exploration algorithms Pathak et al. (2017); Ecoffet et al. (2019); Plappert et al. (2017) can also benefit from using it.
To do this, we define intrinsic reward functions using the keypoint locations, similar to Ionescu et al. (2018). Each reward function corresponds to how much each keypoint moves in the 4 spatial directions between consecutive observations (up, down, left, right). We learn a set of Q function , to maximise each of the following reward functions: , , , . These functions correspond to increasing/decreasing the and coordinates respectively. The functions are trained using n-step .
During training, we randomly sample a particular Q function to act with and commit to this choice for timesteps before resampling. All Q functions are trained using experiences generated from all policies via a shared replay buffer. Randomly exploring in this Q space can already reduce the search space as compared to raw actions. We further reduce this search space by learning to predict the most controllable keypoint. For instance, in many Atari games there is an avatar that is directly controllable on the screen. We infer this abstraction via a fixed controllability policy to select the single “most controllable” keypoint: .
This procedure picks keypoints where one action leads to more prospective change in all spatial directions than all other keypoints. Given this keypoint, we randomly sample a with a fixed temporal commitment as the random exploration policy. Consider a sequence of 100 actions with 18 choices before receiving rewards, which is typically the case in hard exploration Atari games (e.g. montezuma’s revenge). A random action agent would need to search in the space of raw actions. However, observing 5 keypoints and only has , giving a search space reduction of . The search space reduces further when we explore with the most controllable keypoints. Since our learned action space is agnostic to the control mechanism, we evaluate them by randomly searching in this space versus raw actions. We measure extrinsically defined game score as the metric to evaluate the effectiveness of both search procedures.
In section 4.1 we first evaluate the long-term tracking ability of our object keypoint detector. Next, in section 4.2 we evaluate the application of the keypoint detector on two control tasks – comparison against state-of-the-art model-based and model-free methods for data-efficient learning on Atari ALE games Bellemare et al. (2013) in section 4.2.1, and in section 4.2.2 examine efficient exploration by learning to control the discovered keypoints; we demonstrate reaching states otherwise unreachable through random explorations on raw-actions, and also recover the agent self as the most-controllable keypoint. For implementation details, please refer to the supplementary material.
We evaluate our method on Atari ALE Bellemare et al. (2013) and Manipulator Tassa et al. (2018) domains. We chose representative levels with large variations in the type and number of objects. (1) For evaluating long-term tracking of object keypoints section 4.1 we use – pong, frostbite, ms_pacman, and stack_4 (manipulator with blocks). (2) For data-efficient reinforcement learning (section 4.2.1) we train on diverse data collected using random exploration on the Atari games indicated in fig. 5. (3) For keypoints based efficient-exploration (section 4.2.2) we evaluate on one of the most difficult exploration game – montezuma revenge, along with ms_pacman and seaquest.
A random policy executes actions and we collect a trajectory of images before the environment resets; details for data generation are presented in the supplementary material. We sample the source and target frames randomly within a temporal offset of 1 to 20 frames, corresponding to small or significant changes in the the configuration between these two frames respectively. For Atari ground-truth object locations are extracted from the emulated RAM using hand crafted per-game rules and for Manipulator it is extracted from the simulator geoms.
4.1 Evaluating Object Keypoint Predictions
We compare our method against state-of-the-art methods for unsupervised discovery of object landmarks – (1) Jakab et al. Jakab et al. (2018) and (2) Zhang et al. Zhang et al. (2018). For (1) we use exactly the same architecture for and as ours; for (2) we use the implementation released online by the authors where the image-size is set to pixels. We train all the methods for optimization steps and pick the best model checkpoint based on a validation set.
We measure the precision and recall of the detected keypoint trajectories, varying their lengths from 1 to 200 frames ( frames 13 seconds @ 15-fps with action-repeat of 4) to evaluate long-term consistency of the keypoint detections crucial for control. The average Euclidean distance between each detected and ground-truth trajectory is computed. The time-steps where a ground-truth object is absent are ignored in the distance computation. Distances above a threshold () are excluded as potential matches. One-to-one assigments between the trajectories are then computed using min-cost linear sum assignment, and the matches are used for reporting precision and recall.
Figure 2 visualises the detections while fig. 4 presents precision and recall for varying trajectory lengths. Transporter consistently tracks the salient object keypoints over long time horizons and outperforms the baseline methods on all environments, with the notable exception of Jakab et al. (2018) on pacman where our method is slightly worse but achieves similar performance for long-trajectories.
4.2 Using Keypoints for Control
4.2.1 Data-efficient Reinforcement Learning on Atari
We demonstrate that using the learned keypoints and corresponding features within a reinforcement learning context can lead to data-efficient learning in Atari games. Following Kaiser et al. (2019), we trained our Keypoint Q-Network (KeyQN) architecture for interactions, which corresponds to frames. As shown in Figure 5, our approach is better than the state-of-the-art model-based SimPLe architecture Kaiser et al. (2019) and model-free Rainbow architecture Hessel et al. (2018) on four out of five games. Applying this approach to all Atari games will require training Transporter inside the reinforcement learning loop because pretraining keypoints on data from a random policy is insufficient for games where new objects or screens can appear. However, these experiments provide evidence that the right visual abstractions and simple control algorithms can produce highly data efficient reinforcement learning algorithms.
We perform random actions in the environment for all methods (without reward) and record the mean and standard deviation of episodic returns across 4 billion frames. With the same frame budget, we simultaneously learn the most controllable keypoint and randomly explore in the space of its co-ordinates (to move itleft, right, top, down). The options model becomes better with training (using only intrinsic rewards) and this leads to higher extrinsically defined episodic returns. Surprisingly, our learned options model is able to play several Atari games via random sampling of options. This is possible by learning skills to move the discovered game avatar as far as possible without dying. (second row) We measure the percentile episodic return reached for all methods. Our approach outperforms the baseline, both in terms of efficient and robust exploration of rare and rewarding states.
4.2.2 Efficient Exploration with Keypoints
How do we learn skills using object keypoints for efficient exploration? We use a distributed off-policy learner similar to Horgan et al. (2018) using 128 actors and 4 GPUs. The agent network is a standard Mnih et al. (2015) with an LSTM with 256 hiddens which feeds into a linear layer with units where is the number of actions. Our transporter model and all control policies simultaneuosly. The data is generated by randomly sampling keypoints and coordinates, and then following the resulting policy for timesteps before resampling. We use a log-scale epsilon distribution for all policies (.4 to 1e-4). During evaluation we use the to select the keypoint to control and then randomly sample a coordinate every timesteps. The quantitative results are shown in Figure 6. We also show qualitative results of the most controllable keypoint in Figure 3 and the supplementary material.
Our experiments clearly validate our hypothesis that using keypoints enables temporally extended exploration. As shown in Figure 6, our learned keypoint options consistently outperform the random actions baseline by a large margin. Encouragingly, our random options policy is able to play some Atari games by moving around the avatar (most controllable keypoint) in different parts of the state space without dying. For instance, the agent explores multiple rooms in Montezuma’s Revenge, a classical hard exploration environment in the reinforcement learning community. Similarly, our keypoint exploration learns to consistently move around the submarine in Seaquest and the avatar in Ms. Pacman. Most notably, this is achieved without rewards or (extrinsic) task-directed learning. Therefore our learned keypoints are stable enough to learn complex object-oriented skills in the Atari domain.
We demonstrate that it is possible to learn stable object keypoints across thousands of environment steps, without having access to task-specific reward functions. Therefore, object keypoints could provide a flexible and re-purposable representation for efficient control and reinforcement learning. Scaling keypoints to work reliably on richer datasets and environments is an important future area of research. Further, tracking objects over long temporal sequences can enable learning object dynamics and affordances which could be used to inform learning policies. A limitation of our model is that we do not currently handle moving backgrounds. Recent work Gordon et al. (2019) that explicitly reasons about camera / ego motion could be integrated to globally transport features between source and target frames. In summary, our experiments provide clear evidence that it is possible to learn visual abstractions and use simple algorithms to produce highly data efficient control policies and exploration procedures.
Acknowledgements. We thank Loic Matthey and Relja Arandjelović for valuable discussions and comments.
Bellemare et al. 
M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling.
The arcade learning environment: An evaluation platform for general
Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Burgess et al.  C. P. Burgess, L. Matthey, N. Watters, R. Kabra, I. Higgins, M. Botvinick, and A. Lerchner. Monet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390, 2019.
- Chen et al.  X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
- Ecoffet et al.  A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
Finn et al. 
C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and P. Abbeel.
Deep spatial autoencoders for visuomotor learning.In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 512–519. IEEE, 2016.
- Goel et al.  V. Goel, J. Weng, and P. Poupart. Unsupervised video object segmentation for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 5683–5694, 2018.
- Goodfellow et al.  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- Gordon et al.  A. Gordon, H. Li, R. Jonschkowski, and A. Angelova. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. arXiv preprint arXiv:1904.04998, 2019.
- Greff et al.  K. Greff, R. L. Kaufmann, R. Kabra, N. Watters, C. Burgess, D. Zoran, L. Matthey, M. Botvinick, and A. Lerchner. Multi-object representation learning with iterative variational inference. arXiv preprint arXiv:1903.00450, 2019.
Grundmann et al. 
M. Grundmann, V. Kwatra, M. Han, and I. Essa.
Efficient hierarchical graph-based video segmentation.
2010 ieee computer society conference on computer vision and pattern recognition, pages 2141–2148. IEEE, 2010.
- Hessel et al.  M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Higgins et al.  I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, volume 3, 2017.
- Horgan et al.  D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. Van Hasselt, and D. Silver. Distributed prioritized experience replay. International Conference on Learning Representations (ICLR), 2018.
- Ioffe and Szegedy  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015. URL http://arxiv.org/abs/1502.03167.
- Ionescu et al.  C. Ionescu, T. Kulkarni, A. van den Oord, A. Mnih, and V. Mnih. Learning to Control Visual Abstractions for Structured Exploration in Deep Reinforcement Learning. In NeurIPS Deep Reinforcement Learning Workshop, 2018.
- Jakab et al.  T. Jakab, A. Gupta, H. Bilen, and A. Vedaldi. Unsupervised learning of object landmarks through conditional image generation. In Advances in Neural Information Processing Systems, 2018.
- Ji et al.  X. Ji, J. F. Henriques, and A. Vedaldi. Invariant information distillation for unsupervised image segmentation and clustering. arXiv preprint arXiv:1807.06653, 2018.
- Kaiser et al.  L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.
- Kingma and Ba  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Krizhevsky et al.  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Kulkarni et al.  T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In Advances in neural information processing systems, pages 2539–2547, 2015.
- Kulkarni et al.  T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016.
- Li et al.  S. Li, B. Seybold, A. Vorobyov, A. Fathi, Q. Huang, and C.-C. Jay Kuo. Instance embedding transfer to unsupervised video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6526–6535, 2018.
- Long et al.  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
- Mnih et al.  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Pathak et al.  D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 16–17, 2017.
- Pinheiro et al.  P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learning to refine object segments. In ECCV, 2016.
- Plappert et al.  M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz. Parameter space noise for exploration. arXiv preprint arXiv:1706.01905, 2017.
Neural fitted q iteration–first experiences with a data efficient
neural reinforcement learning method.
European Conference on Machine Learning, pages 317–328. Springer, 2005.
- Schulman et al.  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Shu et al.  Z. Shu, M. Sahasrabudhe, R. Alp Guler, D. Samaras, N. Paragios, and I. Kokkinos. Deforming autoencoders: Unsupervised disentangling of shape and appearance. In Proceedings of the European Conference on Computer Vision (ECCV), pages 650–665, 2018.
- Silver et al.  D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
- Spelke and Kinzler  E. S. Spelke and K. D. Kinzler. Core knowledge. Developmental science, 10(1):89–96, 2007.
- Sutton et al.  R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
- Suwajanakorn et al.  S. Suwajanakorn, N. Snavely, J. J. Tompson, and M. Norouzi. Discovery of latent 3d keypoints via end-to-end geometric reasoning. In Advances in Neural Information Processing Systems, pages 2059–2070, 2018.
- Tassa et al.  Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
- Thewlis et al.  J. Thewlis, H. Bilen, and A. Vedaldi. Unsupervised learning of object landmarks by factorized spatial embeddings. In Proceedings of the IEEE International Conference on Computer Vision, pages 5916–5925, 2017.
- Whitney et al.  W. F. Whitney, M. Chang, T. Kulkarni, and J. B. Tenenbaum. Understanding visual concepts with continuation learning. arXiv preprint arXiv:1602.06822, 2016.
- Wiles et al.  O. Wiles, A. Koepke, and A. Zisserman. Self-supervised learning of a facial attribute embedding from video. arXiv preprint arXiv:1808.06882, 2018.
- Xue et al.  T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems, pages 91–99, 2016.
- Zhang et al.  Y. Zhang, Y. Guo, Y. Jin, Y. Luo, Z. He, and H. Lee. Unsupervised discovery of object landmarks as structural representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2694–2703, 2018.
Appendix A Implementation Details
for the rest with number of filters doubled after every two layers. The stride was set to 2 for layer 3 and 5 (1 for the rest).PointNet has a similar architecture but includes a final regressor to feature-maps corresponding to keypoints. 2D coordinates are extracted from these maps as described in Jakab et al. . The architecture of RefineNet is the transpose of with bilinear-upsampling to undo striding. We specify for each environment but keep all other hyper-parameters of the network fixed across experiments. We used the Adam optimizer Kingma and Ba  with a learning rate of (decayed by 0.95 every steps) and batch size of 64 across all experiments.
Appendix B Diverse Data Generation
To train the Transporter, a dataset of observation pairs is constructed from environment trajectories. It is important that this dataset contains a diverse range of situations, and unconditionally storing a pair from all trajectories generated by a random policy may contain many similar pairs. To mitigate this, we use a diverse data generation procedure as follows.
We generate trajectories of up to length 100 (action repeat is set to 4, so these trajectories represent up to 400 environment frames) using a uniform random policy, and uniformly sample one observation from the first half of the trajectory and one frame from the second half. Trajectories are shorter than 100 only when the end of an episode is reached. A buffer containing the maximum number of pairs we want to generate (in these experiments, 100k) is populated unconditionally from a number of environment actor threads until it is full. More frame pairs are generated, up to some defined maximum budget, and are conditionally written into the buffer as follows.
First some number of indices of existing pairs are sampled from the buffer, and for each of them we compute the nearest neighbor by L2 distance to other elements of the buffer. We take the same number of new generated frame pairs, and also compute their nearest neighbor in the buffer. For corresponding pairs of (existing frame pair, new frame pair) we overwrite the existing pair with the new pair whenever the new pair has a greater nearest neighbor distance, or if a uniform random number . We continue this procedure until the maximum budget is reached, and write out the final buffer as our training set. Note that the reward function is not used at all in this procedure.
For efficiency, we store a lower resolution copy of the buffer (64x64, grayscale) on the GPU to perform efficient nearest neighbor calculations, keeping corresponding higher res (128x128 RGB) copies on CPU RAM. Using a single consumer GPU and a 56 core desktop machine, with many actor subprocesses, this approach can perform 10 million environment steps (40 million total frames) in approximately 1 hour.
Appendix C Videos
Videos visualising various aspects of the model are available at:
Appendix D Pixel Transport versus Feature Transport
We investigated whether learning features is as important as spatially transporting them between frames. As shown in Figure 7, we show that transporting learned features significantly outperforms transporting pixels. Transporting pixels gives rise to ambiguous intermediate pixel representations, making it difficult for the final CNN decoder network to solve the downstream pixel prediction task. On the other hand, the feature encode higher level information and the decoder network learns a more abstract function to solve the prediction problem.
Appendix E Temporal consistency of keypoints
Figures 8, 9, 10, and 11 show the infered keypoints on frames selected from a single episode on Atari ALE Bellemare et al.  (Pong, Frostbite and Ms. Pac-Man) and Manipulator Tassa et al.  (stack_4) domains. The selected frames are each 10 time steps apart. The first frame has been explicitly chosen to ensure there is enough diversity in the shown frames. The colours are time consistent – a specific colour corresponds to the same keypoint throughout the episode. Thus, if a given game ‘object’ is assigned the same coloured keypoint throughout the episode, that keypoint is temporaly consistent for that ‘object’.
Videos showing the inferred keypoints by the three methods for entire episodes can be accessed at: https://www.youtube.com/playlist?list=PL3LT3tVQRpbvGt5fgp_bKGvW23jF11Vi2
Appendix F Reconstructions
We visualise the reconstructed images on Atari ALE Bellemare et al.  (Figures 12, 13, and 14) and Manipulator Tassa et al.  domains (Figure 15) for randomly selected frames. The rows in the figures correspond respectively to our model, Jakab et al.  and Zhang et al. . The first two columns are the inputs given to the models. Whereas our model requires a pair of input frames (image and future_image), the remaining two models only require single frame (future_image). The third column (reconstruction) shows the reconstructed target image. The final column (keypoints) shows the infered keypoints for the given inputs.