Unsupervised Learning of Object Keypoints for Perception and Control

by   Tejas Kulkarni, et al.

The study of object representations in computer vision has primarily focused on developing representations that are useful for image classification, object detection, or semantic segmentation as downstream tasks. In this work we aim to learn object representations that are useful for control and reinforcement learning (RL). To this end, we introduce Transporter, a neural network architecture for discovering concise geometric object representations in terms of keypoints or image-space coordinates. Our method learns from raw video frames in a fully unsupervised manner, by transporting learnt image features between video frames using a keypoint bottleneck. The discovered keypoints track objects and object parts across long time-horizons more accurately than recent similar methods. Furthermore, consistent long-term tracking enables two notable results in control domains -- (1) using the keypoint co-ordinates and corresponding image features as inputs enables highly sample-efficient reinforcement learning; (2) learning to explore by controlling keypoint locations drastically reduces the search space, enabling deep exploration (leading to states unreachable through random action exploration) without any extrinsic rewards.



There are no comments yet.



KINet: Keypoint Interaction Networks for Unsupervised Forward Modeling

Object-centric representation is an essential abstraction for physical r...

Unsupervised Learning of Object Structure and Dynamics from Videos

Extracting and predicting object structure and dynamics from videos with...

End-to-End Learning of Keypoint Representations for Continuous Control from Images

In many control problems that include vision, optimal controls can be in...

Object Parsing in Sequences Using CoordConv Gated Recurrent Networks

We present a monocular object parsing framework for consistent keypoint ...

COBRA: Data-Efficient Model-Based RL through Unsupervised Object Discovery and Curiosity-Driven Exploration

Data efficiency and robustness to task-irrelevant perturbations are long...

Local Navigation and Docking of an Autonomous Robot Mower using Reinforcement Learning and Computer Vision

We demonstrate a successful navigation and docking control system for th...

Unsupervised learning-based long-term superpixel tracking

Finding correspondences between structural entities decomposing images i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

End-to-end learning of feature representations has led to advances in image classification Krizhevsky et al. (2012), generative modeling of images Goodfellow et al. (2014) and agents which outperform expert humans at game play Mnih et al. (2015); Silver et al. (2016). However, this training procedure induces task-specific representations, especially in the case of reinforcement learning, making it difficult to re-purpose the learned knowledge for future unseen tasks. On the other hand, humans explicitly learn notions of objects, relations, geometry and cardinality in a task-agnostic manner Spelke and Kinzler (2007)

and re-purpose this knowledge to future tasks. Deep generative models aim to learn task-agnostic features which have been shown to be useful for tasks such as object classification and semantic segmentation. We argue that despite their success on supervised learning tasks such unsupervised learning methods have not found wide applicability in reinforcement learning and control because they were not designed for control as the downstream task.

For instance, there has been extensive research inspired by psychology and cognitive science on explicitly learning object-centric representations from pixels. Both instance and semantic segmentation has been approached using supervised Long et al. (2015); Pinheiro et al. (2016) and unsupervised learning Burgess et al. (2019); Greff et al. (2019); Ionescu et al. (2018); Grundmann et al. (2010); Ji et al. (2018); Li et al. (2018); Goel et al. (2018) methods. However, the representations learned by these methods do not explicitly encode fine-grained locations and orientations of object parts, and thus they have not been extensively used in the control and reinforcement learning literature. We argue that being able to precisely control objects and object parts is at the root of many complex sensory motor behaviors.

In recent work, object keypoint or landmark discovery methods Zhang et al. (2018); Jakab et al. (2018) have been proposed to learn representations that precisely represent locations of object parts. These methods predict a set of Cartesian co-ordinates of keypoints denoting the salient locations of objects given image frame(s). However, as we will show, the existing methods struggle to accurately track keypoints under the variability in number, size, and motion of objects present in common RL domains.

Figure 1: Transporter. Our model leverages object motion to discover keypoints by learning to transform a source video frame () into another target frame () by transporting image features at the discovered object locations. During training, spatial feature maps and and keypoints co-ordinates and are predicted for both the frames using a ConvNet and the fully-differentiable PointNet Jakab et al. (2018) respectively. The keypoint co-ordinates are transformed into Gaussian heatmaps (same spatial dimensions as feature maps) and . We perform two operations in the transport phase: (1) the features of the source frame are set to zero at both locations and ; (2) the features in the source image at the target positions are replaced with the corresponding features from the target frame at the source position . The final refinement ConvNet (which maps from the transported feature map to an image) then has two tasks: (i) to inpaint the missing features at the source position; and (ii) to clean up the image around the target positions. During inference, keypoints can be extracted for a single frame via a feed-forward pass through the PointNet ().

We propose Transporter, a novel architecture to explicitly discover spatially, temporally and geometrically aligned keypoints given only videos. After training, each keypoint represents and tracks the co-ordinate of an object or object part even as it undergoes deformations (see fig. 1 for illustrations). As we will show, Transporter learns more accurate and more consistent keypoints on standard RL domains than existing methods. We will then showcase two ways in which the learned keypoints can be used for control and reinforcement learning. First, we show that using keypoints as inputs to policies instead of RGB observations leads to drastically more data efficient reinforcement learning on Atari games. Second, we show that by learning to control the Cartesian coordinates of the keypoints in the image plane we are able to learn skills or options Sutton et al. (1999) grounded in pixel observations, which is an important problem in reinforcement learning. We evaluate the learned skills by using them for exploration and show that they lead to much better exploration than primitive actions, especially on sparse reward tasks. Crucially, the learned skills are task-agnostic because they are learned without access to any rewards.

Figure 2: Keypoint visualisation. Visualisations from our and state-of-the-art unsupervised object keypoint discovery methods – Jakab et al. Jakab et al. (2018) and Zhang et al. Zhang et al. (2018) – on Atari ALE Bellemare et al. (2013) and Manipulator Tassa et al. (2018) domains. Our method learns more spatially aligned keypoints, e.g.frosbite and stack_4 (see section 4.1). Quantitative evaluations are given in fig. 4 and further visualisations in the supplementary material.

In summary, our key contributions are:

  • Transporter learns state of the art object keypoints across a variety of commonly used RL environments. Our proposed architecture is robust to varying number, size and motion of objects.

  • Using learned keypoints as state input leads to policies that perform better than state-of-the-art model-free and model-based reinforcement learning methods on several Atari environments, while using only up to 100k environment interactions.

  • Learning skills to manipulate the most controllable keypoints provides an efficient action space for exploration. We demonstrate drastic reductions in the search complexity for exploring challenging Atari environments. Surprisingly, our action space enables random agents to play several Atari games without rewards and any task-dependent learning.

2 Related Work

Our work is related to the recently proposed literature on unsupervised object keypoint discovery Zhang et al. (2018); Jakab et al. (2018). Most notably, Jakab et al. Jakab et al. (2018) proposed an encoder-decoder architecture with differentiable bottlenecks in the intermediate layer. We reuse their bottleneck architecture but add a crucial new inductive bias – the feature transport mechanism – to constrain the representation to be more spatially aligned compared to all baselines. The approach in Zhang et al. Zhang et al. (2018)

discovers keypoints using single images and requires privileged information about temporal transformations between frames in form of optical flow. This approach also requires multiple loss and regularization terms to converge. In contrast, our approach does not require access to these transformations and learns keypoints with a simple pixel-wise L2 loss function. Other approaches for learning object structure has similar limitations

Thewlis et al. (2017); Shu et al. (2018); Suwajanakorn et al. (2018); Wiles et al. (2018). Deep generative models with structured bottlenecks have recently seen a lot of advances Chen et al. (2016); Kulkarni et al. (2015); Whitney et al. (2016); Xue et al. (2016); Higgins et al. (2017) but they do not explicitly reason about geometry.

Unsupervised learning of object keypoints has not been widely explored in the control literature, with the notable exception of Finn et al. (2016). However, this model uses a full-connected layer for reconstruction and therefore can learn non-spatial latent embeddings similar to a baseline we consider Jakab et al. (2018). Moreover, similar to Zhang et al. (2018) their auto-encoder reconstructs single frames and hence does not learn to factorize geometry. Object-centric representations have also been studied in the context of intrinsic motivation, hierarchical reinforcement learning and exploration. However, existing approaches either require hand-crafted object representations Kulkarni et al. (2016) or have not been shown to capture fine-grained representations over long temporal horizons Ionescu et al. (2018).

3 Method

In section 3.1 we first detail our model for unsupervised discovery of object keypoints from videos. Next, we describe the application of the learned object keyponts to control for – (1) data-efficient reinforcement learning (section 3.2.1), and (2) learning keypoint based options for efficient exploration (section 3.2.2).

3.1 Feature Transport for learning Object Keypoints

Given an image , our objective is to extract 2-dimensional image locations or keypoints, , which correspond to locations of objects or object-parts without any manual labels for locations. We follow the formulation of Jakab et al. (2018) and assume access to frame pairs collected from some trajectories such that the frames differ only in objects’ pose / geometry or appearance. The learning objective then is to reconstruct the second frame from the first . This is achieved by computing ConvNet (CNN) feature maps and extracting 2D locations by marginalising the keypoint-detetor feature-maps along the image dimensions (as proposed in Jakab et al. (2018)). A transported feature map is generated by suppressing both sets of keypoint locations in and compositing in the featuremaps around the keypoints from :



is a heatmap image containing fixed-variance isotropic Gaussians around each of the

points specified by . A final CNN with small-receptive field refines the transported reconstruction to regress the target frame . We use pixel-wise squared- reconstruction error for end-to-end learning.

In words, (i) the features in the source image at the target positions are replaced with the features from the target image – this is the transportation; and (ii) the features at the source position are set to zero. The refine net (which maps from the transported feature map to an image) then has two tasks: (i) to inpaint the missing features at the source position; and (ii) to clean up the image around the target positions. Refer to Figure 1 for a concise description of our method.

Note, unlike Jakab et al. (2018), who regress the target frame from stacked target keypoint heatmaps and source image features , we enforce explicit spatial transport for stronger correlation with image locations leading to more robust long-term tracking (section 4.1).

Figure 3: Temporal Consistency of Keypoints. Our learned keypoints are temporally consistent across hundreds of environment steps, as demonstrated in this classical hard exploration game called montezuma’s revenge Bellemare et al. (2013). Additionally, we also predict the most controllable keypoint denoted by the triangular markers, without using any environment rewards. This prediction often corresponds to the avatar in the game and it is consistently tracked across different parts of the state space. See section 4.2.2 for further discussion.

3.2 Object Keypoints for Control

Given learned keypoints, we want to use them within the context of control and exploration. Consider a Markov Decision Process (MDP) with visual observations

as states, actions and a transition function . We use a Transporter model which is pretrained in an unsupervised fashion without extrinsic rewards. The agents output actions and receive rewards as normal.

3.2.1 Data-efficient reinforcement learning

Our first hypothesis is that task-agnostic learning of object keypoints can enable fast learning of goal-directed policies. This is because once we learn keypoints, the control policy can be much simpler and does not have to relearn all visual features using temporal difference learning. In order to test this hypothesis, we use a variant of the neural fitted Q-learning framework Riedmiller (2005)

with learned keypoints as input and a recurrent neural network Q function to output behaviors. The agent observes


only at the corresponding masked keypoint locations. We encode one hot vectors to denote positions of keypoints and their corresponding (keypoint mask averaged) feature vectors at that location.

Transporter is trained by collecting data using a random policy and without any reward functions (see supplementary material for details). The Transporter network weights are fixed during behavior learning given environment rewards.

3.2.2 Keypoint-based options for efficient exploration

Our second hypothesis is that learned keypoints can enable significantly better task-independent exploration. Typically, raw actions are randomly sampled to bootstrap goal-directed policy learning. This exploration strategy is notoriously inefficient. We leverage the Transporter representation to learn a new action space. The actions are now skills grounded in the control of co-ordinate values of each keypoint. This idea has been explored in the reinforcement learning community Kulkarni et al. (2016); Ionescu et al. (2018) but it has been hard to learn spatial features with long temporal consistency. Here we show that Transporter is particularly amenable to this task. We show that randomly exploring in this space leads to significantly more rewards compared to raw actions. Our learned action space is agnostic to the control algorithm and hence other exploration algorithms Pathak et al. (2017); Ecoffet et al. (2019); Plappert et al. (2017) can also benefit from using it.

To do this, we define intrinsic reward functions using the keypoint locations, similar to Ionescu et al. (2018). Each reward function corresponds to how much each keypoint moves in the 4 spatial directions between consecutive observations (up, down, left, right). We learn a set of Q function , to maximise each of the following reward functions: , , , . These functions correspond to increasing/decreasing the and coordinates respectively. The functions are trained using n-step .

During training, we randomly sample a particular Q function to act with and commit to this choice for timesteps before resampling. All Q functions are trained using experiences generated from all policies via a shared replay buffer. Randomly exploring in this Q space can already reduce the search space as compared to raw actions. We further reduce this search space by learning to predict the most controllable keypoint. For instance, in many Atari games there is an avatar that is directly controllable on the screen. We infer this abstraction via a fixed controllability policy to select the single “most controllable” keypoint: .

This procedure picks keypoints where one action leads to more prospective change in all spatial directions than all other keypoints. Given this keypoint, we randomly sample a with a fixed temporal commitment as the random exploration policy. Consider a sequence of 100 actions with 18 choices before receiving rewards, which is typically the case in hard exploration Atari games (e.g. montezuma’s revenge). A random action agent would need to search in the space of raw actions. However, observing 5 keypoints and only has , giving a search space reduction of . The search space reduces further when we explore with the most controllable keypoints. Since our learned action space is agnostic to the control mechanism, we evaluate them by randomly searching in this space versus raw actions. We measure extrinsically defined game score as the metric to evaluate the effectiveness of both search procedures.

Figure 4: Long-term tracking evaluation. We compare long-term tracking ability of our keypoint detector against Jakab et al. Jakab et al. (2018) and Zhang et al. Zhang et al. (2018) (visualisations in fig. 2

and supplementary material). We report precision and recall for trajectories of varying lengths (lengths

frames; each frame corresponds to 4 action repeats) against ground-truth keypoints on Atari ALE Bellemare et al. (2013) and Manipulator Tassa et al. (2018) domains. Our method significantly outperforms the baselines on all games ( on pong), except for ms_pacman where we perform similarly especially for long trajectories (length ). See section 4.1 for further discussion.

4 Experiments

In section 4.1 we first evaluate the long-term tracking ability of our object keypoint detector. Next, in section 4.2 we evaluate the application of the keypoint detector on two control tasks – comparison against state-of-the-art model-based and model-free methods for data-efficient learning on Atari ALE games Bellemare et al. (2013) in section 4.2.1, and in section 4.2.2 examine efficient exploration by learning to control the discovered keypoints; we demonstrate reaching states otherwise unreachable through random explorations on raw-actions, and also recover the agent self as the most-controllable keypoint. For implementation details, please refer to the supplementary material.

figureAgent architecture for data-efficient reinforcement learning. Transporter is trained off-line with data collected using a random policy. A recurrent variant of the neural-fitted Q-learning algorithm Riedmiller (2005) rapidly learns control policies using keypoint co-ordinates and features at the corresponding locations given game rewards. Game KeyQN (ours) SimPLe Rainbow PPO (500k) Human Random breakout 19.3 12.7 3.3 66.1 31.8 1.7 frostbite 388.3 254.7 120.1 214.0 4334.7 65.2 ms_pacman 999.4 762.8 364.3 306.0 15693.0 307.3 pong 10.8 5.2 -19.5 -8.6 9.3 -20.7 seaquest 236.7 370.9 206.3 692.0 20182.0 -20.7
Figure 5: Atari Mean Scores. Mean scores obtained by our method in comparison with Rainbow Hessel et al. (2018) and SimPLe Kaiser et al. (2019) trained on 100K steps (400K frames), and PPO Schulman et al. (2017) trained on 500K steps (2 millions frames). See Section 4.2.1 for details. Numbers (except for KeyQN) taken from Kaiser et al. (2019).

We evaluate our method on Atari ALE Bellemare et al. (2013) and Manipulator Tassa et al. (2018) domains. We chose representative levels with large variations in the type and number of objects. (1) For evaluating long-term tracking of object keypoints section 4.1 we use – pong, frostbite, ms_pacman, and stack_4 (manipulator with blocks). (2) For data-efficient reinforcement learning (section 4.2.1) we train on diverse data collected using random exploration on the Atari games indicated in fig. 5. (3) For keypoints based efficient-exploration (section 4.2.2) we evaluate on one of the most difficult exploration game – montezuma revenge, along with ms_pacman and seaquest.

A random policy executes actions and we collect a trajectory of images before the environment resets; details for data generation are presented in the supplementary material. We sample the source and target frames randomly within a temporal offset of 1 to 20 frames, corresponding to small or significant changes in the the configuration between these two frames respectively. For Atari ground-truth object locations are extracted from the emulated RAM using hand crafted per-game rules and for Manipulator it is extracted from the simulator geoms.

4.1 Evaluating Object Keypoint Predictions


We compare our method against state-of-the-art methods for unsupervised discovery of object landmarks – (1) Jakab et al. Jakab et al. (2018) and (2) Zhang et al. Zhang et al. (2018). For (1) we use exactly the same architecture for and as ours; for (2) we use the implementation released online by the authors where the image-size is set to pixels. We train all the methods for optimization steps and pick the best model checkpoint based on a validation set.


We measure the precision and recall of the detected keypoint trajectories, varying their lengths from 1 to 200 frames ( frames 13 seconds @ 15-fps with action-repeat of 4) to evaluate long-term consistency of the keypoint detections crucial for control. The average Euclidean distance between each detected and ground-truth trajectory is computed. The time-steps where a ground-truth object is absent are ignored in the distance computation. Distances above a threshold () are excluded as potential matches. One-to-one assigments between the trajectories are then computed using min-cost linear sum assignment, and the matches are used for reporting precision and recall.


Figure 2 visualises the detections while fig. 4 presents precision and recall for varying trajectory lengths. Transporter consistently tracks the salient object keypoints over long time horizons and outperforms the baseline methods on all environments, with the notable exception of Jakab et al. (2018) on pacman where our method is slightly worse but achieves similar performance for long-trajectories.

4.2 Using Keypoints for Control

4.2.1 Data-efficient Reinforcement Learning on Atari

We demonstrate that using the learned keypoints and corresponding features within a reinforcement learning context can lead to data-efficient learning in Atari games. Following Kaiser et al. (2019), we trained our Keypoint Q-Network (KeyQN) architecture for interactions, which corresponds to frames. As shown in Figure 5, our approach is better than the state-of-the-art model-based SimPLe architecture Kaiser et al. (2019) and model-free Rainbow architecture Hessel et al. (2018) on four out of five games. Applying this approach to all Atari games will require training Transporter inside the reinforcement learning loop because pretraining keypoints on data from a random policy is insufficient for games where new objects or screens can appear. However, these experiments provide evidence that the right visual abstractions and simple control algorithms can produce highly data efficient reinforcement learning algorithms.

Figure 6: Exploration using random actions versus random (most controllable) keypoint option/skills: (first row)

We perform random actions in the environment for all methods (without reward) and record the mean and standard deviation of episodic returns across 4 billion frames. With the same frame budget, we simultaneously learn the most controllable keypoint and randomly explore in the space of its co-ordinates (to move it

left, right, top, down). The options model becomes better with training (using only intrinsic rewards) and this leads to higher extrinsically defined episodic returns. Surprisingly, our learned options model is able to play several Atari games via random sampling of options. This is possible by learning skills to move the discovered game avatar as far as possible without dying. (second row) We measure the percentile episodic return reached for all methods. Our approach outperforms the baseline, both in terms of efficient and robust exploration of rare and rewarding states.

4.2.2 Efficient Exploration with Keypoints

How do we learn skills using object keypoints for efficient exploration? We use a distributed off-policy learner similar to Horgan et al. (2018) using 128 actors and 4 GPUs. The agent network is a standard Mnih et al. (2015) with an LSTM with 256 hiddens which feeds into a linear layer with units where is the number of actions. Our transporter model and all control policies simultaneuosly. The data is generated by randomly sampling keypoints and coordinates, and then following the resulting policy for timesteps before resampling. We use a log-scale epsilon distribution for all policies (.4 to 1e-4). During evaluation we use the to select the keypoint to control and then randomly sample a coordinate every timesteps. The quantitative results are shown in Figure 6. We also show qualitative results of the most controllable keypoint in Figure 3 and the supplementary material.

Our experiments clearly validate our hypothesis that using keypoints enables temporally extended exploration. As shown in Figure 6, our learned keypoint options consistently outperform the random actions baseline by a large margin. Encouragingly, our random options policy is able to play some Atari games by moving around the avatar (most controllable keypoint) in different parts of the state space without dying. For instance, the agent explores multiple rooms in Montezuma’s Revenge, a classical hard exploration environment in the reinforcement learning community. Similarly, our keypoint exploration learns to consistently move around the submarine in Seaquest and the avatar in Ms. Pacman. Most notably, this is achieved without rewards or (extrinsic) task-directed learning. Therefore our learned keypoints are stable enough to learn complex object-oriented skills in the Atari domain.

5 Conclusion

We demonstrate that it is possible to learn stable object keypoints across thousands of environment steps, without having access to task-specific reward functions. Therefore, object keypoints could provide a flexible and re-purposable representation for efficient control and reinforcement learning. Scaling keypoints to work reliably on richer datasets and environments is an important future area of research. Further, tracking objects over long temporal sequences can enable learning object dynamics and affordances which could be used to inform learning policies. A limitation of our model is that we do not currently handle moving backgrounds. Recent work Gordon et al. (2019) that explicitly reasons about camera / ego motion could be integrated to globally transport features between source and target frames. In summary, our experiments provide clear evidence that it is possible to learn visual abstractions and use simple algorithms to produce highly data efficient control policies and exploration procedures.

Acknowledgements. We thank Loic Matthey and Relja Arandjelović for valuable discussions and comments.


Appendix A Implementation Details

The feature extractor is a convolutional neural network with six


layers Ioffe and Szegedy [2015]. The filter size for the first layer is with filters, and

for the rest with number of filters doubled after every two layers. The stride was set to 2 for layer 3 and 5 (1 for the rest).

PointNet has a similar architecture but includes a final regressor to feature-maps corresponding to keypoints. 2D coordinates are extracted from these maps as described in Jakab et al. [2018]. The architecture of RefineNet is the transpose of with bilinear-upsampling to undo striding. We specify for each environment but keep all other hyper-parameters of the network fixed across experiments. We used the Adam optimizer Kingma and Ba [2014] with a learning rate of (decayed by 0.95 every steps) and batch size of 64 across all experiments.

Appendix B Diverse Data Generation

To train the Transporter, a dataset of observation pairs is constructed from environment trajectories. It is important that this dataset contains a diverse range of situations, and unconditionally storing a pair from all trajectories generated by a random policy may contain many similar pairs. To mitigate this, we use a diverse data generation procedure as follows.

We generate trajectories of up to length 100 (action repeat is set to 4, so these trajectories represent up to 400 environment frames) using a uniform random policy, and uniformly sample one observation from the first half of the trajectory and one frame from the second half. Trajectories are shorter than 100 only when the end of an episode is reached. A buffer containing the maximum number of pairs we want to generate (in these experiments, 100k) is populated unconditionally from a number of environment actor threads until it is full. More frame pairs are generated, up to some defined maximum budget, and are conditionally written into the buffer as follows.

First some number of indices of existing pairs are sampled from the buffer, and for each of them we compute the nearest neighbor by L2 distance to other elements of the buffer. We take the same number of new generated frame pairs, and also compute their nearest neighbor in the buffer. For corresponding pairs of (existing frame pair, new frame pair) we overwrite the existing pair with the new pair whenever the new pair has a greater nearest neighbor distance, or if a uniform random number . We continue this procedure until the maximum budget is reached, and write out the final buffer as our training set. Note that the reward function is not used at all in this procedure.

For efficiency, we store a lower resolution copy of the buffer (64x64, grayscale) on the GPU to perform efficient nearest neighbor calculations, keeping corresponding higher res (128x128 RGB) copies on CPU RAM. Using a single consumer GPU and a 56 core desktop machine, with many actor subprocesses, this approach can perform 10 million environment steps (40 million total frames) in approximately 1 hour.

Appendix C Videos

Videos visualising various aspects of the model are available at:

Appendix D Pixel Transport versus Feature Transport

We investigated whether learning features is as important as spatially transporting them between frames. As shown in Figure 7, we show that transporting learned features significantly outperforms transporting pixels. Transporting pixels gives rise to ambiguous intermediate pixel representations, making it difficult for the final CNN decoder network to solve the downstream pixel prediction task. On the other hand, the feature encode higher level information and the decoder network learns a more abstract function to solve the prediction problem.

Figure 7: Transporting features is significantly better than transporting pixels. Given a sawyer arm with tabletop toys, Transporter discovers keypoints at the joint locations of the robot and object centroids (left two columns). In this experiment we investigate whether it is important to transport learned features or pixels. In case of pixel transport, the refinement network has to perform difficult and ambiguous computations to predict the target frame. Therefore the final pixel reconstruction error is significantly higher for pixel transporter (right most column).

Appendix E Temporal consistency of keypoints

Figures 8, 9, 10, and 11 show the infered keypoints on frames selected from a single episode on Atari ALE Bellemare et al. [2013] (Pong, Frostbite and Ms. Pac-Man) and Manipulator Tassa et al. [2018] (stack_4) domains. The selected frames are each 10 time steps apart. The first frame has been explicitly chosen to ensure there is enough diversity in the shown frames. The colours are time consistent – a specific colour corresponds to the same keypoint throughout the episode. Thus, if a given game ‘object’ is assigned the same coloured keypoint throughout the episode, that keypoint is temporaly consistent for that ‘object’.

Videos showing the inferred keypoints by the three methods for entire episodes can be accessed at: https://www.youtube.com/playlist?list=PL3LT3tVQRpbvGt5fgp_bKGvW23jF11Vi2

Figure 8: Atari Bellemare et al. [2013]: pong
Figure 9: Atari Bellemare et al. [2013]: frostbite
Figure 10: Atari Bellemare et al. [2013]: ms pacman
Figure 11: Manipulator Tassa et al. [2018]: stacker with 4 objects

Appendix F Reconstructions

We visualise the reconstructed images on Atari ALE Bellemare et al. [2013] (Figures 12, 13, and 14) and Manipulator Tassa et al. [2018] domains (Figure 15) for randomly selected frames. The rows in the figures correspond respectively to our model, Jakab et al. [2018] and Zhang et al. [2018]. The first two columns are the inputs given to the models. Whereas our model requires a pair of input frames (image and future_image), the remaining two models only require single frame (future_image). The third column (reconstruction) shows the reconstructed target image. The final column (keypoints) shows the infered keypoints for the given inputs.

Figure 12: Reconstruction: pong
Figure 13: Reconstruction: frostbite
Figure 14: Reconstruction: ms pacman
Figure 15: Reconstruction: stacker with 4 objects