In order for robots to truly become generalists, they must be readily taskable by humans, handle raw sensory inputs without instrumentation, and be equipped with a range of skills that generalize effectively to new situations. Reinforcement learning autonomously learns policies that maximize a reward function and is a promising approach towards such generalist robots. However, in a general setting involving diverse objects and tasks, prior information about what tasks to learn is hard to come by without manually designing object detectors and reward functions. How can a robot explore the world in order to learn and fine-tune useful skills on diverse objects, only from acting and observing how its actions affect its sensory stream?
Prior methods have proposed to let an agent learn from its sensor stream by automatically generating plausible goals during an unsupervised training phase, and then learning policies that reach those goals [nair2018rig, nachum2018hiro, wadefarley2019discern, pong2019skewfit]. Such goals can be defined in a variety of ways, but a simple choice is to use goal observations, such that each proposed task requires reaching a different observation. When the robot observes the world via raw camera images, this corresponds to using images as goals. At test time, a user then provides the robot with a new goal image.
While such methods have been demonstrated in both simulated and real-world settings, they are typically used to learn behaviors in domains with relatively little visual diversity. In the real world, a robot might interact with highly diverse scenes and objects, and the tasks that it can perform from each of the many possible initial states will be different. If the robot is presented with an object, it can learn to pick up or grasp it, and when it is presented with a door, it can learn to open it. However, it must generate and practice goals that are suitable for each scene. In this paper, we propose and evaluate a self-supervised policy learning method that learns to propose goals that are suitable to the current scene via a conditional goal generation model, allowing it to learn in visually varied settings that prove challenging for prior algorithms.
The key idea in our work is that representing every element in a visually complex scene is often not necessary for control. A scene is a visual form of context that can be factored out, while only the controllable entities in the environment need to be captured for goal setting and representing the state. To this end, we propose learning a context-conditioned generative model that learns a smooth, compressed latent variable with an information bottleneck, while allowing the context, in the form of the initial state image, to be used freely to reconstruct other images during the task. This context-conditioned generative model architecture is shown in Figure 1.
The main contribution in this paper builds on this context-conditioned generative model to devise a complete self-supervised goal-conditioned reinforcement learning algorithm, which can handle visual variability in the scene via context-conditioned goal setting. Our method can learn policies that reach visually indicated goals without any additional supervision during training, using the context-conditioned generative model to set goals that are appropriate to the current scene. We show that our approach learns coherent representations of visually varied environments, capturing controllable dimensions of variation while ignoring dimensions that vary but cannot be influenced by the agent, such as lighting and object appearance. We further show that our approach can learn policies to solve tasks in visually varied environments, including in a real-world robotic pushing task with a wide variety of distinct objects.
2 Related Work
While many practical robots today perform tasks by executing hand-engineered sequences of motor commands, machine learning is opening up a new avenue to train a wide variety of robotic tasks from interaction. This body of work includes grasping[ekvall2004interactive, kroemer2010grasping, bohg2010learning], and general tasks [PETERS2008682, kober2013reinforcement], multi-task learning , baseball [peters2008baseball], ping-pong [peters2010reps], and various other tasks [deisenroth2011pilco]
. More recently, using expressive function approximators such as neural networks has reduced manual feature engineering and has increased task complexity and diversity, finding use in decision-making domains, such as solving Atari games[mnih2013atari] and Go [silver2016alphago]
. Deep learning for robotics has proved to be difficult due to a host of challenges including noisy state estimation, specifying reward functions, and handling continuous action spaces, but has been used to investigate grasping[pinto2015supersizing], pushing [agrawal2016poking], manipulation of 3D object models [krainin2011autonomous]martinez2014active] and pouring liquids [schenck2017visual]. Deep reinforcement learning, which autonomously maximizes a given reward function, has been used to solve precise manipulation tasks [levine2016gps], grasping [pinto2017robust, levine2017grasping], door opening [gu2016naf], and navigation [kahn2018navigation]. These methods have succeeded on specific tasks, often with hard-coded reward functions. However, to scale task generalization robots may need to learn methods that can handle significant environment variation and require relatively little external supervision.
Several works have investigated self-supervised robotic interaction with varied objects in the deep learning setting with the goal of generalizing between objects. For example, in the domain of robotic grasping, several works have studied autonomous data collection to learn to grasp from a hand-specified grasping reward [pinto2015supersizing, levine2017grasping]. However, hand-specifying such rewards in general settings and for arbitrary manipulation skills is very cumbersome. Other work has focused on self-supervised learning with visual forward models, either by enforcing a simplified dynamical structure [watter2015embed, zhang2019solar] or with pixel transformer architectures [finn2016visualforesight, ebert2017videoprediction, ebert2018retrying, lee2018videoprediction, ebert2018journal]. However, these methods rely on accurate visual forward modelling, which is itself a very challenging problem. Instead, we build on self-supervised model-free approaches, which allow the agent to efficiently reach visual goals without planning with a visual forward model.
Prior work has also sought to perform self-supervised learning with model-free approaches. Using visual inverse models [agrawal2016poking] is one such approach, but may not work well for complex interaction dynamics or longer horizon planning. Most closely related to our approach are prior methods on goal-conditioned reinforcement learning [kaelbling1993goals, schaul2015uva, andrychowicz2017her]. The methods have been extended to frame self-supervised RL as learning goal reaching with automatically proposed goals, including visually-specified goals [nair2018rig, pong2019skewfit, wadefarley2019discern, florensa2019self, lin2019rlwithoutstate]. However, they generally focus on learning in narrow environments with little between-trial variability. In this setting, any previously visited state represents a valid goal. However, in the general case, this is no longer true: when the robot is presented with a different scene or different objects on each trial, it must only set those goals that can be accomplished in the current scene. In contrast, we focus on enabling self-supervised learning from off-policy data in heterogeneous environments with increased factors of variability.
In this section, we provide an overview of relevant prior work necessary to understand our method, including goal-conditioned reinforcement learning and representation learning with variational auto-encoders.
3.1 Goal-Conditioned Reinforcement Learning
In an MDP consisting of states , actions , dynamics , rewards , horizon , and discount factor , reinforcement learning addresses optimizing a policy to maximize expected return . To learn a variety of skills, it is instead convenient to optimize over a family of reward functions parametrized by goals, as in the framework of goal-conditioned RL [kaelbling1993goals]. A variety of RL algorithms exist, but as we are primarily interested in sample-efficient off-policy learning, we consider goal-conditioned Q-learning algorithms [schaul2015uva]. These algorithms learn a parametrized Q-function that estimates the expected return of taking action from state with goal . Q-learning methods rely on minimizing the Bellman error:
This objective can be optimized using standard actor-critic algorithms using a set of transitions which can be collected off-policy [lillicrap2015continuous]. In practice, a target network is often used for the second Q-function.
3.2 Variational Auto-Encoders
Above, the goal description can take many forms. To handle high-dimensional goals, in this work, we learn a latent representation of the state using a variational auto-encoder (VAE). A VAE is a probabilistic generative model that has been shown to learn structured representations of high-dimensional data[kingma2014vae] successfully. It is trained by reconstructing states with a parametrized encoder that converts the state into a normal random distribution with a parametrized decoder that converts the latent variable into a predicted state distribution , while keeping the latent close (in KL divergence) to its prior
, a standard normal distribution. To train the encoder and decoder parametersand , we jointly optimize both objectives when minimizing the negative evidence lower bound:
3.3 Conditional Variational Auto-Encoders
Instead of a generative model that learns to generate the dataset distribution, one might instead desire a more structured generative model that can generate samples based on structured input. One example of this is a conditional variational auto-encoder (CVAE) that conditions the output on some input variable and samples from [sohn2015cvae]. For example, to train a model that generates images of digits given the desired digit, the input variable
A CVAE trains and , where both the encoder and decoder has access to the input variable . The CVAE then minimizes:
Samples are generated by first sampling a latent . Based on , we can then decode with and visualize the output, which is in our case an image. In our framework .
3.4 Reinforcement Learning with Imagined Goals
To learn skills from raw observations in a self-supervised manner, reinforcement learning with imagined goals (RIG) proposed to use representation learning combined with goal-conditioned RL [nair2018rig]. The aim of RIG is to choose actions in order to make the state reach a goal image at test time. RIG first collects an interaction dataset and learns a latent representation by training a VAE on this data. Then, a goal-conditioned policy is trained to act in the environment in order to reach a given goal . Exploration data is collected by rolling out the policy with goals that are “imagined” from the VAE prior; at test time, the policy can take in a goal image as input, encode the image to the latent space, and act to reach the goal latent. A key limitation of this method is that sampling goals from the VAE prior during training time assumes that every state in the dataset is reachable at any time. However, in general, this assumption may not be true.
4 Self-Supervised Learning with Context-Conditioned Representations
In this work, our goal is to enable the learning of flexible goal-conditioned policies that can be used to successfully perform a variety of tasks in a variety of contexts – e.g., with different objects in the scene. Such policies must learn from large amounts of experience, and although it is in principle possible to use random exploration to collect this experience, this quickly becomes impractical in the real world. It is, therefore, necessary for the robot to set its own goals during self-supervised training, to collect meaningful experience. However, in diverse settings, many randomly generated goals may not be feasible – e.g., the robot cannot push a red puck if the red puck is not present in the scene. We propose to extend off-policy goal-conditioned reinforcement learning with a conditional goal setting model, which proposes only those goals that are currently feasible. This enables a learning regime with imagined goals that is more realistic for real-world robotic systems that must generalize effectively to a range of objects and settings.
4.1 Context-Conditioned VAEs
To train a generative model that can improve the generation of feasible goals in varied scenes, we use a modified CVAE that uses the initial state in a rollout as the input , which we call the “context” for that rollout. The modified CVAE, which we call a context-conditioned VAE (CC-VAE), is shown in Figure 1. While most CVAE applications use a one-hot vector as the input, we use an image . This image is encoded with a convolutional encoder into a compact representation . Note that by design, and do not share weights, as they are intended to encode different factors of variation in the images. The context is used to output the latent representation , as well as the reconstruction of the state . In addition, is used alone to (deterministically) decode . The objective is given by
Due to the information bottleneck on
, this loss function penalizes information passing throughbut allows for unrestricted information flow from . Therefore, the optimal solution would encode as much information as possible in , while only including the state information that changes within a trajectory in the latent variable . These are precisely the features of most interest for control.
4.2 Context-Conditioned Reinforcement Learning with Imagined Goals
We propose to use our context-conditioned VAE in the RIG framework to learn policies over environments with visual diversity, where each episode might involve interacting with a different scene and different objects. We first collect a dataset of trajectories by executing random actions in the environment. We then learn a CC-VAE, as detailed in Section 4.1, to learn a factored representation of the image observations. To use the CC-VAE for self-supervised learning, we save the first image when starting a rollout. We compute the encoding of , . Let denote the context concatenated vector , and let denote the mean of . We then use RIG in the latent space by encoding observations with , meaning that we train a goal-conditioned policy and a goal-conditioned Q-function .
To collect data, we sample a latent goal for each rollout from the prior , as in RIG. For every observation , we compute the mean encoding . We then obtain a rollout of the policy by executing . The reward at each timestep is the latent distance .
The policy and Q-function can be trained with any off-policy reinforcement learning algorithm. We use TD3 in our implementation [fujimoto2018td3]. Our policy and Q-function are goal-conditioned, and we take advantage of being able to relabel the goals for each transition to improve sample efficiency [andrychowicz2017her, nair2018rig, pong2019skewfit]. However, when relabeling a goal with a random goal from the environment, the context-conditioning is still preserved. That is, if is the new sampled goal, we use . This ensures that the relabeled goal is compatible with the scene for the corresponding transition.
After training, we can use the learned policy to reach a visually indicated goal. Given a goal image , we encode it into a latent goal . Then, we execute the policy with the latent goal , just as during the training phase. The complete algorithm is presented in Algorithm 1.
In our experiments, we aim to answer the following questions:
How does our method compare to prior work at learning self-supervised skills in visually diverse environments?
Do context-conditioned VAEs learn an image representation that produces coherent and diverse goals that are suitable for the current scene?
Can our proposed context-conditioned RIG method handle diverse real-world data and learn effective policies under visual variation in the real world?
5.1 Self-Supervised Learning in Simulation
In simulation, we can conduct controlled experiments and evaluate against known underlying state values to measure the performance of our approach and prior methods. As a simulation test-bed, we use a multi-color pusher environment simulated in MuJoCo [todorov12mujoco]. In this environment (shown on the left in Figure 3), a simulated Sawyer arm is tasked with pushing a circular puck to a target position, specified by a goal image at test time. On each rollout, the puck color is set to a random RGB value. Therefore, the goal proposals for each method must adequately account for the color of the puck – a goal that requires moving a red puck to a given location is impossible if only a blue puck is present in the scene.
We compare the following algorithms: CC-RIG. Our method using a CC-VAE for representation learning, as described in Section 4.2. RIG. Reinforcement learning with imagined goals [nair2018rig] using a standard VAE, as described in Section 3.4. Oracle. The oracle agent runs goal-conditioned RL with direct access to state information. Achieving performance similar to the oracle indicates that an algorithm loses little from using raw image observations over ground truth state.
Learning curves comparing these methods are presented in the plot on the left in Figure 2. CC-RIG outperforms RIG significantly, and standard RIG is not able to improve beyond the initial random policy. The performance of CC-RIG approaches that of the oracle policy, which has access to the true state. This suggests that, in visually varied environments, self-supervised learning is possible so long as the visual complexity is factored out with representation learning, and the proposed goals are consistent with the appearance of the current scene.
5.2 Generalizing to Varying Appearance and Dynamics with Self-Supervised Learning
In this experiment, to study changing both visual appearance and physical dynamics, we study how well our method can generalize when the environment dynamics change. We use a simulated 2D navigation task, where the goal is to navigate a point robot around an obstacle. The arrangement of the obstacles is chosen from a set of 15 possible configurations, and the color of the point robot is generated from a random RGB value. Learning curves obtained by training the different methods above in this environment are presented in Figure 2. CC-RIG requires more samples to learn, but eventually approaches the oracle performance. RIG, in comparison, plateaus with poor performance. This environment is explained further in the supplementary, in Section 7 and Figure 5.
5.3 Context-Conditioned VAE Goal Sampling
To better understand why CC-RIG outperforms RIG, we compare the samples from our CC-VAE to a standard VAE. Samples from both models are shown in Figure 3. The quality of the samples reveals why the CC-VAE provides better goal setting for self-supervised learning. In all environments, the samples from the CC-VAE maintain the background, object shape, and object color from the initial state. Therefore, the goals are more meaningful in the CC-VAE latent space.
This kind of visualization is a good indicator for the suitability of the representation for self-supervised learning. Diverse, coherent samples indicate that the latent space captures the appropriate factors of change in the environment and can be useful for self-supervised policy learning. Good samples also suggest that the latent space is well-structured, and therefore distances in the latent space should provide a good reward function for goal-reaching. In practice, we also look at the quality of the reconstructions. Good reconstructions confirm that the latent variables capture sufficient information about the image to be used in place of the image itself as a state representation.
5.4 Real-World Robotic Evaluation
In this experiment, we evaluate whether our method can handle manipulating visually varied objects in the real world. We use CC-RIG to train a Sawyer robot to manipulate a variety of objects, placed one at a time in a bin in front of the robot. As before, the training phase is self-supervised, and the robot must match a given goal image at test time. The robot setup is shown in Figure 1.
We first collect a large dataset with random actions and train a CC-VAE on the data. Samples from the model are shown in Figure 2. The CC-VAE learns to generate goals with the correct object. To handle varying brightness at different times of the day, we added data augmentation by applying a color jitter filter to pairs. As seen in the figure, the model is robust to this factor of variation. Each sample contains the same type of object, brightness level, and background as the initial state that it is conditioned on. However, crucially, these factors of variation are not present in , as evidenced by the fact they do not vary within each column of Figure 2, but the object position does.
Next, we run CC-RIG with the trained CC-VAE to learn to reach visually indicated goals in a self-supervised manner. We first conduct fully off-policy training using the same dataset as was used to train the CC-VAE, consisting of 50,000 samples (about 3 hours) of total interaction with 20 objects. Then, we collect a small amount of additional on-policy data to finetune the policy, analogous to recent work on large-scale vision-based robotic reinforcement learning [pmlr-v87-kalashnikov18a]. The robot learns to push objects to target locations, indicated by a goal image. The real-world results are presented in Figure 2. Because it is difficult to automatically detect the positions of objects, we show some representative rollout examples, and we compute several distance metrics between the final state of a rollout and the goal: CVAE distance. CC-VAE latent space distance between final image and goal. VAE distance. VAE latent space distance between final image and goal. Pixel distance. We manually label the center of mass of the object in the final image and goal image, and compute the distance between them. Object distance. We measure the distance between the physical goal position of the object and the final position. In each metric, CC-RIG outperforms RIG.
At training time, the dataset consists of interaction with 20 objects. The result of running CC-RIG on novel objects that were not included in the dataset are shown in the table as “CC-RIG, novel” and in the rollouts in Figure 2. These results show that our method can also generalize its experience to push novel objects it has not seen before.
The table above shows the performance of our method in the real-world, evaluated with four different evaluation metrics222The first three metrics are computed on 40 trajectories per method, and we report mean standard deviation. Object distance is computed on 10 trajectories per method, and we report median standard deviation.. CC-RIG outperforms RIG in each one, even when tested on novel objects that it has not been trained on. Test rollouts of our method are shown on training objects on the left and unseen novel objects on the right. Successful rollouts where the object is pushed to the goal location are shown in top row, and failure modes are shown in the bottom row.
We presented a method for sample-efficient, flexible self-supervised task learning for environments with visual diversity. Our method can learn effective behavior without external supervision in simulated environments with randomized colors and layout, and in a real-world pushing task with differently colored pucks. Each environment contains an axis of visual variation that requires our algorithm to utilize an intelligent goal-setting strategy, to ensure that the self-proposed goals are consistent with the tasks and feasible in the current scene.
The main idea behind our method is to devise a context-conditioned goal proposal mechanism, allowing our self-supervised reinforcement learning algorithm to propose goals for itself that are feasible to reach. This context-conditioned VAE model factors out the unchanging context of a rollout, such as which objects are present in the scene, from the controllable aspects, such as the object positions to construct a more generalizable goal proposal model.
We believe this contribution will enable scalable learning in the real world. An agent manipulating objects in the real world must handle many forms of variation: different manipulation skills to learn, objects to manipulate, as well as variation in lighting, textures, etc. Methods that learn from data must be able to represent these variations while at the same time taking advantage of common structure across objects and tasks in order to achieve practical sample efficiency. Future work will address the remaining challenges to achieve this vision.
This research was supported in part by the National Science Foundation under IIS-1651843, IIS-1700697, and IIS-1700696, the Office of Naval Research, ARL DCIST CRA W911NF-17-2-0181, DARPA, Berkeley DeepDrive, Google, Amazon, and NVIDIA.
7 Multi-Color 2D Navigation Experiments
In order to study generalizing to varying appearance and dynamics with CC-RIG, we introduced the multi-color 2D point navigation environment shown in 5. The goal is to navigate a point robot around the central walls. The arrangement of the walls is randomly chosen from a set of 15 possible configurations in each rollout, and the color of the circle indicating the position of the point robot is generated from a random RGB value. Thus at test time, the agent sees new colors it has never trained on.
First, we see from the samples in Figure 5 that the learned latent space for the CC-VAE is more reasonable than a VAE: it preserves color and wall information in samples and represents only the colored circle position in the latent variable . This improves the capability of our algorithm to learn in several ways: it provides a more informative reward function, and gives us better goal sampling for both exploration rollouts and experience relabeling when training the Q function.
Learning curves obtained by training the different methods above in this environment are presented in the main paper Figure 2. This task is trivial for the oracle method to learn, as it directly receives state information and does not need to generalize between different object appearances. CC-RIG requires more samples to learn, but eventually approaches the oracle performance. RIG plateaus with poor performance in comparison.
8 Off-Policy Experiments
Because we use off-policy RL methods, one major benefit is that we can bootstrap training from large interaction datasets rather than requiring on on-policy data collection. This is particularly vital in the real-world, where on-policy data collection is expensive in terms of human effort, and repeatedly tuning on-policy methods for complex tasks is likely to be impractical. Our robot experiments are therefore run by starting with a fixed initial dataset of 50,000 samples (about 3 hours) of random interaction with 20 objects, which is used for both training the CC-VAE as well as RL. Our simulated experiments are conducted with online data-collection to make comparison with prior work clearer, but in this section we show that bootstrapping with off-policy training is possible in these settings as well.
In our simulated experiments, we first collect 100,000 samples (1000 trajectories) with random actions. This data is used both to train the CC-VAE and as off-policy data. When we begin RL, we load these samples into the replay buffer and perform 100,000 gradient updates of RL. As shown in Figure 6, this allows us to begin online data collection with a reasonably good policy. But we see that online data collection does improve slightly beyond this initial policy. In dynamically sensitive environments or environments where random actions do not provide meaningful interaction, this online data collection may still be very valuable.