Visuomotor Control in Multi-Object Scenes Using Object-Aware Representations

by   Negin Heravi, et al.
Stanford University

Perceptual understanding of the scene and the relationship between its different components is important for successful completion of robotic tasks. Representation learning has been shown to be a powerful technique for this, but most of the current methodologies learn task specific representations that do not necessarily transfer well to other tasks. Furthermore, representations learned by supervised methods require large labeled datasets for each task that are expensive to collect in the real world. Using self-supervised learning to obtain representations from unlabeled data can mitigate this problem. However, current self-supervised representation learning methods are mostly object agnostic, and we demonstrate that the resulting representations are insufficient for general purpose robotics tasks as they fail to capture the complexity of scenes with many components. In this paper, we explore the effectiveness of using object-aware representation learning techniques for robotic tasks. Our self-supervised representations are learned by observing the agent freely interacting with different parts of the environment and is queried in two different settings: (i) policy learning and (ii) object location prediction. We show that our model learns control policies in a sample-efficient manner and outperforms state-of-the-art object agnostic techniques as well as methods trained on raw RGB images. Our results show a 20 percent increase in performance in low data regimes (1000 trajectories) in policy training using implicit behavioral cloning (IBC). Furthermore, our method outperforms the baselines for the task of object localization in multi-object scenes.


page 5

page 8

page 11


Self-Supervised Learning of Multi-Object Keypoints for Robotic Manipulation

In recent years, policy learning methods using either reinforcement or i...

Grasp2Vec: Learning Object Representations from Self-Supervised Grasping

Well structured visual representations can make robot learning faster an...

Self-supervised Learning of Image Embedding for Continuous Control

Operating directly from raw high dimensional sensory inputs like images ...

Learning task-agnostic representation via toddler-inspired learning

One of the inherent limitations of current AI systems, stemming from the...

Task-Agnostic Robust Representation Learning

It has been reported that deep learning models are extremely vulnerable ...

Matching Multiple Perspectives for Efficient Representation Learning

Representation learning approaches typically rely on images of objects c...

Investigating Power laws in Deep Representation Learning

Representation learning that leverages large-scale labelled datasets, is...

1 Introduction

General purpose robots need to encode information about their environment and themselves in a way that is not task specific and can be easily transferred to new situations and tasks. Current techniques for learning representations of the robot’s environment are often trained in a supervised manner on task-specific datasets. However, collecting and labeling a new dataset per task is not scalable. Self-supervised learning methods that aim to learn representations from unlabeled data hold the potential to help robots learn about their environments without manual annotation.

Prior work in robotics has shown that performance and sample-efficiency of policy learning improves with self-supervised scene representations. These methods use compact representations in form of a global embedding [Sermanet2017TCN], sparse key-point coordinates [finn2016deep, kulkarni2019unsupervised, florence2019self, manuelli2020keypoints], or other object embeddings [yuan2021sornet] as input to policy learning modules. In this work, we explore a class of self-supervised models called Slot Attention [locatello2020objectcentric]

for representation learning in a robotic setup and differ from these prior works by learning the policy directly on dense per-pixel features and object masks produced by the Slot Attention model. This is particularly important in the context of multi-object manipulation as the representations need to encode the location of multiple objects in the scene along with the end-effector. Additionally, learning slot based representations does not require multi-view cameras 

[Sermanet2017TCN, florence2019self, manuelli2020keypoints] or canonical images of objects [kulkarni2019unsupervised, yuan2021sornet].

Slot Attention models use sequential attention based mechanisms to group low level features in a scene where each group falls into a slot bin [locatello2020objectcentric]. The authors show that the model can segment objects in an unsupervised manner. Inspired by this architecture, we propose to use these models that can learn the extent of multiple objects in a scene and use these representations for a variety of downstream robotic tasks. Our hypothesis is that the abstracted information using Slot Attention can improve data sample efficiency and performance in downstream training since it is object-aware making it suitable for extracting information in multi-object scenes. We test this hypothesis in the tasks of object localization and multi-object goal-conditioned policy learning discussed in detail in the next sections. Our model is trained in multiple stages. First, we train Slot Attention in the object discovery mode in a self-supervised manner. Then, we freeze the weights and use these learned representations to train different small downstream networks for each task. We train our models using data of a robot interacting with blocks of different shapes and colors placed on a table in a simulation environment. Using this setup, we study the gain in performance by using these representations in different data regimes. We observe that the mask and features learned by our model are able to boost performance on both tasks of object localization and behavioral cloning. Particularly in the low data regime, our features result in a 20% improvement in task completion success rate.

To summarize, the contributions of our work are as follows:

  1. We show that our Slot Attention inspired representations encode location and properties of all objects in the scene while the object agnostic self-supervised methods such as MoCo [he2019moco] tend to only focus on a few objects.

  2. We show that our method needs fewer supervised action labels to learn policies (i.e. it is more sample efficient) and learns policies that have faster training convergence than alternative state-of-the-art methods.

2 Related Work

Self-supervised Representation Learning. Self-supervised learning has been immensely successful in training large models without any labels across different visual modalities: images [chen2020simple, he2019moco], optical flow [jonschkowski2020matters] and videos [wang2015unsupervised, han2020coclr]. As these methods do not rely on manual labeling, they are well-suited for learning features from videos of robots interacting with objects. Self-supervised losses tend to be either contrastive or a reconstruction loss. In this work, we compare features learned with these two different losses as input to a policy-learning network. In particular, we focus on a class of models that use the reconstruction loss but also the notion of slots/objects that induces grouping of similar pixels into slots in a self-supervised way [locatello2020objectcentric].

Self-supervised Learning for Robotics. Several previous papers have explored self-supervised learning methods for robotics. [finn2016deep] learned features as input to policies using a spatial-softmax bottleneck layer and a reconstruction loss.  [Sermanet2017TCN, dwibedi2018learning]

proposed a self-supervised approach for learning embeddings based on metric learning using videos from multiple cameras. These embeddings were shown to be useful in reward calculation or as input for reinforcement learning methods.

[jang2018grasp2vec] also used self-supervised embeddings to improve grasping. [lee2019making] used self-supervised learning of multimodal representations for contact rich manipulation tasks. [deng2020self]

proposed a self-supervised approach to improve 6D pose estimation of objects by using a robot that interacts with objects.

[zakka2021xirl] explored how features from self-supervised learning methods can be used as a reward for training unseen reinforcement learning agents.

Object-centric Representations for Robotics. Various types of object-centric representations have been explored for use in policies in robotics, and although sometimes these use supervised learning [devin2018deep, wang2019deep, manuelli2019kpam], often these may be self-supervised [jang2018grasp2vec]

as well. In this aspect, our work is similar in spirit to works which use self-supervision to acquire embeddings interpreted as keypoints either learned through autoencoding bottlenecks  

[finn2016deep, kulkarni2019unsupervised], or multi-view consistency  [florence2018dense, florence2019self, manuelli2020keypoints]. In contrast to these prior papers however, we directly use dense features (the same resolution as the input image) rather than sparsifying the representation down to keypoints.

Another closely related work is APEX [wu2021apex]

which shows how segmentation masks can improve performance when using a heuristic policy. However, we study the utility of Slot Attention for end-to-end behavioral cloning without the need for heuristics for policy learning. We find that these features are crucial for task success when the number of objects in the scene increases.

Imitation Learning. Past work [florence2019self] has explored the use of self-supervised features to improve the performance of behavior cloning models. In this work, we also investigate the effectiveness of self-supervised features but in the context of a modern behavior cloning algorithm Implicit Behavior Cloning (IBC) [florence2021implicit] which was shown to be better than standard behavior cloning on a number of robotic tasks.

3 Approach

Our overall framework learns object aware representations from unlabeled videos using Slot Attention [locatello2020objectcentric]. We then freeze the weights of the representation architecture, and use features from this model for downstream robotic tasks of object and end-effector localization (Section 4.3) as well as policy learning (Section 4.5). We show an overview of our approach in Figure 1.

Figure 2: Qualitative example of masks learned by Slot Attention. Slot Attention is able to localize objects and the end-effector by observing interaction videos without using any mask-level supervision.

As our representation method, we use an image encoder that takes an RGB image as input, and outputs an embedded feature representation . Given an input batch of N images , we train using a variation of the Slot Attention network [locatello2020objectcentric] which consists of a convolution based encoding architecture followed by an attention mechanism [vaswani2017attention] with Gated Recurrent Units (GRUs). This architecture groups the features of an image into slots where

is a hyperparameter. The attention mechanism

[vaswani2017attention] is normalized over the slots. This makes the slots compete with each other and each specialize in explaining a different component in the image. This property results in self-supervised decomposition of low-level image features into abstract groups. Given these slots, an upsampling convolutional decoder spatially broadcasts and reconstructs image slots as well as the corresponding spatial alpha masks . The masks are then used as weights to sum the reconstruction slots to create a single combined reconstructed image for input . We train this network in a self-supervised manner using the L2 pixel-wise difference of the reconstructed image and the original image:


We modified the Slot Attention architecture in two ways. First, like [yang2021self]

, we initialize the slots to be learnable fixed vectors instead of samples from a learned Gaussian distribution. We found this change to mitigate slot swapping between objects as the noise in the Gaussian setting could lead to permutations. Second, we use convolutions followed by upsampling instead of using transposed convolutions to prevent checkerboard artifacts 

[odena2016deconvolution]. Please refer to [locatello2020objectcentric] for more training details of the IBC algorithm. Figure 2 shows qualitative examples of the performance of training this model on our robotic dataset.

After training this network, we freeze the weights and use the output of the convolution based encoder as representation in our policy learning tasks. The pre-trained frozen Slot Attention models used in the following experiments were trained for steps with a batch size of on images of size by and slots with random seed initialization unless otherwise noted. For the experiments in which we vary the fraction of the data available, the slot representations are also only trained on the selected fraction of data. For the localization task, we use the dense masks () as input to the downstream network. These masks encode the location of objects in pixel space in the form of slots.

4 Experiments and Discussion

The primary goal of our experiments is to evaluate whether self-supervised object aware representations learned using Slot Attention provide a performance gain for robotic tasks. To quantify this benefit, we focus on comparing various representation learning techniques for tasks such as object localization and multi-object goal-conditioned policy learning.

4.1 Baselines

We compare our method with the following baselines:

MoCo [he2019moco]. We train an encoder using a contrastive loss in a self-supervised manner as outlined in [chen2020improved]. The output of the encoder is spatially averaged to produce an embedding. MoCo uses this contrastive loss to learn invariances of an image going through various data augmentations. In the context of this loss, positives are provided by using the embedding from a momentum encoder with the augmented version of the same image as input and negatives are sampled from a queue that keeps past embeddings in memory. We use this method as we want to compare the performance of contrastive losses and reconstruction losses (like that used in Slot Attention) for robotic tasks. We use an embedding size of 128, queue size of 16384, softmax temperature of 0.1 and batch size of 16 to train the MoCo encoder.

Autoencoder. We train an encoder-decoder architecture using the reconstruction loss [pmlr-v27-baldi12a]

. While this method shares the same loss function as the slot-encoder architecture it differs in not using any notion of slots or objects. The objective of comparing with this baseline is to isolate the importance of the Slot Attention module and the reconstruction loss while learning representations.

4.2 Environment

We use a robotic environment implemented in PyBullet [coumans2016pybullet]. In our setup, a robot arm is attached to a fixed base such that it can manipulate objects in front of it on a table. A cylinder is attached to the end of the arm which serves as the end-effector to push around blocks of different shapes and colors. The robot arm is constrained to move on a 2D plane. We use this environment for collecting the data for all the experiments in the following sections.

4.3 Object Localization: Task and Metrics

In this experiment, we investigate if the Slot Attention architecture can encode information about all the objects present in a scene. This property is important for learning policies on datasets with multiple objects and for tasks that depend on full scene information such as object localization. We evaluate the representations on the object localization task in a simulated environment which provides ground truth object locations. We only use this ground truth during downstream task training not for representation learning.

We learn representations from a dataset of demonstrations collected in simulation environment described in section 4.2. Then, we freeze the weights of the representation network, and train an MLP to predict the location of the center of each block in robot coordinates. This MLP consists of two fully connected layers of size 256 and outputs the 2D location of all the 8 blocks and the end effector. For Slot Attention in this task, the input to our downstream MLP is the center of mass of the predicted slot masks. For our MoCo baseline, we use the output of the MOCO’s encoder as input to this MLP. For our Autoencoder baseline, as we need a 2D vector as input to the MLP decoder, we use global average pooling layer on top of the output of the encoder CNN and use that as the input to the MLP.

To have an interpretable metric for evaluation, we use the metric of Probability of Correct Keypoint (PCK) [yang2012articulated]

which captures the percentage of times an object location is predicted correctly. This metric considers an object to be correctly localized if the predicted coordinates are within a given threshold of the ground truth coordinates. This is a commonly used method in computer vision research to evaluate localization of human and object keypoints

[yang2012articulated]. We chose a PCK threshold value of 0.1 of the length of the table (about 5 cm).

We test the performance of our method with 1, 4, and 8 blocks. For the Slot Attention models, we use 7, 11, and 11 slots for the 1, 4, and 8 blocks respectively. The number of slot were chosen based on their performance on the reconstruction loss when pretraining the model independent of the downstream task. We train models with a dataset of 160k trajectories with image size of for 150k steps each with batch size of 16 on 1 V100 GPU. We use the checkpoint with the lowest loss during representation learning for downstream training of object localization.

4.4 Object Localization: Results

Slot Attention outperformed the baselines in all the object localization experiments specially in multi-object cases as seen in Tables 1-3 and Figure 4. We further made the following observations:

Baselines struggle with localizing multiple objects. Table 1 and 2 show the performance of our method on the cases of 1 and 4 blocks respectively and Table 3 shows the performance on the 8 block scenario. We observe that while MoCo and Autoencoder are able to learn to predict object location more accurately on a dataset with a single object, they struggle to encode object locations in multi object scenes. In Figure 4, we also show how the average object localization performance of different representations varies as the number of objects in the scene changes. Figure 3 shows a qualitative example of the performance of these models on an example evaluation image.

Slot Attention struggles with localizing objects with similar color and shape. We also observe that the Slot Attention model finds it difficult to localize blocks of the same color but with fine-grained differences in their shape. In Table 3, we observe the Slot Attention model gives poor localization performance for the two yellow and the two green objects that have similar shapes. This is due to the fact that Slot is trained using a pixel-wise image reconstruction loss and hence struggles to differentiate subtle differences in shapes when the color is the same. This is due to the small contribution of those differences to the loss since they only contribute to a small number of pixel differences. However, Slot Attention still outperforms the other methods for this task.

Figure 3: Qualitative example of the model’s performance on object location prediction. As baseline, performance of a representation learned using MoCo is shown on the middle column. In a perfect prediction, shapes should overlap with the corresponding matching circles of those objects. It can be seen that this is the case for Slot Attention in the 1 and 4 block case. For the 8 block case, it can be seen that the predicted slots are closer to the ground truth predictions than that of the MoCo baseline.
Figure 4: Performance comparison on object localization over number of blocks present in the scene. The baselines are able to learn the location of one object in the scene (the end effector) resulting in a low performance on average that decreases as the number of blocks increases. Slot Attention is able to localize multiple objects but sometimes struggles with objects of same color but different shapes in the 8 block scenario.
Input to object localizer Mean End Effector Blue Cube
MoCo 61.5 83.1 39.9
Autoencoder 49.0 82.5 15.5
Slot Attention 95.8 97.2 94.4
Table 1: Performance of different self-supervised representations on downstream task of object location prediction (1 block case)
Input to object localizer Mean End Effector Red Blue Green Yellow
Moon Cube Star Pentagon
MoCo 22.9 66.3 15.2 14.7 10.3 8.2
Autoencoder 21.2 57.6 12.4 11.1 11.8 13.2
Slot Attention 95.1 94.7 94.6 96.4 94.5 95.2
Table 2: Performance of different self-supervised representations on downstream task of object location prediction (4 blocks case)
Input to Red Blue Green Yellow Red Blue Green Yellow
object localizer Moon Cube Star Pentagon Pentagon Moon Cube Star
MoCo 9.9 13.7 10.5 7.4 11.9 13.1 9.8 9.1
Autoencoder 9.1 14.4 7.3 9.9 12.0 13.7 8.5 8.3
Slot Attention 79.8 79.4 20.2 18.4 86.9 83.0 23.0 15.6
Table 3: Performance of different self-supervised representations on downstream task of object location prediction (8 blocks case)
Object (continued)
Input to object localizer End Effector Mean
MoCo 58.4 16.0
Autoencoder 41.0 13.8
Slot Attention 60.1 51.8
No. of training episodes () 1000 2000 3000 10000
Input to Policy () Mean SD Mean SD Mean SD Mean SD
RGB 36.9 5.8 78.9 8.4 88.8 3.2 95.1 2.8
RGB+Groundtruth Segmentation 78.5 5.0 93.5 1.3 94.8 1.3 95.4 0.8
Autoencoder 46.0 3.2 61.3 9.8 83.5 3.8 85.8 6.1
RGB+Autoencoder 49.0 13.1 76.9 5.3 66.5 32.3 92.6 1.9
Slot Attention 57.1 1.5 86.0 1.2 92.4 1.9 95.0 1.7
RGB + Slot Attention 53.3 9.9 87.8 4.6 92.6 2.6 95.0 1.3
Table 4: Performance comparison on policy learning using the rate of successful task completion as metric. Bold values show the method with maximum performance without access to ground truth information. The method with access to ground truth segmentation provides an upper bound for this task.
Figure 5: Example of a task demonstration. Here the task is moving the red pentagon to the pole.
Figure 6: Qualitative example of masks learned by Slot Attention on real world data. Slot Attention is able to correctly group the pixels corresponding to the two blocks in the scene as well the end effector. For visualization purposes, a white background was chosen for slot number 2.

4.5 Multi-object Goal-conditioned Policy Learning: Task and Metrics

In this experiment, we compare different representation learning techniques by studying their effectiveness as inputs to a policy learning method. We only consider the imitation learning setup and learn a policy from a dataset of demonstrations provided by experts. The task for these experiments is to manipulate one of 8 blocks (i.e. the target block) on the plane to the target location shown by a purple rod. When the block is within 0.05 units of the rod, the episode is considered a success. If the robot fails to move the target block to the rod in 200 steps then the episode is considered a failure. In Figure 5, we show an example of the demonstration of the task in the environment described in 4.2. We use Implicit Behavior Cloning [florence2021implicit]

to learn policies. During evaluation, we run the policy for 200 different initial configurations and measure the number of times the policy was able to successfully move the required block to the target location within the tolerated distance. We run policy training with 4 random seeds and report the mean and standard deviation of the success rates over the 4 runs. We present the results in Table 

4. To compare different methods, we keep the policy learning method the same while we vary the input representations between RGB, RGB + Ground Truth (GT) Segmentation, Slot Attention, RGB+Slot Attention, Autoencoder, RGB+Autoencoder. To fairly compare between the different representation learning techniques we take the penultimate layer of the CNN encoder (before the spatial average pooling) and resize it to match the input RGB image. We optionally concatenate these features to the RGB image which is provided as input to the policy learning network. We also experimented with using MoCo features as input to IBC but were not able to train it to convergence. Remember that MoCo is trained to minimize a contrastive loss, and to succeed at minimizing this loss, the final representation does not need to capture information about all the objects in the scene. MoCo’s loss can be minimized by focusing on objects that move more often than not, like the robot arm. The lack of convergence we observed for MoCo applied to IBC is likely due to MoCo features not reliably localizing the purple rod which is needed to solve the task. We refer the reader to [florence2021implicit] for additional details on the IBC training.

4.6 Multi-object Goal-conditioned Policy Learning: Results

We make the following observations based on results in Figure 7 and Table 4:

Figure 7: (a)Validation performance of different methods during IBC training in low data regime (1000 episodes). It can be observed that using Slot Attention leads to a 20 percent performance increase. As an upper bound, using ground truth segmentation masks resulted in about 40 percent performance improvement. Solid lines show the mean across 4 seeds and the shaded area indicates 1 standard deviation from each side. (b) Performance comparison on policy learning over episodes in training data. Slot Attention based representations provide a performance boost in the low data regimes.

Better perception inputs lead to more sample efficient policies. As shown in Table 4, by using the GT semantic segmentation for all the blocks as input, we find that the policy learning method can learn high-performing policies with over 90 percent success rate with few samples (2000 episodes). However, the policy with RGB needs somewhere between 3000 and 10000 episodes to achieve the same performance. This motivates us to look for better input representations than raw RGB for IBC. In the following experiments, we were using a slot model trained with 16 bins since it had the lowest evaluation reconstruction loss during Slot Attention training. Both models were trained to convergence.

Slot Attention provides performance boost in low data regimes. We observe that Slot Attention models provide a boost in performance in the success rate of task completion over using raw RGB as input. Slot Attention models are object aware and by using this prior we are able to learn representations from demonstration video datasets that can result in performance improvement without collecting object bounding boxes or segmentation masks from humans. We also note that the performance gain of using Slot Attention over baselines decreases as the number of samples available for learning the policy increases as shown in figure 7.

Slot Attention performs better than Autoencoder. We find that slot has better performance than autoencoder which has the same loss as the Slot Attention model but not the object/slot prior in its architecture. This shows that the prior of objects/slots is important for the performance gains.

Figure 8: Effect of chosen slot bin numbers on IBC policy performance. Here we are using Slot Attention features in 1000 episodes data regime.

4.7 Ablation: effect of number of slot bins

The number of slots is an important design choice in the Slot Attention architecture. This is a hyper-parameter that can be set by looking at the Slot reconstruction loss during the representation pretraining as well as evaluating on a validation set for downstream tasks. Figure 8 shows the effect of varying the number of slots ( in Section 3). In a scene of about 12 objects (8 blocks, 1 pole, 1 robot arm, 1 table, 1 background), it is reasonable to assume that at least 12 bins might be required to be able to consistently detect all the objects. However, we notice performance improvements till . We also observe a drastic decrease at which is a known shortcoming [locatello2020objectcentric] of the original algorithm that it struggles with higher number of slots.

5 Limitations

General information about the approximate number of objects in the scene is needed for finding the optimal number of slots. As was observed in the original Slot Attention paper, using too few or too many slots can result in degraded performance as it will not be able to properly segment the scene in the case of too few slots or divide one object into multiple slots in the case of too many slots. However, the reconstruction loss value during Slot training can be used as a general guide to choose this number. Furthermore, it might be possible to learn the optimal number of slots in simulation and use that information to learn slots on real data which we will investigate in future work. Increasing the number of slots increases the time taken during training. Similar to other self supervised methods, Slot Attention sometimes struggles when colors of objects are same but the shape is different. It will also confuse multiple instances of the same-looking object leading to slot swapping. However, spatial information about the object location in the slot masks can potentially be used to differentiate these objects in cases where only a specific one of the multiple instances is the intended object.

Similar to the baseline methods, the algorithm will also struggle with generalization to unseen objects with significantly different shape/color than those in training. However, the original Slot Attention paper showed evidence of the model being capable of generalizing to scenes with more objects than it was originally trained on at test time using the CLEVR6 dataset [locatello2020objectcentric].

Slot Attention based models are able to push the performance of RGB-only models closer to the upper bound of models that have access to the ground truth segmentation masks. This suggests that self-supervised object-aware representations are a promising sample-efficient direction for augmenting visual input when learning policies.

6 Conclusion and Future Work

In this paper, we presented a method to improve performance of multi-object goal-conditioned behavior cloning policies using the Slot Attention architecture. We find that features and masks from this model are especially useful in the low data regime which is especially pertinent to deploying machine learning models on real-world robots. In future work, we would like to explore the advantage of using this method for real robot experiments. As preliminary evidence, we show a qualitative example of the Slot Attention algorithm successfully localizing objects in a real scene trained with no labels using data from

[florence2021implicit] in Figure 6. It can be seen that even though the existence of shadows and the wooden pattern of the table complicates the task in this case, Slot Attention is still able to learn to group pixels of each moving object successfully.

7 Acknowledgements

Toyota Research Institute (”TRI”) provided partial funds to support this work, but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.