A crucial question for complex multi-step robotic tasks is how to represent relationships between entities in the world, particularly as they pertain to preconditions for various skills the robot might employ. In goal-directed sequential manipulation tasks with long-horizon planning, it is common to use a state estimator followed by a task and motion planner or other model-based system [7, 8, 1, 24, 21, 26]. A variety of powerful approaches exist for explicitly estimating the state of objects in the world, e.g. [29, 27, 17, 2]. However, it is challenging to generalize these approaches to an arbitrary collection of objects. In addition, the objects are often in contact in manipulation scenarios, where works explicitly addressing the problem of generalizing to unseen objects [31, 30] still struggle.
Fortunately, knowing exact poses of objects may not be necessary for manipulation. End-to-end methods [18, 16, 5] leverage that fact and build networks that generates actions directly from sensor inputs without explicitly representing objects. Nevertheless, these networks are very specific to the tasks they are trained on. For example, it is non-trivial to use a network trained on stacking blocks to unstack blocks.
In this work, we take an important step towards a manipulation framework that generalizes zero-shot to novel tasks with unseen objects. Specifically, we propose a neural network that extracts object-centric embeddings directly from raw RGB images, which we callSORNet, or Spatial Object-Centric Representation Network. The design of SORNet allows it to generalize to novel objects and tasks without retraining or finetuning. Trained from large amounts of simulated robotic manipulation data, the object-centric embeddings produced by SORNet can be used to inform a task and motion planner with implicit object state information relevant to goal-directed sequential manipulation tasks, e.g. logical preconditions for primitive skills or continuous 3D directions between the entities in the scene.
To summarize, our contribution are: (1) a method for extracting object-centric embeddings from RGB images that generalizes zero-shot to different number and type of objects; (2) a framework for learning object embeddings that capture continuous spatial relationships with only logical supervision; (3) a dataset containing sequences of RGB observations labeled with spatial predicates during various tabletop rearrangement manipulation tasks.
We empirically evaluate the object-centric embeddings produced by SORNet on three different downstream tasks: 1) classification of logical predicates for action preconditions; 2) visual question and answering for spatial relationships; 3) prediction of relative 3D direction between entities in manipulation scenes. In all three tasks, the models are tested on held-out objects that did not appear in training data. In all tasks, SORNet obtains significant improvements over the baseline methods. Finally, we evaluated SORNet with real-world robot experiments to showcase the transfer of the learned object-centric embeddings to real-world observations.
2 Related Work
Sequential Manipulation In goal-directed sequential manipulation tasks with long-horizon planning, a class of work uses a pipeline approach requiring a state estimator followed by a task and motion planner [7, 8, 1, 24, 21, 26], but the state estimator with an explicit state representation is often restricted to an environment or a set of objects.
Another class of work employs neural networks to learn motor controls directly from raw sensor data, such as RGB images and joint encoder readings [16, 10, 5, 12, 20, 34], but they are often not transferable to new tasks, especially tasks involving long-horizon planning. Our work combines a network that estimates implicit states with a symbolic planner to obtain the best of both worlds: generalization to new objects and to new tasks.
Learning Spatial Relationships Learning spatial relationships between object entities have been studied in the field of 3D vision and robotics. Methods such as [23, 6, 25] predict discrete or continuous pairwise object relations from 3D inputs such as point clouds or voxels, assuming complete observation of the scene and segmented objects with identities. In contrast, our approach does not make any assumptions regarding the observability of the objects and does not require pre-processing of the sensor data. The learning framework by Kase et al.  is most related to our approach, which takes a sequence of sensor observations and classifies a set of pre-defined relational predicates which is then used by a symbolic planner to produce a suitable operator. Compared to our approach, theirs is limited by the number of objects in the scene and a fixed set of spatial predicates.
using transformer networks. Toward solving spatio-temporal reasoning task from CLEVRER  and CATER , Ding et al.  proposed an object-based attention mechanism and Zhou et al.  proposed a multi-hop transformer model. Both works assume a segmentation model to produce object segments and performs language grounding to the segments to perform reasoning. Our SORNet architecture is simpler and can solve spatial-reasoning tasks for unseen object instances without requiring a segmentation or object detection module. Furthermore, our work focuses on a relatively complex manipulation task domain involving manipulator in the observations. Although our current work focuses on predicting spatial relations from a single RGB frame, the object-centric embeddings could potentially be used for solving temporal-reasoning tasks.
3.1 SORNet: Spatial Object-Centric Representation Network
Our object embedding network (SORNet) (Fig. 2
) takes an RGB image and an arbitrary number of canonical object views and outputs an embedding vector corresponding to each input object patch. The architecture of the network is based on the Visual Transformer (ViT). The input image is broken into a list of fixed-sized patches, which we call context patches. The context patches are concatenated with the canonical object views to form a patch sequence. Each patch is first flattened and then linearly projected into a token vector, then positional embedding is added to the sequence of tokens. Following , we use a set of learnable vectors with the same dimension as the token vectors as positional embeddings. The positional-embedded tokens are then passed through a transformer encoder, which includes multiple layers of multi-head self-attention. The transformer encoder outputs a sequence of embedding vectors. We discard the embedding for context patches and keep those for the canonical object views.
We apply the same positional embedding to the canonical object views to make the output embeddings permutation equivariant. We also mask out the attention among canonical object views and the attention from context patches to canonical object views to ensure the model uses information from the context patches to make predictions. In this way, we can pass in an arbitrary number of canonical object views in arbitrary order without changing model parameters during inference.
Intuitively, the canonical object views can be viewed as queries where the context patches serve as keys to extract the spatial relationships’ values. Note that the canonical object views are not crops from the input image, but arbitrary views of the objects that may not match the objects’ appearance in the scene. Our model learns to identify objects even under drastic change in lighting, pose and occlusion. Fig. 3 shows some examples of canonical object views used in our experiments.
3.2 Predicate Classifier
The predicate classifier (Fig. 4) is responsible for predicting a list of predicates using object embeddings. The predicates can be logical statements about spatial relationships, e.g., whether the blue block is on the left part of the table, or continuous quantities such as which direction should the end effector move to reach the red block. The predicate classifier consists of a collection of 2-layer MLPs, one for each type of relationship. Here we focus on unary and binary predicates. Unary predicates involve a single object or an object and the environment, which could be the robot or a region on the table. Binary predicates involve two objects and, optionally, the environment. In principle, our framework is extensible to predicates involving more than two objects, but we leave that for future work.
The MLP for unary predicates takes the list of object embeddings produced by SORNet and outputs predicates pertaining to the object that the embedding is conditioned on. Taking the top_is_clear classifier for an example, if the input embedding is conditioned on the blue block, the MLP will output whether there is any object on top of the blue block. If the input embedding is conditioned on the red block, the MLP will output whether there is any object on top of the red block.
The MLP for binary predicates takes a list of binary object embeddings created by concatenating pairs of object embeddings produced by SORNet and outputs predicates corresponding to a pair of objects, e.g., whether the blue block is on top of the red block. Thus, with object embeddings, there will be binary object embeddings.
Parameters of the predicate classifier are independent of the number of objects. The number of output predicates dynamically changes with the number of input object embeddings. For example, when are 7 unary MLPs and 2 binary MLPs, with 4 objects, the model predicts predicates; with 5 objects, the model predicts predicates. In this way, our model generalizes zero-shot to scenes with an arbitrary number of objects.
4 The Leonardo Dataset
To test the generalization of our model to new objects and tasks, we created a simulated tabletop environment named Leonardo, where a Franka Panda robot manipulates a set of randomly colored blocks. The robot is given a goal formulated as a list of predicates to be satisfied, and then use a simple task planner  to find a sequence of actions to achieve that goal and a sample-based motion planner  to generate trajectories and choose grasps. As we know the ground truth poses of the blocks in the simulator, we can compute ground-truth logical predicates at every step of the planning process. We used NVISII  to render the RGB observations. Domain randomization including random lighting, background and perturbations to the camera position is applied while rendering.
The training data contains 133796 sequences of a single task - stacking 4 blocks in a tower. The block colors are randomly chosen from 405 xkcd colors111https://xkcd.com/color/rgb/. The testing data contains 9526 sequences with 4-6 blocks, chosen from colors that are not included in the training data: red, green, blue, yellow, aqua, pink, purple. The testing tasks consists of 7 tasks different from training. Fig. 5 shows some examples. Please see the supplement for full description of the testing tasks.
We first evaluate our approach on a variant of the CLEVR dataset , a well established benchmark for visual reasoning. CLEVR contains rendered RGB images with at most 10 objects per image. There are 96 different objects in total (2 sizes, 8 colors, 2 materials, 3 shapes). Each image is labeled with 4 types of spatial relationships (right, front, left, behind) for each pair of objects.
Specifically, we use the CoGenT version of the dataset, which stands for Compositional Generalization Test, where the data is generated in two different conditions. In condition A, cubes are gray, blue, brown, or yellow and cylinders are red, green, purple, or cyan. Condition B is the opposite: cubes are red, green, purple, or cyan and cylinders are gray, blue, brown, or yellow. Spheres can be any color in both conditions. The models are trained on condition A and evaluated on condition B. The training set (trainA) contains 70K images and the evaluation set (valB) contains 15K images. Several prior works [33, 14] show significant generalization gap on CLEVR-CoGenT caused by the visual model learning strong spurious biases between shape and color.
We generate a question for each spatial relationship in the image, e.g. “Is the large red rubber cube in front of the small blue metal sphere?” We filter out any query that is ambiguous, e.g. if there were two large red cubes, one in front and one behind the small blue sphere. This results in around 2 million questions for both valA and valB sets. We compare against MDETR , which reports state-of-the-art zero-shot result on CLEVR-CoGenT, i.e. there is no fine-tuning on any example from condition B. The results are summarized in Table 1. Our model performs drastically better on classifying spatial relationships of unseen objects and shows a much smaller generalization gap between valA and valB sets.
Unlike MDETR which takes text queries, our model takes visual queries in the form of canonical object views (i.e. 2 canonical views for the objects mentioned in the question). To eliminate the influence of those factors, we report the performance of the MDETR model trained on the full CLEVR dataset, denoted as MDETR-oracle. We can see that although SORNet is only trained on condition A, it is able to achieve similar performance to MDETR-oracle. The zero-shot generalization ability of our model can potentially be combined with other reasoning pipelines to improve generalization performance on other types of queries as well.
|MDETR ||MDETR-oracle ||SORNet(ours)|
5.2 Predicate Classification
Next, we evaluate the task of predicate classification on the Leonardo dataset. We compare against 3 baselines that do not use object conditioning. The first two baselines use a ResNet18 and a ViT-B/32 respectively to directly predict 52 predicates. The last baseline uses the same architecture as ours, but the embedding tokens come from 4 fixed class embedding vectors. We report 3 metrics: accuracy, F-1 score and all-match accuracy across all predicates and among predicates within a category. The accuracy and F-1 score are computed per predicate and averaged. For all-match accuracy, the predicates are considered as a single vector, which most faithfully represents how the predicates are treated by a planner. If any predicate is classified incorrectly, the prediction for the entire category is considered as wrong. The models are tested on images where the objects are completely unseen during training.
The results are summarized in Table 2
. Models labeled with M-View are trained on 3 different views of the scene as data augmentation. During testing, we aggregate the predictions from 3 views by adding the logits. The M-View models are better at handling occlusions by leveraging multi-view information. The gripper state is concatenated to the object embedding in theSORNet variant denoted by (G) in Table 2, which is better at classifying predicates correlated to the opening/closing of the gripper, e.g. has_obj. We can see that the non-object-conditioned baselines fail drastically when applied to unseen objects zero-shot. Even after fine-tuning on 100 examples, they still significantly underperform zero-shot SORNet. This demonstrates the generalizability of our model obtained via conditioning on canonical object views.
Further, we tested our model on scenes with 5 to 6 objects, while it has only been trained on 4-object scenes. We first treat the additional objects as distractors, which shows that it is not necessary to acquire canonical views of every object in the scene if they are not of interest. We then treat all objects as objects of interest. In this case, the number of binary predicates increases quadratically with the number of objects in the scene, so the all-match accuracy naturally drops even if the average accuracy and F-1 score remain the same. Note that none of the baselines can even be applied to these scenes with more objects without introducing additional model parameters and retraining the model.
Finally, we run our best performing model on 30 real-world images of a robot performing various manipulation tasks. The quantitative results are in Table 2 and Fig. 6 shows two qualitative examples of the predicates predicted by our model. Our model transfers to real-world without losing much accuracy. It does make some mistakes when encountered with novel scenarios never seen during training, such as one block stacked in between two blocks (right plot in Fig. 6).
5.3 Skill Executability
We further evaluate the predicate prediction in the context of task planning. Each frame in the Leonardo dataset is labeled with a primitive skill that will be executed, e.g. grasp the red block. Each skill has a list of preconditions that needs to be satisfied before it can be executed, which can be formulated as a vector of predicate values. In other words, if all of the predicates in the preconditions are classified correctly, we can determine the executability of that skill correctly. In Table 3, we report the accuracy for classifying the executability of skills in the Leonardo test set using the predicate predictions.
This evaluation puts more emphasis on the predicates relevant to the manipulation of objects, e.g. approach, aligned, which are rarely true in the training data. Our model is able to identify these predicates correctly whereas the baselines fail completely on skills relevant to object manipulation, i.e. grasp, align, place and lift.
|ResNet18 M-Head 0-shot||52||88.5||80.0||88.4||76.2||91.7||99.2||92.9|
|ViT-B/32 M-View 0-shot||52||88.5||80.2||88.4||76.2||91.7||99.2||92.9|
|ViT-B/32 M-Head M-View 0-shot||52||88.5||80.2||88.4||76.0||91.7||99.2||92.9|
|ResNet18 M-Head 100-shot||52||88.5||80.0||88.4||76.4||91.7||99.1||92.9|
|ViT-B/32 M-View 100-shot||52||88.6||79.9||88.3||78.7||91.7||99.1||92.8|
|ViT-B/32 M-Head M-View 100-shot||52||92.8||89.9||90.4||87.8||92.4||99.2||93.5|
|SORNet M-View 0-shot||52||98.9||99.0||95.9||99.2||99.6||99.4||97.3|
|SORNet M-View (G) 0-shot||52||98.9||98.9||98.8||98.5||99.5||99.3||96.6|
|SORNet M-View 1 distractor 0-shot||52||98.3||98.6||96.6||95.2||98.6||99.5||97.5|
|SORNet M-View 2 distractors 0-shot||52||97.6||98.2||96.7||89.9||97.7||99.5||97.3|
|SORNet M-View (G) 5 obj 0-shot||70||98.5||98.5||99.4||95.8||98.2||99.6||97.4|
|SORNet M-View (G) 6 obj 0-shot||102||98.0||98.3||99.6||93.9||96.8||99.7||97.7|
|SORNet M-View (G) real-world 0-shot||52||96.3||96.4||96.7||93.3||97.1||96.7||95.6|
|ResNet18 M-Head 0-shot||52||0.0||0.3||53.5||30.4||30.1||89.8||71.6|
|ViT-B/32 M-View 0-shot||52||0.0||0.4||53.5||30.3||30.1||89.8||71.6|
|ViT-B/32 M-Head M-View 0-shot||52||0.0||0.3||53.5||30.3||30.1||89.8||71.6|
|ResNet18 M-Head 100-shot||52||0.0||0.3||53.5||30.9||30.1||89.8||71.6|
|ViT-B/32 M-View 100-shot||52||0.2||3.3||55.0||40.5||31.0||89.8||72.5|
|ViT-B/32 M-Head M-View 100-shot||52||4.1||21.7||62.4||63.8||40.0||89.8||75.0|
|SORNet M-View 0-shot||52||60.2||86.3||83.5||96.8||95.8||92.4||89.5|
|SORNet M-View (G) 0-shot||52||63.6||84.5||95.8||94.0||94.3||92.0||86.6|
|SORNet M-View 1 distractor 0-shot||52||50.1||80.9||86.3||82.7||85.9||94.4||91.0|
|SORNet M-View 2 distractors 0-shot||52||39.2||76.3||87.2||68.2||78.3||95.0||90.4|
|SORNet M-View (G) 5 obj 0-shot||70||45.5||74.8||97.6||81.1||72.6||92.0||87.5|
|SORNet M-View (G) 6 obj 0-shot||102||29.7||68.9||97.7||70.2||52.4||92.0||87.5|
|SORNet M-View (G) real-world 0-shot||52||26.7||63.3||93.3||80.0||76.7||90.0||80.0|
|ResNet18 M-Head 0-shot||52||9.7||23.6||0.0||31.5||0.0||0.0||0.0|
|ViT-B/32 M-View 0-shot||52||11.9||37.4||0.0||28.9||0.0||0.0||0.1|
|ViT-B/32 M-Head M-View 0-shot||52||12.2||32.5||0.0||27.7||0.0||0.0||0.0|
|ResNet18 M-Head 100-shot||52||0.0||21.9||0.0||32.6||0.0||0.0||0.0|
|ViT-B/32 M-View 100-shot||52||0.0||37.7||6.3||46.5||0.0||0.0||7.3|
|ViT-B/32 M-Head M-View 100-shot||52||0.0||70.5||31.0||73.2||27.2||0.0||23.2|
|SORNet M-View 0-shot||52||88.9||97.5||82.0||98.4||97.3||70.5||81.7|
|SORNet M-View (G) 0-shot||52||89.5||97.1||94.7||96.8||96.4||69.9||76.7|
|SORNet M-View 1 distractor 0-shot||52||85.1||96.3||81.8||90.1||88.8||67.5||80.1|
|SORNet M-View 2 distractors 0-shot||52||80.2||95.2||80.2||79.4||81.4||60.7||75.3|
|SORNet M-View (G) 5 obj 0-shot||70||85.3||96.0||96.7||91.3||83.6||69.8||78.1|
|SORNet M-View (G) 6 obj 0-shot||102||79.9||95.5||97.0||87.5||69.2||70.0||77.9|
|SORNet M-View (G) real-world 0-shot||52||76.5||90.7||85.7||80.3||69.1||33.3||68.9|
|ViT-B/32 M-Head M-View||27.3||90.4||0.0||0.0||0.0||0.0||100.0|
|SORNet M-View (G)||76.3||98.7||68.1||99.9||99.9||69.7||100.0|
5.4 Open Loop Planning
In this demo, we incorporate SORNet as a part of an open-loop planning pipeline in a real-world manipulation scenario. Specifically, given an initial frame, we use the predicates predicted by SORNet M-View (G) to populate a state vector. A task and motion planner takes the state vector and desired goal (formulated as a list of predicate values to be satisfied), and outputs a sequence of primitive skills. The robot then executes this sequence of skills in an open loop fashion. This demonstrate that how SORNet can be applied to sequential manipulation of unseen object in a zero-shot fashion, i.e. without any fine-tuning on the test objects. Please refer to our online video for the demo.
|Method||ResNet18||ResNet18 (MV)||ResNet18 (P)||CLIP-ViT||CLIP-ViT (P)||SORNet (P)||SORNet (P MV)|
5.5 Relative Direction Prediction
Although the training objective of SORNet is purely logical, with large-scale pre-training, the object embeddings learned by SORNet contains continuous spatial information. We demonstrate this fact by using SORNet embeddings to predict the relative 3D direction between entities. Specifically, we trained a regressor (same architecture as the classifier in Sec. 3.2) on top of frozen SORNet embeddings to predict the continuous direction between two objects (Obj-Obj) or the direction the end effector should move to reach a certain object (EE-Obj). The regressor is trained using L2 loss on a 1000 examples with unseen objects, and tested on 3000 examples with the same objects.
The results are summarized in Table 4. SORNet is able to outperform models trained from scratch with pose supervision (ResNet18) as well as models initialized with weights from large-scale language-image pretraining  (CLIP-ViT). This demonstrates that our representation learning technique is more suited to manipulation scenarios where more precise spatial information is crucial. In our online video, we showed that by predicting the direction for movement in an online fashion, our model can be used for visual servoing to guide the robot to reach a target object.
We proposed SORNet (Spatial Object-Centric Representation Network) that learns object-centric representations from RGB images. We show that the object embeddings produced by SORNet capture spatial relationships which can be used in a downstream tasks such as spatial relationship classification, skill precondition classification and relative direction regression. Our method works on scenes with an arbitrary number of unseen objects in a zero-shot fashion. With real-world robot experiments, we demonstrate how SORNet can be used in manipulation of novel objects.
This work was supported in part by the National Science Foundation under Contract NSF-NRI-2024057 and in part by Honda Research Institute as part of the Curious Minded Machine initiative.
-  (2012) An industrial robotic knowledge representation for kit building applications. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1365–1370. Cited by: §1, §2.
-  (2021) PoseRBPF: a rao–blackwellized particle filter for 6-d object pose tracking. IEEE Transactions on Robotics. Cited by: §1.
-  (2020) Object-based attention for spatio-temporal reasoning: outperforming neuro-symbolic models with flexible distributed architectures. arXiv preprint arXiv:2012.08508. Cited by: §2.
-  (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: Figure 2, §3.1.
One-shot imitation learning. In NIPS, Cited by: §1, §2.
-  (2014) Learning spatial relationships from 3d vision using histograms. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 501–508. Cited by: §2.
-  (1971) STRIPS: a new approach to the application of theorem proving to problem solving. Artificial intelligence 2 (3-4), pp. 189–208. Cited by: §1, §2.
-  (2003) PDDL2. 1: an extension to pddl for expressing temporal planning domains. Journal of artificial intelligence research 20, pp. 61–124. Cited by: §1, §2.
-  (2020) PDDLStream: integrating symbolic planners and blackbox samplers via optimistic adaptive planning. In Proceedings of the International Conference on Automated Planning and Scheduling, Vol. 30, pp. 440–448. Cited by: §4.
Deep predictive policy training using reinforcement learning. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2351–2358. Cited by: §2.
-  (2020) CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning. In ICLR, Cited by: §2.
-  (2019) Neural task graphs: generalizing to unseen tasks from a single video demonstration. In , pp. 8565–8574. Cited by: §2.
-  (2017) CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, Cited by: §2, §5.1, Table 1.
-  (2021) MDETR–modulated detection for end-to-end multi-modal understanding. arXiv preprint arXiv:2104.12763. Cited by: §5.1, §5.1, Table 1.
-  (2020) Transferable task execution from pixels through deep planning domain learning. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 10459–10465. Cited by: §2.
End-to-end training of deep visuomotor policies.
The Journal of Machine Learning Research17 (1), pp. 1334–1373. Cited by: §1, §2.
Deepim: deep iterative matching for 6d pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 683–698. Cited by: §1.
-  (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
-  (2020) ViSII: virtual scene imaging interface. Note: https://github.com/owl-project/ViSII/ Cited by: §4.
-  (2019) Prospection: interpretable plans from language by predicting the future. In 2019 International Conference on Robotics and Automation (ICRA), pp. 6942–6948. Cited by: §2.
-  (2019) Representing robot task plans as robust logical-dynamical systems. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5588–5595. Cited by: §1, §2, §4.
-  (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020. Cited by: §5.5.
-  (2011) Learning spatial relationships between objects. The International Journal of Robotics Research 30 (11), pp. 1328–1342. Cited by: §2.
-  (2017) Extended behavior trees for quick definition of flexible robotic tasks. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6793–6800. Cited by: §1, §2.
-  (2020) Relational learning for skill preconditions. arXiv preprint arXiv:2012.01693. Cited by: §2.
-  (2017) Goal-directed robot manipulation through axiomatic scene estimation. The International Journal of Robotics Research 36 (1), pp. 86–104. Cited by: §1, §2.
-  (2018) Implicit 3d orientation learning for 6d object detection from rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 699–715. Cited by: §1.
-  (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §2.
Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199. Cited by: §1.
-  (2020) Learning rgb-d feature embeddings for unseen object instance segmentation. arXiv preprint arXiv:2007.15157. Cited by: §1.
-  (2020) The best of both modes: separately leveraging rgb and depth for unseen object instance segmentation. In Conference on robot learning, pp. 1369–1378. Cited by: §1.
-  (2020) CLEVRER: collision events for video representation and reasoning. In ICLR, Cited by: §2.
-  (2018) Neural-symbolic vqa: disentangling reasoning from vision and language understanding. arXiv preprint arXiv:1810.02338. Cited by: §5.1.
-  (2018) One-shot hierarchical imitation learning of compound visuomotor tasks. arXiv preprint arXiv:1810.11043. Cited by: §2.
-  (2021) Hopper: multi-hop transformer for spatiotemporal reasoning. arXiv preprint arXiv:2103.10574. Cited by: §2.
Here we provide technical details in support of our main paper. Below is a summary of the contents.
Qualitative examples of relationship classification on CLEVR-CoGenT;
Qualitative examples of relative direction regression on Leonardo;
Visualizations of the attention learned by SORNet;
More details on the predicates and tasks in the Leonardo dataset;
Additional Details on model architecture and training.
B Qualitative Results on CLEVR-CoGenT
Fig. 8 shows qualitative results of spatial relationship classification on CLEVR-CoGenT (Sec. 5.1 in main paper). These examples demonstrate that SORNet is able to identify objects not only using color cues, but also shape (e.g. blue sphere vs blue cylinder in the topmost example), size (e.g. small cyan cube vs big cyan cube in the second from top example) and material (e.g. small purple metal cube vs small purple rubber cube in the third from top example). We also visualize the relevant canonical object views provided to the model. As we can see, the canonical object views can have very different appearance from the corresponding objects in the input image. It is more appropriate to consider these canonical views as a visual replacement for natural language, rather than the result of object detection or segmentation.
C Qualitative Results on Relative Direction Predition
In Fig. 9 we visualize the results of relative direction prediction (Sec. 5.5 in main paper). Specifically, we train regressors on top of frozen SORNet embeddings to predict the relative direction (a 3D unit vector) and distance (a 1D scalar) between each pair of objects, as well as between the end effector and each object. We visualize the predicted direction as arrows, scaled by the predicted distance. Trained only on a thousand examples, the regressor is able to predict continuous spatial relationships accurate enough to guide robot execution (see Sec. LABEL:sec:video), thanks to the spatial information encoded in the SORNet embeddings. Note that SORNet is never trained on explicit poses of objects, but it understands relative locations just like humans. As a result, the object-centric embedding can be quickly finetuned for downstream tasks that require accurate spatial information.
D Attention Visualization
Fig. 10 visualizes the attention weights learned by the visual transformer model in order to obtain the object-centric embeddings. More specifically, we take the normalized attention weights from the tokens corresponding to the canonical object views to the tokens corresponding to the context patches and convert the weights into a colormap, where the intensity of the colormap corresponds to the magnitude of the attention weight over that patch. We then overlay the colormap with the input image. We can see that while the model puts the highest attention to the patch containing the object of interest, it also learns to pay attention to the robot arm and other objects, while ignoring irrelevant background. We also visualize the canonical object patches given to the model, which can look wildly different from the same objects in the image. The model needs to associate the canonical view with the input view of the object under different lighting conditions and occlusions.
E The Leonardo Dataset
Here we include additional details for the Leonardo dataset.
e.1 List of Predicates
Table 2 shows the 52 predicates for 4-object test scenes.
e.2 Training and test tasks
Fig. 11 shows initial frame, final frame and goal conditions for a sample from each task in the Leonardo dataset.
Fig. 7 shows the 3 camera views used to train the multi-view models. Each camera view is also slightly perturbed around the base camera pose during training.
F Model Architecture and Training
f.1 ResNet Baseline
The ResNet baseline uses ResNet18 backbone implemented in torchvision. The feature map before average pooling is flattened into a 512-dimensional vector and passed through 4 2-layer MLPs with 512 hidden units and outputs predicates relevant for each of the 4 objects (19 for each objects). For binary predicates (e.g. stacked(red_block), blue_block) during inference, we add the logits from MLPs responsible for both objects to make the final prediction.
f.2 ViT Baseline
The ViT baseline uses ViT-B/32 backbone, with 12 layers of 12-head self-attention layers. The width of the model (dimension of token embeddings) is set to 768. This ViT model passes the embedding from a single trainable classification token through a 2-layer MLP with 512 hidden units to predict all 52 predicates.
f.3 ViT Multihead Baseline
The ViT multihead baseline also uses ViT-B/32 backbone with the same architecture as the ViT baseline, except that it has 4 trainable classification tokens which gives 4 embedding vectors. Then, the same predicate classifier for SORNet is used to predict unary and binary predicate values.
SORNet uses the same backbone architecture as the two baselines above. The canonical object views are linearly-projected and flattened just like the context patches from the input image. The predicate classifier takes the token vectors corresponding to the canonical object views from the topmost layer of the transformer and outputs predicate values. Each MLP in the predicate classifier has 512 hidden units and outputs a single scalar.
f.5 Training Hyperparameters
All models are trained using binary cross-entropy loss with SGD optimizer (momentum set to 0.9) on 4 GPUs with 32G memory. The batch size for a single GPU is 512. The ResNet baseline uses learning rate 0.01 while the ViT baslines and SORNet uses learning rate 0.0001. All models are trained for 80 epochs.