Egocentric affordance detection with the one-shot geometry-driven Interaction Tensor

06/13/2019 ∙ by Eduardo Ruiz, et al. ∙ University of Bristol 0

In this abstract we describe recent [4,7] and latest work on the determination of affordances in visually perceived 3D scenes. Our method builds on the hypothesis that geometry on its own provides enough information to enable the detection of significant interaction possibilities in the environment. The motivation behind this is that geometric information is intimately related to the physical interactions afforded by objects in the world. The approach uses a generic representation for the interaction between everyday objects such as a mug or an umbrella with the environment, and also for more complex affordances such as humans Sitting or Riding a motorcycle. Experiments with synthetic and real RGB-D scenes show that the representation enables the prediction of affordance candidate locations in novel environments at fast rates and from a single (one-shot) training example. The determination of affordances is a crucial step towards systems that need to perceive and interact with their surroundings. We here illustrate output on two cases for a simulated robot and for an Augmented Reality setting, both perceiving in an egocentric manner.



There are no comments yet.


page 1

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Agents that need to act on their surroundings can significantly benefit from the perception of their interaction possibilities or affordances. The concept of affordance is innately interlinked and founded for egocentric perception, and the term coined by James J. Gibson [5] within the field of ecological perception. For Gibson, affordances are action opportunities in the environment that are directly perceived by the observer. According to this, the goal of vision is to recognise the affordances rather than the elements or objects in the scene. The concept of affordances calls for an approach to visual perception that is free from non-action representations, and that is there to help the agent to interact with the world. Following Gibson’s call for a direct perception of affordances, and motivated by studies in neuroscience showing that affordance detection does not require semantic reasoning [8]; we hypothesise that geometric information or shape provides enough information for an agent to directly perceive the interaction opportunities in its surroundings. Examples of our geometry-driven affordance detection approach are shown in Fig. 1. As detailed later on, the detection is agnostic to semantics and complex representations of the input scene.

Figure 1: Affordance detections (green objects and their poses) determined by our geometry-based method. We are able to determine over 80 affordances in real-time on never before seen RGB-D scenes.

Related work  Much of the attention given to the problem of affordances has focused on the classification of object instances in the world[3], internal symbolic relationships [10] or semantic category information[1], which strongly undermines the idea of direct and economical perception of affordances proposed by Gibson. But to directly determine affordances has faced many dilemmas, namely the challenging problems of visually recovering the relevant properties of the environment in a robust and accurate manner.

We argue that in order to truly perceive affordances in a way that is most useful for agents, there is a need for methods that are agnostic to object categories and free from complex feature representations; methods of a generic nature that allow for the simple yet robust description of multiple affordances. We hypothesise that geometry on its own provides enough information to robustly and generically characterise affordances.

2 Our approach

We concentrate on the subclass of affordances between rigid objects. Affordances such as “where can I hang this?”, place this, ride, fill, and similar. We do this by specifying a geometry-driven interaction tensor that aims to capture the way in which the affordance manifests between a pair of objects. In contrast with previous approaches, our algorithm is able to generalise from a single training example to completely novel environments, i.e. one-shot learning. Here we describe the core of our approach, namely the affordance representation (Interaction Tensor) and the algorithm that allows for fast one-shot detections.

The Interaction Tensor  The Interaction Tensor () [4]

is a vector field representation able to characterise the static interactions between 2 generic entities (e.g. objects) in 3D space. This proposed representation builds on the Interaction Bisector Surface (IBS)

[9] concept and extends its robustness by three main factors:

  1. [topsep=4pt,itemsep=2pt]

  2. Proposing a representation suitable for visually generated data, e.g. pointclouds

  3. Increases robustness by encoding the locations in the interacting entities that contributed to the computation of their bisector —provenance vectors

  4. Introduces a straight forward descriptor that allows for real-time prediction of affordance candidate locations on RGB-D data —affordance keypoints

Using direct, sparse sampling over the allows for the determination of geometrically similar interactions from a single training example; this sampling comprises what we call affordance keypoints, which serve to more quickly judge the likelihood of an affordance at a test point in a scene. The

is straightforward to compute and tolerates well changes in geometry, which provides a good generalisation to unseen scenes from a single example. The key steps include an example affordance from a simulated interaction, the computation of the IBS between an object (query-object) and scene (or scene-object), and estimating provenance vectors which are the vectors used in the computation of points on the bisector surface. Top row (Training example) in Fig.

2 shows the elements and the process involved in computing an affordance for Placing a bowl.

Fast one-shot affordance detection

Figure 2: Top row (Training 2D example) illustrates the computation of an interaction tensor for an affordance of interest, in this particular case Placing a bowl. Figure on the bottom (Detection) shows the approach followed to perform affordance detection on novel scenarios. For clarity purposes the scene shown is a synthetic kitchen; however, the detection algorithm allows to follow the same approach for RGB-D scenes.

In order to make fast affordance detection in a novel scenario without computing the full descriptor, we perform an approximation of the descriptor via a Nearest Neighbour (NN) search. This can be done by taking advantage of the provenance vectors from the training example; these vectors account for regions in the scene that contributed to the computation of the bisector surface. The proposed algorithm uses this information to investigate whether those regions exist in a novel scenario, these regions would allow computing the same or a similar . In this sense, the NN-search is used to investigate if the point in the scene required to compute a point on the exists; or more precisely, if the point in the scene is where is expected to be. The detection pipeline is illustrated in the bottom diagram (Detection) of Fig. 2. Full description of our methods is available at [4] and [7].

Figure 3: Top row shows a demonstration for Augmented Reality. The input to the system is the RGB and depth images of the current scene. The red circle in the RGB image on the left illustrates the ROI used for detections. Bottom rows show example predictions for Sitting, Placing and Hanging objects in an office environment.
Figure 4: The fast rates at which detections are made allows the integration of the affordance detection algorithm into the perception pipeline of a robotic system. For this simulation, the input to the detection algorithm is directly the pointcloud as captured by the RGB-D sensor. Top row images show a robot in a kitchen environment where Filling (left) and Placing(right) affordances have been detected. Images on the bottom row show Sitting affordance detections performed while the robot navigates in a living-room. In all images query-objects are shown in green.

3 For Robots and AR

We leverage a state-of-the-art and publicly available dense mapping system [6] paired with an RGB-D sensor to recover a 3D representation of the scene in front of the camera. Our implementation of the affordance detection algorithm leverages the parallelisation capabilities of commodity desktop hardware. As a reference, our algorithm running in a PC with a Titan X GPU allows for the simultaneous detection of up to 84 affordances at 10 point locations of the input scene in under 1 second. Fig. 3 summarises the computation times involved in the current implementation.

Our experiments include multiple affordances of everyday objects such as cups, mugs, bowls, etc. and also detections of human affordances such as Sitting or Riding. Here we show qualitative results of the algorithm of its application in robotics systems (Fig. 4) as well as for augmented reality (Fig. 3). The scenes used for our experiments include publicly available data such as [2], amongst others of our own. Code and data of our core affordance detection approach have been made publicly available111

4 Conclusion

We have developed a tensor field representation to characterise the interactions between pairs of objects. This representation and the proposed algorithm allow for real-time and multiple affordance detections in novel environments training from a single example. In this abstract we showed results of the application of the proposed approach for robotic perception and scene augmentation in mixed reality systems. Overall, we see this work as an effort to motivate further advancing of approaches in Vision that are more ecological in nature and consider the relationship between the scene and the perceiving agent.


  • [1] C. Chuang, J. Li, A. Torralba, and S. Fidler. Learning to act properly: Predicting and explaining affordances from images. In

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , pages 975–983, June 2018.
  • [2] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
  • [3] T. Do, A. Nguyen, and I. Reid.

    Affordancenet: An end-to-end deep learning approach for object affordance detection.

    In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–5, May 2018.
  • [4] Ruiz Eduardo and Mayol-Cuevas Walterio. Where can i do this? geometric affordances from a single example with the interaction tensor. In Robotics and Automation (ICRA), 2018 IEEE International Conference on, May 2018.
  • [5] James J. Gibson. The Ecological Approach to Visual Perception. Houghton Mifflin, 1979.
  • [6] Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on, pages 127–136. IEEE, 2011.
  • [7] Eduardo Ruiz and Walterio W. Mayol-Cuevas. What can I do here? leveraging deep 3d saliency and geometry for fast and scalable multiple affordance detection. CoRR, abs/1812.00889, 2018.
  • [8] G. Vingerhoets, K. Vandamme, and A. Vercammen. Conceptual and physical object qualities contribute differently to motor affordances. Brain and Cognition, 69(3):481 – 489, 2009.
  • [9] Xi Zhao, He Wang, and Taku Komura. Indexing 3d scenes using the interaction bisector surface. ACM Trans. Graph., 33:1–14, 2014.
  • [10] Yuke Zhu, Alireza Fathi, and Li Fei-Fei. Reasoning about Object Affordances in a Knowledge Base Representation, volume 8690 of Lecture Notes in Computer Science, book section 27, pages 408–424. Springer International Publishing, 2014.