Where is My Stuff? An Interactive System for Spatial Relations

09/13/2019 ∙ by E. Akin Sisbot, et al. ∙ ibm 0

In this paper we present a system that detects and tracks objects and agents, computes spatial relations, and communicates those relations to the user using speech. Our system is able to detect multiple objects and agents at 30 frames per second using a RGBD camera. It is able to extract the spatial relations in, on, next to, near, and belongs to, and communicate these relations using natural language. The notion of belonging is particularly important for Human-Robot Interaction since it allows the robot ground the language and reason about the right objects. Although our system is currently static and targeted to a fixed location in a room, we are planning to port it to a mobile robot thus allowing it explore the environment and create a spatial knowledge base.



There are no comments yet.


page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


With the recent developments in Artificial Intelligence and adoption of smart devices, social robots are one step closer to finding a place in our daily lives. From doing repetitive mundane tasks to helping us to solve problems, they will be a major contributor to improving our quality of life. One problem that social robots/agents can help with is to keeping track of objects that we use in our homes or workplaces. Forgetting where an object is, or somebody else moving an object without our knowledge, are two very common scenarios that are sources of frustration. We believe that an interactive camera system along with adequate reasoning capabilities can be very helpful in solving this problem.

In this paper we present an integrated system that constantly watches the scene, detecting and tracking objects and people while inferring ownership and spatial relationships. The system allows the user ask where an object is, and answers accordingly. Although the system is currently installed as a static setup as part of a smart room, our vision is to port it to a mobile robot.

Spatial relationships between objects and people have been studied extensively in computer vision and developmental psychology fields. Piaget

[6] found that the notion of spatial relationships between objects starts in early infanthood. This ability is crucial for us to build an abstract representation of the world and communicate this representation with others.

In robotics, spatial representations are related to affordances. In [7]

a contact point network is used to segment out objects using a Kinect camera, and the relationships are learned using supervised learning methods.

[3] use a 3D simulation environment to generate large volume of training data. Some works such as [1] use 2D images instead of 3D to calculate spatial relationships.

[5] described a system where the robot is able to use its previous knowledge to create new relationships between objects. In a recent work by [4]

, a neural network is employed to generalize spatial relations to apply to previously unknown objects. Recent VQA systems such as

[2] employ deep neural networks to answer any questions about a scene.

In this paper we present an interactive system that uses a simplified approach to calculate spatial relations. Our system is able to calculate belonging relationship simplifying the language that the user use to refer to objects. However our system offers an end-to-end solution starting from user speech and ending with system’s answer in natural language.

Use Cases

There are two use cases that inspire and guide our work:

Elder Care

About 40% of people aged 65 or older have age-associated memory impairment [8]. The older we get the easier it becomes to forget where we placed our belongings. This use case focuses on keeping track of relevant items that an elderly person might need but forgot. An example scenario for this use case is as follows:

  • Mr. Jones comes home, leaves his wallet in an uncommon location (next to the vase)

  • Ms. Jones places a couple of magazines on the wallet

  • Later Mr. Jones asks “Where is my wallet?”

  • The system answers: “It is next to the vase, under the magazines”

Workshops and Factories

In workplaces, factories, and workshops a missing or misplaced object may have a direct impact on the efficiency of the processes underway. In this use case the system not only answers explicit queries but also proactively watches and warns the user of misplaced/missing items. An example for this use case might be:

  • Mr. Jones is working on a repair project

  • The current project involves a defined set of tools

  • The system tracks the location of those tools

  • The system detects that one tool is missing from its usual location

    • The system warns Mr. Jones that the tool is missing

  • The system detects that one tool is situated in an unusual location

    • When the time comes for Mr. Jones to use that tool, the system proactively tells him its location relative to a landmark: “The wrench is behind the toolbox”

System Overview

Our system uses a Microsoft Kinect RGB-D camera and an array microphone mounted on the ceiling. As seen in Figure 1, the camera is focused on a target work area where we detect and track the objects. The work area can easily be extended by using multiple cameras.

Figure 1: The system uses a MS Kinect sensor and an array microphone to detect objects and capture speech.

There are three major components of the system:

Object and Person Detection

People are essentially modeled as stalagmites coming up from the floor, with constraints on head height, head size, shoulder width, etc. Objects are similarly modeled as bumps coming up from the work surface. Both people and objects are tracked over time allowing continuity in properties such as names or types that may have been assigned. The full set of person and object data is streamed in JSON format via a ZeroMQ pub-sub channel approximately 30 times a second.

The system allows objects to be tracked even when they are being held (and thus have a wildly different shape when combined with an arm) by merging the point cloud associated with the user’s hand with the point cloud of the object. The system also tracks the position of people’s hands relative to the work area focusing on the end points of the protrusions in the point cloud that are created by the presence of arms). In addition to providing information about hands, the system can infer pointing directions using principal component analysis method. This can be used by itself to, for instance, select a particular object for querying.

Figure 2 shows a snapshot of the environment and the corresponding person and object detection results. The results are then passed to the Spatial Relations step.

Figure 2: A snapshot of the environment and the ZeroMQ detection stream data.

Spatial Relations

In order to generate human understandable dialog, we extract the topology of the environment by calculating relationships between agents, objects, and locations from the geometric data. For every detection result that we receive we compute a number of spatial relationships.

We consider 4 types of observer-independent spatial relations:

  • Object-Object Relations: in, on, near, next to

  • Object-Agent Relations: belongs, last touched by

  • Agent-Location Relations: in

  • Object-Location Relations: in

These spatial relations are computed based on a set of predefined rules and a number of geometric properties. Figure 3 illustrates Object-Object relations.

Figure 3: Illustration of Object to Object Relations


The in relation between object and is defined as:

is in if 80% of ’s volume is in .


The on relation between object and is defined as:

is on if the bottom of is above the top of .


The near relation between object and is defined as: is near if the distance between and is not greater than the .

Next to

The next to relation between object and is defined as:

is next to if is near and there no object between and .


The object belongs to the agent if has not been seen before and appears with the agent

The spatial relations are computed for every frame and for every object and agent. The system is also able to compute only a subset of these relations to focus only on relevant objects and agents, if desired. The relationships, especially belongs to, is kept and maintained in a database allowing the user ask about his/her/others’ objects.


The Dialog system is the third major component of the overall architecture. It is in charge of understanding the user’s spoken questions and assertions, and providing intelligible answers using synthesized speech. This system is also responsible for detecting when the user is addressing the system (versus chatting with a workmate) by verifying two types of events coming from the users:

  • Using the keyword “Celia”

  • Looking directly at the camera

Once the user gets the system’s attention using one of the above methods, he/she has 2 seconds to start phrasing his/her request. If nothing is heard within this interval, the system times out and resumes waiting for a new attention trigger.

Conclusion and Future Work

In this paper we detailed our preliminary work on keeping track of objects and people, and informing the user of various relations between them upon request. Our system is able to resolve the belonging and allows the user ask about his/her/others’ objects. While the system is currently part of a room, we believe a mobile robot has fewer privacy concerns than a static camera mounted on the ceiling. It can also look at objects from different points of view, thus reducing possible blind spots and occlusions.

We are also planning to enable a proactive behavior where the robot follows people and watches which objects they are using. This way the robot will have a broader knowledge base of activities, and be able to answer questions about locations that are not in its current field of view.


  • [1] A. Belz, A. Muscat, M. Aberton, and S. Benjelloun (2015-09) Describing spatial relationships between objects in images in English and French. In Proceedings of the Fourth Workshop on Vision and Language, Lisbon, Portugal, pp. 104–113. External Links: Document, Link Cited by: Introduction.
  • [2] S. Cho, W. Lee, and J. Kim (2017-10) Implementation of human-robot vqa interaction system with dynamic memory networks. In 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 495–500. External Links: Document Cited by: Introduction.
  • [3] S. Fichtl, J. Alexander, F. Guerin, W. Mustafa, D. Kraft, and N. Krüger (2013-08) Learning spatial relations between objects from 3d scenes. In 2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL), pp. 1–2. External Links: Document, ISSN 2161-9476 Cited by: Introduction.
  • [4] P. Jund, A. Eitel, N. Abdo, and W. Burgard (2018-05) Optimization beyond the convolution: generalizing spatial relations with end-to-end metric learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–7. External Links: Document, ISSN 2577-087X Cited by: Introduction.
  • [5] O. Mees, N. Abdo, M. Mazuran, and W. Burgard (2017-Sep.) Metric learning for generalizing spatial relations to new objects. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3175–3182. External Links: Document, ISSN 2153-0866 Cited by: Introduction.
  • [6] J. Piaget (1954.) The construction of reality in the child /. Ballantine Books,. Cited by: Introduction.
  • [7] B. Rosman and S. Ramamoorthy (2011) Learning spatial relationships between objects. The International Journal of Robotics Research 30 (11), pp. 1328–1342. External Links: Document, https://doi.org/10.1177/0278364911408155, Link Cited by: Introduction.
  • [8] G. Small (2002) What we need to know about age related memory loss. BMJ 324 (7352), pp. 1502–1505. Cited by: Elder Care.