With the recent developments in Artificial Intelligence and adoption of smart devices, social robots are one step closer to finding a place in our daily lives. From doing repetitive mundane tasks to helping us to solve problems, they will be a major contributor to improving our quality of life. One problem that social robots/agents can help with is to keeping track of objects that we use in our homes or workplaces. Forgetting where an object is, or somebody else moving an object without our knowledge, are two very common scenarios that are sources of frustration. We believe that an interactive camera system along with adequate reasoning capabilities can be very helpful in solving this problem.
In this paper we present an integrated system that constantly watches the scene, detecting and tracking objects and people while inferring ownership and spatial relationships. The system allows the user ask where an object is, and answers accordingly. Although the system is currently installed as a static setup as part of a smart room, our vision is to port it to a mobile robot.
Spatial relationships between objects and people have been studied extensively in computer vision and developmental psychology fields. Piaget found that the notion of spatial relationships between objects starts in early infanthood. This ability is crucial for us to build an abstract representation of the world and communicate this representation with others.
In robotics, spatial representations are related to affordances. In 
a contact point network is used to segment out objects using a Kinect camera, and the relationships are learned using supervised learning methods. use a 3D simulation environment to generate large volume of training data. Some works such as  use 2D images instead of 3D to calculate spatial relationships.
, a neural network is employed to generalize spatial relations to apply to previously unknown objects. Recent VQA systems such as employ deep neural networks to answer any questions about a scene.
In this paper we present an interactive system that uses a simplified approach to calculate spatial relations. Our system is able to calculate belonging relationship simplifying the language that the user use to refer to objects. However our system offers an end-to-end solution starting from user speech and ending with system’s answer in natural language.
There are two use cases that inspire and guide our work:
About 40% of people aged 65 or older have age-associated memory impairment . The older we get the easier it becomes to forget where we placed our belongings. This use case focuses on keeping track of relevant items that an elderly person might need but forgot. An example scenario for this use case is as follows:
Mr. Jones comes home, leaves his wallet in an uncommon location (next to the vase)
Ms. Jones places a couple of magazines on the wallet
Later Mr. Jones asks “Where is my wallet?”
The system answers: “It is next to the vase, under the magazines”
Workshops and Factories
In workplaces, factories, and workshops a missing or misplaced object may have a direct impact on the efficiency of the processes underway. In this use case the system not only answers explicit queries but also proactively watches and warns the user of misplaced/missing items. An example for this use case might be:
Mr. Jones is working on a repair project
The current project involves a defined set of tools
The system tracks the location of those tools
The system detects that one tool is missing from its usual location
The system warns Mr. Jones that the tool is missing
The system detects that one tool is situated in an unusual location
When the time comes for Mr. Jones to use that tool, the system proactively tells him its location relative to a landmark: “The wrench is behind the toolbox”
Our system uses a Microsoft Kinect RGB-D camera and an array microphone mounted on the ceiling. As seen in Figure 1, the camera is focused on a target work area where we detect and track the objects. The work area can easily be extended by using multiple cameras.
There are three major components of the system:
Object and Person Detection
People are essentially modeled as stalagmites coming up from the floor, with constraints on head height, head size, shoulder width, etc. Objects are similarly modeled as bumps coming up from the work surface. Both people and objects are tracked over time allowing continuity in properties such as names or types that may have been assigned. The full set of person and object data is streamed in JSON format via a ZeroMQ pub-sub channel approximately 30 times a second.
The system allows objects to be tracked even when they are being held (and thus have a wildly different shape when combined with an arm) by merging the point cloud associated with the user’s hand with the point cloud of the object. The system also tracks the position of people’s hands relative to the work area focusing on the end points of the protrusions in the point cloud that are created by the presence of arms). In addition to providing information about hands, the system can infer pointing directions using principal component analysis method. This can be used by itself to, for instance, select a particular object for querying.
Figure 2 shows a snapshot of the environment and the corresponding person and object detection results. The results are then passed to the Spatial Relations step.
In order to generate human understandable dialog, we extract the topology of the environment by calculating relationships between agents, objects, and locations from the geometric data. For every detection result that we receive we compute a number of spatial relationships.
We consider 4 types of observer-independent spatial relations:
Object-Object Relations: in, on, near, next to
Object-Agent Relations: belongs, last touched by
Agent-Location Relations: in
Object-Location Relations: in
These spatial relations are computed based on a set of predefined rules and a number of geometric properties. Figure 3 illustrates Object-Object relations.
The in relation between object and is defined as:
is in if 80% of ’s volume is in .
The on relation between object and is defined as:
is on if the bottom of is above the top of .
The near relation between object and is defined as: is near if the distance between and is not greater than the .
The next to relation between object and is defined as:
is next to if is near and there no object between and .
The object belongs to the agent if has not been seen before and appears with the agent
The spatial relations are computed for every frame and for every object and agent. The system is also able to compute only a subset of these relations to focus only on relevant objects and agents, if desired. The relationships, especially belongs to, is kept and maintained in a database allowing the user ask about his/her/others’ objects.
The Dialog system is the third major component of the overall architecture. It is in charge of understanding the user’s spoken questions and assertions, and providing intelligible answers using synthesized speech. This system is also responsible for detecting when the user is addressing the system (versus chatting with a workmate) by verifying two types of events coming from the users:
Using the keyword “Celia”
Looking directly at the camera
Once the user gets the system’s attention using one of the above methods, he/she has 2 seconds to start phrasing his/her request. If nothing is heard within this interval, the system times out and resumes waiting for a new attention trigger.
Conclusion and Future Work
In this paper we detailed our preliminary work on keeping track of objects and people, and informing the user of various relations between them upon request. Our system is able to resolve the belonging and allows the user ask about his/her/others’ objects. While the system is currently part of a room, we believe a mobile robot has fewer privacy concerns than a static camera mounted on the ceiling. It can also look at objects from different points of view, thus reducing possible blind spots and occlusions.
We are also planning to enable a proactive behavior where the robot follows people and watches which objects they are using. This way the robot will have a broader knowledge base of activities, and be able to answer questions about locations that are not in its current field of view.
-  (2015-09) Describing spatial relationships between objects in images in English and French. In Proceedings of the Fourth Workshop on Vision and Language, Lisbon, Portugal, pp. 104–113. External Links: Cited by: Introduction.
-  (2017-10) Implementation of human-robot vqa interaction system with dynamic memory networks. In 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 495–500. External Links: Cited by: Introduction.
-  (2013-08) Learning spatial relations between objects from 3d scenes. In 2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL), pp. 1–2. External Links: Cited by: Introduction.
-  (2018-05) Optimization beyond the convolution: generalizing spatial relations with end-to-end metric learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–7. External Links: Cited by: Introduction.
-  (2017-Sep.) Metric learning for generalizing spatial relations to new objects. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3175–3182. External Links: Cited by: Introduction.
-  (1954.) The construction of reality in the child /. Ballantine Books,. Cited by: Introduction.
-  (2011) Learning spatial relationships between objects. The International Journal of Robotics Research 30 (11), pp. 1328–1342. External Links: Cited by: Introduction.
-  (2002) What we need to know about age related memory loss. BMJ 324 (7352), pp. 1502–1505. Cited by: Elder Care.