Interactive Robot Learning of Gestures, Language and Affordances

A growing field in robotics and Artificial Intelligence (AI) research is human-robot collaboration, whose target is to enable effective teamwork between humans and robots. However, in many situations human teams are still superior to human-robot teams, primarily because human teams can easily agree on a common goal with language, and the individual members observe each other effectively, leveraging their shared motor repertoire and sensorimotor resources. This paper shows that for cognitive robots it is possible, and indeed fruitful, to combine knowledge acquired from interacting with elements of the environment (affordance exploration) with the probabilistic observation of another agent's actions. We propose a model that unites (i) learning robot affordances and word descriptions with (ii) statistical recognition of human gestures with vision sensors. We discuss theoretical motivations, possible implementations, and we show initial results which highlight that, after having acquired knowledge of its surrounding environment, a humanoid robot can generalize this knowledge to the case when it observes another agent (human partner) performing the same motor actions previously executed during training.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

08/14/2021

Sharing Cognition: Human Gesture and Natural Language Grounding Based Planning and Navigation for Indoor Robots

Cooperation among humans makes it easy to execute tasks and navigate sea...
09/18/2018

Proceedings of the AI-HRI Symposium at AAAI-FSS 2018

The goal of the Interactive Learning for Artificial Intelligence (AI) fo...
10/16/2019

Learning from My Partner's Actions: Roles in Decentralized Robot Teams

When teams of robots collaborate to complete a task, communication is of...
10/18/2004

Neural Architectures for Robot Intelligence

We argue that the direct experimental approaches to elucidate the archit...
05/04/2020

Robotic Self-Assessment of Competence

In robotics, one of the main challenges is that the on-board Artificial ...
01/25/2022

Interspecies Collaboration in the Design of Visual Identity: A Case Study

Design usually relies on human ingenuity, but the past decade has seen t...
04/02/2020

Enabling End-Users to Deploy Flexible Human-Robot Teams to Factories of the Future

Human-Robot Teams offer the flexibility needed for partial automation in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Robotics is progressing fast, with a steady and systematic shift from the industrial domain to domestic, public and leisure environments [siciliano:2016:handbook2, ch. 65, Domestic Robotics]. Application areas that are particularly relevant and being researched by the scientific community include: robots for people’s health and active aging, mobility, advanced manufacturing (Industry 4.0). In short, all domains that require direct and effective human–robot interaction and communication (including language and gestures [matuszek:2014:aaai]).

However, robots have not reached the level of performance that would enable them to work with humans in routine activities in a flexible and adaptive way, for example in the presence of sensor noise, or unexpected events not previously seen during the training or learning phase. One of the reasons to explain this performance gap between human–human teamwork and a human–robot teamwork is in the collaboration aspect, i. e., whether the members of a team understand one another. Humans have the ability of working successfully in groups. They can agree on common goals (e. g., through verbal and non-verbal communication), work towards the execution of these goals in a coordinated way, and understand each other’s physical actions (e. g., body gestures) towards the realization of the final target. Human team coordination and mutual understanding is effective [ramnani:2004:natureneuro] because of (i) the capacity to adapt to unforeseen events in the environment, and re-plan one’s actions in real time if necessary, and (ii) a common motor repertoire and action model, which permits us to understand a partner’s physical actions and manifested intentions as if they were our own [saponaro:2013:crhri].

In neuroscience research, visuomotor neurons (i. e., neurons that are activated by visual stimuli) have been a subject of ample study 

[rizzolatti:2001:nrn]. Mirror neurons are one class of such neurons that responds to action and object interaction, both when the agent acts and when it observes the same action performed by others, hence the name “mirror”.

Figure 1: Experimental setup, consisting of an iCub humanoid robot and a human user performing a manipulation gesture on a shared table with different objects on top. The depth sensor in the top-left corner is used to extract human hand coordinates for gesture recognition. Depending on the gesture and on the target object, the resulting effect will differ.

This work takes inspiration from the theory of mirror neurons, and contributes towards using it on humanoid and cognitive robots. We show that a robot can first acquire knowledge by sensing and self-exploring its surrounding environment (e. g., by interacting with available objects and building up an affordance representation of the interactions and their outcomes) and, as a result, the robot is capable of generalizing its acquired knowledge while observing another agent (e. g., a human person) who performs similar physical actions to the ones executed during prior robot training. Fig. 1 shows the experimental setup.

2 Related Work

A large and growing body of research is directed towards having robots learn new cognitive skills, or improving their capabilities, by interacting autonomously with their surrounding environment. In particular, robots operating in an unstructured scenario may understand available opportunities conditioned on their body, perception and sensorimotor experiences: the intersection of these elements gives rise to object affordances (action possibilities), as they are called in psychology [gibson:2014]. The usefulness of affordances in cognitive robotics is in the fact that they capture essential properties of environment objects in terms of the actions that a robot is able to perform with them [montesano:2008, jamone:2016:tcds]. Some authors have suggested an alternative computational model called Object–Action Complexes [kruger:2011:ras], which links low-level sensorimotor knowledge with high-level symbolic reasoning hierarchically in autonomous robots.

In addition, several works have demonstrated how combining robot affordance learning with language grounding can provide cognitive robots with new and useful skills, such as learning the association of spoken words with sensorimotor experience [salvi:2012:smcb, morse:2016:cogsci] or sensorimotor representations [stramandinoli:2016:icdl], learning tool use capabilities [goncalves:2014:icarsc, goncalves:2014:icdl], and carrying out complex manipulation tasks expressed in natural language instructions which require planning and reasoning [antunes:2016:icra].

In [salvi:2012:smcb]

, a joint model is proposed to learn robot affordances (i. e., relationships between actions, objects and resulting effects) together with word meanings. The data contains robot manipulation experiments, each of them associated with a number of alternative verbal descriptions uttered by two speakers for a total of 1270 recordings. That framework assumes that the robot action is known a priori during the training phase (e. g., the information “grasping” during a grasping experiment is given), and the resulting model can be used at testing to make inferences about the environment, including estimating the most likely action, based on evidence from other pieces of information.

Several neuroscience and psychology studies build upon the theory of mirror neurons which we brought up in the Introduction. These studies indicate that perceptual input can be linked with the human action system for predicting future outcomes of actions, i. e., the effect of actions, particularly when the person possesses concrete personal experience of the actions being observed in others [aglioti:2008:basketball, knoblich:2001:psychsci]

. This has also been exploited under the deep learning paradigm 

[kim:2017:nn], by using a Multiple Timescales Recurrent Neural Network (MTRNN) to have an artificial simulated agent infer human intention from joint information about object affordances and human actions. One difference between this line of research and ours is that we use real, noisy data acquired from robots and sensors to test our models, rather than virtual simulations.

3 Proposed Approach

In this paper, we combine (1) the robot affordance model of [salvi:2012:smcb], which associates verbal descriptions to the physical interactions of an agent with the environment, with (2) the gesture recognition system of [saponaro:2013:crhri], which infers the type of action from human user movements. We consider three manipulative gestures corresponding to physical actions performed by agent(s) onto objects on a table (see Fig. 1): grasp, tap, and touch. We reason on the effects of these actions onto the objects of the world, and on the co-occurring verbal description of the experiments. In the complete framework, we will use Bayesian Networks (BNs

), which are a probabilistic model that represents random variables and conditional dependencies on a graph, such as in Fig. 

2. One of the advantages of using Bayesian Networks is that their expressive power allows the marginalization over any set of variables given any other set of variables.

Our main contribution is that of extending [salvi:2012:smcb] by relaxing the assumption that the action is known during the learning phase. This assumption is acceptable when the robot learns through self-exploration and interaction with the environment, but must be relaxed if the robot needs to generalize the acquired knowledge through the observation of another (human) agent. We estimate the action performed by a human user during a human–robot collaborative task, by employing statistical inference methods and Hidden Markov Models. This provides two advantages. First, we can infer the executed action during training. Secondly, at testing time we can merge the action information obtained from gesture recognition with the information about affordances.

Figure 2: Abstract representation of the probabilistic dependencies in the model. Shaded nodes are observable or measurable in the present study, and edges indicate Bayesian dependency.

3.1 Bayesian Network for Affordance–Words Modeling

Following the method adopted in [salvi:2012:smcb], we use a Bayesian probabilistic framework to allow a robot to ground the basic world behavior and verbal descriptions associated to it. The world behavior is defined by random variables describing: the actions , defined over the set , object properties , over , and effects , over . We denote  the state of the world as experienced by the robot. The verbal descriptions are denoted by the set of words 

. Consequently, the relationships between words and concepts are expressed by the joint probability distribution 

of actions, object features, effects, and words in the spoken utterance. The symbolic variables and their discrete values are listed in Table 1. In addition to the symbolic variables, the model also includes word variables, describing the probability of each word co-occurring in the verbal description associated to a robot experiment in the environment.

name description values
Action action grasp, tap, touch
Shape object shape sphere, box
Size object size small, medium, big
ObjVel object velocity slow, medium, fast
Table 1: The symbolic variables of the Bayesian Network which we use in this work (a subset of the ones from [salvi:2012:smcb]), with the corresponding discrete values obtained from clustering during previous robot exploration of the environment.

This joint probability distribution, that is illustrated by the part of Fig. 2 enclosed in the dashed box, is estimated by the robot in an ego-centric way through interaction with the environment, as in [salvi:2012:smcb]. As a consequence, during learning, the robot knows what action it is performing with certainty, and the variable  assumes a deterministic value. This assumption is relaxed in the present study, by extending the model to the observation of external (human) agents as explained below.

3.2 Hidden Markov Models for Gesture Recognition