Robotics is progressing fast, with a steady and systematic shift from the industrial domain to domestic, public and leisure environments [siciliano:2016:handbook2, ch. 65, Domestic Robotics]. Application areas that are particularly relevant and being researched by the scientific community include: robots for people’s health and active aging, mobility, advanced manufacturing (Industry 4.0). In short, all domains that require direct and effective human–robot interaction and communication (including language and gestures [matuszek:2014:aaai]).
However, robots have not reached the level of performance that would enable them to work with humans in routine activities in a flexible and adaptive way, for example in the presence of sensor noise, or unexpected events not previously seen during the training or learning phase. One of the reasons to explain this performance gap between human–human teamwork and a human–robot teamwork is in the collaboration aspect, i. e., whether the members of a team understand one another. Humans have the ability of working successfully in groups. They can agree on common goals (e. g., through verbal and non-verbal communication), work towards the execution of these goals in a coordinated way, and understand each other’s physical actions (e. g., body gestures) towards the realization of the final target. Human team coordination and mutual understanding is effective [ramnani:2004:natureneuro] because of (i) the capacity to adapt to unforeseen events in the environment, and re-plan one’s actions in real time if necessary, and (ii) a common motor repertoire and action model, which permits us to understand a partner’s physical actions and manifested intentions as if they were our own [saponaro:2013:crhri].
In neuroscience research, visuomotor neurons (i. e., neurons that are activated by visual stimuli) have been a subject of ample study[rizzolatti:2001:nrn]. Mirror neurons are one class of such neurons that responds to action and object interaction, both when the agent acts and when it observes the same action performed by others, hence the name “mirror”.
This work takes inspiration from the theory of mirror neurons, and contributes towards using it on humanoid and cognitive robots. We show that a robot can first acquire knowledge by sensing and self-exploring its surrounding environment (e. g., by interacting with available objects and building up an affordance representation of the interactions and their outcomes) and, as a result, the robot is capable of generalizing its acquired knowledge while observing another agent (e. g., a human person) who performs similar physical actions to the ones executed during prior robot training. Fig. 1 shows the experimental setup.
2 Related Work
A large and growing body of research is directed towards having robots learn new cognitive skills, or improving their capabilities, by interacting autonomously with their surrounding environment. In particular, robots operating in an unstructured scenario may understand available opportunities conditioned on their body, perception and sensorimotor experiences: the intersection of these elements gives rise to object affordances (action possibilities), as they are called in psychology [gibson:2014]. The usefulness of affordances in cognitive robotics is in the fact that they capture essential properties of environment objects in terms of the actions that a robot is able to perform with them [montesano:2008, jamone:2016:tcds]. Some authors have suggested an alternative computational model called Object–Action Complexes [kruger:2011:ras], which links low-level sensorimotor knowledge with high-level symbolic reasoning hierarchically in autonomous robots.
In addition, several works have demonstrated how combining robot affordance learning with language grounding can provide cognitive robots with new and useful skills, such as learning the association of spoken words with sensorimotor experience [salvi:2012:smcb, morse:2016:cogsci] or sensorimotor representations [stramandinoli:2016:icdl], learning tool use capabilities [goncalves:2014:icarsc, goncalves:2014:icdl], and carrying out complex manipulation tasks expressed in natural language instructions which require planning and reasoning [antunes:2016:icra].
, a joint model is proposed to learn robot affordances (i. e., relationships between actions, objects and resulting effects) together with word meanings. The data contains robot manipulation experiments, each of them associated with a number of alternative verbal descriptions uttered by two speakers for a total of 1270 recordings. That framework assumes that the robot action is known a priori during the training phase (e. g., the information “grasping” during a grasping experiment is given), and the resulting model can be used at testing to make inferences about the environment, including estimating the most likely action, based on evidence from other pieces of information.
Several neuroscience and psychology studies build upon the theory of mirror neurons which we brought up in the Introduction. These studies indicate that perceptual input can be linked with the human action system for predicting future outcomes of actions, i. e., the effect of actions, particularly when the person possesses concrete personal experience of the actions being observed in others [aglioti:2008:basketball, knoblich:2001:psychsci]
. This has also been exploited under the deep learning paradigm[kim:2017:nn], by using a Multiple Timescales Recurrent Neural Network (MTRNN) to have an artificial simulated agent infer human intention from joint information about object affordances and human actions. One difference between this line of research and ours is that we use real, noisy data acquired from robots and sensors to test our models, rather than virtual simulations.
3 Proposed Approach
In this paper, we combine (1) the robot affordance model of [salvi:2012:smcb], which associates verbal descriptions to the physical interactions of an agent with the environment, with (2) the gesture recognition system of [saponaro:2013:crhri], which infers the type of action from human user movements. We consider three manipulative gestures corresponding to physical actions performed by agent(s) onto objects on a table (see Fig. 1): grasp, tap, and touch. We reason on the effects of these actions onto the objects of the world, and on the co-occurring verbal description of the experiments. In the complete framework, we will use Bayesian Networks (BNs
), which are a probabilistic model that represents random variables and conditional dependencies on a graph, such as in Fig.2. One of the advantages of using Bayesian Networks is that their expressive power allows the marginalization over any set of variables given any other set of variables.
Our main contribution is that of extending [salvi:2012:smcb] by relaxing the assumption that the action is known during the learning phase. This assumption is acceptable when the robot learns through self-exploration and interaction with the environment, but must be relaxed if the robot needs to generalize the acquired knowledge through the observation of another (human) agent. We estimate the action performed by a human user during a human–robot collaborative task, by employing statistical inference methods and Hidden Markov Models. This provides two advantages. First, we can infer the executed action during training. Secondly, at testing time we can merge the action information obtained from gesture recognition with the information about affordances.
3.1 Bayesian Network for Affordance–Words Modeling
Following the method adopted in [salvi:2012:smcb], we use a Bayesian probabilistic framework to allow a robot to ground the basic world behavior and verbal descriptions associated to it. The world behavior is defined by random variables describing: the actions , defined over the set , object properties , over , and effects , over . We denote the state of the world as experienced by the robot. The verbal descriptions are denoted by the set of words
. Consequently, the relationships between words and concepts are expressed by the joint probability distributionof actions, object features, effects, and words in the spoken utterance. The symbolic variables and their discrete values are listed in Table 1. In addition to the symbolic variables, the model also includes word variables, describing the probability of each word co-occurring in the verbal description associated to a robot experiment in the environment.
|Action||action||grasp, tap, touch|
|Shape||object shape||sphere, box|
|Size||object size||small, medium, big|
|ObjVel||object velocity||slow, medium, fast|
This joint probability distribution, that is illustrated by the part of Fig. 2 enclosed in the dashed box, is estimated by the robot in an ego-centric way through interaction with the environment, as in [salvi:2012:smcb]. As a consequence, during learning, the robot knows what action it is performing with certainty, and the variable assumes a deterministic value. This assumption is relaxed in the present study, by extending the model to the observation of external (human) agents as explained below.