Interactive reinforcement learning (IRL) has received increasing attention for teaching autonomous robotic agents to perform a task. In traditional reinforcement learning (RL), an agent interacts with the environment performing actions and observing new situations to learn an optimal policy which allows it to learn how to perform a task autonomously [Sutton98]. Although RL has been shown to be effective for learning agents, one open issue is the excessive time required by the agent to find a proper policy [Griffith13]. IRL speeds up this learning process by extending traditional RL with a parent-like teacher who is allowed to advise the agent in selected episodes.
Caregivers interact with infants through different multi-sensory stimuli such as speech and gestures. These stimuli can also be seen as guidance from a teacher who provides a set of instructions on how to achieve a specific goal. Although IRL approaches have been implemented in robotic scenarios, an open issue is that the communication interface between the teacher and the robot may not be straightforward for non-expert trainers in a domestic environment. Therefore, this is motivation to develop simpler interactive scenarios where parent-like teachers can provide instructions using their natural communication skills such as speech and gestures. In this setting, the feedback provided by the user may, however, be incongruent or noisy. From a computational perspective, the use of multiple sensory modalities has shown to attenuate the ambiguity of incoming stimuli [Bauer15], providing the means to enhance perception-driven behavior. However, the process of multi-sensory integration must also take into account the case in which the information from the multiple sources is in conflict (incongruent). Uncertainty in the integration of multi-sensory instructions may not be clear and misunderstood, thereby leading to a decreased performance in the apprentice agent when solving a task [Griffith13].
In previous research [CruzIROS], we presented a multi-modal IRL approach using dynamic audio-visual input as trainer-like feedback. Multi-modal feedback can be provided to the agent through a set of vocal commands and hand gestures using a microphone and a depth sensor respectively. Our IRL algorithm integrates audio-visual information with the aim to provide robust feedback along with a confidence value that indicates the level of trustworthiness of the feedback based on sensory cues. However, a limitation of this approach is that multi-modal feedback is taken into account on the basis of the confidence value of the integrated feedback predictions, thus relying on the ability of our architecture to robustly predict speech and gestures from sensory cue (which is not always the case in real-world scenarios).
In this paper, we extend our architecture to modulate the influence of sensory-driven feedback in the IRL task using goal-oriented knowledge. This is motivated by neurobehavioral studies on multi-modal processing in which human subjects exposed to audio-visual stimuli integrate multiple sources of information driven by a combination of sensory representations and prior expectations [Odegaard15]. In our approach, we integrate task-specific knowledge in terms of contextual affordances which represent an effective method to anticipate the effect of actions performed by an agent interacting with objects [CruzESANN]. We train a multi-layer neural network to predict the effect of the performed actions with different objects in order to avoid failed-states, i.e., states from which it is not possible for the agent to complete the task. Therefore, multi-modal processing driven by audio-visual sensory representations is integrated with knowledge about the task so that, e.g., predicted feedback with a high confidence value can be bypassed in case that this prediction leads to a failed-state.
We conduct a set of experiments to explore the interplay of external feedback and contextual (task-specific) affordances in a cleaning scenario in which a humanoid robot can interact with two objects with the goal of cleaning the surface of a table. We compared the learning performance in terms of speed of convergence and accumulated reward under 4 different conditions: traditional RL, multi-modal IRL, and each of these two setups with the use of affordances. For this purpose, we varied the percentage of available feedback and contextual affordances during the learning process. The obtained results demonstrate that multi-modal sensory processing integrated with affordance-driven IRL yield the best learning performance for the proposed IRL task.
Ii Related Work
Ii-a Interactive Reinforcement Learning
RL is a behavioral approach used by autonomous agents to learn new tasks [Sutton98]. The agent interacts with the environment in order to find a proper policy which determines how to act to accomplish a given task effectively. During a learning episode, an RL agent performs an action obtaining a reward (or a punishment) and a new state from the environment (Fig. 1). These actions are selected according to a policy , which in psychology is referred to as a set of stimulus–response rules or associations [Kornblum90]. The value of taking an action in a state under a policy is denoted as , which is also called the action-value function for a policy .
The process of learning in humans and animals has been widely studied by neuroscience, yielding a better understanding of how the brain is able to acquire new cognitive skills. RL is associated with cognitive memory and decision-making mechanisms in the biological brain in terms of how behavior is generated [Niv09]. RL is a method used to address optimal decision-making, attempting to maximize the collected reward and minimize the punishment over time, and shown to be successful in terms of acquiring new skills in robotics [Kober13][Kormushev13].
To learn a task, an RL agent has to interact with the environment in order to collect and refine knowledge over time. Nevertheless, it may be sometimes inefficient to give the agent full autonomy when learning a task since this may lead to excessive training time. IRL allows to speed up the learning process by using a parent-like advisor (Fig. 1) to support the learning by delivering useful advice in selected episodes by either reward- or policy-shaping [Griffith13]. This approach reduces the search space and allows to learn the task faster in comparison to a fully autonomous agent [Suay11, Knox13]. Therefore, an apprentice agent can be taught by a parent-like trainer in a similar way as caregivers assist infants during the learning of new tasks. A parent-like trainer can be either a human user or another artificial agent.
In robotics scenarios, IRL has been applied using different communication interfaces between the trainer and the apprentice agent. For example, Suay and Chernova [Suay11] proposed an IRL task where a humanoid robot receives advice from an external trainer using a graphic user interface built on top of the visual input from the robot’s camera. The trainer may provide feedback through the use of a tablet to deliver directions to the robot. In another IRL approach by Knox et al. [Knox13], the trainer can send feedback to the apprentice robot using a presenter control. However, the communication interfaces lack usability when it comes to home scenarios where non-expert trainers may train assistive robots in a more natural way.
Ii-B Multi-modal Integration
In our daily life, we are constantly subject to external stimuli through different sensory modalities (e.g., vision, audition, and touch) that together provide a coherent and robust perceptual experience [Stein09]. Similarly, robot perception may be driven by an array of sensors that in concert contribute to the efficient and robust interaction with the environment. In human-robot interaction (HRI) scenarios, robots may take advantage of multi-sensory information to reliably operate in highly dynamic environments and in situations of sensory uncertainty.
From a computational perspective, multiple studies have shown the advantages of integrating multi-sensory information. For instance, Lacheze et al. [Lacheze09] presented a multi-modal integration approach for the classification of static patterns using audio-visual input. This work uses the auditory information to improve the classification of objects when they were partially obstructed. Kimura et al. [Kimura15]
proposed a multi-modal approach to estimate the characteristics of unknown objects using an RGB-D camera, a stereo microphone, and sensors of pressure and weight. Ozasa et al.[Ozasa12]
presented an approach to recognize unknown objects using multi-modal integration of images and speech through logistic regression. They also integrated a confidence value to improve the accuracy of the recognition. However, this integrated confidence value did not take into account the case of conflict of uni-modal information. Such a conflict may occur due to either sensor noise or incongruent advice provided by the user. In previous research[CruzIROS], we proposed an integration function that considers incongruent audio-visual input so that in the case of conflict, the predicted label with the highest confidence is preferred. Both of these approaches yield exclusively sensory-driven integration, thus do not consider task-specific knowledge which plays an important role in multi-modal integration [Odegaard15].
Affordances are available possible actions for an agent operating in its environment [Gibson79]. They represent characteristics of the relation between the agent and an object in terms of operational opportunities the object offers to the agent. The original idea of affordances by Gibson included many practical examples but lacked a formal definition leading to significant differences among cognitive psychologists about the definition and use of affordances [Horton12]
, and also in the field of artificial intelligence[Chemero07]. Horton et al. [Horton12] recognized three main attributes of an affordance: i) its existence is associated with the capabilities of an agent, ii) it exists regardless whether the agent is able to perceive it, and iii) it does not change, unlike necessities or goals of the agent. Different approaches for learning affordances in robotics have been proposed. For instance, Lopes et al. [Lopes07]
addressed the imitation learning problem using affordance-based action sequences. The affordance model was extended allowing the robot to work with a second object using an enlarged Bayesian network to represent affordances[Moldovan12], studying the use of functional affordances which also lead to a reduced action space. In this regard, an object can be used in a restricted manner by not considering all its operational opportunities but only the socially acceptable ones.
In robotics, affordances have been represented as a triplet which encodes relationships between its components [Montesano08]. Therefore, it is possible to predict the effect using objects and actions as domain variables as . Nevertheless, although this model has been shown to be suitable for many scenarios, it does not include contextual information which allows anticipating effects in all situations properly [Kammer11]. For instance, let us consider a scenario where an agent interacts with a set of given objects. In the case that the agent has his/her two hands occupied with objects, then the agent cannot grasp a new object. In other words, the affordance of graspability is temporarily unavailable until the agent drops one object or places it back onto the table (which in turn may modify the context). This aspect must be considered for agents operating in real-world scenarios for an efficient and effective interaction with the environment.
Iii Our Method
A diagram of our architecture is illustrated in Fig. 2. Audio-visual input is processed by two distinct modules that recognize feedback commands from speech and gestures. Both recognition modules predict a command label and a confidence value that indicates the level of trustworthiness of such predictions. The multi-modal integration (MMI) module computes a joint label and a confidence value on the basis of uni-modal predictions. For this purpose, we use a mathematical transformation that takes into account also incongruent information, i.e., predicted command labels from speech and vision that do not match. The labels of the integrated feedback are used as the input to compute the contextual affordances, i.e., the effects of an action given the current state. Thus, the IRL algorithm may consider or not the feedback, e.g., bypassing the feedback that leads the agent to a failed-state from which the task cannot be successfully carried out. In the following sections, we describe in detail the above-mentioned modules and their implementation.
Iii-a Speech and Gesture Recognition
To process speech from auditory information, we used a cloud-based automatic speech recognition (ASR) system with local audio signals. The ASR system is based on DOCKS [Twiefel14] that selects the best hypothesis given a domain-specific sentence list associated with our robot task. Our domain-specific language is comprised of robot commands represented by a list of sentences which can be interpreted as advice for the agent. To determine which sentence fits better, we use the Levenshtein distance that compares each obtained hypothesis to the sentences in our domain-specific language using a phonemic representation. Given the set of the 10-best hypotheses and the set of the in-domain sentences, the predicted class label is computed as:
where is the Levenshtein distance in our ASR system. The confidence value is computed as:
with and , both represented as phonemes.
For recognizing gestures, we used an extended version of neural network learning for gesture recognition [ParisiHandSOM] that extracts hand-independent gesture features from depth map sequences. The learning model consists of a set of two hierarchically arranged self-organizing networks that learn the spatiotemporal structure of the input sequences in terms of gesture features. Along with a predicted label, we also estimate a confidence value that expresses the degree of belief that the prediction is correct based on sensory-driven observations. Training videos were recorded with an ASUS Xtion depth sensor from which we estimated the 3D skeleton model.
A label prediction is carried out every 3 frames in a sliding window scheme. We consider the last 5 observations and compute the statistical mode that returns the most frequent value given the set of predictions , from which we compute the gesture class as . Let be the number of occurrences of in so that the confidence value can be defined as , thus yielding a confidence value in the range between and .
Iii-B Audio-visual Integration
To integrate the multi-sensory input, we propose a non-linear transformation function. This function defines the relationship between the predicted labels and the confidence values:for audio and for vision. We compute the integrated label using the highest confidence value:
Therefore, when and are equal, any of them is assigned to . In case that and are different, then the label with a greater confidence is assigned to . The integrated confidence value is computed by the function , with being a dynamic parameter that depends on each congruent or incongruent pair of predicted labels and their confidence values. We refer to this parameter as the likeliness parameter which we compute as follows:
The likeliness parameter strengthens the integrated confidence value when the predicted labels for audio and vision are congruent, whereas diminishes if the predicted uni-modal labels are incongruent. After applying the transformation function, the confidence value is rescaled between and .
Iii-C Affordance-driven IRL
To introduce task-specific information to the available affordances, we use contextual affordances where an additional variable is considered denoting the current state [CruzESANN]. In this case, the affordance triplet is now extended to , yielding:
where is the agent’s current state, is one of the entities that the agent can interact with, represents motor behavior that can be triggered by the interaction with the objects, and the is the outcome of a specific action involving agent-object interaction [Atil10].
The model of contextual affordances that includes the agent’s current state enables to provide knowledge about actions that lead to failed-states from which it is not possible to accomplish the given task and, therefore, the action space is reduced by avoiding these states.
Formally, the problem can be stated as follows. Given an agent performing the same action with the same object but from a different agent’s state : when the action is performed, different effects could be generated since the initial states and are different. It is then unfeasible to establish differences in the final effect when we use affordances to represent it because and . Thus, to deal with the current states , an agent must learn to distinguish each case by using contextual affordances defined by and , which establishes clear differences between the final effects.
The described model of contextual affordances allows us to predict the effects of specific actions, in our case with the use of a feed-forward neural network that learns the relationship of the states, the actions, and the objects. In our IRL approach, we use different levels of availability for contextual affordances to modulate the learning process. Contextual affordances are used in both autonomous RL and IRL, i.e., the selected action (either by the agent autonomously or by the parent-like trainer as feedback) may be bypassed if the effect of performing such an action leads the agent to a failed-state.
Iv Experimental Results
Iv-a Robot Scenario
In order to test our method, we implement a robotic home scenario where a robot interacts with a parent-like trainer to perform a cleaning task. The task consists of wiping a table with the use of one of the robot’s arms. To successfully complete the cleaning task, it is necessary to carry out additional sub-tasks such as interacting with objects on the table. The trainer is able to advise the robot on what action to perform next through the use of speech, gestures, or a combination of both. The scenario comprises three different locations:
left, the left section of the table;
right, the right section of the table;
home, an additional position that is the initial and final position of the robot’s arm.
There are two objects included in the scenario:
sponge, used to wipe both sections of the table. The sponge is placed at the home position while it is not being used by the robot;
goblet, initially placed in one of the sections of the table and, therefore, it must be moved from one section to the other during cleaning in order to end the task successfully.
The robot can perform seven different actions. These actions may be decided by the robot autonomously as well as by the parent-like trainer through multi-modal feedback in terms of audio-visual advice.
Available actions and advice classes are as follows:
go left: moves the arm to the left section of the table;
go right: moves the arm to the right section of the table;
go home: moves the arm to the home location;
grasp: takes the object at the current arm’s position;
place: drops the object at the current arm’s position;
wipe: uses the sponge to wipe the table at the current arm’s position;
abort: aborts the task and returns to the initial state.
Each state is represented by four variables:
handObject: the object which is currently in the robot’s hand (sponge, goblet, or free);
handPosition: the position of the robot’s arm (left, right, or home);
gobletPosition: the position of the goblet on the table (left or right);
: a vector with two values indicating whether all sections of the table have been wiped.
Therefore, the state vector can be represented as follows:
To complete the task, the robot must clean both sections of the table by moving the goblet from one section to the other during the process of wiping. After the robot has wiped both sections, the task finishes with the sponge at the home position and with free hands. Therefore, the final state can be represented as:
Once the task has been finished and the final state has been reached, the agent receives a reward of . If the agent is not able to finish the task due to a failed-state, it receives a negative reward (or punishment) of . In all other states, the agent receives a small negative reward of for each transition in order to discourage longer paths and loops. The reward function is summarized as follows:
RL is performed using SARSA with a discount factor , learning rate , and -greedy action selection with
. IRL is carried out using a probability of feedback of, meaning of the time we use advice to assist the robot in the task execution.
Our architecture comprises four different modules. The interface module is in charge of receiving the parent-like advice using a depth sensor and a microphone. The control module receives the uni-sensory elements of advice to compute multi-modal feedback, which is sent to the learning algorithm and to the affordance module to predict possible failed-states. Finally, the robot module generates low-level action control using either a real or a simulated robot.
Our approach uses contextual affordances to predict the effect after an action has been performed by the robot in the cleaning-table scenario. We encode all the variables with a localist data representation for objects, locations, side conditions, and actions. We use this representation to create the training set for the contextual affordance model which is composed of a multi-layer perceptron. As input, we use vectors withcomponents containing information about the current state and the action. The current state is represented by the first 13 components in the input vector considering the four variables that define a state, i.e., hand object, hand position, goblet position, and side condition. The output corresponds to the effect of contextual affordances encoded as a vector with 13 components representing the next state. If the performed action leads to a failed-state, then all components of the output vector are equal to zero. The training data was created considering all possible combinations of states together with actions, i.e., states and actions. Therefore, the total number of data samples is for the training of the multi-layer perceptron with
hidden neurons with a sigmoid transfer function using the Levenberg-Marquardt learning algorithm[Hagan94] for epochs.
Iv-B Results and Evaluation
We implemented the cleaning table scenario described in Sec. IV-A in order to test our proposed method. For this purpose, we made recordings of speech and hand gesture sequences from a parent-like teacher. These recordings enabled us to better control the conducted experiments in order to repeat the process under different learning conditions. Each advice class was recorded four times. After the training was completed, our goal was to predict the feedback labels from novel multi-modal input sequences along with their confidence values . After processing each modality independently, the predictions were integrated using the multi-modal integration model to compute and . We considered with as the minimum confidence value to be considered as a valid advice. Then, we used different to verify whether smaller confidence values are still beneficial. The thresholds used were . The average convoluted rewards are shown in Fig. 3 for 100 agents and 500 episodes.
To evaluate differences in the uni-modal and multi-modal approaches, we used a threshold of . The results for agents and episodes are shown in Fig. 4 where it is possible to observe that both uni-modal approaches lead to similar learning behavior, i.e., similar convergence speed and accumulated reward. When using integrated multi-modal advice, the approach converges faster and collects greater reward in comparison with audio and visual advice only.
Afterwards, we introduced the proposed contextual affordance model to avoid failed-states. This way, we do not only reduce the action space but also the likelihood of a failed-state during an episode so that the agent is less likely to receive a punishment. Consequently, using the affordance-driven IRL increases the average accumulated reward and yields faster convergence (shown in Fig. 5). For a better comparison of our results, all plots in Fig. 5 show the autonomous RL and IRL approaches without the use of affordances.
We evaluated our method using different percentages of available contextual affordances during the learning process. We defined a parameter to be the likelihood of a contextual affordance being available. We set values for with meaning that the affordance is fully available. It can be seen in Fig. 5a that even a reduced amount of affordance availability () improves the learning process. Furthermore, the autonomous affordance-driven RL approach (Fig. 5, green line) accomplishes similar performance to IRL without affordances in terms of accumulated reward. In the case of affordance-driven IRL, it reaches a better performance than IRL without affordances. Fig. 5b shows the results for an affordance availability of . In this case, even the affordance-driven autonomous RL approach obtains a higher accumulated reward in comparison to the IRL approach where affordances are not used. With , both approaches with affordances outperform the traditional RL and IRL approaches in terms of accumulated reward and convergence speed (Fig. 5c). Finally, we used an agent with full affordance availability, i.e., . Fig. 5d shows that with affordances being fully available, the agent quickly converges to its maximal possible reward in both RL and IRL approaches, with a slight difference in the maximal reward for both approaches.
We presented an IRL approach using dynamic audio-visual input as feedback in terms of vocal commands and hand gestures for a robotic cleaning task. Our approach integrates uni-sensory cues to provide multi-modal feedback. The multi-modal integration module computes a joint label and confidence value on the basis of uni-modal predictions. The integration process is of particular importance when the two modalities convey incongruent information, i.e., feedback classes predicted by the modules of speech and gesture recognition do not match. Therefore, our integration function takes into account the confidence level of the predictions to provide the IRL algorithm with consistent feedback. As an extension of previous research on multi-modal IRL [CruzIROS], we implemented a model using contextual affordances for modulating the influence of sensory-driven feedback in the IRL task through goal-oriented knowledge. We predict the effects of an action given the current state to avoid failed-states. In this way, the IRL algorithm may consider or not the feedback by bypassing the feedback that leads the agent to a failed-state, thus speeding up the learning process in terms of required episodes for achieving convergence.
Our results in a simulated robot environment show that although uni-sensory modalities show satisfactory prediction accuracy, the use of multi-modal feedback leads to a better performance in our domestic cleaning table scenario in terms of the accumulated reward and required learning episodes. We evaluated the learning performance under four different conditions: traditional RL, multi-modal IRL, and these two setups with the use of affordances, showing that the best performance is obtained using multi-modal feedback with affordance-driven IRL.
V-B Sensory-driven Feedback vs Goal-oriented Knowledge
The focus of our study was the interplay of multi-modal feedback with task-specific knowledge. Our previous results showed that integrated audio-visual representations yield more robust feedback for an IRL task with respect to uni-modal approaches [CruzIROS]. In particular, audio-visual integration provides the means to solve conflicts, i.e., situations in which predicted feedback labels from the auditory and the visual modules are incongruent. This supports the idea of multi-modal integration as a method to enhance robot perception and interaction [Kimura15].
In our approach, the integration is carried out taking into account the predicted labels and the confidence values from uni-modal cues. In the case of incongruent audio-visual predictions, the modality yielding the higher confidence value will be preferred. Gesture labels are predicted by the neural network processing of hand motion features, whereas vocal commands are predicted using in-domain automatic speech recognition. Consequently, these two approaches provide robust feedback predictions with confidence values computed as a function of a fully sensory-driven process, i.e., a high confidence value indicates that it is very likely that the feedback perceived by the agent matches the one actually given by the trainer. This procedure, however, does not give any information on whether the piece of feedback is correct or not in terms of the next actions required to accomplish the task.
Signals from multiple sources are combined in the brain taking into account a combination of the reliability of low-level sensor representations and the expectations of an agent in a specific situation (e.g., in terms of task-oriented knowledge) [Odegaard15]. Therefore, we integrated this aspect to our previous model in order to study the combination of sensory-driven multi-modal feedback and goal-oriented knowledge in the context of our IRL task. In our new extended proposed architecture, we integrated task-specific knowledge in terms of contextual affordances which represent an effective method to anticipate the effect of actions performed by an agent interacting with objects based on its current state [Cruz16, Cruz14]. We trained a neural network to predict the effect of performed actions with different objects in order to avoid states from which it is not possible for the agent to complete the cleaning task. Thus, contextual affordances modulate the influence of multi-modal feedback in the IRL algorithm, i.e., if an action provided by the trainer leads to a failed-state, it may be bypassed irrespective of a high (sensory-driven) confidence value.
V-C Future Work
The obtained results motivate the extension of our approach in several directions. So far, the integration function considers two cues for predicting multi-modal feedback and computing its confidence. On the one hand, we could think of naturally extending our function to consider input from additional sensory sources, e.g., RGB information as an additional visual cue. It has been shown that combining depth and RGB information leads to a better recognition accuracy with respect to using a single cue [cad60_3]. In the case of our robotic task, conflicting input in terms of incongruent predictions from the auditory and visual modules may be solved by considering multiple visual cues. On the other hand, we could think of extending our approach with additional modalities, e.g., haptics. In such a setting, parent-like feedback may be delivered to the agent by providing haptic feedback to its actuators, e.g. moving its arm to grasp an object.
Multi-modal IRL allows the agent to interact in a more natural way with parent-like trainers for dynamically acquiring and refining task-specific knowledge with respect to traditional IRL approaches. Together, our results demonstrate the contribution of multi-modal sensory processing integrated with goal-oriented knowledge to significantly enhance the interaction between users and agents in robotic learning tasks.
The authors gratefully acknowledge partial support by Comisión Nacional de Investigación Científica y Tecnológica (CONICYT) scholarship 5043 and the German Research Foundation DFG under project CML (TRR 169).