Deep Reinforcement Learning with Interactive Feedback in a Human-Robot Environment

07/07/2020 ∙ by Ithan Moreira, et al. ∙ UPE-Poli Deakin University 0

Robots are extending their presence in domestic environments every day, being more common to see them carrying out tasks in home scenarios. In the future, robots are expected to increasingly perform more complex tasks and, therefore, be able to acquire experience from different sources as quickly as possible. A plausible approach to address this issue is interactive feedback, where a trainer advises a learner on which actions should be taken from specific states to speed up the learning process. Moreover, deep reinforcement learning has been recently widely utilized in robotics to learn the environment and acquire new skills autonomously. However, an open issue when using deep reinforcement learning is the excessive time needed to learn a task from raw input images. In this work, we propose a deep reinforcement learning approach with interactive feedback to learn a domestic task in a human-robot scenario. We compare three different learning methods using a simulated robotic arm for the task of organizing different objects; the proposed methods are (i) deep reinforcement learning (DeepRL); (ii) interactive deep reinforcement learning using a previously trained artificial agent as an advisor (agent-IDeepRL); and (iii) interactive deep reinforcement learning using a human advisor (human-IDeepRL). We demonstrate that interactive approaches provide advantages for the learning process. The obtained results show that a learner agent, using either agent-IDeepRL or human-IDeepRL, completes the given task earlier and has fewer mistakes compared to the autonomous DeepRL approach.



There are no comments yet.


page 11

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Robotics has been getting more attention since new researched advances have introduced significant improvements to our society. For instance, for many years, robots have been installed in the automotive industrial area [25]. However, the current technological progress has allowed expanding the robot’s applications domain in areas such as medicine, military, search and rescue, and entertainment. In this regard, under current research, another challenging application of robotics is its integration to domestic environments, mainly due to the presence of many dynamic variables in comparison to industrial contexts. Moreover, in domestic environments, it is expected that humans regularly interact with robots and that the robots can understand and respond accordingly to the interactions [11].

Algorithms such as reinforcement learning (RL) [28] allow a robotic agent to autonomously learn new skills, in order to solve complex tasks inspired by the way as humans do, through trial and error [22]. RL agents interact with the environment in order to find an appropriate policy that meets the problem aims. To find the appropriate policy, the agent interacts with the environment by performing an action and, in turn, the environment returns a new state with a reward for the performed action to adjust the policy. However, an open issue in RL algorithms is the time and the resources required to achieve good learning outcomes [5], which is especially critical in online environments. One of the reasons for this problem is that the agent, at the beginning of the learning process, does not know the environment and the interactions responses. Thus, to address this problem, the agent must explore multiple paths to refine its knowledge about the environment.

In continuous spaces, an alternative is to recognize the agent’s state directly from raw inputs. Deep reinforcement learning (DeepRL) [18]

is based on the same RL structure but also adds deep learning to process the function approximation for the state in multiple abstraction levels. An example of DeepRL implementations is by convolutional neural networks (CNN)

[15] which can be modified to be used for DeepRL, e.g., DQN [18]. CNNs have brought significant progress in the last years in different areas such as image, video, audio, and speech processing [14]. Nevertheless, for a robotic agent working in highly dynamic environments, DeepRL still needs excessive time to learn a new task properly.

Moreover, although a robot may be capable of learning autonomously to sort objects in different contexts, current approaches address the problem using supervised deep learning methods with previously labeled data to recognize the objects, e.g., [33] and [27]

. In this regard, the DeepRL approach allows classifying objects as well as deciding how to act with them. Additionally, if prior knowledge of the problem is transferred to the DeepRL agent, e.g., using demonstrations

[31], the learning speed may also be improved. Therefore, using interactive feedback as an assistance method, we will be able to advise the learner agent during the learning process, using both artificial and human trainers, to evaluate how the DeepRL method responds to the given advice.

In this work, we present an interactive-shaping vision-based algorithm derived from DeepRL, referred to here as interactive DeepRL or IDeepRL. Our algorithm allows us to speed up the required learning time through strategic interaction with either a human or an artificial advisor. The information exchange between the learner and the advisor gives the learner a better understanding of the environment by reducing the search space [7].

We have implemented a simulated domestic scenario, in which a robot has to organize objects considering color and shape. The RL agent perceives the world through RGB images and interacts through a robotic arm while an external trainer may advise the agent a different action to perform during the first training steps. The implemented scenario is used for test and comparison between the DeepRL and IDeepRL algorithms, as well as the evaluation of IDeepRL using two different types of trainers, namely, another artificial agent previously trained and a human advisor.

2 Related works

In this section, we review previously developed works considering two main areas. First, we address the deep reinforcement learning approach and the use of interactive feedback. Following, we discuss the problem of vision-based object sorting using robots in order to contextualize our approach properly.

2.1 Deep reinforcement learning and interactive feedback

Deep reinforcement learning (DeepRL) combines reinforcement Learning (RL) and deep neural networks (DNN). This combination has allowed the enhancement of RL agents when autonomously exploring a given scenario [30]. If an RL agent is learning a task, the environment gives the agent the necessary information on how good or bad the taken actions are. With this information, the agent must differentiate which actions lead to a better accomplishment of the task aims [28].

The goal of RL is to find an optimal policy () mapping states to actions, e.g., an optimal action-value function maps a given state () and an action () to choose an action () using the policy, in order to maximize the future reward () over a time () with a discount rate (), as shown in Eq. (1). In DeepRL, an approximation function, implemented by DNN, allows an agent to work with high-dimensional observation spaces, such as pixels of an image [18].


Interactive feedback is a method that improves the learning time of an RL agent [26]. In this method, an external trainer can guide the agent’s apprenticeship to explore more promising areas at early learning stages. The external trainer is an agent that can be a human, a robot, or another artificial agent.

There are two principal strategies for providing interactive feedback in RL scenarios, i.e., evaluative and corrective feedback [19]. In the first one, called reward-shaping, the trainer modifies or accepts the reward given by the environment in order to bias the agent’s learning [21, 3]. In the second one, called policy-shaping, the trainer may suggest a different action to perform, by replacing the one proposes by the policy [12, 16]. A simple method of policy-shaping involves forcing the agent to take certain actions that are recommended by the trainer [13, 20]. For instance, a similar approach is used when a teacher is guiding a child’s hand to learn how to draw a geometric figure. In this work, we use the policy-shaping approach since it has been shown that humans using this technique to instruct an agent provide more accurate advice, are able to assist the learner agent for a longer time, and provide more advice per episode. Moreover, people using policy-shaping have reported that the agent’s ability to follow the advice is higher, and therefore, felt their own advice to be of higher accuracy when compared to people providing advice via reward-shaping [2]. The policy-shaping approach is depicted in Figure 1.

Figure 1: Policy-shaping interactive feedback approach. In this approach, the trainer may advise the agent on what actions to take in a particular given state.

There are different ways to support the agent’s learning, which in turn may lead to other problems. For instance, if the trainer delivers too much advice, the learner never gets to know other alternatives because most of the decisions taken are given from the external trainer [29]. The quality of the given advice by the external trainer must also be considered to improve the learning. It has been shown that inconsistent advice may be very detrimental during the learning process, so that in case of low consistency of advice, autonomous learning may lead to better performance [4].

One additional strategy to better distribute the given advice is to use a budget [29]. In this strategy, the trainer has a limited amount of interaction with the learner agent, similar to the limited patience of a person for teaching. There are different ways of using the budget, in terms of when to interact or give advice, namely, early advising, alternating advice, importance advising, mistake correcting, and predictive advising. In this work, we use early advising allowing us to fairly compare interactive approaches using the different kinds of trainers used in the proposed methods, i.e., humans or artificial agents as trainers.

Although there have been some approaches addressing the interactive deep reinforcement learning problem, they have been mostly used in other scenarios. For instance, in [10] is presented an application to serious games, and in [23] is presented a dexterous robotic manipulation approach using demonstrations. In the game scenario [10], the addressed task presents different environmental dynamics compared to human-robot environments. Moreover, the authors propose undoing an action by the advisor, which is not always feasible. For example, in a human-robot environment, a robot might break an object as a result of a performed action, which is impossible to undo.

2.2 Vision-based object sorting with robots

The automation of sorting object tasks has been previously addressed using machine learning techniques. For instance, Lukka et al.

[17] implemented a recycling robot for construction and demolition waste. In this work, the robot sorts the waste using a vision-based system for object recognition and object manipulation to control the movement of the robot in order to classify the objects presented on a moving belt properly. The authors did not present performance results since the approach is presented as a functional industrial prototype for sorting objects through images.

The object recognition problem is an extended research area that has been addressed by different techniques, including deep learning, as presented in [33]. This approach is a similar system to [17] in terms of proposing to sort garbage from a moving belt. The authors used a convolutional neural network, called Fast R-CNN, to obtain the moving object’s class and location, and send the information to the robotic grasping control unit to grasp the object and move it. As the authors point out, the key problem the deep learning method tries to solve is the object identification. Moreover, another approach to improve the object recognition task is presented in [27], where the authors implemented a stereo vision system to recognize the material and the clothes categories. The stereo vision system creates a 3D reconstruction of the image to process and obtains local and global features to predict the clothing category class and manipulate a robot that must grasp and sort the clothing in a preestablished box. These two systems, presented in [33] and [27]

, use a supervised learning method that requires prior training of items to be sorted, leading to low generalization for new objects.

Using RL to sort objects has also been previously addressed. For instance, in [6] a cleaning-table task in a simulated robotic scenario is presented. In order to complete the task, the robot needs to deal with objects such as a cup and a sponge. In this work, the RL agent utilized a discrete tabular RL approach complemented by interactive feedback. Therefore, the agent did not deal with the problem of continuous visual inputs for state representation. Furthermore, in [32] an approach for robotic control using DeepRL is presented. In this work, a simulated Baxter robot learned autonomous control using a DQN-based system. When transferring the system to a real-world scenario, the approach failed. To fix this, the system ran replacing the camera images with synthetic images in order for the agent to acquire the state and decide which action to take in the real-world.

3 Methodology and implementation of the agents

In this work, our focus is on assessing how interactive feedback, used as an assistance method, may affect the performance of a DeepRL agent rather than finding the solution to the underlying RL problem. To this aim, we implement three different approaches for the RL agents:

  1. [label=.]

  2. DeepRL: where the agent interacts autonomously with the environment;

  3. agent-IDeepRL: where the DeepRL approach is complemented with a previously trained artificial agent to give advice; and

  4. human-IDeepRL: where the DeepRL approach is complemented with a human trainer.

The first approach includes a standard deep reinforcement learning agent, referred to here as DeepRL, and is the basis of both of the interactive agents discussed subsequently. The DeepRL agent perceives the environment information through a visual representation [9]

, which is processed by a convolutional neural network (CNN) that estimates the Q-values. The deep Q-learning algorithm allows the agents to learn through actions previously experienced, using the CNN as a function approximator, allowing them to generalize states and apply Q-learning in continuous state spaces.

To save the past experiences, the experience replay [1] technique is implemented. This technique saves the most useful information (experience) in memory, which is used afterward to train the RL agent. The neural network is responsible for processing the visual representation and gives the Q-value of each action to the agent, which decides what action to take. In order for the agent to balance exploration and exploitation of actions, the -greedy method is used. This method includes an parameter, which allows the agent to performs either a random exploratory action or the best-known action proposed by the policy.

The learning process for the autonomous agent, i.e., DeepRL agent, is separated into two stages. The first pretraining stage consists of 1000 random actions that the agent must perform to populate the initial memory. In the second stage, the agent’s training is carried out using the -greedy policy and, after each performed action, the agent is trained using 128 tuples considering state, action, reward, and next state, as extracted from the memory.

Both IDeepRL approaches are based on autonomous DeepRL and include the interactive feedback strategy from an external trainer to improve the DeepRL performance. Therefore, the agents have the same base algorithm, adding an extra interactive stage. For the interactive agents, the learning process is separated into three stages. The first pretraining stage corresponds to 900 random actions that the agent must perform in order to populate the initial memory. In the second stage, the external trainer participates giving early advice about the environment dynamics and the intended task during 100 consecutive time-steps. In the third stage, the agent starts training using the -greedy policy, and, following each action selected, the agent is trained with 128 tuples as saved previously in the batch memory. The learning process for both autonomous and interactive agents is depicted in Figure 2.

Figure 2: The learning process for autonomous and interactive agents. Both approaches include a pretraining stage comprising 1000 actions. For interactive agents, the final part of the pretraining is performed using external advice instead of random actions.

In the second stage of the IDeepRL approach, the learner agent receives advice either from a previously trained artificial agent or from a human trainer. The artificial trainer agent used in agent-IDeepRL is an RL agent which collected previous experience by performing the autonomous DeepRL approach using the same hyperparameters. Therefore, the knowledge is acquired by interacting with the environment in the same manner as the learner agent does. Once the trainer agent has learned the environment it is used to then provide advice in agent-IDeepRL over 100 consecutive time-steps. Both the trainer agent and the learner agent perceived the environmental information through a visual representation after an action is performed.

Algorithm 1 shows the first stage for DeepRL and IDeepRL approaches, which corresponds to the pretraining stages to populate the batch memory. This algorithm also contains the second stage for interactive agents using IDeepRL represented with a conditional in line 4. Moreover, in Algorithm 2 is observed the second stage for DeepRL, which also corresponds to the third stage for IDeepRL, the training stage.

1:Initialize memory M
2:Observe agent’s initial state
3:while len(M)  do
4:     if interaction is used AND length of M  then
5:         Get action from advisor
6:     else
7:         Choose a random action
8:     end if
9:     Perform action
10:     Observe and next state
11:     Add () to M
12:     if  is terminal OR time-steps  then
13:         Reset episode
14:     end if
15:end while
Algorithm 1 Pretraining algorithm to populate the batch memory including interactive feedback.
1:Perform the pretraining Algorithm 1
2:for each episode do
3:     Observe state
4:     repeat
5:         Choose an action using -greedy
6:         Perform action
7:         Observe and next state
8:         Add () to M
9:         Populate randomly batch B from M
10:         Train CNN using data in B
12:         _decay
13:     until  is terminal OR time-steps
14:end for
Algorithm 2 Training algorithm to populate and extract information from the batch memory.

3.1 Interactive approach

As previously discuss, the IDeepRL methods include an external advisor, that can be another already trained agent or human. In our scenario, the advisor uses a policy-shaping approach during the decision-making process, as previously shown in Figure 1. Moreover, between the different alternatives for interactive feedback, we utilize teaching on a budget with early advising [8]. This technique attempts to reduce the time required for an RL agent to understand better the environment, achieved by 100 early consecutive pieces of advice from the trainer, trying to transfer the trainer’s knowledge of the environment as quickly as possible. For the different ways to implement the interactive approach, we use early advising for the training of the learner agent, using a limited consecutive amount of advice to be used by the trainer to help the agent.

3.2 Visual representation

A visual representation for the deep Q-learning algorithm is utilized, which consists of Q-learning using a function approximator for the Q-values with a CNN. Additionally, it uses a memory with past experiences from where are taken batches for the network training.

Our architecture is capable of processing input images of pixels used in RGB channels for learning the image features. The architecture is inspired by similar networks used in other DeepRL works [18, 14]. Particularly, in the first layer, an convolution with four filters is used, then a max-pooling layer followed by a convolution layer with eight filters, and followed by another max-pooling with the same specification as the previous one. The network has a last

convolution layer with 16 filters. After the last pooling, a flatten layer is applied, which is fully connected to a layer with 256 neurons. Finally, the 256 neurons are also fully connected with the output layer, which uses a softmax function, including four neurons to represent the possible actions. The full network architecture can be seen in Figure

3. Since this work is oriented to compare different learning methodologies, all agents were trained with the same architecture to compare them fairly.

Figure 3: Neural network architecture, with an input of a RGB image, and composed of three convolution layers, three max-pooling layers, and two fully connected layers, including a softmax function for the output.

3.3 Continuous representation

Given the task characteristics, considering images as inputs to recognize different objects in a dynamic environment, it is impractical to generate a table with all the possible state-action combinations. Therefore, we have used a continuous representation combining two methods. The first method is a function approximator through a neuronal network for Q(, ), which allows us to generalize the states, in order to use Q-learning in continuous spaces and select which action is carried out. The second method is the experience replay technique, which saves on memory a tuple of an experience given by . These data saved in memory are used afterward to train the RL agent.

4 Experimental setup

We have designed a simulated domestic scenario focused on organizing objects. The agent aims to classify geometric figures with different shapes and colors and organize them in designated locations to optimize the collected reward. Classification tasks are widespread in domestic scenarios, e.g., organizing cloth. The object shape might represent different cloth types, while the color might represent whether it is clean or dirty.

In order to compare DeepRL and IDeepRL algorithms, three different agents are trained in this scenario in terms of collected reward and learning time. The experimental scenario is developed in the simulator CoppeliaSim developed by Coppelia Robotics [24].

Three tables are used in the scenario, the first contains all the unsorted objects initially placed randomly on the table within nine preestablished positions. This represents the initial dynamic state. The second two tables are used to place the objects once the agent determines which table objects belong. To perform this sort a robotic manipulator arm with 7 degrees of freedom, six axes, and a suction cup grip is used. The robotic arm is placed on another table along with a camera from where we obtain RGB images. The objects to be organized are cubes, cylinders, and disks in two different colors, red and blue, as are presented in Figure


4.1 Actions and state representation

The available actions for the agent are four and can be taken in an autonomous way or through given advice from the external trainer. The actions are the following:

  1. [label=.]

  2. Grab object: the agent grabs randomly one of the objects with the suction cup grip.

  3. Move right: the robotic arm is moved to the table on the right side of the scenario; if the arm is already there, do nothing.

  4. Move left: the robotic arm is moved to the table on the left side of the scenario; if the arm is already there, do nothing.

  5. Drop: if the robotic arm has an object in the grip and is located in one of the side tables, the arm goes down and releases the object; in case the arm is positioned in the center, the arm keeps the position.

For example, the actions required to correctly organize a blue cube from the central table consist of (i) grab object, (ii) move right, and (iii) drop. The robot low-level control to reach the different positions within the scenario is performed using inverse kinematics. Although to reach an object we use inverse kinematics, the CNN is responsible for deciding to perform the action grab an object through the Q-values, and if so, to decide where to place the object, based on the classification. Grasping tasks, motion planning, and control are crucial problems that have received significant attention lately. However, this paper’s scope is limited to assess the effect of interactive feedback in the DeepRL approach, addressing the sorting problem through classification and decision-making.

Figure 4: The simulated domestic scenario presenting 6 objects in different colors and the robotic arm. In the upper left corner is shown the camera signal, which is a pixels RGB image for the state representation of the agent.

The state comprises a high-dimensional space, represented by a raw image captured by an RGB camera. The image presents a dimension of pixels from where the agent perceives the environment and chooses what action to take according to the network output. The input image is normalized to values to be presented to the convolutional neural network.

4.2 Reward

In terms of the reward function, there are different subtasks to complete an episode. To complete the task successfully, all the objects must be correctly organized. To organize one object is considered a partial goal of the task. All the objects are initially located in the central table to be sorted, and once placed in the side tables, they cannot be grasped again. If all the objects are correctly sorted, the reward is equal to 1, and for correctly organizing a single object, the reward is equal to 0.4. For example, if the classification of the six objects is correct, each of the first five organized objects leads to a reward of 0.4, and the last one obtains a reward of 1, summarizing a total reward of 3. Furthermore, to encourage the agent to accomplish the classification task in fewer steps, a small negative reward of -0.01 per step is considered when the steps are more than 18, which is the minimal time-steps needed to complete the task satisfactorily. If an object is misplaced, the current training episode ends, and the agent receives a negative reward of -1. The complete reward function is shown in Eq. (2).


4.3 Human interaction

In the case of human trainers giving advice during the second stage of the IDeepRL approach, a brief three-step induction is carried out for each participant:

  1. [label=.]

  2. The user reads an introduction to the scenario and the experiment, as well as the expected results.

  3. The user is given an explanation about the problem and how it can be solved.

  4. The user is taught how to use the computer interface to advise actions to the learner agent.

In general terms, the participants chosen for the experiment have not had significant exposure to artificial intelligence, and are not familiar with simulated robotic environments. The solution explanation is given to the participants in order to give to all of them an equal understanding of the problem and thus to reduce the time that they spend exploring the environment and focus on advising the agent. Each participant communicates with the learner agent using a computer interface while observing the current state and performance in the robot simulator. The user interface contains in a menu all possible actions that can be advised. These action possibilities are shown at the screen, and the trainer may choose any of them to be performed by the learner agent. There is no time limit to advise each action, but, as mentioned, during the second stage of IDeepRL, the trainer has a maximum of 100 consecutive time-steps available for advice.

5 Experimental results

In this section, we show the experimental results obtained during the training of three different kinds of agents with the three proposed methodologies, i.e., an autonomous agent scenario, a human-agent scenario, and an agent-agent scenario, namely, DeepRL, human-IDeepRL, and agent-IDeepRL. The methodologies are tested with the same hyperparameters, which have been experimentally determined concerning our scenario, as follows: the initial value of , decay rate of 0.9995, and a learning rate during 300 episodes.

As discussed in section 3, the first methodology is an autonomous agent using DeepRL, who must learn how the environment works and how to complete the task. Given the complexity of learning the task autonomously, the time required for the learning process is rather high. The average collected reward for ten autonomous agents is shown in Figure 5 represented by the black line. Moreover, this complexity also increases the error rate or misclassification of the objects located in the central table.

Next, we perform the interactive learning approaches by using agent-IDeepRL and human-IDeepRL. The average obtained results for ten interactive agents are shown in Figure 5 represented by the blue line and the red line, respectively. The agent-IDeepRL approach performs slightly better than the human-IDeepRL approach, mainly because people needed more time to understand the setup and to react during the experiments. However, both interactive approaches obtain very similar results, achieving much faster convergence when comparing to the autonomous DeepRL approach. Furthermore, the learner agents getting advice from external trainers make fewer mistakes, especially at the beginning of the learning process, and are also able to learn the task in fewer episodes. On the other hand, the autonomous agent makes more mistakes at the beginning since it is trying to learn how the environment works and the aim that it has to be accomplished. This is not the case for the interactive agents since the advisors help them during this critical part of the learning.

Figure 5:

Average collected reward for the three proposed methods. The black line represents the autonomous (DeepRL) agent which has to discover the environment without any help. The blue and red lines are the agents with an external trainer, namely, an artificial advisor (agent-IDeepRL) and a human advisor (human-IDeepRL), respectively. The shadowed area around the curves shows the standard deviation for ten agents.

Due to the trainer has a budget of 100 actions to advise, the interactive feedback is consumed within the first six episodes, taking into account that the minimal amount of actions to complete an episode are 18 actions. Even with such a small amount of feedback, the learner agent receives an important knowledge from the advisor that is complemented with the experience replay method. During the human-IDeepRL approach, to give demonstrative advice, 11 people participated in the experiments, four males and seven females, with ages between 16 and 24 (). The participants were explained how to help the robot to complete the task giving advice using the same script for all of them (see Section 3).

In Figure 6 are shown the collected rewards by an autonomous DeepRL agent and three learner agents trained by different people as examples. All human-IDeepRL approaches work much better than the autonomous DeepRL, making fewer mistakes and, therefore, collecting faster rewards. Between the interactive agents, there are some differences, especially at the beginning of the learning process, within the first 50 episodes. It is possible to observe the different strategies followed by the human trainers, for instance, the sixth trainer (yellow line, labeled as human-IDeepRL6) started giving wrong advice, leading to less reward at the beginning, while the eighth trainer (green line, labeled as human-IDeepRL8) started giving sort of perfect advice, experiencing a drop in the collected reward some episodes later. Nevertheless, all the agents, even the autonomous, managed well the task reaching a similar reward.

Figure 6: Collected rewards for different interactive agents. The figure compares the learning process of agents trained by different people using human-IDeepRL (one autonomous agent is included as a reference). The trainer has a budget of 100 actions to advise the learner agent. Although each person has initially a different understanding of the environment considering objectives and possible movements, all the agents converge to similar behavior in terms of collected reward.

Figure 7 shows Pearson’s correlation of the collected rewards for all the interactive agents trained by the participants in the experiment. Moreover, we include an autonomous agent () and an interactive agent trained by an artificial trainer agent () as a reference. It is possible to observe that all interactive agents, including the one using an artificial trainer agent, have a high correlation in terms of the collected reward. However, the autonomous agent shows a lower correlation in comparison to the interactive approaches.

Figure 7: Pearson’s correlation between the collected rewards for different agents. The shown agents include an autonomous agent (), an interactive agent trained by an artificial trainer (), and the interactive agents trained by humans (from to ). The collected reward for all the interactive approaches, including the one using the artificial trainer, presents a similar behavior showing a high correlation. On the contrary, the collected reward by the autonomous agent shows a lower correlation in comparison to the interactive agents.

Additionally, we have computed the Student’s t-test to test the statistical difference between the obtained results. When the autonomous DeepRL approach is compared to agent-IDeepRL and human-IDeepRL, it obtains a t-score t=7.6829 (p-value p=

) and a t-score t=7.0192 (p-value p=) respectively, indicating that there is a statistically significant difference between the two approaches. On the other hand, comparing both interactive approaches between each other, i.e., agent-IDeepRL and human-IDeepRL, a t-score t=0.8461 (p-value p=0.3978) is obtained, showing that there is no statistical difference between the interactive methods.

In all the tested approaches, approximately since episode 150, the agent performs actions mainly based on its learning or training. In that episode, the value of in the -greedy policy decay to 1% of exploratory actions. Moreover, in all the approaches, the maximal collected reward fluctuates between 2.5 and 3. This is because the robot, with its movements, sometimes throws away another object from the table, different from the one being manipulated.

6 Discussion

We have presented an interactive deep reinforcement learning approach to train an agent in a human-robot environment. We have also performed a comparison between three different methods for learning agents. First, we implemented an autonomous version of DeepRL, which had to interact and learn the environment by itself. Next, we proposed an interactive version called IDeepRL, which used an external trainer to give useful advice during the decision-making process through interactive feedback delivered through early advising.

We have implemented two variations of IDeepRL by using previously trained artificial agents and humans as trainers. We called these approaches agent-IDeepRL and human-IDeepRL, respectively. Our proposed IDeepRL methods considerably outperform the autonomous DeepRL version in the implemented robotic scenario. Moreover, in complex tasks, which often require more training time, to have an external trainer giving supportive feedback, leads to great benefits in terms of time and collected reward.

Overall, the interactive deep reinforcement learning approach introduces an advantage in domestic-like environments. It allows speeding up the learning process of a robotic agent interacting with the environment and allows people to transfer prior knowledge about a specific task. Furthermore, using a reinforcement learning approach allows the agent to learn the task without the necessity of previously labeled data, such as the case for supervised learning methods. In this regard, the task is learned in such a way that the agent learns to recognize the state of the environment as well as to behave on it, deciding where to place the different objects.

As future work, we consider the use of different kinds of artificial trainers to select possible advisors better. A bad teacher can negatively influence the learning process and somehow limit the learner by teaching a strategy that is not necessarily optimal. To select a good teacher, it is necessary to take into account that an agent that obtains the best results for the task, in terms of accumulated reward, is not necessarily the best teacher [4]. Rather a good teacher could be one with a small standard deviation from the obtained results. This would allow advising the learner agent in more specific states.


This research was partially funded by Universidad Central de Chile under the research project CIP2018009, the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, the Brazilian agencies FACEPE, and CNPq.


  • [1] S. Adam, L. Busoniu, and R. Babuska (2012-04) Experience replay for real-time reinforcement learning control. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 42, pp. 201 – 212. Cited by: §3.
  • [2] A. Bignold (2019) Rule-based interactive assisted reinforcement learning. Ph.D. Thesis, Federation University Australia. Cited by: §2.1.
  • [3] T. Brys, A. Nowé, D. Kudenko, and M. E. Taylor (2014) Combining multiple correlated reward and shaping signals by measuring confidence.. In AAAI, pp. 1687–1693. Cited by: §2.1.
  • [4] F. Cruz, S. Magg, Y. Nagai, and S. Wermter (2018) Improving interactive reinforcement learning: what makes a good teacher?. Connection Science 30 (3), pp. 306–325. Cited by: §2.1, §6.
  • [5] F. Cruz, S. Magg, C. Weber, and S. Wermter (2014) Improving reinforcement learning with interactive feedback and affordances. In 4th International Conference on Development and Learning and on Epigenetic Robotics, pp. 165–170. Cited by: §1.
  • [6] F. Cruz, S. Magg, C. Weber, and S. Wermter (2016) Training agents with interactive reinforcement learning and contextual affordances. IEEE Transactions on Cognitive and Developmental Systems 8 (4), pp. 271–284. Cited by: §2.2.
  • [7] F. Cruz, G. I. Parisi, and S. Wermter (2018) Multi-modal feedback for affordance-driven interactive reinforcement learning. In 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §1.
  • [8] F. Cruz, P. Wüppen, S. Magg, A. Fazrie, and S. Wermter (2017) Agent-advising approaches in an interactive reinforcement learning scenario. In 2017 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pp. 209–214. Cited by: §3.1.
  • [9] N. Desai and A. Banerjee (2017) Deep reinforcement learning to play space invaders. Technical report Stanford University. Cited by: §3.
  • [10] A. Dobrovsky, U. M. Borghoff, and M. Hofmann (2019) Improving adaptive gameplay in serious games through interactive deep reinforcement learning. In Cognitive infocommunications, theory and applications, pp. 411–432. Cited by: §2.1.
  • [11] M. A. Goodrich, A. C. Schultz, et al. (2008) Human–robot interaction: a survey. Foundations and Trends® in Human–Computer Interaction 1 (3), pp. 203–275. Cited by: §1.
  • [12] S. Griffith, K. Subramanian, J. Scholz, C. Isbell, and A. L. Thomaz (2013) Policy shaping: integrating human feedback with reinforcement learning. Conference Proceedings In Advances in Neural Information Processing Systems, pp. 2625–2633. Cited by: §2.1.
  • [13] J. Grizou, M. Lopes, and P. Oudeyer (2013) Robot learning simultaneously a task and how to interpret human instructions. In 2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL), pp. 1–8. Cited by: §2.1.
  • [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, §3.2.
  • [15] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
  • [16] G. Li, R. Gomez, K. Nakamura, and B. He (2019) Human-centered reinforcement learning: a survey. IEEE Transactions on Human-Machine Systems 49 (4), pp. 337–349. Cited by: §2.1.
  • [17] T. J. Lukka, T. Tossavainen, J. V. Kujala, and T. Raiko (2014) ZenRobotics recycler–robotic sorting using machine learning. In Proceedings of the International Conference on Sensor-Based Sorting (SBS), Cited by: §2.2, §2.2.
  • [18] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1, §2.1, §3.2.
  • [19] A. Najar and M. Chetouani (2020) Reinforcement learning with human advice. a survey. arXiv preprint arXiv:2005.11016. Cited by: §2.1.
  • [20] N. Navidi (2020) Human ai interaction loop training: new approach for interactive reinforcement learning. arXiv preprint arXiv:2003.04203. Cited by: §2.1.
  • [21] A. Y. Ng, D. Harada, and S. Russell (1999) Policy invariance under reward transformations: theory and application to reward shaping. Conference Proceedings In ICML, Vol. 99, pp. 278–287. Cited by: §2.1.
  • [22] Y. Niv (2009) Reinforcement learning in the brain. Journal of Mathematical Psychology 53 (3), pp. 139–154. Cited by: §1.
  • [23] A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine (2017) Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087. Cited by: §2.1.
  • [24] E. Rohmer, S. P. Singh, and M. Freese (2013) V-rep: a versatile and scalable robot simulation framework. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1321–1326. Cited by: §4.
  • [25] S. Shepherd and A. Buchstab (2014) Kuka robots on-site. In Robotic Fabrication in Architecture, Art and Design 2014, pp. 373–380. Cited by: §1.
  • [26] H. B. Suay and S. Chernova (2011) Effect of human guidance and state space size on interactive reinforcement learning. In 2011 Ro-Man, pp. 1–6. Cited by: §2.1.
  • [27] L. Sun, G. Aragon-Camarasa, S. Rogers, R. Stolkin, and J. P. Siebert (2017) Single-shot clothing category recognition in free-configurations with application to autonomous clothes sorting. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , pp. 6699–6706. Cited by: §1, §2.2.
  • [28] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: An introduction. MIT press. Cited by: §1, §2.1.
  • [29] M. E. Taylor, N. Carboni, A. Fachantidis, I. Vlahavas, and L. Torrey (2014) Reinforcement learning agents providing advice in complex video games. Connection Science 26 (1), pp. 45–63. Cited by: §2.1, §2.1.
  • [30] H. Van Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, pp. 2094–2100. Cited by: §2.1.
  • [31] M. Vecerik, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. Heess, T. Rothörl, T. Lampe, and M. Riedmiller (2017) Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817. Cited by: §1.
  • [32] F. Zhang, J. Leitner, M. Milford, B. Upcroft, and P. Corke (2015) Towards vision-based deep reinforcement learning for robotic motion control. arXiv preprint arXiv:1511.03791. Cited by: §2.2.
  • [33] C. Zhihong, Z. Hebin, W. Yanbo, L. Binyan, and L. Yu (2017) A vision-based robotic grasping system using deep learning for garbage sorting. In 2017 36th Chinese Control Conference (CCC), Vol. , pp. 11223–11226. Cited by: §1, §2.2.