Reinforcement learning (RL)  enables intelligent agents to make good decisions in environments and make it possible for a robot to learn specific skills by interacting with environments. However, in many robotic control tasks, the observation space and action space are usually continuous and high-dimensional, so traditional RL algorithms usually cannot cope with these tasks very well. Recently, deep reinforcement learning (DRL) algorithms have achieved significant successes in many continuous control tasks and difficult sequential decision-making problems .
In general, the original DRL algorithms can be divided into 2 types, i.e., policy-based algorithms such as vanilla policy gradient  and value-based algorithms such as deep Q-learning . In order to combine the advantages of these two types of algorithms, actor-critic style algorithms were devised by many researchers , such as advantage actor-critic (A2C), asynchronous advantage actor-critic (A3C)  and actor-critic with experience replay (ACER) . On the other hand, with the aim of alleviating the unstable training process of policy networks, trust region policy optimization (TRPO)  was proposed by Schulman, in which KL-Divergence was employed to constrain the update of the policy. Subsequently, proximal policy optimization (PPO), a simpler version, was proposed by Schulman as well . In addition, deep deterministic policy gradient (DDPG)  can be thought of as an extension of DQN in continuous action space. Additionally, twin delayed DDPG (TD3), an improved version of DDPG , was put forward by Fujimoto to alleviate the problem of dramatically overestimating Q-values and make the training process more stable. Moreover, soft actor-critic (SAC)  appeared almost simultaneously with TD3, which is a sort of actor-critic style RL algorithm, featured by DDPG-style critics and maximum entropy regularization. It is worth emphasizing that SAC has achieved the state-of-the-art results in many robotic continuous control tasks.
Despite DRL algorithms having achieved significant success, these methods still suffer from the problems of sample inefficiency, difficulty of escaping from a local optimum and unstable training process , where the sample inefficiency problem is particularly serious. This is because agents cannot be rewarded in a timely manner. In addition, in the early stages of training, the policy is a stochastic policy, which cannot make good actions, so it results in a poor reward. Thus, if we can guide agents to take appropriate actions and guide agents to go to high-reward areas, the sample efficiency will be greatly enhanced. Based on this idea, there has been a sustained research activity in the combination of RL algorithms with expert demonstrations [14, 15, 16, 17, 18]. Recently, it is common practice for DRL algorithms to put demonstration data into replay buffers [19, 20, 21] or pre-train the policy network [22, 23]
. However, these sorts of methods cannot make full use of demonstration data, limited by only using them in a supervised learning manner. To address this deficiency, absorbing some useful ideas from imitation learning[24, 25, 26, 27, 28], more advanced RLfD approaches [17, 18] were proposed. Many of them used reward shaping techniques to guide agents to explore high-reward areas, this is because rewards provide the most informative information about an environment . However, these methods usually require a lot of expert demonstrations. But for many robotic tasks, it is difficult to collect a large amount of demonstration data. Also, traditional robot control methods usually require complex dynamics modeling and sophisticated control algorithm design.
Thus, in this paper, we proposed a sample efficient DRL-EG (DRL with efficent guidance) algorithm , which can be used in robotic continuous control tasks. Our method, built on the SAC, is a novel method that can make effective use of demonstration data in agents’ exploration phase. Firstly, a discriminator and a guider will be built by demonstration data. Then, in the training phase, the discriminator will judge whether the guider can give a good action in the current state . If so, the agent will take the action output by the guider , otherwise, the agent will take an action by the policy. Finally, better policy can be achieved through better exploration. Empirical evaluation results on continuous control tasks verify the effectiveness and performance improvement of our method over other counterparts, such as SAC , PPO , PPO with pre-training and DDPGfD . Experiments results also showed that DRL-EG method can help the agent escape from a local optimum.
The major contributions of this paper could be summarized as follows:
Two types of discriminator and guider
, including neural network style and functional style, are proposed to guide the agent to better exploration in the environment.
We developed a sample efficient DRL algorithm named DRL-EG to tackle the problems in pure DRL algorithms, such as sample inefficiency and difficulty of escaping from a local optimum.
With the guided mechanism, DRL-EG achieves consistent improvement over other RL and RLfD counterparts on several continuous robotic control benchmarks.
The rest of paper is organized as follows: In Section 2, we provide the subject of maximum entropy RL framework. Then our proposed method will be detailed in Section 3. Empirical evaluations results and discussion will be demonstrated in Section 4. Finally, conclusion will be represented in Section 5.
Ii-a Markov Decision Process (MDP)
The proposed method is based on the framework of the infinite-horizon markov decision process (MDP), represented by the tuple, in which and
represent state space and action space, respectively. The state transition probabilitydenotes the probability density of transition to next state after performing action in state . is a reward function which maps state-action pairs to real numbers, i.e., . is discount factor.
Usually, a stochastic policy for markov decision process (MDP) gives the probability of taking an action when in state . gives the probability density of the initial state . An episode is a trajectory of an agent’s interaction with the environment, so an episode of a MDP can be defined by a sequence . MDP used in RL is aimed at maximizing average returns:
In reinforcement learning, state value function and state-action value function are introduced to evaluate the performance of a policy. and can be given by the expression:
where means the expectation is taken with respect to the policy . According to the definition of the state value function and the state-action value function, and can be transformed from each other:
Ii-B Soft Policy Iteration
Soft actor-critic algorithm is a sort of maximum entropy reinforcement learning algorithm , which has achieved the state-of-the-art performance in many continuous control tasks . Entropy is often used to indicate the degree of uncertainty of a distribution , which can be represented by . In maximum entropy reinforcement learning, the entropy term is added to the reward of each step, i.e., , which helps to improve the stability and robustness of the model.
In this case, the objective is to maximize cumulative rewards and entropy of the policy:
Moreover, equation (4) and (5) can be modified to:
For a fixed policy , calling (7) and (8) repeatedly can obtain . This process is called Q-iteration.
Iii-a Demonstrations Guided Exploration Framework
When people start to learn a new skill, such as swimming or shooting a basketball, people usually will not take actions blindly. Instead, they will imitate the actions of experts, and then explore some more suitable actions for them. Therefore, it is sensible to think that for RL algorithms, it is necessary to use demonstration data to guide the exploration of agents at the beginning of training, which can help agents learn faster.
Inspired by this, we proposed a demonstration data guided mechanism, which makes use of demonstration data to guide agents’ actions in the training phase, as shown in Fig. 1. In state , the discriminator will judge whether the guider can give a good action in this state. If so, the agent will take the action output by the guider , otherwise, the agent will take an action output by the policy .
Generally, the guider can be a neural network or a function, modeled using demonstration data, i.e., a series of state-action pairs . The discriminator can be a distribution with respect to states or a function, modeled using a set of states . This mechanism is mainly used when the average rewards of demonstration data far exceed the performance of the agent, especially in the early stage of training. With this mechanism, the sample efficiency will be enhanced, and the agent may be able to jump out of the local optimum.
Take an extreme example, suppose in a robot grasping task the reward is defined as: the reward is 1 when the end effector catches the target, otherwise the reward is 0. At the beginning stage of training, the policy is stochastic, so the agent can hardly approach the target. In this case, it is impossible for the agent to learn anything useful from rewards, since the rewards are all zero. Thus, if the guider can guide the robot to approach the target, the rewards will contain more useful information and the training speed will be greatly enhanced.
Iii-B Discriminator and Guider in Practice
Two types of discriminator are proposed, i.e., a state distribution and a function. For the distribution, it is modeled by a set of states
in demonstration data, using gaussian mixture model (GMM):
. Expectation maximization algorithm (EM algorithm) can be used to build a GMM. When the probability density of a stateis greater than a threshold , we can concluded that this state can be guided, otherwise it cannot be guided by the guider . It is worth noting that this kind of method is only applicable to the case when there are a huge number of state-action pairs.
For the case of relatively less demonstration data, such as only 1000 state-action pairs, we can use a functional discriminator:
is a hyperparameter. For functional discriminator, we need to iterate throughin demonstration data and calculate the Euclidean distance between these states and the state . When the Euclidean distance between the state and state in is short, we can assume that the state is similar to state . Thus, there is a similar state in demonstration data, and we can use the guider to guide the agent under state .
Same as before, two types of guider are proposed, i.e., a neural network and a function. For the neural network, the input is a state and the output is an action, which is trained by state-action pairs in demonstration data. In the case of a small amount of demonstration data, a function is more practical, which can be given by:
where . In this paper, we use functional and , since we have collected no more than 1000 state-action pairs in each environment.
Iii-C DRL-EG Algorithm
Due to the fact that the SAC algorithm has achieved the state-of-the-art performance in many continuous control tasks, we apply the demonstration data guided exploration mechanism to the SAC algorithm in order to get better performance, which leads to the proposed DRL-EG (DRL with efficient guidance) algorithm.
To simplify notation going forward, we will drop the mark soft in and . Thus, and mean the soft version of value functions. For continuous state space and action space, we usually use a neural network to approximate the tabular value function, so they can be denoted by and , where and represent the parameters to be optimized in neural networks. Similarly, a stochastic policy can also be represented by a neural network with independent Gaussian noise, denoted by
. For the convenience of backpropagating the gradient, we denote the action output by the policyas:
which means the policy network parameterized by
with independent noise can simultaneously output mean and standard deviation of actionunder state . In order to maximize the expected rewards plus entropy, we can directly maximize for all which is more practical, so we can rewrite the objective function as:
When we get a mini batch of training data, the approximate gradient of (13) can be obtained by:
For the evaluation of and , mean-square error (MSE) loss can be used to approach the target with a mini batch of samples. For Q-value, the target can be calculated by , and for V-value, the target is
. Thus, the loss function can be given by:
which can be optimized with stochastic gradient decent with a mini batch of samples:
Thus, the proposed method can be described in detail, as shown in Algorithm 1.
Iv Experiments and Discussion
For the experiments below, we aim at investigating the following questions:
Can the proposed algorithm achieve better performance than other RL and RLfD counterparts?
Under what circumstances does this mechanism work, and under what circumstances does it fail?
Can the guided exploration mechanism help the agent to jump out of the local optimum?
Iv-a Experiments Configuration
To answer the first two questions, we first carried out the proposed algorithm and the SAC to a series of continuous control tasks ranging from low-dimensional classical control tasks to challenging high-dimensional robotic control tasks in the OpenAI Gym , some of which are implemented using MuJoCo, a physics engine aiming to facilitate research in robotics . The description of these environments is shown in Fig. 2. More specifically, dimensions of the action space and state space of these environments are shown in Table 1.
Considering the randomness of the initialization of neural networks, for more persuasive evaluation and fair comparison, each experiment was implemented three times, so the curves are shown by taking the average of these three curves. Also, our method has the same experimental parameters as the SAC. We also compared our method with other RL and RLfD algorithms, such as proximal policy optimization (PPO) , PPO with pre-train and deep deterministic policy gradient from demonstration (DDPGfD) .
In deep reinforcement learning, the training course is usually unstable . In order to dig deeper into the effect of the proposed mechanism on the training process and answer the third question, we also carried out the proposed algorithm and the SAC in the Ant-v2 environment under four fixed random seeds. In each comparative experiment, not only do they have identical training parameters, but also have the same initialization parameters of neural networks. The only difference between each comparative experiment is the demonstrations guided exploration mechanism.
In addition, the demonstration data is collected as follows:
For MountainCarContinuous-v0 environment, the demonstration data was collected by manual marking. The action in demonstration data is defined as: when the velocity is less than 0, the demonstration action is -0.5, whereas the demonstration action is 0.5.
For environments built by Mujoco, the demonstration data was obtained by implementing the best trained SAC or TD3 (Twin Delayed DDPG)  agent in environments.
The demonstration data for all environments contains 1000 state-action pairs.
Iv-B Comparative Evaluations
Fig. 3 shows the performance of our method versus the SAC and other RL and RLfD counterparts under continuous control tasks. The performance is defined as the average cumulative rewards of several trajectories in the training course.
In the MountainCarContinous environment, the SAC and PPO can hardly learn anything useful from rewards. This is because the rewards given by the environment are so sparse. In this environment, positive rewards are only given when the car reaches the top of the mountain, and rewards will be reduced when the car takes an action, as a loss of the car’s movement. Therefore, as in the early stage of exploration the agent has never reached the top of the mountain and the rewards it receives have always been nonpositive, so the agent decides not to take any action in the environment to ensure maximum rewards. Thus, the rewards obtained by the SAC agent and PPO agent are 0 at the end of training. Fortunately, the proposed mechanism can guide the agent to go to high reward areas to a certain extent, so once the agent reaches the top of the mountain and gets high rewards, it can learn from those rewards and obtain a better policy. In addition, PPO with pre-train and DDPGfD also have the same guiding ability, but the final performance of these two methods are not as good as the proposed algorithm.
In the Hopper and HalfCheetah environments, the performance of the SAC is almost indistinguishable from the proposed algorithm. We think the reason for this case is that the SAC is already able to achieve good performance in these environments. For better performance the additional guided mechanism is not necessary.
In the Walker and Ant environments, our method outperforms all other RL and RLfD algorithms. Moreover, the proposed method can almost achieve the same performance as the demonstration data at the end of training. In addition, the SAC and DDPGfD have encountered the problem of unstable training course in the Ant environment, which is usually caused by the nonstationarity of the incoming data and the difficulty of obtaining steady improvement . Sometimes, small differences in parameter space of the policy will make large differences in performance. However, in these environment, the DRL-EG algorithm shows better training stability.
As a consequence, based on the results in Fig. 3, we can conclude that the proposed algorithm can get better performance than other counterparts. In addition, the DRL-EG algorithm can not only learn efficiently in a sparse reward environment such as in MountainCarContinuous, but also achieve more stable training stability. It is worth noting that when the SAC can achieve excellent performance in some environments, the guided exploration mechanism will not significantly improve agent’s performance, such as in Hopper and HalfCheetah environments.
Fig. 4 shows the performance of our method versus the SAC under Ant-v2 environment with four different random seeds. In each picture, the training parameters and experiment settings are totally identical, except for the fact that our method contains the guided exploration mechanism. For random seed 1, it is obvious that the SAC was trapped in a local optimum. In the early stage of training, our method was also trapped in a local optimum. However, due to the guided exploration mechanism, our method successfully jumped out of the local optimum. For random seed 2, our method converged faster than the SAC, and they had the same performance at the end of training. For random seed 3 and 4, both algorithms had experienced the unstable training problem, but our method achieved better performance at the end of training.
Therefore, based on the results in Fig. 4, we can make the observation that our method can help agents to escape from a local optimum and get better results.
In order to tackle the problems of low sample efficiency, easy to fall into a local optimum and training instability in pure RL algorithms, in this paper we propose a sample efficient DRL-EG algorithm, which can be used in robotic continuous control tasks. The main intuition is that expert demonstrations can be used to guide an agent to go to high reward areas, which helps to reduce ineffective exploration and improves sample efficiency. Therefore, we built a discriminator and a guider using demonstration data. Then, in the training phase, the discriminator will judge whether the guider can give a good action in the current state . The guider
will guide agents to take good actions. Finally, better policy can be achieved through better exploration. The experimental results on several continuous control tasks show that the DRL-EG can achieve better performance than other RL and RLfD counterparts and can help pure DRL algorithms escape from a local optimum. Further exploration on combining our work with computer vision to enable a real robot to grasp a target in dynamic open scenes or making a robot move in a real environment could be a new direction of the future work.
We would like to thank Yanrui Jin and Kai Zhu for insightful discussions. We would also like to thank Duantengchuan Li and Christopher Wu for their valuable work of sample collection.
-  R. S. Sutton and A. G. Barto, “Reinforcement learning,” vol. volume 15, no. 7, pp. 665–685, 1998.
Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep
reinforcement learning for continuous control,” in
Proc. IEEE International Conference on Machine Learning (ICML), NYC, U.S., June 2016, pp. 1329–1338.
-  R. S. Sutton, D. Mcallester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” Neural Information Processing Systems, vol. 12, pp. 1057–1063, 1999.
H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with
double q-learning,” in
Proc. National Conference on Artificial Intelligence, 2016, pp. 2094–2100.
-  I. Grondman, L. Busoniu, G. A. D. Lopes, and R. Babuska, “A survey of actor-critic reinforcement learning: Standard and natural policy gradients,” Systems Man and Cybernetics, vol. 42, pp. 1291–1307, 2012.
-  M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, and J. Kautz, “Reinforcement learning through asynchronous advantage actor-critic on a gpu,” in Proc. International Conference on Learning Representationse (ICLR), Toulon, France, Apr. 2017, pp. 1–12.
-  Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. De Freitas, “Sample efficient actor-critic with experience replay,” in Proc. International Conference on Learning Representationse (ICLR), Toulon, France, Apr. 2017.
-  J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz, “Trust region policy optimization,” in Proc. International Conference on Machine Learning (ICML), 2015, pp. 1889–1897.
-  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv: Learning, 2017.
-  T. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” in Proc. International Conference on Learning Representations (ICLR), 2016.
-  S. Fujimoto, H. Van Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in Proc. international conference on machine learning (ICML), vol. 80, 2018, pp. 1587–1596.
-  T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in Proc. International Conference on Machine Learning (ICML), 2018, pp. 1856–1865.
-  Y. Yu, “Towards sample efficient reinforcement learning,” in Proc. International Joint Conference on Artificial Intelligence (IJCAI), 2018, pp. 5739–5743.
-  K. Kuindersma and R. G. and Andre. Barreto, “Constructing skill trees for reinforcement learning agents from demonstration trajectories,” Neural Information Processing systems, pp. 1162–1170, 2010.
-  T. Brys, A. Harutyunyan, H. B. Suay, S. Chernova, M. E. Taylor, and A. Nowe, “Reinforcement learning from demonstration through shaping,” in Proc. International Conference on Artificial Intelligence, 2015, pp. 3352–3358.
-  G. Yang et al., “Reinforcement learning from imperfect demonstrations,” arXiv:1802.05313, 2018.
-  B. Kang and J. Feng, “Policy optimization with demonstrations,” in Proc. International Conference on Machine Learning (ICML), 2018, pp. 2469–2478.
-  W. Sun et al., “Truncated horizon policy search combining reinforcement learning and imitation learning,” arXiv:1805.11240, 2018.
-  T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, G. Dulacarnold, et al., “Deep q-learning from demonstrations,” in Proc. National Conference on Artificial Intelligence, 2018, pp. 3223–3230.
-  M. Vecerik, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. Heess, T. Rothorl, T. Lampe, and M. Riedmiller, “Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards,” arXiv: Artificial Intelligence, 2017.
-  A. Nair, B. Mcgrew, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Overcoming exploration in reinforcement learning with demonstrations,” in Proc. IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 6292–6299.
-  D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. V. Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
-  T. M. Edmund, D. L. C. J. G. Victor, and D. Yunshu, “Pre-training neural networks with human demonstrations for deep reinforcement learning,” arXiv: Learning, 2019.
-  C. Yang, X. Ma, W. Huang, F. Sun, H. Liu, J. Huang, and C. Gan, “Imitation learning from observations by minimizing inverse dynamics disagreement,” arXiv: Learning, 2019.
-  W. Sun, A. Venkatraman, G. J. Gordon, B. Boots, and J. A. Bagnell, “Deeply aggrevated: differentiable imitation learning for sequential prediction,” in Proc. International Conference on Machine Learning (ICML), 2017, pp. 3309–3318.
-  A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne, “Imitation learning: A survey of learning methods,” ACM Computing Surveys, vol. 50, no. 2, p. 21, 2017.
-  J. Ho and S. Ermon, “Generative adversarial imitation learning,” Neural Information Processing Systems, pp. 4565–4573, 2016.
-  B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” Robotics and Autonomous Systems, vol. 57, no. 5, pp. 469–483, 2009.
-  G. Xiang and J. Su, “Task-oriented deep reinforcement learning for robotic skill acquisition and control,” IEEE Transactions on Cybernetics, 2019.
-  T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al., “Soft actor-critic algorithms and applications.” arXiv: Learning, 2018.
-  E. T. Jaynes, “Information theory and statistical mechanics,” Physical Review, vol. 106, no. 2, pp. 620–630, 1957.
-  C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., 2006.
-  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” CoRR, vol. abs/1606.01540, 2016. [Online]. Available: http://arxiv.org/abs/1606.01540
-  E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012, pp. 5026–5033.
-  L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta, “Robust adversarial reinforcement learning,” in Proc. International Conference on Machine Learning (ICML), 2017, pp. 2817–2826.
J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” inProc. International Conference on Learning Representationse (ICLR), 2016.