Balance Between Efficient and Effective Learning: Dense2Sparse Reward Shaping for Robot Manipulation with Environment Uncertainty

03/05/2020 ∙ by Yongle Luo, et al. ∙ 0

Efficient and effective learning is one of the ultimate goals of the deep reinforcement learning (DRL), although the compromise has been made in most of the time, especially for the application of robot manipulations. Learning is always expensive for robot manipulation tasks and the learning effectiveness could be affected by the system uncertainty. In order to solve above challenges, in this study, we proposed a simple but powerful reward shaping method, namely Dense2Sparse. It combines the advantage of fast convergence of dense reward and the noise isolation of the sparse reward, to achieve a balance between learning efficiency and effectiveness, which makes it suitable for robot manipulation tasks. We evaluated our Dense2Sparse method with a series of ablation experiments using the state representation model with system uncertainty. The experiment results show that the Dense2Sparse method obtained higher expected reward compared with the ones using standalone dense reward or sparse reward, and it also has a superior tolerance of system uncertainty.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep reinforcement learning (DRL) has shown its amazing ability in real robotic manipulations ranging from object grasping [10, 21], ball playing [1, 31], door opening operations [14] and assembly tasks [23, 33]. Compared to conventional control methods, DRL utilizes a powerful nonlinear approximator to learn a mapping from state to action, which is more robust to the noise in the states for conducting complex tasks [23, 28, 27]. Most of the previous works in the DRL guided robot manipulations assumed that the agents are able to obtain accurate rewards during the training process [10, 23]. However, in real practice, noise from the environment or the sensor itself could increase the system uncertainty. It also affects the learning effectiveness of the DRL [9, 41, 22], due to the error accumulation.

There are two major factors that affect the learning efficiency and effectiveness of DRL: reward shaping and policy. Recently, various learning polices have been intensively studied with great improvement on the performance [26, 5, 24, 11, 34, 18]. However, only a few attempts have been made for improving reward shaping methods. In the DRL, the reward shaping mainly includes two schemes, the sparse reward and the dense reward [25, 29, 8, 12, 44]. Sparse reward is a straightforward reward shaping method that makes the agent obtain reward only if the task or sub-task has been completed. In such a case, the agent is required to explore the environment without any reward feedback, until it achieves the target state. That makes the policies using sparse reward difficult to be optimized since there is little gradient information to drive the agent to the target [15]. In addition, it usually takes a long period of time to get a successful trajectory with random actions in the high dimensional state. In order to solve the efficiency issue of the sparse reward, the dense reward was developed with continuous feedback to the agent to accelerate the learning [25, 8, 12, 44]. Although the dense reward-based DRL usually has a faster learning speed compared to the sparse one, the learning effectiveness highly depends on the accuracy of reward, which can be affected by the noise and disturbance from the environment.

It is noted that at the beginning of the learning process using the sparse reward, the learning converges relatively slow due to the lack of global information and successful experience. However, when enough successful data have been stored into the buffer, the sparse reward can lead to better training results [21, 40]. In the opposite, the dense reward can learn a suboptimal policy quickly, however, because of the system uncertainty, such as the noise from the sensors [29], the final learning result may not be as good as the one using sparse reward [2]. In order to combine the advantages of both sparse and dense reward methods, in this study we propose a Dense2Sparse reward shaping approach. It uses a dense reward to guide the agent to learn a relative optimal policy, and then switches to the sparse reward, which utilizes the noiseless experience to further optimize the DRL algorithm.

In order to evaluate the performance of the Dense2Sparse technique, we set up a typical robot manipulation environment using a degree of freedom (DOF) robot and a monocular camera for sensing the environment. Normally, a fixed monocular camera is not able to form a stereo vision which is commonly used in the manipulation task to obtain the target’s position/orientation feedback. With the help of deep learning-based image processing techniques [16, 17, 35, 38], DRL often takes a representation learning via transforming the images into a meaningful form with physical meaning, such as the position and distance [10], and this process is called state representation. In this work, we use a fixed monocular camera to extract the location of a target block by a ResNet34 network [17]

, as the state representation model to estimate state and reward signals. The experimental results show a superior performance of the usage of Dense2Sparse, compared with the ones using the standalone dense and sparse reward, respectively. Interestingly, as we increased the system perturbation level by shifting a certain angle of the calibrated camera (which increased the uncertainty of the feedback system), the Dense2Sparse approach was able to maintain almost the same performance of convergence speed and episode total rewards, which illustrates the Dense2Sparse is not sensitive to the system uncertainty. Therefore, the Dense2Sparse is able to balance the effectiveness and efficiency of DRL, especially for the robot manipulation tasks with system uncertainty.

The contributions of this study can be summarized as follows: (1) the proposed Dense2Sparse approach is able to balance the efficiency and effectiveness of DRL compared with regular dense or sparse reward shaping method. (2) The proposed Dense2Sparse approach is able to deal with system uncertainty, to make the DRL policy less sensitive to the noise and that helps the extension of the DRL application into more scenes.

Ii Related Work

Ii-a Reward shaping

The reward shaping plays an important role in the DRL as the interaction with the environment to guarantee successful operations. The efficiency and effectiveness are two major challenges of the reward shaping, which also hinder the further development of applying DRL to the robot practice directly.

Most of the time, the dense reward is designed manually in various DRL settings [23, 13]. In general, this reward calculation needs physical quantities of the object features such as position or distance information. To obtain those physical quantities, multiple sensors must be utilized which may introduce extra noise and increase the system complexity [36, 19]. Other than dense reward, sparse reward is an easier and more straightforward way to provide feedback to the DRL, which has also been widely applied in many robotic studies. For instance, in [33], a binary reward which is obtained by testing whether the electronics are functional or not, is used to conduct the industrial insertion asks. In [21], a sparse reward is designed for judging whether the target block is lifted through the visual method, which is determined by subtracting the image before and after the task. In [40], the agent can get a reward only when all blocks are at their goal positions in the robotic block stacking task. The results of those studies have shown that DRL can obtain relatively good performance even with a sparse reward.

In general, it needs a certain expert experience to design a dense reward functional which leads to good policy performance but still be suboptimal [26, 19, 30]. Compared to dense reward, a number of tasks are natural to specify with a sparse reward, and the policy can perform well if some tricks are introduced to overcome the initial exploration problem [21, 33, 40].

Ii-B State representation and noise filtering

Although the dense reward has an obvious advantage of faster convergence speed, in most cases the precise physical state is not accessible, which is also common in the field of robotic manipulation. State representation learning is an effective way to estimate the physical state under this situation [28, 27, 6]. In [20], the state representation model contains convolutional layers and an LSTM module, which can map the sequences of the past images and joint angles to motor velocities, gripper actions, 3D cube position and 3D gripper position. In [38], a 3D coordinate predictor based on VGG16 [35] is trained with the amount of domain randomization, which can keep high accuracy when the environment alters from simulation to the real scene. The state representation adopted in this paper is based on a ResNet34 [17] network structure, and to keep certain robust performance, we also introduce domain randomization tricks as in [38].

Since the state representation model has some certain estimation error [38]

, the dense reward calculated according to the estimated state also contains the perturbation, which may decrease the policy’s performance. A direct way to tackle the estimation error is adopting the Kalman filter 

[43]. However, before using Kalman filter to correct the estimation error, we should know the system dynamics (e.g. robot-environment interaction model) which is inaccessible in the model-free DRL. In [41]

, the confusion matrix analysis is proposed as the unbiased reward estimator to tackle the noisy reward problem, and this strategy leads to a good improvement of learning effectiveness. However, since this method can only deal with limited discrete situations, it is difficult to extend this method to large continuous state-space which is common in the robot study.

It has been found that for DRL policy, the sparse reward guided policy may outperform the one with the dense reward [28, 2]. This is because that the sparse reward has a straight judgement about the completion of the task or not, which is hardly affected by the human factor and environment noise [2]. Considering the fast convergence property of dense reward and the robustness of sparse reward, in this study, a novel reward shaping method named Dense2Sparse is developed. In this method, firstly, a dense reward which may include noise is used to learn a suboptimal strategy quickly, and then switches to the sparse reward to continue the policy learning, which guarantees a fast and robust learning performance.

Iii Problem Statement and Method Overview

In this study, we propose a novel reward shape in reinforcement learning to balance the effectiveness and efficiency of the learning for robot manipulation. Specifically, the manipulation is conducted by the following two stages. At the first stage, a ResNet34 network is used as our state representation model to estimate the location of the target object from the single view images captured by a camera. This stage is convenient for the reward shaping and the hardware setup (all we need is only one monocular camera) especially for the robot manipulation, but is inevitable to introduce estimation error. In the second stage, the estimated target location is used by a DRL algorithm based on the Dense2Sparse technique. Specifically, the DRL runs with noisy dense reward shape for a predefined number of episodes, and then switches to the sparse reward setting for the rest of the training process.

In general, the manipulation tasks can be modeled as a finite-horizon, discounted Markov Decision Process (MDP). To describe the MDP, let

be the low-dimension state space which is given by the state representation model; be the action space which is construed by the joint speed command and grasping command; and be the discount factor and horizon, respectively, which are set to and in this study. Then the reward function can be represented as the mapping , and the state transition dynamics can be denoted as . Finally, the policy can be then defined as a deterministic mapping . The goal of formulating the MDP is to maximize the total expected reward (expressed as (1)) by optimizing the policy .

(1)

As we adopt the model-free DRL to solve the MDP problem above, it can avoid accessing the transition dynamics. The detail training process of state representation is described in section IV, and the policy training is detailed in section V.

Iv Representation Model

It is a challenge for utilizing the raw visual information to conduct the reinforcement learning for robotic manipulations directly [28]

. Instead, it is more feasible to use the low-dimension state representations as the feedback data. Normally, the state representation that contains clear physical meaning is always welcomed as it can serve both as the state and the reward shape, and thus reduce the requirements of physical sensors. Specifically, we employ a pre-trained deep learning network, ResNet34, as our state representation model to process the raw images obtained by a fixed monocular camera. The loss function of the ResNet34 network is given as follows.

(2)

where (m =1, 2, 3) presents the actual , and value in world coordinate and denotes the predicted , , value output by the ResNet34 model.

It has been shown that combining the random data and policy related data in the data set could lead to better policy learning performance [28]. In this study, we collected data by running the robot arm whose actions were comprised of random actions and policy related actions. To overcome the “sim-to-real” problem (the policy works well in one simulation environment but worse in another scene), we adopted the simple domain randomization method [38], in which we randomly changed the position of the camera in a small range. We totally ran episodes with steps per episode to collect the data, and finally selected

images which is one quarter of the total data. The evaluation metric of this representation model is the mean Euclidean distance between the ground truth and the predicted position of the target block. As shown in Fig. 1, the final error of the state representation model is about

cm after epochs training.

Fig. 1: The error graph of representation models during training.

V Policy Learning and Experiment Setting

V-a Policy learning

We evaluate the performance of the Dense2Sparse strategy on two common robot manipulation tasks, reaching and lifting. To save the computing resources, both of the tasks utilize the same state representation model described in section IV. As a demonstration, we use the TD3 [11] as our reinforcement learning algorithm. The schematic diagram of the policy learning process is shown in Fig. 2. In the first stage, the state obtained from the representation model and the robot proprioception (including the angle and angular velocity of the joints as well as the opening degree of the gripper) are used to design a dense reward. The dense reward indicates that in each action step, the robot can get a reward which is calculated by the reward function where the physical quantities are obtained from the state representation model. After a certain number of episodes, the dense reward is switched to the sparse reward to continue the policy learning (the second stage). In the second stage, the robot can only get a reward when the task (reaching task) or sub-task (lifting task) is completed.

Fig. 2: Schematic diagram of the testing platform

V-B Task setup

Both of the reaching task and lifting task are conducted in the Robosuite [10], which is a simulation environment based on MUJOCO physical engine [39]. The robot we used in this study is a 7-DOF robot (SCR5, Siasun. Co, Shenyang, China) with a two-finger gripper. For the reaching task, the gripper is kept closed, and the task is completed only when the end of the gripper touches the target block ( cm) or within cm of the center of the block. For the lifting task, the robot has to grasp the block at first, and then lift it cm away from the desktop to complete the task.

V-C Reward function

There are two types of reward functions in both the reaching task and the lifting task, sparse reward and dense reward. For the sparse reward, the robot gets a binary reward, which is if the task is completed and for others. The detail reward setting is formulated as follows. Where is the distance between the center of the gripper and the center of the block; and represent the dense reward and the sparse reward we adopt in the reaching task; and and represent the dense reward and sparse reward in the lifting task, respectively.

V-D Evaluation metrics

We evaluated the policy with the episode reward which is defined as the total reward that the agent gets in a single episode. To compare the policy with different reward shapes, we unified the evaluation reward with the dense reward described above. To ensure the fairness of the evaluation process, the episode reward is calculated by actual state (obtained from the simulation environment directly) rather than the predicted state through the state representation model.

Vi Experiment Results

We tested the proposed scheme through a simple reaching task at first, followed by a further investigation via the lifting task. To make a comprehensive evaluation, we conducted a series of ablations in both tasks. The details of those ablations are shown as follows.

The experiments use or which is elaborated in section V, and keep it unchanged during the policy learning.
The experiments use the or reward to conduct the entire reaching task or lifting task.
Instead of using the state representation model, the actual state (obtained directly from the simulation environment) is employed to calculate the reward in dense.
The experiments use the proposed method which is elaborated in Section V as the reward.

We design these experiments to answer the following three questions:
1) Whether the policy with the Dense2Sparse reward performs better than that with a standalone dense or sparse reward if the dense reward is not accurate?
2) How much performance loss if we use the Dense2Sparse reward compared to directly utilizing the oracle reward when there are uncertainties in state and reward?
3) How robust is our method when there is state shifting during the policy training process?

Vi-a Reaching task

For the case with ideal camera alignment setting (as shown in Fig. 3(a)), the training curve of the TD3 agent is shown in Fig. 4(a). Every episodes have been evaluated and recorded the mean episode reward after training episodes. Each curve in Fig. 4(a) includes three replications based on random seeds. To make the results more intuitive, we also plot histograms of the success rate which is shown in Fig. 4(b).

Fig. 3: Comparative tests with different camera setting, (a) scenario for ideal camera alignment setting, (b) scenario for camera alignment error, (c) scenario for camera alignment error.
Fig. 4:

The evaluation results in the reaching task. The solid line and transparent belt in (a), (c) and (e) represent the mean and standard deviation of 3 random seeds for three camera settings which represent no camera shifting, with

camera shifting, with camera shifting, respectively. The histograms in (b), (d) and (f) present corresponding mean and standard deviation of the success rate that policy can reach after the training process is completed in (a), (c) and (e). For each evaluation, we test 1000 episodes with different target locations and robot initial positions.

From Fig. 4(a) it can be seen that the policy with a standalone sparse reward has a much slower convergent speed compared to the policy with the Dense2Sparse reward or the standalone dense reward. The final episode reward of the policy with the standalone sparse reward can reach as high as , which is much lower than other ablations. This is reasonable as the DRL policy is updated according to reward. When the reward is sparse, the policy cannot get timely reward feedback which could lead to a slower convergent speed and poor episode reward. Policy with standalone dense reward shows a similar convergence compared to the policy with Dense2Sparse, but its final episode reward is about less than that of the Dense2Sparse. An intuitive explanation is that dense reward can guide the policy to converge quickly, although error exists in the reward. However, the accumulated error of each step reduces the upper limit of the policy with a standalone dense reward. In addition, experimental result also indicates that the Oracle has a slightly higher final episode reward than the policy with Dense2Sparse reward. This is normal because the Oracle one has no noise at all, which guarantees the best performance of the policy. However, in practice, it is always difficult to obtain the accurate state or reward during an operation process, and in such a condition, our Dense2Sparse technique is likely to be the one capable of getting close to the best performance of a policy. The result in Fig. 4(b) shows that though there is much difference in the performance of episode reward among those ablations, the difference in corresponding mean success rate is not so big. The mean success rate of policy with standalone sparse reward reaches , which is only lower than other ablations which can reach . This could be due to that the reaching task is relatively simple.

One of the advantages of the Dense2Sparse strategy is to make the policy less sensitive to the noise. In order to test its performance on the noisy environment, we moved the camera with (Fig. 3(b)) and (Fig. 3(c)) shifting away from original position respectively, and other setting was kept consistent with that in the reaching task above. In such a condition, the output of the pre-trained state representation model became much more inaccurate which increased the system uncertainty. Fig. 4(c) and Fig. 4(e) show that the final episode reward of policy with the standalone sparse reward can only reach about (camera with shifting) and (camera with shifting) which is much lower than the original scene (about 120). The mean success rate also reduces to and and with a large standard deviation. It shows that with the state error going up, it is more and more difficult for the policy with sparse reward to keep the performance. The large standard deviation means that the policy we get after the training process is quite unstable. An intuitive explanation for this phenomenon is that except for the specific sparse reward, the only information that is utilized to update the policy is the state which is predicated by the state representation model. When the error of the state representation model increases, it becomes difficult to keep the policy updates following the direction of the original policy with less state error, which thus leads to worse policy performance and policy stability (high standard deviation).

We can see from Fig. 4 that there is also an episode reward reduction and standard deviation increasing for the policy with standalone dense reward. However, compared to the policy with standalone sparse reward, it still has a large advantage (episode reward is about and higher than the policy with standalone sparse reward when there is and degrees camera shifting). A reasonable explanation is that the dense reward can decrease the influence that caused in state error but with a limit.

We can also know from Fig. 4 that as for the policy with the Dense2Sparse, both the episode reward (175) and success rate () keep nearly unchanged when the noise is increasing. For the policy with Dense2Sparse reward, the agent only gets an inaccurate biased reward during the learning process used dense reward, but obtains an accurate reward when the task is completed after switching to the sparse reward. That is the reason why Dense2Sparse reward shape is naturally less sensitive to the noise. However, for the policy with standalone dense reward, the agent gets biased error every step throughout the training process without any correction, and thus its learning is not as good as the one using the Dense2Sparse reward shape.

To summarize, the testing results show that the Dense2Sparse reward shaping method has a higher convergence speed compared to the sparse reward method, which illustrates the efficiency of our method. In addition, our method is able to deal with the uncertainty of the environment, in other words, this method has a good tolerance on the noise from the observer which increases the robustness of the entire system.

Vi-B Lifting task

In order to test if the proposed Dense2Sparse reward shape also works for more complicated task, the same ablative experiments were conducted in the lifting task which contains three stages (reaching, grasping and lifting), and thus it is more complicated than the previous reaching task. The experimental result of policy learning curves and the histogram of the evaluation result are shown in Fig. 5(a) and (b), respectively.

Fig. 5: Evaluation results of the lifting task, (a) training curve of the lifting task, (b) performance histogram of the lifting task.

From Fig. 5(a), it shows that for the policy with standalone sparse reward and policy with standalone dense reward, the episode reward can reach about but do not increase anymore. For the policy with the sparse reward, it may be caused by inadequate reward feedback, and for the dense reward, it may be caused by the reward error accumulation which has been detailed in the reaching task. However, for the policy with the Dense2Sparse reward, the final episode reward can reach about which nearly equals to the Oracle policy. Fig. 5(b) shows that for the policy with standalone sparse reward and with standalone dense reward, the mean success rate is and , respectively, which is much lower than the policy with Dense2Sparse reward and Oracle. Meanwhile, both the standard deviations of the policy with standalone sparse reward and with standalone dense reward are quite large, indicating that the policies are not stable which may be caused by the increasing of task complexity. It can also be seen that the mean success rate of the policy with Dense2Sparse can reach as high as and the standard deviation is nearly zero which is the same as the Oracle, which further illustrates that our method can achieve as stable performance as the Oracle. In summary, the proposed Dense2Sparse reward shape can tackle the state and reward uncertainty in the complex task with comparable performance as the policy using the accurate state and reward information. This illustrates the Dense2Sparse reward shaping method has a great potential to solve the learning effectiveness and efficiency problem in the DRL and robust to the system uncertainty in practice.

Vii Discussion and Conclusion

In this study, we proposed a simple but effective reward shaping strategy, Dense2Sparse to balance the efficiency and effectiveness of the DRL for system with uncertainty. It should be noted that our Dense2Sparse strategy is based on the assumption that there exists system uncertainty and a clear task completion judgement (for sparse reward) is available. We believe in real practice, most of the robot manipulation scenarios belong to this category. Under such an assumption, our method is ready to combine with arbitrary off-policy DRL algorithm and shaping reward from state representation learning.

To verify the proposed method, we developed a stable state representation model to estimate the position of the target object, which is then used to provide a dense reward for the agent. It can guide the agent to achieve suboptimal performance, followed by switching to sparse rewards to rectify the agent behavior to get closer to optimal performance. We verified this Dense2Sparse approach by a series of robotic reaching and lifting tasks with system uncertainty such as camera alignment offset. The testing results show that, besides the fast convergence speed, our method also can rescue the agent from misleading rewards even at a relative high noise level. Future work will focus on combining the Dense2Sparse strategy with the “sim-to-real” approach, to make the DRL more robust to the environment changes.

It also should be noted that we do not verify our method in the real scene as the technology of the “sim-to-real” is quite mature such as domain randomization [38, 32, 3], dynamic randomization [4, 37], domain adaption [42, 7] and so on, and all of above studies have already proven that if a DRL policy works in the physical engine-based simulation environment, it should work for the real practice as long as “sim-to-real” technique has been applied.

References

  • [1] A. Amiranashvili, A. Dosovitskiy, V. Koltun, and T. Brox (2019) Motion perception in reinforcement learning with dynamic objects. arXiv preprint arXiv:1901.03162. Cited by: §I.
  • [2] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba (2017) Hindsight experience replay. In Advances in neural information processing systems, pp. 5048–5058. Cited by: §I, §II-B.
  • [3] O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. (2020) Learning dexterous in-hand manipulation. The International Journal of Robotics Research 39 (1), pp. 3–20. Cited by: §VII.
  • [4] R. Antonova, S. Cruciani, C. Smith, and D. Kragic (2017) Reinforcement learning for pivoting task. arXiv preprint arXiv:1703.00472. Cited by: §VII.
  • [5] G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, D. Tb, A. Muldal, N. Heess, and T. Lillicrap (2018) Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617. Cited by: §I.
  • [6] Y. Bengio, A. Courville, and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §II-B.
  • [7] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, et al. (2018) Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 4243–4250. Cited by: §VII.
  • [8] S. M. Devlin and D. Kudenko (2012) Dynamic potential-based reward shaping. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, pp. 433–440. Cited by: §I.
  • [9] T. Everitt, V. Krakovna, L. Orseau, M. Hutter, and S. Legg (2017) Reinforcement learning with a corrupted reward channel. arXiv preprint arXiv:1705.08417. Cited by: §I.
  • [10] L. Fan, Y. Zhu, J. Zhu, Z. Liu, O. Zeng, A. Gupta, J. Creus-Costa, S. Savarese, and L. Fei-Fei (2018)

    Surreal: open-source reinforcement learning framework and robot manipulation benchmark

    .
    In Conference on Robot Learning, pp. 767–782. Cited by: §I, §I, §V-B.
  • [11] S. Fujimoto, H. Van Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477. Cited by: §I, §V-A.
  • [12] Y. Gao and F. Toni (2015) Potential based reward shaping for hierarchical reinforcement learning. In

    Twenty-Fourth International Joint Conference on Artificial Intelligence

    ,
    Cited by: §I.
  • [13] S. Gu, E. Holly, T. Lillicrap, and S. Levine (2017) Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3389–3396. Cited by: §II-A.
  • [14] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine (2016) Continuous deep q-learning with model-based acceleration. In

    International Conference on Machine Learning

    ,
    pp. 2829–2838. Cited by: §I.
  • [15] K. Hartikainen, X. Geng, T. Haarnoja, and S. Levine (2019) Dynamical distance learning for unsupervised and semi-supervised skill discovery. arXiv preprint arXiv:1907.08225. Cited by: §I.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §I.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §I, §II-B.
  • [18] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. Eslami, et al. (2017) Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286. Cited by: §I.
  • [19] T. Inoue, G. De Magistris, A. Munawar, T. Yokoya, and R. Tachibana (2017) Deep reinforcement learning for high precision assembly tasks. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 819–825. Cited by: §II-A, §II-A.
  • [20] S. James, A. J. Davison, and E. Johns (2017) Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task. arXiv preprint arXiv:1707.02267. Cited by: §II-B.
  • [21] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. (2018) Qt-opt: scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293. Cited by: §I, §I, §II-A, §II-A.
  • [22] A. Kumar et al. (2019) Enhancing performance of reinforcement learning models in the presence of noisy rewards. Ph.D. Thesis. Cited by: §I.
  • [23] M. A. Lee, Y. Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg (2019)

    Making sense of vision and touch: self-supervised learning of multimodal representations for contact-rich tasks

    .
    In 2019 International Conference on Robotics and Automation (ICRA), pp. 8943–8950. Cited by: §I, §II-A.
  • [24] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §I.
  • [25] M. J. Mataric (1994) Reward functions for accelerated learning. In Machine Learning Proceedings 1994, pp. 181–189. Cited by: §I.
  • [26] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §I, §II-A.
  • [27] A. Nair, S. Bahl, A. Khazatsky, V. Pong, G. Berseth, and S. Levine (2019) Contextual imagined goals for self-supervised robotic learning. arXiv preprint arXiv:1910.11670. Cited by: §I, §II-B.
  • [28] A. V. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine (2018) Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pp. 9191–9200. Cited by: §I, §II-B, §II-B, §IV, §IV.
  • [29] A. Y. Ng, D. Harada, and S. Russell (1999) Policy invariance under reward transformations: theory and application to reward shaping. In ICML, Vol. 99, pp. 278–287. Cited by: §I, §I.
  • [30] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel (2018) Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 1–8. Cited by: §II-A.
  • [31] J. Peters and S. Schaal (2008) Reinforcement learning of motor skills with policy gradients. Neural networks 21 (4), pp. 682–697. Cited by: §I.
  • [32] L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel (2017) Asymmetric actor critic for image-based robot learning. arXiv preprint arXiv:1710.06542. Cited by: §VII.
  • [33] G. Schoettler, A. Nair, J. Luo, S. Bahl, J. A. Ojea, E. Solowjow, and S. Levine (2019) Deep reinforcement learning for industrial insertion tasks with visual inputs and natural rewards. arXiv preprint arXiv:1906.05841. Cited by: §I, §II-A, §II-A.
  • [34] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §I.
  • [35] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §I, §II-B.
  • [36] A. Singh, L. Yang, K. Hartikainen, C. Finn, and S. Levine (2019) End-to-end robotic reinforcement learning without reward engineering. arXiv preprint arXiv:1904.07854. Cited by: §II-A.
  • [37] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke (2018) Sim-to-real: learning agile locomotion for quadruped robots. arXiv preprint arXiv:1804.10332. Cited by: §VII.
  • [38] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 23–30. Cited by: §I, §II-B, §II-B, §IV, §VII.
  • [39] E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §V-B.
  • [40] M. Vecerik, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. Heess, T. Rothörl, T. Lampe, and M. Riedmiller (2017) Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817. Cited by: §I, §II-A, §II-A.
  • [41] J. Wang, Y. Liu, and B. Li (2018) Reinforcement learning with perturbed rewards. arXiv preprint arXiv:1810.01032. Cited by: §I, §II-B.
  • [42] W. Yu, J. Tan, C. K. Liu, and G. Turk (2017) Preparing for the unknown: learning a universal policy with online system identification. arXiv preprint arXiv:1702.02453. Cited by: §VII.
  • [43] Y. Zhou, A. Wang, P. Zhou, H. Wang, and T. Chai (2020-FEB2020-FEB) Dynamic performance enhancement for nonlinear stochastic systems using rbf driven nonlinear compensation with extended kalman filter. Automatica 112. Cited by: §II-B.
  • [44] H. Zou, T. Ren, D. Yan, H. Su, and J. Zhu (2019) Reward shaping via meta-learning. arXiv preprint arXiv:1901.09330. Cited by: §I.