Disturbance-Injected Robust Imitation Learning with Task Achievement

by   Hirotaka Tahara, et al.

Robust imitation learning using disturbance injections overcomes issues of limited variation in demonstrations. However, these methods assume demonstrations are optimal, and that policy stabilization can be learned via simple augmentations. In real-world scenarios, demonstrations are often of diverse-quality, and disturbance injection instead learns sub-optimal policies that fail to replicate desired behavior. To address this issue, this paper proposes a novel imitation learning framework that combines both policy robustification and optimal demonstration learning. Specifically, this combinatorial approach forces policy learning and disturbance injection optimization to focus on mainly learning from high task achievement demonstrations, while utilizing low achievement ones to decrease the number of samples needed. The effectiveness of the proposed method is verified through experiments using an excavation task in both simulations and a real robot, resulting in high-achieving policies that are more stable and robust to diverse-quality demonstrations. In addition, this method utilizes all of the weighted sub-optimal demonstrations without eliminating them, resulting in practical data efficiency benefits.


page 1

page 4

page 5

page 6


Learning from Imperfect Demonstrations from Agents with Varying Dynamics

Imitation learning enables robots to learn from demonstrations. Previous...

Bayesian Disturbance Injection: Robust Imitation Learning of Flexible Policies

Scenarios requiring humans to choose from multiple seemingly optimal act...

Few-Shot Bayesian Imitation Learning with Logic over Programs

We describe an expressive class of policies that can be efficiently lear...

GRP Model for Sensorimotor Learning

Learning from complex demonstrations is challenging, especially when the...

Robust Imitation Learning from Corrupted Demonstrations

We consider offline Imitation Learning from corrupted demonstrations whe...

Semi-Supervised Imitation Learning with Mixed Qualities of Demonstrations for Autonomous Driving

In this paper, we consider the problem of autonomous driving using imita...

Learning Skills to Patch Plans Based on Inaccurate Models

Planners using accurate models can be effective for accomplishing manipu...

I Introduction

Behavior Cloning (BC) [23] is widely used in robotics as an imitation learning (IL) method [31, 21] to leverage human demonstrations for learning control policies. Learning from humans is particularly desirable from the perspective of safe interactive robot control, as learned policies are based on the demonstrator’s behavior, and requires few samples for training [21]. A common issue affecting learning via behavior cloning is limited variation in demonstration data, resulting in overly-specific, poorly generalized policies that are not robust to deviations in behavior. Specifically, learned policies may be influenced by error compounding (also known as covariate shift [24]), where there arises a mismatch between the distributions of data used for training and testing. To robustify learning, stochastic perturbations, known as disturbance injections, are added to the demonstrator’s actions, augmenting the learning space and resulting in stable policies [18]. However, a key limitation that restricts real-world control, is the assumption that demonstrators are proficient in the task and can provide consistently high-quality demonstrations. Specifically, a key assumption is that error compounding can be solved by assuming demonstration data is homogeneous, and can be used to learn an optimal policy simply by learning flexible generalizations of the demonstration data. In real-world scenarios, data is often heterogeneous and of varying quality, due to the difficulty of the task or human inexperience [12, 30, 14]. In addition, demonstrators may perform idiosyncratic behavior, which might not be task optimal (e.g., unintentional drifting [6]), resulting in diverse-quality demonstrations. In this scenario, naïve application of disturbance injections does not consider the demonstration quality, and this diverse-quality bias policy learning, leading to over-generalized policies. An example of this problem is shown in Fig. 1.

Fig. 1: Reaching task by imitation learning. (a) A demonstrator gives optimal and sub-optimal demonstrations (black arrows). Learned policy causes compounding errors (green arrows). (b) Augmentation of action distribution (gray shaded) mitigates the error. However, the optimality of demonstrations is not considered, leading to sub-optimal policy learning. (c) Combining task achievement weighting with robustification enables robust, optimal policy learning.

To limit the impact of poor quality demonstrations, this paper takes influence from reinforcement learning

[22] to learn policies by weighting the contribution of demonstrations, proposed as task achievement weighted disturbance injections (TAW-DI). Specifically, maximizing the task achievement is applied to motivate the policy learning method to selectively update based on high-quality demonstrations, and dissuade from learning from poor ones. Importantly, utilizing weighted sub-optimal trajectories, which may contain both non-optimal and optimal parts, contributes to accelerated convergence of the policy performance. This is particularly appealing, as collecting data for IL is demanding, due to task complexity and limitations on demonstrator availability.

Specifically, an iterative process is proposed: disturbance injected augmented trajectories are generated to minimize task achievement weighted covariate shift and, the policy is updated using the task achievement weighted trajectories, to concentrate trajectories around the space of those with high task achievement. Evaluation results show the framework learns robust policies undeterred by either limited variation or diverse-quality of demonstrations, through robotic excavation tasks in both simulations and a scale excavator, and outperforms methods that explicitly account for these issues independently.

Ii Related Work

The proposed method combines features of robust imitation learning and reward weighting inspired by reinforcement learning. As such, both topics are discussed in the context of motivating the proposed combinatorial approach.

Ii-a Robust Imitation Learning

Policy learning via imitation of demonstrator behavior such as Behavior cloning (BC) [23] is widely used in robot control [31, 21]. However, BC suffers from covariate shift [24], where policies trained on a limited variation of demonstrations, fail to control correctly for predicted actions that diverge from the strict expected distribution observed during training. To address this problem, data augmentation methods such as dataset aggregation (DAgger[25]) or disturbance injection (DART[18]) can reduce covariate shift and robustify the policy around the augmented demonstration data.

While robustification mitigates distribution mismatch, there is no guarantee demonstrators perform optimally. Diverse-quality demonstrations are common in real applications with human demonstrations [12, 30, 14]. Attempting to apply robust IL to remove this error, results in learning policies concentrate around the average of the entire demonstration data. This quality agnostic approach is undesirable, as while it prevents error compounding, the resulting dataset-centered policy is clearly different than the optimal.

Ii-B Reward weighting

Diverse-quality demonstrations are antithetical to the assumptions of traditional robust IL methods, which assume high-quality, consistent demonstrations. In contrast, reinforcement learning (RL) does not assume the availability of a demonstrator, and policies are learned via trial-and-error exploration and exploitation of rewards (also known as the task achievement [16]) from the environment, resulting in successful autonomous learning [22, 1, 29].

On the other hand, RL expects immediate rewards, but is unsuitable to episodic task achievement where we only care about the end result. Such a sparse task achievement cannot evaluate the random actions of RL; therefore, the learning for policy improvement becomes inefficient [11]. In contrast, our method uses demonstrated actions instead of random actions, and at least the task will be accomplished by demonstrators. This enables wide applicability to other tasks where a specific indicator can evaluate the task achievement at the end of the task (e.g., pick and place task). In addition, learning a policy model via trial and error in RL requires a large number of action steps to collect data over the entire environment, and this exploration phase may be unsafe for robot control [10]. This property is limiting the applicability of RL for real-world scenarios.

Ii-C Hybridisation: IL with reward weighting

While direct application of RL is problematic, reward weighting [22] can be utilized to address the issue of diverse-quality demonstrations. Prior work has explored combining IL and RL, for example, using IL as a starting policy in the exploration phase of RL to speed up performance convergence [17, 11]. Additionally, recent studies [3, 4]

tackle the issue of diverse-quality demonstrations by injecting disturbances into learned IL policies and using estimated rewards. However, utilizing the IL policy, which is in the learning process and is not robustified against covariate shift, can be potentially dangerous for a real robot. Furthermore, trial-and-error exploration in RL accelerates this risk. To address this fundamental problem, robustification of policy learning with DART, which requires only the safe demonstrator’s policy, is significant and should be incorporated into IL-based frameworks.

Iii Preliminaries

Iii-a Behavior Cloning (BC)

The objective of IL is to learn policies that replicate a demonstrator’s behavior, using demonstration trajectories consisting of sequences of states and actions , where , and is a total steps of a trajectory. The trajectory distribution associated with the dynamics and the parametric policy with a parameter is defined as:


To replicate a demonstrator’s policy , the error of a query policy and the demonstrator’s policy using trajectories is defined as:


To minimize the expected surrogate loss as the objective function, the parameter of a BC policy is acquired by solving:


However, error compounding caused by limited variation in demonstrations can result in policies learned by BC being unstable [24]. The difference between the trajectory distribution of learned policy and the demonstrator’s policy can be formalized as the covariate shift:


In the context of executing the learned policy, a high covariate shift tends to lead policies into undemonstrated states, where it is hard to recover back to a desired trajectory.

Iii-B Disturbances for Augmenting Robot Trajectories (DART)

To mitigate covariate shift, DART[18] learns policies that are robust to error compounding by generating demonstrations with disturbance injection. The learning space is thereby augmented with recovery actions, allowing for greater variation in demonstration trajectories. The level of disturbance is optimized and updated iteratively during data collection over iterations (number of policy updates), on the disturbance distribution parametrized by .

Initially, disturbance injected policy given the disturbance parameter is expressed as . By substituting this policy into the trajectory distribution (Eq. (1)), the disturbance injected trajectory distribution is defined as:


As the covariate shift (Eq. (4)) cannot be computed explicitly, DART employs the upper bound of the covariate shift derived by applying Pinsker’s inequality:



is Kullback-Leibler divergence. By canceling common factors in trajectory distributions (Eq. (

1) and Eq. (5)), the KL divergence can be expanded as follows:


Then, the disturbance parameter is optimized to minimize the upper bound of covariate shift as:


The expectation of the trajectory distribution with a learned policy cannot be solved since the learned policy’s trajectories are not directly observed. Therefore, DART solves Eq. (8) using the demonstrator’s trajectory with the -th updated disturbance parameter instead of the learned policy’s trajectory distribution as:


where is an objective function for disturbance optimization. From Eq. (3), the parameter of a DART policy is optimized using the trajectory distribution with disturbance as:


Note that, a policy is learned using all data up to the -th iteration, however, a disturbance parameter is learned using only demonstration data with the most recent .

A key limitation of DART is that Eq. (6) assumes the demonstrator’s policy with disturbance injection performs optimally to induce the distribution of learned trajectories closer to the demonstrations. However, if the demonstration data is diverse-quality, naïve application of DART (i.e., optimizing the disturbance parameter over the entire space of trajectories), means that the learned policy cannot be induced towards the high reward demonstrations, and learned DART policies fail to replicate an optimal demonstrator’s behavior.

Iv Proposed Method

In this section, a novel imitation learning framework is introduced, task achievement weighted disturbance injections (TAW-DI), that learns a robust optimal policy by exploiting task achievement from diverse-quality demonstration data. In contrast to naïve DART, this method generates disturbances as controlled by the task achievement, thereby inducing the learned policy towards the high task achievement demonstrations. The proposed method is outlined in Alg. 1.

1 for  to  do
2        for  to  do
3               Collect disturbance injected augmented trajectories with task achievement:
4        end for
5       policy is set with Eq. (15) disturbance is set with Eq. (14)
6 end for
Alg. 1 TAW-DI

Iv-a Taw-Di

Initially, a binary optimality variable indicating the maximum possible task achievement is introduced. Maximizing the task achievement is equivalent to maximizing the likelihood of (optimal) [19]

, and this enables learning optimal policies from diverse-quality demonstrations as a probabilistic operation. The conditional probability of receiving maximum task achievement given a cost function is defined as:


where, is any cost function that calculates task achievement and is a normalizing constant. By using this probability into Eq. (5), the task achievement weighted trajectory distribution collected with disturbance injection is defined as:


By substituting this to the trajectory distribution of Eq. (9), the update scheme of disturbance parameter considering the task achievement is given as:


From Eq. (11), the update scheme of policy parameter considering the task achievement is given as:


Iv-B TAW-DI with deep neural network policy model

As an implementation of TAW-DI, a deterministic deep neural network model is employed as the policy

and the disturbance distribution follows a Gaussian noise parametrized by . By substituting the policy and disturbance parameter into the objective function (Eq. (10)) of disturbance (Eq. (14)), we obtain the disturbance optimization as:


By applying the Monte Carlo method to the expectation over trajectory distribution, the solution is approximated as:


where is a total episodes (number of trajectories). The policy parameter optimization of Eq. (15) is solved by applying the gradient descent methods.

Fig. 2: Simulation environment on VortexStudio. Intensity of color indicates depth of excavations.

V Evaluation

To investigate the effectiveness of learning robust policies given limited variation and diverse-quality demonstration data, an autonomous excavation task is presented in both simulation and a scale robotic excavator. Specifically, a soil stripping excavation task is performed, where an area is cleared of soil by three consecutive scooping motions, as shown in Fig. 2(a).

Excavation is a dangerous and strenuous activity for operators, and robotic automation is actively researched [7], with previous studies [2, 13]

investigating automation via modeling of excavator kinematics, or learning data-driven policies using deep-learning

[9, 28, 8]. RL is unsuitable to this task, due to the danger of exploration, and previous studies have investigated imitating the human behavior for automation [9, 28]. This paper addresses the challenging task of learning from diverse-quality demonstrations, which is a natural consequence of the difficulty of excavation.

Fig. 3: Visualization of environment, showing excavator’s bucket position (gray shaded) and soil area (brown shaded, with topsoil starting at 0.6m). Results of demonstration trajectories are plotted, either without disturbances (BC), or with disturbances (TAW-DI, and DART).
Fig. 4: Result of disturbance update by Eq. (9) and Eq. (14). TAW-DI and DART learn from sub-optimal trajectories, while DART removes them and updates its disturbance using the optimal trajectory.
(a) TAW-DI
(b) TAW-BC
(c) DART
(d) BC
Fig. 5: Soil stripping excavation task by learned policies. TAW-DI obtained the best result close to the optimal trajectory by applying disturbance injection and task achievement weighting. TAW-BC failed to excavate the proper position due to the error compounding. DART performed soil stripping robustly, but it is shallow due to the effect of learning sub-optimal trajectories. BC was affected by both problems, and the result is worse.
Fig. 6: Test performance. All method utilizes both optimal and sub-optimal trajectories. While the performance of the comparisons, TAW-BC, DART, and BC, were deteriorated due to the limited variation and diverse-quality of demonstrations, TAW-DI showed the best performance.

V-a Simulation experiment

To evaluate the proposed TAW-DI method, comparisons are made between baselines BC and DART, and TAW-BC (task achievement weighting without disturbance injections) is used as an ablation study. Generally, BC and DART assume demonstrations contain only optimal trajectories defined via a strict threshold, and these are denoted as BC and DART. However, in general, choosing this threshold a priori can be challenging, and as such BC and DART are additional comparisons that also utilize both optimal and sub-optimal trajectories. The simulation environment is developed on the VortexStudio simulator [5] which enables the real-time simulation of excavation with soil dynamics.

V-A1 Task Setting

The diverse-quality of the demonstration data is defined according to domain knowledge [26]: Deep excavation () of the ground should be avoided as it may cause trench or excavator instability. Shallow excavation () (Fig. 2(b)) is considered sub-optimal, as it necessitates additional work and time to clear the soil. The appropriate depth () of excavation is considered as the optimal trajectories (Fig. 2(c)). A custom controller is designed that automatically performs soil striping, either optimally or sub-optimally. The controller automates the excavator to emulate the excavation trajectory that follows the idea of previous work [15], where the excavation motion is divided into several simple trajectories, by passing through a set of predefined points (known as support points). To make the demonstration more diverse, uniform noise is added to the support points. In this experiment, is chosen as a reasonable divergence, as this task involves a large spatial scale of the excavator and work-space. The uniform noise is also added to the initial robot position as task uncertainty. The state is described by 33 dimensions, these being the excavator’s state, consisting of the joint angle and angular velocity of four joints,

the soil’s shape information, as seen by an overhead depth camera, consisting of a vectorized

grid of the excavation space, used to determine the next excavation point by considering the soil shape. The action is the velocity of each joint. Since it is difficult to measure the exact depth of the excavated soil in a real situation, this task defines the task achievement ( in Eq. (12)) as the total amount of moved soil which can easily be evaluated at the end of the task, and it is suitable to our method as discussed in §II-B. Following the DART architecture [18] with the same parameterization, a four-layer deep neural network consisting of an input layer, two 64-units hidden layers, and an output layer is used to train the policies. In the demonstration phase, demonstration data is collected for iterations of episodes (as in DART [18]), and data is used to update the policies and disturbances. Each iteration contains an optimal and a sub-optimal trajectory. Variations in demonstrations may induce performance disparity in learned policies so that the demonstration phase is repeated twice to validate the robustness. In the testing phase, each of the policies learned in each iteration is evaluated three times. The one episode takes about 600 steps to complete the task. Fig. 7: Test performance. TAW-DI and TAW-BC utilize an optimal and a sub-optimal trajectory, and DART and BC only use an optimal trajectory. The performance convergence of TAW-DI and TAW-BC is faster than DART and BC.

V-A2 Results

The demonstration trajectories collected from the last iteration are shown in Fig. 3. Initially, DART exhibits dangerous oscillation and undesired drift due to the diverse-quality demonstrations in Fig. 3, causing the disturbance level to continuously grow over training iterations, as seen in Fig. 4. In contrast, DART, which only uses optimal trajectories, does not experience this phenomenon. In comparison, TAW-DI generates demonstration trajectories that are more stable and do not drift in Fig. 3. TAW-DI prevents excessive disturbance level updates by the weighting of trajectories, as is shown in Fig. 4 where the disturbance level converges.

After policy learning, learned policies are evaluated, as seen in Fig. 5, with performance seen in Fig. 6. Initially, the proposed method, TAW-DI, which applies both disturbance injection and task achievement weighting, performs the best and achieves a performance close to the optimal. In contrast, it is seen that BC performs poorly, due to the fact it suffers from both error compounding and diverse-quality of demonstration data, resulting in undesired behavior such as excavating improper position. DART performs better than BC, as it explicitly accounts for error compounding by robustification using the disturbance injection. However, it is similarly unable to deal with the problem of diverse-quality demonstrations, resulting in sub-optimal policy learning. TAW-BC is robust against the problem of a diverse-quality demonstration by task achievement weighting, but it is not stable due to the error compounding compared with TAW-DI.

In addition, there are clear benefits to utilizing all demonstrations (optimal and sub-optimal), as seen in Fig. 7. BC fails to reach optimal performance even when only using optimal demonstrations, due to the aforementioned inability to deal with error compounding (similarly seen with TAW-BC). DART overcomes this, and reaches optimal performance, however, at the cost of a slow convergence rate due to the limited number of optimal samples. In comparison, TAW-DI, which additionally utilizes sub-optimal trajectories, reaches this optimal level of performance much faster. As such, the proposed task achievement weighting method is not only more generalizable, as there is no need to define an optimal threshold, but is more suitable to data-limited environments.

Fig. 8: A scale robotic excavator environment.

V-B Real robot experiment

In this section, all methods utilize both optimal and sub-optimal demonstrations, including the baselines, BC and DART, and the proposed method TAW-DI. A scale robotic excavator is used for this experiment as shown Fig. 8.

V-B1 Task setting

This experiment employs human demonstrators. To verify the effectiveness of the proposed method, the demonstrators are instructed to collect an optimal and a sub-optimal trajectories in each episode to keep the balance of the quality of demonstrations. The additional variation in diverse-quality demonstrations is induced by the uncertainty of the demonstrator’s operation instead of the uniform noise used in the simulation. The demonstrators manipulate the excavator using twin joystick controllers where joints are controlled independently.

State, action, and task achievement are consistent with §V-A, and a motion capture system (OptiTrack Flex3) detects markers on each arm of the excavator and estimates the joint angles and angular velocities simultaneously. A depth camera (Intel RealSense D455) attached overhead captures soil depth. A mass sensor measures the total soil mass. Plastic beads are used instead of real soil, corresponding to soil dynamics with low shear strength and high flow velocity.

In the demonstration phase, demonstration data is collected for iterations of episodes (an optimal and a sub-optimal) by demonstrators, specifically, by two subjects. The policies are learned by the same network with §V-A. In the testing phase, the learned policies from the final iteration are evaluated three times. One episode takes 200 steps to complete the task.

V-B2 Results

The performance of the test trajectories by two subjects is shown in Fig. 9, where it is seen that as expected, BC performs poorly due to error compounding and the diverse-quality demonstrations, and the diverse-quality demonstrations also deteriorate the performance of DART

. In contrast, TAW-DI obtains the same performance as the optimal trajectories by the disturbance injection and the task achievement weighting similar to the simulation experiments. Such results are similarly obtained for two subjects, and the significant differences by t-test were observed between the proposed method and baselines.

(a) Subject 1
(b) Subject 2
Fig. 9: Test performance of two subjects. Significant differences by t-test were observed between the proposed method and baselines ().

Vi Discussion

In this section, important discussions for the experimental results are presented.

(1) What effect does combining disturbance injections with task achievement weighting have on policy learning? When demonstrations contain sub-optimal trajectories, DART is unstable due to the disturbance update continuously growing (Fig. 4) in an attempt to apply disturbances that minimize the difference between the optimal and sub-optimal trajectories. However, this large disturbance adversely deteriorates the performance of the demonstration and, consequently, the policy’s performance during testing. Under the assumption of diverse-quality demonstrations as common in real-world problems, combining disturbance injection with task achievement weighting overcomes the issues by focusing on mainly learning from high task achievement demonstrations.

(2) What is the advantage of utilizing weighted sub-optimal trajectories instead of simply eliminating them? Experimental results show the use of sub-optimal trajectories can accelerate the convergence of the policy performance. In real-world scenarios where the data collection cost is high due to the complexity of the operation and use of long-term tasks, the concept of weighting and utilizing a small number of demonstration trajectories without removing them is significant. This concept is applicable to various real-world robotics problems like excavation, which demands a high cost in human operation and the use of real robots.

Vii Conclusion

This paper presents a novel imitation learning framework for addressing real-world robotics problems that suffer from the dual problem of limited variation and diverse-quality of demonstrations. While previous studies have investigated these problems independently, the proposed method consistently outperforms methods that explicitly address these problems independently in both simulation and real robot experiments. As future works, the scalability of this study can be improved by making the policy multi-modal [27, 20] for further applicability to more complex tasks.


  • [1] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath (2017) Deep reinforcement learning: a brief survey. Signal Processing Magazine 34 (6), pp. 26–38. Cited by: §II-B.
  • [2] D. A. Bradley and D. W. Seward (1998) The development, control and operation of an autonomous robotic excavator. Journal of Intelligent and Robotic Systems 21 (1), pp. 73–97. Cited by: §V.
  • [3] D. S. Brown, W. Goo, and S. Niekum (2020) Better-than-demonstrator imitation learning via automatically-ranked demonstrations. In Conference on robot learning, pp. 330–359. Cited by: §II-C.
  • [4] L. Chen, R. Paleja, and M. Gombolay (2020) Learning from suboptimal demonstration via self-supervised reward regression. In Conference on robot learning, Cited by: §II-C.
  • [5] CM Labs (2019) Vortex studio. Note: Accessed: 2021-8-22 External Links: Link Cited by: §V-A.
  • [6] A. Coates, P. Abbeel, and A. Y. Ng (2008) Learning for control from multiple demonstrations. In

    International Conference on Machine Learning

    pp. 144–151. Cited by: §I.
  • [7] S. Dadhich, U. Bodin, and U. Andersson (2016) Key challenges in automation of earth-moving machines. Automation in Construction 68, pp. 212–222. Cited by: §V.
  • [8] P. Egli and M. Hutter (2020) Towards rl-based hydraulic excavator automation. In International Conference on Intelligent Robots and Systems, pp. 2692–2697. Cited by: §V.
  • [9] R. Fukui, T. Niho, M. Nakao, and M. Uetake (2017) Imitation-based control of automated ore excavator: improvement of autonomous excavation database quality using clustering and association analysis processes. Advanced Robotics 31 (11), pp. 595–606. Cited by: §V.
  • [10] J. Garcıa and F. Fernández (2015) A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16 (1), pp. 1437–1480. Cited by: §II-B.
  • [11] V. G. Goecks, G. M. Gremillion, V. J. Lawhern, J. Valasek, and N. R. Waytowich (2020) Integrating behavior cloning and reinforcement learning for improved performance in dense and sparse reward environments. In International Conference on Autonomous Agents and MultiAgent Systems, pp. 465–473. Cited by: §II-B, §II-C.
  • [12] D. H. Grollman and A. Billard (2011) Donut as i do: learning from failed demonstrations. In International Conference on Robotics and Automation, pp. 3804–3809. Cited by: §I, §II-A.
  • [13] Q. Ha, M. Santos, Q. Nguyen, D. Rye, and H. Durrant-Whyte (2002) Robotic excavation in construction automation. Robotics & Automation Magazine 9 (1), pp. 20–28. Cited by: §V.
  • [14] M. Hamaya, F. von Drigalski, T. Matsubara, K. Tanaka, R. Lee, C. Nakashima, Y. Shibata, and Y. Ijiri (2020) Learning soft robotic assembly strategies from successful and failed demonstrations. In International Conference on Intelligent Robots and Systems, pp. 8309–8315. Cited by: §I, §II-A.
  • [15] D. Jud, P. Leemann, S. Kerscher, and M. Hutter (2019) Autonomous free-form trenching using a walking excavator. Robotics and Automation Letters 4 (4), pp. 3208–3215. Cited by: item (iii).
  • [16] A. Kinose and T. Taniguchi (2020) Integration of imitation learning using gail and reinforcement learning using task-achievement rewards via probabilistic graphical model. Advanced Robotics 34 (16), pp. 1055–1067. Cited by: §II-B.
  • [17] J. Kober and J. Peters (2010) Imitation and reinforcement learning. Robotics & Automation Magazine 17 (2), pp. 55–62. Cited by: §II-C.
  • [18] M. Laskey, J. Lee, R. Fox, A. Dragan, and K. Goldberg (2017) Dart: noise injection for robust imitation learning. In Conference on Robot Learning, pp. 143–156. Cited by: §I, §II-A, §III-B, item (i).
  • [19] K. S. Luck, J. Pajarinen, E. Berger, V. Kyrki, and H. B. Amor (2016) Sparse latent space policy search. In

    AAAI Conference on Artificial Intelligence

    pp. 1911–1918. Cited by: §IV-A.
  • [20] H. Oh, H. Sasaki, B. Michael, and T. Matsubara (2021) Bayesian Disturbance Injection: robust imitation learning of flexible policies. In International Conference on Robotics and Automation, pp. 8629–8635. Cited by: §VII.
  • [21] T. Osa, J. Pajarinen, G. Neumann, J. Bagnell, P. Abbeel, and J. Peters (2018) An algorithmic perspective on imitation learning. Foundations and Trends in Robotics 7 (1-2), pp. 1–179. Cited by: §I, §II-A.
  • [22] J. Peters and S. Schaal (2007) Reinforcement learning by reward-weighted regression for operational space control. In International conference on Machine learning, pp. 745–750. Cited by: §I, §II-B, §II-C.
  • [23] D. A. Pomerleau (1991) Efficient training of artificial neural networks for autonomous navigation. Neural computation 3 (1), pp. 88–97. Cited by: §I, §II-A.
  • [24] S. Ross and D. Bagnell (2010) Efficient reductions for imitation learning. In International conference on artificial intelligence and statistics, pp. 661–668. Cited by: §I, §II-A, §III-A.
  • [25] S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In International conference on artificial intelligence and statistics, pp. 627–635. Cited by: §II-A.
  • [26] SafeWork NSW (2020) CODE of practice excavation work. Note: Accessed: 2021-8-22 External Links: Link Cited by: §V-A1.
  • [27] H. Sasaki and T. Matsubara (2021) Variational policy search using sparse gaussian process priors for learning multimodal optimal actions. Neural Networks 143, pp. 291–302. Cited by: §VII.
  • [28] B. Son, C. Kim, C. Kim, and D. Lee (2020) Expert-emulating excavation trajectory planning for autonomous robotic industrial excavator. In International Conference on Intelligent Robots and Systems, pp. 2656–2662. Cited by: §V.
  • [29] V. Tangkaratt, B. Han, M. E. Khan, and M. Sugiyama (2020) Variational imitation learning with diverse-quality demonstrations. In International Conference on Machine Learning, pp. 9407–9417. Cited by: §II-B.
  • [30] Y. Wu, N. Charoenphakdee, H. Bao, V. Tangkaratt, and M. Sugiyama (2019) Imitation learning from imperfect demonstration. In International Conference on Machine Learning, pp. 6818–6827. Cited by: §I, §II-A.
  • [31] T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg, and P. Abbeel (2018) Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In International Conference on Robotics and Automation, pp. 5628–5635. Cited by: §I, §II-A.