I Introduction
Behavior Cloning (BC) [23] is widely used in robotics as an imitation learning (IL) method [31, 21] to leverage human demonstrations for learning control policies. Learning from humans is particularly desirable from the perspective of safe interactive robot control, as learned policies are based on the demonstrator’s behavior, and requires few samples for training [21]. A common issue affecting learning via behavior cloning is limited variation in demonstration data, resulting in overlyspecific, poorly generalized policies that are not robust to deviations in behavior. Specifically, learned policies may be influenced by error compounding (also known as covariate shift [24]), where there arises a mismatch between the distributions of data used for training and testing. To robustify learning, stochastic perturbations, known as disturbance injections, are added to the demonstrator’s actions, augmenting the learning space and resulting in stable policies [18]. However, a key limitation that restricts realworld control, is the assumption that demonstrators are proficient in the task and can provide consistently highquality demonstrations. Specifically, a key assumption is that error compounding can be solved by assuming demonstration data is homogeneous, and can be used to learn an optimal policy simply by learning flexible generalizations of the demonstration data. In realworld scenarios, data is often heterogeneous and of varying quality, due to the difficulty of the task or human inexperience [12, 30, 14]. In addition, demonstrators may perform idiosyncratic behavior, which might not be task optimal (e.g., unintentional drifting [6]), resulting in diversequality demonstrations. In this scenario, naïve application of disturbance injections does not consider the demonstration quality, and this diversequality bias policy learning, leading to overgeneralized policies. An example of this problem is shown in Fig. 1.
To limit the impact of poor quality demonstrations, this paper takes influence from reinforcement learning
[22] to learn policies by weighting the contribution of demonstrations, proposed as task achievement weighted disturbance injections (TAWDI). Specifically, maximizing the task achievement is applied to motivate the policy learning method to selectively update based on highquality demonstrations, and dissuade from learning from poor ones. Importantly, utilizing weighted suboptimal trajectories, which may contain both nonoptimal and optimal parts, contributes to accelerated convergence of the policy performance. This is particularly appealing, as collecting data for IL is demanding, due to task complexity and limitations on demonstrator availability.Specifically, an iterative process is proposed: disturbance injected augmented trajectories are generated to minimize task achievement weighted covariate shift and, the policy is updated using the task achievement weighted trajectories, to concentrate trajectories around the space of those with high task achievement. Evaluation results show the framework learns robust policies undeterred by either limited variation or diversequality of demonstrations, through robotic excavation tasks in both simulations and a scale excavator, and outperforms methods that explicitly account for these issues independently.
Ii Related Work
The proposed method combines features of robust imitation learning and reward weighting inspired by reinforcement learning. As such, both topics are discussed in the context of motivating the proposed combinatorial approach.
Iia Robust Imitation Learning
Policy learning via imitation of demonstrator behavior such as Behavior cloning (BC) [23] is widely used in robot control [31, 21]. However, BC suffers from covariate shift [24], where policies trained on a limited variation of demonstrations, fail to control correctly for predicted actions that diverge from the strict expected distribution observed during training. To address this problem, data augmentation methods such as dataset aggregation (DAgger[25]) or disturbance injection (DART[18]) can reduce covariate shift and robustify the policy around the augmented demonstration data.
While robustification mitigates distribution mismatch, there is no guarantee demonstrators perform optimally. Diversequality demonstrations are common in real applications with human demonstrations [12, 30, 14]. Attempting to apply robust IL to remove this error, results in learning policies concentrate around the average of the entire demonstration data. This quality agnostic approach is undesirable, as while it prevents error compounding, the resulting datasetcentered policy is clearly different than the optimal.
IiB Reward weighting
Diversequality demonstrations are antithetical to the assumptions of traditional robust IL methods, which assume highquality, consistent demonstrations. In contrast, reinforcement learning (RL) does not assume the availability of a demonstrator, and policies are learned via trialanderror exploration and exploitation of rewards (also known as the task achievement [16]) from the environment, resulting in successful autonomous learning [22, 1, 29].
On the other hand, RL expects immediate rewards, but is unsuitable to episodic task achievement where we only care about the end result. Such a sparse task achievement cannot evaluate the random actions of RL; therefore, the learning for policy improvement becomes inefficient [11]. In contrast, our method uses demonstrated actions instead of random actions, and at least the task will be accomplished by demonstrators. This enables wide applicability to other tasks where a specific indicator can evaluate the task achievement at the end of the task (e.g., pick and place task). In addition, learning a policy model via trial and error in RL requires a large number of action steps to collect data over the entire environment, and this exploration phase may be unsafe for robot control [10]. This property is limiting the applicability of RL for realworld scenarios.
IiC Hybridisation: IL with reward weighting
While direct application of RL is problematic, reward weighting [22] can be utilized to address the issue of diversequality demonstrations. Prior work has explored combining IL and RL, for example, using IL as a starting policy in the exploration phase of RL to speed up performance convergence [17, 11]. Additionally, recent studies [3, 4]
tackle the issue of diversequality demonstrations by injecting disturbances into learned IL policies and using estimated rewards. However, utilizing the IL policy, which is in the learning process and is not robustified against covariate shift, can be potentially dangerous for a real robot. Furthermore, trialanderror exploration in RL accelerates this risk. To address this fundamental problem, robustification of policy learning with DART, which requires only the safe demonstrator’s policy, is significant and should be incorporated into ILbased frameworks.
Iii Preliminaries
Iiia Behavior Cloning (BC)
The objective of IL is to learn policies that replicate a demonstrator’s behavior, using demonstration trajectories consisting of sequences of states and actions , where , and is a total steps of a trajectory. The trajectory distribution associated with the dynamics and the parametric policy with a parameter is defined as:
(1) 
To replicate a demonstrator’s policy , the error of a query policy and the demonstrator’s policy using trajectories is defined as:
(2) 
To minimize the expected surrogate loss as the objective function, the parameter of a BC policy is acquired by solving:
(3) 
However, error compounding caused by limited variation in demonstrations can result in policies learned by BC being unstable [24]. The difference between the trajectory distribution of learned policy and the demonstrator’s policy can be formalized as the covariate shift:
(4) 
In the context of executing the learned policy, a high covariate shift tends to lead policies into undemonstrated states, where it is hard to recover back to a desired trajectory.
IiiB Disturbances for Augmenting Robot Trajectories (DART)
To mitigate covariate shift, DART[18] learns policies that are robust to error compounding by generating demonstrations with disturbance injection. The learning space is thereby augmented with recovery actions, allowing for greater variation in demonstration trajectories. The level of disturbance is optimized and updated iteratively during data collection over iterations (number of policy updates), on the disturbance distribution parametrized by .
Initially, disturbance injected policy given the disturbance parameter is expressed as . By substituting this policy into the trajectory distribution (Eq. (1)), the disturbance injected trajectory distribution is defined as:
(5) 
As the covariate shift (Eq. (4)) cannot be computed explicitly, DART employs the upper bound of the covariate shift derived by applying Pinsker’s inequality:
(6) 
where,
is KullbackLeibler divergence. By canceling common factors in trajectory distributions (Eq. (
1) and Eq. (5)), the KL divergence can be expanded as follows:(7) 
Then, the disturbance parameter is optimized to minimize the upper bound of covariate shift as:
(8) 
The expectation of the trajectory distribution with a learned policy cannot be solved since the learned policy’s trajectories are not directly observed. Therefore, DART solves Eq. (8) using the demonstrator’s trajectory with the th updated disturbance parameter instead of the learned policy’s trajectory distribution as:
(9)  
(10) 
where is an objective function for disturbance optimization. From Eq. (3), the parameter of a DART policy is optimized using the trajectory distribution with disturbance as:
(11) 
Note that, a policy is learned using all data up to the th iteration, however, a disturbance parameter is learned using only demonstration data with the most recent .
A key limitation of DART is that Eq. (6) assumes the demonstrator’s policy with disturbance injection performs optimally to induce the distribution of learned trajectories closer to the demonstrations. However, if the demonstration data is diversequality, naïve application of DART (i.e., optimizing the disturbance parameter over the entire space of trajectories), means that the learned policy cannot be induced towards the high reward demonstrations, and learned DART policies fail to replicate an optimal demonstrator’s behavior.
Iv Proposed Method
In this section, a novel imitation learning framework is introduced, task achievement weighted disturbance injections (TAWDI), that learns a robust optimal policy by exploiting task achievement from diversequality demonstration data. In contrast to naïve DART, this method generates disturbances as controlled by the task achievement, thereby inducing the learned policy towards the high task achievement demonstrations. The proposed method is outlined in Alg. 1.
Iva TawDi
Initially, a binary optimality variable indicating the maximum possible task achievement is introduced. Maximizing the task achievement is equivalent to maximizing the likelihood of (optimal) [19]
, and this enables learning optimal policies from diversequality demonstrations as a probabilistic operation. The conditional probability of receiving maximum task achievement given a cost function is defined as:
(12) 
where, is any cost function that calculates task achievement and is a normalizing constant. By using this probability into Eq. (5), the task achievement weighted trajectory distribution collected with disturbance injection is defined as:
(13) 
By substituting this to the trajectory distribution of Eq. (9), the update scheme of disturbance parameter considering the task achievement is given as:
(14) 
From Eq. (11), the update scheme of policy parameter considering the task achievement is given as:
(15) 
IvB TAWDI with deep neural network policy model
As an implementation of TAWDI, a deterministic deep neural network model is employed as the policy
and the disturbance distribution follows a Gaussian noise parametrized by . By substituting the policy and disturbance parameter into the objective function (Eq. (10)) of disturbance (Eq. (14)), we obtain the disturbance optimization as:(16) 
By applying the Monte Carlo method to the expectation over trajectory distribution, the solution is approximated as:
(17) 
where is a total episodes (number of trajectories). The policy parameter optimization of Eq. (15) is solved by applying the gradient descent methods.
V Evaluation
To investigate the effectiveness of learning robust policies given limited variation and diversequality demonstration data, an autonomous excavation task is presented in both simulation and a scale robotic excavator. Specifically, a soil stripping excavation task is performed, where an area is cleared of soil by three consecutive scooping motions, as shown in Fig. 2(a).
Excavation is a dangerous and strenuous activity for operators, and robotic automation is actively researched [7], with previous studies [2, 13]
investigating automation via modeling of excavator kinematics, or learning datadriven policies using deeplearning
[9, 28, 8]. RL is unsuitable to this task, due to the danger of exploration, and previous studies have investigated imitating the human behavior for automation [9, 28]. This paper addresses the challenging task of learning from diversequality demonstrations, which is a natural consequence of the difficulty of excavation.Va Simulation experiment
To evaluate the proposed TAWDI method, comparisons are made between baselines BC and DART, and TAWBC (task achievement weighting without disturbance injections) is used as an ablation study. Generally, BC and DART assume demonstrations contain only optimal trajectories defined via a strict threshold, and these are denoted as BC and DART. However, in general, choosing this threshold a priori can be challenging, and as such BC and DART are additional comparisons that also utilize both optimal and suboptimal trajectories. The simulation environment is developed on the VortexStudio simulator [5] which enables the realtime simulation of excavation with soil dynamics.
VA1 Task Setting
The diversequality of the demonstration data is defined according to domain knowledge [26]: Deep excavation () of the ground should be avoided as it may cause trench or excavator instability. Shallow excavation () (Fig. 2(b)) is considered suboptimal, as it necessitates additional work and time to clear the soil. The appropriate depth () of excavation is considered as the optimal trajectories (Fig. 2(c)). A custom controller is designed that automatically performs soil striping, either optimally or suboptimally. The controller automates the excavator to emulate the excavation trajectory that follows the idea of previous work [15], where the excavation motion is divided into several simple trajectories, by passing through a set of predefined points (known as support points). To make the demonstration more diverse, uniform noise is added to the support points. In this experiment, is chosen as a reasonable divergence, as this task involves a large spatial scale of the excavator and workspace. The uniform noise is also added to the initial robot position as task uncertainty. The state is described by 33 dimensions, these being the excavator’s state, consisting of the joint angle and angular velocity of four joints,
the soil’s shape information, as seen by an overhead depth camera, consisting of a vectorized
grid of the excavation space, used to determine the next excavation point by considering the soil shape. The action is the velocity of each joint. Since it is difficult to measure the exact depth of the excavated soil in a real situation, this task defines the task achievement ( in Eq. (12)) as the total amount of moved soil which can easily be evaluated at the end of the task, and it is suitable to our method as discussed in §IIB. Following the DART architecture [18] with the same parameterization, a fourlayer deep neural network consisting of an input layer, two 64units hidden layers, and an output layer is used to train the policies. In the demonstration phase, demonstration data is collected for iterations of episodes (as in DART [18]), and data is used to update the policies and disturbances. Each iteration contains an optimal and a suboptimal trajectory. Variations in demonstrations may induce performance disparity in learned policies so that the demonstration phase is repeated twice to validate the robustness. In the testing phase, each of the policies learned in each iteration is evaluated three times. The one episode takes about 600 steps to complete the task.VA2 Results
The demonstration trajectories collected from the last iteration are shown in Fig. 3. Initially, DART exhibits dangerous oscillation and undesired drift due to the diversequality demonstrations in Fig. 3, causing the disturbance level to continuously grow over training iterations, as seen in Fig. 4. In contrast, DART, which only uses optimal trajectories, does not experience this phenomenon. In comparison, TAWDI generates demonstration trajectories that are more stable and do not drift in Fig. 3. TAWDI prevents excessive disturbance level updates by the weighting of trajectories, as is shown in Fig. 4 where the disturbance level converges.
After policy learning, learned policies are evaluated, as seen in Fig. 5, with performance seen in Fig. 6. Initially, the proposed method, TAWDI, which applies both disturbance injection and task achievement weighting, performs the best and achieves a performance close to the optimal. In contrast, it is seen that BC performs poorly, due to the fact it suffers from both error compounding and diversequality of demonstration data, resulting in undesired behavior such as excavating improper position. DART performs better than BC, as it explicitly accounts for error compounding by robustification using the disturbance injection. However, it is similarly unable to deal with the problem of diversequality demonstrations, resulting in suboptimal policy learning. TAWBC is robust against the problem of a diversequality demonstration by task achievement weighting, but it is not stable due to the error compounding compared with TAWDI.
In addition, there are clear benefits to utilizing all demonstrations (optimal and suboptimal), as seen in Fig. 7. BC fails to reach optimal performance even when only using optimal demonstrations, due to the aforementioned inability to deal with error compounding (similarly seen with TAWBC). DART overcomes this, and reaches optimal performance, however, at the cost of a slow convergence rate due to the limited number of optimal samples. In comparison, TAWDI, which additionally utilizes suboptimal trajectories, reaches this optimal level of performance much faster. As such, the proposed task achievement weighting method is not only more generalizable, as there is no need to define an optimal threshold, but is more suitable to datalimited environments.
VB Real robot experiment
In this section, all methods utilize both optimal and suboptimal demonstrations, including the baselines, BC and DART, and the proposed method TAWDI. A scale robotic excavator is used for this experiment as shown Fig. 8.
VB1 Task setting
This experiment employs human demonstrators. To verify the effectiveness of the proposed method, the demonstrators are instructed to collect an optimal and a suboptimal trajectories in each episode to keep the balance of the quality of demonstrations. The additional variation in diversequality demonstrations is induced by the uncertainty of the demonstrator’s operation instead of the uniform noise used in the simulation. The demonstrators manipulate the excavator using twin joystick controllers where joints are controlled independently.
State, action, and task achievement are consistent with §VA, and a motion capture system (OptiTrack Flex3) detects markers on each arm of the excavator and estimates the joint angles and angular velocities simultaneously. A depth camera (Intel RealSense D455) attached overhead captures soil depth. A mass sensor measures the total soil mass. Plastic beads are used instead of real soil, corresponding to soil dynamics with low shear strength and high flow velocity.
In the demonstration phase, demonstration data is collected for iterations of episodes (an optimal and a suboptimal) by demonstrators, specifically, by two subjects. The policies are learned by the same network with §VA. In the testing phase, the learned policies from the final iteration are evaluated three times. One episode takes 200 steps to complete the task.
VB2 Results
The performance of the test trajectories by two subjects is shown in Fig. 9, where it is seen that as expected, BC performs poorly due to error compounding and the diversequality demonstrations, and the diversequality demonstrations also deteriorate the performance of DART
. In contrast, TAWDI obtains the same performance as the optimal trajectories by the disturbance injection and the task achievement weighting similar to the simulation experiments. Such results are similarly obtained for two subjects, and the significant differences by ttest were observed between the proposed method and baselines.
Vi Discussion
In this section, important discussions for the experimental results are presented.
(1) What effect does combining disturbance injections with task achievement weighting have on policy learning? When demonstrations contain suboptimal trajectories, DART is unstable due to the disturbance update continuously growing (Fig. 4) in an attempt to apply disturbances that minimize the difference between the optimal and suboptimal trajectories. However, this large disturbance adversely deteriorates the performance of the demonstration and, consequently, the policy’s performance during testing. Under the assumption of diversequality demonstrations as common in realworld problems, combining disturbance injection with task achievement weighting overcomes the issues by focusing on mainly learning from high task achievement demonstrations.
(2) What is the advantage of utilizing weighted suboptimal trajectories instead of simply eliminating them? Experimental results show the use of suboptimal trajectories can accelerate the convergence of the policy performance. In realworld scenarios where the data collection cost is high due to the complexity of the operation and use of longterm tasks, the concept of weighting and utilizing a small number of demonstration trajectories without removing them is significant. This concept is applicable to various realworld robotics problems like excavation, which demands a high cost in human operation and the use of real robots.
Vii Conclusion
This paper presents a novel imitation learning framework for addressing realworld robotics problems that suffer from the dual problem of limited variation and diversequality of demonstrations. While previous studies have investigated these problems independently, the proposed method consistently outperforms methods that explicitly address these problems independently in both simulation and real robot experiments. As future works, the scalability of this study can be improved by making the policy multimodal [27, 20] for further applicability to more complex tasks.
References
 [1] (2017) Deep reinforcement learning: a brief survey. Signal Processing Magazine 34 (6), pp. 26–38. Cited by: §IIB.
 [2] (1998) The development, control and operation of an autonomous robotic excavator. Journal of Intelligent and Robotic Systems 21 (1), pp. 73–97. Cited by: §V.
 [3] (2020) Betterthandemonstrator imitation learning via automaticallyranked demonstrations. In Conference on robot learning, pp. 330–359. Cited by: §IIC.
 [4] (2020) Learning from suboptimal demonstration via selfsupervised reward regression. In Conference on robot learning, Cited by: §IIC.
 [5] (2019) Vortex studio. Note: Accessed: 2021822 External Links: Link Cited by: §VA.

[6]
(2008)
Learning for control from multiple demonstrations.
In
International Conference on Machine Learning
, pp. 144–151. Cited by: §I.  [7] (2016) Key challenges in automation of earthmoving machines. Automation in Construction 68, pp. 212–222. Cited by: §V.
 [8] (2020) Towards rlbased hydraulic excavator automation. In International Conference on Intelligent Robots and Systems, pp. 2692–2697. Cited by: §V.
 [9] (2017) Imitationbased control of automated ore excavator: improvement of autonomous excavation database quality using clustering and association analysis processes. Advanced Robotics 31 (11), pp. 595–606. Cited by: §V.
 [10] (2015) A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16 (1), pp. 1437–1480. Cited by: §IIB.
 [11] (2020) Integrating behavior cloning and reinforcement learning for improved performance in dense and sparse reward environments. In International Conference on Autonomous Agents and MultiAgent Systems, pp. 465–473. Cited by: §IIB, §IIC.
 [12] (2011) Donut as i do: learning from failed demonstrations. In International Conference on Robotics and Automation, pp. 3804–3809. Cited by: §I, §IIA.
 [13] (2002) Robotic excavation in construction automation. Robotics & Automation Magazine 9 (1), pp. 20–28. Cited by: §V.
 [14] (2020) Learning soft robotic assembly strategies from successful and failed demonstrations. In International Conference on Intelligent Robots and Systems, pp. 8309–8315. Cited by: §I, §IIA.
 [15] (2019) Autonomous freeform trenching using a walking excavator. Robotics and Automation Letters 4 (4), pp. 3208–3215. Cited by: item (iii).
 [16] (2020) Integration of imitation learning using gail and reinforcement learning using taskachievement rewards via probabilistic graphical model. Advanced Robotics 34 (16), pp. 1055–1067. Cited by: §IIB.
 [17] (2010) Imitation and reinforcement learning. Robotics & Automation Magazine 17 (2), pp. 55–62. Cited by: §IIC.
 [18] (2017) Dart: noise injection for robust imitation learning. In Conference on Robot Learning, pp. 143–156. Cited by: §I, §IIA, §IIIB, item (i).

[19]
(2016)
Sparse latent space policy search.
In
AAAI Conference on Artificial Intelligence
, pp. 1911–1918. Cited by: §IVA.  [20] (2021) Bayesian Disturbance Injection: robust imitation learning of flexible policies. In International Conference on Robotics and Automation, pp. 8629–8635. Cited by: §VII.
 [21] (2018) An algorithmic perspective on imitation learning. Foundations and Trends in Robotics 7 (12), pp. 1–179. Cited by: §I, §IIA.
 [22] (2007) Reinforcement learning by rewardweighted regression for operational space control. In International conference on Machine learning, pp. 745–750. Cited by: §I, §IIB, §IIC.
 [23] (1991) Efficient training of artificial neural networks for autonomous navigation. Neural computation 3 (1), pp. 88–97. Cited by: §I, §IIA.
 [24] (2010) Efficient reductions for imitation learning. In International conference on artificial intelligence and statistics, pp. 661–668. Cited by: §I, §IIA, §IIIA.
 [25] (2011) A reduction of imitation learning and structured prediction to noregret online learning. In International conference on artificial intelligence and statistics, pp. 627–635. Cited by: §IIA.
 [26] (2020) CODE of practice excavation work. Note: Accessed: 2021822 External Links: Link Cited by: §VA1.
 [27] (2021) Variational policy search using sparse gaussian process priors for learning multimodal optimal actions. Neural Networks 143, pp. 291–302. Cited by: §VII.
 [28] (2020) Expertemulating excavation trajectory planning for autonomous robotic industrial excavator. In International Conference on Intelligent Robots and Systems, pp. 2656–2662. Cited by: §V.
 [29] (2020) Variational imitation learning with diversequality demonstrations. In International Conference on Machine Learning, pp. 9407–9417. Cited by: §IIB.
 [30] (2019) Imitation learning from imperfect demonstration. In International Conference on Machine Learning, pp. 6818–6827. Cited by: §I, §IIA.
 [31] (2018) Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In International Conference on Robotics and Automation, pp. 5628–5635. Cited by: §I, §IIA.