Adaptive t-Momentum-based Optimization for Unknown Ratio of Outliers in Amateur Data in Imitation Learning

08/02/2021 ∙ by Wendyam Eric Lionel Ilboudo, et al. ∙ 0

Behavioral cloning (BC) bears a high potential for safe and direct transfer of human skills to robots. However, demonstrations performed by human operators often contain noise or imperfect behaviors that can affect the efficiency of the imitator if left unchecked. In order to allow the imitators to effectively learn from imperfect demonstrations, we propose to employ the robust t-momentum optimization algorithm. This algorithm builds on the Student's t-distribution in order to deal with heavy-tailed data and reduce the effect of outlying observations. We extend the t-momentum algorithm to allow for an adaptive and automatic robustness and show empirically how the algorithm can be used to produce robust BC imitators against datasets with unknown heaviness. Indeed, the imitators trained with the t-momentum-based Adam optimizers displayed robustness to imperfect demonstrations on two different manipulation tasks with different robots and revealed the capability to take advantage of the additional data while reducing the adverse effect of non-optimal behaviors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The ultimate goal of the machine learning framework has always been to generate algorithms that perform at least as well as a human being, and robotics in particular, aims at building mechanical machines that can mimic human or animal behaviors. With this objective in mind, the Imitation learning (IL) approach has received an increasing attention, due to its ability to infer the hidden intention (policy) of an expert, which can be a human operator, through the observation of his/her demonstrations. In the literature, two types of IL are predominant: behavioral cloning (BC) 

[4, 19]

which reproduces the sequences of the experts’ action based on the environment state, and inverse reinforcement learning which maximizes a reward function inferred from the experts’ demonstrations 

[14, 9]. These algorithms have been shown to yield near-optimal policies when trained on high-quality demonstrations performed from experts, highlighting their potential for the production of advanced task-oriented robots that can naturally learn from demonstrations [7, 20].

Unfortunately, all these studies in both theoretical and applied aspects have assumed the presence of experts who always perform optimally, and of sophisticated operating interfaces that can adequately reflect the intentions of the experts, even when they do not make any mistakes. However, in practice, the demonstrators may lack qualitative expertise, either at the task itself or due to a non-intuitive operating interface, which means that they may be required to be well trained to become familiar with the setting before any demonstration can be recorded. However, this wastes both time and data, and constitutes an impractical constraint for crowdsourcing data collection [13]. Furthermore, even after being trained, a human operator may be subject to distractions due either to limited attention, tiredness or boredom, making the assumption of optimal and mistake-free demonstrations uncertain. For all these reasons, real world demonstrations are highly likely to contain unintentional noise and outliers, which makes it difficult for IL agents to extract an optimal policy. Therefore, in general, such demonstrations which contains wrong actions would implicitly be excluded from the dataset used to train the agent, even when some parts of the demonstration may be informative. Here, we define such a partially optimal demonstration as an amateur demonstration.

To tackle this issue and allow imitation from amateur demonstrations, several methods have been proposed. Indeed, for inverse reinforcement learning, we can cite the works [5] and [21] where additional labels provided by the experts are employed to discriminate amateur demonstrations, the work [18]

which assumes the amateur actions and states to be a Gaussian distributed noise, and the recent work 

[17]

where a pseudo-labeling technique is used to estimate the data density of the non-expert demonstrations and then a classification risk optimization is performed on all the demonstration dataset, using a symmetric loss function.

In this study, we focus on the neural-network-based BC and, by seeing that amateur demonstrations include noise and outliers, employ the robust t-momentum 

[10]

optimization algorithm to train the imitator. With the t-momentum strategy, the adverse effect of noise and outliers can be implicitly removed according to its robustness hyperparameter during the stochastic gradient descent (SGD) updates. However, in the original version of the t-momentum, the robustness hyperparameter is needed to be specified before training and is therefore incapable of adapting automatically to the unknown actual ratio of noise and outliers inside amateur data. To address this issue, we extend the t-momentum with a method to automatically adjust the algorithm’s robustness in order to deal with the uncertainty on the ratio of real world wrong demonstrations data for robotics application.

Ii Preliminaries

Ii-a Behavioral cloning

Behavioral cloning (BC) [4]

is an imitation learning technique which uses a supervised learning approach to capture and reproduce the behavior of a demonstrator, usually referred to as the

expert. As the expert performs the task, his/her actions are recorded along with the state that gave rise to the action. The sequence of these state-action records, called behavior trace or trajectory, is then used as supervised input signals for the imitator, whose goal is to uncover a set of rules that reproduce the observed behavior. BC is powerful in the sense that the imitator is capable of immediately imitating the demonstrator without having to interact with the environment, making it particularly attractive for robotics applications and for safe and direct transfers of humans sub-cognitive skills or behaviors to machines.

Formally, BC is concerned with the problem of finding a good imitation policy from a set of state-action demonstration trajectories where is a trajectory . This set of state-action pairs are used to seek the parameters of an imitation policy that best fits the set. This decision problem is usually solved by employing the maximum-likelihood estimation method. Indeed, assuming each pairs in are independently and identically distributed (i.i.d) and for defined as the imitator’s policy parameterized by , BC solves for an optimal solution such that:

(1)
(2)

With this objective, the imitator’s policy eventually converges to the unknown policy that produced the dataset .

Ii-B Robust optimization with the t-momentum

Ii-B1 Student’s t-based momentum

Under the deep learning framework, complicated functions such as the policies

can be approximated using neural networks, where the parameters are given by the weights and biases of the networks. With the neural networks, the optimization problem depicted in Eq. (2) is solved by first-order gradient-based optimization methods. Most of the recent and popular first-order gradient-based methods developed nowadays build upon the momentum strategy [11], where an average of the past gradients are employed in the stochastic gradient descent updates.

At the heart of the momentum methods’ success lies the Exponential Moving Average (EMA), which allows recent gradients to have a greater impact on the average due to higher weights, while slowly forgetting observations that are far in the past and that possesses exponentially smaller weights. Let be the objective function evaluated on a random sample from the training dataset, e.g. a sub-sample set of state-action pairs of size in BC , and with the parameters at time corresponding to the weights and biases. With the stochastic gradient of with respect to the parameters , the regular EMA-based first-order momentum is defined as:

(3)

where , the exponential decay coefficient, is a fixed value that controls how fast past gradients , , are forgotten.

However, EMA-based momentum methods lack robustness to aberrant values due to the fact that every new observation is given the same weight . This led to the proposition of the t-EMA, a new EMA algorithm derived from the Student’s t-distribution likelihood estimator, and its corresponding momentum, the t-momentum [10]. The particularity of the t-momentum lies in the fact that the decay coefficient is no longer fixed, but adaptive, and depends on the squared Mahalanobis distance :

(4)

where

(5)
(6)
(7)

where is the Student’s t-distribution degrees of freedom parameter which controls the robustness, in the superscript refers to the

th component of the vector, and

is an exponential moving variance estimate at step

, which is computed by default in recent methods. When integrated to momentum-based optimization methods such as Adam (Adaptive moment estimate) 

[11], the t-momentum has been shown to improve the robustness of the underlying optimizer and therefore increase the performance of the learning process against heavy-tailed data sets.

Ii-B2 t-EMA with modified weight decay

The decay strategy of the accumulated weights in Eq. (6) implies that at the time step , the past value is not decayed with respect to the new value and that both have the same importance in the value of .

In order to ensure that the past value is decayed and has less importance than the new value , Eq. (6) has been modified in [12] to yield instead:

(8)

which remains consistent with the maximum likelihood derivation of the t-momentum algorithm as described in [10] and where the change of the decay factor’s value, from in Eq. (6) to , is set by the requirement that the t-EMA reverts to the EMA in the limit . With this modification, the value of at the time step is given by:

(9)

where the value of is effectively reduced with respect to .

In this study, this modified version of the t-EMA is the one we employ for the t-momentum.

Iii Robust Behavioral Cloning With Adaptive T-Momentum Optimization

Iii-a The imperfect demonstrations issue in behavioral cloning

Because BC relies solely on the provided demonstrations in order to find the imitation policy through a supervised learning approach, it requires all trajectories in the dataset to be optimal (i.e. perfect demonstrations) or near-optimal. Due to this fact, human operators, when given a control interface with the task to perform demonstrations, must first be trained to become highly efficient at using the interface before they can start demonstrating for the imitator; and even after having been trained, distractions, mistakes and limited attention time makes it difficult and nearly impossible for a human to always follow an optimal policy. This leads to trajectories where some state-action pairs are not optimal, causing the imitator to be biased against the optimal policy.

We again refer to these imperfect demonstrations as being amateur demonstrations, so that the dataset is generated as a mixture of the expert policy and the amateur policy:

(10)

where and are respectively the state-action density of the expert policy and amateur policy , represents the proportion of amateur state-action pairs in the dataset, assumed to be in the range .

In the original setting of behavioral cloning, all of the amateur demonstrations are simply discarded so that the policy that produced the dataset is only from the expert, i.e. ; however, this results in a loss of valuable data since all of the amateur pairs are not necessarily wrong. Due to the fact that BC typically require a lot of data in order to produce an optimal policy [16], a strategy that takes advantage of good parts of the amateur demonstration (state-actions that are similar to the expert’s one), while ignoring wrong or misleading actions in the imperfect demonstration is desirable. In this study, we propose to treat the amateur’s imperfect demonstrations as being outliers and we show empirically how the t-momentum, a robust optimization algorithm, extended to allow adaptive robustness, can produce robust imitators in face of the resultant heavy-tailed dataset.

Iii-B Adaptive t-momentum for automatic robustness

The robustness of the Student’s t-distribution, and therefore of the t-momentum derived from it, is controlled by the degrees of freedom parameter . Indeed, as can be seen in Eq. 5, if , then for all time step and every values are given the same weight independently of the value of , leading back to the non robust EMA derived from Gaussian distribution. In contrast, if , then each value is weighted by , leading to a strong sensitivity to the squared Mahalanobis distance and therefore to a very strong filtering effect. In the formulation of the t-momentum in the original t-momentum paper [10], the degrees of freedom is treated as an hyperparameter whose value must be set before starting the optimization process, meaning that the robustness of the t-momentum is fixed throughout the learning operations.

In practice, the proportion value , introduced previously, is unknown (due to the difficulty to keep track of all imperfect pairs). Although one may analyze the dataset to infer its heaviness before starting training the imitator, in this section, a method for automatically adjusting the robustness of the t-momentum, based on the amount of outlying gradients encountered during training, is introduced.

This mechanism exploits the batch approximation algorithm developed in [1], in particular, the incremental version of the algorithm which is an efficient set of formulas capable of iteratively estimating the degrees of freedom for a given set of data points. Thanks to its incremental nature, the data do not need to be saved in memory and are instead treated sequentially as they are observed. This feature is of prime importance in the case of optimization methods, where the gradients are observed one at a time and can be arbitrarily large, rendering it difficult to store every one of them in memory. In the following, we refer to this algorithm as the Aeschliman’s algorithm.

Iii-B1 Direct incremental degrees of freedom estimation algorithm

In order to compute an estimate for the degrees of freedom , the Aeschliman’s direct incremental algorithm is described as it follows: at each step ,

  1. Compute a robust estimate for the mean , such as the median.

  2. Compute the logarithm of the squared euclidean norm of the difference between the recent observed data point and the robust mean: .

  3. Update the arithmetic variance and mean of the variable :

    (11)
    (12)
  4. Compute a new estimate for the degrees of freedom:

    (13)

    where is the trigamma function.

Iii-B2 t-momentum with adaptive degrees of freedom

In order to integrate this algorithm to the t-momentum, a few changes are made to Aeschliman’s algorithm, mainly in order to reduce the computational cost as much as possible. Namely,

  • The t-momentum is directly used as the estimate of the robust mean, instead of computing the median as Aeschliman et al. did in their paper. Since the t-momentum is considered to be a robust mean estimate, this modification remains consistent with the original algorithm and it avoids the burden of estimating the gradient median, removing the need for a new variable.

  • Secondly, the squared norm in the variable computation is replaced by the squared Mahalanobis distance from equation (7), i.e. . This modification remains consistent with the original algorithm and can be understood as replacing the variable by a standardized alternative who has mean and variance equals to .

  • Finally, the arithmetic estimates for the variance and mean of the variable is replaced by exponential moving averages, i.e., the equations (11) and (12) becomes:

    (14)
    (15)

    With . This particular modification is necessary in order to take into account the fact that machine learning tasks may be non-stationary, which requires the estimated mean and variance of to adapt to the changing data distribution.

The new algorithm is named adaptive Student’s t-distribution based momentum or in short At-momentum and the pseudo-algorithm is given in Algorithm 1.

Input: Gradient , Previous t-momentum: , Previous weight’s sum: , Previous variance estimate:
Input: Previous mean and variance of : ,

1:, : EMA decay parameters
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12: Decay the weights’ sum

Output: , , ,

Algorithm 1 At-momentum

Note that, for the practical implementation, the modified Aeschliman’s algorithm is employed to estimate the degrees of freedom scale factor , and the degrees of freedom is obtained using the equation as suggested in the original t-momentum paper [10]. This is necessary in order to keep the updates for being overly robust, since the Aeschliman’s algorithm tends to produce small values for the degrees of freedom, which, when compared to the dimension of the neural network gradients can be negligible.

Iv Experiments

Iv-a Algorithm setup

Iv-A1 Optimization algorithm’s choice

In the following, we employ the t-Adam [10] optimizer which is the Adam [11] optimizer augmented with the t-momentum. The Adaptive t-momentum version is called At-Adam and in order to investigate the effect of the decay parameter used for the mean and variance of in equations (15) and (14), two values are defined:

  • one that takes the same value as the considered momentum (here the first-order momentum of Adam) decay factor, i.e. , and

  • a larger value, which is set to be equal to the decay factor of the Adam second moment, i.e. .

The results of training with Adam, without the t-momentum’s robustness, are also included for reference.

Iv-A2 Policy model description

For all experiments, the imitator agent’s policy model is implemented by a PyTorch 

[15]

neural network with five hidden linear layers made of 100 neurons each, fits out with a layer normalization 

[3]

and with the ReLU activation function. The outputs are the actions’ mean and covariance matrix diagonal elements for a multivariate Gaussian distribution. Different random seeds are used for each models, but all optimizers share the same set of seeds, e.g. for

trained models, the set is .

Iv-A3 Performance measure

For all experiments, we run each of the trained models on the real robot for a certain number of times (most often 5 times), and count the number of times the imitator was capable of solving the given task. This performance measure is then represented by the success rate:

(16)

Iv-B Robots and interface setup

(a) Leap Motion device
(b) Qbchain Yaw-Pitch-Pitch-Pitch (YPPP)
(c) D’Claw robot
Fig. 1: Robots and interface used in the BC experiments.

Iv-B1 Leap Motion hand tracking device

Leap Motion (see Fig. 1(a)) is a hand tracking device that captures the movement of the hands and fingers by using optical sensors and an infrared light. The field of view (FOV) of the sensors is about 150 degrees and the detection range goes roughly from 25 to 600 millimeters above the device. Each object (arm, hand or finger) detected in the FOV of the device is represented by a program class that encodes various informations such as the position, velocity, direction and other characteristics about the object.

Iv-B2 Qbchain robot and control interface

The qbmove [6] is a one degree of freedom (1-DoF) modular actuator with a cubic shape of approximately 66 millimeter width. Its stiffness can also be controlled on the hardware level, but is fixed in the following experiments for simplicity. As can be seen in Fig. 1(b), the robotic arm employed in this section’s experiments is made of 4 cubes assembled such that the first joint axis is vertical, while the three others are horizontal, allowing for an up-and-down and circular motion of the end effector, which consists of a gripper.

The interface between the Leap Motion device and the qbmove robotic arm developed to allow a human operator to control the robot uses the palm position and grab strength of the Leap Motion’s first detected hand. The palm position is used as the position of the robot’s end effector and an Inverse Kinematics (IK) algorithm is employed to compute the first three joints’ angular position. In the experiment, ikpy is employed and corresponds to a python inverse kinematics library that can import the kinematic chain of the robot from an URDF file and can quickly approximate the IK solution by employing an iterative optimizer. The obtained joints position values are then sent to the qbchain to move the tip of the fixed part of the gripper. The grab strength is then mapped to the last joint in other to open and close the gripper.

The schematic of the interface is depicted in Fig. 2.

Fig. 2: Qbchain-Leap Motion control interface. are the desired joints angular position.

Iv-B3 D’Claw robot and control interface

D’Claw is a platform introduced by project-ROBEL (RObotics BEnchmarks for Learning) [2] for studying and benchmarking dexterous manipulation. It’s a nine degrees of freedom (DoFs) platform that consists of three identical fingers mounted symmetrically on a base, as shown on Fig. 1(c).

Its control interface also uses the leap motion device. In particular, the position of the fingers — the index, the ring and thumb fingers — of the operator is used to control the three fingers of the robot, again through the ikpy library.

Iv-C Qbchain robot experiment

Iv-C1 Conditions of the experimentation

A simple pick-and-drop task is defined, where the goal is to pick an object, here a soft cube, and drop it inside a box, with an observation consisting of a direct state measure containing information about the angle, the angular velocity and the torque (effort), for each of the four joints (hence, the state space dimension equals ). The action space dimension, on the other hand, is set to be equal to and corresponds to the desired next angle of the joints (i.e. position controller).

During training, a Gaussian white noise is added to the states by using a scale factor

, i.e. , in order to augment the dataset and improve the generalization ability of the models. A small batch size of is used to reduce the computational cost, and to drive the ability of the gradient updates to escape from local optima.

Iv-C2 Dataset description

trajectories are collected and then divided into expert trajectories that are almost perfect, and amateur trajectories that contain hesitant or poor demonstrations. The expert trajectories are then further split into two data sets; one, containing trajectories, for training and another one for validation, comprised of the remaining trajectories.

Iv-C3 Results

The tests results on the robot, for trained policies, are given by the success rate over all models and summarized in Fig 3 where the error bars correspond to the confidence interval. This success rate is computed by running each trained model times (i.e. total number of runs = ) and Eq. (16) is employed by counting the number of times the model is able to solve the task (i.e. pick the object and drop it in the box). Each episode is ran with a fixed budget of steps and a model is said to have failed if it is not able to complete the task within this number of steps.

Fig. 3: Success rates on the Qbchain robot of the models trained with both amateur and expert demonstrations.

The success rates in Fig 3 show that, using a robust optimization method such as the t-momentum based Adam algorithm, it is possible to efficiently train a behavioral cloning agent with datasets that contain not only expert demonstrations, but also amateur performances.

Fig. 4, where the success rate of trained models is summarized with total number of runs per model, displays the contribution of the amateur demonstrations. Indeed, we can see that, when considering a small number of expert demonstrations (i.e. trajectories), the addition of the demonstrations containing imperfect pairs increases the success rate of the models trained with the robust t-momentum optimizer. This result highlights the fact that amateur demonstrations are useful and can be used to augment the size of the training dataset, instead of being discarded as it is usually done in BC.

Fig. 4: How the amateur data can be useful: Success rates of the trained models on the Qbchain robot with various amateur data proportion.

However, in Fig. 5, after removing the amateur demonstrations and setting the noise scale factor to , we computed the success rates by running again trained models times each (i.e. total number of runs = ). With this modification, we can see that in the absence of imperfect demonstrations and without the Gaussian noise for state augmentation, the Adam optimizer performs better than t-Adam, due to the fixed high robustness of the later.

This result allows us to display the importance of the adaptive robustness feature of At-Adam. Indeed, in the same Fig. 5, we see how the adaptive t-momentum optimizer improves the success rates of the imitators and performs even better than Adam. Hence, the adaptive robustness unarguably allows it to extract more optimal information from the expert dataset than what is allowed with non-robust methods. At-Adam, thanks to its automatic robustness adjustment, is able to find a compromise between the too-robust t-Adam with its and the non-robust Adam with its , outperforming both methods. Fig. 6 shows the median of the adapting degrees of freedom’s factor during the learning. We can see that At-Adam has a median robustness parameter higher than .

Fig. 5: At-Adam Advantage: Success rates on the Qbchain robot of the models trained without noise and amateur data.
Fig. 6: At-Adam: median of the adapting degrees of freedom’s factor during the learning for each parameter of the network.

Iv-D D’Claw robot experiment

To further confirm the ability and limitation of the robust BC with the adaptive t-momentum algorithm to adapt to different ratio of imperfect demonstrations, we conducted the following experiments using the D’Claw robot.

Iv-D1 Conditions of the experimentation

In the experiments, we define the task to consist in rotating a passive DoF (the object located on the middle of the base in Fig. 1(c)) to a fixed target angle. Specifically, the task consists in turning the object from the angle to the target angle , with the success being achieved if the object’s position falls within the range . The state space is given by the angular position and velocity of the fingers’ nine joints, the target position and the current angular position of the object along with their cosine and sine values, the object’s velocity and finally a success flag and the error between the current position and the target position, for a total dimension of . The actions’ dimension is set to corresponding to the position of the fingers’ joints. The batch size is again set to , but this time no noise is included in the states during training.

Iv-D2 Dataset description

For this task, only 34 demonstrations are recorded, consisting in 14 amateur demonstrations with imperfect state-action pairs, and 20 expert demonstrations. The expert data is then split in half; one half is used for training and the other half for validation. All the demonstrations were successful ones, where the operator was able to solve the task.

Iv-D3 Results

Fig. 7 shows the average performance of models with runs each. Each run is given a fixed budget of steps and the success is achieved if the imitator is capable of bringing the object’s position within the range of the target position, i.e. . The success rate of Adam is as expected with the addition of imperfect demonstrations, but the one of At-Adam with also suffered a significant decrease. On the other hand, At-Adam with maintains its performance for half of the amateur demonstrations, but then deteriorates when amateur trajectories are given. Since the success rate of t-Adam with its robustness fixed at increased by adding the amateur trajectories, it is likely that the proposed adjustment rule for the t-momentum’s degrees of freedom was incomplete, or that the simultaneous optimization of and caused the policy to fall into one of the local solutions when updating with temporarily high .

Fig. 7: Success rates of the trained models on the D’Claw robot with varying amateur data proportion.

For further investigation, Fig. 8 shows the success rates of the models trained using only the amateur data. As we can see, despite being previously affected by the presence of imperfect demonstrations in the previous result, At-Adam is capable of altering its robustness to extract the most useful information from this imperfect dataset. Interestingly, with amateur trajectories, the success rate in Fig. 8 is higher than that in Fig. 7. This suggests that the decrease in success rate of At-Adam may be due to a cause outside the proposed method. That is, BC is poor at learning multimodal policies [8], and if the policy optimized by the amateur demonstrations and the one by the expert’s are different but both can solve the task, learning with both demonstrations will fail due to the nature of BC.

Fig. 8: Success rates of the trained models on the D’Claw robot with only amateur data and no expert data.

V Conclusions

In this study, we showed how the t-momentum could be used to produce robust imitators under the BC framework. Taking advantage of the Aeschliman’s algorithm [1], we introduced a mechanism to automatically adjust the robustness of the t-momentum strategy, in order to deal with different proportion of imperfect and noisy pairs in the demonstrations. The application on two different robots with different tasks having different degrees of difficulties displayed the effectiveness of the proposed approach.

As implied by the experiments, the amateur demonstrations may make the policy multimodal, hence, this reaffirms the fact that the standard BC and/or the policy model should be modified in order to resolve this multimodality. In addition, the proposed method can be regarded as a kind of safety net, because it removes outliers at the final stage of optimization. An unsupervised classification of demonstrations and/or a robust design of the loss function would be required to actively utilize amateur demonstrations and further bring forth their potential for wide and unlimited imitation learning applications. In future works, the proposed method will be integrated to such algorithms.

References

  • [1] C. Aeschliman, J. Park, and A. C. Kak (2010)

    A novel parameter estimation algorithm for the multivariate t-distribution and its application to computer vision

    .
    In European conference on computer vision, pp. 594–607. Cited by: §III-B, §V.
  • [2] M. Ahn, H. Zhu, K. Hartikainen, H. Ponte, A. Gupta, S. Levine, and V. Kumar (2020) ROBEL: robotics benchmarks for learning with low-cost robots. In Conference on Robot Learning, pp. 1300–1313. Cited by: §IV-B3.
  • [3] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §IV-A2.
  • [4] M. Bain and C. Sammut (1995) A framework for behavioural cloning.. In Machine Intelligence 15, pp. 103–129. Cited by: §I, §II-A.
  • [5] D. Brown, W. Goo, P. Nagarajan, and S. Niekum (2019) Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International Conference on Machine Learning, pp. 783–792. Cited by: §I.
  • [6] M. G. Catalano, G. Grioli, M. Garabini, F. Bonomo, M. Mancini, N. Tsagarakis, and A. Bicchi (2011) VSA-cubebot: a modular variable stiffness platform for multiple degrees of freedom robots. In IEEE international conference on robotics and automation, pp. 5090–5095. Cited by: §IV-B2.
  • [7] J. S. Dyrstad, E. R. Øye, A. Stahl, and J. R. Mathiassen (2018) Teaching a robot to grasp real fish by imitation learning from a human supervisor in virtual reality. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 7185–7192. Cited by: §I.
  • [8] S. K. S. Ghasemipour, R. Zemel, and S. Gu (2020) A divergence minimization perspective on imitation learning methods. In Conference on Robot Learning, pp. 1259–1277. Cited by: §IV-D3.
  • [9] J. Ho and S. Ermon (2016) Generative adversarial imitation learning. Advances in neural information processing systems 29, pp. 4565–4573. Cited by: §I.
  • [10] W. E. L. Ilboudo, T. Kobayashi, and K. Sugimoto (2020) Robust stochastic gradient descent with student-t distribution based first-order momentum. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §I, §II-B1, §II-B2, §III-B2, §III-B, §IV-A1.
  • [11] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §II-B1, §II-B1, §IV-A1.
  • [12] T. Kobayashi and W. E. L. Ilboudo (2021) T-soft update of target network for deep reinforcement learning. Neural Networks 136, pp. 63–71. Cited by: §II-B2.
  • [13] A. Mandlekar, Y. Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, et al. (2018) Roboturk: a crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pp. 879–893. Cited by: §I.
  • [14] A. Y. Ng, S. J. Russell, et al. (2000) Algorithms for inverse reinforcement learning.. In Icml, Vol. 1, pp. 2. Cited by: §I.
  • [15] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. In Advances in neural information processing systems, pp. 8026–8037. Cited by: §IV-A2.
  • [16] S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In

    International conference on artificial intelligence and statistics

    ,
    pp. 627–635. Cited by: §III-A.
  • [17] V. Tangkaratt, N. Charoenphakdee, and M. Sugiyama (2021) Robust imitation learning from noisy demonstrations. In International Conference on Artificial Intelligence and Statistics, pp. 298–306. Cited by: §I.
  • [18] V. Tangkaratt, B. Han, M. E. Khan, and M. Sugiyama (2020) Variational imitation learning with diverse-quality demonstrations. In International Conference on Machine Learning, pp. 9407–9417. Cited by: §I.
  • [19] F. Torabi, G. Warnell, and P. Stone (2018) Behavioral cloning from observation. arXiv preprint arXiv:1805.01954. Cited by: §I.
  • [20] Y. Tsurumine, Y. Cui, K. Yamazaki, and T. Matsubara (2019) Generative adversarial imitation learning with deep p-network for robotic cloth manipulation. In IEEE-RAS International Conference on Humanoid Robots, pp. 274–280. Cited by: §I.
  • [21] Y. Wu, N. Charoenphakdee, H. Bao, V. Tangkaratt, and M. Sugiyama (2019) Imitation learning from imperfect demonstration. In International Conference on Machine Learning, pp. 6818–6827. Cited by: §I.