1 Introduction
Recent years have witnessed the success of using autonomous agent to learn and adapt to complex tasks and environments in a range of applications such as playing games [e.g. 43, 61, 70], autonomous driving [e.g. 28, 4], robotics [19], medical treatment [e.g. 76] and recommendation system and advertisement [e.g. 37, 67].
Previous success for sequential decision making often requires two key components: (1) a careful design reward function that can provide the supervision signal during learning and (2) an unlimited number of online interactions with the realworld environment (or a carefully designed simulator) to query new unseen region. However, in many scenarios, both components are not allowed. For example, it is hard to define the reward signal in uncountable many extreme situations in autonomous driving [30]; and it is dangerous and risky to directly deploy a learning policy on human to gather information in autonomous medical treatment [76]. Therefore an offline sequential decision making algorithm without reward signal is in demand.
Imitation Learning (IL) [48] offers an elegant way to train intelligent agents for complex task without the knowledge of reward functions. In order to guide intelligent agents to correct behaviors, it is crucial to have high quality expert demonstrations. The wellknown imitation learning algorithms such as Behavior Cloning (BC, [49]) or Generative Adversarial Imitation Learning (GAIL, [21]) require that the demonstrations given for training are all presumably optimal and it aims to learn the optimal policy from expert demonstration data set. More specifically, BC only uses offline demonstration data without any interaction with the environment, whereas GAIL requires online interactions.
However in real world scenario, since the demonstration is often collected from human, we cannot guarantee that all the demonstrations we collected have high quality. This has been addressed in a line of research [71, 66, 65, 6, 59]. An human expert can make mistakes by accident or due to the hardness of a complicated scenario (e.g., medical diagnosis). Furthermore, even an expert demonstrates a successful behavior, the recorder or the recording system can have a chance to contaminate the data by accident or on purpose [e.g. 46, 15, 78].
This leads to the central question of the paper:
Can the optimality assumption on expert demonstrations be weakened or even tolerate arbitrary outliers under offline imitation learning settings?
More concretely, we consider corrupted demonstrations setting where the majority of the demonstration data is collected by an expert policy (presumably optimal), and the remaining data can be even arbitrary outliers (the formal definition is presented in Definition 2.1).
Such definitions allowing arbitrary outliers for the corrupted samples have rich history in robust statistics [23, 24], yet have not been widely used in imitation learning. This has great significance in many applications, such as automated medical diagnosis for healthcare ([76]) and autonomous driving [42], where the historical data (demonstration) is often complicated and noisy which requires robustness consideration.
However, the classical offline imitation learning approaches such as Behavior Cloning (BC) fails drastically under this corrupted demonstration settings. We illustrated this phenomenon in Figure 1. We use BC on Hopper environment (a continuous control environment from PyBullet [12]), and the performance of the policy learned by BC drops drastically as the fraction of corruptions increases in the offline demonstration data set.
In this paper, we propose a novel robust imitation learning algorithm – Robust Behavior Cloning (RBC, Algorithm 1), which is resilient to corruptions in the offline demonstrations. Particularly, our RBC does not require potentially costly or risky interaction with the real world environment or any human annotations. In Figure 1, Our RBC on corrupted demonstrations has nearly the same performance as BC on expert demonstrations (this is the case when ), which achieves expert level. And it barely changes when grows larger to 20%. The detailed experimental setup and comparisons with existing methods (e.g., [59]) are included in Section 5.
1.1 Main Contributions

(Algorithm) We consider robustness in offline imitation learning where we have corrupted demonstrations. Our definition for corrupted demonstrations significantly weakens the presumably optimal assumption on demonstration data, and can tolerate a constant fraction of stateaction pairs to be arbitrarily corrupted. We refer to Definition 2.1 for a more precise statement.
To deal with this issue, we propose a novel algorithm Robust Behavior Cloning (Algorithm 1) for robust imitation learning. Our algorithm works in the offline setting, without any further interaction with the environment or any human annotations. The core ingredient of our robust algorithm is using a novel median of means objective in policy estimation compared to classical Behavior Cloning. Hence, it’s simple to implement, and computationally efficient.

(Theoretical guarantees) We analyze our Robust Behavior Cloning algorithm when there exists a constant fraction of outliers in the demonstrations under the offline setting. To the best of our knowledge, we provide the first theoretical guarantee robust to constant fraction of arbitrary outliers in offline imitation learning. We show that our RBC achieves nearly the same error scaling and sample complexity compared to vanilla BC with expert demonstrations. To this end, our algorithm guarantees robustness to corrupted demonstrations at no cost of statistical estimation error. This is the content of Section 4.

(Empirical support) We validate the predicted robustness and show the effectiveness of our algorithm on a number of different highdimensional continuous control benchmarks. The vanilla BC is fragile indeed with corrupted demonstrations, yet our Robust Behavior Cloning is computationally efficient, and achieves nearly the same performance compared to vanilla BC with expert demonstrations. Section 5 also shows that our algorithm achieves competitive results compared to existing imitation learning methods.
Notation.
Throughout this paper, we use to denote the universal positive constant. We utilize the big notation to denote that there exists a positive constant and a natural number such that, for all , we have .
Outline.
The rest of this paper is organized as follows. In Section 2, we formally define the setup and the corrupted demonstrations. In Section 3, we introduce our RBC and the computationally efficient algorithm (Algorithm 1). We provide the theoretical analysis in Section 4, and experimental results in Section 5. We leave the detailed discussion and related works in Section 6. All proofs and experimental details are collected in the Appendix.
2 Problem Setup
2.1 Reinforcement Learning and Imitation Learning
Markov Decision Process and Reinforcement Learning.
We start the problem setup by introducing the Markov decision process (MDP). An MDP
consists of a state space , an action space , an unknown reward function , an unknown transition kernel , an initial state distribution , and a discounted factor . We useto denote the probability distributions on the simplex.
An agent acts in a MDP following a policy , which prescribes a distribution over the action space given each state . Running the policy starting from the initial distribution yields a stochastic trajectory , where represent the state, action, reward at time respectively, with and the next state follows the unknown transition kernel . We denote as the marginal joint stationary distribution for state, action at time step , and we define as visitation distribution for policy . For simplicity, we reuse the notation to denote the marginal distribution over state.
The goal of reinforcement learning (RL) is to find the best policy
to maximize the expected cumulative return . Common RL algorithms (e.g., please refer to [62]) requires online interaction and exploration with the environments. However, this is prohibited in the offline setting.Imitation Learning.
Imitation learning (IL) aims to obtain a policy to mimic expert’s behavior with demonstration data set where is the sample size of . Note that we do not need any reward signal. Tradition imitation learning assumes perfect (or nearoptimal) expert demonstration – for simplification we assume that each stateaction pair is drawn from the joint stationary distribution of an expert policy :
(1) 
Behavior Cloning.
The Behavior Cloning (BC) is the well known algorithm [49] for IL which only uses offline demonstration data without any interaction with the environment. More specifically, BC solves the following Maximum Likelihood Estimation (MLE) problem, which minimizes the average Negative LogLikelihood (NLL) for all samples in offline demonstrations :
(2) 
Recent works [1, 52, 73, 74] have shown that BC is optimal under the offline setting, and can only be improved with the knowledge of transition dynamic in the worst case. Also, another line of research considers improving BC with further online interaction of the environment [5] or actively querying an expert [56, 55].
2.2 Learning from corrupted demonstrations
However, it is sometimes unrealistic to assume that the demonstration data set is collected through a presumably optimal expert policy. In this paper, we propose Definition 2.1 for the corrupted demonstrations, which tolerates gross corruption or model mismatch in offline data set.
Definition 2.1 (Corrupted Demonstrations).
Let the stateaction pair drawn from the joint stationary distribution of a presumably optimal expert policy . The corrupted demonstration data are generated by the following process: an adversary can choose an arbitrary fraction () of the samples in and modifies them with arbitrary values. We note that is a constant independent of the dimensions of the problem. After the corruption, we use to denote the corrupted demonstration data set.
This corruption process can represent gross corruptions or model mismatch in the demonstration data set. To the best of our knowledge, Definition 2.1 is the first definition for corrupted demonstrations in imitation learning which tolerates arbitrary corruptions.
In the supervised learning, the wellknown Huber’s contamination model (
[23, 24]) considers where is the explanatory variable (feature) andis the response variable. Here,
denotes the authenticstatistical distribution such as Normal mean estimation or linear regression model, and
denotes the outliers.Dealing with corrupted and in high dimensions has a long history in the robust statistics community [e.g. 57, 10, 11, 75]. However, it’s only until recently that robust statistical methods can handle constant fraction (independent of dimensionality ) of outliers in and [31, 50, 14, 38, 39, 60, 41, 35, 25]. We note that in Imitation Learning, the data collecting process for the demonstrations does not obey i.i.d. assumption in traditional supervised learning due to the temporal dependency.
3 Our Algorithm
Equation 2 directly minimizes the empirical mean of Negative LogLikelihood, and it is widely known that the mean operator is fragile to corruptions [23, 24]. Indeed, our experiment in Figure 1 demonstrates that in the presence of outliers, vanilla BC fails drastically. Hence, we consider using a robust estimator to replace the empirical average of NLL in eq. 2
– we first introduce the classical MedianofMeans (MOM) estimator for the mean estimation, and then adapt it to dealing with loss functions in robust imitation learning problems.
The vanilla MOM estimator for onedimensional mean estimation works like following: (1) randomly partition samples into batches; (2) calculates the mean for each batch; (3) outputs the median of these batch mean.
The MOM mean estimator achieves subGaussian concentration bound for onedimensional mean estimation even though the underlying distribution only has second moment bound (heavy tailed distribution) (interested readers are referred to textbooks such as
[47, 26, 3]). Very recently, MOM estimators are used for high dimensional robust regression [7, 22, 41, 35, 25] by applying MOM estimator on the loss function of empirical risk minimization process.3.1 Robust Behavior Cloning
Inspired by the MOM estimator, a natural robust version of eq. 2 can randomly partition samples into batches with the batch size , and calculate
(3) 
where the loss function is the average Negative LogLikelihood in the batch :
(4) 
Our idea eq. 3 minimizes the MOM of NLL, which extends the MOM mean estimator to the loss function for robust imitation learning. Although eq. 3 can also achieve robust empirically result, we propose Definition 3.1 for theoretical convenience, which optimizes the minmax version (MOM tournament [34, 41, 35, 25]) to handle arbitrary outliers in demonstration data set .
Definition 3.1 (Robust Behavior Cloning).
We split the corrupted demonstrations into batches randomly^{1}^{1}1Without loss of generality, we assume that exactly divides the sample size , and is the batch size.: , with the batch size . The Robust Behavior Cloning solves the following optimization
(5) 
The workhorse of Definition 3.1 is eq. 5, which uses a novel variant of MOM tournament procedure for imitation learning problems.
In eq. 4, we calculate the average Negative LogLikelihood (NLL) for a single batch of stateaction pair , and is the solution of a minmax formulation based on the batch loss and . Though our algorithm minimizes the robust version of NLL, we do not utilize the traditional iid assumption in the supervised learning.
The key results in our theoretical analysis show that the minmax solution to the median batch of the loss function is robust to a constant fraction of arbitrary outliers in the demonstrations. The intuition behind solving this minmax formulation is that the inner variable needs to get close to to maximize the term ; and the outer variable also need to get close to to minimize the term. In Section 4, we show that under corrupted demonstrations, will be close to . In particular, in Definition 3.1 has the same error scaling and sample complexity compared to in the expert demonstrations setting.
Algorithm design. In Section 4, we provide rigorous statistical guarantees for Definition 3.1. However, the objective function eq. 5 is not convex (in general), hence we use Algorithm 1
as a computational heuristic to solve it.
In each iteration of Algorithm 1, we randomly partition the demonstration data set into batches, and calculate the loss by eq. 4. We then pick the batch with the median loss, and evaluate the gradient on that batch. We use gradient descent on for the part and gradient ascent on for the part.
In Section 5, we empirically show that this gradientbased heuristic Algorithm 1 is able to minimize this objective and has good convergence properties. As for the time complexity, when using backpropagation on one batch of samples, our RBC incurs overhead costs compared to vanilla BC, in order to evaluate the loss function for all samples via forward propagation. Empirical studies in Appendix B show that the time complexity of RBC is comparable to vanilla BC.
4 Theoretical Analysis
In this section, we provide theoretical guarantees for our RBC algorithm. Since our method (Definition 3.1
) directly estimates the conditional probability
over the offline demonstrations, our theoretical analysis provides guarantees on , which upper bounds the total variation norm compared to under the expectation of . The ultimate goal of the learned policy is to maximize the expected cumulative return, thus we then provide an upper bound for the suboptimality .We begin the theoretical analysis by Assumption 4.1, which simplifies our analysis and is common in literature [1, 2]. By assuming that the policy class is discrete, our upper bounds depend on the quantity , which matches the error rates and sample complexity for using BC with expert demonstrations [1, 2].
Assumption 4.1.
We assume that the policy class is discrete, and realizable, i.e., .
4.1 The upper bound for the policy distance
We first present Theorem 4.1, which shows that minimizing the MOM objective via eq. 5 guarantees the closeness of robust policy to optimal policy in total variation distance.
Theorem 4.1.
Suppose we have corrupted demonstration data set with sample size from Definition 2.1, and there exists a constant corruption ratio . Under Assumption 4.1, let to be the output objective value with in the optimization eq. 5 with the batch size , then with probability at least , we have
(6) 
The proof is collected in Appendix A. We note that the data collection process does not follow the iid assumption, hence we use martingale analysis similar to [1, 2]. The first part of eq. 6 is the statistical error , which matches the error rates of vanilla BC for expert demonstrations [1, 2]. The second part is the final objective value in the optimization eq. 5 which includes two parts – the first part scales with , which is equivalent to the fraction of corruption . The second part is the suboptimality gap due to the solving the nonconvex optimization. Our main theorem – Theorem 4.1 – guarantees that a small value of the final objective implies an accurate estimation of policy and hence we can certify estimation quality using the obtained final value of the objective.
4.2 The upper bound for the suboptimality
Next, we present Theorem 4.2, which guarantees the reward performance of the learned robust policy .
Theorem 4.2.
The proof is collected in Appendix A. We note that the error scaling and sample complexity of the statistical error in Theorem 4.2 match the vanilla BC with expert demonstrations [1, 2].
Remark 4.1.
The quadratic dependency on the effective horizon ( in the discounted setting or in the episodic setting) is widely known as the compounding error or distribution shift in literature, which is due to the essential limitation of offline imitation learning setting. Recent work [52, 73] shows that this quadratic dependency cannot be improved without any further interaction with the environment or the knowledge of transition dynamic . Hence BC is actually optimal under nointeraction setting. Also, a line of research considers improving BC by further online interaction with the environment or even active query of the experts [56, 5, 55]. Since our work, as a robust counterpart of BC, focuses on the robustness to the corruptions in the offline demonstrations setting, it can be naturally used in the online setting such as DAGGER [56] and [5].
5 Experiments
In this section, we study the empirical performance of our Robust Behavior Cloning. We evaluate the robustness of Robust Behavior Cloning on several continuous control benchmarks simulated by PyBullet [12] simulator: HopperBulletEnvv0, Walker2DBulletEnvv0, HalfCheetahBulletEnvv0 and AntBulletEnvv0. Actually, these tasks have true reward function already in the simulator. We will use only state observation and action for the imitation algorithm, and we then use the reward to evaluate the obtained policy when running in the simulator.
5.1 Experimental setup
For each task, we collect the presumably optimal expert trajectories using pretrained agents from Standard Baselines3^{2}^{2}2The pretrained agents were cloned from the following repositories: https://github.com/DLRRM/stablebaselines3, https://github.com/DLRRM/rlbaselines3zoo.. In the experiment, we use Soft ActorCritic [20]
in the Standard Baselines3 pretrained agents, and we consider it to be an expert. We provide the hyperparameters setting in
Appendix B.For the continuous control environments, the action space are bounded between 1 and 1. We note that Definition 2.1 allows for arbitrary corruptions, and we choose these outliers’ action such that it has the maximum effect, and cannot be easily detected. We generate corrupted demonstration data set as follows: we first randomly choose fraction of samples, and corrupt the actions. Then, for the option (1), we set the actions of outliers to the boundary ( or ). For the option (2), the actions of outliers are drawn from a uniform distribution between and .
We compare our RBC algorithm (Algorithm 1) to a number of natural baselines: the first baseline is directly using BC on the corrupted demonstration without any robustness consideration. The second one is using BC on the expert demonstrations (which is equivalent to in our corrupted demonstrations) with the same sample size.
We also investigate the empirical performance of the baseline which achieves the stateoftheart performance: Behavior Cloning from Noisy Observation (Noisy BC). Noisy BC is a recent offline imitation learning algorithm proposed by [59], which achieves superior performance compared to [71, 6, 5]. Similar to our RBC, Noisy BC does not require any environment interactions during training or any screening/annotation processes to discard the nonoptimal behaviors in the demonstration data set.
The Noisy BC works in an iterative fashion: in each iteration, it reuses the old policy iterate to reweight the stateaction samples via the weighted Negative LogLikelihood
Intuitively, if the likelihood is small in previous iteration, the weight for the stateaction sample will be small in the current iteration. Noisy BC outputs after multiple iterations.
5.2 Convergence of our algorithm
We first illustrate the convergence and the performance of our algorithm by tracking the metric of different algorithms vs. epoch number in the whole training process. More specifically, we evaluate current policy in the simulator for 20 trials, and obtain the mean and standard deviation of cumulative reward for every 5 epochs. This metric corresponds to theoretical bounds in
Theorem 4.2.We focus on four continuous control environments, where the observation space has dimensions around 30, and the action space has boundary . In this experiment, we adopt option (1), which set the actions of outliers to the boundary ( or ). We fix the corruption ratio as 10% and 20%, and present the Reward vs. Epochs. Due to the space limitation, we leave the experiments for all the environments to Figure 4 in Appendix B.
As illustrated in Figure 4, Vanilla BC on corrupted demonstrations fails to converge to expert policy. Using our robust counterpart Algorithm 1 on corrupted demonstrations has good convergence properties. Surprisingly, our RBC on corrupted demonstrations has nearly the same reward performance vs. epochs of directly using BC on expert demonstrations.
Computational consideration. Another important aspect of our algorithm is the computational efficiency. To directly compare the time complexity, we report the reward vs. wall clock time performance of our RBC and “Oracle BC”, which optimizes on the expert demonstrations. The experiments are conducted on 1/2 core of NVIDIA T4 GPU, and we leave the results to Table 2 in Appendix B due to space limitations. When using backpropagation on batches of samples, our RBC incurs overhead costs compared to vanilla BC, in order to evaluate the loss function for all samples via forward propagation. Table 2 shows that the actual running time time of RBC is comparable to vanilla BC.


5.3 Experiments under different setups
In this subsection, we compare the performance of our algorithm and existing methods under different setups.
Different fraction of corruption. We fix the sample size as 60,000, and vary the corruption fraction from 0% to 20%. Figure 1 has shown that our RBC is resilient to outliers in the demonstrations ranging from 0 to 20% for the Hopper environment. In Figure 2, the full experiments validate our theory that our Robust Behavior Cloning nearly matches the expert performance for different environments and corruption ratio. By contrast, the performance of vanilla BC on corrupted demos fails drastically.
Different sample size in . In this experiment, we fix the fraction of the corruption , set the actions of outliers to the boundary ( or ), and vary the sample size of the demonstration data set. It is expected in Theorem 4.2 that larger sample size of leads to smaller suboptimality gap in value function. Figure 3 validates our theory: Our RBC on corrupted demonstrations has nearly the same reward as Oracle BC (BC on expert demonstrations), and the suboptimality gap gets smaller as sample size grows larger. However, directly using BC on corrupted demonstrations cannot improve as sample size grows larger.
The Oracle BC on expert demonstrations is a strong baseline which achieves reward performance at expert level with very few transition samples in demonstrations. This suggests that compounding error or distribution shift may be less of a problem in these environments. This is consistent with the findings in [5].
In Figure 2 and Figure 3, our algorithm achieves superior results compared to the stateoftheart robust imitation learning method [59] under different setups. The key difference between our method and the reweighting idea [59] is that we can guarantee removing the outliers in the objective function eq. 5, whereas the outliers may mislead the reweighting process during the iterations. If the outliers have large weight in the previous iteration, the reweighting process will exacerbate the outliers’ impacts. If an authentic and informative sample has large training loss, it will be downweighted incorrectly, hence losing sample efficiency. As in Figure 3, our RBC benefits from the tight theoretical bound Theorem 4.2, and achieves superior performance even the sample size (the number of trajectories) is small.
6 Related Work
Imitation Learning. Behavior Cloning (BC) is the most widelyused imitation learning algorithm [49, 48] due to its simplicity, effectiveness and scalability, and has been widely used in practice. From a theoretical viewpoint, it has been showed that BC achieves informational optimality in the offline setting [52] where we do not have further online interactions or the knowledge of the transition dynamic . With online interaction, there’s a line of research focusing on improving BC in different scenarios – for example, [56] proposed DAgger (Data Aggregation) by querying the expert policy in the online setting. [5] proposed using an ensemble of BC as uncertainty measure and interacts with the environment to improve BC by taking the uncertainty into account, without the need to query the expert. Very recently, [73, 51, 74] leveraged the knowledge of the transition dynamic to eliminate compounding error/distribution shift issue in BC.
Besides BC, there are other imitation learning algorithms: [21]
used generative adversarial networks for distribution matching to learn a reward function;
[54] provided a RL framework to deal with IL by artificially setting the reward; [18] unified several existing imitation learning algorithm as minimizing distribution divergence between learned policy and expert demonstration, just to name a few.Offline RL. RL leverages the signal from reward function to train the policy. Different from IL, offline RL often does not require the demonstration to be expert demonstration [e.g. 17, 16, 33] (interested readers are referred to [36]), and even expects the offline data with higher coverage for different suboptimal policies [9, 27, 53]. Behavioragnostic setting [45, 44] even does not require the collected data from a single policy.
The closest relation between offline RL and IL is the learning of stationary visitation distribution, where learning such visitation distribution does not involve with reward signal, similar to IL. A line of recent research especially for offpolicy evaluation tries to learn the stationary visitation distribution of a given target policy [e.g. 40, 45, 64, 44, 13]. Especially [32] leverages the offpolicy evaluation idea to IL area.
Robustness in IL and RL. There are several recent papers consider corruptionrobust in either RL or IL. In RL, [79] considers that the adversarial corruption may corrupt the whole episode in the online RL setting while a more recent one [78] considers offline RL where fraction of the whole data set can be replaced by the outliers. However, the dependency scales with the dimension in [78], yet can be a constant in this paper for robust offline IL. Many other papers consider perturbations, heavy tails, or corruptions in either reward function [8] or in transition dynamic [72, 63, 58].
The most related papers follow a similar setting of robust IL are [71, 66, 65, 6, 59], where they consider imperfect or noisy observations in imitation learning. However, they do not provide theoretical guarantees to handle arbitrary outliers in the demonstrations. And to the best of our knowledge, we provide the first theoretical guarantee robust to constant fraction of arbitrary outliers in offline imitation learning. Furthermore, [71, 66, 65] require additional online interactions with the environment, and [6, 71] require annotations for each demonstration, which costs a significant human effort. Our algorithm achieves robustness guarantee from purely offline demonstration, without the potentially costly or risky interaction with the real world environment or human annotations.
6.1 Summary and Future Works
In this paper, we considered the corrupted demonstrations issues in imitation learning, and proposed a novel robust algorithm, Robust Behavior Cloning, to deal with the corruptions in offline demonstration data set. The core technique is replacing the vanilla Maximum Likelihood Estimation with a MedianofMeans (MOM) objective which guarantees the policy estimation and reward performance in the presence of constant fraction of outliers. Our algorithm has strong robustness guarantees and has competitive practical performance compared to existing methods.
There are several avenues for future work: since our work focuses on the corruption in offline data, any improvement in online IL which utilizes BC would benefit from the corruptionrobustness guarantees and practical effectiveness by our offline RBC. Also, it would also be of interest to apply our algorithm for realworld environment, such as automated medical diagnosis and autonomous driving.
References
 [1] Alekh Agarwal, Nan Jiang, S. Kakade, and Wen Sun. Reinforcement learning: Theory and algorithms. 2019.
 [2] Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. Flambe: Structural complexity and representation learning of low rank mdps. In Advances in Neural Information Processing Systems, volume 33, pages 20095–20107, 2020.
 [3] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1):137–147, 1999.
 [4] Marc G Bellemare, Salvatore Candido, Pablo Samuel Castro, Jun Gong, Marlos C Machado, Subhodeep Moitra, Sameera S Ponda, and Ziyu Wang. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature, 588(7836):77–82, 2020.
 [5] Kianté Brantley, Wen Sun, and Mikael Henaff. Disagreementregularized imitation learning. In International Conference on Learning Representations, 2019.

[6]
Daniel Brown, Wonjoon Goo, Prabhat Nagarajan, and Scott Niekum.
Extrapolating beyond suboptimal demonstrations via inverse
reinforcement learning from observations.
In
International conference on machine learning
, pages 783–792. PMLR, 2019.  [7] Christian Brownlees, Emilien Joly, and Gábor Lugosi. Empirical risk minimization for heavytailed losses. The Annals of Statistics, 43(6):2507–2536, 2015.
 [8] Sébastien Bubeck, Nicolo CesaBianchi, and Gábor Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717, 2013.
 [9] Jacob Buckman, Carles Gelada, and Marc G Bellemare. The importance of pessimism in fixeddataset policy optimization. In International Conference on Learning Representations, 2020.
 [10] Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust sparse regression under adversarial corruption. In International Conference on Machine Learning, pages 774–782, 2013.
 [11] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):44, 2017.
 [12] Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. 2016.

[13]
Bo Dai, Ofir Nachum, Yinlam Chow, Lihong Li, Csaba Szepesvári, and Dale
Schuurmans.
Coindice: Offpolicy confidence interval estimation.
In Advances in Neural Information Processing Systems, 2020.  [14] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust metaalgorithm for stochastic optimization. In International Conference on Machine Learning, pages 1596–1606, 2019.

[15]
Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei
Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song.
Robust physicalworld attacks on deep learning visual classification.
InProceedings of the IEEE conference on computer vision and pattern recognition
, pages 1625–1634, 2018.  [16] Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. arXiv preprint arXiv:2106.06860, 2021.
 [17] Scott Fujimoto, David Meger, and Doina Precup. Offpolicy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062. PMLR, 2019.
 [18] Seyed Kamyar Seyed Ghasemipour, Richard Zemel, and Shixiang Gu. A divergence minimization perspective on imitation learning methods. In Conference on Robot Learning, pages 1259–1277. PMLR, 2020.
 [19] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energybased policies. In International Conference on Machine Learning, pages 1352–1361. PMLR, 2017.
 [20] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018.
 [21] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. Advances in neural information processing systems, 29:4565–4573, 2016.
 [22] Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails. The Journal of Machine Learning Research, 17(1):543–582, 2016.
 [23] P. J. Huber. Robust estimation of a location parameter. Annals of Mathematical Statistics, 35:492–518, 1964.
 [24] Peter J Huber. Robust statistics. In International Encyclopedia of Statistical Science, pages 1248–1251. Springer, 2011.
 [25] Ajil Jalal, Liu Liu, Alexandros G Dimakis, and Constantine Caramanis. Robust compressed sensing using generative models. Advances in Neural Information Processing Systems, 2020.
 [26] Mark R Jerrum, Leslie G Valiant, and Vijay V Vazirani. Random generation of combinatorial structures from a uniform distribution. Theoretical Computer Science, 43:169–188, 1986.
 [27] Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR, 2021.
 [28] Alex Kendall, Jeffrey Hawke, David Janz, Przemyslaw Mazur, Daniele Reda, JohnMark Allen, VinhDieu Lam, Alex Bewley, and Amar Shah. Learning to drive in a day. In 2019 International Conference on Robotics and Automation (ICRA), pages 8248–8254. IEEE, 2019.
 [29] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [30] B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A Al Sallab, Senthil Yogamani, and Patrick Pérez. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 2021.
 [31] Adam Klivans, Pravesh K Kothari, and Raghu Meka. Efficient algorithms for outlierrobust regression. In Conference On Learning Theory, pages 1420–1430. PMLR, 2018.
 [32] Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Imitation learning via offpolicy distribution matching. In International Conference on Learning Representations, 2020.
 [33] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative qlearning for offline reinforcement learning. arXiv preprint arXiv:2006.04779, 2020.
 [34] Lucien Le Cam. Asymptotic methods in statistical decision theory. Springer Science & Business Media, 2012.
 [35] Guillaume Lecué and Matthieu Lerasle. Robust machine learning by medianofmeans: theory and practice. The Annals of Statistics, 48(2):906–931, 2020.
 [36] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
 [37] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms. In Proceedings of the 4th International Conference on Web Search and Data Mining (WSDM), pages 297–306, 2011.
 [38] Liu Liu, Tianyang Li, and Constantine Caramanis. High dimensional robust estimation: Arbitrary corruption and heavy tails. arXiv preprint arXiv:1901.08237, 2019.

[39]
Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Caramanis.
High dimensional robust sparse regression.
In
International Conference on Artificial Intelligence and Statistics
, pages 411–421. PMLR, 2020.  [40] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinitehorizon offpolicy estimation. In Advances in Neural Information Processing Systems, pages 5356–5366, 2018.
 [41] Gabor Lugosi and Shahar Mendelson. Risk minimization by medianofmeans tournaments. Journal of the European Mathematical Society, 22(3):925–965, 2019.
 [42] Xiaobai Ma, Katherine DriggsCampbell, and Mykel J Kochenderfer. Improved robustness and safety for autonomous vehicle control with adversarial reinforcement learning. In 2018 IEEE Intelligent Vehicles Symposium (IV), pages 1665–1671. IEEE, 2018.
 [43] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
 [44] Ali Mousavi, Lihong Li, Qiang Liu, and Denny Zhou. Blackbox offpolicy estimation for infinitehorizon reinforcement learning. In International Conference on Learning Representations, 2020.
 [45] Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavioragnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, pages 2318–2328, 2019.
 [46] Gina Neff and Peter Nagy. Automation, algorithms, and politics— talking to bots: Symbiotic agency and the case of tay. International Journal of Communication, 10:17, 2016.
 [47] Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.
 [48] Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J Andrew Bagnell, Pieter Abbeel, Jan Peters, et al. An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics, 7(12):1–179, 2018.

[49]
D. Pomerleau.
Alvinn: An autonomous land vehicle in a neural network.
In Advances in Neural Information Processing Systems, 1988.  [50] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust estimation via robust gradient estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(3):601–627, 2020.
 [51] Nived Rajaraman, Yanjun Han, Lin F Yang, Kannan Ramchandran, and Jiantao Jiao. Provably breaking the quadratic error compounding barrier in imitation learning, optimally. arXiv preprint arXiv:2102.12948, 2021.
 [52] Nived Rajaraman, Lin Yang, Jiantao Jiao, and Kannan Ramchandran. Toward the fundamental limits of imitation learning. Advances in Neural Information Processing Systems, 33, 2020.
 [53] Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. arXiv preprint arXiv:2103.12021, 2021.
 [54] Siddharth Reddy, Anca D Dragan, and Sergey Levine. Sqil: Imitation learning via reinforcement learning with sparse rewards. In International Conference on Learning Representations, 2019.
 [55] Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive noregret learning. arXiv preprint arXiv:1406.5979, 2014.
 [56] Stéphane Ross, Geoffrey J. Gordon, and J. Bagnell. A reduction of imitation learning and structured prediction to noregret online learning. In AISTATS, 2011.
 [57] Peter J Rousseeuw. Least median of squares regression. Journal of the American statistical association, 79(388):871–880, 1984.
 [58] Aurko Roy, Huan Xu, and Sebastian Pokutta. Reinforcement learning under model mismatch. In Advances in Neural Information Processing Systems, volume 30, 2017.
 [59] Fumihiro Sasaki and Ryota Yamashina. Behavioral cloning from noisy demonstrations. In International Conference on Learning Representations, 2021.
 [60] Yanyao Shen and Sujay Sanghavi. Learning with bad training data via iterative trimmed loss minimization. In International Conference on Machine Learning, pages 5739–5748. PMLR, 2019.
 [61] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through selfplay. Science, 362(6419):1140–1144, 2018.
 [62] Csaba Szepesvári. Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning, 4(1):1–103, 2010.
 [63] Aviv Tamar, Shie Mannor, and Huan Xu. Scaling up robust mdps using function approximation. In International conference on machine learning, pages 181–189. PMLR, 2014.
 [64] Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, and Qiang Liu. Doubly robust bias reduction in infinite horizon offpolicy estimation. In International Conference on Learning Representations (ICLR), 2020.
 [65] Voot Tangkaratt, Nontawat Charoenphakdee, and Masashi Sugiyama. Robust imitation learning from noisy demonstrations. In International Conference on Artificial Intelligence and Statistics, pages 298–306. PMLR, 2021.
 [66] Voot Tangkaratt, Bo Han, Mohammad Emtiyaz Khan, and Masashi Sugiyama. Variational imitation learning with diversequality demonstrations. In International Conference on Machine Learning, pages 9407–9417. PMLR, 2020.
 [67] Philip S. Thomas, Georgios Theocharous, Mohammad Ghavamzadeh, Ishan Durugkar, and Emma Brunskill. Predictive offpolicy policy evaluation for nonstationary decision problems, with applications to digital marketing. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, pages 4740–4745, 2017.
 [68] A. Tsybakov. Introduction to nonparametric estimation. In Springer series in statistics, 2009.
 [69] Sara van de Geer. Empirical Processes in Mestimation, volume 6. Cambridge university press, 2000.
 [70] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multiagent reinforcement learning. Nature, 575(7782):350–354, 2019.
 [71] YuehHua Wu, Nontawat Charoenphakdee, Han Bao, Voot Tangkaratt, and Masashi Sugiyama. Imitation learning from imperfect demonstration. In International Conference on Machine Learning, pages 6818–6827. PMLR, 2019.
 [72] Huan Xu and Shie Mannor. Distributionally robust markov decision processes. Mathematics of Operations Research, 37(2):288–300, 2012.
 [73] Tian Xu, Ziniu Li, and Yang Yu. Error bounds of imitating policies and environments for reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
 [74] Tian Xu, Ziniu Li, and Yang Yu. Nearly minimax optimal adversarial imitation learning with known and unknown transitions. arXiv preprint arXiv:2106.10424, 2021.
 [75] Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. Byzantinerobust distributed learning: Towards optimal statistical rates. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5650–5659. PMLR, 10–15 Jul 2018.
 [76] Chao Yu, Jiming Liu, and Shamim Nemati. Reinforcement learning in healthcare: A survey. arXiv preprint arXiv:1908.08796, 2019.
 [77] Tong Zhang. From entropy to klentropy: Analysis of minimum information complexity density estimation. The Annals of Statistics, 34(5):2180–2210, 2006.
 [78] Xuezhou Zhang, Yiding Chen, Jerry Zhu, and Wen Sun. Corruptionrobust offline reinforcement learning. arXiv preprint arXiv:2106.06630, 2021.
 [79] Xuezhou Zhang, Yiding Chen, Xiaojin Zhu, and Wen Sun. Robust policy gradient against strong data corruption. arXiv preprint arXiv:2102.05800, 2021.
Appendix A Proofs
The analysis of maximum likelihood estimation is standard in i.i.d. setting for the supervised learning setting [69]. In our proofs of the robust offline imitation learning algorithm, the analysis for the sequential decision making leverages the martingale analysis technique from [77, 2].
Our Robust Behavior Cloning (Definition 3.1) solves the following optimization
(8) 
where the loss function is the average Negative LogLikelihood in the batch :
(9) 
This can be understood as a robust counterpart for the maximum likelihood estimation in sequential decision process.
With a slight abuse of notation, we use and to denote the observation and action, and the underlying unknown expert distribution is and . Following Assumption 4.1, we have the realizable , and the discrete function class satisfies .
Let denote the data set and let denote a tangent sequence . The tangent sequence is defined as and . Note here that follows from the distribution , and depends on the original sequence, hence the tangent sequence is independent conditional on .
For this martingale process, we first introduce a decoupling Lemma from [2].
Lemma A.1.
[Lemma 24 in [2]] Let be a dataset, and let be a tangent sequence. Let be any function which can be decomposed additively across samples in . Here, is any function of and sample . Let be any estimator taking the dataset as input and with range . Then we have
Then we present a Lemma which upper bounds the TV distance via a loss function closely related to KL divergence. Such bounds for probabilistic distributions are discussed extensively in literature such as [68].
Lemma A.2.
[Lemma 25 in [2]] For any two conditional probability densities and any state distribution we have
a.1 Proof of Theorem 4.1
With these Lemmas in hand, we are now equipped to prove our main theorem (Theorem 4.1), which guarantees the solution of eq. 5 is close to the optimal policy in TV distance.
Theorem A.1 (Theorem 4.1).
Suppose we have corrupted demonstration data set with sample size from Definition 2.1, and there exists a constant corruption ratio . Under Assumption 4.1, let to be the output objective value with in the optimization eq. 5 with the batch size , then with probability at least , we have
Proof of Theorem 4.1.
En route to the proof of Theorem 4.1, we keep using the notations in Lemma A.1 and Lemma A.2, where the state observation is , the action is , and the discrete function class is .
Similar to [2], we first note that Lemma A.1 can be combined with a simple Chernoff bound to obtain an exponential tail bound. With probability at least , we have
(10) 
Our proof technique relies on lower bounding the LHS of eq. 10, and upper bounding the RHS eq. 10.
Let the batch size , which is a constant in Definition 3.1, then the number of batches such that there exists at least 66% batches without corruptions.
In the definition of RBC (Definition 3.1), we solve
(11) 
Lower bound for the LHS of eq. 10.
We apply the concentration bound eq. 10 for such uncorrupted batches, hence the majority of all batches satisfies eq. 10. For those batches, the LHS of eq. 10 can be lower bounded by the TV distance according to Lemma A.2.
(12) 
where (i) follows from the independence between and due to the decoupling technique, and (ii) follows from Lemma A.2, which is an upper bound of the Total Variation distance.
Upper bound for the RHS of eq. 10.
Note that the objective is the median of means of each batches and is one feasible solution of the inner maximization step eq. 11. Since is the output objective value with in the optimization eq. 5, this implies that for the median batch , which is equivalent to .
Hence for the median batch , the RHS of eq. 10 can be upper bounded by
(13) 
∎
a.2 Proof of Theorem 4.2
With the supervised learning guarantees Theorem 4.1 in hand, which provides an upper bound for , we are now able to present the suboptimality guarantee of the reward for . This bound directly corresponds to the reward performance of a policy.
Theorem A.2 (Theorem 4.2).
Proof of Theorem 4.2.
This part is similar to [1], and we have
where we use the fact that for the advantage function and the reward is always bounded between 0 and 1.
Appendix B Experimental Details
In this section, we provide the details of our algorithm RBC in different setups. All Behavior Cloning models were trained to minimize the meansquared error regression loss on the demonstration data for 200 epochs using Adam [29]
. In all setting, we fix the policy network as 3 hidden layer feedforward Neural Network of size
with ReLU activation. More hyperparameters are provided in
Table 1.Hyperparameter  Value 

Parallel Environments  20 
regularization  0 
Entropy coefficient  0.01 
Gradient clipping  0.1 
Learning rate 
Reward vs. Epochs.
We illustrate the convergence of our algorithm by tracking the reward performance for different continuous control environments simulated by PyBullet [12] simulator: HopperBulletEnvv0, Walker2DBulletEnvv0, HalfCheetahBulletEnvv0 and AntBulletEnvv0. More specifically, we evaluate current policy in the simulator for 20 trials, and obtain the mean and standard deviation of cumulative reward for every 5 epochs. In this experiment, we adopt option (1) for the outliers, which set the actions of outliers to the boundary ( or ). In Figure 4, we fix the corruption ratio as 10% and 20% with fixed demonstration data of size 60000, and present the Reward vs. Epochs. We note that the difference of purple curves in and is due to different random seed.
Convergence time.
Another important aspect of our algorithm is the practical time complexity efficiency. To speed up our RBC, we pick multiple batches around the median batch in line 7 of Algorithm 1, and evaluate the gradient using backpropagation on these batches. The experimental setup is consistent with Figure 4. To directly compare the time complexity, we report the reward vs. wall clock time performance of our RBC and “Oracle BC”, which optimizes on the expert demonstrations.
We measure the convergence time by counting the elapsed time from zero to first achieving 95% of expert level. The experiments are conducted on 1/2 core of NVIDIA T4 GPU, and presented in Table 2, which shows that the actual running time time of RBC is comparable to vanilla BC.
Hopper  HalfCheetah  Ant  Walker2D  

Oracle BC  88  174  49  159 
RBC  368  597  134  385 