1 Introduction
Imitation learning (IL) has become of great interest because obtaining demonstrations is usually easier than designing reward. Reward is a signal to instruct agents to complete the desired tasks. However, illdesigned reward functions usually lead to unexpected behaviors [Everitt and Hutter, 2016; Dewey, 2014; Amodei et al., 2016]. There are two main approaches that can be used to solve IL: behavioral cloning (BC) [Schaal, 1999]
, which adopts supervised learning approaches to learn an action predictor that is trained directly from demonstration data; and apprenticeship learning (AL), which attempts to find a policy that is better than the expert policy for a class of cost functions
[Abbeel and Ng, 2004]. Even though BC can be trained with supervised learning approaches directly, it has been shown that BC cannot imitate the expert policy without a large amount of demonstration data for not considering the transition of environments [Ross et al., 2011]. In contrast, AL approaches learn from interacting with environments and optimize objectives such as maximum entropy [Ziebart et al., 2008].A stateoftheart approach generative adversarial imitation learning (GAIL) is proposed by Ho and Ermon [2016]. The method learns an optimal policy by performing occupancy measure matching [Syed et al., 2008]. An advantage of the matching method is that it is robust to demonstrations generated from a stochastic policy. Based on the concept proposed in GAIL, variants have been developed recently for different problem settings [Li et al., 2017; Kostrikov et al., 2019].
Despite that GAIL is able to learn an optimal policy from optimal demonstrations, to apply IL approaches to solve realworld tasks, the difficulty in obtaining such demonstration data should be taken into consideration. However, demonstrations from an optimal policy (either deterministic or stochastic) are usually assumed to be available in the above mentioned works, which can be barely fulfilled by the fact that most of the accessible demonstrations are imperfect or even from different policies. For instance, to train an agent to play basketball with gameplay videos of the National Basketball Association, we should be aware that there are 14.3 turnovers per game^{1}^{1}1https://www.basketballreference.com/leagues/NBA_stats.html, not to mention other kinds of mistakes that may not be recorded. The reason why optimal demonstrations are hard to obtain can be attributed to the limited attention and the presence of distractions, which make humans hard to follow optimal policies all the time. As a result, some parts of the demonstrations may be optimal and the others are not.
To mitigate the above problem, we propose to use confidence scores, which indicate the probability that whether a given trajectory is optimal. In practice, obtaining confidence scores can be cheaper than collecting optimal demonstrations. It is because it requires merely the knowledge of the optimal behavior to score but performing optimally requires not only such knowledge but also strict physical conditions. For instance, to play basketball well, the capabilities of making spontaneous decisions and intrinsic fingertip control are required. Therefore, for realworld tasks, the confidence labelers are not necessarily expert at achieving the goal. They can be normal enthusiasts such as audiences of basketball games.
To further reduce the additional cost to learn an optimal policy, we consider a more realistic setting that the given demonstrations are partially equipped with confidence. As a result, the goal of this work is to utilize imperfect demonstrations where some are equipped with confidence while some are not (we refer to demonstrations without confidence as “unlabeled demonstrations”).
In this work, we consider the setting where the given imperfect demonstrations are a mixture of optimal and nonoptimal demonstrations. The setting is common when the demonstrations are collected via crowdsourcing [Serban et al., 2017; Hu et al., 2018; Shah et al., 2018] and learning from different sources such as videos [Tokmakov et al., 2017; Pathak et al., 2017; Supancic III and Ramanan, 2017; Yeung et al., 2017; Liu et al., 2018], where demonstrations can be generated from different policies.
We propose two methods, twostep importance weighting imitation learning (2IWIL) and generative adversarial imitation learning with imperfect demonstration and confidence (ICGAIL), based on the idea of reweighting but from different perspectives. To utilize both confidence and unlabeled data, for 2IWIL, it predicts confidence scores for unlabeled data by optimizing the proposed objective based on empirical risk minimization (ERM) [Vapnik, 1998]
, which has flexibility for different loss functions, models, and optimizers; on the other hand, instead of directly reweighting to the optimal distribution and perform GAIL with reweighting, ICGAIL reweights to the
nonoptimal distribution and match the optimal occupancy measure based on our mixture distribution setting. Since the derived objective of ICGAIL depends on the proportion of the optimal demonstration in the demonstration mixture, we empirically show that ICGAIL converges slower than 2IWIL but achieves better performance, which forms a tradeoff between the two methods. We show that the proposed methods are both theoretically and practically sound.2 Related work
In this section, we provide a brief survey about making use of nonoptimal demonstrations and semisupervised classification with confidence data.
2.1 Learning from nonoptimal demonstrations
Learning from nonoptimal demonstrations is nothing new in IL and reinforcement learning (RL) literature, but previous works utilized different information to learn a better policy.
Distance minimization inverse RL (DMIRL) [Burchfiel et al., 2016]utilized a feature function of states and assumed that the true reward function is linear in the features. The feedback from human is an estimate of accumulated reward, which is harder to be given than confidence because multiple reward functions may correspond to the same optimal policy.
Semisupervised IRL (SSIRL) [Valko et al., 2012] extends the IRL method proposed by Abbeel and Ng [2004], where the reward function can be learned by matching the feature expectations of the optimal demonstrations. The difference from Abbeel and Ng [2004] is that in SSIRL, optimal and suboptimal trajectories from other performers are given. Transductive SVM [Schölkopf et al., 1999] was used in place of vanilla SVM in Abbeel and Ng [2004] to recognize optimal trajectories in the suboptimal ones. In our setting, the confidence scores are given instead of the optimal demonstrations. DMIRL and SSIRL are not suitable for highdimensional problems due to its dependence on the linearity of reward functions and good feature engineering.
2.2 Semisupervised classification with confidence data
In our 2IWIL method, we train a probabilistic classifier with confidence and unlabeled data by optimizing the proposed ERM objective. There are similar settings such as
semisupervised classification [Chapelle et al., 2006], where few hardlabeled data and some unlabeled data are given.Zhou et al. [2014]
proposed to use hardlabeled instances to estimate confidence scores for unlabeled samples using Gaussian mixture models and principal component analysis. Similarly, for an input instance
, Wang et al. [2013] obtained an upper bound of confidencewith hardlabeled instances and a kernel density estimator, then treated the upper bound as an estimate of probabilistic class labels.
Another related scheme was considered in ElZahhar and ElGayar [2010] where they considered soft labels as fuzzy inputs and proposed a classification approach based on knearest neighbors. This method is difficult to scale to highdimensional tasks, and lacks theoretical guarantees. Ishida et al. [2018] proposed another scheme that trains a classifier only from positive data equipped with confidence. Our proposed method, 2IWIL, also considers training a classifier with confidence scores of given demonstrations. Nevertheless, 2IWIL can train a classifier from fewer confidence data, with the aid of a large number of unlabeled data.
3 Background
In this section, we provide backgrounds of RL and GAIL.
3.1 Reinforcement Learning
We consider the standard Markov Decision Process (MDP)
[Sutton and Barto, 1998]. MDP is represented by a tuple , where is the state space, is the action space, is the transition density of state at time step given action made under state at time step , is the reward function, and is the discount factor.A stochastic policy is a density of action given state . The performance of is evaluated in the discounted infinite horizon setting and its expectation can be represented with respect to the trajectories generated by :
(1) 
where the expectation on the righthand side is taken over the densities , , and for all time steps . Reinforcement learning algorithms [Sutton and Barto, 1998] aim to maximize Eq. (1) with respect to .
To characterize the distribution of stateaction pairs generated by an arbitrary policy , the occupancy measure is defined as follows.
Definition 3.1 (Puterman [1994]).
Define occupancy measure ,
(2) 
where is the probability density of state at time step following policy .
The occupancy measure of , , can be interpreted as an unnormalized density of stateaction pairs. The occupancy measure plays an important role in IL literature because of the following onetoone correspondence with the policy.
Theorem 3.2.
(Theorem 2 of Syed et al. [2008]) Suppose is the occupancy measure for . Then is the only policy whose occupancy measure is .
In this work, we also define the normalized occupancy measure ,
The normalized occupancy measure can be interpreted as a probability density of stateaction pairs that an agent experiences in the environment with policy .
3.2 Generative adversarial imitation learning (GAIL)
The problem setting of IL is that given trajectories generated by an expert , we are interested in optimizing the agent policy to recover the expert policy with and the MDP tuple without reward function .
GAIL [Ho and Ermon, 2016] is a stateoftheart IL method that performs occupancy measure matching to learn a parameterized policy. Occupancy measure matching aims to minimize the objective , where is a distance function. The key idea behind GAIL is that it uses generative adversarial training to estimate the distance and minimize it alternatively. To be precise, the distance is the JensenShannon divergence (JSD), which is estimated by solving a binary classification problem. This leads to the following minmax optimization problem:
(3) 
where and are the corresponding normalized occupancy measures for and respectively. is called a discriminator and it can be shown that if the discriminator has infinite capacity, the global optimum of Eq. (3) corresponds to the JSD up to a constant [Goodfellow et al., 2014]. To update the agent policy , GAIL treats the loss as a reward signal and the agent can be updated with RL methods such as trust region policy optimization (TRPO) [Schulman et al., 2015]. A weakness of GAIL is that if the given demonstrations are nonoptimal then the learned policy will be nonoptimal as well.
4 Imitation learning with confidence and unlabeled data
In this section, we present two approaches to learning from imperfect demonstrations with confidence and unlabeled data. The first approach is 2IWIL, which aims to learn a probabilistic classifier to predict confidence scores of unlabeled demonstration data and then performs standard GAIL with reweighted distribution. The second approach is ICGAIL, which forgoes learning a classifier and learns an optimal policy by performing occupancy measure matching with unlabeled demonstration data. Details of derivation and proofs in this section can be found in Appendix.
4.1 Problem setting
Firstly, we formalize the problem setting considered in this paper. For conciseness, in what follows we use in place of . Consider the case where given imperfect demonstrations are sampled an optimal policy and nonoptimal policies . Denote that the corresponding normalized occupancy measure of and are and , respectively. The normalized occupancy measure of a stateaction pair is therefore the weighted sum of and ,
where and . We may further follow traditional classification notation by defining and , where indicates that is drawn from the occupancy measure of the optimal policy and indicates the nonoptimal policies. Here,
is the classprior probability of the optimal policy. We further assume that an oracle labels stateaction pairs in the demonstration data with
confidence scores . Based on this, the normalized occupancy measure of the optimal policy can be expressed by the Bayes’ rule as(4) 
We assume that labeling stateaction pairs by the oracle can be costly and only some pairs are labeled with confidence. More precisely, we obtain demonstration datasets as follows,
where and is a delta distribution. Our goal is to consider the case where is scarce and we want to learn the optimal policy with and jointly.
4.2 Twostep importance weighting imitation learning
We first propose an approach based on the importance sampling scheme. By Eq. (4), the GAIL objective in Eq. (3) can be rewritten as follows:
(5) 
In practice, we may use the mean of confidence scores to estimate the class prior . Although we can reweight the confidence data to match the optimal distribution, we have a limited number of confidence data and it is difficult to perform accurate sample estimation. To make full use of unlabeled data, the key idea is to identify confidence scores of the given unlabeled data and reweight both confidence data and unlabeled data. To achieve this, we train a probabilistic classifier from confidence data and unlabeled data, where we call this learning problem semiconf (SC) classification.
Let us first consider a standard binary classification problem to classify samples into () and (). Let be a prediction function and be a loss function. The optimal classifier can be learned by minimizing the following risk:
(6) 
where PN stands for “positivenegative”. However, as we only have samples from the mixture distribution instead of samples separately drawn from and , it is not straightforward to conduct sample estimation of the risk in Eq. (6). To overcome this issue, we express the risk in an alternative way that can be estimated only from and in the following theorem.
Theorem 4.1.
Thus, we can obtain a probabilistic classifier by minimizing Eq. (7), which can be estimated only with and . Once we obtain the prediction function , we can use it to give confidence scores for .
To make the prediction function estimate confidence accurately, the loss function in Eq. (7) should come from a class of strictly proper composite loss [Buja et al., 2005; Reid and Williamson, 2010]. Many losses such as the squared loss, logistic loss, and exponential loss are proper composite. For example, if we obtain that minimizes a logistic loss
, we can obtain confidence scores by passing prediction outputs to a sigmoid function
[Reid and Williamson, 2010]. On the other hand, the hinge loss cannot be applied since it is not a proper composite loss and cannot estimate confidence reliably [Bartlett and Tewari, 2007; Reid and Williamson, 2010]. Therefore, we can obtain a probabilistic classifier from the prediction function that learned from a strictly proper composite loss. After obtaining a probabilistic classifier, we optimize the importance weighted objective in Eq. (5), where both and are used to estimate the second expectation. We summarize this training procedure in Algorithm 1.Next, we discuss the choice of the combination coefficient
. Since we have access to the empirical unbiased estimator
from Eq. (7), it is natural to find the minimum variance estimator among them. The following theorem gives the optimal
in terms of the estimator variance.Proposition 4.2 (variance minimality).
Let denote the covariance between and . For a fixed , the estimator has the minimum variance when .^{2}^{2}2.
Thus, lies in when the covariance is not so large. If , it means that the unlabeled data does help the classifier by reducing empirical variance when Eq. (7) is adopted. However, computing the that minimizes empirical variance is computationally inefficient since it involves computing and . In practice, we use for all experiments by assuming that the covariance is small enough.
In our preliminary experiments, we sometimes observed that the empirical estimate of Eq. (7) became negative and led to overfitting. We can mitigate this phenomenon by employing a simple yet highly effective technique from Kiryo et al. [2017], which is proposed to solve a similar overfitting problem (see Appendix for implementation details).
4.2.1 Theoretical Analysis
Below, we show that the estimation error of Eq. (7) can be bounded. This means that its minimizer is asymptotically equivalent to the minimizer of the standard classification risk , which provides a consistent estimator of . We provide the estimation error bound with Rademacher complexity [Bartlett and Mendelson, 2002]. Denote be the Rademacher complexity of the function class with the sample size .
Theorem 4.3.
Let be the hypothesis class we use. Assume that the loss function is Lipschitz continuous, and that there exists a constant such that for any . Let and . For , with probability at least over repeated sampling of data for training ,
Thus, we may safely obtain a probabilistic classifier by minimizing , which gives a consistent estimator.
4.3 IcGail
Since 2IWIL is a twostep approach by first gathering more confidence data and then conducting importance sampling, the error may accumulate over two steps and degrade the performance. Therefore, we propose ICGAIL that can be trained in an endtoend fashion and perform occupancy measure matching with the optimal normalized occupancy measure directly.
Recall that . Our key idea here is to minimize the divergence between and , where . Intuitively, the divergence between and is minimized if that between and is minimized. For JensenShannon divergence, this intuition can be justified in the following theorem.
Theorem 4.4.
Denote that
and that . Then, is maximized when , and its maximum value is . Thus, is minimized if and only if almost everywhere.
Theorem 4.4 implies that the optimal policy can be found by solving the following objective,
(8) 
The expectation in the first term can be approximated from , while the expectation in the second term is the weighted sum of the expectation over and . Data sampled from can be obtained by executing the current policy . However, we cannot directly obtain samples from since it is unknown. To overcome this issue, we establish the following theorem.
Theorem 4.5.
can be transformed to , which is defined as follows:
(9) 
We can approximate Eq. (9) given finite samples , , and . In practice, we perform alternative gradient descent with respect to and to solve this optimization problem. Below, we show that the estimation error of can be bounded for a fixed agent policy .
4.3.1 Theoretical analysis
In this subsection, we show that the estimation error of Eq. (9) can be bounded, given a fixed agent policy . Let be the empirical estimate of Eq. (9).
Theorem 4.6.
Let be a parameter space for training the discriminator and be its hypothesis space. Assume that there exist a constant such that and for any and . Assume that both and for any have Lipschitz norms no more than . For a fixed agent policy , let be a sample generated from , , and . Then, for , the following holds with probability at least :
4.3.2 Practical implementation of ICGAIL
Even though Eq. (9) is theoretically supported, when the class prior is low, the influence of the agent become marginal in the discriminator training. This issue can be mitigated by thresholding in Eq. (9) as follows:
(10) 
where and . The training procedure of ICGAIL is summarized in Algorithm 2. Note that Eq. (10) returns to Eq. (3) and learns an suboptimal policy when .
4.4 Discussion
To understand the difference between 2IWIL and ICGAIL, we discuss it from three different perspectives: unlabeled data, confidence data, and the class prior.
Role of unlabeled data: It should be noted that unlabeled data plays different roles in the two methods. In 2IWIL, we show that unlabeled data reduces the variance of the empirical risk estimator as shown in Proposition 4.2.
On the other hand, in addition to making more accurate estimation, the usefulness of unlabeled data in ICGAIL is similar to guided exploration [Kang et al., 2018]. We may analogize confidence information in the imperfect demonstration setting to reward functions since both of them allow agents to learn an optimal policy in IL and RL, respectively. Likewise, fewer confidence data can be analogous to sparse reward functions. Even though a small number of confidence data and sparse reward functions do not make objective such as Eqs. (1) and (5) biased, they cause practical issues such as a deficiency in information for exploration. To mitigate the problem, we imitate from suboptimal demonstrations and use confidence information to refine the learned policy, which is similar to Kang et al. [2018] in the sense that they imitate a suboptimal policy to guide RL algorithms in the sparse reward setting.
Role of confidence data: Confidence data is utilized to train a classifier and to reweight in 2IWIL, which causes the twostep training scheme and therefore the error is accumulated in the prediction phase and the occupancy measure matching phase. Differently, ICGAIL instead compensates the portion in the given imperfect demonstrations by mimicking the composition of . The advantage of ICGAIL over 2IWIL is that it avoids the prediction error by employing an endtoend training scheme.
Influence of the classprior : The class prior in 2IWIL as shown in Eq. (5) serves as a normalizing constant so that the weight for reweighting to has unit mean. Consequently, the class prior does not affect the convergence of the agent policy. On the other hand, the term with respect to the agent is directly scaled by in Eq. (9) of ICGAIL. To comprehend the influence, we may expand the reward function from the discriminator and it shows that the agent term is scaled by , which makes the reward function prone to be a constant when is small. Therefore the agent learns slower than in 2IWIL, where the reward function is .
5 Experiments
In this section, we aim to answer the following questions with experiments. (1) Do 2IWIL and ICGAIL methods allow agents to learn nearoptimal policies when limited confidence information is given? (2) Are the methods robust enough when the given confidence is less accurate? and (3) Do more unlabeled data results in better performance in terms of average return? The discussions are given in Sec. 5.1, 5.2, and 5.3 respectively.
Setup To collect demonstration data, we train an optimal policy () using TRPO [Schulman et al., 2015] and select two intermediate policies ( and ). The three policies are used to generate the same number of stateaction pairs. In realworld tasks, the confidence should be given by human labelers. We simulate such labelers by using a probabilistic classifier pretrained with demonstration data and randomly choose of demonstration data to label confidence scores .
We compare the proposed methods against three baselines. Denote that , , and . GAIL (U+C) takes all the pairs as input without considering confidence. To show if reweighting using Eq. (5) makes difference, GAIL (C) and GAIL (Reweight) use the same stateaction pairs but GAIL (Reweight) additionally utilizes reweighting with confidence information . The baselines and the proposed methods are summarized in Table 1.
To assess our methods, we conduct experiments on Mujoco [Todorov et al., 2012]. Each experiment is performed with five random seeds. The hyperparameter of ICGAIL is set to for all tasks. To show the performance with respect to the optimal policy that we try to imitate, the accumulative reward is normalized with that of the optimal policy and a uniform random policy so that indicates the optimal policy and the random one. Due to space limit, we defer implementation details, the performance of the optimal and the random policies, the specification of each task, and the uncropped figures of Antv2 to Appendix.
Method  Input  objective 

ICGAIL  Eq. (9)  
2IWIL  Eq. (7)  
GAIL (U+C)  Eq. (3)  
GAIL (C)  Eq. (3)  
GAIL (reweight)  Eq. (5) 
5.1 Performance comparison
The average return against training iterations in Fig. 1 shows that the proposed ICGAIL and 2IWIL outperform other baselines by a large margin. Due to the mentioned experiment setup, the class prior of the optimal demonstration distribution is around . To interpret the experiment results, we would like to emphasize that our experiments are under incomplete optimality setting such that confidence itself is not enough to learn the optimal policy as indicated by the GAIL (Reweight) baseline. Since the difficulty of each task varies, we use different number of for different tasks. Our contribution is that in addition to the confidence, our methods are able to utilize the demonstration mixture (suboptimal demonstration) and learn nearoptimal policies.
We can observe that ICGAIL converges slower than 2IWIL. As discussed in Section 4.4, it can be attributed to that the term with respect to the agent in Eq. (10) is scaled by as specified by , which decreases the influence of the agent policy in updating discriminator. The faster convergence of 2IWIL can be an advantage over ICGAIL when interactions with environments are expensive. Even though the objective of ICGAIL becomes biased by not using the class prior , it still converges to nearoptimal policies in four tasks.
In Walker2dv2, the improvement in performance of our methods is not as significant as in other tasks. We conjecture that it is caused by the insufficiency of confidence information. This can be verified by observing that the GAIL (Reweight) baseline in Walker2dv2 gradually converges to whereas in other tasks it achieves the performance of at least . In HalfCheetahv2, we observe that the discriminator is stuck in a local maximum in the middle of learning, which influences all methods significantly.
The baseline GAIL (Reweight) surpasses GAIL (C) in all tasks, which shows that reweighting enables the agent to learn policies that obtain higher average return. However, since the number of confidence instances is small, the information is not enough to derive the optimal policies. GAIL (U+C) is the standard GAIL without considering confidence information. Although the baseline uses the same number of demonstrations as our proposed methods, the performance difference is significant due to the use of confidence.
5.2 Robustness to Gaussian noise in confidence
In practice, the oracle that gives confidence scores is basically human labelers and they may not be able to accurately label confidence all the time. To investigate robustness of our approaches against noise in the confidence scores, we further conduct an experiment on Antv2 where the Gaussian noise is added to confidence scores as follows: , where . Fig. 2 shows the performance of our methods in this noisy confidence scenario. It reveals that both methods are quite robust to noisy confidence, which suggests that the proposed methods are robust enough to human labelers, who may not always correctly assign confidence scores.
5.3 Influence of unlabeled data
In this experiment, we would like to evaluate the performance of both 2IWIL and ICGAIL with different numbers of unlabeled data to verify whether unlabeled data is useful. As we can see in Fig. 3, the performance of both methods grows as the number of unlabeled data increases, which confirms our motivation that using unlabeled data can improve the performance of imitation learning when confidence data is scarce. As discussed in Sec. 4.4, the different roles of unlabeled data in the two proposed methods result in dissimilar learning curves with respect to unlabeled data.
6 Conclusion
In this work, we proposed two general approaches ICGAIL and 2IWIL, which allow the agent to utilize both confidence and unlabeled data in imitation learning. The setting considered in this paper is usually the case in realworld scenarios because collecting optimal demonstrations is normally costly. In 2IWIL, we utilized unlabeled data to derive a risk estimator and obtained the minimum variance with respect to the combination coefficient . 2IWIL predicts confidence scores for unlabeled data and matches the optimal occupancy measure based on the GAIL objective with importance sampling. For ICGAIL, we showed that the agent learns an optimal policy by matching a mixture of normalized occupancy measures with the normalized occupancy measure of the given demonstrations .
Practically, we conducted extensive experiments to show that our methods outperform baselines by a large margin, to confirm that our methods are robust to noise, and to verify that unlabeled data has a positive correlation with the performance. The proposed approaches are general and can be easily extended to other IL and IRL methods [Li et al., 2017; Fu et al., 2018; Kostrikov et al., 2019].
For future work, we may extend it to a variety of applications such as discrete sequence generation because the confidence in our work can be treated as a property indicator. For instance, to generate soluble chemicals, we may not have enough soluble chemicals, whereas the Crippen function [Crippen and Snow, 1990] can be used to evaluate the solubility as the confidence in this work easily.
Acknowledgement
We thank Zhenghang Cui for helpful discussion. MS was supported by KAKENHI 17H00757, NC was supported by MEXT scholarship, and HB was supported by JST, ACTI, Grant Number JPMJPR18UI, Japan.
References
 Abbeel and Ng [2004] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In ICML, pages 1–8, 2004.
 Amodei et al. [2016] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. CoRR, abs/1606.06565, 2016.

Bartlett and Mendelson [2002]
Peter L Bartlett and Shahar Mendelson.
Rademacher and Gaussian complexities: Risk bounds and structural
results.
Journal of Machine Learning Research
, 3:463–482, 2002.  Bartlett and Tewari [2007] Peter L Bartlett and Ambuj Tewari. Sparseness vs estimating conditional probabilities: Some asymptotic results. Journal of Machine Learning Research, 8:775–790, 2007.
 Buja et al. [2005] Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classification: Structure and applications. Technical report, 2005.
 Burchfiel et al. [2016] Benjamin Burchfiel, Carlo Tomasi, and Ronald Parr. Distance minimization for reward learning from scored trajectories. In AAAI, pages 3330–3336, 2016.
 Chapelle et al. [2006] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. SemiSupervised Learning. MIT press, 2006.
 Crippen and Snow [1990] Gordon M Crippen and Mark E Snow. A 1.8 å resolution potential function for protein folding. Biopolymers: Original Research on Biomolecules, 29(1011):1479–1489, 1990.
 Dewey [2014] Daniel Dewey. Reinforcement learning and the reward engineering principle. In AAAI Spring Symposium Series, 2014.
 ElZahhar and ElGayar [2010] Mohamed M ElZahhar and Neamat F ElGayar. A semisupervised learning approach for soft labeled data. In International Conference on Intelligent Systems Design and Applications, pages 1136–1141, 2010.
 Everitt and Hutter [2016] Tom Everitt and Marcus Hutter. Avoiding wireheading with value reinforcement learning. pages 12–22, 2016.
 Fu et al. [2018] Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. In ICLR, 2018.
 Goodfellow et al. [2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, pages 2672–2680, 2014.
 Ho and Ermon [2016] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In NeurIPS, pages 4565–4573, 2016.
 Hu et al. [2018] Zehong Hu, Yitao Liang, Jie Zhang, Zhao Li, and Yang Liu. Inference aided reinforcement learning for incentive mechanism design in crowdsourcing. In NeurIPS, pages 5508–5518, 2018.
 Ishida et al. [2018] Takashi Ishida, Gang Niu, and Masashi Sugiyama. Binary classification from positiveconfidence data. In NeurIPS, pages 5919–5930, 2018.
 Kang et al. [2018] Bingyi Kang, Zequn Jie, and Jiashi Feng. Policy optimization with demonstrations. In ICML, pages 2474–2483, 2018.
 Kiryo et al. [2017] Ryuichi Kiryo, Gang Niu, Marthinus C du Plessis, and Masashi Sugiyama. Positiveunlabeled learning with nonnegative risk estimator. In NeurIPS, pages 1675–1685, 2017.
 Kostrikov et al. [2019] Ilya Kostrikov, Kumar Krishna Agrawal, Sergey Levine, and Jonathan Tompson. Addressing sample inefficiency and reward bias in inverse reinforcement learning. In ICLR, 2019.
 Ledoux and Talagrand [1991] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer, 1991.
 Li et al. [2017] Yunzhu Li, Jiaming Song, and Stefano Ermon. InfoGAIL: Interpretable imitation learning from visual demonstrations. In NeurIPS, pages 3812–3822, 2017.
 Liu et al. [2018] YuXuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In ICRA, pages 1118–1125, 2018.
 McDiarmid [1989] Colin McDiarmid. On the method of bounded differences. Surveys in Combinatorics, 141(1):148–188, 1989.
 Pathak et al. [2017] Deepak Pathak, Ross B Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In CVPR, 2017.
 Puterman [1994] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994. ISBN 0471619779.
 Reid and Williamson [2010] Mark D Reid and Robert C Williamson. Composite binary losses. Journal of Machine Learning Research, 11:2387–2422, 2010.

Ross et al. [2011]
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell.
A reduction of imitation learning and structured prediction to
noregret online learning.
In
Proceedings of the fourteenth international conference on artificial intelligence and statistics
, pages 627–635, 2011.  Schaal [1999] Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences, 3(6):233–242, 1999.

Schölkopf et al. [1999]
Bernhard Schölkopf, Christopher J. C. Burges, and Alexander J. Smola.
Advances in Kernel Methods: Support Vector Learning
. MIT press, 1999.  Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In ICML, pages 1889–1897, 2015.
 Serban et al. [2017] Iulian V Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, Sarath Chandar, Nan Rosemary Ke, et al. A deep reinforcement learning chatbot. CoRR, abs/1709.02349, 2017.

Shah et al. [2018]
Pararth Shah, Dilek HakkaniTur, Bing Liu, and Gokhan Tur.
Bootstrapping a neural conversational agent with dialogue selfplay, crowdsourcing and online reinforcement learning.
In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 41–51, 2018.  Supancic III and Ramanan [2017] James Steven Supancic III and Deva Ramanan. Tracking as online decisionmaking: Learning a policy from streaming videos with reinforcement learning. In ICCV, pages 322–331, 2017.
 Sutton and Barto [1998] Richard S Sutton and Andrew G Barto. Introduction to Reinforcement Learning, volume 135. MIT press, 1998.

Syed et al. [2008]
Umar Syed, Michael Bowling, and Robert E Schapire.
Apprenticeship learning using linear programming.
In ICML, pages 1032–1039, 2008.  Todorov et al. [2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In IROS, pages 5026–5033, 2012.
 Tokmakov et al. [2017] Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. Learning motion patterns in videos. In CVPR, pages 531–539, 2017.
 Valko et al. [2012] Michal Valko, Mohammad Ghavamzadeh, and Alessandro Lazaric. Semisupervised apprenticeship learning. In EWRL, pages 131–142, 2012.
 Vapnik [1998] Vladimir Vapnik. Statistical Learning Theory, volume 3. Wiley, New York, 1998.

Wang et al. [2013]
Weihong Wang, Yang Wang, Fang Chen, and Arcot Sowmya.
A weakly supervised approach for object detection based on softlabel
boosting.
In
IEEE Workshop on Applications of Computer Vision
, pages 331–338, 2013.  Yeung et al. [2017] Serena Yeung, Vignesh Ramanathan, Olga Russakovsky, Liyue Shen, Greg Mori, and Li FeiFei. Learning to learn from noisy web videos. In CVPR, pages 7455–7463, 2017.
 Zhou et al. [2014] Dingfu Zhou, Benjamin Quost, and Vincent Frémont. Soft label based semisupervised boosting for classification and object recognition. In International Conference on Control Automation Robotics & Vision, pages 1062–1067, 2014.
 Ziebart et al. [2008] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, pages 1433–1438, 2008.
Appendix A Proof for 2IWIL
a.1 Proof of Theorem 4.1
Theorem.
a.2 Proof of Proposition 4.2
Proposition.
Proof.
Let
We may represent in terms of and :
Similarly, we obtain . As a result,