The goal of sequential decision making is to learn a policy that makes good decisions (Puterman1994). As an important branch of sequential decision making, imitation learning (IL) (Russell1998; Schaal1999) aims to learn such a policy from demonstrations (i.e., sequences of decisions) collected from experts. However, high-quality demonstrations can be difficult to obtain in reality, since such experts may not always be available and sometimes are too costly (OsaPNBA018). This is especially true when the quality of decisions depends on specific domain-knowledge not typically available to amateurs; e.g., in applications such as robot control (OsaPNBA018), autonomous driving (SilverBS12), and the game of Go (SilverEtAl2016).
In practice, demonstrations are often diverse in quality, since it is cheaper to collect them from mixed demonstrators, containing both experts and amateurs (AudiffrenVLG15). Unfortunately, IL in such settings tends to perform poorly since low-quality demonstrations often negatively affect the performance (ShiarlisMW16; LeeCO16). For example, demonstrations for robotics can be cheaply collected via a robot simulation (MandlekarZGBSTG18), but demonstrations from amateurs who are not familiar with the robot may cause damages to the robot which is catastrophic in the real-world (ShiarlisMW16). Similarly, demonstrations for autonomous driving can be collected from drivers in public roads (FridmanEtAl2017), but these low-quality demonstrations may also cause traffic accidents..
When the level of demonstrators’ expertise is known, multi-modal IL (MM-IL) may be used to learn a good policy with diverse-quality demonstrations (LiSE17; HausmanCSSL17; WangEtAl2017). More specifically, MM-IL aims to learn a multi-modal policy where each mode of the policy represents the decision making of each demonstrator. When knowing the level of demonstrators’ expertise, good policies can be obtained by selecting modes that correspond to the decision making of high-expertise demonstrators. However, in reality it is difficult to truly determine the level of expertise beforehand. Without knowing the level of demonstrators’ expertise, it is difficult to distinguish the decision making of experts and amateurs, and thus learning a good policy is quite challenging.
To overcome the issue of MM-IL, existing works have proposed to estimate the quality of each demonstration using additional information from experts (AudiffrenVLG15; Wu2019; BrownGNN19). Specifically, AudiffrenVLG15 proposed a method that infers the quality using similarities between diverse-quality demonstrations and high-quality demonstrations, where the latter are collected in a small number from experts. In contrast, Wu2019 proposed to estimate the quality using a small number of demonstrations with confidence scores. The value of these scores are proportion to the quality and are given by an expert. Similarly, the quality can be estimated using demonstrations that are ranked according to their relative quality by an expert (BrownGNN19). These methods rely on additional information from experts, namely high-quality demonstrations, confidence scores, and ranking. In practice, these pieces of information can be scarce or noisy, which leads to the poor performance of these methods.
In this paper, we consider a novel but realistic setting of IL where only diverse-quality demonstrations are available, while the level of demonstrators’ expertise and additional information from experts are fully absent. To tackle this challenging setting, we propose a new method called variational imitation learning with diverse-quality demonstrations (VILD). The central idea of VILD is to model the level of expertise via a probabilistic graphical model, and learn it along with a reward function that represents an intention of expert’s decision making. To scale up our model for large state and action spaces, we leverage the variational approach (Jordan1999), which can be implemented using reinforcement learning (RL) (SuttonBarto1998). To further improve data-efficiency when learning the reward function, we utilize importance sampling to re-weight a sampling distribution according to the estimated level of expertise. Experiments on continuous-control benchmarks demonstrate that VILD is robust against diverse-quality demonstrations and outperforms existing methods significantly. Empirical results also show that VILD is a scalable and data-efficient method for realistic settings of IL.
2 Related Work
In this section, we firstly discuss a related area of supervised learning with diverse-quality data. Then, we discuss existing IL methods that use the variational approach.
Supervised learning with diverse-quality data.
In supervised learning, diverse-quality data has been studied extensively under the setting of classification with noisy label (Angluin1988). This classification setting assumes that human labelers may assign incorrect class labels to training inputs. With such labelers, the obtained dataset consists of high-quality data with correct labels and low-quality data with incorrect labels. To handle this challenging setting, many methods were proposed (RaykarYZVFBM10; Nagarajan2013; HanYYNXHTS18). The most related methods to ours are probabilistic modeling methods, which aim to infer correct labels and the level of labeler’s expertise (RaykarYZVFBM10; KhetanLA18). Specifically, RaykarYZVFBM10 proposed a method based on a two-coin model which enables estimating the correct labels and level of expertise. Recently, KhetanLA18
proposed a method based on weighted loss functions, where the weight is determined by the estimated labels and level of expertise.
Methods for supervised learning with diverse-quality data may be used to learn a policy in our setting. However, they tend to perform poorly due to the issue of compounding error (Ross10a). Specifically, supervised learning methods generally assume that data distributions during training and testing are identical. However, data distributions during training and testing are different in IL, since data distributions depend on policies (NgR00). A discrepancy of data distributions causes compounding errors during testing, where prediction errors increase further in future predictions. Due to the issue of compounding error, supervised-learning-based methods often perform poorly in IL (Ross10a). The issue becomes even worse with diverse-quality demonstrations, since data distributions of different demonstrators tend to be highly different. For these reasons, methods for supervised learning with diverse-quality data is not suitable for IL.
Variational approach in IL.
The variational approach (Jordan1999) has been previously utilized in IL to perform MM-IL and reduce over-fitting. Specifically, MM-IL aims to learn a multi-modal policy from diverse demonstrations collected by many experts (LiSE17), where each mode of the policy represents decision making of each expert111We emphasize that diverse demonstrations are different from diverse-quality demonstrations. Diverse demonstrations are collected by experts who execute equally good policies, while diverse-quality demonstrations are collected by mixed demonstrators; The former consists of demonstrations that are equally high-quality but diverse in behavior, while the latter consists of demonstrations that are diverse in both quality and behavior.. A multi-modal policy is commonly represented by a context-dependent policy, where each context represents each mode of the policy. The variational approach has been used to learn a distribution of such contexts, i.e., by learning a variational auto-encoder (WangEtAl2017) and by maximizing a variational lower-bound of mutual information (LiSE17; HausmanCSSL17). Meanwhile, variational information bottleneck (VIB) (alemi2017) has been used to reduce over-fitting in IL (peng2018variational). Specifically, VIB aims to compress information flow by minimizing a variational bound of mutual information. This compression filters irrelevant signals, which leads to less over-fitting. Unlike these existing works, we utilize the variational approach to aid computing integrals in large state-action spaces, and do not use a variational auto-encoder or a variational bound of mutual information.
3 IL from Diverse-quality Demonstrations and its Challenge
Before delving into our main contribution, we first give the minimum background about RL and IL. Then, we formulate a new setting of IL with diverse-quality demonstrations, discuss its challenge, and reveal the deficiency of existing methods.
Reinforcement learning (RL) (SuttonBarto1998)
aims to learn an optimal policy of a sequential decision making problem, which is often mathematically formulated as a Markov decision process (MDP)(Puterman1994). We consider a finite-horizon MDP with continuous state and action spaces defined by a tuple with a state , an action , an initial state density
, a transition probability density, and a reward function , where the subscript denotes the time step. A sequence of states and actions, , is called a trajectory. A decision making of an agent is determined by a policy function , which is a conditional probability density of action given state. RL seeks for an optimal policy which maximizes the expected cumulative reward, i.e., , where is a trajectory probability density induced by
. RL has shown great successes recently, especially when combined with deep neural networks(MnihEtAl2015; SilverEtAl2017). However, a major limitation of RL is that it relies on the reward function which may be unavailable in practice (Russell1998).
To address the above limitation of RL, imitation learning (IL) was proposed (Schaal1999; NgR00). Without using the reward function, IL aims to learn the optimal policy from demonstrations that encode information about the optimal policy. A common assumption in most IL methods is that, demonstrations are collected by demonstrators who execute actions drawn from for every states . A graphic model describing this data collection process is depicted in Figure LABEL:figure:pgm_irl2
, where a random variabledenotes each demonstrator’s identification number and denotes the probability of collecting a demonstration from the -th demonstrator. Under this assumption, demonstrations (i.e., observed random variables in Figure LABEL:figure:pgm_irl2) are called expert demonstrations and are regarded to be drawn independently from a probability density . We note that the variable does not affect the trajectory density and can be omitted. In this paper, we assume a common assumption that and are unknown but we can sample states from them.
IL has shown great successes in benchmark settings (HoE16; FuEtAl2018; peng2018variational). However, practical applications of IL in the real-world is relatively few (schroecker2018generative). One of the main reasons is that most IL methods aim to learn with expert demonstrations. In practice, such demonstrations are often too costly to obtain due to a limited number of experts, and even when we obtain them, the number of demonstrations is often too few to accurately learn the optimal policy (AudiffrenVLG15; Wu2019; BrownGNN19).