Imitation learning has become increasingly important in fields– notably robotics and game AI– where it is easier for an expert to demonstrate a behavior than to translate that behavior to code. [Argall09] Perhaps surprisingly, it has also become central in developing predictors for complex output spaces, e.g. sets and lists [Ross12], parse trees [Daume09], image parsing [Munoz10, Ross11b] and natural language understanding[Duvallet13]. In these domains, a policy is trained to imitate an oracle on ground-truthed data. Iterative training procedures (e.g. DAgger, SEARN, SMILe[Ross11, Daume09, Ross10]
) that interleave policy execution and learning have demonstrated impressive practical performance and strong theoretical guarantees that were not possible with batch supervised learning. Most of these approaches to imitation learning, however, neither require nor benefit from information about the cost of actions; rather they leverage only information provided about “correct” actions by the demonstrator.
While iterative training corrects the compounding of error effect one sees in control and decision making applications, it does not address all issues that arise. Consider, for instance, a problem of learning to drive near the edge of a cliff: methods like DAgger
consider all errors from agreeing with the expert driver equally. If driving immediately off the cliff makes the expert easy to imitate– because the expert simply chooses the go straight from then on– these approaches may learn that very poor policy. More generally, a method that only reasons about agreement with a demonstrator instead of the long term costs of various errors may poorly trade-off inevitable mistakes. Even a crude estimate of the cost-to-go (e.g. it’s very expensive to drive off the cliff)– may improve a learned policy’s performance at the user’s intended task.
SEARN, by contrast, does reason about cost-to-go, but uses rollouts from the current policy which can be impractical for imitation learning. SEARN additionally requires the use of stochastic policies.
We present a simple, general approach we term AggreVaTe (Aggregate Values to Imitate) that leverages cost-to-go information in addition to correct demonstration, and establish that previous methods can be understood as special cases of a more general no-regret strategy. The approach provides much stronger guarantees than existing methods by providing a statistical regret rather then statistical error reduction. [Beygelzimer2008]
This general strategy of leveraging cost-sensitive no-regret learners can be extended to Approximate Policy Iteration (API) variants for reinforcement learning. We show that any no-regret learning algorithm can be used to develop stable API algorithms with guarantees as strong as any available in the literature. We denote the resulting algorithm NRPI. The results provide theoretical support to the commonly observed success of online policy iteration[sutton00] despite a paucity of formal results: such online algorithms often enjoy no-regret guarantees or share similar stability properties. Our approach suggests a broad new family of algorithms and provides a unifying view of existing techniques for both imitation and reinforcement learning.
2 Imitation Learning with Cost-To-Go
We consider in this work finite horizon111All our results can be easily extended to the infinite discounted horizon setting.
control problems in the form of a Markov Decision Process with statesand actions . We assume the existence of a cost-function , bounded between and , that we are attempting to optimize over a horizon of decisions. We denote a class of policies mapping states 222More generally features of the state (and potentially time)– our derivations do not require full observability and hence carry over to featurized state of POMDPs. to actions.
We use to denote the expected future cost-to-go of executing action in state , followed by executing policy for steps. We denote by the time-averaged distribution over states induced by executing policy in the MDP ( is the distribution of states at time induced by executing policy ). The overall performance metric of total cost of executing for -steps is denoted We assume system dynamics are either unknown or complex enough that we typically have only sample access to them. The resulting setting for learning policies by demonstration– or learning policies by approximate policy iteration– are not typical i.i.d. supervised learning problems as the learned policy strongly influences its own test distribution rendering optimization difficult.
2.2 Algorithm: AggreVaTe
We describe here a simple extension of the DAgger technique of [Ross11] that learns to choose actions to minimize the cost-to-go of the expert rather than the zero-one classification loss of mimicking its actions. In simplest form, on the first iteration AggreVaTe collects data by simply observing the expert perform the task, and in each trajectory, at a uniformly random time , explores an action in state , and observes the cost-to-go of the expert after performing this action. (See Algorithm 1 below.) 333This cost-to-go may be estimated by rollout, or provided by the expert.
Each of these steps generates a cost-weighted training example [Mineiro10] and AggreVaTe trains a policy to minimize the expected cost-to-go on this dataset. At each following iteration , AggreVaTe collects data through interaction with the learner as follows: for each trajectory, begin by using the current learner’s policy to perform the task, interrupt at a uniformly random time , explore an action in the current state , after which control is provided back to the expert to continue up to time-horizon . This results in new examples of the cost-to-go of the expert , under the distribution of states visited by the current policy . This new data is aggregated with all previous data to train the next policy ; more generally, this data can be used by a no-regret online learner to update the policy and obtain . This is iterated for some number of iterations
and the best policy found is returned. We optionally allow the algorithm to continue executing the expert’s actions with small probability, instead of always executing , up to the random time where an action is explored and control is shifted to the expert. The general AggreVaTe is detailed in Algorithm 1.
Observing the expert’s cost-to-go indicates how much cost we might expect to incur in the future if we take an action now and then can behave as well (or nearly so) as the expert henceforth. Under the assumption that the expert is a good policy, and that the policy class contains similarly good policies, this provides a rough estimate of what good policies in will be able to achieve at future steps. By minimizing this cost-to-go at each step, we will choose policies that lead to situations where incurring low future cost-to-go is possible. For instance, we will be able to observe that if some actions put the expert in situations where falling off a cliff or crash is inevitable then these actions must be avoided at all costs in favor of those where the expert is still able to recover.
In AggreVaTe the problem of choosing the sequence of policies over iterations is viewed as an online cost-sensitive classification problem. Our analysis below demonstrates that any no-regret algorithm on such problems can be used to update the sequence of policies and provide strong guarantees. To achieve this, when the policy class is finite, randomized online learning algorithms like weighted majority [CesaBianchi06] may be used. When dealing with infinite policy classes (e.g. all linear classifiers), no-regret online cost-sensitive classification is not always computationally tractable. Instead, typically reductions of cost-sensitive classification to regression or ranking problems as well as convex upper bounds [Beygelzimer2008] on the classification loss lead to efficient no-regret online learning algorithms (e.g. gradient descent). The algorithm description suggests as the “default” learning strategy a (Regularized)-Follow-The-Leader online learner: it attempts to learn a good classifier for all
previous data. This strategy for certain loss function (notably strongly convex surrogates to the cost-sensitive classification loss)[CesaBianchi06] and any sufficiently stable batch learner [Ross12b, Saha2012interplay] ensures the no-regret property. It also highlights why the approach is likely to be particularly stable across rounds of interaction.
2.3 Training the Policy to Minimize Cost-to-Go
In standard “full-information” cost-sensitive classification, a cost vector is provided for each data-point in the training data that indicates the cost of predicting each class or label for this input. In our setting, that implies for each sampled state we recieve a cost-to-go estimate/rollout for all actions. Training the policy at each iteration then simply corresponds to solving a cost-sensitive classification problem. That is, if we collect a dataset ofsamples, , where is a cost vector of cost-to-go estimates for each action in state at time , then we solve the cost-sensitive classification problem:
. Reductions of cost-sensitive classification to convex optimization problems can be used like weighted multi-class Support Vector Machines or ranking[Beygelzimer2008], to obtain problems that can be optimized efficiently while still guaranteeing good performance at this cost-sensitive classification task.
For instance, a simple approach is to transform this into an argmax regression problem: i.e., , for the learned regressor at iteration that minimizes the squared loss at predicting the cost-to-go estimates: , where is the dataset of all collected cost-to-go estimates so far, and the class of regressors considered (e.g. linear regressors). This approach also naturally handles the more common situation in imitation learning where we only have partial information for a particular action chosen at a state. Alternate approaches include importance weighting techniques to transform the problem into a standard cost-sensitive classification problem [Horvitz52, Dudik11] and other online learning approaches meant to handle “bandit” feedback.
Local Exploration in Partial Information Setting
In the partial information setting we must also select which action to explore for an estimate of cost-to-go. The uniform strategy is simple and effective but inefficient. The problem may be cast as a contextual bandit problem [Auer03, Beygelzimer11] where features of the current state define the context of exploration. These algorithms, by choosing more carefully than at random, may be significantly more sample efficient. In our setting, in contrast with traditional bandit settings, we care only about the final learned policy and not the cost of explored actions along the way. Recent work [Avner12] may be more appropriate for improving performance in this case. Many contextual bandit algorithms require a finite set of policies [Auer03] or full realizability [Li10], and this is an open and very active area of research that could have many applications here.
We analyze AggreVaTe, showing that the no-regret property of online learning procedures can be leveraged in this interactive learning procedure to obtain strong performance guarantees. Our analysis seeks to answer the following question: how well does the learned policy perform if we can repeatedly identify good policies that incur cost-sensitive classification loss competitive with the expert demonstrator on the aggregate dataset we collect during training?
Our analysis of AggreVaTe relies on connecting the iterative learning procedure with the (adversarial) online learning problem [CesaBianchi06] and using the no-regret property of the underlying cost-sensitive classification algorithm choosing policies . Here, the online learning problem is defined as follows: at each iteration , the learner chooses a policy that incurs loss chosen by the adversary, and defined as for
the uniform distribution on the set, and the cost-to-go of the expert. We can see that AggreVaTe at iteration is exactly collecting a dataset , that provides an empirical estimate of this loss .
Let denote the minimum expected cost-sensitive classification regret achieved by policies in the class on all the data over the iterations of training. Denote the online learning average regret of the sequence of policies chosen by AggreVaTe, .
We provide guarantees for the “uniform mixture” policy , that at the beginning of any trajectory samples a policy uniformly randomly among the policies and executes this policy for the entire trajectory. This immediately implies good performance for the best policy in the sequence , i.e. , and the last policy when the distribution of visited states converges over the iterations of learning.
Assume the cost-to-go of the expert is non-negative and bounded by , and for all for some constant 444The default parameter-free version of AggreVaTe corresponds to , using .. Then the following holds in the infinite sample case (i.e. if at each iteration of AggreVaTe we collected an arbitrarily large amount of data by running the current policy):
After iterations of AggreVaTe:
Thus if a no-regret online algorithm is used to pick the sequence of policies , then as the number of iterations :
The proof of this result is presented in the Appendix. This theorem indicates that after sufficient iterations, AggreVaTe will find policies that perform the task nearly as well as the demonstrator if there are policies in that have small cost-sensitive classification regret on the aggregate dataset (i.e. policies with cost-sensitive classification loss not much larger than that of the bayes-optimal one on this dataset). Note that non-interactive supervised learning methods are unable to achieve a similar bound which degrades only linearly with and the cost-sensitive classification regret. [Ross11].
The analysis above abstracts away the issue of action exploration and learning from finite data. These issues come into play in a sample complexity analysis. Such analyses depend on many factors such as the particular reduction and exploration method. When reductions of cost-sensitive classification to simpler regression/ranking/classification [Beygelzimer2005] problems are used, our results can directly relate the task performance of the learned policy to the performance on the simpler problem. To illustrate how such results may be derived, we provide a result for the special case where actions are explored uniformly at random and the reduction of cost-sensitive classification to regression is used.
In particular, if denotes the empirical average online learning regret on the training regression examples collected over the iterations, and denotes the empirical regression regret of the best regressor in the class on the aggregate dataset of regression examples when compared to the bayes-optimal regressor, we have that:
iterations of AggreVaTe, collecting regression examples per iteration, guarantees that with probability at least 1-:
Thus if a no-regret online algorithm is used to pick the sequence of regressors , then as the number of iterations , with probability :
The detailed proof is presented in the Appendix. This result demonstrates how the task performance of the learned policies may be related all the way down to the regret on the regression loss at predicting the observed cost-to-go during training. In particular, it relates task performance to the square root of the online learning regret, on this regression loss, and the regression regret of the best regressor in the class to the bayes-optimal regressor on this training data. 555The appearance of the square root is particular to the use of this reduction to squared-loss regression and implies relative slow convergence to good performance. Other cost-sensitive classification reductions and regression losses (e.g. [Langford05SECOC, Beygelzimer09ECT]) do not introduce this square root and still allow efficient learning.
2.5 Discussion22todo: 2JAB: this paragraph needs work.
AggreVaTe as a reduction:
AggreVaTe can be interpreted as a regret reduction of imitation learning to no-regret online learning. 666Unfortunately regret here has two different meanings common in the literature: the first is in the statistical sense of doing nearly as well as the Bayes-optimal predictor. [Beygelzimer2008] The second use is in the online, adversarial, no-regret sense of competing against any hypothesis on a particular sequence without statistical assumptions. [CesaBianchi06] We present a statistical regret reduction, as here, performance is related directly to the online, cost-sensitive classification regret on the aggregate dataset. By minimizing cost-to-go, we obtain regret reduction, rather than a weaker error reduction as in [Ross11] when simply minimizing immediate classification loss.
As just mentioned, in cases where the expert is much better than any policy in , the expert’s cost-to-go may be a very optimistic estimate of the true future cost after taking a certain action. The approach may fail to learn policies that perform well, even if policies that can perform the task (albeit not as well as the expert) exist in the policy class. Consider again the driving scenario, where one may choose one of two roads to reach a goal: a shorter route that involves driving on a very narrow road next to cliffs on either side, and a longer route which is safer and risks no cliff. If in this example, the expert takes the short route faster and no policy in the class can drive without falling on the narrow road, but there exists policies that can take the longer road and safely reach the goal, this algorithm would fail to find these policies. The reason for this is that, as we minimize cost-to-go of the expert, we would always favor policies that heads toward the shorter narrow road. But once we are on that road, inevitably at some point we will encounter a scenario where no policies in the class can predict the same low cost-to-go actions as the expert (i.e. making large in the previous guarantee). The end result is that we may learn a policy that takes the short narrow road and eventually falls off the cliff, in these pathological scenarios.
Comparison to SEARN:
AggreVaTe shares deep commonalities with SEARN but by providing a reduction to online learning allows much more general schemes to update the policy at each iteration that may be more convenient or efficient rather than the particular stochastic mixing update of SEARN. These include deterministic ones that provide upper convex bounds on performance. In fact, SEARN may be thought as a particular case of AggreVaTe, where the policy class is the set of distributions over policies, and the online coordinate descent algorithm (Frank-Wolfe) of [Hazan2012projection] is used to update the distribution over policies at each iteration. Both collect data in a similar fashion at each iteration by executing the current policy up to a random time and then collecting cost-to-go estimates for explored actions in the current state. A distinction is that SEARN collects cost-to-go of the current policy after execution of the random action, instead of the cost-to-go of the expert. Interestingly, SEARN is usually used in practice with the approximation of collecting cost-to-go of the expert [Daume09]
, rather than the current policy. Our approach can be seen as providing a theoretical justification for what was previously a heuristic.
3 Reinforcement Learning via No-Regret Policy Iteration
A relatively simple modification of the above approach enables us to develop a family of sample-based approximate policy iteration algorithms. Conceptually, we make a swap: from executing the current policy and then switching to the expert to observe a cost-to-go; to, executing the expert policy while collecting cost-to-go of the learner’s current policy. We denote this family of algorithms No-Regret Policy Iteration NRPI and detail and analyze it below.
This alternate has similar guarantees to the previous version, but may be preferred when no policy in the class is as good as the expert or when only a distribution of “important states” is available. In addition it can be seen to address a general model-free reinforcement learning setting where we simply have a state exploration distribution we can sample from and from which we collect examples of the current policy’s cost-to-go. This is similar in spirit to how Policy Search by Dynamic Programming (PSDP) [Bagnell03, Scherrer2014approximate] proceeds, and in some sense, the algorithm we present here provides a generalization of PSDP. However, by learning a stationary policy instead of a non-stationary policy, NRPI can generalize across time-steps and potentially lead to more efficient learning and practical implementation in problems where is large or infinite.
Following [Bagnell03, Kakade02] we assume access to a state exploration distribution for all times in . As will be justified by our theoretical analysis, these state exploration distributions should ideally be (close to) that of a (near-)optimal policy in the class . In the context where an expert is present, then this may simply be the distribution of states induced by the expert policy, i.e. . In general, this may be the state distributions induced by some base policy we want to improve upon, or be determined from prior knowledge of the task.
Given the exploration distributions , NRPI proceeds as follows. At each iteration , it collects cost-to-go examples by sampling uniformly a time , sampling a state from , and then executes an exploration action in followed by execution of the current learner’s policy for time to , to obtain a cost-to-go estimate of executing followed by in state at time . 777In the particular case where of an exploration policy , then to sample , we would simply execute from time to , starting from the initial state distribution. Multiple cost-to-go estimates are collected this way and added in dataset . After enough data has been collected, we update the learner’s policy, to obtain , using any no-regret online learning procedure, on the loss defined by the cost-sensitive classification examples in the new data . This is iterated for a large number of iterations . Initially, we may start with to be any guess of a good policy from the class , or use the expert’s cost-to-go at the first iteration, to avoid having to specify an initial policy. This algorithm is detailed in Algorithm 2.
Consider the loss function given to the online learning algorithm within NRPI at iteration . Assuming infinite data, it assigns the following loss to each policy :
This loss represents the expected cost-to-go of executing immediately for one step followed by current policy , under the exploration distributions .
This sequence of losses over the iterations of training corresponds to an online cost-sensitive classification problem, as in the previous AggreVaTe algorithm. Let be the average regret of the online learner on this online cost-sensitive classification problem after iterations of NRPI:
For any policy , denote the average or variational distance between and over time steps as Note that if for all , then .
Denote by a bound on cost-to-go (which is always ). Denote the best policy found by NRPI over iterations, and the uniform mixture policy over defined as before. Then NRPI achieves the following guarantee:
For any policy :
If a no-regret online cost-sensitive classification algorithm is used:
NRPI thus finds policies that are as good as any other policy whose state distribution is close to on average over time . Importantly, if corresponds to the state distribution of an optimal policy in class , then this theorem guarantees that NRPI will find an optimal policy (within the class ) in the limit.
This theorem provides a similar performance guarantee to the results for PSDP presented in [Bagnell03]. NRPI has the advantage of learning a single policy for test execution instead one at each time allowing for improved generalization and more efficient learning. NRPI imposes stronger requirements: it uses a no-regret online cost-sensitive classification procedure instead of simply a cost-sensitive supervised learner. For finite policy classes , or using reductions of cost-sensitive classification as mentioned previously, we may still obtain convex online learning problems for which efficient no-regret strategies exist or use the simple aggregation of data-sets with any sufficiently stable batch learner. [Ross12b, Saha2012interplay]
The result presented here can be interpreted as a reduction of model-free reinforcement learning to no-regret online learning. It is a regret reduction, as performance is related directly to the online regret at the cost-sensitive classification task. However performance is strongly limited by the quality of the exploration distribution. 888One would naturally consider adapting the exploration distributions over the iterations of training. It can be shown that if are the exploration distributions at iteration , and we have a mechanism for making converge to the state distributions of an optimal policy in as , then we would always be guaranteed to find an optimal policy in . Unfortunately, no known method can guarantee this.
4 Discussion and Future Work
The work here provides theoretical support for two seemingly unrelated empirical observations. First, and perhaps most crucially, much anecdotal evidence suggests that approximate policy iteration– and especially online variants [sutton00]– is more effective and stable than theory and counter-examples to convergence might suggest. This cries out for some explanation; we contend that it can be understood as such online algorithms often enjoy no-regret guarantees or share similar stability properties than can ensure relative performance guarantees.
Similarly, practical implementation of imitation learning-for-structured-prediction methods like SEARN rely on what was previously considered a heuristic of using the expert demonstrator as an estimate of the future cost-to-go. The resulting good performance can be understood as a consequence of this heuristic being a special case of AggreVaTe where the online Frank-Wolfe algorithm [Hazan2012projection] is used to choose policies. Moreover, stochastic mixing is but one of several approaches to achieving good online performance and deterministic variants have proven more effective in practice. [Ross11b]
The resulting algorithms make suggestions for batch approaches as well: they suggest, for instance, that approximate policy iteration procedures (as well as imitation learning ones) are likely to be more stable and effective if they train not only on the cost-to-go of the most recent policy but also on previous policies. At first this may seem counter-intuitive, however, it prevents the oscillations and divergences that at times plague batch approximate dynamic programming algorithms by ensuring that each learned policy is good across many states.
From a broad point of view, this work forms a piece of a growing picture that online algorithms and no-regret analyses– in contrast with the traditional i.i.d. or batch analysis– are important for understanding learning with application to control and decision making [Ross13, Ross12, Ross11b, Ross11]. At first glance, online learning seems concerned with a very different adversarial setting. By understanding these methods as attempting to ensure both good performance and robust, stable learning across iterations [Ross12b, Saha2012interplay], they become a natural tool for understanding the dynamics of interleaving learning and execution when our primary concern is generalization performance. 33todo: 3JAB: rewrite
It is important to note that any method relying on cost-to-go estimates can be impractical as collecting each estimate for a single state-action pair may involve executing an entire trajectory. In many settings, minimizing imitation loss with DAgger [Ross11], is more practical as we can observe the action chosen by the expert in every visited state along a trajectory and thus collect data points per trajectory instead of single one. This is less crucial in structured prediction settings where the cost-to-go of the expert may often be quickly computed which has lead to the success of the heuristic analyzed here. A potential combination of the two approaches, where first simple imitation loss minimization provides a reasonable policy, and then this is refined using AggreVaTe (e.g. through additional gradient descent steps) thus using fewer (expensive) iterations.
In the reinforcement learning setting, the bound provided is as strong as that provided by [Bagnell03, Kakade2003thesis] for an arbitrary policy class. However, as is generally , this only provides meaningful guarantees when is very close to (on average over time ). Previous methods like [Bagnell03, Kakade02, Scherrer2014approximate] provide a much stronger, multiplicative error guarrantee when we consider competing against the bayes optimal policy in a fully observed MDP. It is not obvious how the current algorithm and analysis can extend to that variant of the bound.
Much work remains to be done: there are a wide variety of no-regret learners and their practical trade-offs are almost completely open. Future work must explore this set to identify which methods are most effective in practice.
Appendix: Proofs and Detailed Bounds
In this appendix, we provide the proofs and detailed analysis of the algorithms for imitation learning and reinforcement learning provided in the main document.
We begin with a classical and useful general lemma that is needed for bounding the expected loss under different distributions. This will be used several times throughout. Here this will be useful for bounding the expected loss under the state distribution of (which optional queries the expert a fraction of the time during it’s execution) in terms of the expected loss under the state distribution of :
Let and be any distribution over elements , and , any bounded function such that for all . Let the range . Then
We provide the proof for discrete, a similar argument can be carried for continuous, using integrals instead of sums.
Additionally, since for any real , , then we have for any :
This holds for all . This upper bound is minimized for , making . This proves the lemma. ∎
The distance between the distribution of states encountered by , the policy chosen by the online learner, and , the policy used to collect data that continues to execute the expert’s actions with probability is bounded as follows:
Let the distribution of states over steps conditioned on picking the expert at least once over steps. Since always executes (never executes the expert action) over steps with probability we have . Thus
The last inequality follows from the fact that for any . Finally, since for any 2 distributions , , we always have , then . ∎
Below we use the performance difference lemma [Bagnell03, Kakade02, Kakade2003thesis] that is useful to bound the change in total cost-to-go. This general result bounds the difference in performance of any two policies. We present this results and its proof here for completeness.
Let and be any two policy and denote and the -step value function and -value function of policy respectively, then:
for the uniform distribution on the set .
Let denote the non-stationary policy that executes in the first time steps, and then switches to execute at time to . Then we have and . Thus:
AggreVaTe Reduction Analysis
Let denote the minimum expected cost-sensitive classification regret achieved by policies in the class on all the data over the iterations of training. Denote the online learning average regret on the cost-to-go examples of the sequence of policies chosen by AggreVaTe, , where . Assume the cost-to-go of the expert is non-negative and bounded by , and that are chosen such that for some . Then we have the following:
After iterations of AggreVaTe:
Thus if a no-regret online algorithm is used to pick the sequence of policies , then as the number of iterations :
For every policy , we have:
Since are non-increasing, define the largest such that . Then:
Again, since the minimum is always better than the average, i.e. . Finally, we have that when , . This proves the first part of the theorem.
The second part follows immediately from the fact that as , and similarly for the extra term . ∎
Finite Sample AggreVaTe with Q-function approximation
We here consider the finite sample case where actions are explored uniformly randomly and the reduction of cost-sensitive classification to squared loss regression is used. We consider learning an estimate Q-value function of the expert’s cost-to-go, and we consider a general case where the cost-to-go predictions may depend on features of the state , action and time , e.g. could be a linear regressor s.t. is the estimate of the cost-to-go , and are the parameters of the linear regressor we learn. Given such estimates , we consider executing the policy , such that in state at time , .
After iterations of AggreVaTe, collecting regression examples per iteration, guarantees that with probability at least 1-:
Thus if a no-regret online algorithm is used to pick the sequence of regressors , then as the number of iterations , with probability 1:
Consider , the bayes-optimal non-stationary policy that minimizes loss on the cost-to-go examples. That is, , i.e. it picks the action with minimum expected expert cost-to-go conditioned on being in state and time . Additionally, given the observed noisy Q-values from each trajectory, the bayes-optimal regressor is simply the Q-value function of the expert that predicts the expected cost-to-go.
At each iteration , we execute a policy , such that , where is the current regressor at iteration from the base online learner. The cost-sensitive regret of policy , compared to , can be related to the regression regret of as follows:
Consider any state and time . Let and consider the action of any other policy. We have that:
Additionally, for any joint distribution
Additionally, for any joint distributionover , and the uniform distribution over actions, we have that:
Thus we obtain that for every :