Continuous Online Learning and New Insights to Online Imitation Learning

12/03/2019 ∙ by Jonathan Lee, et al. ∙ Georgia Institute of Technology berkeley college 33

Online learning is a powerful tool for analyzing iterative algorithms. However, the classic adversarial setup sometimes fails to capture certain regularity in online problems in practice. Motivated by this, we establish a new setup, called Continuous Online Learning (COL), where the gradient of online loss function changes continuously across rounds with respect to the learner's decisions. We show that COL covers and more appropriately describes many interesting applications, from general equilibrium problems (EPs) to optimization in episodic MDPs. Using this new setup, we revisit the difficulty of achieving sublinear dynamic regret. We prove that there is a fundamental equivalence between achieving sublinear dynamic regret in COL and solving certain EPs, and we present a reduction from dynamic regret to both static regret and convergence rate of the associated EP. At the end, we specialize these new insights into online imitation learning and show improved understanding of its learning stability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Online learning [12, 28]

studies the interactions between a learner (i.e. an algorithm) and an opponent through regret minimization. It has proven to be a powerful framework for analyzing and designing iterative algorithms. However, while classic online learning setups focus on bounding the worst case, many applications are not naturally adversarial. This reality gap exists especially for iterative algorithms that are designed to solve optimization problems concerning Markov decision processes (MDPs). Because the objective is often stated in expectation for these problems, continuity properties often arise naturally from the smoothing effect of taking expectations over randomness. When such properties are ignored, theoretical analyses can be overly conservative.

To this end, we propose a new setup for online learning, called Continuous Online Learning (COL). In contrast to the standard adversarial setup that treats losses as adversarial, COL concerns online learning problems where the per-round losses change continuously with respect to the learner’s decisions, and it models adversity, such as stochasticity and bias, as corruption in the feedback signals of these continuous loss sequences. This modified setup natively captures regularity in online losses, while still being able to handle adversity that appears in common problems. As a result, certain concepts that are difficult to analyze in the classic adversarial setup (e.g. sublinear dynamic regret with an adaptive opponent) become attainable in COL.

The goal of this paper is to establish COL and to study, particularly, conditions and efficient algorithms for achieving sublinear dynamic regret. Our first result shows that achieving sublinear dynamic regret in COL, interestingly, is equivalent to solving certain equilibrium problems (EPs), which are known to be PPAD-complete111In short, they are NP problems whose solutions are known to exist, but it is open as to if they belong to P. [8]. In other words, achieving sublinear dynamic regret that is polynomial in the dimension of the decision set can be extremely difficult in general. Nevertheless, based on the solution concept of EP, we present a reduction from sublinear dynamic regret to static regret and convergence to the solution of the associated EP. This reduction allows us to quickly derive non-asymptotic dynamic regret bounds of popular online learning algorithms based on their known static regret rates.

Using these insights of COL, we revisit online imitation learning (IL) [22] and show it can be framed as a COL problem. We demonstrate that, by using standard analyses of COL, we are able to recover and improve existing understanding of online IL algorithms [22, 3, 17]. In particular, we characterize existence and uniqueness of solutions, and present convergence and dynamic regret bounds for a common class of IL algorithms in deterministic and stochastic settings. A more detailed version of this paper with additional theoretical results and proofs omitted here can be found in the full technical report [5].

2 Continuous Online Learning

We recall, generally, an online learning problem repeats the following steps: in round , the learner plays a decision from a convex and compact decision set , the opponent chooses a loss function based on the decisions of the learner, and then information about (e.g. ) is revealed to the learner to inform the next decision. Classically, this abstract setup studies the adversarial setting where can be almost arbitrarily chosen except for minor restrictions like convexity [23, 13]. Often the performance is measured relatively through static regret,

(1)

Recently, interest has emerged in algorithms that can make nearly optimal decisions at each round. The regret is therefore measured on-the-fly and suitably named dynamic regret,

(2)

where . As dynamic regret by definition upper bounds static regret, minimizing the dynamic regret is a more difficult problem.

At a high level, one can view online learning as a protocol to describe iterative algorithms, i.e., an algorithm receives some feedback, updates its decision, tries it out and receives a performance measure, and then repeats. Indeed, this idea has made online learning a ubiquitous tool to analyze a wide range of problems. But often in these problems, the loss sequence has certain correlations; if the algorithm outputs the same decision, regardless of which iteration it is in, its performance will be measured similarly. This structure of regularity, however, is missing the classic adversarial setup. While it is possible to introduce ad-hoc constraints to limit the amount of adversity in the classic setup, as in [28, 19, 26, 9, 2, 14, 27], such a scheme often leads to case-by-case analyses and can hardly model problems where the adversity depends also on the learner’s decision, like online IL of interest here (see Section 5). This mismatch between practice and theory makes studying certain convergence concepts difficult, such as sublinear dynamic regret which is useful to understand the performance of the last iterate produced by the algorithm.

COL differs from the classic setup mainly in the way the loss and the feedback are defined, so that it can inherently model regularity that shows up in the loss sequence of problems in practice. In COL, we suppose that the opponent possesses a bifunction , for , that is unknown to the learner. This bifunction is used by the opponent to determine the per-round losses: in round , if the learner chooses , then the opponent responds with

(3)

Finally, the learner suffers and receives feedback about . For , we treat as the query argument that proposes a question (i.e. an optimization objective ), and treat as the decision argument whose performance is evaluated. This bifunction generally can be defined online as queried, with only one limitation that the same loss function must be selected by the opponent whenever the learner plays the same decision . Thus, the opponent can be adaptive, but in response to only the learner’s current decision. We assume, for all , for some . In Section 5, we will discuss how the bifunction provides a natural interpretation for certain difficult objectives such as in online IL.

In addition to the restriction in (3), we impose regularity into to relate across rounds (so that seeking sublinear dynamic regret becomes well defined.222Otherwise the opponent can define pointwise for each to make constant.)

Definition 1.

We say an online learning problem is continuous if is set as in (3) by a bifunction satisfying, , is a continuous map in 333We define as the derivative with respect to ..

The continuity may appear to restrict COL to purely deterministic settings, but adversity such as stochasticity can be incorporated via an important nuance in the relationship between loss and feedback. In the classical online learning setting, the adversity is incorporated in the loss: the losses and decisions may themselves be generated adversarially or stochastically and then they directly determine the feedback, e.g., given as full information (receiving or ) or bandit (just ). The (expected) regret is then measured with respect to these intrinsically adversarial losses . By contrast, in COL, we always measure regret with respect to the true underlying bifunction . Instead we give the opponent the freedom to add an additional stochastic or adversarial component into the feedback; e.g., in first-order feedback, the learner could receive , where

is a probabilistically bounded and potentially adversarial vector, which can be used to model noise or bias in feedback. In other words, the COL setting models a true underlying loss with regularity, but allows adversity to be modeled within the feedback, analogous to stochastic feedback oracles in convex optimization. This additional structure is especially important for studying dynamic regret, as it allows us to always consider regret with respect to the true

while still incorporating the possibility of stochasticity and adversity.

3 Equivalence and Hardness of Continuous Online Learning

We first ask what extra information the COL formulation entails. We present this result as an equivalence between achieving sublinear dynamic in COL and solving several mathematical programming problems. Particularly, suppose ; we are interested in whether sublinear dynamic regret with polynomial dependency on is even possible. It turns out, in general, this is difficult, as least as hard as a set of difficult problems known to be PPAD-complete [8], even when is convex and continuous.

Theorem 1.

Let be given in Definition 1 for a convex and compact decision set . Suppose is convex and continuous. For any satisfying the above assumption, if there is an algorithm that achieves sublinear dynamic regret that is in in the associated COL, then it solves all PPAD problems in polynomial time. In particular, achieving sublinear dynamic regret is equivalent to solving the equilibrium problem with and the variational inequality with .

Theorem 1 is an excerpt of [5, Theorem 1] in the technical report. We recall an EP problem, , is defined by a variable set and a bifunction such that444The convexity and continuity can be relaxed further, e.g., as hemi-continuity. , is continuous, and is convex. Its goal is to find a point such that

Similarly, the goal of a VI problem, with , is to find a point such that

By definition one can see that the VI problem is also an EP problem with .

In other words, Theorem 1 states that, based on the identification and , achieving sublinear dynamic regret is essentially equivalent to finding an equilibrium , in which denotes the set of solutions of the EP and VI (one can show these two solution sets coincide [5]). Therefore, a necessary condition for sublinear dynamic regret is that is non-empty, which is true when is continuous in and is compact [10].

Theorem 1 also implies that extra structure on COL is necessary for designing efficient algorithms that achieve sublinear dynamic regret and find these solutions. Specifically, we are interested in algorithms whose dynamic regret is sublinear and polynomial in . The requirement of polynomial dependency is important to properly define the problem. Without it, sublinear dynamic regret can be achieved already (at least asymptotically), e.g. by simply performing a grid search that discretizes (as is compact and is continuous) albeit with an exponentially large constant.

Based on this equivalence, we can strengthen the structural properties of COL so that they are conducive to designing such efficient algorithms.

Definition 2.

We say a COL problem with is -regular if for some , ,

  1. is a -strongly convex function.

  2. is a -Lipschitz continuous map.

Leveraging these, we can identify similar structural properties in the equivalent problems.

Proposition 1.

If the COL problem with is -regular, then the map is -strongly monotone. That is, for all ,

It is well known that strong monotonicity implies that has a unique solution. It also implies that fast linear convergence is possible for deterministic feedback in VI problems [10].

4 Reduction by Regularity

We present a reduction from minimizing dynamic regret to minimizing static regret and convergence to . Intuitively, this is possible because Theorem 1 suggests achieving sublinear dynamic regret should not be harder than finding . Define .

Theorem 2.

Define . Let and . If is -regular for , then for all ,

Theorem 2 roughly shows that when an equilibrium exists (e.g. given by the sufficient conditions in the previous section), it provides a stabilizing effect to the problem, so the dynamic regret behaves almost like the static regret when the decisions are around .

This relationship can be used as a powerful tool for understanding the dynamic regret of existing algorithms designed for EPs and VIs. These include, e.g., mirror descent [1], mirror-prox [20, 16], conditional gradient descent [15], Mann iteration [18], etc. Interestingly, many of those are also standard tools in online learning with static regret bounds that are well known [13].

We can apply Theorem 2 in different ways, depending on the known convergence of an algorithm. For algorithms whose convergence rate of to zero is known, Theorem 2 essentially shows that their dynamic regret is at most . For the algorithms with only known static regret bounds, we can use a corollary.

Corollary 1.

If is -regular and , it holds , where is the static regret of the linear online learning problem with .

The purpose of Corollary 1 is not to give a tight bound, but to show that for nicer problems with , achieving sublinear dynamic regret is not harder than achieving sublinear static regret under linear losses. For tighter bounds, we still refer to Theorem 2 to leverage the equilibrium convergence.

Finally, we remark Theorem 2

is directly applicable to expected dynamic regret (the right-hand side of the inequality will be replaced by its expectation) when the learner only has access to stochastic feedback, because the COL setup in non-anticipating. Similarly, high-probability bounds can be obtained based on martingale convergence theorems (see

[4] for a COL example). In these cases, we note that the regret is defined with respect to in COL, not the sampled losses.

5 Application to Online Imitation Learning

In this section, we investigate an application of the COL framework in the sequential decision problem of online IL [22]. We consider an episodic MDP with state space , action space , and finite horizon . For any and , the transition dynamics gives the conditional density, denoted by , of transitioning to from state and action . The reward of state and action is denoted as . A policy is a mapping from to a density over . We suppose the MDP starts from some fixed initial state distribution. We denote the probability of being in state at time under policy as , and we define the average state distribution under as .

In IL, we assume that and are unknown to the learner, but, during training time, the learner is given access to an expert policy

and full knowledge of a supervised learning loss function

, defined for each state . The objective of IL is to solve

(4)

where is the set of allowable parametric policies, which will be assumed convex; note that it is often the case that .

As is not known analytically, optimizing (4

) directly leads to a reinforcement learning problem and therefore can be sample inefficient.

Online IL, such as the popular DAgger algorithm [22], bypasses this difficulty by reducing (4) into a sequence of supervised learning problems. Below we describe a general construction of online IL: at the th iteration (1) execute the learner’s current policy in the MDP to collect state, action samples; (2) update with information of the stochastic approximation of based the samples collected in the first step. Importantly, we remark that in these empirical risks, the states are sampled according to of the learner’s policy.

The use of online learning to analyze online IL is well established [22]. As studied in [3, 17], these online losses can be formulated through a bifunction formulation, , and the policy class can be viewed as the decision set . Naturally, this online learning formulation results in many online IL algorithms resembling standard online learning algorithms, such as follow-the-leader (FTL), which uses full information feedback at each round, [22] and mirror descent [24], which uses the first-order feedback . This feedback can also be approximated by unbiased samples. The original work by Ross et al. [22] analyzed FTL in the static regret case by immediate reductions to known static regret bounds of FTL. However, a crucial objective is understanding when these algorithms converge to useful solutions in terms of policy performance, which more recent work has attempted to address [3, 17, 7]. According to these refined analyses, dynamic regret is a more appropriate solution concept to online IL when , which is the common case in practice.

Below we frame online IL in the proposed COL framework and study its properties based on the properties of COL that we obtained in the previous sections. We have already shown that the per-round loss can be written as the evaluation of a bifunction . This COL problem is actually an -regular COL problem when the expected supervised learning loss is strongly convex in and the state distribution is Lipschitz continuous (see [22, 3, 17]). We can then leverage our results in the COL framework to immediately answer an interesting question in the online IL problem.

Proposition 2.

When , there exists a unique policy that is optimal on its own distribution:

This result is immediate from the fact that implies that is a -strongly monotone VI with by Proposition 1, which is guaranteed to have a unique solution [10].

Furthermore, we can improve upon the known sufficient conditions required to find this policy through online gradient descent and give a non-asymptotic convergence guarantee through a reduction to strongly monotone VIs. We will additionally assume that is -smooth in , satisfying for any fixed query argument .

We then apply the projection algorithm [10], which is equivalent to online gradient descent studied in [24, 17]. Let denote the Euclidean projection onto . The online gradient descent algorithm can be described as computing the following at each round: or equivalently

Proposition 3 ([10]).

If and the stepsize is chosen such that , then, under the online gradient descent algorithm with deterministic feedback , it holds that

By Theorem 2, will therefore be sublinear (in fact ) and the policy converges linearly to the policy that is optimal on its own distribution, . The only condition required on the problem itself is while the state-of-the-art sufficient condition of [17] additionally requires . The result also gives a new non-asymptotic convergence rate to .

The above result only considers the case when the feedback is deterministic; i.e., there is no sampling error due to executing the policy on the MDP, and the risk is known exactly at each round. While this is a standard starting point in analysis of online IL algorithms [22], we are also interested in the more realistic stochastic case, which has so far not been analyzed for the online gradient descent algorithm in online IL. It turns out that the COL framework can be easily leveraged here too to provide a sublinear dynamic regret bound.

At round , we consider observing the empirical risk where . Note that and it is easy to show that the first-order feedback can be modeled as the expected gradient with an additive zero-mean noise: . For simplicity, we assume .

Proposition 4.

If and the stepsize is chosen as , then, under online gradient descent with stochastic feedback, it holds that .

The proof leverages the reduction to static regret in Corollary 1. It is immediate from the fact that the online IL problem is -regular (see Proposition 9 in the full technical report [5] for details). The dynamic regret is worse than that of the deterministic case, but it is still sublinear. This is the price paid for stochastically sampling from the MDP.

6 Conclusion

We present COL, a new class of online learning problems where the gradient varies continuously across rounds with respect to the learner’s decisions. We show that this setting can be equated with certain equilibrium problems (EPs) and variational inequalities (VIs). Leveraging this insight, we present conditions for achieving sublinear dynamic regret. Furthermore, we show a reduction from dynamic regret to static regret and the convergence to equilibrium point. This insight suggests that, when these conditions are met, we may employ standard algorithms from the EP literature to achieve interpretable, sublinear dynamic regret rates. Lastly, we apply our theoretical results to the online imitation learning problem, showing that interesting novel results.

There are several directions for future research on this topic. Our current analyses focus on classical algorithms in online learning. We suspect that the use of adaptive or optimistic methods [6]

can accelerate convergence to equilibria if some coarse model of the bifunction can be estimated. This is especially relevant in applications on episodic MDPs where the expected losses are exactly determined by an underlying reward function and transition dynamics. In addition to online IL, there are also several iterative optimization problems with MDPs that are interesting to consider in the COL setting. First, the problems of online system identification and structured prediction have also been posed as adversarial online learning and analyzed under static regret 

[25, 22]. We also note that the classic fitted Q-iteration [11, 21] for reinforcement learning also uses a similar setup. In round , the loss can be written as , where is the state-action distribution induced by running a policy based on the Q-function estimate of the learner. These problem settings can all be posed as COL problems and it would be interesting see how their algorithms and analyses can be reconciled with those of EP problems via this reduction.

References

  • [1] A. Beck and M. Teboulle (2003) Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters 31 (3), pp. 167–175. Cited by: §4.
  • [2] O. Besbes, Y. Gur, and A. Zeevi (2015) Non-stationary stochastic optimization. Operations research 63 (5), pp. 1227–1244. Cited by: §2.
  • [3] C. Cheng and B. Boots (2018) Convergence of value aggregation for imitation learning. In

    International Conference on Artificial Intelligence and Statistics

    ,
    pp. 1801–1809. Cited by: §1, §5, §5.
  • [4] C. Cheng, R. T. d. Combes, B. Boots, and G. Gordon (2019) A reduction from reinforcement learning to no-regret online learning. arXiv preprint arXiv:1911.05873. Cited by: §4.
  • [5] C. Cheng, J. Lee, K. Goldberg, and B. Boots (2019) Online learning with continuous variations: dynamic regret and reductions. arXiv preprint arXiv:1902.07286. Cited by: §1, §3, §3, §5.
  • [6] C. Cheng, X. Yan, N. Ratliff, and B. Boots (2019) Predictor-corrector policy optimization. In

    International Conference on Machine Learning

    ,
    pp. 1151–1161. Cited by: §6.
  • [7] C. Cheng, X. Yan, E. A. Theodorou, and B. Boots (2019) Accelerating imitation learning with predictive models. In International Conference on Artificial Intelligence and Statistics, Cited by: §5.
  • [8] C. Daskalakis, P. W. Goldberg, and C. H. Papadimitriou (2009) The complexity of computing a nash equilibrium. SIAM Journal on Computing 39 (1), pp. 195–259. Cited by: §1, §3.
  • [9] R. Dixit, A. S. Bedi, R. Tripathi, and K. Rajawat (2019) Online learning with inexact proximal online gradient descent algorithms. IEEE Transactions on Signal Processing 67 (5), pp. 1338–1352. Cited by: §2.
  • [10] F. Facchinei and J. Pang (2007) Finite-dimensional variational inequalities and complementarity problems. Springer Science & Business Media. Cited by: §3, §3, §5, §5, Proposition 3.
  • [11] G. J. Gordon (1995) Stable function approximation in dynamic programming. In Machine Learning Proceedings 1995, pp. 261–268. Cited by: §6.
  • [12] G. J. Gordon (1999) Regret bounds for prediction problems. In Conference on Learning Theory, Vol. 99, pp. 29–40. Cited by: §1.
  • [13] E. Hazan et al. (2016) Introduction to online convex optimization. Foundations and Trends® in Optimization 2 (3-4), pp. 157–325. Cited by: §2, §4.
  • [14] A. Jadbabaie, A. Rakhlin, S. Shahrampour, and K. Sridharan (2015) Online optimization: competing with dynamic comparators. In Artificial Intelligence and Statistics, pp. 398–406. Cited by: §2.
  • [15] M. Jaggi (2013) Revisiting frank-wolfe: projection-free sparse convex optimization.. In ICML (1), pp. 427–435. Cited by: §4.
  • [16] A. Juditsky, A. Nemirovski, and C. Tauvel (2011) Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems 1 (1), pp. 17–58. Cited by: §4.
  • [17] J. Lee, M. Laskey, A. K. Tanwani, A. Aswani, and K. Goldberg (2018) A dynamic regret analysis and adaptive regularization algorithm for on-policy robot imitation learning. In Workshop on the Algorithmic Foundations of Robotics, Cited by: §1, §5, §5, §5, §5.
  • [18] W. R. Mann (1953) Mean value methods in iteration. Proceedings of the American Mathematical Society 4 (3), pp. 506–510. Cited by: §4.
  • [19] A. Mokhtari, S. Shahrampour, A. Jadbabaie, and A. Ribeiro (2016) Online optimization in dynamic environments: improved regret rates for strongly convex problems. In 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 7195–7201. Cited by: §2.
  • [20] A. Nemirovski (2004) Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization 15 (1), pp. 229–251. Cited by: §4.
  • [21] M. Riedmiller (2005) Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pp. 317–328. Cited by: §6.
  • [22] S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In International conference on artificial intelligence and statistics, pp. 627–635. Cited by: §1, §5, §5, §5, §5, §5, §6.
  • [23] S. Shalev-Shwartz et al. (2012) Online learning and online convex optimization. Foundations and Trends® in Machine Learning 4 (2), pp. 107–194. Cited by: §2.
  • [24] W. Sun, A. Venkatraman, G. J. Gordon, B. Boots, and J. A. Bagnell (2017) Deeply aggrevated: differentiable imitation learning for sequential prediction. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3309–3318. Cited by: §5, §5.
  • [25] A. Venkatraman, M. Hebert, and J. A. Bagnell (2015) Improving multi-step prediction of learned time series models. In Conference on Artificial Intelligence, Cited by: §6.
  • [26] T. Yang, L. Zhang, R. Jin, and J. Yi (2016) Tracking slowly moving clairvoyant: optimal dynamic regret of online learning with true and noisy gradient. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pp. 449–457. Cited by: §2.
  • [27] L. Zhang, T. Yang, J. Yi, J. Rong, and Z. Zhou (2017) Improved dynamic regret for non-degenerate functions. In Advances in Neural Information Processing Systems, pp. 732–741. Cited by: §2.
  • [28] M. Zinkevich (2003) Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning, pp. 928–936. Cited by: §1, §2.