Online learning (Zinkevich, 2003; Shalev-Shwartz, 2012; Hazan, 2016; Mohri and Yang, 2016; Zhang et al., 2018; Jun et al., 2017; Jain et al., 2012; Zhang et al., 2017b) is a hot research topic for the last decade of years, due to its application in practices such as online recommendation (Chaudhuri and Tewari, 2016), online collaborative filtering (Liu et al., 2017; Awerbuch and Hayes, 2007), moving object detection (Nair and Clark, 2004) and many others, as well as its close connection with other research areas such as stochastic optimization (Rakhlin et al., 2011; Liu et al., 2018)2017), multiple kernel learning (Lu et al., 2016; Shen et al., 2018), and bandit problems (Flaxman et al., 2005; Arora et al., 2012; Kwon and Perchet, 2017; Kocák et al., 2016), etc.
The typical objective in online learning is to minimize the (static) regret defined below
where is the decision made at step after receiving the information before that (e.g., ). The optimal reference is chosen at the point that minimizes the sum of all component functions up to time . However, the way to decide the optimal reference may not fit some important applications in practice. For example, in the recommendation task, is the regret at time decided by the -th coming customer and our recommendation strategy . Based on the definition of regret in (1), it implicitly assumes that the optimal recommendation strategy is constant over time, which is not necessarily true for the recommendation task (as well as many other applications) since the costumers’ preference usually evolves over time.
Zinkevich (2003) proposed to use the dynamic regret as the metric for online learning, that allows the optimal strategy changing over time. More specifically, it is defined by
where denotes the algorithm that decides iteratively, is short for a sequence , and the dynamics upper bound is defined by
This reminds people to ask a few fundamental questions:
As we know the dependence on is tight, since OG is optimum for static regret. But, is the dependence to the dynamics tight? In other words, Is OG also optimal for dynamic regret?
Is this bound tight enough, how to design a “smarter” algorithm to follow the dynamics?
How difficult to follow dynamics in online learning?
Although the dynamic regret receives more and more attention recently (Mokhtari et al., 2016; Yang et al., 2016; Zhang et al., 2017a; Shahrampour and Jadbabaie, 2018; Hall and Willett, 2015; Jadbabaie et al., 2015) and some successive studies claim to improve this result by considering specific functions types (e.g., strongly convex ), or restricting the definition of dynamic regret, these fundamental questions still remain unsolved.
In this paper, we consider a more general setup for the problem
with and being only convex and closed, and a more general definition for dynamic constraint in (6)
We show that the upper bound of the Proximal Online Gradient (POG) algorithm, which is a general version of online gradient, can be improved to
To understand the difficulty of following dynamics in online learning, we derive the lower bound (that measures the dynamic regret by the optimal algorithm) and show that the proved upper bound for POG matches the lower bound, which indicates POG is the optimal algorithm even for dynamic regret (not just for static regret).
2 Related work
We outline and review the previous researches by the regret in static and dynamic environments briefly.
2.1 Static regrets
2.2 Regrets bounded by other dynamics
Zinkevich (2003) obtains the regret in the order of for the convex function . Similarly, assume the dynamic constraint is defined by the inequality,
where provides the prediction about the dynamic environment. When predict the dynamic environment accurately, Hall and Willett (2013, 2015) obtains a better regret than (Zinkevich, 2003), but it is still bounded by .
Additionally, assume is strongly convex and smooth, and the dynamic constraint is defined by
Mokhtari et al. (2016) obtains regret. When querying noisy gradient, Bedi et al. (2018) obtains regret, where is the cumulative gradient error. Yang et al. (2016); Gao et al. (2018) extends it for non-strongly convex and non-convex functions, respectively. Shahrampour and Jadbabaie (2018) extends it to the decentrialized setting222The definition of is changed slightly in the decentrialized setting.. Furthermore, define
When querying with gradients for every iteration, Zhang et al. (2017a) improves the dynamic regret to be . Comparing with the previous work, we obtain a tight regret by using a more general definition of the dynamic constraint, i.e., (6), and our analysis does not assume the smoothness and strong convexity of .
Other regularities including the functional variation (Jenatton et al., 2016a; Zhu and Xu, 2015; Besbes et al., 2015; Zhang et al., 2018), the gradient variation (Chiang et al., 2012), and the mixed regularity (Jadbabaie et al., 2015; Chen et al., 2017; Jenatton et al., 2016b) have been investigated to bound the dynamic regret. Those different regularities cannot be compared directly because that they measure different aspects of the variation in the dynamic environment. In the paper, we use (6) to bound the regret, and it is the future work to extend our analysis to other regularities.
2.3 Shifting regret
There does exist a connection between the dynamic regret and the shifting regret (György and Szepesvári, 2016). The shifting regret uses norm as the metric to take account the dynamics; while the dynamic regret in our paper uses norm as the metric. Also note that the shifting regret can be somehow considered as a special of the dynamic regret in the following sense. Our result indeed can implies the upper bound of shifting regret. In particular, by restricting in the region , our upper bound for dynamic regret implies that the shifting regret is bounded by , which is consistent with the result in (Jun et al., 2017).
3 Notations and Assumptions
In this section, we introduce notations and important assumptions for the online learning algorithm used throughout this paper.
represents the family of all possible online algorithms.
represents the family of loss functions available to the adversary, where for any loss function, satisfies the three following assumptions. denotes the function product space by .
represents a sequence of vectors, namely, . denotes a sequence of functions, which is .
is the regret for a loss function sequence with a learning algorithm where can be POG or OG.
denotes the norm. represents the norm by default.
represents the subgradient operator. represents the mathematical expectation.
We use the following assumptions to analyze the regret of the online gradient.
Functions for all and are convex and closed but possibly nonsmooth. Particularly, is defined as .
The convex compact set is the domain for and , and for any .
For any and function , , where .
We use the proximal online gradient (POG) for solving the online learning problem with in the form of (5). The POG algorithm is a general version of OG for taking care of the regularizer component in . The complete POG algorithm is presented in Algorithm 1. Line 4 of Algorithm 1 is the proximal gradient descent step defined by
where the proximal operator is defined as
Therefore, the update of is also equivalent to
The POG algorithm reduces to the OG algorithm when is a constant function.
5 Theoretical results
When , reduces to the previous definition of the dynamic constraint. Comparing with the previous definition, when , allocates larger weights for the future parts of the dynamics than the previous parts.
In this section, we will first prove an upper bound for the regret based on our general dynamic constraints via proximal online gradient, which can also slightly to improve the existing upper bound. Then we present an lower bound which was not well studied in previous literature to our best knowledge. We will show that our proved upper bound matches the lower bound, implying the optimality of proximal online gradient algorithm.
5.1 Upper bound
Let , and choose the positive learning rate sequence in Algorithm 1 to be non-increasing, the following upper bound for the dynamic regret holds
To the make the dynamic regret more clear, we choose the learning rate appropriately, which leads to the following result
For any , we choose an appropriate such that and . Then, set the learning rate by
in Algorithm 1. We have
5.2 Lower bound
Once we obtain the upper bound for dynamic regret via POG, namely , there still remains a question, whether our upper bound’s dependency on and is tight enough or even optimal.
Unfortunately, to our best knowledge, this question has not been fully investigated in any existing literature, even for the case of the dynamic regret defined with .
To answer this question, we attempt to explore the value of for the optimal algorithm , which is formally written as . If a lower bound for matches the upper bound in (9), then we can say that POG is optimum for dynamic regret in online learning.
For any , the lower bound for our problem with dynamic regret is
where is the set of all possible learning algorithms. , , with .
Theorem 2 shows that the lower bound matches with the upper bound in terms of the order of and . This theoretical result implies that the proximal online gradient is an optimal algorithm to find decisions in the dynamic environment defined by and our upper bound is also sufficiently tight. In addition, this lower bound also reveals the difficulty of following dynamics in online learning.
The online learning problem with dynamic regret metric is particularly interesting for many real sceneiros. Although the online gradient method has been shown to be optimal for the static regret metric, the optimal algorithm for the dynamic regret remains unknown. This paper studies this problem from a theoretical prespective. We show that proximal online gradient, a general version of online gradient, is optimum to the dynamic regret by showing that our proved lower bound matches the upper bound which slightly improves the existing upper bound.
In this section, we present the detailed proofs for the necessary lemmas and the theorems in our paper. In particular, Lemma 1 and Lemma 2 are for the proofs of Theorem 1 and Corollary 1. Lemma 3 is for the proof of Theorem 2.
In our proofs, we abuse the notations of a little bit to represent any vector in the subgradient of . still represents any vector in .
We use to denote Bregman divergence w.r.t. the function .
Given any sequence , and setting any in Algorithm 1, we have
Define , and , according to the optimal condition, for any , we have
Then, we have
holds due to (10). holds due to three-point identity for Bregman divergence, which is, for any vectors , , and ,
holds due to , so that . Thus, we finally obtain
It completes the proof. ∎
Given any sequence , and setting a non-increasing series in Algorithm 1, we have
According to the law of cosines, we have
Thus, we obtain
holds due to (11). The proof is completed. ∎
Proof of Theorem 1:
For any sequence of loss functions , we have
Proof of Corollary 1:
where is a constant, and does not depend on . According to Theorem 1, when ,
Substituting it into (8), we have
holds due to . Choosing the optimal with
It completes the proof.
Consider a sequence . For any , dimensions of are i.i.d. sampled from Rademacher distribution. We have
We consider the left hand side
where denotes the -th dimension of , and . The second equality holds because that every dimension of is independent to each other.
Consider the sequence . If the event: is picked happens
times with the probability, then the event : is picked happens times. Denote , and we have
Denote , and . Thus, we have
When is even, denote . Thus,
Here, holds due to