Online and stochastic optimization algorithms form the underlying machinery in much of modern machine learning. Perhaps the most well-known example is Stochastic Gradient Descent (SGD) and its adaptive variants, the so-calledAdaGrad algorithms (McMahan and Streeter, 2010; Duchi et al., 2011). Other special cases include multi-armed and linear bandit algorithms, as well as algorithms for online control, tracking and prediction with expert advice (Cesa-Bianchi and Lugosi, 2006; Shalev-Shwartz, 2011; Hazan et al., 2016).
There are numerous algorithmic variants in online and stochastic optimization, such as adaptive (Duchi et al., 2011; McMahan and Streeter, 2010) and optimistic algorithms (Rakhlin and Sridharan, 2013a, b; Chiang et al., 2012; Mohri and Yang, 2016; Kamalaruban, 2016), implicit updates (Kivinen and Warmuth, 1997; Kulis and Bartlett, 2010), composite objectives (Xiao, 2009; Duchi et al., 2011, 2010), or non-monotone regularization (Sra et al., 2016). Each of these variants has been analyzed under a specific set of assumptions on the problem, e.g., smooth (Juditsky et al., 2011; Lan, 2012; Dekel et al., 2012), convex (Shalev-Shwartz, 2011; Hazan et al., 2016; Orabona et al., 2015; McMahan, 2014), or strongly convex (Shalev-Shwartz and Kakade, 2009; Hazan et al., 2007; Orabona et al., 2015; McMahan, 2014) objectives. However, a useful property is typically missing from the analyses: modularity. It is typically not clear from the original analysis whether the algorithmic idea can be mixed with other techniques, or whether the effect of the assumptions extend beyond the specific setting considered. For example, based on the existing analyses it is very much unclear to what extent AdaGrad techniques, or the effects of smoothness, or variational bounds in online learning, extend to new learning settings. Thus, for every new combination of algorithmic ideas, or under every new learning setting, the algorithms are typically analyzed from scratch.
A special new learning setting is non-convex optimization. While the bulk of results in online and stochastic optimization assume the convexity of the loss functions, online and stochastic optimization algorithms have been successfully applied in settings where the objectives are non-convex. In particular, the highly popular deep learning techniques(Goodfellow et al., 2016) are based on the application of stochastic optimization algorithms to non-convex objectives. In the face of this discrepancy between the state of the art in theory and practice, an on-going thread of research attempts to generalize the analyses of stochastic optimization to non-convex settings. In particular, certain non-convex problems have been shown to actually admit efficient optimization methods, usually taking some form of a gradient method (one such problem is matrix completion, see, e.g., Ge et al., 2016; Bhojanapalli et al., 2016).
The goal of this paper is to provide a flexible, modular analysis of online and stochastic optimization algorithms that allows to easily combine different algorithmic techniques and learning settings under as little assumptions as possible.
First, building on previous attempts to unify the analyses of online and stochastic optimization (Shalev-Shwartz, 2011; Hazan et al., 2016; Orabona et al., 2015; McMahan, 2014), we provide a unified analysis of a large family of optimization algorithms in general Hilbert spaces. The analysis is crafted to be modular: it decouples the contribution of each assumption or algorithmic idea from the analysis, so as to enable us to combine different assumptions and techniques without analyzing the algorithms from scratch.
The analysis depends on a novel decomposition of the optimization performance (optimization error or regret) into two parts: the first part captures the generic performance of the algorithm, whereas the second part connects the assumptions about the learning setting to the information given to the algorithm. Lemma 2.1 in Section 2.1 provides such a decomposition.111This can be viewed as a refined version of the so-called “be-the-leader” style of analysis. Previous work (e.g., McMahan 2014; Shalev-Shwartz 2011) may give the impression that “follow-the-leader/be-the-leader” analyses lose constant factors while other methods such as primal-dual analysis don’t. This is not the case about our analysis. In fact, we improve constants in optimistic online learning; see Section 7. Then, in Theorem 3.3, we bound the generic (first) part, using a careful analysis of the linear regret of generalized adaptive Follow-The-Regularized-Leader (FTRL) and Mirror Descent (MD) algorithms.
Second, we use this analysis framework to provide a concise summary of a large body of previous results. Section 4 provides the basic results, and Sections 7, 6 and 5 present the relevant extensions and applications.
Third, building on the aforementioned modularity, we analyze new learning algorithms. In particular, in Section 7.4 we analyze a new adaptive, optimistic, composite-objective FTRL algorithm with variational bounds for smooth convex loss functions, which combines the best properties and avoids the limitations of the previous work. We also present a new class of optimistic MD algorithms with only one MD update per round (Section 7.2).
Finally, we extend the previous results to special classes of non-convex optimization problems. In particular, for such problems, we provide global convergence guarantees for general adaptive online and stochastic optimization algorithms. The class of non-convex problems we consider (cf. Section 8) generalizes practical classes of functions considered in previous work on non-convex optimization.
1.2 Notation and definitions
We will work with a (possibly infinite-dimensional) Hilbert space over the reals. That is,
is a real vector space equipped with an inner product, such that is complete with respect to (w.r.t.) the norm induced by . Examples include (for a positive integer ) where is the standard dot-product, or , the set of real matrices, where , or , the set of square-integrable real-valued functions on , where for any .
We denote the extended real line by , and work with functions of the form . Given a set , the indicatrix of is the function given by for and for . The effective domain of a function , denoted by , is the set where is less than infinity; conversely, we identify any function defined only on a set by the function . A function is proper if is non-empty and for all .
Let be proper. We denote the set of all sub-gradients of at by , i.e.,
The function is sub-differentiable at if ; we use to denote any member of . Note when .
Let , assume that , and let . The directional derivative of at in the direction is defined as provided that the limit exists in . The function is differentiable at if it has a gradient at , i.e., a vector such that for all . The function is locally sub-differentiable at if it has a local sub-gradient at , i.e., a vector such that for all . We denote the set of local sub-gradients of at by . Note that if exists for all , and is sub-differentiable at , then it is also locally sub-differentiable with for any . Similarly, if is differentiable at , then it is also locally sub-differentiable, with . The function is called directionally differentiable at if and exists in for all ; is called directionally differentiable if it is directionally differentiable at every .
Next, we define a generalized222If is differentiable at , then (1) matches the traditional definition of Bregman divergence. Previous work also considered generalized Bregman divergences, e.g., the works of Telgarsky and Dasgupta (2012); Kiwiel (1997) and the references therein. However, our definition is not limited to convex functions, allowing us to study convex and non-convex functions under a unified theory; see, e.g., Section 8. notion of Bregman divergence:
Definition (Bregman divergence)
Let be directionally differentiable at . The -induced Bregman divergence from is the function from , given by
A function is convex if for all and all , . We can show that a proper convex functions is always directionally differentiable, and the Bregman divergence it induces is always nonnegative (see Appendix E). Let denote a norm on and let . A directionally differentiable function is -strongly convex w.r.t. iff for all . The function is -smooth w.r.t. iff for all , .
We use to denote the sequence , and to denote the sum , with for .
2 Problem setting: online optimization
We study a general first-order iterative optimization setting that encompasses several common optimization scenarios, including online, stochastic, and full-gradient optimization. Consider a convex set , a sequence of directionally differentiable functions from to with for all , and a first-order iterative optimization algorithm. The algorithm starts with an initial point . Then, in each iteration , the algorithm suffers a loss from the latest point , receives some feedback , and selects the next point . Typically,
is supposed to be an estimate or lower bound on the directional derivative ofat . This protocol is summarized in Figure 1.
Unlike Online Convex Optimization (OCO), at this stage we do not assume that the are convex333There is a long tradition of non-convex assumptions in the Stochastic Approximation (SA) literature, see, e.g., the book of Bertsekas and Shreve (1978). Our results differ in that they apply to more recent advances in online learning (e.g., AdaGrad algorithms), and we derive any-time regret bounds, rather than asymptotic convergence results, for specific non-convex function classes. or differentiable, nor do we assume that are gradients or sub-gradients. Our goal is to minimize the regret against any , defined as
2.1 Regret decomposition
Below, we provide a decomposition of (proved in Appendix A) which holds for any sequence of points and any . The decomposition is in terms of the forward linear regret , defined as
Intuitively, is the regret (in linear losses) of the “cheating” algorithm that uses action at time , and depends only on the choices of the algorithm and the feedback it receives.
Lemma (Regret decomposition)
Let be any sequence of points in . For , let be directionally differentiable with , and let . Then,
Intuitively, the second term captures the regret due to the algorithm’s inability to look ahead into the future.444This is also related to the concept of “prediction drift”, which appears in learning with delayed feedback (Joulani et al., 2016), and to the role of stability in online algorithms (Saha et al., 2012). The last two terms capture, respectively, the gain in regret that is possible due to the curvature of , and the accuracy of the first-order (gradient) information .
In light of this lemma, controlling the regret reduces to controlling the individual terms in (2). First, we provide upper bounds on for a large class of online algorithms.
3 The algorithms: Ada-FTRL and Ada-MD
In this section, we analyze Ada-FTRL and Ada-MD. These two algorithms generalize the well-known core algorithms of online optimization: FTRL (Shalev-Shwartz, 2011; Hazan et al., 2016) and MD (Nemirovsky and Yudin, 1983; Beck and Teboulle, 2003; Warmuth and Jagota, 1997; Duchi et al., 2010). In particular, Ada-FTRL and Ada-MD capture variants of FTRL and MD such as Dual-Averaging (Nesterov, 2009; Xiao, 2009), AdaGrad (Duchi et al., 2011; McMahan and Streeter, 2010), composite-objective algorithms (Xiao, 2009; Duchi et al., 2011, 2010), implicit-update MD (Kivinen and Warmuth, 1997; Kulis and Bartlett, 2010), strongly-convex and non-linearized FTRL (Shalev-Shwartz and Kakade, 2009; Hazan et al., 2007; Orabona et al., 2015; McMahan, 2014), optimistic FTRL and MD (Rakhlin and Sridharan, 2013a, b; Chiang et al., 2012; Mohri and Yang, 2016; Kamalaruban, 2016), and even algorithms like AdaDelay (Sra et al., 2016) that violate the common non-decreasing regularization assumption existing in much of the previous work.
3.1 Ada-FTRL: Generalized adaptive Follow-the-Regularized-Leader
The Ada-FTRL algorithm works with two sequences of regularizers, and , where each and is a function from to . At time , having received , Ada-FTRL uses and to compute the next point . The regularizers and can be built by Ada-FTRL in an online adaptive manner using the information generated up to the end of time step (including and ). In particular, we use to distinguish the “proximal” part of this adaptive regularization: for all , we require that (but not necessarily ) be minimized over at , that is555 Note that does not depend on , but is rather computed using only . Once is calculated, can be chosen so that (3) holds (and then used in computing ). ,
With the definitions above, for , Ada-FTRL selects such that
In particular, this means that the initial point satisfies666 The case of an arbitrary is equivalent to using, e.g., (and changing correspondingly).
In addition, for notational convenience, we define , so that
Finally, we need to make a minimal assumption to ensure that Ada-FTRL is well-defined.
Assumption 1 (Well-posed Ada-FTRL)
Table 1 provides examples of several special cases of Ada-FTRL. In particular, Ada-FTRL combines, unifies and considerably extends the two major types of FTRL algorithms previously considered in the literature, i.e., the so-called FTRL-Centered and FTRL-Prox algorithms (McMahan, 2014) and their variants, as discussed in the subsequent sections.
|Algorithm||Regularization||Notes, Conditions and Assumptions|
|Dual Averaging||(full-matrix update)|
|and as in AdaGrad-DA|
|Composite-||For adding composite-objective learning to|
|Objective||any instance of Ada-FTRL (see also Section 5)|
3.2 Ada-MD: Generalized adaptive Mirror-Descent
As in Ada-FTRL, the Ada-MD algorithm uses two sequences of regularizer functions from to : and . Further, we assume that the domains of are non-increasing, that is, for . Again, can be created using the information generated by the end of time step . The initial point of Ada-MD satisfies777 The case of an arbitrary is equivalent to using, e.g., (and changing correspondingly).
Furthermore, at time , having observed , Ada-MD uses and to select the point such that
In addition, similarly to Ada-FTRL, we define , though we do not require to be minimized at in Ada-MD 888 We use the convention in defining . .
Finally, we present our assumption on the regularizers of Ada-MD. Compared to Ada-FTRL, we require a stronger assumption to ensure that Ada-MD is well-defined, and that the Bregman divergences in (6) have a controlled behavior.
Assumption 2 (Well-posed Ada-MD)
The regularizers , are proper, and is directionally differentiable. In addition, for all , the sets that define in (6) are non-empty, and their optimal values are finite. Finally, for all , , and are directionally differentiable, , and is linear in the directions inside , i.e., there is a vector in , denoted by , such that for all .
Our results also hold under the weaker condition that is concave999 Without such assumptions, a Bregman divergence term in appears in the regret bound of Ada-MD. Concavity ensures that this term is not positive and can be dropped, greatly simplifying the bounds. (rather than linear) on . However, in case of a convex , this weaker condition would again translate into having a linear , because a convex implies a convex (Bauschke and Combettes, 2011, Proposition 17.2). While we do not require that be convex, all of our subsequent examples in the paper use convex . Thus, in the interest of readability, we have made the stronger assumption of linear directional derivatives here.
Note that needs to be linear only in the directions inside the domain of . As such, we avoid the extra technical conditions required in previous work, e.g., that be a Legendre function to ensure remains in the interior of and is well-defined.
3.3 Analysis of Ada-FTRL and Ada-MD
Next we present a bound on the forward regret of Ada-FTRL and Ada-MD, and discuss its implications; the proof is provided in Appendix F. [Forward regret of Ada-FTRL and Ada-MD ] For any and any sequence of linear losses , the forward regret of Ada-FTRL under Assumption 1 satisfies
whereas the forward regret of Ada-MD under Assumption 2 satisfies
Section 3.3 does not require the regularizers to be non-negative or (even non-strongly) convex.101010Nevertheless, such assumptions are useful when combining the theorem with Lemma 2.1 Thus, Ada-FTRL and Ada-MD capture algorithmic ideas like a non-monotone regularization sequence as in AdaDelay (Sra et al., 2016), and Section 3.3 allows us to extend these techniques to other settings; see also Section 9.
In subsequent sections, we show that the generality of Ada-FTRL and Ada-MD, together with the flexibility of Assumptions 2 and 1, considerably facilitates the handling of various algorithmic ideas and problem settings, and allows us to combine them without requiring a new analysis for each new combination.
4 Recoveries and extensions
Lemma 2.1 and Theorem 3.3 together immediately result in generic upper bounds on the regret, given in (23) and (24) in Appendix B. Under different assumptions on the losses and regularizers, these generic bounds directly translate into concrete bounds for specific learning settings. We explore these concrete bounds in the rest of this section.
First, we provide a list of the assumptions on the losses and the regularizers for different learning settings.111111In fact, compared to previous work (e.g., the references listed in Section 1 and Section 3), these are typically relaxed versions of the usual assumptions. We consider two special cases of the setting of Section 2: Online optimization and stochastic optimization. In online optimization, we make the following assumption:
Assumption 3 (Online optimization setting)
For , is locally sub-differentiable, and is a local sub-gradient of at .
Note that may be non-convex, and does not need to define a global lower-bound (i.e., be a sub-gradient) of ; see Section 1.2 for the formal definition of local sub-gradients.
The stochastic optimization setting is concerned with minimizing a function , defined by . In this case the performance metric is redefined to be the expected stochastic regret, .121212Indeed, in stochastic optimization the goal is to find an estimate such that is small. It is well-known (e.g., Shalev-Shwartz 2011, Theorem 5.1) that for any , this equals if is selected uniformly from . Also, if is convex, if is the average of (such averaging can also be used with -star convex functions, cf. Section 8.2). Thus, analyzing the regret is satisfactory. Typically, if is differentiable in , then , where
is a random variable, e.g., sampled independently from. In parallel to Assumption 3, we summarize our assumptions for this setting is as follows:
Assumption 4 (Stochastic optimization setting)
The function (defined above) is locally sub-differentiable, for all , and is, in expectation, a local sub-gradient of at : .
In both settings we will rely on the non-negativity of the loss divergences at :
Assumption 5 (Nonnegative loss-divergence)
For all , .
It is well known that this assumption is satisfied when each is convex. However, as we shall see in Section 8, this condition also holds for certain classes of non-convex functions (e.g., star-convex functions and more). In the stochastic optimization setting, since , this condition boils down to , .
In both settings, the regret can be reduced when the losses are strongly convex. Furthermore, in the stochastic optimization setting, the smoothness of the loss is also helpful in decreasing the regret. The next two assumptions capture these conditions.
Assumption 6 (Loss smoothness)
The function is differentiable and -smooth w.r.t. some norm .
Assumption 7 (Loss strong convexity)
The losses are 1-strongly convex w.r.t. the regularizers, that is, for .
Note that if are convex, then it suffices to have in the condition (rather than ). Typically, if is strongly convex w.r.t. a norm , then (or ) is set to for some . Again, in stochastic optimization, Assumption 7 simplifies to , . Furthermore, if is convex, then Assumption 7 implies that is convex.
Finally, the results that we recover depend on the assumption that the total regularization, in both Ada-FTRL and Ada-MD, is strongly convex:
Assumption 8 (Strong convexity of regularizers)
For all , is -strongly convex w.r.t. some norm .
|Setting / Algorithms||Assumptions||Regret / Expected Stochastic Regret Bound|
(5), 8, 7
1, 4, 6,
2, 4, 6,
Smooth & strongly-convex SO
2, 4, 6,
(5), 8’, 7
Table 2 provides a summary of the standard results, under different sub-sets of the assumptions above, that are recovered and generalized using our framework. The derivations of these results are provided in the form of three corollaries in Appendix B. Note that the analysis is absolutely modular: each assumption is simply plugged into (23) or (24) to obtain the final bounds, without the need for a separate analysis of Ada-FTRL and Ada-MD for each individual setting. A schematic view of the (standard) proof ideas is given in Figure 2.
5 Composite-objective learning and optimization
Next, we consider the composite-objective online learning setting. In this setting, the functions , from which the (local sub-)gradients are generated and fed to the algorithm, comprise only part of the loss. Instead of , we are interested in minimizing the regret
using the feedback , where are proper functions. The functions are not linearized, but are passed directly to the algorithm.
Naturally, one can use the regularizers to pass the functions to Ada-FTRL and Ada-MD. Then, we can obtain the exact same bounds as in Table 2 on the composite regret ; this recovers and extends the corresponding bounds by Xiao (2009); Duchi et al. (2011, 2010); McMahan (2014). In particular, consider the following two scenarios:
Setting 1: is known before predicting .
In this case, we run Ada-FTRL or Ada-MD with (where ). Thus, we have the update
for Ada-FTRL, and
for Ada-MD. Then, we have the following result.
Suppose that the iterates are given by the Ada-FTRL update (9) or the Ada-MD update (10), and and satisfy Assumption 1 for Ada-FTRL, or Assumption 2 for Ada-MD. Then, under the conditions of each section of Appendices B, B and B, the composite regret enjoys the same bound as , but with in place of .
By definition, . Thus, . Upper-bounding by the aforementioned corollaries completes the proof.
Setting 2: is revealed after predicting , together with .
In this case, we run Ada-FTRL and Ada-MD with functions , , so that
for Ada-FTRL, and
for Ada-MD. Then, we have the following result, proved in Appendix C.
Suppose that the iterates are given by the Ada-FTRL update (11) or the Ada-MD update (12), and and satisfy Assumption 1 for Ada-FTRL, or Assumption 2 for Ada-MD. Also, assume that and the are non-negative and non-increasing, i.e., that .131313This relaxes the assumption in the literature, e.g., by McMahan (2014), that for some fixed, non-negative minimized at , and a non-increasing sequence (e.g., ); see also Setting 5. Then, under the conditions of each section of Appendices B, B and B, the composite regret enjoys the same bound as , but with in place of .
In both settings, the functions are passed as part of the regularizers . Thus, if the are strongly convex, less additional regularization is needed in Ada-FTRL to ensure the strong convexity of because will already have some strongly convex components. In addition, in Ada-MD, when the are convex, the terms in (8) will be smaller than the terms found in previous analyses of MD. This is especially useful for implicit updates, as shown in the next section. This also demonstrates another benefit of the generalized Bregman divergence: the , and hence the , may be non-smooth in general.
6 Implicit-update Ada-MD and non-linearized Ada-FTRL
Other learning settings can be captured using the idea of passing information to the algorithm using the functions. This information could include, for example, the curvature of the loss. In particular, consider the composite-objective Ada-FTRL and Ada-MD, and for , let be a differentiable loss, , and .141414 For non-differentiable , let and to get the same effect. Then, , , and the composite-objective Ada-FTRL update (11) is equivalent to
so the implicit-update MD is also a special case of Ada-MD.