1 Introduction
Online and stochastic optimization algorithms form the underlying machinery in much of modern machine learning. Perhaps the most wellknown example is Stochastic Gradient Descent (SGD) and its adaptive variants, the socalled
AdaGrad algorithms (McMahan and Streeter, 2010; Duchi et al., 2011). Other special cases include multiarmed and linear bandit algorithms, as well as algorithms for online control, tracking and prediction with expert advice (CesaBianchi and Lugosi, 2006; ShalevShwartz, 2011; Hazan et al., 2016).There are numerous algorithmic variants in online and stochastic optimization, such as adaptive (Duchi et al., 2011; McMahan and Streeter, 2010) and optimistic algorithms (Rakhlin and Sridharan, 2013a, b; Chiang et al., 2012; Mohri and Yang, 2016; Kamalaruban, 2016), implicit updates (Kivinen and Warmuth, 1997; Kulis and Bartlett, 2010), composite objectives (Xiao, 2009; Duchi et al., 2011, 2010), or nonmonotone regularization (Sra et al., 2016). Each of these variants has been analyzed under a specific set of assumptions on the problem, e.g., smooth (Juditsky et al., 2011; Lan, 2012; Dekel et al., 2012), convex (ShalevShwartz, 2011; Hazan et al., 2016; Orabona et al., 2015; McMahan, 2014), or strongly convex (ShalevShwartz and Kakade, 2009; Hazan et al., 2007; Orabona et al., 2015; McMahan, 2014) objectives. However, a useful property is typically missing from the analyses: modularity. It is typically not clear from the original analysis whether the algorithmic idea can be mixed with other techniques, or whether the effect of the assumptions extend beyond the specific setting considered. For example, based on the existing analyses it is very much unclear to what extent AdaGrad techniques, or the effects of smoothness, or variational bounds in online learning, extend to new learning settings. Thus, for every new combination of algorithmic ideas, or under every new learning setting, the algorithms are typically analyzed from scratch.
A special new learning setting is nonconvex optimization. While the bulk of results in online and stochastic optimization assume the convexity of the loss functions, online and stochastic optimization algorithms have been successfully applied in settings where the objectives are nonconvex. In particular, the highly popular deep learning techniques
(Goodfellow et al., 2016) are based on the application of stochastic optimization algorithms to nonconvex objectives. In the face of this discrepancy between the state of the art in theory and practice, an ongoing thread of research attempts to generalize the analyses of stochastic optimization to nonconvex settings. In particular, certain nonconvex problems have been shown to actually admit efficient optimization methods, usually taking some form of a gradient method (one such problem is matrix completion, see, e.g., Ge et al., 2016; Bhojanapalli et al., 2016).The goal of this paper is to provide a flexible, modular analysis of online and stochastic optimization algorithms that allows to easily combine different algorithmic techniques and learning settings under as little assumptions as possible.
1.1 Contributions
First, building on previous attempts to unify the analyses of online and stochastic optimization (ShalevShwartz, 2011; Hazan et al., 2016; Orabona et al., 2015; McMahan, 2014), we provide a unified analysis of a large family of optimization algorithms in general Hilbert spaces. The analysis is crafted to be modular: it decouples the contribution of each assumption or algorithmic idea from the analysis, so as to enable us to combine different assumptions and techniques without analyzing the algorithms from scratch.
The analysis depends on a novel decomposition of the optimization performance (optimization error or regret) into two parts: the first part captures the generic performance of the algorithm, whereas the second part connects the assumptions about the learning setting to the information given to the algorithm. Lemma 2.1 in Section 2.1 provides such a decomposition.^{1}^{1}1This can be viewed as a refined version of the socalled “betheleader” style of analysis. Previous work (e.g., McMahan 2014; ShalevShwartz 2011) may give the impression that “followtheleader/betheleader” analyses lose constant factors while other methods such as primaldual analysis don’t. This is not the case about our analysis. In fact, we improve constants in optimistic online learning; see Section 7. Then, in Theorem 3.3, we bound the generic (first) part, using a careful analysis of the linear regret of generalized adaptive FollowTheRegularizedLeader (FTRL) and Mirror Descent (MD) algorithms.
Second, we use this analysis framework to provide a concise summary of a large body of previous results. Section 4 provides the basic results, and Sections 7, 6 and 5 present the relevant extensions and applications.
Third, building on the aforementioned modularity, we analyze new learning algorithms. In particular, in Section 7.4 we analyze a new adaptive, optimistic, compositeobjective FTRL algorithm with variational bounds for smooth convex loss functions, which combines the best properties and avoids the limitations of the previous work. We also present a new class of optimistic MD algorithms with only one MD update per round (Section 7.2).
Finally, we extend the previous results to special classes of nonconvex optimization problems. In particular, for such problems, we provide global convergence guarantees for general adaptive online and stochastic optimization algorithms. The class of nonconvex problems we consider (cf. Section 8) generalizes practical classes of functions considered in previous work on nonconvex optimization.
1.2 Notation and definitions
We will work with a (possibly infinitedimensional) Hilbert space over the reals. That is,
is a real vector space equipped with an inner product
, such that is complete with respect to (w.r.t.) the norm induced by . Examples include (for a positive integer ) where is the standard dotproduct, or , the set of real matrices, where , or , the set of squareintegrable realvalued functions on , where for any .We denote the extended real line by , and work with functions of the form . Given a set , the indicatrix of is the function given by for and for . The effective domain of a function , denoted by , is the set where is less than infinity; conversely, we identify any function defined only on a set by the function . A function is proper if is nonempty and for all .
Let be proper. We denote the set of all subgradients of at by , i.e.,
The function is subdifferentiable at if ; we use to denote any member of . Note when .
Let , assume that , and let . The directional derivative of at in the direction is defined as provided that the limit exists in . The function is differentiable at if it has a gradient at , i.e., a vector such that for all . The function is locally subdifferentiable at if it has a local subgradient at , i.e., a vector such that for all . We denote the set of local subgradients of at by . Note that if exists for all , and is subdifferentiable at , then it is also locally subdifferentiable with for any . Similarly, if is differentiable at , then it is also locally subdifferentiable, with . The function is called directionally differentiable at if and exists in for all ; is called directionally differentiable if it is directionally differentiable at every .
Next, we define a generalized^{2}^{2}2If is differentiable at , then (1) matches the traditional definition of Bregman divergence. Previous work also considered generalized Bregman divergences, e.g., the works of Telgarsky and Dasgupta (2012); Kiwiel (1997) and the references therein. However, our definition is not limited to convex functions, allowing us to study convex and nonconvex functions under a unified theory; see, e.g., Section 8. notion of Bregman divergence:
Definition (Bregman divergence)
Let be directionally differentiable at . The induced Bregman divergence from is the function from , given by
(1) 
A function is convex if for all and all , . We can show that a proper convex functions is always directionally differentiable, and the Bregman divergence it induces is always nonnegative (see Appendix E). Let denote a norm on and let . A directionally differentiable function is strongly convex w.r.t. iff for all . The function is smooth w.r.t. iff for all , .
We use to denote the sequence , and to denote the sum , with for .
2 Problem setting: online optimization
We study a general firstorder iterative optimization setting that encompasses several common optimization scenarios, including online, stochastic, and fullgradient optimization. Consider a convex set , a sequence of directionally differentiable functions from to with for all , and a firstorder iterative optimization algorithm. The algorithm starts with an initial point . Then, in each iteration , the algorithm suffers a loss from the latest point , receives some feedback , and selects the next point . Typically,
is supposed to be an estimate or lower bound on the directional derivative of
at . This protocol is summarized in Figure 1.Unlike Online Convex Optimization (OCO), at this stage we do not assume that the are convex^{3}^{3}3There is a long tradition of nonconvex assumptions in the Stochastic Approximation (SA) literature, see, e.g., the book of Bertsekas and Shreve (1978). Our results differ in that they apply to more recent advances in online learning (e.g., AdaGrad algorithms), and we derive anytime regret bounds, rather than asymptotic convergence results, for specific nonconvex function classes. or differentiable, nor do we assume that are gradients or subgradients. Our goal is to minimize the regret against any , defined as
2.1 Regret decomposition
Below, we provide a decomposition of (proved in Appendix A) which holds for any sequence of points and any . The decomposition is in terms of the forward linear regret , defined as
Intuitively, is the regret (in linear losses) of the “cheating” algorithm that uses action at time , and depends only on the choices of the algorithm and the feedback it receives.
Lemma (Regret decomposition)
Let be any sequence of points in . For , let be directionally differentiable with , and let . Then,
(2) 
where .
Intuitively, the second term captures the regret due to the algorithm’s inability to look ahead into the future.^{4}^{4}4This is also related to the concept of “prediction drift”, which appears in learning with delayed feedback (Joulani et al., 2016), and to the role of stability in online algorithms (Saha et al., 2012). The last two terms capture, respectively, the gain in regret that is possible due to the curvature of , and the accuracy of the firstorder (gradient) information .
In light of this lemma, controlling the regret reduces to controlling the individual terms in (2). First, we provide upper bounds on for a large class of online algorithms.
3 The algorithms: AdaFTRL and AdaMD
In this section, we analyze AdaFTRL and AdaMD. These two algorithms generalize the wellknown core algorithms of online optimization: FTRL (ShalevShwartz, 2011; Hazan et al., 2016) and MD (Nemirovsky and Yudin, 1983; Beck and Teboulle, 2003; Warmuth and Jagota, 1997; Duchi et al., 2010). In particular, AdaFTRL and AdaMD capture variants of FTRL and MD such as DualAveraging (Nesterov, 2009; Xiao, 2009), AdaGrad (Duchi et al., 2011; McMahan and Streeter, 2010), compositeobjective algorithms (Xiao, 2009; Duchi et al., 2011, 2010), implicitupdate MD (Kivinen and Warmuth, 1997; Kulis and Bartlett, 2010), stronglyconvex and nonlinearized FTRL (ShalevShwartz and Kakade, 2009; Hazan et al., 2007; Orabona et al., 2015; McMahan, 2014), optimistic FTRL and MD (Rakhlin and Sridharan, 2013a, b; Chiang et al., 2012; Mohri and Yang, 2016; Kamalaruban, 2016), and even algorithms like AdaDelay (Sra et al., 2016) that violate the common nondecreasing regularization assumption existing in much of the previous work.
3.1 AdaFTRL: Generalized adaptive FollowtheRegularizedLeader
The AdaFTRL algorithm works with two sequences of regularizers, and , where each and is a function from to . At time , having received , AdaFTRL uses and to compute the next point . The regularizers and can be built by AdaFTRL in an online adaptive manner using the information generated up to the end of time step (including and ). In particular, we use to distinguish the “proximal” part of this adaptive regularization: for all , we require that (but not necessarily ) be minimized over at , that is^{5}^{5}5 Note that does not depend on , but is rather computed using only . Once is calculated, can be chosen so that (3) holds (and then used in computing ). ,
(3) 
With the definitions above, for , AdaFTRL selects such that
(4) 
In particular, this means that the initial point satisfies^{6}^{6}6 The case of an arbitrary is equivalent to using, e.g., (and changing correspondingly).
In addition, for notational convenience, we define , so that
(5) 
Finally, we need to make a minimal assumption to ensure that AdaFTRL is welldefined.
Assumption 1 (Wellposed AdaFTRL)
Table 1 provides examples of several special cases of AdaFTRL. In particular, AdaFTRL combines, unifies and considerably extends the two major types of FTRL algorithms previously considered in the literature, i.e., the socalled FTRLCentered and FTRLProx algorithms (McMahan, 2014) and their variants, as discussed in the subsequent sections.
Algorithm  Regularization  Notes, Conditions and Assumptions 

Online Gradient  
Descent (OGD)  Update:  
Dual Averaging  
(DA)  
AdaGrad   
Dual Averaging  (fullmatrix update)  
(diagonalmatrix update)  
FTRLProx  
and as in AdaGradDA  
Composite  For adding compositeobjective learning to  
Objective  any instance of AdaFTRL (see also Section 5)  
Online Learning 
3.2 AdaMD: Generalized adaptive MirrorDescent
As in AdaFTRL, the AdaMD algorithm uses two sequences of regularizer functions from to : and . Further, we assume that the domains of are nonincreasing, that is, for . Again, can be created using the information generated by the end of time step . The initial point of AdaMD satisfies^{7}^{7}7 The case of an arbitrary is equivalent to using, e.g., (and changing correspondingly).
Furthermore, at time , having observed , AdaMD uses and to select the point such that
(6) 
In addition, similarly to AdaFTRL, we define , though we do not require to be minimized at in AdaMD ^{8}^{8}8 We use the convention in defining . .
Finally, we present our assumption on the regularizers of AdaMD. Compared to AdaFTRL, we require a stronger assumption to ensure that AdaMD is welldefined, and that the Bregman divergences in (6) have a controlled behavior.
Assumption 2 (Wellposed AdaMD)
The regularizers , are proper, and is directionally differentiable. In addition, for all , the sets that define in (6) are nonempty, and their optimal values are finite. Finally, for all , , and are directionally differentiable, , and is linear in the directions inside , i.e., there is a vector in , denoted by , such that for all .
Remark 1
Our results also hold under the weaker condition that is concave^{9}^{9}9 Without such assumptions, a Bregman divergence term in appears in the regret bound of AdaMD. Concavity ensures that this term is not positive and can be dropped, greatly simplifying the bounds. (rather than linear) on . However, in case of a convex , this weaker condition would again translate into having a linear , because a convex implies a convex (Bauschke and Combettes, 2011, Proposition 17.2). While we do not require that be convex, all of our subsequent examples in the paper use convex . Thus, in the interest of readability, we have made the stronger assumption of linear directional derivatives here.
Remark 2
Note that needs to be linear only in the directions inside the domain of . As such, we avoid the extra technical conditions required in previous work, e.g., that be a Legendre function to ensure remains in the interior of and is welldefined.
3.3 Analysis of AdaFTRL and AdaMD
Next we present a bound on the forward regret of AdaFTRL and AdaMD, and discuss its implications; the proof is provided in Appendix F. [Forward regret of AdaFTRL and AdaMD ] For any and any sequence of linear losses , the forward regret of AdaFTRL under Assumption 1 satisfies
(7) 
whereas the forward regret of AdaMD under Assumption 2 satisfies
(8) 
Remark 3
Section 3.3 does not require the regularizers to be nonnegative or (even nonstrongly) convex.^{10}^{10}10Nevertheless, such assumptions are useful when combining the theorem with Lemma 2.1 Thus, AdaFTRL and AdaMD capture algorithmic ideas like a nonmonotone regularization sequence as in AdaDelay (Sra et al., 2016), and Section 3.3 allows us to extend these techniques to other settings; see also Section 9.
Remark 4
In practice, AdaFTRL and AdaMD need to pick a specific from the multiple possible optimal points in (4) and (6). The bounds of Section 3.3 apply irrespective of the tiebreaking scheme.
In subsequent sections, we show that the generality of AdaFTRL and AdaMD, together with the flexibility of Assumptions 2 and 1, considerably facilitates the handling of various algorithmic ideas and problem settings, and allows us to combine them without requiring a new analysis for each new combination.
4 Recoveries and extensions
Lemma 2.1 and Theorem 3.3 together immediately result in generic upper bounds on the regret, given in (23) and (24) in Appendix B. Under different assumptions on the losses and regularizers, these generic bounds directly translate into concrete bounds for specific learning settings. We explore these concrete bounds in the rest of this section.
First, we provide a list of the assumptions on the losses and the regularizers for different learning settings.^{11}^{11}11In fact, compared to previous work (e.g., the references listed in Section 1 and Section 3), these are typically relaxed versions of the usual assumptions. We consider two special cases of the setting of Section 2: Online optimization and stochastic optimization. In online optimization, we make the following assumption:
Assumption 3 (Online optimization setting)
For , is locally subdifferentiable, and is a local subgradient of at .
Note that may be nonconvex, and does not need to define a global lowerbound (i.e., be a subgradient) of ; see Section 1.2 for the formal definition of local subgradients.
The stochastic optimization setting is concerned with minimizing a function , defined by . In this case the performance metric is redefined to be the expected stochastic regret, .^{12}^{12}12Indeed, in stochastic optimization the goal is to find an estimate such that is small. It is wellknown (e.g., ShalevShwartz 2011, Theorem 5.1) that for any , this equals if is selected uniformly from . Also, if is convex, if is the average of (such averaging can also be used with star convex functions, cf. Section 8.2). Thus, analyzing the regret is satisfactory. Typically, if is differentiable in , then , where
is a random variable, e.g., sampled independently from
. In parallel to Assumption 3, we summarize our assumptions for this setting is as follows:Assumption 4 (Stochastic optimization setting)
The function (defined above) is locally subdifferentiable, for all , and is, in expectation, a local subgradient of at : .
In both settings we will rely on the nonnegativity of the loss divergences at :
Assumption 5 (Nonnegative lossdivergence)
For all , .
It is well known that this assumption is satisfied when each is convex. However, as we shall see in Section 8, this condition also holds for certain classes of nonconvex functions (e.g., starconvex functions and more). In the stochastic optimization setting, since , this condition boils down to , .
In both settings, the regret can be reduced when the losses are strongly convex. Furthermore, in the stochastic optimization setting, the smoothness of the loss is also helpful in decreasing the regret. The next two assumptions capture these conditions.
Assumption 6 (Loss smoothness)
The function is differentiable and smooth w.r.t. some norm .
Assumption 7 (Loss strong convexity)
The losses are 1strongly convex w.r.t. the regularizers, that is, for .
Note that if are convex, then it suffices to have in the condition (rather than ). Typically, if is strongly convex w.r.t. a norm , then (or ) is set to for some . Again, in stochastic optimization, Assumption 7 simplifies to , . Furthermore, if is convex, then Assumption 7 implies that is convex.
Finally, the results that we recover depend on the assumption that the total regularization, in both AdaFTRL and AdaMD, is strongly convex:
Assumption 8 (Strong convexity of regularizers)
For all , is strongly convex w.r.t. some norm .
Setting / Algorithms  Assumptions  Regret / Expected Stochastic Regret Bound 

OO/SO
AdaFTRL 
1, 3/4,
5, 8 

OO/SO
AdaMD 
2, 3/4,
5, 8 

Stronglyconvex OO/SO
AdaMD 
2, 3/4,
(5), 8, 7 

Smooth SO
AdaFTRL 
1, 4, 6,
5, 8’ 

Smooth SO
AdaMD 
2, 4, 6,
5, 8’ 

Smooth & stronglyconvex SO
AdaMD 
2, 4, 6,
(5), 8’, 7 

Table 2 provides a summary of the standard results, under different subsets of the assumptions above, that are recovered and generalized using our framework. The derivations of these results are provided in the form of three corollaries in Appendix B. Note that the analysis is absolutely modular: each assumption is simply plugged into (23) or (24) to obtain the final bounds, without the need for a separate analysis of AdaFTRL and AdaMD for each individual setting. A schematic view of the (standard) proof ideas is given in Figure 2.
5 Compositeobjective learning and optimization
Next, we consider the compositeobjective online learning setting. In this setting, the functions , from which the (local sub)gradients are generated and fed to the algorithm, comprise only part of the loss. Instead of , we are interested in minimizing the regret
using the feedback , where are proper functions. The functions are not linearized, but are passed directly to the algorithm.
Naturally, one can use the regularizers to pass the functions to AdaFTRL and AdaMD. Then, we can obtain the exact same bounds as in Table 2 on the composite regret ; this recovers and extends the corresponding bounds by Xiao (2009); Duchi et al. (2011, 2010); McMahan (2014). In particular, consider the following two scenarios:
Setting 1: is known before predicting .
In this case, we run AdaFTRL or AdaMD with (where ). Thus, we have the update
(9) 
for AdaFTRL, and
(10) 
for AdaMD. Then, we have the following result.
Corollary
Suppose that the iterates are given by the AdaFTRL update (9) or the AdaMD update (10), and and satisfy Assumption 1 for AdaFTRL, or Assumption 2 for AdaMD. Then, under the conditions of each section of Appendices B, B and B, the composite regret enjoys the same bound as , but with in place of .
By definition, . Thus, . Upperbounding by the aforementioned corollaries completes the proof.
Setting 2: is revealed after predicting , together with .
In this case, we run AdaFTRL and AdaMD with functions , , so that
(11) 
for AdaFTRL, and
(12) 
for AdaMD. Then, we have the following result, proved in Appendix C.
Corollary
Suppose that the iterates are given by the AdaFTRL update (11) or the AdaMD update (12), and and satisfy Assumption 1 for AdaFTRL, or Assumption 2 for AdaMD. Also, assume that and the are nonnegative and nonincreasing, i.e., that .^{13}^{13}13This relaxes the assumption in the literature, e.g., by McMahan (2014), that for some fixed, nonnegative minimized at , and a nonincreasing sequence (e.g., ); see also Setting 5. Then, under the conditions of each section of Appendices B, B and B, the composite regret enjoys the same bound as , but with in place of .
Remark 5
In both settings, the functions are passed as part of the regularizers . Thus, if the are strongly convex, less additional regularization is needed in AdaFTRL to ensure the strong convexity of because will already have some strongly convex components. In addition, in AdaMD, when the are convex, the terms in (8) will be smaller than the terms found in previous analyses of MD. This is especially useful for implicit updates, as shown in the next section. This also demonstrates another benefit of the generalized Bregman divergence: the , and hence the , may be nonsmooth in general.
6 Implicitupdate AdaMD and nonlinearized AdaFTRL
Other learning settings can be captured using the idea of passing information to the algorithm using the functions. This information could include, for example, the curvature of the loss. In particular, consider the compositeobjective AdaFTRL and AdaMD, and for , let be a differentiable loss, , and .^{14}^{14}14 For nondifferentiable , let and to get the same effect. Then, , , and the compositeobjective AdaFTRL update (11) is equivalent to
(13) 
Thus, nonlinearized FTRL, studied by McMahan (2014), is a special case of AdaFTRL. With the same , the compositeobjective AdaMD update (12) is equivalent to
(14) 
so the implicitupdate MD is also a special case of AdaMD.