# Curvature of Feasible Sets in Offline and Online Optimization

It is known that the curvature of the feasible set in convex optimization allows for algorithms with better convergence rates, and there has been renewed interest in this topic both for offline as well as online problems. In this paper, leveraging results on geometry and convex analysis, we further our understanding of the role of curvature in optimization: - We first show the equivalence of two notions of curvature, namely strong convexity and gauge bodies, proving a conjecture of Abernethy et al. As a consequence, this show that the Frank-Wolfe-type method of Wang and Abernethy has accelerated convergence rate O(1/t^2) over strongly convex feasible sets without additional assumptions on the (convex) objective function. - In Online Linear Optimization, we show that the logarithmic regret guarantee of the algorithm Follow the Leader (FTL) over strongly convex sets recently proved by Huang et al. follows directly from a partial lipschitzness of the support function of such sets. We believe that this provides a simpler explanation for the good performance of FTL in this context. - We provide an efficient procedure for approximating convex bodies by curved ones, smoothly trading off approximation error and curvature, allowing one to extend the applicability of algorithms for curved set to non-curved ones.

## Authors

• 11 publications
• ### Fast Rates for Online Gradient Descent Without Strong Convexity via Hoffman's Bound

Hoffman's classical result gives a bound on the distance of a point from...
02/13/2018 ∙ by Dan Garber, et al. ∙ 0

• ### A Linearly Convergent Conditional Gradient Algorithm with Applications to Online and Stochastic Optimization

Linear optimization is many times algorithmically simpler than non-linea...
01/20/2013 ∙ by Dan Garber, et al. ∙ 0

• ### First-order Methods for Geodesically Convex Optimization

Geodesic convexity generalizes the notion of (vector space) convexity to...
02/19/2016 ∙ by Hongyi Zhang, et al. ∙ 0

• ### Online Alternating Direction Method

Online optimization has emerged as powerful tool in large scale optimiza...
06/27/2012 ∙ by Huahua Wang, et al. ∙ 0

• ### Stochastic Optimization with Laggard Data Pipelines

State-of-the-art optimization is steadily shifting towards massively par...
10/26/2020 ∙ by Naman Agarwal, et al. ∙ 20

• ### Distributed Strongly Convex Optimization

A lot of effort has been invested into characterizing the convergence ra...
07/12/2012 ∙ by Konstantinos I. Tsianos, et al. ∙ 0

• ### Greedy Algorithms for Cone Constrained Optimization with Convergence Guarantees

Greedy optimization methods such as Matching Pursuit (MP) and Frank-Wolf...
05/31/2017 ∙ by Francesco Locatello, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Curvature is one of the most fundamental geometric notions with fascinating connections with many different phenomena. There has been much interest in the influence of curvature on computational and statistical efficiency in optimization and machine learning, with the use of notions of curvature such as

strong convexity and gauge bodies in convex optimization [32, 12, 13, 18, 1], and uniform convexity/martingale-cotype (and their dual notions uniform smoothness/martingale-type) in online and statistical leaning  [44, 42, 35, 15].

Our goal is to better understand the relationship between different notions of curvature and their effect in optimization. We briefly discuss some of the known results in order to point out the specific limitations of current knowledge that we address in this paper.

#### Curvature and the Frank-Wolfe method.

Consider a general convex optimization problem

 min f(x) st x∈K, (1)

where is a convex function and a convex set. An important procedure for solving such convex programs is the Frank-Wolfe method [16]: in each iteration it solves the linearized version of the problem to obtain a “direction” , where is the iterate of the previous iteration, and sets the new iterate as for some stepsize . Because this method only requires optimization of linear functions in each iteration, and in particular does not require a (non-linear) projection onto the feasible region as most other methods do, it has gained much interest in applications to large-scale problems arising in machine learning [28, 31, 21, 37, 38, 17]. This method is known to have convergence rate of order , i.e., after iterations it produces a feasible solution of value compared to the optimal solution, and this is tight in general [27].

However, since the seminal work of Polyak in the 60’s it is known that when the feasible set is suitably curved much better convergence rates are possible [32, 12, 13, 18]. The common notion of curvature in this context is that of -strongly convex sets: for all pairs of points , the set needs to contain a large enough ball centered at . We present a slightly generalized definition that can use another convex body instead of the Euclidean ball. Recall that given a convex body with the origin in its interior, its gauge is the function given by

 ∥x∥C:=inf{λ>0:x∈λC}. (2)
###### Definition 1 (Strongly Convex Sets [40]).

Let be a convex body with the origin in its interior. A convex body is -strongly convex with respect to if for every we have the containment

 x+y2+λ∥x−y∥2C⋅C⊆K. (3)

Recently, [18] showed that when the feasible set is strongly convex and the objective function is a strongly convex function (see Definition 3), the Frank-Wolfe method has accelerated convergence rate .

However, it seems that the curvature of the feasible set should supersede the curvature of the objective function, the latter not being required for accelerated convergence rates. In fact, [46] introduced a class of curved convex sets called gauge sets and showed that this is indeed the case for them.

###### Definition 2 (Gauge Set [1]).

A convex body with the origin in its interior is a gauge set of modulus with respect to a norm if its gauge function squared is a -strongly convex function with respect to .

[46] showed that as long as the feasible region is a gauge set, there is a Frank-Wolfe-type algorithm with convergence rate . While on one hand this result removes the strong convexity requirement of the objective function, on the other it makes a possibly stronger assumption on the feasible set, since the class of gauge sets is contained in that of strongly convex sets [18]. However, all standard examples of strongly convex sets such as , Schatten , and group balls for , are also gauge sets. This has led to the conjecture that these notions are in fact the same.

###### Conjecture 1 ([1]).

A convex body containing the origin in its interior is a gauge set w.r.t. its gauge if and only if it is strongly convex w.r.t. itself.

This is one of the gaps in our understanding of curved sets that we address in this paper. Before additional spoilers, we also briefly discuss the role of these sets in online optimization.

#### Curvature in online optimization.

Now consider the Online Linear Optimization problem [22]: A convex set is given upfront. In each time step , the algorithm needs to produce a point

using the information revealed up to this moment; after that, the adversary reveals a gain vector

from a set , and the algorithm receives gain . The goal of the algorithm is to maximize its total gain . Its regret for this instance is the missing gain compared to the best fixed action in hindsight:

 Regret:=maxx∈KT∑t=1⟨gt,x⟩−T∑t=1⟨gt,xt⟩.

We are interested in designing algorithms with provable upper bounds on their worst-case regret.

This problem, and its generalization with convex objective functions, has a vast literature with applications to a host of areas, from online shortest paths and dynamic search trees [30], to portfolio optimization [33], to robust optimization [6], and many others. It is known that as long as the playing set and the gain vector set are bounded one can obtain order regret, and in general this cannot be improved [22]. On the other hand, when the gain functions are curved (e.g., strongly concave or exp-concave) instead of the linear ones , it is possible to obtain a much improved order regret [22]. Interestingly, [25] recently showed that one can also obtain this improved order regret when the playing set is curved instead; however, they require an additional “growth condition” on the gains.

###### Theorem 1 ([25]).

Consider the Online Linear Optimization problem with playing set and gain set . If is -strongly convex w.r.t. the Euclidean ball and the gain vectors satisfy the growth condition for some and all , then the algorithm Follow the Leader has regret at most

 M22λG(1+logT),

where .

The standard 1-dimensional bad example for Online Linear Optimization shows that an assumption like the growth condition is necessary [25]; it is perhaps less clear why this is the case.

### 1.1 Our Results

Leveraging tools from geometry and convex analysis, we further our understanding of the role of curvature in offline and online optimization.

#### Equivalence of strongly convex and gauge sets.

We first observe that Conjecture 1 of [1] on the equivalence of strongly convex and gauge sets is true for centrally symmetric sets.

###### Theorem 2.

Conjecture 1 is true for centrally symmetric sets. More precisely, consider a convex body such that . If is -strongly convex with respect to itself, then is a gauge set with respect to with modulus .

(The other direction was proved in [18]: if is a gauge set w.r.t. with modulus , then it is -strongly convex with respect to itself.)

The main idea of the proof is to use as a stepping stone another classic notion of curvature introduced by [10] in the context of geometry of Banach spaces, namely 2-convexity of norms (Definition 4).

In addition to clarifying the relationship between these two notions of curvature, it shows that the Frank-Wolfe-type algorithm of [46] is the first to achieve accelerated rates under the standard notion of strong convexity of the feasible set without any additional assumption on the objective function (besides convexity).

###### Corollary 1.

Consider the problem (1). If is a centrally symmetric strongly convex body, then the Frank-Wolfe-type algorithm of [46] has convergence rate .111The hides other parameters that influence the convergence of the algorithm, such as the modulus of strong smoothness of the objective function (which is always finite over bounded sets).

#### Online Linear Optimization on curved sets.

Next, we identify two main properties that help explaining why curvature helps in online optimization.

###### Theorem 3 (Informal principle).

In Online Linear Optimization, the improved regret guarantees observed in [25] for strongly convex playing sets (attained by the Followed the Leader algorithm) stems from

 Partial Lipschitzness of the support function of K+no-cancellation of the gain vectors.

This principle is described and developed in detail in Section 5 (see Lemmas 6 and 7 for some formal statements). But at a high level, the first property is intimately related to the stability of the Follow the Leader (FLT) algorithm, which is known to control its regret. However, this Lipschitzness only holds away from the origin. That is why the additional no-cancellation property of the gain vectors is required: it steers the iterates of FTL away from the origin.

This principle gives a simple and clean proof of Theorem 1 above from [25], where this no-cancellation is achieved through the linear growth assumption on the partial sums of the gain vectors. As another illustration of this principle, we use to show that FLT also has logarithmic regret over strongly convex sets when the gain vectors are non-negative, without any additional growth assumption (Theorem 6). Note that the non-negativity assumption is just another way of achieving the no-cancellation property.

#### Making a convex body curved.

In order to extend results obtained for curved set to general sets, we give an efficient way of transforming an arbitrary convex body into a curved one while controlling both its curvature as well as its distance to the original set. We use to denote the Euclidean ball of radius of appropriate dimension.

###### Theorem 4.

Consider a convex body and suppose . Then for all , there is a convex body with the following properties:

1. (Approximation)

2. (Curvature) is -strongly convex with respect to itself

3. (Efficiency) Given access to a weak optimization oracle for , weak optimization over can be performed in time that is polynomial in , and the desired precision (see Definition 5).

Notice that this construction smoothly interpolates between the original set

when and the inscribed ball when , and the guarantees interpolate with no loss at the endpoints.

The starting element for this construction is again the equivalence between strong convexity of sets and 2-convexity of their gauge functions. Based on this, the construction of uses the “Asplund averaging” technique for combining (2-convex) norms into a 2-convex one [29]: is defined by setting its gauge to be . Equivalently, can be defined based on the so-called addition of the (scaled) polars of and , an operation introduced by [14]. In fact, in to order show that one can optimize over in polynomial time, we resort to an equivalent characterization of this operation given by [36].

As a concrete example of application, we consider the problem of Online Linear Optimization with hints [11] and show how Theorem 4 allows us to port the low regret algorithm Dekel et al. designed for strongly convex playing sets to general playing sets, at the expense of a small multiplicative regret. Since this is a straightforward application (simply apply the algorithm to the approximation of the original set ) we present the details in Appendix C.

### 1.2 Structure of the paper

In order to reduce context-switching first we prove the more structural results Theorems 2 and 4, and leave the principle from Theorem 3 to be described and developed in detail in the last section.

## 2 Preliminaries

We need some basic notions from convex analysis, for which we refer to [24].

###### Definition 3.

A convex function is -strongly convex with respect to a norm if for all and all

 f(αx+(1−α)y)≤αf(x)+(1−α)f(y)−G⋅α(1−α)∥x−y∥2.

#### Set operations, gauge and support functions.

Recall that denotes the Euclidean ball of radius in the appropriate dimension depending on the context. Given a set and a scalar we define , and given two sets we define their Minkowski sum and their difference (so has the interpretation of the points “deep inside” ). A set is (centrally) symmetric if . By a convex body we mean a compact convex set with non-empty interior. We use to denote the set of all convex bodies in with 0 in their interior; we work almost exclusively with convex bodies in such position.

Given such a convex body , its support function is

 σK(ℓ):=maxx∈K⟨x,ℓ⟩,

and recall that its gauge is .

Gauge functions are generalization of norms: every norm is the gauge of its unit norm ball , and gauge functions satisfy all properties of norms (as listed below) other than symmetry, which holds iff the convex body is centrally symmetric. We need the following standard facts about these operators that can be readily verified.

###### Lemma 1.

For convex bodies with the origin in their interior, we have the following:

1. (level set) is precisely the set of points satisfying

2. (positive homogeneity) For every scalar ,

4. (inclusion) iff pointwise, and iff pointwise

5. (scaling of body) For all , , and pointwise.

#### Polarity.

The polar of a convex body is the convex body

 K∘:={y:⟨x,y⟩≤1,∀x∈K}.

We will also need the following properties of polars.

###### Lemma 2.

For convex bodies with the origin in their interior, we have the following:

1. (polar involution)

2. (polar order reversal) iff

3. (duality of functionals)

4. (Euclidean balls) For all we have .

For a gauge , we use to denote its dual gauge. By definition, we have the generalized Cauchy-Schwarz inequality:

 ⟨x,y⟩≤∥x∥∥y∥⋆. (4)

Note that since , we see that is the dual gauge of .

Given a convex function , its subdifferential at , denoted by , is the set of all vectors such that give an underestimation of the function, namely

 f(y)≥f(x)+⟨g,y−x⟩for all y∈Rd,

and a vector is called a subgradient. Furthremore, if is differentiable at then is the singleton set consisting of the gradient .

## 3 Equivalence of Strongly Convex and Gauge Bodies

In this section we prove that centrally symmetric strongly convex sets are gauge sets (Theorem 2). The main stepping stone is another classic notion of curvature in Banach spaces [10]; while in this section we will only used it for norms, we state it more generally for gauge functions for later use.

###### Definition 4 (2-convexity [26]).

A gauge function is 2-convex with modulus if for all satisfying and we have

 ∥∥∥x+y2∥∥∥K≤1−D∥x−y∥2K. (5)

Notice that for as above, the subadditivity of gauges gives that ; thus, 2-convexity gives an improvement depending on how far and are from each other. As an example, the Euclidean norm is 2-convex with modulus , and this modulus is best possible.

It is known that 2-convex norms have its square being mid-point strongly convex [34, 5, 4, 9, 7]. More explicitly, since mid-point and regular strong convexity are equivalent for continuous functions [2], Lemma 1.e.10 of [34] gives the following.

###### Lemma 3.

If a norm over is 2-convex with modulus , then the function is -strongly convex w.r.t. .

Moreoever, we note that a gauge is 2-convex iff the set is strong convex with respect to itself. Despite the extensive literature on strongly convex sets (see the survey [19]), we could not find a reference for this result. We present its simple proof for completeness.

###### Lemma 4.

A convex body is -strongly convex with respect to itself iff its gauge is -convex with modulus .

###### Proof.

Take such that , so . Let and . Using the -strong convexity of at , we have that the point belongs to , and hence

 1≥∥∥m+λt2m∥m∥K∥∥K=(1+λt2∥m∥K)∥m∥K=∥m∥K+λt2.

Thus, , proving the -convexity of .

Take with , so by assumption . Then for any we have by triangle inequality , i.e., this point belongs to . This means that is contained in . Thus, is -SC with respect to itself with . ∎

###### Proof of Theorem 2.

Consider a centrally symmetric set that is -strongly convex w.r.t. itself. Since its gauge is a norm, chaining Lemmas 3 and 4 gives that is a -strongly convex function w.r.t. , namely is a gauge set w.r.t. with modulus . This concludes the proof. ∎

## 4 Making a Convex Body Curved

Consider an arbitrary convex body . Our goal in this section is to obtain a set that is strongly convex with respect to itself, that approximates in the sense of , and that can be efficiently optimized over, proving Theorem 4.

### 4.1 A First Attempt

Let and be respectively inscribed and circumscribed balls for . Recall that intuitively a set is strong convex if its boundary does not have flat parts.222See [25] for a formal connection between strong convexity of a set and the curvature of its boundary seen as a Riemannian manifold.

On one hand, the is perfect approximation to itself but may not be strongly convex at all; on the other, as we just saw is -strongly convex with respect to itself but (typically) gives a poor approximation to . The idea is to tradeoff these extremes by taking a “convex combination” between and the inscribed ball .

The natural attempt would be to consider the convex combination for . This operation is just placing a copy of the ball at each point of , which intuitively should give a more strongly convex set as increases. Unfortunately this is not true: if , the set is not strongly convex at all for any value , see Figure 2.a. This is because this operation softens the corners of instead of curving its flat faces.

But it is known that polarity maps “faces” of the set to “corners” of its polar, and vice-versa (Corollary 2.14 of [48] makes this precise for polytopes). Thus, we should soften the vertices of the polar to obtain the desired effect in the original set. More precisely, we can pass to the polar, take a convex combination with the polar of , and take the polar of the resulting object to get back to the original space:

 K′′t:=((1−t)K∘+tB(r)∘)∘, (6)

for ; see Figure 2.b. Indeed, with a careful analysis one can show that is strongly convex and (with the approximation improving as ). However, we get a greatly simplified analysis by working with a different construction.

### 4.2 Construction via L2 Addition

The idea is to replace the construction given in (6) by one with a more “functional” flavor that gives a clean expression for the its gauge function . Since Lemma 4 gives the equivalence between 2-convexity of and strong convexity of , we will be in good shape for controlling the latter.

For this, we replace the Minkowski sum in our previous attempt by the so-called addition [14, 36]. Given two convex bodies , their addition is the convex body whose support function satisfies

 σA⊕B(⋅)2=σA(⋅)2+σB(⋅)2.

We then define our desired approximation of as the set

To have a more transparent version of this definition, by involution of polarity (Lemma 2 Item 1), the polar of satisfies and hence

 σK∘t(⋅)2=σ√1−t2K∘(⋅)2+σtB(r)∘(⋅)2 =(√1−t2⋅σK∘(⋅))2+(t⋅σB(r)∘(⋅))2 =(1−t2)σK∘(⋅)2+t2σB(r)∘(⋅)2,

where in the second equation we used Lemma 1 Item 5. Moreover, using that the support function of the polar is the gauge of the “primal” (Lemma 2 Item 3), we see that is the convex body satisfying

 ∥⋅∥2Kt=(1−t2)∥⋅∥2K+t2∥⋅∥2B(r). (7)

With this functional perspective we are in good shape for analyzing the properties of and proving Theorem 4.

### 4.3 Proof of Theorem 4

#### Approximation.

We first argue that is still sandwiched . Since contains the ball , Lemma 1 Item 4 gives that . So using (7) we see that . The same lemma then implies that .

To see that , notice that the inclusion , together with Lemma 1 Items 4 and 5, implies that , and hence (7) gives

 ∥⋅∥Kt≤∥⋅∥K⋅√1+((Rr)2−1)t2;

the same lemma then give the desired containment. This proves the “approximation” part of Theorem 4.

#### Curvature.

Given the equivalence of strong convexity and 2-convexity of Lemma 4, it suffices to show that is 2-convex with modulus . So consider with ; we want to show that

 ∥∥∥x+y2∥∥∥Kt≤1−t28∥x−y∥2Kt. (8)

First, observe that the function is convex: this follows because it is the composition of the convex function (use Lemma 1 Items 2 and 3 to observe this convexity) and the increasing convex function (see for example Section 3.2.4 of [8]). Using again the fact (Lemma 1 Item 5), we have

 ∥∥∥x+y2∥∥∥2Kt \scriptsize(???)=(1−t2)∥∥∥x+y2∥∥∥2K+t2r2∥∥∥x+y2∥∥∥22 \tiny conv.≤(1−t2)(∥x∥2K2+∥y∥2K2)+t2r2∥∥∥x+y2∥∥∥22 \tiny parallel.=(1−t2)(∥x∥2K2+∥y∥2K2)+t2r2(∥x∥222+∥y∥222−∥x−y∥224) =∥x∥2Kt2+∥y∥2Kt2−t24r2∥x−y∥224 ≤1−t24r2∥x−y∥22 = 1−t24∥x−y∥2B(r) ≤ 1−t24∥x−y∥2Kt,

where in the first inequality we used convexity of , the next equation uses the parallelogram identity, the second inequality uses the assumption , and the last inequality uses , proved in the “approximation” part. Finally, since for all , taking square roots on the last displayed inequality proves (8).

#### Efficiency.

It is not immediately clear that we can optimize a linear function over given access to an optimization oracle for . First, let us recall the standard definition of weak optimization [20].

###### Definition 5 (Weak optimization problem).

Given , a convex set , and a precision parameter , either:

1. Output that is empty

2. Return a point such that

 ⟨c,¯x⟩≥maxx∈A−B(δ)⟨c,x⟩−δ.

We also recall the following result on the equivalence of weak optimization of a body and its polar (for example, chain together Theorem 4.4.7, Theorem 4.2.2, Lemma 4.4.2, and Corollary 4.2.7 of [20]).

###### Theorem 5.

Let be a convex body satisfying . Then, there is an algorithm that, given access to weak optimization oracles over , solves the weak optimization problem over in time polynomial in and .

Given this equivalence and the involution of polarity , in order to weakly optimize over it suffices to be able to weakly optimize over its polar . To do that, we will need a characterization of the the addition by [36], which when applied to gives the following (to simplify the notation, let and ):

 K∘t={(1−α)1/2u+α1/2v:u∈U, v∈V, α∈[0,1]}.

Thus, given , maximizing over is equivalent to the following optimization problem:

 max (1−α)1/2⟨¯y,u⟩+α1/2⟨¯y,v⟩ s.t. u∈U, v∈V,α∈[0,1].

Given the decomposability of this problem, we can do this in polynomial time as follows:

1. First weakly maximize over , obtaining an almost optimal solution . Again, by Theorem 5 this is equivalent to weakly optimizing over the polar , which (since is fixed) is equivalent to weakly optimizing over , which we assumed we have an oracle for.

2. Then maximize over , obtaining the optimal solution . Notice that (Lemma 2 Item 4), so it is just the Euclidean ball of radius . Thus, we explicitly have the maximizer .

3. Finally, weakly maximize over , obtaining an almost optimal solution . We claim that is concave in . To see this, notice that since has the origin in its interior, the optimality of gives that , and the same is true for . Then one can easily check that the second derivative of is negative in , thus giving its concavity over (also notice that is continuous at ). Thus, we can weakly optimize in polynomial time (see for example Theorem 4.3.13 of [20]).

Putting all these elements together, we can weakly optimize over in polynomial time using a weak optimization oracle for . With this, we conclude the proof of Theorem 4.

## 5 Online Linear Optimization on Curved Sets

The goal of this section is to develop the informal principle stated in Theorem 3. We briefly recall the Online Linear Optimization (OLO) problem described in the introduction: a convex body (playing set) is given upfront; in each time step the algorithm first produces a point using the information obtained thus far, sees a gain vector vector , and obtains gain . The goal is to minimize the regret against the best fixed action:

 Regret:=maxx∈KT∑t=1⟨gt,x⟩−T∑t=1⟨gt,xt⟩.

We are interest in the case where is strongly convex.

Follow the Leader (FTL) is arguably the simplest algorithm for this problem, being simply greedy in the previous gain vectors: letting , the algorithm at time chooses an action

 xt∈argmaxx∈K⟨st−1,x⟩ (FTL)

( is chosen as an arbitrary point in ). It is well-known that whenever FTL is stable, namely actions and on consecutive times are “similar”, it obtains good regret guarantees; in fact, this is the basis for the analysis of most OLO algorithms. More precisely, Lemma 2.1 of [43] gives the following.

###### Lemma 5.

The regret of FTL is at most .

Unfortunately in general FTL can be quite unstable: For example, consider the instance , with gain sequence and for the gains alternate between and . Even though the gain vectors are very similar across time steps, the actions of FTL alternate between and , being extremely unstable. In addition, its regret is , which up to constants is worst possible.

However, the intuition is when is “curved”, we should have , as long as the directions of and are similar, see Figure 3.a. More formally, notice that is the optimizer of the support function , and because of that it is a subgradient of it: . In addition, if is strongly convex, then is differentiable everywhere except the origin, and hence as long as  [47], see Figure 3.b.

Thus, the FTL stability requirement is equivalent to , namely stability of the gradient of the support function. A big problem is that since is never differentiable at the origin, gradients are not stable around there.

But when is strongly convex, this is the only problem: is Lipschitz over the unit sphere. This fact has been rediscovered several times [45, 39, 3, 1]; for example, this is Lemma 2.2 of [3].

###### Lemma 6 (Lipschitz gradients over the sphere).

If is -strongly convex with respect to a norm , then for all with we have

 ∥∇σK(u)−∇σK(v)∥⋆≤14λ∥u−v∥.

Just using this limited “sphere-Lipschitz” property (and Lemma 5) we get a generic upper bound on the regret of FTL on strongly convex sets.333This is similar to the conclusion of Proposition 2 plus inequality (6) of [25], but arguably with a simpler and more transparent proof.

###### Lemma 7 (FTL regret from sphere-Lipschitz).

If is such that the gradient of its support function satisfies the Lipschitz gradient condition of Lemma 6, then the regret of FTL is at most

 12λ∑t∥gt∥2∥st∥,

as long as for all .

###### Proof.

To simplify the notation we drop the subscript from . From Lemma 5 and the generalized Cauchy-Schwarz inequality (4), the regret of FTL is at most

 regret≤T∑t=1∥gt∥∥xt+1−xt∥⋆=T∑t=1∥gt∥∥∇σ(st)−∇σ(st−1)∥⋆. (9)

We upper bound this starred norm. By positive homogeneity of we have , so Lemma 6 implies

 ∥∇σ(st)−∇σ(st−1)∥⋆ =∥∥∥∇σ(st∥st∥)−∇σ(st−1∥st−1∥)∥∥∥⋆≤14λ∥∥∥st∥st∥−st−1∥st−1∥∥∥∥. (10)

We claim that the norm on the right-hand side is at most . To see this, since we can use triangle inequality to upper bound it by

 ∥g∥∥st∥+∥∥∥st−1∥st∥−st−1∥st−1∥∥∥∥=∥gt∥∥st∥+∣∣∣∥st−1∥∥st∥−1∣∣∣≤∥gt∥∥st∥+∥gt∥∥st∥=2∥gt∥∥st∥, (11)

where in the first equation we used the manipulation valid for any scalar and vector , and in the inequality we again used triangle inequality to get , which implies . Combining the displayed equations gives the result. ∎

Now we just need to control the denominator of this expression, namely to bound away from the origin. This is what we refer to as the “no-cancellation” property. We consider two incarnations of this property.

### 5.1 No-cancellation via growth condition on st

We can guarantee the desired no-cancellation by assuming that there is such that for all , precisely the assumption in Theorem 1 above of [25]. With the development above, we directly recover this result (and extend it to arbitrary norms): under this assumption, the regret of FTL is at most .

### 5.2 No-cancellation via non-negative gain vectors

Another way of guaranteeing the no-cancellation property is by considering only non-negative gain vectors. The development above again shows that we get logarithmic regret in this case. We remark that the assumption of non-negative gains does not preclude from growing sublinearly, so the two assumptions are orthogonal.

###### Theorem 6.

Consider the OLO problem with playing set and gain set . If is -strongly convex with respect to a norm and all vectors are non-negative,444That is, . We note that the proof directly generalizes to the case when is replaced by an arbitrary pointed cone. then FTL has regret at most

 C⋅Mλ⋅logT,

where and only depends on .

###### Proof.

Since the gain vectors are non-negative, we can assume for all , otherwise we can just ignore the initial time steps with . The idea now is to reduce the analysis to the 1-dimensional case in order to capture more easily the property of no cancellations; for that, we will approximate over by a linear function.

Let denote the th canonical vector, and define the vector with coordinates . Define then the linear function . Notice that for all non-negative : by triangle inequality . In addition, defining , we have for all . Thus, we have the two-sided bound

 ∀x∈Rd+,  ∥x∥≤f(x)≤C⋅∥x∥.

Employing Lemma 7 with these bounds, and using the linearity of , the regret of FTL over the gain vectors ’s is at most

 C2λT∑t=1f(gt)2f(st)=C2λT∑t=1f(gt)2f(g1)+…+f(gt). (12)

To upper bound the right-hand side, we employ the following estimate, which is proved in the appendix.

###### Lemma 8.

Let be numbers in , and let . Then

Because and (since by assumption ), the previous lemma shows that the right-hand side of (12) is at most . By redefining we obtain the desired regret bound for FTL, thus concluding the proof. ∎

## Acknowledgements

We thank Jacob Abernethy for discussions on the topics of this paper.

## References

• [1] J. D. Abernethy, K. A. Lai, K. Y. Levy, and J. Wang, Faster rates for convex-concave games, in COLT, vol. 75 of Proceedings of Machine Learning Research, PMLR, 2018, pp. 1595–1625.
• [2] A. Azócar, J. Giménez, K. Nikodem, and J. L. Sánchez, On strongly midconvex functions, Opuscula Math., 31 (2011), pp. 15–26.
• [3] M. V. Balashov and D. Repovš, Uniform convexity and the splitting problem for selections, Journal of Mathematical Analysis and Applications, 360 (2009), pp. 307 – 316.
• [4] K. Ball, E. A. Carlen, and E. H. Lieb, Sharp uniform convexity and smoothness inequalities for trace norms, Inventiones mathematicae, 115 (1994), pp. 463–482.
• [5] B. Beauzamy, Introduction to Banach Spaces and Their Geometry, Mathematical Studies, North-Holland, 1985.
• [6] A. Ben-Tal, E. Hazan, T. Koren, and S. Mannor, Oracle-based robust optimization via online learning, Operations Research, 63 (2015), pp. 628–638.
• [7] J. Borwein, A. J. Guirao, P. Hájek, and J. Vanderwerff, Uniformly convex functions on Banach spaces, Proc. Amer. Math. Soc., 137 (2009), pp. 1081–1091.
• [8] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, New York, NY, USA, 2004.
• [9] Z. C, Convex Analysis In General Vector Spaces, World Scientific Publishing Company, 2002.
• [10] J. Clarkson, Uniformly convex spaces, Trans. Amer. Math. Soc., 40 (1936), pp. 396–414.
• [11] O. Dekel, A. Flajolet, N. Haghtalab, and P. Jaillet, Online learning with a hint, in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, 2017, pp. 5305–5314.
• [12] V. Demyanov and A. Rubinov, Approximate Methods in Optimization Problems, Modern analytic and computational methods in science and mathematics, 1970.
• [13] J. Dunn, Rates of convergence for conditional gradient algorithms near singular and nonsingular extremals, SIAM Journal on Control and Optimization, 17 (1979), pp. 187–211.
• [14] W. J. Firey, p-means of convex bodies, Mathematica Scandinavica, 10 (1962), pp. 17–24.
• [15] D. J. Foster, S. Kale, M. Mohri, and K. Sridharan, Parameter-free online learning via model selection, in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, 2017, pp. 6022–6032.
• [16] M. Frank and P. Wolfe, An algorithm for quadratic programming, Naval Research Logistics Quarterly, 3 (1956), pp. 95–110.
• [17] R. Freund, P. Grigas, and R. Mazunder, An extended frank-wolfe method with “in-face” directions, and its application to low-rank matrix completion, SIAM Journal on Optimization, 27 (2017), pp. 319–346.
• [18] D. Garber and E. Hazan, Faster rates for the frank-wolfe method over strongly-convex sets, in ICML, vol. 37 of JMLR Workshop and Conference Proceedings, JMLR.org, 2015, pp. 541–549.
• [19] V. V. Goncharov and G. E. Ivanov, Strong and Weak Convexity of Closed Sets in a Hilbert Space, Springer International Publishing, Cham, 2017, pp. 259–297.
• [20] M. Grötschel, L. Lovász, and A. Schrijver,

Geometric Algorithms and Combinatorial Optimization

, vol. 2, second corrected edition ed., 1993.
• [21] Z. Harchaoui, A. Juditsky, and A. Nemirovski, Conditional gradient algorithms for norm-regularized smooth convex optimization, Mathematical Programming, 152 (2015), pp. 75–112.
• [22] E. Hazan, Introduction to online convex optimization, Found. Trends Optim., 2 (2016), pp. 157–325.
• [23] E. Hazan and N. Megiddo, Online learning with prior knowledge, in Proceedings of the 20th Annual Conference on Learning Theory, COLT’07, Berlin, Heidelberg, 2007, Springer-Verlag, pp. 499–513.
• [24] J.-B. Hiriart-Urruty and C. Lemaréchal, Fundamentals of convex analysis, Grundlehren Text Editions, Springer-Verlag, Berlin, 2001.
• [25] R. Huang, T. Lattimore, A. György, and C. Szepesvári, Following the leader and fast rates in online linear prediction: Curved constraint sets and other regularities, Journal of Machine Learning Research, 18 (2017), pp. 1–31.
• [26] T. Hytönen, J. van Neerven, M. Veraar, and L. Weis, Analysis in Banach Spaces : Volume I: Martingales and Littlewood-Paley Theory, Springer International Publishing, 2016.
• [27] M. Jaggi, Revisiting Frank-Wolfe: Projection-free sparse convex optimization, in Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester, eds., vol. 28 of Proceedings of Machine Learning Research, Atlanta, Georgia, USA, 17–19 Jun 2013, PMLR, pp. 427–435.
• [28] M. Jaggi and M. Sulovský, A simple algorithm for nuclear norm regularized problems, in Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, USA, 2010, Omnipress, pp. 471–478.
• [29] K. John and V. Zizler, Shorter notes: A short proof of a version of asplund’s norm averaging theorem, Proceedings of The American Mathematical Society - PROC AMER MATH SOC, 73 (1979).
• [30] A. Kalai and S. Vempala, Efficient algorithms for online decision problems, J. Comput. Syst. Sci., 71 (2005), pp. 291–307.
• [31] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher, Block-coordinate Frank-Wolfe optimization for structural SVMs, in Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester, eds., vol. 28 of Proceedings of Machine Learning Research, Atlanta, Georgia, USA, 17–19 Jun 2013, PMLR, pp. 53–61.
• [32] E. Levitin and B. Polyak, Constrained minimization methods, USSR Computational Mathematics and Mathematical Physics, 6 (1966), pp. 1 – 50.
• [33] B. Li and S. Hoi, Online Portfolio Selection: Principles and Algorithms, CRC Press, 2015.
• [34] J. Lindenstrauss and L. Tzafriri, Classical Banach Spaces II: Function Spaces, Ergebnisse der Mathematik und ihrer Grenzgebiete. 2. Folge, Springer Berlin Heidelberg, 2013.
• [35] T. Liu, G. Lugosi, G. Neu, and D. Tao, Algorithmic stability and hypothesis complexity, in Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, 2017, pp. 2159–2167.
• [36] E. Lutwak, D. Yang, and G. Zhang, The brunn-minkowski-firey inequality for nonconvex sets, Advances in Applied Mathematics, 48 (2012), pp. 407–413.
• [37] C. Mu, Y. Zhang, J. Wright, and D. Goldfarb, Scalable robust matrix recovery: Frank–wolfe meets proximal methods, SIAM Journal on Scientific Computing, 38 (2016), pp. A3291–A3317.
• [38] A. Osokin, J.-B. Alayrac, I. Lukasewitz, P. Dokania, and S. Lacoste-Julien, Minding the gaps for block frank-wolfe optimization of structured svms, in Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger, eds., vol. 48 of Proceedings of Machine Learning Research, New York, New York, USA, 20–22 Jun 2016, PMLR, pp. 593–602.
• [39] E. S. Polovinkin, Strongly convex analysis, Sbornik: Mathematics, 187 (1996), pp. 259–286.
• [40] B. T. Polyak, Existence theorems and convergence of minimizing sequences in extremum problems with restrictions, Soviet Math. Dokl., 7 (1966), pp. 72–75.
• [41] A. Rakhlin and K. Sridharan, Optimization, learning, and games with predictable sequences, in Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, USA, 2013, Curran Associates Inc., pp. 3066–3074.
• [42] A. Rakhlin and K. Sridharan, On equivalence of martingale tail bounds and deterministic regret inequalities, in Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, 2017, pp. 1704–1722.
• [43] S. Shalev-Shwartz, Online learning and online convex optimization, Found. Trends Mach. Learn., 4 (2012), pp. 107–194.
• [44] N. Srebro, K. Sridharan, and A. Tewari, On the universality of online mirror descent, in Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain., 2011, pp. 2645–2653.
• [45] J.-P. Vial, Strong convexity of sets and functions, Journal of Mathematical Economics, 9 (1982), pp. 187 – 205.
• [46] J. Wang and J. Abernethy, Acceleration through Optimistic No-Regret Dynamics, ArXiv e-prints, (2018). (https://arxiv.org/pdf/1807.10455.pdf).
• [47] C. Zălinescu, On the differentiability of the support function, Journal of Global Optimization, 57 (2013), pp. 719–731.
• [48] G. Ziegler, Lectures on Polytopes, Graduate texts in mathematics, Springer-Verlag, 1995.

## Appendix A Non-midpoint Strong Convexity

The following definition of curvature was used in [18].

###### Definition 6 (Non-midpoint Strongly Convex Sets).

Consider a convex body with the origin in its interior. The convex body is -non-midpoint strongly convex with respect to if for every and every with we have the containment

 z+4λμ(1−μ)∥x−y∥2C⋅C⊆K.

It is clear every non-midpoint strongly convex set is strongly convex. The next lemma shows the other direction.

###### Lemma 9.

-Strong convexity implies -non-midpoint strong convexity.

###### Proof.

Consider a -strongly convex set with respect to . Consider any pair of points and with . Let . By symmetry, assume without loss of generality that . Let be the midpoint of and . By assumption, .

We claim that the set is contained in the convex hull of and ; convexity of implies that also contains this set, which would conclude the proof. To prove the claim, note we can write , which equals . The convex combination between and with coefficient (recall that by assumption ) is precisely