# Faster Projection-free Online Learning

In many online learning problems the computational bottleneck for gradient-based methods is the projection operation. For this reason, in many problems the most efficient algorithms are based on the Frank-Wolfe method, which replaces projections by linear optimization. In the general case, however, online projection-free methods require more iterations than projection-based methods: the best known regret bound scales as T^3/4. Despite significant work on various variants of the Frank-Wolfe method, this bound has remained unchanged for a decade. In this paper we give an efficient projection-free algorithm that guarantees T^2/3 regret for general online convex optimization with smooth cost functions and one linear optimization computation per iteration. As opposed to previous Frank-Wolfe approaches, our algorithm is derived using the Follow-the-Perturbed-Leader method and is analyzed using an online primal-dual framework.

• 55 publications
• 6 publications
10/08/2019

### Improved Regret Bounds for Projection-free Bandit Convex Optimization

We revisit the challenge of designing online algorithms for the bandit c...
02/05/2018

### Online Compact Convexified Factorization Machine

Factorization Machine (FM) is a supervised learning approach with a powe...
10/21/2019

### Stochastic Recursive Gradient-Based Methods for Projection-Free Online Learning

This paper focuses on projection-free methods for solving smooth Online ...
10/21/2019

### Efficient Projection-Free Online Methods with Stochastic Recursive Gradient

This paper focuses on projection-free methods for solving smooth Online ...
10/15/2020

### Revisiting Projection-free Online Learning: the Strongly Convex Case

Projection-free optimization algorithms, which are mostly based on the c...
01/25/2021

### Complexity of Linear Minimization and Projection on Some Sets

The Frank-Wolfe algorithm is a method for constrained optimization that ...
06/13/2020

### Follow the Perturbed Leader: Optimism and Fast Parallel Algorithms for Smooth Minimax Games

We consider the problem of online learning and its application to solvin...

## 1 Introduction

In many machine learning problems the decision set is high dimensional or otherwise complex such that even convex optimization over the set is not practical. Such is the case, for example, in matrix learning problems: performing matrix decomposition for very large problems is computationally intensive and super-linear in the sparsity of the input. This renders common algorithms such as projected gradient descent infeasible.

An alternative methodology which has proven successful in several applications is projection-free online learning. In this model, the access of the learner to the decision set is via a linear optimization

oracle, as opposed to general convex optimization. As an example, linear optimization over matrices amounts to eigenvector computations, which can be carried out in time proportional to the sparsity of the matrices.

We henceforth consider online algorithms that perform one (or more generally a constant number) linear optimizations and/or gradient evaluations per iteration. The reason is that if we do not restrict the number of linear optimizations, we can compute projections and run standard projected gradient descent. This defeats the purpose of creating efficient algorithms.

This running time advantage has spurred significant research in recent years on projection-free methods and the Frank-Wolfe algorithm. However, despite a decade-long search, the best known projection-free online algorithm attains a regret bound that scales as , where is the number of iterations 111We omit the -notation in the introduction to make the exposition cleaner. In this case, hides constants that includes the norm of the gradients, diameter of the decision set, and more. See Hazan and others (2016) for more details.. This method, the Online Frank-Wolfe (OFW) algorithm (Hazan and Kale, 2012), is still the fastest known in the general setting of online convex optimization (OCO).

The bound is particularly striking when compared to the stochastic projection-free optimization setting. In this setting, it is straightforward to obtain regret for smooth stochastic projection-free optimization by the so-called blocking technique: grouping several game iterations into one and thereby changing the decision less often (Merhav et al., 2002). The optimal rate of is more challenging to obtain as given in Lan and Zhou (2016). However, we are unaware of an improvement to the rate even in the stochastic non-smooth case, when only constantly many linear optimizations and gradient evaluations per iteration are allowed.

### 1.1 Our Results

Our main result is an efficient randomized algorithm that improves the state-of-the-art in general projection-free online optimization with smooth loss functions. The expected regret of this algorithm scales as

, with only one linear optimization computation per iteration. We then extend the analysis of this algorithm to show that it attains the same regret bound with high probability. Our main results are summarized by the informal theorem below, with the exact dependence on smoothness and other relevant problem parameters detailed in later sections.

###### Theorem 1.1.

There exists an efficient algorithm for online convex optimization (see Algorithm 2) with smooth loss functions that is projection-free, performs only one linear optimization computation per iteration, and guarantees an expected regret of . Furthermore, the algorithm guarantees a regret of with probability at least .

#### Techniques.

Our algorithm is not based on the Frank-Wolfe method, but rather a version of the Follow-the-Perturbed-Leader (FPL) method (Kalai and Vempala, 2005). It was already established in Hazan and others (2016) that a deterministic version of FPL works for online convex optimization. This version computes the expected point FPL plays at every iteration. In order to convert this algorithm to an efficient projection-free method, two main challenges arise:

1. Estimating the expectation by sampling FPL points via linear optimization creates time dependence between iterations, since the gradient is taken at a point which depends on all previous iterations. This means that a small error in one iteration potentially propagates to all future iterations.

2. The number of linear optimization evaluations to estimate the mean of the FPL algorithm up to accuracy is . This dependence is not sufficient to improve the previously best regret bound with only constantly many linear optimization computations per iteration.

To overcome the above issues we require two tools, that are new to the analysis of randomized online methods. First, we use the online primal-dual methodology of Shalev-Shwartz and Singer (2007). This allows us to avoid the error-propagation caused by random estimation of the mean, and could be a technique of independent interest.

The second tool is using the smoothness of the loss functions to leverage not only the estimation proximity but also the fact that the estimation is unbiased. This is executed by switching gradients at nearby points which are not too far off due to the Lipschitz property of the gradients of smooth functions.

#### Paper structure.

In the next subsection we discuss related work, and then move to describe preliminaries, including tools necessary for the online primal-dual analysis framework. In section 3, we state the main algorithm and formally state our main theorems including precise constants. In section 4

we state the deterministic FPL algorithm and analyze it using the primal-dual framework to illustrate its versatility in handling error propagation. We then use unbiased estimation and smoothness in section

5 to derive the first main theorem. In section 5.1, we provide a reduction of the algorithm to the setting of one linear optimization step per iteration. High probability bounds are given in detail in section 5.2 and derived in the appendix along with other miscellaneous proofs.

### 1.2 Related Work

In recent years the projection-free learning and optimization literature has seen a resurgence of results. We separate the related work into the broad categories below.

#### Projection-free offline optimization.

The starting point for our line of work is the seminal paper of Frank and Wolfe (1956), who apply the conditional gradient method for smooth optimization over polyhedral sets. This was extended to semi-definite programming in Hazan (2008), and to general convex optimization in Jaggi (2013). This algorithm requires linear optimization steps to find an -approximate solution for a -smooth function, optimal with no other assumptions.

A breakthrough in projection-free methods was obtained by Garber and Hazan (2013), who give an algorithm that requires only linear optimization steps for strongly convex and smooth functions over polyhedral sets. Data-dependent bounds for the spectahedron were obtained by Garber (2016); Allen-Zhu et al. (2017).

Projection-free optimization on non-smooth objective functions is typically performed via various smoothing techniques. The optimal complexity of linear optimization calls in this case is (Lan, 2013). Several algorithms attain nearly optimal rates as in Lan (2013); Argyriou et al. (2014); Pierucci et al. (2014).

#### Projection-free online learning.

The online variant of the Frank-Wolfe algorithm that applies to general online convex optimization was given in Hazan and Kale (2012). This method attains regret for the general OCO setting, with only one linear optimization step per iteration 222If arbitrarily many linear optimization steps are allowed, the projections can be computed and this regret can be improved to ..

For OCO over polyhedral sets, an implication of the result of Garber and Hazan (2013) is an efficient regret algorithm with only one linear optimization step per iteration, as well as regret for strongly convex online optimization. Recently Levy and Krause (2019) give an efficient projection-free online learning algorithm for smooth sets that devises a new fast projection operation for such sets and attains the optimal regret for convex and regret for strongly convex online optimization.

Without further assumptions, the OFW method in Hazan and Kale (2012) attains the best known bounds for general online convex optimization. To the best of our knowledge, our regret is the first to improve in this general OCO setting for smooth functions.

#### Projection-free stochastic optimization.

An important application of projection free optimization is in the context of supervised learning and the optimization problem of empirical risk minimization. In this setting, there are more techniques that can be applied to further accelerate optimization, as compared to the online setting, most notably variance reduction.

This requires more careful accounting of the actual operations that the algorithms perform, including counting the number of full-gradient computations, stochastic gradient evaluations, linear optimizations, and projections. There have been a multitude of algorithms suggested that attain various tradeoffs of the various computations, and have different merits/caveats. The reader is referred to the vast literature on stochastic projection-free methods, including the recent papers of Lan and Zhou (2016); Hazan and Luo (2016); Xie et al. (2019); Yurtsever et al. (2019).

## 2 Problem Setting

We consider a classical online convex optimization framework as an iterative game between a player and an adversary. At each iteration , the player chooses an action from the constrained set of permissible actions while the adversary simultaneously chooses a loss function that determines the loss the player will occur for the action . The performance metric for such settings is the notion of regret – the difference between the cumulative loss suffered throughout iterations of the online game and the overall loss for the best fixed action in hindsight:

 RT=T∑t=1ft(xt)−minx∈KT∑t=1ft(x) (2.1)

For a given online algorithm , we denote to be the regret after iterations and, in the case when is a randomized algorithm, use expected regret as the performance metric. In this work, the adversary has no computational or information restrictions as long as it chooses simultaneously to the player choosing , i.e. we operate in the adaptive adversarial setting.

Before proceeding to the main results, we formalize several notations and assumptions preserved throughout the paper. We discuss and, if necessary, derive our additional/modified assumptions, made for simplicity in the analysis, without touching upon the conventional standards established in the community, explanations of which can be found in the extensive literature (e.g. see Hazan and others (2016)). Throughout this work the use of norm refers to the standard Euclidean norm unless stated otherwise and denotes the unit ball. Given any sequence , by abuse of notation we denote as a shorthand for the indexed sum or indexed union .

###### Assumption 2.1.

The constrained action set is a convex and compact set. Moreover, all the points in the set have bounded norms, i.e. .

###### Assumption 2.2.

For each iteration , the loss functions are convex, differentiable and have bounded gradient norms .

The convention in OCO is to simply assume a bounded diameter for the set instead of the norm bound. However, it is straightforward to derive the above formulation the following way: given an arbitrary point in the set consider the shifted set ; the diameter bound of implies the bounded norms of the points in while properties such as convexity and compactness are preserved through shifts. The convexity and bounded gradient norm assumptions for the loss functions are part of the standard throughout literature, while differentiability of the loss functions is assumed for simplicity and can be avoided by using subgradients instead.

###### Definition 2.3.

The Fenchel dual of a function with domain is defined as

 ∀y∈Rd,f∗(y)=supx∈K{⟨y,x⟩−f(x)} (2.2)

The definition and some properties of Fenchel duality are given for completeness as the concept of a Fenchel dual will be crucial in the analysis of presented algorithms. If the function is convex, then its Fenchel dual is also convex, and the Fenchel-Moreau theorem gives biconjugacy, i.e. the dual of a dual is equal to the function itself . In this case, it is essential to note that directly implies , which is well-defined, when is differentiable.

#### Linear Optimization Oracle.

A linear optimization oracle, along with a value oracle, over the constraint set is provided to the player, defined as

 ∀y∈Rd,OK(y)=argmaxx∈K⟨y,x⟩,MK(y)=maxx∈K⟨y,x⟩ (2.3)

The reliance on linear optimization is the key motivation of the paper. This work concerns itself with the special case of online convex constrained optimization where the operation of projection to the set , as a problem of quadratic optimization, has a significantly higher computational cost than the linear optimization. In such cases, the use of the projected Online Gradient Descent (OGD) (Zinkevich, 2003) that achieves an optimal regret bound with respect to is not always preferred to methods that bypass projection and use linear optimization instead. It is worth to notice that the existence of only is enough since and . Moreover, the function is convex and Lipschitz as suggested below.

###### Lemma 2.4.

The linear value oracle is convex and -Lipschitz, i.e.

 ∀y1,y2∈Rd,|MK(y1)−MK(y2)| ≤ D∥y1−y2∥ (2.4)

## 3 Algorithm and Main Theorem

The algorithm we propose is fairly straightforward, and the main hurdle lies in the analysis. The seminal work of Kalai and Vempala (2005) analyzes the Follow-the-Perturbed-Leader (FPL) online algorithm that obtains optimal regret for linear loss functions. A more general version of FPL that applies expectations over the perturbations at each iteration extends the result to general convex functions (Hazan and others, 2016). Our algorithm mimics the expected FPL replacing the computationally expensive expectations with empirical averages of i.i.d. samples. It is presented in detail in Algorithm 1. The following theorem states the convergence guarantees for both general convex and smooth convex loss functions.

###### Theorem 3.1.

Given that the Assumptions 2.1 and 2.2 hold, Algorithm 1, for general convex loss functions, obtains an expected regret of

 E[T∑t=1ft(~xt)] ≤ minx∈K{T∑t=1ft(x)}+2D/δ+δDG2⋅dT/2+2GDT√m (3.1)

If the convex loss functions are also -smooth then the expected regret bound becomes

 E[T∑t=1ft(~xt)] ≤ minx∈K{T∑t=1ft(x)}+2D/δ+δDG2⋅dT/2+4βD2Tm (3.2)
###### Remark 3.2.

It follows from the result above that Algorithm 1 achieves an expected regret of with the parameter choices of and for general convex and smooth convex functions respectively. In particular, this restores the original result of the FPL method attaining regret with sample per iteration for linear, , loss functions shown in Kalai and Vempala (2005).

###### Corollary 3.3.

The expected regret bound of Algorithm 1 for general convex loss functions implies expected regret when using one linear optimization step per iteration. The analogous reduction induces the OSPF algorithm given in Algorithm 2 that attains expected regret for smooth functions with one linear optimization step per iteration.

## 4 The Case of Unlimited Computation

In an ideal scenario the player would be given unrestrained computational power along with access to the linear optimization oracle. Then the expected FPL method, as a projection-free online algorithm, is known to obtain regret bound that is optimal with respect to for general convex loss functions. The exact algorithm is spelled out in Algorithm 3. The original analysis follows the standard recipe of online learning literature coined by Kalai and Vempala (2005): no regret for Be-The-Leader – the algorithm suffers no regret if it is hypothetically one step ahead of the adversary, i.e. uses for the loss function ; stability – the predictions of consecutive rounds are not too far apart from each other (Hazan and others, 2016). We provide an alternative approach developed by Shalev-Shwartz and Singer (2007) that is based on duality and enables the further analysis of Algorithm 1.

###### Theorem 4.1.

Given that Assumptions 2.1 and 2.2 hold, Algorithm 3 suffers regret.

###### Proof.

The proof is based on duality when one considers the following optimization problem

 minx∈K{hδ(x)+T∑t=1ft(x)} (4.1)

which resembles the loss suffered by the best-in-hindsight fixed action. The dual objective, that is to be maximized, can be obtained using Lagrange multipliers and is given by (see Shalev-Shwartz and Singer (2007) for details)

 D(λ1,…,λT)=−h∗δ(−λ1:T)−T∑t=1f∗t(λt) (4.2)

The term serves as regularization and is defined implicitly through its Fenchel conjugate , a stochastic smoothing of the value oracle, that is -smooth according to the following lemma and the fact that is -Lipschitz.

###### Lemma 4.2.

The function is -smooth given is -Lipschitz.

The duality gap suggests that the objective (4.2) is upper bounded by (4.1) for any values of hence the goal is to upper bound the online cumulative loss by (4.2). To achieve this, take for all where the action is chosen according to Algorithm 3. Denote the incremental difference as and notice that the dual can be written as . For each ,

 Δt =−[h∗δ(−∇1:t)−h∗δ(−∇1:t−1)]−f∗t(∇t)+f∗t(0) ≥ smoothness of h∗δ(⋅) ≥ ⟨∇t,∇h∗δ(−∇1:t−1)⟩−δdD2∥∇t∥2−f∗t(∇t)+f∗t(0)= definition of xt =⟨∇t,xt⟩−f∗t(∇t)−δdD2∥∇t∥2+f∗t(0)=ft(xt)−δdD2∥∇t∥2+f∗t(0) (4.3)

where we use the fact that the action from Algorithm 3 can alternatively be expressed as and the Fenchel dual identity for convex . The obtained inequality (4) quantifies how much regret an action contributes at a given iteration detached from the rest of the rounds of the game. Such a property of the analysis ends up being crucial in showing the regret bounds further in this work. Noting that and summing up (4) for all , the online cumulative loss is bounded by

 T∑t=1ft(xt) ≤ T∑t=1Δt+δdD2T∑t=1∥∇t∥2−T∑t=1f∗t(0)=D(∇1,…,∇T)+h∗δ(0)+δdD2T∑t=1∥∇t∥2 (4.4)

The bound given by (4.4) and the duality gap of the primal (4.1) provide the necessary ingredients to conclude the regret bound. All that is left are technical details to reach the bound using the assumptions of the given setup. First, by definition which implies, by Lipschitzness of , that for any so . Second, the primal expression in (4.1) can be related to the best loss in hindsight the following way

 minx∈K{hδ(x)+T∑t=1ft(x)} ≤ hδ(x∗)+T∑t=1ft(x∗) ≤ minx∈KT∑t=1ft(x)+maxx∈Khδ(x) (4.5)

where is the optimal action in hindsight, i.e. the minimizer of over . Moreover, notice that for any the expression can be bounded as follows: for each the expression inside the expectation is bounded , hence for any the bound holds. Finally, according to our assumptions the loss gradients are bounded in norm, i.e. . Combining the aforementioned properties with (4.4) along with the fact that is upper bounded by (4.1) due to the duality gap, we conclude the desired inequality

 T∑t=1ft(xt) ≤ minx∈KT∑t=1ft(x)+maxx∈Khδ(x)+h∗δ(0)+δdD2T∑t=1∥∇t∥2 ≤ minx∈KT∑t=1ft(x)+2D/δ+δdD2G2T

yielding the regret bound with the optimal choice of the regularization parameter . ∎

###### Remark 4.3.

It is essential to note how each property given in Assumptions 2.1 and 2.2 was used in the proof above. The convexity of the constraint set allows the action , as an expectation of points in the set , to be a permissible action as well. Given compactness of , we interchange the use of supremum and maximum of bounded expressions at various points throughout. The norm bound of the set is used in showing that is -Lipschitz and bounding several regularization terms. In terms of the loss functions, the convexity of allows us to use the Fenchel-Moreau theorem (continuity is implied by differentiability) while the gradient norm bound is simply used in the last stage of obtaining the regret bound.

## 5 Oracle Efficiency via Estimation

The results in section 4 suggest that Algorithm 3, known as expected FPL, possesses the features desired in this work – it is both online and projection-free – and obtains an optimal regret bound of . However, it is computationally intractable due to the expectation term given in the definition of the action . In this section, we remedy this issue and explore the scenario where the actions played during the online game are random estimators of the mean. In particular, we propose to simply take the empirical average of i.i.d. samples instead of the expectation itself as described in Algorithm 1.

It is essential to note that Algorithm 1 has a computational efficiency of calls to the linear optimization oracle as the rest of the computation is negligible in comparison. The main theorem of this work, Theorem 3.1, indicates the performance of the algorithm in terms of expected regret for (i) general convex loss functions and (ii) smooth convex loss functions, respectively. Given the duality approach to analyzing online algorithms demonstrated in the previous section, the sampled FPL algorithm can now be analyzed to prove the bounds stated in Theorem 3.1. In particular, the following lemma demonstrates that each estimation from Algorithm 1 contributes to the regret in a disjoint fashion, i.e. there is no error propagation through time.

###### Lemma 5.1.

Suppose the Assumptions 2.1 and 2.2 hold and denote for all with and as defined in Algorithm 1. Then, the regret of the algorithm is bounded as follows

 T∑t=1ft(~xt) ≤ minx∈K{T∑t=1ft(x)}+RT(A3)+T∑t=1⟨~∇t,^xt−~xt⟩ (5.1)
###### Proof.

Follow the same proof structure as in the proof of Theorem 4.1 by considering (4.1), (4.2) as the primal and dual objectives. Consider and denote the incremental difference as . The main component of the proof is showing that for each can be roughly seen as an upper bound on the loss suffered at iteration .

First note that the played actions are not, in fact, unbiased estimators of the original ; instead denote the expectations by where comprises the randomness used at iteration . For all , the quantity is different from in that it uses the gradients at the points instead of and such difference can potentially increase with . In other words, one is defined as while the other is equal to . Hence, the action sequences of and can behave quite differently and one cannot analyze the former based on results about the latter. However, the duality approach enables us to directly analyze the actions of Algorithm 1. In particular, lower bound the quantity using the smoothness of , as done in the proof of Theorem 4.1:

 Δt ≥ ⟨~∇t,^xt⟩−δdD2∥~∇t∥2−f∗t(~∇t)+f∗t(0)=ft(~xt)−δdD2∥~∇t∥2+f∗t(0)+⟨~∇t,^xt−~xt⟩ (5.2)

The obtained inequality (5.2) resembles the analogous bound (4) in the unlimited computation case with the extra term that can be seen as accounting for the estimation error. This shows that at a given iteration the additional regret is suffered only at the expense of the current action choice, instead of , while ignoring the optimality of the previous choices . We proceed with the proof by summing up for all and using the following facts: by definition and ; as shown before and ; according to Assumption 2.2, for all . Combining all these properties and choosing the same optimal value of the regularization parameter concludes the stated bound (5.1). The use of all the assumptions is identical to that of Theorem 4.1 as described in Remark 4.3. ∎

All that remains to reach the conclusions by Theorem 3.1 is to use Lemma 5.1 and handle the additional regret terms for each

. The following claims about smooth functions and empirical averages of random vectors are necessary for the latter part.

###### Lemma 5.2.

If is a -smooth function, then for any

 ⟨∇f(y),x−y⟩ ≤ ⟨∇f(x),x−y⟩+β∥x−y∥2 (5.3)
###### Lemma 5.3.

Let

be i.i.d. samples of a bounded random variable

, , with mean . Denote , then

 EZ[∥¯¯¯¯Zm−¯¯¯¯Z∥2] ≤ 4D2m (5.4)
###### Proof of Theorem 3.1.

First, note that according to Lemma 5.3, the following bound holds for all . In the case of general convex loss functions, use the Cauchy-Schwartz inequality along with the norm bound on the gradients and take expectation over the whole randomness in the algorithm in the reverse order to obtain for each

 Eξ1:T[⟨~∇t,^xt−~xt⟩] ≤ GEξ1:t[∥^xt−~xt∥]=GEξ1:t−1[Eξt[∥^xt−~xt∥|ξ1:t−1]] ≤ 2DG√m (5.5)

Ordering the randomness of the iterations in reverse and taking the expectation conditional on is necessary in order to use Lemma 5.3 since is a deterministic quantity over only when conditioned on the previous randomness . Finally, taking expectation over on the bound in (5.1) and using (5.5) for all concludes the expected regret bound of given in detail in (3.1) for general convex loss functions. It is worth to mention that this result did not require any assumptions on how the loss function at each iteration is chosen by the adversary: in particular, the result holds for the strongest adaptive adversarial setting where the adversary can pick having knowledge of the previous actions by the player, i.e. the randomness . This is true due to the fact that all the terms containing the function explicitly, e.g. , are separated and bound on their own.

The case of smooth convex loss functions requires a more nuanced approach in order to achieve an improvement on the general result. The key is to replace the gradient at the point with a quantity that does not depend on and leverage the fact that is an unbiased estimator of . More formally, given is a -smooth function use Lemma 5.2 to get where is denoted accordingly. The quantities and are both (potentially) dependant on previous randomness but are deterministic with respect to when conditioned on , hence so is . Thus, it holds that for all . This fact results in the additional regret having a quadratic dependence on the estimation error instead of linear as before:

 Eξ1:T[⟨~∇t,^xt−~xt⟩] ≤ Eξ1:t−1[Eξt[⟨^∇t,^xt−~xt⟩+β∥^xt−~xt∥2|ξ1:t−1]] ≤ 4βD2m (5.6)

Use the result obtained in (5.6) for all in order to bound the additional regret term in (5.1) and conclude the expected regret bound of given in detail in (3.2) for smooth convex loss functions. Since the adversary is allowed to pick the loss function that depends on the previous randomness , this regret bound again holds in the strongest adaptive adversarial setting. ∎

### 5.1 Reduction to OSPF

The results given in Theorem 3.1 indicate optimal regret bounds for both convex and smooth convex loss functions when taking respectively, as suggested by Remark 3.2. However, is not simply a parameter of the algorithm: it indicates the number of linear optimizations per iteration so in iterations the regret is achieved with an overall linear optimization complexity of . To avoid such convoluted claims, we instead provide a reduction of Algorithm 1, named OSPF in the smooth case, to the setting of one linear optimization per iteration that gives and expected regret for smooth and general convex losses, respectively.

###### Proof of Corollary 3.3.

The reduction follows a simple blocking technique, i.e. grouping several rounds of the game into one as detailed in Algorithm 2. Consider the online optimization setting with loss functions by the adversary after playing the actions using only one linear optimization per iteration. Let where are assumed to be integers for simplicity and denote

 f′i=i⋅k∑t=(i−1)⋅k+1ft,∀i=1,…,n (5.7)

Since each contains losses from the original problem, then the player is allowed linear optimizations to handle a single loss . Hence, use Algorithm 1 for iterations with samples at each iteration to get actions and play for all in the original setting – call this algorithm . The corresponding constants of the constructed game are and since the constraint set remains unchanged and a loss function constitutes original losses together. Thus, the expected regret bound of for general convex functions, according to Theorem 3.1, is given by

 E[RT(A′1)] ≤ 2D/δ+δD(G⋅k)2⋅dn/2+2(G⋅k)Dn√k=2DG√d√nk+2DGn√k (5.8)

with the parameter choice of . Letting yields the expected regret bound for the algorithm that uses one linear optimization per iteration as for general convex functions. The case of smooth convex functions is handled analogously. Note that the algorithm is equivalent to given in Algorithm 2. The smoothness parameter of a sum of functions that are -smooth equals . Hence, the expected regret bound of for smooth convex functions is given by

 E[RT(AOSPF)] ≤ 2D/δ+δD(G⋅k)2⋅dn/2+4(β⋅k)D2nk=2DG√d√nk+4βD2n (5.9)

with the same choice of . In this case let and to attain the expected regret bound for with one linear optimization per iteration. ∎

### 5.2 High Probability Bounds

The theoretical guarantees for the main algorithm of this paper, Algorithm 1, have all been in terms of expected regret as the performance metric. Even though expected regret is a widely accepted metric for online randomized algorithms, one might wonder whether the expectation bound holds only due to a balance of large and small chunks of regret or the given result actually holds most of the time. To answer this question, we provide bounds on asymptotically equivalent (up to logarithmic factors) to the statements from Theorem 3.1 that hold with high probability over the randomness in : these results transfer analogously to the reduction from section 5.1. In particular, the following theorem shows that Algorithm 1 obtains regret of for general convex loss functions and for smooth convex loss functions both holding with high probability.

###### Theorem 5.4.

Given that the Assumptions 2.1 and 2.2 hold, the regret of Algorithm 1 for general convex loss functions is w.p. for any bounded by

 T∑t=1ft(~xt) ≤ minx∈K{T∑t=1ft(x)}+RT(A3)+2GDT√m⋅√log2T/σ (5.10)

If the convex loss functions are also -smooth, then it is w.p. for any bounded by

 T∑t=1ft(~xt) ≤ minx∈K{T∑t=1ft(x)}+RT(A3)+2GD√2Tlog4/σ+8βD2Tm⋅log4T/σ (5.11)

## 6 Discussion

We have presented an efficient projection-free method for online convex optimization with smooth functions that makes only a single linear optimization computation per iteration and achieves regret , improving upon the previous bound of .

Certain algorithms in the literature make more than one linear optimization computation per iteration. To make the comparison to other methods more precise, we need a more refined computational metric. Define the following complexity metric for an online projection-free algorithm: let be an online optimization algorithm, and define to be the overall number of gradient oracle calls as well as linear optimization calls made until the average regret becomes at most .

In these terms, we have shown an algorithm with , as compared to which is the previous best.

It thus remains open to obtain a -complexity algorithm for general convex sets that does not depend on the dimension, or show that this is impossible. It is also unknown at this time if these dependencies on , in both the smooth and non-smooth case, are tight.

## References

• Z. Allen-Zhu, E. Hazan, W. Hu, and Y. Li (2017) Linear convergence of a frank-wolfe type algorithm over trace-norm balls. In Advances in Neural Information Processing Systems, pp. 6191–6200. Cited by: §1.2.
• A. Argyriou, M. Signoretto, and J. Suykens (2014) Hybrid conditional gradient - smoothing algorithms with applications to sparse and low rank regularization. External Links: 1404.3591 Cited by: §1.2.
• M. Frank and P. Wolfe (1956) An algorithm for quadratic programming. Naval research logistics quarterly 3 (1-2), pp. 95–110. Cited by: §1.2.
• D. Garber and E. Hazan (2013) A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. arXiv preprint arXiv:1301.4666. Cited by: §1.2, §1.2.
• D. Garber (2016) Faster projection-free convex optimization over the spectrahedron. In Advances in Neural Information Processing Systems, pp. 874–882. Cited by: §1.2.
• E. Hazan and S. Kale (2012) Projection-free online learning. In Proceedings of the 29th International Conference on Machine Learning, Cited by: §1.2, §1.2, §1.
• E. Hazan and H. Luo (2016) Variance-reduced and projection-free stochastic optimization. In International Conference on Machine Learning, pp. 1263–1271. Cited by: §1.2.
• E. Hazan et al. (2016) Introduction to online convex optimization. Foundations and Trends® in Optimization 2 (3-4), pp. 157–325. Cited by: §1.1, §2, §3, §4, footnote 1.
• E. Hazan (2008) Sparse approximate solutions to semidefinite programs. In Latin American symposium on theoretical informatics, pp. 306–316. Cited by: §1.2.
• M. Jaggi (2013) Revisiting frank-wolfe: projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, pp. 427–435. Cited by: §1.2.
• A. Kalai and S. Vempala (2005) Efficient algorithms for online decision problems. Journal of Computer and System Sciences 71 (3), pp. 291–307. Cited by: §1.1, Remark 3.2, §3, §4.
• Guanghui. Lan and Yi. Zhou (2016) Conditional gradient sliding for convex optimization. SIAM Journal on Optimization 26 (2), pp. 1379–1409. External Links: Cited by: §1.2, §1.
• G. Lan (2013) The complexity of large-scale convex programming under a linear optimization oracle. External Links: 1309.5550 Cited by: §1.2.
• K. Levy and A. Krause (2019) Projection free online learning over smooth sets. In Proceedings of Machine Learning Research, K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, , pp. 1458–1466. External Links: Link Cited by: §1.2.
• N. Merhav, E. Ordentlich, G. Seroussi, and M. J. Weinberger (2002) On sequential strategies for loss functions with memory. IEEE Transactions on Information Theory 48 (7), pp. 1947–1958. Cited by: §1.
• F. Pierucci, Z. Harchaoui, and J. Malick (2014) A smoothing approach for composite conditional gradient with nonsmooth loss. Research Report Technical Report RR-8662, INRIA Grenoble. External Links: Link Cited by: §1.2.
• I. Pinelis (1994) Optimum bounds for the distributions of martingales in banach spaces. The Annals of Probability 22 (4), pp. 1679–1706. External Links: ISSN 00911798 Cited by: Proposition A.1, Appendix A.
• S. Shalev-Shwartz and Y. Singer (2007) A primal-dual perspective of online learning algorithms. Machine Learning 69 (2-3), pp. 115–142. Cited by: §1.1, §4, §4.
• J. Xie, Z. Shen, C. Zhang, B. Wang, and H. Qian (2019) Efficient projection-free online methods with stochastic recursive gradient. External Links: 1910.09396 Cited by: §1.2.
• A. Yurtsever, S. Sra, and V. Cevher (2019) Conditional gradient methods via stochastic path-integrated differential estimator. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 7282–7291. External Links: Link Cited by: §1.2.
• M. Zinkevich (2003) Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pp. 928–936. Cited by: §2.

## Appendix A Proof of the high probability bounds

This section focuses on regret bound results for Algorithm 1 that hold with high probability. We use the following Azuma-type concentration inequality for vector-valued martingales derived as an application of the work by Pinelis [1994] to the Euclidean space .

###### Proposition A.1 (Theorem 3.5 in Pinelis [1994]).

Let be a vector-valued martingale difference with respect to such that for all it holds that and for some . Then for any

 P[∥∥ ∥∥K∑k=1νk∥∥ ∥∥>λ] ≤ 2exp(−λ22∑Kk=1c