# Online Improper Learning with an Approximation Oracle

We revisit the question of reducing online learning to approximate optimization of the offline problem. In this setting, we give two algorithms with near-optimal performance in the full information setting: they guarantee optimal regret and require only poly-logarithmically many calls to the approximation oracle per iteration. Furthermore, these algorithms apply to the more general improper learning problems. In the bandit setting, our algorithm also significantly improves the best previously known oracle complexity while maintaining the same regret.

• 55 publications
• 17 publications
• 62 publications
• 43 publications
10/08/2019

### Improved Regret Bounds for Projection-free Bandit Convex Optimization

We revisit the challenge of designing online algorithms for the bandit c...
02/09/2022

### Smoothed Online Learning is as Easy as Statistical Learning

Much of modern learning theory has been split between two regimes: the c...
02/18/2021

### Online Learning via Offline Greedy Algorithms: Applications in Market Design and Optimization

Motivated by online decision-making in time-varying combinatorial enviro...
06/26/2021

### Contextual Inverse Optimization: Offline and Online Learning

We study the problems of offline and online contextual optimization with...
02/16/2021

### Efficient Competitions and Online Learning with Strategic Forecasters

Winner-take-all competitions in forecasting and machine-learning suffer ...
06/09/2021

### ChaCha for Online AutoML

We propose the ChaCha (Champion-Challengers) algorithm for making an onl...
10/17/2018

### Learning in Non-convex Games with an Optimization Oracle

We consider adversarial online learning in a non-convex setting under th...

## 1 Introduction

One of the most fundamental and well-studied questions in learning theory is whether one can learn a given problem using an optimization oracle. For online learning in games, it was shown by Kalai and Vempala (2005) that an optimization oracle giving the best decision in hindsight is sufficient for attaining optimal regret.

However, in many non-convex settings, such an optimization oracle is either unavailable or NP-hard to compute. In contrast, in many such cases, efficient approximation algorithms are usually known, and are guaranteed to return a solution within a certain multiplicative factor of the optimum. These include not only combinatorial optimization problems such as

Max Cut, Weighted Set Cover, Metric Traveling Salesman Problem, Set Packing

, etc., but also machine learning problems such as

Low Rank Matrix Completion.

Kakade et al. (2009) considered the question of whether an approximation algorithm is sufficient to obtain vanishing regret compared with an approximation to the best solution in hindsight. They gave an algorithm for this offline-to-online conversion. However, their reduction is inefficient in the number of per-iteration queries to the approximation oracle, which grows linearly with time. Ideally, an efficient reduction should call the oracle only a constant number of times per iteration and guarantee optimal regret at the same time, and this was considered an open question in the literature.

Various authors have improved upon this original offline-to-online reduction under certain cases, as we survey below. Recently, Garber (2017) has made significant progress by giving a more efficient reduction, which improves the number of oracle calls both in the full information and the bandit settings. He explicitly asked whether a near-optimal reduction with only logarithmically many calls per iteration exists.

### 1.1 Our Results

In this paper we resolve this question on the positive side, and in a more general setting. We give two different algorithms in the full information setting, one based on the online mirror descent (OMD) method and another based on the continuous multiplicative weight update (CMWU)

algorithm, which give optimal regret and are oracle-efficient. Furthermore, our algorithms apply to more general loss vectors. Our results are summarized in the table below.

In addition to these two algorithms, we give an improved bandit algorithm based on OMD: it attains the same regret as in (Kakade et al., 2009; Garber, 2017) with a lower computational cost: our method requires oracle calls over all the game iterations, as opposed to in the previous best method.

Besides the improved oracle complexity, our methods have the following additional advantages:

• While the algorithm in (Garber, 2017) requires non-negative loss vectors, our second algorithm, based on CMWU, can work with general loss vectors. Furthermore, our OMD-based algorithm can also work with loss vectors from any convex cone satisfying the pairwise non-negative inner product (PNIP) property defined in Definition 4.1 (together with an appropriately chosen regularizer), which is more general than the non-negative orthant.

• Our methods apply to a general online improper learning setting, in which the predictions can be from a potentially different set from the target set to compete against. Previous work considered this different set to be a constant multiple of the target set, which makes sense primarily for combinatorial optimization problems.

However, in many interesting problems, such as Low Rank Matrix Completion, the natural approximation algorithm returns a matrix of higher rank. This is not in a constant multiple of the set of all low rank matrices, and our additional generality allows us to obtain meaningful results even for this case.

• Our first algorithm is based on the general OMD methodology, and thus allows any strongly convex regularizer. This can give better regret bounds, in terms of the space geometry, compared with the previous algorithm of (Garber, 2017) that is based on online gradient descent and Euclidean regularization. The improvement in regret bounds can be as large as the dimension.

• Our bandit algorithm is based on OMD with a new regularizer that is inspired from the construction of barycentric spanners, and may be of independent interest.

### 1.2 Our Techniques

The more general one of our algorithms is based on a completely different methodology compared with previous online-to-offline reductions. It is a variant of the continuous multiplicative weight update (CMWU) algorithm, or the continuous hedge algorithm. Our idea is to apply CMWU over a superset of the target set, and in every iteration the algorithm tries to play the mean of a log-linear distribution. To check feasibility of this mean, we show how to design a separation-or-decomposition oracle

, which either certifies that the mean is infeasible - in this case it provides a separating hyperplane between the mean and the target set and thus gives a more refined superset of the target set, or provides a distribution over feasible points whose average is superior to the mean in terms of the regret. Using this approach, the more oracle calls the algorithm makes, the tighter superset it can obtain, and we show an interesting trade-off between the oracle complexity and the regret bound.

The other algorithm follows the line of Garber (2017). We show how to significantly speed up Garber’s infeasible projection oracle, and to generalize Garber’s algorithm from online gradient descent (OGD) to online mirror descent (OMD).

This additional generality is crucial in our bandit algorithm, where we make use of a novel regularizer in OMD, called the barycentric regularizer

, in order to have a low-variance unbiased estimator of the loss vector. This geometric regularization may be of independent interest.

### 1.3 Related Work

The reduction from online learning to offline approximation algorithms was already considered by Kalai and Vempala (2005). Their scheme, based on the follow-the-perturbed-leader (FTPL) algorithm, requires very strong approximation guarantee from the approximation oracle, namely, a fully polynomial time approximation scheme (FPTAS), and requires an approximation that improves with time. Balcan and Blum (2006) used the same approach in the context of mechanism design.

Kalai and Vempala (2005) also proposed a specialized reduction that works under certain conditions on the approximation oracle, satisfied by some known algorithms for problems such as MAX-CUT. Fujita et al. (2013) further gave more general reductions that apply to problems whose approximation algorithms are based on convex relaxations of mathematical programs. Their scheme is also based on the FTPL method.

Recent advancements on black-box online-to-offline reductions were made in (Kakade et al., 2009; Dudík et al., 2016; Garber, 2017). Hazan and Koren (2016) showed that efficient reductions are in general impossible, unless special structure is present. In the settings we consider this special structure is a linear cost function over the space.

Our algorithms fall into one of two templates. The first is the online mirror descent method, which is an adaptive version of the follow-the-regularized-leader (FTRL) algorithm. The second is the continuous multiplicative weight update method, which dates back to Cover’s portfolio selection method (Cover, 1991) and Vovk’s aggregating algorithm (Vovk, 1990). The reader is referred to the books (Cesa-Bianchi and Lugosi, 2006; Shalev-Shwartz, 2012; Hazan, 2016) for details and background on these prediction frameworks. We also make use of polynomial-time algorithms for sampling from log-concave distributions (Lovász and Vempala, 2007).

## 2 Preliminaries

We use to denote the Euclidean norm of a vector . For and , denote by the Euclidean ball in of radius centered at , i.e., . For , , and , define , , , and . The convex hull of is denoted by . Denote by the volume (Lebesgue measure) of a set . Denote by

the probability simplex in

, i.e., .

A set is called a cone if for any we have . For any , define the dual cone of as . is always a convex cone, even when is neither convex nor a cone.

For any closed set , define to be the projection onto , namely The well-known Pythagorean theorem characterizes an important property of projections onto convex sets:

###### Lemma 2.1 (Pythagorean theorem).

For any closed convex set , and , we have , or equivalently, .

###### Definition 2.2.

A function () is Legendre if

• is convex;

• is strictly convex with continuous gradient defined over ’s interior ;

• for any sequence converging to a boundary point of , .

###### Definition 2.3.

For a Legendre function , the Bregman divergence with respect to is defined as ().

The Pythagorean theorem can be generalized to projections with respect to a Bregman divergence (see e.g. Lemma 11.3 in (Cesa-Bianchi and Lugosi, 2006)):

###### Lemma 2.4 (Generalized Pythagorean theorem).

For any closed convex set , , , and any Legendre function , letting , we must have .

#### Log-concave distributions.

A distribution over with a density function is said to be log-concave if is a concave function. For a convex set equipped with a membership oracle, there exist polynomial-time algorithms for sampling from any log-concave distribution over (Lovász and Vempala, 2007). This can be used to approximately compute the mean of any log-concave distribution.

We have the following classical result which says that every half-space close enough to the mean of a log-concave distribution must contain at least constant probability mass. For simplicity, we only state and prove the result for isotropic (i.e., identity covariance) log-concave distributions, but the result can be easily generalized to allow arbitrary covariance.

###### Lemma 2.5.

Consider any isotropic (identity covariance) log-concave distribution over with mean . Then for any half-space such that , we have .

The proof of Lemma 2.5 is given in Appendix A. As an implication, we have the following lemma regarding mean computation of a log-concave distribution, which is useful in this paper.

###### Lemma 2.6.

For any log-concave distribution in with mean , whose support is in (), and any and , it is possible to compute a point in time such that with probability at least we have:

1. ;

2. for any half space containing , .

For our purpose in this paper, it always suffices to choose and ( being the total number of rounds) without hurting our regret bounds. Therefore, for ease of presentation, we will assume that we can compute the mean of bounded-supported log-concave distributions exactly.

## 3 Online Improper Linear Optimization with an Improper Optimization Oracle

Now we describe the problem setting we consider in this paper. Let () be two compact subsets of , and let be a convex cone. Suppose we have an improper linear optimization oracle , which given an input can output a point such that

 v⊤OK,K∗(v) ≤ minx∗∈K∗v⊤x∗.

In other words, it performs linear optimization over but is allowed to output a point from a (possibly different) set . Note that this implicitly requires that “dominates” in all directions in , that is, for all we must have .

#### Online improper linear optimization.

Consider a repeated game with rounds. In round , the player chooses a point while an adversary chooses a loss vector (), and then the player incurs a loss . The goal for the player is to have a cumulative loss that is comparable to that of the best single decision in hindsight.

We assume that the player only has access to the optimization oracle . Therefore, it is only fair to compare with the best decision in in hindsight. The (improper) regret over rounds is defined as

 RegK,K∗(T):=T∑t=1f⊤txt−minx∗∈K∗T∑t=1f⊤tx∗.

We sometimes treat as a function on , i.e., .

#### Full information and bandit settings.

We consider both full information and bandit settings. In the full information setting, after the player makes her choice in round , the entire loss vector is revealed to the player; in the bandit setting, only the loss value is revealed to the player.

#### α-regret minimization with an approximation oracle.

The problem of online linear optimization with an approximation oracle considered by Kakade et al. (2009) and Garber (2017) is a special instance in our online improper linear optimization framework. In this problem, the player has access to an approximate linear optimization oracle over (), which given a direction as input can output a point such that

 v⊤OαK(v) ≤ α⋅minx∈Kv⊤x.

In this setting we will consider and ; many combinatorial optimization problems with efficient approximation algorithms fall into this regime. The goal in the online problem is therefore to minimize the -regret, defined as

 RegαK(T):=T∑t=1f⊤txt−αminx∈KT∑t=1f⊤tx.

To see why this is a special case of online improper linear optimization, note that we can take and then the approximation oracle is equivalent to and the -regret is equal to the improper regret .

## 4 Efficient Online Improper Linear Optimization via Online Mirror Descent

In this section, we give an efficient online improper linear optimization algorithm (in the full information setting) based on online mirror descent (OMD) equipped with a strongly convex regularizer , which achieves regret when the regularizer

and the domain of linear loss functions

satisfy the pairwise non-negative inner product (PNIP) property (Definition 4.1). This property holds for many interesting domains with appropriately chosen regularizers. Notable examples include the non-negative orthant , the positive semidefinite matrix cone, and the Lorentz cone .

###### Definition 4.1 (Pairwise non-negative inner product).

For a twice-differentiable Legendre function with domain and a convex cone , we say satisfies the pairwise non-negative inner product (PNIP) property, if for all and , where , it holds that .

#### Examples.

satisfies the PNIP property if:

• (with domain ) and ;

• (with domain ) and ;

• (with domain ), where ,

is an invertible matrix, and

. This is useful in our bandit algorithm in Section 6.

### 4.1 Online Mirror Descent with a Projection-and-Decomposition Oracle

We first show that assuming the availability of a projection-and-decomposition (PAD) oracle (Definition 4.2), we can implement a variant of the OMD algorithm that achieves optimal regret. In Section 4.2

, we show how to construct a PAD oracle using the oracle

. In Section 4.3, we bound the number of oracle calls to in our algorithm.

###### Definition 4.2 (Projection-and-decomposition oracle).

A projection-and-decomposition (PAD) oracle onto , , is defined as a procedure that given , , a convex cone and a Legendre function produces a tuple , where , and , such that:

1. is “closer” to than with respect to the Bregman divergence of (and hence is an “infeasible projection”): ;

2. , and is a point that “almost dominates” in all directions in . In other words, there exists such that .

The purpose of the PAD oracle is the following. Suppose the OMD algorithm tells us to play a point . Since might not be in the feasible set , we can call the PAD oracle to find another point as well as a distribution over points . The first property in Definition 4.2 is sufficient to ensure that playing also gives low regret, and the second property further ensures that we have a distribution of points in that suffers less loss than for every possible loss function so we can play according to that distribution.

Using the PAD oracle, we can apply OMD as in Algorithm 1. Theorem 4.3 gives its regret bound.

###### Theorem 4.3.

Suppose satisfies the PNIP property (Definition 4.1). Then for any , Algorithm 1 satisfies the following regret guarantee:

 ∀x∗∈K∗:E[T∑t=1(ft(~xt)−ft(x∗))]≤1η(φ(x∗)−φ(y1)+T∑t=1Dφ(xt,yt+1))+ϵLT.

In particular, if is -strongly convex and , setting and , we have

 ∀x∗∈K∗:E[T∑t=1(ft(~xt)−ft(x∗))]≤L√2ATμ+LR.
###### Proof.

First, for any fixed round , let be the output of in this round. We know by the second property of the PAD oracle that there exists such that . Since is equal to with probability , letting , we have

 ft(¯¯¯xt)−ft(xt)=E[ft(~xt)−ft(xt)]=ft(∑ipivi−xt)≤ft(∑ipivi−xt+c)≤ϵL. (1)

We make use of the following properties of Bregman divergence, which can be verified easily (see e.g. Section 11.2 in (Cesa-Bianchi and Lugosi, 2006)):

 (2)

Consider any . We have

 T∑t=1(ft(xt)−ft(x∗)) (3) = T∑t=11η(∇φ(xt)−∇φ(yt+1))⊤(xt−x∗) (by algorithm definition) = 1ηT∑t=1(Dφ(x∗,xt)−Dφ(x∗,yt+1)+Dφ(xt,yt+1)) (by (2)) ≤ 1ηT∑t=1(Dφ(x∗,yt)−Dφ(x∗,yt+1)+Dφ(xt,yt+1)) (by property of the PAD oracle) = 1η(Dφ(x∗,y1)−Dφ(x∗,yT+1)+T∑t=1Dφ(xt,yt+1))). (by telescoping)

Combining (1) and (3), we can bound the expected improper regret of Algorithm 1 as

 ∀x∗∈K∗: E[T∑t=1(ft(~xt)−ft(x∗))]=T∑t=1(ft(¯¯¯xt)−ft(x∗)) (4) ≤ 1η(Dφ(x∗,y1)−Dφ(x∗,yT+1)+T∑t=1Dφ(xt,yt+1)))+ϵLT.

By the optimality condition , we have

 Dφ(x∗,y1)≤φ(x∗)−φ(y1). (5)

Plugging (5) into (4) and noting , we finish the proof of the first regret bound.

When is -strongly convex, we have the following well-known property:555See http://xingyuzhou.org/blog/notes/strong-convexity for a proof.

 Dφ(x,y)≤12μ∥∇φ(x)−∇φ(y)∥2.

Then by the definition in Algorithm 1 we have

 ∀t∈[T]:Dφ(xt,yt+1)≤12μ∥∇φ(xt)−∇φ(yt+1)∥2=12μ∥ηft∥2≤η2L22μ. (6)

From the above inequality and the choices of parameters and , we have

 E[T∑t=1(ft(~xt)−ft(x∗))]≤Aη+ηL2T2μ+LR≤L√2ATμ+LR.\qed

For the problem of -regret minimization using an -approximation oracle, we have the following regret guarantee, which is an immediate corollary of Theorem 4.3.

###### Corollary 4.4.

If , , , , setting , , Algorithm 1 has the following regret guarantee:

 ∀x∗∈K∗:E[T∑t=1ft(~xt)−αT∑t=1ft(x∗)]≤αLR(√T+1).

### 4.2 Construction of the Projection-and-Decomposition Oracle

Now we show how to construct the PAD oracle using the improper linear optimization oracle . Our construction is given in Algorithm 2.

###### Theorem 4.5.

Suppose satisfies the PNIP condition (Definition 4.1) and is -strongly convex. Then for any and , Algorithm 2 must terminate in iterations, and it correctly implements the projection-and-decomposition oracle , i.e., its output satisfies the two properties in Definition 4.2.

We break the proof of Theorem 4.5 into several lemmas.

###### Lemma 4.6.

If satisfies the PNIP condition (Definition 4.1), then computed in Algorithm 2 satisfy for all .

###### Proof.

Since we have , by the KKT condition, we have

 0=∂∂z(Dφ(z,zi)−λw⊤i(z−vi))∣∣z=zi+1=∇φ(zi+1)−∇φ(zi)−λwi

for some . On the other hand, note that , for some , where . Therefore, for all we have . This means . ∎

###### Lemma 4.7.

Under the setting of Theorem 4.5, Algorithm 2 terminates in at most

 ⎡⎢ ⎢ ⎢ ⎢⎢5dlog4R+2√2μminx∗∈K∗Dφ(x∗,y)ϵ⎤⎥ ⎥ ⎥ ⎥⎥

iterations.

###### Proof.

According to the algorithm, for each , is the Bregman projection of onto a half-space containing , since the oracle ensures for all . Then by the generalized Pythagorean theorem (Lemma 2.4) we know for all and . Therefore we have for all and .

Let . Then there exists such that for all , where the last inequality is due to the -strong convexity of . This implies for all . Therefore, when , we must have , which means the loop must have terminated at this time. This proves the lemma. ∎

###### Lemma 4.8.

Under the setting of Theorem 4.5, for all , there exists such that .

###### Proof.

We assume for contradiction that there exists a unit vector such that . Note that . Letting , we have

 ∀w∈h2+(W∩B(0,r)):mini∈[k]w⊤(vi−y′)>0.

Since for , we have .

By the algorithm, we know that for all , there exists such that . Notice that from Lemma 4.6 we know for all . Thus for all there exists such that . In other words, we have

 ∀w∈W1∖Wk+1:mini∈[k]w⊤(vi−y′)≤0.

Therefore, we must have . We also have for each from Lemma 2.6, since is the intersection of with a half-space that does not contain ’s centroid in the interior. Then we have

 Vol(W1) =Vol(W∩B(0,1))=r−dVol(W∩B(0,r))≤r−dVol(Wk+1) ≤r−d(1−1/(2e))kVol(W1)

where the last step is due to , which is true according to the termination condition of the loop. Therefore we have a contradiction. ∎

We need the following basic property of projection onto a convex cone. The proof is given in Appendix B.

###### Lemma 4.9.

For any closed convex cone and any , we have .

The following lemma is a more general version of Lemma 6 in (Garber, 2017).

###### Lemma 4.10.

Given , and a convex cone , for any , the following two statements are equivalent:

1. There exists and such that .

2. For all , , there exists such that .

#### Geometric interpretation of Lemma 4.10.

Before proving Lemma 4.10, we discuss its geometric intuition. For simplicity of illustration, we only consider here. First we look at the case where . In this case the lemma simply degenerated to the fact

 x∈CH({vi}ki=1)⟺There is no % hyperplane that separates x and all vi's.

In the general case where is an arbitrary convex cone, lemma 4.10 becomes

 x∈CH({vi}ki=1)+W∘⟺There% is no direction w∈W such that w⊤x

Denote . For the “” side, if , it is clear that for all we must have for some . For the “” side, if , then satisfies for all . Moreover it is easy to see , which completes the proof. See Figure 1 for a graphic illustration.

###### Proof of Lemma 4.10.

Suppose (A) holds. Then for any , , we have

 mini∈[k]w⊤(vi−x) ≤w⊤(k∑i=1pivi−x)≤w⊤(k∑i=1p