Non-stationary Stochastic Optimization with Local Spatial and Temporal Changes

by   Xi Chen, et al.

We consider a non-stationary sequential stochastic optimization problem, in which the underlying cost functions change over time under a variation budget constraint. We propose an L_p,q-variation functional to quantify the change, which captures local spatial and temporal variations of the sequence of functions. Under the L_p,q-variation functional constraint, we derive both upper and matching lower regret bounds for smooth and strongly convex function sequences, which generalize previous results in (Besbes et al., 2015). Our results reveal some surprising phenomena under this general variation functional, such as the curse of dimensionality of the function domain. The key technical novelties in our analysis include an affinity lemma that characterizes the distance of the minimizers of two convex functions with bounded L_p difference, and a cubic spline based construction that attains matching lower bounds.


page 1

page 2

page 3

page 4


Non-stationary Stochastic Optimization

We consider a non-stationary variant of a sequential stochastic optimiza...

Stochastic Successive Convex Approximation for General Stochastic Optimization Problems

One key challenge for solving a general stochastic optimization problem ...

Non-stationary Bandits with Knapsacks

In this paper, we study the problem of bandits with knapsacks (BwK) in a...

Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization

We suggest a general oracle-based framework that captures different para...

Online Stochastic Optimization with Wasserstein Based Non-stationarity

We consider a general online stochastic optimization problem with multip...

Natasha: Faster Non-Convex Stochastic Optimization Via Strongly Non-Convex Parameter

Given a nonconvex function f(x) that is an average of n smooth functions...

Probablistic Bigraphs

Bigraphs are a universal computational modelling formalism for the spati...

1 Introduction

Non-stationary stochastic optimization studies the problem of optimizing a non-stationary sequence of convex functions on the fly, with either noisy gradient or function value feedback. This problem has important applications in operations research and machine learning, such as dynamic pricing, online recommendation services, and simulation optimization (Gur, 2014; den Boer & Zwart, 2015; den Boer, 2015; Keskin & Zeevi, 2017). For example, in the case of dynamic pricing, an analyst is given the task of pricing a specific item over a long period of time, with feedback in the form of sales volumes in each time period. As the demand changes constantly over time, the problem can be naturally formulated as non-stationary sequential stochastic optimization, where the analyst adjusts his/her pricing over time based on noisy temporal feedback data.

Formally, consider a sequence of convex functions over epochs, where is a convex, compact domain in the -dimensional Euclidean space . At each epoch , a policy selects an action , based on stochastic or noisy feedback (defined in Sec. 2) of previous epochs , and suffers loss . The objective is to compete with the dynamic optimal sequence of actions in hindsight; that is, to minimize regret

To ensure existence of policy with sub-linear regret (i.e., the non-trivial regret of ), constraints are imposed upon function sequences such that any pair of consecutive functions and are sufficiently close, and therefore feedback through previous epochs are informative for later ones. These constraints usually carry strong practical implications. For example, in dynamic pricing problems, an action represents the price and is the (negative) revenue function at time in terms of price. Since the demand functions cannot change too rapidly, it is natural to impose a constraint on adjacent pairs of revenue functions (see, e.g., Keskin & Zeevi (2017)).

The question of optimizing regret for non-stationary convex functions with stochastic feedback has received much attention in recent years. One particular interesting instance of non-stationary stochastic convex optimization was considered in Besbes et al. (2015), where sub-linear regret policies were derived when the average difference is assumed to go to zero as . Optimal upper and lower regret bounds were derived for both noisy gradient and noisy function value feedback settings.

In this work, we generalize the results of Besbes et al. (2015) so that local spatial and temporal changes of functions are taken into consideration. For any measurable function , define


Here, is the Lebesgue measure of the domain and is finite because of the compactness of . We shall refer to as the -norm of in the rest of this paper. (Conventionally in functional analysis the norm of a function is defined as the unnormalized integration .) Nevertheless, we adopt the volume normalized definition for the convenience of presentation. It is worth noting that this normalization will not affect our results. In particular, because is a compact domain and is a constant, the regrets using the two definitions of function norm only differ by a multiplicative constant. Moreover, the Minkowski’s inequality , as well as other basic properties of norm, remains valid. Also, for a sequence of convex functions , define the -variation functional of as


Note that in both Eqs. (1) and (2) we restrain ourselves to convex norms and . We can then define function classes


which serves as the budget constraint for a function sequence . The definition of is more general than introduced in Besbes et al. (2015) since it better reflects the spatial and temporal locality of in the subscripts and .

1.1 A motivating example of dynamic pricing

To motivate the -variation constraint, we use dynamic pricing as a motivating example and illustrate the advantages of the

-variation measure for loss functions with “local” spatial or temporal changes. We also provide guidelines on how

values should be set qualitatively.

We consider a stylized dynamic pricing problem of a single item under changing revenue functions. Let be a collection of time periods, at each of which the item receives a pricing , . We normalize the prices so that their range is the unit interval . At time period , an unknown function characterizes the negative expected revenue a retailer collects by setting the price at . The revenue function is assumed to be non-stationary over the time periods . The objective of the retailer is to design a pricing policy such that the aggregated (negative) expected revenue is minimized.

1.1.1 Spatial (pricing) locality of revenue changes

We first fix in the variation framework and show how different values of reflect degrees of spatial (pricing) locality of the revenue functions . Suppose for all , there exists a short interval with its length such that for all , and for all . Intuitively, the assumption implies that the changes of the revenue functions between consecutive time periods are “spatially local”, and the revenues are different only at prices in a small range . This is a reasonable assumption in practice since the revenue will not be sensitive to all possible prices in (e.g., a pair of adjacent revenue function values remain the same when price is very high or very low).

Under the existing variation measure (), simple calculation shows that . On the other hand, for , the variation measure satisfies . When the “locality” level is much smaller than 1, . Furthermore, in cases where and , we have and therefore the existing algorithm/analysis in Besbes et al. (2015) cannot achieve sub-linear regret on ; on the other hand, by considering the measure, one has for all , and therefore by applying algorithm/analysis in this paper we can achieve sub-linear regret on .

1.1.2 Temporal locality of revenue changes

We next consider in the variation framework and show how different values of reflect degrees of temporal locality of the revenue function . Suppose there exists a subset if time periods , such that for all , and for all . Intuitively, this assumption implies that the revenue function has local temporal changes, meaning that the changes only in short time intervals and remains the same for most of the other times. This is a relevant assumption when demands of the item have clear temporal correlations, such as seasonal food and clothes.

Simple calculations show that, for and , the variation measure of the above described function sequence is . This demonstrates the effect of the parameter in -variation for with local temporal changes, i.e., a smaller leads to a smaller variation measure of when .

1.1.3 Guidelines on the selection of values

Though the underlying sequence of expected revenue functions is assumed to be unknown, in practice it is common that certain background knowledge or prior information is available regarding . In this section we discuss how such prior information, especially regarding the magnitude changes of and in , can qualitatively help us select the parameters in the variation measure.

We first discuss the selection of and fix the choice

for the moment. Suppose we have the prior knowledge that each pairs of

and differ significantly on portion of the domain by a difference of , as exemplified in Sec. 1.1.1. Then the variation of such function sequence is approximately . According to our results in Theorems 3.13.3, the worst-case regret is where depending on feedback types (e.g., noisy gradient or function value feedback) and (strong) convexity of . The regret can be further re-parameterized as where .

The above analysis leads to the following insights providing qualitative suggestions of choices:

  1. The term is smaller for smaller values, because and is a strictly decreasing function in . This suggests that for function sequences with stronger spatial locality (e.g., revenue functions that only change on a small range of prices), one should use a smaller value in -variation measure;

  2. The term is smaller for larger values, because and is a strictly increasing function in . This suggests that for function sequences with smaller absolute amount of perturbation, one should use a larger in -variation measure.

We next discuss the selection of and fix the choice of . Unlike the spatial locality parameter , our Theorems 3.1 and 3.2 suggest that the optimal worst-case regret is insensitive to the choice of . This might sound surprising, but is the characteristic of the adopted worst-case analytical framework. To see this, we note that the worst-case function sequence is the one that evenly distributes the function changes across all (see also the detailed construction in the online supplement), in which case the -variation measure is the same for all . It should also be noted that the choice of does not affect our optimization algorithm or its re-starting procedure. Therefore, we simply recommend the selection of but we choose to include in our theorem statements for mathematical generality.

1.2 Results and techniques

The main result of this paper is to characterize the optimal regret over function classes , which includes explicit algorithms that are computationally efficient and attain the regret, and a lower bound argument based on Fano’s inequality (Ibragimov & Has’minskii, 1981; Yu, 1997; Cover & Thomas, 2006; Tsybakov, 2009) that shows the regret attained is optimal and cannot be further improved. Below is an informal statement of our main result (a formal description is given in Theorems 3.1 and 3.2):

Main result (informal).

For smooth and strongly convex function sequences under certain regularity conditions, the optimal regret over is with noisy gradient feedback, and with noisy function value feedback, provided that is not too small. In addition, for general convex function sequences satisfying only Lipschitz continuity on function values, we obtain a regret upper bound of with noisy gradient feedback, provided that is not too small. Here is the dimension of the domain .

We clarify that our results also cover the case of small , i.e., converges to 0 as at a very fast rate. However, the case of “not too small ” is of more interest. This is because if is very small, meaning that the underlying function sequence is close to a stationary one (i.e., ), then one could re-produce the standard and/or regrets ( for strongly convex and smooth functions with noisy function feedback, for strongly convex and smooth functions with noisy gradient feedback, and for general convex functions with noisy gradient feedback; see also, e.g., Jamieson et al. (2012); Agarwal et al. (2010); Hazan et al. (2007).) These rates are also known to be optimal (Jamieson et al., 2012; Hazan & Kale, 2014). Technical details of this point are given in the statements of Theorems 3.1, 3.2, 3.3.

More importantly, our result reveals several interesting facts about the regret over function sequences with local spatial and temporal changes. Most surprisingly, the optimal regret suffers from curse of dimensionality, as the regret depends exponentially on the domain dimension . Such phenomenon does not occur in previous works on stationary and non-stationary stochastic optimization problems. For example, for the case of being strongly convex and smooth, as spatial locality in becomes less significant (i.e., ), the optimal regrets approach (for noisy gradient feedback) and (for noisy function value feedback), which recovers the dimension-independent regret bounds in Besbes et al. (2015) derived for the special case of and . Similar phenomenon of curse of dimensionality also appears in the general convex case. We also note that, when is not too small, the obtained regret bound matches the optimal rate for in Besbes et al. (2015) as .

To obtain results for general -variation and the optimal regrets for strongly convex case, we make several important technical contributions in this paper, which are highlighted as follows.

  1. For noisy function value feedback, instead of using the online gradient descent (OGD) from Besbes et al. (2015), we adopt a regularized ellipsoidal (RE) algorithm from Hazan & Levy (2014) and extend it from exact function value evaluation to the noisy version. Our analysis relaxes an important assumption in Besbes et al. (2015) that requires the optimal solution to lie far away from the boundary of . Our policy based on the RE algorithm allows the optimal solution to be closer to the boundary of as increases.

  2. On the upper bound side, we prove an interesting affinity result (Lemma 4.2) which shows that the optimal solutions of cannot be too far apart provided that both are smooth and strongly convex functions, and is upper bounded. The affinity result is also generalizable to non-strongly convex functions (Lemma C.2), by directly integrating function differences in a close neighborhood of (or ) without resorting to (that could be unbounded without strong convexity). Both affinity results are key in deriving upper bounds for our problem, and have not been discovered in previous literatures. They might also be potentially useful for other non-stationary stochastic optimization problems (e.g., adaptivity to unknown parameters (Besbes et al., 2015; Karnin & Anava, 2016)).

  3. On the lower bound side, we present a systematic framework to prove lower bounds by first reducing the non-stationary stochastic optimization problem to an estimation problem with active queries, and then applying the Fano’s inequality with a “sup-argument” similar in spirit to

    Castro & Nowak (2008) that handles the active querying component. To adapt Fano’s inequality, we also design a new construction of adversarial function sets, which is quite different from the one in Besbes et al. (2015). More specifically, to prove that the regret exhibits “curse of dimensionality”, one needs to construct functions that not only have different minima but also “localized” difference (meaning that for most ) such that is small. To construct such adversarial functions, we use the idea of “smoothing splines” from nonparametric statistics that connects two pieces of quadratic functions using a cubic function to ensure the smoothness and strong convexity of the constructed functions. Our analytical framework and spline-based lower bound construction could inspire new lower bounds for other online and non-stationary optimization problems.

1.3 Related work

In addition to the literature discussed in the introduction, we briefly review a few additional recent works from machine learning and optimization communities.

Stationary stochastic optimization.

The stationary stochastic optimization problem considers a stationary function sequence , and aims at finding a near-optimal solution such that is close to . When only noisy function evaluations are available at each epoch, the problem is also known as zeroth-order optimization and has received much attention in the optimization and machine learning community. Classical approaches include confidence-band methods (Agarwal et al., 2013) and pairwise comparison based methods (Jamieson et al., 2012), both of which achieve regret with polynomial dependency on domain dimension . Here in notation we drop poly-logarithmic dependency on . The tight dependency on , however, remains open. In the more restrictive statistical optimization setting , optimal dependency on can be attained by the so-called “two-point query” model (Shamir, 2015).

Online convex optimization.

In online convex optimization, an arbitrary convex function sequence is allowed, and the regret of a policy is compared against the optimal stationary benchmark in hindsight. Unlike the stochastic optimization setting, in online convex optimization the full information of is revealed to the optimizing algorithm after epoch , which allows for exact gradient methods. It is known that for unconstrained online convex optimization, the simplest gradient descent method attains regret for convex functions, and regret for strongly convex and smooth functions, both of which are optimal in the worst-case sense (Hazan, 2016). For constrained optimization problems, projection-free methods exist following mirror descent or follow-the-regularized-leader (FTRL) methods (Hazan & Levy, 2014). Zinkevich (2003); Hall & Willett (2015) considered the question of online convex optimization by competing against the optimal dynamic solution sequence subject to certain smoothness constraints like . Jadbabaie et al. (2015); Mokhtari et al. (2016) further imposed the constraint on both solution sequences and function sequences in terms of -variation and showed that adaptivity to the unknown smoothness parameter is possible with noiseless gradient and the information of . Daniely et al. (2015); Zhang et al. (2017) also designed algorithms that adapt to the unknown smoothness parameter, under the model that the entire function is revealed after time . However, the adaptation still remains an open problem in the “bandit” feedback setting considered in our paper, in which only noisy evaluations of or are revealed. Under the bandit feedback setting, the function perturbations (e.g., ) cannot be easily estimated, making it unclear whether adaptation to is possible.

Bandit convex optimization.

Bandit convex optimization is a combination of stochastic optimization and online convex optimization, where the stationary benchmark in hindsight of a sequence of arbitrary convex functions is used to evaluate regrets. At each time , only the function evaluation at the queried point (or its noisy version) is revealed to the learning algorithm. Despite its similarity to stochastic and/or online convex optimization, convex bandits are considerably harder due to its lack of first-order information and the arbitrary change of functions. Flaxman et al. (2005) proposed a novel finite-difference gradient estimator, which was adapted by Hazan & Levy (2014) to an ellipsoidal gradient estimator that achieves regret for constrained smooth and strongly convex bandits problems. For the non-smooth and non-strongly convex bandits problem, the recent work of Bubeck et al. (2017) attains regret with an explicit algorithm whose regret and running time both depend polynomially on dimension .

1.4 Notations and basic properties of

For a

-dimensional vector we write

to denote the norm of , for , and to denote the norm of . Define and as the -dimensional ball and sphere of radius , respectively. We also abbreviate and . For a -dimensional subset , denote as the interior of , as the closure of , and as the boundary of . For any , we also define as the “strict interior” of , where every point in is guaranteed to be at least away from the boundary of .

We note that the defined in (2) is monotonic in and , as shown below:

Proposition 1.1.

For any and it holds that . In addition, for any we have , and similarly for any we have , assuming all functions in are continuous.

The proof of Proposition 1.1 is deferred to Section D.1 in the online supplement.

The rest of the paper is organized as follows. In Section 2, we introduce the problem formulation. Section 3 contains the main results and describes the policies. Section 4 presents the proof of our main positive result. The concluding remarks and future works are discussed in Section 6. Additional proofs can be found in the online supplement.

2 Problem formulation

Suppose are a sequence of unknown convex differentiable functions supported on a bounded convex set . At epoch , a policy selects a point (i.e., makes an action) and suffers loss . Certain feedback is then observed which can guide the decision of actions in future epochs. Two types of feedback structures are considered in this work:

  • Noisy gradient feedback: , where is the gradient of evaluated at , and are independent -dimensional random vectors such that each component

    is a random variable with

    ; furthermore, conditioned on is a sub-Gaussian random variable with parameter , meaning that for all ;

  • Noisy function value feedback: , where are independent univariate random variables that satisfy ; furthermore, conditioned on is a sub-Gaussian random variable with parameter , meaning that for all .

Both feedback structures are popular in the optimization literature and were considered in previous work on online convex optimization and stochastic bandits (e.g., Hazan (2016) and references therein). For notational convenience, we shall use or simply to refer to a general feedback structure without specifying its type, which can be either or .

Apart from being closed convex and being convex and differentiable, we also make the following additional assumptions on the domain and functions :

  1. (Bounded domain): there exists constant such that ;

  2. (Bounded function and gradient): there exists constant such that and ;

  3. (Unique interior optimizer): there exists unique such that . Furthermore, the interior of is a non-empty set (i.e., ) and there exists such that .

  4. (Smoothness): there exists constant such that for all .

  5. (Strong convexity): there exists constant such that for all .

The assumptions (A1), (A2) are standard assumptions that were imposed in previous works on both stationary and non-stationary stochastic optimization (Flaxman et al., 2005; Agarwal et al., 2013; Shamir, 2015; Besbes et al., 2015). The condition (A3) assumes that the optimal solution is not too close to the boundary of the domain . Compared to similar assumptions in existing work (Flaxman et al., 2005; Besbes et al., 2015), our assumption is considerably weaker since can be within distance to the boundary; while in Flaxman et al. (2005); Besbes et al. (2015), must be distance away from the boundary (i.e., away from the boundary by at least a constant). Finally, the conditions (A4) and (A5) concern second-order properties of and enable smaller regret rates for gradient descent algorithms. We note that the condition in Besbes et al. (2015) (see Eq. (10) in Besbes et al. (2015)) is stronger and implies our (A4) and (A5) since we do not assume that is twice differentiable. We also consider parameters in (A1)–(A5) and domain dimensionality as constants throughout the paper and omit their (polynomial) multiplicative dependency in regret bounds. In Section 3.2, we further relax the assumptions (A3)–(A5) and provide upper bound results for general convex function sequences.


be a random quantity defined over a probability space. A policy

that outputs a sequence of is admissible if it is a measurable function that can be written in the following form:

Let denote the class of all admissible policies for epochs. A widely used metric for evaluating the performance of an admissible policy is the regret against dynamic oracle :


Here is either the noisy gradient feedback or the noisy function feedback . Note that a unique minimizer exists due to the strong convexity of (condition A5). The goal of this paper is to characterize the optimal regret:


and find policies that achieve the rate-optimal regret, i.e., attain the optimal regret up to a polynomial of factor. The optimal regret in (5) is also known as the minimax regret in the literature, because it minimizes over all admissible policies and maximizes over all convex function sequences .

3 Main results

We establish theorems giving both upper and lower bounds on worst-case regret for both noisy gradient feedback and noisy function feedback over . The policies for achieving the following upper bound result will be introduced in the next section.

Theorem 3.1 (Upper bound for strongly-convex function sequences).

Fix arbitrary and . Suppose (A1) through (A5) hold, and . Then there exists a computationally efficient policy and for some function that is a polynomial function in and , such that

For the noisy function value feedback, there exists another computationally efficient policy and for some function that is a polynomial function in and , such that

Theorem 3.2 (Minimax lower bound for strongly-convex function sequences).

Suppose the same conditions hold as in Theorem 3.1. Then there exists a constant independent of and such that

In Theorem 3.1, the quantities and depend on and only via poly-logarithmic factors and these poly-log factors are usually not the focus of studying the regret. In Theorem 3.2 the quantity is independent of and . The other problem dependent parameters are treated as constants throughout the paper. The proof of Theorem 3.1 is given in Sec. 4, while the proofs of Theorem 3.2 is relegated to the online supplement.

The condition in both Theorems 3.1 and 3.2 is necessary for obtaining a non-trivial sub-linear regret. In particular, the lower bound results in Theorem 3.2 show that for , no algorithm can achieve sub-linear regret in either feedback models. On the other hand, a trivial algorithm that outputs for an arbitrary leads to a linear regret.

Both upper and lower regret bounds in Theorems 3.1 and 3.2 consist of two terms. The term for and term for arise from regret bounds for stationary stochastic optimization problems (i.e., ), which were proved in Jamieson et al. (2012); Hazan & Kale (2014). The other terms involving polynomial dependency on are the main regret terms for typical dynamic function sequences whose perturbation is not too small.

We also remark that the parameter does not affect the optimal rate of convergence in Theorem 3.2 (provided that is assumed for convexity of the norms). While this appears counter-intuitive, this is a property of our worst-case analytical framework, as the function sequence that leads to the worst-case regret is the one that distributes function changes evenly across all (see for example our detailed construction of adversarial function sequences in the online supplement), in which case the -variation measure is the same for all .

Remark 3.1 (Comparing with Besbes et al. (2015)).

Besbes et al. (2015) considered the special case of and , and established the following result:


Note that in Eq. (6) we adopt a slightly different notation from Besbes et al. (2015). In particular, the parameter in our paper is times the parameter in (Besbes et al., 2015). Such normalization is for presentation clarity only (to single out the term in the regret bounds).

It is clear that our results reduce to Eq. (6) as for both and . In particular, for fixed domain dimension we have that and , matching regrets in Eq. (6). Therefore, the result from Besbes et al. (2015) (for strongly convex function sequences) is a special case of our results.

Remark 3.2 (Curse of dimensionality).

A significant difference between and settings is the curse of dimensionality. In particular, when the (optimal) regret depends exponentially on dimension , while for the dependency on is independent of on the exponent. The curse of dimensionality is a well-known phenomenon in non-parametric statistical estimation (Tsybakov, 2009).

Below we first introduce the policies, which is based on a “meta-policy” in Besbes et al. (2015).

3.1 Policies

We first describe a “meta-policy” proposed in Besbes et al. (2015) based on a re-starting procedure:

Meta-policy (restarting procedure): input parameters and ; sub-policy . Divide epochs into batches such that , , etc., with , and for . The epochs are divided as evenly as possible, so that for all . For each batch , , do the following: [topsep=0pt,itemsep=0ex] Run sub-policy with and , corresponding to .

The key idea behind the meta-policy is to “restart” certain sub-policy after epochs. This strategy ensures that the sub-policy has sufficient number of epochs to exploit feedback information, while at the same time avoids usage of outdated feedback information. For the noisy gradient feedback , we set if and otherwise; for the noisy function value feedback , we set if and otherwise. Motivations of our scalings are given in Sec. 4 in which we prove Theorem 3.1.

The sub-policy is carefully designed to exploit information provided from different types of feedback structures. For noisy gradient feedback , a simple online gradient descent (OGD, see, e.g., Besbes et al. (2015); Hazan (2016)) policy is used:

Sub-policy (OGD): input parameters ; step sizes . Select arbitrary . For to do the following: [topsep=0pt,itemsep=0ex] Suffer loss and obtain feedback . Compute , where .

For noisy function value feedback , the classical approach is to first obtain an estimator of the gradient by perturbing along a random coordinate . This idea originates from the seminal work of Yudin & Nemirovskii (1983) and was applied to convex bandits problems (e.g., Flaxman et al. (2005); Besbes et al. (2015)). Such an approach, however, fails to deliver the optimal rate of regret when the optimal solution lies particularly close to the boundary of the domain . Here we describe a regularized ellipsoidal (RE) algorithm from Hazan & Levy (2014), which attains the optimal rate of regret even when is very close to .

The RE algorithm in Hazan & Levy (2014) is based on the idea of self-concordant barriers:

Definition 3.1 (self-concordant barrier).

Suppose is convex and . A convex function is a -self-concordant barrier of if it is three times continuously differentiable on and has the following properties:

  1. For any , if then .

  2. For any and it holds that and where .

It is well-known that for any convex set with non-empty interior , there exists a -self-concordant barrier function with , and furthermore for bounded the barrier can be selected such that it is strictly convex; i.e., for all (Nesterov & Nemirovskii, 1994; Boyd & Vandenberghe, 2004). For example, for linear constraints with , a logarithmic barrier function can be used to satisfy all the above properties (note that denotes the -th row of ).

We are now ready to describe the RE sub-policy that handles noisy function value feedback. The policy is similar to the algorithm proposed in Hazan & Levy (2014), except that noisy function value feedback is allowed in our policy, while Hazan & Levy (2014) considered only exact function evaluations. The analysis of our policy is also more involved for dealing with noise.

Sub-policy (RE): input parameters ; constant step size ; self-concordant barrier ; Select ; For to do the following: [topsep=0pt,itemsep=0ex] Compute , where

is the identity matrix in

. Sample

from the uniform distribution on the unit

-dimensional sphere .
Select ; suffer loss and obtain feedback . Compute gradient estimate . FTRL update: .

In step 2(d), the gradient estimate satisfies