A Random Walk Approach to First-Order Stochastic Convex Optimization

01/17/2019 ∙ by Sattar Vakili, et al. ∙ cornell university 0

Online minimization of an unknown convex function over a convex and compact set is considered under first-order stochastic bandit feedback, which returns a random realization of the gradient of the function at each query point. Without knowing the distribution of the random gradients, a learning algorithm sequentially chooses query points with the objective of minimizing regret defined as the expected cumulative loss of the function values at the query points in excess to the minimum value of the function. An active search strategy based on devising a biased random walk on an infinite-depth tree constructed through successive partitioning of the domain of the function is developed. It is shown that the biased random walk moves toward the optimal point in a geometric rate, leading to an order-optimal regret performance of O(√(T)). The structural properties of this random-walk based strategy admits detailed finite-time regret analysis. By localizing data processing to small subsets of the input domain based on the tree structure, it enjoys O(1) computation and memory complexity per query and allows dynamic allocation of limited data storage.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Stochastic Convex Optimization

Stochastic convex optimization is concerned with the minimization of a random loss function

over a convex and compact set . The stochastic model of is unknown, except the knowledge that is convex. At each time , the decision maker chooses a query point , and either a random sample of the function value or a random sample of the derivative of the function at the query point is revealed. These two feedback models are commonly referred to, respectively, as the zeroth-order and the first-order stochastic optimization.

A traditionally adopted objective of the problem, as in the pioneering work by Robbins and Monro [1] and Kiefer and Wolfowitz [2] in the early 1950s, is to approximate the minimizer of the loss function. Also known as stochastic approximation, this line of work focuses on the asymptotic convergence of the end point to the optimal point (or to ) over a growing horizon of length . We refer to this formulation of the problem as the offline setting, for the reason that the losses during the query process are inconsequential and the query process is chosen for the sole purpose of outputting a desired end point . Extensive studies exist under this formulation (see an overview article by Lai [3]).

The online counterpart of the problem adopts the measure of regret defined as the expected cumulative loss at the query points in excess to the minimum loss: . Under this objective, the query process needs to balance the exploration of the input space in search for and the associated loss incurred during the search process. The behavior of regret over a growing horizon length is a finer measure than the convergence of or . Specifically, a policy with a sublinear regret order in implies that the sequence of samples converges to the optimum value . The converse, however, is not true. In particular, the convergence of to (or to ) does not imply a sublinear, let alone an optimal, order of the regret.

In this paper, we focus on online first-order stochastic convex optimization, where the observations are random gradients at the query points and the objective is to minimize regret. A prevailing approach to this problem is based on stochastic gradient descent (SGD), where the next query point

is chosen in the opposite direction of the observed gradient (with a properly chosen step size that shrinks in ) while ensuring via a projection operation. It is known that this approach offers asymptotic regret order, matching the lower bound [1, 4, 5]. The performance of the SGD approaches, however, depends on careful tuning of a large number of parameters (e.g., the sequence of step sizes). Regret analysis for finite is also lacking.

With access to the gradient, the problem can also be mapped to stochastic root finding [6] with the objective of locating the root of based on random samples of the gradient. One solution to the problem under a one-dimensional input space is the probabilistic bisection algorithm (PBA). Assuming a prior distribution of the optimal point , PBA updates the belief (i.e., the posterior distribution) of based on each observation and subsequently probes the median point of the belief. It was shown in [7] that the regret order of PBA is upper bounded by for a small , and an regret order was conjectured. We point out that PBA requires a known stochastic model of the random gradient function for the belief update using Bayes rule, and the update and sorting of the belief at each query point can be expensive in terms of computation and memory requirement.

1.2 Main Results

We propose a random walk approach to first-order stochastic convex optimization. Referred to as Random-Walk based Gradient Descent (RWGD), the proposed policy first constructs a binary tree with infinite depth based on successively refined partitioning of the input space . Specifically, the root of the tree corresponds to , which, without loss of generality, is assumed to be for the one-dimensional case. The tree grows to infinite depth based on a binary splitting of each node (i.e., the corresponding interval) that forms the two children of the node at the next level. The key idea of RWGD is to devise a biased random walk on this interval tree that initiates at the root node. Each move of the random walk is guided by a local sequential test based on random gradient realizations drawn from the left boundary, the middle point, and the right boundary of the interval corresponding to the current location of the random walk. The goal of the local sequential test is to determine, with a confidence level greater than , whether there is a change of sign in the gradient in the left sub-interval or the right sub-interval of the current node. If either one is true (with the chosen confidence level), the walk moves to the corresponding child that sees the sign change. If neither is true, the walk moves back to the parent of the current node. The stopping rule and the output of the local sequential test are based on properly constructed lower and upper confidence bounds of the empirical mean of the observed gradient realizations. A greater than bias of the random walk is sufficient to ensure convergence to the optimal point at a geometric rate, as shown in Sec. 3.

By bounding the sample complexity of the local sequential test and analyzing the trajectory of the biased random walk, we show that RWGD has a regret order of , which is optimal up to a logarithmic factor. Furthermore, structural properties of the random walk approach allow a finer tracking of the query points, leading to a finite-time regret analysis, a first such result to our best knowledge.

In addition to its finite-time performance guarantee (in contrast to asymptotic regret analysis), RWGD also enjoys advantages in terms of robustness and computation/memory efficiency over the prevailing approach of SGD. Specifically, with no parameters to tune, RWGD is more robust to model mismatch and offers improved performance as demonstrated in the simulation examples in Sec. 5. By localizing data processing to small subsets of the input domain based on the tree structure, RWGD has computation and memory complexity per query and allows dynamic allocation of limited data storage. The projection operation in SGD, however, involves the entire input domain at each query and can be computationally expensive.

There may appear to be a connection between RWGD and PBA, since both algorithms involve a certain bisection of the input domain. These two approaches are, however, fundamentally different. First, PBA requires the knowledge on the distribution of the random gradient function, while RWGD operates under unknown models. Second, the belief-based bisection in PBA is on the entire input domain at each query and is done dynamically based on each random observation. The interval tree in RWGD based on successively bisection of is predetermined, and each move of the random walk leads to a bisection of a sub-interval of

that is shrinking in geometric rate over time with high probability. It is this zooming effect of the biased random walk that leads to a

computation and memory complexity. For PBA, if is discretized to points for computation and storage, updating and sorting the belief would incur computation complexity per sample and linear memory requirement.

1.3 Other Related Work

There are several results on first-order stochastic optimization that adopt stronger assumptions on than convexity. In particular, assuming strongly convex or exponentially concave results in a logarithmic regret in time [8].

Under the zeroth-order feedback model where the decision maker has access to the function values, the problem can be viewed as a continuum-armed bandit problem, on which a vast body of results exist. In particular, the work in [9] developed an approach based on the ellipsoid algorithm that achieves an regret when the objective function is convex and Lipschitz. The continuum armed bandit under Lipschitz assumption (not necessarily convex) has been studied in [10, 11, 12] where higher orders of regret were shown. The -armed bandit introduced in [13] considered a Lipschitz function with respect to a dissimilarity function known to the learner. Under the assumption of a finite number of global optima and a particular smoothness property, an regret was shown. The proposed policy in [13] uses a tree structure for updating the indexes in a bandit algorithm which is fundamentally different than RWGD and, e.g., does not induce a random walk. This line of work differs from the gradient-based approach considered in this work. Nevertheless, since an number of samples from can be translated to a sample from under certain regularity assumptions, gradient-based approaches can be extended to cases where samples from are directly fed into the learning policy.

We mention that the stochastic online learning setting considered here is different, in problem formulation, objective, and techniques, from an adversarial counterpart of the problem where the loss function is adversarially chosen at each time . On this line of research, see [14, 15] and references therein.

2 Problem Formulation and Preliminaries

Let be a differentiable probabilistic loss function and . We assume is a convex function. Let denote an optimal point of the objective function . In this paper, we focus on the one-dimensional case. Without loss of generality let . The results can be easily translated to any interval .

Without knowing or the stochastic models of or , an agent sequentially chooses the sampling points . Each sampling time sees an i.i.d. realization of the loss function . A random cost at the sampling point is incurred, and a random realization of the gradient is observed. The objective is to minimize the cumulative loss. Specifically, the goal is to design a learning policy that is a mapping from the history of the sampling points and the observations to a new sampling point, . The performance of a learning policy is measured by regret defined as the expected cumulative loss at the chosen sampling points in excess to the optimum point ,

(1)

where is the expectation operator with respect to the random process induced by the policy .

We assume that the distribution of , for all , is sub-Gaussian with parameter

, i.e., its moment generating function is bounded by that of a Gaussian random variable with variance

:

As a result of the Chernoff-Hoedffding bound, we have ([17]), for any ,

(2)

where is the sample mean obtained from samples of , and is a constant depending on the class of distributions.

Extensions of both the proposed policy and its regret analysis to more general families of distributions, including heavy-tailed distributions are relatively straightforward, as discussed in subsequent sections.

3 Random-Walk based Gradient Descent on a Tree

The proposed policy is based on an infinite-depth binary tree with nodes representing a subinterval of and edges representing the subset relation. The nodes at depth () of the tree correspond to the intervals resulting from an equal-length partition of , with each interval of length . Each node at depth has two children corresponding to its equal-length subintervals at depth . Let (, ) denote the th node at depth . We use the terms node and its corresponding interval interchangeably.

Figure 1: The binary tree representing the subintervals of . At level , corresponds to the interval ; at level , and , respectively, correspond to the intervals and ; at level , , , , and , respectively, correspond to the intervals , , , and ; and so on.

The basic structure of the proposed RWGD is to carry out a biased random walk on the interval tree . The walk starts at the root of the tree. Each move of the random walk is to one of the three adjacent nodes (i.e., the parent and the two children with the parent of the root defined as itself) of the current location. It is guided by the outputs of a confidence-bound based sequential test carried over the two boundary points and the middle point of the interval currently being visited by the random walk.

We now specify the sequential test carried out on a generic sampling point . The goal is to determine, at a confidence level , whether is negative or positive. If the former is true, the test module outputs , indicating the target is more likely to lie on the right of the current sampling point ; if the latter is true, the test module outputs , indicating the target is more likely to lie on the left of . Specifically, the local test module sequentially collects samples from . After collecting each sample, it determines whether to terminate the test and if yes, which value to output. As specified below, the stopping rule and the corresponding output are determined by comparing the upper and lower confidence bounds on the sample mean of random realizations of .

If , terminate and output . If , terminate and output . Otherwise, continue taking samples of . where denotes the sample mean of obtained form observations at point , is a constant, and is the distribution parameter specified in (2).

Figure 2: The sequential test at a sampling point

By convention, we define the output of the test at the boundary points to be , and at to be , without performing the test.

We now specify the random walk on the tree based on the outputs of the local tests. The algorithm consists of the following loop. Let denote the current location of the random walk which is initially set at the root node. The boundary points and the middle point of the interval corresponding to are probed by the local test module with parameter , where can be set to any constant in to ensure a greater than bias of the random walk. Based on the output sequence at the left boundary point (), the middle point (), and the right boundary point () of the current node, the random walk chooses one of the three neighboring nodes of —its two children and its parent—to move to. The procedure is specified in the pseudo-code given in Algorithm 1.

Initialization: initial location of the random walk , .
loop
     Test the boundaries and the middle point of the interval corresponding to by local sequential test.
     if the output sequence on the left boundary, middle point and the right boundary, in order, is  then
          move to the left child of : .
     else if the output sequence on the left boundary, middle point and the right boundary, in order, is  then
          move to the right child of : .
     else
          move to the parent of : .
     end if
end loop
Algorithm 1 The random walk module of RWGD.

For example, at the root node the output of the test at the left and the right boundary points are and , respectively, by convention. We carry out the local test module at the middle point . If the output of the test at is (indicating the derivative at is likely to be positive), the random walk moves to the left child corresponding to the interval . If the output of the test at is , the random walk moves to the right child corresponding to the interval .

We point out that the extension to all distributions models can be easily handled. The only required change is in the local sequential test module of the proposed policy, while the global random walk module remains the same. The local test module needs to be modified based on corresponding concentration results. Specifically, for light-tailed distributions, Chernoff-Hoeffding bounds similar to the ones for Sub-Gaussian distributions exist. For heavy-tailed distributions, more delicate techniques such as truncated sample mean or median estimates with similar concentration inequalities can be employed. We omit the details, given that such extensions are quite standard (e.g., see 

[18]).

4 Regret Analysis

In this section, we analyze the performance of RWGD. We show an upper bound on the regret along with a finite-time analysis, as formalized in Theorem 1 below.

Theorem 1.

Let be the chosen parameter of the RWGD policy. Let . The regret of RWGD satisfies

(3)

Proof of Theorem 1 is based on following two lemmas. Lemma 1 gives upper bounds on the sample complexity and error probabilities of the local sequential test. Lemma 2 establishes the geometric rate of the biased random walk of RWGD to the optimum point .

Lemma 1.

Let denote the time indexed samples from the random variable for a fixed . Let denote the expected value of the i.i.d. Sub-Gaussian random process and denote the sample mean obtained from the first observations. Let be the stopping time of the local sequential test with parameter applied to . We have,

(4)
in the case of , (5)
in the case of , (6)

Proof: See Appendix A.

The probability that the output of the local test carried out on a point is incorrect (i.e. it is when and when ) is upper bounded by . The condition for the random walk to move in the right direction is that the output of all three tests carried out on the boundary points and the middle point of the current interval are correct. Thus, the probability that the random walk moves in the right direction satisfies which indicates by the choice of . This ensures that the random walk moves toward at a geometric rate with a probabilistic guarantee as specified in Lemma 2.

Let denote the sampling point at the th time that the local test is called by RWGD. For example, the first step of the random walk is taken based on the first three tests carried out on points ; the second step of the random walk is taken based on the three tests carried out on points , and so on. Let denote the distance between and . Lemma 2 establishes an upper bound on after steps are taken by the random walk.

Lemma 2.

For the sampling points at step of the random walk in RWGD (), with probability at least ,

(7)

Proof: See Appendix B.

Proof of Theorem 1.

Let denote the number of samples taken in the th time that the local sequential test is carried out; samples are taken at point , samples are taken at point and so on. Let denote the time at the end of the th step of the random walk: . Notice that both and are random variables were the randomness comes from the randomness in the samples of . Define and . The definition of and indicates that at , the random walk has taken more than steps. We analyze the regret incurred up to time , and after that, separately.

(8)

Next, we establish an upper bound on each term of the regret.

Upper bound on the first term . From Lemma 1, we have that satisfies

(9)

Based on this upper bound on and by convexity of we have

(10)

Noticing the constraint that and using , the following constrained optimization problems gives us an upper bound on .

(11)

We can prove that the values of (for all ) gives an upper bound on the above optimization problem which by results in

(12)

Upper bound on the second term . At time , by definition of we know that the random walk has taken more than steps. From Lemma 2, we have with probability at least,

(13)

where the last inequality is obtained by and .

The second term in the regret is upper bounded as follows.

(14)

In the above inequalities is the indicator function. We used the convexity of and to arrive at (14).

From (12) and (14), we have

which concludes the proof. ∎

Remark 1.

Let . By Jensen’s inequality

(15)

Theorem 1 thus shows that converges to , at an rate.

5 Simulation

In this section, we compare the performance of RWGD and the standard SGD in simulations. The expected value of the functions is chosen as . The gradient is thus in the form of , where

is a random variable with a normal distribution with mean

and variance for each . We vary the signal to noise ratio in using parameter (this is equivalent to changing the magnitude of noise).

Figure 3: Comparison of the performance of RWGD and SGD .

In SGD, a sequence of sampling points is generated according to the following rule: 111If the value of in SGD is not in , it is projected to . , where is randomly chosen, and is a sequence of chosen step sizes. The convergence of is strongly dependent on . In particular, in addition to the the assumptions on the distribution of , for convergence of , it is required that and for some constant  [6]. In our simulations, we use for SGD, which satisfies the requirement and we set (SGD does not show considerable sensitivity to the initial point in these examples). The parameter in RWGD is set to .

Figure 4: Comparison of the performance of RWGD and SGD .

As shown in Figures 3 and 4, RWGD outperforms SGD in most cases. In contrast to SGD, whose performance strongly depends on the sequence of the step sizes, RWGD always performs well.

6 Conclusion

We introduced a novel policy for the stochastic convex online learning problem based on constructing a search tree and inducing an efficient random walk on the tree. We established a finite-time regret analysis of the proposed policy which ensures an regret at any finite time . The low computational complexity and memory requirement of the proposed policy make it suitable for a variety of applications. A potential future direction is the applicability of the proposed policy to non-convex online learning problems.

References

  • [1] H. Robbins and S. Monro “A stochastic approximation method” The Annals of Mathematical Statistics, vol. 22, no. 3, pp. 400–407, 1951.
  • [2] J. Kiefer, J. Wolfowitz, “Stochastic Estimation of the Maximum of a Regression Function,” Ann. Math. Statist., vol. 23, no. 3, pp. 462-466, 1952.
  • [3] T. Lai, “Stochastic Approximation,” Annuals of Statistics, no 31, pp. 391–406, 2003.
  • [4] M. Zinkevich, “Online convex programming and generalized infinitesimal gradient ascent,”

    in ICML ’03: Proceedings of the 20th International Conference on Machine Learning

    , pp. 928–936, 2003
  • [5] E. Cope, “Regret and Convergence Bounds for a Class of Continuum-Armed Bandit Problems,” IEEE TRANSACTIONS ON AUTOMATIC CONTROL, vol. 54, no. 6, 2009.
  • [6] R. Pasupathy, S. Kim, “The stochastic root-finding problem: overview, solutions, and open questions,” ACM Transactions on Modeling and Computational Simulations, vol. 21, no. 3, pp. 19, 2011.
  • [7] P. I. Frazier, S. G. Henderson, R. Waeber, “Probabilistic bisection converges almost as quickly as stochastic approximation”, available at arXiv:1612.03964v1 [math.PR], 2016.
  • [8] E. Hazan, A. Agrawal, S. Kale, “Logarithmic regret algorithms for online convex optimization,” Machine Learning, vol. 69, no. 2-3, pp. 169-192, 2007.
  • [9] A. Agrawal, D.P. Foster, D. Hsu, S.M. Kakade, A. Rakhlin, “Stochastic convex optimization with bandit feedback,” Advances in Neural Information Processing Systems, 24, 2011.
  • [10] R. Agrawal, “The continuum-armed bandit problem,” SIAM journal on control and optimization, no. 33, pp. 1926, 1995.
  • [11] R. Kleinberg, “Nearly tight bounds for the continuum-armed bandit problem,” Advances in Neural Information Processing Systems, 18, 2005.
  • [12] R. Kleinberg, A. Slivkins, and E. Upfal, “Multi-armed bandits in metric spaces,”

    In Proceedings of the 40th annual ACM symposium on Theory of computing,

    pp. 681–690. ACM, 2008.
  • [13] S. Bubeck, R. Munos, G. Stolz, C. Szepesvari, “-armed Bandits,” Journal of Machine Learning Research 12. pp. 1655-1695. 2011.
  • [14] E. Hazan, “Introduction to Online Convex Optimization,” Foundations and Trends in Optimization, vol. 2, no. 3-4, pp 157-325, 2016.
  • [15] S. Shalev-Shwartz, “Online Learning and Online Convex Optimization,” Foundations and Trends in Optimization, vol. 4, no. 2, pp 107-194, 2012.
  • [16] S. Vakili, Q. Zhao, C. Liu, C.N. Chuah, “Hierarchical Heavy Hitter Detection under Unknown Models,” International Conference on Acoustics, Speech, and Signal Processing, 2018.
  • [17] R. G. Antonioni, Y. Kozachenko, A. Volodin, “Convergence of series of dependent -subgaussian random variables,” Journal of Mathematical Analysis and Applications, vol. 338, no. 2, pp. 1188-1203, 2008.
  • [18] S. Bubeck, N. Cesa-Bianchi, G. Lugosi, “Bandits with Heavy Tail,” IEEE Transaction on Information Theory, vol. 59, pp. 7711-7717, 2013.

Appendix A

Proof of Lemma 1.

The proof of Lemma 1 is based on concentration inequalities for Sub-Gaussian distributions. We prove inequalities (4) and (5). The same results under the other case, , can be proven similarly. Let . We have,

(18)

Inequality (Appendix A) holds because is a decreasing function in for and (Appendix A) holds because for all , . We thus have

where inequity (Appendix A) is based on the Chernoff-Hoeffding bound and (18). For ,

(20)
(21)

The inequality (20) holds based on (18), for , and the inequality (21) holds based on the Chernof-Hoeffding bound. We can write in terms of the sum of the tail probabilities as

which completes the proof by substituting .

Appendix B

Proof of Lemma 2.

We define the value of each step of the random walk as if the random walk moves in the right direction, i.e., the random walk moves to the child who contains or moves to the parent if neither of the children contain ; and otherwise. We also use to denote the cumulative value of the steps.

The condition for is that the result of all three local sequential tests at step are true. Thus, as a result of Lemma 1, we have with probability which indicates . The positive expected value of each step in the random walk indicates the random walk is more likely to move closer to rather than away from it. In particular,

The last inequality is based on Hoeffding inequality on independent Bernoulli random variables . Each time the random walk moves one step in the right direction the value of in the local sequential tests is divided by half. For example when we have, trivially, . If the random walk moves to the child who contains then we have , and so on. Thus, for , we have, for ,

which completes the proof.