A Hybrid Stochastic Optimization Framework for Stochastic Composite Nonconvex Optimization

07/08/2019 ∙ by Quoc Tran-Dinh, et al. ∙ 0

In this paper, we introduce a new approach to develop stochastic optimization algorithms for solving stochastic composite and possibly nonconvex optimization problems. The main idea is to combine two stochastic estimators to form a new hybrid one. We first introduce our hybrid estimator and then investigate its fundamental properties to form a foundation theory for algorithmic development. Next, we apply our theory to develop several variants of stochastic gradient methods to solve both expectation and finite-sum composite optimization problems. Our first algorithm can be viewed as a variant of proximal stochastic gradient methods with a single-loop, but can achieve O(σ^3ε^-1 + σε^-3) complexity bound that is significantly better than the O(σ^2ε^-4)-complexity in state-of-the-art stochastic gradient methods, where σ is the variance and ε is a desired accuracy. Then, we consider two different variants of our method: adaptive step-size and double-loop schemes that have the same theoretical guarantees as in our first algorithm. We also study two mini-batch variants and develop two hybrid SARAH-SVRG algorithms to solve the finite-sum problems. In all cases, we achieve the best-known complexity bounds under standard assumptions. We test our methods on several numerical examples with real datasets and compare them with state-of-the-arts. Our numerical experiments show that the new methods are comparable and, in many cases, outperform their competitors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we consider the following stochastic composite and possibly nonconvex optimization problem, which is widely studied in the literature:

(1)

where is a stochastic function such that for each ,

is a random variable in a given probability space

, while for each realization , is smooth on ; is the expectation of the random function over on ; and is a proper, closed, and convex function.

In addition to (1), we also consider the following composite finite-sum problem:

(2)

where for are all smooth functions. Problem (2) can be considered as special case of (1) where and

is a uniform distribution on

. If is extremely large such that evaluating the full gradient and the function value is expensive, then, as usual, we refer to this setting as online models.

If the regularizer is absent, then we obtain a smooth problem which has been widely studied in the literature. As another special case, if is the indicator of a nonempty, closed, and convex set , i.e. , then (1) also covers constrained nonconvex optimization problems.

1.1 Our goals, approach, and contribution

Our goals: Our objective is to develop a new approach to approximate a stationary point of (1) and its finite-sum setting (2

) under standard assumptions used in existing methods. In this paper, we only focus on stochastic gradient descent-type (SGD) variants. We are also interested in both oracle complexity bounds and implementation aspects. The ultimate goal is to design simple algorithms that are easy to implement and require less parameter tuning effort.

Our approach: Our approach relies on a so-called “hybrid” idea which merges two existing stochastic estimators through a convex combination to design a “hybrid” offspring that inherits the advantages of its underlying estimators. We will focus on the hybrid estimators formed from the SARAH (a recursive stochastic estimator) introduced in [50]

and any given unbiased estimator such as SGD

[62], SVRG [32], or SAGA [17]. For the sake of presentation, we only focus on either SGD or SVRG estimator in this paper. We emphasize that our method is fundamentally different from momentum or exponential moving average-type methods such as in [15, 34] where we use two independent estimators instead of a combination of the past and the current estimators.

While our hybrid estimators are biased, fortunately, they provide some useful properties to develop new algorithms. One important feature is the variance reduced property which often allows us to derive a large step-size or a constant step-size in stochastic methods. Whereas a majority of stochastic algorithms rely on unbiased estimators such as SGD, SVRG, and SAGA, interestingly, recent evidence has shown that biased estimators such as SARAH, biased SAGA, or biased SVRG estimators also provide comparable and even better algorithms in terms of oracle complexity bounds as well as empirical performance, see, e.g. [19, 22, 53, 56, 68].

Our approach, on the one hand, can be extended to study second-order methods such as cubic regularization and subsampled schemes as in [8, 21, 63, 69, 71, 75]. The main idea is to exploit hybrid estimators to approximate both gradient and Hessian of the objective function similar to [69, 71, 75]. On the other hand, it can be applied to approximate a second-order stationary point of (1) and (2). The idea is to integrate our methods with a negative curvature search such as Oja’s algorithm [55] or Neon2 [6], or to employ perturbed/noise gradient techniques such as [23, 26, 38] in order to approximate a second-order stationary point. However, to avoid overloading this paper, we leave these extensions for our future work.

Our contribution: To this end, the contribution of this paper can be summarized as follows:

  • We first introduce a “hybrid” approach to merge two existing stochastic estimators in order to form a new one. Such a new estimator can be viewed as a convex combination of a biased estimator and an unbiased one to inherit the advantages of its underlying estimators. Although we only focus on a convex combination between SARAH [50] and either SGD [62] or SVRG [32] estimator, our approach can be extended to cover other possibilities. Given such new hybrid estimators, we develop several fundamental properties that can be useful for developing new stochastic optimization algorithms.

  • Next, we employ our new hybrid SARAH-SGD estimator to develop a novel stochastic proximal gradient algorithm, Algorithm 1, to solve (1). This algorithm can achieve -oracle complexity bound. To the best of our knowledge, this is the first variant of SGD that achieves such an oracle complexity bound without using double loop or check-points as in SVRG or SARAH, or requiring an -table to store gradient components as in SAGA-type methods.

  • Then, we derive two different variants of Algorithm 1: adaptive step-size and double-loop schemes. Both variants have the same complexity as of Algorithm 1. We also propose a mini-batch variant of Algorithm 1 and provide a trade-off analysis between mini-batch sizes and the choice of step-sizes to obtain better practical performance.

  • Finally, we design a hybrid SARAH-SVRG estimator and use it to develop new stochastic variants for solving the composite finite-sum problem (2). These variants also achieve the best-known complexity bounds while having new properties compared to existing methods.

Let us emphasize the following additional points of our contribution. Firstly, the new algorithm, Algorithm 1, is rather different from existing SGD methods. It first forms a mini-batch stochastic gradient estimator at a given initial point to provide a good approximation to the initial gradient of . Then, it performs a single loop to update the iterate sequence which consists of two steps: proximal-gradient step and averaging step, where our hybrid estimator is used.

Secondly, our methods work with both single-sample and mini-batches, and achieve the best-known complexity bounds in both cases. This is different from some existing methods such as SVRG-type and SpiderBoost that only achieve the best complexity under certain choices of parameters. Our methods are also flexible to choose different mini-batch sizes for the hybrid components to achieve different complexity bounds and to adjust the performance. For instance, in Algorithm 1, we can choose single sample in the SARAH estimator while using a mini-batch in the SGD estimator that leads to different trade-off on the choice of the weight.

Finally, our theoretical results on hybrid estimators are also self-contained and independent. As we have mentioned, they can be used to develop other stochastic algorithms such as second-order methods or perturbed SGD schemes. We believe that they can also be used in other problems such as composition and constrained optimization [16, 44, 67].

1.2 Related work

Problem (1) and its sample averaging setting (2) have been widely studied in the literature for both convex and nonconvex models, see, e.g. [9, 10, 17, 27, 32, 42, 46, 50, 62, 64]

. However, due to applications in deep learning, large-scale nonconvex optimization problems have attracted huge attention in recent years

[30, 37]. Numerical methods for solving these problems heavily rely on two approaches: deterministic and stochastic approaches, ranging from first-order to second-order methods. Notable first-order methods include stochastic gradient descent-type, conditional gradient descent [59], and primal-dual schemes [13]. In contrast, advanced second-order methods consist of quasi-Newton, trust-region, sketching Newton, subsampled Newton, and cubic regularized Newton-based methods, see, e.g. [11, 48, 57, 63].

In terms of stochastic first-order algorithms, there has been a tremendously increasing trend in stochastic gradient descent methods and their variants in the last fifteen years. SGD-based algorithms can be classified into two categories: non-variance reduction and variance reduction schemes. The classical SGD method was studied in early work of Robbins & Monro

[62], but its convergence rate was then investigated in [46] under new robust variants. Ghadimi & Lan extended SGD to nonconvex settings and analyzed its complexity in [27]. Other extensions of SGD can be found in the literature, including [4, 16, 20, 25, 28, 31, 34, 35, 45, 51, 58].

Alternatively, variance reduction-based methods have been intensively studied in recent years for both convex and nonconvex settings. Apart from mini-batch and importance sampling schemes [29, 73], the following methods are the most notable. The first class of algorithms is based on SAG estimator [64], including SAGA-variants [17]. The second one is SVRG [32] and its variants such as Katyusha [3], MiG [77], and many others [39, 60]. The third class relies on SARAH [50] such as SPIDER [22], SpiderBoost [68], ProxSARAH [56], and momentum variants [78]. Other approaches such as Catalyst [41] and SDCA [65] have also been proposed.

In terms of theory, many researchers have focussed on theoretical aspects of existing algorithms. For example, [27] appeared as one of the first remarkable works studying convergence rates of stochastic gradient descent-type methods for nonconvex and non-composite finite-sum problems. They later extended it to the composite setting in [29]. The authors of [68] also investigated the gradient dominant case, and [33] considered both finite-sum and composite finite-sum problems under different assumptions. Whereas many researchers have been trying to improve complexity upper bounds of stochastic first-order methods using different techniques [5, 6, 7, 22], other works have attempted to construct examples to establish lower-bound complexity barriers. The upper oracle complexity bounds have been substantially improved among these works and some results have matched the lower bound complexity in both convex and nonconvex settings [5, 4, 22, 27, 39, 40, 56, 60, 68, 76]. We refer to Table 1 for some notable examples of stochastic gradient-type methods for solving (1) and (2) and their non-composite settings.

         Algorithms Expectation Finite-sum Composite Type
GD [49] NA  ✓ Single
SGD [27] NA  ✓ Single
SAGA [60] NA  ✓ Single
SVRG [60] NA  ✓ Double
SVRG+ [39]  ✓ Double
SCSG [40]  ✗ Double
SNVRG [76]  ✗ Double
SPIDER [22]  ✗ Double
SpiderBoost [68]  ✓ Double
ProxSARAH [56]  ✓ Double
HybridSGD (This paper)  ✓ Single
Table 1: A comparison of stochastic first-order oracle complexity bounds and the type of algorithms for nonsmooth nonconvex optimization both non-composite and composite case. Here, is the number of data points and is the variance in Assumption 2.3, and “single/double” means that the algorithm uses single-loop or double-loop, respectively. All the complexity bounds here must depend on the Lipschitz constant in Assumption 2.2 and , the difference between the initial objective value and the lower-bound in Assumption 2.1. We assume that and ignore these quantities in the complexity bounds. Note that SAGA is a single-loop method, but it requires a matrix of size to store stochastic gradients .

In the convex case, there exist numerous research papers including [1, 2, 12, 24, 47, 49, 70] that study the lower bound complexity. In [22, 74], the authors constructed a lower-bound complexity for nonconvex finite-sum problems covered by (2). They showed that the lower-bound complexity for any stochastic gradient method relied on only smoothness assumption to achieve an -stationary point in expectation is . For the expectation problem (1), the best-known complexity bound to obtain an -stationary point in expectation is as shown in [22, 68], where is an upper bound of the variance (see Assumption 2.3). Unfortunately, we have not seen any lower-bound complexity for the nonconvex setting of (1) under standard assumptions from the literature.

While numerical stochastic algorithms for solving the non-composite setting, i.e. , are well-developed and have received considerable attention [5, 6, 7, 22, 40, 52, 53, 54, 60, 76], methods for composite setting remain limited [60, 68]. In this paper, we will develop a novel approach to design stochastic optimization algorithms for solving the composite problems (1) and (2). Our approach is rather different from existing ones and we call it a “hybrid” approach.

1.3 Comparison

Let us compare our algorithms and existing methods in the following aspects:

Single-loop vs. multiple-loop:

As mentioned, we aim at developing practical methods that are easy to implement. One of the major difference between our methods and existing state-of-the-arts is the algorithmic style: single-loop vs. multiple-loop style. As discussed in several works, including [36], single-loop methods have some advantages over double-loop methods, including tuning parameters. The single-loop style consists of SGD, SAGA, and their variants [17, 18, 27, 46, 58, 62, 64], while the double-loop style comprises SVRG, SARAH, and their variants [32, 50]. Other algorithms such as Natasha [4] or Natasha1.5 [5] even have three loops. Let us compare these methods in detail as follows:

  • SGD and SAGA-type methods have single-loop, but SAGA-type algorithms use an -matrix to maintain individual gradients which can be very large if and are large. In addition, SAGA has not yet been applied to solve (1). Our first algorithm, Algorithm 1, has single-loop as SGD and SAGA, and does not require heavy memory storage. However, to apply to (2), it still requires either an additional assumption or a check-point compared to SAGA. But if it solves (1), then it has the same assumptions as in SGD. In terms of complexity, Algorithm 1 is much better than SGD. To the best of our knowledge, Algorithm 1 is the first single-loop SGD variant that achieves the best-known complexity. Another related work is [15], which uses momentum approach, but requires additional bounded gradient assumption to achieve similar complexity as Algorithm 1.

  • Algorithm 2 has double-loop as SVRG and SARAH-type methods. While the double-loop in SVRG, SARAH, and their variants are required to achieve convergence, it is optional in Algorithm 2

    . Note that double-loop or multiple-loop methods require to tune more parameters such as epoch lengths and possibly the mini-batch size of the snapshot points. Although Algorithm 

    3 is loopless, it can be viewed as a double-loop variant. This algorithm has different complexity bound than existing methods.

Single-sample and mini-batch:

Our methods work with both single-sample and mini-batch, and in both cases, they achieve the best-known complexity bounds. This is different from some existing methods such as SVRG or SARAH-based methods [60, 68] where the best complexity is only obtained if one chooses the best parameter configuration.

Complexity bounds:

Algorithm 1 and its variants all achieve the best-known complexity bounds as in [56, 68] for solving (1). In early work such as Natasha [4] and Natasha1.5 [5] which are based on the SVRG estimator, the best complexity is often for solving (1) and for solving (2). By combining with additional sophisticated tricks, these complexity bounds are slightly improved. For instance, Natasha [4] or Natasha1.5 [5] can achieve in the finite-sum case, and in the expectation case, but they require three loops with several parameter adjustment which are difficult to tune in practice. SNVRG [76] exploits a dynamic epoch length as used in [40] to improve its complexity bounds. Again, this method also requires complicated parameter selection procedure. To achieve better complexity bounds, SARAH-based methods have been studied in [22, 53, 56, 68]. Their complexity meets the lower-bound one in the finite-sum case as indicated in [22, 56].

1.4 Paper organization

The rest of this paper is organized as follows. Section 2 discusses the main assumptions of our problems (1) and (2), and their optimality conditions. Section 3 develops new hybrid stochastic estimators and investigates their properties. We consider both single-sample and mini-batch cases. Section 4 studies a new class of hybrid gradient algorithms to solve both (1) and (2). We develop three different variants of hybrid algorithms and analyze their convergence and complexity estimates. Section 5 extends our algorithms to mini-batch cases. Section 6 is devoted to investigating hybrid SARAH-SVRG methods to solve the finite-sum problem (2). Section 7 gives several numerical examples and compares our methods with existing state-of-the-arts. For the sake of presentation, all technical proofs are provided in the appendix.

2 Basic assumptions and optimality condition

Notation and basic concepts:

We work with the Euclidean spaces, and equipped with standard inner product and norm . For any function , denotes the effective domain of . If is continuously differentiable, then denotes its gradient. If, in addition, is twice continuously differentiable, then denotes its Hessian.

For a stochastic function defined on a probability space , we use to denote the expectation of w.r.t. on . We also overload the notation to express the expectation w.r.t. a realization in both single-sample and mini-batch cases. Given a finite set , we denote if for and . If for , then we write

by dropping the probability distribution

.

Given a random mapping

depending on a random vector

, we say that is -average Lipschitz continuous if for all , where is called the Lipschitz constant of . If is a deterministic function, then this condition becomes which states that is -Lipschitz continuous. In particular, if this condition holds for , then we say that is -smooth.

For a proper, closed, and convex function , denotes its subdifferential at , and denotes its proximal operator. If is the indicator of a nonempty, closed, and convex set , then reduces to the projection onto . We say that is -weakly convex if for all and , where is a given constant. Clearly, a weakly convex function is not necessarily convex. However, any -smooth function is -weakly convex. If is -weakly convex, then is -strongly convex if . Therefore, the proximal operator is well-defined and single-valued if . Note that is non-expansive, i.e. for all .

If is a matrix, then is the spectral norm of and the inner product of two matrices and is defined as . Also, stands for the set of positive integer numbers, and . Given , denotes the maximum integer number that is less than or equal to . We also use to express complexity bounds of algorithms.

2.1 Fundamental assumptions

Our algorithms developed in the sequel rely on the following fundamental assumptions:

Assumption 2.1.

Both problems (1) and (2) satisfy the following conditions:

  • (Convexity of the regularizer) is a proper, closed, and convex function. The domain is nonempty.

  • (Boundedness from below) There exists a finite lower bound

    (3)

This assumption is fundamental and required for any algorithm. Here, since is proper, closed, and convex, its proximal operator is well-defined, single-valued, and non-expansive. We assume that this proximal operator can be computed exactly.

Assumption 2.2 (-average smoothness).

The expectation function in (1) is -smooth on , i.e. there exists such that

(4)

In the finite sum setting (2), the -smoothness condition (4) can be expressed as the -average smoothness of all with the moduli as:

(5)
Assumption 2.3 (Bounded variance).

There exists such that

(6)

The bounded variance condition for (2) becomes

(7)

Assumptions 2.2 and 2.3 are very standard in stochastic optimization methods and required for any stochastic gradient-based methods for solving (1). The -average smoothness in (5) is in general weaker than the individual smoothness of each [56]. Note that we do not require the Lipschitz continuity of or as in some recent work, e.g. [15].

We also consider problem (2) under the following assumption, which cover the case , the indicator of a nonempty, closed, convex, and bounded set . This assumption will be used to develop algorithms for solving (2) using hybrid SVRG estimators.

Assumption 2.4.

The domain of is bounded, i.e.:

2.2 First-order optimality condition

The optimality condition of (1) can be written as

(8)

Any point satisfying (8) is called a stationary point of (1). The same definition applies to (2).

Note that (8) can be written equivalently to

(9)

Here, is called the gradient mapping of in (1) for any . It is obvious that if , then , the gradient of for any . Our goal is to seek an -stationary point of (1) or (2) defined as follows:

Definition 2.1.

Given a desired acuracy , a point is said to be an -stationary point of (1) or (2) if

(10)

Here, the expectation is taken over all the randomness rendered from both and the algorithm.

Let us clarify why is an approximate stationary point of (1). Indeed, if , then means that . On the other hand, is equivalent to . Therefore, for some . Using the -average smoothness of , we have . This condition shows that is an approximate stationary point of (1).

In practice, we often replace the condition (10) by which can avoid storing the iterate sequence .

3 Hybrid stochastic estimators

In this section, we propose new stochastic estimators for a generic function that can cover function values, gradient, and Hessian of any expectation function in (1).

3.1 The construction of hybrid stochastic estimators

Given a function , where is a (vector) stochastic function from . We define the following stochastic estimator of . As concrete examples, can be the gradient mapping of or the Hessian mapping of in problem (1) or (2).

Definition 3.1.

Let be an unbiased stochastic estimator of formed by a realization of , i.e. at a given . The following quantity:

(11)

is called a hybrid stochastic estimator of at , where and are two independent realizations of on and is a given weight.

Clearly, if , then we obtain a simple unbiased stochastic estimator. If , then we obtain the SARAH-type estimator as studied in [50] but for general function . We are interested in the case , which can be referred to as a hybrid recursive stochastic estimator.

We can rewrite as

The first two terms are two stochastic estimators evaluated at , while the third term is the difference of the previous estimator and a stochastic estimator at the previous iterate. Here, since , the main idea is to exploit more recent information than the old one.

In fact, if , then the hybrid estimator covers many other estimators, including SGD, SVRG, and SARAH. We consider three concrete examples of the unbiased estimator of as follows:

  • The classical stochastic estimator: .

  • The SVRG estimator: , where is a given unbiased snapshot evaluated at a given point .

  • The SAGA estimator: , where if and if .

While both the classical stochastic and SVRG estimators work for both expectation and finite-sum settings, the SAGA estimator currently works for the finite-sum setting (2). Note that it is also possible to consider mini-batch and important sampling settings for our hybrid estimators.

3.2 Properties of hybrid stochastic estimators

Let us first define

(12)

the -field generated by the history of realizations of up to the iteration . We first prove in Appendix 1.1 the following property of the hybrid stochastic estimator .

Lemma 3.1.

Let be defined by (11). Then

(13)

If , then is a biased estimator of . Moreover, we have

(14)
Remark 3.1.

From (11), we can see that remains a biased estimator as long as . Its biased term is

Clearly, the biased term of the estimator is smaller than the one in the SARAH estimator in [50] which is .

While the variance of can only be bounded by a constant , the variance of can be reduced by gradually changing the snapshot . The following lemma shows this property, whose proof can be found, e.g. in [61].

Lemma 3.2.

Assume that is an SVRG estimator of . Then the following estimate holds:

(15)

If is -average Lipschitz continuous, i.e. for all , then we have

(16)

The following lemma bounds the variance of defined in (11). Its proof is given in Appendix 2.1.

Lemma 3.3.

Assume that is -average Lipschitz continuous and is a classical stochastic estimator of . Then, we have the following upper bound:

(17)

where the expectation is taking over all the randomness , and

(18)

3.3 Mini-batch hybrid stochastic estimators

We can also consider a mini-batch hybrid recursive stochastic estimator of defined as:

(19)

where and is a mini-batch of size and independent of .

Note that can also be a mini-batch unbiased estimator of . For example, is a mini-batch unbiased stochastic estimator.

For defined by (19), we have the following property, whose proof is in Appendix 1.3.

Lemma 3.4.

Let be the mini-batch stochastic estimator of defined by (19), where is also a mini-batch unbiased stochastic estimator of with such that is independent of . Then, the following estimate holds:

(20)

where if is finite i.e. , and , otherwise i.e. .

Similar to Lemma 4.1, we can bound the variance of the mini-batch hybrid estimator from (19) in the following lemma, whose proof is in Appendix 1.4. For simplicity of presentation, we choose and .

Lemma 3.5.

Assume that is -average Lipschitz continuous and is a mini-batch unbiased estimator as , is given in (19), and are mini-batches of sizes and , respectively for all . Then, we have the following upper bound on the variance :

(21)

where the expectation is taking over all the randomness , and , , and are defined in (18). Here, and if is finite and and , otherwise.

The theoretical results developed in Section 3 are self-contained. They can be specified to develop stochastic optimization methods for solving (1) and (2). In the next sections, we only exploit these properties for to develop stochastic gradient-type methods.

4 Hybrid SARAH-SGD Algorithms

In this section, we utilize our hybrid stochastic estimator above with to develop new stochastic gradient algorithms for solving (1) and its finite-sum setting (2).

4.1 The single-loop algorithm

Our first algorithm is a single-loop stochastic proximal-gradient scheme for solving (1). This algorithm is described in detail in Algorithm 1.

1:Initialization: An initial point .
2: Input the parameters , , , and (will be specified later).
3: Generate an unbiased estimator at using a mini-batch .
4: Update and .
5:For do
6:     Generate a proper sample pair independently (single sample or mini-batch).
7:     Evaluate .
8:     Update and .
9:EndFor
10: Choose from (at random or deterministic, specified later).
Algorithm 1 (Hybrid stochastic gradient descent (Hybrid-SGD) algorithm)

Algorithm 1 is different form existing SGD methods at the following points:

  • Firstly, it starts with a relatively large mini-batch to compute an initial estimate for the initial gradient . This is quite different from existing methods where they often use single-sample, mini-batch, or increasing mini-batch sizes for the whole algorithms (e.g. [29]), and do not separate into two stages as in Algorithm 1:

    • Stage 1: Step 3 and Step 4.

    • Stage 2: Step 5 to Step 8.

    The idea behind this difference is to find a good stochastic approximation for to move on.

  • Secondly, Algorithm 1 adopts the idea of ProxSARAH in [56] with two steps in and to handle the composite forms. This is different from existing methods as well as methods for non-composite problems where two step-sizes and are used. While the first step on is a standard proximal-gradient step, the second one on is an averaging step. If , i.e. in the non-composite problems, then Steps 4 and 8 reduce to

    Therefore, the product can be viewed as a combined step-size of Algorithm 1. Note that by using to approximate the gradient mapping defined by (9), we can rewrite the main-step of Algorithm 1 as