In this paper, we consider the following stochastic composite and possibly nonconvex optimization problem, which is widely studied in the literature:
where is a stochastic function such that for each ,, while for each realization , is smooth on ; is the expectation of the random function over on ; and is a proper, closed, and convex function.
In addition to (1), we also consider the following composite finite-sum problem:
is a uniform distribution on. If is extremely large such that evaluating the full gradient and the function value is expensive, then, as usual, we refer to this setting as online models.
If the regularizer is absent, then we obtain a smooth problem which has been widely studied in the literature. As another special case, if is the indicator of a nonempty, closed, and convex set , i.e. , then (1) also covers constrained nonconvex optimization problems.
1.1 Our goals, approach, and contribution
) under standard assumptions used in existing methods. In this paper, we only focus on stochastic gradient descent-type (SGD) variants. We are also interested in both oracle complexity bounds and implementation aspects. The ultimate goal is to design simple algorithms that are easy to implement and require less parameter tuning effort.
Our approach: Our approach relies on a so-called “hybrid” idea which merges two existing stochastic estimators through a convex combination to design a “hybrid” offspring that inherits the advantages of its underlying estimators. We will focus on the hybrid estimators formed from the SARAH (a recursive stochastic estimator) introduced in 
and any given unbiased estimator such as SGD, SVRG , or SAGA . For the sake of presentation, we only focus on either SGD or SVRG estimator in this paper. We emphasize that our method is fundamentally different from momentum or exponential moving average-type methods such as in [15, 34] where we use two independent estimators instead of a combination of the past and the current estimators.
While our hybrid estimators are biased, fortunately, they provide some useful properties to develop new algorithms. One important feature is the variance reduced property which often allows us to derive a large step-size or a constant step-size in stochastic methods. Whereas a majority of stochastic algorithms rely on unbiased estimators such as SGD, SVRG, and SAGA, interestingly, recent evidence has shown that biased estimators such as SARAH, biased SAGA, or biased SVRG estimators also provide comparable and even better algorithms in terms of oracle complexity bounds as well as empirical performance, see, e.g. [19, 22, 53, 56, 68].
Our approach, on the one hand, can be extended to study second-order methods such as cubic regularization and subsampled schemes as in [8, 21, 63, 69, 71, 75]. The main idea is to exploit hybrid estimators to approximate both gradient and Hessian of the objective function similar to [69, 71, 75]. On the other hand, it can be applied to approximate a second-order stationary point of (1) and (2). The idea is to integrate our methods with a negative curvature search such as Oja’s algorithm  or Neon2 , or to employ perturbed/noise gradient techniques such as [23, 26, 38] in order to approximate a second-order stationary point. However, to avoid overloading this paper, we leave these extensions for our future work.
Our contribution: To this end, the contribution of this paper can be summarized as follows:
We first introduce a “hybrid” approach to merge two existing stochastic estimators in order to form a new one. Such a new estimator can be viewed as a convex combination of a biased estimator and an unbiased one to inherit the advantages of its underlying estimators. Although we only focus on a convex combination between SARAH  and either SGD  or SVRG  estimator, our approach can be extended to cover other possibilities. Given such new hybrid estimators, we develop several fundamental properties that can be useful for developing new stochastic optimization algorithms.
Next, we employ our new hybrid SARAH-SGD estimator to develop a novel stochastic proximal gradient algorithm, Algorithm 1, to solve (1). This algorithm can achieve -oracle complexity bound. To the best of our knowledge, this is the first variant of SGD that achieves such an oracle complexity bound without using double loop or check-points as in SVRG or SARAH, or requiring an -table to store gradient components as in SAGA-type methods.
Then, we derive two different variants of Algorithm 1: adaptive step-size and double-loop schemes. Both variants have the same complexity as of Algorithm 1. We also propose a mini-batch variant of Algorithm 1 and provide a trade-off analysis between mini-batch sizes and the choice of step-sizes to obtain better practical performance.
Finally, we design a hybrid SARAH-SVRG estimator and use it to develop new stochastic variants for solving the composite finite-sum problem (2). These variants also achieve the best-known complexity bounds while having new properties compared to existing methods.
Let us emphasize the following additional points of our contribution. Firstly, the new algorithm, Algorithm 1, is rather different from existing SGD methods. It first forms a mini-batch stochastic gradient estimator at a given initial point to provide a good approximation to the initial gradient of . Then, it performs a single loop to update the iterate sequence which consists of two steps: proximal-gradient step and averaging step, where our hybrid estimator is used.
Secondly, our methods work with both single-sample and mini-batches, and achieve the best-known complexity bounds in both cases. This is different from some existing methods such as SVRG-type and SpiderBoost that only achieve the best complexity under certain choices of parameters. Our methods are also flexible to choose different mini-batch sizes for the hybrid components to achieve different complexity bounds and to adjust the performance. For instance, in Algorithm 1, we can choose single sample in the SARAH estimator while using a mini-batch in the SGD estimator that leads to different trade-off on the choice of the weight.
Finally, our theoretical results on hybrid estimators are also self-contained and independent. As we have mentioned, they can be used to develop other stochastic algorithms such as second-order methods or perturbed SGD schemes. We believe that they can also be used in other problems such as composition and constrained optimization [16, 44, 67].
1.2 Related work
. However, due to applications in deep learning, large-scale nonconvex optimization problems have attracted huge attention in recent years[30, 37]. Numerical methods for solving these problems heavily rely on two approaches: deterministic and stochastic approaches, ranging from first-order to second-order methods. Notable first-order methods include stochastic gradient descent-type, conditional gradient descent , and primal-dual schemes . In contrast, advanced second-order methods consist of quasi-Newton, trust-region, sketching Newton, subsampled Newton, and cubic regularized Newton-based methods, see, e.g. [11, 48, 57, 63].
In terms of stochastic first-order algorithms, there has been a tremendously increasing trend in stochastic gradient descent methods and their variants in the last fifteen years. SGD-based algorithms can be classified into two categories: non-variance reduction and variance reduction schemes. The classical SGD method was studied in early work of Robbins & Monro, but its convergence rate was then investigated in  under new robust variants. Ghadimi & Lan extended SGD to nonconvex settings and analyzed its complexity in . Other extensions of SGD can be found in the literature, including [4, 16, 20, 25, 28, 31, 34, 35, 45, 51, 58].
Alternatively, variance reduction-based methods have been intensively studied in recent years for both convex and nonconvex settings. Apart from mini-batch and importance sampling schemes [29, 73], the following methods are the most notable. The first class of algorithms is based on SAG estimator , including SAGA-variants . The second one is SVRG  and its variants such as Katyusha , MiG , and many others [39, 60]. The third class relies on SARAH  such as SPIDER , SpiderBoost , ProxSARAH , and momentum variants . Other approaches such as Catalyst  and SDCA  have also been proposed.
In terms of theory, many researchers have focussed on theoretical aspects of existing algorithms. For example,  appeared as one of the first remarkable works studying convergence rates of stochastic gradient descent-type methods for nonconvex and non-composite finite-sum problems. They later extended it to the composite setting in . The authors of  also investigated the gradient dominant case, and  considered both finite-sum and composite finite-sum problems under different assumptions. Whereas many researchers have been trying to improve complexity upper bounds of stochastic first-order methods using different techniques [5, 6, 7, 22], other works have attempted to construct examples to establish lower-bound complexity barriers. The upper oracle complexity bounds have been substantially improved among these works and some results have matched the lower bound complexity in both convex and nonconvex settings [5, 4, 22, 27, 39, 40, 56, 60, 68, 76]. We refer to Table 1 for some notable examples of stochastic gradient-type methods for solving (1) and (2) and their non-composite settings.
|HybridSGD (This paper)||✓||Single|
In the convex case, there exist numerous research papers including [1, 2, 12, 24, 47, 49, 70] that study the lower bound complexity. In [22, 74], the authors constructed a lower-bound complexity for nonconvex finite-sum problems covered by (2). They showed that the lower-bound complexity for any stochastic gradient method relied on only smoothness assumption to achieve an -stationary point in expectation is . For the expectation problem (1), the best-known complexity bound to obtain an -stationary point in expectation is as shown in [22, 68], where is an upper bound of the variance (see Assumption 2.3). Unfortunately, we have not seen any lower-bound complexity for the nonconvex setting of (1) under standard assumptions from the literature.
While numerical stochastic algorithms for solving the non-composite setting, i.e. , are well-developed and have received considerable attention [5, 6, 7, 22, 40, 52, 53, 54, 60, 76], methods for composite setting remain limited [60, 68]. In this paper, we will develop a novel approach to design stochastic optimization algorithms for solving the composite problems (1) and (2). Our approach is rather different from existing ones and we call it a “hybrid” approach.
Let us compare our algorithms and existing methods in the following aspects:
Single-loop vs. multiple-loop:
As mentioned, we aim at developing practical methods that are easy to implement. One of the major difference between our methods and existing state-of-the-arts is the algorithmic style: single-loop vs. multiple-loop style. As discussed in several works, including , single-loop methods have some advantages over double-loop methods, including tuning parameters. The single-loop style consists of SGD, SAGA, and their variants [17, 18, 27, 46, 58, 62, 64], while the double-loop style comprises SVRG, SARAH, and their variants [32, 50]. Other algorithms such as Natasha  or Natasha1.5  even have three loops. Let us compare these methods in detail as follows:
SGD and SAGA-type methods have single-loop, but SAGA-type algorithms use an -matrix to maintain individual gradients which can be very large if and are large. In addition, SAGA has not yet been applied to solve (1). Our first algorithm, Algorithm 1, has single-loop as SGD and SAGA, and does not require heavy memory storage. However, to apply to (2), it still requires either an additional assumption or a check-point compared to SAGA. But if it solves (1), then it has the same assumptions as in SGD. In terms of complexity, Algorithm 1 is much better than SGD. To the best of our knowledge, Algorithm 1 is the first single-loop SGD variant that achieves the best-known complexity. Another related work is , which uses momentum approach, but requires additional bounded gradient assumption to achieve similar complexity as Algorithm 1.
. Note that double-loop or multiple-loop methods require to tune more parameters such as epoch lengths and possibly the mini-batch size of the snapshot points. Although Algorithm3 is loopless, it can be viewed as a double-loop variant. This algorithm has different complexity bound than existing methods.
Single-sample and mini-batch:
Our methods work with both single-sample and mini-batch, and in both cases, they achieve the best-known complexity bounds. This is different from some existing methods such as SVRG or SARAH-based methods [60, 68] where the best complexity is only obtained if one chooses the best parameter configuration.
Algorithm 1 and its variants all achieve the best-known complexity bounds as in [56, 68] for solving (1). In early work such as Natasha  and Natasha1.5  which are based on the SVRG estimator, the best complexity is often for solving (1) and for solving (2). By combining with additional sophisticated tricks, these complexity bounds are slightly improved. For instance, Natasha  or Natasha1.5  can achieve in the finite-sum case, and in the expectation case, but they require three loops with several parameter adjustment which are difficult to tune in practice. SNVRG  exploits a dynamic epoch length as used in  to improve its complexity bounds. Again, this method also requires complicated parameter selection procedure. To achieve better complexity bounds, SARAH-based methods have been studied in [22, 53, 56, 68]. Their complexity meets the lower-bound one in the finite-sum case as indicated in [22, 56].
1.4 Paper organization
The rest of this paper is organized as follows. Section 2 discusses the main assumptions of our problems (1) and (2), and their optimality conditions. Section 3 develops new hybrid stochastic estimators and investigates their properties. We consider both single-sample and mini-batch cases. Section 4 studies a new class of hybrid gradient algorithms to solve both (1) and (2). We develop three different variants of hybrid algorithms and analyze their convergence and complexity estimates. Section 5 extends our algorithms to mini-batch cases. Section 6 is devoted to investigating hybrid SARAH-SVRG methods to solve the finite-sum problem (2). Section 7 gives several numerical examples and compares our methods with existing state-of-the-arts. For the sake of presentation, all technical proofs are provided in the appendix.
2 Basic assumptions and optimality condition
Notation and basic concepts:
We work with the Euclidean spaces, and equipped with standard inner product and norm . For any function , denotes the effective domain of . If is continuously differentiable, then denotes its gradient. If, in addition, is twice continuously differentiable, then denotes its Hessian.
For a stochastic function defined on a probability space , we use to denote the expectation of w.r.t. on . We also overload the notation to express the expectation w.r.t. a realization in both single-sample and mini-batch cases. Given a finite set , we denote if for and . If for , then we write
by dropping the probability distribution.
Given a random mapping
depending on a random vector, we say that is -average Lipschitz continuous if for all , where is called the Lipschitz constant of . If is a deterministic function, then this condition becomes which states that is -Lipschitz continuous. In particular, if this condition holds for , then we say that is -smooth.
For a proper, closed, and convex function , denotes its subdifferential at , and denotes its proximal operator. If is the indicator of a nonempty, closed, and convex set , then reduces to the projection onto . We say that is -weakly convex if for all and , where is a given constant. Clearly, a weakly convex function is not necessarily convex. However, any -smooth function is -weakly convex. If is -weakly convex, then is -strongly convex if . Therefore, the proximal operator is well-defined and single-valued if . Note that is non-expansive, i.e. for all .
If is a matrix, then is the spectral norm of and the inner product of two matrices and is defined as . Also, stands for the set of positive integer numbers, and . Given , denotes the maximum integer number that is less than or equal to . We also use to express complexity bounds of algorithms.
2.1 Fundamental assumptions
Our algorithms developed in the sequel rely on the following fundamental assumptions:
This assumption is fundamental and required for any algorithm. Here, since is proper, closed, and convex, its proximal operator is well-defined, single-valued, and non-expansive. We assume that this proximal operator can be computed exactly.
Assumption 2.2 (-average smoothness).
Assumption 2.3 (Bounded variance).
There exists such that
The bounded variance condition for (2) becomes
Assumptions 2.2 and 2.3 are very standard in stochastic optimization methods and required for any stochastic gradient-based methods for solving (1). The -average smoothness in (5) is in general weaker than the individual smoothness of each . Note that we do not require the Lipschitz continuity of or as in some recent work, e.g. .
We also consider problem (2) under the following assumption, which cover the case , the indicator of a nonempty, closed, convex, and bounded set . This assumption will be used to develop algorithms for solving (2) using hybrid SVRG estimators.
The domain of is bounded, i.e.:
2.2 First-order optimality condition
The optimality condition of (1) can be written as
Note that (8) can be written equivalently to
Let us clarify why is an approximate stationary point of (1). Indeed, if , then means that . On the other hand, is equivalent to . Therefore, for some . Using the -average smoothness of , we have . This condition shows that is an approximate stationary point of (1).
In practice, we often replace the condition (10) by which can avoid storing the iterate sequence .
3 Hybrid stochastic estimators
In this section, we propose new stochastic estimators for a generic function that can cover function values, gradient, and Hessian of any expectation function in (1).
3.1 The construction of hybrid stochastic estimators
Given a function , where is a (vector) stochastic function from . We define the following stochastic estimator of . As concrete examples, can be the gradient mapping of or the Hessian mapping of in problem (1) or (2).
Let be an unbiased stochastic estimator of formed by a realization of , i.e. at a given . The following quantity:
is called a hybrid stochastic estimator of at , where and are two independent realizations of on and is a given weight.
Clearly, if , then we obtain a simple unbiased stochastic estimator. If , then we obtain the SARAH-type estimator as studied in  but for general function . We are interested in the case , which can be referred to as a hybrid recursive stochastic estimator.
We can rewrite as
The first two terms are two stochastic estimators evaluated at , while the third term is the difference of the previous estimator and a stochastic estimator at the previous iterate. Here, since , the main idea is to exploit more recent information than the old one.
In fact, if , then the hybrid estimator covers many other estimators, including SGD, SVRG, and SARAH. We consider three concrete examples of the unbiased estimator of as follows:
The classical stochastic estimator: .
The SVRG estimator: , where is a given unbiased snapshot evaluated at a given point .
The SAGA estimator: , where if and if .
While both the classical stochastic and SVRG estimators work for both expectation and finite-sum settings, the SAGA estimator currently works for the finite-sum setting (2). Note that it is also possible to consider mini-batch and important sampling settings for our hybrid estimators.
3.2 Properties of hybrid stochastic estimators
Let us first define
the -field generated by the history of realizations of up to the iteration . We first prove in Appendix 1.1 the following property of the hybrid stochastic estimator .
Let be defined by (11). Then
If , then is a biased estimator of . Moreover, we have
While the variance of can only be bounded by a constant , the variance of can be reduced by gradually changing the snapshot . The following lemma shows this property, whose proof can be found, e.g. in .
Assume that is an SVRG estimator of . Then the following estimate holds:
If is -average Lipschitz continuous, i.e. for all , then we have
Assume that is -average Lipschitz continuous and is a classical stochastic estimator of . Then, we have the following upper bound:
where the expectation is taking over all the randomness , and
3.3 Mini-batch hybrid stochastic estimators
We can also consider a mini-batch hybrid recursive stochastic estimator of defined as:
where and is a mini-batch of size and independent of .
Note that can also be a mini-batch unbiased estimator of . For example, is a mini-batch unbiased stochastic estimator.
Let be the mini-batch stochastic estimator of defined by (19), where is also a mini-batch unbiased stochastic estimator of with such that is independent of . Then, the following estimate holds:
where if is finite i.e. , and , otherwise i.e. .
Assume that is -average Lipschitz continuous and is a mini-batch unbiased estimator as , is given in (19), and are mini-batches of sizes and , respectively for all . Then, we have the following upper bound on the variance :
where the expectation is taking over all the randomness , and , , and are defined in (18). Here, and if is finite and and , otherwise.
4 Hybrid SARAH-SGD Algorithms
4.1 The single-loop algorithm
Algorithm 1 is different form existing SGD methods at the following points:
Firstly, it starts with a relatively large mini-batch to compute an initial estimate for the initial gradient . This is quite different from existing methods where they often use single-sample, mini-batch, or increasing mini-batch sizes for the whole algorithms (e.g. ), and do not separate into two stages as in Algorithm 1:
The idea behind this difference is to find a good stochastic approximation for to move on.
Secondly, Algorithm 1 adopts the idea of ProxSARAH in  with two steps in and to handle the composite forms. This is different from existing methods as well as methods for non-composite problems where two step-sizes and are used. While the first step on is a standard proximal-gradient step, the second one on is an averaging step. If , i.e. in the non-composite problems, then Steps 4 and 8 reduce to