1 Introduction
In this paper, we consider the following stochastic composite and possibly nonconvex optimization problem, which is widely studied in the literature:
(1) 
where is a stochastic function such that for each ,
is a random variable in a given probability space
, while for each realization , is smooth on ; is the expectation of the random function over on ; and is a proper, closed, and convex function.In addition to (1), we also consider the following composite finitesum problem:
(2) 
where for are all smooth functions. Problem (2) can be considered as special case of (1) where and
is a uniform distribution on
. If is extremely large such that evaluating the full gradient and the function value is expensive, then, as usual, we refer to this setting as online models.If the regularizer is absent, then we obtain a smooth problem which has been widely studied in the literature. As another special case, if is the indicator of a nonempty, closed, and convex set , i.e. , then (1) also covers constrained nonconvex optimization problems.
1.1 Our goals, approach, and contribution
Our goals: Our objective is to develop a new approach to approximate a stationary point of (1) and its finitesum setting (2
) under standard assumptions used in existing methods. In this paper, we only focus on stochastic gradient descenttype (SGD) variants. We are also interested in both oracle complexity bounds and implementation aspects. The ultimate goal is to design simple algorithms that are easy to implement and require less parameter tuning effort.
Our approach: Our approach relies on a socalled “hybrid” idea which merges two existing stochastic estimators through a convex combination to design a “hybrid” offspring that inherits the advantages of its underlying estimators. We will focus on the hybrid estimators formed from the SARAH (a recursive stochastic estimator) introduced in [50]
and any given unbiased estimator such as SGD
[62], SVRG [32], or SAGA [17]. For the sake of presentation, we only focus on either SGD or SVRG estimator in this paper. We emphasize that our method is fundamentally different from momentum or exponential moving averagetype methods such as in [15, 34] where we use two independent estimators instead of a combination of the past and the current estimators.While our hybrid estimators are biased, fortunately, they provide some useful properties to develop new algorithms. One important feature is the variance reduced property which often allows us to derive a large stepsize or a constant stepsize in stochastic methods. Whereas a majority of stochastic algorithms rely on unbiased estimators such as SGD, SVRG, and SAGA, interestingly, recent evidence has shown that biased estimators such as SARAH, biased SAGA, or biased SVRG estimators also provide comparable and even better algorithms in terms of oracle complexity bounds as well as empirical performance, see, e.g. [19, 22, 53, 56, 68].
Our approach, on the one hand, can be extended to study secondorder methods such as cubic regularization and subsampled schemes as in [8, 21, 63, 69, 71, 75]. The main idea is to exploit hybrid estimators to approximate both gradient and Hessian of the objective function similar to [69, 71, 75]. On the other hand, it can be applied to approximate a secondorder stationary point of (1) and (2). The idea is to integrate our methods with a negative curvature search such as Oja’s algorithm [55] or Neon2 [6], or to employ perturbed/noise gradient techniques such as [23, 26, 38] in order to approximate a secondorder stationary point. However, to avoid overloading this paper, we leave these extensions for our future work.
Our contribution: To this end, the contribution of this paper can be summarized as follows:

We first introduce a “hybrid” approach to merge two existing stochastic estimators in order to form a new one. Such a new estimator can be viewed as a convex combination of a biased estimator and an unbiased one to inherit the advantages of its underlying estimators. Although we only focus on a convex combination between SARAH [50] and either SGD [62] or SVRG [32] estimator, our approach can be extended to cover other possibilities. Given such new hybrid estimators, we develop several fundamental properties that can be useful for developing new stochastic optimization algorithms.

Next, we employ our new hybrid SARAHSGD estimator to develop a novel stochastic proximal gradient algorithm, Algorithm 1, to solve (1). This algorithm can achieve oracle complexity bound. To the best of our knowledge, this is the first variant of SGD that achieves such an oracle complexity bound without using double loop or checkpoints as in SVRG or SARAH, or requiring an table to store gradient components as in SAGAtype methods.

Then, we derive two different variants of Algorithm 1: adaptive stepsize and doubleloop schemes. Both variants have the same complexity as of Algorithm 1. We also propose a minibatch variant of Algorithm 1 and provide a tradeoff analysis between minibatch sizes and the choice of stepsizes to obtain better practical performance.

Finally, we design a hybrid SARAHSVRG estimator and use it to develop new stochastic variants for solving the composite finitesum problem (2). These variants also achieve the bestknown complexity bounds while having new properties compared to existing methods.
Let us emphasize the following additional points of our contribution. Firstly, the new algorithm, Algorithm 1, is rather different from existing SGD methods. It first forms a minibatch stochastic gradient estimator at a given initial point to provide a good approximation to the initial gradient of . Then, it performs a single loop to update the iterate sequence which consists of two steps: proximalgradient step and averaging step, where our hybrid estimator is used.
Secondly, our methods work with both singlesample and minibatches, and achieve the bestknown complexity bounds in both cases. This is different from some existing methods such as SVRGtype and SpiderBoost that only achieve the best complexity under certain choices of parameters. Our methods are also flexible to choose different minibatch sizes for the hybrid components to achieve different complexity bounds and to adjust the performance. For instance, in Algorithm 1, we can choose single sample in the SARAH estimator while using a minibatch in the SGD estimator that leads to different tradeoff on the choice of the weight.
Finally, our theoretical results on hybrid estimators are also selfcontained and independent. As we have mentioned, they can be used to develop other stochastic algorithms such as secondorder methods or perturbed SGD schemes. We believe that they can also be used in other problems such as composition and constrained optimization [16, 44, 67].
1.2 Related work
Problem (1) and its sample averaging setting (2) have been widely studied in the literature for both convex and nonconvex models, see, e.g. [9, 10, 17, 27, 32, 42, 46, 50, 62, 64]
. However, due to applications in deep learning, largescale nonconvex optimization problems have attracted huge attention in recent years
[30, 37]. Numerical methods for solving these problems heavily rely on two approaches: deterministic and stochastic approaches, ranging from firstorder to secondorder methods. Notable firstorder methods include stochastic gradient descenttype, conditional gradient descent [59], and primaldual schemes [13]. In contrast, advanced secondorder methods consist of quasiNewton, trustregion, sketching Newton, subsampled Newton, and cubic regularized Newtonbased methods, see, e.g. [11, 48, 57, 63].In terms of stochastic firstorder algorithms, there has been a tremendously increasing trend in stochastic gradient descent methods and their variants in the last fifteen years. SGDbased algorithms can be classified into two categories: nonvariance reduction and variance reduction schemes. The classical SGD method was studied in early work of Robbins & Monro
[62], but its convergence rate was then investigated in [46] under new robust variants. Ghadimi & Lan extended SGD to nonconvex settings and analyzed its complexity in [27]. Other extensions of SGD can be found in the literature, including [4, 16, 20, 25, 28, 31, 34, 35, 45, 51, 58].Alternatively, variance reductionbased methods have been intensively studied in recent years for both convex and nonconvex settings. Apart from minibatch and importance sampling schemes [29, 73], the following methods are the most notable. The first class of algorithms is based on SAG estimator [64], including SAGAvariants [17]. The second one is SVRG [32] and its variants such as Katyusha [3], MiG [77], and many others [39, 60]. The third class relies on SARAH [50] such as SPIDER [22], SpiderBoost [68], ProxSARAH [56], and momentum variants [78]. Other approaches such as Catalyst [41] and SDCA [65] have also been proposed.
In terms of theory, many researchers have focussed on theoretical aspects of existing algorithms. For example, [27] appeared as one of the first remarkable works studying convergence rates of stochastic gradient descenttype methods for nonconvex and noncomposite finitesum problems. They later extended it to the composite setting in [29]. The authors of [68] also investigated the gradient dominant case, and [33] considered both finitesum and composite finitesum problems under different assumptions. Whereas many researchers have been trying to improve complexity upper bounds of stochastic firstorder methods using different techniques [5, 6, 7, 22], other works have attempted to construct examples to establish lowerbound complexity barriers. The upper oracle complexity bounds have been substantially improved among these works and some results have matched the lower bound complexity in both convex and nonconvex settings [5, 4, 22, 27, 39, 40, 56, 60, 68, 76]. We refer to Table 1 for some notable examples of stochastic gradienttype methods for solving (1) and (2) and their noncomposite settings.
Algorithms  Expectation  Finitesum  Composite  Type 

GD [49]  NA  ✓  Single  
SGD [27]  NA  ✓  Single  
SAGA [60]  NA  ✓  Single  
SVRG [60]  NA  ✓  Double  
SVRG+ [39]  ✓  Double  
SCSG [40]  ✗  Double  
SNVRG [76]  ✗  Double  
SPIDER [22]  ✗  Double  
SpiderBoost [68]  ✓  Double  
ProxSARAH [56]  ✓  Double  
HybridSGD (This paper)  ✓  Single 
In the convex case, there exist numerous research papers including [1, 2, 12, 24, 47, 49, 70] that study the lower bound complexity. In [22, 74], the authors constructed a lowerbound complexity for nonconvex finitesum problems covered by (2). They showed that the lowerbound complexity for any stochastic gradient method relied on only smoothness assumption to achieve an stationary point in expectation is . For the expectation problem (1), the bestknown complexity bound to obtain an stationary point in expectation is as shown in [22, 68], where is an upper bound of the variance (see Assumption 2.3). Unfortunately, we have not seen any lowerbound complexity for the nonconvex setting of (1) under standard assumptions from the literature.
While numerical stochastic algorithms for solving the noncomposite setting, i.e. , are welldeveloped and have received considerable attention [5, 6, 7, 22, 40, 52, 53, 54, 60, 76], methods for composite setting remain limited [60, 68]. In this paper, we will develop a novel approach to design stochastic optimization algorithms for solving the composite problems (1) and (2). Our approach is rather different from existing ones and we call it a “hybrid” approach.
1.3 Comparison
Let us compare our algorithms and existing methods in the following aspects:
Singleloop vs. multipleloop:
As mentioned, we aim at developing practical methods that are easy to implement. One of the major difference between our methods and existing stateofthearts is the algorithmic style: singleloop vs. multipleloop style. As discussed in several works, including [36], singleloop methods have some advantages over doubleloop methods, including tuning parameters. The singleloop style consists of SGD, SAGA, and their variants [17, 18, 27, 46, 58, 62, 64], while the doubleloop style comprises SVRG, SARAH, and their variants [32, 50]. Other algorithms such as Natasha [4] or Natasha1.5 [5] even have three loops. Let us compare these methods in detail as follows:

SGD and SAGAtype methods have singleloop, but SAGAtype algorithms use an matrix to maintain individual gradients which can be very large if and are large. In addition, SAGA has not yet been applied to solve (1). Our first algorithm, Algorithm 1, has singleloop as SGD and SAGA, and does not require heavy memory storage. However, to apply to (2), it still requires either an additional assumption or a checkpoint compared to SAGA. But if it solves (1), then it has the same assumptions as in SGD. In terms of complexity, Algorithm 1 is much better than SGD. To the best of our knowledge, Algorithm 1 is the first singleloop SGD variant that achieves the bestknown complexity. Another related work is [15], which uses momentum approach, but requires additional bounded gradient assumption to achieve similar complexity as Algorithm 1.

Algorithm 2 has doubleloop as SVRG and SARAHtype methods. While the doubleloop in SVRG, SARAH, and their variants are required to achieve convergence, it is optional in Algorithm 2
. Note that doubleloop or multipleloop methods require to tune more parameters such as epoch lengths and possibly the minibatch size of the snapshot points. Although Algorithm
3 is loopless, it can be viewed as a doubleloop variant. This algorithm has different complexity bound than existing methods.
Singlesample and minibatch:
Our methods work with both singlesample and minibatch, and in both cases, they achieve the bestknown complexity bounds. This is different from some existing methods such as SVRG or SARAHbased methods [60, 68] where the best complexity is only obtained if one chooses the best parameter configuration.
Complexity bounds:
Algorithm 1 and its variants all achieve the bestknown complexity bounds as in [56, 68] for solving (1). In early work such as Natasha [4] and Natasha1.5 [5] which are based on the SVRG estimator, the best complexity is often for solving (1) and for solving (2). By combining with additional sophisticated tricks, these complexity bounds are slightly improved. For instance, Natasha [4] or Natasha1.5 [5] can achieve in the finitesum case, and in the expectation case, but they require three loops with several parameter adjustment which are difficult to tune in practice. SNVRG [76] exploits a dynamic epoch length as used in [40] to improve its complexity bounds. Again, this method also requires complicated parameter selection procedure. To achieve better complexity bounds, SARAHbased methods have been studied in [22, 53, 56, 68]. Their complexity meets the lowerbound one in the finitesum case as indicated in [22, 56].
1.4 Paper organization
The rest of this paper is organized as follows. Section 2 discusses the main assumptions of our problems (1) and (2), and their optimality conditions. Section 3 develops new hybrid stochastic estimators and investigates their properties. We consider both singlesample and minibatch cases. Section 4 studies a new class of hybrid gradient algorithms to solve both (1) and (2). We develop three different variants of hybrid algorithms and analyze their convergence and complexity estimates. Section 5 extends our algorithms to minibatch cases. Section 6 is devoted to investigating hybrid SARAHSVRG methods to solve the finitesum problem (2). Section 7 gives several numerical examples and compares our methods with existing stateofthearts. For the sake of presentation, all technical proofs are provided in the appendix.
2 Basic assumptions and optimality condition
Notation and basic concepts:
We work with the Euclidean spaces, and equipped with standard inner product and norm . For any function , denotes the effective domain of . If is continuously differentiable, then denotes its gradient. If, in addition, is twice continuously differentiable, then denotes its Hessian.
For a stochastic function defined on a probability space , we use to denote the expectation of w.r.t. on . We also overload the notation to express the expectation w.r.t. a realization in both singlesample and minibatch cases. Given a finite set , we denote if for and . If for , then we write
by dropping the probability distribution
.Given a random mapping
depending on a random vector
, we say that is average Lipschitz continuous if for all , where is called the Lipschitz constant of . If is a deterministic function, then this condition becomes which states that is Lipschitz continuous. In particular, if this condition holds for , then we say that is smooth.For a proper, closed, and convex function , denotes its subdifferential at , and denotes its proximal operator. If is the indicator of a nonempty, closed, and convex set , then reduces to the projection onto . We say that is weakly convex if for all and , where is a given constant. Clearly, a weakly convex function is not necessarily convex. However, any smooth function is weakly convex. If is weakly convex, then is strongly convex if . Therefore, the proximal operator is welldefined and singlevalued if . Note that is nonexpansive, i.e. for all .
If is a matrix, then is the spectral norm of and the inner product of two matrices and is defined as . Also, stands for the set of positive integer numbers, and . Given , denotes the maximum integer number that is less than or equal to . We also use to express complexity bounds of algorithms.
2.1 Fundamental assumptions
Our algorithms developed in the sequel rely on the following fundamental assumptions:
Assumption 2.1.
This assumption is fundamental and required for any algorithm. Here, since is proper, closed, and convex, its proximal operator is welldefined, singlevalued, and nonexpansive. We assume that this proximal operator can be computed exactly.
Assumption 2.2 (average smoothness).
Assumption 2.3 (Bounded variance).
Assumptions 2.2 and 2.3 are very standard in stochastic optimization methods and required for any stochastic gradientbased methods for solving (1). The average smoothness in (5) is in general weaker than the individual smoothness of each [56]. Note that we do not require the Lipschitz continuity of or as in some recent work, e.g. [15].
We also consider problem (2) under the following assumption, which cover the case , the indicator of a nonempty, closed, convex, and bounded set . This assumption will be used to develop algorithms for solving (2) using hybrid SVRG estimators.
Assumption 2.4.
The domain of is bounded, i.e.:
2.2 Firstorder optimality condition
The optimality condition of (1) can be written as
(8) 
Any point satisfying (8) is called a stationary point of (1). The same definition applies to (2).
Note that (8) can be written equivalently to
(9) 
Here, is called the gradient mapping of in (1) for any . It is obvious that if , then , the gradient of for any . Our goal is to seek an stationary point of (1) or (2) defined as follows:
Definition 2.1.
Let us clarify why is an approximate stationary point of (1). Indeed, if , then means that . On the other hand, is equivalent to . Therefore, for some . Using the average smoothness of , we have . This condition shows that is an approximate stationary point of (1).
In practice, we often replace the condition (10) by which can avoid storing the iterate sequence .
3 Hybrid stochastic estimators
In this section, we propose new stochastic estimators for a generic function that can cover function values, gradient, and Hessian of any expectation function in (1).
3.1 The construction of hybrid stochastic estimators
Given a function , where is a (vector) stochastic function from . We define the following stochastic estimator of . As concrete examples, can be the gradient mapping of or the Hessian mapping of in problem (1) or (2).
Definition 3.1.
Let be an unbiased stochastic estimator of formed by a realization of , i.e. at a given . The following quantity:
(11) 
is called a hybrid stochastic estimator of at , where and are two independent realizations of on and is a given weight.
Clearly, if , then we obtain a simple unbiased stochastic estimator. If , then we obtain the SARAHtype estimator as studied in [50] but for general function . We are interested in the case , which can be referred to as a hybrid recursive stochastic estimator.
We can rewrite as
The first two terms are two stochastic estimators evaluated at , while the third term is the difference of the previous estimator and a stochastic estimator at the previous iterate. Here, since , the main idea is to exploit more recent information than the old one.
In fact, if , then the hybrid estimator covers many other estimators, including SGD, SVRG, and SARAH. We consider three concrete examples of the unbiased estimator of as follows:

The classical stochastic estimator: .

The SVRG estimator: , where is a given unbiased snapshot evaluated at a given point .

The SAGA estimator: , where if and if .
While both the classical stochastic and SVRG estimators work for both expectation and finitesum settings, the SAGA estimator currently works for the finitesum setting (2). Note that it is also possible to consider minibatch and important sampling settings for our hybrid estimators.
3.2 Properties of hybrid stochastic estimators
Let us first define
(12) 
the field generated by the history of realizations of up to the iteration . We first prove in Appendix 1.1 the following property of the hybrid stochastic estimator .
Lemma 3.1.
Remark 3.1.
While the variance of can only be bounded by a constant , the variance of can be reduced by gradually changing the snapshot . The following lemma shows this property, whose proof can be found, e.g. in [61].
Lemma 3.2.
Assume that is an SVRG estimator of . Then the following estimate holds:
(15) 
If is average Lipschitz continuous, i.e. for all , then we have
(16) 
Lemma 3.3.
Assume that is average Lipschitz continuous and is a classical stochastic estimator of . Then, we have the following upper bound:
(17) 
where the expectation is taking over all the randomness , and
(18) 
3.3 Minibatch hybrid stochastic estimators
We can also consider a minibatch hybrid recursive stochastic estimator of defined as:
(19) 
where and is a minibatch of size and independent of .
Note that can also be a minibatch unbiased estimator of . For example, is a minibatch unbiased stochastic estimator.
Lemma 3.4.
Let be the minibatch stochastic estimator of defined by (19), where is also a minibatch unbiased stochastic estimator of with such that is independent of . Then, the following estimate holds:
(20) 
where if is finite i.e. , and , otherwise i.e. .
Similar to Lemma 4.1, we can bound the variance of the minibatch hybrid estimator from (19) in the following lemma, whose proof is in Appendix 1.4. For simplicity of presentation, we choose and .
Lemma 3.5.
Assume that is average Lipschitz continuous and is a minibatch unbiased estimator as , is given in (19), and are minibatches of sizes and , respectively for all . Then, we have the following upper bound on the variance :
(21) 
where the expectation is taking over all the randomness , and , , and are defined in (18). Here, and if is finite and and , otherwise.
4 Hybrid SARAHSGD Algorithms
In this section, we utilize our hybrid stochastic estimator above with to develop new stochastic gradient algorithms for solving (1) and its finitesum setting (2).
4.1 The singleloop algorithm
Our first algorithm is a singleloop stochastic proximalgradient scheme for solving (1). This algorithm is described in detail in Algorithm 1.
Algorithm 1 is different form existing SGD methods at the following points:

Firstly, it starts with a relatively large minibatch to compute an initial estimate for the initial gradient . This is quite different from existing methods where they often use singlesample, minibatch, or increasing minibatch sizes for the whole algorithms (e.g. [29]), and do not separate into two stages as in Algorithm 1:
The idea behind this difference is to find a good stochastic approximation for to move on.

Secondly, Algorithm 1 adopts the idea of ProxSARAH in [56] with two steps in and to handle the composite forms. This is different from existing methods as well as methods for noncomposite problems where two stepsizes and are used. While the first step on is a standard proximalgradient step, the second one on is an averaging step. If , i.e. in the noncomposite problems, then Steps 4 and 8 reduce to
Therefore, the product can be viewed as a combined stepsize of Algorithm 1. Note that by using to approximate the gradient mapping defined by (9), we can rewrite the mainstep of Algorithm 1 as
Comments
There are no comments yet.