Optimal Finite-Sum Smooth Non-Convex Optimization with SARAH

01/22/2019 ∙ by Lam M. Nguyen, et al. ∙ 0

The total complexity (measured as the total number of gradient computations) of a stochastic first-order optimization algorithm that finds a first-order stationary point of a finite-sum smooth nonconvex objective function F(w)=1/n∑_i=1^n f_i(w) has been proven to be at least Ω(√(n)/ϵ) where ϵ denotes the attained accuracy E[ ∇ F(w̃)^2] ≤ϵ for the outputted approximation w̃ (Fang et al.,2018). This paper is the first to show that this lower bound is tight for the class of variance reduction methods which only assume the Lipschitz continuous gradient assumption. We prove this complexity result for a slightly modified version of the SARAH algorithm in (Nguyen et al.,2017a;b) - showing that SARAH is optimal and dominates all existing results. For convex optimization, we propose SARAH++ with sublinear convergence for general convex and linear convergence for strongly convex problems; and we provide a practical version for which numerical experiments on various datasets show an improved performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We are interested in solving the finite-sum smooth minimization problem

(1)

where each , , has a Lipschitz continuous gradient with constant . Throughout the paper, we consider the case where has a finite lower bound .

Problems of form (1

) cover a wide range of convex and nonconvex problems in machine learning applications including but not limited to logistic regression, neural networks, multi-kernel learning, etc. In many of these applications, the number of component functions

is very large, which makes the classical Gradient Descent (GD) method less efficient since it requires to compute a full gradient many times. Instead, a traditional alternative is to employ stochastic gradient descent (SGD) 

(Robbins & Monro, 1951; Shalev-Shwartz et al., 2011; Bottou et al., 2016). In recent years, a large number of improved variants of stochastic gradient algorithms called variance reduction methods have emerged, in particular, SAG/SAGA (Schmidt et al., 2016; Defazio et al., 2014), SDCA (Shalev-Shwartz & Zhang, 2013), MISO (Mairal, 2013), SVRG/S2GD (Johnson & Zhang, 2013; Konečný & Richtárik, 2013), SARAH (Nguyen et al., 2017a), etc. These methods were first analyzed for strongly convex problems of form (1). Due to recent interest in deep neural networks, nonconvex problems of form (1) have been studied and analyzed by considering a number of different approaches including many variants of variance reduction techniques (see e.g. (Reddi et al., 2016; Lei et al., 2017; Allen-Zhu, 2017a, b; Fang et al., 2018), etc.)

We study the SARAH algorithm (Nguyen et al., 2017a, b) depicted in Algorithm 1, slightly modified. We use upper index to indicate the -th outer loop and lower index to indicate the -th iteration in the inner loop. The key update rule is

(2)

The computed is used to update

(3)

After iteration in the inner loop, the outer loop remembers the last computed and starts its loop anew – first with a full gradient computation before again entering the inner loop with updates (2). Instead of remembering for the next outer loop, the original SARAH algorithm in (Nguyen et al., 2017a) uses with chosen uniformly at random from . The authors of (Nguyen et al., 2017a) chose to do this in order to being able to analyze the convergence rate for a single outer loop – since in practice it makes sense to keep the last computed if multiple outer loop iterations are used, we give full credit of Algorithm 1 to (Nguyen et al., 2017a) and call this SARAH.

  Parameters: the learning rate , the inner loop size , and the outer loop size
  Initialize:
  Iterate:
  for  do
     
     
     
     Iterate:
     for  do
        Sample uniformly at random from
        
        
     end for
     Set (modified point)
  end for
Algorithm 1 SARAH (modified of (Nguyen et al., 2017a))

We will analyze SARAH for smooth nonconvex optimization, i.e., we study (1) where we only assume component functions having a finite Lipschitz continuous gradient and no other assumptions:

Assumption 1 (-smooth).

Each , , is -smooth, i.e., there exists a constant such that, ,

(4)

We stress that our convergence analysis only relies on the above smooth assumption without bounded variance assumption (as required in (Lei et al., 2017; Zhou et al., 2018)) or Hessian-Lipschitz assumption (as required in (Fang et al., 2018)).

We measure the convergence rate in terms of total complexity , i.e., the total number of gradient computations. For SARAH we have

We notice that SARAH, using the notation and definition of (Fang et al., 2018), is a random algorithm that maps functions to a sequence of iterates

where is a measure mapping, is the individual function chosen by at iteration , and

is a uniform random vector with entries in

. Rephrasing Theorem 3 in (Fang et al., 2018) states the following lower bound: There exists a function such that in order to find a point for which accuracy , must have a total complexity of at least stochastic gradient computations. Applying this bound to SARAH tells us that if the final output has

Our main contribution is to meet this lower bound and show that in SARAH we can choose parameters and such that the total complexity is

or, equivalently,

This significantly improves over prior work which only achieves :

Related Work: The paper that introduces SARAH (Nguyen et al., 2017b) is only able to analyze convergence of a single outer loop giving a total complexity of .

Besides the lower bound, (Fang et al., 2018) introduces SPIDER, as a variant of SARAH, which achieves to date the best known convergence result in the nonconvex case. SPIDER uses the SARAH update rule (2) as was originally proposed in (Nguyen et al., 2017a) and the mini-batch version of SARAH in (Nguyen et al., 2017b). SPIDER and SARAH are different in terms of iteration (3), which are and , respectively. Also, SPIDER does not divide into outer loop and inner loop as SARAH does although SPIDER does also perform a full gradient update after a certain fixed number of iterations. A recent technical report (Wang et al., 2018) provides an improved version of SPIDER called SpiderBoost which allows a larger learning rate. Both SPIDER and SpiderBoost are able to show for smooth nonconvex optimization a total complexity of

which is called “near-optimal” in (Fang et al., 2018) since, except for the term, it almost matches the lower bound.

Method Complexity Additional assumption
GD (Nesterov, 2004) None
SVRG (Reddi et al., 2016) None
SCSG (Lei et al., 2017) Bounded variance
None ()
SNVRG (Zhou et al., 2018) Bounded variance
None ()
SPIDER (Fang et al., 2018) None
SpiderBoost (Wang et al., 2018) None
R-SPIDER (Zhang et al., 2018) None
SARAH (this paper) None
Table 1: Comparison of results on the total complexity for smooth nonconvex optimization

Table 1111 is defined as shows the comparison of results on the total complexity for smooth nonconvex optimization. (a) Each of the complexities in Table 1 also depends on the Lipschitz constant , however, since we consider smooth optimization and it is custom to assume/design , we ignore the dependency on in the complexity results. (b) Although many algorithms have appeared during the past few years, we only compare algorithms having a convergence result which only supposes the smooth assumption. For example, (Fang et al., 2018) can also prove a total complexity of by requiring an additional Hessian-Lipschitz assumption and adding dependence on the Hessian-Lipschitz constant to their analysis. For this reason, this result is not part of the table as it is weaker in that the analysis supposes an additional property of the component functions. (c) Among algorithms with convergence results that only suppose the smooth assumption, Table 1 only mentions recent state-of-the-art results. For example, we do not provide comparisons with SGD (Robbins & Monro, 1951) and SGD-like (e.g. (Duchi et al., 2011; Kingma & Ba, 2014)) since they achieve a much worse complexity of . (d) Although the bounded variance assumption is acceptable in many existing literature, this additional assumption limits the applicability of these convergence results since it adds dependence on which can be arbitrarily large. For fair comparison with convergence analysis without the bounded variance assumption, must be set to go to infinity – and this is what is mentioned in Table 1. As an example, from Table 1 we observe that SCSG has an advantage over SVRG only if but, theoretically, it has the same total complexity as SVRG if . (e) For completeness, incompatibility with assuming a bounded gradient has been discussed in (Nguyen et al., 2018a) for strongly convex objective functions.

According to the results in Table 1, we can observe that SARAH-type algorithms dominate SVRG-type algorithms. In fact this paper proves that SARAH (slightly modified as given in Algorithm 1) achieves the minimal possible total complexity among variance reduction techniques in the nonconvex case for finding a first-order stationary point based on only the smooth assumption. This closes the gap of searching for “better” algorithms since the total complexity meets the lower bound .

Contributions: We summarize our key contributions as follows.

Smooth Non-Convex. We provide a convergence analysis for the full SARAH algorithm with multiple outer iterations for nonconvex problems (unlike in (Nguyen et al., 2017b) which only analyses a single outer iteration). The convergence analysis only supposes the smooth assumption (Lipschitz continuous on the gradient) and proves that SARAH with multiple outer loops (which has not been analyzed before) attains the asymptotic minimum possible total complexity in the non-convex case (Theorem 1). We extend these results to the mini-batch case (Theorem 2).

Smooth Convex. In order to complete the picture, we study SARAH+ (Nguyen et al., 2017a) which was designed as a variant of SARAH for convex optimization. We propose a novel variant of SARAH+ called SARAH++. Here, we study the iteration complexity measured by the total number of iterations (which counts one full gradient computation as adding one iteration to the complexity) – and leave an analysis of the total complexity as an open problem. For SARAH++ we show a sublinear convergence rate in the general convex case (Theorem 3) and a linear convergence rate in the strongly convex case (Theorem 4). SARAH itself may already lead to good convergence and there may no need to introduce SARAH++; in numerical experiments we show the advantage of SARAH++ over SARAH. We further propose a practical version called SARAH Adaptive which improves the performance of SARAH and SARAH++ for convex problems – numerical experiments on various data sets show good overall performance.

For the convergence analysis of SARAH for the non-convex case and SARAH++ for the convex case we show that the analysis generalizes the total complexity of Gradient Descent (GD) (Remarks 1 and 2), i.e., the analysis reproduces known total complexity results of GD. Up to the best of our knowledge, this is the first variance reduction method having this property.

2 Non-Convex Case: Convergence Analysis of SARAH

SARAH is very different from other algorithms since it has a biasedestimator of the gradient. Therefore, in order to analyze SARAH’s convergence rate, it is non-trivial to use existing proof techniques from unbiased estimator algorithms such as SGD, SAGA, and SVRG.

2.1 A single batch case

We start analyzing SARAH (Algorithm 1) for the case where we choose a single sample uniformly at random from in the inner loop.

Lemma 1.

Suppose that Assumption 1 holds. Consider a single outer loop iteration in SARAH (Algorithm 1) with . Then, for any , we have

(5)

The above result is for a single outer loop iteration of SARAH, which includes a full gradient step together with the inner loop. Since the outer loop iteration concludes with , and , we have

Summing over gives

(6)

This proves our main result:

Theorem 1 (Smooth nonconvex).

Suppose that Assumption 1 holds. Consider SARAH (Algorithm 1) with . Then, for any given , we have

where is any lower bound of , and is the result of the -th iteration in the -th outer loop.

The proof easily follows from (6) since is a lower bound of (that is, ). We note that the term

is simply the average of the expectation of the squared norms of the gradients of all the iteration results generated by SARAH. For nonconvex problems, our goal is to achieve

We note that, for simplicity, if is chosen uniformly at random from all the iterations generated by SARAH, we are able to have accuracy .

Corollary 1.

Suppose that Assumption 1 holds. Consider SARAH (Algorithm 1) with where is the inner loop size. Then, in order to achieve an -accurate solution, the total complexity is

The total complexity can be minimized over the inner loop size . By choosing , we achieve the minimal total complexity:

Corollary 2.

Suppose that Assumption 1 holds. Consider SARAH (Algorithm 1) with where is the inner loop size and chosen equal to . Then, in order to achieve an -accurate solution, the total complexity is

Remark 1.

The total complexity in Corollary 1 covers all choices for the inner loop size . For example, in the case of , SARAH recovers the Gradient Descent (GD) algorithm which has total complexity . Theorem 1 for also recovers the requirement on the learning rate for GD, which is .

The above results explain the relationship between SARAH and GD and explains the advantages of the inner loop and outer loop of SARAH. SARAH becomes more beneficial in ML applications where is large.

2.2 Mini-batch case

The above results can be extended to the mini-batch case where instead of choosing a single sample , we choose samples uniformly at random from for updating in the inner loop. We then replace in Algorithm 1 by

(7)

where we choose a mini-batch of size uniformly at random at each iteration of the inner loop. The result of Theorem 1 generalizes as follows.

Theorem 2 (Smooth nonconvex with mini-batch).

Suppose that Assumption 1 holds. Consider SARAH (Algorithm 1) by replacing in the inner loop size by (7) with

Then, for any given , we have

where is any lower bound of , and is the -th iteration in the -th outer loop.

We can again derive similar corollaries as was done for Theorem 1, but this does not lead to additional insight; it results in the same minimal total complexity for -accurate solutions.

3 Convex Case: SARAH++: A New Variant of SARAH+

In this section, we propose a new variant of SARAH+ (Algorithm 2) (Nguyen et al., 2017a), called SARAH++ (Algorithm 3), for convex problems of form (1).

Different from SARAH, SARAH+ provides a stopping criteria for the inner loop; as soon as

the inner loop finishes. This idea originates from the property of SARAH that, for each outer loop iteration , as in the strongly convex case (Theorems 1a and 1b in (Nguyen et al., 2017a)). Therefore, it does not make any sense to update with tiny steps when is small. (We note that SVRG (Johnson & Zhang, 2013) does not have this property.) SARAH+ suggests to empirically choose parameter (Nguyen et al., 2017a) without theoretical guarantee.

  Parameters: the learning rate , , the maximum inner loop size , and the outer loop size
  Initialize:
  Iterate:
  for  do
     
     
     
     
     while  and  do
        Sample uniformly at random from
        
        
        
     end while
     Set
  end for
Algorithm 2 SARAH+ (Nguyen et al., 2017a)

Here, we modify SARAH+ (Algorithm 2) into SARAH++ (Algorithm 3) by choosing the stopping criteria for the inner loop as

and by introducing a stopping criteria for the outer loop.

3.1 Details SARAH++ and Convergence Analysis

Before analyzing and explaining SARAH++ in detail, we introduce the following assumptions used in this section.

Assumption 2 (-strongly convex).

The function , is -strongly convex, i.e., there exists a constant such that ,

Under Assumption 2, let us define the (unique) optimal solution of (1) as . Then strong convexity of implies that

(8)

We note here, for future use, that for strongly convex functions of the form (1), arising in machine learning applications, the condition number is defined as . Assumption 2 covers a wide range of problems, e.g. -regularized empirical risk minimization problems with convex losses.

We separately assume the special case of strong convexity of all ’s with , called the general convexity assumption, which we will use for convergence analysis.

Assumption 3.

Each function , , is convex, i.e.,

SARAH++ is motivated by the following lemma.

Lemma 2.

Suppose that Assumptions 1 and 3 hold. Consider a single outer loop iteration in SARAH (Algorithm 1) with . Then, for and any , we have

(9)

where is any optimal solution of .

Clearly, if

where , inequality (9) implies

For this reason, we choose the stopping criteria for the inner loop in SARAH++ as with . Unlike SARAH+, for analyzing the convergence rate can be as small as .

  Parameters: The controlled factor , the learning rate , the total iteration , and the maximum inner loop size .
  Initialize:
  
  Iterate:
  
  while  do
     
     
     
     
     while  and  do
        
        
        if  then
           Sample uniformly at random from
           
        end if
     end while
     
     
     
  end while
  
  Set
Algorithm 3 SARAH++

The above discussion leads to SARAH++ (Algorithm 3

). In order to analyze its convergence for convex problems, we define random variable

as the stopping time of the inner loop in the -th outer iteration:

Note that is at least 1 since at , the condition always holds.

Let random variable be the stopping time of the outer iterations as a function of an algorithm parameter :

Notice that SARAH++ maintains a running sum against which parameter is compared in the stopping criteria of the outer loop.

For the general convex case which supposes Assumption 3 in addition to smoothness we have the next theorem.

Theorem 3 (Smooth general convex).

Suppose that Assumptions 1 and 3 hold. Consider SARAH++ (Algorithm 3) with , . Then,

the expectation of the average of the squared norm of the gradients of all iterations generated by SARAH++, is bounded by

The theorem leads to the next corollary about iteration complexity, i.e., we bound which is the total number of iterations performed by the inner loop across all outer loop iterations. This is different from the total complexity since does not separately count the gradient evaluations when the full gradient is computed in the outer loop.

Corollary 3 (Smooth general convex).

For the conditions in Theorem 3 with , we achieve an -accurate solution after inner loop iterations.

By supposing Assumption 2 in addition to the smoothness and general convexity assumptions, we can prove a linear convergence rate. For strongly convex objective functions we have the following result.

Theorem 4 (Smooth strongly convex).

Suppose that Assumptions 1, 2 and 3 hold. Consider SARAH++ (Algorithm 3) with , . Then, for the final output of SARAH++, we have

(10)

This leads to the following iteration complexity.

Corollary 4 (Smooth strongly convex).

For the conditions in Theorem 4 with , we achieve after total iterations, where is the condition number.

Remark 2.

The proofs of the above results hold for any . If we choose , then SARAH++ reduces to the Gradient Descent algorithm since the inner “while” loop stops right after updating . In this case, Corollaries 3 and 4 recover the rate of convergence and complexity of GD.

In this section, we showed that SARAH++ has a guarantee of theoretical convergence (see Theorems 3 and 4) while SARAH+ does not have such a guarantee.

An interesting open question we would like to discuss here is the total complexity of SARAH++. Although we have shown the convergence results of SARAH++ in terms of the iteration complexity, the total complexity which is computed as the total number of evaluations of the component gradient functions still remains an open question. It is clear that the total complexity must depend on the learning rate (or ) – the factor that decides when to stop the inner iterations.

We note that can be “closely” understood as the total number of updates of the algorithm. The total complexity is equal to . For the special case , , the algorithm recovers the GD algorithm with . Since each full gradient takes gradient evaluations, the total complexity for this case is equal to (in the general convex case) and (in the strongly convex case).

However, it is non-trivial to derive the total complexity of SARAH++ since it should depend on the learning rate . We leave this question as an open direction for future research.

3.2 Numerical Experiments

Paper (Nguyen et al., 2017a) provides experiments showing good overall performance of SARAH over other algorithms such as SGD (Robbins & Monro, 1951), SAG (Le Roux et al., 2012), SVRG (Johnson & Zhang, 2013), etc. For this reason, we provide experiments comparing SARAH++ directly with SARAH. We notice that SARAH (with multiple outer loops) like SARAH++ has theoretical guarantees with sublinear convergence for general convex and linear convergence for strongly convex problems as proved in (Nguyen et al., 2017a). Because of these theoretical guarantees (which SARAH+ does not have), SARAH itself may already perform well for convex problems and the question is whether SARAH++ offers an advantage.

We consider -regularized logistic regression problems with

(11)

where is the training data and the regularization parameter is set to , a widely-used value in literature (Le Roux et al., 2012; Nguyen et al., 2017a). The condition number is equal to . We conducted experiments to demonstrate the advantage in performance of SARAH++ over SARAH for convex problems on popular data sets including covtype ( training data; estimated ) and ijcnn1 ( training data; estimated ) from LIBSVM (Chang & Lin, 2011).

Figure 1: Comparisons of between SARAH++ and SARAH with different learning rates on covtype and ijcnn1 datasets

Figure 1 shows comparisons between SARAH++ and SARAH for different values of learning rate . We depicted the value of (i.e. in log scale) for the

-axis and “number of effective passes” (or number of epochs, where an epoch is the equivalent of

component gradient evaluations or one full gradient computation) for the -axis. For SARAH, we choose the outer loop size and tune the inner loop size to achieve the best performance. The optimal solution of the strongly convex problem in (11) is found by using Gradient Descent with stopping criterion . We observe that, SARAH++ achieves improved overall performance compared to regular SARAH as shown in Figure 1. From the experiments we see that the stopping criteria () of SARAH++ is indeed important. The stopping criteria helps the inner loop to prevent updating tiny redundant steps. We also provide experiments about the sensitivity of the maximum inner loop size in supplementary material.

3.3 SARAH Adaptive: A New Practical Variant

We now propose a practical adaptive method which aims to improve performance. Although we do not have any theoretical result for this adaptive method, numerical experiments are very promising and they heuristically show the improved performance on different data sets.

  Parameters: The maximum inner loop size , and the outer loop size , the factor .
  Initialize:
  Iterate:
  for  do
     
     
     
     while  and  do
         (adaptive)
        
        
        if  then
           Sample uniformly at random from
           
        end if
     end while
     Set
  end for
Algorithm 4 SARAH Adaptive

The motivation of this algorithm comes from the intuition of Lemma 2 (for convex optimization). For a single outer loop with , (9) holds for SARAH (Algorithm 1). Hence, for any , we intentionally choose such that . Since , , in (Nguyen et al., 2017a) for convex problems, we have , . We also stop the inner loop by the stopping criteria for some . SARAH Adaptive is given in detail in Algorithm 4 without convergence analysis.

Figure 2: Comparisons of between SARAH Adaptive and SARAH with different learning rates on covtype and ijcnn1 datasets
Figure 3: Comparisons of between SARAH Adaptive and SARAH++ with different learning rates on covtype and ijcnn1 datasets

We have conducted numerical experiments on the same datasets and problems as introduced in the previous subsection. Figures 2 and 3 show the comparison between SARAH Adaptive and SARAH and SARAH++ for different values of . We observe that SARAH Adaptive has an improved performance over SARAH and SARAH++ (without tuning learning rate). We also present the numerical performance of SARAH Adaptive for different values of in the supplementary materials. We also present the numerical performance of SARAH Adaptive for different values of in the supplementary materials.

We note that additional experiments in this section on more data sets are performed in the supplementary material.

4 Conclusion and Future Research

Not known in prior literature, we have proven how to achieve optimal total complexity for smooth nonconvex problems in the finite-sum setting, which arises frequently in supervised learning applications. For convex problems, we proposed SARAH++ with theoretical convergence guarantee and showed improved performance over SARAH.

For future research, ideas in this paper may apply to general expectation minimization problems using an inexact version of the gradient (Nguyen et al., 2018b). It would also be noteworthy to investigate SARAH Adaptive in more detail since it has promising empirical results. Moreover, SARAH may open some new research directions because it could be reduced to Gradient Descent as shown in the paper.

References

Appendix

Useful Existing Results

Lemma 3 (Theorem 2.1.5 in (Nesterov, 2004)).

Suppose that is -smooth. Then, for any , ,

(12)
Lemma 4 (Lemma 2 in (Nguyen et al., 2017a) (or in (Nguyen et al., 2017b))).

Suppose that Assumption 1 holds. Consider defined by (2) (or (7)) in SARAH (Algorithm 1) for any . Then for any ,

(13)
Lemma 5 (Lemma 3 in (Nguyen et al., 2017a)).

Suppose that Assumptions 1 and 3 hold. Consider defined as (2) in SARAH (Algorithm 1) with for any . Then we have that for any ,

(14)

Nonconvex SARAH

Proof of Lemma 1

Lemma 1. Suppose that Assumption 1 holds. Consider SARAH (Algorithm 1) within a single outer loop with . Then, for any , we have

Proof.

We use some parts of the proof in (Nguyen et al., 2017b). By Assumption 1 and , for any , we have

(15)

where the last equality follows from the fact for any . By summing over , we have

(16)

Now, we would like to determine such that the expression in (16)

We have