On Faster Convergence of Scaled Sign Gradient Descent

09/04/2021 ∙ by Xiuxian Li, et al. ∙ 8

Communication has been seen as a significant bottleneck in industrial applications over large-scale networks. To alleviate the communication burden, sign-based optimization algorithms have gained popularity recently in both industrial and academic communities, which is shown to be closely related to adaptive gradient methods, such as Adam. Along this line, this paper investigates faster convergence for a variant of sign-based gradient descent, called scaled signGD, in three cases: 1) the objective function is strongly convex; 2) the objective function is nonconvex but satisfies the Polyak-Lojasiewicz (PL) inequality; 3) the gradient is stochastic, called scaled signGD in this case. For the first two cases, it can be shown that the scaled signGD converges at a linear rate. For case 3), the algorithm is shown to converge linearly to a neighborhood of the optimal value when a constant learning rate is employed, and the algorithm converges at a rate of O(1/k) when using a diminishing learning rate, where k is the iteration number. The results are also extended to the distributed setting by majority vote in a parameter-server framework. Finally, numerical experiments on logistic regression are performed to corroborate the theoretical findings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

This paper studies an unconstrained optimization problem

(1)

where the objective is a proper differentiable function, and may be nonconvex, which has numerous applications in industry, such as electric vehicles [1, 2], smart grid [3], internet of things (IoT) [4], and so on. To solve this problem, a quintessential algorithm is the gradient descent (GD) method [5, 6, 7]

, which requires to access true gradients. However, it is usually expensive or difficult to compute the true gradients in reality, and thereby a typical stochastic gradient descent (SGD) algorithm has become prevalent in deep neural networks

[8, 9], which depends upon a lower computing cost for stochastic gradients.

As for large-scale neural networks, the training efficiency can be substantially improved in general by introducing multiple workers in a parameter-server framework, where a group of workers can train their own mini-batch datasets in parallel. Nonetheless, the communication between workers and the parameter server has been a non-negligible handicap for its wide practical application. As such, as one of gradient compression techniques, sign-based methods have been popular in recent decades, not only because they can reduce the communication cost to one bit for each gradient coordinate, but because they have good performance and close relationship with adaptive gradient methods [10, 11, 12]. As a matter of fact, it has been demonstrated in [13, 11] that SIGN

SGD with momentum often has pretty similar performance to Adam on deep learning missions in practice. Notice that a wide range of gradient compression approaches exist for reducing the communication cost in the literature, e.g.,

[14, 15], whose elaboration is beyond the scope of this paper. Particularly, sign-based methods considered in this paper can be regarded as a special gradient compression scheme which need to transmit only one bit per gradient component [16].

Along this line, the sign gradient descent (SIGNGD) algorithm and its stochastic counterpart (SIGNSGD) have been extensively studied in recent years [17, 11, 18, 12, 19], which are, respectively, of the form

(2)
(3)

where is a stochastic gradient of at , is the learning rate, and the signum function is operated componentwise. For instance, it was demonstrated in [11] that SIGNSGD enjoys a SGD-level convergence rate for nonconvex but smooth objective functions under a separable smoothness assumption, which, in combination with majority vote in distributed setup, was further shown to be efficient in terms of communication and fault toleration in [18]. Recently, the authors in [12] found that the -smoothness is a weaker and natural assumption than the separable smoothness and established two conditions under which the sign-based methods are preferable over GD.

Contributions. To our best knowledge, this paper is the first to address faster convergence of sign methods with more details as follows.

First, it is found that SIGNGD is not generally convergent even for strongly convex and smooth objectives when using constant learning rates, although it is indeed convergent for vanilla GD. Therefore, scaled versions in Algorithms 1 and 2 are investigated. It is proved that Algorithm 1 converges linearly to the minimal value for two cases: strongly convex objectives and nonconvex objectives yet satisfying the Polyak-Łojasiewicz (PL) inequality. Meanwhile, Algorithm 2 converges linearly to a neighborhood of the minimal value when using a constant learning rate with an error being proportional to

and the variance of stochastic gradients. When applying a kind of diminishing learning rate, a rate

can be ensured for (15), which is superior to the widely known rate [20].

Second, the obtained results are extended to the distributed setup, where a group of workers compute their own (stochastic) gradients using individual dataset and then transmit the sign gradient and the gradient -norm to the parameter server who calculates the sign gradient by majority vote along with taking the average of the gradient -norms and transmits back to all the workers.

Notations. Denote by for an integer . Let , , and be the -norm, -norm, -norm and the transpose of , respectively. and

stand for column vectors of compatible dimension with all entries being

and , respectively. represents the gradient of a function . and

denote the mathematical expectation and probability, respectively.

Ii Counterexamples for SignGd

For SIGNGD, an interesting result can also be found in the continuous-time setup, which demonstrates obvious advantages of SIGNGD compared with GD. Particularly, SIGNGD converges linearly, while GD is only sublinearly convergent. More details are postponed to the Appendix as supplemental materials.

Motivated by the fact in the continuous-time setup, it seems promising to consider the discrete-time counterpart of (23), i.e., SIGNGD (2) with being a constant learning rate. However, it is not the case. It is well known that GD is linearly convergent for small enough , while several counterexamples are presented below for illustrating that the sign counterpart (2) is generally not convergent even for strongly convex and smooth objectives.

Example 1.

Consider for , which is strongly convex and smooth with . By choosing the initial point as , it is easy to verify for (2) that for ,

(4)

which is obviously not convergent.

Example 1 shows that the exact convergence cannot be ensured for SIGNGD even for strongly convex and smooth objectives. To fix it, one may attempt to consider the sign counterpart of adaptive gradient methods. However, it generally does not work as well. For instance, the AdaGrad-Norm [21]

(5)

is shown to converge linearly without knowing any function parameters beforehand [22], while the linear convergence cannot be ensured in general for its sign counterparts, as illustrated below for its two sign variants.

Example 2.

Consider the first sign variant as

(6)

and (strongly convex and smooth) with . For simplicity, set and . Then simple manipulations give rise to .

In what follows, we show that the convergence rate of (6) is not linear. To do so, it is easy to see that , which leads to that

(7)

By contradiction, if or is linearly convergent, then one has that for some constant , which, together with (7) and , gives

(8)

After decreases to where , invoking (8) leads to , thus implying that will finally oscillate around the origin, which is a contradiction with the linear convergence of . Hence, (6) is not linearly convergent.

Example 3.

Consider now another sign variant as

(9)

and let (strongly convex and smooth) with with initial and . In this case, it is straightforward to calculate that and

(10)

from which one can conclude that (6) amounts to

(11)

which can be viewed as GD for the convex objective with a learning rate . Therefore, the convergence rate of classic GD can be invoked for (8), which is known to be sublinear [23].

Remark 1.

The above examples demonstrate that although GD and AdaGrad-Norm are indeed linearly convergent for strongly convex and smooth objectives, their sign counterparts fail to converge linearly in general.

Iii Linear Rate of Scaled SignGd/sgd

With the above preparations, it is now ready to study faster convergence for solving problem (1). As shown in Section II, the sign counterparts of GD and AdaGrad-Norm are not applicable for linear convergence. As such, the scaled versions of SIGNGD/SGD are considered in this paper, as in Algorithms 1 and 2, which can be viewed as the steepest descent with respect to the maximum norm [12], but is still not fully understood.

A few assumptions are necessary for the following analysis.

Assumption 1.

is -strongly convex with respect to -norm for some constant , i.e., for all .

Assumption 2.

satisfies the Polyak-Łojasiewicz (PL) inequality, i.e., , where is the minimum value.

Assumption 3.

is -smooth with respect to -norm, i.e., for all .

The PL inequality does not require to be even convex, and the - and -norms employed in Assumptions 1 and 2, respectively, are slightly more relaxed than the Euclidean norm. Meanwhile, the smoothness condition is made with respect to -norm, since it is more favorable than the Euclidean smoothness and separable smoothness [12].

Remark 2.

It is noteworthy that another promising sign method is EF-SIGNGD [16] using error feedback, given as

(12)

where is the learning rate. In [16], it is shown that EF-SIGNGD/SGD has a better performance than SIGNGD/SGD, actually enjoying the same convergence rate as GD/SGD. However, we point out that EF-SIGNGD/SGD is, roughly speaking, equivalent to GD/SGD. Let us show this by slightly modifying (12) as

(13)

By defining , it is easy to verify that , that is, (13) amounts to GD in terms of . As a result, EF-SIGNGD/SGD is not considered here.

In the following, the main results are divided into two scenarios, i.e., the deterministic and stochastic settings.

  Input: learning rate , current point
  
(14)
Algorithm 1 Scaled SIGNGD
  Input: learning rate , current point
  
(15)
Algorithm 2 Scaled SIGNSGD

Iii-a The Deterministic Setting

Consider the deterministic setting with full gradients, i.e., (14), for which we have the following results. Note that all proofs are given in the Appendix.

Theorem 1.

The following statements are true for (14).

  1. Under Assumptions 1 and 3, if , then

    (16)

    where .

  2. Under Assumptions 2 and 3 with satisfying , (16) still holds.

  3. If Assumption 3 holds only, then

    (17)

    where .

Remark 3.

In view of Theorem 1, the algorithm (14) is proved to be linearly convergent, which is contrast to SIGNGD and sign AdaGrad-Norm as discussed in Section II. Moreover, for the nonconvex but smooth with respect to the Euclidean norm, by leveraging the similar argument to Theorem 1, it is easy to obtain for SIGNGD with a constant learning rate that by choosing the learning rate as . In comparison, (17) can be nearly when is chosen to approach . In this regard, our result is tighter up to a dimension constant , and the learning rate here is easier to implement. In addition, if the smoothness is with respect to the maximum norm, then the result here has the same convergence bound as SIGNGD but with a less conservative learning rate selection.

Remark 4.

A similar result can be also obtained from the most related work [24] by resorting to the -approximate compressor. To be specific, can be viewed as -approximate compressor, and then applying Theorem 13 in [24] leads to the learning rate and convergence rate . In contrast, Theorem 1 of this paper (need to replace by here) is for with the convergence rate . It is easy to verify that our learning rate is more relaxed and the convergence rate is faster due to .

Iii-B The Stochastic Setting

This section considers the stochastic gradient case, where the true gradient is expensive to compute and instead a stochastic gradient

is relatively cheap to evaluate as an estimate of

. To move forward, some standard assumptions are imposed on stochastic gradients [13, 11].

Assumption 4.

The stochastic gradients are unbiased and have bounded variances with respect to -norm, i.e., there exists a constant such that

(18)

In this case, the algorithm becomes (15). For brevity, define for and , where and represents the -th components of and , respectively.

Remark 5.

For stochastic gradient , when leveraging a mini-batch of size at , the oracle gives us gradient estimates and in this case, the stochastic gradient can be chosen as the average of estimates. In this respect, the variance bound can be reduced to . Additionally, it was shown in [19] that the success probability should be greater than , and otherwise the sign algorithm generally fails to work. And a multitude of cases can ensure , for instance, each component possesses a unimodal and symmetric distribution [11, 19].

We are now in a position to present the main result on (15).

Theorem 2.

For (15), under Assumptions 1, 3, 4 or 2-4, the following statements are true.

  1. If , then

    (19)

    where and .

  2. If , then

    (20)
Remark 6.

The first result in Theorem 2 shows that algorithm (15) converges linearly at a rate . This is comparable to vanilla SGD in [25], where the convergence rate is , which is slower than (i.e., ) when . Moreover, the result in (20) is the exact convergence with rate for both strongly convex case and nonconvex case with PL inequality, which is the same as both vanilla SGD [26] and compression methods [20]. In addition, the same rate was established in [27]. However, the condition in [27] for convergence does not always hold, e.g., in Theorem II.2 of [27], and our result (20) includes more faster rate except for in [27].

Iv The Distributed Setting

Now, we extend the results in Section III to the distributed setting within a parameter server framework. For simplicity, we only focuses on scaled SIGNSGD in this section, but the results can be similarly obtained for scaled SIGNGD.

  Input: learning rate , current point , workers each with an i.i.d. gradient estimate
  On server        Pull and from each worker       Push and to each worker                              
  On each worker        
Algorithm 3 Distributed Scaled SIGNSGD by Majority Vote

To proceed, the distributed scaled SIGNSGD by majority vote is given in Algorithm 3, for which the following convergence result is obtained.

Theorem 3.

For Algorithm 3, under Assumptions 1, 3, 4 or 2-4, if , then

(21)

where , , with being the floor function, and is the regularized incomplete beta function, defined by

Remark 7.

It is noteworthy that the exact convergence can be similarly established as (20) in Theorem 2, which is omitted in Theorem 3.

V Experiments

Numerical experiments are provided to corroborate the efficacy of the obtained theoretical results here.

Fig. 1: Simulation results for several algorithms.
Example 4 (A Toy Example).

Let us consider a simple example, where for . It is easy to verify that is nonconvex, but satisfying the PL condition. To verify the performance of the proposed scaled SIGNGD, several existing algorithms are compared in Fig. 1 by setting with an arbitrary initial state. The comparisons are performed with vanilla gradient descent (GD), SIGNGD, SIGNGDM (i.e., SIGNUM), and EF-SIGNGD [16]. It can be observed from Fig. 1 that the proposed algorithm has the same linear convergence as GD and EF-SIGNGD, while SIGNGD and SIGNUM cannot converge, behaving oscillations near the optimal variable. In summary, this example shows the efficiency of the scaled SIGNGD, and supports the observation in Example 1.

Example 5.

Consider the logistic regression problem, where the objective is with a standard -regularizer [20], and and are the data samples.

Fig. 2: Scaled SIGNSGD. (a) ; (b) .
Fig. 3: Distributed scaled SIGNSGD for .

To test the performance of scaled SIGNSGD, the epsilon dataset with and is exploited [28], and the baseline is calculated using the standard optimizer LogisticSGD of scikit-learn [29]. To marginalize out the effect of initial choices, the numerical result is averaged over repeated runs with . We compare scaled SIGNSGD with vanilla SGD, SIGNSGD, SIGNSGDM, and EF-SIGNSGD [16], as shown in Fig. 2 on a platform with the Intel Core i7-4300U CPU. Fig. 2 indicates that SIGNSGD has a similar performance to SGD and performs better than SIGNSGD and SIGNSGDM. It can be also observed that EF-SIGNSGD is comparable to SGD, which is consistent with the discussion in Remark 2. Moreover, the case in Fig. 2(a) with a constant learning rate converges faster than that in Fig. 2(b) with a diminishing learning rate. Meanwhile, Fig. 3 shows that more workers can improve the performance. Therefore, the numerical results support our theoretical findings.

Vi Conclusion

This paper has investigated faster convergence of scaled SIGNGD/SGD, which can relieve the communication cost compared with vanilla SGD. To further motivate the study of sign methods, continuous-time algorithms have been addressed, indicating that sign SGD can significantly improve the convergence speed of SGD. Subsequently, it has been proven that scaled SIGNGD is linearly convergent for both strongly convex and nonconvex (satisfying PL inequality) objectives. Also, the convergence for SIGNSGD has been analyzed in two cases with constant and decaying learning rates. The results are also extended to the distributed setting in the parameter server framework. The efficacy of scaled sign methods has been validated by numerical experiments for the logistic regression problem.

Appendix

Vi-a Further Motivations for SignGd

Let us provide more evidences for studying sign-based GD from the continuous-time perspective. In doing so, consider the continuous-time dynamics corresponding to the discrete-time GD and SIGNGD, i.e.,

(22)
(23)

where is a constant learning rate.

To proceed, let us construct a Lyapunov candidate as

(24)

where denotes the minimum value attained by .

For algorithms (22) and (23), the following results can be obtained.

Proposition 1.

For algorithm (22),

  1. if is convex, then , where with being the set of minimizers;

  2. if is nonconvex, then .

Proposition 2.

For algorithm (23),

  1. if is convex, then , where ;

  2. if is nonconvex, then .

In view of the above results, it can be easily observed that (23) with sign gradients converges apparently faster than GD (22) in the continuous-time domain, indicating that the performance of gradient descent can be largely improved by sign gradient compression. For instance, in the scenario with convex objectives, GD (22) is sublinearly convergent while SIGNGD (23) is linearly convergent. As a result, the above results provide a new perspective for showing advantages of SIGNGD compared with GD.

Vi-B Proof of Proposition 1

Consider the case with convex objectives. In light of (22), it can be calculated that

(25)

which implies .

Meanwhile, invoking the convexity of yields

(26)

which, together with (25), gives rise to , further implying the claimed result.

For the case with nonconvex objectives, by integrating (25) from to , one can obtain that

(27)

where the inequality has employed the fact that for all . Then taking the minimum of over ends the proof. ∎

Vi-C Proof of Proposition 2

Consider first the convex case. Similar to (25), it can be obtained that

(28)

Akin to (26), one has that

(29)

where the second inequality has used Holder’s inequality. Combining (28) with (29) yields , from which it is easy to verify the claimed result.

Consider now the nonconvex case. The desired result can be obtained by (28) and the similar argument to that in convex case. This completes the proof. ∎

Vi-D Proof of Theorem 1

To facilitate the subsequent analysis, define

(30)

In view of (14) and Assumption 3, it can be concluded that

(31)

In what follows, let us prove this theorem one by one.

First, for case 1, invoking Assumption 1 yields

where the second inequality has employed the Holder inequality. Then one has that . Therefore, in combination with (31), one can obtain that , further leading to . Consequently, by iteration, this completes the proof of case 1.

Second, for case 2, Assumption 2 leads to , which, together with the similar argument to case 1, follows the conclusion in this case.

Third, for case 3, invoking (31) gives , which, by summation over , implies that

(32)

where the last inequality has used the fact that . Then taking the minimum of over ends the proof. ∎

Vi-E Proof of Theorem 2

Recalling in (30). Invoking Assumption 3 gives rise to

By taking the conditional expectation, one has

(33)

Consider now the coordinate for . One has that

which, together with (33), implies that

(34)

By Jesen’s inequality, it follows that . Because for , taking the expectation implies that

(35)

Now, under Assumption 1 or 2, using the similar argument to the proof of Theorem 1 can both lead to that , which together with (35) yields that

(36)

Iteratively applying the above inequality leads to (19).

It remains to show (20). Invoking the similar analysis for (36) yields that

where , further implying that

(37)

where the second inequality has employed the expression of .

For the last two terms in (37), in light of the fact that for , one has that

(38)

and