This paper studies an unconstrained optimization problem
where the objective is a proper differentiable function, and may be nonconvex, which has numerous applications in industry, such as electric vehicles [1, 2], smart grid , internet of things (IoT) , and so on. To solve this problem, a quintessential algorithm is the gradient descent (GD) method [5, 6, 7]
, which requires to access true gradients. However, it is usually expensive or difficult to compute the true gradients in reality, and thereby a typical stochastic gradient descent (SGD) algorithm has become prevalent in deep neural networks[8, 9], which depends upon a lower computing cost for stochastic gradients.
As for large-scale neural networks, the training efficiency can be substantially improved in general by introducing multiple workers in a parameter-server framework, where a group of workers can train their own mini-batch datasets in parallel. Nonetheless, the communication between workers and the parameter server has been a non-negligible handicap for its wide practical application. As such, as one of gradient compression techniques, sign-based methods have been popular in recent decades, not only because they can reduce the communication cost to one bit for each gradient coordinate, but because they have good performance and close relationship with adaptive gradient methods [10, 11, 12]. As a matter of fact, it has been demonstrated in [13, 11] that SIGN
SGD with momentum often has pretty similar performance to Adam on deep learning missions in practice. Notice that a wide range of gradient compression approaches exist for reducing the communication cost in the literature, e.g.,[14, 15], whose elaboration is beyond the scope of this paper. Particularly, sign-based methods considered in this paper can be regarded as a special gradient compression scheme which need to transmit only one bit per gradient component .
Along this line, the sign gradient descent (SIGNGD) algorithm and its stochastic counterpart (SIGNSGD) have been extensively studied in recent years [17, 11, 18, 12, 19], which are, respectively, of the form
where is a stochastic gradient of at , is the learning rate, and the signum function is operated componentwise. For instance, it was demonstrated in  that SIGNSGD enjoys a SGD-level convergence rate for nonconvex but smooth objective functions under a separable smoothness assumption, which, in combination with majority vote in distributed setup, was further shown to be efficient in terms of communication and fault toleration in . Recently, the authors in  found that the -smoothness is a weaker and natural assumption than the separable smoothness and established two conditions under which the sign-based methods are preferable over GD.
Contributions. To our best knowledge, this paper is the first to address faster convergence of sign methods with more details as follows.
First, it is found that SIGNGD is not generally convergent even for strongly convex and smooth objectives when using constant learning rates, although it is indeed convergent for vanilla GD. Therefore, scaled versions in Algorithms 1 and 2 are investigated. It is proved that Algorithm 1 converges linearly to the minimal value for two cases: strongly convex objectives and nonconvex objectives yet satisfying the Polyak-Łojasiewicz (PL) inequality. Meanwhile, Algorithm 2 converges linearly to a neighborhood of the minimal value when using a constant learning rate with an error being proportional to
and the variance of stochastic gradients. When applying a kind of diminishing learning rate, a ratecan be ensured for (15), which is superior to the widely known rate .
Second, the obtained results are extended to the distributed setup, where a group of workers compute their own (stochastic) gradients using individual dataset and then transmit the sign gradient and the gradient -norm to the parameter server who calculates the sign gradient by majority vote along with taking the average of the gradient -norms and transmits back to all the workers.
Notations. Denote by for an integer . Let , , and be the -norm, -norm, -norm and the transpose of , respectively. and
stand for column vectors of compatible dimension with all entries beingand , respectively. represents the gradient of a function . and
denote the mathematical expectation and probability, respectively.
Ii Counterexamples for SignGd
For SIGNGD, an interesting result can also be found in the continuous-time setup, which demonstrates obvious advantages of SIGNGD compared with GD. Particularly, SIGNGD converges linearly, while GD is only sublinearly convergent. More details are postponed to the Appendix as supplemental materials.
Motivated by the fact in the continuous-time setup, it seems promising to consider the discrete-time counterpart of (23), i.e., SIGNGD (2) with being a constant learning rate. However, it is not the case. It is well known that GD is linearly convergent for small enough , while several counterexamples are presented below for illustrating that the sign counterpart (2) is generally not convergent even for strongly convex and smooth objectives.
Consider for , which is strongly convex and smooth with . By choosing the initial point as , it is easy to verify for (2) that for ,
which is obviously not convergent.
Example 1 shows that the exact convergence cannot be ensured for SIGNGD even for strongly convex and smooth objectives. To fix it, one may attempt to consider the sign counterpart of adaptive gradient methods. However, it generally does not work as well. For instance, the AdaGrad-Norm 
is shown to converge linearly without knowing any function parameters beforehand , while the linear convergence cannot be ensured in general for its sign counterparts, as illustrated below for its two sign variants.
Consider the first sign variant as
and (strongly convex and smooth) with . For simplicity, set and . Then simple manipulations give rise to .
In what follows, we show that the convergence rate of (6) is not linear. To do so, it is easy to see that , which leads to that
By contradiction, if or is linearly convergent, then one has that for some constant , which, together with (7) and , gives
After decreases to where , invoking (8) leads to , thus implying that will finally oscillate around the origin, which is a contradiction with the linear convergence of . Hence, (6) is not linearly convergent.
Consider now another sign variant as
and let (strongly convex and smooth) with with initial and . In this case, it is straightforward to calculate that and
from which one can conclude that (6) amounts to
The above examples demonstrate that although GD and AdaGrad-Norm are indeed linearly convergent for strongly convex and smooth objectives, their sign counterparts fail to converge linearly in general.
Iii Linear Rate of Scaled SignGd/sgd
With the above preparations, it is now ready to study faster convergence for solving problem (1). As shown in Section II, the sign counterparts of GD and AdaGrad-Norm are not applicable for linear convergence. As such, the scaled versions of SIGNGD/SGD are considered in this paper, as in Algorithms 1 and 2, which can be viewed as the steepest descent with respect to the maximum norm , but is still not fully understood.
A few assumptions are necessary for the following analysis.
is -strongly convex with respect to -norm for some constant , i.e., for all .
satisfies the Polyak-Łojasiewicz (PL) inequality, i.e., , where is the minimum value.
is -smooth with respect to -norm, i.e., for all .
The PL inequality does not require to be even convex, and the - and -norms employed in Assumptions 1 and 2, respectively, are slightly more relaxed than the Euclidean norm. Meanwhile, the smoothness condition is made with respect to -norm, since it is more favorable than the Euclidean smoothness and separable smoothness .
It is noteworthy that another promising sign method is EF-SIGNGD  using error feedback, given as
where is the learning rate. In , it is shown that EF-SIGNGD/SGD has a better performance than SIGNGD/SGD, actually enjoying the same convergence rate as GD/SGD. However, we point out that EF-SIGNGD/SGD is, roughly speaking, equivalent to GD/SGD. Let us show this by slightly modifying (12) as
By defining , it is easy to verify that , that is, (13) amounts to GD in terms of . As a result, EF-SIGNGD/SGD is not considered here.
In the following, the main results are divided into two scenarios, i.e., the deterministic and stochastic settings.
Iii-a The Deterministic Setting
Consider the deterministic setting with full gradients, i.e., (14), for which we have the following results. Note that all proofs are given in the Appendix.
In view of Theorem 1, the algorithm (14) is proved to be linearly convergent, which is contrast to SIGNGD and sign AdaGrad-Norm as discussed in Section II. Moreover, for the nonconvex but smooth with respect to the Euclidean norm, by leveraging the similar argument to Theorem 1, it is easy to obtain for SIGNGD with a constant learning rate that by choosing the learning rate as . In comparison, (17) can be nearly when is chosen to approach . In this regard, our result is tighter up to a dimension constant , and the learning rate here is easier to implement. In addition, if the smoothness is with respect to the maximum norm, then the result here has the same convergence bound as SIGNGD but with a less conservative learning rate selection.
A similar result can be also obtained from the most related work  by resorting to the -approximate compressor. To be specific, can be viewed as -approximate compressor, and then applying Theorem 13 in  leads to the learning rate and convergence rate . In contrast, Theorem 1 of this paper (need to replace by here) is for with the convergence rate . It is easy to verify that our learning rate is more relaxed and the convergence rate is faster due to .
Iii-B The Stochastic Setting
This section considers the stochastic gradient case, where the true gradient is expensive to compute and instead a stochastic gradient
is relatively cheap to evaluate as an estimate of. To move forward, some standard assumptions are imposed on stochastic gradients [13, 11].
The stochastic gradients are unbiased and have bounded variances with respect to -norm, i.e., there exists a constant such that
In this case, the algorithm becomes (15). For brevity, define for and , where and represents the -th components of and , respectively.
For stochastic gradient , when leveraging a mini-batch of size at , the oracle gives us gradient estimates and in this case, the stochastic gradient can be chosen as the average of estimates. In this respect, the variance bound can be reduced to . Additionally, it was shown in  that the success probability should be greater than , and otherwise the sign algorithm generally fails to work. And a multitude of cases can ensure , for instance, each component possesses a unimodal and symmetric distribution [11, 19].
We are now in a position to present the main result on (15).
The first result in Theorem 2 shows that algorithm (15) converges linearly at a rate . This is comparable to vanilla SGD in , where the convergence rate is , which is slower than (i.e., ) when . Moreover, the result in (20) is the exact convergence with rate for both strongly convex case and nonconvex case with PL inequality, which is the same as both vanilla SGD  and compression methods . In addition, the same rate was established in . However, the condition in  for convergence does not always hold, e.g., in Theorem II.2 of , and our result (20) includes more faster rate except for in .
Iv The Distributed Setting
Now, we extend the results in Section III to the distributed setting within a parameter server framework. For simplicity, we only focuses on scaled SIGNSGD in this section, but the results can be similarly obtained for scaled SIGNGD.
To proceed, the distributed scaled SIGNSGD by majority vote is given in Algorithm 3, for which the following convergence result is obtained.
Numerical experiments are provided to corroborate the efficacy of the obtained theoretical results here.
Example 4 (A Toy Example).
Let us consider a simple example, where for . It is easy to verify that is nonconvex, but satisfying the PL condition. To verify the performance of the proposed scaled SIGNGD, several existing algorithms are compared in Fig. 1 by setting with an arbitrary initial state. The comparisons are performed with vanilla gradient descent (GD), SIGNGD, SIGNGDM (i.e., SIGNUM), and EF-SIGNGD . It can be observed from Fig. 1 that the proposed algorithm has the same linear convergence as GD and EF-SIGNGD, while SIGNGD and SIGNUM cannot converge, behaving oscillations near the optimal variable. In summary, this example shows the efficiency of the scaled SIGNGD, and supports the observation in Example 1.
Consider the logistic regression problem, where the objective is with a standard -regularizer , and and are the data samples.
To test the performance of scaled SIGNSGD, the epsilon dataset with and is exploited , and the baseline is calculated using the standard optimizer LogisticSGD of scikit-learn . To marginalize out the effect of initial choices, the numerical result is averaged over repeated runs with . We compare scaled SIGNSGD with vanilla SGD, SIGNSGD, SIGNSGDM, and EF-SIGNSGD , as shown in Fig. 2 on a platform with the Intel Core i7-4300U CPU. Fig. 2 indicates that SIGNSGD has a similar performance to SGD and performs better than SIGNSGD and SIGNSGDM. It can be also observed that EF-SIGNSGD is comparable to SGD, which is consistent with the discussion in Remark 2. Moreover, the case in Fig. 2(a) with a constant learning rate converges faster than that in Fig. 2(b) with a diminishing learning rate. Meanwhile, Fig. 3 shows that more workers can improve the performance. Therefore, the numerical results support our theoretical findings.
This paper has investigated faster convergence of scaled SIGNGD/SGD, which can relieve the communication cost compared with vanilla SGD. To further motivate the study of sign methods, continuous-time algorithms have been addressed, indicating that sign SGD can significantly improve the convergence speed of SGD. Subsequently, it has been proven that scaled SIGNGD is linearly convergent for both strongly convex and nonconvex (satisfying PL inequality) objectives. Also, the convergence for SIGNSGD has been analyzed in two cases with constant and decaying learning rates. The results are also extended to the distributed setting in the parameter server framework. The efficacy of scaled sign methods has been validated by numerical experiments for the logistic regression problem.
Vi-a Further Motivations for SignGd
Let us provide more evidences for studying sign-based GD from the continuous-time perspective. In doing so, consider the continuous-time dynamics corresponding to the discrete-time GD and SIGNGD, i.e.,
where is a constant learning rate.
To proceed, let us construct a Lyapunov candidate as
where denotes the minimum value attained by .
For algorithm (22),
if is convex, then , where with being the set of minimizers;
if is nonconvex, then .
For algorithm (23),
if is convex, then , where ;
if is nonconvex, then .
In view of the above results, it can be easily observed that (23) with sign gradients converges apparently faster than GD (22) in the continuous-time domain, indicating that the performance of gradient descent can be largely improved by sign gradient compression. For instance, in the scenario with convex objectives, GD (22) is sublinearly convergent while SIGNGD (23) is linearly convergent. As a result, the above results provide a new perspective for showing advantages of SIGNGD compared with GD.
Vi-B Proof of Proposition 1
Consider the case with convex objectives. In light of (22), it can be calculated that
which implies .
Meanwhile, invoking the convexity of yields
which, together with (25), gives rise to , further implying the claimed result.
For the case with nonconvex objectives, by integrating (25) from to , one can obtain that
where the inequality has employed the fact that for all . Then taking the minimum of over ends the proof. ∎
Vi-C Proof of Proposition 2
Consider first the convex case. Similar to (25), it can be obtained that
Akin to (26), one has that
Consider now the nonconvex case. The desired result can be obtained by (28) and the similar argument to that in convex case. This completes the proof. ∎
Vi-D Proof of Theorem 1
To facilitate the subsequent analysis, define
In what follows, let us prove this theorem one by one.
First, for case 1, invoking Assumption 1 yields
where the second inequality has employed the Holder inequality. Then one has that . Therefore, in combination with (31), one can obtain that , further leading to . Consequently, by iteration, this completes the proof of case 1.
Second, for case 2, Assumption 2 leads to , which, together with the similar argument to case 1, follows the conclusion in this case.
Third, for case 3, invoking (31) gives , which, by summation over , implies that
where the last inequality has used the fact that . Then taking the minimum of over ends the proof. ∎
Vi-E Proof of Theorem 2
By taking the conditional expectation, one has
Consider now the coordinate for . One has that
which, together with (33), implies that
By Jesen’s inequality, it follows that . Because for , taking the expectation implies that
Iteratively applying the above inequality leads to (19).
where , further implying that
where the second inequality has employed the expression of .
For the last two terms in (37), in light of the fact that for , one has that