I Introduction
This paper studies an unconstrained optimization problem
(1) 
where the objective is a proper differentiable function, and may be nonconvex, which has numerous applications in industry, such as electric vehicles [1, 2], smart grid [3], internet of things (IoT) [4], and so on. To solve this problem, a quintessential algorithm is the gradient descent (GD) method [5, 6, 7]
, which requires to access true gradients. However, it is usually expensive or difficult to compute the true gradients in reality, and thereby a typical stochastic gradient descent (SGD) algorithm has become prevalent in deep neural networks
[8, 9], which depends upon a lower computing cost for stochastic gradients.As for largescale neural networks, the training efficiency can be substantially improved in general by introducing multiple workers in a parameterserver framework, where a group of workers can train their own minibatch datasets in parallel. Nonetheless, the communication between workers and the parameter server has been a nonnegligible handicap for its wide practical application. As such, as one of gradient compression techniques, signbased methods have been popular in recent decades, not only because they can reduce the communication cost to one bit for each gradient coordinate, but because they have good performance and close relationship with adaptive gradient methods [10, 11, 12]. As a matter of fact, it has been demonstrated in [13, 11] that SIGN
SGD with momentum often has pretty similar performance to Adam on deep learning missions in practice. Notice that a wide range of gradient compression approaches exist for reducing the communication cost in the literature, e.g.,
[14, 15], whose elaboration is beyond the scope of this paper. Particularly, signbased methods considered in this paper can be regarded as a special gradient compression scheme which need to transmit only one bit per gradient component [16].Along this line, the sign gradient descent (SIGNGD) algorithm and its stochastic counterpart (SIGNSGD) have been extensively studied in recent years [17, 11, 18, 12, 19], which are, respectively, of the form
(2)  
(3) 
where is a stochastic gradient of at , is the learning rate, and the signum function is operated componentwise. For instance, it was demonstrated in [11] that SIGNSGD enjoys a SGDlevel convergence rate for nonconvex but smooth objective functions under a separable smoothness assumption, which, in combination with majority vote in distributed setup, was further shown to be efficient in terms of communication and fault toleration in [18]. Recently, the authors in [12] found that the smoothness is a weaker and natural assumption than the separable smoothness and established two conditions under which the signbased methods are preferable over GD.
Contributions. To our best knowledge, this paper is the first to address faster convergence of sign methods with more details as follows.
First, it is found that SIGNGD is not generally convergent even for strongly convex and smooth objectives when using constant learning rates, although it is indeed convergent for vanilla GD. Therefore, scaled versions in Algorithms 1 and 2 are investigated. It is proved that Algorithm 1 converges linearly to the minimal value for two cases: strongly convex objectives and nonconvex objectives yet satisfying the PolyakŁojasiewicz (PL) inequality. Meanwhile, Algorithm 2 converges linearly to a neighborhood of the minimal value when using a constant learning rate with an error being proportional to
and the variance of stochastic gradients. When applying a kind of diminishing learning rate, a rate
can be ensured for (15), which is superior to the widely known rate [20].Second, the obtained results are extended to the distributed setup, where a group of workers compute their own (stochastic) gradients using individual dataset and then transmit the sign gradient and the gradient norm to the parameter server who calculates the sign gradient by majority vote along with taking the average of the gradient norms and transmits back to all the workers.
Notations. Denote by for an integer . Let , , and be the norm, norm, norm and the transpose of , respectively. and
stand for column vectors of compatible dimension with all entries being
and , respectively. represents the gradient of a function . anddenote the mathematical expectation and probability, respectively.
Ii Counterexamples for SignGd
For SIGNGD, an interesting result can also be found in the continuoustime setup, which demonstrates obvious advantages of SIGNGD compared with GD. Particularly, SIGNGD converges linearly, while GD is only sublinearly convergent. More details are postponed to the Appendix as supplemental materials.
Motivated by the fact in the continuoustime setup, it seems promising to consider the discretetime counterpart of (23), i.e., SIGNGD (2) with being a constant learning rate. However, it is not the case. It is well known that GD is linearly convergent for small enough , while several counterexamples are presented below for illustrating that the sign counterpart (2) is generally not convergent even for strongly convex and smooth objectives.
Example 1.
Consider for , which is strongly convex and smooth with . By choosing the initial point as , it is easy to verify for (2) that for ,
(4) 
which is obviously not convergent.
Example 1 shows that the exact convergence cannot be ensured for SIGNGD even for strongly convex and smooth objectives. To fix it, one may attempt to consider the sign counterpart of adaptive gradient methods. However, it generally does not work as well. For instance, the AdaGradNorm [21]
(5) 
is shown to converge linearly without knowing any function parameters beforehand [22], while the linear convergence cannot be ensured in general for its sign counterparts, as illustrated below for its two sign variants.
Example 2.
Consider the first sign variant as
(6) 
and (strongly convex and smooth) with . For simplicity, set and . Then simple manipulations give rise to .
In what follows, we show that the convergence rate of (6) is not linear. To do so, it is easy to see that , which leads to that
(7) 
By contradiction, if or is linearly convergent, then one has that for some constant , which, together with (7) and , gives
(8) 
After decreases to where , invoking (8) leads to , thus implying that will finally oscillate around the origin, which is a contradiction with the linear convergence of . Hence, (6) is not linearly convergent.
Example 3.
Consider now another sign variant as
(9) 
and let (strongly convex and smooth) with with initial and . In this case, it is straightforward to calculate that and
(10) 
from which one can conclude that (6) amounts to
(11) 
which can be viewed as GD for the convex objective with a learning rate . Therefore, the convergence rate of classic GD can be invoked for (8), which is known to be sublinear [23].
Remark 1.
The above examples demonstrate that although GD and AdaGradNorm are indeed linearly convergent for strongly convex and smooth objectives, their sign counterparts fail to converge linearly in general.
Iii Linear Rate of Scaled SignGd/sgd
With the above preparations, it is now ready to study faster convergence for solving problem (1). As shown in Section II, the sign counterparts of GD and AdaGradNorm are not applicable for linear convergence. As such, the scaled versions of SIGNGD/SGD are considered in this paper, as in Algorithms 1 and 2, which can be viewed as the steepest descent with respect to the maximum norm [12], but is still not fully understood.
A few assumptions are necessary for the following analysis.
Assumption 1.
is strongly convex with respect to norm for some constant , i.e., for all .
Assumption 2.
satisfies the PolyakŁojasiewicz (PL) inequality, i.e., , where is the minimum value.
Assumption 3.
is smooth with respect to norm, i.e., for all .
The PL inequality does not require to be even convex, and the  and norms employed in Assumptions 1 and 2, respectively, are slightly more relaxed than the Euclidean norm. Meanwhile, the smoothness condition is made with respect to norm, since it is more favorable than the Euclidean smoothness and separable smoothness [12].
Remark 2.
It is noteworthy that another promising sign method is EFSIGNGD [16] using error feedback, given as
(12) 
where is the learning rate. In [16], it is shown that EFSIGNGD/SGD has a better performance than SIGNGD/SGD, actually enjoying the same convergence rate as GD/SGD. However, we point out that EFSIGNGD/SGD is, roughly speaking, equivalent to GD/SGD. Let us show this by slightly modifying (12) as
(13) 
By defining , it is easy to verify that , that is, (13) amounts to GD in terms of . As a result, EFSIGNGD/SGD is not considered here.
In the following, the main results are divided into two scenarios, i.e., the deterministic and stochastic settings.
(14) 
(15) 
Iiia The Deterministic Setting
Consider the deterministic setting with full gradients, i.e., (14), for which we have the following results. Note that all proofs are given in the Appendix.
Theorem 1.
Remark 3.
In view of Theorem 1, the algorithm (14) is proved to be linearly convergent, which is contrast to SIGNGD and sign AdaGradNorm as discussed in Section II. Moreover, for the nonconvex but smooth with respect to the Euclidean norm, by leveraging the similar argument to Theorem 1, it is easy to obtain for SIGNGD with a constant learning rate that by choosing the learning rate as . In comparison, (17) can be nearly when is chosen to approach . In this regard, our result is tighter up to a dimension constant , and the learning rate here is easier to implement. In addition, if the smoothness is with respect to the maximum norm, then the result here has the same convergence bound as SIGNGD but with a less conservative learning rate selection.
Remark 4.
A similar result can be also obtained from the most related work [24] by resorting to the approximate compressor. To be specific, can be viewed as approximate compressor, and then applying Theorem 13 in [24] leads to the learning rate and convergence rate . In contrast, Theorem 1 of this paper (need to replace by here) is for with the convergence rate . It is easy to verify that our learning rate is more relaxed and the convergence rate is faster due to .
IiiB The Stochastic Setting
This section considers the stochastic gradient case, where the true gradient is expensive to compute and instead a stochastic gradient
is relatively cheap to evaluate as an estimate of
. To move forward, some standard assumptions are imposed on stochastic gradients [13, 11].Assumption 4.
The stochastic gradients are unbiased and have bounded variances with respect to norm, i.e., there exists a constant such that
(18) 
In this case, the algorithm becomes (15). For brevity, define for and , where and represents the th components of and , respectively.
Remark 5.
For stochastic gradient , when leveraging a minibatch of size at , the oracle gives us gradient estimates and in this case, the stochastic gradient can be chosen as the average of estimates. In this respect, the variance bound can be reduced to . Additionally, it was shown in [19] that the success probability should be greater than , and otherwise the sign algorithm generally fails to work. And a multitude of cases can ensure , for instance, each component possesses a unimodal and symmetric distribution [11, 19].
We are now in a position to present the main result on (15).
Theorem 2.
Remark 6.
The first result in Theorem 2 shows that algorithm (15) converges linearly at a rate . This is comparable to vanilla SGD in [25], where the convergence rate is , which is slower than (i.e., ) when . Moreover, the result in (20) is the exact convergence with rate for both strongly convex case and nonconvex case with PL inequality, which is the same as both vanilla SGD [26] and compression methods [20]. In addition, the same rate was established in [27]. However, the condition in [27] for convergence does not always hold, e.g., in Theorem II.2 of [27], and our result (20) includes more faster rate except for in [27].
Iv The Distributed Setting
Now, we extend the results in Section III to the distributed setting within a parameter server framework. For simplicity, we only focuses on scaled SIGNSGD in this section, but the results can be similarly obtained for scaled SIGNGD.
To proceed, the distributed scaled SIGNSGD by majority vote is given in Algorithm 3, for which the following convergence result is obtained.
Theorem 3.
V Experiments
Numerical experiments are provided to corroborate the efficacy of the obtained theoretical results here.
Example 4 (A Toy Example).
Let us consider a simple example, where for . It is easy to verify that is nonconvex, but satisfying the PL condition. To verify the performance of the proposed scaled SIGNGD, several existing algorithms are compared in Fig. 1 by setting with an arbitrary initial state. The comparisons are performed with vanilla gradient descent (GD), SIGNGD, SIGNGDM (i.e., SIGNUM), and EFSIGNGD [16]. It can be observed from Fig. 1 that the proposed algorithm has the same linear convergence as GD and EFSIGNGD, while SIGNGD and SIGNUM cannot converge, behaving oscillations near the optimal variable. In summary, this example shows the efficiency of the scaled SIGNGD, and supports the observation in Example 1.
Example 5.
Consider the logistic regression problem, where the objective is with a standard regularizer [20], and and are the data samples.
To test the performance of scaled SIGNSGD, the epsilon dataset with and is exploited [28], and the baseline is calculated using the standard optimizer LogisticSGD of scikitlearn [29]. To marginalize out the effect of initial choices, the numerical result is averaged over repeated runs with . We compare scaled SIGNSGD with vanilla SGD, SIGNSGD, SIGNSGDM, and EFSIGNSGD [16], as shown in Fig. 2 on a platform with the Intel Core i74300U CPU. Fig. 2 indicates that SIGNSGD has a similar performance to SGD and performs better than SIGNSGD and SIGNSGDM. It can be also observed that EFSIGNSGD is comparable to SGD, which is consistent with the discussion in Remark 2. Moreover, the case in Fig. 2(a) with a constant learning rate converges faster than that in Fig. 2(b) with a diminishing learning rate. Meanwhile, Fig. 3 shows that more workers can improve the performance. Therefore, the numerical results support our theoretical findings.
Vi Conclusion
This paper has investigated faster convergence of scaled SIGNGD/SGD, which can relieve the communication cost compared with vanilla SGD. To further motivate the study of sign methods, continuoustime algorithms have been addressed, indicating that sign SGD can significantly improve the convergence speed of SGD. Subsequently, it has been proven that scaled SIGNGD is linearly convergent for both strongly convex and nonconvex (satisfying PL inequality) objectives. Also, the convergence for SIGNSGD has been analyzed in two cases with constant and decaying learning rates. The results are also extended to the distributed setting in the parameter server framework. The efficacy of scaled sign methods has been validated by numerical experiments for the logistic regression problem.
Appendix
Via Further Motivations for SignGd
Let us provide more evidences for studying signbased GD from the continuoustime perspective. In doing so, consider the continuoustime dynamics corresponding to the discretetime GD and SIGNGD, i.e.,
(22)  
(23) 
where is a constant learning rate.
To proceed, let us construct a Lyapunov candidate as
(24) 
where denotes the minimum value attained by .
Proposition 1.
For algorithm (22),

if is convex, then , where with being the set of minimizers;

if is nonconvex, then .
Proposition 2.
For algorithm (23),

if is convex, then , where ;

if is nonconvex, then .
In view of the above results, it can be easily observed that (23) with sign gradients converges apparently faster than GD (22) in the continuoustime domain, indicating that the performance of gradient descent can be largely improved by sign gradient compression. For instance, in the scenario with convex objectives, GD (22) is sublinearly convergent while SIGNGD (23) is linearly convergent. As a result, the above results provide a new perspective for showing advantages of SIGNGD compared with GD.
ViB Proof of Proposition 1
Consider the case with convex objectives. In light of (22), it can be calculated that
(25) 
which implies .
Meanwhile, invoking the convexity of yields
(26) 
which, together with (25), gives rise to , further implying the claimed result.
For the case with nonconvex objectives, by integrating (25) from to , one can obtain that
(27) 
where the inequality has employed the fact that for all . Then taking the minimum of over ends the proof. ∎
ViC Proof of Proposition 2
Consider first the convex case. Similar to (25), it can be obtained that
(28) 
Akin to (26), one has that
(29) 
where the second inequality has used Holder’s inequality. Combining (28) with (29) yields , from which it is easy to verify the claimed result.
Consider now the nonconvex case. The desired result can be obtained by (28) and the similar argument to that in convex case. This completes the proof. ∎
ViD Proof of Theorem 1
To facilitate the subsequent analysis, define
(30) 
In view of (14) and Assumption 3, it can be concluded that
(31) 
In what follows, let us prove this theorem one by one.
First, for case 1, invoking Assumption 1 yields
where the second inequality has employed the Holder inequality. Then one has that . Therefore, in combination with (31), one can obtain that , further leading to . Consequently, by iteration, this completes the proof of case 1.
Second, for case 2, Assumption 2 leads to , which, together with the similar argument to case 1, follows the conclusion in this case.
Third, for case 3, invoking (31) gives , which, by summation over , implies that
(32) 
where the last inequality has used the fact that . Then taking the minimum of over ends the proof. ∎
ViE Proof of Theorem 2
Recalling in (30). Invoking Assumption 3 gives rise to
By taking the conditional expectation, one has
(33) 
Consider now the coordinate for . One has that
which, together with (33), implies that
(34) 
By Jesen’s inequality, it follows that . Because for , taking the expectation implies that
(35) 
Now, under Assumption 1 or 2, using the similar argument to the proof of Theorem 1 can both lead to that , which together with (35) yields that
(36) 
Iteratively applying the above inequality leads to (19).
Comments
There are no comments yet.