1 Introduction
Distributed computing becomes increasingly important in modern dataintensive applications. In many applications, largescale datasets are distributed over multiple machines for parallel processing in order to speed up computation. In other settings, the data sources are naturally distributed, and for privacy and efficiency considerations, the data are not transmitted to a central machine. An example is the recently proposed Federated Learning paradigm [49, 38, 37], in which the data are stored and processed locally in end users’ cellphones and personal computers.
In a standard workerserver distributed computing framework, a single master machine is in charge of maintaining and updating the parameter of interest, and a set of worker machines store the data, perform local computation and communicate with the master. In this setting, messages received from worker machines are prone to errors due to data corruption, hardware/software malfunction, and communication delay and failure. These problems are only exacerbated in a decentralized distributed architecture such as Federated Learning, where some machines may be subjected to malicious and coordinated attack and manipulation. A wellestablished framework for studying such scenarios is the Byzantine setting [40], where a subset of machines behave completely arbitrarily—even in a way that depends on the algorithm used and the data on the other machines—thereby capturing the unpredictable nature of the errors. Developing distributed algorithms that are robust in the Byzantine setting has become increasingly critical.
In this paper we focus on robust distributed optimization for statistical learning problems. Here the data points are generated from some unknown distribution and stored locally in worker machines, each storing data points; the goal is to minimize a population loss function defined as an expectation over , where is the parameter space. We assume that fraction of the worker machines are Byzantine; that is, their behavior is arbitrary. This Byzantinerobust distributed learning problem has attracted attention in a recent line of work [3, 10, 17, 26, 61, 62, 69]. This body of work develops robust algorithms that are guaranteed to output an approximate minimizer of when it is convex, or an approximate stationary point in the nonconvex case.
However, fitting complicated machine learning models often requires finding a
local minimum of nonconvexfunctions, as exemplified by training deep neural networks and other highcapacity learning architectures
[59, 28, 29]. It is wellknown that many of the stationary points of these problems are in fact saddle points and far away from any local minimum [35, 29]. These tasks hence require algorithms capable of efficiently escaping saddle points and converging approximately to a local minimizer. In the centralized setting without Byzantine adversaries, this problem has been studied actively and recently [27, 32, 12, 33].A main observation of this work is that the interplay between nonconvexity and Byzantine errors makes escaping saddle points much more challenging. In particular, by orchestrating their messages sent to the master machine, the Byzantine machines can create fake local minima near a saddle point of that is far away from any true local minimizer. Such a strategy, which may be referred to as saddle point attack, foils existing algorithms as we elaborate below:

[leftmargin=3mm]

Challenges due to nonconvexity: When is convex, gradient descent (GD) equipped with a robust gradient estimator is guaranteed to find an approximate global minimizer (with accuracy depending on the fraction of Byzantine machines) [17, 69, 3]. However, when is nonconvex, such algorithms may be trapped in the neighborhood of a saddle point; see Example 1 in Appendix A.

Challenges due to Byzantine machines: Without Byzantine machines, vanilla GD [42], as well as its more efficient variants such as perturbed gradient descent (PGD) [32]
, are known to converge to a local minimizer with high probability. However, Byzantine machines can manipulate PGD and GD (even robustified) into fake local minimum near a saddle point; see Example 2 in Appendix
A.
We discuss and compare with existing work in more details in Section 2. The observations above show that existing robust and saddleescaping algorithms, as well as their naive combination, are insufficient against saddle point attack. Addressing these challenges requires the development of new robust distributed optimization algorithms.
1.1 Our Contributions
In this paper, we develop ByzantinePGD, a computation and communicationefficient firstorder algorithm that is able to escape saddle points and the fake local minima created by Byzantine machines, and converge to an approximate local minimizer of a nonconvex loss. To the best of our knowledge, our algorithm is the first to achieve such guarantees under adversarial noise.
Specifically, ByzantinePGD aggregates the empirical gradients received from the normal and Byzantine machines, and computes a robust estimate of the true gradient of the population loss . Crucial to our algorithm is the injection of random perturbation to the iterates , which serves the dual purpose of escaping saddling point and fake local minima. Our use of perturbation thus plays a more signified role than in existing algorithms such as PGD [32], as it also serves to combat the effect of Byzantine errors. To achieve this goal, we incorporate two crucial innovations: (i) we use multiple rounds of larger, yet carefully calibrated, amount of perturbation that is necessary to survive saddle point attack, (ii) we use the moving distance in the parameter space as the criterion for successful escape, eliminating the need of (robustly) evaluating function values. Consequently, our analysis is significantly different, and arguably simpler, than that of PGD.
We develop our algorithmic and theoretical results in a flexible, twopart framework, decomposing the optimization and statistical components of the problem.
The optimization part:
We consider a general problem of optimizing a population loss function given an inexact gradient oracle. For each query point , the
inexact gradient oracle returns a vector
(possibly chosen adversarially) that satisfies , where is nonzero but bounded. Given access to such an inexact oracle, we show that ByzantinePGD outputs an approximate local minimizer; moreover, no other algorithm can achieve significantly better performance in this setting in terms of the dependence on :Theorem 1 (Informal; see Sec. 4.2).
Within iterations, ByzantinePGD outputs an approximate local minimizer that satisfies and , where
is the minimum eigenvalue. In addition, given only access to
inexact gradient oracle, no algorithm is guaranteed to find a point with or .Our algorithm is communicationefficient: it only sends gradients, and the number of parallel iterations in our algorithm matches the wellknown iteration complexity of GD for nonconvex problems in nonByzantine setting [53] (up to log factors). In the exact gradient setting, a variant of the above result in fact matches the guarantees for PGD [32]—as mentioned, our proof is simpler.
Additionally, beyond Byzantine distributed learning, our results apply to any nonconvex optimization problems (distributed or not) with inexact information for the gradients, including those with noisy but nonadversarial gradients. Thus, we believe our results are of independent interest in broader settings.
The statistical part:
The optimization guarantee above can be applied whenever one has a robust aggregation procedure that serves as an inexact gradient oracle with a bounded error . We consider three concrete examples of such robust procedures: median, trimmed mean, and iterative filtering [22, 23]. Under statistical settings for the data, we provide explicit bounds on their errors as a function of the number of worker machines , the number of data points on each worker machine , the fraction of Byzantine machines , and the dimension of the parameter space . Combining these bounds with the optimization result above, we obtain concrete statistical guarantees on the output . Furthermore, we argue that our firstorder guarantees on are often nearly optimal when compared against a universal statistical lower bound. This is summarized below:
Theorem 2 (Informal; see Sec. 5).
When combined with each of following three robust aggregation procedures, ByzantinePGD achieves the statistical guarantees:
(i) median/; ;
(ii) trimmed mean: ;
(iii) iterative filtering: .
Moreover, no algorithm can achieve .
We emphasize that the above results are established under a very strong adversary model: the Byzantine machines are allowed to send messages that depend arbitrarily on each other and on the data on the normal machines; they may even behave adaptively during the iterations of our algorithm. Consequently, this setting requires robust functional estimation (of the gradient function), which is a much more challenging problem than the robust mean estimation setting considered by existing work on median, trimmed mean and iterative filtering. To overcome this difficulty, we make use of careful covering net arguments to establish certain error bounds that hold uniformly over the parameter space, regardless of the behavior of the Byzantine machines. Importantly, our inexact oracle framework allows such arguments to be implemented in a transparent and modular manner.
Notation
For an integer , define the set . For matrices, denote the operator norm by ; for symmetric matrices, denote the largest and smallest eigenvalues by and , respectively. The dimensional ball centered at with radius is denoted by , or when it is clear from the context.
2 Related Work
Algorithm  PGD  Neon+GD  Neon2+GD  ByzantinePGD 
Byzantinerobust?  no  no  no  yes 
Purpose of perturbation  escape SP  escape SP  escape SP  escape SP & robustness 
Escaping method  GD  NC search  NC search  inexact GD 
Termination criterion  decrease in  decrease in  distance in  distance in 
Multiple rounds?  no  no  no  yes 
Robust Aggregation Method  Nonconvex Guarantee  
Feng et al. [26]  geometric median  no 
Chen et al. [17]  geometric median  no 
Blanchard et al. [10]  Krum  firstorder 
Yin et al. [69]  median, trimmed mean  firstorder 
Xie et al. [67]  meanaroundmedian, marginal median  firstorder 
Alistarh et al. [3]  martingalebased  no 
Su and Xu [63]  iterative filtering  no 
This work  median, trimmed mean, iterative filtering  secondorder 
Efficient firstorder algorithms for escaping saddle points
Our algorithm is related to a recent line of work which develops efficient firstorder algorithms for escaping saddle points. Although vanilla GD converges to local minimizers almost surely [42, 43], achieving convergence in polynomial time requires more a careful algorithmic design [25]. Such convergence guarantees are enjoyed by several GDbased algorithms; examples include PGD [32], Neon+GD [68], and Neon2+GD [5]. The general idea of these algorithms is to run GD and add perturbation to the iterate when the gradient is small. While our algorithm also uses this idea, the design and analysis techniques of our algorithm are significantly different from the work above in the following aspects (also summarized in Table 1).

[leftmargin=3mm]

In our algorithm, besides helping with escaping saddle points, the random perturbation has the additional role of defending against adversarial errors.

The perturbation used in our algorithm needs to be larger, yet carefully calibrated, in order to account for the influence of the inexactness of gradients across the iterations, especially iterations for escaping saddle points.

We run inexact GD after the random perturbation, while Neon+GD and Neon2+GD use negative curvature (NC) search. It is not immediately clear whether NC search can be robustified against Byzantine failures. Compared to PGD, our analysis is arguably simpler and more straightforward.

Our algorithm does not use the value of the loss function (hence no need for robust function value estimation); PGD and Neon+GD assume access to the (exact) function values.

We employed multiple rounds of perturbation to boost the probability of escaping saddle points; this technique is not used in PGD, Neon+GD, or Neon2+GD.
Inexact oracles
Optimization with an inexact oracle (e.g. noisy gradients) has been studied in various settings such as general convex optimization [7, 21], robust estimation [55], and structured nonconvex problems [6, 16, 11, 71]. Particularly relevant to us is the recent work by Jin et al. [34], who consider the problem of minimizing when only given access to the gradients of another smooth function satisfying . Their algorithm uses Gaussian smoothing on . We emphasize that the inexact gradient setting considered by them is much more benign than our Byzantine setting, since (i) their inexactness is defined in terms of norm whereas the inexactness in our problem is in norm, and (ii) we assume that the inexact gradient can be any vector within error, and thus the smoothing technique is not applicable in our problem. Moreover, the iteration complexity obtained by Jin et al. [34] may be a highdegree polynomial of the problem parameters and thus not suitable for distributed implementation.
Byzantinerobust distributed learning
Solving large scale learning problems in distributed systems has received much attention in recent years, where communication efficiency and Byzantine robustness are two important topics [58, 41, 70, 10, 15, 20]. Here, we compare with existing Byzantinerobust distributed learning algorithms that are most relevant to our work, and summarize the comparison in Table 2
. A general idea of designing Byzantinerobust algorithms is to combine optimization algorithms with a robust aggregation (or outlier removal) subroutine. For convex losses, the aggregation subroutines analyzed in the literature include geometric median
[26, 17], median and trimmed mean [69], iterative filtering for the high dimensional setting [63], and martingalebased methods for the SGD setting [3]. For nonconvex losses, to the best of our knowledge, existing works only provide firstorder convergence guarantee (i.e., small gradients), by using aggregation subroutines such as the Krum function [10], median and trimmed mean [69], meanaroundmedian and marginal median [67]. In this paper, we make use of subroutines based on median, trimmed mean, and iterative filtering. Our analysis of median and trimmed mean follows Yin et al. [69]. Our results based on the iterative filtering subroutine, on the other hand, are new:
[leftmargin=3mm]

Recent work by Su and Xu [63] also makes use of the iterative filtering subroutine for the Byzantine setting. They only study strongly convex loss functions, and assume that the gradients are subexponential and . Our results apply to the nonconvex case and do not require the aforementioned condition on (which may therefore scale, for example, linearly with the sample size ), but we impose the stronger assumption of subGaussian gradients.
Other nonconvex optimization algorithms
Besides firstorder GDbased algorithms, many other nonconvex optimization methods that can provably converge to approximate local minimum have received much attention in recent years. For specific problems such as phase retrieval [11], lowrank estimation [16, 72], and dictionary learning [1, 64], many algorithms are developed by leveraging the particular structure of the problems, and the either use a smart initialization [11, 65] or initialize randomly [18, 14]
. Other algorithms are developed for general nonconvex optimization, and they can be classified into gradientbased
[27, 44, 68, 4, 5, 33], Hessianvectorproductbased [12, 2, 56, 57], and Hessianbased [54, 19] methods. While algorithms using Hessian information can usually achieve better convergence rates—for example, by Curtis et al. [19], and by Carmon et al. [12]— gradientbased methods are easier to implement in practice, especially in the distributed setting we are interested in.Robust statistics
Outlierrobust estimation is a classical topic in statistics [30]. The coordinatewise median aggregation subroutine that we consider is related to the medianofmeans estimator [52, 31], which has been applied to various robust inference problems [51, 47, 50].
A recent line of work develops efficient robust estimation algorithms in highdimensional settings [8, 22, 39, 13, 60, 45, 9, 36, 46]. In the centralized setting, the recent work [24] proposes a scheme, similar to the iterative filtering procedure, that iteratively removes outliers for gradientbased optimization.
3 Problem Setup
We consider empirical risk minimization for a statistical learning problem where each data point is sampled from an unknown distribution over the sample space . Let be the loss function of a parameter vector , where is the parameter space. The population loss function is therefore given by .
We consider a distributed computing system with one master machine and worker machines, of which are Byzantine machines and the other are normal. Each worker machine has data points sampled i.i.d. from . Denote by the th data point on the th worker machine, and let be the empirical loss function on the th machine. The master machine and worker machines can send and receive messages via the following communication protocol: In each parallel iteration, the master machine sends a parameter vector to all the worker machines, and then each normal worker machine computes the gradient of its empirical loss at and sends the gradient to the master machine. The Byzantine machines may be jointly controlled by an adversary and send arbitrary or even malicious messages. We denote the unknown set of Byzantine machines by , where . With this notation, the gradient sent by the th worker machine is
(1) 
where the symbol denotes an arbitrary vector. As mentioned, the adversary is assumed to have complete knowledge of the algorithm used and the data stored on all machines, and the Byzantine machines may collude [48] and adapt to the output of the master and normal worker machines. We only make the mild assumption that the adversary cannot predict the random numbers generated by the master machine.
We consider the scenario where is nonconvex, and our goal to find an approximate local minimizer of . Note that a firstorder stationary point (i.e., one with a small gradient) is not necessarily close to a local minimizer, since the point may be a saddle point whose Hessian matrix has a large negative eigenvalue. Accordingly, we seek to find a secondorder stationary point , namely, one with a small gradient and a nearly positive semidefinite Hessian:
Definition 1 (Secondorder stationarity).
We say that is an secondorder stationary point of a twice differentiable function if and .
In the sequel, we make use of several standard concepts from continuous optimization.
Definition 2 (Smooth and HessianLipschitz functions).
A function is called smooth if , and Hessian Lipschitz if .
Throughout this paper, the above properties are imposed on the population loss function .
Assumption 1.
is smooth, and Hessian Lipschitz on .
4 Byzantine Perturbed Gradient Descent
In this section, we describe our algorithm, Byzantine Perturbed Gradient Descent (ByzantinePGD), which provably finds a secondorder stationary point of the population loss in the distributed setting with Byzantine machines. As mentioned, ByzantinePGD robustly aggregates gradients from the worker machines, and performs multiple rounds of carefully calibrated perturbation to combat the effect of Byzantine machines. We now elaborate.
It is wellknown that naively aggregating the workers’ messages using standard averaging can be arbitrarily skewed in the presence of just a single Byzantine machine. In view of this, we introduce the subroutine
, which robustly aggregates the gradients collected from the workers. We stipulate that provides an estimate of the true population gradient with accuracy , uniformly across . This property is formalized using the terminology of inexact gradient oracle.Definition 3 (Inexact gradient oracle).
We say that provides a inexact gradient oracle for the population loss if, for every , we have .
Without loss of generality, we assume that throughout the paper. In this section, we treat as a given black box; in Section 5, we discuss several robust aggregation algorithms and characterize their inexactness . We emphasize that in the Byzantine setting, the output of can take values adversarially within the error bounds; that is, may output an arbitrary vector in the ball , and this vector can depend on the data in all the machines and all previous iterations of the algorithm.
The use of robust aggregation with bounded inexactness, however, is not yet sufficient to guarantee convergence to an approximate local minimizer. As mentioned, the Byzantine machines may create fake local minima that traps a vanilla gradient descent iteration. Our ByzantinePGD algorithm is designed to escape such fake minima as well as any existing saddle points of .
4.1 Algorithm
We now describe the details of our algorithm, given in the left panel of Algorithm 1. We focus on unconstrained optimization, i.e., . In Section 5, we show that the iterates during the algorithm actually stay in a bounded ball centered at the initial iterate , and we will discuss the statistical error rates within the bounded space.
In each parallel iteration, the master machine sends the current iterate to all the worker machines, and the worker machines send back . The master machine aggregates the workers’ gradients using and computes a robust estimate of the population gradient . The master machine then performs a gradient descent step using . This procedure is repeated until it reaches a point with for a prespecified threshold .
At this point, may lie near a saddle point whose Hessian has a large negative eigenvalue. To escape this potential saddle point, the algorithm invokes the routine (right panel of Algorithm 1), which performs rounds of perturbationanddescent operations. In each round, the master machine perturbs randomly and independently within the ball . Let be the perturbed vector. Starting from the , the algorithm conducts at most parallel iterations of inexact gradient descent (using as before):
(2) 
During this process, once we observe that for some prespecified threshold (this means the iterate moves by a sufficiently large distance in the parameter space), we claim that is a saddle point and the algorithm has escaped it; we then resume inexact gradient descent starting from . If after rounds no sufficient move in the parameter space is ever observed, we claim that is a secondorder stationary point of and output .
4.2 Convergence Guarantees
In this section, we provide the theoretical result guaranteeing that Algorithm 1 converges to a secondorder stationary point. In Theorem 3, we let , be the initial iterate, and .
Theorem 3 (ByzantinePGD).
Suppose that Assumptions 1 holds, and assume that provides a inexact gradient oracle for with . Given any , choose the parameters for Algorithm 1 as follows: stepsize , , , ,
Then, with probability at least , the output of Algorithm 1, denoted by , satisfies the bounds
(3)  
and the algorithm terminates within parallel iterations.
We prove Theorem 3 in Appendix B.^{1}^{1}1We make no attempt in optimizing the multiplicative constants in Theorem 3. Below let us parse the above theorem and discuss its implications.
Focusing on the scaling with , we may read off from Theorem 3 the following result:
Observation 1.
Under the above setting, within parallel iterations, ByzantinePGD outputs an secondorder stationary point of ;^{2}^{2}2Here, by using the symbol , we ignore logarithmic factors and only consider the dependence on . that is,
In terms of the iteration complexity, it is wellknown that for a smooth nonconvex , gradient descent requires at least iterations to achieve [53]; up to logarithmic factors, our result matches this complexity bound. In addition, our firstorder guarantee is clearly orderwise optimal, as the gradient oracle is inexact. It is currently unclear to us whether our secondorder guarantee is optimal. We provide a converse result showing that one cannot hope to achieve a secondorder guarantee better than .
Proposition 1.
There exists a class of realvalued smooth and Hessian Lipschitz differentiable functions such that, for any algorithm that only uses a inexact gradient oracle, there exists such that the output of the algorithm must satisfy and .
We prove Proposition 1 in Appendix D. Again, we emphasize that our results above are in fact not restricted to the Byzantine distributed learning setting. They apply to any nonconvex optimization problems (distributed or not) with inexact information for the gradients, including those with noisy but nonadversarial gradients; see Section 2 for comparison with related work in such settings.
As a byproduct, we can show that with a different choice of parameters, ByzantinePGD can be used in the standard (nondistribued) setting with access to the exact gradient , and the algorithm converges to an secondorder stationary point within iterations:
Theorem 4 (Exact gradient oracle).
Suppose that Assumptions 1 holds, and assume that for any query point we can obtain exact gradient, i.e., . For any and , we choose the parameters in Algorithm 1 as follows: stepsize , , , and , . Then, with probability at least , Algorithm 1 outputs a satisfying the bounds
and the algorithm terminates within iterations.
We prove Theorem 4 in Appendix C. The convergence guarantee above matches that of the original PGD algorithm [32] up to logarithmic factors. Moreover, our proof is considerably simpler, and our algorithm only requires gradient information, whereas the original PGD algorithm also needs function values.
5 Robust Estimation of Gradients
The results in the previous section can be applied as long as one has a robust aggregation subroutine that provides a inexact gradient oracle of the population loss . In this section, we discuss three concrete examples of : median, trimmed mean, and a highdimension robust estimator based on the iterative filtering algorithm [22, 23, 60]. We characterize their inexactness under the statistical setting in Section 3, where the data points are sampled independently according to an unknown distribution .
To describe our statistical results, we need the standard notions of subGaussian/exponential random vectors.
Definition 4 (subGaussianity and subexponentiality).
A random vector with mean is said to be subGaussian if . It is said to be subexponential if .
We also need the following result (proved in Appendix E), which shows that the iterates of ByzantinePGD in fact stay in a bounded set around the initial iterate .
Proposition 2.
Under the choice of algorithm parameters in Theorem 3, all the iterates in ByzantinePGD stay in the ball with , where is a number that only depends on and .
Consequently, for the convergence guarantees of ByzantinePGD to hold, we only need to satisfy the inexact oracle property (Definition 3) within the bounded set , with given in Proposition 2. As shown below, the three aggregation procedures indeed satisfy this property, with their inexactness depends mildly (logarithmically) on the radius .
5.1 Iterative Filtering Algorithm
We start with a recently developed highdimension robust estimation technique called the iterative filtering algorithm [22, 23, 60] and use it to build the subroutine . As can be seen below, iterative filtering can tolerate a constant fraction of Byzantine machines even when the dimension grows—an advantage over simpler algorithms such as median and trimmed mean.
We relegate the details of the iterative filtering algorithm to Appendix F.1. Again, we emphasize that the original iterative filtering algorithm is proposed to robustly estimate a single parameter vector, whereas in our setting, since the Byzantine machines may produce unspecified probabilistic dependency across the iterations, we need to prove an error bound for robust gradient estimation uniformly across the parameter space . We prove such a bound for iterative filtering under the following two assumptions on the gradients and the smoothness of each loss function .
Assumption 2.
For each , is subGaussian.
Assumption 3.
For each , is smooth.
Let be the covariance matrix of , and define . We have the following bounds on the inexactness parameter of iterative filtering.
Theorem 5 (Iterative Filtering).
5.2 Median and Trimmed Mean
The median and trimmed mean operations are two widely used robust estimation methods. While the dependence of their performance on is not optimal, they are conceptually simple and computationally fast, and still have good performance in low dimensional settings. We apply these operations in a coordinatewise fashion to build .
Formally, for a set of vectors , , their coordinatewise median is a vector with its th coordinate being for each , where is the usual (onedimensional) median. The coordinatewise trimmed mean is a vector with for each , where is a subset of obtained by removing the largest and smallest fraction of its elements.
For robust estimation of the gradient in the Byzantine setting, the error bounds of median and trimmed mean have been studied by Yin et al. [69]. For completeness, we record their results below as an informal theorem; details are relegated to Appendix F.3.
Theorem 6 (Informal).
[69] Under appropriate smoothness and probabilistic assumptions,^{3}^{3}3
Specifically, for median we assume that gradients have bounded skewness, and for trimmed mean we assume that the gradients are subexponentially distributed.
with high probability, the median operation provides a inexact gradient oracle with , and the trimmed mean operation provides a inexact gradient oracle with .5.3 Comparison and Optimality
In Table 3, we compare the above three algorithms in terms of the dependence of their gradient inexactness on the problem parameters , , , and . We see that when , the median and trimmed mean algorithms have better inexactness due to a better scaling with . When is large, iterative filtering becomes preferable.
Gradient inexactness  

median  
trimmed mean  
iterative filtering 
Recall that according to Observation 1, with inexact gradients the ByzantinePGD algorithm converges to an secondorder stationary point. Combining this general result with the bounds in Table 3, we obtain explicit statistical guarantees on the output of ByzantinePGD. To understand the statistical optimality of these guarantees, we provide a converse result below.
Observation 2.
There exists a statistical learning problem in the Byzantine setting such that the output of any algorithm must satisfy with a constant probability.
We prove Observation 2 in Appendix F.4. In view of this observation, we see that in terms of the firstorder guarantee (i.e., on ) and up to logarithmic factors, trimmed mean is optimal if , the median is optimal if and , and iterative filtering is optimal if . The statistical optimality of their secondorder guarantees (i.e., on ) is currently unclear to us, and we believe this is an interesting problem for future investigation.
6 Conclusion
In this paper, we study security issues that arise in largescale distributed learning because of the presence of saddle points in nonconvex loss functions. We observe that in the presence of nonconvexity and Byzantine machines, escaping saddle points becomes much more challenging. We develop ByzantinePGD, a computation and communicationefficient algorithm that is able to provably escape saddle points and converge to a secondorder stationary point, even in the presence of Byzantine machines. We also discuss three different choices of the robust gradient and function value aggregation subroutines in ByzantinePGD—median, trimmed mean, and the iterative filtering algorithm. We characterize their performance in statistical settings, and argue for their nearoptimality in different regimes including the high dimensional setting.
Acknowledgements
D. Yin is partially supported by Berkeley DeepDrive Industry Consortium. Y. Chen is partially supported by NSF CRII award 1657420 and grant 1704828. K. Ramchandran is partially supported by NSF CIF award 1703678. P. Bartlett is partially supported by NSF grant IIS1619362. The authors would like to thank Zeyuan AllenZhu for pointing out a potential way to improve our initial results, and Ilias Diakonikolas for discussing references [22, 23, 24].
References
 Agarwal et al. [2014] Alekh Agarwal, Animashree Anandkumar, Prateek Jain, Praneeth Netrapalli, and Rashish Tandon. Learning sparsely used overcomplete dictionaries. In COLT, pages 123–137, 2014.
 Agarwal et al. [2016] Naman Agarwal, Zeyuan AllenZhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima for nonconvex optimization in linear time. arXiv preprint arXiv:1611.01146, 2016.
 Alistarh et al. [2018] Dan Alistarh, Zeyuan AllenZhu, and Jerry Li. Byzantine stochastic gradient descent. arXiv preprint arXiv:1803.08917, 2018.
 AllenZhu [2017] Zeyuan AllenZhu. Natasha 2: Faster nonconvex optimization than SGD. arXiv preprint arXiv:1708.08694, 2017.
 AllenZhu and Li [2017] Zeyuan AllenZhu and Yuanzhi Li. Neon2: Finding local minima via firstorder oracles. arXiv preprint arXiv:1711.06673, 2017.
 Balakrishnan et al. [2014] Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guarantees for the EM algorithm: From population to samplebased analysis. arXiv preprint arXiv:1408.2156, 2014.
 Bertsekas and Tsitsiklis [2000] Dimitri P. Bertsekas and John N. Tsitsiklis. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 10(3):627–642, 2000.
 Bhatia et al. [2015] Kush Bhatia, Prateek Jain, and Purushottam Kar. Robust regression via hard thresholding. In Advances in Neural Information Processing Systems, pages 721–729, 2015.
 Bhatia et al. [2017] Kush Bhatia, Prateek Jain, Parameswaran Kamalaruban, and Purushottam Kar. Consistent robust regression. In Advances in Neural Information Processing Systems, pages 2107–2116, 2017.
 Blanchard et al. [2017] Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Byzantinetolerant machine learning. arXiv preprint arXiv:1703.02757, 2017.
 Candes et al. [2015] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007, 2015.
 Carmon et al. [2016] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for nonconvex optimization. arXiv preprint arXiv:1611.00756, 2016.

Charikar et al. [2017]
Moses Charikar, Jacob Steinhardt, and Gregory Valiant.
Learning from untrusted data.
In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
, pages 47–60. ACM, 2017.  Chatterji and Bartlett [2017] Niladri Chatterji and Peter L Bartlett. Alternating minimization for dictionary learning with random initialization. In Advances in Neural Information Processing Systems, pages 1994–2003, 2017.
 Chen et al. [2018a] Lingjiao Chen, Zachary Charles, Dimitris Papailiopoulos, et al. DRACO: Robust distributed training via redundant gradients. arXiv preprint arXiv:1803.09877, 2018a.
 Chen and Wainwright [2015] Yudong Chen and Martin J Wainwright. Fast lowrank estimation by projected gradient descent: General statistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025, 2015.
 Chen et al. [2017] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. arXiv preprint arXiv:1705.05491, 2017.
 Chen et al. [2018b] Yuxin Chen, Yuejie Chi, Jianqing Fan, and Cong Ma. Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval. arXiv preprint arXiv:1803.07726, 2018b.
 Curtis et al. [2017] Frank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm with a worstcase iteration complexity of for nonconvex optimization. Mathematical Programming, 162(12):1–32, 2017.
 Damaskinos et al. [2018] Georgios Damaskinos, El Mahdi El Mhamdi, Rachid Guerraoui, Rhicheek Patra, and Mahsa Taziki. Asynchronous Byzantine machine learning. arXiv preprint arXiv:1802.07928, 2018.
 Devolder et al. [2014] Olivier Devolder, François Glineur, and Yurii Nesterov. Firstorder methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146(12):37–75, 2014.
 Diakonikolas et al. [2016] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 655–664. IEEE, 2016.
 Diakonikolas et al. [2017] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. arXiv preprint arXiv:1703.00893, 2017.
 Diakonikolas et al. [2018] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust metaalgorithm for stochastic optimization. arXiv preprint arXiv:1803.02815, 2018.
 Du et al. [2017] Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Aarti Singh, and Barnabas Poczos. Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems, pages 1067–1077, 2017.
 Feng et al. [2014] Jiashi Feng, Huan Xu, and Shie Mannor. Distributed robust learning. arXiv preprint arXiv:1409.5937, 2014.

Ge et al. [2015]
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan.
Escaping from saddle points—online stochastic gradient for tensor decomposition.
In COLT, pages 797–842, 2015.  Ge et al. [2016] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.
 Ge et al. [2017] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. arXiv preprint arXiv:1704.00708, 2017.
 Huber [2011] Peter J Huber. Robust statistics. In International Encyclopedia of Statistical Science, pages 1248–1251. Springer, 2011.

Jerrum et al. [1986]
Mark R Jerrum, Leslie G Valiant, and Vijay V Vazirani.
Random generation of combinatorial structures from a uniform distribution.
Theoretical Computer Science, 43:169–188, 1986.  Jin et al. [2017a] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. arXiv preprint arXiv:1703.00887, 2017a.
 Jin et al. [2017b] Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456, 2017b.
 Jin et al. [2018] Chi Jin, Lydia T Liu, Rong Ge, and Michael I Jordan. Minimizing nonconvex population risk from rough empirical risk. arXiv preprint arXiv:1803.09357, 2018.
 Kawaguchi [2016] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
 Klivans et al. [2018] Adam Klivans, Pravesh K Kothari, and Raghu Meka. Efficient algorithms for outlierrobust regression. arXiv preprint arXiv:1803.03241, 2018.
 Konečnỳ et al. [2015] Jakub Konečnỳ, Brendan McMahan, and Daniel Ramage. Federated optimization: Distributed optimization beyond the datacenter. arXiv preprint arXiv:1511.03575, 2015.
 Konečnỳ et al. [2016] Jakub Konečnỳ, H Brendan McMahan, Daniel Ramage, and Peter Richtárik. Federated optimization: distributed machine learning for ondevice intelligence. arXiv preprint arXiv:1610.02527, 2016.
 Lai et al. [2016] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 665–674. IEEE, 2016.
 Lamport et al. [1982] Leslie Lamport, Robert Shostak, and Marshall Pease. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems (TOPLAS), 4(3):382–401, 1982.
 Lee et al. [2015] Jason D Lee, Qihang Lin, Tengyu Ma, and Tianbao Yang. Distributed stochastic variance reduced gradient methods and a lower bound for communication complexity. arXiv preprint arXiv:1507.07595, 2015.
 Lee et al. [2016] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent converges to minimizers. arXiv preprint arXiv:1602.04915, 2016.
 Lee et al. [2017] Jason D Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Firstorder methods almost always avoid saddle points. arXiv preprint arXiv:1710.07406, 2017.
 Levy [2016] Kfir Y Levy. The power of normalization: Faster evasion of saddle points. arXiv preprint arXiv:1611.04831, 2016.
 Li [2017] Jerry Li. Robust sparse estimation tasks in high dimensions. arXiv preprint arXiv:1702.05860, 2017.
 Liu et al. [2018] Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Caramanis. High dimensional robust sparse regression. arXiv preprint arXiv:1805.11643, 2018.
 Lugosi and Mendelson [2016] Gabor Lugosi and Shahar Mendelson. Risk minimization by medianofmeans tournaments. arXiv preprint arXiv:1608.00757, 2016.
 Lynch [1996] Nancy A. Lynch. Distributed Algorithms. Elsevier, 1996.
 McMahan and Ramage [2017] Brendan McMahan and Daniel Ramage. Federated learning: Collaborative machine learning without centralized training data. https://research.googleblog.com/2017/04/federatedlearningcollaborative.html, 2017.
 Minsker and Strawn [2017] Stanislav Minsker and Nate Strawn. Distributed statistical estimation and rates of convergence in normal approximation. arXiv preprint arXiv:1704.02658, 2017.
 Minsker et al. [2015] Stanislav Minsker et al. Geometric median and robust estimation in banach spaces. Bernoulli, 21(4):2308–2335, 2015.
 Nemirovskii et al. [1983] Arkadii Nemirovskii, David Borisovich Yudin, and Edgar Ronald Dawson. Problem complexity and method efficiency in optimization. Wiley, 1983.
 Nesterov [1998] Yurii Nesterov. Introductory lectures on convex programming volume i: Basic course. Lecture notes, 1998.
 Nesterov and Polyak [2006] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
 Prasad et al. [2018] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485, 2018.
 Royer and Wright [2018] Clément W Royer and Stephen J Wright. Complexity analysis of secondorder linesearch algorithms for smooth nonconvex optimization. SIAM Journal on Optimization, 28(2):1448–1477, 2018.
 Royer et al. [2018] Clément W Royer, Michael O’Neill, and Stephen J Wright. A newtoncg algorithm with complexity guarantees for smooth unconstrained optimization. arXiv preprint arXiv:1803.02924, 2018.
 Shamir et al. [2014] Ohad Shamir, Nati Srebro, and Tong Zhang. Communicationefficient distributed optimization using an approximate newtontype method. In International Conference on Machine Learning, pages 1000–1008, 2014.
 Soudry and Carmon [2016] Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
 Steinhardt et al. [2017] Jacob Steinhardt, Moses Charikar, and Gregory Valiant. Resilience: A criterion for learning in the presence of arbitrary outliers. arXiv preprint arXiv:1703.04940, 2017.
 Su and Vaidya [2016a] Lili Su and Nitin H Vaidya. Faulttolerant multiagent optimization: optimal iterative distributed algorithms. In Proceedings of the 2016 ACM Symposium on Principles of Distributed Computing, pages 425–434. ACM, 2016a.
 Su and Vaidya [2016b] Lili Su and Nitin H Vaidya. NonBayesian learning in the presence of Byzantine agents. In International Symposium on Distributed Computing, pages 414–427. Springer, 2016b.
 Su and Xu [2018] Lili Su and Jiaming Xu. Securing distributed machine learning in high dimensions. arXiv preprint arXiv:1804.10140, 2018.
 Sun et al. [2015] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery using nonconvex optimization. In Proceedings of the 32nd International Conference on Machine Learning, pages 2351–2360, 2015.
 Tu et al. [2015] Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Benjamin Recht. Lowrank solutions of linear matrix equations via procrustes flow. arXiv preprint arXiv:1507.03566, 2015.
 Vershynin [2010] Roman Vershynin. Introduction to the nonasymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
 Xie et al. [2018] Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Generalized Byzantinetolerant SGD. arXiv preprint arXiv:1802.10116, 2018.
 Xu and Yang [2017] Yi Xu and Tianbao Yang. Firstorder stochastic algorithms for escaping from saddle points in almost linear time. arXiv preprint arXiv:1711.01944, 2017.
 Yin et al. [2018a] Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantinerobust distributed learning: Towards optimal statistical rates. In Proceedings of the 35th International Conference on Machine Learning, pages 5650–5659, 2018a.

Yin et al. [2018b]
Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan
Ramchandran, and Peter Bartlett.
Gradient diversity: a key ingredient for scalable distributed
learning.
In
International Conference on Artificial Intelligence and Statistics
, pages 1998–2007, 2018b.  Zhang et al. [2016] Huishuai Zhang, Yuejie Chi, and Yingbin Liang. Provable nonconvex phase retrieval with outliers: Mediantruncated wirtinger flow. In International Conference on Machine Learning, pages 1022–1031, 2016.
 Zhao et al. [2015] Tuo Zhao, Zhaoran Wang, and Han Liu. A nonconvex optimization framework for low rank matrix estimation. In Advances in Neural Information Processing Systems, pages 559–567, 2015.
Appendix
Appendix A Challenges of Escaping Saddle Points in the Adversarial Setting
We provide two examples showing that in nonconvex setting with saddle points, inexact oracle can lead to much worse suboptimal solutions than in the convex setting, and that in the adversarial setting, escaping saddle points can be inherently harder than the adversaryfree case.
Consider standard gradient descent using exact or inexact gradients. Our first example shows that Byzantine machines have a more severe impact in the nonconvex case than in the convex case.
Example 1.
Let and consider the functions and . Here is strongly convex with a unique local minimizer , whereas has two local (in fact, global) minimizers and a saddle point (in fact, a local maximum) . Proposition 3 below shows the following: for the convex , gradient descent (GD) finds a nearoptimal solution with suboptimality proportional to , regardless of initialization; for the nonconvex , GD initialized near the saddle point suffers from an suboptimality gap.
Proposition 3.
Suppose that . Under the setting above, the following holds.
(i) For , starting from any , GD using a inexact gradient oracle finds with .
(ii) For , there exists an adversarial strategy such that starting from a sampled uniformly from , GD with a inexact gradient oracle outputs with , with probability .
Proof.
Since , we have . For any , (since ). Thus, the adversarial oracle can always output when , and we have . Thus, if , the iterate can no longer move with this adversarial strategy. Then, we have (since ). The result for the convex function is a direct corollary of Theorem 1 in [69]. ∎
Our second example shows that escaping saddle points is much harder in the Byzantine setting than in the nonByzantine setting.
Example 2.
Let , and assume that in the neighborhood of the origin, takes the quadratic form , with .^{4}^{4}4 holds locally around the origin, not globally; otherwise has no minimum. The origin is not an secondorder stationary point, but rather a saddle point. Proposition 4 below shows that exact GD escapes the saddle point almost surely, while GD with an inexact oracle fails to do so.
Proposition 4.
Under the setting above, if one chooses and sample from uniformly at random, then:
(i) Using exact gradient descent, with probability , the iterate eventually leaves .
(ii) There exists an adversarial strategy such that, when we update using inexact gradient oracle, if , with probability , the iterate cannot leave ; otherwise with probability the iterate cannot leave .
Proof.
Since , , we have . Sample uniformly at random from , and we know that with probability , . Then, by running exact gradient descent , we can see that the second coordinate of is . When , we know that as gets large, we eventually have , which implies that the iterate leaves .
On the other hand, suppose that we run inexact gradient descent, i.e., with . In the first step, if , the adversary can simply replace with (one can check that here we have ), and then the second coordinate of does not change, i.e., . In the following iterations, the adversary can keep using the same strategy and the second coordinate of never changes, and then the iterates cannot escape , since is a strongly convex function in its first coordinate. To compute the probability of getting stuck at the saddle point, we only need to compute the area of the region , which can be done via simple geometry. ∎
Remark.
Even if we choose the largest possible perturbation in , i.e., sample from the circle , the stuck region still exists. We can compute the length of the arc and find the probability of stuck. One can find that when , the probability of being stuck in is still , otherwise, the probability of being stuck is .
The above examples show that the adversary can significantly alter the landscape of the function near a saddle point. We counter this by exerting a large perturbation on the iterate so that it escapes this bad region. The amount of perturbation is carefully calibrated to ensure that the algorithm finds a descent direction “steep” enough to be preserved under corruption, while not compromising the accuracy. Multiple rounds of perturbation are performed, boosting the escape probability exponentially.
Appendix B Proof of Theorem 3
We first analyze the gradient descent step with inexact gradient oracle.
Lemma 1.
Suppose that . For any , if we run the following inexact gradient descent step:
(4) 
with . Then, we have
Proof.
Since is smooth, we know that
∎
Let be the threshold on that the algorithm uses to determine whether or not to add perturbation. Choose . Suppose that at a particular iterate , we observe . Then, we know that
According to Lemma 1, by running one iteration of the inexact gradient descent step, the decrease in function value is at least
(5) 
We proceed to analyze the perturbation step, which happens when the algorithm arrives at an iterate with . In this proof, we slightly abuse the notation. Recall that in equation (2) in Section 4.1 , we use to denote the iterates of the algorithm in the saddle point escaping process. Here, we simply use to denote these iterates. We start with the definition of stuck region at .
Definition 5.
Given , and parameters , , and , the stuck region is a set of which satisfies the following property: there exists an adversarial strategy such that when we start with and run gradient descent steps with inexact gradient oracle :
(6) 
we observe , .
When it is clear from the context, we may simply use the terminology stuck region at . The following lemma shows that if
has a large negative eigenvalue, then the stuck region has a small width along the direction of the eigenvector associated with this negative eigenvalue.
Lemma 2.
Assume that the smallest eigenvalue of satisfies , and let the unit vector be the eigenvector associated with . Let be two points such that with some . Choose step size , and consider the stuck region
Comments
There are no comments yet.