Defending Against Saddle Point Attack in Byzantine-Robust Distributed Learning

06/14/2018 ∙ by Dong Yin, et al. ∙ cornell university berkeley college 0

In this paper, we study robust large-scale distributed learning in the presence of saddle points in non-convex loss functions. We consider the Byzantine setting where some worker machines may have abnormal or even arbitrary and adversarial behavior. We argue that in the Byzantine setting, optimizing a non-convex function and escaping saddle points become much more challenging, even when robust gradient estimators are used. We develop ByzantinePGD, a robust and communication-efficient algorithm that can provably escape saddle points and converge to approximate local minimizers. The iteration complexity of our algorithm in the Byzantine setting matches that of standard gradient descent in the usual setting. We further provide three robust aggregation subroutines that can be used in ByzantinePGD, including median, trimmed mean, and iterative filtering. We characterize their performance in statistical settings, and argue for their near-optimality in different regimes including the high dimensional setting.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Distributed computing becomes increasingly important in modern data-intensive applications. In many applications, large-scale datasets are distributed over multiple machines for parallel processing in order to speed up computation. In other settings, the data sources are naturally distributed, and for privacy and efficiency considerations, the data are not transmitted to a central machine. An example is the recently proposed Federated Learning paradigm [49, 38, 37], in which the data are stored and processed locally in end users’ cellphones and personal computers.

In a standard worker-server distributed computing framework, a single master machine is in charge of maintaining and updating the parameter of interest, and a set of worker machines store the data, perform local computation and communicate with the master. In this setting, messages received from worker machines are prone to errors due to data corruption, hardware/software malfunction, and communication delay and failure. These problems are only exacerbated in a decentralized distributed architecture such as Federated Learning, where some machines may be subjected to malicious and coordinated attack and manipulation. A well-established framework for studying such scenarios is the Byzantine setting [40], where a subset of machines behave completely arbitrarily—even in a way that depends on the algorithm used and the data on the other machines—thereby capturing the unpredictable nature of the errors. Developing distributed algorithms that are robust in the Byzantine setting has become increasingly critical.

In this paper we focus on robust distributed optimization for statistical learning problems. Here the data points are generated from some unknown distribution and stored locally in worker machines, each storing data points; the goal is to minimize a population loss function defined as an expectation over , where is the parameter space. We assume that fraction of the worker machines are Byzantine; that is, their behavior is arbitrary. This Byzantine-robust distributed learning problem has attracted attention in a recent line of work [3, 10, 17, 26, 61, 62, 69]. This body of work develops robust algorithms that are guaranteed to output an approximate minimizer of when it is convex, or an approximate stationary point in the non-convex case.

However, fitting complicated machine learning models often requires finding a

local minimum of non-convex

functions, as exemplified by training deep neural networks and other high-capacity learning architectures 

[59, 28, 29]. It is well-known that many of the stationary points of these problems are in fact saddle points and far away from any local minimum [35, 29]. These tasks hence require algorithms capable of efficiently escaping saddle points and converging approximately to a local minimizer. In the centralized setting without Byzantine adversaries, this problem has been studied actively and recently  [27, 32, 12, 33].

A main observation of this work is that the interplay between non-convexity and Byzantine errors makes escaping saddle points much more challenging. In particular, by orchestrating their messages sent to the master machine, the Byzantine machines can create fake local minima near a saddle point of that is far away from any true local minimizer. Such a strategy, which may be referred to as saddle point attack, foils existing algorithms as we elaborate below:

  • [leftmargin=3mm]

  • Challenges due to non-convexity: When is convex, gradient descent (GD) equipped with a robust gradient estimator is guaranteed to find an approximate global minimizer (with accuracy depending on the fraction of Byzantine machines) [17, 69, 3]. However, when is non-convex, such algorithms may be trapped in the neighborhood of a saddle point; see Example 1 in Appendix A.

  • Challenges due to Byzantine machines: Without Byzantine machines, vanilla GD [42], as well as its more efficient variants such as perturbed gradient descent (PGD) [32]

    , are known to converge to a local minimizer with high probability. However, Byzantine machines can manipulate PGD and GD (even robustified) into fake local minimum near a saddle point; see Example 2 in Appendix 

    A.

We discuss and compare with existing work in more details in Section 2. The observations above show that existing robust and saddle-escaping algorithms, as well as their naive combination, are insufficient against saddle point attack. Addressing these challenges requires the development of new robust distributed optimization algorithms.

1.1 Our Contributions

In this paper, we develop ByzantinePGD, a computation- and communication-efficient first-order algorithm that is able to escape saddle points and the fake local minima created by Byzantine machines, and converge to an approximate local minimizer of a non-convex loss. To the best of our knowledge, our algorithm is the first to achieve such guarantees under adversarial noise.

Specifically, ByzantinePGD aggregates the empirical gradients received from the normal and Byzantine machines, and computes a robust estimate of the true gradient of the population loss . Crucial to our algorithm is the injection of random perturbation to the iterates , which serves the dual purpose of escaping saddling point and fake local minima. Our use of perturbation thus plays a more signified role than in existing algorithms such as PGD [32], as it also serves to combat the effect of Byzantine errors. To achieve this goal, we incorporate two crucial innovations: (i) we use multiple rounds of larger, yet carefully calibrated, amount of perturbation that is necessary to survive saddle point attack, (ii) we use the moving distance in the parameter space as the criterion for successful escape, eliminating the need of (robustly) evaluating function values. Consequently, our analysis is significantly different, and arguably simpler, than that of PGD.

We develop our algorithmic and theoretical results in a flexible, two-part framework, decomposing the optimization and statistical components of the problem.

The optimization part:

We consider a general problem of optimizing a population loss function given an inexact gradient oracle. For each query point , the

-inexact gradient oracle returns a vector

(possibly chosen adversarially) that satisfies , where is non-zero but bounded. Given access to such an inexact oracle, we show that ByzantinePGD outputs an approximate local minimizer; moreover, no other algorithm can achieve significantly better performance in this setting in terms of the dependence on :

Theorem 1 (Informal; see Sec. 4.2).

Within iterations, ByzantinePGD outputs an approximate local minimizer that satisfies and , where

is the minimum eigenvalue. In addition, given only access to

-inexact gradient oracle, no algorithm is guaranteed to find a point with or .

Our algorithm is communication-efficient: it only sends gradients, and the number of parallel iterations in our algorithm matches the well-known iteration complexity of GD for non-convex problems in non-Byzantine setting [53] (up to log factors). In the exact gradient setting, a variant of the above result in fact matches the guarantees for PGD [32]—as mentioned, our proof is simpler.

Additionally, beyond Byzantine distributed learning, our results apply to any non-convex optimization problems (distributed or not) with inexact information for the gradients, including those with noisy but non-adversarial gradients. Thus, we believe our results are of independent interest in broader settings.

The statistical part:

The optimization guarantee above can be applied whenever one has a robust aggregation procedure that serves as an inexact gradient oracle with a bounded error . We consider three concrete examples of such robust procedures: median, trimmed mean, and iterative filtering [22, 23]. Under statistical settings for the data, we provide explicit bounds on their errors as a function of the number of worker machines , the number of data points on each worker machine , the fraction of Byzantine machines , and the dimension of the parameter space . Combining these bounds with the optimization result above, we obtain concrete statistical guarantees on the output . Furthermore, we argue that our first-order guarantees on are often nearly optimal when compared against a universal statistical lower bound. This is summarized below:

Theorem 2 (Informal; see Sec. 5).

When combined with each of following three robust aggregation procedures, ByzantinePGD achieves the statistical guarantees:
(i) median/; ;
(ii) trimmed mean: ;
(iii) iterative filtering: .
Moreover, no algorithm can achieve .

We emphasize that the above results are established under a very strong adversary model: the Byzantine machines are allowed to send messages that depend arbitrarily on each other and on the data on the normal machines; they may even behave adaptively during the iterations of our algorithm. Consequently, this setting requires robust functional estimation (of the gradient function), which is a much more challenging problem than the robust mean estimation setting considered by existing work on median, trimmed mean and iterative filtering. To overcome this difficulty, we make use of careful covering net arguments to establish certain error bounds that hold uniformly over the parameter space, regardless of the behavior of the Byzantine machines. Importantly, our inexact oracle framework allows such arguments to be implemented in a transparent and modular manner.

Notation

For an integer , define the set . For matrices, denote the operator norm by ; for symmetric matrices, denote the largest and smallest eigenvalues by and , respectively. The -dimensional ball centered at with radius is denoted by , or when it is clear from the context.

2 Related Work

Algorithm PGD Neon+GD Neon2+GD ByzantinePGD
Byzantine-robust? no no no yes
Purpose of perturbation escape SP escape SP escape SP escape SP & robustness
Escaping method GD NC search NC search inexact GD
Termination criterion decrease in decrease in distance in distance in
Multiple rounds? no no no yes
Table 1: Comparison with PGD, Neon+GD, and Neon2+GD. SP = saddle point.
Robust Aggregation Method Non-convex Guarantee
Feng et al. [26] geometric median no
Chen et al. [17] geometric median no
Blanchard et al. [10] Krum first-order
Yin et al. [69] median, trimmed mean first-order
Xie et al. [67] mean-around-median, marginal median first-order
Alistarh et al. [3] martingale-based no
Su and Xu [63] iterative filtering no
This work median, trimmed mean, iterative filtering second-order
Table 2: Comparison with other Byzantine-robust distributed learning algorithms.

Efficient first-order algorithms for escaping saddle points

Our algorithm is related to a recent line of work which develops efficient first-order algorithms for escaping saddle points. Although vanilla GD converges to local minimizers almost surely [42, 43], achieving convergence in polynomial time requires more a careful algorithmic design [25]. Such convergence guarantees are enjoyed by several GD-based algorithms; examples include PGD [32], Neon+GD [68], and Neon2+GD [5]. The general idea of these algorithms is to run GD and add perturbation to the iterate when the gradient is small. While our algorithm also uses this idea, the design and analysis techniques of our algorithm are significantly different from the work above in the following aspects (also summarized in Table 1).

  • [leftmargin=3mm]

  • In our algorithm, besides helping with escaping saddle points, the random perturbation has the additional role of defending against adversarial errors.

  • The perturbation used in our algorithm needs to be larger, yet carefully calibrated, in order to account for the influence of the inexactness of gradients across the iterations, especially iterations for escaping saddle points.

  • We run inexact GD after the random perturbation, while Neon+GD and Neon2+GD use negative curvature (NC) search. It is not immediately clear whether NC search can be robustified against Byzantine failures. Compared to PGD, our analysis is arguably simpler and more straightforward.

  • Our algorithm does not use the value of the loss function (hence no need for robust function value estimation); PGD and Neon+GD assume access to the (exact) function values.

  • We employed multiple rounds of perturbation to boost the probability of escaping saddle points; this technique is not used in PGD, Neon+GD, or Neon2+GD.

Inexact oracles

Optimization with an inexact oracle (e.g. noisy gradients) has been studied in various settings such as general convex optimization [7, 21], robust estimation [55], and structured non-convex problems [6, 16, 11, 71]. Particularly relevant to us is the recent work by Jin et al. [34], who consider the problem of minimizing when only given access to the gradients of another smooth function satisfying . Their algorithm uses Gaussian smoothing on . We emphasize that the inexact gradient setting considered by them is much more benign than our Byzantine setting, since (i) their inexactness is defined in terms of norm whereas the inexactness in our problem is in norm, and (ii) we assume that the inexact gradient can be any vector within error, and thus the smoothing technique is not applicable in our problem. Moreover, the iteration complexity obtained by Jin et al. [34] may be a high-degree polynomial of the problem parameters and thus not suitable for distributed implementation.

Byzantine-robust distributed learning

Solving large scale learning problems in distributed systems has received much attention in recent years, where communication efficiency and Byzantine robustness are two important topics [58, 41, 70, 10, 15, 20]. Here, we compare with existing Byzantine-robust distributed learning algorithms that are most relevant to our work, and summarize the comparison in Table 2

. A general idea of designing Byzantine-robust algorithms is to combine optimization algorithms with a robust aggregation (or outlier removal) subroutine. For convex losses, the aggregation subroutines analyzed in the literature include geometric median 

[26, 17], median and trimmed mean [69], iterative filtering for the high dimensional setting [63], and martingale-based methods for the SGD setting [3]. For non-convex losses, to the best of our knowledge, existing works only provide first-order convergence guarantee (i.e., small gradients), by using aggregation subroutines such as the Krum function [10], median and trimmed mean [69], mean-around-median and marginal median [67]. In this paper, we make use of subroutines based on median, trimmed mean, and iterative filtering. Our analysis of median and trimmed mean follows Yin et al. [69]. Our results based on the iterative filtering subroutine, on the other hand, are new:

  • [leftmargin=3mm]

  • The problem that we tackle is harder than what is considered in the original iterative filtering papers [22, 23]. There they only consider robust estimation of a single mean parameter, where as we guarantee robust gradient estimation over the parameter space.

  • Recent work by Su and Xu [63] also makes use of the iterative filtering subroutine for the Byzantine setting. They only study strongly convex loss functions, and assume that the gradients are sub-exponential and . Our results apply to the non-convex case and do not require the aforementioned condition on (which may therefore scale, for example, linearly with the sample size ), but we impose the stronger assumption of sub-Gaussian gradients.

Other non-convex optimization algorithms

Besides first-order GD-based algorithms, many other non-convex optimization methods that can provably converge to approximate local minimum have received much attention in recent years. For specific problems such as phase retrieval [11], low-rank estimation [16, 72], and dictionary learning [1, 64], many algorithms are developed by leveraging the particular structure of the problems, and the either use a smart initialization [11, 65] or initialize randomly [18, 14]

. Other algorithms are developed for general non-convex optimization, and they can be classified into gradient-based 

[27, 44, 68, 4, 5, 33], Hessian-vector-product-based [12, 2, 56, 57], and Hessian-based [54, 19] methods. While algorithms using Hessian information can usually achieve better convergence rates—for example, by Curtis et al. [19], and by Carmon et al. [12]— gradient-based methods are easier to implement in practice, especially in the distributed setting we are interested in.

Robust statistics

Outlier-robust estimation is a classical topic in statistics [30]. The coordinate-wise median aggregation subroutine that we consider is related to the median-of-means estimator [52, 31], which has been applied to various robust inference problems [51, 47, 50].

A recent line of work develops efficient robust estimation algorithms in high-dimensional settings [8, 22, 39, 13, 60, 45, 9, 36, 46]. In the centralized setting, the recent work [24] proposes a scheme, similar to the iterative filtering procedure, that iteratively removes outliers for gradient-based optimization.

3 Problem Setup

We consider empirical risk minimization for a statistical learning problem where each data point is sampled from an unknown distribution over the sample space . Let be the loss function of a parameter vector , where is the parameter space. The population loss function is therefore given by .

We consider a distributed computing system with one master machine and worker machines, of which are Byzantine machines and the other are normal. Each worker machine has data points sampled i.i.d. from . Denote by the -th data point on the -th worker machine, and let be the empirical loss function on the -th machine. The master machine and worker machines can send and receive messages via the following communication protocol: In each parallel iteration, the master machine sends a parameter vector to all the worker machines, and then each normal worker machine computes the gradient of its empirical loss at and sends the gradient to the master machine. The Byzantine machines may be jointly controlled by an adversary and send arbitrary or even malicious messages. We denote the unknown set of Byzantine machines by , where . With this notation, the gradient sent by the -th worker machine is

(1)

where the symbol denotes an arbitrary vector. As mentioned, the adversary is assumed to have complete knowledge of the algorithm used and the data stored on all machines, and the Byzantine machines may collude [48] and adapt to the output of the master and normal worker machines. We only make the mild assumption that the adversary cannot predict the random numbers generated by the master machine.

We consider the scenario where is non-convex, and our goal to find an approximate local minimizer of . Note that a first-order stationary point (i.e., one with a small gradient) is not necessarily close to a local minimizer, since the point may be a saddle point whose Hessian matrix has a large negative eigenvalue. Accordingly, we seek to find a second-order stationary point , namely, one with a small gradient and a nearly positive semidefinite Hessian:

Definition 1 (Second-order stationarity).

We say that is an -second-order stationary point of a twice differentiable function if and .

In the sequel, we make use of several standard concepts from continuous optimization.

Definition 2 (Smooth and Hessian-Lipschitz functions).

A function is called -smooth if , and -Hessian Lipschitz if .

Throughout this paper, the above properties are imposed on the population loss function .

Assumption 1.

is -smooth, and -Hessian Lipschitz on .

4 Byzantine Perturbed Gradient Descent

In this section, we describe our algorithm, Byzantine Perturbed Gradient Descent (ByzantinePGD), which provably finds a second-order stationary point of the population loss in the distributed setting with Byzantine machines. As mentioned, ByzantinePGD robustly aggregates gradients from the worker machines, and performs multiple rounds of carefully calibrated perturbation to combat the effect of Byzantine machines. We now elaborate.

It is well-known that naively aggregating the workers’ messages using standard averaging can be arbitrarily skewed in the presence of just a single Byzantine machine. In view of this, we introduce the subroutine

, which robustly aggregates the gradients collected from the workers. We stipulate that provides an estimate of the true population gradient with accuracy , uniformly across . This property is formalized using the terminology of inexact gradient oracle.

Definition 3 (Inexact gradient oracle).

We say that provides a -inexact gradient oracle for the population loss if, for every , we have .

Without loss of generality, we assume that throughout the paper. In this section, we treat as a given black box; in Section 5, we discuss several robust aggregation algorithms and characterize their inexactness . We emphasize that in the Byzantine setting, the output of can take values adversarially within the error bounds; that is, may output an arbitrary vector in the ball , and this vector can depend on the data in all the machines and all previous iterations of the algorithm.

The use of robust aggregation with bounded inexactness, however, is not yet sufficient to guarantee convergence to an approximate local minimizer. As mentioned, the Byzantine machines may create fake local minima that traps a vanilla gradient descent iteration. Our ByzantinePGD algorithm is designed to escape such fake minima as well as any existing saddle points of .

4.1 Algorithm

We now describe the details of our algorithm, given in the left panel of Algorithm 1. We focus on unconstrained optimization, i.e., . In Section 5, we show that the iterates during the algorithm actually stay in a bounded ball centered at the initial iterate , and we will discuss the statistical error rates within the bounded space.

In each parallel iteration, the master machine sends the current iterate to all the worker machines, and the worker machines send back . The master machine aggregates the workers’ gradients using and computes a robust estimate of the population gradient . The master machine then performs a gradient descent step using . This procedure is repeated until it reaches a point with for a pre-specified threshold .

At this point, may lie near a saddle point whose Hessian has a large negative eigenvalue. To escape this potential saddle point, the algorithm invokes the routine (right panel of Algorithm 1), which performs rounds of perturbation-and-descent operations. In each round, the master machine perturbs randomly and independently within the ball . Let be the perturbed vector. Starting from the , the algorithm conducts at most parallel iterations of -inexact gradient descent (using as before):

(2)

During this process, once we observe that for some pre-specified threshold (this means the iterate moves by a sufficiently large distance in the parameter space), we claim that is a saddle point and the algorithm has escaped it; we then resume -inexact gradient descent starting from . If after rounds no sufficient move in the parameter space is ever observed, we claim that is a second-order stationary point of and output .

while   do    Master: send to worker machines.    for all do in parallel       Worker : compute       send to master machine.    end for    Master:    .    if  then       Master: ,        (, , , , , ).       if  then          return .       end if    end if    Master: . end while for  do    Master: sample ,    , .    for  do       Master: send to worker machines.       for all do in parallel          Worker : compute          send to master machine.       end for       Master: .       if  then          return .       else                 end if    end for end for return . Algorithm 1 Byzantine Perturbed Gradient Descent (ByzantinePGD)

4.2 Convergence Guarantees

In this section, we provide the theoretical result guaranteeing that Algorithm 1 converges to a second-order stationary point. In Theorem 3, we let , be the initial iterate, and .

Theorem 3 (ByzantinePGD).

Suppose that Assumptions 1 holds, and assume that provides a -inexact gradient oracle for with . Given any , choose the parameters for Algorithm 1 as follows: step-size , , , ,

Then, with probability at least , the output of Algorithm 1, denoted by , satisfies the bounds

(3)

and the algorithm terminates within parallel iterations.

We prove Theorem 3 in Appendix B.111We make no attempt in optimizing the multiplicative constants in Theorem 3. Below let us parse the above theorem and discuss its implications.

Focusing on the scaling with , we may read off from Theorem 3 the following result:

Observation 1.

Under the above setting, within parallel iterations, ByzantinePGD outputs an -second-order stationary point of ;222Here, by using the symbol , we ignore logarithmic factors and only consider the dependence on . that is,

In terms of the iteration complexity, it is well-known that for a smooth non-convex , gradient descent requires at least iterations to achieve  [53]; up to logarithmic factors, our result matches this complexity bound. In addition, our first-order guarantee is clearly order-wise optimal, as the gradient oracle is -inexact. It is currently unclear to us whether our second-order guarantee is optimal. We provide a converse result showing that one cannot hope to achieve a second-order guarantee better than .

Proposition 1.

There exists a class of real-valued -smooth and -Hessian Lipschitz differentiable functions such that, for any algorithm that only uses a -inexact gradient oracle, there exists such that the output of the algorithm must satisfy and .

We prove Proposition 1 in Appendix D. Again, we emphasize that our results above are in fact not restricted to the Byzantine distributed learning setting. They apply to any non-convex optimization problems (distributed or not) with inexact information for the gradients, including those with noisy but non-adversarial gradients; see Section 2 for comparison with related work in such settings.

As a byproduct, we can show that with a different choice of parameters, ByzantinePGD can be used in the standard (non-distribued) setting with access to the exact gradient , and the algorithm converges to an -second-order stationary point within iterations:

Theorem 4 (Exact gradient oracle).

Suppose that Assumptions 1 holds, and assume that for any query point we can obtain exact gradient, i.e., . For any and , we choose the parameters in Algorithm 1 as follows: step-size , , , and , . Then, with probability at least , Algorithm 1 outputs a satisfying the bounds

and the algorithm terminates within iterations.

We prove Theorem 4 in Appendix C. The convergence guarantee above matches that of the original PGD algorithm [32] up to logarithmic factors. Moreover, our proof is considerably simpler, and our algorithm only requires gradient information, whereas the original PGD algorithm also needs function values.

5 Robust Estimation of Gradients

The results in the previous section can be applied as long as one has a robust aggregation subroutine that provides a -inexact gradient oracle of the population loss . In this section, we discuss three concrete examples of : median, trimmed mean, and a high-dimension robust estimator based on the iterative filtering algorithm [22, 23, 60]. We characterize their inexactness under the statistical setting in Section 3, where the data points are sampled independently according to an unknown distribution .

To describe our statistical results, we need the standard notions of sub-Gaussian/exponential random vectors.

Definition 4 (sub-Gaussianity and sub-exponentiality).

A random vector with mean is said to be -sub-Gaussian if . It is said to be -sub-exponential if .

We also need the following result (proved in Appendix E), which shows that the iterates of ByzantinePGD in fact stay in a bounded set around the initial iterate .

Proposition 2.

Under the choice of algorithm parameters in Theorem 3, all the iterates in ByzantinePGD stay in the ball with , where is a number that only depends on and .

Consequently, for the convergence guarantees of ByzantinePGD to hold, we only need to satisfy the inexact oracle property (Definition 3) within the bounded set , with given in Proposition 2. As shown below, the three aggregation procedures indeed satisfy this property, with their inexactness depends mildly (logarithmically) on the radius .

5.1 Iterative Filtering Algorithm

We start with a recently developed high-dimension robust estimation technique called the iterative filtering algorithm [22, 23, 60] and use it to build the subroutine . As can be seen below, iterative filtering can tolerate a constant fraction of Byzantine machines even when the dimension grows—an advantage over simpler algorithms such as median and trimmed mean.

We relegate the details of the iterative filtering algorithm to Appendix F.1. Again, we emphasize that the original iterative filtering algorithm is proposed to robustly estimate a single parameter vector, whereas in our setting, since the Byzantine machines may produce unspecified probabilistic dependency across the iterations, we need to prove an error bound for robust gradient estimation uniformly across the parameter space . We prove such a bound for iterative filtering under the following two assumptions on the gradients and the smoothness of each loss function .

Assumption 2.

For each , is -sub-Gaussian.

Assumption 3.

For each , is -smooth.

Let be the covariance matrix of , and define . We have the following bounds on the inexactness parameter of iterative filtering.

Theorem 5 (Iterative Filtering).

Suppose that Assumptions 2 and 3 hold. Use the iterative filtering algorithm described in Appendix F.1 for , and assume that . With probability , provides a -inexact gradient oracle with

where is an absolute constant.

The proof of Theorem 5 is given in Appendix F.2. Assuming bounded and , we see that iterative filtering provides an -inexact gradient oracle.

5.2 Median and Trimmed Mean

The median and trimmed mean operations are two widely used robust estimation methods. While the dependence of their performance on is not optimal, they are conceptually simple and computationally fast, and still have good performance in low dimensional settings. We apply these operations in a coordinate-wise fashion to build .

Formally, for a set of vectors , , their coordinate-wise median is a vector with its -th coordinate being for each , where is the usual (one-dimensional) median. The coordinate-wise -trimmed mean is a vector with for each , where is a subset of obtained by removing the largest and smallest fraction of its elements.

For robust estimation of the gradient in the Byzantine setting, the error bounds of median and trimmed mean have been studied by Yin et al. [69]. For completeness, we record their results below as an informal theorem; details are relegated to Appendix F.3.

Theorem 6 (Informal).

[69] Under appropriate smoothness and probabilistic assumptions,333

Specifically, for median we assume that gradients have bounded skewness, and for trimmed mean we assume that the gradients are sub-exponentially distributed.

with high probability, the median operation provides a -inexact gradient oracle with , and the trimmed mean operation provides a -inexact gradient oracle with .

5.3 Comparison and Optimality

In Table 3, we compare the above three algorithms in terms of the dependence of their gradient inexactness on the problem parameters , , , and . We see that when , the median and trimmed mean algorithms have better inexactness due to a better scaling with . When is large, iterative filtering becomes preferable.

Gradient inexactness
median
trimmed mean
iterative filtering
Table 3: Statistical bounds on gradient inexactness .

Recall that according to Observation 1, with -inexact gradients the ByzantinePGD algorithm converges to an -second-order stationary point. Combining this general result with the bounds in Table 3, we obtain explicit statistical guarantees on the output of ByzantinePGD. To understand the statistical optimality of these guarantees, we provide a converse result below.

Observation 2.

There exists a statistical learning problem in the Byzantine setting such that the output of any algorithm must satisfy with a constant probability.

We prove Observation 2 in Appendix F.4. In view of this observation, we see that in terms of the first-order guarantee (i.e., on ) and up to logarithmic factors, trimmed mean is optimal if , the median is optimal if and , and iterative filtering is optimal if . The statistical optimality of their second-order guarantees (i.e., on ) is currently unclear to us, and we believe this is an interesting problem for future investigation.

6 Conclusion

In this paper, we study security issues that arise in large-scale distributed learning because of the presence of saddle points in non-convex loss functions. We observe that in the presence of non-convexity and Byzantine machines, escaping saddle points becomes much more challenging. We develop ByzantinePGD, a computation- and communication-efficient algorithm that is able to provably escape saddle points and converge to a second-order stationary point, even in the presence of Byzantine machines. We also discuss three different choices of the robust gradient and function value aggregation subroutines in ByzantinePGD—median, trimmed mean, and the iterative filtering algorithm. We characterize their performance in statistical settings, and argue for their near-optimality in different regimes including the high dimensional setting.

Acknowledgements

D. Yin is partially supported by Berkeley DeepDrive Industry Consortium. Y. Chen is partially supported by NSF CRII award 1657420 and grant 1704828. K. Ramchandran is partially supported by NSF CIF award 1703678. P. Bartlett is partially supported by NSF grant IIS-1619362. The authors would like to thank Zeyuan Allen-Zhu for pointing out a potential way to improve our initial results, and Ilias Diakonikolas for discussing references [22, 23, 24].

References

Appendix

Appendix A Challenges of Escaping Saddle Points in the Adversarial Setting

We provide two examples showing that in non-convex setting with saddle points, inexact oracle can lead to much worse sub-optimal solutions than in the convex setting, and that in the adversarial setting, escaping saddle points can be inherently harder than the adversary-free case.

Consider standard gradient descent using exact or -inexact gradients. Our first example shows that Byzantine machines have a more severe impact in the non-convex case than in the convex case.

Example 1.

Let and consider the functions and . Here is strongly convex with a unique local minimizer , whereas has two local (in fact, global) minimizers and a saddle point (in fact, a local maximum) . Proposition 3 below shows the following: for the convex , gradient descent (GD) finds a near-optimal solution with sub-optimality proportional to , regardless of initialization; for the nonconvex , GD initialized near the saddle point suffers from an sub-optimality gap.

Proposition 3.

Suppose that . Under the setting above, the following holds.
(i) For , starting from any , GD using a -inexact gradient oracle finds with .
(ii) For , there exists an adversarial strategy such that starting from a sampled uniformly from , GD with a -inexact gradient oracle outputs with , with probability .

Proof.

Since , we have . For any , (since ). Thus, the adversarial oracle can always output when , and we have . Thus, if , the iterate can no longer move with this adversarial strategy. Then, we have (since ). The result for the convex function is a direct corollary of Theorem 1 in [69]. ∎

Our second example shows that escaping saddle points is much harder in the Byzantine setting than in the non-Byzantine setting.

Example 2.

Let , and assume that in the neighborhood of the origin, takes the quadratic form , with .444 holds locally around the origin, not globally; otherwise has no minimum. The origin is not an -second-order stationary point, but rather a saddle point. Proposition 4 below shows that exact GD escapes the saddle point almost surely, while GD with an inexact oracle fails to do so.

Proposition 4.

Under the setting above, if one chooses and sample from uniformly at random, then:
(i) Using exact gradient descent, with probability , the iterate eventually leaves .
(ii) There exists an adversarial strategy such that, when we update using -inexact gradient oracle, if , with probability , the iterate cannot leave ; otherwise with probability the iterate cannot leave .

Proof.

Since , , we have . Sample uniformly at random from , and we know that with probability , . Then, by running exact gradient descent , we can see that the second coordinate of is . When , we know that as gets large, we eventually have , which implies that the iterate leaves .

On the other hand, suppose that we run -inexact gradient descent, i.e., with . In the first step, if , the adversary can simply replace with (one can check that here we have ), and then the second coordinate of does not change, i.e., . In the following iterations, the adversary can keep using the same strategy and the second coordinate of never changes, and then the iterates cannot escape , since is a strongly convex function in its first coordinate. To compute the probability of getting stuck at the saddle point, we only need to compute the area of the region , which can be done via simple geometry. ∎

Remark.

Even if we choose the largest possible perturbation in , i.e., sample from the circle , the stuck region still exists. We can compute the length of the arc and find the probability of stuck. One can find that when , the probability of being stuck in is still , otherwise, the probability of being stuck is .

The above examples show that the adversary can significantly alter the landscape of the function near a saddle point. We counter this by exerting a large perturbation on the iterate so that it escapes this bad region. The amount of perturbation is carefully calibrated to ensure that the algorithm finds a descent direction “steep” enough to be preserved under -corruption, while not compromising the accuracy. Multiple rounds of perturbation are performed, boosting the escape probability exponentially.

Appendix B Proof of Theorem 3

We first analyze the gradient descent step with -inexact gradient oracle.

Lemma 1.

Suppose that . For any , if we run the following inexact gradient descent step:

(4)

with . Then, we have

Proof.

Since is smooth, we know that

Let be the threshold on that the algorithm uses to determine whether or not to add perturbation. Choose . Suppose that at a particular iterate , we observe . Then, we know that

According to Lemma 1, by running one iteration of the inexact gradient descent step, the decrease in function value is at least

(5)

We proceed to analyze the perturbation step, which happens when the algorithm arrives at an iterate with . In this proof, we slightly abuse the notation. Recall that in equation (2) in Section 4.1 , we use to denote the iterates of the algorithm in the saddle point escaping process. Here, we simply use to denote these iterates. We start with the definition of stuck region at .

Definition 5.

Given , and parameters , , and , the stuck region is a set of which satisfies the following property: there exists an adversarial strategy such that when we start with and run gradient descent steps with -inexact gradient oracle :

(6)

we observe , .

When it is clear from the context, we may simply use the terminology stuck region at . The following lemma shows that if

has a large negative eigenvalue, then the stuck region has a small width along the direction of the eigenvector associated with this negative eigenvalue.

Lemma 2.

Assume that the smallest eigenvalue of satisfies , and let the unit vector be the eigenvector associated with . Let be two points such that with some . Choose step size , and consider the stuck region