Practical Distributed Learning: Secure Machine Learning with Communication-Efficient Local Updates

03/16/2019 ∙ by Cong Xie, et al. ∙ 0

Federated learning on edge devices poses new challenges arising from workers that misbehave, privacy needs, etc. We propose a new robust federated optimization algorithm, with provable convergence and robustness under non-IID settings. Empirical results show that the proposed algorithm stabilizes the convergence and tolerates data poisoning on a small number of workers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Edge devices/IoT such as smart phones, wearable devices, sensors, and smart homes are playing an increasingly important role in our daily life. Such devices are generating massive, diverse, and private data capturing human behaviors. In response, there is a trend towards moving computation, including the training of machine-learning models, from cloud/datacenters to edge devices (Anguita et al., 2013; Pantelopoulos & Bourbakis, 2010). This motivates federated learning (Konevcnỳ et al., 2016; McMahan et al., 2016) – which involves training machine-learning models locally on edge devices – ideally so the resulting models exhibit improved representation and generalization.

Federated learning is also motivated by privacy protection. Privacy needs and legal requirements (e.g., US HIPAA laws (HealthInsurance.org, 1996) in a smart hospital, or Europe’s GDPR law (EU, 2018)) may necessitate that training be performed on-premises using IoT devices and edge machines, and that data and models must not be deposited in the cloud or cloudlets. In such scenarios, federated optimization is designed to learn a global model on the data locally stored on the devices, all without collecting the private local data at a central location.

Typically, federated learning is executed in a server-worker architecture. The workers are placed on edge devices, which train the models locally on the private data. The servers are placed on the cloud/datacenters, which aggregate the learned models from the workers, and produce a global model. In this architecture, the workers reveal only model parameters to the servers, and are otherwise anonymous. In kind, the servers do not require any meta-data from the workers, or transmission via a trusted third party.

The outlined system scenario is synonymous with mutual distrust between the servers and workers. 1) The users do not trust the cloud/datacenters, hence they would like to prevent the workers from revealing private data to the servers. 2) The cloud/datacenters also do not trust the users, since the worker anonymity means there is no guarantee of correct behavior at workers. In the worst-case, users with abnormal behavior can poison the training data on the workers, which results in bad models pushed to the servers.

We propose robust federated optimization, a novel federated learning algorithm that protects the global model from data poisoning. We summarize the key properties of robust federated learning below:

  • Limited computation.

    Edge devices, including smart phones, wearable devices, sensors, or vehicles typically have weaker computational ability, compared to the workstations or datacenters used in typical distributed machine learning. Thus, simpler models and stochastic training are usually applied in practice. Furthermore, to save the battery power, in each epoch, only a subset of workers may be selected to train the model.

  • Limited communication The connection to the central servers are not guaranteed. Communication can be frequently unavailable, slow, or expensive (in money or in the power of battery). Thus, communication is much less frequent, compared to typical distributed optimization.

  • Decentralized, Non-IID training data. To protect the users’ privacy, the training data on local devices is not uploaded to a central server. As a result, the data distribution on different devices are not mixed and distributed IID as in standard settings, thus are non-identically distributed samples from the population. This is particularly true when each device is controlled by a specific user whose behavior is supposed to be unique. Furthermore, the sampled data on nearby devices are potentially non-independent, since such devices can be shared by the same user or family. For example, the data of step counter from a wearable fitness tracker and a smart phone owned by the same user can have different distributions with mutual dependency. Imagine that the fitness tracker is only used when the user is running, and the smart phone is only used when the user is walking, which results in different distributions. On the other hand, the complementation yields dependency.

  • Unbalanced. As a side effect of the decentralized, non-IID settings, different devices have different sizes of local datasets. The usage of some devices can be much heavier than the others, which results in unbalanced amount of data per device. For example, the users who workout everyday have much more data on wearable fitness bands than the average users.

  • Untrusted workers and data poisoning. A robust system can tolerate adverserial behaviour from some users e.g. a small portion of users can nefariously manipulate the local training data. As a result, some workers may push models learned on poisoned data to the servers.

Although empirical research has been conducted on federated optimization by several authors (Konevcnỳ et al., 2016; McMahan et al., 2016), to our knowledge, there is limited work with theoretical guarantees. The limited existing work with convergence guarantees are based on the strong assumption of IID training data (Yu et al., 2018; Stich, 2018), which we have argued is inappropriate for the federated learning setting. In this paper, we focus on federated optimization with provable convergence guarantees and tolerance to data poisoning, under non-IID settings. In summary, the main contributions are listed as follows:

  • We propose a new federated optimization algorithm with provable convergence under non-IID settings. We show that federated learning remains effective with infrequent communication between local iterations.

  • We propose a robust variant of our federated optimization algorithm. The algorithm tolerates a small number of workers training on poisoned data. As far as we know, this paper is the first to investigate the robustness of federated optimization.

  • We show empirically that the proposed algorithm stabilizes the convergence, and protects the global model from data poisoning.

2 Related Work

There is growing literature on the practical applications of edge and fog computing (Garcia Lopez et al., 2015; Hong et al., 2013) in various scenarios such as smart home or sensor networks. More and more big-data applications are moving from cloud to edge, including for machine-learning tasks (Cao et al., 2015; Mahdavinejad et al., 2018; Zeydan et al., 2016). Although the computational power is growing, edge devices are still much weaker than the workstations and datacenters used in typical distributed machine learning e.g. due to the limited computation and communication capacity, and limited power of batteries. To this end, there are machine-learning frameworks with simple architectures such as MobileNet (Howard et al., 2017) which are designed for learning with weak devices.

The limited computational power of edge devices also motivates local training of machine learning models. Existing federated optimization methods (Konevcnỳ et al., 2015, 2016; McMahan et al., 2016) unfortuntaely lack provable convergence guarantees. The theoretical analysis of the related local optimization methods have so far focused on the IID settings (Yu et al., 2018; Stich, 2018). Furthermore, the issues of data poisoning have not been addressed in previous work. To the best of our knowledge, our proposed work is the first federated optimization framework that considers both convergence and robustness, theoretically and practically.

Typically, federated optimization uses the server-worker architecture, which is similar to the Parameter Server (PS) architecture. Stochastic Gradient Descent (SGD) with PS architecture, is widely used in typical distributed machine learning 

(Li et al., 2014a; Ho et al., 2013; Li et al., 2014b). Compared to the PS architecture, federated learning has much less synchronization. Furthermore, in federated learning, the workers push trained models instead of gradients to the servers.

Approaches based on robust statistics are often used to address security issues in the PS architecture (Yin et al., 2018; Xie et al., 2018). This enables procedures which tolerate multiple types of attacks and system failures. However, the existing methods and theoretical analysis do not consider local training on non-IID data. On the other hand, recent work has considered attacks targeting federated learning (Bagdasaryan et al., 2018; Fung et al., 2018; Bhagoji et al., 2018), but does not propose defense techniques with provable convergence.

3 Problem Formulation

Consider federated learning with devices. On each device, there is a worker process that trains the model on local data. The overall goal is to train a global model using data from all the devices.

To do so, we consider the following optimization problem:

where , for , is sampled from the local data on the th device.

Notation/Term Description
Number of devices
Number of simutaneously updating devices
Number of communication epochs
Set of integers
Randomly selected devices in the epoch
Parameter of trimmed mean
Minimal number of local iterations
Number of local iterations in the epoch
on the th device
Initial model in the epoch
Model updated in the epoch,
th local iteration, on the th device
Dataset on the th device
Data (minibatch) sampled in the epoch,
th local iteration, on the th device
Learning rate
Weight of moving average
All the norms in this paper are -norms
Device Where the training data are placed
Worker One worker on each device,
process that trains the model
User Agent that produces data on the devices,
and/or controls the devices
Abnormal user Special user that produces poisoned data
Table 1: Notations and Terminologies

3.1 Non-IID Local Datasets

Note that different devices have different local datasets, i.e., . Thus, samples drawn from different devices have different expectations, which means that . Further, since different devices can be possessed by the same user or the same group of users (e.g., families), samples drawn from different devices can be potentially dependent on each other.

3.2 Data Poisoning

In federated optimization, the users do not trust the servers. The refusal to transfer training data to the servers results in the non-IID setting introduced in the previous sections. On the other hand, the servers also do not trust the edge devices. A small number of devices may be susceptible to data poisoned by abnormal user behaviors or in the worst case, are controlled by users or agents who intend to directly upload harmful models to the servers.

In this paper, we consider a generalized threat model, where the workers can push arbitrarily bad models to the servers. The bad models can cause misclassification, or global convergence to sub-optimum. Beyond more benign issues such as hardware, software or communication failures, there are multiple ways for nefarious users to manipulate the uploaded models e.g. data poisoning (Bae et al., 2018). In worst case, nefarious users can even directly hack the devices and replace the correct models with arbitrary values. We provide a more formal definition of the threat model in Section 4.1.

4 Methodology

A single execution of federated optimization is composed of communication epochs. At the beginning of each epoch, a randomly selected group of devices pull the latest global model from the central server. Then, the same group of devices locally update the model without communication with the central server. At the end of each epoch, the central server aggregates the updated models and then updates the global model.

In this paper, we propose a new robust federated optimization method. In the epoch, on the th device, we locally solve the following optimization problem using SGD for iterations:

Then, the server collect the resulting local models , and aggregate them using . Finally, we update the model with a moving average over the current model and the aggregated local models.

The detailed algorithm is shown in Algorithm 1. is the model parameter updated in th local iteration of the epoch, on the th device. is the data randomly drawn in th local iteration of the epoch, on the th device. is the number of local iterations in the epoch, on the th device. is the learning rate and is the total number of epochs.

1:  Input: ,
2:  Initialize
3:  for all epoch  do
4:     Randomly select a group of workers, denoted as
5:     for all  in parallel do
6:        Receive the latest global model from the server
7:        
8:        for all local iteration  do
9:           Randomly sample
10:           
11:        end for
12:        Push to the server
13:     end for
14:     Aggregate:
15:     Update the global model:
16:  end for
Algorithm 1 Robust Federated Optimization

4.1 Threat Model and Robust Aggregation

First, we formally define the threat model.

Definition 1.

(Threat Model) In Line 12 of Algorithm 1, instead of the correct , a worker, training on poisoned data or controlled by an abnromal user, may push an arbitrary value to the server.

Remark 1.

Note that the users/workers are anonymous to the servers, and the abnormal users can sometimes pretend to be well-behaved to fool the servers. Hence, it is impossible to surely identify the workers training on poisoned data, according to their historical behavior.

In Algorithm 1, Option II uses the trimmed mean as a robust aggregation which tolerates the proposed threat model. To define the trimmed mean, we first define the order statistics.

Definition 2.

(Order Statistics) By sorting the scalar sequence , we get , where is the th smallest element in .

Then, we define the trimmed mean.

Definition 3.

(Trimmed Mean) For , the -trimmed mean of the set of scalars is defined as follows:

where is the th smallest element in defined in Definition 2. The high-dimensional version () of simply applies the trimmed mean in a coordinate-wise manner.

Note that the trimmed mean (Option II) is equivalent to the standard mean (Option I) if we take .

5 Convergence Analysis

In this section, we prove the convergence of Algorithm 1 with non-IID data, for non-convex functions. Furthermore, we show that the proposed algorithm tolerates the threat model introduced in Definition 1. We start with the assumptions required by the convergence guarantees.

5.1 Assumptions

For convenience, we denote

Assumption 1.

(Existence of Global Optimum) We assume that there exists at least one (potentially non-unique) global minimum of the loss function

, denoted by .

Assumption 2.

(Bounded Taylor’s Approximation) We assume that for , has -smoothness and -lower-bounded Taylor’s approximation:

where , and .

Note that Assumption 2 covers the case of non-convexity by taking , non-strong convexity by taking , and strong convexity by taking .

Assumption 3.

(Bounded Variance) Although we assume the non-IID setting

, there should still be some similarity between different local datasets. We assume that for , and , we have .

Based on the assumptions above, we have the following convergence guarantees. All the detailed proofs can be found in the appendix.

5.2 Convergence without Data Poisoning

First, we analyze the convergence of Algorithm 1 with Option I, where there are no poisoned workers.

Theorem 1.

We take . After epochs, Algorithm 1 with Option I converges to a global optimum:

where , .

5.3 Convergence with Data Poisoning

Under the threat model defined in Definition 1, in worst case, Algorithm 1 with Option I (as well as FedAvg from (McMahan et al., 2016)) suffers from unbounded error.

Proposition 1.

(Informal) Algorithm 1 with Option I can not tolerate the threat model defined in Definition 1.

Proof.

(Sketch) Assume that in a specific epoch , among all the workers, the last of them are poisoned. For the poisoned workers, instead of pushing the correct value to the server, they push , where is an arbitrary constant. For convenience, we assume IID local datasets for all the workers. Thus, the expectation of the aggregated global model becomes , which means that in expectation, the aggregated global model can be manipulated to take arbitrary values, which results in unbounded error. ∎

In the following theorems, we show that using Algorithm 1 with Option II, the error can be upper bounded.

Theorem 2.

Assume that additional to the normal workers, there are workers training on poisoned data, where , and . We assume that , . We take . After epochs, Algorithm 1 with Option II converges to a global optimum:

where , .

Remark 2.

Note that the additional error caused by the poisoned workers and -trimmed mean is controlled by the factor , which decreases when and decreases, or increases.

(a) Top-1 accuracy on testing set
(b) Cross entropy on training set
Figure 1: Convergence on training data with balanced partition. Learning rate decays by at the th epoch. Each epoch is a full pass of the local training data.
(a) Top-1 accuracy on testing set
(b) Cross entropy on training set
Figure 2: Convergence on training data with unbalanced partition. Learning rate decays by at the th epoch. Each epoch is a full pass of the local training data. Note that with the same learning rate, FedReg converges as fast as or faster than FedAvg. When , FedAvg diverges, while FedReg still has stable convergence.
(a) Top-1 accuracy on testing set
(b) Cross entropy on training set
Figure 3: Convergence on training data with balanced partition. No poisoned workers are added. Learning rate decays by at the th epoch. Each epoch is a full pass of the local training data. Note that using trimmed mean in FedRob causes some extra variance compared to FedReg, but the gap is tiny. With same , larger slightly slows down the convergence.
(a) Top-1 accuracy on testing set
(b) Cross entropy on training set
Figure 4: Convergence on training data with balanced partition. workers are guaranteed to train on poisoned data, in the randomly selected users in each epoch. The poisoned data have “flipped” labels, i.e., each in the local training data will be replaced by . Learning rate decays by at the th epoch. Each epoch is a full pass of the local training data. Note that FedRob outperforms the baseline FedReg. Compared with FedReg with no data poisioning, FedRob inevitably converges slower, but the progress is reasonable.

6 Experiments

In this section, we evaluate the proposed algorithm by testing the convergence and robustness. Note that zoomed figures of the empirical results can be found in the appendix.

6.1 Datasets and Evaluation Metrics

We conduct experiments on the benchmark CIFAR-10 image classification dataset (Krizhevsky & Hinton, 2009), which is composed of 50k images for training and 10k images for testing. Each image is resized and cropped to the shape of

. We use convolutional neural network (CNN) with 4 convolutional layers followed by 1 fully connected layer. We make the network architecture simple enough, so that it can be easily handled by mobile devices. The detailed network architecture can be found in our submitted source code (will also be released upon publication).

The experiments are conducted on CPU devices. We implement the federated optimization algorithms using the MXNET (Chen et al., 2015) framework.

In each experiment, the training set is partitioned onto devices. In each epoch, devices are randomly selected to launch local updates, with the minibatch size of

. We repeat each experiment 10 times and take the average. We use top-1 accuracy on the testing set, and cross entropy loss function on the training set as the evaluation metrics.

The baseline algorithm is FedAvg introduced by McMahan et al. (2016), which is a special case of our proposed Algorithm 1 with Option I and .

For convenience, we name Algorithm 1 with Option I as FedReg, and Algorithm 1 with Option II as FedRob.

6.2 Balanced Partition

We first test the performance of FedReg on the training data with balanced partition. Each of the partitions has images. We test the algorithms with different initial learning rates and moving average weights . The learning rates decays by at the th epoch. The result is shown in Figure 1. Note that by taking the moving average, FedReg is less sensitive to the change of learning rate, especially in the first several epochs. As expected, the moving average stabilizes the convergence.

6.3 Unbalanced Partition

To make the setting more realistic, we partition the training set into unbalanced sizes. The sizes of the partitions are  (an arithmetic sequence with step , starting with ). Furthermore, to enlarge the variance, we make sure that in each partition, there are at most different labels out of all the labels. Note that some partitions only have one label.

We test the algorithms with different initial learning rates and moving average weights . The learning rates decays by at the th epoch. The result is shown in Figure 2. Note that due to the larger variance, standard federated optimization may converge to a sub-optimum, or even diverge. When taking without moving average (), FedAvg simply diverges. By taking the moving average with , the convergence gets stable. Furthermore, with appropriate choices of , and the same learning rate , FedReg converges to a slightly better solution. As expected, the moving average in FedReg prevents the local updates from getting too far away from the initial value, which stabilizes the convergence.

6.4 Robustness

First, we test the performance of FedRob when no workers are subjected to data poisoning. FedRob with different is compared to FedReg. The result is shown in Figure 3. As expected, the trimmed mean introduces some extra variance to the convergence. Such additional variance grows with , as explained in Corollary LABEL:cor:convergence_subset_robust and Remark 2. However, the gap to the baseline FedRob is tiny. Furthermore, FedRob converges to the same optimum as FedReg.

To test the robustness, we simulate data poisoning which “flips” the labels of the local training data. The poisoned data have “flipped” labels, i.e., each in the local training data will be replaced by . The experiment is set up so that in each epoch, in all the randomly selected workers, workers are subjected to data poisoning. The result is shown in Figure 4. We use FedReg without data poisoning as the ideal benchmark. As expected, FedReg without trimmed mean can not tolerate data poisoning, which causes catastrophic failure. FedRob tolerates the poisoned worker, though converges slower compared to FedReg without data poisoning. Furthermore, larger improves the robustness.

6.5 Discussion

In all the experiments, whether with balanced partition or unbalanced partition, the convergence of FedReg is more stable and less sensitive to the change of learning rates, especially in the first several epochs as compared to FedAvg. We observe that the moving average prevents the local updates from getting too far away from the initial values shared among all the workers. As a result, when the learning rate is too large, the moving average can prevent overfitting the local training data. When the learning rate is small enough, the moving average is not necessary. In practice, we recommend using in cases where the appropriate choice of learning rate is unknown.

Unlike FedReg, FedRob tolerates poisoned workers. Furthermore, FedRob performs similar to FedReg when there are only normal workers. Note that although the additional variance theoretically grows with , the convergence is empirically improved with larger (see Figure 4

). That is because the trimmed mean filters out not only the the values produced by poisoned workers, but also the natural outliers in the normal workers. As a side effect, the performance of

FedRob on the testing set is very close to the optimum produced by the ideal FedReg without data poisoning, though such gap is larger on the training set. In general, there is a trade-off between convergence rate and robustness. In practice, it is recommended to use FedRob with small , which tolerates data poisoning and abnormal users, and causes minimal deceleration when there are no abnormal users.

7 Conclusion

We propose a novel robust federated optimization algorithm on non-IID training data, which stabilizes the convergence and tolerates poisoned workers. The algorithm has provable convergence. Our empirical results show good performance in practice. In future work, we are going to analyze our algorithm on other threat models, such as hardware or software failures.

References

8 Proofs

Theorem 1.

We take . After epochs, Algorithm 1 with Option I converges to a global optimum:

where , .

Proof.

For convenience, we ignore the random sample in our notations. Thus, represents , where . Furthermore, we define .

Thus, Line 10 in Algorithm 1 can be rewritten into

Using -smoothness of , we have

It is easy to check that , is -strongly convex, where . Thus, we have

Taking expectation on both sides, conditional on , we have

By telescoping and taking total expectation, we have

On the server, after aggregation, conditional on , we have

We define , which is convex. Then, we have

After epochs, by telescoping and taking total expectation, we have

Theorem 2.

Assume that additional to the normal workers, there are workers training on poisoned data, where , and . We assume that , . We take . After epochs, Algorithm 1 with Option II converges to a global optimum:

where , .

Proof.

First, we analyze the robustness of trimmed mean. Assume that among the scalar sequence , elements are poisoned. Without loss of generality, we denote the remaining correct values as . Thus, for , , for , where is the th smallest element in , and is the th smallest element in .

Define . We have

Thus, we have

Note that for arbitrary subset with cardinality , we have the following bound:

In the worker set , there are poisoned workers. We denote as the set of normal workers with cardinality .

Thus, we have

Using -smoothness, we have