1 Introduction
Edge devices/IoT such as smart phones, wearable devices, sensors, and smart homes are playing an increasingly important role in our daily life. Such devices are generating massive, diverse, and private data capturing human behaviors. In response, there is a trend towards moving computation, including the training of machinelearning models, from cloud/datacenters to edge devices (Anguita et al., 2013; Pantelopoulos & Bourbakis, 2010). This motivates federated learning (Konevcnỳ et al., 2016; McMahan et al., 2016) – which involves training machinelearning models locally on edge devices – ideally so the resulting models exhibit improved representation and generalization.
Federated learning is also motivated by privacy protection. Privacy needs and legal requirements (e.g., US HIPAA laws (HealthInsurance.org, 1996) in a smart hospital, or Europe’s GDPR law (EU, 2018)) may necessitate that training be performed onpremises using IoT devices and edge machines, and that data and models must not be deposited in the cloud or cloudlets. In such scenarios, federated optimization is designed to learn a global model on the data locally stored on the devices, all without collecting the private local data at a central location.
Typically, federated learning is executed in a serverworker architecture. The workers are placed on edge devices, which train the models locally on the private data. The servers are placed on the cloud/datacenters, which aggregate the learned models from the workers, and produce a global model. In this architecture, the workers reveal only model parameters to the servers, and are otherwise anonymous. In kind, the servers do not require any metadata from the workers, or transmission via a trusted third party.
The outlined system scenario is synonymous with mutual distrust between the servers and workers. 1) The users do not trust the cloud/datacenters, hence they would like to prevent the workers from revealing private data to the servers. 2) The cloud/datacenters also do not trust the users, since the worker anonymity means there is no guarantee of correct behavior at workers. In the worstcase, users with abnormal behavior can poison the training data on the workers, which results in bad models pushed to the servers.
We propose robust federated optimization, a novel federated learning algorithm that protects the global model from data poisoning. We summarize the key properties of robust federated learning below:

Limited computation.
Edge devices, including smart phones, wearable devices, sensors, or vehicles typically have weaker computational ability, compared to the workstations or datacenters used in typical distributed machine learning. Thus, simpler models and stochastic training are usually applied in practice. Furthermore, to save the battery power, in each epoch, only a subset of workers may be selected to train the model.

Limited communication The connection to the central servers are not guaranteed. Communication can be frequently unavailable, slow, or expensive (in money or in the power of battery). Thus, communication is much less frequent, compared to typical distributed optimization.

Decentralized, NonIID training data. To protect the users’ privacy, the training data on local devices is not uploaded to a central server. As a result, the data distribution on different devices are not mixed and distributed IID as in standard settings, thus are nonidentically distributed samples from the population. This is particularly true when each device is controlled by a specific user whose behavior is supposed to be unique. Furthermore, the sampled data on nearby devices are potentially nonindependent, since such devices can be shared by the same user or family. For example, the data of step counter from a wearable fitness tracker and a smart phone owned by the same user can have different distributions with mutual dependency. Imagine that the fitness tracker is only used when the user is running, and the smart phone is only used when the user is walking, which results in different distributions. On the other hand, the complementation yields dependency.

Unbalanced. As a side effect of the decentralized, nonIID settings, different devices have different sizes of local datasets. The usage of some devices can be much heavier than the others, which results in unbalanced amount of data per device. For example, the users who workout everyday have much more data on wearable fitness bands than the average users.

Untrusted workers and data poisoning. A robust system can tolerate adverserial behaviour from some users e.g. a small portion of users can nefariously manipulate the local training data. As a result, some workers may push models learned on poisoned data to the servers.
Although empirical research has been conducted on federated optimization by several authors (Konevcnỳ et al., 2016; McMahan et al., 2016), to our knowledge, there is limited work with theoretical guarantees. The limited existing work with convergence guarantees are based on the strong assumption of IID training data (Yu et al., 2018; Stich, 2018), which we have argued is inappropriate for the federated learning setting. In this paper, we focus on federated optimization with provable convergence guarantees and tolerance to data poisoning, under nonIID settings. In summary, the main contributions are listed as follows:

We propose a new federated optimization algorithm with provable convergence under nonIID settings. We show that federated learning remains effective with infrequent communication between local iterations.

We propose a robust variant of our federated optimization algorithm. The algorithm tolerates a small number of workers training on poisoned data. As far as we know, this paper is the first to investigate the robustness of federated optimization.

We show empirically that the proposed algorithm stabilizes the convergence, and protects the global model from data poisoning.
2 Related Work
There is growing literature on the practical applications of edge and fog computing (Garcia Lopez et al., 2015; Hong et al., 2013) in various scenarios such as smart home or sensor networks. More and more bigdata applications are moving from cloud to edge, including for machinelearning tasks (Cao et al., 2015; Mahdavinejad et al., 2018; Zeydan et al., 2016). Although the computational power is growing, edge devices are still much weaker than the workstations and datacenters used in typical distributed machine learning e.g. due to the limited computation and communication capacity, and limited power of batteries. To this end, there are machinelearning frameworks with simple architectures such as MobileNet (Howard et al., 2017) which are designed for learning with weak devices.
The limited computational power of edge devices also motivates local training of machine learning models. Existing federated optimization methods (Konevcnỳ et al., 2015, 2016; McMahan et al., 2016) unfortuntaely lack provable convergence guarantees. The theoretical analysis of the related local optimization methods have so far focused on the IID settings (Yu et al., 2018; Stich, 2018). Furthermore, the issues of data poisoning have not been addressed in previous work. To the best of our knowledge, our proposed work is the first federated optimization framework that considers both convergence and robustness, theoretically and practically.
Typically, federated optimization uses the serverworker architecture, which is similar to the Parameter Server (PS) architecture. Stochastic Gradient Descent (SGD) with PS architecture, is widely used in typical distributed machine learning
(Li et al., 2014a; Ho et al., 2013; Li et al., 2014b). Compared to the PS architecture, federated learning has much less synchronization. Furthermore, in federated learning, the workers push trained models instead of gradients to the servers.Approaches based on robust statistics are often used to address security issues in the PS architecture (Yin et al., 2018; Xie et al., 2018). This enables procedures which tolerate multiple types of attacks and system failures. However, the existing methods and theoretical analysis do not consider local training on nonIID data. On the other hand, recent work has considered attacks targeting federated learning (Bagdasaryan et al., 2018; Fung et al., 2018; Bhagoji et al., 2018), but does not propose defense techniques with provable convergence.
3 Problem Formulation
Consider federated learning with devices. On each device, there is a worker process that trains the model on local data. The overall goal is to train a global model using data from all the devices.
To do so, we consider the following optimization problem:
where , for , is sampled from the local data on the th device.
Notation/Term  Description 

Number of devices  
Number of simutaneously updating devices  
Number of communication epochs  
Set of integers  
Randomly selected devices in the epoch  
Parameter of trimmed mean  
Minimal number of local iterations  
Number of local iterations in the epoch  
on the th device  
Initial model in the epoch  
Model updated in the epoch,  
th local iteration, on the th device  
Dataset on the th device  
Data (minibatch) sampled in the epoch,  
th local iteration, on the th device  
Learning rate  
Weight of moving average  
All the norms in this paper are norms  
Device  Where the training data are placed 
Worker  One worker on each device, 
process that trains the model  
User  Agent that produces data on the devices, 
and/or controls the devices  
Abnormal user  Special user that produces poisoned data 
3.1 NonIID Local Datasets
Note that different devices have different local datasets, i.e., . Thus, samples drawn from different devices have different expectations, which means that . Further, since different devices can be possessed by the same user or the same group of users (e.g., families), samples drawn from different devices can be potentially dependent on each other.
3.2 Data Poisoning
In federated optimization, the users do not trust the servers. The refusal to transfer training data to the servers results in the nonIID setting introduced in the previous sections. On the other hand, the servers also do not trust the edge devices. A small number of devices may be susceptible to data poisoned by abnormal user behaviors or in the worst case, are controlled by users or agents who intend to directly upload harmful models to the servers.
In this paper, we consider a generalized threat model, where the workers can push arbitrarily bad models to the servers. The bad models can cause misclassification, or global convergence to suboptimum. Beyond more benign issues such as hardware, software or communication failures, there are multiple ways for nefarious users to manipulate the uploaded models e.g. data poisoning (Bae et al., 2018). In worst case, nefarious users can even directly hack the devices and replace the correct models with arbitrary values. We provide a more formal definition of the threat model in Section 4.1.
4 Methodology
A single execution of federated optimization is composed of communication epochs. At the beginning of each epoch, a randomly selected group of devices pull the latest global model from the central server. Then, the same group of devices locally update the model without communication with the central server. At the end of each epoch, the central server aggregates the updated models and then updates the global model.
In this paper, we propose a new robust federated optimization method. In the epoch, on the th device, we locally solve the following optimization problem using SGD for iterations:
Then, the server collect the resulting local models , and aggregate them using . Finally, we update the model with a moving average over the current model and the aggregated local models.
The detailed algorithm is shown in Algorithm 1. is the model parameter updated in th local iteration of the epoch, on the th device. is the data randomly drawn in th local iteration of the epoch, on the th device. is the number of local iterations in the epoch, on the th device. is the learning rate and is the total number of epochs.
4.1 Threat Model and Robust Aggregation
First, we formally define the threat model.
Definition 1.
(Threat Model) In Line 12 of Algorithm 1, instead of the correct , a worker, training on poisoned data or controlled by an abnromal user, may push an arbitrary value to the server.
Remark 1.
Note that the users/workers are anonymous to the servers, and the abnormal users can sometimes pretend to be wellbehaved to fool the servers. Hence, it is impossible to surely identify the workers training on poisoned data, according to their historical behavior.
In Algorithm 1, Option II uses the trimmed mean as a robust aggregation which tolerates the proposed threat model. To define the trimmed mean, we first define the order statistics.
Definition 2.
(Order Statistics) By sorting the scalar sequence , we get , where is the th smallest element in .
Then, we define the trimmed mean.
Definition 3.
(Trimmed Mean) For , the trimmed mean of the set of scalars is defined as follows:
where is the th smallest element in defined in Definition 2. The highdimensional version () of simply applies the trimmed mean in a coordinatewise manner.
Note that the trimmed mean (Option II) is equivalent to the standard mean (Option I) if we take .
5 Convergence Analysis
In this section, we prove the convergence of Algorithm 1 with nonIID data, for nonconvex functions. Furthermore, we show that the proposed algorithm tolerates the threat model introduced in Definition 1. We start with the assumptions required by the convergence guarantees.
5.1 Assumptions
For convenience, we denote
Assumption 1.
(Existence of Global Optimum) We assume that there exists at least one (potentially nonunique) global minimum of the loss function
, denoted by .Assumption 2.
(Bounded Taylor’s Approximation) We assume that for , has smoothness and lowerbounded Taylor’s approximation:
where , and .
Note that Assumption 2 covers the case of nonconvexity by taking , nonstrong convexity by taking , and strong convexity by taking .
Assumption 3.
(Bounded Variance) Although we assume the nonIID setting
, there should still be some similarity between different local datasets. We assume that for , and , we have .Based on the assumptions above, we have the following convergence guarantees. All the detailed proofs can be found in the appendix.
5.2 Convergence without Data Poisoning
First, we analyze the convergence of Algorithm 1 with Option I, where there are no poisoned workers.
Theorem 1.
5.3 Convergence with Data Poisoning
Under the threat model defined in Definition 1, in worst case, Algorithm 1 with Option I (as well as FedAvg from (McMahan et al., 2016)) suffers from unbounded error.
Proposition 1.
Proof.
(Sketch) Assume that in a specific epoch , among all the workers, the last of them are poisoned. For the poisoned workers, instead of pushing the correct value to the server, they push , where is an arbitrary constant. For convenience, we assume IID local datasets for all the workers. Thus, the expectation of the aggregated global model becomes , which means that in expectation, the aggregated global model can be manipulated to take arbitrary values, which results in unbounded error. ∎
In the following theorems, we show that using Algorithm 1 with Option II, the error can be upper bounded.
Theorem 2.
Assume that additional to the normal workers, there are workers training on poisoned data, where , and . We assume that , . We take . After epochs, Algorithm 1 with Option II converges to a global optimum:
where , .
Remark 2.
Note that the additional error caused by the poisoned workers and trimmed mean is controlled by the factor , which decreases when and decreases, or increases.
6 Experiments
In this section, we evaluate the proposed algorithm by testing the convergence and robustness. Note that zoomed figures of the empirical results can be found in the appendix.
6.1 Datasets and Evaluation Metrics
We conduct experiments on the benchmark CIFAR10 image classification dataset (Krizhevsky & Hinton, 2009), which is composed of 50k images for training and 10k images for testing. Each image is resized and cropped to the shape of
. We use convolutional neural network (CNN) with 4 convolutional layers followed by 1 fully connected layer. We make the network architecture simple enough, so that it can be easily handled by mobile devices. The detailed network architecture can be found in our submitted source code (will also be released upon publication).
The experiments are conducted on CPU devices. We implement the federated optimization algorithms using the MXNET (Chen et al., 2015) framework.
In each experiment, the training set is partitioned onto devices. In each epoch, devices are randomly selected to launch local updates, with the minibatch size of
. We repeat each experiment 10 times and take the average. We use top1 accuracy on the testing set, and cross entropy loss function on the training set as the evaluation metrics.
6.2 Balanced Partition
We first test the performance of FedReg on the training data with balanced partition. Each of the partitions has images. We test the algorithms with different initial learning rates and moving average weights . The learning rates decays by at the th epoch. The result is shown in Figure 1. Note that by taking the moving average, FedReg is less sensitive to the change of learning rate, especially in the first several epochs. As expected, the moving average stabilizes the convergence.
6.3 Unbalanced Partition
To make the setting more realistic, we partition the training set into unbalanced sizes. The sizes of the partitions are (an arithmetic sequence with step , starting with ). Furthermore, to enlarge the variance, we make sure that in each partition, there are at most different labels out of all the labels. Note that some partitions only have one label.
We test the algorithms with different initial learning rates and moving average weights . The learning rates decays by at the th epoch. The result is shown in Figure 2. Note that due to the larger variance, standard federated optimization may converge to a suboptimum, or even diverge. When taking without moving average (), FedAvg simply diverges. By taking the moving average with , the convergence gets stable. Furthermore, with appropriate choices of , and the same learning rate , FedReg converges to a slightly better solution. As expected, the moving average in FedReg prevents the local updates from getting too far away from the initial value, which stabilizes the convergence.
6.4 Robustness
First, we test the performance of FedRob when no workers are subjected to data poisoning. FedRob with different is compared to FedReg. The result is shown in Figure 3. As expected, the trimmed mean introduces some extra variance to the convergence. Such additional variance grows with , as explained in Corollary LABEL:cor:convergence_subset_robust and Remark 2. However, the gap to the baseline FedRob is tiny. Furthermore, FedRob converges to the same optimum as FedReg.
To test the robustness, we simulate data poisoning which “flips” the labels of the local training data. The poisoned data have “flipped” labels, i.e., each in the local training data will be replaced by . The experiment is set up so that in each epoch, in all the randomly selected workers, workers are subjected to data poisoning. The result is shown in Figure 4. We use FedReg without data poisoning as the ideal benchmark. As expected, FedReg without trimmed mean can not tolerate data poisoning, which causes catastrophic failure. FedRob tolerates the poisoned worker, though converges slower compared to FedReg without data poisoning. Furthermore, larger improves the robustness.
6.5 Discussion
In all the experiments, whether with balanced partition or unbalanced partition, the convergence of FedReg is more stable and less sensitive to the change of learning rates, especially in the first several epochs as compared to FedAvg. We observe that the moving average prevents the local updates from getting too far away from the initial values shared among all the workers. As a result, when the learning rate is too large, the moving average can prevent overfitting the local training data. When the learning rate is small enough, the moving average is not necessary. In practice, we recommend using in cases where the appropriate choice of learning rate is unknown.
Unlike FedReg, FedRob tolerates poisoned workers. Furthermore, FedRob performs similar to FedReg when there are only normal workers. Note that although the additional variance theoretically grows with , the convergence is empirically improved with larger (see Figure 4
). That is because the trimmed mean filters out not only the the values produced by poisoned workers, but also the natural outliers in the normal workers. As a side effect, the performance of
FedRob on the testing set is very close to the optimum produced by the ideal FedReg without data poisoning, though such gap is larger on the training set. In general, there is a tradeoff between convergence rate and robustness. In practice, it is recommended to use FedRob with small , which tolerates data poisoning and abnormal users, and causes minimal deceleration when there are no abnormal users.7 Conclusion
We propose a novel robust federated optimization algorithm on nonIID training data, which stabilizes the convergence and tolerates poisoned workers. The algorithm has provable convergence. Our empirical results show good performance in practice. In future work, we are going to analyze our algorithm on other threat models, such as hardware or software failures.
References
 Anguita et al. (2013) Anguita, D., Ghio, A., Oneto, L., Parra, X., and ReyesOrtiz, J. L. A public domain dataset for human activity recognition using smartphones. In ESANN, 2013.
 Bae et al. (2018) Bae, H., Jang, J., Jung, D., Jang, H., Ha, H., and Yoon, S. Security and privacy issues in deep learning. arXiv preprint arXiv:1807.11655, 2018.
 Bagdasaryan et al. (2018) Bagdasaryan, E., Veit, A., Hua, Y., Estrin, D., and Shmatikov, V. How to backdoor federated learning. arXiv preprint arXiv:1807.00459, 2018.
 Bhagoji et al. (2018) Bhagoji, A. N., Chakraborty, S., Mittal, P., and Calo, S. Analyzing federated learning through an adversarial lens. arXiv preprint arXiv:1811.12470, 2018.
 Cao et al. (2015) Cao, Y., Hou, P., Brown, D., Wang, J., and Chen, S. Distributed analytics and edge intelligence: Pervasive health monitoring at the era of fog computing. In Proceedings of the 2015 Workshop on Mobile Big Data, pp. 43–48. ACM, 2015.
 Chen et al. (2015) Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
 EU (2018) EU. European Union’s General Data Protection Regulation (GDPR). 2018. https://eugdpr.org/, Last visited: Nov. 2018.
 Fung et al. (2018) Fung, C., Yoon, C. J., and Beschastnikh, I. Mitigating sybils in federated learning poisoning. arXiv preprint arXiv:1808.04866, 2018.

Garcia Lopez et al. (2015)
Garcia Lopez, P., Montresor, A., Epema, D., Datta, A., Higashino, T.,
Iamnitchi, A., Barcellos, M., Felber, P., and Riviere, E.
Edgecentric computing: Vision and challenges.
ACM SIGCOMM Computer Communication Review, 45(5):37–42, 2015.  HealthInsurance.org (1996) HealthInsurance.org, S. A. Health insurance portability and accountability act of 1996. Public law, 104:191, 1996.
 Ho et al. (2013) Ho, Q., Cipar, J., Cui, H., Lee, S., Kim, J. K., Gibbons, P. B., Gibson, G. A., Ganger, G., and Xing, E. P. More effective distributed ml via a stale synchronous parallel parameter server. In Advances in neural information processing systems, pp. 1223–1231, 2013.
 Hong et al. (2013) Hong, K., Lillethun, D., Ramachandran, U., Ottenwälder, B., and Koldehofe, B. Mobile fog: A programming model for largescale applications on the internet of things. In Proceedings of the second ACM SIGCOMM workshop on Mobile cloud computing, pp. 15–20. ACM, 2013.
 Howard et al. (2017) Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 Konevcnỳ et al. (2015) Konevcnỳ, J., McMahan, B., and Ramage, D. Federated optimization: Distributed optimization beyond the datacenter. arXiv preprint arXiv:1511.03575, 2015.
 Konevcnỳ et al. (2016) Konevcnỳ, J., McMahan, H. B., Yu, F. X., Richtárik, P., Suresh, A. T., and Bacon, D. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
 Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Li et al. (2014a) Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.Y. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pp. 583–598, 2014a.
 Li et al. (2014b) Li, M., Andersen, D. G., Smola, A. J., and Yu, K. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pp. 19–27, 2014b.
 Mahdavinejad et al. (2018) Mahdavinejad, M. S., Rezvan, M., Barekatain, M., Adibi, P., Barnaghi, P., and Sheth, A. P. Machine learning for internet of things data analysis: A survey. Digital Communications and Networks, 4(3):161–175, 2018.
 McMahan et al. (2016) McMahan, H. B., Moore, E., Ramage, D., Hampson, S., et al. Communicationefficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.
 Pantelopoulos & Bourbakis (2010) Pantelopoulos, A. and Bourbakis, N. G. A survey on wearable sensorbased systems for health monitoring and prognosis. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 40(1):1–12, 2010.
 Stich (2018) Stich, S. U. Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.
 Xie et al. (2018) Xie, C., Koyejo, O., and Gupta, I. Phocas: dimensional byzantineresilient stochastic gradient descent. arXiv preprint arXiv:1805.09682, 2018.
 Yin et al. (2018) Yin, D., Chen, Y., Ramchandran, K., and Bartlett, P. Byzantinerobust distributed learning: Towards optimal statistical rates. arXiv preprint arXiv:1803.01498, 2018.
 Yu et al. (2018) Yu, H., Yang, S., and Zhu, S. Parallel restarted sgd for nonconvex optimization with faster convergence and less communication. arXiv preprint arXiv:1807.06629, 2018.
 Zeydan et al. (2016) Zeydan, E., Bastug, E., Bennis, M., Kader, M. A., Karatepe, I. A., Er, A. S., and Debbah, M. Big data caching for networking: Moving from cloud to edge. IEEE Communications Magazine, 54(9):36–42, 2016.
8 Proofs
Theorem 1.
Proof.
For convenience, we ignore the random sample in our notations. Thus, represents , where . Furthermore, we define .
It is easy to check that , is strongly convex, where . Thus, we have
Taking expectation on both sides, conditional on , we have
By telescoping and taking total expectation, we have
On the server, after aggregation, conditional on , we have
We define , which is convex. Then, we have
After epochs, by telescoping and taking total expectation, we have
∎
Theorem 2.
Assume that additional to the normal workers, there are workers training on poisoned data, where , and . We assume that , . We take . After epochs, Algorithm 1 with Option II converges to a global optimum:
where , .
Proof.
First, we analyze the robustness of trimmed mean. Assume that among the scalar sequence , elements are poisoned. Without loss of generality, we denote the remaining correct values as . Thus, for , , for , where is the th smallest element in , and is the th smallest element in .
Define . We have
Thus, we have
Note that for arbitrary subset with cardinality , we have the following bound:
In the worker set , there are poisoned workers. We denote as the set of normal workers with cardinality .
Thus, we have
Using smoothness, we have
Comments
There are no comments yet.