1 Introduction
Federated learning (Konevcnỳ et al., 2016; McMahan et al., 2016)
enables training a global model on datasets partitioned across a massive number of resourceweak edge devices. This decentralized approach is motivated by the modern phenomenon of distributed (often personal) data collected by edge devices at scale, from devices such as smart phones, wearable devices, sensors, and smart homes/buildings. Ideally, the large amounts of training data from diverse users results in improved representation and generalization of machinelearning models. Federated learning is also motivated by the desire for privacy preservation
(Bonawitz et al., 2019, 2017). In some scenarios, ondevice training without depositing data in the cloud may be legally required by regulations such as US HIPAA laws (HealthInsurance.org, 1996), US FERPA laws (of Education, 2019), and by some interpretations of Europe’s GDPR law (EU, 2018).Typically, a federated learning system is composed of servers and workers, with an architecture that is similar to ML training using parameter servers (Li et al., 2014a, b; Ho et al., 2013). The workers (edge devices) train the models locally on the private data. The servers aggregate the learned models from the workers, and produce a global model on the cloud/datacenter. To help protect user privacy, the workers do not expose the training data to the servers, and instead only expose the trained model.
Federated learning has three key properties:

Infrequent task scheduling. Firstly, edge devices typically have weak computational capacity and limited battery. Thus, unlike traditional distributed machine learning, ondevice federated learning tasks are executed only when the device is able to participate, e.g., idle, charging, and connected to unmetered networks (i.e., WiFi) (Bonawitz et al., 2019). Edge devices will ping the servers when they are ready to execute training tasks. The servers will then schedule training tasks on available edge devices. Secondly, to avoid congesting the network, the server randomizes the checkin time of the workers. For these reasons, on each edge device, the training task is executed infrequently.

Infrequent communication. The connection between edge devices and the remote servers may be frequently unavailable, slow, or expensive (in terms of communication costs or in battery power usage). Thus, compared to typical distributed optimization, communication in federated learning needs to be infrequent.

NonIID training data. Unlike the traditional distributed machine learning, the data on different devices are disjoint, thus represent nonidentically distributed samples from the population.
We posit that the standard synchronous approach to federated optimization is potentially unscalable, inefficient, and inflexible. Previous synchronous training algorithms for federated averaging (McMahan et al., 2016; Bonawitz et al., 2019) use a blocksynchronous approach for practical deployments which can only handle hundreds of devices in parallel, fewer than the potential nearly 4 billion mobile phones in total (eMarketer, 2019)
. First, even at smaller scales, like with thousands of devices in a stadium during a game, too many devices checking in at the same time can congest the network on the server side. Thus, in each global epoch, the server is limited to selecting only from the subset of available devices to trigger the training tasks. Second, since the task scheduling varies from device to device due to limited computational capacity and battery time, it is difficult to synchronize the selected devices at the end of each epoch. Some devices will no longer be available before synchronization. Instead, the server has to determine a timeout threshold to drop the stragglers. If the number of survived devices is too small, the server may have to drop the entire epoch including all the received updates.
To address these issues, we propose a novel asynchronous approach and algorithm for federated optimization. The key ideas are: 1) to solve regularized local problems to guarantee convergence, and 2) then use a weighted average to update the global model, where the mixing weight is set adaptively as a function of the staleness. Together, these techniques result in an effective asynchronous federated optimization procedure.
The main contributions of our paper are listed as follows:

We propose a new asynchronous federated optimization algorithm with provable convergence under nonIID settings.

We prove the convergence of the proposed approach for a restricted family of nonconvex problems.

We propose strategies for controlling the error caused by asynchrony. We introduce a mixing hyperparameter which adaptively controls the tradeoff between the convergence rate and variance reduction according to the staleness.

We show empirically that the proposed algorithm converges fast and often outperforms synchronous federated optimization in practical settings.
2 Related work
Edge computing (Garcia Lopez et al., 2015; Hong et al., 2013) is increasingly applied in various scenarios such as smart home, wearable devices, and sensor networks. At the same time, machinelearning applications are also moving from cloud to edge (Cao et al., 2015; Mahdavinejad et al., 2018; Zeydan et al., 2016). Typically, edge devices have weaker computation and communication capacity compared to the workstations and datacenters, due to the weak hardware, limited battery time, and metered networks. As a result, simple machinelearning models such as MobileNet (Howard et al., 2017) have been proposed for learning with weak devices.
Existing federated optimization methods (Konevcnỳ et al., 2015, 2016; McMahan et al., 2016; Bonawitz et al., 2019) focus on synchronous training. In each global epoch, training tasks are triggered on a subset of workers. However, perhaps due to the bad networking conditions and occasional issues, some worker may fail. When this happens, the server has to wait until a sufficient number of workers respond. Otherwise, the server times out, drops the current epoch, and moves on to the next epoch. As far as we know, this paper is the first to discuss asynchronous training in federated learning theoretically and empirically.
Asynchronous training (Zinkevich et al., 2009; Lian et al., 2017; Zheng et al., 2017) is widely used in traditional distributed SGD. Typically, asynchronous SGD converges faster than synchronous SGD. However, classic asynchronous SGD directly sends gradients to the servers after each local update, which is not feasible for edge devices due to unreliable and slow communication. In this paper, we take the advantage of asynchronous training, and combine it with federated optimization.
Notation/Term  Description 

Number of devices  
Number of global epochs  
Set of integers  
, ,  Minimal/maximal number of local iterations, is the imbalance ratio 
Number of local iterations in the epoch on the th device  
Global model in the server epoch  
Model initialized from , updated in the th local iteration, on the th device  
,  is dataset on the th device, 
Learning rate  
, ,  Mixing hyperparameter, staleness, and function of staleness for adaptive 
Maximum staleness:  
Regularization weight  
All the norms in this paper are norms  
Device  Where the training data are placed 
Worker  Process that trains the model, one worker on each device 
3 Problem formulation
We consider federated learning with devices. On each device, a worker process trains the model on local data. The overall goal is to train a global model using data from all the devices. Formally, we consider the following optimization problem: where , for , is sampled from the local data on the th device. Note that different devices have different local datasets, i.e., . Thus, samples drawn from different devices may have different expectations i.e. in general, .
4 Methodology
A single execution of federated optimization has global epochs. In the epoch, the server receives a locally trained model from an arbitrary worker, and updates the global model by weighted averaging: where is the mixing hyperparameter.
On an arbitrary device , after receiving a global model (potentially stale) from the server, we locally solve the following regularized optimization problem using SGD for multiple iterations: For convenience, we define .
The server and workers conduct updates asynchronously. The server immediately updates the global model whenever it receives a local model. The communication between the server and workers is nonblocking.
The detailed algorithm is shown in Algorithm 1. The model parameter is updated in th local iteration after receiving , on the th device. is the data randomly drawn in th local iteration after receiving , on the th device. is the number of local iterations after receiving , on the th device. is the learning rate and is the total number of global epochs.
Remark 1.
On the server side, there are two threads running asynchronously in parallel: scheduler and updater. The scheduler periodically triggers training tasks on some selected workers. The updater receives locally trained models from workers and updates the global model. There could be multiple updater threads with readwrite lock on the global model, which improves the throughput. The scheduler randomizes the timing of training tasks to avoid overloading the updater thread, and controls the staleness ( in the updater thread). Since we do not focus on the system design in this paper, we will not discuss the details of the scheduling strategies.
Remark 2.
Intuitively, larger staleness results in greater error when updating the global model. For the local models with large staleness , we can decrease to mitigate the error caused by staleness. As shown in Algorithm 1, optionally, we can use a function to decide the value of . In general, should be when , and monotonically decrease when increases. There are many functions that satisfy such two properties, with different decreasing rate, e.g., . The options used in this paper can be found in Section 6.2.
5 Convergence analysis
In this section, we prove the convergence of Algorithm 1 with nonIID data.
5.1 Assumptions
First, we introduce some definitions and assumptions for our convergence analysis.
Definition 1.
(Smoothness) A differentiable function is smooth if for , where .
Definition 2.
(Weak convexity) A differentiable function is weakly convex if the function with is convex, where .
Note that when is weakly convex, then is convex if , and potentially nonconvex if .
5.2 Convergence guarantees
Based on the assumptions above, we have the following convergence guarantees. Detailed proofs can be found in the appendix.
Theorem 1.
Assume that is smooth and weakly convex, and each worker executes at least and at most local updates before pushing models to the server. We assume bounded delay . The imbalance ratio of local updates is . Furthermore, we assume that for , and , we have and , . Taking large enough such that and , and , after global updates, Algorithm 1 converges to a critical point:
Taking , , , we have
Remark 3.
The mixing hyperparameter controls the tradeoff between the convergence rate and additional error caused by variance. Larger makes the term converge faster to , but also enlarges the error term .
Remark 4.
In general, the additional error is controlled by two factors: the maximum delay and the imbalance ratio . Larger delay and imbalance ratio result in slower convergence.
6 Experiments
In this section, we empirically evaluate the proposed algorithm.
6.1 Datasets
We conduct experiments on the benchmark CIFAR10 image classification dataset (Krizhevsky and Hinton, 2009), which is composed of 50k images for training and 10k images for testing. We resize each image and crop it to the shape of
. We use convolutional neural network (CNN) with 4 convolutional layers followed by 1 fully connected layer. We chose a simple network architecture so that it can be easily handled by mobile devices. In each experiment, the training set is partitioned onto
devices. Each of the partitions has images. For any worker, the minibatch size for SGD is .6.2 Evaluation setup
The experiments are conducted on CPU devices. We implement the federated optimization algorithms using the MXNET (Chen et al., 2015) framework.
The baseline algorithm is FedAvg introduced by McMahan et al. (2016), which is synchronous federated optimization. For FedAvg, in each epoch, devices are randomly selected to launch local updates. We also use singlethread SGD as the baseline. For the two baseline algorithms, we use grid search to tune the learning rates and report the best results according to the top1 accuracy on the testing set.
We repeat each experiment 10 times and take the average. We use top1 accuracy on the testing set, and cross entropy loss function on the training set as the evaluation metrics.
For convenience, we name Algorithm 1 as FedAsync. We also test the performance of FedAsync with adaptive mixing hyperparameters , as mentioned in Section 4. We employ the following two strategies, parameterized by :

Polynomial: .

Hinge:
For convenience, we refer to FedAsync with polynomial adaptive as FedAsync+Poly, and FedAsync with hinge adaptive as FedAsync+Hinge.
In general, it is nontrivial to compare asynchronous training and synchronous training in a fair way. We conduct two comparisons: metrics vs. number of gradients, and metrics vs. number of communications:

The number of gradients is the number of gradients applied to the global model. Note that for both FedAsync and FedAvg, an epoch of local iterations is a full pass of the local dataset. Thus, for FedAsync, in each global epoch, gradients is applied to the global model. For FedAvg, since , gradients is applied to the global model in each global epoch.

The number of communications measures the communication overhead on the server side. We count how many times the models are exchanged (sent/received) on the server. On average, in each global epoch, FedAvg has the communications of FedAsync. Singlethread SGD has no communication, so we ignore it.
In all the experiments, we simulate the asynchrony by randomly sampling the staleness
from a uniform distribution.
6.3 Empirical results
We test FedAsync (asynchronous federated optimization in Algorithm 1), FedAvg, and SGD, with different learning rates , regularization weights , mixing hyperparameter , and staleness. Each experiment has global epochs. decays by at the th global epoch.
In Figure 1, we show how FedAsync converges when the number of gradients grows. We can see that when the overall staleness is small, FedAsync converges as fast as SGD, and faster than FedAvg. When the staleness is larger, FedAsync converges slower. In the worst case, FedAsync has similar convergence rate as FedAvg. When is too large, the convergence can be unstable. Using adaptive , the convergence can be robust.
In Figure 2, we show how FedAsync converges when the communication overhead grows. With the same amount of communication overhead, FedAsync converges faster than FedAvg when staleness is small. When staleness is large, FedAsync has similar performance as FedAvg.
In Figure 3, we show how staleness affects the convergence of FedAsync. Overall, larger staleness makes the convergence slower, but the influence is not catastrophic. Furthermore, using adaptive mixing hyperparameters, the instability caused by large staleness can be mitigated.
In Figure 4, we show how affects the convergence of FedAsync. In general, FedAsync is robust to different . When the staleness is large, smaller is better for FedAsync, while larger is better for FedAsync+Poly and FedAsync+Hinge. That is, because adaptive is automatically adjusted to be smaller when the staleness is large, we need not manually decrease .
6.4 Discussion and evaluation conclusion
In general, the convergence rate of FedAsync is between singlethread SGD and FedAvg. Larger and smaller staleness make FedAsync closer to singlethread SGD. Smaller and larger staleness makes FedAsync closer to FedAvg.
FedAsync is generally insensitive to hyperparameters. When the staleness is large, we can tune to improve the convergence. Without adaptive , smaller is better for larger staleness. For adaptive , our best choice empirically was FedAsync+Poly with .
Larger staleness makes the convergence slower and unstable. There are three ways to control the influence of staleness:

On the server side, the updater thread can drop the updates with large staleness . This can also be viewed as a special case of adaptive mixing hyperparameter .

Using adaptive mixing hyperparameters improves the convergence, as shown in Figure 3. Different strategies show different improvement. So far we find that FedAsync+Poly with has the best performance.

On the server side, the scheduler thread can control the assignment of training tasks to the workers. If the ondevice training is triggered less frequently, the overall staleness will be smaller.
In general, FedAsync+Poly and FedAsync+Hinge have similar performance. FedAsync+Hinge performs slightly worse than FedAsync+Poly, since it does not penalize the update when .
We conclude that FedAsync has the following advantages compared to FedAvg:

Efficiency: The server can receive the updates from the workers at any time. Unlike FedAvg, stragglers’ updates will not be dropped. When the staleness is small, FedAsync converges much faster than FedAvg. When the staleness is large, FedAsync still has similar performance as FedAvg.

Flexibility: If some workers are no longer available for the training tasks (the devices are no longer idle, charging, or connected to unmetered networks), they can temporarily save the workspace, and continue the training or push the trained model to the server later. This also gives more flexibility to the scheduler on the server. Unlike FedAvg, FedAsync can schedule training tasks even if the workers are currently unavailable, since the server does not wait until the workers respond. The currently unavailable workers can start the training tasks later.

Scalability: Compared to FedAvg, FedAsync can handle more workers running in parallel since all the updates on the server and the workers are nonblocking. The server only needs to randomize the responding time of the workers to avoid congesting the network.

Fault tolerance: For FedAsync, catastrophic failures of workers can be regarded as a special case of unavailability. If some workers crash, the server will ignore them and fail gracefully (equiv. to ignoring data from that worker), since all the updates are nonblocking. In contrast, FedAvg requires a timeout mechanism for fault tolerance, which potentially slows down the progress due to the waiting time.
7 Conclusion
We proposed a novel asynchronous federated optimization algorithm on nonIID training data. We proved the convergence for a restricted family of nonconvex problems. Our empirical evaluation validated both fast convergence and staleness tolerance. An interesting future direction is the design of strategies to adaptively tune mixing hyperparameters.
References
 Bonawitz et al. (2017) K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth. Practical secure aggregation for privacypreserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 1175–1191. ACM, 2017.
 Bonawitz et al. (2019) K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. McMahan, et al. Towards federated learning at scale: System design. arXiv preprint arXiv:1902.01046, 2019.
 Cao et al. (2015) Y. Cao, P. Hou, D. Brown, J. Wang, and S. Chen. Distributed analytics and edge intelligence: Pervasive health monitoring at the era of fog computing. In Proceedings of the 2015 Workshop on Mobile Big Data, pages 43–48. ACM, 2015.
 Chen et al. (2015) T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.

eMarketer (2019)
eMarketer.
Number of mobile phone users worldwide from 2015 to 2020 (in
billions).
2019.
https://www.statista.com/statistics/274774/forecastofmobilephoneusers
worldwide/, Last visited: Mar. 2019.  EU (2018) EU. European Union’s General Data Protection Regulation (GDPR). 2018. https://eugdpr.org/, Last visited: Nov. 2018.

Garcia Lopez et al. (2015)
P. Garcia Lopez, A. Montresor, D. Epema, A. Datta, T. Higashino, A. Iamnitchi,
M. Barcellos, P. Felber, and E. Riviere.
Edgecentric computing: Vision and challenges.
ACM SIGCOMM Computer Communication Review, 45(5):37–42, 2015.  HealthInsurance.org (1996) S. A. HealthInsurance.org. Health insurance portability and accountability act of 1996. Public law, 104:191, 1996.
 Ho et al. (2013) Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing. More effective distributed ml via a stale synchronous parallel parameter server. In Advances in neural information processing systems, pages 1223–1231, 2013.
 Hong et al. (2013) K. Hong, D. Lillethun, U. Ramachandran, B. Ottenwälder, and B. Koldehofe. Mobile fog: A programming model for largescale applications on the internet of things. In Proceedings of the second ACM SIGCOMM workshop on Mobile cloud computing, pages 15–20. ACM, 2013.
 Howard et al. (2017) A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 Konevcnỳ et al. (2015) J. Konevcnỳ, B. McMahan, and D. Ramage. Federated optimization: Distributed optimization beyond the datacenter. arXiv preprint arXiv:1511.03575, 2015.
 Konevcnỳ et al. (2016) J. Konevcnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
 Krizhevsky and Hinton (2009) A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Li et al. (2014a) M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583–598, 2014a.
 Li et al. (2014b) M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pages 19–27, 2014b.
 Lian et al. (2017) X. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous decentralized parallel stochastic gradient descent. arXiv preprint arXiv:1710.06952, 2017.
 Mahdavinejad et al. (2018) M. S. Mahdavinejad, M. Rezvan, M. Barekatain, P. Adibi, P. Barnaghi, and A. P. Sheth. Machine learning for internet of things data analysis: A survey. Digital Communications and Networks, 4(3):161–175, 2018.
 McMahan et al. (2016) H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al. Communicationefficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.
 of Education (2019) U. D. of Education. Family Educational Rights and Privacy Act (FERPA). 2019. https://studentprivacy.ed.gov/?src=fpco, Last visited: May. 2019.
 Zeydan et al. (2016) E. Zeydan, E. Bastug, M. Bennis, M. A. Kader, I. A. Karatepe, A. S. Er, and M. Debbah. Big data caching for networking: Moving from cloud to edge. IEEE Communications Magazine, 54(9):36–42, 2016.

Zheng et al. (2017)
S. Zheng, Q. Meng, T. Wang, W. Chen, N. Yu, Z.M. Ma, and T.Y. Liu.
Asynchronous stochastic gradient descent with delay compensation.
In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 4120–4129. JMLR. org, 2017.  Zinkevich et al. (2009) M. Zinkevich, J. Langford, and A. J. Smola. Slow learners are fast. In Advances in neural information processing systems, pages 2331–2339, 2009.