Asynchronous Federated Optimization

Federated learning enables training on a massive number of edge devices. To improve flexibility and scalability, we propose a new asynchronous federated optimization algorithm. We prove that the proposed approach has near-linear convergence to a global optimum, for both strongly and non-strongly convex problems, as well as a restricted family of non-convex problems. Empirical results show that the proposed algorithm converges fast and tolerates staleness.


Practical Distributed Learning: Secure Machine Learning with Communication-Efficient Local Updates

Federated learning on edge devices poses new challenges arising from wor...

Stragglers Are Not Disaster: A Hybrid Federated Learning Algorithm with Delayed Gradients

Federated learning (FL) is a new machine learning framework which trains...

Analysis and Implementation of an Asynchronous Optimization Algorithm for the Parameter Server

This paper presents an asynchronous incremental aggregated gradient algo...

SLSGD: Secure and Efficient Distributed On-device Machine Learning

We consider distributed on-device learning with limited communication an...

On the Convergence of Federated Optimization in Heterogeneous Networks

The burgeoning field of federated learning involves training machine lea...

Fast Composite Optimization and Statistical Recovery in Federated Learning

As a prevalent distributed learning paradigm, Federated Learning (FL) tr...

Resource-constrained Federated Edge Learning with Heterogeneous Data: Formulation and Analysis

Efficient collaboration between collaborative machine learning and wirel...

1 Introduction

Federated learning (Konevcnỳ et al., 2016; McMahan et al., 2016)

enables training a global model on datasets partitioned across a massive number of resource-weak edge devices. This decentralized approach is motivated by the modern phenomenon of distributed (often personal) data collected by edge devices at scale, from devices such as smart phones, wearable devices, sensors, and smart homes/buildings. Ideally, the large amounts of training data from diverse users results in improved representation and generalization of machine-learning models. Federated learning is also motivated by the desire for privacy preservation 

(Bonawitz et al., 2019, 2017). In some scenarios, on-device training without depositing data in the cloud may be legally required by regulations such as US HIPAA laws (, 1996), US FERPA laws (of Education, 2019), and by some interpretations of Europe’s GDPR law (EU, 2018).

Typically, a federated learning system is composed of servers and workers, with an architecture that is similar to ML training using parameter servers (Li et al., 2014a, b; Ho et al., 2013). The workers (edge devices) train the models locally on the private data. The servers aggregate the learned models from the workers, and produce a global model on the cloud/datacenter. To help protect user privacy, the workers do not expose the training data to the servers, and instead only expose the trained model.

Federated learning has three key properties:

  • Infrequent task scheduling. Firstly, edge devices typically have weak computational capacity and limited battery. Thus, unlike traditional distributed machine learning, on-device federated learning tasks are executed only when the device is able to participate, e.g., idle, charging, and connected to unmetered networks (i.e., WiFi) (Bonawitz et al., 2019). Edge devices will ping the servers when they are ready to execute training tasks. The servers will then schedule training tasks on available edge devices. Secondly, to avoid congesting the network, the server randomizes the check-in time of the workers. For these reasons, on each edge device, the training task is executed infrequently.

  • Infrequent communication. The connection between edge devices and the remote servers may be frequently unavailable, slow, or expensive (in terms of communication costs or in battery power usage). Thus, compared to typical distributed optimization, communication in federated learning needs to be infrequent.

  • Non-IID training data. Unlike the traditional distributed machine learning, the data on different devices are disjoint, thus represent non-identically distributed samples from the population.

We posit that the standard synchronous approach to federated optimization is potentially unscalable, inefficient, and inflexible. Previous synchronous training algorithms for federated averaging (McMahan et al., 2016; Bonawitz et al., 2019) use a block-synchronous approach for practical deployments which can only handle hundreds of devices in parallel, fewer than the potential nearly 4 billion mobile phones in total (eMarketer, 2019)

. First, even at smaller scales, like with thousands of devices in a stadium during a game, too many devices checking in at the same time can congest the network on the server side. Thus, in each global epoch, the server is limited to selecting only from the subset of available devices to trigger the training tasks. Second, since the task scheduling varies from device to device due to limited computational capacity and battery time, it is difficult to synchronize the selected devices at the end of each epoch. Some devices will no longer be available before synchronization. Instead, the server has to determine a timeout threshold to drop the stragglers. If the number of survived devices is too small, the server may have to drop the entire epoch including all the received updates.

To address these issues, we propose a novel asynchronous approach and algorithm for federated optimization. The key ideas are: 1) to solve regularized local problems to guarantee convergence, and 2) then use a weighted average to update the global model, where the mixing weight is set adaptively as a function of the staleness. Together, these techniques result in an effective asynchronous federated optimization procedure.

The main contributions of our paper are listed as follows:

  • We propose a new asynchronous federated optimization algorithm with provable convergence under non-IID settings.

  • We prove the convergence of the proposed approach for a restricted family of non-convex problems.

  • We propose strategies for controlling the error caused by asynchrony. We introduce a mixing hyperparameter which adaptively controls the trade-off between the convergence rate and variance reduction according to the staleness.

  • We show empirically that the proposed algorithm converges fast and often outperforms synchronous federated optimization in practical settings.

2 Related work

Edge computing (Garcia Lopez et al., 2015; Hong et al., 2013) is increasingly applied in various scenarios such as smart home, wearable devices, and sensor networks. At the same time, machine-learning applications are also moving from cloud to edge (Cao et al., 2015; Mahdavinejad et al., 2018; Zeydan et al., 2016). Typically, edge devices have weaker computation and communication capacity compared to the workstations and datacenters, due to the weak hardware, limited battery time, and metered networks. As a result, simple machine-learning models such as MobileNet (Howard et al., 2017) have been proposed for learning with weak devices.

Existing federated optimization methods (Konevcnỳ et al., 2015, 2016; McMahan et al., 2016; Bonawitz et al., 2019) focus on synchronous training. In each global epoch, training tasks are triggered on a subset of workers. However, perhaps due to the bad networking conditions and occasional issues, some worker may fail. When this happens, the server has to wait until a sufficient number of workers respond. Otherwise, the server times out, drops the current epoch, and moves on to the next epoch. As far as we know, this paper is the first to discuss asynchronous training in federated learning theoretically and empirically.

Asynchronous training (Zinkevich et al., 2009; Lian et al., 2017; Zheng et al., 2017) is widely used in traditional distributed SGD. Typically, asynchronous SGD converges faster than synchronous SGD. However, classic asynchronous SGD directly sends gradients to the servers after each local update, which is not feasible for edge devices due to unreliable and slow communication. In this paper, we take the advantage of asynchronous training, and combine it with federated optimization.

Notation/Term Description
Number of devices
Number of global epochs
Set of integers
, , Minimal/maximal number of local iterations, is the imbalance ratio
Number of local iterations in the epoch on the th device
Global model in the server epoch
Model initialized from , updated in the th local iteration, on the th device
, is dataset on the th device,
Learning rate
, , Mixing hyperparameter, staleness, and function of staleness for adaptive
Maximum staleness:
Regularization weight
All the norms in this paper are -norms
Device Where the training data are placed
Worker Process that trains the model, one worker on each device
Table 1: Notations and Terminologies.

3 Problem formulation

We consider federated learning with devices. On each device, a worker process trains the model on local data. The overall goal is to train a global model using data from all the devices. Formally, we consider the following optimization problem: where , for , is sampled from the local data on the th device. Note that different devices have different local datasets, i.e., . Thus, samples drawn from different devices may have different expectations i.e. in general, .

4 Methodology

A single execution of federated optimization has global epochs. In the epoch, the server receives a locally trained model from an arbitrary worker, and updates the global model by weighted averaging: where is the mixing hyperparameter.

On an arbitrary device , after receiving a global model (potentially stale) from the server, we locally solve the following regularized optimization problem using SGD for multiple iterations: For convenience, we define .

The server and workers conduct updates asynchronously. The server immediately updates the global model whenever it receives a local model. The communication between the server and workers is non-blocking.

The detailed algorithm is shown in Algorithm 1. The model parameter is updated in th local iteration after receiving , on the th device. is the data randomly drawn in th local iteration after receiving , on the th device. is the number of local iterations after receiving , on the th device. is the learning rate and is the total number of global epochs.

Remark 1.

On the server side, there are two threads running asynchronously in parallel: scheduler and updater. The scheduler periodically triggers training tasks on some selected workers. The updater receives locally trained models from workers and updates the global model. There could be multiple updater threads with read-write lock on the global model, which improves the throughput. The scheduler randomizes the timing of training tasks to avoid overloading the updater thread, and controls the staleness ( in the updater thread). Since we do not focus on the system design in this paper, we will not discuss the details of the scheduling strategies.

Remark 2.

Intuitively, larger staleness results in greater error when updating the global model. For the local models with large staleness , we can decrease to mitigate the error caused by staleness. As shown in Algorithm 1, optionally, we can use a function to decide the value of . In general, should be when , and monotonically decrease when increases. There are many functions that satisfy such two properties, with different decreasing rate, e.g., . The options used in this paper can be found in Section 6.2.

   Server Process:
  Initialize ,
    Scheduler Thread:
  Scheduler periodically triggers training tasks on some workers, and sends them the latest global model with time stamp
    Updater Thread:
  for all epoch  do
     Receive the pair from any worker
     Optional: , is a function of the staleness
  end for
   Worker Processes:
  for all  in parallel do
     If triggered by the scheduler:
     Receive the pair of the global model and its time stamp from the server
     Define , where
     for all local iteration  do
        Randomly sample , update
     end for
     Push to the server
  end for
Algorithm 1 Asynchronous Federated Optimization (FedAsync)

5 Convergence analysis

In this section, we prove the convergence of Algorithm 1 with non-IID data.

5.1 Assumptions

First, we introduce some definitions and assumptions for our convergence analysis.

Definition 1.

(Smoothness) A differentiable function is -smooth if for , where .

Definition 2.

(Weak convexity) A differentiable function is -weakly convex if the function with is convex, where .

Note that when is -weakly convex, then is convex if , and potentially non-convex if .

5.2 Convergence guarantees

Based on the assumptions above, we have the following convergence guarantees. Detailed proofs can be found in the appendix.

Theorem 1.

Assume that is -smooth and -weakly convex, and each worker executes at least and at most local updates before pushing models to the server. We assume bounded delay . The imbalance ratio of local updates is . Furthermore, we assume that for , and , we have and , . Taking large enough such that and , and , after global updates, Algorithm 1 converges to a critical point:

Taking , , , we have

Remark 3.

The mixing hyperparameter controls the trade-off between the convergence rate and additional error caused by variance. Larger makes the term converge faster to , but also enlarges the error term .

Remark 4.

In general, the additional error is controlled by two factors: the maximum delay and the imbalance ratio . Larger delay and imbalance ratio result in slower convergence.

6 Experiments

In this section, we empirically evaluate the proposed algorithm.

6.1 Datasets

We conduct experiments on the benchmark CIFAR-10 image classification dataset (Krizhevsky and Hinton, 2009), which is composed of 50k images for training and 10k images for testing. We resize each image and crop it to the shape of

. We use convolutional neural network (CNN) with 4 convolutional layers followed by 1 fully connected layer. We chose a simple network architecture so that it can be easily handled by mobile devices. In each experiment, the training set is partitioned onto

devices. Each of the partitions has images. For any worker, the minibatch size for SGD is .

6.2 Evaluation setup

The experiments are conducted on CPU devices. We implement the federated optimization algorithms using the MXNET (Chen et al., 2015) framework.

The baseline algorithm is FedAvg introduced by McMahan et al. (2016), which is synchronous federated optimization. For FedAvg, in each epoch, devices are randomly selected to launch local updates. We also use single-thread SGD as the baseline. For the two baseline algorithms, we use grid search to tune the learning rates and report the best results according to the top-1 accuracy on the testing set.

We repeat each experiment 10 times and take the average. We use top-1 accuracy on the testing set, and cross entropy loss function on the training set as the evaluation metrics.

For convenience, we name Algorithm 1 as FedAsync. We also test the performance of FedAsync with adaptive mixing hyperparameters , as mentioned in Section 4. We employ the following two strategies, parameterized by :

  • Polynomial: .

  • Hinge:

For convenience, we refer to FedAsync with polynomial adaptive as FedAsync+Poly, and FedAsync with hinge adaptive as FedAsync+Hinge.

In general, it is non-trivial to compare asynchronous training and synchronous training in a fair way. We conduct two comparisons: metrics vs. number of gradients, and metrics vs. number of communications:

  • The number of gradients is the number of gradients applied to the global model. Note that for both FedAsync and FedAvg, an epoch of local iterations is a full pass of the local dataset. Thus, for FedAsync, in each global epoch, gradients is applied to the global model. For FedAvg, since , gradients is applied to the global model in each global epoch.

  • The number of communications measures the communication overhead on the server side. We count how many times the models are exchanged (sent/received) on the server. On average, in each global epoch, FedAvg has the communications of FedAsync. Single-thread SGD has no communication, so we ignore it.

In all the experiments, we simulate the asynchrony by randomly sampling the staleness

from a uniform distribution.

6.3 Empirical results

We test FedAsync (asynchronous federated optimization in Algorithm 1), FedAvg, and SGD, with different learning rates , regularization weights , mixing hyperparameter , and staleness. Each experiment has global epochs. decays by at the th global epoch.

In Figure 1, we show how FedAsync converges when the number of gradients grows. We can see that when the overall staleness is small, FedAsync converges as fast as SGD, and faster than FedAvg. When the staleness is larger, FedAsync converges slower. In the worst case, FedAsync has similar convergence rate as FedAvg. When is too large, the convergence can be unstable. Using adaptive , the convergence can be robust.

In Figure 2, we show how FedAsync converges when the communication overhead grows. With the same amount of communication overhead, FedAsync converges faster than FedAvg when staleness is small. When staleness is large, FedAsync has similar performance as FedAvg.

In Figure 3, we show how staleness affects the convergence of FedAsync. Overall, larger staleness makes the convergence slower, but the influence is not catastrophic. Furthermore, using adaptive mixing hyperparameters, the instability caused by large staleness can be mitigated.

In Figure 4, we show how affects the convergence of FedAsync. In general, FedAsync is robust to different . When the staleness is large, smaller is better for FedAsync, while larger is better for FedAsync+Poly and FedAsync+Hinge. That is, because adaptive is automatically adjusted to be smaller when the staleness is large, we need not manually decrease .

(a) Top-1 Accuracy on Testing Set,
(b) Cross Entropy on Training Set,
(c) Top-1 Accuracy on Testing Set,
(d) Cross Entropy on Training Set,
Figure 1: Metrics vs. # of gradients. The maximum staleness is or . , . decays by at the th global epoch. For FedAsync+Poly, we take . For FedAsync+Hinge, we take . Note that when the maximum staleness is , FedAsync and FedAsync+Hinge with are the same.
(a) Top-1 Accuracy on Testing Set,
(b) Cross Entropy on Training Set,
(c) Top-1 Accuracy on Testing Set,
(d) Cross Entropy on Training Set,
Figure 2: Metrics vs. # of communications. The maximum staleness is or . , . decays by at the th global epoch. For FedAsync+Poly, we take . For FedAsync+Hinge, we take . Note that when the maximum staleness is , FedAsync and FedAsync+Hinge with are the same.
(a) Top-1 Accuracy on Testing Set
(b) Cross Entropy on Training Set
Figure 3: Metrics at the end of training (at the 2000th epoch), with different staleness. , . has initial value , and decays by at the th global epoch.
(a) Top-1 Accuracy on Testing Set,
(b) Cross Entropy on Training Set,
(c) Top-1 Accuracy on Testing Set,
(d) Cross Entropy on Training Set,
Figure 4: Metrics at the end of training (at the 2000th epoch), with different . The maximum staleness is or . , . decays by at the th global epoch. For FedAsync+Poly, we take . For FedAsync+Hinge, we take . Note that when the maximum staleness is , FedAsync and FedAsync+Hinge with are the same.

6.4 Discussion and evaluation conclusion

In general, the convergence rate of FedAsync is between single-thread SGD and FedAvg. Larger and smaller staleness make FedAsync closer to single-thread SGD. Smaller and larger staleness makes FedAsync closer to FedAvg.

FedAsync is generally insensitive to hyperparameters. When the staleness is large, we can tune to improve the convergence. Without adaptive , smaller is better for larger staleness. For adaptive , our best choice empirically was FedAsync+Poly with .

Larger staleness makes the convergence slower and unstable. There are three ways to control the influence of staleness:

  • On the server side, the updater thread can drop the updates with large staleness . This can also be viewed as a special case of adaptive mixing hyperparameter .

  • Using adaptive mixing hyperparameters improves the convergence, as shown in Figure 3. Different strategies show different improvement. So far we find that FedAsync+Poly with has the best performance.

  • On the server side, the scheduler thread can control the assignment of training tasks to the workers. If the on-device training is triggered less frequently, the overall staleness will be smaller.

In general, FedAsync+Poly and FedAsync+Hinge have similar performance. FedAsync+Hinge performs slightly worse than FedAsync+Poly, since it does not penalize the update when .

We conclude that FedAsync has the following advantages compared to FedAvg:

  • Efficiency: The server can receive the updates from the workers at any time. Unlike FedAvg, stragglers’ updates will not be dropped. When the staleness is small, FedAsync converges much faster than FedAvg. When the staleness is large, FedAsync still has similar performance as FedAvg.

  • Flexibility: If some workers are no longer available for the training tasks (the devices are no longer idle, charging, or connected to unmetered networks), they can temporarily save the workspace, and continue the training or push the trained model to the server later. This also gives more flexibility to the scheduler on the server. Unlike FedAvg, FedAsync can schedule training tasks even if the workers are currently unavailable, since the server does not wait until the workers respond. The currently unavailable workers can start the training tasks later.

  • Scalability: Compared to FedAvg, FedAsync can handle more workers running in parallel since all the updates on the server and the workers are non-blocking. The server only needs to randomize the responding time of the workers to avoid congesting the network.

  • Fault tolerance: For FedAsync, catastrophic failures of workers can be regarded as a special case of unavailability. If some workers crash, the server will ignore them and fail gracefully (equiv. to ignoring data from that worker), since all the updates are non-blocking. In contrast, FedAvg requires a time-out mechanism for fault tolerance, which potentially slows down the progress due to the waiting time.

7 Conclusion

We proposed a novel asynchronous federated optimization algorithm on non-IID training data. We proved the convergence for a restricted family of non-convex problems. Our empirical evaluation validated both fast convergence and staleness tolerance. An interesting future direction is the design of strategies to adaptively tune mixing hyperparameters.