1 Introduction
Federated Learning (FL) (McMahan et al., 2017) is a privacypreserving framework for training models from decentralized user data residing on devices at the edge. With the Federated Averaging algorithm (FedAvg), in each federated learning round, every participating device (also called client
), receives an initial model from a central server, performs stochastic gradient descent (SGD) on its local dataset and sends back the gradients. The server then aggregates all gradients from the participating clients and updates the starting model. Whilst in datacenter training, batches can typically be assumed to be IID (independent and identically distributed), this assumption is unlikely to hold in Federated Learning settings. In this work, we specifically study the effects of nonidentical data distributions at each client, assuming the data are drawn independently from differing local distributions. We consider a continuous range of nonidentical distributions, and provide empirical results over a range of hyperparameters and optimization strategies.
2 Related Work
Several authors have explored the FedAvg algorithm on nonidentical client data partitions generated from image classification datasets. McMahan et al. (2017) synthesize pathological nonidentical user splits from the MNIST dataset, sorting training examples by class labels and partitioning into shards such that each client is assigned with 2 shards. They demonstrate that FedAvg on nonidentical clients still converges to 99% accuracy, though taking more rounds than identical clients. In a similar sortandpartition manner, Zhao et al. (2018) and Sattler et al. (2019) generate extreme partitions on the CIFAR10 dataset, forming a population consisting of 10 clients in total. These settings are somewhat unrealistic, as practical federated learning would typically involve a larger pool of clients, and more complex distributions than simple partitions.
Other authors look at more realistic data distributions at the client. For example, Caldas et al. (2018) use Extended MNIST (Cohen et al., 2017) with partitions over writers of the digits, rather than simply partitioning over digit class. Closely related to our work, Yurochkin et al. (2019) use a Dirichlet distribution with concentration parameter 0.5 to synthesize nonidentical datasets. We extend this idea, exploring a continuous range of concentrations , with a detailed exploration of optimal hyperparameter and optimization settings.
Prior work on the theoretical side studied the convergence of FedAvg variants under different conditions. Sahu et al. (2018) introduce a proximal term to client objectives and prove convergence guarantees. Li et al. (2019) analyze FedAvg under proper sampling and averaging schemes in strongly convex problems.
3 Synthetic NonIdentical Client Data
In our visual classification task, we assume on every client training examples are drawn independently with class labels following a categorical distribution over
classes parameterized by a vector
( and ). To synthesize a population of nonidentical clients, we draw from a Dirichlet distribution, where characterizes a prior class distribution over classes, and is a concentration parameter controlling the identicalness among clients. We experiment with 8 values for to generate populations that cover a spectrum of identicalness. With , all clients have identical distributions to the prior; with , on the other extreme, each client holds examples from only one class chosen at random.In this work, we use the CIFAR10 (Krizhevsky et al., 2009) image classification dataset, which contains 60,000 images (50,000 for training, 10,000 for testing) from 10 classes. We generate balanced populations consisting of 100 clients, each holding 500 images. We set the prior distribution to be uniform across 10 classes, identical to the test set on which we report performance. For every client, given an , we sample and assign the client with the corresponding number of images from 10 classes. Figure 1 illustrates populations drawn from the Dirichlet distribution with different concentration parameters.
4 Experiments and Results
Given the above dataset preparation, we now proceed to benchmark the performance of the vanilla FedAvg algorithm across a range of distributions ranging from identical to nonidentical.
We use the same CNN architecture and notations as in McMahan et al. (2017) except that a weight decay of 0.004 is used and no learning rate decay schedule is applied. This model is not the stateoftheart on the CIFAR10 dataset, but is sufficient to show relative performance for the purposes of our investigation.
FedAvg is run under client batch size
, local epoch counts
, and reporting fraction (corresponding to 5, 10, 20, and 40 clients participating in every single round, respectively) for a total of 10,000 communication rounds. We perform hyperparameter search over a grid of client learning rates .4.1 Classification Performance with NonIdentical Distributions
Figure 3 shows classification performance as a function of the Dirichlet concentration parameter (larger implies more identical distributions). Significant changes in test accuracy occur around low when the clients come close to oneclass. Increasing the reporting fraction yields diminishing returns, and the gain in performance is especially marginal for identically distributed client datasets. Interestingly, for the case of fixed optimization round budget, synchronizing the weights more frequently () does not always improve the accuracy on nonidentical data.
In addition to reduced endoftraining accuracy, we also observe more volatile training error in the case of more nonidentical data, see Figure 3. Runs with small reporting fraction struggle to converge within 10,000 communication rounds.
Hyperparameter sensitivity.
As well as affecting overall accuracy on the test set, the learning conditions as specified by and have a significant effect on hyperparameter sensitivity. On the identical end with large , a range of learning rates (about two orders of magnitude) can produce good accuracy on the test set. However, with smaller values of and , careful tuning of the learning rate is required to reach good accuracy. See Figure 4.
4.2 Accumulating Model Updates with Momentum
Using momentum on top of SGD has proven to have great success in accelerating network training by a running accumulation of the gradient history to dampen oscillations. This seems particularly relevant for FL where participating parties may have a sparse distribution of data, and hold a limited subset of labels. In this subsection we test the effect of momentum at the server on the performance of FedAvg.
Vanilla FedAvg updates the weights via , where ( is the number of examples, is the weight update from ’th client, and ). To add momentum at the server, we instead compute , and update the model with . We term this approach FedAvgM (Federated Averaging with Server Momentum).
In experiments, we use Nesterov accelerated gradient (Nesterov, 2007) with momentum . The model architecture, client batch size , and learning rate are the same as vanilla FedAvg in the previous subsection. The learning rate of the server optimizer is held constant at 1.0.
Effect of server momentum.
Figure 5 shows the effect of learning with nonidentical data both with and without server momentum. The test accuracy improves consistently for FedAvgM over FedAvg, with performance close to the centralized learning baseline () in many cases. For example, with and , FedAvgM performance stays relatively constant and above , whereas FedAvg accuracy falls rapidly to around .
Hyperparameter dependence on and .
Hyperparameter tuning is harder for FedAvgM as it involves an additional hyperparameter . In Figure 6, we plot the accuracy against the effective learning rate defined as (Shallue et al., 2018) which suggests an optimal for each set of learning conditions. Notably, when the reporting fraction is large, the selection of is easier and a range of values across two orders of magnitude yields reasonable test accuracy. In contrast, when only a few clients are reporting each round, the viable window for can be as small as just one order of magnitude. To prevent client updates from diverging, we additionally have to use a combination of low absolute learning rate and high momentum. The local epoch parameter
affects the choice of learning rate as well. Extensive local optimization increases the variance of clients’ weight updates, therefore a lower
is necessary to counteract the noise.References
 Leaf: a benchmark for federated settings. arXiv preprint arXiv:1812.01097. Cited by: §2.
 EMNIST: an extension of MNIST to handwritten letters. arXiv preprint arXiv:1702.05373. Cited by: §2.
 Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §3.
 On the convergence of FedAvg on nonIID data. arXiv preprint arXiv:1907.02189. Cited by: §2.
 Communicationefficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: §1, §2, §4.
 Gradient methods for minimizing composite objective function. Cited by: §4.2.
 On the convergence of federated optimization in heterogeneous networks. arXiv preprint arXiv:1812.06127. Cited by: §2.
 Robust and communicationefficient federated learning from nonIID data. arXiv preprint arXiv:1903.02891. Cited by: §2.

Measuring the effects of data parallelism on neural network training
. arXiv preprint arXiv:1811.03600. Cited by: §4.2.  [10] (Website) External Links: Link Cited by: Figure 3.

Bayesian nonparametric federated learning of neural networks.
In
International Conference on Machine Learning
, pp. 7252–7261. Cited by: §2.  Federated learning with nonIID data. arXiv preprint arXiv:1806.00582. Cited by: §2.