Federated Visual Classification with Real-World Data Distribution

03/18/2020 ∙ by Tzu Ming Harry Hsu, et al. ∙ Google MIT 24

Federated Learning enables visual models to be trained on-device, bringing advantages for user privacy (data need never leave the device), but challenges in terms of data diversity and quality. Whilst typical models in the datacenter are trained using data that are independent and identically distributed (IID), data at source are typically far from IID. Furthermore, differing quantities of data are typically available at each device (imbalance). In this work, we characterize the effect these real-world data distributions have on distributed learning, using as a benchmark the standard Federated Averaging (FedAvg) algorithm. To do so, we introduce two new large-scale datasets for species and landmark classification, with realistic per-user data splits that simulate real-world edge learning scenarios. We also develop two new algorithms (FedVC, FedIR) that intelligently resample and reweight over the client pool, bringing large improvements in accuracy and stability in training.



There are no comments yet.


page 3

page 6

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Federated learning (FL) is a privacy-preserving framework, originally introduced by McMahan et al. [21], for training models from decentralized user data residing on devices at the edge. Models are trained iteratively across many federated rounds. For each round, every participating device (a.k.a. client

), receives an initial model from a central server, performs stochastic gradient descent (SGD) on its local training data and sends back the gradients. The server then aggregates all gradients from the participating clients and updates the starting model. FL preserves user privacy in that the raw data used for training models never leave the devices throughout the process. In addition, differential privacy 

[22] can be applied for a theoretically bounded guarantee that no information about individuals can be derived from the aggregated values on the central server.

Federated learning is an active area of research with a number of open questions [18, 14] remaining to be answered. A particular challenge is the distribution of data at user devices. Whilst in centralized training, data can be assumed to be independent and identically distributed (IID), this assumption is unlikely to hold in federated settings. Decentralized training data on end-user devices will vary due to user-specific habits, preferences, geographic locations, etc. Furthermore, in contrast to the streamed batches from a central data store in the data center, devices participating in an FL round will have differing amounts of data available for training.

In this work, we study the effect these heterogeneous client data distributions have on learning visual models in a federated setting, and propose novel techniques for more effective and efficient federated learning. We focus in particular on two types of distribution shift: Non-Identical Class Distribution, meaning that the distribution of visual classes at each device is different, and Imbalanced Client Sizes, meaning that the number of data available for training at each device varies. Our key contributions are:

  • We analyze the effect of learning with per-user data in real-world datasets, in addition to carefully controlled setups with parametric (Dirichlet) and natural (geographic) distributions.

  • We propose two new algorithms to mitigate per-client distribution shift and imbalance, substantially improving both classification accuracy and stability.

  • We provide new large-scale datasets with per-user data for two classification problems (natural world and landmark recognition) to the community.

Ours is the first work to our knowledge that attempts to train large-scale visual classification models for real-world problems in a federated setting. We expect that more is to be done to achieve robust performance in this and related settings, and plan to make our datasets and benchmarks available to the community to enable future research in this area.

2 Related Work

2.0.1 Synthetic Client Data

Several authors have explored the FedAvg algorithm on synthetic non-identical client data partitions generated from image classification datasets. McMahan et al. [21] synthesize pathological non-identical user splits from the MNIST dataset, sorting training examples by class labels and partitioning into shards such that each client is assigned 2 shards. They demonstrate that FedAvg on non-identical clients still converges to 99% accuracy, though taking more rounds than identically distributed clients. In a similar sort-and-partition manner,  [37, 28] use extreme partitions of the CIFAR-10 dataset to form a population consisting of 10 clients in total. In contrast to these pathological data splits, Yurochkin et al. [35] and Hsu et al. [12] synthesize more diverse non-identical datasets with Dirichlet priors.

2.0.2 Realistic Datasets

Other authors look at more realistic data distributions at the client. For example, Caldas et al. [3] use the Extended MNIST dataset [4] split over the writers of the digits and the CelebA dataset [19] split by the celebrity on the picture. The Shakespeare and Stack Overflow datasets [8] contain natural per-user splits of textual data using roles and online user ids, respectively. Luo et al. [20] propose a dataset containing 900 images from 26 street-level cameras, which they use to train object detectors. These datasets are however limited in size, and are not representative of data captured on user devices in a federated learning context. Our work aims to address these limitations (see Section 4).

Variance reduction methods have been used in the federated learning literature to correct for the distribution shift caused by heterogeneous client data. Sahu et al. [26] introduce a proximal term to client objectives for bounded variance. Karimireddy et al. [15] propose to use control variates for correcting client updates drift. Importance sampling is a classic technique for variance reduction in Monte Carlo methods [13, 10] and has been used widely in domain adaption literature for countering covariate and target shift [25, 36, 23]. In this work, we adopt a similar idea of importance reweighting in a novel federated setting resulting in augmented client objectives. Different from the classic setting where samples are drawn from one proposal distribution which has the same support with the target, heterogeneous federated clients form multiple proposal distributions, each of which has partially common support with the target.

3 Federated Visual Classification Problems

Figure 1: iNaturalist Species Distribution. Visualized here are the distributions of Douglas-Fir and Red Maple in the continental US within iNaturalist. In a federated learning context, visual categories vary with location, and users in different locations will have very different training data distributions.

Many problems in visual classification involve data that vary around the globe [6, 9]. This means that the distribution of data visible to a given user device will vary, sometimes substantially. For example, user observations in the citizen scientist app iNaturalist will depend on the underlying species distribution in that region (see Figure 1). Many other factors could potentially influence the data present on a device, including the interests of the user, their photography habits, etc. For this study we choose two problems with an underlying geographic variation to illustrate the general problem of non-identical user data, Natural Species Classification and Landmark Recognition:

3.0.1 Natural Species Classification

We create a dataset and classification problem based on the iNaturalist 2017 Challenge [31]

, where images are contributed by a community of citizen scientists around the globe. Domain experts take pictures of natural species and provide annotations during field trips. Fine-grained visual classifiers could potentially be trained in a federated fashion with this community of citizen scientists without transferring images.

3.0.2 Landmark Recognition

We study the problem of visual landmark recognition based on the 2019 Landmark Recognition Challenge [1], where the images are taken and uploaded by Wikipedia contributors. It resembles a scenario where smartphone users take photos of natural and architectural landmarks (e.g., famous buildings, monuments, mountains, etc.) while traveling. Landmark recognition models could potentially be trained via federated learning without uploading or storing private user photos at a centralized party.

Both datasets have data partitioning per user, enabling us to study a realistic federated learning scenario where labeled images were provided by the user and learning proceeds on-device. For experimentation in lab settings, we use a simulation engine for running federated learning algorithms, similar to TensorFlow Federated 


4 Datasets

In the following section, we describe in detail the datasets we develop and analyze key distributional statistics as a function of user and geo-location. We plan to make these datasets available to the community.

Figure 2: iNaturalist Distribution. In (a) we show the re-balancing of the original iNaturalist-2017 dataset. In (b) and (c) we show class and example counts vs clients for our 5 iNaturalist partitionings with varying levels of class distribution shift and size imbalance. The client count is different in each partitioning.

4.1 iNaturalist-User-120k and Geo-Location Splits

iNaturalist-2017 [31] is a large scale fine-grained visual classification dataset comprised of images of natural species taken by citizen scientists. It has 579,184 training examples and 95,986 test examples covering over 5,000 classes. Images in this dataset are each associated with a fine-grained species label, a longitude-latitude coordinate where the picture was originally taken, and authorship information.

The iNaturalist-2017 training set has a very long-tailed distribution over classes as shown in Figure 2, while the test set is relatively uniform over classes. While studying learning robustly with differing training and test distributions is a topic for research [32]

in itself, in our federated learning benchmark, we create class-balanced training and test sets with uniform distributions. This allows us to focus on distribution variations and imbalance at the

client level, without correcting for overall domain shift between training and test sets.

To equalize the number of examples across classes, we first sort all class labels by their count and truncate tail classes with less than 100 training examples. This is then followed by subsampling per-class until all remaining classes each have 100 examples. This results in a balanced training set consisting of 1,203 classes and 120,300 examples. We use this class-balanced iNaturalist subset for the remainder of the paper.

The iNaturalist-2017 dataset also includes user ids, which we use to partition the balanced training set into a population of 9,275 clients. We refer to this partitioning as iNaturalist-User-120k.

This contributor partitioning gives us a data split that is realistic to the target problem of learning using data collected per-user. However, for experimentation, it would be very useful to have a continuous range of data distributions, with clients of varying levels of deviation from the global distribution. To achieve this with the iNaturalist dataset, we use the geo-locations provided to split the data at varying levels of granularity.

To utilize the geo-location tags, we leverage the S2 grid system, which defines a hierarchical partitioning of the planet surface. We perform an adaptive partitioning similar to [33]. Specifically, every S2 cell is recursively subdivided into four finer-level cells until no single cell contains more than examples. Cells ending up with less than examples are discarded. With this scheme, we are able to control the granularity of the resulting S2 cells such that a smaller results in a larger client count. We use {30k, 3k, 1k, 100}, and refer to the resulting data partitionings as iNaturalist-Geo-{30k, 3k, 1k, 100}, respectively. Rank statistics of our geo- and per-user data splits are shown in Figures 2 and 2. Note the client count in the geo-partitionings ranges from 11 to 3,606, which is the largest range ever studied to our knowledge.

Figure 3: Landmarks-User-160k Distribution. Images are partitioned according to the authorship attribute from the GLD-v2 dataset. Filtering is applied to mitigate long tail in the train split.

4.2 Landmarks-User-160k

Google Landmarks Dataset V2 (GLD-v2) [1] is a large scale image dataset for landmark recognition and retrieval, consisting of 5 million images with each attributed to one of over 280,000 authors. The full dataset is noisy: images with the same label could depict landmark exteriors, historical artifacts, paintings or sculptures inside a building. For benchmarking federated learning algorithms on a well-defined image classification problem, we use the cleaned subset (GLD-v2-clean), which is a half the size of the full dataset. In this set, images are discarded if the computed local geometric features from which cannot be matched to at least two other images with the same label [24].

For creating a dataset for federated learning with natural user identities, we partition the GLD-v2-clean subset according to the authorship attribute. In addition, we mitigate the long tail while maintaining realism by requiring every landmark to have at least 30 images and be visited by at least 10 users, meanwhile requiring every user to have contributed at least 30 images that depict 5 or more landmarks. The resulting dataset has 164,172 images of 2,028 landmarks from 1,262 users, which we refer to as the train split of Landmarks-User-160k.

The test split is created from the leftover images in GLD-v2-clean whose authors do not overlap with those in the train split. The test split contains 19,526 images and is well-balanced among classes. 1,835 of the landmarks have exactly 10 test images, and there is a short tail for the rest of the landmarks due to insufficient samples (Figure 3).

5 Methods

The datasets described above contain significant distribution variations between clients, which presents considerable challenges for efficient federated learning [18, 14]. In the following, we describe our baseline approach of Federated Averaging algorithm (FedAvg) (Section 5.1) and two new algorithms intended to specifically address the non-identical class distributions and imbalanced client sizes present in the data (Sections 5.2 and 5.3 respectively).

Server training loop: Initialize for each round  do
       Subset of clients SelectClients() for each client  do in parallel
             ClientUpdate(, )
             AggregateClient( )
                   return clients sampled uniformly    

with probability

for client
                  ClientUpdate(, ):
                         for each local mini-batch over epochs do  over steps
                                   in Eq.4
                               return to server
                                     return , where    
Algorithm 1 FedAvg, FedIR, and FedVC.

5.1 Federated Averaging and Server Momentum

A standard algorithm [21] for FL, and the baseline approach used in this work, is Federated Averaging (FedAvg). See Algorithm 1. For every federated round, clients (the report goal) are randomly selected with uniform probability from a pool of active clients. Selected clients, indexed by , download the same starting model from a central server and perform local SGD optimization, minimizing an empirical loss over local mini-batches with learning rate , for epochs before sending the accumulated model update back to the server. The server then averages the updates from the reporting clients with weights proportional to the sizes of clients’ local data and finishes the federated round by applying aggregated updates to the starting model , where is the server learning rate. Given this framework, alternative optimizers can be applied. FedAvgM [12] has been shown to improve robustness to non-identically distributed client data. It uses a momentum optimizer on the server with the update rule , where is the exponentially weighted moving average of the model updates with powers of .

5.2 Importance Reweighted Client Objectives

Now we address the non-identical class distribution shift in federated clients. Consider a target distribution of images and class labels

, on which a model being trained is supposed to perform well (e.g. a validation dataset known to the central server), and a predefined loss function

. The objective of learning is to minimize the expected loss with respect to the target distribution . SGD in the centralized setting achieves this by minimizing an empirical loss on mini-batches of IID training examples from the same distribution, which are absent in the federated setting. Instead, training examples on a federated client are sampled from a client-specific distribution . This implies that the empirical loss being optimized on every client is a biasedestimator of the loss with respect to the target distribution, since .

We propose an importance reweighting scheme, denoted FedIR, that applies importance weights to every client’s local objective as follows


With the importance weights in place, an unbiased estimator of loss with respect to the target distribution can be obtained using training examples from the client distribution


Assuming that all clients share the same conditional distribution of images given a class label as the target, i.e. , the importance weights can be computed on every client directly from the class probability ratio


Note that this computation does not sabotage the privacy-preserving property of federated learning. The denominator is private information available locally at and never leaves client , whereas the numerator does not contain private information about clients and can be transmitted from the central server with minimal communication cost: scalars in total for classes.

Since scaling the loss also changes the effective learning rate in the SGD optimization, in practice, we use self-normalized weights when computing loss over a mini-batch


This corresponds to the self-normalized importance sampling in the statistics literature [10]. FedIR does not change server optimization loops and can be applied together with other methods, such as FedAvgM. See Algorithm 1.

5.3 Splitting Imbalanced Clients with Virtual Clients

The number of training examples in users’ devices vary in the real world. Imbalanced clients can cause challenges for both optimization and engineering practice. Previous empirical studies [21, 12] suggest that the number of local epochs at every client has crucial effects on the convergence of FedAvg. A larger implies more optimization steps towards local objectives being taken, which leads to slow convergence or divergence due to increased variance. Imbalanced clients suffer from this optimization challenge even when is small. Specifically, a client with a large number of training examples takes significantly more local optimization steps than another with fewer training examples. This difference in steps is proportional to the difference in the number of training examples. In addition, a client with an overly large training dataset will take a long time to compute updates, creating a bottleneck in the federated learning round. Such clients would be abandoned by a FL production system in practice, if failing to report back to the central server within a certain time window [2].

We hence propose a new Virtual Client (FedVC) scheme to overcome both issues. The idea is to conceptually split large clients into multiple smaller ones, and repeat small clients multiple times such that all virtual clients are of similar sizes. To realize this, we fix the number of training examples used for a federated learning round to be for every client, resulting in exactly optimization steps taken at every client given a mini-batch size . Concretely, consider a client with a local dataset with size . A random subset consisting of examples is uniformly resampled from for every round the client is selected. This resampling is conducted without replacement when ; with replacement otherwise. In addition, to avoid underutilizing training examples from large clients, the probability that any client is selected for a round is set to be proportional to the client size , in contrast to uniform as in FedAvg. Key changes are outlined in Algorithm 1. It is clear that FedVC is equivalent to FedAvg when all clients are of the same size.

6 Experiments

Dataset Clients Classes Examples Centralized Accuracy
 CIFAR-10 100 10 50,000 86.16%
 CIFAR-100 100 100 50,000 55.21%
 iNaturalist Geo Splits 11 to 3606 1,203 120,300 57.90%
 iNaturalist-User-120k 9,275 1,203 120,300 57.90%
 Landmarks-User-160k 1,262 2,028 164,172 67.05%
Table 1: Training Dataset Statistics. Note that while CIFAR-10/100 and iNaturalist datasets each have different partitionings with different levels of identicalness, the underlying data pool is unchanged and thus sharing the same centralized learning baselines.

We now present an empirical study using the datasets and methods of Sections 4 and 5. We start by analyzing the classification performance as a function of non-identical data distribution (Section 6.1), using the CIFAR10/100 datasets. Next we show how Importance Reweighting can improve performance in the more non-identical cases (Section 6.2). With real user data, clients are also imbalanced, we show how this can be mitigated with Federated Virtual Clients in Section 5.3. Finally we present a set of benchmark results with the per-user splits of iNaturalist and Landmark datasets (Section 6.4). A summary of the datasets used is provided in Table 1. Implementation details are deferred to Section 6.5.

6.0.1 Metrics

When using the same dataset, the performance of a model trained with federated learning algorithms is inherently upper bounded by that of a model trained in the conventional centralized fashion. We evaluate the relative accuracy, defined as

, and compare this metric under different types of budgets. The centralized training baseline uses the same configurations and hyperparameters for a fair comparison.

Figure 4: Relative Accuracy v.s. Non-identicalness. Federated learning experiments are performed on (a) CIFAR-10 and (b) CIFAR-100 using local epoch . The top row demonstrates the distributions of Earthmover’s Distance (EMD) of clients with different data partitionings. Total client counts are annotated to the right, and the weighted average of all client EMD is marked. Data is increasingly non-identical to the right and the dashed line indicates the centralized learning performance. The best accuracies over a grid of hyperparameters are reported (see Appendix 0.A.1).

6.1 Classification Accuracy vs Distribution Non-Identicalness

Our experiments use CIFAR10/100 datasets to characterize classification accuracy with a continuous range of distribution non-identicalness. We follow the protocol described in [12] such that the class distribution of every client is sampled from a Dirichlet distribution with varying concentration parameter .

We measure distribution non-identicalness using an average Earthmover’s Distance (EMD) metric. Specifically, we take the discrete class distribution for every client, and define the population’s class distribution as , where counts training samples from all clients. The non-identicalness of a dataset is then computed as the weighted average of distances between clients and the population: . is a distance metric between two distributions, which we, in particular, use , bounded between .

Figures 4 and 4 show the trend in classification accuracy as a function of distribution non-identicalness (average EMD difference). We are able to approach centralized learning accuracy with data on the identical end. A substantial drop around an EMD of 1.7 to 2.0 is observed in both datasets. Applying momentum on the server, FedAvgM significantly improves the convergence under heterogeneity conditions for all datasets. Using more clients per round (larger report goal ) is also beneficial for training but has diminishing returns.

Figure 5: Comparing Base Methods with and without FedIR. Accuracy shown at 2.5k communication rounds. Centralized learning accuracy marked with dashed lines.

6.2 Importance Reweighting

Importance Reweighting is proposed for addressing the per-client distribution shift. We evaluate FedIR with both FedAvg and FedAvgM on both two datasets with natural user splits: iNaturalist-User-120k and Landmarks-User-160k.

For Landmarks, we experiment with two different training schemes: (a) fine-tuning the entire network (all layers) end to end, (b) only training the last two layers while freezing the network backbone. We set the local epochs to and experiment with report goals {10, 50, 100}, respectively.

The result in Figure 5 shows a consistent improvement on the Landmarks-User-160k dataset over the FedAvg baseline. While FedAvgM gives the most significant improvements in all runs, FedIR further improves the convergence speed and accuracy especially when the report goal is small (Figure 7). Possibly due to the difference in non-identicalness, the Landmarks-User-160k dataset, with a mean EMD of 1.94, benefits more than iNaturalist-User-120k, with a mean EMD of 1.83.

Data Method FedVC Acc@Round(%) Acc@Batch(%)
1k 2.5k 5k 10k 25k 50k
Geo-3k FedAvg 10 47.0 47.9 48.7 37.8 44.4 46.5
FedAvgM 10 47.2 50.4 45.0 42.5 47.1 44.9
FedAvg 10 37.4 46.2 52.8 46.2 53.1 55.5
FedAvgM 10 49.7 54.8 56.7 54.8 56.7 57.1
User-120k FedAvg 10 34.7 39.7 41.3 37.8 39.8 42.9
FedAvgM 10 31.9 39.2 41.3 32.3 41.6 43.4
FedAvg 10 31.3 39.7 43.9 39.7 48.9 52.8
FedAvgM 10 37.9 43.7 49.1 43.7 47.4 54.6
Centralized 57.9
Table 2: Accuracy of Federated Virtual Client on iNaturalist. Acc@round denotes the accuracy at a FL communication round. Acc@batch denotes the batch count accumulated over the largest clients per round, and is a proxy for a fixed time budget.
Figure 6: Learning with Federated Virtual Clients. Curves on the left are learned on the iNaturalist geo-partitioning Geo-3k and user split User-120k each with 135 clients and 9275 clients. Experiments on multiple iNaturalist partitionings are shown on the right, plotting relative accuracy at 2.5k communication rounds to mean EMD. Centralized learning achieves a 57.9% accuracy.

6.3 Federated Virtual Clients

We apply the Virtual Clients scheme (FedVC) to both FedAvg and FedAvgM and evaluate its efficacy using iNaturalist user and geo-location datasets, each of which contains significantly imbalanced clients. In the experiments, 10 clients are selected for every federated round. We use a mini-batch size and set the virtual client size .

Figure 6 demonstrates the efficiency and accuracy improvements gained via FedVC when clients are imbalanced. The convergence of vanilla FedAvg suffers when clients perform excessive local optimization steps. In iNaturalist-Geo-3k, for example, clients can take up to 46 (i.e. , 3000/64) local steps before reporting to the server. To show that FedVC utilizes data efficiently, we report accuracy at fixed batch budgets in addition to fixed round budgets. Batch budget is calculated by summing the number of local batches taken for the largest client per round. As shown in Table 2, FedVC consistently yields superior accuracy on both FedAvg and FedAvgM. Learning curves in Figure 6 show that FedVC also decreases the learning volatility and stabilizes learning.

iNaturalist per-user and geo-location datasets reflect varying degrees of non-identicalness. Figure 6, though noisier, exhibits a similar trend compared to Figure 4. The performance degrades as the degree of non-identicalness, characterized by EMD, increases.

6.4 Federated Visual Classification Benchmarks

Having shown that our proposed modifications to FedAvg indeed lead to a speedup in learning on both iNaturalist and Landmarks, we wish to also provide some benchmark results on natural user partitioning with reasonable operating points. We hope that these datasets can be used as a proxy to understand real-world federated visual classification, and act as benchmarks for future improvements.

Method FedVC FedIR Accuracy@Rounds(%)
1k 2.5k 5k
FedAvg 10 31.3 39.7 43.9
FedAvg 100 36.9 46.5 51.4
FedAvg 10 30.1 41.3 47.5
FedAvg 100 35.5 44.8 49.8
FedAvgM 10 37.9 43.7 49.1
FedAvgM 100 53.0 56.1 57.2
FedAvgM 10 38.4 42.1 47.0
FedAvgM 100 51.3 54.3 56.2
Centralized 57.9
Table 3: iNaturalist-User-120k accuracy. Numbers reported at fixed communication rounds. denotes the report goal per round.

6.4.1 iNaturalist-User-120k

The iNaturalist-User-120k data has 9,275 clients and 120k examples, containing 1,203 species classes. We use report goals {10, 100}. FedVC samples examples per client. A summary of the benchmark results is shown in Table 3.

Notice that FedAvgM with FedVC and a large report goal of has a 57.2% accuracy, almost reaching the same level as in centralized learning (57.9%). With that said, there is still plenty of room to improve performance with small reporting clients and round budgets. Being able to learn fast with a limited pool of clients is one of the critical research areas for practical visual FL.

Figure 7: Landmarks-User-160k Learning Curves. Only the last two layers of the network are fine-tuned. FedIR

is also shown due to its ability to address skewed training distribution as presented in this dataset.

Method FedIR Accuracy@Rounds(%)
Two layers All layers
1k 2.5k 5k 1k 2.5k 5k
FedAvg 10 4.2 14.6 24.6 18.2 38.1 49.7
FedAvg 50 4.5 16.5 26.0 20.9 42.0 53.3
FedAvg 100 4.9 16.5 26.3 21.9 42.3 53.4
FedAvg 10 6.3 17.4 26.6 19.6 38.5 51.7
FedAvg 50 7.4 19.7 28.8 26.0 45.2 55.0
FedAvg 100 7.2 20.1 29.0 26.5 45.7 55.2
FedAvgM 10 23.0 30.1 30.8 29.4 44.1 53.7
FedAvgM 50 29.9 36.4 38.6 55.2 62.0 64.8
FedAvgM 100 31.9 37.4 39.6 56.3 63.4 65.0
FedAvgM 10 26.5 32.1 31.3 27.9 45.1 53.5
FedAvgM 50 31.6 37.5 38.9 53.1 61.6 63.2
FedAvgM 100 33.7 38.3 39.8 57.7 64.1 65.9
Centralized 40.27 67.05
Table 4: Landmarks-User-160k Accuracy.

6.4.2 Landmarks-User-160k

The Landmarks-User-160k dataset comprises 164,172 images for 2,028 landmarks, divided among 1,262 clients. We follow the setup in Section 6.2 where we experiment with either training the whole model or fine-tuning the last two fully connected layers. Report goal {10, 50, 100} are used.

Similarly, FedAvgM with the is able to achieve 65.9% accuracy at 5k communication rounds, which is just 1.2% off from centralized learning. Interestingly, when we perform a constrained FL, learning only the last two layers, the accuracy is as well not far off from centralized learning (39.8% compared to 40.3%)

6.5 Implementation Details

We use MobileNetV2 [27]

pre-trained on ImageNet 

[5] for both iNaturalist and Landmarks-User-160k experiments; for the latter, a 64-dimensional bottleneck layer between the 1280-dimensional features and the softmax classifier. We replaced BatchNorm with GroupNorm [34] due to its superior stability for FL tasks [11]. During training, the image is randomly cropped then resized to a target input size of 299299 (iNaturalist) or 224224 (Landmarks) with scale and aspect ratio augmentation similar to [30]. A weight decay of is applied.

For CIFAR-10 and CIFAR-100 experiments, we use a CNN similar to LeNet-5 [17] which has two 55, 64-channel convolution layers, each precedes a 2

2 max-pooling layer, followed by two fully-connected layers with 384 and 192 channels respectively and finally a softmax linear classifier. This model is not the state-of-the-art on the CIFAR datasets, but is sufficient to show the relative performance for our investigation. Weight decay is set to

. Models are trained from scratch for 10k/20k federated rounds on CIFAR-10/100, respectively.

Unless otherwise stated, the client learning rate is 0.01 and momentum is used for FedAvgM. The learning rate is kept constant without decay for simplicity. The client batch size is 32 for Landmarks-User-160k and 64 for others.

7 Conclusions

We have shown that large-scale visual classifiers can be trained using a privacy-preserving, federated approach, and highlighted the challenges that per-user data distributions pose for learning. We provide two new datasets and benchmarks, providing a platform for other explorations in this space. We expect others to improve on our results, particularly when the number of participating clients and round budget is small. There remain many challenges for Federated Learning that are beyond the scope of this paper: real world data may include domain shift, label noise, poor data quality and duplication. Model size, bandwidth and unreliable client connections also pose challenges in practice. We hope our work inspires further exploration in this area.


We thank Andre Araujo, Grace Chu, Tobias Weyand, Bingyi Cao, Huizhong Chen, Tomer Meron, and Hartwig Adam for their valuable feedback and support.



Appendix 0.A Additional Experiments

0.a.1 Hyperparameter Sensitivity

To study how sensitive the hyperparameter tuning process is to different degrees of non-identicalness in FL settings, we perform experiments on CIFAR-10/100 datasets with a grid of hyperparameters.111CIFAR experiments in the main text are tuned over the the same grid. Following [29], we define the effective learning rate for FedAvgM as . For all values of Dirichlet concentration , we sweep over learning rate and momentum .

Figure 8: Relative Accuracy of FedAvgM on CIFAR Datasets. Darker shades denote regions of higher relative accuracy. is the effective learning rate, and is the reporting goal out of 100 clients. Note that data split is increasingly non-identical to the right.

In Figure 8 we show the effect of using different on the relative accuracy with each grid point showing the best result over all combinations that give the same . We train for 10k/20k communication rounds with CIFAR-10/100 respectively.

Within each individual contour plot, it can be seen that the accuracy consistently drops with increased non-identicalness, and the set of hyperparameters yielding high performance becomes smaller. In general, we find an effective learning rate works well in many situations.

Across different report goals , a larger enables good performance over a wider range of . This result is unsurprising, since with more clients reporting in, the server observes more data and hence obtains gradients with less variance. The number of local epochs does not affect the choice of hyperparameters much in our experiments (see last two rows of Figure 8). Interestingly, while CIFAR-10 and CIFAR-100 have different numbers of classes and centralized learning accuracy, they exhibit very similar characteristics in terms of relative accuracy (the overall shape of plots in Figure 8 is similar).

0.a.2 The Effect of Pretraining

Pretraining large visual models (e.g., using ImageNet) is very common in centralized training. It is likely to be even more beneficial in federated settings, where extra computation rounds could be prohibitively time consuming. In some cases, however, it may be necessary or desirable to train from scratch. In this section, we investigate the feasibility of training large federated visual classification models without pretraining222Note that in the main text, the smaller CIFAR10/100 experiments are trained from scratch, but the larger iNaturalist and Landmarks experiments use an ImageNet pretrained MobileNetV2..

We perform experiments using iNaturalist-Geo-3k with a combination of settings including the FL algorithm (FedAvg/FedAvgM) and report goal . Since training from random initialization and from pretrained weights converge to different final test accuracy, we use relative accuracy for evaluating FL algorithms’ progress relative to the corresponding centralized learning upperbounds.

Figure 9: Learning Curves from ImageNet Pretraining and from Scratch. On the left vertical axis is the relative accuracy while on the right is the absolute accuracy. Two plots are rescaled to have the full span of 100% relative accuracy.
Data Method Initialization Rounds@Relative Accuracy
10 % 50 % 90 %
Geo-3k FedAvg pretrained 10 165 (1.0) 669 (4.1) 4912 (29.8)
FedAvg pretrained 100 165 (1.0) 567 (3.4) 3780 (22.9)
FedAvgM pretrained 10 79 (1.0) 249 (3.2) 1505 (19.1)
FedAvgM pretrained 100 60 (1.0) 116 (1.9) 420 (6.9)
FedAvg scratch 10 9005 (1.0) 39236 (4.4) 50k
FedAvg scratch 100 7793 (1.0) 20k 20k
FedAvgM scratch 10 1463 (1.0) 5788 (4.0) 50k
FedAvgM scratch 100 977 (1.0) 3733 (3.8) 20k
Table 5: Communication Rounds to Reach Relative Accuracy. Note that models have different centralized learning accuracy (51.4% from scratch and 57.9% from pretrained). The multipliers are calculated row-wise, using Rounds@10% as the baseline. Experiments that do not reach the target relative accuracy even after rounds is marked .

From Figure 9, we see that FL with pretraining requires orders of magnitude fewer communication rounds for convergence and yields higher final relative accuracy than training from scratch. Table 5 further shows the rounds needed to reach 10%, 50%, and 90% relative accuracy. We see that FedAvgM is able to accelerate convergence significantly, with a report goal it takes 94% (977 60) fewer rounds than FedAvg to reach 10% relative accuracy when starting from pretrained model weights. We also see that FedAvgM has a much steeper learning curve, reaching 90% relative accuracy in 6.9 the rounds needed to reach 10% (compared to 20 for FedAvg).

Whilst our results suggest that it is possible to train large federated visual classification models from scratch, doing so efficiently and effectively remains an open challenge with room for improvement.

Appendix 0.B CIFAR-10/100 Dataset Details

0.b.1 Synthetic Clients with Dirichlet Prior

To generate non-identical client datasets from CIFAR-10 and CIFAR-100 [16] datasets, we partition each into 100 clients, with 500 training examples each. We assume every client has their data independently drawn from the original dataset according to a multinomial distribution of classes ( and ).

To synthesize a population of non-identical clients, we draw a multinomial from a Dirichlet distribution, where describes a prior class distribution over classes, and is a parameter controlling the concentration, or identicalness among all clients. can be used to control the overall homogeneity: generates clients that are all identical to the prior , while generates clients that tend to hold very sparse labels. After drawing the class distributions , for every client , we sample training examples from CIFAR-10/100 for each class according to without replacement. This is to ensure there are no overlapping examples between any two clients.

Note that by drawing examples without replacement, towards the end of the assignment process, some subset of classes can be exhausted earlier than other classes, ending up with a shorter list of available classes from which the client synthesis procedure can continue drawing samples. When this happens, we eliminate and enforce the remaining clients to only sample from classes with a multinomial distribution


For CIFAR-10, we use {100, 10, 1, 0.5, 0.2, 0.1, 0.05, 0}; for CIFAR-100 we use {1000, 100, 10, 5, 2, 1, 0.5, 0}. Summary statistics showing the class count over the client population in both datasets is given in Figure 10.

Figure 10: CIFAR-10/100 Distribution. Each curve represents the class counts of clients within a data partitioning synthesized using a Dirichlet concentration parameter .