On Pre-Training for Federated Learning

by   Hong-You Chen, et al.
The Ohio State University

In most of the literature on federated learning (FL), neural networks are initialized with random weights. In this paper, we present an empirical study on the effect of pre-training on FL. Specifically, we aim to investigate if pre-training can alleviate the drastic accuracy drop when clients' decentralized data are non-IID. We focus on FedAvg, the fundamental and most widely used FL algorithm. We found that pre-training does largely close the gap between FedAvg and centralized learning under non-IID data, but this does not come from alleviating the well-known model drifting problem in FedAvg's local training. Instead, how pre-training helps FedAvg is by making FedAvg's global aggregation more stable. When pre-training using real data is not feasible for FL, we propose a novel approach to pre-train with synthetic data. On various image datasets (including one for segmentation), our approach with synthetic pre-training leads to a notable gain, essentially a critical step toward scaling up federated learning for real-world applications.


page 7

page 15


Inverse Distance Aggregation for Federated Learning with Non-IID Data

Federated learning (FL) has been a promising approach in the field of me...

Covert Channel Attack to Federated Learning Systems

Federated learning (FL) goes beyond traditional, centralized machine lea...

FedDRL: Deep Reinforcement Learning-based Adaptive Aggregation for Non-IID Data in Federated Learning

The uneven distribution of local data across different edge devices (cli...

FedAUX: Leveraging Unlabeled Auxiliary Data in Federated Learning

Federated Distillation (FD) is a popular novel algorithmic paradigm for ...

Where to Begin? Exploring the Impact of Pre-Training and Initialization in Federated Learning

An oft-cited challenge of federated learning is the presence of data het...

A Pre-training Oracle for Predicting Distances in Social Networks

In this paper, we propose a novel method to make distance predictions in...

Segmented Federated Learning for Adaptive Intrusion Detection System

Cyberattacks are a major issues and it causes organizations great financ...

1 Introduction

Figure 1: Pre-training for FedAvg and centralized learning.

We initialize each paradigm with an ImageNet or our proposed synthetic pre-trained model, or a model with random weights. Pre-training helps both, but has a larger impact on

FedAvg. Even without real data, we show that our proposed pre-training with synthetic data is sufficient to achieve a notable gain for FedAvg.

The increasing attention to data privacy and protection has attracted a flurry of research interests in federated learning (FL) [li2020federated-survey, yang2019federated, kairouz2019advances]. In FL, data are kept separate at individual clients. The goal of FL is thus to learn a “global” model in a decentralized way. Specifically, one would hope to obtain a model whose accuracy is as good as if it were trained using centralized data.

FedAvg [mcmahan2017communication] is arguable the most widely used FL algorithm, which assumes that every client is connected to a server. FedAvg trains the global model in an iterative manner, between parallel local model training at the clients and global model aggregation at the server. FedAvg is easy to implement and enjoys theoretical guarantees of convergence [zhou2017convergence, stich2019local, haddadpour2019convergence, li2020convergence, zhao2018federated]. Its performance, however, can degrade drastically when clients’ data are not IID — clients’ data are often collected individually and doomed to be non-IID. That is, the accuracy of the global model can be much lower than its counterpart trained with centralized data. To alleviate this issue, existing literature has explored better approaches for local training [li2020federated, karimireddy2020scaffold, feddyn] or global aggregation [wang2020federated, hsu2019measuring, chen2021fedbe].

In this paper, we investigate a different and rarely studied dimension in FL — initialization. Specifically, we investigate if model pre-training can alleviate the non-IID issue and fundamentally improve FedAvg. In centralized learning, pre-training on large-scale datasets [devlin2018bert, hendrycks2019using] has been shown to improve accuracy, generalizability, robustness, etc. In most of the FL literature, however, neural networks are initialized with random weights.

This setup makes sense from the FL perspective: FL is meant to learn a model without centralized data; thus, there should be no other data available for pre-training. With that being said, from an application perspective, pre-training has led to several breakthroughs in areas like computer vision and natural language processing; many pre-trained models are publicly available. It is hard to imagine that one in these areas would simply ignore the pre-trained models, even if the task at hand needs to be done in an FL setting.

To incorporate these two perspectives, we conduct our study in two aspects and focus on visual recognition.

First, assuming pre-trained models are available, we conduct a systematic study to investigate whether and how they benefit FL.

We use models pre-trained on datasets like ImageNet 

[deng2009imagenet] to initialize FedAvg for downstream federated tasks. As one might expect, pre-training significantly improves FedAvg’s accuracy under non-IID conditions; the gap between FedAvg and centralized training is largely reduced. To better understand this, we analyze the training dynamics of FedAvg between local training and global aggregation. To our surprise, pre-training does not alleviate local model drifting [li2020federated, karimireddy2020scaffold], a well-known issue under non-IID data. Instead, how pre-training helps FedAvg is by making global aggregation more stable. Concretely, FedAvg averages the local models’ weights simply by coefficients that are proportional to local data sizes. Due to the model drifting in local training, these coefficients can be far from optimal [chen2021fedbe]. Interestingly, with pre-training, FedAvg is less sensitive to the coefficients, resulting in a stronger global model in terms of accuracy.

Second, assuming pre-trained models are not available and there are no real data for pre-training at the server, we explore the use of synthetic data to make pre-training applicable to FL. We choose to use fractal images [kataoka2020pre, anderson2022improving] for two reasons. First, fractals are artificial-looking and have less privacy concern. Second, fractals have been shown to capture geometric properties of elements found in nature [mandelbrot1982fractal]. Directly applying the existing pre-training methods with fractals [kataoka2020pre, anderson2022improving], however, can hardly improve FL, as will be shown in subsection 6.2. To resolve this issue, we propose a novel approach to pre-train with fractals, taking the inner working of their generating process by iterated function systems [barnsley2014fractals] into account. The resulting pre-trained model can better capture the geometric variations of objects and is more useful for FL tasks of natural images.

We validate our approach on a variety of image classification datasets, including CIFAR-10/100

[krizhevsky2009learning], Tiny-ImageNet [le2015tiny], and iNaturalist [van2018inaturalist]

. We also work on semantic segmentation using Cityscapes

[cordts2016cityscapes]. Our empirical results demonstrate significant and consistent improvements on FedAvg via pre-training, even using synthetic data. More importantly, the more challenging the non-IID condition is, the larger the gain by pre-training is. Our analyses further show the compatibility of pre-training with advanced FL algorithms [feddyn, li2020federated], and reveal new insights for improving FL in practice.

Contributions and scopes.

We conduct the very first systematic study on pre-training for FL, using five image datasets including iNaturalist and Cityscapes. Our novelty for this part is in the findings. We believe that such a study is timely and significant to the FL community. When pre-training using real data is not feasible (there is still a debate in the FL community), we present a novel approach to use synthetic data for pre-training, which is sufficient to attain a notable gain. Our analyses further reveal several new insights of FL, opening up future research directions.

We focus on visual recognition, which has been the focus of many FL works. We go beyond them by further studying semantic segmentation problems, not merely classification problems.

2 Related Work

Federated learning. FedAvg [mcmahan2017communication] is the fundamental FL algorithm. Many works were proposed to improve it, especially to alleviate its accuracy drop under non-IID data. In global aggregation, [wang2020federated, yurochkin2019bayesian] matched local model weights before averaging. [lin2020ensemble, he2020group, zhou2020distilled, chen2021fedbe] replaced weight average by model ensemble and distillation. [hsu2019measuring, reddi2021adaptive] applied server momentum and adaptive optimization. In local training, to reduce local model drifting — a problem commonly believed to cause the accuracy drop —[zhao2018federated] mixed client and server data; FedProx [li2020federated], FedDANE [li2019feddane], and FedDyn [feddyn] employed regularization toward the global model; SCAFFOLD [karimireddy2020mime] and Mime [karimireddy2020scaffold] used control varieties or server statistics to correct local gradients; [wang2020tackling, yao2019federated] modified local update rules. We investigate a rarely studied aspect to improve FedAvg, initialization. To our knowledge, very few works have studied this [lin2022fednlp, stremmel2021pretraining]; none are as systematic as ours on the effect of pre-training. [qu2021rethinking, hsu2020federated, cheng2021fine] used pre-trained models in their experiments but did not or only briefly analyze their effect.

Pre-training. Pre-training has been widely applied in computer vision, natural language processing, and many other application domains to speed up convergence and boost accuracy for downstream tasks [kolesnikov2020big, goyal2021self, Radford2021LearningTV, sun2017revisiting, devlin2018bert, yang2019xlnet, brown2020language]. Many works have attempted to analyze its impacts [hendrycks2020pretrained, erhan2010does, he2019rethinking, djolonga2021robustness, he2019rethinking, kornblith2019better]. For example, [hendrycks2019using] and [chen2020adversarial] found that pre-training improves robustness against adversarial examples; [neyshabur2020being] studied the loss landscape when fine-tuning on target tasks; [mehta2021empirical] empirically showed that pre-training reduces forgetting in continual learning. Despite the ample research on pre-training, its impacts on FL remain largely unexplored. We aim to fill in this missing piece.

Training with synthetic data. Since it is costly to collect and annotate data, many works explored the idea of synthesizing real-looking images for training, e.g., by using 3D graphic renders [peng2015learning, prakash2019structured, mishra2021task2sim, mikami2021scaling, gan2020threedworld] or deep generative models [bowles2018gan, eilertsen2021ensembles, mayer2018makes]. A few recent studies explored images generated by simple procedures; e.g., random generative models [baradad2021learning] or fractals [kataoka2020pre, anderson2022improving]. Surprisingly, pre-training on these non-realistic images produces good initialization for downstream tasks, outperforming training from scratch. We choose to use fractals as they are artificial-looking and have less privacy concern.

3 Background: Federated Learning

We provide a brief background of federated learning (FL). We use classification as the running example. The goal is to learn a classifier

, where is the feature extractor parameterized by and is the classification head parameterized by . We use to denote .

In centralized learning, we are given a training set , where is the input (e.g., images) and is the true label. We can learn by empirical risk minimization


Here, is the empirical risk, and

is the loss function.

In FL, the training data are collected at clients. Each client has a training set . Except for the difference in training data configurations, the goal of learning remains the same — to train a “global” model . The optimization problem in Equation 1 can be rewritten as


Here, is the empirical risk of client ; denotes the aggregated data from all clients.

Federated averaging (FedAvg).

As clients’ data are separated, Equation 2 cannot be solved directly. A standard way to relax it is FedAvg [mcmahan2017communication], which iterates between two steps — local training at the clients and global aggregation at the server — for multiple rounds. Within each round, the server broadcasts the “global” model to the clients, who then independently update the model locally using their data. The server then aggregates the “local” models back into the “global” model and proceeds to the next round. In general, the two steps can be formulated as


The superscript indicates the models after round ;

denotes the initial model. That is, local training aims to minimize each client’s empirical risk, often by several epochs of stochastic gradient descent (SGD). Global aggregation takes an element-wise average over local model parameters.


When clients’ data are non-IID, would drift away from each other and from , making deviate from the solution of Equation 2 and resulting in a drastic accuracy drop.

4 Pre-training for Federated Learning

In most of the FL literature that learns neural networks, the model , or more specifically the feature extractor , is initialized with random weights. In this section, we attempt to provide a detailed empirical study and analysis on the effect of pre-training on FL. Specifically, we investigate whether and how pre-training helps FedAvg alleviate the accuracy drop in non-IID conditions.

4.1 Preparation and notation

To begin with, let us identify factors that may affect FedAvg’s accuracy along its training process. Let us denote by the global test data, by the test loss, and by the test accuracy. Following Equation 3, we can decompose after round by


The first term is the initial test accuracy of round ; the second term () is the average gain by Local training; the third () is the gain by Global aggregation. A negative indicates that local models after local training (at round ) has somewhat “forgotten” what the global model has learned [french1999catastrophic, kirkpatrick2017overcoming]. Namely, the local models drift away from the global model.

Dataset Class Training Test Clients Resolution
CIFAR-10/100 10/100
Tiny-ImageNet 200 //
iNaturalist-GEO 1203 /
Cityscapes 19
Table 1: Summary of datasets and setups.

4.2 Experimental setup


We conduct the study using CIFAR-10, CIFAR-100 [krizhevsky2009learning], Tiny-ImageNet [le2015tiny], iNaturalist [van2018inaturalist], and Cityscapes [Cordts2015Cvprw]. The first four are for image classification; the last is for semantic segmentation. Table 1 summarizes their statistics.

(We also include NLP tasks such as sentiment analysis 

[caldas2018leaf] and language modeling [mcmahan2017communication] in the Suppl. and show similar improvements by pre-training as shown in Figure 2.)

Non-IID splits.

To simulate non-IID conditions, we follow [hsu2019measuring] to partition the training set of CIFAR-10, CIFAR-100, and Tiny-ImageNet into clients. To split the data of class , we draw an

-dimensional vector

from , and assign data of class to client proportionally to . The resulting clients have different numbers of total images and class distributions. The small the is, the severer the non-IID condition is.

For iNaturalist, a dataset for species recognition, we use the data proposed by [hsu2020federated], which are split by the Geo locations. There are two versions, GEO-10K ( clients) and GEO-3K ( clients). For Cityscapes, a dataset of street scenes, we use the official training/valid sets which contain / cities in Germany, respectively. We split the training data by cities to simulate a realistic scenario.


We report the averaged accuracy () for classification and the mIoU () for segmentation.


We use ResNet20 [he2016deep] for CIFAR-10/100, which is suitable for a resolution. We use ResNet18 [he2016deep] for Tiny-ImageNet and iNaturalist. For semantic segmentation on Cityscapes, we use DeepLabV3+ [chen2018encoder] with a MobileNet-V2 [sandler2018mobilenetv2] backbone.


We compare initializing these models with random or pre-trained weights. For all the datasets except Tiny-ImageNet, we use ImageNet [deng2009imagenet] for pre-training. For Tiny-ImageNet (a subset of ImageNet), we use Places365 [zhou2017places].

The pre-training is done by standard supervised training. Specifically, we obtain pre-trained checkpoints from PyTorch’s and Places365’s official sites

111For ResNet20, we pre-trained with the same procedure but on down-sampled ImageNet images..


We perform FedAvg for 100 iterative rounds. Each round of local training takes epochs222We found this this number quite stable for all experiments and freeze it throughout the paper.. We use the SGD optimizer with weight decay and a momentum, except that on DeepLabV3+ we use an Adam [kingma2014adam] optimizer. We apply the standard image pre-processing and data augmentation [he2016deep]. We reserve

data of the training set as the validation set for hyperparameter tuning (e.g., for the learning rate

333We use the same learning rate within each dataset, which is selected by FedAvg without pre-training.). We follow the literature [he2016identity] to decay the learning rate by every rounds. We leave more details and the selected hyperparameters in section 6 and the Suppl.

Scopes of this section.

Due to the space limit, we mainly study the following settings and postpone CIFAR-100 to section 6. For CIFAR-10 and Tiny-ImageNet, we use and set . For iNaturalist, we use GEO-3K (). All clients participate in every round, except that on GEO-30K we sample per round to simulate partial participation. We include other settings, e.g., larger and different , client sampling, and learning rate scheduling in section 6 and the Suppl.

4.3 Results, observations, and discussions

Figure 2 summarizes the results. For each combination (dataset + pre-training or not), we show the test accuracy/mIoU using global model, i.e., (). We also show the averaged test accuracy/mIoU using each local model, i.e., (). The red and green arrows indicate the gain by local training ()/global aggregation (). For brevity, we only draw the first 50 rounds but the final accuracy/mIoU are after 100 rounds. For each combination, we also depict the accuracy of centralized learning. We have the following observations and one hypotheses.

Pre-training significantly improves FedAvg.

On CIFAR-10, FedAvg with pre-training has a 8.1% gain over FedAvg without pre-training ( vs. ). On Tiny-ImageNet, the gain is 7.9% ( vs. ). Similar findings can be seen on iNaturalist and Cityscapes. Moreover, we found that pre-training brings more benefits to federated learning than to centralized learning. For the latter, the gains on CIFAR-10 and Tiny-ImageNet are 1.6% and 5.4%. In other words, while it is expected that pre-training would improve neural network training as it encodes the knowledge of external data, pre-training seems to offer extra benefits to FedAvg.

Pre-training does not alleviate local model drifting.

Local model drifting is commonly believed as the main cause of accuracy drop in FL [chen2021fedbe, li2020federated, karimireddy2020scaffold]. For instance, on FedAvg without pre-training, we found notable local model drifting (i.e., negative ). This can be seen from the slanting line segments: segments with negative slopes (i.e., of round is lower than of round ) indicate drifting. Interestingly, while pre-training improves FedAvg’s test accuracy, we see no clear decrease but sometimes increase in local model drifting. We thus postulate that reducing local model drifting may not be necessary for achieving a high test accuracy in FL.

Figure 2: Training dynamics of FedAvg. For each combination (dataset + pre-training or not), we show the test accuracy/mIoU using the global model (), and the averaged test accuracy/mIoU using each local model (). The red and green arrows indicate the gain by local training ()/global aggregation (). For brevity, we only draw the first 50 rounds but the final accuracy/mIoU are after 100 rounds.

Global aggregation compensates for local model drifting.

The main reason that reducing local model drifting may not be necessary is the gain produced by global aggregation. As in Figure 2, global aggregation (vertical segments from to ) almost always leads to improvements. Interesting, the larger the drift is (i.e., negative ), the larger the aggregation gain is (i.e., positive ). With that being said, we see some exceptions in the first few rounds of FedAvg: global aggregation results in a negative , especially for FedAvg without pre-training. While these cases are scarce, they seem to suggest a hypothesis: pre-training leads to more robust global aggregation.

4.4 Analysis on global aggregation

To verify our hypothesis, we conduct a further analysis on global aggregation. In FedAvg (Equation 3), global aggregation is by a simple weight average, using local data sizes as the coefficients. According to [karimireddy2020scaffold, chen2021fedbe], this simple weight average may gradually deviate from the solution of Equation 1 under non-IID conditions, and ultimately lead to a drastic accuracy drop. Here, we want to analyze if pre-training can alleviate this issue. Of course, for over-parameterized models like neural networks, it is unlikely to find a unique minimum of Equation 1 to calculate the deviation. We thus propose an alternative way to quantify the robustness of aggregation. We focus on CIFAR-10, Tiny-ImageNet.

Optimal convex aggregation.

Our idea is inspired by [chen2021fedbe], which showed that the coefficients used by FedAvg may not optimally combine the local models. This motivates us to search for the optimal convex aggregation and calculate its accuracy gap against the simple average. The smaller the gap is, the more robust the global aggregation is. We define the optimal convex aggregation as follows:

Figure 3: Pre-training leads to robust aggregation. We show for each combination (datasetpre-training or not) the test accuracy using the global model (solid). We also show (dashed), which applies the optimal convex aggregation throughout the entire FedAvg process. FedAvg with pre-training has a smaller gap.

That is, we search for the vector in the -simplex that maximizes . Here, we apply SGD to find . (See the Suppl. for details.)

Figure 3 shows the curves of and . For , we replace the simple weight average by the optimal convex aggregation throughout the entire FedAvg: at the beginning of each round, we send the optimal convex aggregation back to clients for their local training. It can be seen that outperforms . The gap is larger for FedAvg without pre-training than with pre-training. In other words, for FedAvg with pre-training, the simple weight average has a much closer accuracy to the optimal convex aggregation. This verifies that pre-training leads to more robust global aggregation for FedAvg.

Figure 4:

Pre-training leads to a lower-variance loss surface.

We show the test loss by the FedAvg’s global model (curve) and the confidence interval of the test losses by the sampled convex combinations (shaded area) on CIFAR-10 and Tiny-ImageNet. We also show the histograms of losses at round 10 and 50.

Robust aggregation seems to come from a lower-variance loss surface. Why does pre-training lead to robust aggregation? To answer it, we investigate the variation of for different on the simplex. Concretely, we uniformly sample for times, construct the global models and calculate their test losses, and compute the confidence interval. As shown in Figure 4, FedAvg with pre-training has a smaller interval; i.e., a lower-variance loss surface in aggregation. This helps explain why it has a smaller gap between and .

5 Pre-training with Synthetic Data

When pre-trained models are not available, or there are no centralized real data for pre-training, we resort to the use of synthetic data. Specifically, we build upon the recent works by [kataoka2020pre, anderson2022improving] and propose a novel and more effective way to leverage fractal images for pre-training. We choose fractals since they are artificial-looking and have less privacy concern, making them suitable for federated applications. In the following, we give a concise background, followed by our approach.

5.1 Background: supervised fractal pre-training

Fractal generation. A fractal image can be rendered via an affine Iterative Function System (IFS) [kataoka2020pre, barnsley2014fractals]. The IFS rendering process can be thought of as “drawing” pixels iteratively on a canvas. The pixel transition is governed by a set of affine transformations, which we call an IFS code :


Here, an pair specifies an affine transformation and

is the corresponding probability. Concretely, given

, an IFS generates an image as follows. Starting from an initial pixel coordinate , it repeatedly samples one transformation with replacement according to the probability , and performs to arrive at the next pixel coordinate. This process continues until a sufficient number of iterations is reached. The traveled coordinates are then used to synthesize a fractal: by rendering each pixel as a binary or continuous value on a black canvas. Due to the randomness in the IFS, one code can create different but geometrically-similar fractals.

Supervised pre-training.

Since the IFS code controls the generation process and hence the geometry of the fractal, it can essentially be seen as the fractal’s ID or class label. In [kataoka2020pre], the authors proposed to sample different codes and create for each code a number of images to construct a -class classification dataset. This synthetic labeled dataset is then used to pre-train a neural network via a multi-class loss (e.g., cross entropy). The following-up work by [anderson2022improving] proposed to create more complex images by painting multiple (denoted by ) fractals on one canvas. The resulting image thus has out of labels, on which a multi-label loss is more suitable for supervised pre-training.

(a) Fractal + Multi-label Supervisions (b) Fractal + SimSiam (c) Fractal Pair Similarity (ours) + SimSiam
Figure 5: Pre-training with fractals. (a) multi-label training [anderson2022improving]; (b) SimSiam [chen2021exploring]; (c) SimSiam with our fractal pair similarity (FPS).

5.2 Our approach: Fractal Pair Similarity (FPS)

We propose a novel way for fractal pre-training, inspired by one critical insight — we can in theory sample infinitely many IFS codes and create fractals with highly diverse geometric properties. That is, instead of creating a -class dataset that limits the diversity of codes (e.g., 1K) but focuses more on intra-class variation, we propose to trade the latter for the former.

We propose to sample a new set of codes and create an image that contains multiple fractals on the fly. Namely, for each image, we sample a small set of codes , generate corresponding fractals, and composite them into one image. The resulting dataset will have all its images from different classes (i.e., different sets of ) in theory.

Analogy to self-supervised learning.

How can we pre-train a neural network using a dataset whose images are all of different class labels?

Here, we propose to view the dataset as essentially “unlabeled” and draw an analogy to self-supervised learning 

[liu2021self], especially those based on contrastive learning [wu2018unsupervised, he2020momentum, chen2020improved, chen2020simple, chen2020big, tian2020makes] or similarity learning [grill2020bootstrap, chen2021exploring]. The core idea of these approaches is to treat every image as from a different class and learn to either repulse different images away (i.e., negative pairs) or draw an image and an augmented version of it closer (i.e., positive pairs). This line of approaches is so effective that they can even outperform supervised pre-training in many tasks [ericsson2021well]. We note that while [baradad2021learning] has applied self-supervised learning to fractals, the motivation is very different from ours: the authors directly used the supervised dataset created by [kataoka2020pre] but ignored the labels.

Fractal Pair Similarity (FPS).

Conventionally, to employ contrastive learning or similarity learning, one must perform data augmentation such that a single image becomes a positive pair. Common methods are image-level scale/color jittering, flips, crops, RandAugment [cubuk2020randaugment], SelfAugment [reed2021selfaugment], etc. Here, we propose to exploit one unique way for fractals (or broadly, synthetic data), which is to use the IFS to create a pair of images based on the same set of codes . We argue that this method can create stronger and more physically-meaningful argumentation: not only do different fractals from the same codes capture intra-class variation, but we also can create a pair of images with different placements of the same fractals (more like object-level augmentation). Moreover, this method can easily be compatible with commonly used augmentation.


We apply a similarity learning approach SimSiam [chen2021exploring] for pre-training, due to its efficiency and effectiveness. See Figure 5 for an illustration and see the Suppl. for details. Other self-supervised learning algorithms are easily applicable. We use the IFS code sampling scheme proposed in [anderson2022improving] (with ) and similarly apply scale jittering on each IFS code before we use it to render fractals. We also follow [anderson2022improving] to color, resize, and flip the fractals. We pre-sample a total of 100K IFS codes in advance and uniformly sample codes from them to generate a pair of images.

6 Experiments

We conduct extensive experiments to validate FPS and justify the importance of pre-training for FL.

6.1 Setup

This section follows the datasets and setup described in subsection 4.2, unless stated otherwise.

Fractal pre-training. For ResNet20, we render images with and run the IFS for 1K iterations. For ResNet18 and DeepLabV3+, we render images by sampling uniformly from per image to increase the diversity, and run the IFS for 100K iterations. For multi-label pre-training, we choose 1K IFS codes as it gives the best validation result. All methods are pre-trained for 100 epochs. Each epoch has 1M image pairs or 2M images for fair comparisons.

Notation. Given a model learned in FL, we denote the corresponding centrally learned model on the same dataset by . We use to denote their accuracy gap. We highlight the gain by pre-training, i.e., pre-training vs. random initialization, by magenta digits.

Init. / Dataset C10 C100 Tiny-ImgNet
Random 74.4 51.4 42.4
Fractal + Multi-label 73.0 51.0 40.9
StyleGAN + SimSiam 79.2 53.0 44.6
Fractal + SimSiam 77.4 51.7 44.2
FPS (ours) + SimSiam 80.5 54.7 45.7
FPS (ours) + MoCo-V2 79.9 53.9 46.1
ImageNet / Places365 87.9 62.2 50.3
Table 2: Comparison on synthetic pre-training methods in section 5. Means of runs are reported.

6.2 Comparisons on synthetic pre-training

We first compare synthetic pre-training methods. We also consider the images generated by (the best version of) random StyleGAN [baradad2021learning, karras2019style].

We use these methods to initialize FedAvg for non-IID CIFAR-10/100 of Dirichlet() and Tiny-ImageNet of Dirichlet(). We split the training data into clients with full participation every round. Results are in Table 2. Our approach FPS outperforms all the baselines except the real pre-training. By taking a deeper look, we found that multi-label supervised training with fractals cannot learn strong enough features for downstream tasks; the accuracy is worse than random initialization, essentially a case of negative transfer [wang2019characterizing]. We attribute this to both the limited diversity of fractals and the learning mechanism444We tried to increase to have more diverse fractals, but cannot improve due to poor convergence.. Self-supervised learning, on the contrary, can better learn from diverse fractals (i.e., each fractal a class), although the gain is not pronounced.

By taking the inner working of fractals into account to create geometrically-meaningful positive pairs, our FPS unleashes the power of synthetic pre-training.

We apply another self-supervised learning algorithm MoCo-V2 [chen2020improved] to show the generalizability of FPS. The accuracies are similar, but SimSiam is faster to train. Later, we focus on FPS (+SimSiam).

Init. / Setting GEO-10K GEO-3K Centralized Avg.
Random 20.9 12.2 48.5 32.0
FPS 27.8 (+6.9) 17.7 (+5.5) 50.2 (+1.7) 27.5
ImageNet 46.6 (+25.7) 45.6 (+33.4) 58.4 (+9.9) 12.3
Table 3: Top- test accuracy on iNaturalist-GEO [hsu2020federated]

6.3 iNaturalist and Cityscapes

Table 3 summarizes the results on iNaturalist 2017 [van2017devil], using the GEO-10K ( clients and clients per round) and GEO-3K ( clients and clients per round) splits according to Geo locations proposed by [hsu2020federated].

Init. / Setting Cities Centralized
Random 41.0 57.2 16.2
FPS 64.9 (+23.9) 67.2 (+10.0) 2.3
ImageNet 67.6 (+26.6) 71.1 (+13.9) 3.5
Table 4: mIoU () of segmentation on Cityscapes.

Table 4 summarizes the semantic segmentation results (mIoU) on Cityscapes [cordts2016cityscapes]. The training data are split into clients by cities, and we consider full client participation every round.

From both tables, we see clear gaps between centralized and federated learning (i.e., ) on realistic non-IID splits, and pre-training with either synthetic (i.e., FPS) or real data notably reduces the gaps. Specifically, compared to random initialization, FPS brings an encouraging gains in FL ( on iNaturalist; on Cityscapes). The gain is larger than that on centralized learning.

(a) Clients (b) Dir()-non-IID (c) Participation (%) (d) Local epochs/round
Figure 6: Tiny-ImageNet ( clients, Dir(), participation, local epoch) with one variable changed.
Figure 7: Compatibility with advanced client training.

6.4 Different federated settings

We study different federated settings, including different numbers of clients , non-IID degrees , and client participation. We also consider different numbers of local epochs per round but keep the total local epochs as . Figure 6 summarizes the results, in which we use Tiny-ImageNet with a base setting ( clients, Dir(), participation, and local epoch) and change one variable at a time. We see consistent gains by either real and synthetic pre-training across all the settings. Importantly, when the settings become more challenging (e.g., smaller or more clients), the gain by pre-training gets larger, justifying its practical applicability and value.

6.5 Compatibility to other FL methods

We study if pre-training could improve more advanced FL methods like FedProx [li2020federated], FedDyn [feddyn], and MOON [li2021model]. Figure 7 shows the results. FPS is compatible with these methods to boost their accuracy. Interestingly, when pre-training using real data is considered, all these methods perform quite similarly. This suggests that FedAvg is still a quite strong FL approach if pre-trained models are available.

Figure 8: Deeper/wider ResNet on CIFAR-10.

6.6 Further Analysis

Pre-training makes network sizes scale better. On CIFAR-10 Dir(0.3), we study different network depths and widths based on ResNet20. As in Figure 8, with random initialization, deeper models have quite limited gains; wider models improve more. With FPS, both going deeper or wider has notable gains, suggesting that pre-training makes training larger models easier in FL.

Federated pre-training on ImageNet. So far we discuss pre-training in a single site and then apply the pre-trained model to FL. We perform a preliminary trial to show that federated pre-training is also promising. We split ImageNet into clients with Dir() and train a ResNet18 by FedAvg. It reaches top-1 validation accuracy at rounds ( clients every round). We use this checkpoint to initialize the iNaturalist experiment and achieve and on GEO-10K and GEO-3K (cf. Table 3), respectively. The results are encouraging: they are close to using centralized ImageNet pre-training and much higher than training from scratch.

7 Conclusion

We investigate a rarely studied dimension in federated learning (FL): initialization. We conduct the very first systematic study on pre-training for FL, and show that it largely bridges the gap between FL and centralized learning. When model pre-training using real data is not feasible, we present a novel approach to use synthetic data for pre-training, which is sufficient to attain a notable gain. We validate our approach on five image datasets, including iNaturalist and Cityscapes. Our analyses further reveal several new insights of FL, opening up future research directions.


This research is supported in part by grants from the National Science Foundation (IIS-2107077, OAC-2118240, and OAC-2112606), the OSU GI Development funds, and Cisco Systems, Inc. We are thankful for the generous support of the computational resources by the Ohio Supercomputer Center and AWS Cloud Credits for Research.


Supplementary Material

We provide details omitted in the main paper.

To keep the same reference numbers as in the main paper, we use plain text for those newly added references in the supplementary material.

Appendix A Additional Details of Fractal Pre-training

(a) 2 codes
(b) 3 codes
(c) 5 codes
Figure 9: Examples of image pairs of FPS. We generate three image pairs with FPS using 2, 3, and 5 IFS codes. The resulting pairs are shown in (a), (b), and (c), which contain 2, 3, and 5 fractals in each image, respectively. In each pair, the two fractals rendered from the same IFS code reflect intra-class variations and different placements.

a.1 More details on fractal pair generation

We provide more details of our FPS algorithm described in section 5. To generate a fractal image pair for similarity learning, we first randomly sample distinct IFS codes. Then, each IFS code is used to produce two fractal shapes applied with random coloring, resizing, and flipping. Finally, we obtain two sets of fractals; each set contains distinct fractal shapes that can be pasted on a black canvas to generate one fractal image. The resulting two fractal images are further applied with image augmentations, such as random resized cropping, color jittering, flipping, etc, to get and . We provide more examples of image pairs generated by our FPS in Figure 9. In each pair, the two fractals generated from the same IFS code show intra-class variations, in terms of shapes and colors, and are placed at random locations.

a.2 Self-supervised learning approaches

After generating fractal image pairs with FPS, we pre-train models by applying two self-supervised learning approaches, SimSiam [chen2021exploring] (similarity-based) and MoCoV2 [chen2020improved] (contrastive-based). We briefly describe them in the following.

SimSiam. Given a positive image pair (, ), we process them by an encoder network to extract their features. A prediction MLP head is then applied to transform the features of one image to match the features of the other. Let and . The objective (to learn and

) is to minimize the negative cosine similarity between them:


where is the -norm. Since the relation between and is symmetric, the final loss can be written as follows:


MoCoV2. Similar to SimSiam [chen2021exploring], MoCoV2 [chen2020improved]

also aims to maximize the similarity between features extracted from a positive image pair

. The main difference is that MoCoV2 adopts the contrastive loss, which also takes negative pairs into account. Following the naming in MoCoV2 [chen2020improved], let be the query and be the positive key. We also have negative images/keys in the mini-batch. We define , , and , where is the encoder for query images and is the encoder for keys. The objective function for MoCoV2 (to learn and ) is written as follows:


where is a temperature hyper-parameter. Besides the contrastive loss, MoCoV2 maintains a dictionary to store and reuse features of keys from previous mini-batches, thereby making the negative keys not limited to the current mini-batch. Finally, to enforce stability during training, a momentum update is applied on the key encoder .

In subsection 6.2 and Table 2 of the main paper, we pre-train ResNet-18 and ResNet-20 for 100 epochs using SimSiam [chen2021exploring] and MoCoV2 [chen2020improved]. Specifically, these neural network architectures are used for in SimSiam and and in MoCoV2. Each epoch consists of 1M image pairs, which are generated on-the-fly, for FPS. For a comparison to StyleGAN in Table 2, we use the pre-generated StyleGAN dataset provided in [baradad2021learning], which has 1.3M images. After pre-training, we keep from MoCoV2, following [chen2020improved].

For SimSiam, we use the SGD optimizer with learning rate , momentum , weight decay , and batch size . For MoCoV2, we use the SGD optimizer with learning rate , momentum , weight decay , and batch size . The dictionary size is set to . For both SimSiam and MoCoV2, the learning rate decay follows the cosine schedule (Loshchilov et al., 2017).

For other experiments besides Table 2, we mainly use SimSiam for FPS. We use the same training setup as mentioned above (e.g., epochs).

Dataset Task Class Training Test/Valid Split Clients Reso. Networks
CIFAR-10/100 Classification 10/100 Dirichlet
20, 32, 44, 56
Tiny-ImageNet Classification 200 Dirichlet / ResNet-18
ImageNet-2012 Classification 1000 Dirichlet ResNet-18
iNaturalist-2017 Classification 1203 GEO-10K/3K / ResNet-18
Cityscapes Segmentation 19 Cities
DeepLabV3 +
Table 5: Summary of datasets and setups.
Dataset Non-IID Sampling Optimizer Learning rate Batch size
CIFAR-10/100 Dirichlet({0.1, 0.3}) 100% SGD + 0.9 momentum 0.01 32
Tiny-ImageNet Dirichlet(0.3) 100%/10% SGD + 0.9 momentum 0.01 32
ImageNet-2012 Dirichlet(0.3) 10% SGD + 0.9 momentum 0.1 128
iNaturalist-2017 GEO-10K, GEO-3K 50%/20% SGD + 0.9 momentum 0.1 128
Cityscapes Cities 100% Adam 0.001 16
Table 6: FL settings and training hyperparameters.

Appendix B FL Experiment Details

b.1 Datasets, FL settings, and hyperparameters

We train FedAvg for 100 rounds, with local epochs and weight decay . Learning rates are decayed by every rounds. Besides that, we summarize the training hyperparameters for each of the federated experiments included in the main paper in Table 6. We always reserve data of the training set as the validation set for hyperparameter searching for finalizing the setups.

For pre-processing, we generally follow the standard practice which normalizes the images and applies some augmentations. CIFAR-10/100 images are padded

pixels on each side, randomly flipped horizontally, and then randomly cropped back to . For the other datasets with larger resolutions, we simply randomly cropped to the desired sizes and flipped horizontally following the official PyTorch ImageNet training script 555https://github.com/pytorch/examples/tree/master/395imagenet.

For the Cityscapes dataset, we use output stride

. In training, the images are randomly cropped to and resized to in testing.

To further understand the effects of hyperparameters, we provide more analysis on Tiny-ImageNet in subsection C.4.

b.2 Optimal convex aggregation

We provide more details for learning the optimal convex experiment in section 4. To find the optimal combinations for averaging clients’ weights, we optimize using the SGD optimizer, with a learning rate for 20 epochs (batch size 32) on the global test set. The vector is normalized and each entry is constrained to be non-negative (which can be done with a softmax function in PyTorch) to ensure the combinations are convex. Since the optimization problem is not convex, we initialize with several different vectors, including uniform initialization, and return the best (in terms of the test accuracy) as for Equation 5.

Appendix C Additional Experiments and Analyses

c.1 Scope and more experiments in NLP

Init. / Setting FL CL
Random 75.4 80.5 5.1
Pre-trained 83.4 (+8.0) 85.0 (+4.5) 1.6
Table 7: Sent140 accuracy ( rounds; local epoch, clients per round).
Init. / Setting FL CL
Random 46.4 59.1 12.7
Pre-trained 52.6 (+6.2) 59.6 (+0.6) 7.0
Table 8: Shakespear next-character prediction accuracy ( rounds; local epoch, clients per round, clients in total).
Figure 10: Dynamics of FL on Sent140 and Shakespeare. We show the test accuracy of the global model (), and the averaged test accuracy using each local model ().

In this paper, we focus on computer vision (CV). We respectfully think this should not degrade our contributions. First, CV is a big area. Second, while many federated learning (FL) works focus on CV, most of them merely studied classification using simple datasets like CIFAR. We go beyond them by experimenting with iNaturalist and Cityscapes (for segmentation), which are realistic and remain challenging even in centralized learning (CL). We believe our focused study in CV and encouraging results on these two datasets using either real or synthetic pre-training are valuable to the FL community

That being said, here we provide two experiments on natural language processing (NLP) to verify that our observations are consistent with that on CV tasks. First, we conduct a sentiment analysis experiment on a large-scale federated Sent140 dataset (Caldas et al., 2019), which has K clients and M samples. We use a pre-trained DistilBERT (Sanh et al., 2019). Second, we experiment with another popular FL NLP dataset Shakespeare next-character prediction task proposed in [mcmahan2017communication]. We use a version provided in (Caldas et al., 2019) which contains

clients, and use the Penn Treebank dataset (Marcinkiewicz et al., 1994) for pre-training a LSTM of two layers of 256 units each.

Table 7 and Table 8 show the results for the Sent140 and Shakespeare datasets, respectively. We also see a similar trend in Figure 10 — pre-training helps more in FL than in CL and largely closes their gap, even if the local model drifts are not alleviated.

c.2 More analysis on global aggregation

Figure 11: Pre-training leads to a lower-variance loss surface (cf. Figure 4). We show the test loss by the FedAvg’s global model (curve) and the confidence interval of the test losses by the sampled convex combinations (shaded area) on iNaturalist-GEO-3K and Cityscapes.

In section 4, we discuss how pre-training leads to more robust aggregation where the loss variance is smaller when we sample the convex combinations of local models. In Figure 4, we provide CIFAR-10 and Tiny-ImageNet due to the space limit. In Figure 11, we also include the iNaturalist-GEO-3K and Cityscapes. We have a consistent finding — using pre-training does lead to a lower-variance loss surface.

c.3 More comparisons on synthetic pre-training

In subsection 6.2, we compare synthetic pre-training methods. We also consider the images generated by (the best version of) a randomly-initialized StyleGAN [baradad2021learning, karras2019style]. Due to the space limit, we only list some of the comparisons. Here we provide the complete results in Table 9. Across datasets and self-supervised algorithms, our FPS consistently outperforms the baselines. Since overall our FPS + SimSiam provides the strongest performance, we focus on it for other experiments.

Init. / Dataset C10 C100 Tiny-ImgNet
Random 74.4 51.4 42.4
Fractal + Multi-label 73.0 51.0 40.9
StyleGAN + SimSiam 79.2 53.0 44.6
Fractal + SimSiam 77.4 51.7 44.2
FPS (ours) + SimSiam 80.5 54.7 45.7
StyleGAN + MoCoV2 75.3 53.5 45.8
Fractal + MoCoV2 73.5 53.3 44.5
FPS (ours) + MoCoV2 79.9 53.9 46.1
ImageNet / Places365 87.9 62.2 50.3
Table 9: Comparison on synthetic pre-training methods in section 5. Means of runs are reported.

c.4 More settings on Tiny-ImageNet

To understand how the FL settings affect the observations of pre-training, we further conduct studies on Tiny-ImageNet. We focus on the setting in the main text (i.e., clients, Dir(), participation, local epoch) but change one variable of settings such as learning rate scheduling in Figure 12 and the number of clients, non-IID degree, participation rate, and the number of local epochs in Figure 13. In Figure 6

of the main text, we report the mean for brevity, here we present the standard deviations as well over

different runs of experiments, which are generally small.

For different learning rate scheduling, we observe similar trends: pre-training does not alleviate client shift much but the test accuracy of the global model after aggregating the local models is higher. In the main paper, we focus on the schedule that decays the learning rate by every rounds given its better performance compared to the two (decay by every round and no decays) here.

For client numbers, more clients will make the performance drop because each client has fewer data and the resulting local training is prone to over-fitting. However, using pre-training on both real data and our FPS still improves significantly.

For the non-IID degree, we manipulate it with the Dirichlet distributions by the parameter. We observe that more non-IID (smaller ) settings degrade the performance of using random initialization sharply, while using pre-training is more robust (smaller accuracy drop).

For participation rates and the number of local epochs, we observe that they are not very sensitive variables, as long as they are large enough (e.g., participation rate and local epochs ). Interestingly, using either fewer or more local epochs does not close the gap between using pre-training and random initialization.

Figure 12: Tiny-ImageNet ( clients, Dir(), participation, local epoch) with different learning rate decay strategies: we show Global model () accuracy and the averaged accuracy using each local model ().
(a) Clients (b) Dir()-non-IID (c) Participation (%) (d) Local epochs/round
Figure 13: Tiny-ImageNet ( clients, Dir(), participation, local epoch) with one variable changed. Meanstd overall runs are reported.

Appendix D Additional Discussions

d.1 Potential negative societal impacts

Our work discusses how using pre-training on a large dataset can improve FL. While not specific to our discussions, collecting the real dataset could potentially inject data bias or undermine data privacy if the collection process is not carefully-designed. However, we believe the concerns are mainly from data collections but not about the learning algorithms.

To remedy this, in section 5, we propose an alternative of using synthetic data that does not require any real data and it can still improve FL significantly.

d.2 Computation resources

We implement all the codes in PyTorch and train the models with GPUs. For experiments with images, we trained on a 2080 Ti GPU. Pre-training takes about day and FL takes about hours. For experiments with images, we trained on an A6000 GPU. Pre-training takes about days. and FL takes about day. For the Cityscape dataset, we trained with an A6000 GPU for about days.