Federated Learning with Heterogeneous Architectures using Graph HyperNetworks

by   Or Litany, et al.

Standard Federated Learning (FL) techniques are limited to clients with identical network architectures. This restricts potential use-cases like cross-platform training or inter-organizational collaboration when both data privacy and architectural proprietary are required. We propose a new FL framework that accommodates heterogeneous client architecture by adopting a graph hypernetwork for parameter sharing. A property of the graph hyper network is that it can adapt to various computational graphs, thereby allowing meaningful parameter sharing across models. Unlike existing solutions, our framework does not limit the clients to share the same architecture type, makes no use of external data and does not require clients to disclose their model architecture. Compared with distillation-based and non-graph hypernetwork baselines, our method performs notably better on standard benchmarks. We additionally show encouraging generalization performance to unseen architectures.


page 1

page 2

page 3

page 4


To Federate or Not To Federate: Incentivizing Client Participation in Federated Learning

Federated learning (FL) facilitates collaboration between a group of cli...

No One Left Behind: Inclusive Federated Learning over Heterogeneous Devices

Federated learning (FL) is an important paradigm for training global mod...

Architecture Agnostic Federated Learning for Neural Networks

With growing concerns regarding data privacy and rapid increase in data ...

SPIDER: Searching Personalized Neural Architecture for Federated Learning

Federated learning (FL) is an efficient learning framework that assists ...

FedMix: Approximation of Mixup under Mean Augmented Federated Learning

Federated learning (FL) allows edge devices to collectively learn a mode...

Heterogeneous Ensemble Knowledge Transfer for Training Large Models in Federated Learning

Federated learning (FL) enables edge-devices to collaboratively learn a ...

Federated Learning with Position-Aware Neurons

Federated Learning (FL) fuses collaborative models from local nodes with...

1 Introduction

Federated learning (FL) (McMahan et al., 2017; Yang et al., 2019; Konečný et al., 2015, 2017) allows multiple clients to collaboratively train a strong model that benefits from their individual data, without having to share that data. In particular, aggregating the parameters of locally trained models alleviates the need to share raw data, thereby preserving privacy to some extent and reducing the volume of data transferred. FL allows safe and efficient learning from edge nodes like smartphones and self-driving cars. Another important use case is to improve performance among medical facilities while maintaining patient privacy, for example in disease identification.

Most current FL approaches have a key limitation: all clients must share the same network architecture. As a result, they cannot be applied to many important cases that require learning with heterogeneous architectures. For example, depending on the computing power or OS version, different platforms may run different networks. For some organizations, changing models might hinder legacy expertise or pose regulatory challenges. Additionally, some organizations may wish to benefit from each other’s access to data without sharing their proprietary architectures. In all these cases, one is interested in federated learning with heterogeneous architectures (HAFL).

The following example illustrates why current FL approaches do not support clients with different architectures. Consider Federated Averaging (FedAvg), perhaps the most widely used FL technique, where model parameters of different clients are averaged on a shared server McMahan et al. (2017). Unless all clients share the same structure of parameters and layers, it is not well defined how to average their weights. The problem of aggregating weights across different architecture is not specific to FedAvg, but exists with other FL approaches. This limitation raises the fundamental research question: Can federated learning handle heterogeneous architectures? And can it be achieved with clients keeping their architectures undisclosed?


Figure 1: An overview of our approach. We tackle a federated learning setup with clients that have different architectures by using a shared Graph Hypernetwork (GHN) that can adapt to any network architecture. At each communication round, a local copy of the GHN is trained by each client using their own data and network architecture (illustrated as graphs within each client; nodes represent layers and edges represent computational flow), before returning to the server for aggregation.

The first work to consider FL with non-identical architectures is Diao et al. (2021). Focusing on platforms with different computation capabilities, their setup assumes all clients share the same architecture type (e.g. ResNet18) and differ only in the number of channels per layer such that smaller capacity model parameters are subsets of the larger ones. This allows averaging of corresponding parameter subsets. Another approach for HAFL is knowledge distillation. For example, Lin et al. (2020) suggested distilling knowledge from ensembles of client models on a server using unlabeled or synthetic datasets. Although this solution could be beneficial in certain setups, it has two major limitations: (1) Client architectures must be disclosed and (2) External data must be provided. We could instead design a distillation-based approach to HAFL that does not require clients to share their architectures. First, train a global model between clients using standard FL. Then, distill knowledge from that shared model to train each client’s model on local data. Even though this solution addresses the two limitations discussed above, training a large global model might be difficult for clients with limited computing power. Moreover, only using locally available data for distillation may overfit and hinder the benefits of FL. Finally, and perhaps most importantly, this approach bypasses, but does not address, the fundamental question: How can we learn to use knowledge about architectures when transferring information about model parameters between architectures?

In this work, we propose a general approach to HAFL based on a hypernetwork (HN) (Ha et al., 2017) that acts as the knowledge aggregator. The hypernetwork takes as input a description of a client architecture and emits weights to populate that architecture. Unlike FedAvg, which aggregates weights in a predefined way, the hypernetwork implicitly learns an aggregation operator from data and can therefore be significantly more powerful. For providing an architecture descriptor to the HN, we propose to represent client architectures as graphs, where nodes can represent both parametric layers (such as convolutional layers) and non-parametric layers (such as a summation operation) and directional edges represent network connectivity such as skip connections.

To allow the hypernetwork to process any graph, regardless of its size or topology, we propose to use a graph hypernetwork (GHN)  (Zhang et al., 2020). We therefore name our approach HAFL-GHN. The GHN operates on a graph representation of the architecture and predicts layer parameters at each node that represents a parametric layer. During training, at each communication round, clients train local copies of the GHN weights using their own architectures and data (Fig. 1). Then, these weights are sent by the clients to the server where they are aggregated. Since different architectures have different layer compositions, representing layers as nodes allows meaningful knowledge aggregation across architectures. This forms an improved hypernetwork model that uses knowledge gleaned from different network types and datasets, to populate them with improved parameters. Critically, client architectures are not communicated.

HAFL-GHN relies on the ability of the Graph Neural Network (GNN) to generalize across different client architectures. We discuss this matter in light of recent results in theory of GNNs 

(Yehudai et al., 2021) and demonstrate experimentally that our approach generalizes to unseen architectures in certain cases – allowing clients to modify their architectures after federation has occurred, without the need for client-wide FL retraining.

Our experimental study on three image datasets shows that HAFL-GHN outperforms a distillation-based baseline, and a non-GNN based HN architecture (Shamsian et al., 2021), by a large margin. This margin increases as the local dataset size decreases. We further show that HAFL-GHN provides a large benefit in generalization to unseen architectures, improving the accuracy of converged models and drastically shortening the time needed for convergence. Such generalization could allow new clients, with new architectures, to benefit from models trained on different data and different architectures, lowering the bar for deploying new, personalized, architectures.

2 Previous work

Federated Learning.

Federated learning (McMahan et al., 2017; Kairouz et al., 2021; Yang et al., 2019; Mothukuri et al., 2021) is a learning setup in which multiple clients collaboratively train individual models while trying to benefit from the data of all the clients without sharing their data. The most well known FL technique is federated averaging (McMahan et al., 2017), where all clients use the same architecture, which is trained locally by each client and then sent to the server and averaged with other locally trained models. Many recent works have focused on improving privacy (McMahan et al., 2018; Agarwal et al., 2018; Li et al., 2019; Zhu et al., 2020) and communication efficiency (Chen et al., 2021; Agarwal et al., 2018; Chen et al., 2021; Dai et al., 2019; Stich, 2019). Another widely studied setup is the heterogeneous client data setup (Hanzely & Richtárik, 2020; Zhao et al., 2018; Li et al., 2020; Karimireddy et al., 2020; Zhang et al., 2021). To solve this problem, personalized FL (pFL) methods were proposed that adapt global models to specific clients (Kulkarni et al., 2020). FL with heterogeneous client architectures is, however, still underexplored. Diao et al. (2021) considered a setup where clients share the same computational graph and differ only in the number of channels per layer. By enforcing an inclusion structure, models of different capacity can aggregate corresponding parameter subsets. Most related to the current work is the recent work of Shamsian et al. (2021) that also used hypernetworks. Their method considers a simple case of clients with varying architectures, where three pre-defined simple architectures are “hard" encoded into the structure of the hypernetwork. In contrast, our framework allows clients to use a variety of layers and computational graphs, and facilitates better weight sharing as the same layers are used in different architectures. In the experimental section 5 we show that our approach outperforms Shamsian et al. (2021) by large margins.


HyperNetworks (HN) (Klein et al., 2015; Ha et al., 2017) are neural networks that predict input conditioned weights for another neural network that performs the task of interest. HNs are widely used in many learning tasks such as generation of 3D content Littwin & Wolf (2019); Sitzmann et al. (2019), neural architecture search Brock et al. (2018) and language modelling Suarez (2017). More relevant to our work are Graph Hypernetworks (GHNs) - hypernetworks that take graphs as an input. Nachmani & Wolf (2020) used GHNs for molecule property prediction. Even more relevant is the use of GHNs for Neural Architecture Search (NAS) Zhang et al. (2020). In our work, we adapt GHNs to HAFL with unique challenges arising from the problem setup.

3 Problem Definition

Traditional FL addresses the setup of clients working together to improve their local models. Each client has access only to its local data samples , sampled from client specific data distributions , .

Here, we generalize FL to FL with Heterogeneous Architectures (HAFL). In this setup, each client can use a different network architecture . Here, is a neural architecture from some predefined family of models with learnable weights (see further discussion on architecture families in Section 4) . Moreover, we assume that all clients are connected to a server, and that the clients can share information among themselves only through the server. Importantly, in HAFL we are interested in keeping both the architectures and the data private and avoid transferring it.

Our goal is to solve the following minimization problem:


for a suitable loss function


4 Approach

4.1 Overview and workflow

Standard FL methods rely on aggregating model parameters and are not directly applicable when clients use different architectures. In our HAFL setup, the parameter vectors

of different clients have different shapes and sizes, and as a result, a direct aggregation can be meaningless or not well defined. To address this issue, we re-parameterize the weights of a model as the output of a hypernetwork, which serves as a trainable knowledge aggregator. The hypernetwork weights, , are learned from data using updates from all clients, and are the only weights that are trained in our model (in contrast to which are functions of ). Since the output of the hypernetwork must fit a given architecture , the hypernetwork should take as input a representation of . Here, we advocate graphs as a natural representation for neural architectures (in agreement with  (Zhang et al., 2020)) and apply graph hypernetworks to process them.

The workflow of training our HAFL-GHN is illustrated in Figure 1. At each communication round, several steps are followed. (1) First, a server shares the current weights of the GHN, , with all clients (blue dashed line); (2) Each client uses the GHN to predict weights for its own specific architecture , and updates the GHN weights locally using its own architecture and data (arched gray arrow from each client to itself); (3) Each client sends the locally updated GHN weights to the server (orange dashed lines); (4) Weight averaging is performed on the server (arched gray arrow from the GHN to itself);.

4.2 Method

Representing neural architectures as graphs.

A neural architecture can be represented in many different ways. Early chain-structured architectures such as VGG Simonyan & Zisserman (2015) and AlexNet Krizhevsky et al. (2012) can easily be represented as sequences of layers. Recently, several architectures that have richer connectivity structures have been proposed Ronneberger et al. (2015); Targ et al. (2016); Huang et al. (2017). Sequential representation is not suitable for these architectures. In contrast, all neural architectures are computational graphs, which makes a graph representation a much more natural choice. We use graph representations to ensure generality in architecture space.

Given a neural architecture, we represent it as a graph in the following way. The set of vertices contains a vertex for each parametric layer in the architecture. For example, a convolutional layer with weights of dimensions . In a similar way, we use another type of nodes to represent non-parametric operation in the network, such as summation or concatenation of matrices, which are frequently used in ResNet-like architectures. See example in Figure 2. The set of edges represents the computational flow of the architecture: there is an edge between the nodes and (i.e., ) if the output of the layer represented by is the input of a layer represented by . is a matrix that holds the input node features. Initially, each node is equipped with categorical (one-hot) features indicating the layer type they represent, denoted by

. Each categorical layer type specifies the following: linear/conv, stride, kernel size, activation and feature dimensions. As an example, the convolution layer discussed above may be represented as

. According to the desired architecture family, the nodes in the graph can represent different computational blocks of different granularity. Granularity can range from a single layer to complex blocks, so nodes can even represent complex mechanisms like attention. Throughout this paper, we consider the ResNets architecture family and model layers as nodes.

Figure 2: The GHN architecture. GHN is composed of two sub-networks: a Graph neural network (GNN) and a set of MLPs , one for each layer type. We input a graph representation of an architecture to the GHN, with the colored node features corresponding to different layer types. We then apply the GNN to produce node features. Finally, the node features for the parametric layers are further processed by an MLP to attain the final layer weights.

GHN Architecture.

Originally introduced for neural architecture search, a GHN (Zhang et al., 2020; Knyazev et al., 2021) is a generalization of standard hypernetworks that allows generating weights for heterogeneous network architectures. The GHN weight prediction can be broken down into two stages: (1) Processing a graph representation of an architecture using a GNN and then (2) Predicting weights for each layer in that architecture. We now explain these two stages in detail.

In the first stage, our hypernetwork takes as input a graph representation of an architecture and processes it using a -layer graph neural network with learnable parameters (we use in our implementation). This process outputs latent representations for each node . We use maximally expressive message passing GNN layers, as introduced in Morris et al. (2019). These layers have the following form:


Here, represents the depth of the layer, are layer-specific parameters. and

is a non-linear activation function such as ReLU. As mentioned above, we denote the concatenation of the parameters


. As shown in several recent works, the node features extracted by such GNNs are a representation of the local neighbourhood around each node

(Xu et al., 2019; Morris et al., 2019; Yehudai et al., 2021).

At the second stage, we use a set of MLPs with learnable parameters to map latent node representations to layer weights :



is a categorical variable that determines the type of the layer represented by the node 

. We note that a single MLP will not suffice since there are multiple layer types with different output sizes. We denote the concatenation of as . For a particular client , the weights are concatenated to form the client’s weight vector mentioned above.

Objective and training procedure.

Based on the GHN formulation, we can now state our training objective: we look for optimal GHN paramters that simultaneously minimize the empirical risk of all clients:


The training procedure of our GHN is based on local updates of the GHN weights, performed by all clients, followed by a GHN weight averaging process on the shared server. More specifically, the local optimization at each client aims to solve the following client-specific minimization problem:


and performs a predefined number of SGD iterations (or similar gradient-based optimization method). Locally updated weights are then averaged at the server side and redistributed to the clients for further updates. The procedure is summarized in Algorithm 1.

We note that FedAvg can be seen as a specific instance of HAFL-GHN. To see that, consider the standard FL setting where all client architectures are the same and the GNN implements the identity map. Consequently, all clients will have the same latent node representations : the 1-hot encoding of the layer. If the MLPs are all implemented as a linear mapping (i.e., ) then averaging their parameters is equivalent to directly averaging the client network parameters.

Input: R: number of communication rounds, C: number of clients, L: local updates.
Initialize shared GHN weights ;
for r = 1,…,R do
        Server shares current GHN weights ) with all clients ;
        for c = 1,…,C do
               Update GHN weights by local optimization for L update steps on client (see Eq. equation 5);
               Send updated GHN weights to the server ;
        end for
       Average GHN weights: , ;
end for
Algorithm 1 HAFL-GHN

Hypernetwork weight initialization.

Poor weight initialization of CNNs can have a detrimental effect to the training behavior causing either vanishing or exploding gradients. This has been carefully studied in Glorot & Bengio (2010); He et al. (2015b). When the network weights are outputs of another network, proper initialization of the hypernetwork parameters must be carried out so as to output the proper main network initialization. This has been observed by Chang et al. (2019)

who proposed a variance formula. We empirically found that initializing

using zero bias and Xavier-normal initialization multiplied by translates to an accurate Kaiming initialization of the client network. Here and are the number of channels in each layer and the hidden layer dimension of . See supplementary for more details.

Inference and local refinement.

At inference time, the trained GHN is applied to the graph representation of each architecture , yielding architecture-specific weights . We discuss the case of generalizing to architectures not seen during training in Section 5.4

. In contrast to the training stage, where the generated weights are used as-is, at the inference stage, they can be used as an initialization and then locally refined by using the client’s local dataset. The amount of improvement depends on the amount of local data. Furthermore, since this data has already been used for training the shared GHN model, minor improvements should be expected before overfitting might occur. We found that a single refinement epoch of just the linear prediction head worked well in practice.

4.3 Implementation details

For full reproducibility of our results, we will release code upon publication. We ran an extensive hyperparameter search using

Biewald (2020) for a 4 architecture setup using a fixed number of 500 epochs. We found the optimal values (which were used in all our experiments) to be: the GNN introduced in (Morris et al., 2019) with layers, a latent dimension of , SGD with a learning rate of and cosine scheduling. Inspired by Zhang et al. (2020), we experimented with both synchronous and asynchronous message-passing that respects the directional graph structure, yet experiment did not indicate an advantage to the latter. For the client architectures, we define different layer types. The different architectures and non-parametric layers are described in detail in the supplementary material. For each layer type, a 2-layer MLP hypernetwork is used with a latent dimension of , and a leaky-ReLU activation function. We adopted the idea from Littwin & Wolf (2019) and separately predict the scale and weight for each node so that the layer parameters become

. Implementation was done in PyTorch 

Paszke et al. (2017) using PyTorch Geometric (Fey & Lenssen, 2019) and training was done on a cluster with NVIDIA DGX V100 GPUs. The range of parameter sweep and implementation details for baselines are provided in the supplementary.

5 Experiments

Here we describe a comparative experimental study of our approach. In Section (5.1) we review the methods that we compare with. Sections 5.2 & 5.3 provide results for two applications. Finally, we show ablations including generalization to unseen architectures (Sec. 5.4), the effect of communication rate (Sec. 5.5), and the importance of the GNN (Sec. 5.6). In the supplementary material, we also demonstrate strong results with our method in unbalanced data distribution scenarios.

We evaluate HAFL-GHN performance in two tasks. For natural image classification, we use two standard benchmarks in FL: CIFAR-10 and CIFAR-100. Another task that emphasizes the importance of enabling HAFL between medical facilities involves disease classification in chest x-ray scans. CNNs are widely adopted for image classification tasks, and architectures mostly differ in depth, layer types and graph connectivity (e.g. residual connection). We thus ran each experiment with four different network architectures, named Arch 0-3. Arch 0 is a standard ResNet18 

He et al. (2015a), Arch 1 has 10 layers and no skip connections, Arch 2 has 12 layers and skip connections are used only in the first 8 layers, and Arch 3 has 12 (different) layers with skip connection between the last 8 layers. Detailed designs can be found in Figure 5 in the supplementary. In all experiments, we split the data (either uniformly or following a Dirichlet distributions, as indicated in the setup) between 4, 8, and 16 clients while maintaining an even distribution of the different architectures among clients (for instance, in the 8 clients case, we use 2 clients for each architecture).

5.1 Compared baselines

Since we are the first to address the HAFL setup (FL between substantially different architectures that are kept undisclosed, and does not require external data), we compare our performance with two variants of previous methods that we adapted to the HAFL setup. (1) A local-distillation baseline, which we name HAFL-distillation, a variant of the distillation technique proposed by Lin et al. (2020). This baseline is based on locally available data and does not require the clients to disclose their architectures. In detail, for the first step we use standard federated averaging (McMahan et al., 2018) to train a shared architecture. The trained model is then sent to the client for distillation using only local data. (2) A new variant of pFedHN (Shamsian et al., 2021), where heads of individual client models where modified to account for different architectures (different number of outputs according to the number of parameters).

In addition to these baselines, we also report results of Local training where each client performs standard (non-FL) training on its local train samples. This can be seen as a lower bound. For completion, we also include an Upper bound score. This is the result if all the clients used the most expressive Arch 0, and trained in a standard FL.

5.2 Results on CIFAR-10 and CIFAR-100

We experiment with the widely known CIFAR-10/CIFAR-100 (Krizhevsky et al., 2009) image classification datasets that contain 60,000 natural color images in 10/100 classes with predefined train/test split. The results under HAFL are summarized in Table 1. Our method outperforms the local distillation and pFedHN baselines consistently except on CIFAR-10 when local data is where we perform similarly or slightly worse than local distillation. Importantly, as can be expected, as the amount of locally available data decreases, local-distillation deteriorates rapidly. By contrast, our proposed method shows gradual degradation, resulting in an improvement of about 20 points in average accuracy over the baseline when local data percentage is at . Evidently, despite being tuned to its best performing parameters per setup (dataset+split) pFedHN performs poorly. This can be attributed to weight sharing only occurring at the shared MLP responsible for the encoding the architecture, so the separate predictions heads do not benefit from federation. A reduced version of our method without the graph structure is also examined. This no-graph variant, which can be seen as a set network without information about the connectivity between the model layers, already performs very well. The importance of the graph is discussed in detail in Section 5.6.

data % method Arch 0 Arch 1 Arch 2 Arch 3 Avg. Arch 0 Arch 1 Arch 2 Arch 3 Avg.
Original No Skip Skip first Skip last Original No skip Skip first Skip last
25% Upper-bound 73.2 93.6
Local training 49.5 50.2 52.2 48.2 50.1 86.3 84.6 84.7 85.4 85.3
pFedHN 38.0 38.7 46.5 34.7 39.5 85.8 84.1 85.8 84.4 85.0
HAFL Distillation 51.1 44.8 49.0 46.8 47.9 90.0 89.3 89.4 90.1 89.7
Ours 56.3 51.8 50.1 51.9 52.5 89.9 87.4 86.0 89.0 88.1
Ours (no-graph) 55.5 52.7 49.5 55.2 53.2 89.9 87.4 86.0 89.0 88.1
12.50% Upper-bound 71.1 93.2
Local training 36.4 32.5 38.1 33.5 35.1 79.2 75.6 74.5 80.0 77.3
pFedHN 17.8 25.0 24.0 20.2 21.8 77.6 76.0 78.3 74.8 76.7
HAFL Distillation 44.2 33.2 41.3 35.2 38.5 85.7 84.6 84.2 83.3 84.5
Ours 52.3 49.9 48.4 49.9 50.1 88.9 85.9 85.8 87.0 86.9
Ours (no-graph) 51.5 48.7 45.7 50.4 49.1 88.3 86.3 84.9 87.4 86.7
6.25% Upper-bound 71.3 92.9
Local training 22.7 19.4 22.0 22.0 21.5 76.9 75.6 75.6 74.5 75.7
pFedHN 13.6 14.4 16.4 12.6 14.3 58.2 56.6 57.3 50.7 55.7
HAFL Distillation 25.3 23.1 30.0 23.6 25.5 77.4 75.9 75.7 74.8 76.0
Ours 47.3 46.6 44.0 47.0 46.2 86.4 83.8 83.0 85.3 84.6
Ours (no-graph) 47.8 46.4 41.2 46.9 45.6 86.9 84.5 82.7 86.2 85.1
Table 1: Accuracy on CIFAR-10/100 datasets

5.3 Results on Chest X-ray

As discussed in the introduction, a main motivation for HAFL is to allow cross-organizational collaborations. This is especially important when multiple entities have access to valuable data that cannot be shared, but use different neural architectures for processing it. Such is the case for medical clinics and hospitals. To showcase the application of our method to such a real-world problem, we tested it using medical data of X-ray images. The Chest X-ray Wang et al. (2017)

dataset comprises 112,120 frontal-view X-ray images of 30,805 unique patients with fourteen disease image labels (images can have multiple labels), extracted from medical reports using natural language processing. Table 

2 reports the average AUC score across 14 binary classification tasks for our method as well as the baselines. Our method achieves consistently superior performance.

ChestX-ray (Wang et al., 2017)
data % method Arch 0 Arch 1 Arch 2 Arch 3 Avg.
Original No Skip 2 Skip first Skip last
25% Upper-bound 78.0
Local training 60.5 61.8 58.7 60.3 60.3
pFedHN 62.4 60.5 63.0 63.7 62.4
HAFL Distillation 67.5 67.8 65.4 69.8 67.6
Ours 72.2 73.2 70.0 70.5 71.5
12.50% Upper-bound 78.0
Local training 59.5 59.4 57.1 59.8 59.0
pFedHN 62.5 59.9 58.4 61.0 60.5
HAFL Distillation 66.4 64.0 62.9 67.6 65.2
Ours 74.0 71.9 70.6 73.5 72.5
6.25% Upper-bound 77.0
Local training 59.2 58.5 56.9 58.9 58.4
pFedHN 60.4 56.3 58.0 60.0 58.7
HAFL Distillation 65.7 62.8 61.72 66.5 64.2
Ours 69.4 62.4 64.4 68.0 66.1
Table 2: Average AUC on NIH Chest X-ray dataset.

5.4 Generalization to unseen architectures

In standard FL, a client that did not participate in the training procedure may still benefit from the pre-trained network as long as its data is similar enough to that used in training Shamsian et al. (2021). We want HAFL-GHN to enjoy a similar generalization, namely, that the GHN could generalize to clients whose architecture has not been observed during training. Since neural architectures often share local structures, as with the various variants ResNets, when a new architecture is composed of local structures that have previously been seen, it may be feasible to generalize to a new composition of these known components (compositional generalization). The reason is that -layer message-passing GNNs essentially encode the -hop local connectivity pattern around each node. When the same -hop neighborhood appears in a new architecture, possibly in a different position in the computational graph, the GNN can predict reasonable weights since it was trained on such local structures (Yehudai et al., 2021).

Figure 3: Generalization to unseen architectures: our method (light blue) quickly ramps up to high performance and maintains a considerable gap compared to training from scratch (dark blue).

A clear benefit of such generalization is that given a new architecture, our GHN can immediately populate it to give a better initialization. To test this we ran a “leave one out” experiment using CIFAR-10 dataset, where we let 3 clients with 3 different architectures train in an HAFL fashion. We then introduce a 4th architecture and refine it using only local data. We compare against training that architecture from scratch on that same data. The results shown in Figure 3 and Table 6 in the supplementary are very encouraging. When refining on local data, the unseen architecture performance ramps up very quickly and reaches similar performance to that achieved by the architectures that participated in the HAFL-GHN process. In contrast, the training-from scratch alternative takes a much longer to converge and reaches a consistently lower final performance. We further stress tested the generalization capabilities, by introducing a much smaller CNN with only 4 layers. A practical scenario of this sort is training a model on an edge device together with stronger models on the server. Table 7 in the supplementary shows the performance of the 4 architectures used in our main experiments, and how they are influenced when each is replaced by a smaller architecture. The average drop in performance of keeps the performance well above the local-training alternative. Despite the different architecture, when initialized with a model trained on the four main architectures, the fast convergence property persist (see Figure 7).

5.5 Communication rate

Figure 4: HAFL-GHN on CIFAR-10 with different communication rates.

In FL, the amount of data transfer could become a bottleneck. A common remedy is to reduce the frequency of clients-server communication To study the effect of communication rates on the HAFL-GHN performance, we trained 4 architectures on CIFAR-10. At each experiment, clients train locally for epochs before sending local GHN weights to the server for averaging. Client-server communication occurs times, keeping at a constant value of . Figure 4 shows the results for R between 1 and 20. While more frequent communication leads to better performance, we observe that even with 5 times less frequent communication the overall performance decreases by less than .

5.6 The importance of encoding the graph structure

Our HAFL-GHN uses the same layer types for all models. As these layers are repeated within the computational graphs of the different models, more sharing can be done, leading to a more efficient use of the federation. Importantly, while same layers could be directly averaged, it is important that their location within the graph is encoded. This is done via message passing in our GNN. To verify its importance, we implemented a variant of our method that does not use a GNN and instead directly averages MLPs corresponding to the same layer types. Table 3 shows this ablation, for which we used CNNs with 4,7,10 and 13 layers all of type except for the first and last. With this amount of sharing, the results highlight that the GNN is instrumental for performance gains.

data % GNN CIFAR10 CIFAR100
4 layers 7 layers 10 layers 13 layers 4 layers 7 layers 10 layers 13 layers
25% 55.3 75.5 75.1 69.9 27.8 40.0 36.5 29.4
63.6 77.4 77.0 74.4 26.4 41.1 40.4 33.5
12.50% 60.2 75.5 73.8 70.1 32.4 41.5 41.6 36.8
67.7 78.2 77.8 74.4 36.1 43.5 42.3 35.0
Table 3: The importance of the GNN. Under an HAFL setup we use 4 architectures sharing a repeated layer type. The result show significant gains due to message passing informing the layer encoding with their placement within the architecture.

6 Conclusion and limitations

We proposed a new setup for federated learning called HAFL, and a first solution to this problem based on a graph hypernetwork. Our experiments yielded positive results: by modeling neural networks as graphs with layers as nodes, we have shown that GNNs can utilize recurring structures and facilitate efficient federation during learning and even generalization to new architectures.There are, however, a few limitations to our solution. First, it is limited to architectures built from predefined building blocks. In theory, this set could be arbitrarily large, but architectures that use disjoint sets of nodes may lose efficiency when sharing knowledge. Future work might benefit from combining the inclusion structure proposed on  Diao et al. (2021) to group layers that only differ in the number of channels. Second, as indicated by the upper-bound performance, our solution still is not yet comparable with a same-architecture FL performance. Finally, a more robust aggregation technique would alleviate the undesirable degradation of performance when local data is limited. Our hope is that this research will lead to further study of this important setup.

Acknowledgement: We thank the following people for useful discussions and proofreading: Leonidas Guibas, Zan Gojcic, Francis Williams, and Cinjon Resnick.


  • Agarwal et al. (2018) Naman Agarwal, Ananda Theertha Suresh, Felix Xinnan X Yu, Sanjiv Kumar, and Brendan McMahan. cpsgd: Communication-efficient and differentially-private distributed sgd. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/21ce689121e39821d07d04faab328370-Paper.pdf.
  • Biewald (2020) Lukas Biewald. Experiment tracking with weights and biases, 2020. URL https://www.wandb.com/. Software available from wandb.com.
  • Brock et al. (2018) Andrew Brock, Theo Lim, J.M. Ritchie, and Nick Weston. SMASH: One-shot model architecture search through hypernetworks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rydeCEhs-.
  • Chang et al. (2019) Oscar Chang, Lampros Flokas, and Hod Lipson. Principled weight initialization for hypernetworks. In International Conference on Learning Representations, 2019.
  • Chen et al. (2021) Mingzhe Chen, Nir Shlezinger, H Vincent Poor, Yonina C Eldar, and Shuguang Cui. Communication-efficient federated learning. Proceedings of the National Academy of Sciences, 118(17), 2021.
  • Dai et al. (2019) Xinyan Dai, Xiao Yan, Kaiwen Zhou, Han Yang, Kelvin Kai Wing Ng, James Cheng, and Yu Fan. Hyper-sphere quantization: Communication-efficient SGD for federated learning. CoRR, abs/1911.04655, 2019. URL http://arxiv.org/abs/1911.04655.
  • Diao et al. (2021) Enmao Diao, Jie Ding, and Vahid Tarokh. Hetero{fl}: Computation and communication efficient federated learning for heterogeneous clients. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=TNkPBBYFkXg.
  • Fey & Lenssen (2019) Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
  • Glorot & Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    , pp. 249–256. JMLR Workshop and Conference Proceedings, 2010.
  • Ha et al. (2017) David Ha, Andrew M. Dai, and Quoc V. Le. Hypernetworks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=rkpACe1lx.
  • Hamilton et al. (2018) William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs, 2018.
  • Hanzely & Richtárik (2020) Filip Hanzely and Peter Richtárik. Federated learning of a mixture of global and local models. CoRR, abs/2002.05516, 2020. URL https://arxiv.org/abs/2002.05516.
  • He et al. (2015a) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015a.
  • He et al. (2015b) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, 2015b.

  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015.
  • Hsu et al. (2019) Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification, 2019.
  • Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pp. 4700–4708, 2017.
  • Kairouz et al. (2021) Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista A. Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Found. Trends Mach. Learn., 14(1-2):1–210, 2021.
  • Karimireddy et al. (2020) Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In

    International Conference on Machine Learning

    , pp. 5132–5143. PMLR, 2020.
  • Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
  • Klein et al. (2015) Benjamin Klein, Lior Wolf, and Yehuda Afek. A dynamic convolutional layer for short range weather prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4840–4848, 2015.
  • Knyazev et al. (2021) Boris Knyazev, Michal Drozdzal, Graham W Taylor, and Adriana Romero-Soriano. Parameter prediction for unseen deep architectures. In Advances in Neural Information Processing Systems, 2021.
  • Konečný et al. (2015) Jakub Konečný, Brendan McMahan, and Daniel Ramage. Federated optimization:distributed optimization beyond the datacenter, 2015.
  • Konečný et al. (2017) Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency, 2017.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 (canadian institute for advanced research), 2009. URL http://www.cs.toronto.edu/~kriz/cifar.html.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
  • Kulkarni et al. (2020) Viraj Kulkarni, Milind Kulkarni, and Aniruddha Pant. Survey of personalization techniques for federated learning. In 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4), pp. 794–797. IEEE, 2020.
  • Li et al. (2020) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In I. Dhillon, D. Papailiopoulos, and V. Sze (eds.), Proceedings of Machine Learning and Systems, volume 2, pp. 429–450, 2020. URL https://proceedings.mlsys.org/paper/2020/file/38af86134b65d0f10fe33d30dd76442e-Paper.pdf.
  • Li et al. (2019) Wenqi Li, Fausto Milletarì, Daguang Xu, Nicola Rieke, Jonny Hancox, Wentao Zhu, Maximilian Baust, Yan Cheng, Sébastien Ourselin, M Jorge Cardoso, et al. Privacy-preserving federated brain tumour segmentation. In International Workshop on Machine Learning in Medical Imaging, pp. 133–141. Springer, 2019.
  • Li et al. (2017) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks, 2017.
  • Lin et al. (2020) Tao Lin, Lingjing Kong, Sebastian U. Stich, and Martin Jaggi. Ensemble distillation for robust model fusion in federated learning, 2020.
  • Littwin & Wolf (2019) Gidi Littwin and Lior Wolf. Deep meta functionals for shape representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1824–1833, 2019.
  • McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. PMLR, 2017.
  • McMahan et al. (2018) H. Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang.

    Learning differentially private recurrent language models.

    In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=BJ0hF1Z0b.
  • Morris et al. (2019) Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 4602–4609, 2019.
  • Mothukuri et al. (2021) Viraaji Mothukuri, Reza M Parizi, Seyedamin Pouriyeh, Yan Huang, Ali Dehghantanha, and Gautam Srivastava. A survey on security and privacy of federated learning. Future Generation Computer Systems, 115:619–640, 2021.
  • Nachmani & Wolf (2020) Eliya Nachmani and Lior Wolf. Molecule property prediction and classification with graph hypernetworks. arXiv preprint arXiv:2002.00240, 2020.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In Advances in neural information processing systems, 2017.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Springer, 2015.
  • Shamsian et al. (2021) Aviv Shamsian, Aviv Navon, Ethan Fetaya, and Gal Chechik. Personalized federated learning using hypernetworks. In International Conference on Machine Learning, 2021.
  • Simonyan & Zisserman (2015) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • Sitzmann et al. (2019) Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations. In NeurIPS, pp. 1119–1130, 2019.
  • Stich (2019) Sebastian U. Stich. Local SGD converges fast and communicates little. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=S1g2JnRcFX.
  • Suarez (2017) Joseph Suarez. Character-level language modeling with recurrent highway hypernetworks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 3269–3278, 2017.
  • Targ et al. (2016) Sasha Targ, Diogo Almeida, and Kevin Lyman. Resnet in resnet: Generalizing residual architectures. CoRR, abs/1603.08029, 2016.
  • Wang et al. (2017) Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 3462–3471, 2017.
  • Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ryGs6iA5Km.
  • Yang et al. (2019) Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1–19, 2019.
  • Yehudai et al. (2021) Gilad Yehudai, Ethan Fetaya, Eli Meirom, Gal Chechik, and Haggai Maron. From local structures to size generalization in graph neural networks, 2021.
  • Yurochkin et al. (2019) Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Trong Nghia Hoang, and Yasaman Khazaeni. Bayesian nonparametric federated learning of neural networks, 2019.
  • Zhang et al. (2020) Chris Zhang, Mengye Ren, and Raquel Urtasun. Graph hypernetworks for neural architecture search, 2020.
  • Zhang et al. (2021) Michael Zhang, Karan Sapra, Sanja Fidler, Serena Yeung, and Jose M. Alvarez. Personalized federated learning with first order model optimization. In International Conference on Learning Representations, 2021.
  • Zhao et al. (2018) Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018.
  • Zhu et al. (2020) Wennan Zhu, Peter Kairouz, Brendan McMahan, Haicheng Sun, and Wei Li. Federated heavy hitters discovery with differential privacy. In International Conference on Artificial Intelligence and Statistics, pp. 3837–3847. PMLR, 2020.

Appendix A Network architectures

Each experiment in this paper used four different types of architectures split among the different clients plus an additional small architecture for the stress test. There are ten different types of nodes (layers) in each architecture. Figure 5 shows the architectures used in the CIFAR-10/100 experiments. The types of convolutional layers are denoted by “ c<channels_in>_<channels_ out>_<kernel_size>_<stride>”. In the chest x-ray experiment, where the input images are grayscale and of size the first convolutional layer type is “c1_64_k7_s2” type.

CIFAR-10, CIFAR-100, and Chest X-rays, respectively, use different types of linear layers with 10, 100, and 14 output dimensions.

Figure 5: The four client architectures used in CIFAR-10/100 experiments.

Appendix B Non parametric layers

Since our main architectures all use residual connections and same activation type, the graph connectivity suffices to express the architecture. However, in general, different non-parametric layers might be desired. For that, we implemented three versions of ResNet in which several different residual layers were replaced by concatenation, resulting in a hybrid addition-concatenation. To that end, we included two new non-parametric types of nodes in our layer type set: “add” and “concat”. Participating in message passing, these nodes produce latent embedding, depending on where they reside in the graph. No additional parameters are required for them. We trained these 3 architectures together with the vanilla ResNet with four and eight clients with the CIFAR10 and CIFAR100 datasets. On CIFAR10, the average results by architecture type are: 88.96, 89, 89, 89.2 for 4 clients and 88.6, 88.1, 88.3, 86.8 for 8 clients. On CIFAR100, the results are: 56, 56.2, 56.5, 48.200 for 4 clients and 50.6, 50.4, 47.7, 46.0 for 8 clients. The results are on par with those shown in Table 1. In particular, the performance obtained by the vanilla ResNet architecture on CIFAR10/100 with 4 and 8 clients respectively is: 88.96, 88.6, 56 and 50.6, whereas its performance under the HAFL-distillation baseline is 90, 85.7, 51.1, 44.2. In this example, HAFL-GHN shows success in training with two commonly used non-parametric layers.

Appendix C Hypernetwork initialization

As described in Section 4.2 of the main paper, a proper initialization of the hypernetwork weights is instrumental to a successful training of the client networks. In figure 6 this is shown on an example convolutional layer with dimensions: . In both plots, the blue colored histogram shows the distribution of the desired Kaiming weight initialization He et al. (2015b). When the weights are generated by a hypernetwork, a standard initialization of the hypernetwork would generate the orange histogram shown on the left. We show what that histogram looks like after our initialization scheme on the right.

Figure 6: Hypernetwork weight initialization. We compare (blue:) a direct Kaiming weight initialization He et al. (2015b) of a convolutional layer with (orange:) the resulting weight initialization by the hypernetwork, without (left) and with (right) our initialization scheme.

Appendix D hyperarameter search for our HAFL-GHN

We ran an extensive hyperparameter search using Biewald (2020) for a 4 architecture setup using a fixed number of 500 epochs , with 3 different GNN types: GraphConv(Morris et al., 2019), GatedGraphConv(Li et al., 2017), and GraphSAGE(Hamilton et al., 2018); number of GNN layers from 1 to 8 with latent dimensions between 16 and 128; hypernetwork bottleneck dimension between 16 and 64; learning rates between 1e-4 and 0.1; SGD (with and without cosine scheduler) and Adam(Kingma & Ba, 2015) optimizers with weight decay values between 5e-4 and 5e-6.

Appendix E Implementation details: Local Distillation

Baseline distillation from a teacher model trained via standard (same-architecture) FL is done using a distillation loss Hinton et al. (2015): where CE and KL denote Cross-Entropy and Kullback Leibler, respectively. The softmax in the KL loss is taken with respect to a temperature and . We trained distillation as well as the main FL models for 200 epochs with SGD. The learning rate for training the teacher network (with same-architecture FL) is set to with cosine scheduling. When distilling from the teacher network to the student, we use a learning rate to .

Appendix F Implementation details: pFedHN

We found this architecture to be quite sensitive to hyperparameters, hence unlike our method which uses the same set of parameters in all experiments, here we chose the best performing parameters per setting. Hyperparameters sweep on the following parameters: learning rate, number of hidden layers in shared mlp, latent dimension size, optimizer (adam and sgd), and weight decay.

Appendix G Unbalanced distribution

In collaborative training between different entities, clients’ data may be distributed unevenly. In medical data, for example, this may occur when clinics specialize in certain diseases or use different sensors. Thus, in addition to architectural differences, HAFL can also have data imbalance. Here we study the behavior of HAFL under such unbalanced data distributions.

Table 4: Each row in the table shows the unbalanced class distribution for 4 clients, with the original balanced distribution of the left. Three different values are shown: (top) , (mid) , (bottom)

We follow Yurochkin et al. (2019); Hsu et al. (2019); Lin et al. (2020) and use the (symmetric) Dirichlet distribution, parameterized by a concentration parameter to split the CIFAR-10 training and test sets between the different clients. Table 4 shows the resulting per-client class distributions. As can be seen, the smaller is, the less balanced the distribution is.

Table 5 shows the performance on CIFAR-10 under 3 different values: . The smaller is the more unbalanced the distribution is. Class distribution under the different values are shown in Table 4.

Client 0 Client 1 Client 2 Client 3
method unbal. bal. unbal. bal. unbal. bal. unbal. bal.
100 HAFL-GHN 89.7 89.5 87.6 87.6 87.0 86.2 88.3 88.1
Local 81.8 82.4 75.1 74.2 80.5 80.5 81.1 80.3
Standard FL 92.9 93.7 93.2 93.7 94.1 93.7 94.5 93.7
1 HAFL-GHN 91.1 87.1 87.4 85.5 84.9 84.6 85.0 85.5
Local 87.8 75.8 87.3 83.5 84.4 84.0 84.4 83.5
Standard FL 93.8 93.0 92.9 93.0 92.6 93.0 92.9 93.0
0.1 HAFL-GHN 94.6 62.2 98.3 28.7 92.3 34.0 92.4 61.0
Local 93.6 51.1 96.6 27.8 90.9 25.2 90.7 54.1
Standard FL 69.3 53.8 58.2 53.8 42.3 53.8 49.6 53.8
Table 5: HAFL with unbalanced distribution. In the table, corresponds to the level of unbalanced,e.g. =100 (almost uniform), =0.1 (extremelly unbalanced). unbal. and bal. are short for unbalanced and balanced and correspond to the distribution of the test set with unbalanced being the same distribution of each client’s training set.

When the training data is unevenly distributed, performance can either be measured in a similar distributed test set or a balanced test set. In the former case, a client would like high performance on local, biased samples. Suppose a hospital specializes in a specific disease and hopes to improve model performance related to that disease through FL. We refer to that a “unbalanced” metric. Alternatively, a client might be interested in balancing its model bias, in which case the performance on the full (“balanced”) test set is of interest. We compare our HAFL-GHN against a local training. We also include the usual upper bound performance of standard FL with all clients using the same architecture. Table 5 shows that HAFL-GHN outperforms local training in both “unbalanced” and “balanced” tasks and across all values. Specifically, it can be seen that local training achieves high performance only on test sets of similar distribution, but severely sacrifices performance on sets of balanced distributions. The same architecture FL also shows a trade-off. Despite being the most performant on the balanced test set, it sacrifices accuracy on local distributions. HAFL-GHN’s capability to improve on local training in both tasks can be attributed to the inherent personalization of the network. That is, beyond its ability to adapt to new architectures, HAFL-GHN can also learn a personalized weight prediction according to the client distributions. This result aligns well with the observation of Shamsian et al. (2021).

Appendix H Generalization – additional results

Table 7 shows the performance of the 4 architectures used in our experiments, and how they are influenced when each is replaced by a smaller architecture. The average drop in performance of keep the performance well above the local-training alternative.

GHN Init From scratch
Arch 1 (original) 86.1 83.9
Arch 2 (No skip) 84.2 81.3
Arch 3 (Skip first) 84.3 83.7
Arch 4 (Skip last) 85.6 83.6
Table 6: Generalization to unseen architectures: leave-one-architecture-out experiment on CIFAR-10. Each row is the accuracy on a held-out architecture while training on the other architectures.
Replaced Arch 1 Arch 2 Arch 3 Arch 4
None 90.4 89.0 87.3 88.5
Arch 4 88.2 86.9 86.4 80.3
Arch 3 88.8 87.1 79.6 88.2
Arch 2 88.6 81.6 85.4 86.2
Arch 1 77.5 83.8 85.4 83.6
Table 7: Training with a much smaller architecture shows an average performance drop by pts. However, this is well above the local training alternative.
Figure 7: Generalization to a smaller 4 layer CNN architecture. Our method (light blue) quickly ramps up to high performance and maintains a considerable gap compared to training from scratch until convergence(dark blue).