FjORD: Fair and Accurate Federated Learning under heterogeneous targets with Ordered Dropout

02/26/2021 ∙ by Samuel Horvath, et al. ∙ 0

Federated Learning (FL) has been gaining significant traction across different ML tasks, ranging from vision to keyboard predictions. In large-scale deployments, client heterogeneity is a fact, and constitutes a primary problem for fairness, training performance and accuracy. Although significant efforts have been made into tackling statistical data heterogeneity, the diversity in the processing capabilities and network bandwidth of clients, termed as system heterogeneity, has remained largely unexplored. Current solutions either disregard a large portion of available devices or set a uniform limit on the model's capacity, restricted by the least capable participants. In this work, we introduce Ordered Dropout, a mechanism that achieves an ordered, nested representation of knowledge in Neural Networks and enables the extraction of lower footprint submodels without the need of retraining. We further show that for linear maps our Ordered Dropout is equivalent to SVD. We employ this technique, along with a self-distillation methodology, in the realm of FL in a framework called FjORD. FjORD alleviates the problem of client system heterogeneity by tailoring the model width to the client's capabilities. Extensive evaluation on both CNNs and RNNs across diverse modalities shows that FjORD consistently leads to significant performance gains over state-of-the-art baselines, while maintaining its nested structure.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the past few years, advances in deep learning have revolutionised the way we interact with everyday devices. Much of this success relies on the availability of large-scale training infrastructures and the collection of vast amounts of training data. However, users and providers are becoming increasingly aware of the privacy implications of this ever-increasing data collection, leading to the creation of various privacy-preserving initiatives by service providers 

(Apple, 2017) and government regulators (European Commission, 2018).

Federated Learning (FL) (McMahan et al., 2017a) is a relatively new subfield of machine learning (ML) that allows the training of models without the data leaving the users’ devices; instead, FL allows users to collaboratively train a model by moving the computation to them. At each round, participating devices download the latest model and compute an updated model using their local data. These locally trained models are then sent from the participating devices back to a central server where updates are aggregated for next round’s global model. Until now, a lot of research effort has been invested with the sole goal of maximising the accuracy of the global model (McMahan et al., 2017a; Liang et al., 2019; Li et al., 2020b; Karimireddy et al., 2020; Wang et al., 2020), while complementary mechanisms have been proposed to ensure privacy and robustness (Bonawitz and others, 2017; Geyer et al., 2017; McMahan et al., 2018; Melis et al., 2019; Hu et al., 2020; Bagdasaryan et al., 2020).

Figure 1: FjORD employs OD to tailor the amount of computation to the capabilities of each participating device.

A key challenge of deploying FL in the wild is the vast heterogeneity of devices (Li et al., 2020a), ranging from low-end IoT to flagship mobile devices. Despite this fact, the widely accepted norm in FL is that the local models have to share the same architecture as the global model. Under this assumption, developers typically opt to either drop low-tier devices from training, hence introducing training bias due to unseen data (Kairouz et al., 2019), or limit the global model’s size to accommodate the slowest clients, leading to degraded accuracy due to the restricted model capacity (Caldas et al., 2018b). In addition to these limitations, variability in sample sizes, computation load and data transmission speeds further contribute to a very unbalanced training environment. Finally, the resulting model might not be as efficient as models specifically tailored to the capabilities of each device tier to meet the minimum processing-performance requirements (Laskaridis et al., 2020).

In this work, we introduce FjORD (Fig. 1), a novel adaptive training framework that enables heterogeneous devices to participate in FL by dynamically adapting model size – and thus computation, memory and data exchange sizes – to the available client resources. To this end, we introduce Ordered Dropout (OD), a mechanism for run-time ordered (importance-based) pruning, which enables us to extract and train submodels in a nested manner. As such, OD enables all

devices to participate in the FL process independently of their capabilities by training a submodel of the original DNN, while still contributing knowledge to the global model. Alongside OD, we propose a self-distillation method from the maximal supported submodel on a device to enhance the feature extraction of smaller submodels. Finally, our framework has the additional benefit of producing models that can be dynamically scaled during inference, based on the hardware and load constraints of the device.

Our evaluation shows that FjORD enables significant accuracy benefits over the baselines across diverse datasets and networks, while allowing for the extraction of submodels of varying FLOPs and sizes without the need for retraining.

2 Motivation

Despite the progress on the accuracy front, the unique deployment challenges of FL still set a limit to the attainable performance. FL is typically deployed on either siloed setups, such as among hospitals, or on mobile devices in the wild (Bonawitz and others, 2019). In this work, we focus on the latter setting. Hence, while cloud-based distributed training uses powerful high-end clients (Hazelwood and others, 2018), in FL these are commonly substituted by resource-constrained and heterogeneous embedded devices.

In this respect, FL deployment is currently hindered by the vast heterogeneity of client hardware (Wu and others, 2019; Ignatov et al., 2019; Bonawitz and others, 2019). On the one hand, different mobile hardware leads to significantly varying processing speed (Almeida et al., 2019), in turn leading to longer waits upon aggregation of updates (i.e. stragglers). At the same time, devices of mid and low tiers might not even be able to support larger models, e.g. the model does not fit in memory or processing is slow, and, thus, are either excluded or dropped upon timeouts from the training process, together with their unique data. More interestingly, the resource allocation to participating devices may also reflect on demographic and socio-economic information of owners, that makes the exclusion of such clients unfair (Kairouz et al., 2019). Analogous to the device load and heterogeneity, a similar trend can be traced in the downstream (model) and upstream (updates) network communication in FL, which can be an additional substantial bottleneck for the training procedure (Sattler et al., 2020).

3 Ordered Dropout

In this paper, we firstly introduce the tools that act as enablers for heterogeneous federated training. Concretely, we have devised a mechanism of importance-based pruning for the easy extraction of subnetworks from the original, specially trained model, each with a different computational and memory footprint. We name this technique Ordered Dropout (OD), as it orders knowledge representation in nested submodels of the original network.

More specifically, our technique starts by sampling a value (denoted by ) from a distribution of candidate values. Each of these values corresponds to a specific submodel, which in turn gets translated to a specific computational and memory footprint (see Table 2). Such sampled values and associations are depicted in Fig. 2

. Contrary to conventional dropout (RD), our technique drops adjacent components of the model instead of random neurons, which translates to computational benefits in today’s linear algebra libraries and higher accuracy as shown later.

3.1 Ordered Dropout Mechanics

The proposed OD method is parametrised with respect to: i) the value of the dropout rate per layer, ii) the set of candidate values , such that and iii) the sampling method of over the set of candidate values, such that , where is the distribution over .

A primary hyperparameter of OD is the dropout rate

which defines how much of each layer is to be included, with the rest of the units dropped in a structured and ordered manner. The value of is selected by sampling from the dropout distribution which is represented by a set of discrete values such that

and probabilities

such that

. For instance, a uniform distribution over

is denoted by (i.e. ). In our experiments we use uniform distribution over the set , which we refer to as (or uniform-). The discrete nature of the distribution stems from the innately discrete number of neurons or filters to be selected. The selection of set is discussed in the next subsection.

The dropout rate can be constant across all layers or configured individually per layer , leading to . As such an approach opens the search space dramatically, we refer the reader to NAS techniques (Zoph and Le, 2017) and continue with the same value across network layers for simplicity, without hurting the generality of our approach.

Figure 2: Ordered vs. Random Dropout. The left-most features are used by more devices during training, creating a natural ordering to the importance of these features.

Given a value, a pruned -subnetwork can be directly obtained as follows. For each111Notice that OD is not applied on the last layer in order to maintain the same output dimensionality. layer with width222i.e. neurons for fully-connected layers (linear and recurrent) and filters for convolutional layers. , the submodel for a given has all neurons/filters with index included and pruned. Moreover, the unnecessary connections between pruned neurons/filters are also removed333 For BatchNorm, we maintain a separate set of statistics for every dropout rate . This has only a marginal effect on #parameters and can be used in a privacy-preserving manner (Li et al., 2021).. We denote a pruned -subnetwork with its weights , where and are the original network and weights, respectively. Importantly, contrary to existing pruning techniques (Han et al., 2015; Lee et al., 2019; Molchanov et al., 2019), a -subnetwork from OD can be directly obtained post-training without the need to fine-tune, thus eliminating the requirement to access any labelled data.

(a) ResNet18 - CIFAR10
(c) RNN - Shakespeare
Figure 6: Full non-federated datasets. OD-Ordered Dropout with , SM-single independent models, KD-knowledge distillation.

3.2 Training OD Formulation

We propose two ways to train an OD-enabled network: i) plain OD and ii) knowledge distillation OD training (OD w/ KD). In the first approach, in each step we first sample ; then we perform the forward and backward pass using the -reduced network ; finally we update the submodel’s weights using the selected optimiser. Since sampling a -reduced network provides us significant computational savings on average, we can exploit this reduction to further boost accuracy. Therefore, in the second approach we exploit the nested structure of OD, i.e.  and allow for the bigger capacity supermodel to teach the sampled -reduced network at each iteration via knowledge distillation (teacher ,

). In particular, in each iteration, the loss function consists of two components as follows:


where is the softmax output of the sampled -submodel, is the ground-truth label, CE is the cross-entropy function, KL is the KL divergence, is the distillation temperature (Hinton et al., 2014) and

is the relative weight of the two components. We observed in our experiments always backpropagating also the teacher network further boosts performance. Furthermore, the best performing values for distillation were

, thus smaller models exactly mimic the teacher output.

3.3 Ordered Dropout exactly recovers SVD

We further show that our new OD formulation can recover the Singular Value Decomposition (SVD) in the case where there exists a linear mapping from features to responses. We formalise this claim in the following theorem.

Theorem 1.

Let be a neural network with two fully-connected linear layers with no activation or biases and hidden neurons. Moreover, let data come from a uniform distribution on the -dimensional unit ball and be an full rank matrix with

distinct singular values. If response

is linked to data via a linear map: and distribution is such that for every there exists for which , then for the optimal solution of

it holds , where is the best -rank approximation of and .

Theorem 1 shows that our OD formulation exhibits not only intuitively, but also theoretically ordered importance representation. Proof of this claim is deferred to the Appendix.

3.4 Model-Device Association

Computational and Memory Implications. The primary objective of OD is to alleviate the excessive computational and memory demands of the training and deployment processes. When a layer is shrunk through OD, there is no need to perform the forward and backward passes or gradient updates on the pruned units. As a result, OD offers gains both in terms of FLOP count and model size. In particular, for every fully-connected and convolutional layer, the number of FLOPs and weight parameters is reduced by , where and correspond to the number of input and output neurons/channels, respectively. Accordingly, the bias terms are reduced by a factor of . The normalisation, activation and pooling layers are compressed in terms of FLOPs and parameters similarly to the biases in fully-connected and convolutional layers. This is also evident in Table 2

. Finally, smaller model size also leads to reduced memory footprint for gradients and the optimiser’s state vectors such as momentum. However, how are these submodels related to devices in the wild and how is this getting modelled?

Ordered Dropout Rates Space. Our primary objective with OD is to tackle device heterogeneity. Inherently, each device has certain capabilities and can run a specific number of model operations within a given time budget. Since each value defines a submodel of a given width, we can indirectly associate a value with the -th device capabilities, such as memory, processing throughput or energy budget. As such, each participating client is given at most the -submodel it can handle.

Devices in the wild, however, can have dramatically different capabilities; a fact further exacerbated by the co-existence of previous-generation devices. Modelling discretely each device becomes quickly intractable at scale. Therefore, we cluster devices of similar capabilities together and subsequently associate a single

value with each cluster. This clustering can be done heuristically (

i.e. based on the specifications of the device) or via benchmarking of the model on the actual device and is considered a system-design decision for our paper. As smartphones nowadays run a multitude of simultaneous tasks (LiKamWa and Zhong, 2015), our framework can further support modelling of transient device load by reducing its associated , which essentially brings the capabilities of the device to a lower tier at run time, thus bringing real-time adaptability to FjORD.

Concretely, the discrete candidate values of depend on i) the number of clusters and corresponding device tiers, ii) the different load levels being modelled and iii) the size of the network itself, as i.e. for each tier there exists beyond which the network cannot be resolved. In this paper, we treat the former two as invariants (assumed to be given by the service provider), but provide results across different number and distributions of clusters, models and datasets.

3.5 Preliminary Results

In this section, we present some results to showcase the performance of OD in the centralised non-FL training setting (i.e. the server has access to all training data) across three tasks, explained in detail in § 5.

Concretely, we run OD with distribution (uniform distribution over the set ) and compare it with end-to-end trained submodels (SM) trained in isolation for the given width of the model. Fig. 6 shows that across the three datasets, the best attained performance of OD along every width is very close to the performance of the baseline models. We note at this point that the submodel baselines are trained from scratch, explicitly optimised to that given width with no possibility to jump across them, while our OD model was trained using a single training loop and offers the ability to switch between accuracy-computation points without the need to retrain.


Building upon the shoulders of OD, we introduce FjORD, a framework for federated training over heterogenous clients. We subsequently describe the workflow of FjORD, further documented in Algorithm LABEL:alg:flanders_training.

As a starting point, the global model architecture, , is initialised with weights , either randomly or via a pretrained network. The dropout rates space is selected along with distribution with discrete candidate values, with each corresponding to a subnetwork of the global model with varying FLOPs and parameters. Next, the devices to participate are clustered into tiers and a value is associated with each cluster . The resulting represents the maximum capacity of the network that the devices in this cluster can handle without violating a latency or memory footprint constraint.

At the beginning of each communication round , the set of participating devices is determined, which either consists of all available clients or contains only a random subset of based on the server’s capacity. Next, the server broadcasts the current model to the set of clients and each client receives . On the client side, each client runs local iterations and at each local iteration , the device samples from conditional distribution which accounts for its limited capability. Subsequently, each client updates the respective weights () of the local submodel using the FedAvg (McMahan et al., 2017a) update rule. In this step, other strategies (Li et al., 2020b; Wang et al., 2020; Karimireddy et al., 2020) can be interchangeably employed. At the end of the local iterations, each device sends its update back to the server.

Finally, the server aggregates these communicated changes and updates the global model, to be distributed in the next global federated round to a different subset of devices. Heterogeneity of devices leads to heterogeneity in the model updates and, hence, we need to account for that in the global aggregation step. To this end, we utilise the following aggregation rule


where are the weights that belong to but not to , the global weights at communication round , the weights on client at communication round after local iterations, a set of clients that have the capacity to update , and WA stands for weighted average, where weights are proportional to the amount of data on each client.


Communication Savings. In addition to the computational savings (§3.4), OD provides additional communication savings. First, for the server-to-client transfer, every device with observes a reduction of approximately in the downstream transferred data due to the smaller model size (§ 3.4). Accordingly, the upstream client-to-server transfer is decreased by as only the gradient updates of the unpruned units are transmitted.

Subnetwork Knowledge Transfer. In the § 3.2, we introduced knowledge distillation for our OD formulation. We extend this approach to FjORD, where instead of the full network, we employ width as a teacher network in each local iteration on device .

5 Evaluation of FjORD

Dataset Model # Clients # Samples Task
CIFAR10 ResNet18 Image classification
FEMNIST CNN Image classification
Shakespeare RNN Next character prediction
Table 1: Datasets
CIFAR10 / ResNet18
MACs 23M 91M 203M 360M 555M
Params 456K 2M 4M 7M 11M
MACs 47K 120K 218K 342K 491K
Params 5K 10K 15K 20K 26K
Shakespeare / RNN
MACs 12K 40K 83K 143K 216K
Params 12K 40K 82K 142K 214K
Table 2: MACs and Parameters per -reduced network

In this section, we provide a thorough evaluation of FjORD and its components across different tasks, datasets, models and device cluster distributions to showcase its performance, elasticity and generality.

(a) ResNet18 - CIFAR10
(c) RNN - Shakespeare
Figure 10: Ordered Dropout with KD vs eFD baselines. Performance vs dropout rate p across different networks and datasets.

Datasets and Models. We evaluate FjORD on two vision and one text prediction task, shown in Table 1. For CIFAR10 (Krizhevsky et al., 2009)

, we use the “CIFAR” version of ResNet18 

(He et al., 2016)

. We federate the dataset by randomly dividing it into equally-sized partitions, each allocated to a specific client, and thus remaining IID in nature. For FEMNIST, we use a CNN with two convolutional layers followed by a softmax layer. For Shakespeare, we employ a RNN with an embedding layer (without dropout) followed by two LSTM 

(Hochreiter and Schmidhuber, 1997)

layers and a softmax layer. We report the model’s performance of the last epoch on the test set which is constructed by combining the test data for each client. We report top-

accuracy vision tasks and negative perplexity for text prediction. Further details, such as hyperparameters, description of datasets and models are available in the Appendix.

5.1 Experimental Setup

Infrastructure. FjORD was implemented on top of the Flower (v0.14dev) (Beutel et al., 2020)

framework and PyTorch (v1.4.0)

(Paszke et al., 2019). We run all our experiments on a private cloud cluster, consisting of Nvidia V100 GPUs. To scale to hundreds of clients on a single machine, we optimized Flower

so that clients only allocate GPU resources when actively participating in a federated client round. We report average performance and the standard deviation across three runs for all experiments. To model client availability, we run up to

Flower clients in parallel and sample 10% at each global round, with the ability for clients to switch identity at the beginning of each round to overprovision for larger federated datasets. Furthermore, we model client heterogeneity by assigning each client to one of the device clusters. We provide the following setups:

  • [leftmargin=*,label=,noitemsep,topsep=0pt]

  • Uniform-{5,10}: This refers to the distribution , i.e. , with or .

  • Drop Scale :

    This parameter affects a possible skew in the number of devices per cluster. It refers to the drop in clients per cluster of devices, as we go to higher

    ’s. Formally, for uniform-n and drop scale , the high-end cluster contains of the devices and the rest of the clusters contain each. Hence, for of the uniform-5 case, all devices can run the subnetwork, 80% can run the and so on, leading to a device distribution of . This percentage drop is half for the case of , resulting in a larger high-end cluster, e.g. .

Baselines. To assess the performance of our work against the state-of-the-art, we compare FjORD with the following set of baselines: i) Extended Federated Dropout (eFD), ii) FjORD with eFD (FjORD w/ eFD).

eFD builds on top of the technique of Federated Dropout (FD) (Caldas et al., 2018b), which adopts a Random Dropout (RD) at neuron/filter level for minimising the model’s footprint. However, FD does not support adaptability to heterogeneous client capabilities out of the box, as it inherits a single dropout rate across devices. For this reason, we propose an extension to FD, allowing to adapt the dropout rate to the device capabilities, defined by the respective cluster membership. It is clear that eFD dominates FD in performance and provides a tougher baseline, as the latter needs to impose the same dropout rate to fit the model at hand on all devices, leading to larger dropout rates (i.e. uniform dropout of 80% for full model to support the low-end devices). We provide empirical evidence for this in the Appendix. For investigative purposes, we also applied eFD on top of FjORD, as a means to update a larger part of the model from lower-tier devices, i.e. allow them to evaluate submodels beyond their during training.

(a) ResNet18 - CIFAR10
(c) RNN - Shakespeare
Figure 14: Ablation analysis of FjORD with Knowledge Distillation (KD). Ordered Dropout with .

5.2 Performance Evaluation

In order to evaluate the performance of FjORD, we compare it to the two baselines, eFD and OD+eFD. We consider the uniform-5 setup with drop scale of 1.0 (i.e. uniform clusters). For each baseline, we train one independent model , end-to-end, for each . For eFD, what this translates to is that the clusters of devices that cannot run model compensate by randomly dropping out neurons/filters. We point out that is omitted from the eFD results as it is essentially not employing any dropout whatsoever. For the case of FjORD + eFD, we control the RD by capping it to . This allows for larger submodels to be updated more often – as device belonging to cluster can now have during training where is the next more powerful cluster – while at the same time it prevents the destructive effect of too high dropout values shown in the eFD baseline.

Fig. 10 presents the achieved accuracy for varying values of across the three target datasets. FjORD (denoted by FjORD w/ KD) outperforms eFD across all datasets with improvements between - percentage points (pp) ( pp avg. across values) on CIFAR10, - pp ( p avg.) on FEMNIST and - points (p) ( p avg.) on Shakespeare. Compared to FjORD +eFD, FjORD achieves performance gains of - pp ( avg.), up to pp ( pp avg.) on FEMNIST and - p (0.22 p avg.) on Shakespeare.

Across all tasks, we observe that FjORD is able to improve its performance with increasing due to the nested structure of its OD method. We also conclude that eFD on top of FjORD does not seem to lead to better results. More importantly though, given the heterogeneous pool of devices, to obtain the highest performing model for eFD, multiple models have to be trained (i.e. one per device cluster). For instance, the highest performing models for eFD are , and for CIFAR10, FEMNIST and Shakespeare respectively, which can be obtained only a posteriori; after all model variants have been trained. Instead, despite the device heterogeneity, FjORD requires a single training process that leads to a global model that significantly outperforms the best model of eFD (by and pp for CIFAR10 and FEMNIST respectively and p for Shakespeare), while allowing the direct, seamless extraction of submodels due to the nested structure of OD.

5.3 Ablation Study of KD in FjORD

To evaluate the contribution of our knowledge distillation method to the attainable performance of FjORD, we conduct an ablative analysis on all three datasets. We adopt the same setup of uniform-5 and as in the previous section and compare FjORD with and without KD.

Fig. 14 shows the efficacy of FjORD’s KD in FL settings. FjORD’s KD consistently improves the performance across all three datasets when , with average gains of and pp for submodels of size and on CIFAR-10, and pp for FEMNIST and and pp for Shakespeare. For the cases of , the impact of KD is fading, especially in the two vision tasks. We believe this to be a side-effect of optimising for the average accuracy across submodels, which also yielded the strategy. We leave the exploration of alternative weighted KD strategies as future work. Overall, the use of KD significantly improves the performance of the global model, yielding gains of and pp for CIFAR10 and FEMNIST respectively and p for Shakespeare.

5.4 FjORD’s Deployment Flexibility

5.4.1 Device Clusters Scalability

An important characteristic of FjORD is its ability to scale to a larger number of device clusters or, equivalently, perform well with higher granularity of values. To illustrate this, we fall back to the local setup for simplicity and timeliness of experiment results and test the performance of OD across two setups, uniform-5 and -10 (defined in § 5.1).

As shown in Fig. 17, FjORD sustains its performance even under the higher granularity of values. This means that for applications where the modelling of clients needs to be more fine-grained, FjORD can still be of great value, without any significant degradation in achieved accuracy per submodel. This further supports the use-case where device-load needs to be modelled explicitly in device clusters (e.g.

 modelling device capabilities and load with deciles).

5.4.2 Adaptability to Device Distributions

In this section, we make a similar case about FjORD’s elasticity with respect to the allocation of available devices to each cluster. We adopt the setup of uniform-5 once again, but compare across drop scales and . In both cases, clients that can support models of are equisized, but the former halves the percentage of devices and allocates it to the last cluster. Hence, when , the high-end cluster accounts for 60% of the devices. The rationale behind this is that the majority of participating devices are able to run the whole original model.

The results depicted in Fig. 20 show that in both cases, FjORD is able to perform up to par, without significant degradation in accuracy due to the presence of more lower-tier devices in the FL setting. We should note that we did not alter the uniform sampling in this case on the premise that high-end devices are seen more often, exactly to illustrate FjORD’s adaptability to latent user device distribution changes of which the server is not aware.

(b) RNN - Shakespeare
Figure 17: Demonstration of FjORD’s scalability with respect to the number of device clusters on non-federated datasets.
(b) RNN - Shakespeare
Figure 20: Demonstration of the adaptability of FjORD across different device distributions.

6 Related Work

Dropout Techniques. Contrary to conventional Random Dropout (Srivastava et al., 2014), which stochastically drops a different, random set of a layer’s units in every batch and is typically applied for regularisation purposes, OD employs a structured ordered dropping scheme that aims primarily at tunably reducing the computational and memory cost of training and inference. However, OD can still have an implicit regularization effect since we encourage learning towards the top-ranked units (e.g. the left-most units in the example of Fig. 2), as these units will be dropped less often during training. Respectively, at inference time, the load of a client can be dynamically adjusted by dropping the least important units, i.e. adjusting the width of the network.

To the best of our knowledge, the only similar technique to OD is Nested Dropout (Rippel et al., 2014)

, where the authors proposed a similar construction, which is applied to the representation layer in autoencoders in order to enforce identifiability of the learned representation. In our case, we apply OD to every layer to elastically adapt the computation and memory requirements during training and inference.

Traditional Pruning. Conventional non-FL compression techniques can be applicable to reduce the network size and computation needs. The majority of pruning methods (Han et al., 2015; Guo et al., 2016; Li et al., 2016; Lee et al., 2019; Molchanov et al., 2019) aim to generate a single pruned model and require access to labelled data in order to perform a costly fine-tuning/calibration for each pruned variant. Instead, FjORD’s Ordered Dropout enables the deterministic extraction of multiple pruned models with varying resource budgets directly after training. In this manner, we remove both the excessive overhead of fine-tuning and the need for labelled data availability, which is crucial for real-world, privacy-aware applications (Wainwright et al., 2012; Shokri and Shmatikov, 2015). Finally, other model compression methods (Fang et al., 2018; Wang et al., 2019a; Dudziak et al., 2019) remain orthogonal to FjORD.

System Heterogeneity. So far, although substantial effort has been devoted to alleviating the statistical heterogeneity (Li et al., 2020a) among clients (Smith et al., 2017; Li and Wang, 2019; Hsieh et al., 2020; Fallah et al., 2020; Li et al., 2020c), the system heterogeneity has largely remained unaddressed. Considering the diversity of client devices, techniques on client selection (Nishio and Yonetani, 2019) and control of the per-round number of participating clients and local iterations (Luo et al., 2021; Wang et al., 2019b) have been developed. Nevertheless, as these schemes are restricted to allocate a uniform amount of work to each selected client, they either limit the model complexity to fit the lowest-end devices or exclude slow clients altogether. From an aggregation viewpoint, FedProx (Li et al., 2020b) allows for partial results to be integrated to the global model, thus enabling the allocation of different amounts of work across heterogeneous clients. Despite the fact that each client is allowed to perform a different number of local iterations based on its resources, large models still cannot be accommodated on the more constrained devices.

Communication Optimisation. The majority of existing work has focused on tackling the communication overhead in FL. Konečný et al. (2016) proposed using structured and sketched updates to reduce the transmitted data. ATOMO (Wang et al., 2018) introduced a generalised gradient decomposition and sparsification technique, aiming to reduce the gradient sizes communicated upstream. Han et al. (2020) adaptively select the gradients’ sparsification degree based on the available bandwidth and computational power. Building upon gradient quantisation methods (Lin et al., 2018; Horváth et al., 2019; Rajagopal et al., 2020; Horváth and Richtárik, 2021), Amiri et al. (2020) proposed using quantisation in the model sharing and aggregation steps. However, their scheme requires the same clients to participate across all rounds, and is, thus, unsuitable for realistic settings where clients’ availability cannot be guaranteed.

Despite the bandwidth savings, these communication-optimising approaches do not offer computational gains nor do they address device heterogeneity. Nonetheless, they remain orthogonal to our work and can be complementarily combined to further alleviate the communication cost.

Computation-Communication Co-optimisation. A few works aim to co-optimise both the computational and bandwidth costs. PruneFL (Jiang et al., 2020) proposes an unstructured pruning method. Despite the similarity to our work in terms of pruning, this method assumes a common pruned model across all clients at a given round, thus not allowing more powerful devices to update more weights. Hence, the pruned model needs to meet the constraints of the least capable devices, which severely limits the model capacity. Moreover, the adopted unstructured sparsity is difficult to translate to processing speed gains (Yao et al., 2019). Federated Dropout (Caldas et al., 2018b) randomly sparsifies the global model, before sharing it to the clients. Similarly to PruneFL, Federated Dropout does not consider the system diversity and distributes the same model to all clients. Thus, it is restricted by the low-end devices or excludes them altogether from the FL process.

Contrary to the presented works, our framework embraces the client heterogeneity, instead of treating it as a limitation, and thus pushes the boundaries of FL deployment in terms of fairness, scalability and performance by tailoring the model size to the device at hand.

7 Conclusions

In this work, we have introduced FjORD, a federated learning method for heterogeneous device training. To this direction, FjORD builds on top of our Ordered Dropout technique as a means to extract submodels of smaller footprints from a main model in a way where training the part also participates in training the whole. We show that our Ordered Dropout is equivalent to SVD for linear mappings and demonstrate that FjORD’s performance in the local and federated setting exceeds that of competing techniques, while maintaining flexibility across different environment setups.


  • M. Almeida, S. Laskaridis, I. Leontiadis, S. I. Venieris, and N. D. Lane (2019) EmBench: Quantifying Performance Variations of Deep Neural Networks Across Modern Commodity Devices. In The 3rd International Workshop on Deep Learning for Mobile Systems and Applications (EMDL), Cited by: §2.
  • M. M. Amiri, D. Gunduz, S. R. Kulkarni, and H. V. Poor (2020) Federated Learning with Quantized Global Model Updates. arXiv preprint arXiv:2006.10672. Cited by: §6.
  • Apple (2017) Learning with Privacy at Scale. In Differential Privacy Team Technical Report, Cited by: §1.
  • Authors (2019) TensorFlow Federated Datasets. External Links: Link Cited by: §B.1.
  • E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov (2020) How To Backdoor Federated Learning. In

    Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (AISTATS)

    pp. 2938–2948. Cited by: §1.
  • D. J. Beutel, T. Topal, A. Mathur, X. Qiu, T. Parcollet, and N. D. Lane (2020) Flower: A Friendly Federated Learning Research Framework. arXiv preprint arXiv:2007.14390. Cited by: §B.2, §5.1.
  • K. Bonawitz et al. (2017) Practical Secure Aggregation for Privacy-Preserving Machine Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS), Cited by: §1.
  • K. Bonawitz et al. (2019) Towards Federated Learning at Scale: System Design. In Proceedings of Machine Learning and Systems (MLSys), Cited by: §2, §2.
  • S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan, V. Smith, and A. Talwalkar (2018a) Leaf: a benchmark for federated settings. arXiv preprint arXiv:1812.01097. Cited by: §B.1, §B.1.
  • S. Caldas, J. Konečný, B. McMahan, and A. Talwalkar (2018b) Expanding the Reach of Federated Learning by Reducing Client Resource Requirements. In NeurIPS Workshop on Federated Learning for Data Privacy and Confidentiality, Cited by: §1, §5.1, §6.
  • G. Cohen, S. Afshar, J. Tapson, and A. Van Schaik (2017)

    EMNIST: extending mnist to handwritten letters

    In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2921–2926. Cited by: §B.1.
  • Ł. Dudziak, M. S. Abdelfattah, R. Vipperla, S. Laskaridis, and N. D. Lane (2019)

    ShrinkML: End-to-End ASR Model Compression Using Reinforcement Learning

    In INTERSPEECH, pp. 2235–2239. Cited by: §6.
  • European Commission (2018) GDPR: 2018 Reform of EU Data Protection Rules. External Links: Link Cited by: §1.
  • A. Fallah, A. Mokhtari, and A. Ozdaglar (2020) Personalized Federated Learning with Theoretical Guarantees: A Model-Agnostic Meta-Learning Approach. Advances in Neural Information Processing Systems (NeurIPS). Cited by: §6.
  • B. Fang, X. Zeng, and M. Zhang (2018) NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking (MobiCom), pp. 115–127. Cited by: §6.
  • R. C. Geyer, T. J. Klein, and M. Nabi (2017) Differentially Private Federated Learning: A Client Level Perspective. In NeurIPS Workshop on Machine Learning on the Phone and other Consumer Devices (MLPCD), Cited by: §1.
  • Y. Guo, A. Yao, and Y. Chen (2016) Dynamic Network Surgery for Efficient DNNs. In Advances in Neural Information Processing Systems (NeuriPS), pp. 1387–1395. Cited by: §6.
  • P. Han, S. Wang, and K. K. Leung (2020) Adaptive Gradient Sparsification for Efficient Federated Learning: An Online Learning Approach. In IEEE International Conference on Distributed Computing Systems (ICDCS), Cited by: §6.
  • S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both Weights and Connections for Efficient Neural Network. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1135–1143. Cited by: §3.1, §6.
  • K. Hazelwood et al. (2018) Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In IEEE International Symposium on High Performance Computer Architecture (HPCA), Vol. . Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §B.1, §5.
  • G. Hinton, O. Vinyals, and J. Dean (2014) Distilling the Knowledge in a Neural Network. In NeurIPS Deep Learning Workshop, Cited by: §3.2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §B.1, §5.
  • S. Horváth, C. Ho, Ľ. Horváth, A. N. Sahu, M. Canini, and P. Richtárik (2019) Natural Compression for Distributed Deep Learning. arXiv preprint arXiv:1905.10988. Cited by: §6.
  • S. Horváth and P. Richtárik (2021) A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning. In International Conference on Learning Representations, Cited by: §6.
  • K. Hsieh, A. Phanishayee, O. Mutlu, and P. Gibbons (2020) The Non-IID Data Quagmire of Decentralized Machine Learning. In International Conference on Machine Learning (ICML), Cited by: §6.
  • R. Hu, Y. Guo, H. Li, Q. Pei, and Y. Gong (2020) Personalized Federated Learning With Differential Privacy. IEEE Internet of Things Journal (JIOT) 7 (10), pp. 9530–9539. Cited by: §1.
  • A. Ignatov, R. Timofte, A. Kulik, S. Yang, K. Wang, F. Baum, M. Wu, L. Xu, and L. Van Gool (2019) AI Benchmark: All About Deep Learning on Smartphones in 2019. In International Conference on Computer Vision Workshops (ICCVW), Cited by: §2.
  • Y. Jiang, S. Wang, B. J. Ko, W. Lee, and L. Tassiulas (2020) Model Pruning Enables Efficient Federated Learning on Edge Devices. In Workshop on Scalability, Privacy, and Security in Federated Learning (SpicyFL), NeurIPS, Cited by: §6.
  • P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. (2019) Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977. Cited by: §1, §2.
  • S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh (2020) SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. In International Conference on Machine Learning (ICML), Cited by: §1, §4.
  • J. Konečný, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, and D. Bacon (2016) Federated Learning: Strategies for Improving Communication Efficiency. In NeurIPS Workshop on Private Multi-Party Machine Learning, Cited by: §6.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §B.1, §5.
  • S. Laskaridis, S. I. Venieris, H. Kim, and N. D. Lane (2020) HAPI: Hardware-Aware Progressive Inference. In International Conference on Computer-Aided Design (ICCAD), Cited by: §1.
  • N. Lee, T. Ajanthan, and P. Torr (2019) SNIP: Single-Shot Network Pruning based on Connection Sensitivity. In International Conference on Learning Representations (ICLR), Cited by: §3.1, §6.
  • D. Li and J. Wang (2019) FedMD: Heterogenous Federated Learning via Model Distillation. In NeurIPS 2019 Workshop on Federated Learning for Data Privacy and Confidentiality, Cited by: §6.
  • H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2016) Pruning Filters for Efficient ConvNets. In International Conference on Learning Representations (ICLR), Cited by: §6.
  • T. Li, A. K. Sahu, A. Talwalkar, and V. Smith (2020a) Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Processing Magazine. Cited by: §1, §6.
  • T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith (2020b) Federated Optimization in Heterogeneous Networks. In Proceedings of Machine Learning and Systems (MLSys), Cited by: §1, §4, §6.
  • T. Li, M. Sanjabi, A. Beirami, and V. Smith (2020c) Fair Resource Allocation in Federated Learning. In International Conference on Learning Representations (ICLR), Cited by: §6.
  • X. Li, M. JIANG, X. Zhang, M. Kamp, and Q. Dou (2021)

    Fed{BN}: Federated Learning on Non-{IID} Features via Local Batch Normalization

    In International Conference on Learning Representations (ICLR), Cited by: footnote 3.
  • P. P. Liang, T. Liu, L. Ziyin, N. B. Allen, R. P. Auerbach, D. Brent, R. Salakhutdinov, and L. Morency (2019) Think Locally, Act Globally: Federated Learning with Local and Global Representations. In NeurIPS 2019 Workshop on Federated Learning, Cited by: §1.
  • R. LiKamWa and L. Zhong (2015) Starfish: Efficient Concurrency Support for Computer Vision Applications. In Proceedings of the 13th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys), pp. 213–226. Cited by: §3.4.
  • Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally (2018) Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In International Conference on Learning Representations (ICLR), Cited by: §6.
  • B. Luo, X. Li, S. Wang, J. Huang, and L. Tassiulas (2021) Cost-Effective Federated Learning Design. In INFOCOM, Cited by: §6.
  • B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017a) Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §1, §4.
  • B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017b) Communication-efficient Learning of Deep Networks from Decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: §B.1.
  • H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang (2018) Learning Differentially Private Recurrent Language Models. In International Conference on Learning Representations (ICLR), Cited by: §1.
  • L. Melis, C. Song, E. De Cristofaro, and V. Shmatikov (2019) Exploiting Unintended Feature Leakage in Collaborative Learning. In IEEE Symposium on Security and Privacy (SP), pp. 691–706. Cited by: §1.
  • P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz (2019)

    Importance Estimation for Neural Network Pruning

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11264–11272. Cited by: §3.1, §6.
  • T. Nishio and R. Yonetani (2019) Client Selection for Federated Learning with Heterogeneous Resources in Mobile Edge. In IEEE International Conference on Communications (ICC), Cited by: §6.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems (NeurIPS), pp. 8026–8037. Cited by: §B.2, §5.1.
  • A. Rajagopal, D. Vink, S. Venieris, and C. Bouganis (2020) Multi-Precision Policy Enforced Training (MuPPET) : A Precision-Switching Strategy for Quantised Fixed-Point Training of CNNs. In Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 7943–7952. Cited by: §6.
  • O. Rippel, M. Gelbart, and R. Adams (2014) Learning Ordered Representations with Nested Dropout. In International Conference on Machine Learning (ICML), pp. 1746–1754. Cited by: §6.
  • F. Sattler, S. Wiedemann, K. -R. Müller, and W. Samek (2020) Robust and Communication-Efficient Federated Learning From Non-i.i.d. Data. IEEE Transactions on Neural Networks and Learning Systems (TNNLS) 31 (9), pp. 3400–3413. Cited by: §2.
  • R. Shokri and V. Shmatikov (2015) Privacy-Preserving Deep Learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 1310–1321. Cited by: §6.
  • V. Smith, C. Chiang, M. Sanjabi, and A. S. Talwalkar (2017) Federated Multi-Task Learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §6.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research (JMLR) 15 (56), pp. 1929–1958. Cited by: §6.
  • M. J. Wainwright, M. Jordan, and J. C. Duchi (2012) Privacy Aware Learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §6.
  • H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, and S. Wright (2018) ATOMO: Communication-Efficient Learning via Atomic Sparsification. Advances in Neural Information Processing Systems (NeurIPS). Cited by: §6.
  • J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor (2020) Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization. Advances in Neural Information Processing Systems (NeurIPS). Cited by: §1, §4.
  • K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han (2019a) HAQ: Hardware-Aware Automated Quantization with Mixed Precision. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 8612–8620. Cited by: §6.
  • S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan (2019b) Adaptive Federated Learning in Resource Constrained Edge Computing Systems. IEEE Journal on Selected Areas in Communications (JSAC) 37 (6). Cited by: §6.
  • C. Wu et al. (2019) Machine Learning at Facebook: Understanding Inference at the Edge. In IEEE International Symposium on High Performance Computer Architecture (HPCA), Vol. . Cited by: §2.
  • Z. Yao, S. Cao, W. Xiao, C. Zhang, and L. Nie (2019) Balanced Sparsity for Efficient DNN Inference on GPU. In AAAI Conference on Artificial Intelligence (AAAI), Vol. 33, pp. 5676–5683. Cited by: §6.
  • B. Zoph and Q. Le (2017) Neural Architecture Search with Reinforcement Learning. In International Conference on Learning Representations (ICLR), Cited by: §3.1.

Appendix A Proof of Theorem 1


Let denote SVD decomposition of . This decomposition is unique as is full rank with distinct singular values. We also denote

Assuming a linkage of input data and response through a linear mapping , we obtain the following

Let us denote to be -th row of matrix U and to be -th row of V. Due to being uniform on unit ball and structure of the neural network, we can further simplify the objective to

where denotes Frobenius norm. Since for each there exists nonzero probability such that , we can explicitly compute expectation, which leads to

Realising that has rank at most , we can use Eckart–Young theorem which implies that

Equality is obtained if and only if for all . This can be achieved, e.g.  and for all . ∎

Appendix B Experimental Details

b.1 Datasets and Models

Below, we provide detailed description of the datasets and models used in this paper. We use vision datasets EMNIST (Cohen et al., 2017) and its federated equivalent FEMNIST and CIFAR10 (Krizhevsky et al., 2009), as well as the language modelling dataset Shakespeare (McMahan et al., 2017b). In the centralised training scenarios, we use the union of dataset partitions for training and validation, while in the federeated deployment, we adopt either a random partitioning in IID datasets or the pre-partitioned scheme available in TensorFlow Federated (TFF) (Authors, 2019). Detailed description of the datasets is provided below.

CIFAR10. The CIFAR10 dataset is a computer vision dataset consisting of images with possible labels. For federated version of CIFAR10, we randomly partition dataset among clients, each client having data-points. We train a ResNet18 (He et al., 2016) on this dataset, where for Ordered Dropout, we train independent batch normalization layers for every as Ordered Dropout affects distribution of layers’ outputs. We perform standard data augmentation and preprocessing, i.e.  a random crop to shape followed by a random horizontal flip and then we normalize the pixel values according to their mean and standard deviation.

(F)EMNIST. EMNIST consists of gray-scale images of both numbers and upper and lower-case English characters, with

possible labels in total. The digits are partitioned according to their author, resulting in a naturally heterogeneous federated dataset. EMNIST is collection of all the data-points. We do not use any preprocessing on the images. We train a Convolutional Neural Network (CNN), which contains two convolutional layers, each with

kernels with and filters, respectively. Each convolutional layer is followed by a max pooling layer. Finally, the model has a dense output layer followed by a softmax activation. FEMNIST refers to the federated variant of the dataset, which has been partitioned based on the writer of the digit/character (Caldas et al., 2018a).

Shakespeare. Shakespeare dataset is also derived from the benchmark designed by (Caldas et al., 2018a). The dataset corpus is the collected works of William Shakespeare, and the clients correspond to roles in Shakespeare’s plays with at least two lines of dialogue. Non-federated dataset is constructed as a collection of all the clients’ data-points in the same way as for FEMNIST. For the preprocessing step, we apply the same technique as TFF dataloader, where we split each client’s lines into sequences of

characters, padding if necessary. We use a vocabulary size of

entities –

characters contained in Shakespeare’s work, beginning and end of line tokens, padding tokens, and out-of-vocabulary tokens. We perform next-character prediction on the clients’ dialogue using an Recurrent Neural Network (RNN). The RNN takes as input a sequence of

characters, embeds it into a learned -dimensional space, and passes the embedding through two LSTM (Hochreiter and Schmidhuber, 1997) layers, each with units. Finally, we use a softmax output layer with units. For this dataset, we don’t apply Ordered Dropout to the embedding layer, but only to the subsequent LSTM layers, due to its insignificant impact on the size of the model.

b.2 Implementation Details

FjORD was built on top of PyTorch (Paszke et al., 2019) and Flower (Beutel et al., 2020)

, an open-source federated learning framework which we extended to support Ordered, Federated, and Adaptive Dropout and Knowledge Distillation. Our OD aggregation was implemented in the form of a

Flower strategy that considers each client maximum width . Server and clients run in a multiprocess setup, communicating over gRPC444 channels and can be distributed across multiple devices. To scale to hundreds of clients per cloud node, we optimised Flower so that clients only allocate GPU resources when actively participating in a federated client round. This is accomplished by separating the forward/backward propagation of clients into a separate spawned process which frees its resources when finished. Timeouts are also introduced in order to limit the effect of stragglers or failed client processes to the entire training round.

b.3 Hyperparameters.

In this section we lay out the hyperparameters used for each tuple.

b.3.1 Non-federated Experiments

For centralised training experiments, we employ SGD with momentum as an optimiser. We also note that the training epochs of this setting are significantly fewer that the equivalent federated training rounds, as each iteration is a full pass over the dataset, compared to an iteration over the sampled clients.

  • [leftmargin=0pt,label=,noitemsep,topsep=0pt]

  • ResNet18. We use batch size , step size of and train for epochs. We decrease the step size by a factor of at epochs and .

  • CNN. We use batch size and train for epochs. We keep the step size constant at .

  • RNN. We use batch size and train for epochs. We keep the step size constant at .

b.3.2 Federated Experiments

For each federated deployment, we start the communication round by randomly sampling clients to model client availability and for each available client we run one local epoch. We decrease the client step size by at and of total rounds. We run 500 global rounds of training across experiments and use SGD without momentum.

  • [leftmargin=0pt,label=,noitemsep,topsep=0pt]

  • ResNet18. We use local batch size and step size of .

  • CNN. We use local batch size and step size of .

  • RNN. We use local batch size and step size of .

Appendix C Additional Experiments

Federated Dropout vs. eFD

(a) ResNet18 - CIFAR10
(c) RNN - Shakespeare
Figure 24: Federated Dropout (FD) vs Extended Federated Dropout (eFD). Performance vs dropout rate p across different networks and datasets.

In this section we provide evidence of eFD’s accuracy dominance over FD. We inherit the setup of the experiment in § 5.2 to be able to compare results and extrapolate across similar conditions. From Fig. 24, it is clear that eFD’s performance dominates the baseline FD by 27.13-33 percentage points (pp) (30.94pp avg.) on CIFAR10, 4.59-9.04pp (7.13pp avg.) on FEMNIST and 1.51-6.96 points (p) (3.96p avg.) on Shakespeare. The superior performance of eFD, as a technique, can be attributed to the fact that it allows for an adaptive dropout rate based on the device capabilities. As such, instead of imposing a uniformly high dropout rate to accommodate the low-end of the device spectrum, more capable devices are able to update larger portion of the network, thus utilising its capacity more intelligently.

However, it should be also noted that despite FD’s accuracy drop, on average it is expected to have a lower computation/upstream network bandwidth/energy impact on devices of the higher end of the spectrum, as they run the largest dropout rate possible to accommodate the computational need of their lower-end counterparts. This behaviour, however, can also be interpreted as wasted computation potential on the higher end – especially under unconstrained environments (i.e. device charging overnight) – at the expense of global model accuracy.