Communication-Efficient Federated Learning via Optimal Client Sampling

Federated learning is a private and efficient framework for learning models in settings where data is distributed across many clients. Due to interactive nature of the training process, frequent communication of large amounts of information is required between the clients and the central server which aggregates local models. We propose a novel, simple and efficient way of updating the central model in communication-constrained settings by determining the optimal client sampling policy. In particular, modeling the progression of clients' weights by an Ornstein-Uhlenbeck process allows us to derive the optimal sampling strategy for selecting a subset of clients with significant weight updates. The central server then collects local models from only the selected clients and subsequently aggregates them. We propose four client sampling strategies and test them on two federated learning benchmark tests, namely, a classification task on EMNIST and a realistic language modeling task using the Stackoverflow dataset. The results show that the proposed framework provides significant reduction in communication while maintaining competitive or achieving superior performance compared to baseline. Our methods introduce a new line of communication strategies orthogonal to the existing user-local methods such as quantization or sparsification, thus complementing rather than aiming to replace them.


page 1

page 2

page 3

page 4


Optimal Client Sampling for Federated Learning

It is well understood that client-master communication can be a primary ...

CatFedAvg: Optimising Communication-efficiency and Classification Accuracy in Federated Learning

Federated learning has allowed the training of statistical models over r...

ADDS: Adaptive Differentiable Sampling for Robust Multi-Party Learning

Distributed multi-party learning provides an effective approach for trai...

Timely Communication in Federated Learning

We consider a federated learning framework in which a parameter server (...

Comfetch: Federated Learning of Large Networks on Memory-Constrained Clients via Sketching

A popular application of federated learning is using many clients to tra...

Efficient and Reliable Overlay Networks for Decentralized Federated Learning

We propose near-optimal overlay networks based on d-regular expander gra...

Federated Learning under Importance Sampling

Federated learning encapsulates distributed learning strategies that are...

1 Introduction

Federated learning (FL) is a communication-efficient privacy-preserving framework for training machine learning models in settings where the data is distributed across many clients. Such settings are common in applications that involve mobile devices

[34], automated vehicles, and Internet-of-Things (IoT) systems, as well as in cross-silo applications including healthcare [43] and banking. In FedAvg, the baseline FL procedure proposed in [34] (included in the supplementary material as Algorithm 2), a server distributes an initial model to clients who independently update the model using their local training data; these updates are aggregated by the server which broadcasts a new global model to the clients and selects a subset of them to start a new round of local training; the procedure is repeated until convergence. Since clients communicate only their models to the server, FL offers data security that can be further strengthened using privacy mechanisms including those that provide differential privacy guarantees [1, 55, 35].

The number of clients in FL systems may be in millions, and the models that they locally train could be rather large; for example, VGG, the widely known neural network for image recognition has 160M parameters


, weighing 25MB when represented by 32 bits. Clearly, collecting sizable models from a large number of clients may require significant communication resources. Moreover, many settings where FL is applied are highly dynamic (e.g., mobile devices, IoT), with new users joining at any moment and old users continuing to generate new data. Such settings may lead to a large number of training rounds and clients’ model uploads, even though the contributions of some locally updated models to the global one may be negligible. Since collecting enormous amounts of information requires considerable resources, it is desirable to reduce the communication rates consumed by a FL system. This is being explored in a promising line of work focused on reducing each client’s communication budget by compressing the ML model through strategies such as quantization and sparsification

[53, 27, 52, 28, 4, 24].

In the existing FL systems, the number of clients participating in each round of updates (and, therefore, the required communication budget) is typically fixed. Yet the contribution of many clients in any given round could be dispensable, especially near convergence. Following this intuition, we propose a novel approach to reducing communication in FL by identifying and transmitting only the client updates that are deemed informative.222Note that this approach is orthogonal to the aforementioned compression strategies, and that the two may in principle be complemented (left to future work).

In particular, we model the progression of each user’s vector of weights during stochastic gradient descent (SGD) as a multidimensional stochastic process, and decide whether or not to send an update to the server based on how informative is the observed segment of a sample path (e.g., how far is the process from its steady-state). Specifically, we rely on the Ornstein-Uhlenbeck process (OU) – a continuous stochastic process parameterized by the mean and covariance functions – and interpret weights in SGD iterations as the points obtained by discretizing a sample path of the underlying OU process. Relying on this connection, we borrow techniques for optimal sampling of OU processes and adapt them to the problem of optimal client sampling. The optimal strategy turns out to be a simple threshold on the update’s norm and can thus be efficiently implemented at the client side.

We propose and analyze four different strategies for selecting the threshold and compare them in terms of the trained model accuracy and communication rates. To start, we show that using a judiciously chosen fixed threshold in all the loops of FL training reduces communication without significant performance deterioration. Next, using the OU process concepts, we propose a strategy where a model is communicated if the fraction of weights that are far from their steady state exceed s a predetermined threshold. We then develop two analogous dynamic strategies, wherein the threshold varies adaptively during the training process. In the first one, we estimate (locally, at a client’s side) parameters of an OU process governing weight updates for each client, and use them to determine if the current global model (i.e., the latest model broadcasted by the server) is close to the mean of a client’s process; if so, the client’s update need not be transmitted to the server. This is an effective and reliable assessment of the clients’ model relevance but requires additional memory resources on the clients’ side. In the second one, we propose an alternative based on empirical gradient norm variance evaluated across clients. This strategy requires the clients to first share their updates’ norm with the server so that an optimal threshold can be computed and broadcasted back; the clients then rely on the received thereshold to decide whether or not to send their updates to the server.

Efficacy of the proposed formulation is demonstrated in practical settings where we show that it may reduce communication during an FL training loop up to 80%, while achieving the same rate of convergence and competitive model accuracy. In particular, we test our methods in two FL benchmark tests: a digit classification task on a real federated dataset (EMNIST) using two different network architectures, and on the Stackoverflow dataset where we trained a recurrent neural network for the next word prediction task.

2 Background and Related Work

2.1 Federated Learning

The FL algorithm introduced in [34], FedAvg, requires a random subset of clients to send their updates to the server after having trained locally for epochs on mini-batches of size . For more details, please see Algorithm 2 in the supplementary document. In subsequent years, this field has gained a lot of attention from privacy and security perspectives [1, 20, 5, 12], optimization [42], adversarial attacks [6, 9] and personalization [49] – see [26] for a comprehensive overview. Our focus is on communication challenges in federated learning.

2.2 Communication reduction strategies

Highly relevant to our problem is the line of research on distributed estimation under communication constraints (see, e.g., [13, 18, 7, 57] and the references therein). Recently, work in [58, 14, 23, 2] provided bounds on the minimax risk, i.e., the worst-case estimation error for a distributed system operating under a constraint on the maximum number of bits that remote nodes are allowed to transmit. Under the same constraint, Barnes et al. [8] leverage Fisher information to derive lower bounds on the estimation error.

In machine learning applications, including FL, schemes for reducing communication overhead typically perform compression on the client side and thus require additional computation for encoding and decoding. Deterministic approaches such as low rank approximation, sparsification, subsampling, and quantization [27, 4, 24], and randomized ones including random rotations and stochastic rounding [52] and randomized approximation [28] can be used to reduce the communication while maintaining high accuracy. Note that these methods may be leveraged on the server side as well [15].

An alternative approach to learning under resource constraints is focused on reducing the overall model complexity, e.g., by bounding the model size [31, 39], pruning [22, 3, 17, 59], or restricting weights to be binary [16]

. Adaptation of these methods to FL and their analysis in such contexts are open research questions. Finally, reducing the number of clients who send their updates to the server may significantly reduce the amount of communication, as demonstrated by the heuristics which impose limits on the uploading times of updates


2.3 Ornstein-Uhlenbeck Processes

The Ornstein-Uhlenbeck Process Definition.

The Ornstein-Uhlenbeck process (OU) is a stationary Gauss-Markov process that, over time, drifts towards its mean function. Unlike the Wiener process whose drift is constant, the drift of the OU process depends on how far is its current realization from the mean. The OU processes have been extensively studied in a wide range of fields including physics [29], finance [50, 19, 48], and biology [44, 45], to name a few. Formally, the OU process is described by the stochastic differential equation


where denotes the standard Wiener process. Expression (1) specifies the process that is drifting towards with velocity , and has volatility driven by a Brownian motion with variance .

Estimating parameters of the OU process.

Various techniques for estimating parameters of the OU process from the observations of its sample path have been proposed in literature, including least-squares, maximum likelihood [32] and Jackknife method [46]. Further details are provided in the supplementary material.

Optimal sampling of the OU process.

Rate-constrained sampling of stochastic processes has been widely studied in literature, primarily in the context of communications and control [25, 11, 37, 41, 36, 51, 40, 21]. In [25, 11], this problem is studied for random i.i.d. sequences and Gauss-Markov processes; Nayyar et al. [37] present a similar study in the scenario where the nodes of a sensor network collaboratively estimate environment while operating under energy constraints that limit the amount of information sensors can transmit to a central processor. Rabi et al. [41] study linear diffusion processes and formulate sampling as a stopping time problem; Nar et al. [36] extend the results to multidimensional problems.

The setting where samples are observed locally (by nodes/clients) but used for estimation only if communicated to the central processor is studied in [51, 40]. As shown there, thresholding the increase in signal magnitude is an optimal sampling policy for estimating parameters of an OU process. An extension of these results to a larger class of continuous Markov processes with regularity conditions is reported in [21].

Connection between OU processes and SGD.

Recently, [10, 54, 30, 33] have studied SGD in various settings by leveraging stochastic differential equations. Mandt et al. [33] model SGD as an OU process and leverage its properties to derive the optimal model parameters. Wang et al. [54] investigate asymptotic behaviour of descent algorithms through stochastic processes modeling. Li et al. [30] show that SGD can be used for statistical inference, demonstrating that the average of SGD sequences can be approximated by an OU process. Blanc et al. [10] show that when learning a neural network with SGD and independent label noise, the dynamics of weight updates can intuitively be thought of as an OU process.

3 Efficient Training via Client Sampling

Here we introduce the strategy wherein a client transmits its weights to the server only if the norm of the client model’s update exceeds a pre-determined threshold. This simple strategy can be derived using the interpretion of SGD as an OU process.

3.1 SGD as an OU process

Consider the loss function

, where is a dataset with samples and is the loss of point for . In gradient descent, is minimized by evaluating in each iteration an approximation of the gradient using a mini-batch of the data. In particular,

The following observations and assumptions are commonly encountered in literature (see, e.g., [33]).

Observation 1:

The central limit theorem implies that

, where denotes the full gradient and is the corresponding covariance matrix.

Assumption 1: When approaches a stationary value, is constant [33].

Assumption 2: The iterates lie in a region where the loss can be approximated by a quadratic form (readily justified in the case of smooth loss functions), and the process reaches a quasi-stationary distribution around a local minimum.

Predicated on the above, the discrete process

can be interpreted as obtained by discretizing the OU process

In the supplementary, we illustrate these arguments by showing plots of the sample paths of randomly selected weights in a convolutional neural network trained on the MNIST dataset using SGD with a constant learning rate. For each iteration, distribution of the weight increments is approximately Gaussian; as expected, weights follow trajectories typical of OU sample paths.

3.2 Thresholding: Optimal sampling of OU processes

The aforementioned OU process sampling strategies (Sec 2) assume a network of nodes with unlimited access to the process, while the estimator is on a central server and the communication between the nodes and server is limited. In this setting, a node has access to signal information and may locally determine whether it should communicate to the server or not.

When a sampling frequency constraint limits the number of samples that could be collected by a server, the nodes should locally decide when to send an update. Guo et al. [21] show that the optimal strategy in this case is to sample at time , and that the optimal decoding policy is given by , . In particular, for the OU process in (1), ; note that and are unknown (more on this below).

To establish a connection to FL, we recall the arguments from the previous subsection and note that in each round of an FL procedure, client "observes" a partial sample path of an OU process (i.e., the progression of its weights during local training); the sample path starts from a point (i.e., model weights) broadcasted by the server at the beginning of a training round. Invoking the above sampling optimality results, we propose to schedule transmission of updates if the norm of the difference between a locally updated and the previously broadcasted model exceeds a judiciously selected threshold.

Since we do not know the process parameters and , we estimate them using previously aggregated available to the server (see Sec 2.3); this in turn enables finding . Furthermore, note that not receiving an update implies (since the proximity to a steady-state is the reason thresholding takes place); therefore, in this case a trivial approximation is justified (and is immediately accessible to the server).

3.3 Selecting the threshold

To summarize, we propose a scheme where client is selected if the norm of its weights update exceeds threshold , i.e., if ; ideally, thresholding reduces communication without incurring significant accuracy loss compared to baseline. In this section, we explore several thresholding strategies, as listed below.

  1. Fixed threshold (FT): The server provides a fixed threshold to all clients before training starts. During training, and before sending local updates, each client tests . The clients for which this holds true send their updates; the others communicate only the size of their local data set (to be used in computation of weights in the model aggregation step).

  2. Adaptive threshold (AT): At each iteration , all clients report to the server (just one float number per client); the server in turn computes the empirical mean and variance of the received norms, and sends back to the clients to use as the threshold.

  3. OU process estimation (OU): During training, each client keeps track of the iterates of its local model’s parameters and, at the end, solves a regression problem (see Supplementary A.2) to find and . We calculate the fraction of the parameters such that or . Intuitively, this is reflective of the number of weights that have not reached steady state of the corresponding OU process. If the fraction is higher than a predefined, fixed threshold (fraction) , then client communicates to the server.

  4. Adaptive OU (AOU): Similar to OU, but instead of using a fixed fraction for all rounds, we compute an adaptive fraction for each round by collecting the fractions from all clients and computing their mean and variance, as in AT; finally, is sent back to the clients to use as the threshold, where and

    denote the mean and standard deviation of their arguments, respectively.

3.4 An efficient communication algorithm

Input: clients, local minibatch size , number of local epochs , learning rate , threshold selection rule
Output: Global model
1 initialize ;
2 for t=1,2,… do
3       ;
4       random set of clients ;
5       for  in parallel, local clients  do
6             ClientUpdate() ;
7             if  then
8                   send ;
10            else
11                   send ;
13             end if
15       end for
16      for  Server  do
17             (Eq 2)
18       end for
19       ;
21 end for
Algorithm 1 Efficient communication FederatedAveraging

Summarizing the discussions, we formalize our algorithm for communication-efficient FL as Algorithm 1. In brief, clients are selected at the beginning of round . The server broadcasts model parameters , and the selected clients locally performs SGD with mini-batches of size for epochs. Then, following a communication threshold rule (selected among those described in Sec 3.3) that evaluates if the local model update exceeds the threshold, each client locally decides whether to communicate its updates or not, and transmits either the model updates or a negative-acknowledgement message, respectively. In both cases, the client sends its training data size to enable weighing according to (line 16). Finally, following the optimal sampling strategy, the server estimates each client’s parameters as


Parameters and are estimated via least-squares (Supplementary A.2), where . Note that the server’s computationally cheap alternative to estimation is to simply reuse the client’s model from the previous round, i.e., to set ; in our experiments, we observed that this alternative consistently provides high accuracy. Finally, the server computes a new model according to (line 11 of the algorithm’s pseudo-code).

4 Experiments

In this section we present a number of experiments that demonstrate the performance of our proposed algorithm on different datasets and various settings and models. We start by describing the datasets and preprocessing steps, and then benchmark various client selection strategies on two different datasets, EMNIST and Stackoverflow, with three different models: (1) a small EMNIST neural network convenient for a comparison of all the strategies; (2) a more sophisticated convolutional neural network applied to EMNIST; and (3) a recurrent neural network for the next word prediction on the Stackoverflow dataset. Further details and additional experimental results can be found in the supplementary document.

4.1 Datasets

Two different datasets were used for benchmarking and model validation experiments. For federated experiments we use the EMNIST dataset, a reprocessed version of the original MNIST dataset where each image is linked to its original writer, providing a non-i.i.d. natural distribution and allowing us to emulate an FL setting. This dataset consists of images attributed to 3843 users. For a larger and more realistic FL scenario we use the Stackoverflow dataset, a language modelling dataset with questions and answers collected from 342477 unique users. Following the previous work with this dataset [42]

, we use a build vocabulary with 10000 frequent words and restrict each user’s dataset to have at most 128 sentences. We use padding and truncation to enforce 20 word sentences, and represent them with index sequences corresponding to the vocabulary words, out of vocabulary words, beginning and end of sentences.

4.2 Results

We describe experiments on the aforementioned federated datasets and three different architectures. The main results are presented in this section, with further details, plots and tables found in the supplementary document.

We fix the number of selected clients at the beginning of each round and compare the four thresholding strategies in Sec 3.3 with two baselines: (1) a model with no communication constraints which collects updates from all the clients, and (2) the setting where a fraction of clients is randomly dropped, with chosen for a meaningful and fair comparison with the proposed thresholding schemes. Specifically, is set equal to the average number of clients participating per round in the thresholding scheme achieving the best accuracy.

Figure 1: Performance of different client selection strategies. On the left, we plot the accuracy per MB of communication. The right plot shows accuracy over FederatedAveraging rounds.

In EMNIST experiments we use standard settings and fix hyperparameters as in the prior work

[42], selecting clients at random in each round. We test two architectures: (i) a feed-forward network with 100K parameters; and (ii) a convolutional neural network with 1.7M parameters. In the Stackoverflow experiments we train an LSTM with 4M parameters. All the models are trained for 100 rounds. Since the results we obtained for the convolutional EMNIST are similar to the ones for the other two architectures, we present them in the supplementary document.


Accuracy of the models is compared in Fig 1 for the small (feed-forward) NN on EMNIST. In the left panel, we observe that the thresholding strategies typically achieve better accuracy per amount of communication than the baselines except for OU which exhibits similar rate. In the right panel, we compare the overall accuracy across FL rounds and observe that AT, OU and AOU thresholding methods achieve final accuracy similar to that of the full communication scheme; FT strategy is somewhat less accurate while the random client selection scheme suffers from significant performance degradation.

Figure 2: Communication for different client selection strategies on a small neural network on EMNIST. The left plot shows the number of clients participating in each round for each strategy. The right plot shows the cumulative communication.

Communication savings.

As shown in Fig 2, our thresholding strategies typically require much smaller amount of communication to obtain accuracy comparable to the baseline (i.e., to the scheme using updates of all clients). Among them, the FT strategy achieves the highest communication savings but is ultimately not capable of matching the accuracy of the baseline. Moreover, while randomly dropping clients trivially achieves low communication rate, the random client selection scheme suffers from a significant deterioration in accuracy. This should not come as a surprise since the random scheme closely resembles tuning the mini-batch side in SGD: if we use less data, less communication is needed, but at the cost of slower convergence. It is also in part consequence of dropping too many clients near the optimum where gradient norms might become smaller. Our adaptive thresholding strategies overcome aforementioned problems by changing the threshold in each round based on either the overall clients’ update norms (AT scheme) or the overall number of varying weights (AOU scheme). The results show that the level of accuracy in FL can be maintained while reducing communication by using a smaller number of clients; however, these clients have to be carefully selected.

The right panel in Fig 2 shows that the FT scheme reduces communication exponentially in each round, while the communication in adaptive thresholding strategies oscillates due to selecting a near-optimal number of clients for model updates. The results in Table 1 and Table 2 suggest that while there is no single method that is uniformly superior in exploring accuracy-communication trade-offs, the thresholding strategies are an efficient way of subselecting clients without a significant deterioration of accuracy.

Accuracy Accuracy rate Overall communication Communication used
(acc/byte) (MB) (% )
Baseline 86.25% 423.7 2035.4 100%
FT 84.42% 782.2 1079.16 53 %
AT 86.01% 513.1 1676.35 82 %
OU 86.14% 431.1 1998.35 98 %
AOU 86.08% 488.7 1761.43 87%
Random 83.85% 759 1104.8 54 %
Table 1: Results for the small (feed-forward) NN on EMNIST dataset
Accuracy Accuracy rate Overall Communication
(acc/byte) comm.(GB) used(%)
Dataset EMNIST Stack EMNIST Stack EMNIST Stack EMNIST Stack
Baseline 97.5% 13.07 % 29.3 1.61 33.27 81.0 100% 100 %
FT 95.4% 13.63 % 150 2.34 6.3 58.3 19% 71.96 %
AT 97.4% 13.04 % 34.9 2.99 27.9 43.6 83% 54 %
Random 94.2 % 9.67 % 148.9 2.22 6.3 43.6 19% 54 %
Table 2: Results for convMNIST and Stackoverflow

5 Conclusion

We propose a novel approach to reducing communication rates in FL by judiciously subselecting clients instead of relying on traditional compression strategies. A parallel between OU processes and SGD suggests strategies for identifying clients whose updates are informative and therefore should be communicated to the server. Experimental results demonstrate efficacy of the proposed methods in various settings. Moreover, our approaches can be combined with compression strategies to lower the communication rates even further.

The proposed client selection protocol based on thresholding is theoretically justified by the existing results on optimal OU process sampling. Future work involves automating threshold selection, and the analytical investigation of the connection between the selected thresholds and the FL system performance. Moreover, it is of interest to combine the methods proposed in this paper with the techniques that attempt to reduce communication burden by sparsifying transmitted information.

Broader Impact

Federated learning has emerged as a paradigm that allows users to maintain ownership of their data while still being able to enjoy sophisticated machine learning tools in application which range from entertainment to vital technologies that might rely on sensitive health data.

Adoption and proliferation of FL solutions depends fundamentally on availability of compute and communication resources. Our work addresses scenarios characterized by communication constraints and offers novel solutions that can vastly amplify already successful techniques such as compression and sparsification.

Further studies are needed to explore the effect of our work on fairness – for example, making sure that our sampling schemes are not biased with respect to vulnerable groups and that models trained with our sampling scheme deliver fair predictions. For this, comprehensive datasets would be needed to assess the model’s performance over diverse groups of clients.


  • [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep learning with differential privacy. In sigsac, pp. 308–318. Cited by: §1, §2.1.
  • [2] J. Acharya, C. L. Canonne, and H. Tyagi (2018) Inference under information constraints i: lower bounds from chi-square contraction. arXiv preprint arXiv:1812.11476. Cited by: §2.2.
  • [3] A. Aghasi, N. Nguyen, and J. Romberg (2016) Net-trim: a layer-wise convex pruning of deep neural networks. arXiv preprint arXiv:1611.05162 2. Cited by: §2.2.
  • [4] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic (2017) QSGD: communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709–1720. Cited by: §1, §2.2.
  • [5] K. Amin, A. Kulesza, A. Munoz, and S. Vassilvtiskii (2019-09–15 Jun) Bounding user contributions: a bias-variance trade-off in differential privacy. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 263–271. External Links: Link Cited by: §2.1.
  • [6] E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov (2018) How to backdoor federated learning. arXiv preprint arXiv:1807.00459. Cited by: §2.1.
  • [7] M. F. Balcan, A. Blum, S. Fine, and Y. Mansour (2012) Distributed learning, communication complexity and privacy. In Conference on Learning Theory, pp. 26–1. Cited by: §2.2.
  • [8] L. P. Barnes, Y. Han, and A. Ozgur (2019) Learning distributions from their samples under communication constraints. arXiv preprint arXiv:1902.02890. Cited by: §2.2.
  • [9] A. N. Bhagoji, S. Chakraborty, P. Mittal, and S. Calo (2018) Analyzing federated learning through an adversarial lens. arXiv preprint arXiv:1811.12470. Cited by: §2.1.
  • [10] G. Blanc, N. Gupta, G. Valiant, and P. Valiant (2019) Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. arXiv preprint arXiv:1904.09080. Cited by: §2.3.
  • [11] P. A. Bommannavar and T. Başar (2008) Optimal estimation over channels with limits on usage. IFAC Proceedings Volumes 41 (2), pp. 6632–6637. Cited by: §2.3.
  • [12] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth (2016) Practical secure aggregation for federated learning on user-held data. arXiv preprint arXiv:1611.04482. Cited by: §2.1.
  • [13] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3 (1), pp. 1–122. Cited by: §2.2.
  • [14] M. Braverman, A. Garg, T. Ma, H. L. Nguyen, and D. P. Woodruff (2016) Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In

    Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

    pp. 1011–1020. Cited by: §2.2.
  • [15] S. Caldas, J. Konečny, H. B. McMahan, and A. Talwalkar (2018) Expanding the reach of federated learning by reducing client resource requirements. arXiv preprint arXiv:1812.07210. Cited by: §2.2.
  • [16] M. Courbariaux, Y. Bengio, and J. David (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §2.2.
  • [17] X. Dong, S. Chen, and S. Pan (2017) Learning to prune deep neural networks via layer-wise optimal brain surgeon. In Advances in Neural Information Processing Systems, pp. 4857–4867. Cited by: §2.2.
  • [18] J. C. Duchi, A. Agarwal, and M. J. Wainwright (2011) Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Transactions on Automatic control 57 (3), pp. 592–606. Cited by: §2.2.
  • [19] L. T. Evans, S. P. Keef, and J. Okunev (1994) Modelling real interest rates. Journal of banking & finance 18 (1), pp. 153–165. Cited by: §2.3.
  • [20] R. C. Geyer, T. Klein, and M. Nabi (2017) Differentially private federated learning: a client level perspective. arXiv preprint arXiv:1712.07557. Cited by: §2.1.
  • [21] N. Guo and V. Kostina (2020) Optimal causal rate-constrained sampling for a class of continuous markov processes. arXiv preprint arXiv:2002.01581. Cited by: §2.3, §2.3, §3.2.
  • [22] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2.2.
  • [23] Y. Han, A. Özgür, and T. Weissman (2018) Geometric lower bounds for distributed parameter estimation under communication constraints. arXiv preprint arXiv:1802.08417. Cited by: §2.2.
  • [24] S. Horvath, C. Ho, L. Horvath, A. N. Sahu, M. Canini, and P. Richtarik (2019) Natural compression for distributed deep learning. arXiv preprint arXiv:1905.10988. Cited by: §1, §2.2.
  • [25] O. C. Imer and T. Basar (2005) Optimal estimation with limited measurements. In Proceedings of the 44th IEEE Conference on Decision and Control, pp. 1029–1034. Cited by: §2.3.
  • [26] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. (2019) Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977. Cited by: §2.1.
  • [27] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon (2016) Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §1, §2.2.
  • [28] J. Konečnỳ and P. Richtárik (2018) Randomized distributed mean estimation: accuracy vs. communication. Frontiers in Applied Mathematics and Statistics 4, pp. 62. Cited by: §1, §2.2.
  • [29] D. S. Lemons and P. Langevin (2002) An introduction to stochastic processes in physics. JHU Press. Cited by: §2.3.
  • [30] T. Li, L. Liu, A. Kyrillidis, and C. Caramanis (2018) Statistical inference using sgd. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2.3.
  • [31] D. Lin, S. Talathi, and S. Annapureddy (2016) Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pp. 2849–2858. Cited by: §2.2.
  • [32] R. S. Liptser and A. N. Shiryaev (2013) Statistics of random processes ii: applications. Vol. 6, Springer Science & Business Media. Cited by: §2.3.
  • [33] S. Mandt, M. Hoffman, and D. Blei (2016) A variational analysis of stochastic gradient algorithms. In International conference on machine learning, pp. 354–363. Cited by: §2.3, §3.1, §3.1.
  • [34] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017-20–22 Apr) Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, A. Singh and J. Zhu (Eds.), Proceedings of Machine Learning Research, Vol. 54, Fort Lauderdale, FL, USA, pp. 1273–1282. External Links: Link Cited by: §A.1, §1, §2.1, 2.
  • [35] H. B. McMahan, G. Andrew, U. Erlingsson, S. Chien, I. Mironov, N. Papernot, and P. Kairouz (2018) A general approach to adding differential privacy to iterative training procedures. arXiv preprint arXiv:1812.06210. Cited by: §1.
  • [36] K. Nar and T. Başar (2014-12) Sampling multidimensional wiener processes. In 53rd IEEE Conference on Decision and Control, Vol. , pp. 3426–3431. External Links: Document, ISSN 0191-2216 Cited by: §2.3.
  • [37] A. Nayyar, T. Başar, D. Teneketzis, and V. V. Veeravalli (2013) Optimal strategies for communication and remote estimation with an energy harvesting sensor. IEEE Transactions on Automatic Control 58 (9), pp. 2246–2260. Cited by: §2.3.
  • [38] T. Nishio and R. Yonetani (2019) Client selection for federated learning with heterogeneous resources in mobile edge. In ICC 2019 - 2019 IEEE International Conference on Communications (ICC), Vol. , pp. 1–7. Cited by: §2.2.
  • [39] D. Oktay, J. Ballé, S. Singh, and A. Shrivastava (2019) Model compression by entropy penalized reparameterization. arXiv preprint arXiv:1906.06624. Cited by: §2.2.
  • [40] T. Z. Ornee and Y. Sun (2019) Sampling for remote estimation through queues: age of information and beyond. arXiv preprint arXiv:1902.03552. Cited by: §2.3, §2.3.
  • [41] M. Rabi, G. V. Moustakides, and J. S. Baras (2012) Adaptive sampling for linear state estimation. SIAM Journal on Control and Optimization 50 (2), pp. 672–702. Cited by: §2.3.
  • [42] S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečnỳ, S. Kumar, and H. B. McMahan (2020) Adaptive federated optimization. arXiv preprint arXiv:2003.00295. Cited by: §2.1, §4.1, §4.2.
  • [43] M. Ribero, J. Henderson, S. Williamson, and H. Vikalo (2020) Federating recommendations using differentially private prototypes. arXiv preprint arXiv:2003.00602. Cited by: §1.
  • [44] L. M. Ricciardi and L. Sacerdote (1979)

    The ornstein-uhlenbeck process as a model for neuronal activity

    Biological cybernetics 35 (1), pp. 1–9. Cited by: §2.3.
  • [45] R. V. Rohlfs, P. Harrigan, and R. Nielsen (2014) Modeling gene expression evolution with an extended ornstein–uhlenbeck process accounting for within-species variation. Molecular biology and evolution 31 (1), pp. 201–211. Cited by: §2.3.
  • [46] J. Shao and D. Tu (2012) The jackknife and bootstrap. Springer Science & Business Media. Cited by: §2.3.
  • [47] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
  • [48] L. T. Siu-tang and L. Xin (2015) Optimal mean reversion trading: mathematical analysis and practical applications. Vol. 1, World Scientific. Cited by: §2.3.
  • [49] V. Smith, C. Chiang, M. Sanjabi, and A. S. Talwalkar (2017) Federated multi-task learning. In Advances in Neural Information Processing Systems, pp. 4424–4434. Cited by: §2.1.
  • [50] E. M. Stein and J. C. Stein (1991) Stock price distributions with stochastic volatility: an analytic approach. The review of financial studies 4 (4), pp. 727–752. Cited by: §2.3.
  • [51] Y. Sun, Y. Polyanskiy, and E. Uysal-Biyikoglu (2017) Remote estimation of the wiener process over a channel with random delay. In 2017 IEEE International Symposium on Information Theory (ISIT), pp. 321–325. Cited by: §2.3, §2.3.
  • [52] A. T. Suresh, F. X. Yu, S. Kumar, and H. B. McMahan (2017) Distributed mean estimation with limited communication. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3329–3337. Cited by: §1, §2.2.
  • [53] H. Tang, X. Lian, T. Zhang, and J. Liu (2019) Doublesqueeze: parallel stochastic gradient descent with double-pass error-compensated compression. arXiv preprint arXiv:1905.05957. Cited by: §1.
  • [54] Y. Wang (2017) Asymptotic analysis via stochastic differential equations of gradient descent algorithms in statistical and computational paradigms. arXiv preprint arXiv:1711.09514. Cited by: §2.3.
  • [55] X. Wu, F. Li, A. Kumar, K. Chaudhuri, S. Jha, and J. Naughton (2017) Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. In sigmod, pp. 1307–1322. Cited by: §1.
  • [56] H. Xiao, K. Rasul, and R. Vollgraf (2017-08-28)(Website) External Links: cs.LG/1708.07747 Cited by: Appendix B.
  • [57] Y. Zhang, J. C. Duchi, and M. J. Wainwright (2013) Communication-efficient algorithms for statistical optimization. The Journal of Machine Learning Research 14 (1), pp. 3321–3363. Cited by: §2.2.
  • [58] Y. Zhang, J. Duchi, M. I. Jordan, and M. J. Wainwright (2013) Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In Advances in Neural Information Processing Systems, pp. 2328–2336. Cited by: §2.2.
  • [59] M. Zhu and S. Gupta (2017) To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878. Cited by: §2.2.

Appendix A Background details

a.1 Federated Averaging algorithm

For convenience and completeness, we here provide FedAvg, the baseline federated learning algorithm proposed in [34].

Input: clients, is the local minibatch size, is the number of local epochs, and is the learning rate.
Output: Global model
1 initialize ;
2 for t=1,2,… do
3        ;
4        random set of clients ;
5        for  in parallel do
6               ClientUpdate() ;
8        end for
9        ;
11 end for
12Clients execute: ;
13 ClientUpdate() : Run on client ;
14 split data into batches of size ;
15 for  do
16        for  do
17               ;
19        end for
21 end for
22return to server ;
Algorithm 2 FederatedAveraging [34]

a.2 Estimating parameters of the OU process via least-squares regression

A number of methods for estimating parameters of the OU process from discrete observations of its sample path exist. We here summarize the simple least-squares solution. First, note that the continuous OU process can be discretized as


where denotes the discretization (sampling) period and are i.i.d. increments of a Wiener process. This leads to a linear measurement model


where denotes i.i.d. noise, and where

Then the parameter estimation can be formulated as a least-squares regression problem and solved in a closed form, yielding

Appendix B Numerical illustration of SGD: Aptness of the OU process model

To illustrate the aptness of using OU process for modeling sequences of weights/parameters during SGD, we performed a simple experiment on the Fashion MNIST dataset [56]. This dataset consists of 60,000 training images and 10,000 test examples; each sample is a 28x28 image, belonging to one out of 10 categories. No user or id information is available.

We train a feed forward neural network with a single hidden layer consisting of 128 units, ReLU activation, and a softmax output layer. We train the model for 5 epochs with the batch size set to 100, for a final test accuracy of 84%, when the model saturates.

Figure 3: Parameter drift when retuning the model with new data. The weights follow trajectories reminiscent of the OU process sample paths.

In Fig 3 we show the trajectories of individual weights randomly selected from each layer (we plot the trajectories after centering by setting ). Similar to sample paths of an OU process, all trajectories in Fig 1 ultimately deviate towards a mean where they stabilize.

Figure 4: Distribution of increments of weights from a dense layer

We repeat the training 600 times with the same initial weights but shuffling batches for each training process. The trajectories of 100 randomly selected weights are recorded and used to generate histograms of increments (on a fixed interval) for each weight. In Fig 4, we plot the histograms of increments for four randomly selected parameters. The histograms suggest that the underlying distribution of the weight increments is approximately Gaussian, which is consistent with transition densities of the OU process.

We further apply least-squares regression to the observed SGD trajectories to estimate parameters of the corresponding OU models. In Fig 4(a), we show example trajectories along with the corresponding means to which they are converging. In Fig 4(b), we single out one weight trajectory and show the mean to which it is converging as well as the variance of the steady-state distribution of the corresponding OU model.

(a) A few example weight trajectories and the corresponding estimated parameters (in particular, means of the corresponding OU model).
(b) An illustrative weight trajectory and the estimated mean and standard deviation of the corresponding OU model.
Figure 5: Weight trajectories and the estimates of the OU model parameters using least-squares regression.

Appendix C Experimental Details

c.1 Datasets

The datasets used for benchmarking and model validation experiments are: (i) EMNIST dataset, a reprocessed version of the original MNIST dataset where each image is linked to its original writer, providing a non-i.i.d. natural distribution and allowing us to emulate an FL setting; and (ii) Stackoverflow dataset, a language modelling dataset with questions and answers collected from users. Size of the datasets is summarized in the table below.

Dataset Users Samples
EMNIST 3.4 K 60 K
Stackoverflow 342 K 136 M
Table 3: Datasets

c.2 Models

The ML architectures that we used are presented in Tables 1-2 below.

small EMNIST

We describe the small NN trained on EMNIST in Table 1. In particular, we used a fully connected neural network having one hidden layer with 128 neurons and ReLU activation.


We train a deep neural network with the following architecture: two 5x5 convolution layers, with 32 and 64 channels respectively, interleaved with 2x2 max-pooling, a dense layer with 512 neurons with ReLU activation and a 10 unit softmax output layer. The network has a total of 1,663,370 parameters. At each round, 50 clients are uniformly selected to update the model. Each client locally trains for

epochs using stochastic gradient descent (SGD) with a batch size of .

Layer Output
# Trainable
Input (28,28,1) - -
Flatten 784 - -
Dense 128 100480 ReLU
Dense 10 1290 Softmax
Total 101,770
Table 4: small EMNIST digit recognition model architecture
Layer Output
# Trainable
Activation Hyperparameters
Input (28,28,1)
Conv2d (28,28,32) 832 ReLU kernel size = 5
MaxPool2d (14,14,32) pool size= (2, 2)
Conv2d (7,7,64) 51264 ReLU kernel size = 5
MaxPool2d 784 pool size= (2, 2)
Flatten 3136
Dense 512 1606144 ReLU
Dense 10 5130 Softmax
Total 1,663,370
Table 5: EMNIST digit recognition convolutional model architecture


We train a recursive neural network for the next word prediction that first embeds words into a 96-dimensional space, followed by an LSTM and finally a dense layer. The architecture is presented in Table 6.

Layer Output
# Trainable
Input 20
Embedding (20,96) 960384
LSTM (20,670) 2055560
Dense (20,96) 64416 ReLU
Dense (20,10004) 970388 Softmax
Total 4,050,748
Table 6: Stackoverflow next word prediction model architecture

Appendix D Additional Experimental Results

d.1 conv EMNIST

Here we report improvements due to using thresholding for the convolutional model on EMNITS. As shown in Fig 6, employing FT provides 95% accuracy while using only 20% of the clients. This method requires a pre-specified threshold which may be inconvenient since, as shown below in Sec D.3, the threshold needs to be tuned. However, we observe that AT has the same performance as the baseline scheme that uses all clients while reducing the communication by 20%. Fig 7 compares the communication used by various methods and shows how AT reduces communication through iterations. Note that a random selection of the same number of clients as in FT leads to a significant deterioration of the accuracy, demonstrating the value of our proposed thresholding strategies.

Figure 6: Performance for different client selection strategies. On the left, we plot the accuracy per MB. The right plot shows accuracy over FederatedAveraging rounds.
Figure 7: Communication for different client selection strategies. The left plot shows the number of clients participating in each round for each strategy. The right plot the cumulative communication.

d.2 Stackoverflow

Below we present details and results on the Stackoverflow dataset. Here we observe that AT uses half of the clients while achieving almost the same performance as the baseline which uses all clients. Note that a random selection of clients leads to deteriorated accuracy.

Figure 8: Performance for various client selection strategies. On the left, we plot the accuracy per MB. The right plot shows accuracy over FederatedAveraging rounds.
Figure 9: Communication for various client selection strategies. The left plot shows the number of clients participating in each round for each strategy. The right plot shows the cumulative communication.

d.3 Selecting the threshold for FT

A disadvantage of a fixed threshold strategy is the necessity of tuning an additional parameter; this was our motivation to develop adaptive strategies.

Figure 10: Performance for different threshold values.

In Fig 10, we observe on the right that higher accuracy is achieved by choosing smaller thresholds. However, the performance per amount of communication is better for . Furthermore, in Fig 11 we observe that threshold requires much more communication than the other values of threshold.

Figure 11: Communication for different threshold values.

In the other extreme, using a threshold that is too large can greatly deteriorate the accuracy. As observed in Fig 10, when , the model achieves only 90% accuracy.