Federated learning (FL) is a communication-efficient privacy-preserving framework for training machine learning models in settings where the data is distributed across many clients. Such settings are common in applications that involve mobile devices, automated vehicles, and Internet-of-Things (IoT) systems, as well as in cross-silo applications including healthcare  and banking. In FedAvg, the baseline FL procedure proposed in  (included in the supplementary material as Algorithm 2), a server distributes an initial model to clients who independently update the model using their local training data; these updates are aggregated by the server which broadcasts a new global model to the clients and selects a subset of them to start a new round of local training; the procedure is repeated until convergence. Since clients communicate only their models to the server, FL offers data security that can be further strengthened using privacy mechanisms including those that provide differential privacy guarantees [1, 55, 35].
The number of clients in FL systems may be in millions, and the models that they locally train could be rather large; for example, VGG, the widely known neural network for image recognition has 160M parameters
, weighing 25MB when represented by 32 bits. Clearly, collecting sizable models from a large number of clients may require significant communication resources. Moreover, many settings where FL is applied are highly dynamic (e.g., mobile devices, IoT), with new users joining at any moment and old users continuing to generate new data. Such settings may lead to a large number of training rounds and clients’ model uploads, even though the contributions of some locally updated models to the global one may be negligible. Since collecting enormous amounts of information requires considerable resources, it is desirable to reduce the communication rates consumed by a FL system. This is being explored in a promising line of work focused on reducing each client’s communication budget by compressing the ML model through strategies such as quantization and sparsification[53, 27, 52, 28, 4, 24].
In the existing FL systems, the number of clients participating in each round of updates (and, therefore, the required communication budget) is typically fixed. Yet the contribution of many clients in any given round could be dispensable, especially near convergence. Following this intuition, we propose a novel approach to reducing communication in FL by identifying and transmitting only the client updates that are deemed informative.222Note that this approach is orthogonal to the aforementioned compression strategies, and that the two may in principle be complemented (left to future work).
In particular, we model the progression of each user’s vector of weights during stochastic gradient descent (SGD) as a multidimensional stochastic process, and decide whether or not to send an update to the server based on how informative is the observed segment of a sample path (e.g., how far is the process from its steady-state). Specifically, we rely on the Ornstein-Uhlenbeck process (OU) – a continuous stochastic process parameterized by the mean and covariance functions – and interpret weights in SGD iterations as the points obtained by discretizing a sample path of the underlying OU process. Relying on this connection, we borrow techniques for optimal sampling of OU processes and adapt them to the problem of optimal client sampling. The optimal strategy turns out to be a simple threshold on the update’s norm and can thus be efficiently implemented at the client side.
We propose and analyze four different strategies for selecting the threshold and compare them in terms of the trained model accuracy and communication rates. To start, we show that using a judiciously chosen fixed threshold in all the loops of FL training reduces communication without significant performance deterioration. Next, using the OU process concepts, we propose a strategy where a model is communicated if the fraction of weights that are far from their steady state exceed s a predetermined threshold. We then develop two analogous dynamic strategies, wherein the threshold varies adaptively during the training process. In the first one, we estimate (locally, at a client’s side) parameters of an OU process governing weight updates for each client, and use them to determine if the current global model (i.e., the latest model broadcasted by the server) is close to the mean of a client’s process; if so, the client’s update need not be transmitted to the server. This is an effective and reliable assessment of the clients’ model relevance but requires additional memory resources on the clients’ side. In the second one, we propose an alternative based on empirical gradient norm variance evaluated across clients. This strategy requires the clients to first share their updates’ norm with the server so that an optimal threshold can be computed and broadcasted back; the clients then rely on the received thereshold to decide whether or not to send their updates to the server.
Efficacy of the proposed formulation is demonstrated in practical settings where we show that it may reduce communication during an FL training loop up to 80%, while achieving the same rate of convergence and competitive model accuracy. In particular, we test our methods in two FL benchmark tests: a digit classification task on a real federated dataset (EMNIST) using two different network architectures, and on the Stackoverflow dataset where we trained a recurrent neural network for the next word prediction task.
2 Background and Related Work
2.1 Federated Learning
The FL algorithm introduced in , FedAvg, requires a random subset of clients to send their updates to the server after having trained locally for epochs on mini-batches of size . For more details, please see Algorithm 2 in the supplementary document. In subsequent years, this field has gained a lot of attention from privacy and security perspectives [1, 20, 5, 12], optimization , adversarial attacks [6, 9] and personalization  – see  for a comprehensive overview. Our focus is on communication challenges in federated learning.
2.2 Communication reduction strategies
Highly relevant to our problem is the line of research on distributed estimation under communication constraints (see, e.g., [13, 18, 7, 57] and the references therein). Recently, work in [58, 14, 23, 2] provided bounds on the minimax risk, i.e., the worst-case estimation error for a distributed system operating under a constraint on the maximum number of bits that remote nodes are allowed to transmit. Under the same constraint, Barnes et al.  leverage Fisher information to derive lower bounds on the estimation error.
In machine learning applications, including FL, schemes for reducing communication overhead typically perform compression on the client side and thus require additional computation for encoding and decoding. Deterministic approaches such as low rank approximation, sparsification, subsampling, and quantization [27, 4, 24], and randomized ones including random rotations and stochastic rounding  and randomized approximation  can be used to reduce the communication while maintaining high accuracy. Note that these methods may be leveraged on the server side as well .
An alternative approach to learning under resource constraints is focused on reducing the overall model complexity, e.g., by bounding the model size [31, 39], pruning [22, 3, 17, 59], or restricting weights to be binary 
. Adaptation of these methods to FL and their analysis in such contexts are open research questions. Finally, reducing the number of clients who send their updates to the server may significantly reduce the amount of communication, as demonstrated by the heuristics which impose limits on the uploading times of updates.
2.3 Ornstein-Uhlenbeck Processes
The Ornstein-Uhlenbeck Process Definition.
The Ornstein-Uhlenbeck process (OU) is a stationary Gauss-Markov process that, over time, drifts towards its mean function. Unlike the Wiener process whose drift is constant, the drift of the OU process depends on how far is its current realization from the mean. The OU processes have been extensively studied in a wide range of fields including physics , finance [50, 19, 48], and biology [44, 45], to name a few. Formally, the OU process is described by the stochastic differential equation
where denotes the standard Wiener process. Expression (1) specifies the process that is drifting towards with velocity , and has volatility driven by a Brownian motion with variance .
Estimating parameters of the OU process.
Optimal sampling of the OU process.
Rate-constrained sampling of stochastic processes has been widely studied in literature, primarily in the context of communications and control [25, 11, 37, 41, 36, 51, 40, 21]. In [25, 11], this problem is studied for random i.i.d. sequences and Gauss-Markov processes; Nayyar et al.  present a similar study in the scenario where the nodes of a sensor network collaboratively estimate environment while operating under energy constraints that limit the amount of information sensors can transmit to a central processor. Rabi et al.  study linear diffusion processes and formulate sampling as a stopping time problem; Nar et al.  extend the results to multidimensional problems.
The setting where samples are observed locally (by nodes/clients) but used for estimation only if communicated to the central processor is studied in [51, 40]. As shown there, thresholding the increase in signal magnitude is an optimal sampling policy for estimating parameters of an OU process. An extension of these results to a larger class of continuous Markov processes with regularity conditions is reported in .
Connection between OU processes and SGD.
Recently, [10, 54, 30, 33] have studied SGD in various settings by leveraging stochastic differential equations. Mandt et al.  model SGD as an OU process and leverage its properties to derive the optimal model parameters. Wang et al.  investigate asymptotic behaviour of descent algorithms through stochastic processes modeling. Li et al.  show that SGD can be used for statistical inference, demonstrating that the average of SGD sequences can be approximated by an OU process. Blanc et al.  show that when learning a neural network with SGD and independent label noise, the dynamics of weight updates can intuitively be thought of as an OU process.
3 Efficient Training via Client Sampling
Here we introduce the strategy wherein a client transmits its weights to the server only if the norm of the client model’s update exceeds a pre-determined threshold. This simple strategy can be derived using the interpretion of SGD as an OU process.
3.1 SGD as an OU process
Consider the loss function, where is a dataset with samples and is the loss of point for . In gradient descent, is minimized by evaluating in each iteration an approximation of the gradient using a mini-batch of the data. In particular,
The following observations and assumptions are commonly encountered in literature (see, e.g., ).
The central limit theorem implies that, where denotes the full gradient and is the corresponding covariance matrix.
Assumption 1: When approaches a stationary value, is constant .
Assumption 2: The iterates lie in a region where the loss can be approximated by a quadratic form (readily justified in the case of smooth loss functions), and the process reaches a quasi-stationary distribution around a local minimum.
Predicated on the above, the discrete process
can be interpreted as obtained by discretizing the OU process
In the supplementary, we illustrate these arguments by showing plots of the sample paths of randomly selected weights in a convolutional neural network trained on the MNIST dataset using SGD with a constant learning rate. For each iteration, distribution of the weight increments is approximately Gaussian; as expected, weights follow trajectories typical of OU sample paths.
3.2 Thresholding: Optimal sampling of OU processes
The aforementioned OU process sampling strategies (Sec 2) assume a network of nodes with unlimited access to the process, while the estimator is on a central server and the communication between the nodes and server is limited. In this setting, a node has access to signal information and may locally determine whether it should communicate to the server or not.
When a sampling frequency constraint limits the number of samples that could be collected by a server, the nodes should locally decide when to send an update. Guo et al.  show that the optimal strategy in this case is to sample at time , and that the optimal decoding policy is given by , . In particular, for the OU process in (1), ; note that and are unknown (more on this below).
To establish a connection to FL, we recall the arguments from the previous subsection and note that in each round of an FL procedure, client "observes" a partial sample path of an OU process (i.e., the progression of its weights during local training); the sample path starts from a point (i.e., model weights) broadcasted by the server at the beginning of a training round. Invoking the above sampling optimality results, we propose to schedule transmission of updates if the norm of the difference between a locally updated and the previously broadcasted model exceeds a judiciously selected threshold.
Since we do not know the process parameters and , we estimate them using previously aggregated available to the server (see Sec 2.3); this in turn enables finding . Furthermore, note that not receiving an update implies (since the proximity to a steady-state is the reason thresholding takes place); therefore, in this case a trivial approximation is justified (and is immediately accessible to the server).
3.3 Selecting the threshold
To summarize, we propose a scheme where client is selected if the norm of its weights update exceeds threshold , i.e., if ; ideally, thresholding reduces communication without incurring significant accuracy loss compared to baseline. In this section, we explore several thresholding strategies, as listed below.
Fixed threshold (FT): The server provides a fixed threshold to all clients before training starts. During training, and before sending local updates, each client tests . The clients for which this holds true send their updates; the others communicate only the size of their local data set (to be used in computation of weights in the model aggregation step).
Adaptive threshold (AT): At each iteration , all clients report to the server (just one float number per client); the server in turn computes the empirical mean and variance of the received norms, and sends back to the clients to use as the threshold.
OU process estimation (OU): During training, each client keeps track of the iterates of its local model’s parameters and, at the end, solves a regression problem (see Supplementary A.2) to find and . We calculate the fraction of the parameters such that or . Intuitively, this is reflective of the number of weights that have not reached steady state of the corresponding OU process. If the fraction is higher than a predefined, fixed threshold (fraction) , then client communicates to the server.
Adaptive OU (AOU): Similar to OU, but instead of using a fixed fraction for all rounds, we compute an adaptive fraction for each round by collecting the fractions from all clients and computing their mean and variance, as in AT; finally, is sent back to the clients to use as the threshold, where and
denote the mean and standard deviation of their arguments, respectively.
3.4 An efficient communication algorithm
Summarizing the discussions, we formalize our algorithm for communication-efficient FL as Algorithm 1. In brief, clients are selected at the beginning of round . The server broadcasts model parameters , and the selected clients locally performs SGD with mini-batches of size for epochs. Then, following a communication threshold rule (selected among those described in Sec 3.3) that evaluates if the local model update exceeds the threshold, each client locally decides whether to communicate its updates or not, and transmits either the model updates or a negative-acknowledgement message, respectively. In both cases, the client sends its training data size to enable weighing according to (line 16). Finally, following the optimal sampling strategy, the server estimates each client’s parameters as
Parameters and are estimated via least-squares (Supplementary A.2), where . Note that the server’s computationally cheap alternative to estimation is to simply reuse the client’s model from the previous round, i.e., to set ; in our experiments, we observed that this alternative consistently provides high accuracy. Finally, the server computes a new model according to (line 11 of the algorithm’s pseudo-code).
In this section we present a number of experiments that demonstrate the performance of our proposed algorithm on different datasets and various settings and models. We start by describing the datasets and preprocessing steps, and then benchmark various client selection strategies on two different datasets, EMNIST and Stackoverflow, with three different models: (1) a small EMNIST neural network convenient for a comparison of all the strategies; (2) a more sophisticated convolutional neural network applied to EMNIST; and (3) a recurrent neural network for the next word prediction on the Stackoverflow dataset. Further details and additional experimental results can be found in the supplementary document.
Two different datasets were used for benchmarking and model validation experiments. For federated experiments we use the EMNIST dataset, a reprocessed version of the original MNIST dataset where each image is linked to its original writer, providing a non-i.i.d. natural distribution and allowing us to emulate an FL setting. This dataset consists of images attributed to 3843 users. For a larger and more realistic FL scenario we use the Stackoverflow dataset, a language modelling dataset with questions and answers collected from 342477 unique users. Following the previous work with this dataset 
, we use a build vocabulary with 10000 frequent words and restrict each user’s dataset to have at most 128 sentences. We use padding and truncation to enforce 20 word sentences, and represent them with index sequences corresponding to the vocabulary words, out of vocabulary words, beginning and end of sentences.
We describe experiments on the aforementioned federated datasets and three different architectures. The main results are presented in this section, with further details, plots and tables found in the supplementary document.
We fix the number of selected clients at the beginning of each round and compare the four thresholding strategies in Sec 3.3 with two baselines: (1) a model with no communication constraints which collects updates from all the clients, and (2) the setting where a fraction of clients is randomly dropped, with chosen for a meaningful and fair comparison with the proposed thresholding schemes. Specifically, is set equal to the average number of clients participating per round in the thresholding scheme achieving the best accuracy.
In EMNIST experiments we use standard settings and fix hyperparameters as in the prior work, selecting clients at random in each round. We test two architectures: (i) a feed-forward network with 100K parameters; and (ii) a convolutional neural network with 1.7M parameters. In the Stackoverflow experiments we train an LSTM with 4M parameters. All the models are trained for 100 rounds. Since the results we obtained for the convolutional EMNIST are similar to the ones for the other two architectures, we present them in the supplementary document.
Accuracy of the models is compared in Fig 1 for the small (feed-forward) NN on EMNIST. In the left panel, we observe that the thresholding strategies typically achieve better accuracy per amount of communication than the baselines except for OU which exhibits similar rate. In the right panel, we compare the overall accuracy across FL rounds and observe that AT, OU and AOU thresholding methods achieve final accuracy similar to that of the full communication scheme; FT strategy is somewhat less accurate while the random client selection scheme suffers from significant performance degradation.
As shown in Fig 2, our thresholding strategies typically require much smaller amount of communication to obtain accuracy comparable to the baseline (i.e., to the scheme using updates of all clients). Among them, the FT strategy achieves the highest communication savings but is ultimately not capable of matching the accuracy of the baseline. Moreover, while randomly dropping clients trivially achieves low communication rate, the random client selection scheme suffers from a significant deterioration in accuracy. This should not come as a surprise since the random scheme closely resembles tuning the mini-batch side in SGD: if we use less data, less communication is needed, but at the cost of slower convergence. It is also in part consequence of dropping too many clients near the optimum where gradient norms might become smaller. Our adaptive thresholding strategies overcome aforementioned problems by changing the threshold in each round based on either the overall clients’ update norms (AT scheme) or the overall number of varying weights (AOU scheme). The results show that the level of accuracy in FL can be maintained while reducing communication by using a smaller number of clients; however, these clients have to be carefully selected.
The right panel in Fig 2 shows that the FT scheme reduces communication exponentially in each round, while the communication in adaptive thresholding strategies oscillates due to selecting a near-optimal number of clients for model updates. The results in Table 1 and Table 2 suggest that while there is no single method that is uniformly superior in exploring accuracy-communication trade-offs, the thresholding strategies are an efficient way of subselecting clients without a significant deterioration of accuracy.
|Accuracy||Accuracy rate||Overall communication||Communication used|
|Baseline||97.5%||13.07 %||29.3||1.61||33.27||81.0||100%||100 %|
|FT||95.4%||13.63 %||150||2.34||6.3||58.3||19%||71.96 %|
|AT||97.4%||13.04 %||34.9||2.99||27.9||43.6||83%||54 %|
|Random||94.2 %||9.67 %||148.9||2.22||6.3||43.6||19%||54 %|
We propose a novel approach to reducing communication rates in FL by judiciously subselecting clients instead of relying on traditional compression strategies. A parallel between OU processes and SGD suggests strategies for identifying clients whose updates are informative and therefore should be communicated to the server. Experimental results demonstrate efficacy of the proposed methods in various settings. Moreover, our approaches can be combined with compression strategies to lower the communication rates even further.
The proposed client selection protocol based on thresholding is theoretically justified by the existing results on optimal OU process sampling. Future work involves automating threshold selection, and the analytical investigation of the connection between the selected thresholds and the FL system performance. Moreover, it is of interest to combine the methods proposed in this paper with the techniques that attempt to reduce communication burden by sparsifying transmitted information.
Federated learning has emerged as a paradigm that allows users to maintain ownership of their data while still being able to enjoy sophisticated machine learning tools in application which range from entertainment to vital technologies that might rely on sensitive health data.
Adoption and proliferation of FL solutions depends fundamentally on availability of compute and communication resources. Our work addresses scenarios characterized by communication constraints and offers novel solutions that can vastly amplify already successful techniques such as compression and sparsification.
Further studies are needed to explore the effect of our work on fairness – for example, making sure that our sampling schemes are not biased with respect to vulnerable groups and that models trained with our sampling scheme deliver fair predictions. For this, comprehensive datasets would be needed to assess the model’s performance over diverse groups of clients.
-  (2016) Deep learning with differential privacy. In sigsac, pp. 308–318. Cited by: §1, §2.1.
-  (2018) Inference under information constraints i: lower bounds from chi-square contraction. arXiv preprint arXiv:1812.11476. Cited by: §2.2.
-  (2016) Net-trim: a layer-wise convex pruning of deep neural networks. arXiv preprint arXiv:1611.05162 2. Cited by: §2.2.
-  (2017) QSGD: communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709–1720. Cited by: §1, §2.2.
-  (2019-09–15 Jun) Bounding user contributions: a bias-variance trade-off in differential privacy. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 263–271. External Links: Cited by: §2.1.
-  (2018) How to backdoor federated learning. arXiv preprint arXiv:1807.00459. Cited by: §2.1.
-  (2012) Distributed learning, communication complexity and privacy. In Conference on Learning Theory, pp. 26–1. Cited by: §2.2.
-  (2019) Learning distributions from their samples under communication constraints. arXiv preprint arXiv:1902.02890. Cited by: §2.2.
-  (2018) Analyzing federated learning through an adversarial lens. arXiv preprint arXiv:1811.12470. Cited by: §2.1.
-  (2019) Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. arXiv preprint arXiv:1904.09080. Cited by: §2.3.
-  (2008) Optimal estimation over channels with limits on usage. IFAC Proceedings Volumes 41 (2), pp. 6632–6637. Cited by: §2.3.
-  (2016) Practical secure aggregation for federated learning on user-held data. arXiv preprint arXiv:1611.04482. Cited by: §2.1.
-  (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3 (1), pp. 1–122. Cited by: §2.2.
Communication lower bounds for statistical estimation problems via a distributed data processing inequality.
Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 1011–1020. Cited by: §2.2.
-  (2018) Expanding the reach of federated learning by reducing client resource requirements. arXiv preprint arXiv:1812.07210. Cited by: §2.2.
-  (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §2.2.
-  (2017) Learning to prune deep neural networks via layer-wise optimal brain surgeon. In Advances in Neural Information Processing Systems, pp. 4857–4867. Cited by: §2.2.
-  (2011) Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Transactions on Automatic control 57 (3), pp. 592–606. Cited by: §2.2.
-  (1994) Modelling real interest rates. Journal of banking & finance 18 (1), pp. 153–165. Cited by: §2.3.
-  (2017) Differentially private federated learning: a client level perspective. arXiv preprint arXiv:1712.07557. Cited by: §2.1.
-  (2020) Optimal causal rate-constrained sampling for a class of continuous markov processes. arXiv preprint arXiv:2002.01581. Cited by: §2.3, §2.3, §3.2.
-  (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2.2.
-  (2018) Geometric lower bounds for distributed parameter estimation under communication constraints. arXiv preprint arXiv:1802.08417. Cited by: §2.2.
-  (2019) Natural compression for distributed deep learning. arXiv preprint arXiv:1905.10988. Cited by: §1, §2.2.
-  (2005) Optimal estimation with limited measurements. In Proceedings of the 44th IEEE Conference on Decision and Control, pp. 1029–1034. Cited by: §2.3.
-  (2019) Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977. Cited by: §2.1.
-  (2016) Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §1, §2.2.
-  (2018) Randomized distributed mean estimation: accuracy vs. communication. Frontiers in Applied Mathematics and Statistics 4, pp. 62. Cited by: §1, §2.2.
-  (2002) An introduction to stochastic processes in physics. JHU Press. Cited by: §2.3.
Statistical inference using sgd.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.3.
-  (2016) Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pp. 2849–2858. Cited by: §2.2.
-  (2013) Statistics of random processes ii: applications. Vol. 6, Springer Science & Business Media. Cited by: §2.3.
-  (2016) A variational analysis of stochastic gradient algorithms. In International conference on machine learning, pp. 354–363. Cited by: §2.3, §3.1, §3.1.
-  (2017-20–22 Apr) Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, A. Singh and J. Zhu (Eds.), Proceedings of Machine Learning Research, Vol. 54, Fort Lauderdale, FL, USA, pp. 1273–1282. External Links: Cited by: §A.1, §1, §2.1, 2.
-  (2018) A general approach to adding differential privacy to iterative training procedures. arXiv preprint arXiv:1812.06210. Cited by: §1.
-  (2014-12) Sampling multidimensional wiener processes. In 53rd IEEE Conference on Decision and Control, Vol. , pp. 3426–3431. External Links: Cited by: §2.3.
-  (2013) Optimal strategies for communication and remote estimation with an energy harvesting sensor. IEEE Transactions on Automatic Control 58 (9), pp. 2246–2260. Cited by: §2.3.
-  (2019) Client selection for federated learning with heterogeneous resources in mobile edge. In ICC 2019 - 2019 IEEE International Conference on Communications (ICC), Vol. , pp. 1–7. Cited by: §2.2.
-  (2019) Model compression by entropy penalized reparameterization. arXiv preprint arXiv:1906.06624. Cited by: §2.2.
-  (2019) Sampling for remote estimation through queues: age of information and beyond. arXiv preprint arXiv:1902.03552. Cited by: §2.3, §2.3.
-  (2012) Adaptive sampling for linear state estimation. SIAM Journal on Control and Optimization 50 (2), pp. 672–702. Cited by: §2.3.
-  (2020) Adaptive federated optimization. arXiv preprint arXiv:2003.00295. Cited by: §2.1, §4.1, §4.2.
-  (2020) Federating recommendations using differentially private prototypes. arXiv preprint arXiv:2003.00602. Cited by: §1.
The ornstein-uhlenbeck process as a model for neuronal activity. Biological cybernetics 35 (1), pp. 1–9. Cited by: §2.3.
-  (2014) Modeling gene expression evolution with an extended ornstein–uhlenbeck process accounting for within-species variation. Molecular biology and evolution 31 (1), pp. 201–211. Cited by: §2.3.
-  (2012) The jackknife and bootstrap. Springer Science & Business Media. Cited by: §2.3.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
-  (2015) Optimal mean reversion trading: mathematical analysis and practical applications. Vol. 1, World Scientific. Cited by: §2.3.
-  (2017) Federated multi-task learning. In Advances in Neural Information Processing Systems, pp. 4424–4434. Cited by: §2.1.
-  (1991) Stock price distributions with stochastic volatility: an analytic approach. The review of financial studies 4 (4), pp. 727–752. Cited by: §2.3.
-  (2017) Remote estimation of the wiener process over a channel with random delay. In 2017 IEEE International Symposium on Information Theory (ISIT), pp. 321–325. Cited by: §2.3, §2.3.
-  (2017) Distributed mean estimation with limited communication. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3329–3337. Cited by: §1, §2.2.
-  (2019) Doublesqueeze: parallel stochastic gradient descent with double-pass error-compensated compression. arXiv preprint arXiv:1905.05957. Cited by: §1.
-  (2017) Asymptotic analysis via stochastic differential equations of gradient descent algorithms in statistical and computational paradigms. arXiv preprint arXiv:1711.09514. Cited by: §2.3.
-  (2017) Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. In sigmod, pp. 1307–1322. Cited by: §1.
-  (2017-08-28)(Website) External Links: Cited by: Appendix B.
-  (2013) Communication-efficient algorithms for statistical optimization. The Journal of Machine Learning Research 14 (1), pp. 3321–3363. Cited by: §2.2.
-  (2013) Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In Advances in Neural Information Processing Systems, pp. 2328–2336. Cited by: §2.2.
-  (2017) To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878. Cited by: §2.2.
Appendix A Background details
a.1 Federated Averaging algorithm
For convenience and completeness, we here provide FedAvg, the baseline federated learning algorithm proposed in .
a.2 Estimating parameters of the OU process via least-squares regression
A number of methods for estimating parameters of the OU process from discrete observations of its sample path exist. We here summarize the simple least-squares solution. First, note that the continuous OU process can be discretized as
where denotes the discretization (sampling) period and are i.i.d. increments of a Wiener process. This leads to a linear measurement model
where denotes i.i.d. noise, and where
Then the parameter estimation can be formulated as a least-squares regression problem and solved in a closed form, yielding
Appendix B Numerical illustration of SGD: Aptness of the OU process model
To illustrate the aptness of using OU process for modeling sequences of weights/parameters during SGD, we performed a simple experiment on the Fashion MNIST dataset . This dataset consists of 60,000 training images and 10,000 test examples; each sample is a 28x28 image, belonging to one out of 10 categories. No user or id information is available.
We train a feed forward neural network with a single hidden layer consisting of 128 units, ReLU activation, and a softmax output layer. We train the model for 5 epochs with the batch size set to 100, for a final test accuracy of 84%, when the model saturates.
We train a feed forward neural network with a single hidden layer consisting of 128 units, ReLU activation, and a softmax output layer. We train the model for 5 epochs with the batch size set to 100, for a final test accuracy of 84%, when the model saturates.
In Fig 3 we show the trajectories of individual weights randomly selected from each layer (we plot the trajectories after centering by setting ). Similar to sample paths of an OU process, all trajectories in Fig 1 ultimately deviate towards a mean where they stabilize.
We repeat the training 600 times with the same initial weights but shuffling batches for each training process. The trajectories of 100 randomly selected weights are recorded and used to generate histograms of increments (on a fixed interval) for each weight. In Fig 4, we plot the histograms of increments for four randomly selected parameters. The histograms suggest that the underlying distribution of the weight increments is approximately Gaussian, which is consistent with transition densities of the OU process.
We further apply least-squares regression to the observed SGD trajectories to estimate parameters of the corresponding OU models. In Fig 4(a), we show example trajectories along with the corresponding means to which they are converging. In Fig 4(b), we single out one weight trajectory and show the mean to which it is converging as well as the variance of the steady-state distribution of the corresponding OU model.
Appendix C Experimental Details
The datasets used for benchmarking and model validation experiments are: (i) EMNIST dataset, a reprocessed version of the original MNIST dataset where each image is linked to its original writer, providing a non-i.i.d. natural distribution and allowing us to emulate an FL setting; and (ii) Stackoverflow dataset, a language modelling dataset with questions and answers collected from users. Size of the datasets is summarized in the table below.
|EMNIST||3.4 K||60 K|
|Stackoverflow||342 K||136 M|
The ML architectures that we used are presented in Tables 1-2 below.
We describe the small NN trained on EMNIST in Table 1. In particular, we used a fully connected neural network having one hidden layer with 128 neurons and ReLU activation.
We train a deep neural network with the following architecture: two 5x5 convolution layers, with 32 and 64 channels respectively, interleaved with 2x2 max-pooling, a dense layer with 512 neurons with ReLU activation and a 10 unit softmax output layer. The network has a total of 1,663,370 parameters.
At each round, 50 clients are uniformly selected to update the model. Each client locally trains for
We train a deep neural network with the following architecture: two 5x5 convolution layers, with 32 and 64 channels respectively, interleaved with 2x2 max-pooling, a dense layer with 512 neurons with ReLU activation and a 10 unit softmax output layer. The network has a total of 1,663,370 parameters. At each round, 50 clients are uniformly selected to update the model. Each client locally trains forepochs using stochastic gradient descent (SGD) with a batch size of .
|Conv2d||(28,28,32)||832||ReLU||kernel size = 5|
|MaxPool2d||(14,14,32)||pool size= (2, 2)|
|Conv2d||(7,7,64)||51264||ReLU||kernel size = 5|
|MaxPool2d||784||pool size= (2, 2)|
We train a recursive neural network for the next word prediction that first embeds words into a 96-dimensional space, followed by an LSTM and finally a dense layer. The architecture is presented in Table 6.
Appendix D Additional Experimental Results
d.1 conv EMNIST
Here we report improvements due to using thresholding for the convolutional model on EMNITS. As shown in Fig 6, employing FT provides 95% accuracy while using only 20% of the clients. This method requires a pre-specified threshold which may be inconvenient since, as shown below in Sec D.3, the threshold needs to be tuned. However, we observe that AT has the same performance as the baseline scheme that uses all clients while reducing the communication by 20%. Fig 7 compares the communication used by various methods and shows how AT reduces communication through iterations. Note that a random selection of the same number of clients as in FT leads to a significant deterioration of the accuracy, demonstrating the value of our proposed thresholding strategies.
Below we present details and results on the Stackoverflow dataset. Here we observe that AT uses half of the clients while achieving almost the same performance as the baseline which uses all clients. Note that a random selection of clients leads to deteriorated accuracy.
d.3 Selecting the threshold for FT
A disadvantage of a fixed threshold strategy is the necessity of tuning an additional parameter; this was our motivation to develop adaptive strategies.
In Fig 10, we observe on the right that higher accuracy is achieved by choosing smaller thresholds. However, the performance per amount of communication is better for . Furthermore, in Fig 11 we observe that threshold requires much more communication than the other values of threshold.
In the other extreme, using a threshold that is too large can greatly deteriorate the accuracy. As observed in Fig 10, when , the model achieves only 90% accuracy.