1 Introduction
Federated Learning (FL) allows users to reap the benefits of models trained from rich yet sensitive data captured by their mobile devices, without the need to centrally store such data (McMahan et al., 2017; Konečný et al., 2016a; Smith et al., 2017). Under the FL paradigm, each device performs training on samples available locally and only communicates intermediate model updates.
Network speed and number of nodes are two of the core systems aspects that differentiate FL from traditional distributed learning in data centers, with network bandwidth being potentially orders of magnitude slower and the number of worker nodes orders of magnitude larger. Together, these issues exacerbate the communication bottlenecks usually associated with distributed learning, increasing both the number of stragglers and the probability of devices dropping out altogether. The problem is further aggravated when working with high capacity models with large numbers of parameters.
Insisting on training these large models using existing federated optimization methods can lead to the systematic exclusion of clients with restricted bandwidth or limited network access from the training stage, and thus to a degraded user experience once these models are served. One naive solution involves training low capacity models with smaller communication footprints, at the expense of model accuracy. As a middle ground, we could develop strategies to reduce the communication footprint of larger, highcapacity models. Recent work (Konečný et al., 2016b) has in fact taken this approach, but only in the context of clienttoserver FL communication. Their success with lossy compression strategies is perhaps not surprising, as the clients’ lossy, yet unbiased, updates are eventually averaged over many users. However, servertoclient exchanges do not benefit from such averaging. As such, they remain a main bottleneck in our goal of expanding FL’s reach.
In this work, we propose two novel strategies to mitigate the servertoclient communication footprint, and empirically demonstrate their efficacy and seamless integration with existing clienttoserver strategies. The specific contributions of this paper are as follows:

[leftmargin=*]

We study lossily compressing the models downloaded by the clients, thus addressing the open question as to whether these approaches are amenable in the context of servertoclient exchanges. We also introduce the use of the theoretically motivated Kashin’s representation to reduce the error associated with the lossy compression (Lyubarskii and Vershynin, 2010; Kashin, 1977).

We introduce Federated Dropout, a technique that builds upon the popular idea of dropout (Srivastava et al., 2014), yet is primarily motivated by systemsrelated concerns. Our approach enables each device to locally operate on a smaller submodel (i.e. with smaller weight matrices) while still providing updates that can be applied to the larger global model on the server. It thus reduces communication costs by allowing for these smaller submodels to be exchanged between server and clients, while also reducing the computational cost of local training.

We empirically show that not only are these approaches compatible with one another, but with existing clienttoserver compression. Combining these approaches during FL training (see Figure 1) reduces the size of the downloaded models up to , the size of the corresponding updates up to , and the required local computations by up to , all without degrading the model’s accuracy and only at the expense of a slightly slower convergence rate (in terms of number of communication rounds).
2 Related Work
We review the relevant related work given our objective of reducing the communication footprint in servertoclient exchanges in Federated Learning (FL).
Federated Learning Federated Learning (FL) is a technique that aims to learn a global model over data distributed across multiple edge devices (usually mobile phones) without the data ever leaving the device on which it was generated (McMahan et al., 2017). It brings along a set of statistical (nonIID, unbalanced data) and systems (stragglers, communication bottlenecks, etc.) challenges which differentiate it from traditional distributed learning in the data center, and which have been tackled by several works. For instance, McMahan et al. (2017) propose Federated Averaging (FedAvg), which in its canonical form works by (1) sending the global model to a subset of the available devices, (2) training the model on each device using the available local data, and (3) averaging the local updates to thus end a round of training. In contrast Smith et al. (2017) present a multitask variant that also models the relationship between clients in order to learn personalized yet related models for each device. Nonetheless, all approaches we are aware of (including the two aforementioned ones) require continued exchanges between a central server and its clients across a potentially slow network.
Communicationefficient distributed learning Distributed learning is known to suffer from communication overheads associated with the frequent gradient updates exchanged among nodes (Wang et al., 2018; Dean et al., 2012; Smith et al., 2018; Reddi et al., 2016). To reduce these bottlenecks, recent studies focus on communicating a sparsified, quantized or randomly subsampled version of the updates. Although these operations introduce noise, they have been shown both empirically and theoretically to maintain the quality of the trained models. We refer the reader to the introduction of Wang et al. (2018) for more details and references.
In the context of FL, Konečný et al. (2016b)
successfully perform lossy compression on the clienttoserver exchanges (i.e. the model updates). Of particular interest is their use of the randomized Hadamard transform to reduce the error incurred by the subsequent quantization. This is due to the fact that the Hadamard transform, in expectation, spreads a vector’s information more evenly across its components
(Suresh et al., 2017; Konečný and Richtárik, 2016).We note, however, that neither the work on traditional distributed learning nor the work of Konečný et al. (2016b) considers compressing the servertoclient exchanges. Nevertheless, in FL, downloading a large model can still be a considerable burden for users, particularly for those in regions with network constraints. Furthermore, as FL is expected to deal with a large number of devices, communicating the global model may even become a bottleneck for the server (as it would, ideally, send the model to the clients in parallel).
Model compression Deep models tend to demand significant computational resources both for training and inference. Using them on edge devices is therefore not a straightforward task. Because of this, several recent works have proposed compressing the models before deploying them ondevice (Ravi, 2018). Popular alternatives include pruning the least useful connections in a network (Han et al., 2016, 2015), weight quantization (Hubara et al., 2016; Lin et al., 2017; De Sa et al., 2018), and model distillation (Hinton et al., 2015). Many of these approaches, however, are not applicable for the problems addressed in this work, as they are either ingrained in the training procedure (and our server holds no data and performs no actual training) or are mostly optimized for inference. In the context of FL, we need something computationally light that can be efficiently applied in every round and that also allows for subsequent local training. We do note, however, that some of the previously mentioned approaches could potentially be leveraged at inference time in the federated setting, and exploring these directions would be an interesting avenue for further research.
3 Methods
In this section, we present our proposed strategies for reducing Federated Learning’s (FL) servertoclient communication costs, namely lossy compression techniques (Section 3.1) and Federated Dropout (Section 3.2). We introduce the strategies separately, but they are fully compatible with one another (as we show in Section 4.4).
3.1 Lossy Compression
Our first approach at reducing bandwidth usage consists of using lightweight lossy compression techniques that can be applied to an already trained model and that, when reversed (i.e. after decompression), maintain the model’s quality. The particular set of techniques we propose are inspired by those successfully used by Konečný et al. (2016b) to compress the clienttoserver updates. We apply them, however, to the servertoclient exchanges, meaning we do not get the benefit of averaging the noisy decompressions over many updates.
Our method works as follows: we reshape each tobecompressed weight matrix in our model into a vector and (1) apply a basis transform to it. We then (2) subsample and (3) quantize the resulting vector and finally send it through the network. Once received, we simply execute the respective inverse transformations to finally obtain a noisy version of .
Basis transform Previous work (Lyubarskii and Vershynin, 2010; Konečný et al., 2016b) has explored the idea of using a basis transform to reduce the error that will later be incurred by perturbations such as quantization. In particular, Konečný et al. (2016b) use the random Hadamard transform to more evenly spread out a vector’s information among its dimensions. We go even further and also apply the classical results of Kashin (1977) to spread a vector’s information as much as possible in every dimension (Lyubarskii and Vershynin, 2010). Thus, Kashin’s representation mitigates the error incurred by subsequent quantization compared to using the random Hadamard transform. For a more detailed discussion, we refer the reader to Section A.3 in the Appendix.
Subsampling For , we zero out a fraction of the elements in each weight matrix, appropriately rescaling the remaining values. The elements to zero out are picked uniformly at random. Thus, we only communicate the nonzero values and a random seed which allows recovery of the corresponding indices.
Probabilistic quantization For a vector , let us denote and . Uniform probabilistic bit quantization replaces every element by with probability , and by
otherwise. It is straightforward to verify this yields an unbiased estimate of
. Now, for bit uniform quantization, we first equally divide into intervals. If falls in the interval bounded by and , the quantization operates by replacing and in step two of the above algorithm by and , respectively.3.2 Federated Dropout
To further reduce communication costs, we propose an algorithm in which each client, instead of locally training an update to the whole global model, trains an update to a smaller submodel. These submodels are subsets of the global model and, as such, the computed local updates have a natural interpretation as updates to the larger global model. We call this technique Federated Dropout as it is inspired by the well known idea of dropout (Srivastava et al., 2014), albeit motivated primarily by systemslevel concerns rather than as a strategy for regularization.
In traditional dropout, hidden units are multiplied by a random binary mask in order to drop an expected fraction of neurons during each training pass through the network. Because the mask changes in each pass, each pass is effectively computing a gradient with respect to a different submodel. These submodels can have different sizes (architectures) depending on how many neurons are dropped in each layer. Now, even though some units are dropped, in all implementations we are aware of, activations are still multiplied with the original weight matrices, they just have some useless rows and columns.
To extend this idea to FL and realize communication and computation savings, we instead zero out a fixed number of activations at each fullyconnected layer, so all possible submodels have the same reduced architecture; see Figure 2. The server can map the necessary values into this reduced architecture, meaning only the necessary coefficients are transmitted to the client, repacked as smaller dense matrices. The client (which may be fully unaware of the original model’s architecture) trains its submodel and sends its update, which the server then maps back to the global model^{1}^{1}1This can be done by communicating a single random seed to the client and back, or via state on the server.. For convolutional layers, zeroing out activations would not realize any space savings, so we instead drop out a fixed percentage of filters.
This technique brings two additional benefits beyond savings in servertoclient communication. First, the size of the clienttoserver updates is also reduced. Second, the local training procedure now requires a smaller number of FLOPS per gradient evaluation, either because all matrixmultiplies are now of smaller dimensions (for fullyconnected layers) or because less filters have to be applied (for convolutional ones). Thus, we reduce local computational costs.
4 Experimental Results
In this section, we first present our experimental setup (Section 4.1) before presenting results for our lossy compression (Section 4.2) and Federated Dropout (Section 4.3) strategies. Finally, we show experiments that use both of these strategies in tandem with those proposed in Konečný et al. (2016b) to also compress clienttoserver exchanges (Section 4.4).
4.1 Experimental Setup
Optimization Algorithm We focus on testing our strategies against already established FL benchmarks. In particular, we restrict our experiments to the use of Federated Averaging (FedAvg) (McMahan et al., 2017).
Datasets We use three datasets in our experiments: MNIST (LeCun et al., 1998), CIFAR10 (Krizhevsky and Hinton, 2009) and Extended MNIST or EMNIST (Cohen et al., 2017). The first two were used to benchmark the performance of FedAvg and of lossy compression for clienttoserver updates (Konečný et al., 2016b). For these two datasets, we use the artificial IID partition proposed by these previous works. Meanwhile, EMNIST is a dataset that has only recently been introduced as a useful benchmark for FL. Derived from the same source as MNIST, it also includes the identifier of the user that wrote the character (digit, lower or upper case letter), creating a natural and much more realistic partition of the data. Table 1 summarizes the basic dataset properties. Due to space constraints, we relegate the MNIST results to Appendix B, though all conclusions presented here also qualitatively hold for these experiments.
Dataset  # of users  IID  Training samples per user  Test samples per user  

mean  mean  
MNIST  Yes  
CIFAR10  Yes  
EMNIST  No 
Models For MNIST’s digit recognition task we use the same model as McMahan et al. (2017)
: a CNN with two 5x5 convolution layers (the first with 32 channels, the second with 64, each followed by 2x2 max pooling), a fully connected layer with 512 units and ReLu activation, and a final softmax output layer, for a total of more than
parameters. For CIFAR10, we use the all convolutional model taken from what is described as “Model C" in Springenberg et al. (2015), which also has a total of over parameters. Finally, for EMNIST we use a variant of the MNIST model with 2048 units in the final fully connected layer. While none of these models is the stateoftheart, they are sufficient for evaluating our methods, as we wish to measure accuracy degradation against a baseline and not to achieve the best possible accuracy on these tasks.Hyperparameters
We do not optimize our experiments for FedAvg’s hyperparameters, always using those that proved to work reasonably well in our baseline setting which involves no compression and no
Federated Dropout. For local training at each client we use static learning rates of for MNIST, for CIFAR10 and for EMNIST. We select random clients per round for MNIST and CIFAR10, andfor EMNIST. Finally, each selected client trains for one epoch per round using a batch size of
.4.2 Lossy Compression
We focus on testing how the compression strategies presented in Section 3.1 impact the global model’s accuracy. Like Konečný et al. (2016b), we don’t compress all variables of our models. As they mention, compressing smaller variables causes significant accuracy degradation but translates into minuscule communication savings. As such, we don’t compress biases for any of the models^{2}^{2}2Unlike Konečný et al. (2016b), we do compress all 9 convolutional layers in the CIFAR10 model, not just the 7 in the middle..
In our experiments, we vary three parameters:

[noitemsep, leftmargin=*]

The type of basis transform applied: no transform or identity (I), randomized Hadamard transform (HD) and Kashin’s representation (K).

The subsampling rate , which refers to the fraction of weights that are kept (i.e. of the weights are zeroed out).

The number of quantization bits .
Figure 3 shows the effect of varying these parameters for CIFAR10 and EMNIST. We repeat each experiment times and report the mean accuracy among these repetitions. The three main takeaways from these experiments are: (1) for every model, we are able find a setting of compression parameters that at the very least matches our baseline; (2) Kashin’s representation proves to be most useful for aggressive quantization values; and (3) it appears that subsampling is not all that helpful in the servertoclient setting. We proceed to give more details about these highlights.
The first takeaway is that, for every model, we are indeed able find a setting of compression parameters that matches or, in some cases, slightly outperforms our baseline. In particular, we are able to quantize every model to bits, which translates to a reduction in communication of nearly 8.
The second takeaway is that Kashin’s representation proves to be most useful for aggressive quantization values, i.e. for low values of . In our experiments, gains were observed only in regimes where the overall accuracy had already degraded, but we hypothesize that the use of Kashin’s representation may provide clearer benefits in the compression of clienttoserver gradient updates, where more aggressive quantization is admissible. We also highlight that using Kashin’s representation may be beneficial for other datasets. Indeed, its computational costs are comparable to that of the random Hadamard transform while also providing better theoretical error rates (see Section A.1). We refer the reader to Section A.3 in the Appendix, where we show preliminary results that demonstrate Kashin’s potential to dominate over the randomized Hadamard transform in compressing fullytrained models, particularly for small values of .
Finally, it appears that subsampling is not all that helpful in this servertoclient setting. This contrasts with the results presented by Konečný et al. (2016b) for compressed clienttoserver updates, where aggressive values of were admissible. This trend extends to the other compression parameters: servertoclient compression of global models requires much more conservative settings than clienttoserver compression of model updates. For example, for CIFAR10, Konečný et al. (2016b) get away with using and under a random Hadamard transform representation^{3}^{3}3The updates for CIFAR10 can actually be compressed up to 2 bits.. Meanwhile, in Figure 3 we can see that, for the same and representation, already causes an unacceptable degradation of the accuracy. This is not surprising, since it is expected that the updates’ error will cancel out once several of them get aggregated at the server, which is not true for model downloads.
4.3 Federated Dropout
We focus on testing how the global model’s accuracy deteriorates once we use the strategy proposed in Section 3.2. In these experiments, we vary the percentage of neurons (or filters for the case of convolutional layers) that are kept
on each layer of our models (we call this the federated dropout rate). We always keep the totality of the input and logits layers, and never drop the neuron that can be associated to the bias term.
Figure 4 shows how the convergence of our three models behaves under different federated dropout rates. We repeat each experiment 10 times and report the mean among these repetitions. The main takeaway from these experiments is that, for every model, it is possible to find a federated dropout rate less than that matches or, in some cases, even improves on the final accuracy of the model.
A federated dropout rate of seems to work across the board. This corresponds to dropping of the rows and columns of the weight matrices of fullyconnected layers (which translates to a reduction in size), and to dropping the same percentage of filters of each convolutional layer. Now, because fully connected layers correspond to most of the parameters of the MNIST and EMNIST models, the reduction will apply to them both in terms of the amount of data that has to be communicated and of the number of FLOPS required for local training. Meanwhile, because our CIFAR model is fully convolutional, gains will be of .
As a final comment, we note that more aggressive federated dropout rates tend to slow down the convergence rate of the model, even if they sometimes result in a higher accuracy.
4.4 Reducing the overall communication cost
Our final set of experiments shows how our models behave once we combine our two strategies, lossy compression and Federated Dropout, with existing clienttoserver compression schemes (Konečný et al., 2016b), in order to explore how the different components of this endtoend, communication efficient framework interact. To do this, we evaluate how our models behave under 3 different compression schemes (aggressive, moderate and conservative) and 4 different federated dropout rates (, , and ). The values for these schemes and rates were picked based on the observed behavior during the previous experiments, being somewhat more conservative as we are now combining different sources of noise. Table 2 describes the settings for each scheme.
Figure 5 shows how our CIFAR10 and EMNIST models behave under each of the previously mentioned conditions. We repeat each experiment 5 times and report the mean among these repetitions. For all three models, a federated dropout rate of resulted in models with no accuracy degradation under all compression schemes except for the most aggressive. For MNIST and EMNIST, this translates into servertoclient communication savings of , clienttoserver savings of and a reduction of in local computation, all without degrading the accuracy of the final global model (and sometimes even improving it). For CIFAR10, we provide servertoclient communication savings of , clienttoserver savings of and local computation savings of .
Based on these results, we also hypothesize that a federated dropout rate of combined with a moderate or conservative compression scheme will be a good starting point when setting these parameters in practice.
Scheme  ClienttoServer  ServertoClient  

transf.  transf.  
Aggressive  Kashin’s  Kashin’s  
Moderate  Kashin’s  Kashin’s  
Conservative  Kashin’s  Kashin’s 
5 Conclusions and Open Questions
The ecosystem currently targeted by Federated Learning (FL) is marked by heterogeneous edge networks that can potentially be orders of magnitude slower than the ones in datacenters. At the same time, FL can be quite demanding in terms of bandwidth, particularly when used to train deep models. We are thus at risk of either restricting the type of models we are able to train using this technique, or of excluding large groups of users from federated training. Both issues are problematic, but because access to highend networks also appears to be correlated to sensitive factors such as income and age (Anzilotti, 2016; Pew Research Center, 2018), the latter may have implications related to fairness, making it particularly sensitive as we continue the adoption of FL systems.
Our work dramatically reduces the communication overheads in FL by (1) using lossy compression techniques on the servertoclient exchanges and by (2) using Federated Dropout, a technique that only communicates subsets of the global model to each client. We empirically show that a combination of our strategies with previous work allows for up to a reduction in servertoclient communication, a reduction in local computation and a reduction in clienttoserver communication.
In future work, we plan to: explore the efficacy of introducing a server step size in order to account for the use of different submodels in Federated Dropout; investigate the possibility of using the same submodels for all the selected clients in one round; and further characterize the benefits of Kashin’s representation in compressing the gradient updates in FL and in traditional model serving. An additional future direction to pursue related to fairness involves studying the effect of adaptively using these strategies (i.e. using more aggressive compression and federated dropout rates for some users) to prevent unfairly biased models.
Finally, we note that the success of Federated Dropout suggests an entirely new avenue of research in which smaller, perhaps personalized, submodels are eventually aggregated into a larger, more complex model that can be managed by the server. Contrary to the classic datacenter setting, the computational overhead associated with first creating and then aggregating the submodels is justified in FL.
Acknowledgements
This work was supported in part by DARPA FA875017C0141, the National Science Foundation grants IIS1705121 and IIS1838017, an Okawa Grant, a Google Faculty Award, an Amazon Web Services Award, and a Carnegie Bosch Institute Research Award. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA, the National Science Foundation, or any other funding agency.
References
 Anzilotti [2016] Eillie Anzilotti. Visualizing the state of global internet connectivity, Aug 2016. URL https://www.citylab.com/life/2016/08/visualizingthestateofglobalinternetconnectivity/496328/.
 Candes et al. [2006] Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(8):1207–1223, 2006.
 Cohen et al. [2017] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. EMNIST: an extension of MNIST to handwritten letters. arXiv preprint arXiv:1702.05373, 2017.
 De Sa et al. [2018] Christopher De Sa, Megan Leszczynski, Jian Zhang, Alana Marzoev, Christopher R Aberger, Kunle Olukotun, and Christopher Ré. Highaccuracy lowprecision training. arXiv preprint arXiv:1803.03383, 2018.
 Dean et al. [2012] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012.
 Foucart and Rauhut [2013] Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing, volume 1. Birkhäuser Basel, 2013.

Han et al. [2015]
Song Han, Jeff Pool, John Tran, and William J Dally.
Learning both weights and connections for efficient neural network.
In Advances in neural information processing systems, pages 1135–1143, 2015.  Han et al. [2016] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations, 2016.
 Hinton et al. [2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 Hubara et al. [2016] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107–4115, 2016.
 Kashin [1977] Boris Sergeevich Kashin. Diameters of some finitedimensional sets and classes of smooth functions. Izvestiya Rossiiskoi Akademii Nauk. Seriya Matematicheskaya, 41(2):334–351, 1977.
 Konečný and Richtárik [2016] Jakub Konečný and Peter Richtárik. Randomized distributed mean estimation: Accuracy vs communication. arXiv preprint arXiv:1611.07555, 2016.
 Konečný et al. [2016a] Jakub Konečný, H Brendan McMahan, Daniel Ramage, and Peter Richtárik. Federated optimization: Distributed machine learning for ondevice intelligence. arXiv preprint arXiv:1610.02527, 2016a.
 Konečný et al. [2016b] Jakub Konečný, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016b.
 Krizhevsky and Hinton [2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

Lin et al. [2017]
Xiaofan Lin, Cong Zhao, and Wei Pan.
Towards accurate binary convolutional neural network.
In Advances in Neural Information Processing Systems, pages 345–353, 2017.  Lyubarskii and Vershynin [2010] Yurii Lyubarskii and Roman Vershynin. Uncertainty principles and vector quantization. IEEE Transactions on Information Theory, 56(7):3491–3501, 2010.
 McMahan et al. [2017] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communicationefficient learning of deep networks from decentralized data. pages 1273–1282, 2017.
 Pew Research Center [2018] Pew Research Center. Mobile fact sheet, Feb 2018. URL http://www.pewinternet.org/factsheet/mobile/.
 Ravi [2018] Sujith Ravi. Custom OnDevice ML Models with Learn2Compress, May 2018. URL https://ai.googleblog.com/2018/05/customondevicemlmodels.html.
 Reddi et al. [2016] Sashank J Reddi, Jakub Konečný, Peter Richtárik, Barnabás Póczós, and Alex Smola. Aide: fast and communication efficient distributed optimization. arXiv preprint arXiv:1608.06879, 2016.
 Smith et al. [2017] Virginia Smith, ChaoKai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated multitask learning. In Advances in Neural Information Processing Systems, pages 4424–4434, 2017.

Smith et al. [2018]
Virginia Smith, Simone Forte, Ma Chenxin, Martin Takáč, Michael I
Jordan, and Martin Jaggi.
Cocoa: A general framework for communicationefficient distributed
optimization.
Journal of Machine Learning Research
, 18:230, 2018.  Springenberg et al. [2015] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. International Conference on Learning Representations (workshop track), 2015.
 Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Suresh et al. [2017] Ananda Theertha Suresh, Felix X Yu, Sanjiv Kumar, and H Brendan McMahan. Distributed mean estimation with limited communication. In International Conference on Machine Learning, pages 3329–3337, 2017.
 Wang et al. [2018] Hongyi Wang, Scott Sievert, Zachary Charles, Dimitris Papailiopoulos, and Stephen Wright. ATOMO: Communicationefficient Learning via Atomic Sparsification. arXiv preprint arXiv:1806.04090, 2018.
Appendix A Kashin’s Representation
For reasons of space, we have relegated a more detailed discussion of Kashin’s representation (see Section 3) to the Appendix. In this section, we briefly discuss Kashin’s representation both from a theoretical (Section A.1) and practical (Section A.2) standpoints. Finally, we present some preliminary results that argue the potential of Kashin’s representation to dominate over the random Hadamard transform with respect to the size vs. accuracy tradeoff (Section A.3).
a.1 Theoretical Overview
The idea of using the classical results of Kashin [1977] to increase the robustness of coefficients to perturbations was first introduced by Lyubarskii and Vershynin [2010]. Their result states that, given a tight frame satisfying a form of uncertainty principle, a weaker notion of the RIP [Candes et al., 2006], it is possible to convert the frame representation of every vector into the more robust Kashin’s representation, whose coefficients will have the smallest possible dynamic range.
Error rates Since the results of Suresh et al. [2017] (who quantified the reduction in quantization error due to the Hadamard transform) rely on exactly this notion of dynamic range, and assuming the subsampled randomized Hadamard transform satisfies the uncertainty principle, Theorem 3.5 of Lyubarskii and Vershynin [2010] can be directly used as a dropin replacement for Lemma 7 in Suresh et al. [2017], removing the logarithmic dependence on dimension from Theorem 3 therein, matching the lower bounds. We do not provide the complete proof as, beyond drawing this connection, it does not imply any novelty whatsoever. However, an open question remains, as we are not aware of a result showing what are the parameters of the uncertainty principle guaranteed by the subsampled randomized Hadamard transform. They exist however, as the transform is known to satisfy the RIP [Foucart and Rauhut, 2013], which is a stronger notion.
a.2 Practical Considerations
In practice, given a tight frame, the algorithm for computing Kashin’s representation is straightforward. It runs for iterations, and takes parameters as input. In a single iteration, one first computes the frame coefficients, projects them onto a ball, and reconstructs the error in the original domain. Another iteration proceeds starting with the reconstructed error and a smaller ball. We refer the reader to Lyubarskii and Vershynin [2010] for more details regarding and their relationship with the uncertainty principle.
In our work, we use the randomized Hadamard transform as the initial tight frame (see Section A.1 for details on why this is possible). We also run the algorithm for just iterations (as very often this provides most of the benefit), fixed , and used a variant of the algorithm which yields an exact representation (omitting the projection in the last iteration). Given this, the choice of is irrelevant. The dominant part of the computation is then three applications of the fast WalshHadamard transform, as opposed to a single one in Konečný et al. [2016b]).
As a particular example, say we are to compress an
dimensional vector. We first pad the vector with zeros, so that its dimension is
(the closest larger power of). Then, we multiply the vector by a diagonal matrix with independent Rademacher random variables (
), followed by the application of the fast WalshHadamard transform (). The first columns of the matrix correspond to the tight frame used to find the Kashin’s representation. Nonetheless, we avoid representing this explicitly.Finally, note that, if the initial dimension was a power of , we need to pad zeros to the next power of in order to realize any benefit over just using the Hadamard transform.
a.3 Dominance over Hadamard
Given the theoretical properties of Kashin’s representation, we hypothesize it should dominate the random Hadamard transform when it comes to the size vs. accuracy tradeoff. A preliminary experiment to corroborate this hypothesis is the following:

[noitemsep, leftmargin=*]

We train an MNIST model until we get an accuracy of around .

We compress the original model using some linear transform, some subsampling ratio and some number of quantization bits.

We decompress the model and evaluate both its new accuracy and its distance to the original model.

We repeat the previous two steps for different linear transforms (identity, random Hadamard transform and Kashin’s representation), subsampling ratios (, and ) and quantization bits (, , , , ).
An important detail is that, whenever we use Kashin’s representation, we do a grid search over the best values for (from to ) and . However, is kept fixed as .
The results of this experiment are shown in Figure 6. In the legend, R corresponds to rotation — I for identity, HD for randomized Hadamard, Kashin for Kashin based on the randomized Hadamard; and SR corresponds to subsampling ratio — the fraction of elements to be kept nonzero. In the top row, the figure shows the relationship of the accuracy of the compressed model vs. the number of bits used for quantization, and vs. the model’s size (in MB). In the bottom row, the error incurred is plotted against the same. It is very clear then that Kashin’s representation does dominate the other two representations when it comes to the size vs. accuracy tradeoff, making up the Pareto frontier for all combinations of subsampling ratio and quantization bits. Nevertheless, we did optimize over the parameters associated with Kashin’s algorithm, something that does not need to be done for the random Hadamard transform. In Section A.2, we propose a set of values that worked well enough for our experiments, but further exploration on how to easily determine these values is in order.
Appendix B MNIST Experimental Results
For reasons of space, we have relegated the experimental results using MNIST (see Section 4) (Section A) to the Appendix.
Figure 7 shows the results of using our lossy compression on MNIST under the experimental setup presented in Section 4.2. Meanwhile, Figure 8 shows the results of using Federated Dropout (see Section 4.3 for details). Finally, Figure 9 shows the results of performing both lossy compression for downloads and uploads, as well as Federated Dropout, as described in Section 4.4.
Comments
There are no comments yet.