1 Introduction
The key paradigm of federated learning is that data is stored locally on edge devices, while model updates (either gradients or weights) are communicated over a network and aggregated by a central server. This setup enables edge computing devices to jointly learn a model without data sharing, thereby retaining their data privacy. However, the issue of communication bandwidth often stands in the way of largescale deployment of federated learning systems: it can be very costly to send model updates over a network, especially when communicating with mobile phones and edge devices.
To reduce bandwidth requirements for federated learning, it is natural to compress model updates before sending them over the network. Previous works in this direction ajiheafield2017sparse; Sattler2020RobustAC; lin2018deep; DBLP:conf/icml/RothchildPUISB020 have explored compression schemes including Top sparsification (i.e. taking the top weights with the largest magnitude) and gradient sketching.
At the same time, in the machine learning theory community, researchers have been working to understand what at first seems like an entirely different question: why do hugely overparametrized models generalize so well? One promising approach to answering this question has utilized the concept of intrinsic dimension, defined for a given optimization problem as the smallest dimension for which we can solve the problem when the weights are restricted to a a dimensional manifold. To be precise, it is the smallest for which an optimization problem
(1) 
has a satisfactory solution, where is a dimensional manifold. If the intrinsic dimension of an optimization problem is low, then even if a model is vastly overparameterized, only a small number of parameters need to be tuned in order to obtain a good solution, which is often enough to imply certain generalization guarantees.
We begin this paper by observing that the two problems above are naturally related. If one can find a solution to the problem by only tuning parameters, as in Equation 1, then a corresponding low bandwidth algorithm can be found by simply running gradient descent on . This occurs because gradients on are dimensional, and hence require less bandwidth to communicate.
However, for very small (as is desired), it is often insufficient to simply optimize a
sized subset of a model’s parameters, especially if this subset must be chosen manually for each neural network architecture. Thus, we are inspired to seek a more general family of these types of lowbandwidth algorithms.
We rewrite the optimization problem in Equation 1 in the original parameter space as
so then stochastic gradient descent in the original space can be written as
(2) 
We call this method static intrinsic gradient compression, because our gradients are projected into a static (“intrinsic”) subspace. Now, Equation 2 admits a natural generalization, which allows us to explore more of the parameter space while still preserving a low level of upload bandwidth usage:
(3) 
where may vary with time. We call the set of all such algorithms intrinsic gradient compression algorithms, and consider three particular instantiations for federated learning: static, subspace, and timevarying intrinsic gradient compression.
The static algorithm is an extremely simple baseline; it simply projects the local model update to a lowerdimensional space before sending it to the server to be aggregated. Nonetheless, we find that it performs remarkably well in practice compared to recent gradient compression schemes. The subspace and timevarying algorithms are designed specifically for federated learning: the subspace method reduces the upload bandwidth requirements of the static algorithm, while the timevarying method improves performance across multiple of distributed training.
Our approach is modelagnostic and highly scalable. In experiments across multiple federated learning benchmarks (language modeling, text classification, and image classification), we vastly outperform prior gradient compression methods, and show strong performance even at very high compression rates (e.g. up to ).
Our contributions are as follows.

We find a general class of optimization algorithms based on the notion of intrinsic dimension that use low amounts of upload bandwidth, which we denote intrinsic gradient compression algorithms.

We specify three such algorithms: static compression, timevarying compression and subspace compression, with different levels of upload and download bandwidth for use in various federated settings.

We provide theoretical guarantees on the performance of our algorithms.

Through extensive experiments, we show that these methods outperform prior gradient compression methods for federated learning, obtaining large reductions in bandwidth at the same level of performance.
2 Preliminaries
2.1 Intrinsic Dimension
The concept of intrinsic dimension was introduced in the work of li2018measuring, as a way of evaluating the true difficulty of an optimization problem. While this can usually be done by counting the number of parameters, some optimization problems are easier than others in that solutions may be far more plentiful. To illustrate this concept, we will take an optimization problem over a large space and a small space so that for any , for the function we have . If is in the image of on , one can write
(4) 
where and thus transform the original problem over into an optimization problem over . If we can still find good solutions to the original problem where , then the problem may be easier than originally expected. Intuitively, even though the “true” dimension of the optimization problem is , the fact that good solutions can be found while searching over a manifold of dimension suggests that the problem is easier than a typical dimension optimization problem.
With this, we can now define the notion of intrinsic dimension. The intrinisic dimension with respect to a task and performance threshold is equal to the smallest integer so that optimizing Equation 4 on task could lead to a solution of performance at least equal to . The intrinsic dimension is not completely knowable, because we cannot find the “best performing model” exactly. However, if say, training with some optimization algorithm gives us a solution to Equation 4 with loss and with dimensions, we can say with certainty that .
Throughout this paper we will always take for a matrix , and take , and , where , where is the original value of the expression. Consequently, the image of on (and thus the dimension over which we optimize) is an affine dimensional subspace of . The affine nature is crucial – it allows us to do a full finetune starting from a pretrained checkpoint, which is not possible if we just use a standard subspace.
2.2 Related Work
Below, we describe how our contribution relates to relevant prior work. Due to space constraints, we describe additional related work in Appendix C.
Intrinsic Dimension
As discussed in the previous section, li2018measuring introduced the concept of intrinsic dimensionality to gain insight into the difficulty of optimization problems.^{1}^{1}1The concept of intrinsic dimension has also been used to describe the dimensionality of datasets; these works are not directly related to ours, but we provide an overview of them in Appendix C. aghajanyan2020intrinsic
followed up on this work by considering the setting of finetuning models in natural language processing. They show that the intrinsic dimension of some of these tasks is surprisingly low, and claim that this result explains the widespread success of the language model finetuning.
These works form the basis of our static intrinsic gradient compression algorithm. Whereas these works use the concept of intrinsic dimension as a mechanism for understanding optimization landscapes, we use it as a tool for gradient compression. We then extend these works by introducing two new algorithms designed for the federated setting: subspace and timevarying intrinsic dimension. Our algorithms were not explored by previous works because they are uniquely interesting from the perspective of federated learning: they are designed to reduce communication bandwidth rather than to shed insight into objective landscapes.
Gradient Compression
With the proliferation of largescale machine learning models over the past decade, the topic of distributed model training has gained widespread attention. Federated learning combines the challenges of distributed training and limited network bandwidth, motivating the use of gradient compression. For example, a single gradient update for a 100 million parameter model takes approximately 0.4 gigabytes of bandwidth (uncompressed).
Gradient compression methods may be divided into two groups: biased and unbiased methods. Unbiased gradient compression estimators tend to be more straightforward to analyze, and are generally better understood for stochastic gradient descent. As long as their variance is bounded, it is usually possible to obtain reasonable bounds on their performance. Biased gradient compression estimators are typically much more challenging to analyze, although they often deliver good empirical performance. For example, top
compression is a popular (biased) method which takes the elements of the gradient with largest magnitudes. Numerous papers are dedicated to the topic of debiasing such methods to make them more amenable to theoretical analysis. In particular, many of these use the idea of error feedback stich2020error; ef21 to obtain theoretical guarantees on otherwise biased algorithms, like TopK lin2018deep and FetchSGD DBLP:conf/icml/RothchildPUISB020. Other more exotic alternative ideas also exist, like albasyoni2020optimal, which finds an optimal gradient compression algorithm, albeit one which is computationally infeasible.Federated and Distributed Learning
From the introduction of federated learning mcmahan2017communication, it was clear that communication costs represented a significant challenge to its widespread adoption. mcmahan2017communication introduced the FedAvg algorithm, which aims to reduce communication costs by performing multiple local updates before communicating model updates. However, even with local update methods such as FedAvg, communicating model updates often remains too costly.^{2}^{2}2Additionally, the benefits of these methods are vastly diminished when clients have a small amount of local data, as many rounds of communication are necessary. As a result, the area of gradient compression has attracted recent attention within the federated learning community.
Top compression is among the simplest and most intuitive compression schemes. ajiheafield2017sparse showed that top compression with
produced good results on neural machine translation and MNIST image classification tasks.
shi2019understanding provided a theoretical analysis and an approximate top selection algorithm to improve sampling efficiency. Sattler2020RobustAC combined top compression with ternary quantization and a Golomb encoding of the weight updates. konecny2018federated study multiple strategies for improving communication efficiency, including lowrank updates, randomly masked updates, and sketched updates. Their lowrank update strategy is related to our method, but we differ from them in that we compute our lowdimensional updates differently, perform largescale experiments, give theoretical analysis, and consider the tradeoff between download and upload bandwidth (only upload bandwidth). Also related, vkj2019powerSGD proposed a lowrank version of SGD based on power iteration for dataparallel distributed optimization. Most recently, FetchSGD DBLP:conf/icml/RothchildPUISB020 used sketching to reduce the size of gradients before sending them over the network. FetchSGD is the current stateoftheart in gradient compression.Finally, it is important to note that local update methods (e.g. FedAvg) and gradient compression methods may be combined. In particular, one can simply perform multiple training steps before compressing resulting the model update (). For fair comparison to FetchSGD, in our experiments, we only perform one local step per update.
3 Methods
3.1 Intrinsic Gradient Compression
In this subsection, we characterize a family of lowbandwidth optimization algorithms based on the notion of intrinsic dimension. In the following subsection, we will describe three algorithms from this family in detail, which we implemented
We start from the optimization problem induced by intrinsic dimension (Equation 4). If we directly run gradient descent on Equation 4 with respect to the intrinsic weights , we obtain an equation of the following form:
Then, leftmultiplying both sides by we obtain
(5) 
Note that here, we can interpret as a compressed gradient with dimension , and as the approximate gradient. This inspires us to consider the more general family of optimization algorithms given by
(6) 
where is a
dimensional vector computed from data available at timestep
that plays a similar role to a gradient, but may not be an exact gradient, and the are all matrices known ahead of time (say, generated with random seeds). One intuitive way of interpreting this algorithm is that is constrained to lie in a lowdimensional subspace, namely that given by the span of . This family of algorithms can be made to use only upload bandwidth, as only the vector must be uploaded. Furthermore, note that Equation 6 has no references to the intrinsic weights , meaning that it represents a general optimization algorithm in the original space. Formally, All optimization algorithms of the formcan be simulated with upload bandwidth in a standard federated learning setting, where is a function that can be calculated by the client at time combined with all data from the server, and is a matrix known to both the client and the server.
We call all algorithms of the form above intrinsic gradient compression algorithms.
Intrinsic Gradient Compression Method  Upload  Download  Dimensions Explored 

No Compression  
Static  
TimeVarying  
Subspace  
Subspace + TimeVarying 
3.2 Algorithms
While Section 3.1 shows that any algorithm of the form Equation 6 can be implemented with low levels of upload bandwidth, not every algorithm of the form Equation 6 can be implemented with low levels of download bandwidth as well. In this section, we describe three particular intrinsic gradient compression algorithms which use low amounts of both upload and download bandwidth. We show the theoretical tradeoffs between each of these algorithms in Table 1.
These federated learning algorithms can be decomposed into three main phases.

Reconciliation: The client reconciles its model with the server’s copy of the model.

Compression: The local model calculates, compresses, and sends its local gradient to the server.

Decompression: The server updates its own copy of the model using the estimated gradients it has received.
Compression and decompression are shared between all algorithms, while each algorithm has a distinct reconciliation phase.
Static Intrinsic Gradient Compression
The static intrinsic gradient compression simply involves projecting gradients into a fixed (“static”) lowdimensional space and reconstructing them on the server:
Nonetheless, it performs remarkably well in practice (see Section 4). The full algorithm is given in Algorithm 1.
Note that in the reconciliation phase, the parameters (which are on the server) will always be equal to for some . Thus, the server can just send to the client, using download bandwidth. In the compression phase, the client compresses the gradient by multiplying by , and for decompression the server multiplies this by . The client then compresses the gradient by multiplying by , and the server decompresses it by multiplying it by .
Subspace Static Intrinsic Gradient Compression
The subspace algorithm is motivated by the fact that in some cases, upload bandwidth is more heavily constrained than download bandwidth. Rather than using a single compression matrix , we use a set of different compression matrices , each corresponding to a different subspace. At each iteration, each client is randomly assigned one of these matrices. Each client then explores a subspace of dimension and uploads a vector of size to the server. Finally, the server aggregates these local updates into a global update of size , which is downloaded by each client. In this way, it is possible to explore a subspace of size using only upload bandwidth. With , this algorithm is equivalent to static gradient compression. The full algorithm is given in Algorithm 2.
TimeVarying Intrinsic Gradient Compression
Finally, the timevarying algorithm utilizes the fact that changing the subspace in which we are optimizing is nearly costless: it simply involves sending the random seed
from which the (pseudo)random matrix
may be generated. Rather than using one (or a set of) static compression matrices for all epochs (i.e. one round of training over all clients), we generate a new matrix
at each epoch . Formally, we have:In this case, our algorithm can be implemented with at most bandwidth used per client per timestep, so over epochs there is bandwidth used total on downloading. Since this bandwidth is twice that of static subspace compression, but we search times more directions in the space, this algorithm is particularly useful when we have many epochs.
Letting be the client parameters at epoch , note that we have the value of when performing reconciliation. Now we can write
We can see that lies in the span of and lies in the span of , showing the validity of the algorithm, which is given in full in Algorithm 3.
Finally, we note that it is possible to use both subspace and timevarying compression together. In this case, a new batch of of compression matrices is generated at each epoch . We do not experiment with this setup, but it is likely to show further improvements over using each of these methods alone.
Choice of Compression Matrix
Here, we discuss how to choose . Our methods are theoretically agnostic to the choice of , and depend only on the existence of efficient subroutines for calculating the matrixvector products and . Nonetheless, the choice of has significant practical considerations, which we discuss here.
The naive choice is to let be a random dense matrix, but such a choice is impossible due to memory constraints. For example, if we aim to train even a small version of BERT (100M parameters) with an intrinsic dimension of , we would need to store a matrix with entries.
Our approach, also taken by aghajanyan2020intrinsic; li2018measuring, utilizes the Fastfood transform DBLP:conf/icml/LeSS13. This transform expresses the matrix as where is the smallest power of two larger than , is a standard Hadamard matrix, is a random diagonal matrix with independent Rademacher entries (random signs), is a random permutation matrix, is a random diagonal matrix with independent standard normal entries,
to be a linear operator which simply pads a
dimensional vector with zeroes until it has size , and is a linear operator which takes the first elements from a dimensional vector. Since we can quickly compute a matrixvector product by with a fast WalshHadamard transform, we can perform a matrix multiplication by in time and space.Finally, to ensure that we do not need to communicate the matrices , we generate each matrix pseudorandomly from a random seed . Thus, the matrices do not need to be transferred over the network.
3.3 Theoretical Guarantees
In this section, we provide guarantees on static, timevarying, and subspace intrinsic gradient compression. We focus on convex functions, which are the most amenable to analysis. First, we contend that it is not interesting to prove guarantees of the form “timevarying intrinsic gradient compression works well for all convex functions”. This is because the hypotheses are too weak to produce meaningful results, even if one assumes that one has access to oracle convex optimization routines which return the minimizer (rather than just an approximate optimizer).
Two representative works, similar to ours, which consider a setup where we have access to an oracle which finds minimizers of convex functions are stich2013optimization and ssobound. stich2013optimization considers an optimization algorithm which searches over random dimensional subspaces, showing that theoretically, searching random direction times performs about as well as searching directions once, offering no bandwidth benefit in our context. ssobound shows a similar result without requiring random subspaces. Thus, showing interesting guarantees for arbitrary convex functions is likely quite challenging.
Rather, in the flavor of intrinsic dimension, we assume that our convex optimization problems are “easier” than standard problems, in that searching few directions is likely to yield good solutions. In this case, we show that timevarying intrinsic dimension works even better than static compression. Intuitively, this is because each random subspace sampled in the timevarying algorithm contains a point which allows us to meaningfully reduce our loss. As a consequence, when we consider many subspaces sequentially, we can reduce our loss exponentially.
Thus, we state our hypotheses via a formalized definition of intrinsic dimension. A convex function has intrinsic dimension if for all we have
where is a uniformly chosen dimensional subspace over the Grassmanian, and is the minima of the function .
The result on static compression now follows directly. We merely need to account for the fact that we are using an approximate optimization algorithm and not an oracle optimization algorithm. However, since a convex problem on a subspace is convex, this follows directly from wellknown guarantees on gradient descent.
In what follows, we assume that from each step we have access to
, an unbiased estimate of the true gradient of
at time , given the current we have – such a naturally emerges from our methods, where the randomness comes from the data points in the batch. In all cases, we assume that is an orthonormal basis of a random subspace sampled according to the Grassmanian. All proofs are given in Appendix A.For the static compression algorithm, if the function has intrinsic dimension , we have
if we take total steps where is obtained by running the static compression algorithm, and .
For subspace compression, we do not obtain stronger theoretical guarantees than static, but we include the result for completeness. Note that they use the same amount of upload bandwidth total, because varying saves a factor of on upload. We also need a further assumption on the ratio of the variance to the squared mean: if it is too small, the extra variance induced by the varying method causes the performance drop to be substantial.
For the subspace algorithm, if the function has intrinsic dimension
with probability
, we haveif we take steps, where , assuming that for all values of for some and is defined as .
Finally, we prove a better guarantee for timevarying compression, taking advantage of effectively exponential decaying loss from repeatedly applying Section 3.3.
For the timevarying algorithm, if the function has intrinsic dimension over epochs,
after taking steps, where
Results on computer vision benchmarks. Both static and timevarying intrinsic gradient dimension significantly outperform prior work, with timevarying intrinsic compression performing best. On the right, we see that timevarying and static compression perform similarly at the beginning of training, but timevarying outperforms static with equal space when the compression is higher. For the FedAvg and uncompressed methods with compression rates above 1, compression was performed by training for fewer epochs.
subspace has the added benefit of much lower upload compression (not shown). For the SST2 results, error bars show the standard error of performance calculated over five runs with different random seeds.
4 Experiments
We evaluate our method across three benchmarks: two from NLP (language modeling and text classification) and one from computer vision (image classification). As with previous works DBLP:conf/icml/RothchildPUISB020; mcmahan2017communication, we simulate a federated setting in order to scale to large numbers of clients (upwards of ). We perform experiments in both nonIID and IID settings.
Image Classification (ResNet9 on CIFAR10)
First, we consider image classification on CIFAR10, a dataset of 50,000 px images. We use the same experimental setup as DBLP:conf/icml/RothchildPUISB020: we split the data between 10,000 clients in a nonIID fashion, such that each client only has data from a single class. At each step, we sample 100 clients at random, such that each gradient step corresponds to 500 images. We perform 24 rounds of communication between all clients (i.e. 24 epochs).
We use a ResNet9 architecture with 6,570,880 trainable parameters for our fair comparison to previous work. Note that the model does not have batch normalization, as it would not make sense in a setting where each client has so few examples. Due to the substantial number of epochs performed here, we experiment with both static and timevarying gradient compression (
subspace compression is better suited to settings involving fewer rounds of communication). We experiment with intrinsic dimensions from 4000 to 256000.Our results are shown in Figure 3. Whereas FedAvg and TopK struggle at even modest compression rates (e.g. ), the intrinsic gradient compression methods deliver strong performance at much larger compression rates. The intrinsic methods outperform the current stateoftheart gradient compression method, FetchSGD DBLP:conf/icml/RothchildPUISB020, by a large margin, and easily scales to high compression rates (e.g. ). Finally, we see that timevarying intrinsic compression generally outperforms static compression for the same communication cost.
Text Classification (BERT on SST2)
Next, we consider text classification on the Stanford Sentiment Treebankv2 (SST2) dataset sst2
, a common sentiment analysis dataset. For this experiment, we consider an IID data split into 50 and 500 clients, respectively. We employ the popular BERT
devlinetal2019bert architecture with 109M parameters and we use intrinsic dimensions from 200 to 25600. The purpose of this experiment is to push the limits of gradient compression; we project the 109Mdimension BERT gradients into as few as 200 dimensions.Our results are given in Figure 6. First, in agreement with aghajanyan2020intrinsic, we find that it is possible to achieve remarkably high compression ratios for text classification: we get nearly full performance even when compressing the 109Mdimension parameter vector into an intrinsic space of dimension 16,384. Furthermore, we find that timevarying intrinsic gradient compression consistently outperforms static intrinsic gradient compression at the same compression rate.
Language Modeling (GPT2 on PersonaChat)
Lastly, we consider language modeling on the PersonaChat zhang2018personalizing
dataset. The dataset has a nonIID split into 17,568 clients in which each client is assigned all data corresponding to given personality; as a result, it is widely used in federated learning simulations. We perform language modeling using the GPT2 transformer architecture (124M parameters) and conduct two rounds of training across the clients (i.e. two epochs). Due to the low number of training rounds, it is natural to apply
static and subspace gradient compression (we use ).^{3}^{3}3Timevarying compression does not make sense here, as its benefit is derived from the setting where there are many rounds of communication between the clients.Our results are shown in Figure 6. Overall, intrinsic dimensionbased gradient compression vastly outperforms a wide range of prior approaches to reducing communication in federated learning. On the lowcompression end of the spectrum, we obtain nearly full performance with superior compression rates to the stateoftheart FetchSGD DBLP:conf/icml/RothchildPUISB020. On the highcompression end of the spectrum, we scale better than previous approaches. For example, we obtain a perplexity of around 20 even with an extremely high compression rate of 1898.
Finally, we see that subspace intrinsic compression performs similarly to (or slightly worse) than static compression at the same level of overall compression. However, if it is more important to conserve upload bandwidth than download bandwidth, then subspace intrinsic gradient compression significantly outperforms static intrinsic gradient compression (see Table 2).
Gradient Reconstruction: Data Privacy Experiment
One of the primary motivations of federated learning is the desire for individual clients to be able to retain data privacy while still participating in model training. However, prior work DBLP:conf/nips/ZhuLH19 has shown that if the client sends their full local model update to the server, it is sometimes possible to approximately reconstruct their local data from the model update. We investigate the extent to which an attacker can reconstruct a client’s data given a compressed gradient update, and we find that our compression helps to mitigate this reconstruction problem. Full details are included in Appendix E due to space constraints.
5 Conclusion
We propose a family of intrinsic gradient compression algorithms for federated learning. This family includes static compression, which performs remarkably well despite its simplicity, subspace compression, which is optimized for upload bandwidth, and timevarying compression, which improves performance by changing the intrinsic subspace over time. We provide theoretical results for our algorithms and demonstrate their effectiveness through numerous largescale experiments. We hope that our results help make the realworld deployment of largescale federated learning systems more feasible.
References
Appendix A Proofs Omitted in the Main Text
a.1 Proof of Section 3.3
First, we show that is convex in .
is convex.
Proof.
We have
and we may conclude. ∎
We can now write
We can bound the first term with a result from [scaffold] because is convex, and thus classical convex optimization algorithms will converge quickly (namely, within steps). The second term is bounded by our assumption on the intrinsic dimension of the function . With at least probability , we have that is at most .
a.2 Proof of Section 3.3
In this part of the problem, it is not immediately clear how to fit it into the existing SGD framework. First, to parametrize we use
and take . The correct gradient of this function is , where is the true gradient. However, now define
Then, we claim that our algorithm is equivalent to using as an unbiased gradient estimate. Thus, the SGD equation looks like , and after multiplying both sides by the matrix we get
which matches our algorithm for subspace compression.
It remains to compute the variance of the gradients , which is used in the SGD bound. We obtain that . Note that
Thus, we have that the true variance, given the ratio, is at most times the original variance. The rest of the analysis is exactly the same as Section A.1, and we may conclude.
a.3 Proof of Section 3.3
Here, we repeatedly apply Section 3.3 by using the fact that we essentially sample fresh directions each time. Intuitively, the timevarying design implies that each new subspace choice is a fresh opportunity to get closer to the optimum. Each epoch lets us get closer and closer to the desired optimum.
We have that after iterations from [scaffold], the loss is at most , where . By repeatedly applying this result, with probability at least , the final loss is at most , where
and we may conclude.
Appendix B subspace Intrinsic Gradient Compression
This is given in Algorithm 2.
Appendix C Additional Related Work
c.1 Intrinsic Dimensionality
As mentioned in the main paper, the concept of measuring the intrinsic dimensional of loss landscapes was introduced by [li2018measuring]. [li2018measuring] consider optimizing a parameter model in a random dimensional subspace of the full parameter space. They define the intrinsic dimension of the optimization problem as the minimum dimension for which a solution to the problem can be found, where a “solution” refers attaining a certain percentage of the maximum possible validation accuracy (i.e. the validation accuracy obtained by optimizing in all dimensions). They use a fixed cutoff of % accuracy for their experiments. [aghajanyan2020intrinsic] apply these ideas in the setting of finetuning NLP models.
A number of works have tried to measure the intrinsic dimension of datasets, rather than objective landscapes. [NIPS2004_74934548] introduced a maximum likelihood approach to estimating intrinsic dimensionality based on nearestneighbors, while [CERUTI20142569] employed angle and normbased similarity.
Finally, some works have tried to measure the intrinsic dimensionality of image representations and datasets. [gong2019intrinsic] finds that the representations produced by popular image and face representation learning models (ResNet50 and SphereFace) have quite low intrinsic dimensionalities (16 and 19, respectively). Along similar lines, [pope2021the]
showed that popular image datasets (MNIST, CIFAR 10, ImageNet) also have low intrinsic dimensionality.
c.2 Model Pruning
There has been great interest in compressing models by using fewer weights, starting with the work of [hinton2015distilling, han2015deep]. One related work is Diff Pruning [guo2020parameter], which constrains the number of weights that can be changed from a pretrained model. In essence, diff pruning attempts to solve an minimization problem on the weights of the model, and approaches this by means of a relaxation to a problem that is more amenable to a standard analysis.
A number of other works have explored the idea of finetuning by only modifying a subset of a model’s parameters. [ravfogel2021bitfit] finetunes only the layer biases, whereas [houlsby2019parameter] introduces the concept of lowparameter adapters between each layer. Compared to [ravfogel2021bitfit] our method is far more flexible, allowing any number of parameters to be changed. Compared to [houlsby2019parameter] our methods are architectureindependent, and can be applied to any model.
Federated Learning
Federated learning is generally concerned with the distributed training of machine learning models across many devices, each of which holds private data. Many aspects of this federated setup are separate subfields of research, including how to ensure the privacy of clientheld data [Xie2020DBA, bhagoji2019analyzing], how to deal with heterogeneous data and networks [li2020federated, li2020convergence, yu2020federated], how to reconcile weights/gradients from multiple clients [li2020federated, wang2020federated, pmlrv119li20g], how to manage clients in a faulttolerant manner, deployment on mobile/iot devices [chaoyanghe2020fedml], and fairness [mohri2019agnostic].
The classic FedAvg [mcmahan2017communication] algorithm communicates model updates after multiple local training iterations. FedProx [li2020federated] generalized and reparametrized FedAvg, and FedMA [wang2020federated] improved this approach by matching and averaging hidden layers of networks with similar activations at each communication round. Additionally, FedAwS [yu2020federated] considered federated averaging in the case where each client has data from only a single class.
Appendix D Further Experimental Details and Analysis
In the main paper, we included a number of figures demonstrating our performance in comparison to prior work. Here, we include tables with our precise results for clarity and in order to facilitate future comparison with our work.
d.1 General Implementation Details
We perform our language modeling experiments on 8 RTX 6000 GPUs and our image/text classification experiments on 1 RTX 6000 GPU. Regarding the intrinsic gradient compression matrices , we employ the FastFood method described in Section 3.2 using a CUDA implementation of the fast WalshHadamard transform from [thomas2018learning].
d.2 Further PersonaChat Analysis
First, we give more details on the PersonaChat dataset, which were omitted from the main paper due to space constraints. The PersonaChat dataset [zhang2018personalizing] was collected by first giving imaginary personas (defined by a set of 5 sentences) to Amazon Mechanical Turk workers and asking them to take on those personas. Then, the system paired workers and asked them to discuss. Since the personas were imaginary and no personally identifiable information was exchanged (in particular, the workers were explicitly told to not use personally identifiable information) the dataset does not contain personally identifiable information. The dataset has a nonIID split into 17,568 clients in which each client is assigned all data corresponding to given personality; as a result, it is widely used in federated learning simulations. We perform language modeling using the GPT2 transformer architecture (124M parameters). We perform static and subspace gradient compression using intrinsic dimensions of 16384, 65536, 262144, 1048576, and 4194304.
We show full results on PersonaChat below, complete with upload and download compression. Overall compression is calculated as average compression over both upload and download. We compare with FedAvg [mcmahan2017communication], TopK, and FetchSGD [DBLP:conf/icml/RothchildPUISB020]. FedAvg is the baseline federated learning approach involving sending and averaging weights. TopK refers to sending the top gradients, sorted by magnitude. FetchSGD compresses the weights with sketching.
Our method significantly outperforms competing approaches across the board. We obtain an accuracy close to that of uncompressed optimization using 29.7 overall compression; FedAvg and TopK both fail to achieve such strong results, while FetchSGD does so at a significantly lower compression rate.
Next we compare static and Kvarying intrinsic gradient compression. When comparing overall compression rates, static compression is slightly better than Kvarying compression. However, Kvarying compression is optimized for low upload bandwidth; it obtains much better upload compression rates than static compression at the same accuracy. For example, Kvarying compression with and yields perplexity at upload compression , whereas static compression with yields perplexity at upload compression .
Name  Intrinsic Dim.  PPL  Up. Comp.  Down. Comp.  Total Comp.  
Uncompressed  13.9  1  1  1  
[mcmahan2017communication]  FedAvg (2 local iters)  16.3  2  2  2  
[mcmahan2017communication]  FedAvg (5 local iters)  20.1  5  5  5  
Local TopK ()  19.3  30.3  2490  60  
Local TopK ()  17.1  3.6  248  7.1  
[DBLP:conf/icml/RothchildPUISB020]  FetchSGD ()  14.8  3.8  100  7.3  
[DBLP:conf/icml/RothchildPUISB020]  FetchSGD ()  15.8  2.4  10  3.9  
Ours (static)  16384  27.7  7595  7595  7595  
Ours (subspace)  16384  19.6  7595  949  1688  
Ours (static)  65536  20.6  1900  1900  1900  
Ours (subspace)  65536  17.8  1900  237  422  
Ours (static)  262144  17.6  475  475  475  
Ours (subspace)  262144  16.6  475  59.3  105  
Ours (static)  1048576  15.8  119  119  119  
Ours (subspace)  1048576  15.4  119  14.8  26.3  
Ours (static)  4194304  14.8  29.7  29.7  29.7 
d.3 Further SST2 Details and Analysis
Intrinsic Dim.  200  400  800  1,600 

Static  82.8 ()  85.3 ()  87.1 ()  87.5 () 
TimeVarying  85.9 ()  87.8 ()  87.8 ()  88.7 () 
Intrinsic Dim.  3,200  6,400  12,800  25,600 

Static  88.3 ()  89.4 ()  89.5 ()  89.5 () 
TimeVarying  89.0 ()  89.4 ()  89.4 ()  89.4 () 
Regarding the experimental setup, we perform 30 rounds (i.e. 30 epochs) of training for all compressed runs, while we perform 6 for the uncompressed baseline (as it converges more quickly). Federated learning experiments has previously been criticized for being challenging to reproduce; as a result, we perform each run five times over different random seeds. Due to the substantial number of epochs performed here, it is natural to apply static and timevarying intrinsic gradient compression. We use intrinsic dimensions of 200, 400, 800, , 25600.
In Table 3, we show full results for the SST2 dataset with static and timevarying gradient compression for a range of intrinsic dimensions. We include in this experiment an demonstration of the robustness of our method to variation in random seeds; we run each experiment five times using separate random seeds (i.e. different intrinsic subspaces and model initializations). We report standard errors in Table 3 and include Figure 6 with error bars in the main paper. Overall variability is quite low.
We also see that timevarying intrinsic gradient compression outperforms static intrinsic compression, especially for low intrinsic dimensions. For example, timevarying compression at outperforms static compression with , and timevarying compression with outperforms static compression with .
Appendix E Gradient Reconstruction: Data Privacy Experiment
Data privacy is one of the central motivations of federated learning.
However, a number of works have shown that if the client does not have a large amount of data and the client sends back their full local gradient, it is possible to approximately reconstruct their local data from the model. This is a significant problem, because their data would then effectively be visible to the central server and any attackers that intercept their communications.
This is a significant problem, because their data would then effectively be visible to the central server and any attackers that intercept their communications.
Here, we show that compressing gradients with our approach can mitigate this problem. Specifically, we check if our compressed gradients can be reconstructed with the iterative procedure proposed by [DBLP:conf/nips/ZhuLH19], which takes a gradient and a model and tries to recover an image. As in [DBLP:conf/nips/ZhuLH19]
, we use a ResNet152 model on a randomly selected image from ImageNet and run for 24,000 iterations (by which time the method has converged). We reconstruct the image both from the full gradient (the center image) and from a the intrinsicallycompressed image (the right image) with intrinsic dimension 65,536.
As seen in Figure 10, given the full gradient it is possible to obtain a fairly good reconstruction of the image. By contrast, with our method, the reconstruction is visually much less similar to the original image. Of course, our method does not solve the problem entirely; an outline of the dog in the image is still visible because the compressed gradient still contains some information about the local data. To solve the issue entirely, it would be necessary to use a method such as differential privacy.