DeepAI
Log In Sign Up

Intrinisic Gradient Compression for Federated Learning

12/05/2021
by   Luke Melas-Kyriazi, et al.
University of Oxford
Harvard University
0

Federated learning is a rapidly-growing area of research which enables a large number of clients to jointly train a machine learning model on privately-held data. One of the largest barriers to wider adoption of federated learning is the communication cost of sending model updates from and to the clients, which is accentuated by the fact that many of these devices are bandwidth-constrained. In this paper, we aim to address this issue by optimizing networks within a subspace of their full parameter space, an idea known as intrinsic dimension in the machine learning theory community. We use a correspondence between the notion of intrinsic dimension and gradient compressibility to derive a family of low-bandwidth optimization algorithms, which we call intrinsic gradient compression algorithms. Specifically, we present three algorithms in this family with different levels of upload and download bandwidth for use in various federated settings, along with theoretical guarantees on their performance. Finally, in large-scale federated learning experiments with models containing up to 100M parameters, we show that our algorithms perform extremely well compared to current state-of-the-art gradient compression methods.

READ FULL TEXT VIEW PDF
05/05/2022

Communication-Efficient Adaptive Federated Learning

Federated learning is a machine learning training paradigm that enables ...
09/18/2019

Detailed comparison of communication efficiency of split learning and federated learning

We compare communication efficiencies of two compelling distributed mach...
01/28/2022

FedLite: A Scalable Approach for Federated Learning on Resource-constrained Clients

In classical federated learning, the clients contribute to the overall t...
02/01/2023

: Downlink Compression for Cross-Device Federated Learning

Many compression techniques have been proposed to reduce the communicati...
10/29/2020

Scalable Federated Learning over Passive Optical Networks

Two-step aggregation is introduced to facilitate scalable federated lear...
06/20/2022

QuAFL: Federated Averaging Can Be Both Asynchronous and Communication-Efficient

Federated Learning (FL) is an emerging paradigm to enable the large-scal...
04/20/2022

Is Non-IID Data a Threat in Federated Online Learning to Rank?

In this perspective paper we study the effect of non independent and ide...

1 Introduction

The key paradigm of federated learning is that data is stored locally on edge devices, while model updates (either gradients or weights) are communicated over a network and aggregated by a central server. This setup enables edge computing devices to jointly learn a model without data sharing, thereby retaining their data privacy. However, the issue of communication bandwidth often stands in the way of large-scale deployment of federated learning systems: it can be very costly to send model updates over a network, especially when communicating with mobile phones and edge devices.

To reduce bandwidth requirements for federated learning, it is natural to compress model updates before sending them over the network. Previous works in this direction ajiheafield2017sparse; Sattler2020RobustAC; lin2018deep; DBLP:conf/icml/RothchildPUISB020 have explored compression schemes including Top- sparsification (i.e. taking the top weights with the largest magnitude) and gradient sketching.

At the same time, in the machine learning theory community, researchers have been working to understand what at first seems like an entirely different question: why do hugely overparametrized models generalize so well? One promising approach to answering this question has utilized the concept of intrinsic dimension, defined for a given optimization problem as the smallest dimension for which we can solve the problem when the weights are restricted to a a -dimensional manifold. To be precise, it is the smallest for which an optimization problem

(1)

has a satisfactory solution, where is a -dimensional manifold. If the intrinsic dimension of an optimization problem is low, then even if a model is vastly overparameterized, only a small number of parameters need to be tuned in order to obtain a good solution, which is often enough to imply certain generalization guarantees.

We begin this paper by observing that the two problems above are naturally related. If one can find a solution to the problem by only tuning parameters, as in Equation 1, then a corresponding low bandwidth algorithm can be found by simply running gradient descent on . This occurs because gradients on are -dimensional, and hence require less bandwidth to communicate.

However, for very small (as is desired), it is often insufficient to simply optimize a

-sized subset of a model’s parameters, especially if this subset must be chosen manually for each neural network architecture. Thus, we are inspired to seek a more general family of these types of low-bandwidth algorithms.

We rewrite the optimization problem in Equation 1 in the original parameter space as

so then stochastic gradient descent in the original space can be written as

(2)

We call this method static intrinsic gradient compression, because our gradients are projected into a static (“intrinsic”) subspace. Now, Equation 2 admits a natural generalization, which allows us to explore more of the parameter space while still preserving a low level of upload bandwidth usage:

(3)

where may vary with time. We call the set of all such algorithms intrinsic gradient compression algorithms, and consider three particular instantiations for federated learning: static, -subspace, and time-varying intrinsic gradient compression.

The static algorithm is an extremely simple baseline; it simply projects the local model update to a lower-dimensional space before sending it to the server to be aggregated. Nonetheless, we find that it performs remarkably well in practice compared to recent gradient compression schemes. The -subspace and time-varying algorithms are designed specifically for federated learning: the -subspace method reduces the upload bandwidth requirements of the static algorithm, while the time-varying method improves performance across multiple of distributed training.

Our approach is model-agnostic and highly scalable. In experiments across multiple federated learning benchmarks (language modeling, text classification, and image classification), we vastly outperform prior gradient compression methods, and show strong performance even at very high compression rates (e.g. up to ).

Our contributions are as follows.

  • We find a general class of optimization algorithms based on the notion of intrinsic dimension that use low amounts of upload bandwidth, which we denote intrinsic gradient compression algorithms.

  • We specify three such algorithms: static compression, time-varying compression and -subspace compression, with different levels of upload and download bandwidth for use in various federated settings.

  • We provide theoretical guarantees on the performance of our algorithms.

  • Through extensive experiments, we show that these methods outperform prior gradient compression methods for federated learning, obtaining large reductions in bandwidth at the same level of performance.

2 Preliminaries

2.1 Intrinsic Dimension

The concept of intrinsic dimension was introduced in the work of li2018measuring, as a way of evaluating the true difficulty of an optimization problem. While this can usually be done by counting the number of parameters, some optimization problems are easier than others in that solutions may be far more plentiful. To illustrate this concept, we will take an optimization problem over a large space and a small space so that for any , for the function we have . If is in the image of on , one can write

(4)

where and thus transform the original problem over into an optimization problem over . If we can still find good solutions to the original problem where , then the problem may be easier than originally expected. Intuitively, even though the “true” dimension of the optimization problem is , the fact that good solutions can be found while searching over a manifold of dimension suggests that the problem is easier than a typical dimension optimization problem.

With this, we can now define the notion of intrinsic dimension. The intrinisic dimension with respect to a task and performance threshold is equal to the smallest integer so that optimizing Equation 4 on task could lead to a solution of performance at least equal to . The intrinsic dimension is not completely knowable, because we cannot find the “best performing model” exactly. However, if say, training with some optimization algorithm gives us a solution to Equation 4 with loss and with dimensions, we can say with certainty that .

Throughout this paper we will always take for a matrix , and take , and , where , where is the original value of the expression. Consequently, the image of on (and thus the dimension over which we optimize) is an affine -dimensional subspace of . The affine nature is crucial – it allows us to do a full fine-tune starting from a pretrained checkpoint, which is not possible if we just use a standard subspace.

2.2 Related Work

Below, we describe how our contribution relates to relevant prior work. Due to space constraints, we describe additional related work in Appendix C.

Intrinsic Dimension

As discussed in the previous section, li2018measuring introduced the concept of intrinsic dimensionality to gain insight into the difficulty of optimization problems.111The concept of intrinsic dimension has also been used to describe the dimensionality of datasets; these works are not directly related to ours, but we provide an overview of them in Appendix C. aghajanyan2020intrinsic

followed up on this work by considering the setting of finetuning models in natural language processing. They show that the intrinsic dimension of some of these tasks is surprisingly low, and claim that this result explains the widespread success of the language model finetuning.

These works form the basis of our static intrinsic gradient compression algorithm. Whereas these works use the concept of intrinsic dimension as a mechanism for understanding optimization landscapes, we use it as a tool for gradient compression. We then extend these works by introducing two new algorithms designed for the federated setting: -subspace and time-varying intrinsic dimension. Our algorithms were not explored by previous works because they are uniquely interesting from the perspective of federated learning: they are designed to reduce communication bandwidth rather than to shed insight into objective landscapes.

Gradient Compression

With the proliferation of large-scale machine learning models over the past decade, the topic of distributed model training has gained widespread attention. Federated learning combines the challenges of distributed training and limited network bandwidth, motivating the use of gradient compression. For example, a single gradient update for a 100 million parameter model takes approximately 0.4 gigabytes of bandwidth (uncompressed).

Gradient compression methods may be divided into two groups: biased and unbiased methods. Unbiased gradient compression estimators tend to be more straightforward to analyze, and are generally better understood for stochastic gradient descent. As long as their variance is bounded, it is usually possible to obtain reasonable bounds on their performance. Biased gradient compression estimators are typically much more challenging to analyze, although they often deliver good empirical performance. For example, top-

compression is a popular (biased) method which takes the elements of the gradient with largest magnitudes. Numerous papers are dedicated to the topic of debiasing such methods to make them more amenable to theoretical analysis. In particular, many of these use the idea of error feedback stich2020error; ef21 to obtain theoretical guarantees on otherwise biased algorithms, like Top-K lin2018deep and FetchSGD DBLP:conf/icml/RothchildPUISB020. Other more exotic alternative ideas also exist, like albasyoni2020optimal, which finds an optimal gradient compression algorithm, albeit one which is computationally infeasible.

Federated and Distributed Learning

From the introduction of federated learning mcmahan2017communication, it was clear that communication costs represented a significant challenge to its widespread adoption. mcmahan2017communication introduced the FedAvg algorithm, which aims to reduce communication costs by performing multiple local updates before communicating model updates. However, even with local update methods such as FedAvg, communicating model updates often remains too costly.222Additionally, the benefits of these methods are vastly diminished when clients have a small amount of local data, as many rounds of communication are necessary. As a result, the area of gradient compression has attracted recent attention within the federated learning community.

Top- compression is among the simplest and most intuitive compression schemes. ajiheafield2017sparse showed that top- compression with

produced good results on neural machine translation and MNIST image classification tasks.

shi2019understanding provided a theoretical analysis and an approximate top- selection algorithm to improve sampling efficiency. Sattler2020RobustAC combined top- compression with ternary quantization and a Golomb encoding of the weight updates. konecny2018federated study multiple strategies for improving communication efficiency, including low-rank updates, randomly masked updates, and sketched updates. Their low-rank update strategy is related to our method, but we differ from them in that we compute our low-dimensional updates differently, perform large-scale experiments, give theoretical analysis, and consider the trade-off between download and upload bandwidth (only upload bandwidth). Also related, vkj2019powerSGD proposed a low-rank version of SGD based on power iteration for data-parallel distributed optimization. Most recently, FetchSGD DBLP:conf/icml/RothchildPUISB020 used sketching to reduce the size of gradients before sending them over the network. FetchSGD is the current state-of-the-art in gradient compression.

Finally, it is important to note that local update methods (e.g. FedAvg) and gradient compression methods may be combined. In particular, one can simply perform multiple training steps before compressing resulting the model update (). For fair comparison to FetchSGD, in our experiments, we only perform one local step per update.

3 Methods

3.1 Intrinsic Gradient Compression

In this subsection, we characterize a family of low-bandwidth optimization algorithms based on the notion of intrinsic dimension. In the following subsection, we will describe three algorithms from this family in detail, which we implemented

We start from the optimization problem induced by intrinsic dimension (Equation 4). If we directly run gradient descent on Equation 4 with respect to the intrinsic weights , we obtain an equation of the following form:

Then, left-multiplying both sides by we obtain

(5)

Note that here, we can interpret as a compressed gradient with dimension , and as the approximate gradient. This inspires us to consider the more general family of optimization algorithms given by

(6)

where is a

dimensional vector computed from data available at timestep

that plays a similar role to a gradient, but may not be an exact gradient, and the are all matrices known ahead of time (say, generated with random seeds). One intuitive way of interpreting this algorithm is that is constrained to lie in a low-dimensional subspace, namely that given by the span of . This family of algorithms can be made to use only upload bandwidth, as only the vector must be uploaded. Furthermore, note that Equation 6 has no references to the intrinsic weights , meaning that it represents a general optimization algorithm in the original space. Formally, All optimization algorithms of the form

can be simulated with upload bandwidth in a standard federated learning setting, where is a function that can be calculated by the client at time combined with all data from the server, and is a matrix known to both the client and the server.

We call all algorithms of the form above intrinsic gradient compression algorithms.

Intrinsic Gradient Compression Method Upload Download Dimensions Explored
No Compression
Static
Time-Varying
-Subspace
-Subspace + Time-Varying
Table 1: Bandwidth and Performance Comparisons. The bandwidth refers to that of that used for each client. Note that we break upload and download bandwidth into separate columns, because download speeds can often be considerably faster than upload speeds and we may thus be willing to tolerate higher values of download bandwidth. A realistic example of the values of the variables above is e.g. .

3.2 Algorithms

While Section 3.1 shows that any algorithm of the form Equation 6 can be implemented with low levels of upload bandwidth, not every algorithm of the form Equation 6 can be implemented with low levels of download bandwidth as well. In this section, we describe three particular intrinsic gradient compression algorithms which use low amounts of both upload and download bandwidth. We show the theoretical tradeoffs between each of these algorithms in Table 1.

These federated learning algorithms can be decomposed into three main phases.

  • Reconciliation: The client reconciles its model with the server’s copy of the model.

  • Compression: The local model calculates, compresses, and sends its local gradient to the server.

  • Decompression: The server updates its own copy of the model using the estimated gradients it has received.

Compression and decompression are shared between all algorithms, while each algorithm has a distinct reconciliation phase.

Static Intrinsic Gradient Compression

The static intrinsic gradient compression simply involves projecting gradients into a fixed (“static”) low-dimensional space and reconstructing them on the server:

Nonetheless, it performs remarkably well in practice (see Section 4). The full algorithm is given in Algorithm 1.

Note that in the reconciliation phase, the parameters (which are on the server) will always be equal to for some . Thus, the server can just send to the client, using download bandwidth. In the compression phase, the client compresses the gradient by multiplying by , and for decompression the server multiplies this by . The client then compresses the gradient by multiplying by , and the server decompresses it by multiplying it by .

  input: learning rate , timesteps , local batch size , clients per round
  Create matrix with . Spawn on all nodes using a suitable random number generator.
  Current Vector:
  for  do
     Randomly select clients .
     loop
         {In parallel on clients }
         Download , calculate current .
         Compute stochastic gradient on batch of size : where .
         Sketch to and upload it to the aggregator.
     end loop
     Aggregate sketches
     Unsketch:
     Update: , .
  end for
Algorithm 1 Static Intrinsic Gradient Compression

-Subspace Static Intrinsic Gradient Compression

The -subspace algorithm is motivated by the fact that in some cases, upload bandwidth is more heavily constrained than download bandwidth. Rather than using a single compression matrix , we use a set of different compression matrices , each corresponding to a different subspace. At each iteration, each client is randomly assigned one of these matrices. Each client then explores a subspace of dimension and uploads a vector of size to the server. Finally, the server aggregates these local updates into a global update of size , which is downloaded by each client. In this way, it is possible to explore a subspace of size using only upload bandwidth. With , this algorithm is equivalent to static gradient compression. The full algorithm is given in Algorithm 2.

  input: distinct subspaces , learning rate , timesteps , local batch size , clients per round
  Create matrices where with . Spawn across all nodes using a random seed which is distinct but generates one of .
  Current Vector: for .
  for  do
      for  do
          Randomly select clients .
          loop
              {In parallel on clients }
              Download for , calculate current
              
              Choose a random
              Compute stochastic gradient on batch of size : where .
              Sketch and upload it to the aggregator.
          end loop
          Write sketches received as .
          Unsketch to get
          Update: ,
          for  do
              Update: .
          end for
      end for
  end for
Algorithm 2 -Subspace Intrinsic Gradient Compression

Time-Varying Intrinsic Gradient Compression

Finally, the time-varying algorithm utilizes the fact that changing the subspace in which we are optimizing is nearly costless: it simply involves sending the random seed

from which the (pseudo-)random matrix

may be generated. Rather than using one (or a set of) static compression matrices for all epochs (i.e. one round of training over all clients), we generate a new matrix

at each epoch . Formally, we have:

In this case, our algorithm can be implemented with at most bandwidth used per client per timestep, so over epochs there is bandwidth used total on downloading. Since this bandwidth is twice that of static subspace compression, but we search times more directions in the space, this algorithm is particularly useful when we have many epochs.

Letting be the client parameters at epoch , note that we have the value of when performing reconciliation. Now we can write

We can see that lies in the span of and lies in the span of , showing the validity of the algorithm, which is given in full in Algorithm 3.

Finally, we note that it is possible to use both -subspace and time-varying compression together. In this case, a new batch of of compression matrices is generated at each epoch . We do not experiment with this setup, but it is likely to show further improvements over using each of these methods alone.

  input: learning rate , timesteps , local batch size , clients per round
  for  do
      Create matrix where with , and spawn it on all nodes.
      Current, Final Vector: ,
      for  do
          Randomly select clients .
          loop
              {In parallel on clients }
              Download , calculate current .
              Update .
              Compute stochastic gradient on batch of size : where .
              Sketch and upload it to the aggregator.
          end loop
          Aggregate sketches
          Unsketch:
          Update: , .
      end for
      Let .
  end for
Algorithm 3 Time-Varying Intrinsic Gradient Compression

Choice of Compression Matrix

Here, we discuss how to choose . Our methods are theoretically agnostic to the choice of , and depend only on the existence of efficient subroutines for calculating the matrix-vector products and . Nonetheless, the choice of has significant practical considerations, which we discuss here.

The naive choice is to let be a random dense matrix, but such a choice is impossible due to memory constraints. For example, if we aim to train even a small version of BERT (100M parameters) with an intrinsic dimension of , we would need to store a matrix with entries.

Our approach, also taken by aghajanyan2020intrinsic; li2018measuring, utilizes the Fastfood transform DBLP:conf/icml/LeSS13. This transform expresses the matrix as where is the smallest power of two larger than , is a standard Hadamard matrix, is a random diagonal matrix with independent Rademacher entries (random signs), is a random permutation matrix, is a random diagonal matrix with independent standard normal entries,

to be a linear operator which simply pads a

-dimensional vector with zeroes until it has size , and is a linear operator which takes the first elements from a -dimensional vector. Since we can quickly compute a matrix-vector product by with a fast Walsh-Hadamard transform, we can perform a matrix multiplication by in time and space.

Finally, to ensure that we do not need to communicate the matrices , we generate each matrix pseudorandomly from a random seed . Thus, the matrices do not need to be transferred over the network.

3.3 Theoretical Guarantees

In this section, we provide guarantees on static, time-varying, and -subspace intrinsic gradient compression. We focus on convex functions, which are the most amenable to analysis. First, we contend that it is not interesting to prove guarantees of the form “time-varying intrinsic gradient compression works well for all convex functions”. This is because the hypotheses are too weak to produce meaningful results, even if one assumes that one has access to oracle convex optimization routines which return the minimizer (rather than just an approximate optimizer).

Two representative works, similar to ours, which consider a setup where we have access to an oracle which finds minimizers of convex functions are stich2013optimization and ssobound. stich2013optimization considers an optimization algorithm which searches over random -dimensional subspaces, showing that theoretically, searching random direction times performs about as well as searching directions once, offering no bandwidth benefit in our context. ssobound shows a similar result without requiring random subspaces. Thus, showing interesting guarantees for arbitrary convex functions is likely quite challenging.

Rather, in the flavor of intrinsic dimension, we assume that our convex optimization problems are “easier” than standard problems, in that searching few directions is likely to yield good solutions. In this case, we show that time-varying intrinsic dimension works even better than static compression. Intuitively, this is because each random subspace sampled in the time-varying algorithm contains a point which allows us to meaningfully reduce our loss. As a consequence, when we consider many subspaces sequentially, we can reduce our loss exponentially.

Thus, we state our hypotheses via a formalized definition of intrinsic dimension. A convex function has intrinsic dimension if for all we have

where is a uniformly chosen -dimensional subspace over the Grassmanian, and is the minima of the function .

The result on static compression now follows directly. We merely need to account for the fact that we are using an approximate optimization algorithm and not an oracle optimization algorithm. However, since a convex problem on a subspace is convex, this follows directly from well-known guarantees on gradient descent.

In what follows, we assume that from each step we have access to

, an unbiased estimate of the true gradient of

at time , given the current we have – such a naturally emerges from our methods, where the randomness comes from the data points in the batch. In all cases, we assume that is an orthonormal basis of a random subspace sampled according to the Grassmanian. All proofs are given in Appendix A.

For the static compression algorithm, if the function has intrinsic dimension , we have

if we take total steps where is obtained by running the static compression algorithm, and .

For -subspace compression, we do not obtain stronger theoretical guarantees than static, but we include the result for completeness. Note that they use the same amount of upload bandwidth total, because -varying saves a factor of on upload. We also need a further assumption on the ratio of the variance to the squared mean: if it is too small, the extra variance induced by the -varying method causes the performance drop to be substantial.

For the -subspace algorithm, if the function has intrinsic dimension

with probability

, we have

if we take steps, where , assuming that for all values of for some and is defined as .

Finally, we prove a better guarantee for time-varying compression, taking advantage of effectively exponential decaying loss from repeatedly applying Section 3.3.

For the time-varying algorithm, if the function has intrinsic dimension over epochs,

after taking steps, where

(a)

Accuracy on CIFAR-10 across compression rates.

(b) Training curves on CIFAR-10 of static and time varying compression for the intrinsic dimension .
Figure 3:

Results on computer vision benchmarks. Both static and time-varying intrinsic gradient dimension significantly outperform prior work, with time-varying intrinsic compression performing best. On the right, we see that time-varying and static compression perform similarly at the beginning of training, but time-varying outperforms static with equal space when the compression is higher. For the FedAvg and uncompressed methods with compression rates above 1, compression was performed by training for fewer epochs.

(a) Perplexity on PersonaChat
(b) Accuracy on SST-2
Figure 6: Results on NLP benchmarks. -subspace and static compression both strongly outperform all other methods, though

-subspace has the added benefit of much lower upload compression (not shown). For the SST-2 results, error bars show the standard error of performance calculated over five runs with different random seeds.

4 Experiments

We evaluate our method across three benchmarks: two from NLP (language modeling and text classification) and one from computer vision (image classification). As with previous works DBLP:conf/icml/RothchildPUISB020; mcmahan2017communication, we simulate a federated setting in order to scale to large numbers of clients (upwards of ). We perform experiments in both non-IID and IID settings.

Image Classification (ResNet-9 on CIFAR-10)

First, we consider image classification on CIFAR-10, a dataset of 50,000 px images. We use the same experimental setup as DBLP:conf/icml/RothchildPUISB020: we split the data between 10,000 clients in a non-IID fashion, such that each client only has data from a single class. At each step, we sample 100 clients at random, such that each gradient step corresponds to 500 images. We perform 24 rounds of communication between all clients (i.e. 24 epochs).

We use a ResNet-9 architecture with 6,570,880 trainable parameters for our fair comparison to previous work. Note that the model does not have batch normalization, as it would not make sense in a setting where each client has so few examples. Due to the substantial number of epochs performed here, we experiment with both static and time-varying gradient compression (

-subspace compression is better suited to settings involving fewer rounds of communication). We experiment with intrinsic dimensions from 4000 to 256000.

Our results are shown in Figure 3. Whereas FedAvg and Top-K struggle at even modest compression rates (e.g. ), the intrinsic gradient compression methods deliver strong performance at much larger compression rates. The intrinsic methods outperform the current state-of-the-art gradient compression method, FetchSGD DBLP:conf/icml/RothchildPUISB020, by a large margin, and easily scales to high compression rates (e.g. ). Finally, we see that time-varying intrinsic compression generally outperforms static compression for the same communication cost.

Text Classification (BERT on SST-2)

Next, we consider text classification on the Stanford Sentiment Treebank-v2 (SST-2) dataset sst2

, a common sentiment analysis dataset. For this experiment, we consider an IID data split into 50 and 500 clients, respectively. We employ the popular BERT

devlin-etal-2019-bert architecture with 109M parameters and we use intrinsic dimensions from 200 to 25600. The purpose of this experiment is to push the limits of gradient compression; we project the 109M-dimension BERT gradients into as few as 200 dimensions.

Our results are given in Figure 6. First, in agreement with aghajanyan2020intrinsic, we find that it is possible to achieve remarkably high compression ratios for text classification: we get nearly full performance even when compressing the 109M-dimension parameter vector into an intrinsic space of dimension 16,384. Furthermore, we find that time-varying intrinsic gradient compression consistently outperforms static intrinsic gradient compression at the same compression rate.

Language Modeling (GPT-2 on PersonaChat)

Lastly, we consider language modeling on the PersonaChat zhang2018personalizing

dataset. The dataset has a non-IID split into 17,568 clients in which each client is assigned all data corresponding to given personality; as a result, it is widely used in federated learning simulations. We perform language modeling using the GPT-2 transformer architecture (124M parameters) and conduct two rounds of training across the clients (i.e. two epochs). Due to the low number of training rounds, it is natural to apply

static and -subspace gradient compression (we use ).333Time-varying compression does not make sense here, as its benefit is derived from the setting where there are many rounds of communication between the clients.

Our results are shown in Figure 6. Overall, intrinsic dimension-based gradient compression vastly outperforms a wide range of prior approaches to reducing communication in federated learning. On the low-compression end of the spectrum, we obtain nearly full performance with superior compression rates to the state-of-the-art FetchSGD DBLP:conf/icml/RothchildPUISB020. On the high-compression end of the spectrum, we scale better than previous approaches. For example, we obtain a perplexity of around 20 even with an extremely high compression rate of 1898.

Finally, we see that -subspace intrinsic compression performs similarly to (or slightly worse) than static compression at the same level of overall compression. However, if it is more important to conserve upload bandwidth than download bandwidth, then -subspace intrinsic gradient compression significantly outperforms static intrinsic gradient compression (see Table 2).

Gradient Reconstruction: Data Privacy Experiment

One of the primary motivations of federated learning is the desire for individual clients to be able to retain data privacy while still participating in model training. However, prior work DBLP:conf/nips/ZhuLH19 has shown that if the client sends their full local model update to the server, it is sometimes possible to approximately reconstruct their local data from the model update. We investigate the extent to which an attacker can reconstruct a client’s data given a compressed gradient update, and we find that our compression helps to mitigate this reconstruction problem. Full details are included in Appendix E due to space constraints.

5 Conclusion

We propose a family of intrinsic gradient compression algorithms for federated learning. This family includes static compression, which performs remarkably well despite its simplicity, -subspace compression, which is optimized for upload bandwidth, and time-varying compression, which improves performance by changing the intrinsic subspace over time. We provide theoretical results for our algorithms and demonstrate their effectiveness through numerous large-scale experiments. We hope that our results help make the real-world deployment of large-scale federated learning systems more feasible.

References

Appendix A Proofs Omitted in the Main Text

a.1 Proof of Section 3.3

First, we show that is convex in .

is convex.

Proof.

We have

and we may conclude. ∎

We can now write

We can bound the first term with a result from [scaffold] because is convex, and thus classical convex optimization algorithms will converge quickly (namely, within steps). The second term is bounded by our assumption on the intrinsic dimension of the function . With at least probability , we have that is at most .

a.2 Proof of Section 3.3

In this part of the problem, it is not immediately clear how to fit it into the existing SGD framework. First, to parametrize we use

and take . The correct gradient of this function is , where is the true gradient. However, now define

Then, we claim that our algorithm is equivalent to using as an unbiased gradient estimate. Thus, the SGD equation looks like , and after multiplying both sides by the matrix we get

which matches our algorithm for -subspace compression.

It remains to compute the variance of the gradients , which is used in the SGD bound. We obtain that . Note that

Thus, we have that the true variance, given the ratio, is at most times the original variance. The rest of the analysis is exactly the same as Section A.1, and we may conclude.

a.3 Proof of Section 3.3

Here, we repeatedly apply Section 3.3 by using the fact that we essentially sample fresh directions each time. Intuitively, the time-varying design implies that each new subspace choice is a fresh opportunity to get closer to the optimum. Each epoch lets us get closer and closer to the desired optimum.

We have that after iterations from [scaffold], the loss is at most , where . By repeatedly applying this result, with probability at least , the final loss is at most , where

and we may conclude.

Appendix B -subspace Intrinsic Gradient Compression

This is given in Algorithm 2.

Appendix C Additional Related Work

c.1 Intrinsic Dimensionality

As mentioned in the main paper, the concept of measuring the intrinsic dimensional of loss landscapes was introduced by [li2018measuring]. [li2018measuring] consider optimizing a -parameter model in a random -dimensional subspace of the full parameter space. They define the intrinsic dimension of the optimization problem as the minimum dimension for which a solution to the problem can be found, where a “solution” refers attaining a certain percentage of the maximum possible validation accuracy (i.e. the validation accuracy obtained by optimizing in all dimensions). They use a fixed cut-off of % accuracy for their experiments. [aghajanyan2020intrinsic] apply these ideas in the setting of finetuning NLP models.

A number of works have tried to measure the intrinsic dimension of datasets, rather than objective landscapes. [NIPS2004_74934548] introduced a maximum likelihood approach to estimating intrinsic dimensionality based on nearest-neighbors, while [CERUTI20142569] employed angle and norm-based similarity.

Finally, some works have tried to measure the intrinsic dimensionality of image representations and datasets. [gong2019intrinsic] finds that the representations produced by popular image and face representation learning models (ResNet-50 and SphereFace) have quite low intrinsic dimensionalities (16 and 19, respectively). Along similar lines, [pope2021the]

showed that popular image datasets (MNIST, CIFAR 10, ImageNet) also have low intrinsic dimensionality.

c.2 Model Pruning

There has been great interest in compressing models by using fewer weights, starting with the work of [hinton2015distilling, han2015deep]. One related work is Diff Pruning [guo2020parameter], which constrains the number of weights that can be changed from a pretrained model. In essence, diff pruning attempts to solve an minimization problem on the weights of the model, and approaches this by means of a relaxation to a problem that is more amenable to a standard analysis.

A number of other works have explored the idea of finetuning by only modifying a subset of a model’s parameters. [ravfogel2021bitfit] finetunes only the layer biases, whereas [houlsby2019parameter] introduces the concept of low-parameter adapters between each layer. Compared to [ravfogel2021bitfit] our method is far more flexible, allowing any number of parameters to be changed. Compared to [houlsby2019parameter] our methods are architecture-independent, and can be applied to any model.

Federated Learning

Federated learning is generally concerned with the distributed training of machine learning models across many devices, each of which holds private data. Many aspects of this federated setup are separate subfields of research, including how to ensure the privacy of client-held data [Xie2020DBA, bhagoji2019analyzing], how to deal with heterogeneous data and networks [li2020federated, li2020convergence, yu2020federated], how to reconcile weights/gradients from multiple clients [li2020federated, wang2020federated, pmlr-v119-li20g], how to manage clients in a fault-tolerant manner, deployment on mobile/iot devices [chaoyanghe2020fedml], and fairness [mohri2019agnostic].

The classic FedAvg [mcmahan2017communication] algorithm communicates model updates after multiple local training iterations. FedProx [li2020federated] generalized and re-parametrized FedAvg, and FedMA [wang2020federated] improved this approach by matching and averaging hidden layers of networks with similar activations at each communication round. Additionally, FedAwS [yu2020federated] considered federated averaging in the case where each client has data from only a single class.

Appendix D Further Experimental Details and Analysis

In the main paper, we included a number of figures demonstrating our performance in comparison to prior work. Here, we include tables with our precise results for clarity and in order to facilitate future comparison with our work.

d.1 General Implementation Details

We perform our language modeling experiments on 8 RTX 6000 GPUs and our image/text classification experiments on 1 RTX 6000 GPU. Regarding the intrinsic gradient compression matrices , we employ the FastFood method described in Section 3.2 using a CUDA implementation of the fast Walsh-Hadamard transform from [thomas2018learning].

d.2 Further PersonaChat Analysis

First, we give more details on the PersonaChat dataset, which were omitted from the main paper due to space constraints. The PersonaChat dataset [zhang2018personalizing] was collected by first giving imaginary personas (defined by a set of 5 sentences) to Amazon Mechanical Turk workers and asking them to take on those personas. Then, the system paired workers and asked them to discuss. Since the personas were imaginary and no personally identifiable information was exchanged (in particular, the workers were explicitly told to not use personally identifiable information) the dataset does not contain personally identifiable information. The dataset has a non-IID split into 17,568 clients in which each client is assigned all data corresponding to given personality; as a result, it is widely used in federated learning simulations. We perform language modeling using the GPT-2 transformer architecture (124M parameters). We perform static and -subspace gradient compression using intrinsic dimensions of 16384, 65536, 262144, 1048576, and 4194304.

We show full results on PersonaChat below, complete with upload and download compression. Overall compression is calculated as average compression over both upload and download. We compare with FedAvg [mcmahan2017communication], Top-K, and FetchSGD [DBLP:conf/icml/RothchildPUISB020]. FedAvg is the baseline federated learning approach involving sending and averaging weights. Top-K refers to sending the top gradients, sorted by magnitude. FetchSGD compresses the weights with sketching.

Our method significantly outperforms competing approaches across the board. We obtain an accuracy close to that of uncompressed optimization using 29.7 overall compression; FedAvg and Top-K both fail to achieve such strong results, while FetchSGD does so at a significantly lower compression rate.

Next we compare static and K-varying intrinsic gradient compression. When comparing overall compression rates, static compression is slightly better than K-varying compression. However, K-varying compression is optimized for low upload bandwidth; it obtains much better upload compression rates than static compression at the same accuracy. For example, K-varying compression with and yields perplexity at upload compression , whereas static compression with yields perplexity at upload compression .

Name Intrinsic Dim. PPL Up. Comp. Down. Comp. Total Comp.
Uncompressed 13.9 1 1 1
[mcmahan2017communication] FedAvg (2 local iters) 16.3 2 2 2
[mcmahan2017communication] FedAvg (5 local iters) 20.1 5 5 5
Local Top-K () 19.3 30.3 2490 60
Local Top-K () 17.1 3.6 248 7.1
[DBLP:conf/icml/RothchildPUISB020] FetchSGD () 14.8 3.8 100 7.3
[DBLP:conf/icml/RothchildPUISB020] FetchSGD () 15.8 2.4 10 3.9
Ours (static) 16384 27.7 7595 7595 7595
Ours (-subspace) 16384 19.6 7595 949 1688
Ours (static) 65536 20.6 1900 1900 1900
Ours (-subspace) 65536 17.8 1900 237 422
Ours (static) 262144 17.6 475 475 475
Ours (-subspace) 262144 16.6 475 59.3 105
Ours (static) 1048576 15.8 119 119 119
Ours (-subspace) 1048576 15.4 119 14.8 26.3
Ours (static) 4194304 14.8 29.7 29.7 29.7
Table 2: Results of our method and comparison to prior work, including the state-of-the-art in gradient compression (FetchSGD). The table shows language modeling perplexity (lower is better) and compression rates (higher is better). We show upload, download, and total compression rates. For our intrinsic gradient compression results, we show static and -subspace compression for a range of dimensions between and . For -subspace compression we use .

d.3 Further SST-2 Details and Analysis

Intrinsic Dim. 200 400 800 1,600
Static 82.8 () 85.3 () 87.1 () 87.5 ()
Time-Varying 85.9 () 87.8 () 87.8 () 88.7 ()
Intrinsic Dim. 3,200 6,400 12,800 25,600
Static 88.3 () 89.4 () 89.5 () 89.5 ()
Time-Varying 89.0 () 89.4 () 89.4 () 89.4 ()
Table 3: Accuracy and standard error of a BERT model trained on the Stanford Sentiment Treebank v2 (SST-2) for varying intrinsic dimensions. We calculate the standard error over five trials with different random seeds. We see that for fixed dimension, time-varying intrinsic gradient compression outperforms static intrinsic gradient compression.

Regarding the experimental setup, we perform 30 rounds (i.e. 30 epochs) of training for all compressed runs, while we perform 6 for the uncompressed baseline (as it converges more quickly). Federated learning experiments has previously been criticized for being challenging to reproduce; as a result, we perform each run five times over different random seeds. Due to the substantial number of epochs performed here, it is natural to apply static and time-varying intrinsic gradient compression. We use intrinsic dimensions of 200, 400, 800, , 25600.

In Table 3, we show full results for the SST-2 dataset with static and time-varying gradient compression for a range of intrinsic dimensions. We include in this experiment an demonstration of the robustness of our method to variation in random seeds; we run each experiment five times using separate random seeds (i.e. different intrinsic subspaces and model initializations). We report standard errors in Table 3 and include Figure 6 with error bars in the main paper. Overall variability is quite low.

We also see that time-varying intrinsic gradient compression outperforms static intrinsic compression, especially for low intrinsic dimensions. For example, time-varying compression at outperforms static compression with , and time-varying compression with outperforms static compression with .

Appendix E Gradient Reconstruction: Data Privacy Experiment

(a) Input
(b) Reconstruction from full gradient.
(c) Reconstruction from gradient with intrinsic compression.
Figure 10: Image reconstruction from gradients with and without our intrinsic gradient compression method. On the left, we show the original image. In the center, we show the result of reconstructing the image from a single gradient from a ResNet-152 model (60M parameters), produced using the method of [DBLP:conf/nips/ZhuLH19]. On the right, we show the result of the same image reconstruction method applied to an gradient compressed by our algorithm using intrinsic dimension 65,536.

Data privacy is one of the central motivations of federated learning.

However, a number of works have shown that if the client does not have a large amount of data and the client sends back their full local gradient, it is possible to approximately reconstruct their local data from the model. This is a significant problem, because their data would then effectively be visible to the central server and any attackers that intercept their communications.

This is a significant problem, because their data would then effectively be visible to the central server and any attackers that intercept their communications.

Here, we show that compressing gradients with our approach can mitigate this problem. Specifically, we check if our compressed gradients can be reconstructed with the iterative procedure proposed by [DBLP:conf/nips/ZhuLH19], which takes a gradient and a model and tries to recover an image. As in [DBLP:conf/nips/ZhuLH19]

, we use a ResNet-152 model on a randomly selected image from ImageNet and run for 24,000 iterations (by which time the method has converged). We reconstruct the image both from the full gradient (the center image) and from a the intrinsically-compressed image (the right image) with intrinsic dimension 65,536.

As seen in Figure 10, given the full gradient it is possible to obtain a fairly good reconstruction of the image. By contrast, with our method, the reconstruction is visually much less similar to the original image. Of course, our method does not solve the problem entirely; an outline of the dog in the image is still visible because the compressed gradient still contains some information about the local data. To solve the issue entirely, it would be necessary to use a method such as differential privacy.