Distilled One-Shot Federated Learning

09/17/2020 ∙ by Yanlin Zhou, et al. ∙ University of Florida IEEE 0

Current federated learning algorithms take tens of communication rounds transmitting unwieldy model weights under ideal circumstances and hundreds when data is poorly distributed. Inspired by recent work on dataset distillation and distributed one-shot learning, we propose Distilled One-Shot Federated Learning, which reduces the number of communication rounds required to train a performant model to only one. Each client distills their private dataset and sends the synthetic data (e.g. images or sentences) to the server. The distilled data look like noise and become useless after model fitting. We empirically show that, in only one round of communication, our method can achieve 96 on federated IMDB with a customized CNN (centralized 86 TREC-6 with a Bi-LSTM (centralized 89 match the centralized baseline on all three tasks. By evading the need for model-wise updates (i.e., weights, gradients, loss, etc.), the total communication cost of DOSFL is reduced by over an order of magnitude. We believe that DOSFL represents a new direction orthogonal to previous work, towards weight-less and gradient-less federated learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Conventional supervised learning dictates that data be gathered into a central location where it can be used to train a model. However, this is intrusive and difficult, if data is spread across multiple devices or clients. For this reason, federated learning (FL) has garnered attention due to its ability to collectively train neural networks while keeping data private. The most popular federated learning algorithm is FedAvg

McMahan et al. (2016). Each iteration, clients perform local training and forward the resulting model weights to a server. The server averages these to obtain a global model. Since learning processes happen at local level, neither the server nor other clients directly observe a client’s data.

Federated learning introduces distinct challenges not present in classical distributed machine learning

Li et al. (2019a). The main focus of this paper are expensive communication and statistical heterogeneity. Previous approaches try to learn faster when data is poorly distributed. They include modifying the training loss Li et al. (2018), using lifelong learning to prevent forgetting Shoham et al. (2019), and correcting local updates using control variates Karimireddy et al. (2019). These methods improve upon FedAvg, but can still take hundreds of communication rounds, while increasing the amount of information sent to the server.

Inspired by dataset distillation Wang et al. (2018), we propose Distilled One-shot Federated Learning (DOSFL) to solve the communication challenges of federated learning (see Figure 1). Each client distills their data and uploads learned synthetic data to the server, instead of transmitting bulky gradients or weights. Even large datasets containing thousands of examples can be compressed to only a few fabricated examples. The server then interleaves the clients’ distilled data together, using them to train a global model. To achieve good results even when client data is poorly distributed, we leverage soft labels Sucholutsky and Schonlau (2019) and introduce two new techniques: soft reset and random masking.

Figure 1: Distilled One-Shot Federated Learning. (1) The server initializes a model which is broadcast to all clients. (2) Each client distills their private dataset and (3) transmits synthetic data to the server. (4) The server fits its model on the distilled data and (5) distributes the final model to all clients.

Within only one round of communication, DOSFL can reach 96% test accuracy on Federated MNIST when data is independently and identically distributed (IID) and nearly 80% when data is non-IID. If we use each client’s steps serially, DOSFL can achieve almost 99% on IID Federated MNIST. In addition, DOSFL can reach 81% test accuracy for IMDB sentiment analysis with a customized CNN (86% when trained centrally), and 84% on TREC-6 with a bi-directional LSTM (89% when trained centrally). Due to the reduction in rounds, we claim a communication cost reduction of up to 96% compared to 100 rounds of FedAvg while achieving similar accuracy. Moreover, since only synthetic data is uploaded to the server, our method provides moderate privacy gains compared to FedAvg.

We believe DOSFL is part of a new paradigm in the field of federated learning. So far, nearly every federated learning algorithm communicates model weights or gradients. While effective, breaking from this pattern offers many benefits, such as low communication Guha et al. (2019) or private model architectures Li and Wang (2019). We hope that DOSFL, along with related work, may inspire the machine learning community to explore possible techniques for weight-less and gradient-less federated learning.

2 Related Work

2.1 Federated Learning

Since the introduction of FedAvg in 2016 McMahan et al. (2016), there has been an explosion of work directed towards the problem of statistical heterogeneity. When statistical hetergeniety is high, convergence for FedAvg slows and becomes unstables Li et al. (2019a). The issue is that, the difference between the local losses and the global objective—their weighted sum—may be large. As such, minimizing a particular local loss does not ensure that the global loss is also minimized. This is problematic, even when the losses are convex and smooth Li et al. (2019b). In applications where privacy loss can be tolerated, Zhao et al. demonstrate massive gains in performance by making as little as 5% of data public Zhao et al. (2018).

Numerous successors to FedAvg have been suggested. Server momentum introduces a global momentum parameter that improves convergence theoretically and experimentally Liu et al. (2019). In Shoham et al., the loss is modified with Elastic Weight Consolidation to prevent forgetting as clients perform local training Shoham et al. (2019). SCAFFOLD uses control variates, namely the gradient of the global model, to address drifting among clients during local training Karimireddy et al. (2019). These schemes, while effective, at least double the per round communication cost.

While faster learning decreases the total number of communication rounds, strategies have been devised to explicitly reduce communication costs. FedPAQ quantizes local updates before transmission, with averaging happens at both client and server sides Reisizadeh et al. (2019). Sparsifying the weights may perform better than FedAvg alone Sattler et al. (2019). Asynchronous model updates have also been explored, using adaptive weighted averaging to combat staleness, combined with a proximal loss, Xie et al. (2019) and updating deeper layer less frequently than shallower layers Chen et al. (2019). SAFA takes a semi-synchronous approach; only up-to-date and deprecated clients synchronize with the server Wu et al. (2019).

A few papers have made first steps towards one-shot federated learning. Guha et al. try different heuristics for selecting the client models which would form the best ensemble

Guha et al. (2019). By swapping weight averaging for ensembling, only one round of communication is necessary. Upper and lower bounds have been proven for one-shot distributed optimization, along with an order optimal algorithm Salehkaleybar et al. (2019); Sharifnassab et al. (2019)

. The extent to which these results apply to federated learning of neural networks is unknown as the local losses must be convex and drawn from the same probability distribution.

2.2 Distillation

There is a wealth of literature studying dataset compression, while maintaining the most crucial features for training models. These methods include dataset pruning Angelova et al. (2005) and core set construction Bachem et al. (2017); Tsang et al. (2005); Sener and Savarese (2017), which keep the examples that are measured to be more useful for training and remove the rest. The drawback of drawing distilled images from the original dataset is that the level of compression achieved is much lower than that of dataset distillation, which is exempt from the requirement that distilled data be real Wang et al. (2018).

Dataset distillation Wang et al. (2018) was introduced by Wang et al., to compress a large dataset with thousands to millions of images down to only a few synthetic training images. The key idea is to use gradient descent to learn the features most helpful for rapidly training a neural network. Given some model parameters , dataset distillation minimizes the loss of adapted parameters , obtained by performing gradient descent on and the distilled data. This procedure resembles meta-learning, which performs task-specific adaption followed by a meta-update Finn et al. (2017). With dataset distillation, 10 synthetic digits can train a neural network from 13% to 94% test accuracy in 3 iterations, near the test accuracy reached by training on MNIST.

Dataset distillation originally was limited to only image classification tasks, because the distilled labels were predetermined and fixed. Learnable or soft labels not only decrease the number of required labels, but also expand dataset distillation to language tasks such as sentiment classification Sucholutsky and Schonlau (2019). Soft labels have a long history, being proposed for model distillation by Hinton et al. Hinton et al. (2015) and for k-nearest neighbors by El Gayar et al. El Gayar et al. (2006). Using soft label dataset distillation, Sucholutsky et al. were able to train LeNet to accuracy with only 10 images Sucholutsky and Schonlau (2019). Examples of distilled data (i.e., text, grey image and RGB image) are shown in Figure 1.

3 Distilled One Shot Federated Learning

Suppose we have numbered clients each with their own local models with parameters

and loss functions

. Given some probability vector

(each and ), our goal is to find some parameters that minimize the weighted sum .

(1)

However, we often do not have distinct loss functions but rather the same loss function evaluated on distinct private datasets. Let to be the loss of a single example . Following Wang et al. (2018), we define to be average loss of all the data points in the set . Thus, for each client with a dataset , .

1:Initialize server weights
2:for clients  do
3:     DistillData(, )
4:     Send distilled data to the server
5:end for
6:Merge distilled data into a single sequence
7:for  do
8:     for  do
9:         Number of adaptations
10:         
11:     end for
12:end for
13:
14:function DistillData(, )
15:     Initialize
16:     for  do
17:         Get a minibatch from the client’s dataset
18:         for  do
19:              for  do
20:                  Number of adaptations
21:                  
22:              end for
23:         end for
24:         
25:         
26:         if soft_label then
27:              
28:         end if
29:     end for
30:     return
31:end function
Algorithm 1 Distilled One-Shot Federated Learning

Our solution consists of 3 steps. These steps are summarized in Algorithm 1.

  1. A central server randomly initializes model parameters . This can be distributed to the clients as a random seed.

  2. The clients distill their datasets. Start by initializing the distilled data , distilled label , and distilled learning rate . Each entry in

    is drawn from a standard normal distribution, while

    is set to a predefined value . The distilled labels are initialized as either one-hot vectors for classification problems or normal distributed random vectors for regression problems. Adapt these into via gradient descent.

    (2)

    Afterwards, minimize the loss of evaluated on a batch of real data .

    (3)
    (4)
    (5)

    where is learning rate. Equation 5 only applies when using soft labels.

    This can be done with a sequence of distilled data (

    is the distill step), repeated distill epoch

    times. Each example successively adapts until we have after gradient descent updates. This dramatically increases the expressive power of dataset distillation at the expense of compute time.

  3. The clients upload the distilled data to the server. If , the server sorts the distilled data by index, e.g. from clients 1 and 2 become where are 3-tuples. The server then trains its own model on the combined sequence.

The last step can cause issues when the data is non-IID. Consider two clients and with distilled examples and respectively with . The server first trains on , arriving at , which is then trained on . But has been distilled to train . To combat this interference, we introduce two new techniques for improving performance on non-IID data.

Soft resets sample the starting parameters

from a Gaussian distribution concentrated at the server’s parameters

. By sampling between distillation iterations, dataset distillation learns more robust examples capable of training any model with weights . This technique is based off of the “hard resets” introduced in Wang et al. (2018), which completely re-initializes . Data distilled with “hard resets” can be used on any randomly initialized model, but cannot train models to the same level of accuracy as models trained on data distilled without resets.

Random masking randomly selects a fraction

of the distilled data each training iteration and replaces it with a random tensor. The random tensors randomly adjusts the model during training, while also reducing the amount of distilled data to actually train the starting parameters. After the training iteration, the original distilled data are restored. Now, sequences of distilled data can still train a model even when there is interference from other distilled steps. However, resetting and storing distilled data is compute and memory intensive, which slows down distillation.

Finally, we explore an alternative setting called serial DOSFL. First, the server selects only one client to distill its data. Afterwards, the server updates the global model by training on the distilled data. A different client then performs dataset distillation targeting the updated parameters. This process repeats until each client has distilled their data. Hence the global model is updated times, once for each client. The name is chosen serial because the next client can only begin distillation after the current client finishes. To distinguish the two, we refer to the previous setting as parallel DOSFL.

4 Experiments

We evaluate DOSFL on several federated classification tasks. Because of this, cross entropy loss is used for all experiments. To train the distilled data, we use ADAM Kingma and Ba (2014) with a learning rate that decays by 0.5 every epochs. We have , , and

for federated MNIST, IMDB, and TREC-6, respectively. These hyperparameters are mirrored from

Sucholutsky and Schonlau (2019) and have been found to be optimal or near-optimal. Clients distill the data for epochs for image datasets and epochs for text datasets with a batch size of . All experiments were run on an Nvidia M40 GPU, taking 2-3 minutes per client for federated MNIST and

minute per client for the text tasks. We use the default train and test splits in PyTorch

Paszke et al. (2019).

The client-server architecture is simulated by partitioning a dataset into subsets, and then distill these subsets. The server models have their weights Xavier initialized Glorot and Bengio (2010). Following the methodology of McMahan et al., IID partitions are created by randomly dividing the dataset McMahan et al. (2016). For non-IID partitions, we first sort the entire dataset by label and then divide it into shards of equal length. Starting from empty subsets, the shards are randomly assigned to the subsets until each has shards. As increases, the partition becomes more IID, with subsets more likely to contain examples from each class.

4.1 Image Classification

Additions 10 clients 100 clients
IID non-IID () IID non-IID ()
None 94.02%
Soft label (SL) 95.64% 91.53%
Soft reset (SR) 93.05% 78.29%
Random masking (RM) 93.68%
SL, SR 90.87% 78.83%
SR, RM 88.07%
SR, RM, SL 88.95%
Table 1: Parallel DOSFL accuracy on federated MNIST. For reference, LeNet can reach a test set accuracy of on MNIST and 93% with dataset distillation.

We first test parallel DOSFL on 10 and 100 client Federated MNIST with different combinations of soft labels, soft resets, and random masking. All federated MNIST experiments use LeNet as the model architectureLeCun et al. (1998). Our distilled data are not single examples but batches with size . Using soft labels, could be made much smaller. After the server model has trained on distilled data from the clients, its accuracy is measured on a test set. Each experiment is run times, and the best result is reported in Table 1.

The distill steps are , the distill epochs are , and the initial distill learning rate is

. Of the proposed additions to dataset distillation, soft resets provide the largest jump in non-IID performance, followed by random masking and soft labels. The reset variance

and masking probability were chosen from the search spaces and . However, soft labels also boost accuracy when data is IID, where as the other two methods cause dips in the final accuracy. The distillation additions are not additive; even with all add-ons, non-IID DOSFL caps at test accuracy. Surprisingly, the behavior of these additions changes depending on the number of clients. While accuracies in the 100 client case are lower in general, soft resets and no additions work better.

We also evaluate serial DOSFL, with soft resets and soft labels, on federated MNIST. Again, the distill steps and distill epochs are and . The best out of 5 runs, as determined by final accuracy, are shown in Figure 2. As expected, serial DOSFL performs better than parallel DOSFL by when the data is IID. This advantage disappears when the data is extremely non-IID. The reason for this is simple: when the clients’ datasets are IID, a model trained on one client’s dataset will transfer to the others. Encouragingly, the server model achieves its final accuracy after as little as 15% of the clients finish performing dataset distillation. This holds true even when the data is non-IID, although it takes longer (around 40% of the clients).

(a) Accuracy for 10 clients. The final accuracies
are and .
(b) Accuracy for 100 clients. The final accuracies
are and .
Figure 2: Performance of parallel and serial DOSFL on IID Federated MNIST, with soft resets and soft labels, vs. number of clients distilled.

We conclude with an analysis of the impact non-IIDness has on DOSFL. We ran serial and parallel DOSFL on 10 client Federated MNIST for shard counts through . The results are given in Figure 3. Importantly, DOSFL—parallel and serial—maintain their IID performance even as the shard count drops to . This is a moderately non-IID setting; each client on average still contains examples of all 10 digits. However, serial DOSFL slightly curves downward as decreases while parallel DOSFL is flat. Beyond this point, test accuracy degrades quickly until both parallel and serial DOSFL have similar test accuracies once .

(a) Parallel DOSFL accuracy for different
(b) Serial DOSFL accuracy for different
Figure 3: DOSFL Performance on Federated Non-IID MNIST with soft labels and soft resets.

4.2 Text Classification

To show that DOSFL is not limited to image-based tasks, we test DOSFL to work on federated IMDB (sentiment analysis) Maas et al. (2011) and federated TREC-6 (question classification) Voorhees and Harman (2000). Directly applying dataset distillation for language tasks is challenging as text data is discrete. Each token is a one-hot vector with dimension equal to the number of tokens, which can be in the thousands. To overcome this issue, we use pre-trained GloVe embeddings with a look up table to convert one-hot token ids to word vectors in 100D Euclidean space Pennington et al. (2014)

. Distilled sentences now are fixed-size real-valued matrices. Real sentences are also padded or truncated to the same fixed length: 200 for federated IMDB and 30 for federated TREC-6.

For federated IMDB, we use a simple CNN model, which we call TextCNN. We test parallel and serial DOSFL with soft labels for 10 and 100 clients, IID and non-IID federated IMDB. Here the distill steps , the distill epochs , the distill batch size , and the starting distill learning rate is . The results, best out of 5 runs, are provided in Table 2. Almost all settings surpass the dataset distillation baseline Sucholutsky and Schonlau (2019), except for 100 client parallel DOSFL. Furthermore, all results are close to the centralized baseline of 86.1%, with serial DOSFL coming within 1%. Since there exists only 2 classes in IMDB dataset (positive or negative sentiment), non-IID performance is within 2% of IID. Approximately clients contain labels from all classes, whereas in federated MNIST no client can have more than 4 classes.

For federated TREC-6, we adopt a Bi-LSTM model to show that DOSFL can be used with non-CNN models. We use 2 and 29 for the number of clients, since the size of the dataset is 5452 and the client dataset sizes must be divisible by the shard count . The amount of training data for the 2 client federated TREC-6 and 10 client federated IMDB are almost equal (). Similarly, 29 client federated TREC-6 is comparable with 100 client federated IMDB (). Results are recorded in Table 2 for , , , and . Due to the low number of clients, we were able to reduce the amount of distilled data needed compared to the previous two tasks. Unlike federated IMDB, there is a larger, roughly 6% gap in accuracy between the IID and non-IID settings. The average client only has of all classes.

Dataset Setting 10 clients 100 clients
IID non-IID () IID non-IID ()
IMDB Parallel
IMDB Serial
2 clients 29 clients
TREC-6 Parallel
TREC-6 Serial
Table 2: Parallel DOSFL performance on Federated IMDB and Federated TREC-6. For reference, TextCNN can reach on IMDB, and with dataset distillation. Bi-LSTM can reach on TREC-6, or with dataset distillation Sucholutsky and Schonlau (2019).

5 Discussion

Communication. We now compute the total communication cost of DOSFL compared with FedAvg measured in the amount of scalar values sent between the clients and the server. Since the server model’s initialization can be distributed as a random seed, we ignore the cost of the first server-to-client transmission. Let be the fraction of the clients that participate each round. FedAvg sends server model parameters to each client, who responds with locally trained parameters. The lifetime communication cost is where is the number of communication rounds. For parallel DOSFL, we only need to consider the expense of sending distilled data to the server. Thus, the communication cost of parallel DOSFL is where is the number of elements in each data point and is the batch size of the distilled data.

Dataset Model Break even round
MNIST LeNet 10
IMDB TextCNN 1
TREC-6 Bi-LSTM 1
Table 3: Communication comparison between DOSFL and FedAvg.

We compare the number of communication rounds—the break even round—needed for lifetime cost of FedAvg to equal DOSFL for the tasks in Section 4. Note that this value is independent of as it is present in the communication cost expressions for FedAvg and parallel DOSFL. Break even rounds for federated MNIST, IMDB, and TREC-6 are provided in Table 3 along with the data size and model size . The higher break even round for MNIST, compared to the text tasks, is due to LeNet having significantly fewer parameters than either TextCNN or Bi-LSTM.

Privacy. DOSFL, by virtue of communicating artificial data instead of model weights, releases less information to the server than vanilla FedAvg. Distilled data targeted for specific models appears random to the human eye Wang et al. (2018). Furthermore, the distilled data are useless without the model initialization for which the data is distilled. Using distilled data to train a different initialization produces no gain in test accuracy. At most, the server could train its model on one client’s distilled data, after which the server only obtains a trained local model. Naive FedAvg forces these to be shared with the server. However, secure multiparty computation makes it possible for client model parameters to remain secret while still computing the average Bonawitz et al. (2016).

Parallel vs. serial DOSFL. For most federated learning applications, parallel DOSFL is the better choice. It performs nearly as well as serial DOSFL when the data is IID and similarly when data is non-IID. Furthermore, serial DOSFL takes longer to finish since only one client can distill data at a time. Even with early stoppage for serial DOSFL, it could take as much as times longer than parallel DOSFL. Finally, serial DOSFL has the server send the global model weights to the clients times, therefore costing more bandwidth than parallel DOSFL. Serial DOSFL should be only used when the number of clients is small and maximum performance is desired.

DOSFL is best suited for cross-silo federated learning, where 2-100 organizations seek to learn a shared model without sharing data Kairouz et al. (2019). In cross-silo learning, participants likely would be able to dedicate hardware for the sole purpose of federated learning. Big models are also probable, and the communication savings of DOSFL increase as the models get larger. DOSFL is less suited for cross-device federated learning, where billions of low-powered devices may participate Kairouz et al. (2019). The devices are multi-purpose; learning must take place when the device is idle. Clients do not have much uninterrupted time to perform dataset distillation. In extreme cases, dataset distillation may be too expensive for mobile hardware.

6 Conclusion

In this paper, we have proposed a new algorithm for one-shot federated learning and tested it on several tasks—federated MNIST, IMDB, TREC-6—and neural networks—LeNet, TextCNN, and Bi-LSTM. Experimental results show that Distilled One-shot Federated Learning can achieve close to the centralized baseline after only one communication round. DOSFL provides an alternative to gradient or weight averaging, relying instead on dataset distillation, suitable for cross-silo federated learning. In the future, we plan to test DOSFL on more difficult supervised learning tasks requiring larger models, where FedAvg is strained communication-wise. In particular, federated next word detection and regression tasks are promising for DOSFL. More generally, we hope to improve the non-IID performance, as well as incorporate existing federated learning advances.

Broader Impact

DOSFL represents a departure from previous federated learning algorithms based around gradient or weight averaging. Its lower communication cost could see it replace these existing algorithms in certain cases or used to jump start a federated learning session. Our algorithm is ideal for cross-silo federated learning, where organizations cooperate to train a global model. The number of clients is small, about 2 to 100, plus clients are available to perform dataset distillation Kairouz et al. (2019). Furthermore, data may be highly non-IID, and non-IID performance is a strong point of DOSFL. Improved cross-silo federated learning could spur a democratization of machine learning services. Currently, only the largest and most invasive corporations possess data in large enough quantities to build compelling products using machine learning. Federated learning could enable several smaller businesses to pool their data to compete with multi-national corporations on equal footing. This could be most beneficial in the financial or healthcare industries, which have legal restrictions on how data can be distributed.

DOSFL is not well suited for cross-device federated learning, where billions of mobile devices may be involved. These devices are often low-power, have unreliable networking, and are rarely available for training. In extreme cases, devices may not have enough computation power to support dataset distillation. Yet cross-device federated learning is where communication is most constrained, where communication cost-savings would be most valuable. Federated learning also introduces security risks not present in centralized training. Clients may maliciously upload weights to worsen training or cause the global model to secretly mispredict certain types of inputs. Here DOSFL suffers, given that dataset distillation can be used for model poisoning attacks Wang et al. (2018). Additional security mechanisms could be adapted to DOSFL when needed, such as fully or partially homomorphic encryption Gentry and Boneh (2009). Follow-up research is needed on the security and privacy of DOSFL.

We believe DOSFL is a promising algorithm with room for future improvements. However, the whole field of one-shot federated learning deserves additional attention. There are concrete gains by replacing weight averaging with other forms of aggregations, such as ensembling Guha et al. (2019) or model distillation Li and Wang (2019)

. Techniques from transfer learning and out-of-distribution-generalization may also be applicable; it may be possible to extend learned models from one client to another even when data is highly non-IID. Which of these aggregation methods proves most useful is an unresolved question. We hope that DOSFL may encourage other researchers in academia and industry to investigate the new paradigms of weight-less and gradient-less federated learning.

Use unnumbered first level headings for the acknowledgments. All acknowledgments go at the end of the paper before the list of references. Moreover, you are required to declare funding (financial activities supporting the submitted work) and competing interests (related financial activities outside the submitted work). More information about this disclosure can be found at: https://neurips.cc/Conferences/2020/PaperInformation/FundingDisclosure.

Do not include this section in the anonymized submission, only in the final paper. You can use the ack environment provided in the style file to autmoatically hide this section in the anonymized submission.

References

  • [1] A. Angelova, Y. Abu-Mostafam, and P. Perona (2005) Pruning training sets for learning of object categories. In

    2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

    ,
    Vol. 1, pp. 494–501. Cited by: §2.2.
  • [2] O. Bachem, M. Lucic, and A. Krause (2017) Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476. Cited by: §2.2.
  • [3] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth (2016) Practical secure aggregation for federated learning on user-held data. arXiv preprint arXiv:1611.04482. Cited by: §5.
  • [4] Y. Chen, X. Sun, and Y. Jin (2019)

    Communication-efficient federated deep learning with asynchronous model update and temporally weighted aggregation

    .
    arXiv preprint arXiv:1903.07424. Cited by: §2.1.
  • [5] N. El Gayar, F. Schwenker, and G. Palm (2006)

    A study of the robustness of knn classifiers trained using soft labels

    .
    In Proceedings of the Second International Conference on Artificial Neural Networks in Pattern Recognition, ANNPR’06, Berlin, Heidelberg, pp. 67–80. External Links: ISBN 3540379517, Link, Document Cited by: §2.2.
  • [6] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §2.2.
  • [7] C. Gentry and D. Boneh (2009) A fully homomorphic encryption scheme. Vol. 20, Stanford university Stanford. Cited by: Broader Impact.
  • [8] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    ,
    pp. 249–256. Cited by: §4.
  • [9] N. Guha, A. Talwlkar, and V. Smith (2019) One-shot federated learning. arXiv preprint arXiv:1902.11175. Cited by: §1, §2.1, Broader Impact.
  • [10] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.2.
  • [11] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. (2019) Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977. Cited by: §5, Broader Impact.
  • [12] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh (2019) SCAFFOLD: stochastic controlled averaging for on-device federated learning. arXiv preprint arXiv:1910.06378. Cited by: §1, §2.1.
  • [13] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • [14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1.
  • [15] D. Li and J. Wang (2019) FedMD: heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581. Cited by: §1, Broader Impact.
  • [16] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith (2019) Federated learning: challenges, methods, and future directions. arXiv preprint arXiv:1908.07873. Cited by: §1, §2.1.
  • [17] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith (2018) Federated optimization in heterogeneous networks. arXiv preprint arXiv:1812.06127. Cited by: §1.
  • [18] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang (2019) On the convergence of fedavg on non-iid data. arXiv preprint arXiv:1907.02189. Cited by: §2.1.
  • [19] W. Liu, L. Chen, Y. Chen, and W. Zhang (2019) Accelerating federated learning via momentum gradient descent. arXiv preprint arXiv:1910.03197. Cited by: §2.1.
  • [20] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts (2011-06) Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 142–150. External Links: Link Cited by: §4.2.
  • [21] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al. (2016) Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629. Cited by: §1, §2.1, §4.
  • [22] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §4.
  • [23] J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    ,
    pp. 1532–1543. Cited by: §4.2.
  • [24] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani (2019) Fedpaq: a communication-efficient federated learning method with periodic averaging and quantization. arXiv preprint arXiv:1909.13014. Cited by: §2.1.
  • [25] S. Salehkaleybar, A. Sharifnassab, and S. J. Golestani (2019) One-shot federated learning: theoretical limits and algorithms to achieve them. arXiv preprint arXiv:1905.04634. Cited by: §2.1.
  • [26] F. Sattler, S. Wiedemann, K. Müller, and W. Samek (2019) Robust and communication-efficient federated learning from non-iid data. IEEE transactions on neural networks and learning systems. Cited by: §2.1.
  • [27] O. Sener and S. Savarese (2017) Active learning for convolutional neural networks: a core-set approach. arXiv preprint arXiv:1708.00489. Cited by: §2.2.
  • [28] A. Sharifnassab, S. Salehkaleybar, and S. J. Golestani (2019) Order optimal one-shot distributed learning. In Advances in Neural Information Processing Systems, pp. 2165–2174. Cited by: §2.1.
  • [29] N. Shoham, T. Avidor, A. Keren, N. Israel, D. Benditkis, L. Mor-Yosef, and I. Zeitak (2019) Overcoming forgetting in federated learning on non-iid data. arXiv preprint arXiv:1910.07796. Cited by: §1, §2.1.
  • [30] I. Sucholutsky and M. Schonlau (2019) Soft-label dataset distillation and text dataset distillation. External Links: 1910.02551 Cited by: §1, §2.2, §4.2, Table 2, §4.
  • [31] I. W. Tsang, J. T. Kwok, and P. Cheung (2005) Core vector machines: fast svm training on very large data sets. Journal of Machine Learning Research 6 (Apr), pp. 363–392. Cited by: §2.2.
  • [32] E. M. Voorhees and D. Harman (2000) Overview of the sixth text retrieval conference (trec-6). Information Processing & Management 36 (1), pp. 3–35. Cited by: §4.2.
  • [33] T. Wang, J. Zhu, A. Torralba, and A. A. Efros (2018) Dataset distillation. arXiv preprint arXiv:1811.10959. Cited by: §1, §2.2, §2.2, §3, §3, §5, Broader Impact.
  • [34] W. Wu, L. He, W. Lin, S. Jarvis, et al. (2019) SAFA: a semi-asynchronous protocol for fast federated learning with low overhead. arXiv preprint arXiv:1910.01355. Cited by: §2.1.
  • [35] C. Xie, S. Koyejo, and I. Gupta (2019) Asynchronous federated optimization. arXiv preprint arXiv:1903.03934. Cited by: §2.1.
  • [36] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra (2018) Federated learning with non-iid data. arXiv preprint arXiv:1806.00582. Cited by: §2.1.

Appendix A Hyperparameters

Hyperparameter Symbol Value
MNIST IMDB TREC6
Distill batch size
Distill steps
Distill epochs
Initial distilled learning rate
Learning rate
Learning rate decay
Learning rate decay period
Random masking probability
Soft reset variance
Table 4: DOSFL hyperparameters. A dash (–) indicates that the value is constant across different tasks.

Appendix B Example distilled images

Figure 4: First step of distilled images from 1 out of 10 clients for IID federated MNIST with no additions (i.e. soft labels, soft resets, random masking).
Figure 5: First step of distilled images from 1 out of 10 clients for non-IID federated MNIST with no additions (i.e. soft labels, soft resets, random masking).
Figure 6:

First step of distilled images from 1 out of 100 clients for IID federated MNIST with soft labels. The values above are the 3 labels with the highest logits.

Appendix C Example distilled sentences

c.1 Imdb

We provide a distilled sentence from one of 100 clients for federated IMDB with IID distribution. The logit is 1.63 for the positive class and 0 for the negative class. The corresponding distill learning rate is 0.0272. shaw malone assembled shelly pendleton tha insanity vietnam finishes morton leather watts respectable mastery funky idle watched peripheral ely glossy 1934 honed periods suppress setting eden arises resides moses aura succumb prc missing dyer angela emulate showcased meredith embraces bonnie translates replicate potts segment affects enhances stein juliet bumping mystic resistance token alienate hays unnamed mira rewarded fateful aspire uniformly bliss mermaid burnt joins unforgettable martino namely marshal ivan morse segment pleads boasting victorian closeness rafael reid saddle boot hawks lingered landon …

Further, we exhibit a distilled sentence for non-IID federated IMDB. The logit is 1.68 for the positive class and 0 for the negative class. The corresponding distill learning rate is 0.0284. outset wed burroughs grossly contacted reginald anticipating dimitri returns nap housed feeds pitting woodward potts graduates attendant inherit superficial pleasure yanks pills salem tombstone mcintyre finishes ponder pa concede thru herzog getting supports claudio board elevated lieu chaney cashing meantime denise disposition mess whopping comprehend slicing haley cronies screens zombie assures separately ill. debacle helm aroused scrape minuscule dozen wears devoid bio drunken recommendation shrewd denying decaying blocks primal housekeeper moviegoers mates crook useless dictates cap …

c.2 Trec6

In addition, we show a distilled sentence from 1 out of 29 clients for federated TREC6 with IID distribution. The logit is 1.96 for class 1, and 0 for the remaining classes. The corresponding distill learning rate is 2.25. conversion loop monster manufactured causing besides stealing yankee 1932 igor supplier nicholas lloyd sees businessman alternate alternate photograph portrayed tale trials 49 principal sequel authors topped donation fictional bull philip

At last, We illustrate a distilled sentence from 1 out of 29 clients for federated TREC6 with non-IID distribution. The logit is 1.58 for class 2, and 0 for the remaining classes. The corresponding distill learning rate is 1.87. fair listen programming helps lose remembered block changed classical learning break tap klein stole quick reed solomon mouse extension sisters virtual holmes knight medieval norman newton rider nobel rhode murdered