Decentralized Learning of Generative Adversarial Networks from Multi-Client Non-iid Data

05/23/2019
by   Ryo Yonetani, et al.
4

This work addresses a new problem of learning generative adversarial networks (GANs) from multiple data collections that are each i) owned separately and privately by different clients and ii) drawn from a non-identical distribution that comprises different classes. Given such multi-client and non-iid data as input, we aim to achieve a distribution involving all the classes input data can belong to, while keeping the data decentralized and private in each client storage. Our key contribution to this end is a new decentralized approach for learning GANs from non-iid data called Forgiver-First Update (F2U), which a) asks clients to train an individual discriminator with their own data and b) updates a generator to fool the most `forgiving' discriminators who deem generated samples as the most real. Our theoretical analysis proves that this updating strategy indeed allows the decentralized GAN to learn a generator's distribution with all the input classes as its global optimum based on f-divergence minimization. Moreover, we propose a relaxed version of F2U called Forgiver-First Aggregation (F2A), which adaptively aggregates the discriminators while emphasizing forgiving ones to perform well in practice. Our empirical evaluations with image generation tasks demonstrated the effectiveness of our approach over state-of-the-art decentralized learning methods.

READ FULL TEXT VIEW PDF

Authors

page 16

page 17

page 18

06/23/2022

EFFGAN: Ensembles of fine-tuned federated GANs

Generative adversarial networks have proven to be a powerful tool for le...
12/03/2018

Beyond Inferring Class Representatives: User-Level Privacy Leakage From Federated Learning

Federated learning, i.e., a mobile edge computing framework for deep lea...
06/28/2019

SetGANs: Enforcing Distributional Accuracy in Generative Adversarial Networks

This paper addresses the ability of generative adversarial networks (GAN...
08/05/2020

Annealing Genetic GAN for Minority Oversampling

The key to overcome class imbalance problems is to capture the distribut...
02/09/2021

Training Federated GANs with Theoretical Guarantees: A Universal Aggregation Approach

Recently, Generative Adversarial Networks (GANs) have demonstrated their...
08/18/2020

Adaptive Distillation for Decentralized Learning from Heterogeneous Clients

This paper addresses the problem of decentralized learning to achieve a ...
02/13/2018

First Order Generative Adversarial Networks

GANs excel at learning high dimensional distributions, but they can upda...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large-scale datasets as well as high-performance computational resources are arguably vital for training many of the state-of-the-art deep learning models. Typically, such datasets have been curated from publicly available data,

e.g.Russakovsky2015a ; Gemmeke2017 , and trained in a single workstation or a well-organized computer cluster. At the same time, increasing attention is being paid to decentralized learning, where multiple clients collaboratively utilize their private data resources and computational resources to enable large-scale training, while the data are kept decentralized in each client storage. Unlike much related work focusing on supervised decentralized learning, this work will address an unsupervised task, more specifically, learning generative adversarial networks (GANs) Goodfellow2014 from decentralized data.

Figure 1: Problem Setting. (a) Individual clients each have private data collections drawn from non-identical distributions that comprise different classes (e.g., classes in and in ). (b) Given multi-client non-iid data as input, we learn a generative adversarial network with the decentralized setting to achieve (c) a distribution involving all the input classes (i.e., ), specifically .

Particularly, we are interested in learning a generative model from multiple image data collections that are each i) owned separately and privately by different clients, and ii) drawn from non-identical data-generating distributions that comprise different classes (e.g., image categories; see also Figure 1 (a)). Given such multi-client and non-iid data as input, we aim to achieve a generative model of a distribution that involves all the classes input data can belong to (Figure 1 (c)). Doing so allows us to generate diverse samples that can be observed collectively under various client’s environments, and will ultimately benefit many applications including image translation Isola2017 ; Zhu2017 ; Choi2018

, anomaly detection 

Schlegl2017 , data compression Agustsson2018 , and domain adaptation Tzeng2017 . However, curation of client’s private data should be prohibited due to privacy concerns (e.g., life-logging videos Chowdhury2016 , biological data Ching2018 , and medical data Li2005 ). This dilemma between data utility and privacy makes it hard to aggregate all the client data in a central server, necessitating decentralized learning approaches.

Nevertheless, it is hard to determine how supervised decentralized learning, which has been extensively studied Lian2017 ; Jiang2017 ; Wen2017 ; Bonawitz2019 , can be adopted for learning generative models from decentralized non-iid data. An exception proposed recently is decentralized learning of GANs Hardy2018 , which lets each client train an individual discriminator with their own data while asking a central server to update a generator to fool those discriminators (Figure 1 (b)). While allowing clients to decentralize their data in their own storage, this approach i) restricts all client data to be iid, and ii) otherwise has no theoretical guarantee on what distribution will be learned. Consequently, little work has been done on the decentralized learning of generative models from non-iid data, despite the data non-iidness is one of the key properties in a practical setting of decentralized learning Mcmahan2017 .

Given this background, our main contribution is twofold. Firstly, we propose a new unsupervised decentralized approach for learning GANs from multi-client non-iid data, which we refer to as Forgiver-First Update (F2U). Specifically, given multiple discriminators each trained by different clients with non-identical distributions, hereafter , F2U allows a generator to learn ( is the normalizing constant) that comprises all the input classes including rare ones observed only by a small fraction of the clients as well as common ones shared by many. Our theoretical analysis based on -divergence minimization proves that can be achieved as the decentralized GAN’s global optimum by letting the generator fool the most ‘forgiving’ discriminators for each generated sample, who deemed the sample as the most real and closest to what they own.

Secondly, we present a relaxed version of F2U called Forgiver-First Aggregation (F2A). Instead of selecting the most forgiving discriminators, F2A adaptively aggregates judgments of discriminators made to generated samples, while emphasizing those from more forgiving ones, and updates the generator with the aggregated judgments. While sacrificing the theoretical guarantee, F2A often performs better than F2U in practice. Moreover, F2A can be combined with off-the-shelf secure aggregation techniques such as Bonawitz2017 to make its training process secure. Technically, the adaptive aggregation is done by a regularized weighted averaging function whose weights are also updated via back-propagation, allowing the generator to better capture the non-iidness of input data.

We empirically evaluated our approach with image generation tasks on several public image datasets. The experimental results demonstrate that the decentralized GANs trained with F2U and F2A clearly outperformed several state-of-the-art approaches Hardy2018 ; Durugkar2016 .

2 Preliminaries

Problem Setting

Consider clients who each have their own private data collection drawn from a hidden, non-identical data-generating distribution that comprises different disjoint classes (e.g., classes in and in as shown in Figure 1). Given multi-client non-iid data as input, we address the problem of learning a generative adversarial network with the generator’s distribution given by , where is the normalizing constant, while keeping decentralized and private such that each is visible only to the -th client. The distribution contains all the classes that collectively have (i.e., in Figure 1)111More generally, consider a set of disjoint classes each of which has its own data-generating distribution . Then we denote by , a subset of classes that comprises, and define where is a class prior that satisfies i) , and ii) if and otherwise. Because we assume the classes to be disjoint, where if and otherwise. Namely, comprises all the classes collectively have.. Compared to other possible distributions that could be learned from , such as and where , learning a generator for is advantageous when we aim to generate diverse samples including those of rare classes observed only by a small fraction of as well as common ones shared by many.

Generative Adversarial Networks

As a preliminary, let us briefly introduce a formulation for training GANs with the centralized setting. Namely, we assume that data sample drawn from data-generating distribution can be accessed in a single place without any restriction. Typically, GANs consist of a generator and a discriminator which work as follows:

takes as input a noise vector

drawn from a normal distribution

to generate a realistic sample , which is ideally a generative model of a distribution . receives either real samples or generated ones to discriminate them.

Training of GANs proceeds based on the competition between and ; they are coupled and updated by minimizing the following two objective functions alternately:

(1)
(2)

where can also be represented by with generator’s distribution .

is defined differently for the choice of loss functions such as binary cross entropy

Goodfellow2014 and mean squared error Mao2017 , measuring how judgments of are different from labels (real) or (fake). Given mini-batches of and , is updated with Eq. (1) via back-propagation while fixed to detect generated samples more accurately. is updated with Eq. (2) while fixing to generate more realistic samples that are more likely to fool .

Decentralized Learning of GANs

Now we consider our main problem: decentralized learning of GANs from multi-client non-iid data. Following the basic idea of Multi-Discriminator GAN (MD-GAN) proposed in Hardy2018 , we consider the existence of a server that collaborates with the clients to learn GANs. Under the decentralized setting, it is reasonable to ask each client to learn its own discriminator with and to inform the server how judges generated samples instead of directly sharing . Then the server can update the generator to fool .

Within this approach, we focus particularly on the following challenges to address our problem:

  1. How can judgments be taken into account by to learn , especially when s are non-identical?

  2. How can be kept secret to any other party than the -th client?

Unfortunately, both of these two challenges remain unsolved in MD-GAN. As we introduced in Section 1, it assumes all of the input data collections to be iid, and has no theoretical guarantee on what will be learned as the generator’s distribution otherwise. Moreover, it requires clients to exchange discriminators periodically (e.g., client 1 updates with while client 2 updates with ) to avoid from being overfit to discrimination of . This exchange however allows clients to infer what other clients own (e.g., inferring from whether is likely to contain a certain sample ). Other relevant GANs using multiple discriminators, such as Generative Multi-Adversarial Networks (GMAN) Durugkar2016 , also assume all the data samples to be drawn from the same distribution, making it difficult to address the first concern.

3 Proposed Approach

In this section, we present i) Forgiver-First Update (F2U) that is proven to achieve as the global optimum of decentralized GAN with multiple discriminators; and ii) Forgiver-First Aggregation (F2A) that can work well in practice and can also involve off-the-shelf secure aggregation techniques to preserve data privacy under certain settings.

3.1 Forgiver-First Update

As its name implies, F2U asks generator to be updated against the discriminator who gives the most forgiving judgment, i.e., . To better understand this approach, Figure 1 (a) illustrates an example of two-client non-iid data. Our key insight is that, when client data comprise different classes as shown in the figure, the discriminators trained from them will judge each sample differently depending on where the sample is, such as within and within . Accordingly, selecting to update intuitively means selecting as the data distribution that will learn.

Theoretical Results with Least-Square GANs

Below we prove that our decentralized GAN achieves as the global optimum if is updated with . Here we focus on a typical setting of least-square GANs (LSGANs) Mao2017 that will be used in our experiments, where is defined by the mean-squared error, , and (another case with the standard GAN is also present in the supplementary material).

As shown in Mao2017 , the optimal discriminator given data-generating distribution and generator’s distribution and is:

(3)

This leads to the following lemma.

Lemma 1.

If each is trained optimally from data-generating distribution and generator’s distribution , can be regarded as the optimal discriminator trained from , i.e., where is a positive constant.

Proof.

Eq. (3) can be represented by . By fixing and regarding as a positive constant, we see that monotonically increases with within . Thus, where . ∎

On the other hand, by substituting and for the objective function in Eq. (2), and by letting be the mean-squared error as done in Mao2017 , we obtain:

(4)
(5)
Theorem 2.

The global minimum of given is achieved if and only if .

Proof.

Importantly, the theoretical result in Mao2017 based on the minimization of Pearson divergence is not directly applicable here because in Eq. (5) is fixed but unknown in practice. To explicitly deal with in the divergence minimization, we introduce the following function :

(6)

where , continuous and convex for (see the supplementary material for more detail). This function can then be used to define the -divergence below:

(7)

where is a constant. This -divergence is non-negative and becomes zero if and only if . Finally, in Eq. (5) can be rearranged with as follows:

(8)

From Eq. (8), reaches the global minimum if and only if . ∎

3.2 Forgiver-First Aggregation

While F2U has a theoretical guarantee to achieve as the global optimum, we rarely obtain optimal discriminators in practice. Moreover, Durugkar2016 shows that involving many discriminators, instead of selecting one of them, can accelerate the training process. Therefore, we propose F2A that aggregates s while emphasizing more forgiving ones as follows:

(9)

Here, is a parameter allowing us to take different aggregation strategies to better adapt given client data. When becomes larger, will converge to , which would benefit those cases where client data are highly non-iid and severely overlapping. In contrast, when is nearly , will become just the average of , which would work well when the client data are iid and significantly overlapping.

Importantly, can be updated adaptively with like done in Durugkar2016 . This makes it unnecessary to manually try multiple choices of to find better ones based on how non-iid client data are. Specifically, we augment in Eq. (2) to introduce the following regularized objective to update :

(10)

where we omit the term in Eq. (2) as it does not contain . By computing the gradient of with respect to , we obtain:

(11)
(12)

Here, is the original loss gradient measured on a single aggregated judgment , which is multiplied by

that can be viewed as the variance of

weighted by . Updating by gradient descent with is therefore reasonable because will be increased when input data collections are non-iid and making s diverse, and be decreased otherwise.

For updating generator , let us denote the parameters of by

. While simplifying formal notations of chain rules, the loss gradient with respect to

is derived as follows:

(13)

where can further be decomposed to:

(14)

When is small, it makes the first term dominant and treats as equal when updating . In contrast, when becomes large, it gives more importance to with larger and encourages to fool more forgiving .

Toward Secure Decentralized Learning with F2A

Decentralized learning of GANs with F2A proceeds as follows. First, the server that has generates two mini-batches of generated samples and distributes them to all of the clients as done in Hardy2018 . Then, the clients update using Eq. (1) with their own data and training mini-batch . Subsequently, the server and the clients perform F2A synchronously to compute loss gradients in Eq. (11) and in Eq. (13) with testing mini-batch . Finally, the server updates and with and . These training steps will be iterated until reaches an expected performance. Note that the synchronous update by multiple clients is justified in Hardy2018 given that the size of mini-batches is set to be reasonably small, and is also necessary for introducing a secure-aggregation technique shown below.

One important concern is how this training process can become provably secure against malicious parties. Consider a typical type of attack in which an attempt is made by one of the clients or the server to reveal a part of private client data from what can be obtained during and after the training. Importantly, F2A is designed to consist of the summation of client-wise variables, such as and in Eq. (9), in Eq. (12), and in Eq. (13). This allows the uplink communications from clients to the server to involve off-the-shelf secure aggregation techniques that compute while keeping each secret under certain conditions, such as the one used in Bonawitz2017 . Doing so will provably prevent malicious parties from intercepting the aggregation procedures to use for revealing a part of .

Moreover, the downlink communications from the server to the clients convey generated samples . Because is trained against multiple s, these generated samples are ideally the mixtures of multi-client data. Therefore, it remains secret by only allowing to be viewed, if any specific one of the generated samples is similar to what the -th client owns in when .

4 Experimental Results

We empirically evaluate the decentralized learning of GANs with F2U and F2A on image generation tasks using decentralized versions of public image datasets. Note that this work aims exclusively at evaluating the generated image quality given by the proposed approach rather than its communication efficiency. Thus we implemented all of the generator, discriminators, and client data in a single workstation for the simulation.

4.1 Implementation Details

As a backbone model, we implemented a variant of LSGANs Mao2017 with a DCGAN Radford2015 -based architecture, which had spectral normalization Miyato2018

instead of batch normalization 

Ioffe2015 in discriminators (see our supplementary material for detail). Deeper models such as ones with residual blocks He2016 and other sophisticated techniques such as gradient penalty Gulrajani2017 would provide higher performance but were not used in this paper, because our focus is not to obtain the best possible performance but to investigate if GANs trained with our approaches could outperform other state-of-the-art methods under the decentralized non-iid setting. That being said, we compare several other choices of models in the supplementary material.

To decentralize the backbone model, the discriminator presented above was instantiated times and initialized independently. The aggregation parameter for F2A was implemented by a one-channel fully connected layer with hidden trainable parameter

followed by ReLU activation, which was able to output

by receiving as input, i.e., . was initialized by , and the regularization strength was set to . Both the generator and the discriminators were trained using Adam Kingma2014 with learning rate ,

. All the implementations were done with Keras

222https://keras.io/ and evaluated on NVIDIA Tesla V100.

4.2 Baseline Methods

F2U and F2A consist of multiple discriminators to train a single generator. We chose the following state-of-the-art GANs with the same configuration as baseline methods. For all of the methods, we used the same model architecture and optimization strategy to ensure fair comparisons.

Multi-Discriminator Generative Adversarial Networks (MD-GAN) Hardy2018 is a pioneering attempt to decentralize GAN training that a) asks each client to train an individual discriminator with their own data and b) updates the generator to fool those multiple discriminators. The generator is updated by applying loss gradients computed by each discriminator in turn. Importantly, MD-GAN requires clients to exchange their discriminators periodically to prevent them from being overfit to one particular client’s data. While this approach might benefit non-iid cases, it comes with privacy concerns that one client could infer what data other clients have from the outputs of other discriminators. We therefore evaluated the original MD-GAN and its variant MD-GAN (w/o ED) that omitted the discriminator exchanges.

Generative Multi-Adversarial Networks (GMAN) Durugkar2016 aims at the stabilizing the learning process by introducing multiple discriminators but trained from the identical data distribution. It aggregates the loss computed with each discriminator, i.e., , with a softmax function so that discriminators with higher losses are emphasized more when updating the generator, while ours can be viewed as emphasizing discriminators with lower losses. Similar to F2A, the softmax function is tunable with aggregation parameter and regularized. We evaluated two variants of GMAN proposed in the paper: GMAN* that tuned via back-propagation, and GMAN-0 with fixed .

4.3 Data and Preprocessing

We used the training split of MNIST (60,000 samples), Fashion MNIST (60,000 samples) Xiao2017 , and CIFAR10 (50,000 samples), each of which comprised 10 different classes. We set and split the dataset into five subsets (i.e., as described below) with the following conditions such that the original data distribution can be regarded as :

  • Non-Overlapping (Non-OVL): respectively contained the images of , , , , -th classes, standing for the most challenging condition.

  • Moderately Overlapping (Mod-OVL): respectively contained the images of , , , , -th classes.

  • Fully Overlapping (Full-OVL): all the subsets contained all the classes equally, though such iid cases were not of our main focus.

4.4 Evaluation Process and Metric

We chose the Fréchet Inception Distance (FID) Heusel2017

as our evaluation metric. As discussed in

Lucic2018 , FID is sensitive to mode dropping; i.e., it degrades when some of the classes contained in the original dataset are missing in generated data, and thus serves as a suitable metric in this work to see if all the different classes client data had were learned successfully. In our experiments, we randomly sampled 10,000 images from both of the training data and trained generators to compute FID scores.

Importantly, we found that the choices of hyperparameters for training GANs, such as a mini-batch size and the number of iterations, affected FID scores greatly and differently for each method, which was also discussed in

Lucic2018 . Instead of picking out one specific choice of hyperparameters, we tested each method with the combinations of mini-batch sizes and the number of iterations for MNIST and Fashion MNIST and for CIFAR10, and reported the median, minimum, and maximum FID scores of those four combinations. Each combination was tested once with fixed random seeds.


max width= MNIST Fashion MNIST CIFAR10 (Backbone) 19.37 (16.14 - 23.01) 26.18 (21.08 - 31.10) 32.61 (30.53 - 38.58) Non-OVL Mod-OVL Full-OVL Non-OVL Mod-OVL Full-OVL Non-OVL Mod-OVL Full-OVL MD-GAN 38.42 34.33 11.90 56.09 47.12 22.04 56.64 50.30 37.66 (36.49 - 39.42) (26.76 - 41.13) (10.85 - 15.13) (52.32 - 63.09) (45.00 - 51.43) (19.15 - 24.68) (53.70 - 60.31) (45.51 - 54.95) (35.86 - 41.13)   (w/o ED) 45.14 40.39 15.70 51.62 46.09 25.62 49.33 47.89 44.60 (34.79 - 53.76) (30.98 - 49.26) (13.29 - 18.96) (50.10 - 53.48) (44.62 - 47.82) (24.88 - 31.87) (46.69 - 57.07) (41.96 - 54.32) (41.95 - 45.38) GMAN* 67.69 58.65 20.86 56.79 49.84 27.97 50.50 41.83 43.30 (54.23 - 93.01) (52.33 - 63.68) (14.23 - 21.34) (53.01 - 58.14) (46.91 - 51.52) (25.91 - 31.35) (45.60 - 52.96) (39.41 - 43.21) (40.74 - 44.40) GMAN-0 69.83 49.92 18.90 55.21 49.31 29.86 47.99 43.63 42.97 (57.66 - 101.62) (45.52 - 57.51) (15.55 - 22.43) (51.89 - 57.68) (46.00 - 57.81) (25.27 - 33.29) (44.32 - 51.24) (41.44 - 46.57) (41.39 - 46.81) F2U 22.19 13.38 14.32 43.07 32.65 36.87 66.43 40.42 45.11 (16.64 - 29.23) (10.25 - 15.47) (11.32 - 19.27) (34.79 - 56.23) (27.94 - 37.21) (33.55 - 37.42) (66.27 - 66.77) (36.93 - 53.82) (44.24 - 46.24) F2A 18.96 14.53 17.21 37.16 29.03 25.82 38.92 41.01 41.23 (16.54 - 19.96) (12.54 - 16.67) (13.85 - 25.25) (26.61 - 37.32) (24.32 - 31.88) (24.49 - 29.18) (37.91 - 44.31) (38.08 - 43.58) (38.87 - 45.51)

Table 1: FID Scores: in the form of median (min - max) across multiple hyperparameter combinations where Non-OVL: non-overlapping, Mod-OVL: moderately overlapping, Full-OVL: fully overlapping conditions. The original scores from the backbone model are shown in the second row.

4.5 Results

Comparison with Baselines

Table 1 lists the FID scores. We found that i) when client data were non-overlapping or moderately overlapping, F2U or F2A clearly outperformed the baselines; and ii) when the data were fully overlapping, MD-GAN worked best but yet the other approaches including F2U and F2A performed reasonably well. Note that MD-GAN, however, required discriminators to be exchanged mutually between clients periodically, making it difficult to train while securing client data. Without this discriminator exchange, MD-GAN (w/o ED) and F2A showed comparable performances under the fully overlapping condition, while F2A could further be combined with a secure-aggregation protocol to enable secure training.

Effects of Aggregation Parameter
Figure 2: Changes of over iterations.

Additionally, we visualize how the aggregation parameter changed over iterations for MNIST in Figure 2. As shown, when client data were non-overlapping or moderately overlapping, increased greatly until it was saturated by regularization. In contrast, increased less when the data were fully overlapping.

More Results

on the different choices of backbone models, aggregation parameters for F2A, number of clients, as well as qualitative results are reported in the supplementary material.

5 Discussion and Related Work

Our work on decentralized learning of GANs has a connection to existing literature in several aspects. In terms of the formulation of GANs, recent work has also tried to involve multiple discriminators and/or multiple generators. The motivations behind such works are, however, not to enable decentralized learning for multi-client data but to stabilize the training process Durugkar2016 ; Chavdarova2018 , to avoid mode collapses Nguyen2017 ; Ghosh2018 , or to model multi-domain data Zhu2017 ; Choi2018 ; Liu2016 , with the centralized setting.

To the best of our knowledge, our work is the first to address the problem of unsupervised decentralized learning from non-iid data. Much work has focused on how computations for large-scale training can be decentralized to multiple clients (see Lian2017 , Bonawitz2019 , and Tal2018 for a summary of recent work). Among these studies, the most relevant approach is federated learning Mcmahan2017 that addressed the problem of learning from non-iid data. More recent work has tried to make federated learning more communication efficient Konecny2016 ; Lin2017 ; Jeong2018 , secure Bonawitz2017 ; Bagdasaryan2018 , and applicable to a practical wireless setting Giannakis2016 ; Wang2018

, but under the setting of standard supervised learning. One exception presented recently is MD-GAN 

Hardy2018 , which however worked only when client data were iid. Another interesting attempt on the decentralized learning of GANs but with a different objective was Kosmopoulos2018 , which tried to allow clients to ‘jointly’ learn their own GANs by exchanging discriminators.

Finally, our work has several limitations. i) Our empirical evaluation is based on a simulation in a single workstation. Practical implementations of the proposed approach as well as other baselines will come with problems of communication, security, and scalability for a large number of clients, as discussed in Mcmahan2017 . ii) Our approach by itself is not designed to resolve common problems observed in GAN training, such as mode collapse and theoretical guarantee for convergence. We still require additional contributions to make GANs perform well in practice. iii) The current formulation can be applied only to standard GANs with a generator and a discriminator. One interesting extension for future work is to deal with various architectures such as conditional GANs Isola2017 ; Mirza2014 ; Reed2016 ; Odena2017 ; Miyato2018b and GANs with multiple generators Zhu2017 ; Choi2018 ; Chavdarova2018 ; Ghosh2018 ; Liu2016 .

6 Conclusion

We addressed the problem of learning GANs in a decentralized fashion from multi-client non-iid data and presented new approaches called Forgiver-First Update (F2U) and Forgiver-First Aggregation (F2A). We hope that our work has raised a new challenge of decentralized deep learning, i.e., unsupervised decentralized learning from non-iid data, and will also impact a variety of real-world applications such as anomaly detection on confidential medical data and learning image compression models using photo collections stored privately in smartphones.

References

  • [1] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.

    International Journal of Computer Vision

    , 115(3):211–252, 2015.
  • [2] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio Set: An Ontology and Human-Labeled Dataset for Audio Events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 776–780, 2017.
  • [3] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
  • [4] P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-to-Image Translation with Conditional Adversarial Networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2017.
  • [5] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision, 2017.
  • [6] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [7] Thomas Schlegl, Philipp Seeböck, Sebastian M. Waldstein, Ursula Schmidt-Erfurth, and Georg Langs. Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery. In Marc Niethammer, Martin Styner, Stephen Aylward, Hongtu Zhu, Ipek Oguz, Pew-Thian Yap, and Dinggang Shen, editors, Proceedings of the Information Processing in Medical Imaging, pages 146–157, 2017.
  • [8] Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Radu Timofte, and Luc Van Gool. Generative Adversarial Networks for Extreme Learned Image Compression. Computing Research Repository, abs/1804.02958, 2018.
  • [9] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial Discriminative Domain Adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [10] Soumyadeb Chowdhury, Md Sadek Ferdous, and Joemon M Jose. Exploring Lifelog Sharing and Privacy. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 553–558, 2016.
  • [11] T. Ching, D. S. Himmelstein, B. K. Beaulieu-Jones, A. A. Kalinin, B. T. Do, G. P. Way, E. Ferrero, P. M. Agapow, M. Zietz, M. M. Hoffman, W. Xie, G. L. Rosen, B. J. Lengerich, J. Israeli, J. Lanchantin, S. Woloszynek, A. E. Carpenter, A. Shrikumar, J. Xu, E. M. Cofer, C. A. Lavender, S. C. Turaga, A. M. Alexandari, Z. Lu, D. J. Harris, D. DeCaprio, Y. Qi, A. Kundaje, Y. Peng, L. K. Wiley, M. H. S. Segler, S. M. Boca, S. J. Swamidass, A. Huang, A. Gitter, and C. S. Greene. Opportunities and Obstacles for Deep Learning in Biology and Medicine. Journal of The Royal Society Interface, 15(141), 2018.
  • [12] Mingyan Li, Radha Poovendran, and Sreeram Narayanan. Protecting Patient Privacy against Unauthorized Release of Medical Images in a Group Communication Environment. Computerized Medical Imaging and Graphics, 29(5):367 – 383, 2005.
  • [13] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu.

    Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent.

    In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Proceedings of the Advances in Neural Information Processing Systems 30, pages 5330–5340, 2017.
  • [14] Zhanhong Jiang, Aditya Balu, Chinmay Hegde, and Soumik Sarkar. Collaborative Deep Learning in Fixed Topology Networks. In Proceedings of the Advances in Neural Information Processing Systems, pages 5904–5914, 2017.
  • [15] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning. In Proceedings of the Advances in Neural Information Processing Systems, pages 1508–1518, 2017.
  • [16] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Konecný, Stefano Mazzocchi, H. Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. Towards Federated Learning at Scale: System Design. Computing Research Repository, abs/1902.01046, 2019.
  • [17] Corentin Hardy, Erwan Le Merrer, and Bruno Sericola. MD-GAN: Multi-Discriminator Generative Adversarial Networks for Distributed Datasets. Computing Research Repository, abs/1811.03850, 2018.
  • [18] H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-Efficient Learning of Deep Networks from Decentralized Data. In

    Proceedings of the International Conference on Artificial Intelligence and Statistics

    , 2017.
  • [19] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth.

    Practical Secure Aggregation for Privacy-Preserving Machine Learning.

    In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, pages 1175–1191, 2017.
  • [20] Ishan P. Durugkar, Ian Gemp, and Sridhar Mahadevan. Generative Multi-Adversarial Networks. In Proceedings of the International Conference on Learning Representations, 2016.
  • [21] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley. Least Squares Generative Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2813–2821, 2017.
  • [22] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In Proceedings of the International Conference on Learning Representations, 2016.
  • [23] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral Normalization for Generative Adversarial Networks. In Proceedings of the International Conference on Learning Representations, 2018.
  • [24] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on International Conference on Machine Learning, pages 448–456, 2015.
  • [25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 171–180, 2016.
  • [26] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved Training of Wasserstein GANs. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Proceedings of the Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
  • [27] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations, 2015.
  • [28] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. Computing Research Repository, abs/1708.07747, 2017.
  • [29] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Proceedings of the Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
  • [30] Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GANs Created Equal? A Large-Scale Study. In Proceedings of the Advances in Neural Information Processing Systems, pages 698–707, 2018.
  • [31] Tatjana Chavdarova and François Fleuret. SGAN: An Alternative Training of Generative Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [32] Tu Nguyen, Trung Le, Hung Vu, and Dinh Phung. Dual Discriminator Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems, pages 2670–2680, 2017.
  • [33] Arnab Ghosh, Viveka Kulharia, Vinay P. Namboodiri, Philip H.S. Torr, and Puneet K. Dokania. Multi-Agent Diverse Generative Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [34] Ming-Yu Liu and Oncel Tuzel. Coupled Generative Adversarial Networks. In Proceedings of the Advances in Neural Information Processing Systems, pages 469–477, 2016.
  • [35] Tal Ben-Nun and Torsten Hoefler. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. Computing Research Repository, abs/1802.09941, 2018.
  • [36] Jakub Konecný, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated Learning: Strategies for Improving Communication Efficiency. In Proceedings of the NIPS Workshop on Private Multi-Party Machine Learning, 2016.
  • [37] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In Proceedings of the International Conference on Learning Representations, 2018.
  • [38] Eunjeong Jeong, Seungeun Oh, Hyesung Kim, Jihong Park, Mehdi Bennis, and Seong-Lyun Kim. Communication-Efficient On-Device Machine Learning: Federated Distillation and Augmentation under Non-IID Private Data. Computing Research Repository, abs/1811.11479, 2018.
  • [39] Eugene Bagdasaryan, Andreas Veit, Yiqing Hua, Deborah Estrin, and Vitaly Shmatikov. How To Backdoor Federated Learning. Computing Research Repository, abs/1807.00459, 2018.
  • [40] Georgios B. Giannakis, Qing Ling, Gonzalo Mateos, Ioannis D. Schizas, and Hao Zhu. Decentralized Learning for Wireless Communications and Networking, pages 461–497. Springer International Publishing, 2016.
  • [41] Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K. Leung, Christian Makaya, Ting He, and Kevin Chan. When Edge Meets Learning: Adaptive Control for Resource-Constrained Distributed Machine Learning. In Proceedings of the IEEE International Conference on Computer Communications, 2018.
  • [42] Dimitrios Kosmopoulos. A Prototype Towards Modeling Visual Data Using Decentralized Generative Adversarial Networks. In Proceedings of IEEE International Conference on Image Processing, pages 4163–4167, 2018.
  • [43] Mehdi Mirza and Simon Osindero. Conditional Generative Adversarial Nets. Computing Research Repository, abs/1411.1784, 2014.
  • [44] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative Adversarial Text to Image Synthesis. In Proceedings of International Conference on Machine Learning, pages 1060–1069, 2016.
  • [45] Augustus Odena, Christopher Olah, and Jonathon Shlens.

    Conditional Image Synthesis with Auxiliary Classifier GANs.

    In Proceedings of the International Conference on Machine Learning, pages 2642–2651, 2017.
  • [46] Takeru Miyato and Masanori Koyama. cGANs with Projection Discriminator. In Proceedings of the International Conference on Learning Representations, 2018.
  • [47] Vinod Nair and Geoffrey E. Hinton. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the International Conference on Machine Learning, pages 807–814, 2010.
  • [48] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical Evaluation of Rectified Activations in Convolutional Network. Computing Research Repository, abs/1505.00853, 2015.

Appendix A On the Convexity of

To show the global optimality of with LSGANs in Theorem 2, we defined the following function in Eq. (6).

(15)

where and the equality here holds if and only if . This function needs to be convex at least for to be used with the -divergence.

To show its convexity, we calculate the second derivative of :

(16)

Since , if , namely, is convex for .

Appendix B Global Optimality of with the Standard GAN

In addition to our main theoretical results that show the global optimality of with LSGANs, we here prove that the same global optimum can be achieved also for the standard GAN using the binary cross-entropy loss.

As proven in [3], the optimal discriminator trained from data-generating distribution with generator’s distribution fixed is given as follows:

(17)

As we showed in Lemma 1 of the main paper, can then be regarded as the optimal discriminator trained from , namely,

(18)

where is a positive constant. On the other hand, the objective function for the generator in [3], given , can be reformulated as:

(19)
(20)

Now, consider the following continuous function :

(21)

where and its second derivative is:

(22)

Since if , the function is convex for . We introduce the -divergence with this function as follows:

(23)
(24)

where is a constant. With , in Eq. (20) can be rearranged:

(25)

which reaches its global minimum if and only if .

Appendix C Implementation Details

This section presents implementation details of the backbone GANs used in our experiments.

MNIST and Fashion MNIST

The architecture of the generator was designed as follows. A 128-dimensional noise vector drawn from the normal distribution was first fed to a fully connected layer with channels and activated with ReLU [47], which was then reshaped into a feature map sized and with channels. This feature map was then deconvoluted using two consecutive 2D deconvolution layers with the kernel size of

, the stride of

, and the channels of (first layer) and (second layer), both of which were batch-normalized with the momentum of and activated with ReLU. Finally, one more deconvolution layer with the kernel size of , the stride of , and the channel of , which was activated by the hyperbolic tangent, was applied to obtain gray-scale images of the size . The discriminator that received gray-scale images with the size of consisted of four consecutive convolution layers, which all had the kernel size of , the stride of , and the channels of , followed by spectral normalization [23] and LeakyReLU activation ( [48]

. Zero-padding was applied before the second convolutional filter to down-scale feature maps properly in the subsequent convolutions. Finally, the feature maps were flattened and fed into a fully connected layer with one-dimensional output followed by spectral normalization and linear activation

333Having the normalization after the last layer of discriminators might not be a standard choice but improved the overall performance in our experiments..

Cifar10

Similar to the architecture shown above, the generator first fed a 128-dimensional noise vector drawn from to a fully connected layer with channels and the ReLU activation. The output was reshaped into a feature map sized and with channels, and then fed to three consecutive 2D deconvolution layers with the kernel size of , the stride of , and the channels of . Each deconvolution layer was followed by the batch normalization with the momentum of and the ReLU activation. One more deconvolution layer with the kernel size of , the stride of , and the channel of , which was activated by the hyperbolic tangent, was applied finally to obtain colored images of the size . The discriminator consisted of five convolution layers with the channels of , the kernel size of , and the stride of , respectively. Each convolution layer was followed by spectral normalization and leaky ReLU with , and the output was flattened and fed to a fully connected layer with a single channel with spectral normalization and linear activation.

Appendix D Additional Experimental Results

d.1 Effect of Model Choices

Tables 2 and 3 show MNIST results with several other models, including a standard GAN with binary cross entropy loss (‘BCE’ in the table) with or without spectral normalization (SN), as well as LSGAN (‘MSE’ in the table) without SN. Overall, we found that the MSE loss and SN were both important; for both F2U and F2A and for all the conditions, MSE worked better than BCE when combined with SN.


max width= Loss SN Non-OVL Mod-OVL Full-OVL BCE 44.85 (42.31 - 51.28) 21.03 (18.77 - 24.81) 12.63 (11.77 - 18.58) BCE 25.85 (21.17 - 35.58) 28.06 (23.28 - 29.62) 29.83 (24.70 - 39.20) MSE 44.10 (41.88 - 48.36) 23.16 (20.80 - 31.06) 11.20 (9.52 - 12.61) MSE 22.19 (16.64 - 29.23) 13.38 (10.25 - 15.47) 14.32 (11.32 - 19.27)

Table 2: Effect of Model Choices (F2U): FID scores on MNIST in the form of median (min - max) across several hyperparameter combinations.

max width= Loss SN Non-OVL Mod-OVL Full-OVL BCE 43.62 (42.59 - 45.35) 21.47 (16.96 - 22.85) 11.89 (11.25 - 16.54) BCE 24.04 (20.33 - 26.42) 33.65 (26.93 - 40.08) 28.02 (23.40 - 29.34) MSE 74.65 (52.63 - 86.38) 30.19 (23.25 - 37.95) 11.72 (8.96 - 13.24) MSE 18.96 (16.54 - 19.96) 14.53 (12.54 - 16.67) 17.21 (13.85 - 25.25)

Table 3: Effect of Model Choices (F2A): FID scores on MNIST in the form of median (min - max) across several hyperparameter combinations.

d.2 Effect of Aggregation Parameters

Table 4 lists other settings of including if it was fixed to certain values ( as the value after saturation under the non-overlapping condition) or was regularized weakly (). Especially under non-overlapping and moderately-overlapping conditions, aggregations with fixed presented limited performances regardless of how large or small was, indicating the importance of dynamically updating . The weaker regularization with instead of was affected only slightly.


max width= Non-OVL Mod-OVL Full-OVL (fixed) 46.76 (35.93 - 53.85) 32.93 (28.88 - 36.00) 17.26 (14.41 - 22.58) (fixed) 22.76 (14.34 - 23.20) 23.35 (11.97 - 26.81) 15.84 (14.92 - 16.48) 21.99 (17.97 - 24.36) 13.78 (12.22 - 18.13) 17.01 (15.88 - 18.43) 18.96 (16.54 - 19.96) 14.53 (12.54 - 16.67) 17.21 (13.85 - 25.25)

Table 4: Effects of and : FID scores on MNIST in the form of median (min - max) across multiple hyperparameter combinations.

d.3 Effect of the Number of Clients

We also tested how performances changed when the number of clients became large: . For , we split MNIST data into ten subsets such that each subset involved images of the only single digit. For , we further divided each subset obtained in randomly into two subsets of the same size. Figure 3 shows the median FID scores. As a reference, we also present under the non-overlapping condition in the figure. Both F2U and F2A clearly outperformed the other methods even when was large.

Figure 3: Effects of N: Median FID scores on MNIST across multiple hyperparameter combinations.

d.4 Qualitative Results

Finally, we show some examples of generated images in Figures 4 to 12. Especially under non-overlapping conditions in Figures 4, 7, and 10, we found i) lower quality images with MD-GAN and MD-GAN (w/o ED); and ii) biased outputs (e.g., many ‘1’s generated) with GMAN* and GMAN-0, while iii) F2U and F2A did not provide such major issues.

Figure 4: Qualitative results (MNIST, Non-OVL)
Figure 5: Qualitative results (MNIST, Mod-OVL)
Figure 6: Qualitative results (MNIST, Full-OVL)
Figure 7: Qualitative results (FMNIST, Non-OVL)
Figure 8: Qualitative results (FMNIST, Mod-OVL)
Figure 9: Qualitative results (FMNIST, Full-OVL)
Figure 10: Qualitative results (CIFAR10A, Non-OVL)
Figure 11: Qualitative results (CIFAR10A, Mod-OVL)
Figure 12: Qualitative results (CIFAR10A, Full-OVL)