FedGAN: Federated Generative Adversarial Networks for Distributed Data

06/12/2020 ∙ by Mohammad Rasouli, et al. ∙ Stanford University 0

We propose Federated Generative Adversarial Network (FedGAN) for training a GAN across distributed sources of non-independent-and-identically-distributed data sources subject to communication and privacy constraints. Our algorithm uses local generators and discriminators which are periodically synced via an intermediary that averages and broadcasts the generator and discriminator parameters. We theoretically prove the convergence of FedGAN with both equal and two time-scale updates of generator and discriminator, under standard assumptions, using stochastic approximations and communication efficient stochastic gradient descents. We experiment FedGAN on toy examples (2D system, mixed Gaussian, and Swiss role), image datasets (MNIST, CIFAR-10, and CelebA), and time series datasets (household electricity consumption and electric vehicle charging sessions). We show FedGAN converges and has similar performance to general distributed GAN, while reduces communication complexity. We also show its robustness to reduced communications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 21

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative adversarial network (GAN) is proposed by (Goodfellow et al., 2014) for generating fake data similar to the original data and has found wide applications. In lots of cases data is distributed across multiple sources, with data in each source being too limited in size and diversity to locally train an accurate GAN for the entire population of distributed data. On the other hand, due to privacy constraints data can not be shared or pooled centrally. Therefore, a distributed GAN algorithm is required for training a GAN representing the entire population; such distributed GAN also allows generating publicly accessible data (Yonetani et al., 2019). Current distributed GAN algorithms require large communication bandwidth among data sources or between data sources and an intermediary (to ensure convergence) due to architectures that separate generators from discriminators (Augenstein et al., 2019; Hardy et al., 2018b). But in many applications communication bandwidth is limited, e.g. energy, mobile communications, finance, and sales (Yang et al., 2019). Communication-efficient distributed GAN is an open problem. We propose an architecture which places local discriminators with local generators, synced occasionally through an intermediary.

Communication-efficient learning across multiple data sources, which are also subject to privacy constraints, is studied under federated learning (Konečnỳ et al., 2016; McMahan et al., 2016). Therefore, we refer to distributed communication-efficient GAN as federated GAN (FedGAN). FedGAN can extend GAN applications to federated learning. For example, in lots of cases, even the pooled dataset is not large enough to learn an accurate model, and FedGAN can help producing more data similar to the original data for better training (Bowles et al., 2018).

The major challenge in GAN algorithms is their convergence since cost functions may not converge using gradient descent in the minimax game between the discriminator and the generator. Convergence is also the major challenge in federated learning since each source updates local model using multiple stochastic gradient descents (SGDs) before syncing with others through the intermediary (Dinh et al., 2019); it becomes even more challenging when data at different sources are not independent and identically distributed (non-iid) (Wang et al., 2019). Convergence of distributed GANs is an open problem. We theoretically prove our FedGAN algorithm converges even with non-iid sources under equal and two time step updates. We connect results from stochastic approximation for GAN convergence and communication-efficient SGD for federated learning to address FedGAN convergence.

We experiment our FedGAN on popular toy examples in the GAN literature including 2D system, mixed Gaussians, and Swiss role, on image datasets including MNIST, CIFAR-10, and CelebA, and on time series data form energy industry including household electricity consumption and electric vehicles charging, to show its convergence, efficiency, and robustness to reduced communications. We use energy industry since it is not currently studied in the federated learning but often involves distributed data, subject to privacy constraints, with limited communication infrastructure, and mostly in time series format (Stankovic et al., 2016; Balachandran et al., 2014). To the best of our knowledge, there is no other algorithm for communication efficient distributed GAN to compare ours with. We compare the performance of FedGAN with a typical distributed GAN that has local discriminators and one generator in the intermediary which communicate frequently (similar to that in (Augenstein et al., 2019)).

The rest of the paper is as follows. We first review the relevant literature (Section 2). Then we propose our FedGAN algorithm (Section 3.1), discuss its communication and computation properties (Section 3.2), and theoretically prove its convergence even when the data sources are non-iid (Section 3.3). Next, we run experiments on toy examples (Section 4.1), image dataset (Section 4.2), and time series dataset (Section 4.3). Finally, we point out to some observations in our experiments and some open problems (Section 5). Part of proofs and experiments are in appendices.

2 Literature Review

This work relates to three literature, GAN convergence, distributed GAN, and federated learning.

Convergence of GAN has been studied through convergence of dynamical systems using stochastic approximation by modeling GAN as a continuous two-player game solvable by gradient-based multi-agent learning algorithms (Chasnov et al., 2019). With equal step sizes of generator and discriminator updates the problem is single time-scale stochastic approximation (Konda et al., 2004) for which certain conditions are developed (and tested (Mescheder et al., 2018)) for convergence of the ODE representing the stochastic approximation of GAN, e.g. Hurwitz Jacobian at equilibrium (Khalil, 2002), negative definite Hessians with small learning rate (Nowozin et al., 2016; Ratliff et al., 2013), consensus optimization regularization (Nagarajan and Kolter, 2017)

, and non-imaginary eigenvalues of the spectrum of the gradient vector field Jacobian

(Mescheder et al., 2017). With different step sizes of generator and discriminator updates the problem is two time-scale stochastic approximation (Borkar, 1997) for which convergence is shown under global (Borkar, 1997) and local asymptotic stability assumptions (Karmakar and Bhatnagar, 2017; Borkar, 2009). (Heusel et al., 2017) proposes a two time-scale update rule (TTUR) for GAN with SGD and shows convergence under the those stability conditions. All of the above papers are for centrally trained GAN, while our FedGAN algorithm is distributed.

Distributed GANs are proposed recently. For iid data sources, (Hardy et al., 2018b) proposes a single generator at the intermediary and distributed discriminators which communicate generated data and the corresponding error. Also, discriminators exchange their parameters occasionally to avoid overfitting to local data. (Hardy et al., 2018a) utilizes a gossip approach for distributed GAN which does not require an intermediary server. For non-iid data sources, (Yonetani et al., 2019) trains individual discriminators and updates the centralized generator to fool the weakest discriminator. All of the above algorithms require large communications, while our FedGAN is communication efficient. Also, to the best of our knowledge there is no theoretical result for convergence of distributed GAN, while we provide such results for FedGAN.

Federated learning, proposed for communication efficient distributed learning, runs parallel SGD on randomly selected subset of all the agents and updates the parameters with the averages of the trained models through an intermediary once in a while. Its convergence is proved for convex objective with iid data (Stich, 2018), non-convex objective with iid data (Wang and Joshi, 2018), strongly convex objective and non-iid data with all responsive agents (Wang et al., 2019) and some non-responsive agents (Li et al., 2019), and non-convex objective and non-iid data (Yu et al., 2019). Our FedGAN study is with non-convex objective and non-iid data, and also involves GAN convergence challenges on top of distributed training issues.

A distributed GAN with communication constraints is proposed by Augenstein et al. (2019) under FedAvg-GAN which has distributed discriminators but centralized generator, similar to distributed GAN in Hardy et al. (2018b) with the difference of selecting a subset of agents for discriminator updating. This approach does not fully address the large communications required for distributed GAN as it needs communications in each generator update iteration. We overcome this issue by placing both the discriminators and the generators at the agents, and then communicating only every steps with intermediary to sync parameters across agents. An architecture similar to our FedGAN is envisioned in (Rajagopal. and Nirmala., 2019) and (Hardy et al., 2018b), but they do not provide theoretical studies for convergence, and their experiment results are very limited. We provide a complete study in this paper.

3 FedGAN Algorithm

In this section we propose FedGAN algorithm, discuss its communication and computation complexity, and theoretically prove its convergence.

3.1 Model and Algorithm

We denote the training iteration horizon by and index time by . Consider agents with local dataset of agent denoted by and weight of agent denoted by . data comes from an individual distributions for agent (data in non-iid across agents). Assume each agent has local discriminator and generator with corresponding parameter vectors and

, loss functions

and , local true gradients and , local stochastic gradients and , and learning rates and at time . We assume the learning rates are the same across agents. The gradients and are stochastic, since every agent uses a mini-batch of his local data for SGD.

There is an intermediary whose role is syncing the local generators and discriminators. The intermediary parameters at time are denoted by and . Note that the intermediary does not train a generator or discriminator itself, and and are only obtained by averaging and across .

The FedGAN algorithm is presented in Algorithm 1. All agents run SGDs for training local generators and discriminators using local data. Every time steps of local gradient updates, the agents send their parameters to the intermediary which in turn sends back the average parameters to all agents to sync. We refer to by synchronization interval. , and are tuning parameters of the FedGAN algorithm.

Remark 1.

In our model privacy is the main reason agents do not share data and rather send model parameters. Adding privacy noise to the model parameters can further preserve privacy. We leave this as a future direction for this research. Also, we assume all agents participate in the communication process. There is a literature on federated learning which studies if only part of the agents send their parameters due to communication failures (Konečnỳ et al., 2016). This could be an extension to this paper for FedGAN.

Input  : Set training period . Initialize local discriminator and generator parameters for each agent : and , . Set the learning rates of the discriminator and the generator at iteration , and for . Also set synchronization interval .
1 for  do
2       Each agent calculates local stochastic gradient from and from and fake data generated by the local generator. Each agent updates its local parameter in parallel to others via
(1)
 if   then
3             All agents send parameters to intermediary. The intermediary calculates and by averaging
(2)
The intermediary send back and agents update local parameters
(3)
4       end if
5      
6 end for
Algorithm 1 Federated Generative Adversarial Network (FedGAN)

3.2 Communication and Computation Complexity

FedGAN communications are limited to sending parameters to intermediary by all agents and receiving back the synchronized parameters every steps. For a parameter vector of size (we assume the size of generator and discriminator are the same order), the average communication per round per agent is . Increasing reduces the average communication, which may reduce the performance of trained FedGAN (we experiment FedGAN robustness to increasing in Section 4.2 and leave its theoretical understanding for future research). For a general distributed GAN where the generator is trained at the intermediary, the communication involves sending discriminator parameters and generator parameters (or the fake generated data), and this communication should happen at every time step for convergence. The average communication per round per agent therefore is . This shows the communication efficiency of FedGAN.

Since each agent trains a local generator, FedGAN requires increased computations for agents compared to distributed GAN, but at the same order (roughly doubled). However, in FedGAN the intermediary has significantly lower computational burden since it only average the agents’ parameters.

3.3 Convergence Analysis

In this section, we show that FedGAN converges even with non-iid sources of data, under certain standard assumptions. We analyze the convergence of FedGAN for both equal time-scale updates and two time-scale updates (distinguished by whether ). While using equal time-scale update is considered standard, some recent progress in GANs such as Self-Attention GAN (Zhang et al., 2018) advocate the use of two time-scale updates presented in (Heusel et al., 2017) for tuning hyper-parameters.

We extend the notations in this section. For a centralized GAN that pools all the distributed data together, we denote the generator’s and discriminator’s loss functions by and , with true gradients and . Also define and . and

are random variables due randomness in mini-batch stochastic gradient of

.

We make the following standard assumptions for rest of this section. The first four are with respect to the centralized GAN and are often used in stochastic approximation literature of GAN convergence. The last assumption is with respect to local GANs and is common in distributed learning.

  • and are -Lipschitz.

  • , , ,

  • The stochastic gradient errors and are martingale difference sequences w.r.t. the increasing -filed .

  • and .

  • ,

    (bounded variance) and

    (bounded gradient divergence).

(A1)-(A4) are clear assumptions. In (A5), the first bound ensure the closeness between the local stochastic gradients and local true gradients, while the second bound ensures closeness of local discriminator true gradient of non-iid sources and the discriminator true gradient of the pooled data.

We next prove the convergence of FedGAN. To this end, we rely on the extensive literature that connects the convergence of GAN to the convergence of an ODE representation of the parameter updates (Mescheder et al., 2017). We prove the ODE representing the parameter updates of FedGAN asymptotically tracks the ODE representing the parameter update of the centralized GAN. We then use the existing results on convergence of centralized GAN ODE (Mescheder et al., 2018; Nagarajan and Kolter, 2017). Note that our results do not mean the FedGAN and centralized GAN converge to the same point in general.

Equal time-scale update. It has been shown in (Nagarajan and Kolter, 2017; Mescheder et al., 2017) that under equal time-scale update, the centralized GAN tracks the following ODE asymptotically (we use to denote continuous time).

(4)

We now show that when , the updates in (1) as specified by the proposed algorithm, also tracks the ODE in (4) asymptotically. To this end, we further extend the notations. For centralized GAN, Let and define from (4). For FedGAN, define from (2) and with a little abuse of notation, define time instants , . Define a continuous piece-wise linear function by

(5)

with linear interpolation on each interval

. Correspondingly, define and to have .

Let (correspondingly ) denote the unique solution to (4) starting (ending) at

(6)

with (with ). Define (correspondingly ) to be the elements of (elements of ).

Now, in order to prove FedGAN follows (4) asymptotically, it is sufficient to show that asymptotically tracks and as . The first step is to show the difference between the intermediary averaged parameters and defined below base on the centralized GAN updates in between synchronization intervals is bounded. If , then let and otherwise, denote to be the largest multiplication of before and let

(7)

We prove the following Lemma 1 and Lemma 2 for this purpose. The distinction between proof here and in Theorem 1 in (Wang et al., 2019) is that we consider local SGD for federated learning.

We present the proofs of result in this Section in Appendix B.

Lemma 1.

.

Lemma 2.

.

Next, using Lemma 1 and Lemma 2, we prove Theorem 1 which shows asymptotically tracks and as . This in turn proves (1) asymptotically tracks the limiting ODE in (4). The proof is modified from Lemma 1 in Chapter 2 of (Borkar, 2009) from centralized GAN to FedGAN.

Theorem 1.

For any (a.s. stands for almost surly convergence)

(8)
Corollary 1.

From Theorem 1 above and Theorem 2 of Section 2.1 in Borkar (2009), under equal time-scale update, FedGAN tracks the ODE in (4) asymptotically.

We provide the convergence analysis of FedGAN with two time-scale updates in Appendix A.

4 Experiments

In this section we experiment the proposed FedGAN algorithm using different datasets to show its convergence, performance in generating close to real data, and robustness to reducing communications (by increasing synchronization interval ). First in Section 4.1 we experiment with popular toy examples in the GAN literature, 2D system (Nagarajan and Kolter, 2017), mixed Gaussian (Metz et al., 2016) and Swiss Roll (Gulrajani et al., 2017). Next, in Section 4.2 we experiment with image datasets including MNIST, CFAR-10 and CelebA. Finally we consider time-series data in Section 4.3 by focusing on energy industry, including PG&E household electricity consumption and electric vehicles charging sessions from a charging station company. In all the experiments, data sources are partitioned into non-iid subsets each owned by one agent.

4.1 Toy Examples

Three toy examples, 2D system mixed Gaussian and Swiss role, are presented in Appendix C. These experiments show the convergence and performance of the FedGAN in generating data similar to real data. The first experiment, 2D system, also shows the robustness of FedGAN performance to increasing synchronization intervals for reducing communications.

4.2 Image Datasets

We test FedGAN on MNIST, CIFAR-10, and CelebA to show its performance on image datasets.

Both MNIST and CIFAR-10 consist of classes of data which we split across

agents, each with two classes of images. We use the ACGAN neural network structure in

(Odena et al., 2017)

. For a detailed list of architecture and hyperparameters and other generated images see Appendix

D.

For the MNIST dataset, we set synchronization interval . Figure 0(a) presents the generated images of MNIST from FedGAN. It shows FedGAN can generate close to real images.

(a) Generated MNIST images from FedGAN with
agents and synchronization interval .

(b) FID score on CIFAR-10 with and and distributed GAN.
Figure 1: Experiment results for FedGAN on MNIST and CIFAR-10.

For the CIFAR-10 dataset, we use the FID scores (Karmakar and Bhatnagar, 2017) to compare the generated and real data and show the realness of the generated images from FedGAN. We check FedGAN performance robustness to reduced communications and increased synchronization intervals by setting . We benchmark FedGAN performance against a typical distributed GAN similar to that in (Augenstein et al., 2019), where there are local discriminators that send the parameters to the intermediary after each update, and the intermediary sends back the average discriminator parameters plus the generated data of its updated centralized generator (to the best of our knowledge, there is no other algorithm for communication efficient distributed GAN to compare ours). Figure 0(b) shows the results for FedGAN CIFAR-10. It can be observed that, even for large synchronization interval , the FedGAN FID score is close to distributed GAN (except for the tail part). This indicates that FedGAN has high performance for image data, and furthermore its performance is robust to reducing the communications by increasing synchronization intervals . The gap between the distributed GAN and FedGAN in the tail part required further investigation in the future.

Next, we experiment FedGAN algorithm on CelebA (Liu et al., 2015), a dataset of face images of celebrities. We split data across agents by first generating classes based on the combinations of four binary attributes of the images, Eyeglasses, Male, Smiling, and Young, and then allocating each of these classes to one of the agents (some classes divided between two agents to ensure equal size of data across agents). We check FedGAN robustness to reduced communications by setting synchronization intervals , and also compare the performance with distributed GAN. We use the ACGAN neural network structure in (Odena et al., 2017) (for details of data splitting, lists of structures and hyperparameters see Appendix D). Figure 1(a) shows the generated images with with iterations (more generated images are presented in Appendix D). Figure 1(b) shows the performance of FedGAN is close to distributed GAN, and robust to reduced communications.

(a) Generated images with , , and iterations.

(b) FID score on CelebA, and compared to distributed GAN.
Figure 2: Experiment results for FedGAN on CelebA.

4.3 Time Series Data in Energy Industry

In this section, we experiment FedGAN for time series data. We particularly focus on energy industry where the communication and privacy constraints are important, and data often is in time series format. We experiment on household electricity consumption and electric vehicle (EV) charing sessions data. In order to measure the performance of FedGAN for time series data, absent of an accepted measure or score for measuring the realness of time series data, we cluster both the real data and generated data, and visually compare the top 9 cluster centroids for each.

For household electricity consumption, we use the hourly load profile data of households in one year in California by Pacific Gas and Electric Company (PG&E) (dividing every single households data into separate daily profiles). The data includes both household characteristics and temporal features including regional average income, enrolled in low income tariff program/not, all electric/not, daily average temperature, weekday/weekend, month, tariff type, climate zone, house premise type.

For EV data, we use data from an electric vehicle (EV) charging stations company including million charging sessions where each session is defined by the plug-in and -off of an EV at a charging station. For each session, we observe start time, end time, 15-min charging power, charging time, and charges energy, and fully charged or not. We also observe characteristics of the charging station as well as the EV. For example an EV with battery capacity 24kW arriving at a high-tech workplace at 9:00am on Monday (see Appedix E for full detail).

For both experiments, we split the data in equal parts across agents (representing different utility companies or different EV charging companies), based on climate zones or category of charging stations (to ensure non-iid data across agents), and set synchronization interval . We use a network structure similar to CGAN (Mirza and Osindero, 2014).

We separate of each agent’s data, train FedGAN on the rest of the data, and use the trained FedGAN to generate fake time series profiles for those

. We then apply k-means for both the real and generated data of those

. The k-means top 9 centroids are shown in Figure 2(a) and 2(b) for household electricity consumption, and in Figure 4 for EV charging sessions. Visually comparing them shows the performance of FedGAN on generating close to real profiles for time series data.

(a) Top 9 cluster centroids of the real data

(b) Top 9 cluster centroids of the generated data
Figure 3: Top 9 k-means clusters for real PG&E daily household electricity consumption, and FedGAN generated profiles with and . The consumption profiles are normalized.

(a) Top 9 cluster centroids of the real data

(b) Top 9 cluster centroids of the generated data
Figure 4: Top 9 k-means clusters for real EV charging profiles, and FedGAN generated profiles with and . The charging profiles are normalized.

5 Conclusions and Future Directions

We proposed an algorithm for communication-efficient distributed GAN subject to privacy constraints (FedGAN). We proved the convergence of our FedGAN for non-iid sources of data both with equal and two time-scale updates of generators and discriminators. We experimented FedGAN on toy examples (2D, mixed Gaussian, and Swiss roll), image data (MNIST, CIFAR-10, and CelebA), and time series data (household electricity consumption and EV charging sessions) to study its convergence and performance. We showed FedGAN has similar performance to general distributed GAN, while reduces communication complexity. We also showed its robustness to reduced communications.

There are some observations and open problems. First is experimenting FedGAN with other federated learning datasets such as mobile phone texts, and with other applications besides image classification and energy. Robustness to increasing agents number is required which requires large set of GPUs (beyond the engineering capacity of this research). Theoretically, an explanation for the FedGAN robustness to reduced communication, as well as identifying the rate of convergence are interesting. While privacy is the main reason agents do not share data in our FedGAN, adding privacy noise to the model parameters, for example by differential privacy, can further preserve privacy and should be studied. Finally, non-responsiveness of some agents in practice should be studied.

6 Broader Impact

This work has the following potential positive impact in the society: it provides an algorithm for shared learning across agents with local data while preserving privacy and low communication cost, hence it helps democratizing data power. We have also emphasized energy domain as an application which is at the forefront of sustainability and reversing global warming; we particularly experimented on data for household demand prediction and electric vehicle charging station planning.

Acknowledgment

We would like to thank the Pacific Gas and Electric Company (PG&E) for providing the household energy consumption dataset and SLAC National Accelerator Laboratory for providing the EV dataset.

References

  • S. Augenstein, H. B. McMahan, D. Ramage, S. Ramaswamy, P. Kairouz, M. Chen, R. Mathews, et al. (2019) Generative models for effective ml on private, decentralized datasets. arXiv preprint arXiv:1911.06679. Cited by: §1, §1, §2, §4.2.
  • K. Balachandran, R. L. Olsen, and J. M. Pedersen (2014) Bandwidth analysis of smart meter network infrastructure. In 16th International Conference on Advanced Communication Technology, pp. 928–933. Cited by: §1.
  • V. S. Borkar (1997) Stochastic approximation with two time scales. Systems & Control Letters 29 (5), pp. 291–294. Cited by: §2.
  • V. S. Borkar (2009) Stochastic approximation: a dynamical systems viewpoint. Vol. 48, Springer. Cited by: Appendix A, Appendix A, Appendix A, Appendix A, Appendix B, §2, §3.3, Corollary 1, Theorem 2, Theorem 3.
  • C. Bowles, L. Chen, R. Guerrero, P. Bentley, R. Gunn, A. Hammers, D. A. Dickie, M. V. Hernández, J. Wardlaw, and D. Rueckert (2018) Gan augmentation: augmenting training data using generative adversarial networks. arXiv preprint arXiv:1810.10863. Cited by: §1.
  • B. Chasnov, L. J. Ratliff, E. Mazumdar, and S. A. Burden (2019) Convergence analysis of gradient-based learning with non-uniform learning rates in non-cooperative multi-agent settings. arXiv preprint arXiv:1906.00731. Cited by: §2.
  • C. Dinh, N. H. Tran, M. N. Nguyen, C. S. Hong, W. Bao, A. Zomaya, and V. Gramoli (2019) Federated learning over wireless networks: convergence analysis and resource allocation. arXiv preprint arXiv:1910.13067. Cited by: §1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: Appendix C, §4.
  • C. Hardy, E. Le Merrer, and B. Sericola (2018a) Gossiping gans. In

    Proceedings of the Second Workshop on Distributed Infrastructures for Deep Learning: DIDL

    ,
    Vol. 22. Cited by: §2.
  • C. Hardy, E. L. Merrer, and B. Sericola (2018b) Md-gan: multi-discriminator generative adversarial networks for distributed datasets. arXiv preprint arXiv:1811.03850. Cited by: §1, §2, §2.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §2, §3.3.
  • P. Karmakar and S. Bhatnagar (2017) Two time-scale stochastic approximation with controlled markov noise and off-policy temporal-difference learning. Mathematics of Operations Research 43 (1), pp. 130–151. Cited by: Appendix A, §2, §4.2.
  • H. K. Khalil (2002) Nonlinear systems. Upper Saddle River. Cited by: §2.
  • N. Kodali, J. Abernethy, J. Hays, and Z. Kira (2017) On convergence and stability of gans. arXiv preprint arXiv:1705.07215. Cited by: Appendix C.
  • V. R. Konda, J. N. Tsitsiklis, et al. (2004) Convergence rate of linear two-time-scale stochastic approximation.

    The Annals of Applied Probability

    14 (2), pp. 796–819.
    Cited by: §2.
  • J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon (2016) Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §1, Remark 1.
  • X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang (2019) On the convergence of fedavg on non-iid data. arXiv preprint arXiv:1907.02189. Cited by: §2.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In

    Proceedings of the IEEE international conference on computer vision

    ,
    pp. 3730–3738. Cited by: §4.2.
  • H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al. (2016) Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629. Cited by: §1.
  • L. Mescheder, A. Geiger, and S. Nowozin (2018) Which training methods for gans do actually converge?. arXiv preprint arXiv:1801.04406. Cited by: §2, §3.3.
  • L. Mescheder, S. Nowozin, and A. Geiger (2017) The numerics of gans. In Advances in Neural Information Processing Systems, pp. 1825–1835. Cited by: §2, §3.3, §3.3.
  • L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein (2016) Unrolled generative adversarial networks. ArXiv abs/1611.02163. Cited by: Appendix C, §4.
  • M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §4.3.
  • V. Nagarajan and J. Z. Kolter (2017) Gradient descent gan optimization is locally stable. In Advances in Neural Information Processing Systems, pp. 5585–5595. Cited by: Appendix C, §2, §3.3, §3.3, §4.
  • S. Nowozin, B. Cseke, and R. Tomioka (2016) F-gan: training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pp. 271–279. Cited by: §2.
  • A. Odena, C. Olah, and J. Shlens (2017)

    Conditional image synthesis with auxiliary classifier gans

    .
    In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    ,
    pp. 2642–2651. Cited by: §4.2, §4.2.
  • A. Rajagopal. and V. Nirmala. (2019) Federated ai lets a team imagine together: federated learning of gans.. .. Cited by: §2.
  • L. J. Ratliff, S. A. Burden, and S. S. Sastry (2013) Characterization and computation of local nash equilibria in continuous games. In 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 917–924. Cited by: §2.
  • L. Stankovic, V. Stankovic, J. Liao, and C. Wilson (2016) Measuring the energy intensity of domestic activities from smart meter data. Applied Energy 183, pp. 1565–1580. Cited by: §1.
  • S. U. Stich (2018) Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767. Cited by: §2.
  • J. Wang and G. Joshi (2018) Cooperative sgd: a unified framework for the design and analysis of communication-efficient sgd algorithms. arXiv preprint arXiv:1808.07576. Cited by: §2.
  • S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan (2019) Adaptive federated learning in resource constrained edge computing systems. IEEE Journal on Selected Areas in Communications 37 (6), pp. 1205–1221. Cited by: §1, §2, §3.3.
  • Q. Yang, Y. Liu, T. Chen, and Y. Tong (2019) Federated machine learning: concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10 (2), pp. 1–19. Cited by: §1.
  • R. Yonetani, T. Takahashi, A. Hashimoto, and Y. Ushiku (2019) Decentralized learning of generative adversarial networks from multi-client non-iid data. arXiv preprint arXiv:1905.09684. Cited by: §1, §2.
  • H. Yu, S. Yang, and S. Zhu (2019) Parallel restarted sgd with faster convergence and less communication: demystifying why model averaging works for deep learning. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 5693–5700. Cited by: §2.
  • H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: §3.3.

Supplementary Material for "FedGAN: Federated Generative Adversarial Networks for Distributed Data"

Appendix A Convergence Analysis of FedGAN with Two Time-Scale Updates

To study the two time-scale FedGAN we add the following assumptions.

  • For each , the ODE has a local asymptotically stable attractor within a domain of attraction such that is Lipschitz. The ODE has a local asymptotically stable equilibrium within a domain of attraction.

(A6) is the standard assumption to determine the relationship of the generator and discriminator learning rates. (A7) characterizes the local asymptotic behavior of the limiting ODE in (4) and shows its local asymptotic stability. Both assumptions are regular conditions in the literature of two time-scale stochastic approximation [Borkar, 2009, Karmakar and Bhatnagar, 2017]. In the literature of stochastic approximation, often global asymptotic stability assumptions are made but [Karmakar and Bhatnagar, 2017] and Chapter 2 of [Borkar, 2009] relax them to local asymptotic stability which is a more practical assumption in GANs111Note that another way of relaxation to local asymptotic stability is to assume that the initial parameter is in the region of attraction for a locally asymptotically stable attractor, which is difficult to ensure in practice.. The relaxed local stability assumption (A7) limits the convergence results to be conditioned on an unverifiable event i.e. and eventually belongs to some compact set of their region of attraction.

To prove the convergence of FedGAN for two time-scale updates, similar to the proof for equal time-scale update, we show the ODE representation of the FedGAN asymptotically tracks the ODE below representing the parameter update of the two time-scale centralized GAN

(9)

where to ensure updating is fast compared to updating (A6). Consequent to (A6), can be considered quasi-static while analyzing the updates of and we can look at the following ODE in studying (with small change of notation we drop the notational dependency on fixed for convenience)

(10)

In order to show that the updates of asymptotically tracks (10), we follow the same idea as in equal time-scale update to construct the continuous interpolated trajectory defined immediately after (5), and show that it asymptotically almost surely approaches the solution of (10). For this, we also use the construction of and defined immediately after (6). We thus have the following lemma as a special case of Theorem 1.

Lemma 3.

For any ,

(11)
Proof of Lemma 3.

The proof can be directly obtained by consider the special case of Theorem 1 where the dimension of is zero. ∎

With Lemma 3, Theorem 2 below shows the FedGAN tracks the ODE in (10) asymptotically when is fixed. We refer the reader for proof to [Borkar, 2009].

Theorem 2.

[Theorem 2, Chapter 2, [Borkar, 2009]]. Almost surely, the sequence generated by (1) when is fixed to converges to a (possibly sample path dependent) compact connected internally chain transitive invariant set of (10).

The following Lemma 4 and Theorem 3 extend the results of Theorem 2 to the case when both and could vary as in (9). Both proofs should be referred to the respective part of [Borkar, 2009]. Lemma 4 shows that asymptotically tracks , and Theorem 3 shows the convergence of the proposed FedGAN asymptotically converges to , which is an equilibrium of the ODE in (9) representing centralized GAN with two time-scale updates. Lemma 4 is an adaption from Lemma 1 in Chapter 6 of [Borkar, 2009].

Lemma 4.

For the two time-scale updates as specified in (1), almost surely where is the internally transitive invariant sets of the ODE , .

Proof of Lemma 4.

Rewrite the generator update as

(12)

where and for . From the third Extension in Section 2.2 of [Borkar, 2009], should converge to an internally chain transitive invariant set of . Considering this, the proof follows directly from Theorem 2. ∎

Theorem 3.

[Theorem 2, Chapter 6, [Borkar, 2009]] almost surely.

Appendix B Proof of Results in Section 3.3

Proof of Lemma 1.
(13)

The latter part on the right hand side can be written as

(14)

Here and are bounds for the variances for discriminator and generator respectively, and is the bound for gradient divergence for discriminator (Assumption (A5)). The above inequality follows from Assumption (A1) and which holds because the fake data is generated based on parameters of the generator. Also, the last equality holds because in equal time-scale updates.

Considering (13) and (14), we have

(15)

The last inequality can be holds by induction over and a mild assumption that the learning rate is unchanged within the same synchronization interval. ∎

Proof of Lemma 2.
(16)

We have

(17)

where the last inequality follows from Lemma 1.

Consequently,

(18)

Proof of Theorem 1.

The proof is modified from Lemma 1 in Chapter 2 of [Borkar, 2009]. We shall only prove the first claim, as the arguments for proving the second claim are completely analogous. Define

(19)
(20)

Denote , . Let . Let and . (Note that is overloaded here both as a function and a variable). By construction,

(21)
(22)

Furthermore, denote

(23)

Also by construction,