Venn GAN: Discovering Commonalities and Particularities of Multiple Distributions

02/09/2019 ∙ by Yasin Yazıcı, et al. ∙ 0

We propose a GAN design which models multiple distributions effectively and discovers their commonalities and particularities. Each data distribution is modeled with a mixture of K generator distributions. As the generators are partially shared between the modeling of different true data distributions, shared ones captures the commonality of the distributions, while non-shared ones capture unique aspects of them. We show the effectiveness of our method on various datasets (MNIST, Fashion MNIST, CIFAR-10, Omniglot, CelebA) with compelling results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GAN) (Goodfellow et al., 2014)

learn a function that can sample from an approximated probability distribution. Due to enormous interest, GAN have been improved substantially over the past few years

(Radford et al., 2015; Gulrajani et al., 2017; Miyato et al., 2018; Karras et al., 2017; Mescheder, 2018).

GANs are designed to learn a single distribution, though multiple distributions can be modeled by treating them separately. However, this naive implementation does not consider relationships between the distributions. An interesting question is how we can model multiple distributions efficiently and discover their common and unique aspects? We explain this situation by utilizing Venn diagrams. Figure 1 depicts some cases of different interactions between 3 sets, where each set represents a distribution. In , each set has its own unique part and intersections with the other sets, whereas in , some sets are a superset of others. Each case can be useful in different scenarios, e.g.  can be used in a case where a distribution is a subset of another distribution, such as a specific dog breed and its superset is many different dog-breeds.

Figure 1: Three different configurations of Venn diagrams with 2 and 3 sets.

In this paper, we propose Venn GAN, which models multiple distributions efficiently and discovers their interactions and uniqueness. Each data distribution is modeled with a mixture of generator distributions. As the generators are partially shared between the modeling of different true data distributions, shared ones captures the commonality of the distributions, while non-shared ones capture unique aspects of them. Our contributions are the following:

  • Introducing a novel and interesting problem setting where there exists multiple distribution various configurations (See Figure 1).

  • Proposing a new method that can capture commonalities and particularities of various distributions with high success rate.

  • Thoroughly evaluating the method on various datasets, namely MNIST, Fashion MNIST, CIFAR-10, Omniglot, CelebA, with compelling results.

2 Related work

Multi-generator/discriminator GAN: There have been some attempts to use multiple generator/discriminator in order to solve various issues with GAN. Arora et al. (2017); Hoang et al. (2017); Ghosh et al. (2017)

modeled a single distribution with multiple generators to capture different modes of the distribution. In order to guide the generators into different modes, they utilized a classifier which separates each generator from one another.

Durugkar et al. (2016); Neyshabur et al. (2017); Juefei-Xu et al. (2017) utilized multiple discriminators to address mode collapse and optimization stability. Similarly, Doan et al. (2018) used multiple discriminators with learned importance to ease training of GAN. Tolstikhin et al. (2017) used a meta-learning algorithm analogous to AdaBoost to improve coverage of modes with multiple generators.

Mixture of Distributions with GAN: Some of the earlier works considered multiple generators as mixture of distributions to model a single distribution (Arora et al., 2017; Hoang et al., 2017; Ghosh et al., 2017). Our model is different, as we model multiple data distributions and share the generator distributions as component for each data distribution.

Conditional GAN: This type of GAN uses a condition, alongside noise, to generate data (Mirza & Osindero, 2014). The conditions are desired to correlate with generated data. It has been used for Image-to-Image transformation (Isola et al., 2016; Hoang et al., 2018; Yi et al., 2017), text-to-image (Reed et al., 2016)

, super resolution

(Ledig et al., 2016).

The way GANs are conditioned is still an active research field. We have focused on conditioning of the generator. The most common way to include conditions into the generator is to provide it as input (Mirza & Osindero, 2014; Reed et al., 2016; Odena et al., 2016). Recently, Miyato & Koyama (2018) used conditional BatchNorm (de Vries et al., 2017; Dumoulin et al., 2017) to include conditions into generator.

Other Related Works: Concurrent work of (Kaneko et al., 2018) is perhaps the most similar work to ours. However, their motivation, method and experiments are different then ours. They are motivated by ambiguous class labels, due to noisy labels, and propose a model to discover class-distinct and class-mutual parts. Their method utilizes modified version of AC-GAN and redesigns input of to achieve the objective. While our work scales GAN objective into distributions and models each distribution as mixture of generator distributions.

3 Method

3.1 Background

GAN is a two player zero-sum game between a discriminator and generator:

(1)
(2)

It utilizes a discriminator to assess a peudo-divergence between the true data distribution, , and the generator’s distribution, . The discriminator maximizes the divergence, while the generator minimizes it. In this way, the generator learns to mimic the data distribution implicitly. Goodfellow et al. (2014) show that, under certain assumptions, for a fixed optimal , minimizing Eq. 2 for would lead to .

3.2 Multi-distribution GAN

The value function, Eq. 1, can be scaled to distributions trivially as follows:

(3)
(4)

where is -th true data distribution and is -th generator’s distribution, which are independent from one another. Note that and 222Eq. 4 does not explicitly show but which is distribution of -th generator, in above equation interact with one another only when . This makes learning one distribution independent from the others. By following the proof from Goodfellow et al. (2014), we can show that, at equilibrium .

However this objective does not consider possible overlaps between the data distributions. Incorporating this can make the model more efficient and leads to interesting discoveries, e.g. commonalities and particularities of the distributions. In order to achieve this, we have reformulated the way we construct generator distributions, . It is no longer equal to the distribution of -th generator, but a mixture of generator distributions, denoted by , Eq. 5. In this way, each data distribution is modeled as a mixture of generators’ distributions. As are shared for all data distributions, some of them cover common parts and others unique ones. Each generators learns only sub-part of the distributions and combines them at different amounts to make the data distributions.

(5)

where is a mixture matrix whose rows sum up to one to make valid. Note that this reformulation does not change the objective (Eq. 3 and Eq. 4), but how we model .

3.3 Conceptual Explanation: Relation to Venn Diagrams

The method in the previous section can be explained by using Venn diagrams where each set represents a distribution. We deal with a situation where multiple distributions exist. Each distribution might have a unique part and commonalities with other distributions e.g. of Figure 1. In another case, one distribution’s support might cover the others’ e.g. of Figure 1. Our proposed method models each region of a Venn diagrams as a probability distribution . Each set should capture the distribution of its corresponding data distribution, e.g. . Each set can be represented by union of its regions, e.g. of Figure 1, . Similarly, each region can be represented with set operations e.g. of Figure 1, . Set configurations can be in different forms e.g. of Figure 1 is .

type diagram can be represented by:

(6)

Similarly type diagram can be represented by:

(7)

In both cases we assume that each region contributes equally. Learning mixture weights is left for future study.

3.4 Implementation Details

Generator side: We can use two approaches to model the generators (). The first is the use of independent generators for each region. Each generator is modeled by , where is a generative network, is input noise and are the parameters of the -th network. The second approach is a single generator with conditions. Each region is modeled with a function , is a condition whose -th index used to generate region and are the network parameters. The former approach can be expensive when there are many region to model, however it has its own advantage as we will show in the experiments. Conditional generator is more efficient as the number of regions grows exponentially with distributions, e.g.  distributions contain up to regions. Also, sharing weights with other generators regularizes the model and makes the training easier. Besides, using this type of generator has certain effects on modeling, namely different conditions with the same noise produce semantically related samples, as detailed in the CelebA experiments. We use both types and discuss their advantages and disadvantages in more detail in the experiment section.

Discriminator side: There should be discriminators for -distribution game. As we have changed generator distribution into a mixture of distributions, each discriminator takes input from all incoming generators, which has non-zero mixture weight. Figure 2 illustrates how a type diagrams looks like in terms on connections. Other types can be constructed in a similar way by following the connection pattern from the weight matrix . When sampling from the generators to feed into , the number of samples from each generator should be proportional to -th row of . The “” sign in the diagram corresponds to union operation over the incoming regions. In practice it is concatenation over batch dimensions. As each set should represent a true data distribution, , union of regions that belongs to should match to data distribution. In order to satisfy this, each discriminator, , compares a specific true data distribution, , with union of regions, , which belongs to the corresponding set e.g. . As certain regions are fed into more than one discriminator, those regions would be forced to represent common parts of the distributions. For example, will suffer a loss if its modeling does not satisfy the 3-way intersection of the distributions. In other words, it will receive a negative feedback from the discriminator(s) which it could not satisfy. Similar analogies can be made to , , which are 2-way intersections, whereas individual regions like , , are only used by a single discriminator, thereby they are inclined to model the unique part of its corresponding distribution. Sharing the regions between different discriminators which receive different true data distributions is the core dynamic of learning commonalities between true data distributions. We make the assumption that all the regions in a distribution have equal weights.

Figure 2: Venn GAN architecture for 3 distributions and type diagrams. Each mode of the generator represents corresponding region in Figure 1; other Venn diagrams can be constructed in a similar way. Each discriminator receives union of its corresponding set’s regions. For this illustration we have used conditional generator, however independent generators can be also used in the same way. “” sign takes union over its incoming regions, and the union is feed into a discriminator as fake class. Each discriminator receives different true data distribution and with GAN objective it compares them to mixture of the regions.

The objective of the model is a minimax game with discriminators for -distribution game is stated in Eq. 3 and Eq. 4. In Eq. 5, we show how can be represented. From Venn diagrams perspective, it can be also represented by:

(8)

where is number of regions in set .

In practice we observe that there is some amount of leakage between regions. In order to alleviate this issue, we include an additional objective, which aims to separate regions of the generator from one another:

(9)

where is the category for and is a classifier which outputs probability distribution over the regions. With this objective, the classifier tries to separate the regions and the generator tries to satisfy the classifier by increasing differences between the regions. Similar losses has been used by (Hoang et al., 2018) previously. The combined objective becomes:

(10)

where is balancing hyper-parameter between the two terms.

4 Experiments

Network Architecture: Discriminator and generator architectures are similar to DCGAN (Radford et al., 2015) for MNIST, Fashion-MNIST, Omniglot and CIFAR-10, while CelebA uses ResNet type architecture with detailed specifications given in the Appendix. The classifier architecture is the same as the discriminator except for the last layer, whose output dimensions equal the number of regions. Exponential Moving Average (EMA) (Karras et al., 2017; Yazıcı et al., 2018) has been used over generator(s) parameters out of training loop. Conditioning of is similar to that of Miyato & Koyama (2018); de Vries et al. (2017); Dumoulin et al. (2017) except that there is no normalization but scaling and addition.

Objective Details: Zero gradient penalty (Mescheder, 2018) has been applied on true data distributions for each discriminator with weight in every case but illustrative examples. We found that this improves the quality of generation, especially in CelebA.

Optimization & Hyperparameters:

We have used ADAM (Kingma & Ba, 2014) optimizer with learning rate of , and . The optimization of discriminator and generator follows alternating update rule with single discriminator update per generator update. The model has been trained for 100k iterations for CelebA, 50k for CIFAR-10, 20k for MNIST, Fashion-MNIST and Omniglot. For each region, we use a batch size of , except for illustrative example which uses . The batch size of real data depends on the number of regions fed to each discriminator. Union over regions would corresponds to a batch size of . is selected as by searching over range of with quantitative score (will be explained shortly) over various scenarios. Classifier’s optimization is the same with the discriminators’. For the conditional generator, we have used the same noise for different conditions during training. The illustrative example does not use a classifier.

Quantification of Results: In case of artificial datasets, we can quantify the rate of correct generation (accuracy) for different regions. In order to achieve this, we have trained a separate classifier on MNIST, fashion-MNIST (Xiao et al., 2017) and CIFAR-10 (Krizhevsky et al., ) by using their training data split. This model is used to assess if the generated images from each region belongs to the correct class. We use 10k generated samples from each region to assess the quantity. The accuracy of the classifier on each region is used as the metric. The details about architecture, optimization etc. for the classifier can be found in the Appendix. The accuracy of the classifier on test sets for MNIST, fashion-MNIST and CIFAR-10 are 99.12, 91.20 and 84.20 respectively. During the VennGAN training, we have measured the model at every 2k iterations and report the best average results.

4.1 Illustrative Examples

We use mixture of Gaussians illustrative example to show that the method works as anticipated. The nature of the dataset and its dimensionality make it easier to spot subtle behaviours of the method. For this experiment we generate 3 different data distributions where each data distribution equally mixes 4 out of 7 Gaussians as in Figure 3.

Figure 3: Samples from the data distributions for the illustrative example. Each distributions equally mixes 4 out of 7 Gaussians.

In order to model these distributions, we have used type with . The experiment is conducted with independent generators for 5k iterations. Further details about the training, architecture etc. are in Appendix. Figure 4 shows the results. All the regions are generated at the correct position, e.g. the pink samples generated by , which is the common mode of all the distributions. We have conducted this experiment multiple times with no notable differences which shows stability of the method.

Figure 4: Generated regions for the illustrative example. Annotation of the regions w.r.t.  of Figure 1: is blue, is orange, is green, is brown, is purple, is red and is pink.

4.2 Main Experiments

We have designed multiple artificial datasets as well as natural datasets to investigate the working dynamic of the method. For artificial datasets, MNIST, fashion-MNIST (Xiao et al., 2017) and CIFAR-10 (Krizhevsky et al., ) have been used. By using these datasets, we have designed 2 and 3 distribution games with , and type Venn diagrams. The distributions are constructed by using the class information from the datasets as per Table 2. For all types, each distribution contains 2000 samples from the classes it includes. We never use the same sample twice for different distributions, which could lead to trivial solutions.

Case Venn Type Distributions Sets
A 2 ,
B 3 , ,
C 3 , ,
Table 1: Configurations of artificial datasets
Fashion-MNIST T-shirt/top Trouser Pullover Dress Coat Sandal Shirt Sneaker Bag Ankle boot
Cifar-10 Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck
Table 2: Correspondence of labels for fashion-MNIST and CIFAR-10
Figure 5: MNIST, Fashion-MNIST, CIFAR-10 results for case A
Figure 6: MNIST, fashion-MNIST, CIFAR-10 results for case C
Figure 7: MNIST, fashion-MNIST, CIFAR-10 results for case B

Figure 5, 7, 6 shows the results for cases A, B, and C respectively. In case A of MNIST, , and are correctly modeled. Similarly, for Fashion-MNIST, , and are correctly modeled. CIFAR-10 image quality is not as good as the others, so it is not easy to make a judgment. However, from the recognizable classes we see that “Automobile”, “Horse”, “Ship”, “Truck” appears in the right region. In case B of MNIST and Fashion-MNIST, the vast majority of the object appears in the right region with good image quality. For CIFAR-10, the results are decent for “Airplane”, “Automobile”, “Deer”. For other regions the quality is not satisfactory and there seems to be some amount of leaking. In case C, we see near perfect performance in case of MNIST and fashion-MNIST. Objects are placed in the right regions and image quality is good enough to recognize the objects. Image quality of CIFAR-10 is again not very good, but the objects seems to be placed in the right regions. For example only includes “Airplane”, “Automobile” and “Bird”, while only includes “Frog”, “Horse”, “Ship” and “Truck”.

Dataset Case Classifier IG Avg
MNIST A Yes Yes 99.76 99.11 81.72 n/a n/a n/a n/a 93.53
MNIST A Yes No 99.69 98.86 83.40 n/a n/a n/a n/a 93.98
F-MNIST A Yes Yes 91.37 87.75 80.17 n/a n/a n/a n/a 86.43
F-MNIST A Yes No 90.15 86.48 80.92 n/a n/a n/a n/a 85.85
CIFAR-10 A Yes Yes 78.03 75.19 58.07 n/a n/a n/a n/a 70.42
CIFAR-10 A Yes No 72.23 71.65 52.78 n/a n/a n/a n/a 65.55
MNIST B Yes Yes 99.33 100.0 95.67 98.44 98.22 99.64 99.58 98.70
MNIST B Yes No 99.32 100.0 96.05 98.75 98.14 99.56 99.36 98.74
F-MNIST B Yes Yes 73.03 97.36 70.43 68.33 91.02 92.09 18.59 72.97
F-MNIST B Yes No 71.86 98.07 68.45 71.04 93.17 91.54 18.02 73.16
CIFAR-10 B Yes Yes 83.57 58.71 10.63 53.14 2.81 51.93 28.28 41.30
CIFAR-10 B Yes No 88.3 52.84 11.43 51.99 2.98 52.78 35.29 42.23
MNIST C Yes Yes 99.5 n/a n/a n/a n/a 93.85 94.19 95.85
MNIST C Yes No 99.12 n/a n/a n/a n/a 93.08 93.64 95.28
F-MNIST C Yes Yes 94.88 n/a n/a n/a n/a 85.25 67.49 82.54
F-MNIST C Yes No 94.41 n/a n/a n/a n/a 83.17 67.5 81.69
CIFAR-10 C Yes Yes 85.63 n/a n/a n/a n/a 70.57 63.83 73.34
CIFAR-10 C Yes No 77.39 n/a n/a n/a n/a 66.82 61.85 68.69
MNIST A No No 99.54 98.88 81.45 n/a n/a n/a n/a 93.29
F-MNIST A No No 90.48 86.59 80.12 n/a n/a n/a n/a 85.73
CIFAR-10 A No Yes 76.4 73.32 60.93 n/a n/a n/a n/a 70.22
MNIST B No No 98.72 99.99 95.28 99.08 97.40 99.29 99.29 98.43
F-MNIST B No No 67.89 97.81 63.79 68.18 88.98 91.91 15.68 70.61
CIFAR-10 B No Yes 85.17 51.64 9.27 51.44 2.89 46.48 22.97 38.55
MNIST C No No 98.49 n/a n/a n/a n/a 92.88 93.71 95.03
F-MNIST C No No 92.57 n/a n/a n/a n/a 84.04 67.24 81.28
CIFAR-10 C No Yes 86.14 n/a n/a n/a n/a 71.85 61.77 73.25
Table 3: Quantitative results on 3 datasets and 3 cases. Accuracy of each region is reported. IG stands for Independent Generators and Avg is average of all the regions. The regions can be tracked from Figure 1. n/a is placed into the regions where it does not exist in the type of Venn diagram.

Table 3 lists quantitative results for the experiments above. Interestingly, MNIST performs best in case B, while the same case is the hardest for Fashion-MNIST and CIFAR-10. We believe this is due to the clear separation between the classes in MNIST, while there are a few hard to distinguish classes in Fashion-MNIST such as “Pullover”, “Coat”, “Shirt”. As expected, average accuracy drops as the dataset becomes harder ().

Conditional Generator vs. Independent Generators: In case of MNIST and Fashion-MNIST, conditional generator produces comparable or slightly better results, while independent generators are better for CIFAR-10. We postulate that in case of simple datasets, single conditional generator has sufficient capacity to match the quality of multiple generators. Besides, sharing most of the weights with different regions regularizes the training, as there are many common features between regions. However when it comes to CIFAR-10, sharing weights might be a burden for the representation of different regions rather than a regularization.

Effect of the Classifier: As explained in the method section, we have utilized a classifier to alleviate leaking issues between regions. In this section we evaluate its effectiveness on various datasets. In order to reduce the number of setting we use conditional generators for MNIST and Fashion-MNIST and independent generators for CIFAR-10 due to reasons explained in the previous section. The bottom section of Table 3 belongs to 9 different settings without classifier term in the objective. At all settings there are slight but consistent improvements. For CIFAR-10, improvements are more significant than for the other datasets.

Figure 8: Omniglot results

Omniglot (Lake et al., 2015) contains letters from many alphabets. Each alphabet contains a certain number of letters, and there are samples per letter, which make this dataset hard to model. We have selected “Cyrillic”, “Greek” and “Latin” alphabets as 3 different distributions. As these alphabets include both unique and common letters, we aim to model it with type modeling to discover both unique and common letters.

In Figure 8, the first three regions corresponds to only “Cyrillic”, “Greek” and “Latin” in order. The majority of the letters in each of these regions belongs to their own alphabet and not in others. For other regions there are more mistakes like the letter “o” appearing in multiple regions.

CelebA (Liu et al., 2015): For this dataset, we use both and types with two distributions. In case of , the first distribution contains only male faces while the second one contains females. In case of , the first distribution contains only female faces while the second one contains both genders. Our aim is to see whether semantic commonalities and differences of the distributions can be captured successfully. In setting, there should be no overlap in genders but we are interested in what type of commonalities our method can find. We have used conditional generator for this experiment to see the semantic relations between the regions more clearly.

Figure 9: CelebA results: is only males, is only females, is intersection
Figure 10: CelebA results: is only females, is only males

In CelebA (Figure 9), depicts stereotype masculine faces with short hair and masculine faces, whereas exhibits predominantly feminine features like long hair etc. On the other hand, features faces which are neither predominantly male nor female. As the images in different regions are generated with the same noise, pose and background of an image at different regions remain similar, while the facial attributes change. Similarly, CelebA (Figure 10) shows that the model can capture commonalities of the distributions well, , correctly with all male faces, while the difference, , are female faces as it should be. Again, due to the same noise, generations between different regions can be compared. Both experiments show that Venn GAN can capture high level semantic commonality between high dimensional complex distributions.

5 Discussion & Conclusion

In this paper, we have used prior knowledge to choose the Venn type or matrix. When we know that the distributions have intersections and unique parts, or type has been used; if a distribution is subset of another one, then we have utilized . We note that certain distributions may not fall under either one of those two types. If we have a prior knowledge about the type of the distributions, then this method can be utilized easily. In case we have no prior knowledge about it, the ideal situation would be learning it, which we leave for future work.

The main limitation of the method is that it takes union over each region with equal probability, which is a strong assumption in many cases. In an ideal situation we should optimize

end-to-end with the model parameters. One challenge is that the mixture weights are discrete, as in practice we use the number of samples to approximate them. However this can be handled with a reinforcement learning algorithm. Another bigger challenge is to find a meaningful reward signal for the training of

. This reward should negatively correlate with “leaks” between the regions. We think this is also an important future research direction.

In conclusion, we have proposed a novel multi-distribution GAN method which can discover particularities and commonalities between distributions. Our method models each data distribution with a mixture of generator distributions. As the generators are partially shared between the modeling of different true data distributions, shared ones captures the commonality of the distributions, while non-shared ones capture unique aspects of them. We have successfully trained it on various datasets to show its effectiveness. We believe this method has good potential for new applications and better data modeling.

Acknowledgments

Yasin Yazıcı was supported by a SINGA scholarship from the Agency for Science, Technology and Research (A*STAR). Georgios Piliouras would like to acknowledge SUTD grant SRG ESD 2015 097, MOE AcRF Tier 2 Grant 2016-T2-1-170 and NRF 2018 Fellowship NRF-NRFF2018-07. This research is partially supported by the Agency for Science, Technology and Research (A*STAR) under its AME Programmatic Funds (Project No.A1892b0026). This research was carried out at Advanced Digital Sciences Center (ADSC), Institute for Infocomm Research (I2R) and at the Rapid-Rich Object Search (ROSE) Lab at the Nanyang Technological University, Singapore. The ROSE Lab is supported by the National Research Foundation, Singapore, and the Infocomm Media Development Authority, Singapore. Research at I2R was partially supported by A*STAR SERC Strategic Funding (A1718g0045). The computational work for this article was partially performed on resources of the National Supercomputing Centre, Singapore (https://www.nscc.sg).

References

Appendix A Network Architectures

Prior distribution for the generator(s) is a 128-dimensional isotropic Gaussian distribution. If not mentioned, stride and padding of the convolution is

. “Cond” is conditioning which is linear scaling and addition for each feature channel. It is not used when multiple generators utilized. “LReLU” is LeakyReLU with .

Layers Act. Output Shape

Latent vector

- 128 x 1 x 1
Conv 4 x 4, pad=3 Cond - LReLU 128 x 4 x 4
Conv 4 x 4, pad=3 Cond - LReLU 128 x 7 x 7
Upsample - 128 x 14 x 14
Conv 3 x 3, pad=1 Cond - LReLU 64 x 14 x 14
Upsample - 64 x 28 x 28
Conv 3 x 3, pad=1 Cond - LReLU 32 x 28 x 28
Conv 3 x 3, pad=1 Tanh 1 x 28 x 28
Table 4: Generator Architecture for 28x28 resolution (MNIST, Fashion-MNIST, Omniglot)
Layers Act. Output Shape
Input image - 3 x 28 x 28
Conv 4 x 4, st=3 LReLU 64 x 14 x 14
Conv 4 x 4, st=3 LReLU 128 x 7 x 7
Conv 4 x 4, st=3 LReLU 256 x 3 x 3
Conv 3 x 3, st=1, pad=0 Squeeze 1
Table 5: Discriminator Architecture for 28x28 resolution (MNIST, Fashion-MNIST, Omniglot)
Layers Act. Output Shape
Latent vector - 128 x 1 x 1
Conv 4 x 4, pad=3 Cond - LReLU 512 x 4 x 4
Upsample - 512 x 8 x 8
Conv 3 x 3 Cond - LReLU 256 x 8 x 8
Upsample - 256 x 16 x 16
Conv 3 x 3 Cond - LReLU 128 x 16 x 16
Upsample - 128 x 32 x 32
Conv 3 x 3 Cond - LReLU 64 x 32 x 32
Conv 3 x 3 Tanh 3 x 32 x 32
Table 6: Generator Architecture for 32x32 resolution (CIFAR-10)
Layers Act. Output Shape
Input image - 3 x 32 x 32
Conv 3 x 3 LReLU 64 x 32 x 32
Conv 3 x 3 LReLU 128 x 32 x 32
Downsample - 128 x 16 x 16
Conv 3 x 3 LReLU 128 x 16 x 16
Conv 3 x 3 LReLU 256 x 16 x 16
Downsample - 256 x 8 x 8
Conv 3 x 3 LReLU 256 x 8 x 8
Conv 3 x 3 LReLU 512 x 8 x 8
Downsample - 512 x 4 x 4
Conv 4 x 4, st=1, pad=0 Squeeze 1
Table 7: Discriminator Architecture for 32x32 resolution (CIFAR-10)
Layers Act. Output Shape
Latent vector - 128 x 1 x 1
Conv 4 x 4, pad=3 Cond 512 x 4 x 4
ResBlock - 512 x 4 x 4
Upsample Cond 512 x 8 x 8
ResBlock - 512 x 8 x 8
Upsample Cond 512 x 16 x 16
ResBlock - 256 x 16 x 16
Upsample Cond 256 x 32 x 32
ResBlock - 128 x 32 x 32
Upsample Cond 128 x 64 x 64
ResBlock LReLU - Cond 64 x 64 x 64
Conv 3 x 3 Tanh 3 x 64 x 64
Table 8: ResNet Generator Architecture for 64x64 resolution (CelebA)
Layers Act. Output Shape
Input image - 3 x 64 x 64
Conv 3 x 3 - 64 x 64 x 64
ResBlock - 64 x 64 x 64
Downsample - 64 x 32 x 32
ResBlock - 128 x 32 x 32
Downsample - 128 x 16 x 16
ResBlock - 256 x 16 x 16
Downsample - 256 x 8 x 8
ResBlock - 512 x 8 x 8
Downsample - 512 x 4 x 4
ResBlock LReLU 512 x 4 x 4
Conv 4 x 4, st=1, pad=0 Squeeze 1
Table 9: ResNet Discriminator Architecture for 64x64 resolution (CelebA)

Appendix B Training of the classifiers for Quantification

For MNIST, Fashion-MNIST and CIFAR-10, we have trained 3 separate classifier to assess quality of the method. For each dataset, the architecture is the same with the discriminator used for that dataset except the last layer which outputs logits value instead of . We have used ADAM optimizer with learning rate of , and . Each model has been trained for 50k iterations with a batch size of . The accuracy of the classifier on test sets for MNIST, fashion-MNIST and CIFAR-10 are 99.12, 91.20 and 84.20 respectively.

Appendix C Illustrative Examples

For this experiments, we have used generators and discriminators. The network architecture for generators is 4 fully connected layer followed by LeakyReLU except the last one which is linear. The discriminators’ are constructed from 4 fully connected layers followed by LeakyReLU except the last one which is linear. In both networks, each layer has 256 units while last layer of generator has and last layer of the discriminator has . Prior distribution for the generators is a 128-dimensional isotropic Gaussian distribution. We have used ADAM (Kingma & Ba, 2014) optimizer with learning rate of , and . The optimization of discriminator and generator follows alternating update rule with single discriminator update per generator update. The model has been trained for 5k iterations. For each region (generator), we use a batch size of . of regularizer is .