Class-Distinct and Class-Mutual Image Generation with GANs

11/27/2018 ∙ by Takuhiro Kaneko, et al. ∙ 24

We describe a new problem called class-distinct and class-mutual (DM) image generation. Typically in class-conditional image generation, it is assumed that there are no intersections between classes, and a generative model is optimized to fit discrete class labels. However, in real-world scenarios, it is often required to handle data in which class boundaries are ambiguous or unclear. For example, data crawled from the web tend to contain mislabeled data resulting from confusion. Given such data, our goal is to construct a generative model that can be controlled for class specificity, which we employ to selectively generate class-distinct and class-mutual images in a controllable manner. To achieve this, we propose novel families of generative adversarial networks (GANs) called class-mixture GAN (CMGAN) and class-posterior GAN (CPGAN). In these new networks, we redesign the generator prior and the objective function in auxiliary classifier GAN (AC-GAN), then extend these to class-mixture and arbitrary class-overlapping settings. In addition to an analysis from an information theory perspective, we empirically demonstrate the effectiveness of our proposed models for various class-overlapping settings (including synthetic to real-world settings) and tasks (i.e., image generation and image-to-image translation).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 17

page 19

page 20

page 21

page 22

page 23

page 24

Code Repositories

CP-GAN

CP-GAN: Class-Distinct and Class-Mutual Image Generation with GANs


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In computer vision and machine learning, generative modeling has been studied to produce or reproduce samples that are indistinguishable from real data. Until recently, applicable data have been restricted to being relatively simple or low-dimensional, owing to the difficulty of modeling complex distributions. However, advances in deep generative modeling have lifted this restriction and have shown impressive results for various applications, such as speech synthesis 

[54], image-to-image translation [16, 62, 20, 69, 7, 28, 30], and photo editing [68, 5, 17, 14].

Figure 1: Concept of our proposed class-distinct and class-mutual image generation method. Given images whose class boundaries are ambiguous (a), our goal is not to create a generative model conditioned on discrete class labels that divide data by force (b), but rather to construct a generative model that allows for the existence of images lying between classes and can selectively generate an image on the basis of class specificity (c).

Among these, one of the most prominent models is generative adversarial networks (GANs) [11], which learn a generative distribution that mimics a target distribution through adversarial training between a generator and a discriminator

. This training algorithm allows any data distribution to be learned without an explicit density estimation. This eliminates over-smoothing resulting from representing data distributions approximately, and allows the production of high-fidelity images 

[27, 19, 40, 64, 4].

Another advantage of GANs is that it can represent high-dimensional data in a compact but expressive latent space. In naive GANs, various factors are possibly entangled in the latent space because they do not impose an explicit structure on the latent variables. However, conditional extensions of GANs (e.g., conditional GAN (cGAN) 

[39] and auxiliary classifier GAN (AC-GAN) [43]), which incorporate supervision into the latent space as conditional information, make it possible to learn the representations that are disentangled between the supervision and other factors. These representations allow for to selectively generate images by varying the conditional information (e.g., attribute/class labels [39, 43, 52, 17, 41, 18], texts [47, 66, 65, 60], and location descriptions [46]). Furthermore, the supervision helps simplify the learned target distribution from an overall distribution to a conditional one. Recent studies [43, 66, 41, 64, 4] show that this contributes to stabilizing the unstable training in typical GANs.

When focusing on class-conditional image generation, conventional approaches (including cGAN and AC-GAN) assume that there are no intersections among classes, and optimize a generative model to fit discrete class labels. However, in real-world scenarios this assumption is too restrictive, because it is often necessary to handle data in which class boundaries are ambiguous or unclear. For example, in Clothing1M [58], the difference between knitwear and sweater is not obvious, as shown in Figure 1(a). This causes difficulty in annotating correctly. In fact, the authors of Clothing1M report that the overall annotation accuracy (including other classes) is only 61.54%. Given such ambiguous class-boundary data, it is important to consider the relationships among classes. However, typical conditional models attempt to fit discrete labels regardless of the quality of supervision, as shown in Figure 1(b).

To mitigate these restrictions, we tackle a new problem called class-distinct and class-mutual (DM) image generation, in which the goal is to construct a generative model that can capture between-class relationships and selectively generate an image on the basis of class specificity, as shown in Figure 1(c). By adapting such a model, we aim to allow the existence of images lying between classes. To solve this problem, we redesign ’s input and objective function in AC-GAN and propose two extensions: class-mixture GAN (CMGAN) and class-posterior GAN (CPGAN

). CMGAN is an extension of AC-GAN to the class-mixture setting, in which we replace the discrete prior in AC-GAN with a class-mixture prior. One limitation of CMGAN is that it requires manually defining the hyperparameters of the class-mixture prior in advance. To remedy this drawback, we propose CPGAN, in which we use the class posterior (in practice we approximate it using the classifier’s posterior) instead of the class-mixture prior, and tune

to fit the class posterior.

Despite its simplicity, we find empirically that the proposed models can achieve DM image generation on various settings, including synthetic to real-world settings. We also evaluate typical class-conditional GANs (i.e., cGAN and AC-GAN) and state-of-the-art extensions (i.e., cGAN with a projection discriminator [41] and CFGAN [17]), and demonstrate that they are not suitable for DM image generation, even in relatively simple settings on MNIST [26]. Finally, we show the generality of the proposed models by applying CPGAN to StarGAN [7], which is a state-of-the-art AC-GAN-based image-to-image translation model.

Our contributions are summarized as follows:

  • We introduce a novel problem, called class-distinct and class-mutual (DM) image generation, as an extension of class-conditional image generation to a class-overlapping setting.

  • To solve this problem, we propose two extensions of AC-GAN: CMGAN, an extension of AC-GAN to the class-mixture setting, and the CPGAN, an extension of CMGAN to an arbitrary class-overlapping setting.

  • We conduct extensive experiments on this new problem. We clarify the difference between typical class-conditional image generation and DM image generation, analyze the model configurations, validate the effectiveness for real-world noisy labels, and demonstrate the generality by application to image-to-image translation.

2 Related work

Deep generative models.

Along with GANs, the most prominent deep generative models are variational autoencoders (VAEs) 

[23, 49]

and autoregressive models (ARs), such as PixelRNN and PixelCNN 

[55]. All three models have strengths and weaknesses. One well-known weakness of GANs is training instability, owing to the difficulty of balancing and . However, this has been being improved by recent advances  [9, 45, 50, 67, 1, 2, 36, 12, 19, 57, 40, 38, 64, 4]. In this study, we focus on GANs, because they offer flexibility in terms of designing latent variables, and it is relatively easy to incorporate various priors. For VAEs and ARs, several conditional extensions have recently been proposed [22, 61, 35, 56, 48]. However, to our best knowledge, the proposed DM image-generation method has not been tackled, and extensions to these models represent a promising area for future work.

Disentangled representation learning. Naive deep generative models (particularly GANs and VAEs) do not impose an explicit structure on the latent variables. Thus, they may be employed by the generator (or encoder) in a highly entangled manner. To solve this problem, recent studies have incorporated supervision into the networks [39, 43, 52, 17, 41, 18, 47, 66, 65, 60, 46]. These models can learn disentangled representations in a stable manner following the supervision. However, the learnable representations are restricted to the supervision. To overcome this limitation, unsupervised [6, 18] and weakly supervised [25, 34, 37, 17, 18]

models have been proposed. Our CMGAN and CPGAN approaches are categorized as weakly supervised models, because they learn class-specificity controllable models using only weak supervision (i.e., binary class labels). The difference from previous studies is that we focus on class specificity. Note that from this viewpoint, DM image generation is different from typical class-wise interpolation (or category morphing in

[41]), which conducts class-wise interpolation regardless of class specificity. We discuss these relationships in Section 3.4, and demonstrate the difference experimentally in Section 4.1.

Figure 2:

Comparison of probability densities. Here, class A contains

and class B contains . “1” is shared between the two classes and “0” and “2” are specific to A and B, respectively. (a) Probability density of ’s posterior for real data. In this case, “1” lies between classes, while “0” and “2” are completely classified. (b–c) The upper row shows the probability densities of ’s prior. AC-GAN handles classes discretely, while CMGAN and CPGAN handle relationships between classes. The lower row shows the probability densities of ’s posterior for the generated data. AC-GAN ignores masses between classes, while CMGAN and CPGAN succeed in capturing them.

Conditional image-to-image translation. Recently, supervision has not only been incorporated into image-generation models, but also into image-to-image translation models. State-of-the-art models include StarGAN [7], which achieves multi-domain image-to-image translation using attribute supervision, and outperforms single-domain image-to-image translation models (e.g., DIAT [28] and CycleGAN [69]) and encoder-decoder models (e.g., IcGAN [44]). Naive StarGAN is difficult to apply to class-overlapping data, because it handles classes as discrete values, similarly to AC-GAN [43]. However, it is possible to overcome this limitation by replacing the AC-GAN-based loss with the CPGAN-based loss. We demonstrate this experimentally in Section 4.4.

 

Model Generator Discriminator
Formulation Sampling Formulation Objective
cGAN [39] Concat.
cGAN+ [41] Projection
AC-GAN [43] with AC loss
CFGAN [17] cond. on with Cond. MI loss
CM-CFGAN† cond. on with Cond. MI loss
CMGAN (Equation 7) with KL-AC loss
CPGAN (Equation 8) with KL-AC loss

 

Table 1: Relationships between class-conditional GANs (†Naive CFGAN assumes that the number of classes is only two. For comparison purposes, we extend this to any size setting, which we call CM-CFGAN)

3 Class-distinct and class-mutual image generation

Our goal is to construct a generative model that can be applied to class-overlapping data and is controllable for class specificity. To achieve this, we extend AC-GAN [43] to the class-overlapping setting. In this section, we provide an overview of AC-GAN (Section 3.1), and then present our proposed models CMGAN and CPGAN in Sections 3.2 and 3.3, respectively. In Section 3.4, we discuss the relationships between the proposed models and previous related conditional GANs.

3.1 Background: AC-GAN

AC-GAN [43] is a class-conditional extension of GAN [11]. Its objective consists of two losses: an adversarial loss and auxiliary classifier loss.

Adversarial loss. To make generated images indistinguishable from the real ones, the following adversarial loss is employed:

(1)

where is a real image sample, and generates an image conditioned on both the noise variable and class label . Typically,

is sampled from a uniform distribution (

) or Gaussian distribution (

), whereas is sampled from a categorical distribution (, where is the number of classes). Here, attempts to generate images that are indistinguishable from real images by minimizing this loss, and attempts to find the best classifier between the real and generated data by maximizing the loss.

Auxiliary classifier loss. To make the generated images belong to the target class, an auxiliary classifier (AC) loss is employed. To achieve this, the auxiliary classifier is first trained for a classification loss of real images. This is defined as

(2)

where the input image and class label pair is given as the training data.111In the original AC-GAN study [43], was trained using not only real data, but also generated images. However, recent extensions such as WGAN-GP [12] and StarGAN [7] only use real images. In this study, we follow the latter formulation.

represents a probability distribution over class labels computed by

. Here, learns to classify a real image as the corresponding class by minimizing this loss. For this classifier, is optimized by

(3)

where attempts to generate images classified as a target class by minimizing this loss.

Full objective. In practice, shared networks between and are commonly used [43, 12]. In this configuration, the full objective function is expressed as

(4)
(5)

where and are trade-off parameters between the adversarial and AC loss. Then, and are optimized by minimizing and , respectively.

Information theory viewpoint. Based on the theoretical analysis in InfoGAN [6], we can interpret Equation 3 as a variational upper bound on the negative mutual information (MI: ). This means that the semantic features captured by this loss depend heavily on the design of . That is, when is a discrete variable (as in a typical AC-GAN), discrete semantic features are learned, and controllability is restricted to ON or OFF.

From another perspective, Equation 3 can be interpreted as a cross-entropy loss, i.e., the cross entropy between and . From this perspective, Equation 3 can be rewritten as

(6)

where we ignore because is constant when is given as a ground truth. This equation means that is optimized as fits to in terms of the Kullback-Leibler (KL) divergence. We call this redesigned loss the KL-AC loss. This suggests that AC-GAN is sensitive to the design of . In conventional AC-GAN, is represented as a discrete variable. As a result, is optimized to prefer a class-discrete distribution, as shown in Figure 2(b). This property is not suitable for DM image generation, where we need to cover the distribution between classes, as shown in Figure 2(a).

3.2 Proposal I: CMGAN

As discussed above, naive AC-GAN is tuned to fit to a discrete class distribution. This motivated us to develop CMGAN, which is an extension of AC-GAN to a class-mixture setting. In CMGAN, we introduce an -class mixture variable (Mix):

(7)

where are one-hot variables sampled from a categorical distribution (i.e., ). Here, represents the weight of , and is sampled from a Dirichlet distribution (i.e., ()). In CMGAN, we optimize conditioned on and with the KL-AC loss, where and in Equation 6 are replaced by and , respectively. Then, are optimized in a similar manner as in AC-GAN. From an information theory perspective, this extension means that we can cover representations between classes and fit to an -class mixture distribution, as shown in Figure 2(c).

3.3 Proposal II: CPGAN

CMGAN can cover all relationships between any classes. However, a limitation of this method is that we must carefully define how many classes are mixed in advance (i.e., ), as well as how to mix the classes (i.e., in ), To extend CMGAN to an arbitrary class-overlapping setting, we developed CPGAN. In particular, we focus on the class posterior , which can represent between-class relationships in given images. However, the direct calculation of is intractable. Therefore, we approximate it using ’s posterior , and introduce the following variable (CP):

(8)

We optimize conditioned on and with the KL-AC loss, where and in Equation 6 are replaced by and , respectively. Then, are optimized in a similar manner as in AC-GAN. From an information theory perspective, this extension enables us to capture the between-class relationships in a data-driven manner, and to fit to , as shown in Figure 2(d). A limitation of CPGAN is that it requires real images to sample by Equation 8. However, as is simpler than an image distribution, it can easily be modeled using another generative model . We empirically find that no degradation occurs by using instead of . We demonstrate this in the last paragraph of Section 4.2.

3.4 Relationships with previous conditional GANs

In Table 1, we summarize the relationships between our CMGAN, CPGAN, and previous class-conditional GANs, including typical models (i.e., cGAN [39] and AC-GAN [43]) and state-of-the-art extensions (i.e., cGAN with a projection discriminator (cGAN+) [41] and CFGAN [17]). The novelty of our models is that we redesign ’s input to obtain controllability on the distributions between classes, and we redesign ’s loss to fit to the class-overlapping distribution. Note that cGAN+ also consider the structure of using the projection discriminator, but it cannot reflect the structure to ’s input like ours. Therefore, this model is not applicable to DM image generation. We empirically demonstrate the difference between ours and these previous models in Section 4.

Figure 3: Generated image samples on MNIST. Each row shows generated samples for fixed but varied . In particular, in (a) we vary continuously between classes. CMGAN and CPGAN succeed in selectively generating class-distinct (red font) and class-mutual (blue font) images, whereas the other related models fail to do so. Note that we do not use class labels of digits () directly as supervision. Instead, we derive them from class labels denoted by capital characters (A, B, ). See Figure 12 in the appendices for more samples.

4 Experiments

To examine the effectiveness of CMGAN and CPGAN on DM image-generation tasks, we conducted four experiments. First we performed evaluation on the synthetic sets of MNIST [26] and CIFAR-10 [24] to clarify the difference between typical class-conditional image generation and DM image-generation (Section 4.1) and analyze the model configurations in detail (Section 4.2). As more realistic settings, we also utilized Clothing1M [58], which includes many wrongly labeled instances (Section 4.3). Finally, we demonstrate the generality of the proposed models by application to image-to-image translation using CelebA [31] in Section 4.4. We provide the details of the experimental setup in Appendix D.

 

Models Case I Case II Case III
DMA FID DMA FID DMA FID
Cat Mix CP Cat Mix CP Cat Mix CP
cGAN 50.2 4.9 9.4 N/A 36.4 3.4 5.8 N/A 27.6 4.8 9.2 N/A
cGAN+ 49.4 5.6 10.9 N/A 36.4 4.1 7.4 N/A 27.3 4.4 8.3 N/A
AC-GAN 49.7 5.3 11.1 12.2 39.4 3.7 8.3 12.2 27.1 4.0 18.4 22.0
CM-CFGAN 77.8 24.7 5.3 N/A 54.1 7.1 4.4 N/A 55.8 20.5 4.6 N/A
CMGAN 100.0 92.3 6.2 11.5 84.8 17.9 6.5 6.3 100.0 60.9 6.1 8.0
CPGAN 100.0 65.6 17.1 4.3 99.9 25.7 16.6 3.3 99.9 38.3 22.3 3.9

 

Table 2: DMA and FID scores on class-overlapping MNIST. For class-mixture sampling (Mix), we set in Cases I and II, in Case III, and in all cases. Bold fonts indicate the top scores in each case. In cGAN, cGAN+, and CM-CFGAN, we do not calculate the FID for CP (N/A), because they do not have an auxiliary classifier.

4.1 Model comparison

To demonstrate the difference between typical class-conditional image generation and DM image generation, we compared our models with typical and state-of-the-art conditional GANs (see Table 1) on synthetic sets of MNIST. We illustrate the considered class-overlapping settings in Figure 4. In the training, class labels denoted by capital letters (A, B, ) are given as supervision, but it is not known which digits are included. Under these conditions, our goal is to construct a generative model that can selectively generate class-distinct images (e.g., 0, 2, 4, 6, and 8 in Case II) and class-mutual images (e.g., 1, 3, 5, 7, and 9 in Case II). When creating the subsets (A, B, ) from the original dataset, we considered two data-division methods: (a) Class-mutual instances are duplicated and divided into all related classes with an overlap. (b) Class-mutual instances are completely divided into either class without overlap. In (a), class-mutual images are relatively easy to find, because the overlap actually exists. However, in (b) we need to find the semantic overlap through learning. In this subsection, we discuss case (a) and show that the compared models do not perform well, even in this relatively simple setting, whereas our models do perform well. We discuss case (b) in Appendix B.1.1 and in following subsections.

Figure 4: Illustration of class-overlapping settings.

Evaluation metrics.

We employed two evaluation metrics. First, to evaluate whether a constructed

can be controlled for class specificity, we generate images from representing class-distinct states or class-mutual states, and calculate the accuracy in classifying the generated images to the expected states.222We trained the classifier using supervision of the expected states, independently from the GAN training. We generated 50,000 samples for each state, and report the average across all states. For example, in Case II the class-distinct states are represented by , where is a one-hot variable representing the corresponding label , and the images generated by these are expected to be classified as , respectively. The class-mutual states are represented by , and the images generated by them are expected to be classified as , respectively. We present the image samples representing the expected states in the top row of Figure 3. We call this metric the class-distinct and class-mutual accuracy (DMA). Second, to quantify the quality of the generated samples, we employ the Fréchet inception distance (FID[13], which captures the similarities between generated and real distributions. Another commonly employed metric is the inception score (IS) [50]. However, recent studies [13, 32] have indicated drawbacks to this approach, and so we use the FID.

Results. We summarize the quantitative and qualitative results in Table 2 and Figure 3, respectively. For the FID, we compare the scores when ’s prior is changed (i.e., Cat, Mix, and CP). The FID scores indicate that most models perform best when utilizing the prior of used in training. This implies the importance of its design, and supports the validity of our extensions to it. The DMA scores indicate that CMGAN and CPGAN significantly outperform the other compared models. The qualitative results in Figure 3 also support the validity of the scores. For example, in Figure 3(b) CMGAN and CPGAN succeed in accurately generating class-distinct (e.g., 0, 2, ) and class-mutual (e.g., 1, 3, ) images, but the others fail. Figure 3(a) highlights the difference between typical class-wise interpolation (or category morphing in [41]) and DM image generation. Although in the former it is possible to generate images continuously between classes by varying , the changes are not necessarily related to class specificity.

4.2 Analysis of model configurations

Generalization and memorization. We conduct a model configuration analysis on CIFAR-10 [24], which is a commonly used dataset for evaluating image generation. We analyzed three cases: original CIFAR-10 (no overlap), and class-overlapping CIFAR-10 (Cases II and III in Figure 4). When creating the subsets from the original dataset in Cases II and III, we ensure that class-mutual instances are completely divided into either class. In this setting, we must pay attention to the memorization effect [63], i.e., a DNN-based classifier can fit to labels even when they are random. In our tasks, such an effect may prevent CPGAN from finding the semantic overlap. However, another study [3] demonstrates empirically that a DNN-based classifier prioritizes memorizing a simple pattern at first, and an explicit regularization (e.g., dropout) can degrade the performance without compromising generalization on real data. This suggests that we could find the semantic overlap by appropriately regularizing the models.

 

No. Condition Original Case II Case III
DMA FID DMA FID DMA FID
Dropout
1 W/ 95.4 11.6 77.4 12.2 60.1 13.6
2 W/o 94.8 12.7 51.7 14.9 37.9 18.4
# of shared blocks
3 0 96.8 20.3 56.7 20.9 43.3 23.9
4 1 96.5 19.0 61.0 19.9 40.1 22.4
5 2 94.8 15.3 72.6 15.2 33.9 35.7
6 3 95.7 13.0 78.1 13.6 59.0 16.7
Iter. of adding
7 20000 95.0 12.3 71.2 13.5 49.9 15.4
8 40000 95.8 12.5 63.8 15.0 46.8 14.8

 

Table 3: Analysis of model configurations. No. 1 is the default model (with dropout, four shared blocks, using from the start, and ). In the other models, only the target parameters are changed from the default models.

Based on these studies, we analyzed four aspects: (1) The effect of explicit regularization (i.e., dropout). (2) The effect of a shared architecture between and . We expect that this will act as a regularizer. We varied the number of shared residual blocks in . (3) The effect of jointly learning and . We varied the timing of using (i.e., from the beginning or later). (4) The effect of the scale of , i.e., a trade-off parameter between and . We implemented the models based on AC-WGAN-GP ResNet [12], and only modified ’s prior and objective function in CMGAN and CPGAN.

Figure 5: DMA and FID scores on original and class-overlapping CIFAR-10. We alter the trade-off parameter and analyze its effect. We calculate FID using ’s prior that was used in training.
Figure 6: Generated image samples on CIFAR-10. Notation is similar to Figure 3. See Figures 13 and 14 in the appendices for more samples.

Results. Regarding to (1)–(3), we list the results in Table 3. (1) The comparison with and without dropout (nos. 1, 2) indicates that an explicit regularization (i.e., dropout) is useful for mitigating the memorization effect, and CPGAN succeeds in finding the semantic overlap between classes in terms of DMA. (2) The comparison of the number of shared blocks (nos. 1, 3–6) indicates that the shared architecture also acts as an effective regularizer. We also find that this is useful for improving the FID scores. This is because the shared architecture has the effect of preventing from strongly fitting to and generating only recognizable (i.e., classifiable) images. (3) The comparison of the timing of using (nos. 1, 7, 8) indicates that early joint learning (in this stage, a classifier prioritizing learning a simple pattern [3]) is useful for achieving DM image generation. We summarize the results for (4) in Figure 5. We also analyzed the effect on AC-GAN and CMGAN. We found that a trade-off exists between the DMA and FID scores in all the models. However, in CPGAN, the degradation of the FID is relatively small. This is because CPGAN represents ’s prior in a data driven matter. Hence, it is robust to the gap between ’s prior and ’s posterior. We present the generated samples for in Figure 6. Similarly to MNIST, CMGAN and CPGAN are able to selectively generate class-distinct (e.g., airplane (A) and bird (B) in Case II) and class-mutual (e.g., automobile (A B) and cat (B C) in Case III) images accurately, but AC-GAN fails.

Effect of classifier posterior knowledge. In the above, when calculating the FID score for CPGAN, we employ , i.e., the classifier posterior for real images, as ’s prior. This may be a slightly advantageous condition, because we can use the partial information of real data to generate images. However, is simpler than . Therefore, we can easily model the distribution using another generative model . To validate this statement, we learn using an additional GAN after the CPGAN training is finished. In the sampling phase, we generate images using and in a stacked manner. This means that data are generated without relying on any supervision, similarly to other models. The FID scores for CPGAN () with in the original case, Case II, and Case III are , , and , respectively. These are competitive with the scores for CPGAN with real CP in Figure 5 (, , and , respectively).

Figure 7: Translated image samples and DMA scores on CelebA. We calculate the DMA scores for each class-distinct or class-mutual state. Note that in training, we only know the dataset identifiers indicated by colorful bold font, i.e., A (black hair), B (male), and C (smiling), and class-distinct and class-mutual representations (gray color) are found through learning. See Figure 16 in the appendices for more samples.

 

Model ACC FID
AC-GAN 48.6 11.4
AC-GAN w/ dropout 53.8 8.4
CPGAN 61.6 7.0
CPGAN w/ dropout 70.2 7.3
cGAN+ 46.1 23.3
cGAN+ w/ SN 50.3 10.9

 

Table 4: ACC and FID scores on Clothing1M. ACC is calculated for the class-distinct states (i.e.,

is a one-hot vector).

Figure 8: Generated image samples on Clothing1M for class-distinct states (i.e., is a one-hot vector). Each row contains samples generated by the fixed but the varied . CPGAN generates more class-distinct images than the other models. See Figure 15 in the appendices for more samples.

4.3 Evaluation on real-world noisy labels

To validate the effectiveness of the proposed models in a real-world scenario, we utilized Clothing1M [58], which contains 14-class clothing images collected from several online shopping websites. As discussed in Section 1, this dataset contains much mislabeled data (the annotation accuracy is only 61.54%). noisy labeled data and clean labeled data are provided as training sets. Following [58], we used mixed data consisting of clean labeled data (bootstrapped to ) and noisy labeled data. To shorten the training time, we resized images from to . Our implementation is based on AC-WGAN-GP ResNet [12], and only modified ’s prior and objective function in CPGAN. Considering the findings in Section 4.2, we compared the performance with and without dropout. As another state-of-the-art baseline, we evaluated cGAN+ [41], which has the same network architecture as the proposed model except for an additional projection layer. We tested cGAN+ with dropout, but found that it suffers from severe mode collapse during training. Therefore, we instead used cGAN+ with spectral normalization (SN) [40] as a regularized model.333We also modified the GAN objective from the WGAN-GP loss [12] to the hinge loss [29, 51] because this is recommended in [41, 40] In this dataset, it is difficult to calculate the DMA, because ideal class-mutual states are not obvious. Therefore, we instead report the accuracy (ACC) for the class-distinct states (i.e., is a one-hot vector) using a classifier trained on clean labeled data.

Results. We list the results in Table 4. We observe that CPGAN outperforms AC-GAN and cGAN+ in terms of the FID. These results indicate that mismatches between images and labels could cause learning difficulties in typical class-conditional models, and it is important to incorporate a mechanism like ours to allow mismatches for application to noisy labeled data. The ACC scores indicate that CPGAN can selectively generate class-distinct images by using discrete . We present the qualitative results in Figure 8. In the competitive models (particularly in cGAN w/ SN), the between-class difference is ambiguous. However, for the CPGAN w/ dropout it is relatively obvious.

4.4 Application to image-to-image translation

Our proposed models comprise general extensions of AC-GAN. Therefore, they can be incorporated into any AC-GAN-based model. To demonstrate this statement, we incorporate CPGAN into StarGAN [7], which is a model for multi-domain image-to-image translation, and is optimized using the AC loss with adversarial loss [11] and cycle-consistency [69] loss. To combine CPGAN with StarGAN, we alter the representation from discrete to CP, and replace the AC loss with the KL-AC loss. We denote this combination model by CP-StarGAN. To validate the effectiveness, we employ a modified version of CelebA [31] in which we consider the situation where multiple datasets collected on different criteria are given. In particular, we divided CelebA into three subsets without overlap: (A) a black hair set, (B) male set, and (C) smiling set. Our goal is to discover class-distinct (e.g., A B: back hair and not male) and class-mutual (e.g., A B: back hair and male) representations without relying on any additional annotation. We believe that this would be useful for a real-world application in which we want to discover class intersections for existing multiple datasets.

Results. We present the translated image samples and quantitative evaluation results in Figure 7.444To calculate DMA, we trained three classifiers that distinguish male or not, black hair or not, and smiling or not, respectively. We calculated the accuracy for the expected states using these classifiers and report the average score. Note that such multidimensional representations were not used for training the GAN models. These results imply that CP-StarGAN can selectively generate class-distinct (e.g., A B C: black hair, not male, and not smiling) and class-mutual (e.g., A B C: black hair, male, and smiling) images accurately, whereas conventional StarGAN fails. Note that in Figure 7, we present the multi-dimensional attribute representations (e.g., black hair, not male, and not smiling (A B C)) as the expected states, but such representations are not given as supervision in training. Instead, only the dataset identifiers (i.e., A, B, or C) are given as supervision, and we discover the multi-dimensional representations through learning.

5 Conclusion

This study introduced a new problem, called DM image generation, in which we aim to construct a generative model that can be controlled for class specificity. To solve this problem, we redesigned ’s prior and objective function in AC-GAN, and developed CMGAN and CPGAN, which extend AC-GAN to class-mixture and arbitrary class-overlapping settings, respectively. In addition to the analysis based on information theory, we demonstrated the difference between typical class-conditional image generation and DM image generation through experiments, by comparison with the related conditional GANs. We also analyzed the model configurations from the viewpoints of memorization and generalization, demonstrated the effectiveness on real-world noisy labels, and validated the generality through application to image-to-image translation. Based on our findings, adapting these methods to other generative models such as VAE [23, 49] and AR [55] and using them as data-mining tools on other datasets remain interesting future directions.

Acknowledgement

We would like to thank Hiroharu Kato, Atsuhiro Noguchi, and Antonio Tejero-de-Pablos for helpful discussions. This work was supported by JSPS KAKENHI Grant Number JP17H06100, partially supported by JST CREST Grant Number JPMJCR1403, Japan, and partially supported by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) as “Seminal Issue on Post-K Computer.”

References

  • [1] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In ICLR, 2017.
  • [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
  • [3] D. Arpit, S. Jastrzebski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, and S. Lacoste-Julien. A closer look at memorization in deep networks. In ICML, 2017.
  • [4] A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  • [5] A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Neural photo editing with introspective adversarial networks. In ICLR, 2017.
  • [6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016.
  • [7] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, 2018.
  • [8] H. de Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. Courville. Modulating early visual processing by language. In NIPS, 2017.
  • [9] E. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a Laplacian pyramid of adversarial networks. In NIPS, 2015.
  • [10] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. In ICLR, 2017.
  • [11] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
  • [12] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. In NIPS, 2017.
  • [13] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a Nash equilibrium. In NIPS, 2017.
  • [14] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally consistent image completion. ACM Trans. on Graph., 36(4):107:1–107:14, 2017.
  • [15] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [16] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    In CVPR, 2017.
  • [17] T. Kaneko, K. Hiramatsu, and K. Kashino. Generative attribute controller with conditional filtered generative adversarial networks. In CVPR, 2017.
  • [18] T. Kaneko, K. Hiramatsu, and K. Kashino.

    Generative adversarial image synthesis with decision tree latent controller.

    In CVPR, 2018.
  • [19] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In ICLR, 2018.
  • [20] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. In ICML, 2017.
  • [21] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [22] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In NIPS, 2014.
  • [23] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
  • [24] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
  • [25] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In NIPS, 2015.
  • [26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. of the IEEE, 86(11):2278–2324, 1998.
  • [27] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi.

    Photo-realistic single image super-resolution using a generative adversarial network.

    In CVPR, 2017.
  • [28] M. Li, W. Zuo, and D. Zhang. Deep identity-aware transfer of facial attributes. arXiv preprint arXiv:1610.05586, 2016.
  • [29] J. H. Lim and J. C. Ye. Geometric GAN. arXiv preprint arXiv:1705.02894, 2017.
  • [30] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In NIPS, 2017.
  • [31] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015.
  • [32] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. Are GANs created equal? A large-scale study. arXiv preprint arXiv:1711.10337, 2017.
  • [33] A. Maas, A. Y. Hannun, and A. Y. Ng.

    Rectifier nonlinearities improve neural network acoustic models.

    In ICML Workshop, 2013.
  • [34] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. In NIPS, 2016.
  • [35] E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov. Generating images from captions with attention. In ICLR, 2016.
  • [36] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In ICCV, 2017.
  • [37] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun. Disentangling factors of variation in deep representation using adversarial training. In NIPS, 2016.
  • [38] L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for GANs do actually converge? In ICML, 2018.
  • [39] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [40] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.
  • [41] T. Miyato and M. Koyama. cGANs with projection discriminator. In ICLR, 2018.
  • [42] V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. In ICML, 2010.
  • [43] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier GANs. In ICML, 2017.
  • [44] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. Álvarez. Invertible conditional GANs for image editing. In NIPS Workshop, 2016.
  • [45] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
  • [46] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In NIPS, 2016.
  • [47] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In ICML, 2016.
  • [48] S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvinick, and N. de Freitas. Generating interpretable images with controllable structure. In ICLR Workshop, 2017.
  • [49] D. J. Rezende, S. Mohamed, and D. Wierstra.

    Stochastic backpropagation and approximate inference in deep generative models.

    In ICML, 2014.
  • [50] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. In NIPS, 2016.
  • [51] D. Tran, R. Ranganath, and D. M. Blei. Deep and hierarchical implicit models. arXiv preprint arXiv:1702.08896, 2017.
  • [52] L. Tran, X. Yin, and X. Liu.

    Disentangled representation learning GAN for pose-invariant face recognition.

    In CVPR, 2017.
  • [53] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing ingredient for fast stylization. In arXiv preprint arXiv:1607.08022. 2016.
  • [54] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  • [55] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu.

    Pixel recurrent neural networks.

    In ICML, 2016.
  • [56] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu. Conditional image generation with pixelCNN decoders. In NIPS, 2016.
  • [57] X. Wei, B. Gong, Z. Liu, W. Lu, and L. Wang. Improving the improved training of Wasserstein GANs: A consistency term and its dual effect. In ICLR, 2018.
  • [58] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang. Learning from massive noisy labeled data for image classification. In CVPR, 2015.
  • [59] B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectified activations in convolutional network. In ICML Workshop, 2015.
  • [60] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. G. X. Huang, and X. He. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
  • [61] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In ECCV, 2016.
  • [62] Z. Yi, H. Zhang, P. Tan, and M. Gong. DualGAN: Unsupervised dual learning for image-to-image translation. In ICCV, 2017.
  • [63] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
  • [64] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
  • [65] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1710.10916, 2017.
  • [66] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  • [67] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. In ICLR, 2017.
  • [68] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In ECCV, 2016.
  • [69] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.

Appendix A Contents

In these appendices, we provide additional analysis in Appendix B (pp. B11), extended results in Appendix C (pp. C17), and details of the experimental setup in Appendix D (pp. D20).

Appendix B Additional analysis

b.1 Additional analysis for Section 4.1

b.1.1 Model comparison in completely divided data

In Section 4.1, we compared the models in the setting where, in the training set, class-mutual instances are duplicated and divided into all related classes with an overlap. In this setting, class-mutual images are relatively easy to find because the overlap actually exists. In the following, we denote this setting as data-duplicated setting. In contrast, here we compare the models in the other setting where, in the training set, class-mutual instances are completely divided into either class without overlap. We denote this setting as data-divided setting. In this setting, it is not trivial to find class-mutual images because such images do not really exist, and we need to find the semantic overlap through learning. We have already discussed such a setting on CIFAR-10 in Section 4.2, and we find that a similar tendency can be observed on MNIST. In particular, here we tested an explicit regularization (i.e., dropout).

Results. We demonstrate the effect of the explicit regularization in Table 5. In this table, we show the results without dropout (upper table) and with dropout (lower table). By incorporating dropout, CMGAN and CPGAN achieve a similar tendency to that in the data-duplicated settings, which is listed in Table 2. CMGAN and CPGAN achieve the highest DMA scores. Additionally CPGAN achieves the highest or runner-up FID scores. In the results without dropout, the degradation of the DMA scores in CPGAN is large; however, that in CMGAN is relatively small. We argue that this is because CMGAN attempts to find class-mutual images based on the given class-mixture prior, regardless of the classifier posterior. In contrast, CPGAN is dependent on the quality of the classifier posterior. Therefore, when the classifier memorizes and fits labels due to the lack of an appropriate regularization, it becomes difficult to find the semantic overlap between classes.

 

Models Case I Case II Case III
w/o dropout DMA FID DMA FID DMA FID
Cat Mix CP Cat Mix CP Cat Mix CP

 

cGAN 55.6 5.8 12.8 N/A 40.4 3.9 6.6 N/A 32.7 4.3 9.8 N/A
cGAN+ 55.8 6.3 15.7 N/A 40.1 4.6 7.9 N/A 32.0 4.8 11.1 N/A
AC-GAN 55.7 6.1 18.5 6.2 40.4 5.9 13.5 6.2 32.7 6.3 21.4 6.4
CM-CFGAN 91.1 38.8 6.0 N/A 59.8 7.7 4.6 N/A 70.4 32.3 5.3 N/A
CMGAN 96.2 72.3 22.9 70.6 86.0 33.1 15.6 29.7 71.4 70.8 35.2 65.2
CPGAN 74.7 7.5 41.6 7.3 70.1 6.4 17.4 6.3 51.4 7.6 43.8 7.3

 

 

Models Case I Case II Case III
w/ dropout DMA FID DMA FID DMA FID
Cat Mix CP Cat Mix CP Cat Mix CP

 

cGAN† N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
cGAN+† N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
AC-GAN 56.4 5.2 10.2 7.1 43.5 5.6 8.8 8.1 34.8 5.0 11.0 8.1
CM-CFGAN† N/A N/A N/A N/A 76.3 16.8 6.7 N/A N/A N/A N/A N/A
CMGAN 100.0 72.0 7.7 19.5 100.0 37.4 6.7 9.2 99.1 77.7 7.8 19.8
CPGAN 100.0 45.1 21.3 4.8 100.0 18.8 9.9 4.3 99.9 25.2 17.5 4.0

 

Table 5: DMA and FID scores on class-overlapping MNIST in data-divided settings. For comparison, see Table 2 for those in data-duplicated settings. The upper table lists the results without dropout. The lower table lists the results with dropout. For class-mixture sampling (Mix), we set in Cases I and II, in Case III, and in all cases. Bold font indicates the top two scores in each case across both tables. In cGAN, cGAN+, and CM-CFGAN, we do not calculate FID for CP (N/A) because they do not have an auxiliary classifier. †We find that a model that uses tends to fail to stabilize the training when using dropout. Therefore, we do not calculate all scores for such models, except for CM-CFGAN with dropout (Case II) because it succeeds in stabilizing the training.

 

Models Case II Case III
Data-duplicated setting DMA FID DMA FID
Mix () Mix () Mix () Mix ()

 

CMGAN with 84.8 6.5 11.7 77.3 4.4 11.8
CMGAN with 98.8 11.1 8.2 100.0 12.4 6.1

 

 

Models Case II Case III
Data-divided setting DMA FID DMA FID
Mix () Mix () Mix () Mix ()

 

CMGAN with 100.0 6.7 12.8 80.9 5.2 11.0
CMGAN with 100.0 13.2 9.8 99.1 19.0 7.8

 

Table 6: DMA and FID scores for CMGAN with and CMGAN with . The upper table lists the results in data-duplicated settings. The lower table lists the results in data-divided settings. We alter the class-mixture sampling methods (particularly, ) in both training and image generation and compared the FID scores. Bold font indicate the best score in each block.

b.1.2 Analysis of parameter sensitivity of CMGAN

In CMGAN, we need to define the hyper-parameters of class-mixture sampling (i.e., (number of mixture classes) and (parameter for Dirichlet distribution)) in advance. In this appendix, we fixed and analyzed the sensitivity to . We compared CMGAN with and CMGAN with in four cases:

  • Case II and Case III in data-duplicated settings (discussed in Section 4.1)

  • Case II and Case III in data-divided settings (discussed in Appendix B.1.1)

In the latter two cases, we used CMGAN with dropout. We omit Case I, as there are only two classes, and CMGAN with cannot be applied in this case.

Results. We list the results in Table 6. Similar to the results in Tables 2 and 5, each model achieves the best FID scores when ’s prior that was used in training is also used in generating the images. Interestingly, CMGAN with outperforms CMGAN with in terms of the DMA scores, whereas CMGAN with outperforms CMGAN with in terms of the FID scores. We argue that this is because CMGAN with attempts to capture wider between-class relationships than CMGAN with . As a result, CMGAN with succeeds in finding class-mutual images without omission. In contrast, CMGAN with may expressively cover between-class relationships. For example, in Case II, the upper number of related classes is two; however, CMGAN with attempts to cover not only two-class relationships, but also three-class relationships. This excessive representation may cause image quality degradation.

b.1.3 Effect of classifier posterior knowledge

In the last paragraph of Section 4.2, we showed that the classifier posterior can be easily imitated using an additional GAN model (i.e., ), and CPGAN with (which can generate images without relying on any supervision, similar to other models) is competitive with CPGAN with real class posterior (i.e., ) in terms of the FID score. To confirm whether this observation holds in other cases, we also conducted the same experiments on MNIST. We evaluated the effectiveness for six cases:

  • Case I, Case II, and Case III in data-duplicated settings (discussed in Section 4.1)

  • Case I, Case II, and Case III in data-divided settings (discussed in Appendix B.1.1)

In the latter three cases, we used CPGAN with dropout. As , we used the network architecture described in Table 14.

Results. The results are listed in Table 7, and we find that in all cases, the scores for CPGAN with are competitive with those of CPGAN with real CP, which are listed in Tables 2 and 5.

 

Models Case I Case II Case III
Data-duplicated setting FID FID FID

 

CPGAN w/ real CP 4.3 3.3 3.9
CPGAN w/ 4.3 3.3 3.9

 

 

Models Case I Case II Case III
Data-divided setting FID FID FID

 

CPGAN w/ real CP 4.8 4.3 4.0
CPGAN w/ 4.8 4.3 4.1

 

Table 7: Comparison of the FID scores between CPGAN with real CP and CPGAN with on MNIST. The upper table lists the results in data-duplicated settings. The lower table lists the results in data-divided settings. The scores for CPGAN with real CP are the same as those listed in Tables 2 and 5.

 

Models Original Case II Case III
w/o dropout DMA FID DMA FID DMA FID
Cat Mix CP Cat Mix CP Cat Mix CP

 

AC-GAN† 93.4 14.3 27.6 14.2 36.6 15.3 26.7 15.5 31.8 20.0 43.3 20.2
CMGAN 96.9 17.4 16.8 17.2 76.3 26.8 16.4 26.4 52.5 48.2 20.5 47.5
CPGAN 94.8 12.8 26.6 12.7 51.7 14.8 27.6 14.9 37.9 18.2 60.4 18.4

 

 

Models Original Case II Case III
w/ dropout DMA FID DMA FID DMA FID
Cat Mix CP Cat Mix CP Cat Mix CP

 

AC-GAN 94.0 11.8 23.7 12.6 34.8 12.3 24.0 23.4 30.0 14.0 40.5 29.1
CMGAN 98.1 17.3 13.0 14.4 78.2 28.7 13.5 13.0 50.4 54.3 15.4 18.3
CPGAN 95.4 11.1 28.4 11.6 77.4 22.3 20.5 12.2 60.1 27.5 22.2 13.6

 

Table 8: DMA and FID scores on CIFAR-10. The upper table lists the results without dropout. The lower table lists the results with dropout. For class-mixture sampling (Mix), we set in Cases I and II, in Case III, and in all cases. Bold fonts indicates the top two scores across both tables. †This is the same as original AC-WGAN-GP ResNet [12] (our reimplementation).

b.2 Additional analysis for Section 4.2

b.2.1 Effect of dropout for other models

In Section 4.2, we demonstrate that an explicit regularization (i.e., dropout) mitigates the memorization effect and is useful for CPGAN to find the semantic overlap in data-divided settings. In this appendix, we also investigate its effect for the other models (i.e., AC-GAN and CMGAN).

Results. We list the results in Table 8. Regarding the DMA, we find that dropout is most effective for CPGAN, whereas its effect is relatively small for AC-GAN and CMGAN. This is because CPGAN captures the class relationships in a data-driven manner (i.e., classifier-driven manner), whereas the other two models (AC-GAN and CMGAN) attempt to capture them in a deterministic manner regardless of the classifier.

Another interesting finding is that the DMA scores for CMGAN are higher than the classification accuracy on the real test data (). We argue that this is because CMGAN covers wider class-mutual relationships and can selectively generate class-distinct images, given representing a class-distinct state (i.e., a one-hot vector). Such images are more classifiable than the real images in the test set. This makes enables CMGAN to achieve higher scores in terms of the DMA. The qualitative results in Figure 13 also verify this observation.

b.3 Additional analysis for Section 4.3

b.3.1 Effect of classifier posterior knowledge

In Section 4.2 and Appendix B.1.3, we show that the classifier posterior can be easily mimicked using an additional GAN model (i.e., ) in the synthetic settings. To examine whether this statement holds in real-world noisy labeled data, we conducted the same experiments on Clothing1M. As , we used the network architecture described in Table 14. We evaluated two models: CPGANs with and without dropout.

Results. We list the results in Table 9. We find that the scores for CPGAN with are competitive with those of CPGAN with real CP, which are listed in Table 4. These results verify that CPGAN with is also useful for real-world noisy labeled data.

 

Models FID

 

CPGAN w/ real CP 7.0
CPGAN w/ 6.9
CPGAN w/ dropout w/ real CP 7.3
CPGAN w/ dropout w/ 7.2

 

Table 9: Comparison of the FID scores between CPGAN with real CP and CPGAN with on Clothing1M. The scores for CPGAN with real CP are the same as those listed in Table 4.

b.4 Additional analysis for Section 4.4

b.4.1 Image generation on different criteria data

We tested image generation using the same dataset setting described in Section 4.4, in which we tested image-to-image translation. The purpose of this setting is to analyze the effectiveness of our proposed models for data that are collected on different criteria. We used CelebA and evaluated the effectiveness for four cases:

  • DM image generation for two attributes

    • (A) black hair and (B) male

    • (A) black hair and (B) smiling

    • (A) male and (B) smiling

  • DM image generation for three attributes

    • (A) black hair, (B) male, and (C) smiling

Here, A, B, and C denote the dataset identifiers that we used as supervision. Namely, in the original dataset, binary indicators (e.g., smiling or not smiling) are given as supervision, but in these experiments, we did not use such information. We consider the data-divided settings where class-mutual instances are completely divided into either class without overlap. In the above, we used dropout to mitigate the memorization effect. However, we empirically found that standard augmentation (flip horizontally) is sufficient in this dataset because the number of training samples in this dataset is sufficiently large and it would be difficult to memorize the data, even without any additional regularization. We compared AC-GAN, CMGAN, and CPGAN. We resized images to to shorten the training time.

Results. We show the generated image samples and quantitative evaluation results for the two-attribute and three-attribute cases in Figures 9 and 10, respectively. Similar to the results in image-to-image translation (discussed in Section 4.4), CPGAN succeeds in selectively generating class-distinct (e.g., : black hair, not male, and not smiling in the three-attribute case) and class-mutual (e.g., : black hair, male, and smiling in the three-attribute case) images accurately, whereas AC-GAN fails to do so.

b.4.2 Image-to-image translation for two-attribute cases

In Section 4.4, we show the results in image-to-image translation for the three-attribute case. As additional analyses, we evaluated the two-attribute cases described in Appendix B.4.1.

Results. We show the translated image samples and quantitative evaluation results on CelebA (image-to-image translation, two attributes) in Figure 11. Refer to Section 4.4, in which we provide the detailed description of the experiment. We obtain the similar results to those in the three-attribute case in Figure 7.

Figure 9: Generated image samples and DMA and FID scores on CelebA (image generation, two attributes). Each row shows generated samples with the fixed but the varied . We calculated the DMA scores for each class-distinct or class-mutual state. Note that in training, we only know the dataset identifiers indicated by colorful bold font (e.g., A (black hair) and B (male) in (a)) and class-distinct (e.g., A B: black hair and not male in (a)) and class-mutual (e.g., A B: black hair and male in (a)) representations are found through learning.
Figure 10: Generated image samples and DMA and FID scores on CelebA (image generation, three attributes). Each row shows generated samples with the fixed but the varied . We calculated the DMA scores for each class-distinct or class-mutual state. Note that in training, we only know the dataset identifiers indicated by colorful bold font (i.e., A (black hair), B (male), and C (smiling)) and class-distinct (e.g., A B C: black hair, not male, and not smiling) and class-mutual (e.g., A B C: black hair, male, and smiling) representations are found through learning.
Figure 11: Translated image samples and DMA scores on CelebA (image-to-image translation, two attributes). We calculate the DMA scores for each class-distinct or class-mutual state. Note that in training, we only know the dataset identifiers indicated by colorful bold font (e.g., A (black hair) and B (male) in (a)) and class-distinct (e.g., A B: black hair and not male in (a)) and class-mutual (e.g., A B: black hair and male in (a)) representations are found through learning.

Appendix C Extended results

c.1 Extended results of Section 4.1

In Figure 12, we provide the generated image samples on MNIST. This is the extended version of Figure 3.

c.2 Extended results of Section 4.2

In Figure 13, we provide the generated image samples on CIFAR-10. This is the extended version of Figure 6. Additionally, we show the continuous class-wise interpolation results in Figure 14. These results highlight the difference between DM image generation and typical class-wise interpolation (or category morphing).

c.3 Extended results of Section 4.3

In Figure 15, we provide the generated image samples on Clothing1M for class-distinct states (i.e., is a one-hot vector). This is the extended version of Figure 8.

c.4 Extended results of Section 4.4

In Figure 16, we provide the translated image samples on CelebA (image-to-image translation, three attributes). This is the extended version of Figure 7. We also provide the generated image samples on CelebA (image generation, three attributes) in Figure 17. This is the extended version of Figure 10.

Figure 12: Generated image samples on MNIST. This is the extended version of Figure 3. Each row shows generated samples with the fixed but the varied . In particular, in (a) we vary continuously between classes. This highlights the difference between typical class-wise interpolation (or category morphing) and DM image generation. Although in the former it is possible to generate images continuously between classes by varying , the changes are not necessarily related to class specificity. In contrast, CMGAN and CPGAN succeed in selectively generating class-distinct (red font) and class-mutual (blue font) images. Note that we do not use class labels of digits () directly as supervision. Instead, we derive them from class labels denoted by capital characters (A, B, ). See Table 4 for the details of the class-overlapping settings.
Figure 13: Generated image samples on CIFAR-10. This is the extended version of Figure 6. Each row shows generated samples for the fixed but the varied . In (a) and (b), each column is expected to represent airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck, respectively, from left to right. In (c), each column is expected to represent airplane, automobile, bird, cat, deer, dog, and frog, respectively, from left to right. In (b), class A, B, C, D, and E contain {truck, airplane, automobile}, {automobile, bird, cat}, {cat, deer, dog}, {dog, frog, horse}, and {horse, ship, truck}, respectively. In (c), class A, B, and C include {dog, airplane, automobile, frog}, {automobile, bird, cat, frog}, and {cat, deer, dog, frog}, respectively. CMGAN and CPGAN succeed in selectively generating class-distinct (red font) and class-mutual (blue font) images, whereas AC-GAN fails to do so. Note that we do not use class labels of categories (airplane, automobile, ) directly as supervision. Instead, we derive them from class labels denoted by capital characters (A, B, ). See Figure 14 for the other extended results.
Figure 14: Continuous class-wise interpolation results on CIFAR-10 (Case II). This is the extended version of Figures 6 and 13. In each row, we vary continuously between classes while fixing . Even using the conventional model (i.e., AC-GAN), it is possible to generate images continuously between classes, as shown in the top block. However, the changes are not necessarily related to class specificity. For example, in the fifth column, A B (automobile) is expected to appear, but in AC-GAN, unrelated images are generated. In contrast, CMGAN and CPGAN succeed in capturing class-distinct (surrounded by red lines) and class-mutual (surrounded by blue lines) images, and can generate them continuously on the basis of class specificity, as shown in the middle and bottom blocks. For example, from the first to tenth columns, CMGAN and CPGAN achieve to generate A (airplane), A B (automobile), and B (bird) continuously. As shown in this figure, the aim of DM image generation is different from that of typical class-wise interpolation (or category morphing).
Figure 15: Generated image samples on Clothing1M for class-distinct states (i.e., is a one-hot vector). This is the extended version of Figure 8. Each row contains samples generated from the fixed but the varied . From left to right, each column is expected to represent t-shirt, shirt, knitwear, chiffon, sweater, hoodie, windbreaker, jacket, down coat, suit, shawl, dress, vest, and underwear, respectively. CPGAN can generate the class-distinct (i.e., more classifiable) images by selectively using the class-distinct states (i.e., is a one-hot vector). The ACC scores reflect these observations. Additionally, when using CP (or in Appendix B.3.1) as ’s input, CPGAN achieves the better FID score than AC-GAN and cGAN+, as listed in Tables 4 and 9. This implies that CPGAN can cover the overall distribution reasonably by selectively using CP (or ).
Figure 16: Translated image samples on CelebA (image-to-image translation, three attributes). This is the extended version of Figure 7. Note that in training, we only know the dataset identifiers indicated by colorful bold font (i.e., A (black hair), B (male), and C (smiling)) and class-distinct and class-mutual representations (e.g., A B C (black hair, not male, and not smiling)) are found through learning.
Figure 17: Generated image samples on CelebA (image generation, three attributes). This is the extended version of Figure 10. Note that in training, we only know the dataset identifiers indicated by colorful bold font (i.e., A (black hair), B (male), and C (smiling)) and class-distinct and class-mutual representations (e.g., A B C (black hair, not male, and not smiling)) are found through learning.

Appendix D Details of experimental setup

In this appendix, we describe the details of the network architectures and training scheme for each dataset. In the following tables, FC. and Conv. indicate fully connected and convolutional layers, respectively. To downscale and upscale, we respectively use convolutions (Conv. ) and backward convolutions (Conv.

), i.e., fractionally strided convolutions, with stride

. The terms BN and IN indicate batch normalization [15] and instance normalization [53], respectively. The terms ReLU and lReLU indicate rectified linear unit [42] and leaky rectified linear unit [33, 59], respectively.

d.1 Mnist

Generative model. The generative model for MNIST, which was used for the experiments discussed in Section 4.1, is shown in Table 10. As a pre-process, we normalized the pixel values to the range . In

’s output layer, we used the sigmoid function. As a GAN objective, we used the Wasserstein GAN objective with gradient penalty (WGAN-GP) 

[12] and train the model using the Adam optimizer [21] with a minibatch of size 64. We set the parameters to the default values of the WGAN-GP,555We refer to the source code provided by the authors of WGAN-GP: https://github.com/igul222/improved_wgan_training i.e., , , , , and . We set the trade-off parameters for the auxiliary classifier to and and trained for 200000 iterations.

 

Generator

 

Input: ,
4096 FC., BN, ReLU, Reshape
128 Conv. , BN, ReLU, Cut
64 Conv. , BN, ReLU ()
1 Conv. , Sigmoid ()

 

 

Discriminator / Auxiliary classifier

 

Input:
64 Conv. , lReLU ()
0.5 Dropout
128 Conv. , lReLU ()
0.5 Dropout
256 Conv. , lReLU ()
0.5 Dropout
FC. output for
FC. output for

 

Table 10: Generative model for MNIST

Classifier model used for evaluation. The classifier model used for the MNIST DMA evaluation is shown in Table 11. As a pre-process, we normalized the pixel values to the range . We trained the model using the SGD optimizer with a minibatch of size 128. We set the initial learning rate to 0.1 and divided it by 10 when the iterations are 40000, 60000, and 80000. We trained the model for 100000 iterations in total. The accuracy score for the MNIST test data was .

 

Input:
64 Conv., BN, ReLU
64 Conv. , BN, ReLU ()
0.5 Dropout
128 Conv., BN, ReLU
128 Conv. , BN, ReLU ()
0.5 Dropout
256 Conv., BN, ReLU
256 Conv., BN, ReLU
256 Conv., BN, ReLU
256 Conv. , BN, ReLU ()
0.5 Dropout
FC. output

 

Table 11: Classifier model used for MNIST DMA evaluation

d.2 Cifar-10

Generative model. The generative model for CIFAR-10, which was used for the experiments discussed in Section 4.2, is shown in Table 12. The network architecture is the same as that of AC-WGAN-GP ResNet [12].666We refer to the source code provided by the authors of WGAN-GP: https://github.com/igul222/improved_wgan_training Regarding the dropout position, we also refer to CT-GAN ResNet [57].777We refer to the source code provided by the authors of CT-GAN: https://github.com/biuyq/CT-GAN/blob/master/CT-GANs Similar to AC-WGAN-GP ResNet, we used conditional batch normalization (CBN) [10, 8] to make conditioned on . In CMGAN and CPGAN, to represent class-mixture states, we combined the CBN parameters (i.e., scale and bias parameters) with the weights calculated by Mix and CP, respectively. As a pre-process, we normalized the pixel values to the range . In ’s output layer, we used the hyperbolic tangent (tanh) function. As a GAN objective, we used WGAN-GP and trained the model using the Adam optimizer with a minibatch of size 64 and 128 for and , respectively. We set the parameters to the default values of WGAN-GP, i.e., , , , , and . We set . With regard to , we compared the performance when alternating the value of the parameter in Figure 5. We trained for 100000 iterations, letting linearly decay to 0 over 100000 iterations.

 

Generator

 

Input: ,
2048 FC., Reshape
ResBlock ()
ResBlock ()
ResBlock ()
BN, ReLU, 3 Conv., Tanh ()

 

 

Discriminator / Auxiliary classifier

 

Input:
ResBlock ()
ResBlock ()
0.2 Dropout
ResBlock ()
0.5 Dropout
ResBlock ()
0.5 Dropout
ReLU, Global mean pool
FC. output for
FC. output for

 

Table 12: Generative model for CIFAR-10

Classifier model used for evaluation. The classifier model used for the CIFAR-10 DMA evaluation is shown in Table 13. As a pre-process, we normalized the pixel values to the range . We trained the model using the SGD optimizer with a minibatch of size 128. We set the initial learning rate to 0.1 and divided it by 10 when the iterations are 40000, 60000, and 80000. We trained the model for 100000 iterations in total. The accuracy score for the CIFAR-10 test data was .

 

Input:
128 Conv., BN, ReLU
128 Conv., BN, ReLU
MaxPool ()
256 Conv., BN, ReLU
256 Conv., BN, ReLU
MaxPool ()
512 Conv., BN, ReLU
512 Conv., BN, ReLU
512 Conv., BN, ReLU
256 Conv., BN, ReLU
MaxPool ()
1024 FC., ReLU
0.5 Dropout
1024 FC., ReLU
0.5 Dropout
FC. output

 

Table 13: Classifier model used for CIFAR-10 DMA evaluation

Generative model for classifier posterior. The generative model used for the classifier posterior generation, i.e., , is shown in Table 14. We used cGAN with the concat. discriminator [39]. The dimension of is the same as that of . As a GAN objective, we used WGAN-GP and trained the model using the Adam optimizer with a minibatch of size 256. We set the parameters to the default values of WGAN-GP for toy datasets,888We refer to the source code provided by the authors of WGAN-GP: https://github.com/igul222/improved_wgan_training i.e., , , , , and . We trained for 100000 iterations.

 

Generator

 

Input: ,
512 FC., ReLU
512 FC., ReLU
512 FC., ReLU
FC. output for

 

 

Discriminator

 

Input: ,
512 FC., ReLU +
512 FC., ReLU +
512 FC., ReLU +
FC. output for

 

Table 14: Generative model for classifier posterior

d.3 Clothing1M

Generative model. The generative model for Clothing1M, which was used for the experiments discussed in Section 4.3, is shown in Table 15. As a pre-process, we normalized the pixel values to the range . In ’s output layer, we used the tanh function. As a GAN objective, we used WGAN-GP and trained the model using the Adam optimizer with a minibatch of size 66. We set the parameters to the default values of WGAN-GP for images,999We refer to the source code provided by the authors of WGAN-GP: https://github.com/igul222/improved_wgan_training i.e., , , , , and . We set the trade-off parameters for the auxiliary classifier to and . We trained for 200000 iterations.

 

Generator

 

Input: ,
8192 FC., Reshape
ResBlock ()
ResBlock ()
ResBlock ()
ResBlock ()
BN, ReLU, 3 Conv., Tanh ()

 

 

Discriminator / Auxiliary classifier

 

Input:
64 Conv. ()
ResBlock ()
0.2 Dropout
ResBlock ()
0.2 Dropout
ResBlock ()
0.5 Dropout
ResBlock ()
0.5 Dropout
FC. output for
FC. output for

 

Table 15: Generative model for Clothing1M

Classifier model used for evaluation. The classifier model used for the Clothing1M ACC evaluation is shown in Table 16. As a pre-process, we normalized the pixel values to the range . We trained the model using the SGD optimizer with a minibatch of size 129. We set the initial learning rate to 0.1 and divided it by 10 when the iterations are 40000, 60000, and 80000. We trained the model for 100000 iterations in total. The accuracy score for clean labeled data was .

 

Input:
64 Conv., BN, ReLU
64 Conv. , BN, ReLU ()
0.5 Dropout
128 Conv., BN, ReLU
128 Conv., BN, ReLU
128 Conv. , BN, ReLU ()
0.5 Dropout
256 Conv., BN, ReLU
256 Conv., BN, ReLU
256 Conv. , BN, ReLU ()
0.5 Dropout
512 Conv., BN, ReLU
512 Conv., BN, ReLU
256 Conv. , BN, ReLU ()
0.5 Dropout
512 Conv., BN, ReLU
512 Conv., BN, ReLU
256 Conv., BN, ReLU ()
0.5 Dropout
FC. output

 

Table 16: Classifier model used for Clothing1M ACC evaluation

d.4 CelebA (image-to-image translation)

Generative model. The generative model for CelebA on image-to-image-translation tasks, which was used for the experiments discussed in Section 4.4, is shown in Table 17. The network architecture is the same as that of StarGAN [7].101010We refer to the source code provided by the authors of StarGAN: https://github.com/yunjey/StarGAN As a pre-process, we normalized the pixel values to the range . In ’s output layer, we used the tanh function. As a GAN objective, we used WGAN-GP and trained the model using the Adam optimizer with a minibatch of size 16. We set the parameters to the default values of the StarGAN, i.e., , , , , , and . We set the trade-off parameters for the auxiliary classifier to and . We trained the for 50000 iterations and let linearly decay to 0 over the last 25000 iterations.

 

Generator

 

Input: ,
, 64 Conv., IN, ReLU ()
, 128 Conv. , IN, ReLU ()
, 256 Conv. , IN, ReLU ()
ResBlock ()
ResBlock ()
ResBlock ()
ResBlock ()
ResBlock ()
ResBlock ()
128 Conv. , IN, ReLU ()
64 Conv. , IN, ReLU ()
3 Conv., Tanh ()

 

 

Discriminator / Auxiliary classifier

 

Input:
64 Conv. , lReLU ()
128 Conv. , lReLU ()
256 Conv. , lReLU ()
512 Conv. , lReLU ()
1024 Conv. , lReLU ()
2048 Conv. , lReLU ()
Conv. output for

Conv. (zero pad) output for

 

Table 17: Generative model for CelebA (image-to-image translation)

Classifier model used for evaluation. The classifier model used for the CelebA DMA evaluation on image-to-image-translation tasks is shown in Table 18. As a pre-process, we normalized the pixel values to the range . We trained the model using the SGD optimizer with a minibatch of size 128. We set the initial learning rate to 0.1 and divided it by 10 when the iterations are 40000, 60000, and 80000. We trained the model for 100000 iterations in total. The accuracy scores for black hair, male, and smiling (binary classification) were , , and , respectively.

 

Input:
32 Conv., BN, ReLU
32 Conv. , BN, ReLU ()
0.5 Dropout
64 Conv., BN, ReLU
64 Conv. , BN, ReLU ()
0.5 Dropout
128 Conv., BN, ReLU
128 Conv. , BN, ReLU ()
0.5 Dropout
256 Conv., BN, ReLU
256 Conv. , BN, ReLU ()
0.5 Dropout
512 Conv., BN, ReLU
512 Conv., BN, ReLU
512 Conv., BN, ReLU
256 Conv. , BN, ReLU ()
0.5 Dropout
FC. output

 

Table 18: Classifier model used for CelebA (image-to-image translation) DMA evaluation

d.5 CelebA (image generation)

Generative model. The generative model for CelebA on image-generation tasks, which was used for the experiments discussed in Appendix B.4.1, is shown in Table 19. As a pre-process, we normalized the pixel values to the range . In ’s output layer, we used the tanh function. As a GAN objective, we used WGAN-GP and trained the model using the Adam optimizer with a minibatch of size 64. We set the parameters to the default values of WGAN-GP for the standard CNN architecture,111111We refer to the source code provided by the authors of WGAN-GP: https://github.com/igul222/improved_wgan_training i.e., , , , , and . We set the trade-off parameters for the auxiliary classifier to and . We trained for 150000 iterations.

 

Generator

 

Input: ,
8192 FC., BN, ReLU, Reshape
256 Conv. , BN, ReLU ()
128 Conv. , BN, ReLU ()
64 Conv. , BN, ReLU ()
3 Conv. , Tanh ()

 

 

Discriminator / Auxiliary classifier

 

Input:
64 Conv. , lReLU ()
128 Conv. , lReLU ()
256 Conv. , lReLU ()
512 Conv. , lReLU ()
FC. output for
FC. output for

 

Table 19: Generative model for CelebA (image generation)

Classifier model used for evaluation. The classifier model used for the CelebA DMA evaluation on image-generation tasks is shown in Table 20. As a pre-process, we normalized the pixel values to the range . We trained the model using the SGD optimizer with a minibatch of size 128. We set the initial learning rate to 0.1 and divided it by 10 when the iterations are 40000, 60000, and 80000. We trained the model for 100000 iterations in total. The accuracy scores for black hair, male, and smiling (binary classification) were , , and , respectively.

 

Input:
64 Conv., BN, ReLU
64 Conv. , BN, ReLU ()
0.5 Dropout
128 Conv., BN, ReLU
128 Conv. , BN, ReLU ()
0.5 Dropout
256 Conv., BN, ReLU
256 Conv. , BN, ReLU ()
0.5 Dropout
512 Conv., BN, ReLU
512 Conv., BN, ReLU
512 Conv., BN, ReLU
256 Conv. , BN, ReLU ()
0.5 Dropout
FC. output

 

Table 20: Classifier model used for CelebA (image generation) DMA evaluation