Conditional Transferring Features: Scaling GANs to Thousands of Classes with 30

by   Chunpeng Wu, et al.

Generative adversarial network (GAN) has greatly improved the quality of unsupervised image generation. Previous GAN-based methods often require a large amount of high-quality training data while producing a small number (e.g., tens) of classes. This work aims to scale up GANs to thousands of classes meanwhile reducing the use of high-quality data in training. We propose an image generation method based on conditional transferring features, which can capture pixel-level semantic changes when transforming low-quality images into high-quality ones. Moreover, self-supervision learning is integrated into our GAN architecture to provide more label-free semantic supervisory information observed from the training data. As such, training our GAN architecture requires much fewer high-quality images with a small number of additional low-quality images. The experiments on CIFAR-10 and STL-10 show that even removing 30 outperform previous ones. The scalability on object classes has been experimentally validated: our method with 30 the best quality in generating 1,000 ImageNet classes, as well as generating all 3,755 classes of CASIA-HWDB1.0 Chinese handwriting characters.



There are no comments yet.


page 1

page 3


Constrained Generative Adversarial Networks for Interactive Image Generation

Generative Adversarial Networks (GANs) have received a great deal of att...

Two Birds with One Stone: Iteratively Learn Facial Attributes with GANs

Generating high fidelity identity-preserving faces has a wide range of a...

Collapse Resistant Deep Convolutional GAN for Multi-Object Image Generation

This work introduces a novel system for the generation of images that co...

Hyper-GAN: Transferring Unconditional to Conditional GANs with HyperNetworks

Conditional GANs have matured in recent years and are able to generate h...

IR-GAN: Image Manipulation with Linguistic Instruction by Increment Reasoning

Conditional image generation is an active research topic including text2...

Efficient Conditional GAN Transfer with Knowledge Propagation across Classes

Generative adversarial networks (GANs) have shown impressive results in ...

FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization

Generating images from natural language instructions is an intriguing ye...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

As one of the most exciting breakthroughs in unsupervised machine learning,

generative adversarial network (GAN) [12] has been successfully applied to a variety of applications, such as face verification [19]

, human pose estimation 

[3] and small object detection [18]. In principle, GANs are trained in an adversarial manner: a generator produces new data by mimicking a targeted distribution; meanwhile, a discriminator measures the similarity between the generated and targeted distributions, which in turn is used to adapt the generator.

The quality of generated data highly relies on both volume and quality

of the training data. For example, our experiments on GAN-based image generation and image-to-image translation show dramatic performance degradation when reducing the number of high-quality training images. Figure 

1 shows several mushroom images generated by SN-GAN [21] trained with 60% of (top row) or 100% of (bottom row) ImageNet training data [27]. The images in the bottom row obtained by using the entire training dataset present more distinguishable appearance (e.g., cap and stem of mushroom) and have much better quality. After removing 40% of the training data, the Inception score decreases from 21.1 to 14.8 and FID increases from 90.4 to 141.2. The high demand on high-quality training data has emerged as a major challenge of GAN-based methods—it is very difficult or even impossible to collect sufficient data for producing satisfactory results in real-world applications.

Fig. 1: Our experiments on GAN-based image generation.The top and bottom rows are generated mushroom images by using 60% of and 100% of the ImageNet training set, respectively.

To address the challenge, we propose an image generation method based on conditional transferring features (CTFs) with three key solutions. First, we contruct the training data with a portion of the original high-quality images and a small number of low-quality images. Second, our method extracts the CTFs by transforming low-quality images into high-quality images. Third, we further enhance our method with more label-free supervisory information observed from the training data.

Ii Related Work

Many GAN research studies explore how to stabilize GAN training through modifying network architecture [25, 14] and optimizing algorithms [1, 20]. The recent SN-GAN [21] stabilizes the discriminator of a GAN using a novel weight normalization method. In this work, we propose a new approach rarely considered in GAN-based methods, that is, using low-quality training data to facilitate the generation of high-quality images.

Fig. 2: Our proposed image generation method. The data-flow in blue extracts our proposed CTFs and provides CTFs to generator . Tasks \⃝raisebox{-0.9pt}{2} (in blue) and \⃝raisebox{-0.9pt}{3} (in dotted lines) are first adversarially trained, followed by the training of adversarial tasks \⃝raisebox{-0.9pt}{1} and \⃝raisebox{-0.9pt}{3}. The number of high-quality images and low-quality images are not required to be the same.

GAN-based image generation methods [5, 28, 9, 7, 4, 32, 30, 13, 29, 15, 24, 17] tackle the issues of multi-resolution, variation observation, architecture changing, energy estimation for samples, embedding recursive structures, integrating condition information into GANs, and quality evaluation of generated images. Recently, BigGAN [2] significantly improves image synthesis quality by adding orthogonal regularization to the generator. Our GAN-based method can scale to thousands of classes with significantly fewer high-quality training data.

Iii Image Generation Based on CTFs

Figure 2 shows our proposed image generation method. Following traditional GAN-based methods, the design consists of a generator and a discriminator . Our discriminator is also used for self-supervision (SP) learning. We introduce a generator for extracting CTFs. There are three learning tasks in our method:

  • Task \⃝raisebox{-0.9pt}{1} adopts and to produce images similar to the high-quality by using noise and the conditional transferring features as ’s input. The Conv layers in convolute by taking the output from the previous layer under the same resolution.

  • Task \⃝raisebox{-0.9pt}{2} (highlight in blue) adopts and to transform the low-quality images to high-quality images similar to and provides the extracted to . Noises are injected into each Res.+Unpool (ResBlock+Unpooling) block in , respectively, to increase the randomness of the generated images.

  • Task \⃝raisebox{-0.9pt}{3} (represented in dotted lines) is to distinguish the real images sampled from synthetic images using .

The adversarial tasks \⃝raisebox{-0.9pt}{2} and \⃝raisebox{-0.9pt}{3} are first trained for extracting the CTFs until no significant improvement can be observed. Afterwards, tasks \⃝raisebox{-0.9pt}{1} and \⃝raisebox{-0.9pt}{3} are adversarially trained for image generation based on the CTFs.

Fig. 3: Data-flow (in green) for extracting the conditional transferring features in ResBlock+Unpooling block in the generator .

Iii-a Extracting CTFs

In , the ResBlock+Unpooling block is used to extract the conditional transferring feature . Figure 3 shows the detailed extraction data-flow (in green). The random noise is embedded and concatenated to the input of the block. The operator is previously described in [26, 21, 22]

. We replace the batch-normalization 

[16] layers in traditional ResBlock with conditional batch-normalization (CBN) [8] layers and in ResBlock+Unpooling. and are conditional to the label information of the high-quality images where is the class number of the high-quality images. According to the CBN’s definition [8], for the layer , an input activation is transformed into a normalized activation specific to a class calculated as:


where and

are respectively the mean and standard deviation taken across spatial axes, and

and are trainable parameters specific to class of . Thus, the trainable parameters of are and . Similarly, and denote the trainable parameters across all the classes of .

The label information of both low-quality and high-quality images are concatenated to feature the differences between adjacent blocks of ResBlock+Unpooling. is calculated by:


where is the aggregated difference maps between the feature maps and respectively in the and the blocks of ResBlock+Unpooling. Note that might not be equal to . To make it more clear, given a feature map , the difference map between and each is calculated, and then is obtained by aggregating all difference maps together. is the labels of input low-quality images , and and include label information of high-quality images. The class information of low-quality and high-quality images are first concatenated together before they are concatenated to the difference maps . The class conditional parameters and of the layer are not used in Equation (2) because the layer is in front of the Unpooling layer as shown in Figure 3, i.e., its resolution corresponds to but not . The feature maps will be upsampled to the same size of

using bilinear interpolation. For the first block of

ResBlock+Unpooling (), the previous is replaced by the gray-level version of low-quality image . The differences between a pair of feature maps are evaluated in a DCT-based frequency domain . is calculated as shown in Equation (3) when :


where and are DCT and inverse DCT transforms, function unsamples an image using bilinear interpolation, and function converts a color image into a gray-level image. is calculated as shown in Equation (4) when :


Iii-B Self-supervision (SP) learning

Popular self-supervision tasks [6, 11, 23] predict chromatic transformations, rotation, scaling, relative position of the image patches, etc. Our self-learning method is inspired by [10]. For any image sampled from or generated by (or ), we randomly cut an image patch from , paste it to a random location of the image, and record the bounding-box coordinates of the patch. The Self-supervision loss shown in Figure 2 is the differences between the recorded coordinates and those predicted by the discriminator .

Method CIFAR-10 STL-10
Training data Inception score Training data Inception score
AC-GAN [24] 50,000 (CIFAR-10 training set) 5,000 (STL-10 training set) -
Ours (Fewer HQ) 32,500+10,000 3,250+1,000
Ours (Entire HQ) 50,000+10,000 5,000+1,000
TABLE I: Conditional image generation on CIFAR-10 and STL-10.
Method Training data Inception score
SN-GAN+Projection 1,246,991
Ours (Fewer HQ) 880,544 (810,544+70,000)
Ours (Entire HQ) 1,316,991 (1,246,991+70,000)
TABLE II: Conditional image generation on CASIA-HWDB1.0.

Iv Experiments

Method Training data Inception score
AC-GAN [7] 1,282,167 (ImageNet training set)
SN-GAN [21]
Ours (Fewer HQ) 893,408 (833,408+60,000)
Ours (Entire HQ) 1,342,167 (1,282,167+60,000)
TABLE III: Conditional image generation on ImageNet.

Our generator has similar architecture with the generator adopted in SN-GAN+Projection [21, 22], but two differences are as follows. Our has additional layers for convoluting with CTFs provided by the generator . Our uses regular BN (instead of CBN in SN-GAN+Projection), while our uses CBN.

Iv-a Conditional Image Generation on CIFAR-10 and STL-10

Table I compares the quality of image generation by using previous methods [25, 28, 24, 31, 14, 13, 17, 21, 22] with ours. All the previous approaches take the entire CIFAR-10 training set (50,000 images). Our training data is a mixed-up of high-quality (HQ) images sampled from CIFAR-10 or STL-10 training set and low-quality (LQ) images. Since CIFAR-10 or STL-10 are already the “simplest” datasets, we have to use their down-sampled versions as LQ images. For comparison purpose, Ours (Fewer HQ) uses 32,500 CIFAR-10 and 10,000 down-sampled images as training data, and Ours (Entire HQ) applies the entire CIFAR-10 training set and 10,000 down-sampled images. According to the popular testing protocol [14, 13, 21], we scale all the generated images to 3232 for CIFAR-10 classes and 4848 for STL-10 classes.

The experiment shows that Ours (Fewer HQ) with fewer training data slightly outperforms previous methods. Using the entire CIFAR-10 or STL-10 training sets further improves the image quality of our method: Ours (Entire HQ) is respectively 1.7% and 2.7% better in Inception score, compared to previously best SN-GAN+Projection [21, 22].

Iv-B Conditional Image Generation on 3755-Class CASIA-HSWB1.0

To further validate the scalability on object classes, we compare the generation of 3,755 classes of CASIA- HWDB1.0 Chinese characters by using our method and SN-GAN+Projection [21, 22]. SN-GAN+Projection adopts the entire CASIA-HWDB1.0 training set (1,246,991 images) as training data. Our training data takes 810,544 CASIA-HWDB1.0 training set as HQ images and 70,000 MNIST handwriting images as LQ images. The total number of our training data is (880,544) is 29.4% smaller than the entire CASIA-HWDB1.0 training set. The resolution of the generated images is set to 4848 which is the same as original CASIA-HWDB1.0 dataset.

The quantitative comparison in Table II validates that Ours (Fewer HQ) and Ours (Entire HQ) can produce higher-quality Chinese characters in 3,755 CASIA-HWDB1.0 classes, compared to SN-GAN+Projection. The quality gap between SN-GAN+Projection and Ours (Fewer HQ) is larger than the gap presented in Table I, which implies our advantage on more image classes.

Fig. 4: Our generated images of the ImageNet classes (Ours (Entire HQ)).

Iv-C Conditional Image Generation on ImageNet

We use our method for conditional image generation on ImageNet classes and compare it to AC-GAN [7], SN-GAN [21] and SN-GAN+Projection [21, 22]. The training of the three previous GANs adopt the entire ImageNet training set (1,282,167 images). The training data of Ours (Fewer HQ) contains 833,408 ImageNet training set as HQ images and 60,000 CIFAR-100 images as LQ ones. Thus, the total number of Ours (Fewer HQ) is 30.3% smaller than the entire ImageNet training set used in previous methods [24, 21, 22]. The resolution of the generated images is set to 128128 to compare with previous methods.

Table III summarizes the comparison with previous methods, and Figure 4 shows some examples of the generated images by Ours (Entire HQ). Ours (Fewer HQ) outperforms previous methods, even though it uses 30.3% fewer training data. Using the entire ImageNet training set and CIFAR-100 images, Ours (Entire HQ) is 17.0% better than the previous best SN-GAN+Projection [21, 22] in Inception score.

V Conclusion

Previous GAN-based image generation methods face the challenges of the heavy dependency on high-quality training data. In contrast, collecting low-quality images is relatively easier and more economical. Through the observation on the learning process during transforming the low-quality images into high-quality images, we find that certain intermediate output combined with the class information, or conditional transferring features (CTFs), can be adopted to improve the quality of image generation and scalability on object classes of GAN. Moreover, we integrate self-supervision learning into our GAN architecture to further improve the learning ability of the GAN. Experiments on conditional image generation tasks show that our method performs better than previous methods even when 30% high-quality training data is removed. And our method successfully scales GANs to thousands of object classes such as the 1,000 ImageNet classes and 3,755 CASIA-HWDB1.0 classes.


  • [1] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein GAN. arXiv:1701.07875v3. Cited by: §II.
  • [2] A. Brock, J. Donahue, and K. Simonyan (2019) Large Scale GAN Training for High Fidelity Natural Image Synthesis. ICLR. Cited by: §II.
  • [3] Y. Chen, C. Shen, X.-S. Wei, L. Liu, and J. Yang (2017)

    Adversarial PoseNet: A Structure-aware Convolutional Network for Human Pose Estimation

    ICCV. Cited by: §I.
  • [4] Z. Dai, A. Almahairi, P. Bachman, E. H. Hovy, and A. C. Courville (2017) Calibrating Energy-based Generative Adversarial Networks. ICLR. Cited by: §II.
  • [5] E. Denton, S. Chintala, A. Szlam, and R. Fergus (2015) Deep Generative Image Models Using a Laplacian Pyramid of Adversarial Networks. NeurIPS. Cited by: §II.
  • [6] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox (2014)

    Discriminative Unsupervised Feature Learning with Convolutional Neural Networks

    NeurIPS. Cited by: §III-B.
  • [7] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville (2017) Adversarially Learned Inference. ICLR. Cited by: §II, §IV-C, TABLE III.
  • [8] V. Dumoulin, J. Shlens, and M. Kudlur (2017) A Learned Representation for Artistic Style. ICLR. Cited by: §III-A.
  • [9] I. Durugkar, I. Gemp, and S. Mahadevan (2017) Generative Multi-Adversarial Networks. ICLR. Cited by: §II.
  • [10] D. Dwibedi, I. Misra, and M. Hebert (2017) Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection. ICCV. Cited by: §III-B.
  • [11] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised Representation Learning by Predicting Image Rotations. ICLR. Cited by: §III-B.
  • [12] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative Adversarial Nets. NeruIPS. Cited by: §I.
  • [13] G. L. Grinblat, L. C. Uzal, and P. M. Granitto (2017) Class-Splitting Generative Adversarial Networks. arXiv:1709.07359v1. Cited by: §II, §IV-A.
  • [14] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville (2017) Improved Training of Wasserstein GANs. NeurIPS. Cited by: §II, §IV-A.
  • [15] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. NeurIPS. Cited by: §II.
  • [16] S. Ioffe and C. Szegedy (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167v3. Cited by: §III-A.
  • [17] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2018) Progressive Growing of GANs for Improved Quality, Stability, and Variation. ICLR. Cited by: §II, TABLE I, §IV-A.
  • [18] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan (2017) Perceptual Generative Adversarial Networks for Small Object Detection. CVPR. Cited by: §I.
  • [19] Y. Li, L. Song, X. Wu, R. He, and T. Tan (2018) Anti-Makeup: Learning A Bi-Level Adversarial Network for Makeup-Invariant Face Verification. AAAI. Cited by: §I.
  • [20] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley (2017) Least Squares Generative Adversarial Networks. ICCV. Cited by: §II.
  • [21] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral Normalization for Generative Adversarial Network. ICLR. Cited by: §I, §II, §III-A, §IV-A, §IV-A, §IV-B, §IV-C, §IV-C, TABLE III, §IV.
  • [22] T. Miyato and M. Koyama (2018) cGANS with Projection Discriminator. ICLR. Cited by: §III-A, §IV-A, §IV-A, §IV-B, §IV-C, §IV-C, §IV.
  • [23] M. Noroozi and P. Favaro (2016) Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles.. ECCV. Cited by: §III-B.
  • [24] A. Odena, C. Olah, and J. Shlens (2017)

    Conditional Image Synthesis with Auxiliary Classifier GANs

    ICML. Cited by: §II, TABLE I, §IV-A, §IV-C.
  • [25] A. Radford, L. Metz, and S. Chintala (2016) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ICLR. Cited by: §II, §IV-A.
  • [26] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee (2016) Generative Adversarial Text to Image Synthesis. ICML. Cited by: §III-A.
  • [27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F. F. Li (2015) ImageNet Large Scale Visual Recognition Challenge. IJCV. Cited by: §I.
  • [28] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved Techniques for Training GANs. NeurIPS. Cited by: §II, §IV-A.
  • [29] T. Salimans and D. P. Kingma (2016) Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. NeruIPS. Cited by: §II.
  • [30] D. Warde-Farley and Y. Bengio (2017) Improving Generative Adversarial Networks with Denoising Feature Matching. ICLR. Cited by: §II.
  • [31] X.Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie (2017) Stacked Generative Adversarial Networks. CVPR. Cited by: §IV-A.
  • [32] J. Yang, A. Kannan, D. Batra, and D. Parikh (2017) LR-GAN: Layered Recursive Generative Adversarial Networks for Image Generation. ICLR. Cited by: §II.