I Introduction
As one of the most exciting breakthroughs in unsupervised machine learning,
generative adversarial network (GAN) [12] has been successfully applied to a variety of applications, such as face verification [19], human pose estimation
[3] and small object detection [18]. In principle, GANs are trained in an adversarial manner: a generator produces new data by mimicking a targeted distribution; meanwhile, a discriminator measures the similarity between the generated and targeted distributions, which in turn is used to adapt the generator.The quality of generated data highly relies on both volume and quality
of the training data. For example, our experiments on GAN-based image generation and image-to-image translation show dramatic performance degradation when reducing the number of high-quality training images. Figure
1 shows several mushroom images generated by SN-GAN [21] trained with 60% of (top row) or 100% of (bottom row) ImageNet training data [27]. The images in the bottom row obtained by using the entire training dataset present more distinguishable appearance (e.g., cap and stem of mushroom) and have much better quality. After removing 40% of the training data, the Inception score decreases from 21.1 to 14.8 and FID increases from 90.4 to 141.2. The high demand on high-quality training data has emerged as a major challenge of GAN-based methods—it is very difficult or even impossible to collect sufficient data for producing satisfactory results in real-world applications.
To address the challenge, we propose an image generation method based on conditional transferring features (CTFs) with three key solutions. First, we contruct the training data with a portion of the original high-quality images and a small number of low-quality images. Second, our method extracts the CTFs by transforming low-quality images into high-quality images. Third, we further enhance our method with more label-free supervisory information observed from the training data.
Ii Related Work
Many GAN research studies explore how to stabilize GAN training through modifying network architecture [25, 14] and optimizing algorithms [1, 20]. The recent SN-GAN [21] stabilizes the discriminator of a GAN using a novel weight normalization method. In this work, we propose a new approach rarely considered in GAN-based methods, that is, using low-quality training data to facilitate the generation of high-quality images.

GAN-based image generation methods [5, 28, 9, 7, 4, 32, 30, 13, 29, 15, 24, 17] tackle the issues of multi-resolution, variation observation, architecture changing, energy estimation for samples, embedding recursive structures, integrating condition information into GANs, and quality evaluation of generated images. Recently, BigGAN [2] significantly improves image synthesis quality by adding orthogonal regularization to the generator. Our GAN-based method can scale to thousands of classes with significantly fewer high-quality training data.
Iii Image Generation Based on CTFs
Figure 2 shows our proposed image generation method. Following traditional GAN-based methods, the design consists of a generator and a discriminator . Our discriminator is also used for self-supervision (SP) learning. We introduce a generator for extracting CTFs. There are three learning tasks in our method:
-
Task \⃝raisebox{-0.9pt}{1} adopts and to produce images similar to the high-quality by using noise and the conditional transferring features as ’s input. The Conv layers in convolute by taking the output from the previous layer under the same resolution.
-
Task \⃝raisebox{-0.9pt}{2} (highlight in blue) adopts and to transform the low-quality images to high-quality images similar to and provides the extracted to . Noises are injected into each Res.+Unpool (ResBlock+Unpooling) block in , respectively, to increase the randomness of the generated images.
-
Task \⃝raisebox{-0.9pt}{3} (represented in dotted lines) is to distinguish the real images sampled from synthetic images using .
The adversarial tasks \⃝raisebox{-0.9pt}{2} and \⃝raisebox{-0.9pt}{3} are first trained for extracting the CTFs until no significant improvement can be observed. Afterwards, tasks \⃝raisebox{-0.9pt}{1} and \⃝raisebox{-0.9pt}{3} are adversarially trained for image generation based on the CTFs.

Iii-a Extracting CTFs
In , the ResBlock+Unpooling block is used to extract the conditional transferring feature . Figure 3 shows the detailed extraction data-flow (in green). The random noise is embedded and concatenated to the input of the block. The operator is previously described in [26, 21, 22]
. We replace the batch-normalization
[16] layers in traditional ResBlock with conditional batch-normalization (CBN) [8] layers and in ResBlock+Unpooling. and are conditional to the label information of the high-quality images where is the class number of the high-quality images. According to the CBN’s definition [8], for the layer , an input activation is transformed into a normalized activation specific to a class calculated as:(1) |
where and
are respectively the mean and standard deviation taken across spatial axes, and
and are trainable parameters specific to class of . Thus, the trainable parameters of are and . Similarly, and denote the trainable parameters across all the classes of .The label information of both low-quality and high-quality images are concatenated to feature the differences between adjacent blocks of ResBlock+Unpooling. is calculated by:
(2) |
where is the aggregated difference maps between the feature maps and respectively in the and the blocks of ResBlock+Unpooling. Note that might not be equal to . To make it more clear, given a feature map , the difference map between and each is calculated, and then is obtained by aggregating all difference maps together. is the labels of input low-quality images , and and include label information of high-quality images. The class information of low-quality and high-quality images are first concatenated together before they are concatenated to the difference maps . The class conditional parameters and of the layer are not used in Equation (2) because the layer is in front of the Unpooling layer as shown in Figure 3, i.e., its resolution corresponds to but not . The feature maps will be upsampled to the same size of
using bilinear interpolation. For the first block of
ResBlock+Unpooling (), the previous is replaced by the gray-level version of low-quality image . The differences between a pair of feature maps are evaluated in a DCT-based frequency domain . is calculated as shown in Equation (3) when :(3) |
where and are DCT and inverse DCT transforms, function unsamples an image using bilinear interpolation, and function converts a color image into a gray-level image. is calculated as shown in Equation (4) when :
(4) |
Iii-B Self-supervision (SP) learning
Popular self-supervision tasks [6, 11, 23] predict chromatic transformations, rotation, scaling, relative position of the image patches, etc. Our self-learning method is inspired by [10]. For any image sampled from or generated by (or ), we randomly cut an image patch from , paste it to a random location of the image, and record the bounding-box coordinates of the patch. The Self-supervision loss shown in Figure 2 is the differences between the recorded coordinates and those predicted by the discriminator .
Method | CIFAR-10 | STL-10 | ||
Training data | Inception score | Training data | Inception score | |
AC-GAN [24] | 50,000 (CIFAR-10 training set) | 5,000 (STL-10 training set) | - | |
PROG-GAN [17] | ||||
SN-GAN+Projection | ||||
Ours (Fewer HQ) | 32,500+10,000 | 3,250+1,000 | ||
Ours (Entire HQ) | 50,000+10,000 | 5,000+1,000 |
Method | Training data | Inception score |
---|---|---|
SN-GAN+Projection | 1,246,991 | |
Ours (Fewer HQ) | 880,544 (810,544+70,000) | |
Ours (Entire HQ) | 1,316,991 (1,246,991+70,000) |
Iv Experiments
Method | Training data | Inception score |
---|---|---|
AC-GAN [7] | 1,282,167 (ImageNet training set) | |
SN-GAN [21] | ||
SN-GAN+Projection | ||
Ours (Fewer HQ) | 893,408 (833,408+60,000) | |
Ours (Entire HQ) | 1,342,167 (1,282,167+60,000) |
Our generator has similar architecture with the generator adopted in SN-GAN+Projection [21, 22], but two differences are as follows. Our has additional layers for convoluting with CTFs provided by the generator . Our uses regular BN (instead of CBN in SN-GAN+Projection), while our uses CBN.
Iv-a Conditional Image Generation on CIFAR-10 and STL-10
Table I compares the quality of image generation by using previous methods [25, 28, 24, 31, 14, 13, 17, 21, 22] with ours. All the previous approaches take the entire CIFAR-10 training set (50,000 images). Our training data is a mixed-up of high-quality (HQ) images sampled from CIFAR-10 or STL-10 training set and low-quality (LQ) images. Since CIFAR-10 or STL-10 are already the “simplest” datasets, we have to use their down-sampled versions as LQ images. For comparison purpose, Ours (Fewer HQ) uses 32,500 CIFAR-10 and 10,000 down-sampled images as training data, and Ours (Entire HQ) applies the entire CIFAR-10 training set and 10,000 down-sampled images. According to the popular testing protocol [14, 13, 21], we scale all the generated images to 3232 for CIFAR-10 classes and 4848 for STL-10 classes.
The experiment shows that Ours (Fewer HQ) with fewer training data slightly outperforms previous methods. Using the entire CIFAR-10 or STL-10 training sets further improves the image quality of our method: Ours (Entire HQ) is respectively 1.7% and 2.7% better in Inception score, compared to previously best SN-GAN+Projection [21, 22].
Iv-B Conditional Image Generation on 3755-Class CASIA-HSWB1.0
To further validate the scalability on object classes, we compare the generation of 3,755 classes of CASIA- HWDB1.0 Chinese characters by using our method and SN-GAN+Projection [21, 22]. SN-GAN+Projection adopts the entire CASIA-HWDB1.0 training set (1,246,991 images) as training data. Our training data takes 810,544 CASIA-HWDB1.0 training set as HQ images and 70,000 MNIST handwriting images as LQ images. The total number of our training data is (880,544) is 29.4% smaller than the entire CASIA-HWDB1.0 training set. The resolution of the generated images is set to 4848 which is the same as original CASIA-HWDB1.0 dataset.
The quantitative comparison in Table II validates that Ours (Fewer HQ) and Ours (Entire HQ) can produce higher-quality Chinese characters in 3,755 CASIA-HWDB1.0 classes, compared to SN-GAN+Projection. The quality gap between SN-GAN+Projection and Ours (Fewer HQ) is larger than the gap presented in Table I, which implies our advantage on more image classes.

Iv-C Conditional Image Generation on ImageNet
We use our method for conditional image generation on ImageNet classes and compare it to AC-GAN [7], SN-GAN [21] and SN-GAN+Projection [21, 22]. The training of the three previous GANs adopt the entire ImageNet training set (1,282,167 images). The training data of Ours (Fewer HQ) contains 833,408 ImageNet training set as HQ images and 60,000 CIFAR-100 images as LQ ones. Thus, the total number of Ours (Fewer HQ) is 30.3% smaller than the entire ImageNet training set used in previous methods [24, 21, 22]. The resolution of the generated images is set to 128128 to compare with previous methods.
Table III summarizes the comparison with previous methods, and Figure 4 shows some examples of the generated images by Ours (Entire HQ). Ours (Fewer HQ) outperforms previous methods, even though it uses 30.3% fewer training data. Using the entire ImageNet training set and CIFAR-100 images, Ours (Entire HQ) is 17.0% better than the previous best SN-GAN+Projection [21, 22] in Inception score.
V Conclusion
Previous GAN-based image generation methods face the challenges of the heavy dependency on high-quality training data. In contrast, collecting low-quality images is relatively easier and more economical. Through the observation on the learning process during transforming the low-quality images into high-quality images, we find that certain intermediate output combined with the class information, or conditional transferring features (CTFs), can be adopted to improve the quality of image generation and scalability on object classes of GAN. Moreover, we integrate self-supervision learning into our GAN architecture to further improve the learning ability of the GAN. Experiments on conditional image generation tasks show that our method performs better than previous methods even when 30% high-quality training data is removed. And our method successfully scales GANs to thousands of object classes such as the 1,000 ImageNet classes and 3,755 CASIA-HWDB1.0 classes.
References
- [1] (2017) Wasserstein GAN. arXiv:1701.07875v3. Cited by: §II.
- [2] (2019) Large Scale GAN Training for High Fidelity Natural Image Synthesis. ICLR. Cited by: §II.
-
[3]
(2017)
Adversarial PoseNet: A Structure-aware Convolutional Network for Human Pose Estimation
. ICCV. Cited by: §I. - [4] (2017) Calibrating Energy-based Generative Adversarial Networks. ICLR. Cited by: §II.
- [5] (2015) Deep Generative Image Models Using a Laplacian Pyramid of Adversarial Networks. NeurIPS. Cited by: §II.
-
[6]
(2014)
Discriminative Unsupervised Feature Learning with Convolutional Neural Networks
. NeurIPS. Cited by: §III-B. - [7] (2017) Adversarially Learned Inference. ICLR. Cited by: §II, §IV-C, TABLE III.
- [8] (2017) A Learned Representation for Artistic Style. ICLR. Cited by: §III-A.
- [9] (2017) Generative Multi-Adversarial Networks. ICLR. Cited by: §II.
- [10] (2017) Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection. ICCV. Cited by: §III-B.
- [11] (2018) Unsupervised Representation Learning by Predicting Image Rotations. ICLR. Cited by: §III-B.
- [12] (2014) Generative Adversarial Nets. NeruIPS. Cited by: §I.
- [13] (2017) Class-Splitting Generative Adversarial Networks. arXiv:1709.07359v1. Cited by: §II, §IV-A.
- [14] (2017) Improved Training of Wasserstein GANs. NeurIPS. Cited by: §II, §IV-A.
- [15] (2017) GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. NeurIPS. Cited by: §II.
- [16] (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167v3. Cited by: §III-A.
- [17] (2018) Progressive Growing of GANs for Improved Quality, Stability, and Variation. ICLR. Cited by: §II, TABLE I, §IV-A.
- [18] (2017) Perceptual Generative Adversarial Networks for Small Object Detection. CVPR. Cited by: §I.
- [19] (2018) Anti-Makeup: Learning A Bi-Level Adversarial Network for Makeup-Invariant Face Verification. AAAI. Cited by: §I.
- [20] (2017) Least Squares Generative Adversarial Networks. ICCV. Cited by: §II.
- [21] (2018) Spectral Normalization for Generative Adversarial Network. ICLR. Cited by: §I, §II, §III-A, §IV-A, §IV-A, §IV-B, §IV-C, §IV-C, TABLE III, §IV.
- [22] (2018) cGANS with Projection Discriminator. ICLR. Cited by: §III-A, §IV-A, §IV-A, §IV-B, §IV-C, §IV-C, §IV.
- [23] (2016) Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles.. ECCV. Cited by: §III-B.
-
[24]
(2017)
Conditional Image Synthesis with Auxiliary Classifier GANs
. ICML. Cited by: §II, TABLE I, §IV-A, §IV-C. - [25] (2016) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ICLR. Cited by: §II, §IV-A.
- [26] (2016) Generative Adversarial Text to Image Synthesis. ICML. Cited by: §III-A.
- [27] (2015) ImageNet Large Scale Visual Recognition Challenge. IJCV. Cited by: §I.
- [28] (2016) Improved Techniques for Training GANs. NeurIPS. Cited by: §II, §IV-A.
- [29] (2016) Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. NeruIPS. Cited by: §II.
- [30] (2017) Improving Generative Adversarial Networks with Denoising Feature Matching. ICLR. Cited by: §II.
- [31] (2017) Stacked Generative Adversarial Networks. CVPR. Cited by: §IV-A.
- [32] (2017) LR-GAN: Layered Recursive Generative Adversarial Networks for Image Generation. ICLR. Cited by: §II.
Comments
There are no comments yet.