Generative adversarial networks (GANs) 
based on deep convolutional neural networks (CNNs) have shown considerable success to capture complex and high-dimensional image data, and been utilized to numerous applications including image-to-image translation[12, 4, 36]32, 26, 24], and text-to-image translation [23, 11]. Despite the recent advances, however, the training of the GANs is known to be unstable and sensitive to the choices of hyper-parameters . To address this problem, some researchers proposed the novel generator and discriminator structures [13, 33, 35]. These methods effectively improves image generation task on challenging datasets such as ImageNet  but are difficult to apply to other various applications since they impose training overhead or need to modify the network architectures.
attempted to stabilize the training of the GANs by using novel loss functions or regularization terms. Arjovskyet al.  applied the Wasserstein distance to adversarial loss function, which shows better training stability than the original loss function. Gulrajani et al.  extended the method in  by adding the gradient regularization term, called gradient penalty, to further stabilize the training procedure. Miyato et al.  proposed the weight normalization technique called spectral normalization which limits the spectral norm of weight matrices to stabilize the training of the discriminator. By combining the projection-based discriminator in , this approach significantly improved the performance of the image generation task on ImageNet .
Recently, Chen et al. 
combined the GANs with self-supervised learning by adding an auxiliary loss function. Although this method improves the performance of the GANs, it needs an additional task-specific network and objective functions for self-supervised learning, which results in extra computational loads on training procedure. Maoet al.  proposed a simple regularization term which maximizes the ratio of the distance between images with respect to the distance between latent vectors. This technique imposes no training overheads and does not require the network structure modification, which makes it readily applicable to various applications.
Inspired by the method in , this paper presents a simple yet effective way that greatly improves the performance of the GANs without modifying the original network architectures or imposing the training overhead. In general, the discriminator of GANs extracts features using multiple convolutional layers and predicts whether the input image is real or fake using a fully connected layer which produces a single scalar value, i.e
. probability value. Indeed, the operation of fully connected layer predicting the single probability value is equivalent to the inner product operation. In other words, to predict the probability, this layer conducts the inner product between a single embedding vector,i.e. a weight vector, and an image feature vector obtained via CNNs. In the inner product process, however, the discriminator unintentionally ignores the part of the feature space which is perpendicular to the weight vector. Since the generator is trained through adversarial learning which focuses on deceiving the discriminator, it produces an image without considering the ignored feature space. For instance, if the discriminator first learns the global structure for distinguishing between the real and generated images, the generator will naturally attempt to produce the images having a similar global structure with real images without considering local structure. In other words, the generator fails to fully capture the complex and high-dimensional feature space on the image data.
To alleviate this problem, we propose a novel cascading rejection (CR) module which extracts different features in an iterative procedure. The CR module leads the discriminator to effectively distinguish between real and generated images, which results in a strong penalization to the generator. In order to deceive the robust discriminator having the CR module, the generator generates the images that are more similar to the real images. Since the proposed CR module needs only a few simple vector operations, it can be readily applied to existing frameworks with marginal training overheads. We conducted extensive experiments on various datasets including CIFAR-10 , Celeb-HQ [13, 16], LSUN , and tiny-ImageNet [5, 30]. Experimental results show that the proposed method significantly improves the performance of GANs and conditional GANs in terms of a Frechet inception distance (FID) indicating the diversity and visual appearance of the generated images.
In summary, in this paper we present:
A simple but effective technique for improving the performance of GANs without imposing the training overhead and modifying the network structures.
A novel CR module which guides the discriminator to consider the non-overlapped features of the images for distinguishing between real and generated images. By strongly penalizing the generator through the discriminator having the CR module, the proposed method significantly improves the performance of the GANs and conditional GANs in terms of the FID.
2.1 Generative adversarial networks
Typically, the GANs  consist of the generator and the discriminator . In the GANs, both networks are simultaneously trained: is trained to create a new image which is indistinguishable from real images, whereas is optimized to differentiate between real and generated images. This relation can be considered as a two-player min-max game where and compete with each other. Formally, the () is trained to minimize (maximize) the loss function, called adversarial loss, as follows:
where and denote a random noise vector and a real image sampled from the noise and real data distribution , respectively. It is worth noting that and are scalar values indicating the probabilities that and came from the data distribution.
The conditional GANs (cGANs) which aim at producing the class conditional images have been actively researched [18, 21, 19, 34]. The cGANs techniques usually add conditional information , such as class labels or text condition, to both generator and discriminator in order to control the data generation process in a supervised manner. This can be formally expressed as follows:
By training the networks based on the above equation, the generator can select an image category to be generated, which is not possible when employing the standard GANs framework.
2.2 Revisit the Fully Connected Layer
To train the discriminator using Eqs. 1 and 2, the discriminator should produce the single scalar value as an output. To this end, in the last layer, the discriminator usually employs a fully-connected layer with a single output channel which acts like the inner product between an embedding vector w, i.e. weight vector, and an image feature vector v obtained through the multiple convolutional layers. Even if the last layer consists of several pixels such as PatchGAN , the discriminator conducts the inner product for each pixel and averages all values for the adversarial loss function.
The inner product of v onto w is illustrated in Fig.1. As shown in Fig. 1, the inner product produces the scalar value V, but it ignores the part of the feature space which is perpendicular to the w. In other words, the discriminator only considers the feature space which is parallel to the w when predicting the probability value for the adversarial loss. This problem often makes the discriminator difficult to effectively penalize the generator. For instance, as shown in Fig. 2, even if the generator produces low-quality images having the ignored feature space, the discriminator cannot distinguish between real and generated images. In other words, the generator can minimize the adversarial loss by producing the low-quality images having the ignored feature space, which results in the performance degradation of the generator. To alleviate this problem, this paper proposes the CR module which encourages the discriminator to consider ignored feature space in the last layer.
3 Proposed Method
3.1 Cascading rejection module
In Euclidean space, the fully-connected layer producing a single scalar value as an output is equivalent to the inner product P(v, w) as follows:
where v and w indicate the input feature vector and the embedding vector, i.e. weight vector, in the last fully connected layer, respectively. From this formulation, we observe that the ignored feature caused by the inner product, v̂, can be obtained by the vector rejection of v from w, which is defined as follows:
In other words, by minimizing the adversarial loss using the additional probability value obtained through the inner product of v̂ and another weight vector ŵ, the discriminator is able to consider the ignored feature space.
Based on these observations, we propose the CR module which iteratively conducts the inner product and vector rejection processes. Fig.2 illustrates the proposed CR module, where v indicates the input feature vector of the CR module, which is obtained through the multiple convolutional layers in the discriminator. The iterative vector rejection process generates vectors, i.e. , which represent the ignored feature in the previous inner product operation, whereas the iterative inner product produces scalar values, i.e. , which indicate the probabilities that the v came from the real data distribution. Note that s are non-overlapped with each other since they obtained through the vector rejection operation. By using the probabilities obtained via the CR module, the adversarial loss of the discriminator and that of the generator can be rewritten as
where () indicates the i-th probability value when the input of discriminator is real (generated) image, and is a hyper parameter which controls the relative importance of each loss term. In order to predict the probability using the more essential features in the earlier stage of the CR module, we set as follows:
When N is one, the loss functions in Eqs. 5 and 6 are equivalent to the original one in Eq. 1. In contrast, when N is larger than one, the discriminator and generator should consider the ignored feature space to minimize the and , respectively. It is worth noting that since the iterative inner product process and vector rejection process are simple vector operations, the proposed CR module does not impose the training overhead. In addition, since the proposed CR module is appended after the last fully connected layer in the discriminator, there is no modification to the existing architecture of the discriminator.
3.2 Conditional Cascading Rejection module
In this subsection, we introduce the CR module for the cGANs, called the conditional cascading rejection (cCR) module. Among the various cGANs frameworks [18, 23, 25, 22, 21, 20], we design the cCR module based on the conditional projection discriminator  which shows superior performance than other existing discriminators for the cGANs. As depicted in Fig. 4, the conditional projection discriminator takes an inner product between the embedded condition vector w, which is a different vector depending on the given condition, and the feature vector of the discriminator, i.e. v, so as to impose a regularity condition. Based on the regularity condition, like the standard discriminator, the conditional projection discriminator predicts the conditional probability P(v, w, w) as follows:
Based on the above equation, we design the cCR module by replacing the w in the CR module with (w + w), as shown in Fig. 5. More specifically, the -th vector rejection process of the cCR module can be expressed as follows:
4.1 Implementation details
In order to evaluate the effectiveness of the CR module, we conducted extensive experiments using the CIFAR-10 , LSUN , Celeb-HQ [13, 16], and tiny-ImageNet [5, 30] datasets. The CIFAR-10  and LSUN  datasets consist of 10 classes, whereas the tiny-ImageNet [5, 30], which is a subset of the ImageNet , is composed of 200 classes. Among a large number of images in the LSUN dataset, we randomly selected 30,000 images for each class. In addition, we compressed the images from the Celeb-HQ, LSUN, and tiny-ImageNet datasets as pixels. For the objective function, we adopted the hinge version of adversarial loss. The hinge version loss with the CR module is defined as follow:
Since all parameters in the generator and the discriminator including the CR module can be differentiated, we performed an optimization using the Adam optimizer and
, to 0 and 0.9, respectively, and set the learning rate to 0.0002. During training procedure, we updated the discriminator five times per each update of the generator. For the CIFAR-10 dataset, we used a batch size of 64 and trained the generator for 100k iterations, whereas the generators for the Celeb-HQ and LSUN datasets were trained for 100k iterations with a batch size of 32. For the tiny-ImageNet, we set a batch size as 64 and trained the generator for 400k iterations (about 100 epochs). Our experiments were conducted on CPU Intel(R) Xeon(R) CPU E3-1245 v5 and GPU RTX 2080 Ti, and implemented inTensorFlow.
4.2 Baseline models
In this work, we employed the generator and discriminator architectures of the leading cGANs scheme [20, 19], as our baseline models. The detailed architectures of the two different models for and resolutions are presented in Tables 1 and 2, respectively. Both of cases employ the multiple residual block  (ResBlocks) as depicted in Fig. 6. In the ResBlock of the discriminator, we employed the spectral normalization 
for all layers including the proposed CR module. For the discriminator, the down-sampling (average-pooling) is performed after the second convolutional layer, whereas the generator up-sampled the feature maps using a nearest neighbor interpolation prior to the first convolution layer.
|Celeb-HQ [13, 16]||11.81||11.11||8.49||-||-||-|
|tiny-ImageNet [5, 30]||39.12||38.37||36.57||32.15||31.75||29.69|
4.3 Comparison of Sample Quality
Evaluation metrics To evaluate the performance of the generator, in this paper, we employed the principled and comprehensive metric, called frechet inception distance (FID) , which measures the visual appearance and diversity of the generated images. The FID can be obtained by calculating the Wasserstein-2 distance between the distribution of the real images, , and that of the generated ones, , in the feature space obtained via the Inception model , which is defined as follows:
where are the mean and covariance of the samples with distribution and , respectively. Lower FID scores indicate better quality of the generated images. Indeed, there is an alternative approximate measure of image quality, called inceptions score (IS). However, since the IS has some flaws as mentioned in [2, 3], we employed the FID as the main metric in this paper.
Results To demonstrate the advantage of the CR module, we conducted extensive experiments by adjusting the N value. In our experiments, we randomly generated 50,000 images for CIFAR-10, LSUN, and tiny-ImageNet datasets and 30,000 images for Celeb-HQ dataset. Table 3 shows the comprehensive performances of the proposed method. The bold numbers in Table 3 represent the best performance among the results. As shown in Table 3, the CR module significantly improves the performance of GANs. In addition, Fig. 7 shows the FID results on tiny-ImageNet dataset during the training procedure. As shown in Fig. 7, the proposed method shows better performance than the standard GANs during the training procedure. These results reveal that the proposed method is able to effectively improve the GANs performance by considering the ignored features in the discriminator. Thus, we confirmed that, by strongly penalizing the generator using the CR module, the proposed method leads the generator to produce the images that are more similar to the real images and results in the low FID scores.
Moreover, to demonstrate the validity of the cCR module, we conducted additional cGAN experiments on CIFAR-10, LSUN, and tiny-ImageNet datasets, except for the Celeb-HQ dataset which does not contain the conditional information, i.e
. class information. We employed the same baseline models with the experiments of GANs, but replaced the BN in the generator with the conditional batch normalization layer. Also, the CR module in the discriminator is replaced with cCR module. As shown in Table 3, the proposed method shows better performance than cGANs , which reveals the performance of cGNAs can be improved by applying the cCR module to the cGANs framework. We notice that the cCR module with shows worse performance than that with on datasets having a small number of classes such as CIFAR-10 and LSUN, whereas shows better performance on tiny-ImageNet dataset consisting of 200 classes. These results indicate that is enough to penalize the generator for training the datasets having a small number of classes. In other words, to achieve the fine performance, the of the cCR module should be adjusted depending on the number of the given condition.
Fig. 9 shows the examples images on Celeb-HQ, LSUN and tiny-ImageNet datasets. As depicted in Fig. 9, the proposed method allows the generator to produce visually pleasing images. In addition, the proposed method additionally requires the network parameters, where indicates the dimension of the last layer in the discriminator, which are very small numbers compared to the overall discriminator parameters. Thus, the proposed CR module can be added to the discriminator with a marginal training overhead. It is worth noting that this work does not intend to design an optimal generator and discriminator architectures for the CR and cCR modules; there could be another structure that leads to better performance and generates more high quality images. In contrast, we care more about whether it is possible to improve the performance of the GANs and cGANs frameworks by simply adding the CR and cCR module to the discriminator, respectively.
4.4 Image-to-image translation with CR module
To demonstrate the generalization ability of the proposed method, we applied the CR module to the image-to-image translation scheme. In this paper, we selected the CycleGAN , one of the state-of-the-art frameworks to conduct image-to-image translation with the unpaired training data, as the baseline framework. The detailed architectures are described in Table 4. Note that we did not conduct global pooling on the last layer of the discriminator. Instead, in the CR module, we predicted probabilities for each pixel using the convolutional layer which is equivalent to the fully connected layer. In addition, in the ResBlock of the generator, we employed the instance normalization  (IN) instead of the BN.
In our experiments, we evaluated the performance on the photo Monet dataset in . We compressed the images from the photo Monet dataset as pixels, and trained the generator for 100k iterations using the hinge version of adversarial loss in Eqs. 10 and 11. Experimental results are described in Fig. 10 and Table 5. As shown in Table 5, the proposed method significantly improves the FID score, which represents that the proposed CR module can be applied to the discriminator having several pixels in the last layer such as PatchGAN . Thus, we confirmed that the proposed CR module can be easily utilized for improving the performance of other GAN-based applications.
In this paper, we have introduced a straightforward method for improving the performance of the GANs. By using the non-overlapped features obtained via the proposed CR module, the discriminator effectively penalizes the generator during the training procedure, which results in improving the performance of the generator. One of the main advantages of the CR module is that it can be readily integrated with the existing discriminator architectures. Moreover, our experiments reveal that, without imposing the training overhead, the discriminator with the CR module significantly improves the performance of the baseline models. In addition, the generalization ability of the proposed method is demonstrated by applying the CR module to the cGANs and image-to-image translation frameworks. It is expected that the proposed method will be applicable to various applications based on the GANs.
-  (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §1.
-  (2018) A note on the inception score. arXiv preprint arXiv:1801.01973. Cited by: §4.3.
-  (2019) Self-supervised gans via auxiliary rotation loss. In , pp. 12154–12163. Cited by: §1, §1, §4.3.
-  (2018) Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797. Cited by: §1.
-  (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1, §4.1, Table 3.
-  (2017) A learned representation for artistic style. Proc. of ICLR 2. Cited by: §4.3.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §2.1.
-  (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.2.
-  (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §4.3.
-  (2018) Inferring semantic layout for hierarchical text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7986–7994. Cited by: §1.
Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §1, §2.2, §4.4.
-  (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §1, §1, §4.1, Table 3.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, §1.
-  (2015) Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738. Cited by: §1, §4.1, Table 3.
-  (2019) Mode seeking generative adversarial networks for diverse image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1429–1437. Cited by: §1, §1, §1.
-  (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2.1, §3.2.
-  (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §1, §2.1, §4.2.
-  (2018) CGANs with projection discriminator. arXiv preprint arXiv:1802.05637. Cited by: §1, Figure 4, Figure 5, §3.2, §4.2, §4.3, Table 1, Table 2, Table 3.
Conditional image synthesis with auxiliary classifier gans.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2642–2651. Cited by: §1, §2.1, §3.2.
-  (2016) Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583. Cited by: §3.2.
-  (2016) Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396. Cited by: §1, §3.2.
-  (2019) Pepsi: fast image inpainting with parallel decoding network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11360–11368. Cited by: §1.
-  (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §3.2.
-  (2019) PEPSI++: fast and lightweight network for image inpainting. arXiv preprint arXiv:1905.09010. Cited by: §1.
-  (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §4.3.
80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE transactions on pattern analysis and machine intelligence 30 (11), pp. 1958–1970. Cited by: §1, §4.1, Table 3.
-  (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §4.4.
-  (2015) Tiny imagenet classification with convolutional neural networks. CS 231N. Cited by: §1, §4.1, Table 3.
-  (2015) LSUN: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365. Cited by: §1, §4.1, Table 3.
-  (2018) Free-form image inpainting with gated convolution. arXiv preprint arXiv:1806.03589. Cited by: §1.
-  (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: §1.
-  (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915. Cited by: §2.1.
-  (2018) Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1947–1962. Cited by: §1.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §1, Figure 10, §4.4, §4.4, Table 5.