GIU-GANs: Global Information Utilization for Generative Adversarial Networks

01/25/2022
by   Yongqi Tian, et al.
263
Beijing Institute of Technology
0

In recent years, with the rapid development of artificial intelligence, image generation based on deep learning has dramatically advanced. Image generation based on Generative Adversarial Networks (GANs) is a promising study. However, since convolutions are limited by spatial-agnostic and channel-specific, features extracted by traditional GANs based on convolution are constrained. Therefore, GANs are unable to capture any more details per image. On the other hand, straightforwardly stacking of convolutions causes too many parameters and layers in GANs, which will lead to a high risk of overfitting. To overcome the aforementioned limitations, in this paper, we propose a new GANs called Involution Generative Adversarial Networks (GIU-GANs). GIU-GANs leverages a brand new module called the Global Information Utilization (GIU) module, which integrates Squeeze-and-Excitation Networks (SENet) and involution to focus on global information by channel attention mechanism, leading to a higher quality of generated images. Meanwhile, Batch Normalization(BN) inevitably ignores the representation differences among noise sampled by the generator, and thus degrade the generated image quality. Thus we introduce Representative Batch Normalization(RBN) to the GANs architecture for this issue. The CIFAR-10 and CelebA datasets are employed to demonstrate the effectiveness of our proposed model. A large number of experiments prove that our model achieves state-of-the-art competitive performance.

READ FULL TEXT VIEW PDF

Authors

page 19

page 21

05/31/2017

Megapixel Size Image Creation using Generative Adversarial Networks

Since its appearance, Generative Adversarial Networks (GANs) have receiv...
07/09/2021

ViTGAN: Training GANs with Vision Transformers

Recently, Vision Transformers (ViTs) have shown competitive performance ...
01/25/2018

Generative Adversarial Networks using Adaptive Convolution

Most existing GANs architectures that generate images use transposed con...
06/01/2018

Whitening and Coloring transform for GANs

Batch Normalization (BN) is a common technique used both in discriminati...
05/15/2019

Dilated Spatial Generative Adversarial Networks for Ergodic Image Generation

Generative models have recently received renewed attention as a result o...
12/10/2020

Slimmable Generative Adversarial Networks

Generative adversarial networks (GANs) have achieved remarkable progress...
01/26/2021

Self Sparse Generative Adversarial Networks

Generative Adversarial Networks (GANs) are an unsupervised generative mo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image generation is a vital and challenging problem in Computer Vision. There has been remarkable progress in this field with the development of deep learning. In 2014, the application of Generative Adversarial Networks (GANs)

DBLP:journals/corr/GoodfellowPMXWOCB14

for image generation was reported with promising results. Then, GANs based on deep Convolutional Neural Networks (CNNs)

DBLP:journals/corr/RadfordMC15 elevated the quality of generated images to a higher level. Since then, CNNs had been the core ingredient of GANs. However, convolution kernels enjoy two remarkable properties, namely, spatial-agnostic and channel-specific DBLP:journals/corr/abs-2103-06255 , that contribute to its popularity yet bring some issues.

Spatial-agnostic means that a convolution kernel produces the same output no matter its location in an image. As is well known, the size of a convolution kernel is , where and are the numbers of input and output channels, respectively, and is the kernel size. Owing to spatial-agnostic, a convolution kernel has the same parameters in different regions of an image, which reduces computation. However, this constrains the receptive field of convolution, which makes it difficult for models to capture the mathematical relationships of spatially distant locations and thus deprives the ability of convolution to extract global information. The convolution kernel of each channel has specific parameters so that CNNs can detect different features, which is the second property of convolution called channel-specific. Generally, the number of channels increases as a network deepens. Yet, it is questionable if more channels yield higher accuracy. When the quality of a training set is poor when models are complex, there may be overfitting risks that lead to poor performance. Hundreds of channels complicate a model and increase the overfitting risks. It has been shown that many convolution kernels are redundant in the channel dimension DBLP:conf/bmvc/JaderbergVZ14 , so the traditional convolution operation may increase computation yet fail to improve the model performance.

For GANs based on CNNs, Spatial-Agnostic has a low time complexity, yet it limits the utilization of global information by GANs. Therefore, GANs’ design often requires stacking convolutions, which results in too many layers in the model. However, complex models will increase the risk of overfitting and also reduce the quality of generated images. Consider another property of convolution, channel-specific, which causes redundant channels in the model. Redundant channels complicate GANs, resulting in the phenomenon of overfitting. The overly complex discriminator makes the model having more stringent evaluation criteria for the generated false images. This will cause a serious problem for GANs, i.e. mode collapse, where the generator produces many similar images that lose diversity. To summarize, how to appropriately improve the quality of the generated images without causing the risk of overfitting is a thorny issue.

Aiming to address problems of GANs based on CNNs, this paper reports the Global Information Utilization (GIU) module for GANs to enhance the quality of generated images. GIU module consists of involution DBLP:journals/corr/abs-2103-06255 and SENet DBLP:conf/cvpr/HuSS18

. Involution is a new neural network operator which is contrary to the characteristics of convolution. It shares the parameter of kernels in the channel domain and considers generating the corresponding involution kernel based on the input feature map. The size of the involution kernel is

, where . It means that all channels share the parameter matrix of kernels. Meanwhile, we notice that the parameter matrix of the involution kernel is affected by the input feature map. To enhance the useful features and suppress the useless ones, GIU-GANs introduce the GIU module, which combined the channel attention mechanism SENet and involution together. SENet uses two operations, i.e. Squeeze and Excitation to explicitly model the interdependence between feature channels. SENet utilizes the relationship between channels to obtain the importance degree of each channel through learning. It enhances the useful features according to this importance degree and inhibits the features that are not useful for the current task. Compared with traditional convolution, the GIU module has a stronger ability to extract and utilize global information. The parameters of the GIU module are determined by the input feature map, it pays attention to each pixel on the image lattice and adaptively enhances useful features. Meanwhile, the GIU module shares the kernel over channels, which decreases the number of channels to reduce redundant information. It improves the performance of GANs while reducing the risk of overfitting. Through the GIU module, we can enhance the ability to extract features for GANs and thus enhance the quality of generated images.

In the image generation task based on GANs, random noises are sampled to generate images by the generator, and Batch Normalization(BN) DBLP:conf/icml/IoffeS15 has become the core configuration of generators because of its stability training advantages. However, normalization also causes the generator to ignore the representation differences between different noises. To address this problem, we replace the BN in the generator with Representative Batch Normalization(RBN) DBLP:conf/cvpr/GaoHLCP21 . RBN introduces the formulation of centering calibration and Scaling calibration based on traditional BN. It compresses input features as representational features and enhances or suppresses them through learnable variables. Through RBN, GANs can enhance or suppress the representativeness difference of different random noises to enhance the generated image quality.

The contributions of this work can be summarized as follows:

  • We design a new GIU module for GANs. The GIU module can make good use of the global information to improve the ability of GANs feature extraction and enhance the quality of the generated images.

  • We introduce RBN for GANs to make the model pay more attention to the expression of representative features.

  • We adopt two technologies, i.e. spectral normalization DBLP:conf/iclr/MiyatoKKY18 and WGAN-GP DBLP:conf/nips/GulrajaniAADC17 to stabilize the training of GANs and improve the quality of generated images.

We compare our model with many classical GANs, i.e Deep Convolutional GANs (DCGANs) DBLP:journals/corr/RadfordMC15 Least Squares Generative Adversarial Networks (LSGANs) DBLP:conf/iccv/MaoLXLWS17 , WGAN-GP Spectrally Normalized GANs (SNGANs) DBLP:conf/iclr/MiyatoKKY18 , and Self-Attention Generative Adversarial Networks (SAGANs) DBLP:conf/icml/ZhangGMO19 on CIFAR-10 and CelebA datasets. Each GANs was trained 1 million times and the best model was selected to randomly generate 50k samples for comparison using the two evaluation indicators of Inception Score (IS) DBLP:conf/nips/SalimansGZCRCC16 and Frechet Inception Distance (FID) DBLP:conf/iccv/LiuLWT15 . A large number of experimental results prove that our model achieves state-of-the-art competitive performance.

The rest of this paper is organized as follows. Section 2 reviews the related work to GANs. In Section 3, we introduce GIU-GANs, including the architecture of the GIU module and techniques for stabilizing the training of GANs. The algorithm of RBN is introduced in Section 4. In Section 5, we show the experimental results on many classic GANs and our model on the CIFAR-10 and CelebA datasets. Section 6 concludes the paper.

2 Related Work

In this section, we review some previous work related to GANs.

Image generation has long been a subject that has attracted researchers from a diverse range of fields including Computer Vision, Machine Learning. In 2014, Generative Adversarial Networks (GANs)

DBLP:journals/corr/GoodfellowPMXWOCB14 was proposed to generate images and has received extensive attention. Since then, various GANs have been developed. To enhance the quality of generated images, Mirza et al. proposed Conditional Generative Adversarial Nets (CGANs) DBLP:journals/corr/MirzaO14 , which introduced the condition information as a label to the generator. In addition, Radford et al. proposed Deep Convolutional GANs (DCGANs) DBLP:journals/corr/RadfordMC15 to enhance GANs by full convolutional structure. Furthermore, Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets (InfoGAN) DBLP:conf/nips/ChenCDHSSA16

constrained some dimensions of random noise to control the semantic characteristics of generated data. Additionally, Variational Autoencoder and Generative Adversarial Network (VAE-GAN)

DBLP:conf/icml/LarsenSLW16 introduced a discriminator on the basis of the original Variational Autoencoder (VAE) DBLP:journals/corr/KingmaW13 to better capture the data distribution. In addition, Energy-based Generative Adversarial Network (EBGAN) DBLP:conf/iclr/ZhaoML17 introduced the auto-encoder to the discriminator. It showed a better convergence pattern and scalability to generate high-resolution images. While GANs’ theory has made rapid progress, GANs’ applications have also flourished. These include text-to-image DBLP:conf/icml/ReedAYLSL16 , DBLP:conf/iccv/ZhangXL17 , DBLP:journals/pami/ZhangXLZWHM19 , DBLP:conf/cvpr/HongYCL18 , image-to-image DBLP:conf/icml/IoffeS15 , DBLP:conf/iccv/ZhuPIE17 , DBLP:conf/cvpr/KarrasLA19 , DBLP:conf/iclr/NieNP19

and image super-resolution tasks

DBLP:conf/cvpr/LedigTHCCAATTWS17 , DBLP:conf/eccv/WangYWGLDQL18 , DBLP:conf/eccv/BulatYT18 .

Despite GANs has made great progress, the training of GANs is still full of difficulties. Poorly trained GANs may generate similar images and reduce the diversity of generated images, leading to mode collapse. To address the issue, Multi-Agent Diverse Generative Adversarial Networks (MAD-GANs) DBLP:conf/cvpr/GhoshKNTD18 built multiple generators simultaneously and each generator produced a different pattern, to guarantee the diversity of samples produced. On the other hand, improving GANs’ objective function is also a strategy to stabilize GANs’ training. Unrolled Generative Adversarial Networks (Unrolled GANs) DBLP:conf/iclr/MetzPPS17 changed the objective function of the generator considering both the current state of the generator and the state of the discriminator after K updates. Meanwhile, Mao et al. proposed Least Squares Generative Adversarial Networks (LSGANs) DBLP:conf/iccv/MaoLXLWS17 to stabilize the training of GANs. LSGANs changed the objective function of GANs from cross-entropy loss to least square loss and achieved a certain effect. Furthermore, Wasserstein Generative Adversarial Networks (WGANs) DBLP:conf/icml/ArjovskyCB17 replaced JS divergence with Wasserstein distance and greatly stabilizes the training of GANs. Additionally, because WGANs used the weight clipping strategy to enforce the Lipschitz constraint, the value of weight will affect the result of training. Therefore, Gulrajani et al. proposed WGAN-GP DBLP:conf/nips/GulrajaniAADC17 , which gave up the strategy of weight clipping based on WGANs and used gradient penalty to meet the Lipschitz constraint for the training of GANs. Moreover, to satisfy the Lipschitz constraint, Spectrally Normalized GANs (SNGANs) DBLP:conf/iclr/MiyatoKKY18 used spectral normalization return to stabilize the training of GANs. On this basis, Self-Attention Generative Adversarial Networks (SAGANs) DBLP:conf/icml/ZhangGMO19 combined spectral normalization with self-attention DBLP:conf/nips/VaswaniSPUJGKP17 mechanism to improve the performance of GANs.

3 Global Information Utilization for Generative Adversarial Networks (GIU-GANs)

In this section, we present GIU-GANs in detail. Firstly, we introduce preliminary work on GIU-GANs, including SENet and involution. Then, the implementation method of the GIU module will be explained. Next, we show the architecture of GIU-GANs. Finally, we describe two techniques in detail to stabilize the training of GANs, WGAN-GP and spectral normalization for both generator and discriminator.

3.1 Preliminary Work

In this subsection, we describe the channel attention SENet and involution in detail.

3.1.1 Squeeze-and-Excitation Networks (SENet)

At present, CNNs are core components of Computer Vision research. Capturing more information from images is becoming increasingly relevant in CNNs. There has numerous studies DBLP:conf/cvpr/SzegedyVISW16 , DBLP:conf/eccv/NewellYD16 , DBLP:conf/nips/JaderbergSZK15 considering the spatial domain to improve the model performance, whereas the SENet considers the relationship between characteristic channels. It captures the importance of each feature channel and then refers to this importance enhancement or suppression feature.

The first step is to compress each feature channel to a real number using global average pooling DBLP:journals/corr/LinCY13 . Each real number obtained by compression has a global field of sensation, and their number of channels matches the number of input channels.

Then, the SENet captures the relationship between the channels via the excitation operation. It employs a gating mechanism in the form of the sigmoid function:

(1)

where . The SENet adopts full connection layer operation to reduce the dimension of

by squeeze, which can reduce the computational load. Then, the result of dimensionality reduction is processed using the Rectified Linear Unit (ReLU) activation function, which is then multiplied by the full connection layer

. The output dimension is . Next, the feature maps are processed by the sigmoid function to obtain .

Finally, the sigmoid activation value of each captured channel is multiplied by the original feature:

(2)

As such, useful feature channels are enhanced, and less useful feature channels are suppressed.

3.1.2 Involution

Involution, which is a new neural network operator, has different characteristics from convolution. Involution has the characteristics of spatial-specificity and channel-agnostic. The former property contributes to generating parameters of the kernel by feature maps to improve the ability of feature extraction. The latter property can reduce the number of channels, thereby significantly decreasing the computational load and preventing model overfitting. The involution kernel size is , where is the size of the involution kernel and are resolutions of the feature map. Involution is defined as follows:

(3)

where is the approach by which the involution kernel is created:

(4)

is the index set in the neighborhood of pixel to obtain parameters of number , whereas is a series of scaling and reshape operations. Selecting a single point set of on the feature map can acquire involution kernel instantiation:

(5)

where , . Involution uses and to transform the parameter matrix to ; is the channel reduction ratio, and is the BN and ReLU operations.

Involution turns the feature vector of pixel

in a feature map into by and reshapes the operation, where is the kernel shape, and is the number of shared kernels. Finally, involution performs multiplication–addition operations on the weight matrix of size and the feature vector of the neighborhood of pixel to obtain the final outputting feature map.

3.1.3 Algorithm of GIU module

We notice the involution kernel generative function considers using the feature vector of a point on the feature map for design:

(6)

where is the feature vector of the pixel located in point . The value of the feature vector represents the output of point on each channel. Different channel responses have different effects on the current task; useful values will improve the performance of GANs, whereas less useful values will misguide GANs. To obtain a more effective involution kernel, we combine the SE module with involution, called the GIU module, to adaptively enhance kernel useful parameters and suppress useless parameters. Figure 1 is a schematic of the architecture of the GIU module.

Figure 1: The schematic of the Global Information Utilization module for GIU-GANs. The represents matrix multiplication and the represents matrix addition.

Firstly, SENet is employed to capture the importance of each channel for the input feature map:

(7)

where . is global average pooling operation:

(8)

are feature maps where input uses a convolution operation, and are spatial dimensions of feature maps, is the importance of each channel. Multiply the of each channel by itself: where is a 2-dimension spatial kernel and represents the c-th channel.

Then, the obtained is used to generate the involution kernel:

(9)

The previous involution kernel is the set of points at point : , and the improved kernel function is: . The new generation function has the ability to select the channel, which can adaptively adjust the parameter matrix according to the importance of the channel. Reshaping the size of involution kernel to , where counts the number of groups where each group shares the same involution kernel, then the final feature map can be obtained by Multiplication-Addition using the feature vector of the GIU Module and the neighborhood of the corresponding point on the input feature map:

(10)

Specifically, the algorithm of GIU module shown in Algorithm 1:

Input: The set of feature maps for current batch, , its spatial dimensions are × .
are full connection layer operations.
, are convolution operations.
is sigmoid activation function, is ReLU activation function.
Output: The set of feature maps for current batch,

1:.
2:.
3:.
4:.
5:Reshaping the size of involution kernel: 1 × 1 × .
6:.
7:return .
Algorithm 1 Algorithm of GIU module

3.2 The architecture of GIU-GANs

The architecture of GIU-GANs is shown in the Figure 2 and Figure 3 :

Figure 2: The discriminator architecture of GIU-GANs. The and denote the height and width of the current feature map, the denotes the number of channels in the current hidden layer and denotes the batch size.
Figure 3: The generator architecture of GIU-GANs. The and denote the height and width of the current feature map, the denotes the number of channels in the current hidden layer and denotes the batch size.

Figure 2 and Figure 3 are the architecture of GIU-GANs. We place the GIU module near the output layer of the discriminator and the input layer of the generator; The discriminator does not use BN and GIU-GANs select LeakReLU with the slope of the leak was set to 0.1 for the discriminator and ReLU for generator.

GIU module pays attention to the importance of the channels in the hidden layer and uses this as an important basis for generating the parameter matrix. Meanwhile, GIU Module considers the index set of the neighborhood of each pixel on the feature map, so the number of its parameters is closely related to the height and width of the feature map. In view of the above findings, we insert the GIU module after the hidden layer with more channels, that is, the position of the discriminator near the output layer and the generator near the input layer. Section 5 of the experiment proves that our research is effective.

3.3 Two Technologies for Training GANs

3.3.1 Spectral normalization for both generator and discriminator

As we know, WGAN replaced the JS divergence with the Wasserstein divergence. However, this requires the discriminator to satisfy the Lipschitz constraint:

(11)

It requires that the absolute value of the derivative of the function does not exceed . Otherwise, gradient explosion will occur during model training. SNGANs satisfy the Lipschitz constraint using spectral normalization by employing the spectral normalization of each layer to restrict the parameter matrix :

(12)

where

is the parameter matrix after spectral normalization. Spectral normalization has the advantage of not requiring extra hyperparameter tuning. In addition, the computational cost is relatively less. In conclusion, it can be used to satisfy the Lipschitz constraint and stabilize the training of GANs.

In GIU-GANs, we borrowed from SAGANs: spectrum normalization is applied to both discriminator and generator to stable the training of stable GANs.

3.3.2 Wgan-Gp

Moreover, we adopt WGAN-GP as our objective function. WGAN-GP uses a new truncation strategy, gradient penalty::

(13)

where is the real image distribution, is the generated image distribution,

is the distribution interpolated between

and , and is the penalty coefficient. The gradient penalty, , satisfies the Lipschitz constraint to stabilize training of GANs. Specifically, the training details of GIU-GANs are summarized in Algorithm 2:

Input: The batch size m, Adam hyperparameters , , , the gradient penalty coefficient .
Input: Initial discriminator parameters , initial generator parameters

1:for number of training iteration do
2:     for i = 1, …, m  do
3:         Sample real data , latent variable , a random number .
4:         .
5:         .
6:         .      
7:     .
8:     Sample a batch of latent variables
9:     
10:return
Algorithm 2 Training Algorithm of GIU-GANs. We use default values of = 10, = 0.0002, = 0, = 0.9.

4 The algorithm of Representative Batch Normalization(RBN)

The conventional BN performs three operations on input feature maps : centering, scaling, and affine operations, respectively, given by

(14)

where and

represent the mean and variance of feature maps, respectively,

and denote the learnable scaling and offset terms, respectively, in the affine transformation, and is used to prevent the occurrence of zero values. In the model training process, the mean and variance are calculated in a mini-batch:

(15)

However, and of mini-batch use normalization to reduce the expression of partial random noise by the generator. Therefore, we replace the BN with RBN in part of the network layer of GANs.

First, RBN introduces a centering calibration operation before the centering operation of BN:

(16)

where is the learnable variable and is the statistical features obtained after a series of processing of . Different operations will result in different shapes of statistical features. For example, is to obtain statistical information of each channel through feature map normalization, or is to obtain statistical information of each pixel after channel normalization. In GIU-GANs we follow the original RBN paper DBLP:conf/cvpr/GaoHLCP21 and adopt global average pooling to normalize the feature map as the shape of . The new mean is , it subtracts the old mean as follows:

(17)

After the centering calibration, , as a learnable variable, will enhance or suppress representational noise. If when , representative noise characteristics will be enhanced, whereas if when , representative noise features will be suppressed.

Meanwhile, RBN adds a scaling calibration after the original scaling operation:

(18)

where are learnable variables, which control the strength and position of the limiting function, respectively. Because , there must be a value that is . Therefore, is expressed as follows:

(19)

After the scaling calibration, the dispersion of the characteristic variance is reduced and a more stable channel distribution is obtained.

In Section 5, we compared GIU-GANs with and without RBN on the CIFAR-10 and CelebA datasets. The experimental results revealed the effectiveness of GIU-GANs with RBN on the CelebA dataset.

5 Experimental Results

In order to evaluate the effectiveness of our proposed GIU-GANs, we conducted experiments on two datasets, i.e. CelebA and CIFAR-10. The CIFAR-10 dataset consists of 50k training and 10k testing images. We trained with all 50K images of the CIFAR-10 dataset. For the CelebA dataset, we aligned and cropped for 200k images to 64 × 64 resolution and used them as training sets. Moreover, to further prove the effectiveness of the GIU module, the CIFAR-10 dataset will be conducted in the ablation studies. Finally, the GIU-GANs with and without RBN will be verified experimentally on CIFAR-10 and CelebA datasets. Two common evaluation metrics, i.e. Inception Score (IS) and Frechet Inception Distance (FID) were utilized as main metrics. All models were trained 1 million times and 50K images generated by all models will be measured by PyTorch implementations of IS and FID. All experiments were set with NVIDIA RTX 3090 GPU and Intel Core i9-9900k CPU@3.60GHz, using PyTorch 1.7.0. In the following, we will present experimental results by GIU-GANs and compare them with other GANs in detail.

5.1 Hyperparameters Study

To explore the advantages of the GIU module in processing hidden layers with many channels in GANs, we inserted the GIU module after the hidden layer with fewer channels and compared it with the aforementioned GIU-GANs. The best results of all models are shown in Table 1.

Model IS() FID()
GIU-GANs(Different position) 24.02
GIU-GANs 22.66
Table 1: The test accuracy of the GIU module at different locations of GIU-GANs. All models have been trained for 1 million iterations, and the IS and FID are reported:

Then, we first conducted a comparative experiment on the influence of insertion position of the GIU model on GIU-GANs. Meanwhile, to verify which hyperparameter fitted our model better, Two Timescale Update Rule (TTUR) was adopted for GIU-GANs DBLP:conf/nips/HeuselRUNH17 , that is, generator and discriminator to select different learning rates and different optimization algorithms for GIU-GANs. All models were trained 1 million times on the CIFAR-10 dataset and tested their IS and FID scores. In Table 2, we show the IS and FID of GANs under different learning rates on the CIFAR-10 dataset. We set the learning rate of GIU-GANs without TTUR to 0.0002 and compared it GIU-GANs with TTUR where the learning rate of the discriminator is 0.0004, the learning rate of the generator is 0.0001.

Model IS() FID()
GIU-GANs with TTUR 34.80
GIU-GANs 22.66
Table 2: The test accuracy of GIU-GANs with and without TTUR. All models have been trained for 1 million iterations. The best IS and FID are reported:

Finally, we leveraged the optimization algorithm Adagrad DBLP:journals/jmlr/DuchiHS11

, RMSprop

tieleman2012lecture , and Adam DBLP:journals/corr/KingmaB14 for GIU-GANs. The best results of all models are presented in Table 3.

Model IS() FID()
GIU-GANs-Adagrad 61.28
GIU-GANs-RMSprop 28.93
GIU-GANs-Adam 22.66
Table 3: The test accuracy of GIU-GANs with different optimization algorithms (Adagrad, RMSprop, and Adam). All models have been trained for 1 million iterations. The best IS and FID are reported:

Through experimental comparison, Adam is set as the optimization algorithm of GIU-GANs, generator, and the discriminator learning rate is set as 0.0002, and placing the GIU module in a hidden layer with a larger number of channels will greatly improve the performance of the model.

5.2 Results on the CIFAR-10 DataSet

The architecture of the GIU-GANs used to train the CIFAR-10 dataset is shown in Figure 4:

Figure 4: The architecture of GIU-GANs applied on the CIFAR-10 dataset.

In this experiment, GIU-GANs and other compared GANs were all trained 1 million times on the CIFAR-10 dataset. All GANs generated 32 × 32 images and tested their IS and FID scores. The best results of all models are shown in Table 4.

Figure 5: 32×32 examples were randomly produced by our models on the CIFAR-10 dataset.

Figure 5 shows 64 images generated by GIU-GANs and it can be seen from Table 4 shows that GIU-GANs performs better than other well-known GANs.

Model IS() FID()
DCGANs 26.22
WGAN-GP 22.95
LSGANs 66.51
SNGANs 24.60
SAGANs 67.31
GIU-GANs 22.66
Table 4: Comparison of 50K 32 × 32 images generated by GIU-GANs and 50K 32 × 32 images generated by other GANs using CIFAR-10. All models have been trained for 1 million iterations. The best IS and FID are reported:

5.3 Results on the CelebA DataSet

The architecture of the GIU-GANs used to train the CelebA dataset is shown in Figure 6:

Figure 6: The architecture of GIU-GANs applied on the CelebA dataset.

In this experiment, we cropped and aligned all images from the CelebA dataset for training and trained 1 million times. All GANs generated 50k 64 × 64 images and tested their IS and FID scores. We show the best results of all models in Table 5.

Additionally, as shown in Figure 7, when DCGANs were trained on CelebA dataset, the gradient vanished of the discriminator resulted in no change of generator gradient. We no longer used DCGANs for comparison.

(a) Convergence curve of the discriminator of DCGANs on the CelebA dataset
(b) Convergence curve of the generator of DCGANs on the CelebA dataset. time
Figure 7: Convergence curve of DCGANs on the CelebA dataset.
Figure 8: 64×64 examples were randomly produced by our models on the CelebA dataset.
Model IS() FID()
WGAN-GP 11.07
LSGANs 15.48
SNGAN 7.09
SAGANs 19.10
GIU-GANs 6.34
Table 5: Comparison of 50K 64 × 64 images generated by GIU-GANs and 50K 64 × 64 images generated by other GANs using CelebA. All models have been trained for 1 million iterations. The best IS and FID are reported:

The images generated by GIU-GANs are shown in Figure 8. From Table 5, as we can see, GIU-GANs performed best in both IS and FID.

5.4 Ablation Study

To explore the reasons for GIU-GANs’ superior performance, several ablation studies are designed to study the contributions of individual components. We applied the GIU-GANs with and without the GIU module on the CIFAR-10 data set respectively. In addition, baseline without GIU-Module and RBN was also trained and tested. The best results are shown in Table 6.

Model IS() FID()
baseline 37.54
GIU-GANs without GIU module 39.97
GIU-GANs 22.66
Table 6: The ablation study of proposed techniques in CIFAR-10. All models have been trained for 1 million iterations. The best IS and FID are reported:

From Table 6, we know that the performance of GIU-GANs deteriorates significantly when the GIU module is removed.

Table 7 shows the influence of RBN on GIU-GANs. GIU-GANs without RBN on the CIFAR-10 dataset performs better, but for large datasets like CelebA, GIU-Gans with RBN are superior.

CIFAR-10 CelebA
model IS() FID() IS() FID()
GIU-GANs without RBN 22.12 6.53
GIU-GANs 22.66 6.34
Table 7: Experimental results of GIU-GANs with and without RBN on CIFAR-10 and CelebA datasets. All models have been trained for 1 million iterations. The best IS and FID are reported:

5.5 Results Analysis

The training of GANs is extremely unstable. After a long time of training, the error tends from small to large, resulting in the phenomenon of overfitting. In GANs, some hidden layers even have thousands of channels, redundant channel information not only increases the computational time of the model but also makes the model more complex and increases the risk of overfitting. The GIU module is placed after the hidden layer with a large number of channels. The GIU module first filters the global information, enhances useful information, and suppresses useless information. Then the GIU module pays attention to each pixel of the feature maps and generates the corresponding parameter matrix for Multiplication-Addition. By filtering the global channels and paying attention to each pixel on the feature maps, the GIU module can make good use of global information. The performance of GIU-GANs can be improved without stacking convolutions, so GIU-GANs can easily enhance the quality of the generated images without too much consideration of the risk of overfitting.

6 Conclusion

In this paper, we proposed a new generative model called GIU-GANs, which incorporates a GIU module into the GANs framework. GIU-GANs place GIU Module composed of SENet and involution, so that the model can make full use of channel information to extract features and reduce redundant information to enhance the quality of the generated image. In addition, since RBN pays attention to the representative feature, we replace the BN of the generator near the input and output layers with RBN. Comparing GIU-GANs with other GANs, the IS and FID of our model reached the level of the state-of-the-art on CIFAR-10 and CelebA.

Considering the success of the Vision Transformer (VIT) DBLP:conf/iclr/DosovitskiyB0WZ21 model in visual tasks, we plan to combine the VIT model with GANs in the future to further improve performance. Moreover, involution does not change the number of output channels so that the GIU-GANs do not abandon convolution. How to improve Invlotion so that GANs can completely abandon convolution and build a backbone only with involution is a challenging problem.

7 Acknowledgments

This research work is supported in part by the Fundamental Research Funds for the Central Universities under Grant 21621017, in part by the Innovative Youth Program of Guangdong University under Grant 2019KQNCX194, and in part by the Educational and Scientifical Project of Guangdong Province 2021GXJK368.

References

  • (1) I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, Y. Bengio, Generative adversarial networks, CoRR abs/1406.2661.
  • (2) A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, in: International Conference on Learning Representations, ICLR, 2016.
  • (3) D. Li, J. Hu, C. Wang, X. Li, Q. She, L. Zhu, T. Zhang, Q. Chen, Involution: Inverting the inherence of convolution for visual recognition, CoRR abs/2103.06255.
  • (4) M. Jaderberg, A. Vedaldi, A. Zisserman, Speeding up convolutional neural networks with low rank expansions, in: British Machine Vision Conference, BMVC, 2014.
  • (5) X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, S. P. Smolley, Least squares generative adversarial networks, in: IEEE International Conference on Computer Vision, ICCV, 2017, pp. 2813–2821.
  • (6) I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. C. Courville, Improved training of wasserstein gans, in: Advances in Neural Information Processing Systems, NIPS, 2017, pp. 5767–5777.
  • (7) T. Miyato, T. Kataoka, M. Koyama, Y. Yoshida, Spectral normalization for generative adversarial networks, in: International Conference on Learning Representations, ICLR, 2018.
  • (8)

    J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2018, pp. 7132–7141.

  • (9) S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: International Conference on Machine Learning, ICML, Vol. 37 of JMLR Workshop and Conference Proceedings, 2015, pp. 448–456.
  • (10) S. Gao, Q. Han, D. Li, M. Cheng, P. Peng, Representative batch normalization with feature calibration, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2021, pp. 8669–8679.
  • (11) H. Zhang, I. J. Goodfellow, D. N. Metaxas, A. Odena, Self-attention generative adversarial networks, in: International Conference on Machine Learning, ICML, Vol. 97 of Proceedings of Machine Learning Research, 2019, pp. 7354–7363.
  • (12) T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for training gans, in: Advances in Neural Information Processing Systems, NIPS, 2016, pp. 2226–2234.
  • (13) Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in: IEEE International Conference on Computer Vision, ICCV, 2015, pp. 3730–3738.
  • (14) M. Mirza, S. Osindero, Conditional generative adversarial nets, CoRR abs/1411.1784.
  • (15) X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, P. Abbeel, Infogan: Interpretable representation learning by information maximizing generative adversarial nets, in: Advances in Neural Information Processing Systems, NIPS, 2016, pp. 2172–2180.
  • (16) A. B. L. Larsen, S. K. Sønderby, H. Larochelle, O. Winther, Autoencoding beyond pixels using a learned similarity metric, in: International Conference on Machine Learning, ICML, Vol. 48 of JMLR Workshop and Conference Proceedings, 2016, pp. 1558–1566.
  • (17) D. P. Kingma, M. Welling, Auto-encoding variational bayes, in: International Conference on Learning Representations, ICLR, 2014.
  • (18) J. J. Zhao, M. Mathieu, Y. LeCun, Energy-based generative adversarial networks, in: International Conference on Learning Representations, ICLR, 2017.
  • (19) S. E. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, H. Lee, Generative adversarial text to image synthesis, in: International Conference on Machine Learning, ICML, Vol. 48 of JMLR Workshop and Conference Proceedings, 2016, pp. 1060–1069.
  • (20) H. Zhang, T. Xu, H. Li, Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks, in: IEEE International Conference on Computer Vision, ICCV, 2017, pp. 5908–5916.
  • (21) H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D. N. Metaxas, Stackgan++: Realistic image synthesis with stacked generative adversarial networks, IEEE Trans. Pattern Anal. Mach. Intell. 41 (8) (2019) 1947–1962.
  • (22) S. Hong, D. Yang, J. Choi, H. Lee, Inferring semantic layout for hierarchical text-to-image synthesis, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2018, pp. 7986–7994.
  • (23)

    J. Zhu, T. Park, P. Isola, A. A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in: IEEE International Conference on Computer Vision, ICCV, 2017, pp. 2242–2251.

  • (24) T. Karras, S. Laine, T. Aila, A style-based generator architecture for generative adversarial networks, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2019, pp. 4401–4410.
  • (25)

    W. Nie, N. Narodytska, A. Patel, Relgan: Relational generative adversarial networks for text generation, in: International Conference on Learning Representations, ICLR, 2019.

  • (26) C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, W. Shi, Photo-realistic single image super-resolution using a generative adversarial network, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2017, pp. 105–114.
  • (27) X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, C. C. Loy, Esrgan: Enhanced super-resolution generative adversarial networks, in: European Conference on Computer Vision (ECCV) workshops, Vol. 11133 of Lecture Notes in Computer Science, 2018, pp. 63–79.
  • (28) A. Bulat, J. Yang, G. Tzimiropoulos, To learn image super-resolution, use a GAN to learn how to do image degradation first, in: European Conference on Computer Vision (ECCV), Vol. 11210 of Lecture Notes in Computer Science, 2018, pp. 187–202.
  • (29) A. Ghosh, V. Kulharia, V. P. Namboodiri, P. H. S. Torr, P. K. Dokania, Multi-agent diverse generative adversarial networks, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2018, pp. 8513–8521.
  • (30) L. Metz, B. Poole, D. Pfau, J. Sohl-Dickstein, Unrolled generative adversarial networks, in: International Conference on Learning Representations, ICLR, 2017.
  • (31) M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks, in: International Conference on Machine Learning, ICML, Vol. 70 of Proceedings of Machine Learning Research, 2017, pp. 214–223.
  • (32) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in Neural Information Processing Systems, NIPS, 2017, pp. 5998–6008.
  • (33) C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2016, pp. 2818–2826.
  • (34)

    A. Newell, K. Yang, J. Deng, Stacked hourglass networks for human pose estimation, in: European Conference on Computer Vision (ECCV), Vol. 9912 of Lecture Notes in Computer Science, 2016, pp. 483–499.

  • (35)

    M. Jaderberg, K. Simonyan, A. Zisserman, K. Kavukcuoglu, Spatial transformer networks, in: Advances in Neural Information Processing Systems, NIPS, 2015, pp. 2017–2025.

  • (36) M. Lin, Q. Chen, S. Yan, Network in network, in: International Conference on Learning Representations, ICLR, 2014.
  • (37) M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, Gans trained by a two time-scale update rule converge to a local nash equilibrium, in: Advances in Neural Information Processing Systems, NIPS, 2017, pp. 6626–6637.
  • (38)

    J. C. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res. 12 (2011) 2121–2159.

  • (39) T. Tieleman, G. Hinton, Lecture 6.5-rmsprop, coursera: Neural networks for machine learning, University of Toronto, Technical Report.
  • (40) D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: International Conference on Learning Representations, ICLR, 2015.
  • (41) A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, in: International Conference on Learning Representations, ICLR, 2021.