Generative Adversarial Networks (GANs) are developed to learn the distributions of input data and then generate new samples from the learned distribution . Recently, GANs have been applied to lots of tasks and achieve impressive results, such as image enhancement (Demir and Unal 2018; Yu et al. 2018; Nazeri et al. 2019; Ledig et al. 2017; Zhang et al. 2019; Kupyn et al. 2018), high resolution image generation [2, 6, 7, 14], 3D generation and reconstruction [15, 13, 16]. Training GANs is unstable and sensitive to hyper-parameters 
. Different loss functions and network structures have been developed to train powerful GANs for better quality and stability[1, 2, 4, 9, 6, 12, 19, 18].
Generation of high resolution images may be difficult for early works, but recent advances in the community have made it possible to generate high quality images even at and resolutions [2, 6, 7]. Even though, we find that training GANs to generate images at high resolutions is both time consuming and computational intensive. High-end graphic cards and long training time are required. Karras et al.  reports that it takes NVIDIA DGX-1 with 8 Tesla V100 GPUs and 4 days to train generative networks at resolution. If only one graphic card is available, it would be 14 days for  and 41 days for . BigGAN  use poweful 128 to 512 cores of a Google TPU V3 Pod so as to scale up GANs.
We propose an acceleration framework in this paper to increase the efficiency of training GANs at high resolution. We manage to generate small codes in latent space instead of large images, as demonstrated in Figure 2. Traditional structure generates images from input noise directly, which is challenging to train when image resolution increases. In our framework, we propose a different two-stage way. Encoder and Decoder networks are first trained to transform large images to small latent codes. The generative networks only learns to generate small codes from noise in the latent space. New generated code samples can be easily transformed to high resolution images via the trained decoder network.
The proposed acceleration framework increases the training efficiency for two reasons. First, the latent codes are times smaller than the original large images, which means that less layers and smaller feature maps are in both generator and discriminator networks. Second, though additional encoder and decoder networks have to be trained in the first step, it is easy to train them fast using low resolution images, as both networks are fully convolutional networks. For different resolutions, the encoder and decoder networks only need to be trained once. As a result, in the proposed acceleration framework, we manage to train the networks without feeding in high resolutions images in every training step.
Finally, we build an traditional image generative network and accelerate it using the proposed framework. Experimental results are promising. The training speed is times fast for different resolutions. In addition, the quality of generated samples after acceleration are well improved, benefiting from the better stability of learning the distribution of small codes in latent space than large images.
2 Related Work
Training high resolution GANs is challenging and vulnerable to suffer from gradient problems, as the discriminator is easy to distinguish fake images from real images . To train high resolution GANs stably, various techniques have been proposed. Gulrajani et al.  used gradient penalty to avoid gradient problems and improved the stability in training. Spectral normalization techniques were used by  to stabilize the training of discriminator networks. Zhang et al.  introduced self-attention module to their network, which enables the network to model long range, multi-level dependencies across image regions. Self-attention mechanism helps improve the details and quality. Kerras et al.  and  trained the network progressively, starting from low resolution and growing to large resolutions. Both the generator and discriminator networks grow during the training process which increases training stability significantly. They manage to generate images at resolution with high fidelity.  and  used powerful computational resource and scaled up GANs using very large batch size and parameter numbers.
The proposed method accelerates the training of GANs without modifying either the network structures or the loss function. We train an additional encoder and decoder network to convert the image generation to the small latent code generation. Variational Autoencoder (VAE)
also trained encoder and decoder networks to generate images. During training the encoder and decoder networks, VAEs try to force the code in latent space to obey some kind of distribution (i.e. standard normal distribution). Full resolution images are used in every training step. In the proposed framework, the encoder and decoder networks are trained in a independent way using low resolution images. The parameters of the encoder and decoder networks are fixed after training. Training images are then transformed to latent codes. A generative network is trained to learn the distribution of the latent codes. The proposed method does not use full resolution images in every training step. Both of the target and the training manner of the proposed method are different from VAE methods.
3 The Acceleration Framework
The key of the proposed accelerating framework for high resolution generation lies in generating small codes in latent space instead of large images in image space. We first train an encoder and decoder network. The encoder network transforms images to latent codes, and the decoder network transforms codes back to images. Then GANs are trained to generate codes in latent space, which can be transformed to images easily by the trained decoder network. The proposed acceleration framework is demonstrated in Figure 2.
3.1 Encoder and Decoder Networks
The framework of the proposed encoder and decoder networks is demonstrated in Figure 3(a). The whole framework is composed of three parts: an encoder network , a decoder network and an image discriminator network . There are several downsampling operation in the encoder network to transform input images to latent codes . The decoder network contains upsampling operations to convert latent codes back to images. We add tanh operation at the end of the encoder network, to force the value of the latent codes in [-1,1].
The objective is to minimize the reconstruction error. We adopt loss function here.
where is the input image. To improve the reconstruction quality, we add an adversarial loss by introducing an image discriminator network. The image discriminator predicts whether images are real or not. The adversarial loss for the encoder and decoder networks is defined as,
The total loss for training the encoder and decoder networks is,
To train the image discriminator, we use the non-saturating loss.
3.2 Code Generation Networks
The encoder and decoder networks can transform input large images to smaller latent codes and inverse latent codes back to images. Thus, to generate new images from random input noise, we only need to generate small size of codes in latent space. High resolution images in the dataset are all transformed to latent codes by the encoder network and only those codes are used in the code generation process. In this process, we aim at learning the distribution of the latent codes corresponding to high resolution images.
Though the size of latent codes is different from normal RGB images, the structure of either the generator or the discriminator does not require special customization. The codes and images are both three dimensional matrix and the value range of the codes is from the encoder network, which are exactly the commonly taken in range of generator and discriminator networks. Therefore, the structure of our code generation networks is nearly the same with image generation networks, except that the dimension of the output is the size of latent codes. The architecture of the proposed code generation networks is displayed in Figure 3(b), the structure of the code generation networks is identical with a simple image generator and discriminator.
To train the proposed code generation network, we use the WGAN-GP loss function , which has been proved to improve stability for image generations. The loss function contains the Wasserstein GAN loss and the gradient penalty loss. The loss function used to train the generator is,
The loss function for the code discriminator contains two parts. The first part is Wasserstein loss,
where is the latent code corresponding to image , i.e. . The second part is the gradient penalt for random smaple ,
In experiments, we set . The total loss function for our code discriminator is,
3.3 Implementation Details
We propose to transform large images to small latent codes by training encoder and decoder networks. It is inevitable that part of the information in the original images will be lost after the encoder and decoder process. The size of the latent code determines the reconstruction errors. Smaller latent codes would make the code generation process faster, but decoding from the codes will be less accurate. For images with resolution , we try to encoder them to three different sizes of latent codes, , and . The networks are all trained using images. The reconstruction mean square errors corresponding to different code sizes and resolutions are listed in Table 1. Smaller code size leads to larger reconstruction error. We also notice that the reconstruction error decreases when the input resolution increases. Even though the encoder and decoder networks are all trained using images, it works even better for higher resolutions. In Figure 4, we display the reconstruction images from three different code sizes. The reconstructed images from code size and look nearly the same with the input images. When using code size , the reconstructed images miss some details and look smooth. In our experiments, we adopt for the best balance of the reconstruction accuracy and the size of latent codes.
|Code Size||Input Resolution|
Training encoder and decoder networks.
Though the proposed framework is for accelerating high resolution image synthesis, the encoder and decoder networks can be trained using low resolution images. In our experiments, the encoder and decoder networks are trained using
images for efficiency consideration. Note that the encoder and decoder networks can be applied for any other resolution as both of them are fully convolutional neural networks. As shown in Table1, the reconstruction error is even lower for high resolution input than low resolution input, though the same encoder and decoder networks are used. Therefore, the encoder and decoder networks only need to be trained once using low resolution images, no matter images at what resolutions we want to generate.
Normalization techniques are not used in our encoder and decoder networks as we find that they can be trained easily and stably without normalization techniques. In the generator network, we use pixel normalization  after convolutional operations. As pointed out by , we also observe in our experiments that pixel normalization does not seem to change the results much. We add pixel normalization simply to prevent possible escalation of signal magnitudes during training.
To measure whether the proposed framework is able to accelerate the training of high resolution image generation, we build a traditional image generation network and accelerate this network using the proposed framework. We mainly compare the results of the networks before and after using the proposed acceleration framework, in terms of image quality, quantitative measurement and training speed.
4.1 Experimental Settings
The image generation network is nearly the same with Figure 3(b), except that the dimension of the output from the generator and the input to the discriminator is different, which becomes the size of images. For example, to generate images, the output of the image generator is . Applying the proposed acceleration framework to this image generation network, we train a code generation network using exactly the same structure but with output size , which is the size of latent codes. By comparison, the image generator and discriminator networks contain more layers than the code generator and discriminator. Whereas, the general structures of both methods are identical. The layer details of the network corresponding to resolution before and after acceleration is in Table 2.
|Layers||Before acceleration||After acceleration|
The dataset we use is celeba-hq dataset , which contains 30000 faces at resolution . We resize images to and and learn to generate images at three resolutions respectively, i.e. , and . We display the generated images before and after acceleration, and evaluate the generated samples quantitatively by calculating the Fréchet Inception distance (FID) . We do not calculate the Inception score  as we only generate the faces using celeba-hq dataset, not images in multiple categories. In addition, FID is considered to be consistent with human evaluation in terms of measuring the realism and variation of the generated images .
All experiments are run on one NVIDIA P100 graphic card with 16GB memory, and our CPU is Intel Xeon E5-2682 V4 @ 2.5GHz.
4.2 Training Speed
We analyze that after acceleration, the generative network only needs to generate small latent codes instead of large images. The width of the codes in our experiments is one fourth of the width of the original images. As a result, there are less layers and smaller feature maps in the code generator and discriminator networks after acceleration than before. Notice that the proposed framework has to train additional encoder and decoder networks. As described in previous section, we can train these two networks using images, which converges relatively fast. It only take about 4 hours and 30 minutes to train the encoder and decoder networks, which is much less than training high resolution image generator and discriminator networks. In addition, for any other resolutions or datasets, we only have to train the encoder and decoder once. Therefore, we just compare the speed of training the generative networks.
The training speed of running one epoch (feeding in 30000 images) before and after acceleration is listed in Table3. We can see that, for all three resolutions, the training speed is highly accelerated by more than two times. For resolution , the training speed after acceleration is 5 times faster than before, which makes it possible to train image generator in 3 days using only one P100 graphic card.
|Resolution||Before acceleration||After acceleration|
|30 minutes||8 minutes|
|56 minutes||22 minutes|
|225 minutes||45 minutes|
4.3 Qualitative Evaluation
A good acceleration method should increase the speed and keep the quality. We first qualitatively analyze of the generated samples before and after acceleration. High resolution images at , and are generated respectively, which are displayed in Figure 5, 6 and 7. For resolution and , the traditional network is able to generate reasonable and good results. Whereas, when it comes to resolution, the generated samples contain lots of artifacts. We consider that the large number of parameters and size of feature maps increase the difficulty of network training significantly at resolution. In contrast, after being accelerated by the proposed framework, the network is able to generate good samples at all resolutions. In addition, the general appearance of the samples after acceleration look more natural with less artifacts than before.
The results show that the proposed acceleration framework does not lower the quality of the original network. On the contrary, the quality of generated samples increases after acceleration. Our framework converts large image generation to small code generation, which enables the networks to learn less parameters and easier to converge. Therefore, after acceleration, the network shows better stability to generate high resolution images with satisfying image quality.
|Resolution||Before acceleration||After acceleration|
4.4 Quantitative Evaluation
We further evaluate the generated samples quantitatively. We randomly generate 50k samples after training and FIDs corresponding to different resolutions before and after acceleration are calculated. From Table 4, we can see that after using the proposed acceleration method, the FID decreases. Before acceleration, the network is able to generate good samples with relatively low FIDs at resolution and , but fails at higher resolutions. FIDs are very large when generating samples. The proposed method decreases the FIDs at all three resolutions. For resolution and , the improvements are significant. In conclusion, the quality of generated samples is well improved after using the proposed acceleration framework, which is in accordance with the qualitative measurement.
4.5 Other Datasets
We further test the proposed method on other datasets. Lsun datasets  at resolution are used to train the networks. As mentioned in the theory part, the encoder and decoder networks only need to be trained once. Thus, we do not retrain the encoder and decoder networks using these specific datasets. Instead, we use the encoder and decoder networks that are trained on the celeba-hq dataset directly. Even though, the proposed framework is able to generate reasonable samples as displayed in Figure 8 and 9. In addition, we calculates the FIDs in Table 5. The proposed acceleration method improves the quality on all datasets as well.
|without acceleration||54.83||225 minutes|
5 Limitations and Discussions
The target of the proposed method is to accelerate the training process of high resolution image generation without lowering the quality. Therefore, we did not spend much time building complex networks. We build a relatively simple generative network and test the results before and after using the proposed acceleration framework. The proposed method enables generating promising samples at high resolutions within short training time. Whereas, the generated images are not as good as recent advanced image generation structures such as . More complicated generative network should be used to reach the quality of the state-of-art methods in our following work.
In experiments, we adopt code size. In fact, the code can be much smaller. We adopt this size for the best quality of generated images. For resolution , we test smaller code sizes in Table 6. Even for code size, the FID is much less after acceleration and the speed is the fastest.
We propose an acceleration framework for high resolution images generation in this paper. Encoder and decoder networks are first trained to transform large images to small latent codes. Then the code generation networks learn to generate small codes in latent space.The training process is highly accelerated and the network stability is well improved. Experimental results show that the proposed framework makes the training speed times faster, and improves the generated image quality as well. The proposed acceleration framework makes it possible to generate satisfying high resolution images using less training time with limited hardware resource.
M. Arjovsky, S. Chintala, and L. Bottou.
Wasserstein generative adversarial networks.
International Conference on Machine Learning, pages 214–223, 2017.
-  A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
-  M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
-  T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
-  T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2018.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for gans do actually converge? arXiv preprint arXiv:1801.04406, 2018.
-  T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
A. Odena, C. Olah, and J. Shlens.
Conditional image synthesis with auxiliary classifier gans.In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2642–2651. JMLR. org, 2017.
-  T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
-  E. Smith and D. Meger. Improved adversarial systems for 3d object generation and reconstruction. arXiv preprint arXiv:1707.09557, 2017.
-  T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In , pages 8798–8807, 2018.
-  J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in neural information processing systems, pages 82–90, 2016.
-  B. Yang, H. Wen, S. Wang, R. Clark, A. Markham, and N. Trigoni. 3d object reconstruction from a single depth view with adversarial learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 679–688, 2017.
-  F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
-  H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
-  H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1710.10916, 2017.