An Acceleration Framework for High Resolution Image Synthesis

09/09/2019 ∙ by Jinlin Liu, et al. ∙ 34

Synthesis of high resolution images using Generative Adversarial Networks (GANs) is challenging, which usually requires numbers of high-end graphic cards with large memory and long time of training. In this paper, we propose a two-stage framework to accelerate the training process of synthesizing high resolution images. High resolution images are first transformed to small codes via the trained encoder and decoder networks. The code in latent space is times smaller than the original high resolution images. Then, we train a code generation network to learn the distribution of the latent codes. In this way, the generator only learns to generate small latent codes instead of large images. Finally, we decode the generated latent codes to image space via the decoder networks so as to output the synthesized high resolution images. Experimental results show that the proposed method accelerates the training process significantly and increases the quality of the generated samples. The proposed acceleration framework makes it possible to generate high resolution images using less training time with limited hardware resource. After using the proposed acceleration method, it takes only 3 days to train a 1024 *1024 image generator on Celeba-HQ dataset using just one NVIDIA P100 graphic card.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 4

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GANs) are developed to learn the distributions of input data and then generate new samples from the learned distribution [3]. Recently, GANs have been applied to lots of tasks and achieve impressive results, such as image enhancement (Demir and Unal 2018; Yu et al. 2018; Nazeri et al. 2019; Ledig et al. 2017; Zhang et al. 2019; Kupyn et al. 2018), high resolution image generation [2, 6, 7, 14], 3D generation and reconstruction [15, 13, 16]. Training GANs is unstable and sensitive to hyper-parameters [18]

. Different loss functions and network structures have been developed to train powerful GANs for better quality and stability

[1, 2, 4, 9, 6, 12, 19, 18].

Generation of high resolution images may be difficult for early works, but recent advances in the community have made it possible to generate high quality images even at and resolutions [2, 6, 7]. Even though, we find that training GANs to generate images at high resolutions is both time consuming and computational intensive. High-end graphic cards and long training time are required. Karras et al. [6] reports that it takes NVIDIA DGX-1 with 8 Tesla V100 GPUs and 4 days to train generative networks at resolution. If only one graphic card is available, it would be 14 days for [6] and 41 days for [7]. BigGAN [2] use poweful 128 to 512 cores of a Google TPU V3 Pod so as to scale up GANs.

We propose an acceleration framework in this paper to increase the efficiency of training GANs at high resolution. We manage to generate small codes in latent space instead of large images, as demonstrated in Figure 2. Traditional structure generates images from input noise directly, which is challenging to train when image resolution increases. In our framework, we propose a different two-stage way. Encoder and Decoder networks are first trained to transform large images to small latent codes. The generative networks only learns to generate small codes from noise in the latent space. New generated code samples can be easily transformed to high resolution images via the trained decoder network.

The proposed acceleration framework increases the training efficiency for two reasons. First, the latent codes are times smaller than the original large images, which means that less layers and smaller feature maps are in both generator and discriminator networks. Second, though additional encoder and decoder networks have to be trained in the first step, it is easy to train them fast using low resolution images, as both networks are fully convolutional networks. For different resolutions, the encoder and decoder networks only need to be trained once. As a result, in the proposed acceleration framework, we manage to train the networks without feeding in high resolutions images in every training step.

Finally, we build an traditional image generative network and accelerate it using the proposed framework. Experimental results are promising. The training speed is times fast for different resolutions. In addition, the quality of generated samples after acceleration are well improved, benefiting from the better stability of learning the distribution of small codes in latent space than large images.

2 Related Work

Training high resolution GANs is challenging and vulnerable to suffer from gradient problems, as the discriminator is easy to distinguish fake images from real images [11]. To train high resolution GANs stably, various techniques have been proposed. Gulrajani et al. [4] used gradient penalty to avoid gradient problems and improved the stability in training. Spectral normalization techniques were used by [10] to stabilize the training of discriminator networks. Zhang et al. [18] introduced self-attention module to their network, which enables the network to model long range, multi-level dependencies across image regions. Self-attention mechanism helps improve the details and quality. Kerras et al. [6] and [7] trained the network progressively, starting from low resolution and growing to large resolutions. Both the generator and discriminator networks grow during the training process which increases training stability significantly. They manage to generate images at resolution with high fidelity. [2] and [2] used powerful computational resource and scaled up GANs using very large batch size and parameter numbers.

The proposed method accelerates the training of GANs without modifying either the network structures or the loss function. We train an additional encoder and decoder network to convert the image generation to the small latent code generation. Variational Autoencoder (VAE)

[8]

also trained encoder and decoder networks to generate images. During training the encoder and decoder networks, VAEs try to force the code in latent space to obey some kind of distribution (i.e. standard normal distribution). Full resolution images are used in every training step. In the proposed framework, the encoder and decoder networks are trained in a independent way using low resolution images. The parameters of the encoder and decoder networks are fixed after training. Training images are then transformed to latent codes. A generative network is trained to learn the distribution of the latent codes. The proposed method does not use full resolution images in every training step. Both of the target and the training manner of the proposed method are different from VAE methods.

Figure 2: The architecture of (a) a traditional image generation structure and (b) the proposed acceleration framework. A traditional structure generates high resolution images from noise directly. The proposed acceleration framework first trains encoder and decoder networks in a supervised way. High resolution images are transformed to small latent codes. Then a code generator is trained to learn the distribution of codes in latent space. Finally, new generated code samples are transformed to high resolution images by the trained decoder network. The proposed framework manages to generate small codes rather then large images, which makes the generation of high resolution images using GANs easier and faster.
Figure 3: The architecture of the encoder and decoder networks and the code generation networks.

3 The Acceleration Framework

The key of the proposed accelerating framework for high resolution generation lies in generating small codes in latent space instead of large images in image space. We first train an encoder and decoder network. The encoder network transforms images to latent codes, and the decoder network transforms codes back to images. Then GANs are trained to generate codes in latent space, which can be transformed to images easily by the trained decoder network. The proposed acceleration framework is demonstrated in Figure 2.

3.1 Encoder and Decoder Networks

The framework of the proposed encoder and decoder networks is demonstrated in Figure 3(a). The whole framework is composed of three parts: an encoder network , a decoder network and an image discriminator network . There are several downsampling operation in the encoder network to transform input images to latent codes . The decoder network contains upsampling operations to convert latent codes back to images. We add tanh operation at the end of the encoder network, to force the value of the latent codes in [-1,1].

The objective is to minimize the reconstruction error. We adopt loss function here.

(1)

where is the input image. To improve the reconstruction quality, we add an adversarial loss by introducing an image discriminator network. The image discriminator predicts whether images are real or not. The adversarial loss for the encoder and decoder networks is defined as,

(2)

The total loss for training the encoder and decoder networks is,

(3)

To train the image discriminator, we use the non-saturating loss.

(4)

3.2 Code Generation Networks

The encoder and decoder networks can transform input large images to smaller latent codes and inverse latent codes back to images. Thus, to generate new images from random input noise, we only need to generate small size of codes in latent space. High resolution images in the dataset are all transformed to latent codes by the encoder network and only those codes are used in the code generation process. In this process, we aim at learning the distribution of the latent codes corresponding to high resolution images.

Though the size of latent codes is different from normal RGB images, the structure of either the generator or the discriminator does not require special customization. The codes and images are both three dimensional matrix and the value range of the codes is from the encoder network, which are exactly the commonly taken in range of generator and discriminator networks. Therefore, the structure of our code generation networks is nearly the same with image generation networks, except that the dimension of the output is the size of latent codes. The architecture of the proposed code generation networks is displayed in Figure 3(b), the structure of the code generation networks is identical with a simple image generator and discriminator.

To train the proposed code generation network, we use the WGAN-GP loss function [4], which has been proved to improve stability for image generations. The loss function contains the Wasserstein GAN loss and the gradient penalty loss. The loss function used to train the generator is,

(5)

The loss function for the code discriminator contains two parts. The first part is Wasserstein loss,

(6)

where is the latent code corresponding to image , i.e. . The second part is the gradient penalt for random smaple ,

(7)

In experiments, we set . The total loss function for our code discriminator is,

(8)

3.3 Implementation Details

Code size.

We propose to transform large images to small latent codes by training encoder and decoder networks. It is inevitable that part of the information in the original images will be lost after the encoder and decoder process. The size of the latent code determines the reconstruction errors. Smaller latent codes would make the code generation process faster, but decoding from the codes will be less accurate. For images with resolution , we try to encoder them to three different sizes of latent codes, , and . The networks are all trained using images. The reconstruction mean square errors corresponding to different code sizes and resolutions are listed in Table 1. Smaller code size leads to larger reconstruction error. We also notice that the reconstruction error decreases when the input resolution increases. Even though the encoder and decoder networks are all trained using images, it works even better for higher resolutions. In Figure 4, we display the reconstruction images from three different code sizes. The reconstructed images from code size and look nearly the same with the input images. When using code size , the reconstructed images miss some details and look smooth. In our experiments, we adopt for the best balance of the reconstruction accuracy and the size of latent codes.

Code Size Input Resolution
2.7e-4 2.0e-4 1.5e-4 1.1e-4
1.3e-3 1.0e-3 7.5e-4 3.5e-4
3.4e-3 2.8e-3 2.0e-3 8.8e-4
Table 1: Reconstruction mean square error (MSE) of the encoder and decoder networks with different code sizes and input resolutions.
Figure 4: Reconstructed images using different code sizes. (a) Input images. Reconstructed images using (b) (c) (d) code size. The reconstructed images become smooth when using code size.

Training encoder and decoder networks.

Though the proposed framework is for accelerating high resolution image synthesis, the encoder and decoder networks can be trained using low resolution images. In our experiments, the encoder and decoder networks are trained using

images for efficiency consideration. Note that the encoder and decoder networks can be applied for any other resolution as both of them are fully convolutional neural networks. As shown in Table 

1, the reconstruction error is even lower for high resolution input than low resolution input, though the same encoder and decoder networks are used. Therefore, the encoder and decoder networks only need to be trained once using low resolution images, no matter images at what resolutions we want to generate.

Normalization techniques.

Normalization techniques are not used in our encoder and decoder networks as we find that they can be trained easily and stably without normalization techniques. In the generator network, we use pixel normalization [6] after convolutional operations. As pointed out by [6], we also observe in our experiments that pixel normalization does not seem to change the results much. We add pixel normalization simply to prevent possible escalation of signal magnitudes during training.

4 Experiments

To measure whether the proposed framework is able to accelerate the training of high resolution image generation, we build a traditional image generation network and accelerate this network using the proposed framework. We mainly compare the results of the networks before and after using the proposed acceleration framework, in terms of image quality, quantitative measurement and training speed.

4.1 Experimental Settings

The image generation network is nearly the same with Figure 3(b), except that the dimension of the output from the generator and the input to the discriminator is different, which becomes the size of images. For example, to generate images, the output of the image generator is . Applying the proposed acceleration framework to this image generation network, we train a code generation network using exactly the same structure but with output size , which is the size of latent codes. By comparison, the image generator and discriminator networks contain more layers than the code generator and discriminator. Whereas, the general structures of both methods are identical. The layer details of the network corresponding to resolution before and after acceleration is in Table 2.

Layers Before acceleration After acceleration
Generator
dense layer
conv
output
Discriminator
conv
dense layer
output
Table 2: Feature map sizes corresponding to resolution before and after acceleration.

The dataset we use is celeba-hq dataset [6], which contains 30000 faces at resolution . We resize images to and and learn to generate images at three resolutions respectively, i.e. , and . We display the generated images before and after acceleration, and evaluate the generated samples quantitatively by calculating the Fréchet Inception distance (FID) [5]. We do not calculate the Inception score [12] as we only generate the faces using celeba-hq dataset, not images in multiple categories. In addition, FID is considered to be consistent with human evaluation in terms of measuring the realism and variation of the generated images [18].

All experiments are run on one NVIDIA P100 graphic card with 16GB memory, and our CPU is Intel Xeon E5-2682 V4 @ 2.5GHz.

4.2 Training Speed

We analyze that after acceleration, the generative network only needs to generate small latent codes instead of large images. The width of the codes in our experiments is one fourth of the width of the original images. As a result, there are less layers and smaller feature maps in the code generator and discriminator networks after acceleration than before. Notice that the proposed framework has to train additional encoder and decoder networks. As described in previous section, we can train these two networks using images, which converges relatively fast. It only take about 4 hours and 30 minutes to train the encoder and decoder networks, which is much less than training high resolution image generator and discriminator networks. In addition, for any other resolutions or datasets, we only have to train the encoder and decoder once. Therefore, we just compare the speed of training the generative networks.

The training speed of running one epoch (feeding in 30000 images) before and after acceleration is listed in Table 

3. We can see that, for all three resolutions, the training speed is highly accelerated by more than two times. For resolution , the training speed after acceleration is 5 times faster than before, which makes it possible to train image generator in 3 days using only one P100 graphic card.

Resolution Before acceleration After acceleration
30 minutes 8 minutes
56 minutes 22 minutes
225 minutes 45 minutes
Table 3: The running time of training one epoch (feeding in 30000 images).
(a) Before acceleration (FID=23.62)
(b) After acceleration (FID=20.78)
Figure 5: Randomly generated images without manual selection (a) before and (b) after acceleration. The quality of the generated samples after acceleration are comparable with those before acceleration.
(a) Before acceleration (FID=30.43)
(b) After acceleration (FID=14.72)
Figure 6: Randomly generated images without manual selection (a) before and (b) after acceleration. The quality of the generated samples after acceleration looks better and with less artifacts those before acceleration.

4.3 Qualitative Evaluation

A good acceleration method should increase the speed and keep the quality. We first qualitatively analyze of the generated samples before and after acceleration. High resolution images at , and are generated respectively, which are displayed in Figure 5, 6 and 7. For resolution and , the traditional network is able to generate reasonable and good results. Whereas, when it comes to resolution, the generated samples contain lots of artifacts. We consider that the large number of parameters and size of feature maps increase the difficulty of network training significantly at resolution. In contrast, after being accelerated by the proposed framework, the network is able to generate good samples at all resolutions. In addition, the general appearance of the samples after acceleration look more natural with less artifacts than before.

The results show that the proposed acceleration framework does not lower the quality of the original network. On the contrary, the quality of generated samples increases after acceleration. Our framework converts large image generation to small code generation, which enables the networks to learn less parameters and easier to converge. Therefore, after acceleration, the network shows better stability to generate high resolution images with satisfying image quality.

(a) Before acceleration (FID=54.83)
(b) After acceleration (FID=14.80)
Figure 7: Randomly generated images without manual selection (a) before and (b) after acceleration. The quality of the generated samples after acceleration is significantly better than before acceleration.
Resolution Before acceleration After acceleration
23.62 20.78
30.43 14.72
54.83 14.80
Table 4: FID before and after acceleration.

4.4 Quantitative Evaluation

We further evaluate the generated samples quantitatively. We randomly generate 50k samples after training and FIDs corresponding to different resolutions before and after acceleration are calculated. From Table 4, we can see that after using the proposed acceleration method, the FID decreases. Before acceleration, the network is able to generate good samples with relatively low FIDs at resolution and , but fails at higher resolutions. FIDs are very large when generating samples. The proposed method decreases the FIDs at all three resolutions. For resolution and , the improvements are significant. In conclusion, the quality of generated samples is well improved after using the proposed acceleration framework, which is in accordance with the qualitative measurement.

4.5 Other Datasets

We further test the proposed method on other datasets. Lsun datasets [17] at resolution are used to train the networks. As mentioned in the theory part, the encoder and decoder networks only need to be trained once. Thus, we do not retrain the encoder and decoder networks using these specific datasets. Instead, we use the encoder and decoder networks that are trained on the celeba-hq dataset directly. Even though, the proposed framework is able to generate reasonable samples as displayed in Figure 8 and 9. In addition, we calculates the FIDs in Table 5. The proposed acceleration method improves the quality on all datasets as well.

(a) Before acceleration (FID=38.32)
(b) After acceleration (FID=17.90)
Figure 8: Randomly generated samples using lsun bedroom dataset without manual selection (a) before and (b) after acceleration. The quality of the generated samples is improved after acceleration.
(a) Before acceleration (FID=30.99)
(b) After acceleration (FID=21.91)
Figure 9: Randomly generated samples using lsun church dataset without manual selection (a) before and (b) after acceleration. The quality of the generated samples is improved after acceleration.
LSUN Dataset cat airplane bus bird bedroom church
before acceleration 66.90 72.53 25.75 62.44 38.32 30.99
after acceleration 34.25 25.78 15.29 28.42 17.90 21.91
Table 5: FID of LSUN datasets before and after acceleration.
Settings FID Running time
without acceleration 54.83 225 minutes
14.80 45 minutes
17.71 19 minutes
22.12 9 minutes
Table 6: FID and the running time of training one epoch corresponding to different settings.

5 Limitations and Discussions

The target of the proposed method is to accelerate the training process of high resolution image generation without lowering the quality. Therefore, we did not spend much time building complex networks. We build a relatively simple generative network and test the results before and after using the proposed acceleration framework. The proposed method enables generating promising samples at high resolutions within short training time. Whereas, the generated images are not as good as recent advanced image generation structures such as [7]. More complicated generative network should be used to reach the quality of the state-of-art methods in our following work.

In experiments, we adopt code size. In fact, the code can be much smaller. We adopt this size for the best quality of generated images. For resolution , we test smaller code sizes in Table 6. Even for code size, the FID is much less after acceleration and the speed is the fastest.

The input noise latent space interpolation results are displayed in Figure 

10. The generated images change smoothly, which shows that the proposed frame-work does not just memorize training samples.

Figure 10: Input noise latent space interpolation results.

6 Conclusion

We propose an acceleration framework for high resolution images generation in this paper. Encoder and decoder networks are first trained to transform large images to small latent codes. Then the code generation networks learn to generate small codes in latent space.The training process is highly accelerated and the network stability is well improved. Experimental results show that the proposed framework makes the training speed times faster, and improves the generated image quality as well. The proposed acceleration framework makes it possible to generate satisfying high resolution images using less training time with limited hardware resource.

References