Decoupled Learning for Conditional Adversarial Networks
Incorporating encoding-decoding nets with adversarial nets has been widely adopted in image generation tasks. We observe that the state-of-the-art achievements were obtained by carefully balancing the reconstruction loss and adversarial loss, and such balance shifts with different network structures, datasets, and training strategies. Empirical studies have demonstrated that an inappropriate weight between the two losses may cause instability, and it is tricky to search for the optimal setting, especially when lacking prior knowledge on the data and network. This paper gives the first attempt to relax the need of manual balancing by proposing the concept of decoupled learning, where a novel network structure is designed that explicitly disentangles the backpropagation paths of the two losses. Experimental results demonstrate the effectiveness, robustness, and generality of the proposed method. The other contribution of the paper is the design of a new evaluation metric to measure the image quality of generative models. We propose the so-called normalized relative discriminative score (NRDS), which introduces the idea of relative comparison, rather than providing absolute estimates like existing metrics.READ FULL TEXT VIEW PDF
Decoupled Learning for Conditional Adversarial Networks
Generative adversarial networks (GANs) 
is an adversarial framework that generates images from noise while preserving high fidelity. However, generating random images from noise doesn’t meet the requirements in many real applications, e.g., image inpainting, image transformation [4, 21], image manipulation [23, 25], etc. To overcome this problem, recent works like [19, 13] concatenate additional features generated by an encoder or certain extractor to the random noise or directly replace the noise by the features. In most recent practices, the encoding-decoding networks (ED), e.g., VAE , AAE 8], etc., have been the popular structure to be incorporated with GANs  for image-conditional modeling, where the encoder extracts features, which are then fed to the decoder/generator to generate the target images. The encoding-decoding network tends to yield blurry images. Incorporating a discriminator, as empirically demonstrated in many works [6, 4, 7, 23, 9, 26, 24], effectively increases the quality (i.e., reality and resolution) of generated images from the encoding-decoding networks. In recent two years, the adversarial loss has become a common regularizer for boosting image quality, especially in image generation tasks.
In existing works that incorporate the encoding-decoding networks (ED) with GANs, the decoder of ED and generator of GAN share the same network and parameters, thus the reconstruction loss (from ED) and the adversarial loss (from discriminator) are both imposed on the decoder/generator. Although ED is known to be stable in training, and many alternatives of GANs, e.g., DCGAN , WGAN , LSGAN , etc., have stabilized the training of GANs, coupling the reconstruction loss and the adversarial loss by making them interact with each other may yield unstable results, e.g., introducing artifacts as shown in Fig. 1.
We observe the increased details of generated images as compared to the image generated from ED only (the top row in Fig. 1 where the weight of adversarial loss is 0). However, we also observe the obvious artifacts introduced by adding the adversarial loss (e.g., the 1st, 2nd faces with weights and ). A higher weight on the adversarial loss preserves richer details in generated images but suffering higher risk of introducing significant artifacts or even causing instability, while a lower weight on the adversarial loss would not effectively boost the image fidelity. Generally, the trade-off between the two losses needs to be carefully tuned, otherwise, the generated images may present significant artifacts, e.g., stripe, spots, or anything visually unrealistic.
Existing works generally arrive at an appropriate weight between the two losses by conducting extensive empirical study; and yet this weight may vary with different network structures or different datasets used.
In this paper, we give the first attempt to relax the manual balancing between the two losses by proposing a novel decoupled learning structure. Moving away from the traditional routine of incorporating the ED and GAN, decoupled learning explicitly disentangles the two networks, avoiding interaction between them. To make the presentation easy to follow, we denote the coupled structure used in existing works as ED+GAN222The coupled structures used in existing works are denoted as ED+GAN because they add the effects of ED and GAN together during training., and the proposed method as ED//GAN333The proposed decoupled learning is denoted as ED//GAN, indicating that the effect from ED and GAN are learned/propagated separately through the two networks.. The contributions of this paper could be summarized from the following three aspects:
We propose the decoupled learning (ED//GAN) to tackle the ubiquitous but often neglected problem in the widely adopted ED+GAN structure that removes the need for manual balancing between the reconstruction loss and adversarial loss. To the best of our knowledge, this is the first attempt to deal with this issue.
Based on the proposed decoupled learning (ED//GAN), we further observe its merit in visualizing the boosting effect of adversarial learning. Although many empirical studies demonstrated the effect of GAN in the visual perspective, few of them could demonstrate how GAN sharpens the blurry output from ED, e.g., what kinds of edges and textures could be captured by GAN but missed by ED.
Moving away from providing absolute performance metrics like existing works, we design the normalized relative discriminative score (NRDS) that provides relative estimates of the models in comparison. After all, the purpose of model evaluation is mostly to rank their performance; therefore, many times, absolute measurements are unnecessary. In essence, NRDS aims to illustrate whether one model is better or worse than another, which is more practical to arrive at a reliable estimate.
In the widely used network structure ED+GAN, ED appears to generate smooth and blurry results due to minimization of pixel-wise average of possible solutions in the pixel space, while GAN drives results towards the natural image manifold producing perceptually more convincing solutions. Incorporating the two parts as in existing works causes competition between the two networks, and when the balance point is not appropriately chosen, bad solutions might result causing artifacts in the generated images. Many empirical studies have demonstrated that it does not necessarily boost the image quality by topping a GAN to ED. We aim to avoid such competition by training ED and GAN in a relatively independent manner – we preserve the structures of ED and GAN without sharing parameters, as compared to existing works where the parameters of decoder in ED and generator in GAN are shared. The independent network design explicitly decouples the interaction between ED and GAN, but still follows the classic objective functions — the reconstruction loss and minimax game for ED and GAN, respectively. Thus, any existing work based on ED+GAN can be easily adapted to the proposed structure without significantly changing their objectives, meanwhile gaining the benefit of not having to find a balance between ED and GAN.
Compared to ED+GAN, the uniqueness of the proposed ED//GAN lies in the two decoupled backpropagation paths where the reconstruction and adversarial losses are backpropagated to separate networks, instead of imposing both losses to generator/decoder as done in ED+GAN. Fig. 2 illustrates the major difference between ED+GAN and ED//GAN.
In ED+GAN, both reconstruction loss and adversarial loss are backpropagated to Dec, and the general objective could be written as
where and denote the reconstruction and adversarial losses, respectively. The parameter is the weight to balance the two losses.
In ED//GAN, we are no longer in need of the weight because the backpropagation from two losses are along different paths without interaction. Then, the general objective for ED//GAN becomes
The general framework of the proposed decoupled learning (ED//GAN) is detailed in Fig. 3, incorporating the encoding-decoding network (ED) with GAN (i.e., D and G) in a decoupled manner, i.e., G and Dec are trained separately corresponding to the adversarial loss and reconstruction loss, respectively.
Assuming the input image , where , , and denote the height, width, and the number of channels, respectively. ED (i.e., Enc and Dec) is trained independently from GAN (i.e., G and D), and the reconstructed image from ED is , which is a blurred version of the input image . The generator G, together with the discriminator D, learns which is added to to yield the final output image . Since , the generated images from G is actually the residual map between and . Assuming is close to the real image , then would illustrate how adversarial learning increases the resolution and photo-realism of a blurry image. Generally, contains details, e.g., edges and texture, and specifically, wrinkles and edges of the eyes and mouth in face images. Therefore, a byproduct of the decoupled learning is that it provides a direct mechanism to conveniently illustrate how the adversarial learning boosts the performance of ED.
In the proposed ED//GAN framework, the gradient derived from the reconstruction loss and adversarial loss are directed in separated paths without any interaction, avoiding the competition between reconstruction and adversarial effects which may cause instability as discussed in the introduction. G serves as the subsequent processing block of ED, recovering details missed by the output from ED. The G and Dec share the latent variable because of the correspondence between the blurry image and the corresponding recoverable details .
The proposed decoupled learning can be divided into two parts: 1) reconstruction learning of Enc and Dec and 2) adversarial learning of G and D. Enc and Dec (i.e., ED) are trained independently of G and D (i.e., GAN), updated through the norm in pixel level as shown by the red dashed arrow in Fig. 3. G and D are trained by the original objective of GAN, and G is only updated by the adversarial loss as indicated by the blue dashed arrow. The final output image is obtained by pixel-wise summation of the outputs from G and Dec.
The encoding-decoding network (ED) aims to minimize the pixel-level error between the input image and reconstructed image . The training of ED is well known to be stable, and ED could be any structures specifically designed for any applications, e.g., U-Net  or conditional network 
with/without batch normalization. Most works that adopted batch normalization to enhance stability of the ED+GAN structure may bring a few unfortunate side effects and hurt the diversity  of generated images. With the proposed ED//GAN, however, batch normalization becomes unnecessary because the training of ED is isolated from that of GAN, and ED itself could be stable without batch normalization. The reconstruction loss of the ED part can be expressed by
where and indicate the functions of encoder and decoder, respectively. The latent variable derived from is denoted by . indicates -norm in pixel level. More general, the latent variable5] and AAE .
In the proposed ED//GAN, GAN works differently from the vanilla GAN in two aspects: 1) The inputs of G are features of the input image (sharing the latent variable with Dec) rather than the random noise. 2) The fake samples fed to D are not directly generated by G. Instead, they are conditioned on the output from Dec. Therefore, the losses of GAN can be expressed as
Finally, we obtain the objective of the proposed decoupled learning (ED//GAN),
Note that there are no weighting parameters between the losses in the objective function, which relaxes the manual tuning that may require an expert with strong domain knowledge and rich experience. During training, each component could be updated alternatively and separately because the three components do not overlap in backpropagation, i.e., the backpropagation paths are not intertwined. In practice, however, ED could be trained first because it is completely independent from GAN and GAN operates on the output of ED.
A side product of the proposed ED//GAN is that it helps to investigate how the discriminator independently boosts the quality of generated images. In ED+GAN, however, the effect of discriminator is difficult to directly identify because it is coupled with the effect of ED. The learned residual in ED//GAN is considered the boosting factor from the adversarial learning (discriminator). Generally, the images from ED tend to be blurry, while the residual from GAN carries the details or important texture information for photo-realistic image generation. Imposing the residual onto the reconstructed images is supposed to yield higher-fidelity images as compared to the reconstructed images.
In Fig. 4 (middle picture in each triple), we can observe that the adversarial learning mainly enhances the edges at eyebrow, eyes, mouth, and teeth for face images. For the bird and flower images, the residues further enhance the color. In some cases, the added details also create artifacts. In general, adding the residual to the blurry images from ED (Fig. 4 left), the output images present finer details.
An argument on the visualization of adversarial effect may be that subtracting the result of ED from that of ED+GAN could also obtain the residual. Although this process can roughly visualize the boost from GAN, we emphasize that “ED+GAN” minus “ED” is not purely the effect from GAN because the training of GAN is affected by ED in ED+GAN and vice versa. In the proposed ED//GAN, however, ED is trained independently from GAN, thus GAN only learns the residual between real images and those from ED.
In the evaluation of image quality (e.g., reality and resolution), how to design a reliable metric for generative models has been an open issue. Existing metrics (e.g., inception score  and related methods ), although successful in certain cases, have been demonstrated to be problematic in others 
. If a perfect metric exists, the training of generative models would be much easier because we could use such metric as loss directly without training a discriminator. The rationale behind our design is that if it is difficult to obtain the absolute score (perfect metric) of a model, we could at least compare which model generates better images than others. From this perspective, we propose to perform relative comparison rather than providing evaluation based on absolute score like existing works. More specifically, we train a single discriminator/classifier to separate real samples from generated samples, and those generated samples closer to real ones will be more difficult to be separated. For example, given two generative models Gand G, which define the distributions of generated samples and , respectively. Suppose the distribution of real data is , if where denotes the Jensen-Shannon divergence and assume and intersect with , a discriminator/classifier D trained to classify real samples as 1 and 0 otherwise would show the following inequality,
The main idea is that if the generated samples are closer to real ones, more epochs would be needed to distinguish them from real samples. The discriminator is a binary classifier to separate the real samples from fake ones generated by all the models in comparison. In each epoch, the discriminator output of each sample is recorded. The average discriminator output of real samples will increase with epoch (approaching 1), while that of generated samples from each model will decrease with epoch (approaching 0). However, the decrement rate of each model varies based on how close the generated samples to the real ones. Generally, the samples closer to real ones show slower decrement rate. Therefore, we compare the “decrement rate” of each model to relatively evaluate their generated images. The decrement rate is proportional to the area under the curve of average discriminator output versus epoch. Larger area indicates slower decrement rate, implying that the generated samples are closer to real ones. Fig.5 illustrates the computation of normalized relative discriminative score (NRDS).
There are three steps to compute the proposed normalized relative discriminative score (NRDS): 1) Obtain the curve () of discriminator output versus epoch (or mini-batch) for each model (assuming models in comparison) during training; 2) Compute the area under each curve ; and 3) Compute NRDS of the th model by
To illustrate the computation of NRDS, Fig. 6
shows a toy example. Assume the samples named “fake-close” and “fake-far” are generated from two different models to simulate the real samples. We train a discriminator on the real and fake (i.e., fake-close and fake-far) samples. The structure of discriminator is a neural network with two hidden layers, both of which have 32 nodes, and ReLU is adopted as the activation function. After each epoch of training on the real and fake samples, the discriminator is tested on the same samples from real, fake-close, and fake-far, respectively. For example, all the real samples are fed to the discriminator, and then we compute the mean of the outputs from the discriminator. By the same token, we can obtain the average outputs of fake-close and fake-far, respectively. With 300 epochs, we plot the curves shown in Fig.6 (right). Intuitively, the curve of fake-close approaches zero slower than that of fake-far because the samples in fake-close are closer (similar) to the real samples.
A toy example of computing NRDS. Left: the real and fake samples randomly sampled from 2-D normal distributions with different means but with the same (identity) covariance. The real samples (blue circle) is with zero mean. The red “x” and yellow “+” denote fake samples with the mean ofand , respectively. The notation fake-close/far indicates that the mean of correspondingly fake samples is close to or far from that of the real samples. Right: the curves of epoch vs. averaged output of discriminator on corresponding sets (colors) of samples.
The area under the curves of fake-close () and fake-far () are and , respectively. From Eq. 11,
Therefore, we can claim that the model generating fake-close is relatively better. Note that the actual value of NRDS for certain single model is meaningless. We can only conclude that the model with higher NRDS is better than those with lower NRDS in the same comparison, but a high NRDS does not necessarily indicate an absolutely good model.
We evaluate the proposed decoupled learning mainly from 1) its ability in relaxing the weight setting and 2) its generality in adapting to existing works. First, we compare the proposed ED//GAN to the traditional ED+GAN based on the UTKFace dataset 
using a general (not fine-tuned) network structure. Then, two existing works, i.e., Pix2Pix and CAAE , are adopted for adaptation, where the corresponding datasets are use, i.e. UTKFace and CMP Facade databases , respectively.
The UTKFace dataset consists of about 20,000 aligned and cropped faces with large diversity in age and race. The decoupled learning applied on the UTKFace dataset aims to demonstrate the performance on image manipulation tasks. The CMP Facade dataset is utilized to illustrate the performance of the decoupled learning on image transformation tasks. without parameter tuning on any datasets.
For fair comparison, we compare ED//GAN and ED+GAN on the same network and dataset. This network is neither specifically designed for any applications nor delicately fine-tuned to achieve the best result. The goal is to illustrate the advantages of ED//GAN as compared to ED+GAN. Table 1 details the network structure used in this experiment.
|Enc / D||Size|
|Conv, BN, ReLU|
|Conv, BN, ReLU|
|Conv, BN, ReLU|
|Conv, BN, ReLU|
|Conv, BN, ReLU|
|Reshape, FC, tanh||/|
|Dec / G||Size|
|FC, ReLU, BN, Reshape|
|Deconv, BN, ReLU|
|Deconv, BN, ReLU|
|Deconv, BN, ReLU|
|Deconv, BN, ReLU|
To demonstrate the effectiveness of ED//GAN on relaxing the weight setting, we compare it to ED+GAN by changing the weight between the reconstruction loss and adversarial loss. In the objective function of ED//GAN (Eq. 9), no weight is required. For comparison purpose, we intentionally add a weighting parameter to the adversarial loss like the objective of GAN (Eq. 1). Iterating from 0.001 to 1 with the step of 10x, we obtain the results as shown in Fig. 7 after 200 epochs with the batch size of 25.
The output images from ED//GAN are relatively higher-fidelity and maintain almost the same quality regardless of the weight change. However, the outputs of ED+GAN significantly vary with the weight. In addition, ED+GAN generates unstable results, e.g., model collapsing and significant artifacts. The corresponding NRDS is calculated in Table 2, where the number of models in comparison is (Eq. 11). The discriminator adopted in NRDS is the same as D in Table 1.
Now, we remove the batch normalization in Enc and Dec to see whether ED//GAN still yields stable results. Fig. 8 compares the results from ED//GAN and ED+GAN by removing the batch normalization in Enc and Dec. The corresponding NRDS is listed in Table 3.
From the two experiments, ED//GAN vs. ED+GAN with/without batch normalization on ED (i.e., Enc and Dec), we observe that ED//GAN generally yields higher NRDS, indicating better image quality. In addition, the NRDS values for ED//GAN vary much less than those of ED+GAN, as observed from the lower standard deviation (std), indicating robustness against different weights. These observations completely agree with our claim — ED//GAN stabilizes the training regardless of the trade-off issue in the traditional ED+GAN structure.
We notice that for ED//GAN, the NRDS value slightly changes with the change of weight. However, the change is too small to be observable from visual inspection. We also observe that NRDS achieves the peak value at certain weight settings. For example, NRDS achieves the highest value at in both Tables 2 and 3, which happens to be the case of the proposed ED//GAN without weight setting.
An essential merit of ED//GAN is its adaptability for existing ED+GAN works. Specifically, an existing work that adopted the ED+GAN structure could be easily modified to the ED//GAN structure without significantly reducing the performance but with the benefit of not having to fine-tune the weight. To demonstrate the adaptability of ED//GAN, we modify two existing works: 1) Pix2Pix  for image transformation and 2) CAAE  for image manipulation. According to Fig. 2, the modification is simply to parallelize a G (the copy of Dec) to the original network. The objective functions are modified from Eq. 1 to Eq. 3, which is straightforward to implement.
In Pix2Pix, ED is implemented by the U-Net, which directly passes feature maps from encoder to decoder, preserving more details. In order not to break the structure of U-Net, we apply another U-Net as the generator G in the corresponding ED//GAN version. Fig. 10 compares the results from Pix2Pix and its ED//GAN version. The reported weight in Pix2Pix is 100:1, where the weight on reconstruction loss is 100, and 1 on the adversarial loss. We change the weight setting to 1:1 and 1000:1 to illustrate its effect on the generated images.
We observe that the generated images with the weight of 1:1 introduce significant artifacts (zoom in for better view). With higher weight on the reconstruction loss, e.g., 100:1 and 1000:1, more realistic images can be generated, whose quality is similar to that from the ED//GAN version that does not need weight setting.
We next adapt CAAE , a conditional ED+GAN structure, to the proposed ED//GAN structure as shown in Fig. 12. CAAE generates aged face by manipulating the label concatenated to the latent variable from Enc.
The original network used in CAAE has an extra discriminator on to force to be uniformly distributed. We do not show this discriminator in Fig. 12 because it does not affect the adaptation. Fig. 11 shows some random examples to compare the original and modified structures. The weights of the reconstruction loss and adversarial loss are 1 and (i.e., 1:) as reported in the original work. We use another two different weight settings, 1: and 1:1, for the original structure and compare the results with the corresponding ED//GAN version.
We observe that the ED//GAN structure still, in general, yields higher or similar NRDS values than the coupled counterpart. Although in Table 5, the proposed ED//GAN ranks number two, ED//GAN achieves the competitive result without the need of tuning the weight parameter. In Table 5, ED//GAN ranks the top as compared to the ED+GAN structure. Note that we show alternatives of parameter setting based on the optimal settings that are already known from the original papers. If designing a new structure without any prior knowledge, however, it could be difficult to find out the optimal weight with only a few trials.
It is worth emphasizing that the goal is not to beat the best result from fine-tuned ED+GAN. Rather, ED//GAN aims at achieving stable and competitive results without having to fine-tune the weight.
This paper proposed the novel decoupled learning structure (ED//GAN) for image generation tasks with image-conditional models. Different from existing works where the reconstruction loss (from ED) and the adversarial loss (from GAN) are backpropagated to a single decoder, referred to as the coupled structure (ED+GAN), in ED//GAN, the two losses are backpropagated through separate networks, thus avoiding the interaction between each other. The essential benefit of the decoupled structure is such that the weighting factor that has to be fine-tuned in ED+GAN is no longer needed in the decoupled structure, thus improving stability without looking for the best weight setting. This would largely facilitate the wider realization of more specific image generation tasks. The experimental results demonstrated the effectiveness of the decoupled learning. We also showed that existing ED+GAN works can be conveniently modified to ED//GAN by adding a generator that learns the residual.
Synthesizing the preferred inputs for neurons in neural networks via deep generator networks.In Advances in Neural Information Processing Systems, pages 3387–3395, 2016.
Proceedings of The 33rd International Conference on Machine Learning, volume 3, 2016.