Since vanilla Generative Adversarial Networks [goodfellow2014generative] were proposed, they have become a compelling topic, bringing AI, in some sense, closer to a human-like way of learning. However, so far there is no explicit evidence showing that GANs can actually create new information, which makes the capabilities of GAN-based data augmentation questionable. The study of disentangled representations in GANs aims to provide a side-proof for this. It provides means to reorganize and manipulate the elements in the generated data, allowing for a more flexible and intuitive tool for data augmentation.
As discussed in [brock2018large]
, unconditional generative networks show that if we draw latent vectors from low-density areas of the learned distribution, the generation quality is sub-standard. Thus,[brock2018large, karras2019style] propose to draw latent codes from a truncated sampling space at the cost of some degree of diversity. Conditional generation provides another way to add constraints to the sampling space. The explicitly disentangled semantics in the latent space (e.g. based on class labels) makes the generation process controllable. The focus of this paper lies on a better disentanglement of different types of information within the learned latent representations. This is achieved with a newly designed conditional generative model, which we refer to as Information Compensation GAN (ICGAN).
Existing conditional generative networks, e.g. InfoGAN [chen2016infogan], have achieved good disentanglement of discrete information in RGB images. InfoGAN is also capable of disentangling clear continuous information such as rotation of digits in gray-scale images. However, this performance does not seem consistent for RGB images. To deal with this issue, we re-design the architecture of the conditional GAN. We extend it by adding our proposed Information Compensation Connection (IC-Connection) in each layer of its generator. As presented in Fig. 2, this sub-structure provides an additional path for information transfer from the input latent code to the intermediate activations in each layer. Through this path our IC-Connection is able to ”compensate” or recover any information that might be lost during the internal deconvolution operations. Our structure looks similar to StyleGAN [karras2019style], but it is based on a different hypothesis. StyleGAN treats any latent information as a certain style, so that only injecting latent code into intermediate layers will be fine. We argue (and prove in Section 4.2) that it is better to inject information from the front of the generator, as it allows to also introduce content information. Thus, by keeping the input from the front of the network, different layers can better construct basic contents for different concepts. At the same time, we remove the instance normalization layer to better maintain the information passing through.
In order to get a deeper insight on the performance of the proposed method, We present a novel evaluation procedure based on a modified MNIST dataset. We convert the images from gray-scale to RGB color space, and add background color for each image. This results in two clearly perceptible attributes. One attribute focuses on discrete features, i.e. the digits’ content. The other attribute focuses on continuous features, i.e. background colors. The ablation test (Fig.1) shows our method can encode different information clearly in different layers in an unsupervised manner. Moreover, our performance is better than a fully-supervised version of InfoGAN.
In summary, we present a structure called IC-Connection for better disentangling the latent representation in conditional generative models. In addition, we design a novel evaluation procedure to assess the degree of disentanglement.
2 Related Work
Generative Adversarial Networks. GANs [goodfellow2014generative] have reached many milestones since they were proposed. Many novel GAN structures have been designed for different purposes, such as CGAN [mirza2014conditional] for conditional generation, InfoGAN [chen2016infogan] for unsupervised conditional generation. Some of the techniques used in our model are based on those successful GAN architectures, such as the weights’ updating method used in PGGAN [karras2017progressive]. Different from them, we introduce a novel Information Compensation Connection to enhance the ability of conditional generation. This design seems to effectively improve the quality of generated images.
Learning Disentangled Representations. Finding a better controllable generative network is always an essential topic in understanding deep generative networks. Disentangled representations is one aspect of this field. Highly disentangled representations are useful for data augmentation. Moreover, it helps to quantitatively interpret generative networks. InfoGAN [chen2016infogan] was the first to propose a theoretical background on unsupervised conditional generation in GANs. Its procedure of maximizing the mutual information have become a standard practice in this task. Likewise, -VAE [higgins2017beta_vae]
improved the degree of disentanglement in Variational Autoencoders. In this paper, we propose methods to improve the level of representation disentanglement.
Skip connections. Skip connections have different implications for different tasks. In the detection task, skip connections or so-called identity shortcut connections [he2016resnet]
can help to deal with the vanishing gradients problem in deep networks. As for the U-Net[ronneberger2015unet] structures used in the segmentation task, concatenating the activations between the contraction and expansion sections can help to maintain the spatial information in the segments. Skip connections largely expand the ways in which information flows, that is why they are so useful in different scenarios. All the former works focus on connecting activations to activations. Here, we aim to connect the input vector directly to each activation, so that some information lost during the upsampling phase can be compensated through an alternative “route”. The “route” designed in this paper is called the Information Compensation Connection (IC-Connection).
3 Information compensation in GANs
In most GANs, the information of the latent code is passed through a single pipeline from the front of the generator to the end. However, some works have shown that passing the latent information only through intermediate layers is sufficient to generate high-quality images [karras2019style]. In this type of structure, changing the latent code in different layers produces semantic changes in the output. For example hair color from people in images is usually decoded in later layers while the shape of faces is defined in earlier layers. This observation suggests that GANs may decode different information in different layers. If this hypothesis is correct, then making the generator focus exclusively on the information needed in the current layer (without paying attention to the information to be modelled at later layers), might make it easier to train a generator. In other words, we can compensate the information needed for later layers through other “routes” rather than depend only on the information flowing from the previous layer. Following this procedure, different layers can learn different aspects of the representation. This is the main goal of the proposed IC-Connection. Different from [karras2019style], we find that keeping the input from the front of the generator and removing instance normalization will make it easier for the generator to construct the shape of different objects. Moreover, it leads to improved disentangling performance for both discrete information as well as continuous information. The structure of our generator is shown in Fig.2.
3.1 Network Architecture
Information Compensation Connection. Our design is inspired by the AdaIN block [huang2017adain] which is also used in [karras2019style]. [huang2017adain] regard instance normalization as a style normalization. Then, other styles are injected by scaling and shifting the feature maps. This effectively improves the performance on the Neural Style Transfer (NST) task. In other words, instance normalization could mitigate the effect of the global features on the content information, and these global features could be modified by changing the feature statistics. In our design, we removed the instance normalization layer to keep the global information of each input feature map and only do compensation if needed. Recently [karras2019style2] also stated the negative effect of instance normalization on the generation quality in GANs. The Shift-Scale Block (SSBlock) shown in Fig.2 receives the output feature maps from the previous layer. The affine parameters and in Eq.1 are learned from the input latent code .
where stands for the th feature map from the previous layer. The substructure containing the SSBlock and the skip connections in our generator is called Information Compensation Connection (IC-Connection). This design provides flexibility to the generator for combining different information from activations occurring at different layers in the architecture. Just like how painters draw their art works, they can focus on composition and profile at the early stage without considering too much about the coloring phase.
Shift-invariant Convolutional Layer. For a better discriminator performance, we used the shift-invariant convolutional layer proposed in [zhang2019shiftinvariant]
. There it is observed that it is easy to lose content information if the stride is larger than one in the maxpooling and convolutional layers. It proposes to firstly pass the feature maps to the maxpooling layer with stride one, and then do blurring and downsampling on the outputs. In this way, the generated feature maps will keep more content information and have a shift-invariant characteristic.
Learning rate scaling. Progressive Growing GANs [karras2017progressive] propose a novel way to keep the training of different layers at a similar pace. Spectral normalization [miyato2018spectral] is another way, but in our experiments, the former works better. The intuition behind the learning rate scaling method proposed in [karras2017progressive] is that, if one layer has more kernels than another, many small changes will accumulate to a big change for the generation results. Giving a lower learning rate to the layer with larger number of input signals is reasonable. The actual scaling factor used is the same as what [he2015kaiming] did in CNN initialization. In practice, this learning rate scaling technique works well with the WGAN-GP loss [gulrajani2017improved].
Our goal is to generate high quality images and enhance mutual information as much as possible. Thus, our loss function has two parts, the adversarial loss and the mutual information loss.
Adversarial loss. We use the Wasserstein Distance with Gradient Penalty [gulrajani2017improved] as our adversarial loss, defined in Eq.2.
where is a point sampled between the distribution of real data and the generated distribution . is set to as suggested in [gulrajani2017improved]. WGAN-GP is a good approximation for implementing the 1-Lipshitz constraint in the discriminator, which can effectively mitigate the mode collapse problem during the training of GANs.
Regularization of discrete variables. As discussed in [chen2016infogan], maximizing mutual information between the generator distribution and the latent code can induce the latent code to learn meaningful disentangled information in the part. The mutual information gain is expressed as
. For one-hot encoded categorical information, maximizing mutual information can be expressed as minimizing the Cross Entropy (Eq.3) between the input and output
where stands for the number of categories for one attribute.
Regularization of continuous variables.
For continuous information, the constraints introduced depends on the properties of the variables. For a normally distributed attributes (e.g. rotation), we could use the factored Gaussian[chen2016infogan]. In our case, we model the sub-task for continuous color information as a nonlinear regression problem. Thus, we choose Mean Square Error (Eq.4) as the metric to minimize the information loss.
where stands for the numbers of elements used for this continuous concept in the latent code, e.g. for RGB color in our case.
The total loss function is defined in Eq.5, and
are hyperparameters for balancing the different loss functions.
In addition, it is possible to encode the discrete information in the latent space in exactly the same order on which its labels were provided. This is achieved through weak supervision that can be added on this information. In our case, we can use another cross entropy loss to train the discriminator with real data in a supervised manner. Since this supervision will provide additional constraints to the GAN, it can lead to a better performance of disentanglement. But supervised conditional generation is not what we would like to focus on in this paper.
We evaluate the effectiveness of our method from three different directions. The quality of the generated data will be discussed at first. Then, we evaluate the quality of disentanglement on the discrete and continuous variables through a novel procedure based on our ColorMNIST dataset.
We conduct experiments on our newly designed ColorMNIST dataset. The dataset contains 108,504 training images and 18,103 testing images of 28x28x3 pixels. There are two types of attributes in the dataset, One attribute focuses on discrete features, i.e. 12 classes (digits 0 to 9, flat color region and random noise), the other one focuses on continuous features, i.e. the background color drawn from Hue value 0 to 1 with step size of 0.01 in HSV color space. The GAN is only trained on images with discrete labels from 0 to 9, but the classifier for evaluating the degree of disentanglement of discrete information is trained on the complete dataset.
Training Details. Experiments are run on a single NVIDIA TITAN X GPU. The models are trained using Adam optimizer with mini-batches of size 100. We follow the two time-scale update rule (TTUR) [heusel2017TTUR]
to train our GANs. The initial learning rate of our two newly designed architectures is 5e-2 for generator and 8e-2 for the discriminator with 10% learning rate drops every 5 epochs. For the original InfoGAN[chen2016infogan], we change the suggested learning rate of the discriminator from 2e-4 to 3e-3 which is better suited for our dataset.
4.1 Assessing the quality of the generated images
In this experiment, we aim to compare the training stability and generation quality of the proposed method. Towards this goal, we compare the performance of four different architectures: InfoGAN [chen2016infogan], StyleGAN [karras2019style]
, our newly designed ICGAN and the similar architecture but with Batch Normalization and without IC-Connection. We use the Fréchet Inception Distance (FID)[heusel2017TTUR] to evaluate the generation quality. This score is a generally used metric for evaluating the quality and diversity of the generated images by making comparisons with the training set.
We train our network on the ColorMNIST dataset. The latent code
has 100 dimensions, the first 10 elements are one-hot encoded digit information, the next 3 elements are designed for the RGB background color information converted from 100 solid colors in ColorMNIST and the rest is for random noise drawn from uniform distribution between 0 to 1.
|ours (w/o IC)|
We plot the learning curve for these three architectures in Fig.3. We find our method converges faster than InfoGAN and the FID score is much more stable during the training phase. For the similar architectures which use the same learning rate scaling techniques and loss functions (StyleGAN, ours (ICGAN) and ours but without IC-Connection), there is no significant difference on generation quality and stability. Moreover, as presented in Table 1, their FID scores are very close at the convergence point.
|ours (w/o IC)|
we fix discrete variables to digit 0, change continuous variables with interpolation up to 400 colors.
4.2 Disentanglement of discrete variables
To evaluate how well our model can disentangle discrete information. We train a classifier on ColorMNIST along with solid color and noise data. The classification accuracy is designed as a proxy metric to verify whether the conditional generation networks perfectly disentangle the discrete digit information. The architecture of the classifier is similar to the discriminator of our GAN. The average classification accuracy of the classifier on the testing set is over 0.994.
We calculate the classification accuracy on our generated images for multiple rounds, and we compare the average accuracy of the different GAN architectures. We randomly generate 1000 images for each GAN and pass them to the pre-trained classifier. The generated images are labeled by matching each latent discrete code c
to a digit content. The mean classification accuracy and the standard error of the mean are shown in Table2. Our method performs the best on disentangling the discrete digit information.
In our experiments, the original InfoGAN [chen2016infogan] easily mixes the digit and color information together, and have a trend to map the 10 discrete variables to 10 different color ranges, as can be seen in Fig.3(a). In practice, the training procedure of InfoGAN is also easy to suffer from mode collapse which will cause unexpected generation results. Compared to StyleGAN [karras2019style], our method can disentangle adequately on all the digits. In contrast, StyleGAN [karras2019style] tends to mix up digit ‘4’, ‘9’ and ‘8’ (Fig.3(a)) quite frequently.
In our experiments we were unable to achieve proper disentanglement of digit information on the ColorMNIST dataset through the vanilla InfoGAN. On the one hand, we do believe there might exist a specific setting under which this is possible. On the other hand, it is remarkable how much easier it is for our method to achieve this goal. Further experiments showed that we can get relatively better disentanglement with InfoGAN only if we add supervision. The classification performance of the supervised version of InfoGAN can reach accuracy which is similar to that of other methods. We could say that, compared to the original InfoGAN, our new architecture can better disentangle discrete digit content and keep diversity of color even without the IC-Connection.
4.3 Disentanglement of continuous variables
Measuring the degree of disentanglement on continuous variables is a hard task. Previous works measure the performance only on the visual effect of the generation [chen2016infogan] [shen2018faceid]. In our experiments, thanks to the characteristics of our dataset, we can move a step ahead by evaluating whether the exact background color in HSV space possesses certain linearity along with the continuous changes in the latent space. We use Mean Square Error of hue values between the standard color ring and the generated color ring computed from the generated images (shown in Fig.5).
Firstly, We generate images with interpolation in the latent continuous variables, the results of 400 steps interpolation are shown in Fig. 4(b). We plot the results of InfoGAN with supervision and ours (ICGAN) without supervision. For other methods, they either have random background colors (e.g. InfoGAN with supervision and ours w/o IC) or similar generation quality (e.g. StyleGAN). As can be noted, the color changes of the images generated by our method without supervision are visually smoother than those from InfoGAN with supervision on the continuous color variables.
|methods||MSE ()||MSE ()|
|ours (w/o IC)|
In the next step, we compute our generated color ring in the manner shown in Fig.5. We sample pixels from four regions around the digits in the generated images. The mean value of these four regions in each image is used to represent the background color if the image. Then, we convert the color information from RGB color space to HSV space and construct the color ring with the hue values. Because most of the GANs considered in our experiments are trained in the unsupervised manner, the starting index will not be the same as the standard color ring. So, as shown in Fig.5, we permute any possible index mapping by rotating and flipping our generated color ring without changing its internal order. The mapping with lowest MSE score represents the exact Hue value error which can measure the quality of disentanglement on continuous color variables in our dataset. The Hue MSE is calculated using Eq.6.
where stands for the Hue values of the standard color ring (index 0 is for hue value 0.0), stands for the Hue values of one possible mapping from the generated color ring to the standard one. As shown in Table 3, our method has the lowest Hue MSE which is close to the performance of StyleGAN [karras2019style] on disentangling continuous information, and much better than InfoGAN [chen2016infogan] even with supervision. We plot the exact hue value in our generated color ring of ICGAN in Fig.6. We can see a clear linear behavior on the learned representation for Hue values.
5 Inspecting the features internally encoded
For further understanding how the IC-Connection influences the learned features. We do an ablation test on the concept-related activations. Previous work [Bolei2018] uses segmentation networks to help measure the Intersection over Union (IoU) of the concept changes in the generated images while modifying the activations in GANs. Inspired by [Bolei2018], we use segments’ information of the generated images to help interpret our architecture in a novel way.
We mask out the segment of one concept, i.e. setting the digit regions to 0 in our case. Then, we backpropagate this loss and inspect the activations with high gradient values. We use the gradient scores (
We mask out the segment of one concept, i.e. setting the digit regions to 0 in our case. Then, we backpropagate this loss and inspect the activations with high gradient values. We use the gradient scores (7) to represent each activation and rank the scores layer-wise.
where l is the layer index, i is the index of the activation in layer l, w and h stand for the width and height of current gradient matrix.
In the next step we do ablation test by suppressing the activations with top 10% gradient scores (Eq. 7) per layer. As shown in Fig.7, we can notice that background information is encoded in the last two layers and digit shape is encoded in former layers. The background color will not be touched in the former five layers even with accumulative ablation. In addition, the concept, i.e. the digit, cannot be ablated only from one layer like a feed-forward GAN tested in [Bolei2018]. This implies that our GAN learns to use IC-Connection to encode different information in different layers.
We propose a new information compensation structure for GANs. According to our novel evaluation procedure, we find our method has a better performance on disentangling both discrete (digit shape) and continuous (color) information compared to InfoGAN [chen2016infogan] and, the more recent StyleGAN [karras2019style]. Moreover, experiments on quantifying continuous color information suggest that our architecture is capable of generating new information.
Appendix A Appendix
We design a new dataset for conditional generation called ColorMNIST. Our dataset contains 108,503 (training) and 18,103(testing) RGB images of 10 digits with 100 background colors drawn from HSV space from (0,1,1) to (1,1,1) with step of 0.01 in Hue value. Each instance has two labels, one for digit number and one for background color. The digit number is regraded as totally discrete attribute and the background color is treated as continuous attribute to some extent. The number of different digits in the original MNIST dataset is not totally balance, so we undersample some digits like digit 1 while creating the dataset. For the discrete attribute, other than 10 classes for 10 digits, we add another two classes one for solid color background and one for random noise. These two classes will take effect to the accuracy estimation for digits when the input digit image is destroyed in some way. Otherwise, the classifier will assign a pseudo-label in this case, and the pseudo-label might be the same as the original label of the digit itself, which will largely influence the effectiveness of the accuracy based metric. We list the information of ColorMNIST in Table4.
In Fig.8, we show some images sampled from our dataset for each discrete label. Label from 0 to 9 stands for digit from 0 to 9. Label 10 represents solid background color images and label 11 represents random noise. In this classification task we use all the images, but for generation task, we only use the images containing digits (discrete label from 0 to 9). We also plot 10 colors from the 100 color labels in Fig.9. The continuous color information represents the colors in the color ring of rainbow.