Unsupervised Representation Adversarial Learning Network: from Reconstruction to Generation

04/19/2018 ∙ by Yuqian Zhou, et al. ∙ 0

A good representation for arbitrarily complicated data should have the capability of semantic generation, clustering and reconstruction. Previous research has already achieved impressive performance on either one. This paper aims at learning a disentangled representation effective for all of them in an unsupervised way. To achieve all the three tasks together, we learn the forward and inverse mapping between data and representation on the basis of a symmetric adversarial process. In theory, we minimize the upper bound of the two conditional entropy loss between the latent variables and the observations together to achieve the cycle consistency. The newly proposed RepGAN is tested on MNIST, fashionMNIST, CelebA, and SVHN datasets to perform unsupervised or semi-supervised classification, generation and reconstruction tasks. The result demonstrates that RepGAN is able to learn a useful and competitive representation. To the author's knowledge, our work is the first one to achieve both a high unsupervised classification accuracy and low reconstruction error on MNIST.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning a good representation from complex data distribution can be resolved by deep directed generative models. Among them, Generative Adversarial Network (GAN)[Goodfellow2016] is proposed to generate complicated data space by sampling from a simple pre-defined latent space. Specifically, a generator is modeled to map the latent samples to real data, and a discriminator is applied to differentiate real samples from generated ones. However, the original GAN only learns the forward mapping from a entangled latent space to data space. Given the complicated data, it lacks the inverse inference network to map the data back to the interpretable latent space.

Figure 1:

Structure of AAE, InfoGAN and combined RepGAN. AAE follows a x-z-x flow, and InfoGAN follows a z-x-z process. The latent vector is split to a categorical slot c, continuous slot s and noise n. RepGAN jointly trains these two structures to emphasis on the cycle consistency. Thus the bijection can be achieved between the latent and data space.

Efforts have been put on learning the bidirectional mapping in an adversarial way. InfoGAN [Chen et al.2016]

is proposed to address the problem of uninformative latent space of GAN, by disentangling the latent variables, and maximizing the mutual information between a subset of the variables and the observations. InfoGAN is able to learn a representation with semantic meaning in a fully unsupervised way. However, faithful reconstruction cannot be achieved by InfoGAN. Another model named Adversarial Autoencoder (AAE)

[Makhzani et al.2015] performs variational inference by matching the aggregated posterior distribution with the prior distribution using an adversarial loss. The autoencoder-like structure guarantees a good reconstruction performance, but the generation using the sampled latent variables is not faithful enough. BiGAN[Donahue et al.2016] and ALI[Dumoulin et al.2016]

both propose an encoder (inference network) and decoder (generative network), and seek to match the joint distributions of latent variables and data from the two networks. However, the objective functions do not constraint on the relationship between the latent variables and the observations, which results in an unsatisfied reconstruction performance. ALICE

[Li et al.2017] resolves this non-identifiability issues by additionally optimizing the conditional entropy. But it does not learn a disentangled latent space for semantic interpretation and knowledge discovery.

Bi-directional mapping is also addressed in some applications like image domain transformation or image semantic editing. In BiCycleGAN[Zhu et al.2017b], the authors differentiated two models cVAE-GAN and cLR-GAN and explained the hybrid model in an intuitive way (regarding to real or fake sampling). It does not encode interpretable information into the latent vector, but directly concatenates the vector with the images from another domain. crVAE [Shang et al.2017] could only demonstrate the semantic meaning of latent vector from visual inspection. IAN[Brock et al.2016] proposed a hybridization of VAE and GAN to solve the semantic photo editing problem by improving the representation capability of latent space without increasing its dimensionality. The decoder of VAE is used as generator of GAN, and hidden layer outputs of discriminator are used to quantify reconstruction loss, which was showed to improve the reconstruction quality. However no cycle consistency was enforced and the latent space was not disentangled. DTN[Taigman et al.2016] applied a similar structure for image domain transfer, while the latent space is not constrained to a regularized distribution, thus random generation tasks were not performed.

In this paper, we seek to learn a generic interpretable representation and bidirectional network which is capable of reconstruction, generation and clustering at the same time. A model supporting all these capabilities is important for data analysis and transmission. Reconstruction ability will help data compression while transmitting. Clustering and generation ability will benefit the natural analysis of complicated data without human prior knowledge.

we first performed a theoretical analysis of two popular unsupervised learning models named Adversarial Autoencoder (AAE) and Information maximization GAN (InfoGAN). We identify their advantages and disadvantages respectively by studying the loss functions they try to minimize, and relate it to the mutual information and conditional entropy in the information theory

[Zhao et al.2017]. Then we propose a novel model involving the concept of cycle consistency [Zhu et al.2017a, Yi et al.2017, Kim et al.2017] to combine those two models, which is able to achieve a better overall performance in terms of unsupervised classification accuracy, data reconstruction and generation by learning a useful generic disentangled latent space. Finally, we show and analyze the effectiveness of this new model on the MNIST, FashionMNIST, celebA and SVHN dataset.

2 Related Work

In this section, we review two models named AAE [Makhzani et al.2015] and InfoGAN [Chen et al.2016] by studying their network structures and loss functions. We denote the parameter of the generator (from to ) as , and that of the encoder (from to ) as .

2.1 Adversarial Autoencoder (AAE)

Adversarial Autoencoder[Makhzani et al.2015] is a related work of the Variational Autoencoder (VAE)[Kingma and Welling2013, Rezende et al.2014, Doersch2016]. The basic AAE-like structure is shown in figure 1. Recall that in the VAE, the variational lower bound it optimizes is,

(1)

The first term is identified as the regularization term which tries to match , the posterior of conditional on , to a target distribution using the KL divergence. The second term represents the reconstruction loss, namely given the data , generating the latent representation , and then using this to reconstruct the data. Notice that the loss term above is for a specific data point . To get the loss inside a training batch, we need to average it over , namely

(2)

The loss function of AAE is similar to VAE, except for the regularization term is replaced by an adversarial learning process (represented by JS Divergence) on the aggregated posterior distribution. Therefore, the objective function for AAE becomes,

(3)

InfoVAE[Zhao et al.2017] generalizes the regularization term of AAE to a divergence family, and justify the richer information it provides in the latent code sampled from the aggregated posterior distribution. The author also proved that the latent space learned by InfoVAE (or the variants AAE) does not suffer from exploding problem and uninformative latent modeling. However, unsupervised generation with disentangled latent vector was not reported in the original AAE or InfoVAE paper.

2.2 Infomation Maximizing GAN (InfoGAN)

Another unsupervised learning model is InfoGAN (figure 1), which basically adds an information maximizing term on top of the vanilla GAN [Goodfellow2016] so that the generator is forced to use all the information contained in the input when generating sample data points. In the original InfoGAN paper, the latent vector is disentangled into categorical, continuous and noise parts, and the discriminator will output the categorical and continuous parts to achieve the mutual information maximization. The design of the discriminator is similar to other previous work like [Odena et al.2016, Salimans et al.2016, Springenberg2015]. The loss function of InfoGAN can be written as,

(4)

InfoGAN achieves a fully unsupervised representation learning with disentangled semantic latent space. Both the generation and clustering performance is impressive, but the reconstruction quality of input images was not reported.

3 RepGAN

In this section, we derivate the correlation between the loss function of previous models and conditional entropy, and compare their influence on the learned representation. Then we integrate the loss and propose a novel model called Representation GAN to learn a useful disentangled latent space.

3.0.1 Conditional Entropy

The conditional entropy measures the uncertainty of one random variable given the other. For example,

quantifies the uncertainty of the observation space given the latent space , which can be formulated as,

(5)

Comparing the upper bound of conditional entropy with the second term in VAE/AAE loss (equation 2), we can see that they are exactly the same expression. As a result, the upper bound of the conditional entropy is equivalent to the reconstruction term of VAE/AAE objective function. Minimizing the upper bound of the conditional entropy also turns out to be maximizing the mutual information , if the entropy of the data distribution is assumed to be fixed.

Again the formulation of conditional entropy is,

(6)

We see that the reconstruction loss term in equation 4 is exactly the same as the upper bound of the conditional entropy . As noted by the author in original InfoGAN paper, this loss function tries to minimize the conditional entropy and consequently maximizing the mutual information .

3.0.2 Comparing AAE and InfoGAN

Now we study the property of the loss function of AAE and InfoGAN. AAE minimizes the conditional entropy , it is trained to decrease the uncertainty of given . As shown in figure 2, AAE demonstrates a stochastic mapping from to the latent representation . Comparing equation 4 to equation 2, we can see that they are symmetric to each other. Namely, AAE follows an x-z-x pattern while InfoGAN follows an z-x-z pattern. As a result, we can use the same argument as in the AAE case to conclude that InfoGAN actually maps multiple data points back to the same latent representation as shown in figure 2.

Both AAE and InfoGAN try to maximize the mutual information

by minimizing conditional entropies. However since conditional entropy is not symmetric, those two models show different focuses. Specifically, we show that AAE maps multiple points in latent space to a single point in data space, whereas InfoGAN maps multiple points in data space to a single point in latent space. Therefore, AAE is good at reconstruction (when the latent space is large enough). The classification performance of AAE is not guaranteed though. On the other hand, InfoGAN is good at classification, because different digits with subtle differences can be put into the same category, which makes the classifier robust to noises and small style changes. But the reconstruction performance is not guaranteed.

Actually, if we follow the design of the latent vector in the original InfoGAN paper, the reconstruction of InfoGAN cannot be good because the noise at the input of InfoGAN is not present at the output, which means subtle information describing the details of the image is discarded during reconstruction. In our experiments, we find out that the noise actually changes the generated image greatly, as shown in figure 10 and 11. If the noise is simply discarded, the reconstructed images will be almost not the same.

To further understanding the mapping relation of AAE and InfoGAN, suppose in the discrete case, we notice that the second term in equation 2 is minimized to zero when equals one for all , whereas those

are actually the output from the encoder of the AAE. Thus in the optimal case, the probability mass function

should have disjoint support for different given [Zhao et al.2017], but it is not optimized to 1 for a specific . That is the reason why one data point can be mapped to different , and multiple can be used to reconstruct the same . Similar explanations apply to the mapping property of InfoGAN.

Another problem is the dimension of the latent vector. Lower latent dimension will suffer from insufficient representation ability, while higher latent dimension will increase the difficulty of distribution regularization using adversarial learning or KL divergence. For example, let us consider the case where the latent space has a categorical one-hot variable and a continuous variable , and we wish to categorize the digits in MNIST in terms of which number they represent and their style. If the input is a slim and clockwise rotated digit 7, the encoder should output with the 7th element being one and others being zero and

should have its two elements describing the style of the digit, namely slim and rotation. However, apart from being slim and rotated, this digit 7 still has many other characteristics, perhaps a bar in the middle or a sharp turn at the corner. Those subtle information will be lost during the encoding process due to the insufficient latent dimension. As a result, in the case where the latent space is not large enough, the loss function in AAE will make the output of decoder converge to an averaged version of all the inputs that are represented by the same latent representation, which turns out to be a blurred image while training using L2-norm by assuming a Gaussian distribution. When the latent space is large enough, the reconstruction of AAE will be much better. However, high latent dimension will increase the difficulties of latent regularization. The drawbacks are shown in figure

4 and 5.

Figure 2: Mapping relations of AAE , InfoGAN and proposed RepGAN. Cycle consistency is involved for a deterministic bijection better for both reconstruction and generation.

3.0.3 Model Structure

In order to combine the strength of both models, we propose to train AAE and InfoGAN together with shared parameters, so that the new model can achieve good classification and reconstruction performance at the same time. The network architecture is illustrated in figure 1

. Specifically, the encoder (X2Rep) of AAE and InfoGAN are the same module sharing parameters. Likewise, the decoder (Rep2X) of AAE and of InfoGAN are the same module sharing parameters. During the training, we train the model alternatively between InfoGAN-fashion and AAE-fashion so that the classification accuracy can be improved by InfoGAN training while reconstruction performance can be improved by AAE training. The training of infoGAN is emphasised, which experimentally gives better result. The training steps for infoGAN:AAE is 5:1. The latent vector is split into three subsets, a categorical variable

, a continuous variable and a noise . Continuous and noise variable could be sampled from Gaussian distribution. The full objective function for RepGAN is,

(7)

In practical implementation, we rewrite the objective function as,

(8)

where is computed as norm for image reconstruction, represents cross-entropy loss, and is negative log-likelihood for Gaussian loss with re-parametrization tricks as InfoGAN. The model structure for experiment is summarized in table 1

.The stride for each convolution layer is always 2, and we refer the structure design and optimization tricks as WGAN

[Arjovsky and Bottou2017, Arjovsky et al.2017]. For learning rate, we used , , for the generators, AAE discriminator and infoGAN discriminator on MNIST dataset. For fashionMNIST, we used , , . For SVHN, we used , , .

encoder decoder
In 28x28x1 In 32x1
4x4x64 conv, LReLU,BN

FC1024 ReLU,BN

4x4x128 conv,LReLU, BN FC7x7x128 ReLU,BN
FC 1024 LReLU, BN 4x4x64 deConv,ReLU,BN
c: FC 10 softmax, BN 4x4x1 deConv, Sigmoid
s mean: FC 2 LRelu, BN
s sigma: FC 2 LRelu, BN, exp()
n: FC 20 LRelu, BN

Dz
Dx
In c/s/n In 28x28x1
FC3000 LReLU 4x4x64 conv, LReLU
FC3000 LReLU 4x4x128 conv, LReLU, BN
FC1 raw output (WGAN) FC1024 LRelu, BN
FC 1 sigmoid

Table 1: Network Structure of RepGAN

4 Expeiments

We tested the three models AAE, InfoGAN and RepGAN on MNIST[LeCun et al.2010], FashionMNIST[Xiao et al.2017], and SVHN[Netzer et al.2011] dataset. We conducted two sets of experiments with different types of the latent variable . First, is not disentangled but directly sampled from an isotropic Gaussian distribution. In this experiment, we investigate the theory of mapping discussed in section 3.0.2. Second, like the original InfoGAN, we split the latent vector into three slots: a one-hot vector sampled from a categorical distribution, a continuous vector sampled from Gaussian, and a random noise . In additional to reconstruction and generation performance, we demonstrate the unsupervised and semi-supervised clustering performance of RepGAN, and noise importance.

4.1 Gaussian Latent space

We first implement the AAE, InfoGAN, and RepGAN with a single entangled latent space using a latent vector sampled from isotropic Gaussian distribution with zero mean and 0.5 variance. We vary the dimension of the latent vector to 2, 8, 16, 32 and 64, and compare the reconstruction performance of image or latent space of all the three models. Training and testing is conducted on MNIST dataset.

4.1.1 Image Reconstruction

The image reconstruction result is computed by organizing the structure like AAE after training each model, and feeding the real data sample to the input. The visualization is shown in figure 3. AAE achieves a better reconstruction ability than InfoGAN, and RepGAN is almost as good as AAE. As shown in the figure 3 and 5, InfoGAN has a bad ability of reconstruction, and for all the latent dimensions, the error keeps the highest among the three models. That is because the loss definition of InfoGAN does not put constrains on the image reconstruction.

Figure 3: Reconstruction visualization of AAE, InfoGAN and RepGAN with Gaussian latent space. AAE and RepGAN achieves identically good reconstruction, but InfoGAN cannot recover the original input image good enough. The reconstructed images are blur when the latent dimension of AAE is low, but better with RepGAN.
Figure 4: Generated samples and the randomly selected learned latent distribution of all the three models.The target latent space is zero-mean Gaussian with 0.5 variance.
Figure 5: Reconstruction error and latent vector error curve for the three models across the latent dimensions. Left: Data reconstruction error in terms of MSE. Right: Illustrating the ability of generation by showing latent Reconstruction error in MSE.

4.1.2 Latent Reconstruction and Generation

For latent reconstruction evaluation, we follow the structure of InfoGAN for testing. After training all the models, we reorganize the network structure, and feed a sampled latent vector into the network. we plotted the MSE of the latent code to examine the ability of latent regularization and the exist of mode collapse of the models in figure 5. For AAE, the error becomes large when the latent dimension is high because of (1) heavy mode collapse: given different z, the model generates identical x. This is illustrated in the objective, and (2) unsatisfied latent regularizing like VAE. In this case, when we sample from a true prior distribution, the AAE model cannot generate good-quality images. That is because the high-quality manifold shifted.As shown in figure 4, AAE cannot generate high-quality samples when the latent dimension is too large, and cannot generate sharp samples when the dimension is small. The model fails to learn a good latent distribution when the latent dimension is larger or equal to 16.

However, InfoGAN and RepGAN achieve an identical good performance for latent space modeling and new sample generation. The generated images are also sharp and clear. All the images in figure 4 are randomly sampled from the generation results. Compared with AAE and InfoGAN which can only guarantee either recognition and generation, the proposed RepGAN can simultaneously achieve the two capabilities by constraining on two conditional entropy, and the mapping between the latent variable and real data shrinks to a bijection.

4.2 Disentangled Latent space

In this experiment, we disentangle the latent space and follow the original structure of InfoGAN. , and have dimension 10, 2, and 20 respectively. In addition to reconstruction and generation, we compare the unsupervised and semi-supervised clustering performance. We also investigate the importance of the noise.

4.2.1 Unsupervised Learning

Model MNIST MNIST FMNIST FMNIST
(Acc) (MSE) (Acc) (MSE)
AAE 86.92% 0.007 57.30% 0.015
InfoGAN 95% 0.07 53.81% 0.098
VADE 94.46% None None None
DEC 84.30% None None None
RepGAN 96.74% 0.02 58.64% 0.013

Table 2: Testing Accuracy for Unsupervised Classification
Model SVHN SVHN FMNIST FMNIST
(Acc) (MSE) (Acc) (MSE)
AAE 71.67% 0.004 83.82% 0.02
InfoGAN 74.16% 0.03 79.41% 0.12
RepGAN 76.05% 0.006 82.81% 0.034
Table 3: Testing Accuracy for Semi-supervised Classification

When evaluating the unsupervised clustering accuracy, we set the continuous and noise vector to zero, and generate the cluster head of each clusters. Then we searched in the training set to find the closest sample with the cluster head, and assigned the label of that sample to the whole cluster. Finally, we computed the accuracy based on the assigned cluster labels. Table 2 shows the classification accuracies of comparable models like VADE[Jiang et al.2016] and DEC[Xie et al.2016] on MNIST and FashionMNIST. The InfoGAN and RepGAN are able to achieve an average accuracy of 95% or 96%, which is much higher than the AAE, which only achieves 87%. For FashionMNIST, the classification accuracy is low due to the high similarity of images assigned by different category labels. This experiment result is consistent with our theoretical analysis in section 3.0.2, which is InfoGAN is better for classification than AAE. RepGAN, being the combination of AAE and InfoGAN, successfully preserved the ability of InfoGAN for clustering and generation, and AAE for reconstruction. Note that our network structure is slightly different from the original InfoGAN and AAE paper, thus the results may not be the same as the reported ones.

The qualitative evaluation of reconstruction and generation ability of RepGAN is shown in figure 6 and 9. By fixing the categorical code, the model is able to generate any samples belonging to this cluster. And by changing the continuous value, the model learns the manifold of the styles. While reconstructing, RepGAN achieves a more faithful reconstruction than InfoGAN, and sharper images than AAE.

We also compare our generated image on CelebA with infoGAN in figure 7. By using the same latent space configuration as infoGAN, namely 10 categorical variables where each one is 10-dim OneHot vector, we are able to achieve a better image quality while showing attribute change at the same time. In summary, RepGAN currently does well in all the three tasks: reconstruction, generation, and unsupervised clustering.

Figure 6: Varying the categorical variable, the RepGAN can cluster different types of images in MNIST and FashionMNIST in a fully unsupervised way. The categorical variable varies along the column, and the continuous variable varies along the row.
Figure 7: Comparison of different attributes generation ability of RepGAN and InfoGAN. RepGAN generates better-quality images than infoGAN.
Figure 8: Comparison of generated SVHN samples between AAE and RepGAN. RepGAN generates sharper images than AAE.
Figure 9: Reconstruction performance of AAE, InfoGAN and RepGAN with disentangled latent space. The first row of each group of images are the input, and the second row is the reconstruction. InfoGAN cannot achieve a faithful reconstruction, and AAE only recovers blurred images if the latent dimension is not big enough. RepGAN achieves sharper and more clear reconstruction.

4.2.2 Semi-supervised Learning

We conduct the semi-supervised learning on FashionMNIST, and SVHN by utilizing 1000 labeled images of each database for training. Unlike ACGAN

[Odena et al.2016] or catGAN[Springenberg2015], InfoGAN is not designed for semi-supervised or supervised learning, thus in our experiment settings, we will first train the discriminator with the selected 1000 labels, and then jointly train the discriminator and the generator together to achieve the semi-supervised learning. Therefore, the reported semi-supervised learning result on SVHN and fashionMNIST may not be better than other state-of-the-art approaches. However, AAE is easier for semi-supervised setting, and according to the original paper, it achieves the comparable classification accuracy on SVHN. Adding the infoGAN objectives will greatly enhance the generation ability when sampling from the latent space, and achieve a comparable or better clustering accuracy on SVHN and FashionMNIST.

The testing accuracy and MSE value is shown in table 3. The three models demonstrate a similar classification performance, but the reconstruction is still not satisfied for InfoGAN. We also compare the generation quality on SVHN between AAE and RepGAN in figure 8 and shows that RepGAN generates sharper images than AAE.

4.2.3 Effectiveness of Noise

The noise variable is interpreted as representation incompressible information in InfoGAN. We tunnel the noise for intact and plausible image reconstruction during training the AAE-like part, since categorical and continuous variable may not be expressive enough for intact reconstruction. The difference between continuous and noise variable is that: the lower-dimensional continuous variable is used to encode the most salient attributes (or largest data variance direction) commonly shared by all the samples (it is enhanced by ), while noise is used to encode incompressible or entangled information (it is enhanced by ).

In figure 10

, we demonstrate the effect of continuous v.s. noise variable on generated samples. Specifically, on the first row, we interpolate on the continuous code, and set the noise variable to zero. While varying the continuous variable, the style changes explicitly and smoothly. After adding random noise, in addition to uniformly changed style, more variants are generated.

On the second row, we interpolate the first two dimension of the noise code, and set the continuous variable to zero. We can see tiny changes of the generated images when traversing on the first two dimensions of noise, and the changes are slightly different for distinct clusters. It demonstrated the information encoded in noise is actually cluster-dependent. If we randomly sample the continuous variable and keep it the same for all the clusters, we can visualize identical changes of the image attributes across clusters (slant and thickness degree). It demonstrated the information encoded in is actually cluster-independent or cluster-shared. Similarly figure 11 shows the generated samples from fashionMNIST dataset (unsupervised).

5 Conclusion and Discussion

In this paper, we analyzed the advantage and disadvantage of two popular unsupervised machine learning models: AAE and infoGAN. We showed both theoretically and experimentally that infoGAN is able to achieve a higher classification accuracy, whereas AAE is able to get a better reconstruction quality. After that, we combined those two models in an attempt to take their advantages. We showed on MNIST, FashionMNIST and SVHN dataset that the new model, named RepGAN, is able to achieve both a high classification accuracy and a good reconstruction quality in both the original input space and the latent space. By performing well in both classification and reconstruction, RepGAN is able to learn a good bidirectional mapping between the input space and the latent space, which is a desired property of unsupervised representation learning model. It will be inspiring if it can be utilized for arbitrarily complicated data discovery with more complicated network structures and larger latent dimension, which is left for future work.

Figure 10: First row: traversing on continuous variable with zero noise, and then controlling and adding noise . Using the same noise batch for different clusters, but it shows different variants. Thus noise variable is cluster-dependent. Second row: traversing on the first two dimensions of noise vector with zero . Tiny changes of samples can be visualized. Then we add same random continuous variable value for distinct clusters, and identical changes are illustrated. is cluster-independent and corresponding to commonly shared attributes: slant and thickness.
Figure 11: Generated samples on FashionMNIST dataset. Similar to MNIST, we can see a smooth style transition when noise is set to zero. Compared to MNIST dataset, noise variables have a larger impact on generated images on fashionMNIST. In addition, since the RepGAN on fashionMNIST is trained unsupervised, different categories with similar appearance may get classified as same category, as shown on the bottom right sub-figure where sandals and sneakers are confused.

References