Unsupervised Projection Networks for Generative Adversarial Networks

by   Daiyaan Arfeen, et al.
berkeley college

We propose the use of unsupervised learning to train projection networks that project onto the latent space of an already trained generator. We apply our method to a trained StyleGAN, and use our projection network to perform image super-resolution and clustering of images into semantically identifiable groups.


page 2

page 3

page 4

page 5


ClusterGAN : Latent Space Clustering in Generative Adversarial Networks

Generative Adversarial networks (GANs) have obtained remarkable success ...

Optimizing Generative Adversarial Networks for Image Super Resolution via Latent Space Regularization

Natural images can be regarded as residing in a manifold that is embedde...

Learning to Associate Words and Images Using a Large-scale Graph

We develop an approach for unsupervised learning of associations between...

Fine-grained Attention and Feature-sharing Generative Adversarial Networks for Single Image Super-Resolution

The traditional super-resolution methods that aim to minimize the mean s...

GLEAN: Generative Latent Bank for Large-Factor Image Super-Resolution

We show that pre-trained Generative Adversarial Networks (GANs), e.g., S...

An Image-based Generator Architecture for Synthetic Image Refinement

Proposed are alternative generator architectures for Boundary Equilibriu...

Transforming and Projecting Images into Class-conditional Generative Networks

We present a method for projecting an input image into the space of a cl...

1 Introduction

Deep learning has been rapidly pushing the boundaries for image processing, generation and manipulation works. Generative Adversarial Networks (GANs) in particular have been achieved success in this domain [5]. StyleGAN, a recently published network architecture for face generation [7]

, is able to automatically separate high level features in an unsupervised manner while preserving stochasticity in the generated images. This generator can better disentangle complex image properties into latent factors, allowing for better control and interpolation properties. We propose in this paper a new method that constructs a network that maps an image to its latent code for the purpose of deriving structure.

In this paper, we explore whether the image to latent network can learn a projection onto a latent space that can be used to explore structure of the images; we demonstrate this by using it to cluster real images and perform image super-resolution. Our results show that the projection representation network is successful as we are able to achieve image super-resolution on images produced by the generator. Furthermore, utilizing the projection network for clustering results in meaningful, varied clusters of real images.

2 Related Work

Recently, deep generative models such as generative adversarial networks (GANs) have been popular in the image generation literature [5]. GANs consist of generator and discriminator networks in which the generator’s goal is to produce synthetic images that trick the discriminator into believing that they are real. The training sequence resembles finding the optimum with two players having opposing objectives. When trained successfully, the two-player game ideally results in the generator being able to generate realistic images while the discriminator cannot do better than random guessing.

Since the original GAN paper, myriad GAN training techniques and network architectures have been developed to allow generators to better approximate high quality, varied image distributions. For example, StyleGAN [7]

uses a learned fully-connected network to disentangle the randomly sampled latent space for separation of high-level image attributes. When trained on faces, this latent vector likely encodes both global and local information such as pose, gender, face shape, hair style, etc.

Even with a disentangled latent space, StyleGAN still suffers a common challenge faced by GANs: it is difficult to derive the structure of the latent space. This is problematic for the application of GANs to controlled image generation, as the latent space is the main source of variance for the final output. Clustering is one task that can expose structure in a space, allowing for partitioning the space into various diagrams.

[8] showed good results clustering images using their projections into the latent space of a GAN, but their method requires training a network to learn a projection from the image space to the latent space as well as training the GAN with a new loss. [3] trained an encoder to learn the mapping from image to latent space jointly while training the generator and discriminator, while [2] proposed a bidirectional GAN architecture that includes an inverting network and similarly projects images into the GANs latent space. Learned projections have also been proposed by [12] where the projection functions was learned for an already trained GAN for the purpose of interpolation.

In this paper we learn a projection function for an already trained StyleGAN so that we may suitably cluster images in the StyleGAN latent space. We do so by training the projection network with a latent loss in an unsupervised manner. We note that our technique should work for most already trained generator architectures; in this paper we explore StyleGAN exclusively.

3 Methodology

The StyleGAN generator consists of two networks that are jointly trained: a multi-layer perceptron (MLP) that projects an entangled latent variable

drawn from a random distribution into a disentangled latent space , and a generator that transforms a disentangled latent variable into an image. We denote the MLP as and the generator as .

Our objective is to find an estimator

, that projects images to a disentangled latent space, in order to minimize over , the parameter space, the risk taken over the latent variable

. Our loss function is the squared error loss. Thus, the parameters of

are denoted as:

If is not StyleGAN’s generator, is usually the identity function and we would be projecting from images directly to the randomly sampled latent space .

Unlike previous approaches to learning a projection onto the latent space of a GAN, our method is unsupervised and only relies on the trained generator. Previous approaches, such as [12]

, used supervised learning to train the projection network such that the parameters of


where is the image dataset and is a loss function such as MSE or negative log-likelihood.

4 Experiments and Results

4.1 Training a Baseline Model

For all of our experiments, our network is the ResNet18 model [6]

pretrained on ImageNet

[1] with the fully connected layer replaced with a randomly initialized one. The generator and the network come from a pretrained StyleGAN model [7] (trained on the FFHQ Dataset) with output size , held constant during training of . We approximate

through stochastic gradient descent optimized with the Adam optimizer. In less than 10 minutes of training on a single Tesla V100 GPU,

is able to learn a qualitatively good projection onto , as shown in Figure 2.

Figure 2: Top: Images randomly sampled from . Bottom: Images reconstructed by using ’s projection of the original images.

4.2 Super-Resolution

We observe that our projection network compresses the most important information about the face in the input image. Even with an input image that is at a much lower resolution than what is trained to project, the distinguishing details are still extracted from the face. Thus we utilize trained on images at to super-resolve images downsampled to a much lower resolution by feeding it the downsampled image and then feeding the projection into . This results in accurate recreations of images when downsampled to and even , however the images are noticeably different from the others once downsampled to . A comprehensive comparison with the same examples downsampled to different resolutions is shown in Figure 3.

Figure 3: Comparing super-resolution results at different resolutions. In order from top to bottom, the left side displays images produced by downsampled to , , , and (no downsampling). The right side shows those same images as reconstructed by and .

We also attempted to perform the same experiment with images from the FFHQ dataset directly, however the images produced are not similar enough to original images. An example is shown in Figure 4.

Figure 4: The same comparison as above but using images from FFHQ. Clearly and are unable to faithfully reproduce these images.

4.3 Distribution Extension

One consequence of training on images only drawn from the generator’s distribution is the inability to extend to images not in the generator’s output space. This can be seen as a form of overfitting and limits the application of our method. We attempt to alleviate this issue through two methods, described below.


After we train our network as described in section 4, we further train our network on a reconstruction loss. Specifically, we take an image from the FFHQ dataset, and we train to minimize the risk

where is the set of FFHQ images. We continue training with the original loss, and we simply add our reconstruction loss, multiplied by a weighting factor, to the original loss. Unfortunately, we did not see improvements with this loss and instead saw our model regress to producing more ”general” images that lack detail and look more ”smoothed” as can be seen in Figure 5.

Figure 5: An example of the images that look too ”smoothed” from the reconstruction loss.

Jointly Training and Utilizing a Discriminator

In order to avoid the smoothing described above, we utilize a discriminator to ensure that the images still look like faces in FFHQ dataset. We use a standard discriminator model with a series of downsampling convolutional layers connected to a final fully connected layer. We tried a variety of losses on the generator, including the original discriminator loss [5], a feature matching loss from pix2pixHD [10], and the standard discriminator loss utilized in pix2pixHD [10]. All of these losses result in being unable to converge during training, so we then allowed to be trained jointly with . Even this did not converge during training, resulting in even worse pictures than before (see Figure 6 for an example of what happens when and cannot converge). We also tried replacing the reconstruction loss with loss [9], similar to that of [4], however similar smoothing still occured. Ultimately, we believe that these methods are insufficient to allow and to reproduce images outside of the distribution that ’s images are from, and furthe r study is needed to determine whether it is possible at all to do so. For now, our is limited to only properly encoding images from ’s distribution.

Figure 6: A example of the images produced when and fail to converge during training when we attempt to make them learn to reproduce images from FFHQ. We see definite mode collapse despite utilizing losses that are designed to prevent it.

4.4 Image Clustering

One benefit of the low-dimensional latent space embedding mapped to by is that due to its disentangled nature, similar images should have somewhat similar values in certain dimensions of their embedding space. We hypothesize that despite being trained on only images in the distribution of StyleGAN’s outputs, ’s learned embedding is able to generalize sets of real images. So we feed 8000 randomly sampled images from the FFHQ dataset through the encoder, and perform greedy agglomerative clustering with Ward’s linkage [11] on the embedding of the images. This results in sensible clusters that encode various features of faces: gender, glasses, hair style, ethnicity, and overall style. Figure 7 demonstrates example closest pairs performed with this agglomerative clustering, and Figure 8 shows clusters of multiple images.

Figure 7: Example closest pairs from greedy agglomerative clustering on embeddings of FFHQ images.
Figure 8: Example clusters from greedy agglomerative clustering on embeddings of FFHQ images.

5 Conclusion and Future Work

In this work we propose a method to train an projection network in an unsupervised manner to project images produced by a GAN onto a randomly sampled latent space. We explore using StyleGAN [7] specifically, and demonstrate our results on super-resolution of images produced by the generator. We attempt to extend the projection network to images not produced by the generator (e.g. from the FFHQ dataset), albeit unsuccessfully. However, we perform agglomerative clustering on the latent encoding of images produced by the projection network, which results in semantically meaningful clusters that encode various traits across different people. We believe this work can be extended to perform “semantic photoshopping”: eventually being able to change various, individual aspects of input images by perturbing the disentangled latent space in the desired directions.