Deep learning has been rapidly pushing the boundaries for image processing, generation and manipulation works. Generative Adversarial Networks (GANs) in particular have been achieved success in this domain . StyleGAN, a recently published network architecture for face generation 
, is able to automatically separate high level features in an unsupervised manner while preserving stochasticity in the generated images. This generator can better disentangle complex image properties into latent factors, allowing for better control and interpolation properties. We propose in this paper a new method that constructs a network that maps an image to its latent code for the purpose of deriving structure.
In this paper, we explore whether the image to latent network can learn a projection onto a latent space that can be used to explore structure of the images; we demonstrate this by using it to cluster real images and perform image super-resolution. Our results show that the projection representation network is successful as we are able to achieve image super-resolution on images produced by the generator. Furthermore, utilizing the projection network for clustering results in meaningful, varied clusters of real images.
2 Related Work
Recently, deep generative models such as generative adversarial networks (GANs) have been popular in the image generation literature . GANs consist of generator and discriminator networks in which the generator’s goal is to produce synthetic images that trick the discriminator into believing that they are real. The training sequence resembles finding the optimum with two players having opposing objectives. When trained successfully, the two-player game ideally results in the generator being able to generate realistic images while the discriminator cannot do better than random guessing.
Since the original GAN paper, myriad GAN training techniques and network architectures have been developed to allow generators to better approximate high quality, varied image distributions. For example, StyleGAN 
uses a learned fully-connected network to disentangle the randomly sampled latent space for separation of high-level image attributes. When trained on faces, this latent vector likely encodes both global and local information such as pose, gender, face shape, hair style, etc.
Even with a disentangled latent space, StyleGAN still suffers a common challenge faced by GANs: it is difficult to derive the structure of the latent space. This is problematic for the application of GANs to controlled image generation, as the latent space is the main source of variance for the final output. Clustering is one task that can expose structure in a space, allowing for partitioning the space into various diagrams. showed good results clustering images using their projections into the latent space of a GAN, but their method requires training a network to learn a projection from the image space to the latent space as well as training the GAN with a new loss.  trained an encoder to learn the mapping from image to latent space jointly while training the generator and discriminator, while  proposed a bidirectional GAN architecture that includes an inverting network and similarly projects images into the GANs latent space. Learned projections have also been proposed by  where the projection functions was learned for an already trained GAN for the purpose of interpolation.
In this paper we learn a projection function for an already trained StyleGAN so that we may suitably cluster images in the StyleGAN latent space. We do so by training the projection network with a latent loss in an unsupervised manner. We note that our technique should work for most already trained generator architectures; in this paper we explore StyleGAN exclusively.
The StyleGAN generator consists of two networks that are jointly trained: a multi-layer perceptron (MLP) that projects an entangled latent variabledrawn from a random distribution into a disentangled latent space , and a generator that transforms a disentangled latent variable into an image. We denote the MLP as and the generator as .
Our objective is to find an estimator, that projects images to a disentangled latent space, in order to minimize over , the parameter space, the risk taken over the latent variable
. Our loss function is the squared error loss. Thus, the parameters ofare denoted as:
If is not StyleGAN’s generator, is usually the identity function and we would be projecting from images directly to the randomly sampled latent space .
Unlike previous approaches to learning a projection onto the latent space of a GAN, our method is unsupervised and only relies on the trained generator. Previous approaches, such as 
, used supervised learning to train the projection network such that the parameters ofare:
where is the image dataset and is a loss function such as MSE or negative log-likelihood.
4 Experiments and Results
4.1 Training a Baseline Model
For all of our experiments, our network is the ResNet18 model 
pretrained on ImageNet with the fully connected layer replaced with a randomly initialized one. The generator and the network come from a pretrained StyleGAN model  (trained on the FFHQ Dataset) with output size , held constant during training of . We approximate
through stochastic gradient descent optimized with the Adam optimizer. In less than 10 minutes of training on a single Tesla V100 GPU,is able to learn a qualitatively good projection onto , as shown in Figure 2.
We observe that our projection network compresses the most important information about the face in the input image. Even with an input image that is at a much lower resolution than what is trained to project, the distinguishing details are still extracted from the face. Thus we utilize trained on images at to super-resolve images downsampled to a much lower resolution by feeding it the downsampled image and then feeding the projection into . This results in accurate recreations of images when downsampled to and even , however the images are noticeably different from the others once downsampled to . A comprehensive comparison with the same examples downsampled to different resolutions is shown in Figure 3.
We also attempted to perform the same experiment with images from the FFHQ dataset directly, however the images produced are not similar enough to original images. An example is shown in Figure 4.
4.3 Distribution Extension
One consequence of training on images only drawn from the generator’s distribution is the inability to extend to images not in the generator’s output space. This can be seen as a form of overfitting and limits the application of our method. We attempt to alleviate this issue through two methods, described below.
After we train our network as described in section 4, we further train our network on a reconstruction loss. Specifically, we take an image from the FFHQ dataset, and we train to minimize the risk
where is the set of FFHQ images. We continue training with the original loss, and we simply add our reconstruction loss, multiplied by a weighting factor, to the original loss. Unfortunately, we did not see improvements with this loss and instead saw our model regress to producing more ”general” images that lack detail and look more ”smoothed” as can be seen in Figure 5.
Jointly Training and Utilizing a Discriminator
In order to avoid the smoothing described above, we utilize a discriminator to ensure that the images still look like faces in FFHQ dataset. We use a standard discriminator model with a series of downsampling convolutional layers connected to a final fully connected layer. We tried a variety of losses on the generator, including the original discriminator loss , a feature matching loss from pix2pixHD , and the standard discriminator loss utilized in pix2pixHD . All of these losses result in being unable to converge during training, so we then allowed to be trained jointly with . Even this did not converge during training, resulting in even worse pictures than before (see Figure 6 for an example of what happens when and cannot converge). We also tried replacing the reconstruction loss with loss , similar to that of , however similar smoothing still occured. Ultimately, we believe that these methods are insufficient to allow and to reproduce images outside of the distribution that ’s images are from, and furthe r study is needed to determine whether it is possible at all to do so. For now, our is limited to only properly encoding images from ’s distribution.
4.4 Image Clustering
One benefit of the low-dimensional latent space embedding mapped to by is that due to its disentangled nature, similar images should have somewhat similar values in certain dimensions of their embedding space. We hypothesize that despite being trained on only images in the distribution of StyleGAN’s outputs, ’s learned embedding is able to generalize sets of real images. So we feed 8000 randomly sampled images from the FFHQ dataset through the encoder, and perform greedy agglomerative clustering with Ward’s linkage  on the embedding of the images. This results in sensible clusters that encode various features of faces: gender, glasses, hair style, ethnicity, and overall style. Figure 7 demonstrates example closest pairs performed with this agglomerative clustering, and Figure 8 shows clusters of multiple images.
5 Conclusion and Future Work
In this work we propose a method to train an projection network in an unsupervised manner to project images produced by a GAN onto a randomly sampled latent space. We explore using StyleGAN  specifically, and demonstrate our results on super-resolution of images produced by the generator. We attempt to extend the projection network to images not produced by the generator (e.g. from the FFHQ dataset), albeit unsuccessfully. However, we perform agglomerative clustering on the latent encoding of images produced by the projection network, which results in semantically meaningful clusters that encode various traits across different people. We believe this work can be extended to perform “semantic photoshopping”: eventually being able to change various, individual aspects of input images by perturbing the disentangled latent space in the desired directions.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In , pages 248–255. Ieee, 2009.
-  Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
-  Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
-  Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2018.
Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, and Sreeram Kannan.
Clustergan: Latent space clustering in generative adversarial
Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4610–4617, 2019.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  Joe H Ward Jr. Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301):236–244, 1963.
-  Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A Efros. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pages 597–613. Springer, 2016.