Mask-Guided Discovery of Semantic Manifolds in Generative Models

05/15/2021 ∙ by Mengyu Yang, et al. ∙ UNIVERSITY OF TORONTO 0

Advances in the realm of Generative Adversarial Networks (GANs) have led to architectures capable of producing amazingly realistic images such as StyleGAN2, which, when trained on the FFHQ dataset, generates images of human faces from random vectors in a lower-dimensional latent space. Unfortunately, this space is entangled - translating a latent vector along its axes does not correspond to a meaningful transformation in the output space (e.g., smiling mouth, squinting eyes). The model behaves as a black box, providing neither control over its output nor insight into the structures it has learned from the data. We present a method to explore the manifolds of changes of spatially localized regions of the face. Our method discovers smoothly varying sequences of latent vectors along these manifolds suitable for creating animations. Unlike existing disentanglement methods that either require labelled data or explicitly alter internal model parameters, our method is an optimization-based approach guided by a custom loss function and manually defined region of change. Our code is open-sourced, which can be found, along with supplementary results, on our project page: https://github.com/bmolab/masked-gan-manifold

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 5

Code Repositories

masked-gan-manifold

A mask-guided method for control over localized regions in StyleGAN2 images.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Advances in the realm of Generative Adversarial Networks (GANs) goodfellow_generative_2014 have led to architectures capable of producing amazingly realistic images such as StyleGAN2 karras_analyzing_2020, which, when trained on the FFHQ dataset karras_style-based_2019, generates images of human faces from random vectors in a lower-dimensional latent space. Unfortunately, this space is entangled – translating a latent vector along its axes does not correspond to a meaningful transformation in the output space (e.g., smiling mouth, squinting eyes). The model behaves as a black box, providing neither control over its output nor insight into the structures it has learned from the data.

However, the smoothness of the mappings from latent vectors to faces plus empirical evidence shen_interpreting_2020; creswell_inverting_2018

suggest that manifolds of meaningful transformations are in fact hidden inside the latent space but obscured by not being axis-aligned or even linear. Travelling along these manifolds would provide puppetry-like abilities to manipulate faces while studying their geometry would provide insight into the nature of the face variations present in the dataset – revealing and quantifying the degrees-of-freedom of eyes, mouths,

etc.

We present a method to explore the manifolds of changes of spatially localized regions of the face. Our method discovers smoothly varying sequences of latent vectors along these manifolds suitable for creating animations. Unlike existing disentanglement methods that either require labelled data shen_interpreting_2020; wei_maggan_2020 or explicitly alter internal model parameters alharbi_disentangled_2020; broad_network_2020, our method is an optimization-based approach guided by a custom loss function and manually defined region of change. Our code is open-sourced, which can be found, along with supplementary results, on our project page111https://github.com/bmolab/masked-gan-manifold.

2 Method

We design functions defined on the images generated by our pre-trained model (we will continue to work with the example of StyleGAN2 trained on FFHQ). The desired property of these functions is that they are at their minimum when only the target region of the face (for instance, the mouth) has changed. We then use standard optimization techniques to discover smoothly varying paths through the latent space that lie on the manifold.

We start with a user-provided initial generated image , where is the generator network and some latent vector (note that in this work we use StyleGAN2’s higher-dimensional intermediate latent space , refer to karras_style-based_2019 for details). We then define a rectangular mask region over the image, for instance around the mouth, and define as the image formed by cropping to , and as its complement (i.e. the rest of the image). We seek a manifold containing images which have primarily changed in the mouth region but not in the rest of the image . We can define this manifold as minima of the function

(1)

where is a distance function between images. We have experimented with both pixel-wise distance and the LPIPS perceptual loss zhang_unreasonable_2018. satisfies our requirement as it is minimal when the target region has changed by a factor of while the rest of the image remains unchanged.

In order to create smoothly varying animations that explore this manifold, we use a physically-inspired model of masses connected by springs. Take a matrix of latent vectors where is the dimension of the latent space. The vectors are connected by springs of rest-length (an adjustable parameter) in series, encouraging each to be similar, but not too similar, to its neighbours. We further encourage the path to have minimal curvature by also adding higher-order “stiffener” springs connecting to vectors that are further apart. This system can be formalized as follows,

(2)

For our experiments, we include . Putting everything together, given a reference latent vector and mask region , we use the L-BFGS liu_limited_1989 algorithm to optimize and find , as seen in Equation 3, where are tuneable parameters controlling the importance of each term. The result of this optimization is visually represented in Figure 1, left.

(3)

3 Results and discussion

Figure 1: Left: A choice of seed vector , shown as an orange dot, and mask region creates a function illustrated here on the latent space of the generative model. The optimization seeks a set of vectors , shown as red dots, that lie along the minima of this landscape. They are encouraged to be evenly spaced along a path with minimal curvature by the spring loss , where springs of order are shown as green and blue connectors respectively. Right: Various results of our algorithm. Each row consists of a different seed vector and mask region as shown in the first column. The other columns are selected images from the generated sequences . Note that the change in the images is well localized to the masked region. Also note that we use a large value of to exaggerate the changes for clarity. Refer to the Appendix for many more results.

Figure 1, right, shows some of our experimental results. It can be seen that changes to the face are all localized within the mask region while minimal change occurs outside. More importantly, we demonstrate that our method is generalizable to any mask region of choice as well as initial face (see Figures 2 to 10 in the Appendix for additional experiments). The spring constraints of our method are designed to generate smooth videos, please refer to our supplementary material to experience this qualitatively.

This work is a small contribution towards the larger vision of exploring and characterizing the semantic manifolds lurking in the latent spaces of generative models. Generalizing to different models, different dimensionalities of manifolds, and more controls than just rectangular masks are a small sampling of the natural extensions of this line of inquiry.

4 Ethical implications

The StyleGAN2 model we use is capable of generating realistic faces while also demonstrating a proficient understanding of how faces tend to vary in the dataset. Given these qualities, GANs can be used as a popular tool for modelling and promulgating what is considered to be “normal”, which if used uncritically, could marginalize people labelled “abnormal” by these systems crawford_trouble_2017.

Furthermore, there has been much popular discussion about whether we are entering a post-truth contemporary era, where generative tools such as the one we present here have raised fears of hyper-realistic “deepfake” videos impersonating real people, poisoning the information ecology and further eroding trust in any consensus reality vincent_watch_2018.

Perhaps more subtly, our method and others like it can create very physically plausible videos of faces changing in “unnatural” ways, such as shifting bone structure, smoothly varying a face from one identity to another. If such videos become commonplace in our culture, might this contribute to a reconfiguring of our traditional conception of separate, fixed, and individual identities towards fluid, overlapping, and changeable ones? The consequences of such a fundamental shift, be they negative, positive, or neutral are difficult to anticipate but worthy of consideration.

5 Appendix

Below are some additional figures of experiments with different faces and masks. For all figures below, the first column is the reference image with mask region shown. Unless its value is explicitly stated, we use a large value of to exaggerate the changes for clarity.

We encourage readers to view our video animations, from which stills were taken to create these figures below. An important aspect of our method is that it creates smooth animations while exploring the manifolds. As a result, the video animations convey much more visual information, whereas some of that is lost with still figures. Refer to supplementary materials for the animations.

(a)
(b)
(c)
(d)
Figure 2: Each row represents an experiment with a different offset value on the same face and mask region.
Figure 3: Experiment with a single offset value with mask region around the eyes.
Figure 4: Experiment with a single offset value with mask region around the right half of the face.
Figure 5: Experiment with a single offset value with mask region around everything except the eye region (i.e., eye region remains unchanged).
Figure 6: Experiment with a single offset value with mask region around the right half of the face.
Figure 7: Experiment with a single offset value with mask region around the mouth.
Figure 8: Experiment with a single offset value with mask region around the mouth and chin.
Figure 9: Experiment with a single offset value with mask region around the right half of the face.
Figure 10: Experiment with a single offset value with mask region around the mouth.