Tensor-based Emotion Editing in the StyleGAN Latent Space

by   René Haas, et al.
IT University of Copenhagen

In this paper, we use a tensor model based on the Higher-Order Singular Value Decomposition (HOSVD) to discover semantic directions in Generative Adversarial Networks. This is achieved by first embedding a structured facial expression database into the latent space using the e4e encoder. Specifically, we discover directions in latent space corresponding to the six prototypical emotions: anger, disgust, fear, happiness, sadness, and surprise, as well as a direction for yaw rotation. These latent space directions are employed to change the expression or yaw rotation of real face images. We compare our found directions to similar directions found by two other methods. The results show that the visual quality of the resultant edits are on par with State-of-the-Art. It can also be concluded that the tensor-based model is well suited for emotion and yaw editing, i.e., that the emotion or yaw rotation of a novel face image can be robustly changed without a significant effect on identity or other attributes in the images.


page 1

page 5

page 6

page 7

page 8


Tensor-based Subspace Factorization for StyleGAN

In this paper, we propose τGAN a tensor-based method for modeling the la...

Finding Directions in GAN's Latent Space for Neural Face Reenactment

This paper is on face/head reenactment where the goal is to transfer the...

Semantic and Geometric Unfolding of StyleGAN Latent Space

Generative adversarial networks (GANs) have proven to be surprisingly ef...

RSGAN: Face Swapping and Editing using Face and Hair Representation in Latent Spaces

In this paper, we present an integrated system for automatically generat...

Affective Facial Expression Processing via Simulation: A Probabilistic Model

Understanding the mental state of other people is an important skill for...

Controlling Memorability of Face Images

Everyday, we are bombarded with many photographs of faces, whether on so...

Rayleigh EigenDirections (REDs): GAN latent space traversals for multidimensional features

We present a method for finding paths in a deep generative model's laten...

1 Introduction

Generative Adversarial Networks (GANs) [Goodfellow2014GAN] have emerged as one of the most promising architectures for image synthesis. GANs can produce synthetic images with near-perfect photorealism [Karras2018PGGAN, Karras2019StyleGAN, Brock2019BigGAN, Karras2020StyleGAN2, Karras2020StyleGANada]. GANs learn to organize the data they are trained on into a latent space and are, by drawing samples from the latent space, able to synthesize new images which are not contained in the training data but follow the same distribution. In particular, in the field of face synthesis StyleGAN has set new standards for what is possible [Karras2019StyleGAN, Karras2020StyleGAN2, Karras2020StyleGANada].

Recent work has explored methods to gain artistic control over the images produced by modern GANs [Shen2020InterfaceganTPAMI, Abdal2020Image2StyleGANpp, Harkonen2020GANSpace, Patashnik2021StyleCLIP, Shen2020SeFa, Tewari2020StyleRig, Wu2020StyleSpace, Abdal2020StyleFlow]. In this work, we use a multilinear tensor model to derive latent space directions in StyleGAN2 [Karras2020StyleGAN2] corresponding to the six prototypical emotions: anger, disgust, happiness, fear, sadness, and surprise as well as yaw rotation. With these directions, we are able to edit the emotion of real face images as shown in Fig. 1.


The StyleGAN generator is composed of two networks, the mapping network and the synthesis network . The mapping network

maps the latent vector

onto the auxiliary latent space while the synthesis network maps a vector to the final output image. The latent vectors in

follow the standard normal distribution

while the distribution of the auxiliary latent codes in is learned by the mapping network . The main benefit of this mapping is that the space is more disentangled if compared to the space [Karras2019StyleGAN].

Every major block corresponding to a resolution of the synthesis network is modulated by two style vectors . Thus, for the full 1024 by 1024 generator there are 9 major blocks and the synthesis network takes a total of 18 style vectors as an input. Each set of style vectors has different effects on the synthesized image. In detail, the style vectors for the early layers, corresponding to coarse spatial resolutions, control high-level aspects of the image such as pose and face shape. Style vectors on the middle layers control smaller scale facial features like hair style and if the eyes and mouth are open or closed. The style vectors on the later layers correspond to higher resolutions controls such as the texture and the microstructure of the generated image[Karras2019StyleGAN]. In space, each of the style vectors are identical. However, we can also allow them to be different, in which case the resulting space is denoted as the space. The space can be used for style mixing [Karras2019StyleGAN] and GAN inversion [Abdal2019Image2StyleGAN, Zhu2020InDomain]. Recently, an additional latent space referred to as style space has also been proposed [Wu2020StyleSpace].

Semantic Face Editing.

Several methods have been proposed to enable edits of the images produced by StyleGAN. InterFaceGAN [Shen2020Interfacegan, Shen2020InterfaceganTPAMI]

uses pre-trained binary classifiers to annotate StyleGAN generated images based on single binary attributes, e.g., young vs. old, male vs. female, glasses vs. no glasses. Support vector machines are then trained on the annotated data to discriminate between each attribute in the latent space. The normal vectors of the separating hyperplane define a direction in latent space that changes the corresponding binary attribute. GANSpace


finds interpretable directions in an unsupervised fashion with PCA while manual examination of the found directions is required. Directions found with PCA are typically entangled, affecting multiple attributes. It was shown that the degree of entanglement can be reduced by only applying the found directions to a subset of the style vectors. It has also been proposed to make the eigenvalue decomposition on the weights of the pre-trained generator to discover meaningful semantic directions in the latent space

[Shen2020SeFa]. Recently, StyleCLIP [Patashnik2021StyleCLIP] demonstrates text driven semantic editing by minimizing CLIP [Radford2021CLIP] loss between a text input and the generated image. StyleFlow [Abdal2020StyleFlow] proposed editing along non-linear paths using normalizing flows to better preserve identity.

Separate from StyleGAN research, different multilinear methods have been widely used to model and analyze faces and expressions [Blanz1999MorphableModel, Ferrari2017Dictionary3DMM, tensorface, grasshof2020Multilinear]. Recently there has been some interest in applying these methods to explore the latent space of GANs. For example, StyleRig [Tewari2020StyleRig] proposes edits by minimizing the loss between the image produced by the generated image and an image rendered by a 3D morphable model. Furthermore, models based on the Higher-Order Singular Value Decomposition (HOSVD) have successfully been used to model faces, their 3D reconstruction, as well as in transferring expressions [Vasilescu2002Tensorface, Vlasic2005FaceTransfer, Brunton2014MultilinearWavelets, Chen2014FaceWarehouse]. Recently, it has been suggested [Haas2021tensorGAN] to use such a HOSVD-based tensor model for semantic face editing in StyleGAN. Here a facial expression database was projected into the StyleGAN space and relevant semantic subspaces corresponding to identity, expression and yaw rotation were defined using HOSVD-based subspace factorization. The model showed limited flexibility for representing arbitrary latent codes and to overcome this a stacked style-separated model was proposed. This extended the tensor model to an ensemble of tensor models, one for each style vector in the StyleGAN space. Further, it was shown that in the derived expression subspace, each of the six prototypical emotions formed nearly linear trajectories in agreement with [Grasshof2017apathy]. Although initial results were promising, convincing expression editing using a HOSVD-based model on the StyleGAN latent space was however not yet demonstrated. We propose a solution to this shortcoming, and demonstrate the robustness, and competitiveness of our approach in this work.

Generator Inversion.

To facilitate editing of real images, the images first need to be projected into the StyleGAN latent space. This is also referred to as GAN inversion [Zhu2018GANInversion] and the problem is to find a latent code that, when passed to the generator, produces an image as close as possible to the given target image. Typically GAN inversion techniques are either based on training an encoder [Pidhorskyi2020ALAE, Richardson2021pSp, Tov2021e4e, alaluf2021restyle], which can embed an image into latent space at inference time, or optimization-based techniques [Karras2020StyleGAN2, Abdal2019Image2StyleGAN, yang2019unconstrained, Abdal2020Image2StyleGANpp]

. In the latter approach, the latent code is found by minimizing a loss function, typically pixel-wise L2 or perceptional image similarity

[Zhang2018LPIPS] is used. Hybrid approaches have also been proposed which use a trained encoder to find a good initial condition for subsequent iterative optimization of the latent code [puzer, Zhu2020InDomain].

Recently, [Roich2021pivotal] shows that novel images can be embedded into space with a lower reconstruction error by fine-tuning the pre-trained generator on the target image such that the latent code in space yields an image closer to the target.

Recent work [Tov2021e4e] suggests that there is a trade-off between distortion and editability when selecting which latent space to project a given target image into. When projecting out-of-domain images into the StyleGAN latent space picking the extended space leads to a higher quality reconstruction, i.e, it yields an image closer to the target image. However, latent codes in the space are generally less editable than latent codes in space. To find latent codes with the optimal trade-off between distortion and editability a novel training methodology was proposed [Tov2021e4e] which embeds images into space in a way that constrains the latent codes to be as close to space as possible.

Figure 2: Diagram of our method. We first project a facial expression database intro the space of StyleGAN. We then use the HOSVD to factorize the latent representation of the data in order to derive meaningful semantic subspaces. From the subspaces we define a set of global editing directions in corresponding to yaw rotation and each of the six basic emotions.


Our contributions can be summarized as follows

  • [noitemsep]

  • We show that a HOSVD-based tensor model is able to discover novel semantic directions robustly, corresponding to the six prototypical emotions, in pre-trained GANs.

  • We show that convincing emotion directions can be derived by truncating the expression intensity subspace.

  • We show that, by using the e4e encoder [Tov2021e4e] for projecting real images into the latent space of StyleGAN, it is possible to construct a tensor model which enables stable rotation and expression transfer on real faces.

  • We show the previously proposed tensor model for the GAN latent space [Haas2021tensorGAN] had an implicit rank-one constraint, which can be relaxed, leading to lower reconstruction error.

2 Method

In this section, we describe tensor model formulation [Haas2021tensorGAN] and propose two extensions to it: (1) We show how to relax the implicit rank-one constraint of the model by replacing the set of parameter vectors of the model with a single full rank parameter tensor, and (2) show how to derive emotion directions in by truncating the expression intensity subspace. An overview of our approach is shown in Fig. 2.

2.1 Multilinear Tensor Model

Given a data set of StyleGAN latent codes in we represent them so that each latent code is equivalent to a vector , where for the generator producing images. Suppose we have latent codes for different persons, performing expressions each with different intensities from different rotations, then we arrange the data into the 5 order tensor . We then proceed to calculate the Higher-Order Singular Value Decomposition (HOSVD) on the mean-centered data tensor as


where is the core tensor and denotes the -mode tensor matrix product. The mean tensor is written as , where is the mean latent code from the data set, is a vector of ones with dimension , and denotes the tensor product. The matrices have orthonormal columns, i.e., and are constructed from the left singular vectors of the mode- matrix unfoldings of the mean-centered data tensor. The columns of form the basis for the respective subspace. The columns of form a basis for the latent space and are identical to the principal components [grasshof2020Multilinear]. Likewise , , , and form the bases for the person identity, expression, intensity and rotation subspaces respectively.

Parameter Vectors.

To recover a specific latent code from the tensor model, we select appropriate rows of , , and corresponding to the desired person, expression, expression intensity, and rotation respectively. By introducing one-hot vectors which we will refer to as the canonical parameters for the tensor model, we get


where . This formulation is analogous to the one proposed in [Grasshof2017apathy, grasshof2020Multilinear] and subsequently, [Haas2021tensorGAN]. Now, (2) can be further simplified by defining which allows us to write


which gives is a more compact representation of the tensor model.

Recovering Subspace Parameters.

To find the parameters for a novel latent code , with corresponding to the latent code which best approximates , one could minimize the loss,


Additionally, it has been proposed in [Grasshof2017apathy] to regularize the solution by the Tikhonov regularizer and sum constraint as


that yields the regularized minimization problem


This regularization is important for finding a stable parameter vector representations and thereby enables expression editing for latent codes corresponding to novel images, as will be seen below.

Relaxing the Rank-One Constraint.

In the tensor model (3), each latent code is entirely determined by four parameter vectors , , and corresponding to identity, expression, expression intensity and rotation, respectively. Using component notation and the Einstein summation convention we rewrite (3) as


where is a rank-one tensor.

Now, we propose to relax this implicit rank-one constraint and instead allow the tensor to be full rank that leads to the problem


The relaxation increases the number of parameters of the tensor model from parameters to parameters. This results in a more flexible model which yields lower reconstruction errors for novel latent codes.

2.2 Truncating the Expression Intensity Subspace

From (1), the expression intensity subspace is truncated to a one-dimensional subspace by selecting the dominant singular vector, i.e., the first column of which we denote . The truncated core tensor is then written as


Defining as before, then the model is written similarly to (2) and (3) as


where the corresponding intensity parameter is a scalar since the expression intensity subspace has been truncated. Thus, the expression intensity factors out of the model and we may write


where can now be interpreted as the expression intensity parameter. We trivially unfold the singleton dimension of corresponding to the intensity subspace, i.e., and then write the model as


2.3 Recovering Semantic Directions

Emotion Directions.

We define emotion directions in latent space by selecting an appropriate row of corresponding to the emotion of interest. The combined parameter tensor corresponding to an expression direction is then written as


where and is the mean person and rotation parameters respectively. To change the expression of a given latent code

, we interpolate linearly in the direction given by the vector

with components


thus performing an expression edit as


Rotation Direction.

We edit rotations in a similar way. First we select the mean person, expression and expression intensity parameters and and then define the rotation direction parameter as the difference between the parameters corresponding to the left and right rotations, i.e., the difference between the two rows of . We write the rotation direction parameter directly as


Now the combined rotation direction tensor is written as


and we can change the rotation of a latent code as


where is the strength of the rotation.

With this formulation, we apply semantic edits directly in

without the need for estimating the tensor model parameters beforehand as has otherwise been suggested


Figure 3: Image embeddings. (a) BU-3DFE images, (b) random samples from the generator, and (c) real images. The embeddings of the original images are shown in the top row, the parameter vector embeddings in the middle, and the parameter tensor embeddings in the bottom row.

3 Experiments

Our tensor model was trained with the latent space projection of images from the Binghamton University 3D Facial Expression database (BU-3DFE) [bu3dfe]. The BU-3DFE database contains 2500 3D face scans and corresponding images from two views of 100 persons (56 female and 44 male) with varying ages (18-70 years), and diverse ethnic/racial ancestries. Each subject was asked to perform the six basic emotions: anger, disgust, fear, happiness, sadness, and surprise, each with four levels of intensity. Additionally, for each participant, a neutral face is provided. Hence, for each person, there are 25 facial expressions in total, recorded from two pose directions, left and right, resulting in 5000 face images. Additionally, we used the FEI face database [Thomaz2010Feidatabase] which contains 14 images of each of the 200 individuals, 100 male and 100 female. For each the database contains two frontal images, one with a neutral or non-smiling expression and the other with a smiling facial expression, the rest of the images depicts each individual with a neutral expression from various yaw rotations.

3.1 Implementation Details

We use the full resolution, i.e. , StyleGAN2 [Karras2020StyleGANada] generator which has been pre-trained on FFHQ [Karras2019StyleGAN]

. The tensor model was implemented in PyTorch

[Paszke2019PyTorch] using tntorch [tntorch] to calculate the HOSVD. To estimate the tensor model parameters we used gradient descent implemented in PyTorch with the Adam optimizer. For comparing images we use two different metrics. For perceptual image similarity we use LPIPS [Zhang2018LPIPS] and for identity similarity we uses Arcface [Deng2019ArcFace]. To measure the pose of the generated images we uses MediaPipe [Lugaresi2019MediaPipe] to extract 2D and 3D landmarks and then proceeded to solve the Perspective-n-point (PnP) [Fischler1981RandomSC] problem which gave us a scalar value for the yaw rotation of a given image. We embedded all images into space using the e4e encoder [Tov2021e4e].

3.2 Subspace Parameter Recovery

We computed estimated the tensor model parameters for 3 types of novel latent codes: 1) BU-3DFE latent codes where we left one person out in the calculation of the tensor model, 2) randomly sampled latent codes, and 3) real images projected into latent space. Fig. 12 shows the result of recovering the tensor model parameters for these three types of latent codes when recovering the parameters in vector and tensor form, respectively. It can be seen that using parameter vectors for the tensor model led to a significant reconstruction loss if compared to using a representation with a parameter tensor, as illustrated in Fig. 4 and quantified in Tab. 1. It seems that the randomly sampled images are slightly harder to reconstruct than the embedded real images.

For the representation with parameter vectors, we find that although the proposed regularization (5) leads to a slightly higher reconstruction error, it is important in order to find parameter vectors which are suitable for expression editing. Fig. 5 shows that performing expression edits on the regularized parameters leads to less identity change compared to the non-regularized parameters. The importance of regularization is more noticeable when we recover the parameters for a randomly generated image if compared to an image contained the in BU-3DFE database.

Random Latents BU-3DFE Latents
Rank one
Full rank
Table 1: Comparison of reconstruction error by representing randomly sampled latent codes and latent codes from the BU-3DFE data set with parameter vector and a parameter tensor respectively.
Figure 4: Representing a latent code in the tensor model with parameter vectors with and without regularization compared with a representation using a parameter tensor.
(a) Without regularization.
(b) With regularization.
Figure 5: Visual comparison of the effect of regularization for expression editing using parameter vectors for the tensor model.
Figure 6: Direct edit in the space without prior estimation of the model parameters.

Moreover, it can be seen that the tensor model is not necessary for expression editing, because we can edit the latent code directly by perturbing in the directions defined by (15), instead of manipulating the estimated parameters of the tensor model. The effect of such a direct edit is illustrated in Fig. 6. The main advantage of performing expression edits in this way, is that we avoid the reconstruction error associated with representing the latent code in terms of the tensor model parameters.

3.3 Expression Direction Recovery

Fig. 7 shows the effect of applying the found six latent space directions to the BU-3DFE mean face. We found that subtracting the sadness direction from the mean face also produces a happy facial expression. However, the resulting expression is qualitatively different from adding the happy direction to the mean face. While adding the happy direction results in a wide smile, subtracting the sadness direction results in a smile that is narrower but where the mouth is more open. See the supplementary materials for videos showing the found emotion directions on real face images.

Figure 7: Effect of applying the direction corresponding to the six prototypical expressions to a real image. The rows show the different expressions determined by while the strength is modulated by , while the rotation parameters remain unchanged. The right column shows edits in the direction of the respective expression while the left column illustrates the subtraction of it.

3.4 Comparison to Related Work

We compared the rotation and smile directions found by our approach to those previously found by InterFaceGAN [Shen2020InterfaceganTPAMI] and GANSpace [Harkonen2020GANSpace]. For InterFaceGAN, we used the PyTorch version of the rotation and smile directions provided by the authors of [Roich2021pivotal] at their GitHub repository111https://github.com/danielroich/PTI/tree/main/editings/interfacegan_directions. For the rotations, we chose a manipulation strength that resulted in a similar degree of rotation. To perform rotations with GANSpace [Harkonen2020GANSpace], we initially used the principal component applied to the first three style vectors. However, we found that if we only changed the first three style vectors to edit the rotation, the result tends to break down when the editing strength is large, which is demonstrated in the first row in Fig. 8. If we applied the edit to the first five style vectors instead, we generally received better results, see second row in Fig. 8.

We visually compared the rotations by GANSpace, InterFaceGAN and our proposed method on images which are randomly sampled from the generator as well as images from the FEI face database [Thomaz2010Feidatabase]. For the FEI database we used the frontal face images as initial conditions and then applied rotations with GANSpace, InterFaceGAN and our method to approximate the latent codes corresponding to rotated images from the database. The results on randomly sampled images are shown in Fig. 8 and on the FEI database in Fig. 9, respectively. It can be seen that the quality of the edits are visually on par, except the gaze direction follows the camera in the InterFaceGAN results.

Figure 8: Comparison of rotations produced by GANSpace [Harkonen2020GANSpace] (top 2 rows), InterFaceGAN [Shen2020InterfaceganTPAMI] (third row) and our approach (bottom). Here GANSpace* refers to a manipulation where we edit the first five style vectors rather than the first three as described in the main text.
Figure 9: Qualitative comparison of the found rotation direction with the equivalent edits from InterFaceGAN [Shen2020InterfaceganTPAMI] and GANSpace [Harkonen2020GANSpace] applied on the FEI face database [Thomaz2010Feidatabase].

3.5 Happy Faces

We compared the found happiness direction to the smile directions from GANSpace and InterFaceGAN, respectively. For GANSpace we used the 47 principal component applied to the 5 and 6 style vectors. The results are shown in Fig. 10. Although each method resulted in a smile in the generated image, the style of smile is different. Our method yielded a wider smile whereas GANSpace yielded a smile with a larger mouth opening, while the smile by InterFaceGAN seems to fall between these two.

Figure 10: Visual comparison of editing a randomly sampled latent code in the smiling directions found in GANSpace [Harkonen2020GANSpace] and InterFaceGAN [Shen2020InterfaceganTPAMI] with the happiness direction found in this work.

3.6 Face Frontalization

To experiment face frontalization, we started with the latent codes corresponding to the rotated images in the FEI database [Thomaz2010Feidatabase], then edited the yaw of latent code to frontalize the images. Quantative comparison is shown in Fig. 11. In Tab. 2, we compare the perceptual and identity similarity scores of the frontalized images to the ground truth. It can be seen the frontalized images are very similar to the result obtained by using the pose direction from InterFaceGAN. However, our method yielded better similarity scores against to the ground truth. In addition, the gaze direction by InterFaceGAN is not straight ahead whereas ours is.

Figure 11: Qualitative comparison of facial frontalization with InterFaceGAN [Shen2020InterfaceganTPAMI] and our method on FEI face database [Ferrari2017Dictionary3DMM].
LPIPS [Zhang2018LPIPS] ArcFace [Deng2019ArcFace]
Table 2: Comparison of perceptual and identity similarity scores of facial frontalization of images from the FEI face database with InterFaceGAN [Shen2020InterfaceganTPAMI] and our method. The results are reported as mean value standard error of the mean.

3.7 Validation with expression classifier

To validate that the semantic directions recovered with our approach produce a change in the generated images corresponding to the intended labels, we use a pre-trained expression classifier [pyfeat] which is trained on the FER2013 data set [Goodfellow2013Challenges]. We sampled

random images with varying expressions from StyleGAN and edited these in the direction of each basic emotion. Using the classifier, we obtained the probability mass distribution of expressions for the sampled and edited images. From this, we calculated the average difference in probability mass due to the edit and visualize the results with a heatmap in Fig. 


The edits in the direction of anger, happiness, sadness, and surprise lead to changes in the class probabilities which corresponds to an increase in probability of the expected class labels. However, the edits in the disgust direction lead to an increase in probability for anger as well as disgust while edits in the fear direction leads to a larger probability mass for the surprise label. This is explained by the fact that PyFeat also classifies the BU-3DFE raw images in a similar way as can be seen in the confusion matrix in Fig. 

13. Thus, this discrepancy is not due to a limitation of our model, but rather due to systematic differences between the BU-3DFE and FER2013 data sets, which are especially apparent for data points annotated with the fear or disgust labels.

4 Conclusion

In this work, we have presented an extension of the HOSVD-based tensor model, proposed in [Haas2021tensorGAN]. In contrast to [Haas2021tensorGAN], (1) we use the e4e encoder [Tov2021e4e] to recover highly editable latent codes for the BU-3DFE database, (2) we improve reconstruction in the tensor model by allowing the parameters to be full-rank, and (3) we show that edits can be applied directly in latent space. Further, we showed that we can calculate linear directions in latent space corresponding to the six prototypical emotions by truncating the emotion intensity subspace. After obtaining a latent representation of the data, constructing the tensor model is fast, requiring only a few minutes to calculate the HOSVD. Further, the latent space directions corresponding to the six prototypical emotions can be calculated from the tensor model and subsequently applied to any latent code in the original latent space without the need to first estimate the subspace parameters as otherwise suggested in [Haas2021tensorGAN]. In other words, the found semantic directions are global and can be applied to any latent code without any further calculations. Our

Figure 12: Heatmap of the average difference in expression probability masses due to expression edits with our approach. Note that Fear increases the probability mass for Surprise and Disgust increases the probability mass for Anger. The reason is explained in the main text.
Figure 13: Confusion matrix showing the Pyfeat classification results on BU-3DFE. It shows that the correlation between Fear/Surprise and Disgust/Anger is not due to a limitation of our model, but can attributed to the differences between the BU-3DFE and FER2013 data sets.

method is able to identify directions in latent space corresponding to yaw rotation, as well as each of the six basic expressions. The quality of the edits performed with these directions is on par with the corresponding edits using GANSpace [Harkonen2020GANSpace] and InterFaceGAN [Shen2020InterfaceganTPAMI].