1 Introduction
Generative Adversarial Networks (GANs) [Goodfellow2014GAN] have emerged as one of the most promising architectures for image synthesis. GANs can produce synthetic images with nearperfect photorealism [Karras2018PGGAN, Karras2019StyleGAN, Brock2019BigGAN, Karras2020StyleGAN2, Karras2020StyleGANada]. GANs learn to organize the data they are trained on into a latent space and are, by drawing samples from the latent space, able to synthesize new images which are not contained in the training data but follow the same distribution. In particular, in the field of face synthesis StyleGAN has set new standards for what is possible [Karras2019StyleGAN, Karras2020StyleGAN2, Karras2020StyleGANada].
Recent work has explored methods to gain artistic control over the images produced by modern GANs [Shen2020InterfaceganTPAMI, Abdal2020Image2StyleGANpp, Harkonen2020GANSpace, Patashnik2021StyleCLIP, Shen2020SeFa, Tewari2020StyleRig, Wu2020StyleSpace, Abdal2020StyleFlow]. In this work, we use a multilinear tensor model to derive latent space directions in StyleGAN2 [Karras2020StyleGAN2] corresponding to the six prototypical emotions: anger, disgust, happiness, fear, sadness, and surprise as well as yaw rotation. With these directions, we are able to edit the emotion of real face images as shown in Fig. 1.
StyleGAN.
The StyleGAN generator is composed of two networks, the mapping network and the synthesis network . The mapping network
maps the latent vector
onto the auxiliary latent space while the synthesis network maps a vector to the final output image. The latent vectors infollow the standard normal distribution
while the distribution of the auxiliary latent codes in is learned by the mapping network . The main benefit of this mapping is that the space is more disentangled if compared to the space [Karras2019StyleGAN].Every major block corresponding to a resolution of the synthesis network is modulated by two style vectors . Thus, for the full 1024 by 1024 generator there are 9 major blocks and the synthesis network takes a total of 18 style vectors as an input. Each set of style vectors has different effects on the synthesized image. In detail, the style vectors for the early layers, corresponding to coarse spatial resolutions, control highlevel aspects of the image such as pose and face shape. Style vectors on the middle layers control smaller scale facial features like hair style and if the eyes and mouth are open or closed. The style vectors on the later layers correspond to higher resolutions controls such as the texture and the microstructure of the generated image[Karras2019StyleGAN]. In space, each of the style vectors are identical. However, we can also allow them to be different, in which case the resulting space is denoted as the space. The space can be used for style mixing [Karras2019StyleGAN] and GAN inversion [Abdal2019Image2StyleGAN, Zhu2020InDomain]. Recently, an additional latent space referred to as style space has also been proposed [Wu2020StyleSpace].
Semantic Face Editing.
Several methods have been proposed to enable edits of the images produced by StyleGAN. InterFaceGAN [Shen2020Interfacegan, Shen2020InterfaceganTPAMI]
uses pretrained binary classifiers to annotate StyleGAN generated images based on single binary attributes, e.g., young vs. old, male vs. female, glasses vs. no glasses. Support vector machines are then trained on the annotated data to discriminate between each attribute in the latent space. The normal vectors of the separating hyperplane define a direction in latent space that changes the corresponding binary attribute. GANSpace
[Harkonen2020GANSpace]finds interpretable directions in an unsupervised fashion with PCA while manual examination of the found directions is required. Directions found with PCA are typically entangled, affecting multiple attributes. It was shown that the degree of entanglement can be reduced by only applying the found directions to a subset of the style vectors. It has also been proposed to make the eigenvalue decomposition on the weights of the pretrained generator to discover meaningful semantic directions in the latent space
[Shen2020SeFa]. Recently, StyleCLIP [Patashnik2021StyleCLIP] demonstrates text driven semantic editing by minimizing CLIP [Radford2021CLIP] loss between a text input and the generated image. StyleFlow [Abdal2020StyleFlow] proposed editing along nonlinear paths using normalizing flows to better preserve identity.Separate from StyleGAN research, different multilinear methods have been widely used to model and analyze faces and expressions [Blanz1999MorphableModel, Ferrari2017Dictionary3DMM, tensorface, grasshof2020Multilinear]. Recently there has been some interest in applying these methods to explore the latent space of GANs. For example, StyleRig [Tewari2020StyleRig] proposes edits by minimizing the loss between the image produced by the generated image and an image rendered by a 3D morphable model. Furthermore, models based on the HigherOrder Singular Value Decomposition (HOSVD) have successfully been used to model faces, their 3D reconstruction, as well as in transferring expressions [Vasilescu2002Tensorface, Vlasic2005FaceTransfer, Brunton2014MultilinearWavelets, Chen2014FaceWarehouse]. Recently, it has been suggested [Haas2021tensorGAN] to use such a HOSVDbased tensor model for semantic face editing in StyleGAN. Here a facial expression database was projected into the StyleGAN space and relevant semantic subspaces corresponding to identity, expression and yaw rotation were defined using HOSVDbased subspace factorization. The model showed limited flexibility for representing arbitrary latent codes and to overcome this a stacked styleseparated model was proposed. This extended the tensor model to an ensemble of tensor models, one for each style vector in the StyleGAN space. Further, it was shown that in the derived expression subspace, each of the six prototypical emotions formed nearly linear trajectories in agreement with [Grasshof2017apathy]. Although initial results were promising, convincing expression editing using a HOSVDbased model on the StyleGAN latent space was however not yet demonstrated. We propose a solution to this shortcoming, and demonstrate the robustness, and competitiveness of our approach in this work.
Generator Inversion.
To facilitate editing of real images, the images first need to be projected into the StyleGAN latent space. This is also referred to as GAN inversion [Zhu2018GANInversion] and the problem is to find a latent code that, when passed to the generator, produces an image as close as possible to the given target image. Typically GAN inversion techniques are either based on training an encoder [Pidhorskyi2020ALAE, Richardson2021pSp, Tov2021e4e, alaluf2021restyle], which can embed an image into latent space at inference time, or optimizationbased techniques [Karras2020StyleGAN2, Abdal2019Image2StyleGAN, yang2019unconstrained, Abdal2020Image2StyleGANpp]
. In the latter approach, the latent code is found by minimizing a loss function, typically pixelwise L2 or perceptional image similarity
[Zhang2018LPIPS] is used. Hybrid approaches have also been proposed which use a trained encoder to find a good initial condition for subsequent iterative optimization of the latent code [puzer, Zhu2020InDomain].Recently, [Roich2021pivotal] shows that novel images can be embedded into space with a lower reconstruction error by finetuning the pretrained generator on the target image such that the latent code in space yields an image closer to the target.
Recent work [Tov2021e4e] suggests that there is a tradeoff between distortion and editability when selecting which latent space to project a given target image into. When projecting outofdomain images into the StyleGAN latent space picking the extended space leads to a higher quality reconstruction, i.e, it yields an image closer to the target image. However, latent codes in the space are generally less editable than latent codes in space. To find latent codes with the optimal tradeoff between distortion and editability a novel training methodology was proposed [Tov2021e4e] which embeds images into space in a way that constrains the latent codes to be as close to space as possible.
Contributions.
Our contributions can be summarized as follows

[noitemsep]

We show that a HOSVDbased tensor model is able to discover novel semantic directions robustly, corresponding to the six prototypical emotions, in pretrained GANs.

We show that convincing emotion directions can be derived by truncating the expression intensity subspace.

We show that, by using the e4e encoder [Tov2021e4e] for projecting real images into the latent space of StyleGAN, it is possible to construct a tensor model which enables stable rotation and expression transfer on real faces.

We show the previously proposed tensor model for the GAN latent space [Haas2021tensorGAN] had an implicit rankone constraint, which can be relaxed, leading to lower reconstruction error.
2 Method
In this section, we describe tensor model formulation [Haas2021tensorGAN] and propose two extensions to it: (1) We show how to relax the implicit rankone constraint of the model by replacing the set of parameter vectors of the model with a single full rank parameter tensor, and (2) show how to derive emotion directions in by truncating the expression intensity subspace. An overview of our approach is shown in Fig. 2.
2.1 Multilinear Tensor Model
Given a data set of StyleGAN latent codes in we represent them so that each latent code is equivalent to a vector , where for the generator producing images. Suppose we have latent codes for different persons, performing expressions each with different intensities from different rotations, then we arrange the data into the 5 order tensor . We then proceed to calculate the HigherOrder Singular Value Decomposition (HOSVD) on the meancentered data tensor as
(1) 
where is the core tensor and denotes the mode tensor matrix product. The mean tensor is written as , where is the mean latent code from the data set, is a vector of ones with dimension , and denotes the tensor product. The matrices have orthonormal columns, i.e., and are constructed from the left singular vectors of the mode matrix unfoldings of the meancentered data tensor. The columns of form the basis for the respective subspace. The columns of form a basis for the latent space and are identical to the principal components [grasshof2020Multilinear]. Likewise , , , and form the bases for the person identity, expression, intensity and rotation subspaces respectively.
Parameter Vectors.
To recover a specific latent code from the tensor model, we select appropriate rows of , , and corresponding to the desired person, expression, expression intensity, and rotation respectively. By introducing onehot vectors which we will refer to as the canonical parameters for the tensor model, we get
(2) 
where . This formulation is analogous to the one proposed in [Grasshof2017apathy, grasshof2020Multilinear] and subsequently, [Haas2021tensorGAN]. Now, (2) can be further simplified by defining which allows us to write
(3) 
which gives is a more compact representation of the tensor model.
Recovering Subspace Parameters.
To find the parameters for a novel latent code , with corresponding to the latent code which best approximates , one could minimize the loss,
(4) 
Additionally, it has been proposed in [Grasshof2017apathy] to regularize the solution by the Tikhonov regularizer and sum constraint as
(5) 
that yields the regularized minimization problem
(6) 
This regularization is important for finding a stable parameter vector representations and thereby enables expression editing for latent codes corresponding to novel images, as will be seen below.
Relaxing the RankOne Constraint.
In the tensor model (3), each latent code is entirely determined by four parameter vectors , , and corresponding to identity, expression, expression intensity and rotation, respectively. Using component notation and the Einstein summation convention we rewrite (3) as
(7) 
where is a rankone tensor.
Now, we propose to relax this implicit rankone constraint and instead allow the tensor to be full rank that leads to the problem
(8) 
The relaxation increases the number of parameters of the tensor model from parameters to parameters. This results in a more flexible model which yields lower reconstruction errors for novel latent codes.
2.2 Truncating the Expression Intensity Subspace
From (1), the expression intensity subspace is truncated to a onedimensional subspace by selecting the dominant singular vector, i.e., the first column of which we denote . The truncated core tensor is then written as
(9) 
Defining as before, then the model is written similarly to (2) and (3) as
(10) 
where the corresponding intensity parameter is a scalar since the expression intensity subspace has been truncated. Thus, the expression intensity factors out of the model and we may write
(11) 
where can now be interpreted as the expression intensity parameter. We trivially unfold the singleton dimension of corresponding to the intensity subspace, i.e., and then write the model as
(12) 
2.3 Recovering Semantic Directions
Emotion Directions.
We define emotion directions in latent space by selecting an appropriate row of corresponding to the emotion of interest. The combined parameter tensor corresponding to an expression direction is then written as
(13) 
where and is the mean person and rotation parameters respectively. To change the expression of a given latent code
, we interpolate linearly in the direction given by the vector
with components(14) 
thus performing an expression edit as
(15) 
Rotation Direction.
We edit rotations in a similar way. First we select the mean person, expression and expression intensity parameters and and then define the rotation direction parameter as the difference between the parameters corresponding to the left and right rotations, i.e., the difference between the two rows of . We write the rotation direction parameter directly as
(16) 
Now the combined rotation direction tensor is written as
(17) 
and we can change the rotation of a latent code as
(18) 
where is the strength of the rotation.
With this formulation, we apply semantic edits directly in
without the need for estimating the tensor model parameters beforehand as has otherwise been suggested
[Haas2021tensorGAN].



3 Experiments
Our tensor model was trained with the latent space projection of images from the Binghamton University 3D Facial Expression database (BU3DFE) [bu3dfe]. The BU3DFE database contains 2500 3D face scans and corresponding images from two views of 100 persons (56 female and 44 male) with varying ages (1870 years), and diverse ethnic/racial ancestries. Each subject was asked to perform the six basic emotions: anger, disgust, fear, happiness, sadness, and surprise, each with four levels of intensity. Additionally, for each participant, a neutral face is provided. Hence, for each person, there are 25 facial expressions in total, recorded from two pose directions, left and right, resulting in 5000 face images. Additionally, we used the FEI face database [Thomaz2010Feidatabase] which contains 14 images of each of the 200 individuals, 100 male and 100 female. For each the database contains two frontal images, one with a neutral or nonsmiling expression and the other with a smiling facial expression, the rest of the images depicts each individual with a neutral expression from various yaw rotations.
3.1 Implementation Details
We use the full resolution, i.e. , StyleGAN2 [Karras2020StyleGANada] generator which has been pretrained on FFHQ [Karras2019StyleGAN]
. The tensor model was implemented in PyTorch
[Paszke2019PyTorch] using tntorch [tntorch] to calculate the HOSVD. To estimate the tensor model parameters we used gradient descent implemented in PyTorch with the Adam optimizer. For comparing images we use two different metrics. For perceptual image similarity we use LPIPS [Zhang2018LPIPS] and for identity similarity we uses Arcface [Deng2019ArcFace]. To measure the pose of the generated images we uses MediaPipe [Lugaresi2019MediaPipe] to extract 2D and 3D landmarks and then proceeded to solve the Perspectivenpoint (PnP) [Fischler1981RandomSC] problem which gave us a scalar value for the yaw rotation of a given image. We embedded all images into space using the e4e encoder [Tov2021e4e].3.2 Subspace Parameter Recovery
We computed estimated the tensor model parameters for 3 types of novel latent codes: 1) BU3DFE latent codes where we left one person out in the calculation of the tensor model, 2) randomly sampled latent codes, and 3) real images projected into latent space. Fig. 12 shows the result of recovering the tensor model parameters for these three types of latent codes when recovering the parameters in vector and tensor form, respectively. It can be seen that using parameter vectors for the tensor model led to a significant reconstruction loss if compared to using a representation with a parameter tensor, as illustrated in Fig. 4 and quantified in Tab. 1. It seems that the randomly sampled images are slightly harder to reconstruct than the embedded real images.
For the representation with parameter vectors, we find that although the proposed regularization (5) leads to a slightly higher reconstruction error, it is important in order to find parameter vectors which are suitable for expression editing. Fig. 5 shows that performing expression edits on the regularized parameters leads to less identity change compared to the nonregularized parameters. The importance of regularization is more noticeable when we recover the parameters for a randomly generated image if compared to an image contained the in BU3DFE database.
Random Latents  BU3DFE Latents  

Rank one  
Full rank 


Moreover, it can be seen that the tensor model is not necessary for expression editing, because we can edit the latent code directly by perturbing in the directions defined by (15), instead of manipulating the estimated parameters of the tensor model. The effect of such a direct edit is illustrated in Fig. 6. The main advantage of performing expression edits in this way, is that we avoid the reconstruction error associated with representing the latent code in terms of the tensor model parameters.
3.3 Expression Direction Recovery
Fig. 7 shows the effect of applying the found six latent space directions to the BU3DFE mean face. We found that subtracting the sadness direction from the mean face also produces a happy facial expression. However, the resulting expression is qualitatively different from adding the happy direction to the mean face. While adding the happy direction results in a wide smile, subtracting the sadness direction results in a smile that is narrower but where the mouth is more open. See the supplementary materials for videos showing the found emotion directions on real face images.
3.4 Comparison to Related Work
We compared the rotation and smile directions found by our approach to those previously found by InterFaceGAN [Shen2020InterfaceganTPAMI] and GANSpace [Harkonen2020GANSpace]. For InterFaceGAN, we used the PyTorch version of the rotation and smile directions provided by the authors of [Roich2021pivotal] at their GitHub repository^{1}^{1}1https://github.com/danielroich/PTI/tree/main/editings/interfacegan_directions. For the rotations, we chose a manipulation strength that resulted in a similar degree of rotation. To perform rotations with GANSpace [Harkonen2020GANSpace], we initially used the principal component applied to the first three style vectors. However, we found that if we only changed the first three style vectors to edit the rotation, the result tends to break down when the editing strength is large, which is demonstrated in the first row in Fig. 8. If we applied the edit to the first five style vectors instead, we generally received better results, see second row in Fig. 8.
We visually compared the rotations by GANSpace, InterFaceGAN and our proposed method on images which are randomly sampled from the generator as well as images from the FEI face database [Thomaz2010Feidatabase]. For the FEI database we used the frontal face images as initial conditions and then applied rotations with GANSpace, InterFaceGAN and our method to approximate the latent codes corresponding to rotated images from the database. The results on randomly sampled images are shown in Fig. 8 and on the FEI database in Fig. 9, respectively. It can be seen that the quality of the edits are visually on par, except the gaze direction follows the camera in the InterFaceGAN results.
3.5 Happy Faces
We compared the found happiness direction to the smile directions from GANSpace and InterFaceGAN, respectively. For GANSpace we used the 47 principal component applied to the 5 and 6 style vectors. The results are shown in Fig. 10. Although each method resulted in a smile in the generated image, the style of smile is different. Our method yielded a wider smile whereas GANSpace yielded a smile with a larger mouth opening, while the smile by InterFaceGAN seems to fall between these two.
3.6 Face Frontalization
To experiment face frontalization, we started with the latent codes corresponding to the rotated images in the FEI database [Thomaz2010Feidatabase], then edited the yaw of latent code to frontalize the images. Quantative comparison is shown in Fig. 11. In Tab. 2, we compare the perceptual and identity similarity scores of the frontalized images to the ground truth. It can be seen the frontalized images are very similar to the result obtained by using the pose direction from InterFaceGAN. However, our method yielded better similarity scores against to the ground truth. In addition, the gaze direction by InterFaceGAN is not straight ahead whereas ours is.
LPIPS [Zhang2018LPIPS]  ArcFace [Deng2019ArcFace]  

InterFaceGAN  
TensorGAN 
3.7 Validation with expression classifier
To validate that the semantic directions recovered with our approach produce a change in the generated images corresponding to the intended labels, we use a pretrained expression classifier [pyfeat] which is trained on the FER2013 data set [Goodfellow2013Challenges]. We sampled
random images with varying expressions from StyleGAN and edited these in the direction of each basic emotion. Using the classifier, we obtained the probability mass distribution of expressions for the sampled and edited images. From this, we calculated the average difference in probability mass due to the edit and visualize the results with a heatmap in Fig.
12.The edits in the direction of anger, happiness, sadness, and surprise lead to changes in the class probabilities which corresponds to an increase in probability of the expected class labels. However, the edits in the disgust direction lead to an increase in probability for anger as well as disgust while edits in the fear direction leads to a larger probability mass for the surprise label. This is explained by the fact that PyFeat also classifies the BU3DFE raw images in a similar way as can be seen in the confusion matrix in Fig.
13. Thus, this discrepancy is not due to a limitation of our model, but rather due to systematic differences between the BU3DFE and FER2013 data sets, which are especially apparent for data points annotated with the fear or disgust labels.4 Conclusion
In this work, we have presented an extension of the HOSVDbased tensor model, proposed in [Haas2021tensorGAN]. In contrast to [Haas2021tensorGAN], (1) we use the e4e encoder [Tov2021e4e] to recover highly editable latent codes for the BU3DFE database, (2) we improve reconstruction in the tensor model by allowing the parameters to be fullrank, and (3) we show that edits can be applied directly in latent space. Further, we showed that we can calculate linear directions in latent space corresponding to the six prototypical emotions by truncating the emotion intensity subspace. After obtaining a latent representation of the data, constructing the tensor model is fast, requiring only a few minutes to calculate the HOSVD. Further, the latent space directions corresponding to the six prototypical emotions can be calculated from the tensor model and subsequently applied to any latent code in the original latent space without the need to first estimate the subspace parameters as otherwise suggested in [Haas2021tensorGAN]. In other words, the found semantic directions are global and can be applied to any latent code without any further calculations. Our
method is able to identify directions in latent space corresponding to yaw rotation, as well as each of the six basic expressions. The quality of the edits performed with these directions is on par with the corresponding edits using GANSpace [Harkonen2020GANSpace] and InterFaceGAN [Shen2020InterfaceganTPAMI].