DeepAI
Log In Sign Up

Style Transformer for Image Inversion and Editing

03/15/2022
by   Xueqi Hu, et al.
0

Existing GAN inversion methods fail to provide latent codes for reliable reconstruction and flexible editing simultaneously. This paper presents a transformer-based image inversion and editing model for pretrained StyleGAN which is not only with less distortions, but also of high quality and flexibility for editing. The proposed model employs a CNN encoder to provide multi-scale image features as keys and values. Meanwhile it regards the style code to be determined for different layers of the generator as queries. It first initializes query tokens as learnable parameters and maps them into W+ space. Then the multi-stage alternate self- and cross-attention are utilized, updating queries with the purpose of inverting the input by the generator. Moreover, based on the inverted code, we investigate the reference- and label-based attribute editing through a pretrained latent classifier, and achieve flexible image-to-image translation with high quality results. Extensive experiments are carried out, showing better performances on both inversion and editing tasks within StyleGAN.

READ FULL TEXT VIEW PDF

page 1

page 6

page 7

page 8

page 13

page 14

page 15

10/11/2021

LSC-GAN: Latent Style Code Modeling for Continuous Image-to-image Translation

Image-to-image (I2I) translation is usually carried out among discrete d...
11/30/2021

SpaceEdit: Learning a Unified Editing Space for Open-Domain Image Editing

Recently, large pretrained models (e.g., BERT, StyleGAN, CLIP) have show...
11/17/2022

Assessing Neural Network Robustness via Adversarial Pivotal Tuning

The ability to assess the robustness of image classifiers to a diverse s...
04/15/2021

A Simple Baseline for StyleGAN Inversion

This paper studies the problem of StyleGAN inversion, which plays an ess...
03/31/2022

TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing

Recent advances like StyleGAN have promoted the growth of controllable f...
01/31/2022

Third Time's the Charm? Image and Video Editing with StyleGAN3

StyleGAN is arguably one of the most intriguing and well-studied generat...
07/17/2022

Editing Out-of-domain GAN Inversion via Differential Activations

Despite the demonstrated editing capacity in the latent space of a pretr...

1 Introduction

Generative Adversarial Network (GAN) [5, 27, 29] has been significantly improved during recent years. Particularly, with the help of AdaIN [15] or it variation ModulatedConv, StyleGAN [17, 18] is able to synthesize high resolution images with moderate quality. Therefore, utilizing the pretrained and fixed StyleGAN for downstream tasks becomes a hot research topic, especially in the editing task of image-to-image (I2I) translation [11, 31, 32, 37, 38, 3].

To edit a given real-world image, we first need to find out its input noise vector

or intermediate latent code , which can faithfully reconstruct the specified real image using the pretrained generator. Then, the code is modified by an offset corresponding to the target attribute, so that it can be mapped into an edited image, while preserving the original details. Despite of the great efforts, inverting [1, 2, 43, 30, 33] or editing [11, 31, 37, 32, 3] images for StyleGAN is still challenging due to following reasons. First, there are several candidate latent embeddings. Existing methods [33, 45, 37] reveal that different choices on them are critical. Compared to or space with a single 512-d vector, has the enough ability to describe image details, therefore it is suitable for inversion. In , each image is represented by 18 different codes, and each of them is 512-d. They are given to the generator to formulate features and final synthesis from low to high resolutions in sequence. However, the code in can not be well edited unless imposing enough regularization. Second, the distribution in or are highly complex. Real images only lie on the manifold in the space [33]. Moreover, different dimensions are often entangled for a single attribute, making independent editing difficult.

This paper aims to improve the encoder-based image inversion and editing for StyleGAN at the same time. Inspired by the great success of transformer in image classification [10, 23] and object detection [6, 46], we utilize it to find the appropriate latent code in space for image inversion and editing tasks. The basic idea is to regard latent codes in different generation stages as query tokens, and image features at different spatial positions as keys and values, then perform the multi-head cross-attention to update the queries in an iterative way. Meanwhile, before the cross-attention, the queries are also allowed to access others through the self-attention, to enhance the regularization on them, so the final codes given to the generator become tightly linked.

Particularly, queries first interact with image features (keys) by comparing similarities between each query-key pair. Then they are organized into the attention matrix to dynamically weight the features (values) and update queries for the transformer block in next stage. The image features, used as keys and values, are obtained by a CNN encoder. To capture the image details at different resolutions, we employ a two-pyramid encoder proposed in [30] to provide multi-scale features as keys and values. Note that our model has the multiple cross-attentions from low to high resolutions, and the style queries are gradually updated by features at different scales. Therefore, general contents in queries are first formulated, and then refined by details in the higher resolution. After several times self- and cross-attentions, queries absorb enough details from the input image, so they can be utilized to invert it by the pretrained generator.

We are further interested in the way to edit the codes for translating a specified attribute. Traditional approaches [11, 31, 37] assume the linear separations in the latent space for a binary attribute, so inverted code from different images are edited by the same direction. We argue that the identical direction is not optimal for the editing quality, and may reduce the result diversity. Inspired by [7, 14], we divide the image editing in StyleGAN into two different types. One is label-based editing, in which only target label is specified. The other is reference-based editing, which requires another image to supply the desired style. For the former, a pretrained non-linear latent classifier is used to determine the direction. it computes a loss for the inverted code according to the target label, and its gradient is back-propagated to the code, giving the editing direction. In the latter case, we want to determine the exact editing vector from the reference. Therefore, the inverted code from the source is used as query, and from the reference as key and value. The cross-attention is performed between them. The parameters in the attention module is trained under the supervision from the latent classifier, encouraging the edited attribute to be similar with the reference and other attributes without any changes. The proposed editing method is able to give the diverse results while maintaining the quality of image.

The contribution of the paper is summarized into following aspects. First, we propose novel multi-stage style transformer in space to invert image accurately. The transformer includes the self- and cross-attention modules, in which the style queries gradually get updated from the multi-scale image features. Second, we characterize the image editing in StyleGAN into label-based and reference-based, and use a non-linear classifier to generate the editing vector. Diverse and fidelity editing results are obtained.

2 Related Works

GAN inversion is first proposed in [44] and becomes important due to the wide applications of some recent generators. There are basically two ways, either encoder-free or encoder-based. The former does not have any training parameters, and the latent codes are directly optimized by the gradient mainly from reconstruction loss. To deal with the complex latent structure, Abdal et al. [1] invert a real image in space, and use pixel-wise MSE and perceptual loss with Adam [20] to tune the code. Image2StyleGAN++ [2] extends the code space to the layerwise additive noise vector to decrease the distortion. Although such a method can reliably find the code through multi-step iterations, it is inefficient and its code lacks the editability.

In contrast, the encoder-based method intends to train a common model to achieve inversion for all images. It improves the editing ability and is efficient during inference. IDInvert [43] utilizes a CNN as encoder to output the code. Except the reconstruction and perceptual loss, it is trained by an extra adversarial loss. pSp [30] designs a two-pyramid encoder to provide multi-scale features, and maps them to the style vector through multiple convolution layers. Benefiting from strong features, pSp achieves less distortion. ReStyle [4] uses the encoder to give the residual style to refine the inversion in the iterative way. E4e [33] analyzes the distortion-editability trade off for inversion and editing tasks in . It sacrifices the inversion accuracy to improve the editablity, constraining the codes for different layers close with each other. Kim et al. [19] and Wang et al. [36] depart from space and enhance the code with spatial dimensions, so that more information are given to the generator to lower down the distortions. Compared to previous works, our method lies strictly in , and it is able to achieve minimal distortion and high quality editing at the same time.

Latent code manipulation for pretrained StyleGAN is often used to edit the attribute and achieve I2I translation, either in the supervised or unsupervised manner. GANSpace [11] and Sefa [32] adopt PCA to find the principal directions in space. They are responsible for controlling the pose, gender or background. Note that for a particular attribute, these works specify the same direction on all latent codes to realize editing. Voynov and Babenko [34] train a simple module to edit the input, and use a reconstructor in pixel domain to interpret the editing, finding noticeable directions. LatentCLR [40] builds a learnable direction model to edit the code and uses contrastive loss to train it. Hence, these two models give different images unique editing directions. All the above works are unsupervised, without requiring attribute label for editing. But only limited directions for some attributes can be found.

The supervised methods can identify directions for more attributes, especially the local ones. InterfaceGAN [31] trains the linear binary SVM in latent space to obtain a separation plane, whose normal vector controls its corresponding attribute. StyleSpace [37] finds the control direction in a precise way guided by the semantic mask. Moreover, they propose to edit the code in

space which is defined by the affine layer after

. Wang et al. [35] further extend space by tracing back the gradient flow to its previous stage, making the change more accurately. Note that these works still share the same editing direction for all images. Recently, StyleFlow [3] conditionally manipulates the images using Continuous Normalizing Flows (CNF). Yao et al. [38] propose a latent transformation module to generate adaptive directions for different images. Wang et al. [36] utilize a CNN encoder to provide multi-scale features to supplement the style vector, which actually adapts directions at different locations.

However, previous works only deal with the label-based editing. Collins et al. [8]

apply k-means clustering on features to obtain channel-wise masks, determining which channels are locally semantic-aware. The cluster memberships of the reference further guide local attribute editing for the source. Different from above works, our work is strictly in

, and we realize both the label- and the reference-based editing.

Figure 2: The overall framework of Style Transformer for image inversion. We build the multi-stage transformer-based model to update the code in space. Details within the transformer block are depicted on the right. Each has a multi-head self- and cross-attention block, following the common routine in transformer model.

3 Framework of Style Transformer

We aim to achieve accurate image inversion for StyleGAN by our proposed style transformer in space. Given a real image , our model is able to specify different style vectors denoted by , where is the index of the vector injected into the different stages of the generator . For simplicity, we use without any index to represent all . Note that in StyleGAN2, is first projected by the affine layer , then it affects the corresponding layers by modulating on the convolution kernels.

Fig. 2 illustrates the overview of the proposed framework. The input image is encoded by , generating a series of image features to in multi-resolutions [30]. different queries, output from an MLP, access these features through the transformer blocks in a sequential way, forming the final code for the generator. The initial input to the MLP is also learnable, and it is gradually updated into , which is suitable for inverting . By training all parameters including the transformer blocks, the encoder , the MLP and the initial , the pretrained can utilize the final to reconstruct input with minimal distortions.

3.1 Style Transformer Block

The style transformer block is the key component for image inversion. The same structure is applied for 3 times in the model, exploiting the image details from to , respectively. The specific design within the block is shown on the right of Fig. 2

. Basically, there are two types of attention, which are multi-head self-attention and cross-attention. In addition, we follow the design routines for transformer, incorporating the residual connection, the normalization and FFN module into the block.

Style query initialization. Given a single style code , high fidelity image can be synthesized by StyleGAN generator. However, space needs different style vectors to reconstruct one image, and they essentially describe the details at different scales, therefore are employed to affect features of different resolutions in the generator. A common choice in transformer decoder is to randomly initialize beginning query tokens and keep them as learnable parameters in the model. However, considering the fact that code distribution in space is complex and far from Gaussian prior, we utilize the pretrained MLP in StyleGAN to first map each individual code to the beginning style query in space, and then update through the self and cross-attention operations. Note that is sampled from standard Gaussian, and set as model parameters. Moreover, the pretrained MLP is finetuned during training.

Multi-Head Self-Attention. The self-attention is performed among different query tokens . It intends to find the potential relation between any pair of them, and route the value to connect them. We denote all of as . The query , key and value are all projected from according to Eq. 1. Note that , and are learnable projection heads in the self-attention module, which do not change the feature dimension.

(1)

The multi-head attention operation is formulated as in Eq. 2, where , and are query, key and value in the th head, and is result from that head. The feature dimension , and is the number of attention heads.

(2)

The final update on from the self-attention is in Eq. 3. is also learnable, being responsible for fusion the results from different heads.

(3)

Multi-Head Cross-Attention. The self-attention does not involve any image feature in its computation. Therefore, we further design the cross-attention for the inversion task, so that the query tokens can obtain information from image features , and in different resolutions. In the cross-attention, features of key and value are from the encoder , while queries are computed by the linear projection on the previous results from self-attention block. Particularly, we have the query, key and value according to Eq. 4, where , and share the similar settings with the self-attention.

(4)

The multi-head cross-attention is carried out in the same way as is shown in Eq. 2 and Eq. 3. After that, the updated query tokens are given to an FFN to refine itself, and the results are further passed to the transformer block in the next stage, mining the details from finer resolution features.

3.2 Training Objectives for Image Inversion

During training, the backbone of StyleGAN (including the affine layer ) is strictly fixed. The gradients from the loss only tune other parameters. Note that we use the same training objectives as pSp [30]. Particularly, to give the accurate reconstruction, the loss between the input and its inverted version from is calculated. Meanwhile, LPIPS [42], a similarity metric, which is computed based on the features in an Inception net , is also adopted, specifying another objective . Additionally, to keep the identity of the inverted image, we incorporate a pretrained ArcFace model [9] for the ID loss

, so that the cosine similarity of

and can be maximized. Notice that we do not adopt any adversarial loss during training.

Figure 3: Reference-based editing module and its training strategy. The inverted codes and , from the source and reference, are given to the transformer module , specifying the code for edited image. is an attribute classifier in , by which we constrain the editing attribute being similar with the reference, while others staying the same as the source.

4 Image Editing in Style Transformer

Image editing by the fixed StyleGAN is an important application not only for itself, but also for evaluating the quality of image inversion. The low distortion is only the one aspect, the flexible and high fidelity editing are also important. As is described in [7, 14], there are two types of editing, either through the target label or a reference image in the desired domain. Previous works [31, 37, 38] focus on the former, but few works deal with the referenced-based editing, which potentially provides diverse results.

Typically, given the inverted style code for the source , and a desired target attribute, we need to determine an offset , so that can be mapped to an edited image with the desired attribute different from , but keep content of . In the reference-based editing, another image is given as an extra input. Since our style transformer can invert image with negligible distortion, we train a latent classifier for binary attributes in space to guide the editing like [38]. Concretely, given a code inverted from an image, the classifier computes several embedding features corresponding to the

th attributes, and the final logits

for the BCE loss . During editing, is fixed to evaluate .

4.1 Reference-based Editing

Module design. We design a simple module to translate a particular attribute according to the inverted code from reference . Since both and represent images with almost no distortions, these codes contain enough information about the edited attribute. should be able to specify based on and . Again, a cross-attention structure is chosen, as is shown in Fig. 3. are used as a series of query tokens, while and are key and value tokens, projected from . According to [37, 31], some local attributes are only depended on a single for a particular resolution in . So we choose a routing scheme in [24] different from Eq. 2. The idea is to make normalize over queries not keys. Then re-norm the matrix over keys by as is Eq. 5. This strategy assigns the value feature to queries in the unique way, so that a value token from affects only a few tokens in .

(5)

Loss designs. To guarantee the th attribute editing results, we design the following loss terms to train the projection head in . Particularly, we constrain the code after editing by as is shown in Eq. 6:

(6)

Here is the th attribute embedding from the pretrained latent classifier . ensures the edited attribute to be similar with . At the same time, other attributes denoted by should stay close with the source , giving in Eq. 7:

(7)

Finally, we regularize , so edited image does not change much.

4.2 Label-based Editing

Compared to reference-based, label-based editing is relatively easy. So we adopt an encoder-free method to edit based on the latent classifier . We emphasize that for each , there should be a unique direction for the th attribute editing, which is determined by the gradient back-propagated from the classifier . Note that the first-order gradient on is , and the direction becomes . Here is the target label, and is the logits after sigmoid.

We also investigate the method based on the second-order derivative , which is the Hessian matrix. Similar with [26], a randomly sampled unit vector is first obtained, then it is scaled by a small number . Then we evaluate the Hessian vector product by Eq. 8. According to power iteration,

converges to the dominant eigenvector, so we let

.

(8)
Figure 4: Qualitative results of image inversion. Our method is compared with pSp and e4e. Besides the inverted image, we also list images edited by InterFaceGAN [31] for faces. For cars, we use the directions provided in GANSpace [11] for editing.
Domain Method Inversion Editing Model Size
MSE LPIPS FID SWD FID SWD Params(M) FLOPs(G) Time(s)
Face pSp 0.037 0.169 31.52 15.07 46.64 29.05 267.3 72.55 0.0668
e4e 0.050 0.209 36.16 17.25 47.45 25.10 267.3 72.55 0.0659
Ours 0.036 0.166 28.31 14.00 40.57 23.21 40.6 36.37 0.0436
Car pSp 0.115 0.298 17.24 19.76 27.25 36.01 238.0 66.11 0.0565
e4e 0.110 0.314 14.68 18.25 21.50 27.57 238.0 66.11 0.0541
Ours 0.089 0.245 13.58 16.14 21.24 25.28 40.6 36.34 0.0435
Table 1: Quantitative comparison for different inversion methods. To consider the distortion-editability trade-off, we list metrics for image editing to give a comprehensive evaluation on them. We also list the parameters and FLOPs of the three methods, Time means the inference time of an iteration.
Figure 5: Qualitative comparison on different label-based editing methods. We list the results for editing ”Gender”, ”Smile” and ”Bangs”, and compare them with [31, 37]. Note that we evaluate both the first- and second-order editing methods proposed in Sec. 4.2.

5 Experiments

5.1 Implementation Details

All experiments are implemented on StyleGAN2 [18] pretrained on FFHQ [17] and LSUN Cars [39] datasets. We build our model based on the pSp encoder for multi-scale image feature. For face domain, we train the inversion model on FFHQ dataset and evaluate on CelebA-HQ [16]

test set. For car domain, the inversion model is trained and evaluated on Stanford Cars

[21] dataset. The synthesis network in StyleGAN2 is fixed and all other parameters in our model is trainable.

5.2 Inversion Results

We compare our model with pSp [30] and e4e [33], which are two state-of-the-art encoder-based inversion methods. Qualitative and quantitative results are shown in Fig. 4 and Tab. 1. Our model is validated in three aspects: the perceptual quality of inversion, the ability of editing, and the model size. MSE and LPIPS evaluate the pixel and perceptual similarity of input and inverted images. FID [13] and SWD [28] measure the distance between two distributions of real and generated images, indicating the visual quality of generated images. To compare the editing ability of three methods, we adopt InterFaceGAN [31] in face domain to edit the latent codes generated by each method. For car domain, we apply GANSpace [11] to find the semantic directions. The metrics are averaged over the editing results of the whole test set. Our model is outperformed in inversion and of higher editability. Moreover, we list the parameters, FLOPs and inference time of three methods in Tab. 1. Compared with Convnet, the transformer used in our model has only 18 or 16 tokens for face and car domain, hence it is lightweight and efficient.

Method Gender Smile Bangs
FID SWD FID SWD FID SWD
InterFaceGAN 48.72 19.43 40.03 18.94 44.01 29.41
StyleSpace 37.31 17.31 34.72 15.46 42.91 20.96
Ours-1 38.73 17.83 33.50 14.89 41.15 19.30
Ours-2 34.84 16.14 32.88 15.23 40.14 18.53
Table 2: Quantitative comparison of label-based editing on three attributes, corresponding to Fig. 5.
Figure 6: Mean-AD results of label-based editing on three attributes compared with [31, 37], lower means better. Ours-1 and Ours-2 represent our first- and second-order methods, respectively.
Figure 7: Reference-based editing results. Given a pair of source and reference images, we first utilize the proposed method to find their inverted codes in . Then the transformer block described in Sec. 4.1 is used to take the ”bangs”, ”mouth” and ”gender” style in the reference code, and apply them onto the source.

5.3 Editing Results

We apply reference- and label-based editing on CelebA-HQ dataset, in which each image has the label of 40 facial attributes. We invert images to latent codes using our pretrained inversion model, and train a 40-class latent classifier. The latent classifier consists of 4 fully-connected layers, in which there is an independent branch for each attribute before the prediction, leading to the independent embedding feature.

Label-based Editing. We first apply our pretrained inversion model to obtain the latent codes of images, and use the first- and second-order methods to edit the images to have the target attributes. Desirable results can be generated by only ONE iteration. We evaluate first- and second-order methods illustrated in Sec. 4.2 and compare our results with InterFaceGAN [31] and StyleSpace [37]. Qualitative results and metrics are shown in Fig. 5 and Tab. 2. Moreover, we measure the disentanglement of attributes by calculating the Attribute Dependency (AD) [37], which indicates the degree of changes in other attributes while editing one attribute. A multi-branch attribute classifier based on ResNet-50 [12] is applied to obtain the predicted logits of images. We measure the changes between the input and edited images, and normalize by

, which is the standard deviation computed from the logits of numerous generated images. For a target attribute

, we calculate the mean-AD on other attributes as , where is the index of fixed attributes. Fig. 6 illustrates the mean-AD over the degree of target attribute changes , Our methods perform better than InterFaceGAN and StyleSpace, and the second-order method is of higher disentanglement.

Method Quality(%) Disentanglement(%)
BA GE GO BA GE GO
InterFaceGAN 15.00 7.50 9.17 11.67 1.67 8.33
StyleSpace 10.83 10.00 13.33 18.33 15.00 10.83
Ours-1 25.83 39.17 31.67 35.83 34.17 30.00
Ours-2 48.33 43.33 47.50 34.17 49.17 49.17
Table 3: User study of label-based editing compared with [31], [37]. BA, GE and GO represent ‘Bangs’, ‘Gender’ and ‘Goatee’ attributes.

Considering human judgements, we further conduct a user study. We ask 60 volunteers to evaluate the methods in two aspects: image quality and disentanglement. Results are shown in Tab. 3. The detailed algorithms of first- and second-order methods are provided in Appendix C.

Reference-based Editing. The reference-based editing module is trained for different attributes individually. To ensure the module takes the style from reference image, we randomly divide the training images into source and reference sets, instead of depending on the labels. We train the module on three attributes and qualitative results are shown in Fig. 7. The edited images take the relevant attributes from different reference images, and they appear the similar style on the translated attribute. Note that the reference-based editing module is trained only in the latent space, resulting in less diversity compared with directly editing on images. Whereas, different from the optimized-based method [8], our model can commonly apply to all images, which is lightweight and more flexible.

Method MSE LPIPS Params(M) FLOPs(G) Time(s)
pSp 0.0373 0.1693 267.3 72.55 0.0668
Ours w/o self 0.0369 0.1716 37.3 36.31 0.0429
Ours full 0.0363 0.1665 40.6 36.37 0.0436
Table 4: Ablations of transformer structure. Time means the inference time of an iteration. The best results are indicated in Bold.

5.4 Ablations and Analysis

We further validate the benefit of transformer by comparing among pSp [30], our full model with both self- and cross-attention and ours w/o self-attention in Tab. 4. [30] maps image features to by individual mapping networks, though obtain the image features directly and completely, the relation between each is not tightly enough. In our model, cross-attention is necessary to update queries by fusing image features, and self-attention is also important in constructing the potential relation between queries.

6 Limitations

We now discuss limitations, which we have already realized, for our work. First, for the inversion task, although our proposed method achieves improved reconstruction quality, there are still some differences between the input and reconstructed images, especially for the out-of-domain input. We think it is mainly caused by the finite discriminative ability of space. As is described in [36], the distortion can be significantly reduced by adding more information from the source. Moreover, since we apply the multi-head attention, the training speed is slower due to the complex matrix multiplication. Second, for the reference-based editing task, we adopt a transformer-based module in the latent space, resulting in less diversity for some attributes compared with direct editing on the images, in which the mode seeking loss [25] can encourage the diversity in the pixel domain. But our method is lightweight and more flexible.

7 Conclusion

This paper presents a transformer-based image inversion and editing method for StyleGAN. We choose space to represent real images, which needs to determine multiple style codes for different layers of the generator. To effectively exploit information from input image, we design a multi-stage transformer module, which mainly consists of the self- and cross-attention. In the initial stage, the MLP maps a set of learnable noise vectors into the codes in , and then they are iteratively updated by the two types of attention operations, so the codes from the final stage can reconstruct the input accurately. Based on them, we are able to carry out label- and reference-based editing in a flexible way. Given a required label, an encoder-free strategy is employed to find the unique editing vector according to the gradient from a pretrained latent classifier. Meanwhile, given a reference code, a transformer block is trained to edit the source, so that the result takes the relevant style from the reference. Experiments show the proposed image inversion and editing method achieves less distortions and higher quality at the same time.

References

Appendix A Training Details

We adopt a pretrained StyleGAN2 [18] generator in our experiments, in which the synthesis network is fixed and the mapping network (MLP) is trained. In the multi-head attention of the transformer block, the number of heads is set to 4, and the dimension of each head is 512. For inversion task, the Ranger optimizer is used in training, which is a combination of Rectified Adam [22] with the Lookahead technique [41]. We train the model for iterations with a batch size of 8, the learning rate is set to . For the reference-based editing task, we use the Adam [20] optimizer to train the model for iterations with a batch size of 8, the learning rate is set to . All experiments are implemented on 2 NVIDIA RTX 2080Ti GPUs.

Appendix B Label-based Editing Methods

We propose first- and second-order label-based editing methods in the main text. To give a detailed explanation, we provide the pseudo codes in PyTorch style.

Algorithm 1 and Algorithm 2 illustrate the first- and second-order methods, respectively. Moreover, we measure the disentanglement of five attributes by Re-scoring [31] in Fig. 8. The top row lists edited attributes, and the scores are the classification logits changes between original and edited images.

Figure 8: Re-scoring results of label-based editing on five attributes, Ours-1 and Ours-2 represent our first- and second-order methods, respectively.

Appendix C More Results

In this section, we provide more results of inversion, label-based editing and reference-based editing in Fig. 9, Fig. 10, Fig. 11.

1# w: input latent code (18, 512)
2# C: latent classifier
3# y_t: target label
4
5predicted = C(w)
6loss = torch.nn.BCELoss(predicted, y_t)
7loss.backward()
8direct = w.grad
9direct = direct / torch.norm(direct, dim=1)
10w_edit = w - alpha * direct # alpha is a scaling factor.
Algorithm 1 First-order Label-based Editing
1# w: input latent code (18, 512)
2# C: latent classifier
3# y_t: target label
4
5r_d = torch.randn(18, 512)
6r_0 = torch.zeros(18, 512)
7w_d = w + kasi * r_d # kasi is a small number, we set it to 10e-4.
8w_0 = w + r_0
9predicted_d = C(w_d)
10loss = torch.nn.BCELoss(predicted_d, y_t)
11loss.backward()
12direct_d = r_d.grad
13
14C.zero_grad()
15predicted_0 = C(w_0)
16loss = torch.nn.BCELoss(predicted_0, y_t)
17loss.backward
18direct_0 = r_0.grad
19
20direct = direct_d - direct_0
21direct = direct / torch.norm(direct, dim=1)
22w_edit = w - alpha * direct # alpha is a scaling factor.
Algorithm 2 Second-order Label-based Editing
Figure 9: More results of inversion compared with [30] and [33].
Figure 10: More results of label-based editing on five attributes.
Figure 11: More results of reference-based editing on three attributes. The edited images take the style of Bangs, Mouth and Gender from different reference images.