LSC-GAN: Latent Style Code Modeling for Continuous Image-to-image Translation

10/11/2021
by   Qiusheng Huang, et al.
0

Image-to-image (I2I) translation is usually carried out among discrete domains. However, image domains, often corresponding to a physical value, are usually continuous. In other words, images gradually change with the value, and there exists no obvious gap between different domains. This paper intends to build the model for I2I translation among continuous varying domains. We first divide the whole domain coverage into discrete intervals, and explicitly model the latent style code for the center of each interval. To deal with continuous translation, we design the editing modules, changing the latent style code along two directions. These editing modules help to constrain the codes for domain centers during training, so that the model can better understand the relation among them. To have diverse results, the latent style code is further diversified with either the random noise or features from the reference image, giving the individual style code to the decoder for label-based or reference-based synthesis. Extensive experiments on age and viewing angle translation show that the proposed method can achieve high-quality results, and it is also flexible for users.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 7

page 8

page 9

page 10

page 11

page 12

10/11/2021

Bridging the Gap between Label- and Reference-based Synthesis in Multi-attribute Image-to-Image Translation

The image-to-image translation (I2IT) model takes a target label or a re...
03/15/2022

Style Transformer for Image Inversion and Editing

Existing GAN inversion methods fail to provide latent codes for reliable...
12/01/2020

Unpaired Image-to-Image Translation via Latent Energy Transport

Image-to-image translation aims to preserve source contents while transl...
09/26/2021

ISF-GAN: An Implicit Style Function for High-Resolution Image-to-Image Translation

Recently, there has been an increasing interest in image editing methods...
12/05/2018

Integrated unpaired appearance-preserving shape translation across domains

We address the problem of un-supervised geometric image-to-image transla...
03/03/2018

Stylize Aesthetic QR Code

With the continued proliferation of smart mobile devices, Quick Response...
06/11/2021

GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation (works for videos too!)

We show how to learn a map that takes a content code, derived from a fac...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative model, particularly the technique of Generative Adversarial Network (GAN)

[4, 9, 31]

, is well-developed in recent year. It can synthesize high-resolution realistic images by mimicking the data distribution, through the adversarial training of two deep Convolutional Neural Networks (CNN), namely the generator G and discriminator D. The G maps the randomly sampled noise vector into the color image, which is inspected by D so that it has the aligned output distribution with real data. Conditional GAN

[30, 32] is able to synthesize the image which fulfills the requirement of the given label. Usually, the label indicates the discrete type of the image, which is encoded by a one-hot vector. It is adopted by both G and D, so that the conditional data distribution for a specific class can be learned by G.

Due to the success of GAN and cGAN, direct image-to-image (I2I) translation across different domains can be accomplished by a generator G in cGAN. It aims to learn the mapping function, which changes the source domain input image into the target domain. The result should take the content of input. At the same time, it also needs to satisfy the target domain requirement. In practice, labels reflecting a particular attribute are often not discrete, but they are more appropriately described by a continuous value, which varies within a certain range, e.g., the age or the viewing angle of a face. Such labels define the continuous image domains, which require G being able to be controlled so that its output naturally changes among the domains. In image generation, GANs [18, 19]

usually achieve this through ”morphing”. By interpolating the input noise (or label vector) from

to , the intermediate results from to change in the smooth way, showing that G has the strong ability to map a noise vector into an image. However, the quality and the domain conformity of the intermediate results can not be ensured.

In I2I translation, existing techniques lack the special designs to deal with the continuous domains. CycleGAN [48], StarGAN [6] and StarGAN-v2 [7] all focus on the translations among discrete domains, which are either caused by a single or multiple attributes. Although, by simply dividing the whole domain range into several discrete intervals, these models can still perform I2I translation, the relation among different domains are ignored during training, and the domain conformity of the result is not accurate enough. Moreover, CycleGAN and StarGAN are label-based, which can only translate the source image by providing the target label. They are impossible to output the diverse synthesis. StarGAN-v2 can support both the label- and reference-based translation, and it synthesizes different results for the same source in the target domain. But these results only show the attribute-irrelevant diversity, such as the different lightness and backgrounds. It can not slightly modify the desired attribute within the same domain, i.e., changing the age from 5 to less than 3. In summary, these models lack the fine control ability for the continuous domain.

This paper aims for the continuous I2I translation, i.e., editing the facial age or viewing angle. We intentionally consider the relation among the predefined discrete domains. Similar to StarGAN-v2, we achieve both the label- and the reference-based synthesis, and the model is able to output the diverse images, all fulfilling the target domain requirement. To accurately capture the characteristic of each domain interval, we first explicitly build the representative style codes of the domain centers. Then they are encoded together with the random noise or reference image to give diverse and specific styles. To make the model understand the domain relations, we design the two types of latent code editing modules, which are capable of changing the code along two opposite directions (e.g. young and old directions), so that the style code from one domain can be translated into its two neighbouring domains. These modules help to constrain the distance among domain centers during training. Moreover, they are utilized to edit the style code, making the consistent change in a specific direction in the inference stage. Note that with these modules, the latent style code can even go beyond the initial domain coverage. So the model becomes flexible for the user during the inference.

The contribution of this paper lies in following aspects.

  • We propose a generic I2I translation model which is particularly suitable for continuous domains. By modeling the latent style code for each domain center, our method preserves the common domain styles well. They are further diversified by either the noises or reference images, which give the corresponding label- or reference-based translations, respectively.

  • We design the latent style editing module to convert the code for domain center into other domains, and set up the relation constraint among them. The module can be further adopted to edit the image consistently in one direction during the inference stage.

  • A plenty of experiments are performed on two different datasets with continuous domain labels, FFHQ-Aging and Multi-PIE. In the former, our model translates the age of the face into an arbitrary age group. While in the latter, it is able to edit the viewing angle of the face. Both of them show the superiority of our proposed model.

Fig. 2: TSNE visualizations on the latent style codes . After training, we have several domain center . Together with different sampling noises, they are encoded into . Here we sample noise vectors, leading to the same amount of for each domain. Within a domain, we use red arrows to indicate different directions of variation, representing the intra- and intra-domain diversity, respectively. We calculate the average of all the code in each domain and display it as a black dot.
Method
Latent-guided
synthesis
Reference-guided
synthesis
Intra-domain
diversity
Inter-domain
diversity
Continuous mapping
from domain to domain
Beyond domain
limit
StarGAN-v2 [7]
DivCo [28]
DRIT++ [25]
CVAE-GAN [3]
DLOW [8]
LIFE [33]
StarGAN [6]
STGAN [26]
SAM [1]
IPCGAN [40]
VI-GAN [42]
CD-VAE [44]

Ours
TABLE I: Comparisons with approaches in I2I translation.

2 Related Work

2.1 I2I translations between two domains

The topic of I2I translation has drawn researchers’ great attention in recent years. It is first proposed in Pix2Pix [17], and extended to Pix2PixHD [39] for translating the high-resolution image. The generator G of both models are built by the auto-encoder, with the first half reducing the spatial resolution and encoding the content of a source domain image , while the second half enlarging the size and translating into the target image. These models need the paired data during training, which means the output from G are directly constrained by the or distance with the target image. CycleGAN [48], DualGAN [43] and DiscoGAN [20] relax the assumption to the unpaired data. They build and for two translating directions, and set up the consistency between and . Note that these models are only able to give single-modal results, and they implicitly edit the image by the target domain label.

To obtain diverse results, one way is to change the backbone G to VAE [22, 3, 47, 44], in which the latent code from the encoder is sampled from the posterior distribution determined by the input . Another solution for diversifying the results is to incorporate a reference image in G, therefore, the same content can be combined with different references, leading to various results [27, 16, 24, 25]. Note that the synthesis needs to show a similar style with the reference. UNIT [27] uses two VAEs, mapping images from different domains to a shared probabilistic space. Then the decoders directly translate the sampled latent code into either the original or the other domain. MUNIT [16] and DRIT [25] explicitly disentangle between the content and the style code to promote the diverse styles. They add an extra encoder, specifying a style code, and inject it into G by AdaIN [15].

In addition, training loss is also important for diverse results. MS-GAN [29] and DivCo [28] design the loss to prevent the mode collapse in image generation. The former directly maximizes ratio of the pixel domain distance, with respect to the distance in the latent space. The latter adopts the similar idea through the contrastive learning, constraining the two close latent codes producing the similar visual results.

2.2 I2I translations among multiple domains

Previous models mainly deals with two domains. In practice, it is sometimes necessary to convert images among multiple domains. StarGAN [6] is the first work which achieves translation among multiple domains through a single . It uses the target label and the original image as two inputs, and outputs the label based synthesis, which is inspected by a discriminator. Similar to [48], it also adopts the cycle consistency to ensure to take the content from the input. Instead of the target label, AttGAN [13] employs the label difference as the input, and changes the image towards to the target domain. ST-GAN [26] designs a module to iteratively process the encoder features from different layers, and inject them into the multiple layers of the decoder. All these models realize the translation according to the specified label. They are deterministic, so can not give diverse results.

A common solution for diversity is to bring in noises in . SMIT [36] extends StarGAN [6] by concatenating a random noise vector with the target label, and then mapping them into a style vector, which influences the statistics of the feature in . To prevent the noise being ignored, model parameters for processing the noise are not trained in SMIT. Similar with two domains task, adding an extra image as a reference is another option to diversify the synthesis. ELEGANT [41] exchanges and mixes the latent codes from two images. It consists of only a pair of connected encoder-decoder, which is able to either reconstruct the input image or edit the specified attribute to the target domain. HomoGAN [5] learns interpolators between two latent codes from source and target images. Then it maps the interpolated codes into images, so that they exhibit the gradual change from the source to the target. Note that although the reference based model can give diverse result, it usually has the lower quality than label based model.

To achieve high quality and diverse translations, StarGAN-v2 combines the label and reference based synthesis in a single model. It also has careful design on the training loss. Since StarGAN-v2 becomes the current SOTA model for I2I translation, we build our proposed model based on it, and give a brief introduction about it in the next section.

2.3 Continuous domain I2I translation

The above works deal with discrete domains. However, the I2I translation for continuous domains has more applications. Here we focus on the facial age and viewing angle translation. Most works [46, 40, 12] for age edition divide the whole range into domain intervals, however, they lack special design for modeling the relation among them. Moreover, they can only translate according to the target label, so the results lacks the diversity. Another problem is that these models are trained on the constrained data (e.g.frontal view and similar background), they have poor results on wild images. LIFE [33] is the first model training on high-quality wild images. SAM [1] uses a well-trained StyleGAN to produce high-quality age-edited images but still cannot produce diversity.

Different from age edition, many works [42, 44] for view angle edition considers the continuous mapping through the geometry relation between two domains. However, they only deal with the geometry related translation and lack the diverse results. DLOW [8] synthesizes transitional images between the two domains, in which a random scalar, ranging in and reflecting the domainness, is used as a pseudo label to control the distance between the intermediate and the source/target domains. Although DLOW achieves the continuous I2I translation, the intermediate domains are not determined by a physical value and do not have true data in their assumption. Therefore, the domain conformity is not guaranteed by real data. Moreover, it makes translation by the domainness, and lacks the ability for reference-based synthesis. Tab.I lists the visual comparison between the above works and our proposed approach.

Fig. 3: Overview of our method. (a) By feeding the auto-encoder G with two different types of style code ( and ), translated image in th domain with diverse styles can be obtained. The style codes are given to the same G with shared parameters. (b) Details of the computation of and for the th domain. The latent codes for each domain are first output by the multi-branch F. Then they are further diversified to form or by the module M or E, with either the sampling noise vector or the reference image as its input. (c) The code from F can be edited by the module Y&O along two opposite directions, so the th domain code changes into its neighbouring domains, the th or th domain.

3 Proposed Method

3.1 Preliminaries on StarGAN-v2

Our work is based on StarGAN-v2 [7], which can realize both the label- and reference-based translation. It has an auto-encoder G as the backbone of the generator, and its feature in each layer is affected by the style vector injected from the side branch. For the label-based synthesis, StarGAN-v2 uses a mapping network F to directly project a noise into a domain style code. F adopts a multi-branch structure, and the output from the branch indicated by the label is further used. For the reference-based editing, it directly encodes a reference image into the style vector through an encoder E, which is a CNN with a multi-branch structure. When editing an image, the source is input into G, and the style code from F or E is injected into G by AdaIN to get the translated image.

In addition, StarGAN-v2 adopts 4 loss functions.

is an adversarial loss with gradient penalty [11, 2] to make the translated image real. In order to make the style of the edited image close enough to that of the reference, and to retain the domain-invariant characteristics in the source image, and are employed. reconstructs the style code of label-based synthesis by E, and is the cycle consistency computed on pixels. At the same time, is introduced to encourage the stronger diversity. For shorthand, .

3.2 Proposed framework

Given a real image , and an arbitrary domain , where and represent the sets of all images and domains, respectively, we want to edit , making it have the style of the target domain , which can also be said that is translated to domain . Traditional I2I translation usually assumes discrete domains, where with represents discrete domains. Here, we mainly deal with the continuous domain. Particularly, the whole domain is divided into discrete intervals, also indicated by . The interval does not overlap with each other. With the subscript increases or decreases, the domain changes consistently in one direction.

For the continuous I2I translation, the result is required to change according to in the consistent way, and it also exhibits diverse intra-domain and inter-domain features. To achieve this, we need first build different style vectors, and then train the generator to use these style vectors for translation. Fig. 2 compares the latent style code distribution between StarGAN-v2 and the proposed LSC-GAN. Fig. 3 describes the overview of the major blocks in the generator, and the two different pipelines for synthesis.

3.2.1 Auto-encoder G

Given the style belonging to the target domain , the role of G is to translate the source image to the domain , so that gives diverse and appropriate results, conforming to . In our design, there are two types of style vectors, and for label- and reference-based synthesis, shown in Fig. 3a. Both of them can express the desired style of the target domain, and they are the keys to enabling G to make the change on accordingly. Note that we embed the style vector into G through AdaIN [15], which affects the feature statistics in each layer of G.

3.2.2 Mapping network F for the center of domain interval

To represent the domain interval center , and set up the continuous relation among them, we use the network F to learn the mapping , as is shown in Fig. 3b. Similar to StarGAN-v2, a multi-branch network is adopted in F. Specifically, we utilize a total of branches at the output of F, which form the representative code for each domain interval, and the parameters in the branch are not shared. A constant vector is given as the input of F, and different can be obtained simultaneously from the output of . Then we can select the code , according to , for the further process.

3.2.3 Mapping network M for noise vector

The purpose of M is to diversify the style on the center , so as to map it to the style vector in Fig. 3b. Here is a randomly sampled noise vector. In fact, the code from expresses the common characteristics of the domain . Then we try to add variation to by , so that will increase the diversity of the final result on the premise that it still has the basic characteristics of the domain . Particularly, we concatenate the and the output from F in the channel dimension and feed them into M to generate the style vector .

3.2.4 Style encoder E for the reference image

Similar with StarGAN-v2, our method also accepts the reference image for modeling the style, since the dual-drive greatly encourages the diverse editing results. The module E in Fig. 3b is used to encode the reference input . The intermediate features of are encoded with the domain center , and then they are together mapped into the reference style , further exploited by the auto-encoder G for the reference-based synthesis. not only ensures the translation to be in the target domain, but also make it have the similar style with the input . Functionally, E can be compared with M.

3.2.5 Latent code editing module Y&O

To model the relation of the center among different domain intervals, we build two editing modules Y and O to change along two opposite directions consistently, as is shown in Fig. 3c. They have the same structure but do not share the model parameters. Given a domain code generated by the network F, where , the module O is expected to translate into its neighbouring domain . This means that O() becomes close with . Similarly, the module Y changes in the opposite direction, and it is expected that Y() is close with the other neighbour . Note that edited features by Y&O do not participate the generation process during training. They are only employed to restrict of each domain. We will illustrate the loss designed for them in section 3.3. Moreover, Y&O can be conveniently used to translate the source, even beyond the initial domain coverage during inference.

3.2.6 Discriminator D

Like StarGAN-v2, we adopt a multi-task discriminator to conduct adversarial training on each domain separately. Given an image and the domain it belongs to, D learns to distinguish whether is a real or fake image during training.

3.3 Training objectives

Since our network is based on StarGAN-v2, we use the similar training loss . Here we only illustrate the extra terms. Let be the source input and its corresponding domain label, and be the reference image with the target domain label, where , and (including and ) is the integer ranging from , increasing or decreasing its value by 1 as the subscript changes to the right or left neighbour. Note that the division of domain coverage is based on the actual physical value, and domains are organized to keep the internal orders. Following 3 loss terms are designed, namely , and , to model the center for each domain interval. and are centers for the source and target domain, respectively.

3.3.1 Continuous domain constraint

As is assumed above, there are two types of latent editing modules, named Y and O, to translate the code into different domains specified by the target label . Here Y&O are supposed to perform successive editing on , changing it in one direction by recursively modifying the previous output. As is shown in (1), means to call module Y/O times, repeatedly. indicates the target domain has larger value than , so the module O is applied on with times. On the other hand, implies , so the module Y is used to edit with several times.

(1)

Given a pair of the source and target label and , we first choose to edit by the module O or Y, leading to or . Then, the loss is computed according to (2).

(2)

Basically, there are 3 different cases. The first and second item utilize and , which are further depended on O and Y module, respectively. is employed to train the parameters in O, Y, M and F, building the relation among .

3.3.2 Triplet loss with adaptive margin

We adopt a triplet loss to further ensure the edited latent code ( or ) in the target domain, as is shown in (3). Here or is the anchor, is the positive which is directly computed from F, and is the negative. computes the Euclidean distance or between the anchor and positive, and compares it with the or between the anchor and negative.

(3)

is the latent code of an arbitrary domain , computed by F. Hence, is determined by the hardest negative through on the index . is the adaptive margin to adjust the distance between two domains, and we set . In other words, the greater the distance between and , the greater the value of margin.

3.3.3 Cycle continuous consistency loss

To further emphasize the role of the module Y&O and stabilize the latent code for the center of each domain interval, we use the following loss function in (4).

(4)

Here is the cycle continuous consistency loss obtained by editing (or ) in the opposite direction defined by (or ).

3.3.4 Full objective

Finally, we train our M, F, Y&O, E, G and D, to minimize following objectives.

(5)

where , and are hyper-parameters for each term. (3.1) is the loss functions used by StarGAN-v2.

Fig. 4: Label-based comparison on FFHQ-Aging. The real images in red box are input into the model for translation. To show the diversity, 3 different noises are sampled for the same domain. We also compare 2 sets of results with other works.
Fig. 5: Reference-based visual results on FFHQ-Aging. The source and reference images in the first column and the first row are real images, while the rest are generated images. According to the different reference images, we make the generated images have the corresponding age domain characteristics, and have similar style to the reference image.
Fig. 6: Reference-based comparison on FFHQ-Aging. Here is a visual comparison with other work.
Fig. 7: Beyond domain limit. (a) Label-based results with different noises. For the right part, we show the results of calling the module Y by 1-5 times in a row. For the left, we continuously edit the latent code by the module O, though the result in the blue box is already in the oldest domain. (b) The two faces already belong to the oldest domain. We edit them by the module O for 1-10 times, which can obviously show the older effect even beyond the maximum limit.
Fig. 8: Label-based comparisons on Multi-PIE. We randomly selected 4 samples to compare the results of label-based view translation. For each sample, the first row is the ground truth, and the rest are translated images from 6 different models.
Fig. 9: Reference-based comparison on Multi-PIE. In the first row, we randomly selected two references at each viewing angle with different lighting and identity. We compare the experimental results with StarGAN-v2.

4 Experiments

We choose two tasks to validate the proposed model for continuous domain I2I translation, which are face aging and view translation tasks. There are continuous domains in both tasks. In the former, the domains are defined by the age, while in the latter, they are determined by the geometry angle.

4.1 Datasets

FFHQ-Aging. According to the latest work LIFE [33], FFHQ-Aging is the best dataset so far, in terms of the amount, variety and resolution of images. So, we adopt FFHQ-Aging to evaluate our method, which contains 70,000 images. They are categorized into 10 age groups: 0–2, 3–6, 7–9, 10–14, 15–19, 20–29, 30–39, 40–49, 50–69 and 70+. Among them, 63,000 images are used as the training set, 500 as the validation set, and 6,500 as the testing set. We resize each image to 256256. However, the confidence on the age label is not high, which may lead to inaccurate class conditional distribution modeling by the I2I model. In addition, to alleviate the problem of unbalanced sample in each group, and increase the amount of training images in each domain, we merge the groups into 6 continuous domains, 1{0–2, 3–6}, 2{7–9, 10–14}, 3{15–19}, 4{20–29, 30–39}, 5{40–49, 50–69}, 6{70+}.

Multi-PIE. We also evaluate the proposed model on the Multi-PIE [10] for the task of viewing angle translation, which contains about 130,000 images, with 13 different viewing angles, spanning from to . Each image is resized into 128128 during training. To test our hypothesis effectively, five angles (0, 30, 45, 60, 90) are used for training and testing. For details, 40,000 images are used as the training set, 2,000 as the validation set, and 8,000 as the testing set.

4.2 Implementation details.

Loss weights in (5) are set as , and . All networks are optimized by Adam [21] solver (1= 0.5, 2= 0.999). The batch size is set to eight and the model is trained for 70K iterations. The training time is about 58 hours on a single NVIDIA GeForce RTX 3090. On networks G and D, we adopt the same network structure as StarGAN-v2. We also use the same basic residual network with IN [38]

as StarGAN-v2 on M, F and E. For the proposed network Y&O, the MLP structure with residuals is adopted. The code is implemented with pytorch

[35].

4.3 Evaluation metrics

We use Frechét inception distance (FID) [14] to evaluate the visual quality, and evaluate the diversity of translated images by the learned perceptual image patch similarity (LPIPS) [45, 23]. For Multi-PIE, the face identity recognition network pretrained on VGGface [34] dataset is employed to calculate the identity accuracy of the translated image. Here the metrics for all domains are computed on the test set, and we report their averages.

method FID LPIPS(%)
StarGAN-v2 [7] 47.9248.31 9.69.4
Divco [28] 49.0056.63 14.57.3
DRIT++ [25] 54.5253.18 11.69.3
LIFE [33] 207.86- 2.7-
w/oFM 44.6045.05  13.010.4
w/oContinuousLoss 45.6045.82  12.811.5
w/oTrip 43.8544.47  15.118.9
LSC-GAN(ours) 42.5442.82  17.517.2
TABLE II: Quantitative comparisons with state-of-the-arts on the FFHQ-Aging by FID and LPIPS. We measure on two types of synthesis. On the left of the separator ”” is the value of label-based synthesis, while the right is reference-based. ”-” means the model can not realize the reference-based edition.
method quality(%) degree(%)
StarGAN-v2 [7] 13.4 6.5
Divco [28] 5.1 4.8
LIFE [33] 10.6 3.2
LSC-GAN(ours) 70.8 85.5
TABLE III: User Study on FFHQ-Aging. We calculate and report the average of each indicator separately.
method FID ID(%)
U-Net [37] 26.50 56.7
CVAE-GAN [3] 26.32 26.3
VI-GAN [42] 25.04 57.1
CD-VAE [44] 23.95 60.4
StarGAN-v2 [7] 27.2023.81 0.70.4
LSC-GAN(ours) 19.1818.29 68.071.8
TABLE IV: Quantitative comparisons with approaches on the Multi-PIE by FID and ID accuracy. For StarGAN-v2 and DLC-GAN(ours), we measure on two types of synthesis.

5 Results and Analysis

5.1 Qualitative and Quantitative Results on FFHQ-Aging

According to LIFE [33], the traditional age translation methods, like IPCGAN, etc., cannot generate diverse results, and are often limited to the facial image captured in a specific condition, like the frontal view and the constant lighting condition. To be fair, we mainly choose StarGAN-v2 [7] and DivCo [28], the state-of-the-art approaches in terms of quality and diversity, for comparison. To pay tribute to the proposal of large-scale age editing, we also compare with LIFE.

5.1.1 Label-based synthesis

The results in Fig.4 demonstrate that our method can produce diverse and high-quality images. The diversity is reflected in intra-domain (lighting style, hair color, etc.) and inter-domain (head size, wrinkles, etc.). Note that although the former we show is obviously correlated with age, it does not affect the classification of domain categories. Compare with our method, the image quality of StarGAN-v2 is acceptable, but it obviously lacks the inter-domain diversity, and the face does not change except the color. DivCo has the inter-domain diversity, but still lacks intra-domain diversity, and the image quality is poor. Similarly, the generated images of DRIT++ also has artifacts, especially in the bangs. LIFE cannot generate diverse results, and it shows obvious image artifacts. In general, these methods can only synthesize the average age of each domain, while our model is able to change the age within the domain. For example, in the first domain (0-6), we can freely express all the features of 0-6 years old, but other methods can only generate features of 3 years old.

Tab.II lists FID and LPIPS of all competing methods. For LPIPS, we randomly sample 10 different noises for each test image. Then translate them to the same target domain and measure the distance between the any two results. Finally, we traverse 6 target domains and get the average. For FID, we calculate it between the generated image used in calculating LPIPS and the testing set image, and calculate the average. Obviously, our method achieves the best on both indicators. In fact, it is challenging to obtain high diversity, especially the inter-domain one, while maintaining high quality.

5.1.2 Reference-based synthesis

In Fig.5, we list results from 48 different pairs, consisting of 6 sources and 8 references. It can be seen that our method can naturally and accurately extract the characteristics such as the age, illumination and other styles from the reference image. However, the identity and pose of the source image are kept as much as possible. In addition, we take two sources to compare with other works, shown in Fig.6. StarGAN-v2 can extract the age-irrelevant factors, but still cannot synthesize different age according to the reference. Note that the first and second references belong to the first domain (0-6), but have different specific ages. The results of StarGAN-v2 only show lighting changes within the same domain. DivCo’s edition has the variation on the age, but the results are obviously out of control, making the translation far from the age of the reference. In contrast, DRIT++ not only has the same problems as StarGAN-v2, but also has serious artifacts. Taking the hair color as an example, StarGAN-v2, DivCo and DRIT++ cannot change it according to the reference.

To quantify our advantages, the FID and LPIPS are also shown in Tab.II. For LPIPS, we randomly sample reference images, and keep all of them the same for fair comparison. Our results are still the best, which proves that the proposed model can take the desired style from the reference image.

5.1.3 Beyond domain limit

In the training phase, the module Y&O is used to edit and constrain the latent code. During inference, they can still be utilized to translate the code even beyond the domain limit. We show the editing effect in Fig.7. The upper bound of the label in Fig.7(a) is 70+, when we continuously call the module Y, the face gradually becomes younger. Similarly, the module O changes the image into the older domain. Note that we do not generate image by the Y&O module in the training phase, which means that the encoder M and E have not seen the code given by the Y&O. All existing works require the same domain range during the training and inference phase. But ours can translate image beyond the training range, and continue to modify the facial age. These images still have the high quality. The Y&O module can also perform the reference-based translation, shown in Fig.7(b). Although the first column is labeled as the oldest, the module O still edits the latent code for the older age.

5.1.4 Ablation study

In Tab.II, we show the effect of each proposed item in detail by removing it from the original model. w/oFM represents the same structure as StarGAN-v2 for label-based synthesis without the module F and M. All the loss functions and the Y&O module are still retained. Both FID and LPIPS are getting worse, which means that the structure of StarGAN-v2 is not suitable for the continuous domain translation. It is irrational to directly map noises or reference images into the style codes reflecting the domain interval. Since the style code tends to be close to the center of the domain, it reduces the diversity of translated image. In contrast, our structure first maps a constant into the latent domain code that characterizes the center of each domain interval, and then uses noises or reference images to give the diverse styles, which not only expresses the age within the domain, but also prevents the mode collapse. w/oContinuousLoss removes the and in Equation.(2) and (4). Since they have similar functions, so we bind them together for ablations. Poor quality and low LPIPS show that they are indispensable in establishing the link among domains. w/oTri removes the in Equation.(4). The performance in Tab.II also drops significantly. Actually, the constrains distances of the latent codes among different domains, which helps the module Y&O to fit the age trend.

5.1.5 User Study

We conduct user study in the form of a questionnaire.The subjects are provided with 20 sets of generated images, and are required to select the most suitable one from each set. We ask the subjects which result has the best quality and which has the maximum degree of translation. In detail, a group of translated results for the same source image are listed for the users, and these images come from 4 different models of StarGAN-v2, DivCo, LIFE and LSC-GAN. The statistical results are listed in Tab.III. Our model achieves the best on both quality and translation degree, which shows that from a human perspective, the images edited by our network are more in line with human aesthetics.

5.2 Qualitative and Quantitative Results on Multi-PIE

We also carry out the view translation on Multi-PIE. Different from face aging, the ground truth of the target view is known, but we do not use it and keep the task unsupervised during training. We choose the model U-Net [37], CVAE-GAN [3], and the advanced works VI-GAN [42], CD-VAE [44] for comparison. Note that it is unfair for our LSC-GAN to measure the SSIM or MSE between the ground truths and results, because our model can synthesize diverse results which are not necessarily the same as ground truths.

5.2.1 Label-based synthesis

Fig.8 shows the visual comparisons. StarGAN-v2 and CVAE-GAN obviously cannot maintain the identity of the source image. The image quality of U-Net and CD-VAE is poor, with serious artifacts. VI-GAN performs well in some viewing angles, but overall it is still inferior to us, especially in image quality. Our LSC-GAN gives the satisfactory results, and it accurately translates the source images into different viewing angles. In Tab.IV, we measure the FID and ID accuracy to further prove the advantages of our model. There is no doubt that we achieve the best on both metrics. Note that we do not adopt any loss that maintains identity, but our model can still keep it well under different views. This proves that the proposed method can model the latent style code to establish and fit the continuous trends.

5.2.2 Reference-based synthesis

Less work can successfully complete the reference-based viewing angle translation. As shown in Fig.9, our model can not only change the view angle according to the reference and maintain the source identity, but also can realistically synthesize the lighting environment in the reference image. Compared with StarGAN-v2, although it can also capture the illumination condition in the reference, the identity is seriously affected by the reference, and the source image is ignored. We also measure the quantitative results in Tab.IV, which shows the proposed LSC-GAN can give the best result. This once again proves the superiority of our model for modeling the latent style codes and their relation for continuous domain translation.

6 Conclusion

This paper investigates the continuous I2I translation task. We propose the LSC-GAN for both label- and reference-based translation, which intentionally builds the latent style codes for each domain interval, and models their relation by the designed code editing modules. These modules are able to translate the style code from one domain to its neighbours, so they can recursively change the style code from source to target. Moreover, the LSC-GAN achieves the diverse synthesis by explicitly incorporating the sampling noises and the references into the latent style code. Extensive experiments on two datasets for continuous domain I2I demonstrate the effectiveness of the proposed model.

References

  • [1] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Only a matter of style: Age transformation using a style-based regression model. arXiv preprint arXiv:2102.02754, 2021.
  • [2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. 2017.
  • [3] Jianmin Bao, Dong Chen, Fang Wen, Houqiang Li, and Gang Hua. Cvae-gan: fine-grained image generation through asymmetric training. In

    Proceedings of the IEEE international conference on computer vision

    , pages 2745–2754, 2017.
  • [4] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  • [5] Ying-Cong Chen, Xiaogang Xu, Zhuotao Tian, and Jiaya Jia. Homomorphic latent space interpolation for unpaired image-to-image translation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 2408–2416, 2019.
  • [6] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8789–8797, 2018.
  • [7] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8188–8197, 2020.
  • [8] Rui Gong, Wen Li, Yuhua Chen, and Luc Van Gool. Dlow: Domain flow for adaptation and generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2477–2486, 2019.
  • [9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  • [10] Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker. Multi-pie. Image and vision computing, 28(5):807–813, 2010.
  • [11] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, pages 5767–5777. Curran Associates, Inc., 2017.
  • [12] Zhenliang He, Meina Kan, Shiguang Shan, and Xilin Chen. S2gan: Share aging factors across ages and share aging trends among individuals. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9440–9449, 2019.
  • [13] Zhenliang He, Wangmeng Zuo, Meina Kan, Shiguang Shan, and Xilin Chen. Attgan: Facial attribute editing by only changing what you want. IEEE Transactions on Image Processing, 28(11):5464–5478, 2019.
  • [14] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, pages 6626–6637. Curran Associates, Inc., 2017.
  • [15] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pages 1501–1510, 2017.
  • [16] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 172–189, 2018.
  • [17] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  • [18] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
  • [19] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8110–8119, 2020.
  • [20] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning to discover cross-domain relations with generative adversarial networks. In

    International Conference on Machine Learning

    , pages 1857–1865. PMLR, 2017.
  • [21] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Computer ence, 2014.
  • [22] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [23] Alex Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • [24] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Diverse image-to-image translation via disentangled representations. In Proceedings of the European conference on computer vision (ECCV), pages 35–51, 2018.
  • [25] Hsin-Ying Lee, Hung-Yu Tseng, Qi Mao, Jia-Bin Huang, Yu-Ding Lu, Maneesh Singh, and Ming-Hsuan Yang. Drit++: Diverse image-to-image translation via disentangled representations. International Journal of Computer Vision, pages 1–16, 2020.
  • [26] Ming Liu, Yukang Ding, Min Xia, Xiao Liu, Errui Ding, Wangmeng Zuo, and Shilei Wen. Stgan: A unified selective transfer network for arbitrary image attribute editing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3673–3682, 2019.
  • [27] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In Advances in neural information processing systems, pages 700–708, 2017.
  • [28] Rui Liu, Yixiao Ge, Ching Lam Choi, Xiaogang Wang, and Hongsheng Li. Divco: Diverse conditional image synthesis via contrastive generative adversarial network. arXiv preprint arXiv:2103.07893, 2021.
  • [29] Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and Ming-Hsuan Yang. Mode seeking generative adversarial networks for diverse image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [30] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. Computer ence, pages 2672–2680, 2014.
  • [31] Augustus Odena, Christopher Olah, and Jonathon Shlens.

    Conditional image synthesis with auxiliary classifier gans.

    In International conference on machine learning, pages 2642–2651. PMLR, 2017.
  • [32] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier GANs. volume 70 of Proceedings of Machine Learning Research, pages 2642–2651, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • [33] Roy Or-El, Soumyadip Sengupta, Ohad Fried, Eli Shechtman, and Ira Kemelmacher-Shlizerman. Lifespan age transformation synthesis. In European Conference on Computer Vision, pages 739–755. Springer, 2020.
  • [34] Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman.

    Deep face recognition.

    2015.
  • [35] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al.

    Pytorch: An imperative style, high-performance deep learning library.

    Advances in neural information processing systems, 32:8026–8037, 2019.
  • [36] Andrés Romero, Pablo Arbeláez, Luc Van Gool, and Radu Timofte. Smit: Stochastic multi-label image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0–0, 2019.
  • [37] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [38] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
  • [39] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–8807, 2018.
  • [40] Zongwei Wang, Xu Tang, Weixin Luo, and Shenghua Gao. Face aging with identity-preserved conditional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7939–7947, 2018.
  • [41] Taihong Xiao, Jiapeng Hong, and Jinwen Ma. Elegant: Exchanging latent encodings with gan for transferring multiple face attributes. In Proceedings of the European conference on computer vision (ECCV), pages 168–184, 2018.
  • [42] Xiaogang Xu, Ying-Cong Chen, and Jiaya Jia. View independent generative adversarial network for novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7791–7800, 2019.
  • [43] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE international conference on computer vision, pages 2849–2857, 2017.
  • [44] Mingyu Yin, Li Sun, and Qingli Li. Novel view synthesis on unpaired data by conditional deformable variational auto-encoder. In European Conference on Computer Vision, pages 87–103. Springer, 2020.
  • [45] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang.

    The unreasonable effectiveness of deep features as a perceptual metric.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [46] Zhifei Zhang, Yang Song, and Hairong Qi.

    Age progression/regression by conditional adversarial autoencoder.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5810–5818, 2017.
  • [47] Zhilin Zheng and Li Sun. Disentangling latent space for vae by label relevant/irrelevant dimensions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12192–12201, 2019.
  • [48] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.