Combining Attention with Flow for Person Image Synthesis

08/04/2021 ∙ by Yurui Ren, et al. ∙ Tencent Peking University 0

Pose-guided person image synthesis aims to synthesize person images by transforming reference images into target poses. In this paper, we observe that the commonly used spatial transformation blocks have complementary advantages. We propose a novel model by combining the attention operation with the flow-based operation. Our model not only takes the advantage of the attention operation to generate accurate target structures but also uses the flow-based operation to sample realistic source textures. Both objective and subjective experiments demonstrate the superiority of our model. Meanwhile, comprehensive ablation studies verify our hypotheses and show the efficacy of the proposed modules. Besides, additional experiments on the portrait image editing task demonstrate the versatility of the proposed combination.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Being able to synthesize person images by transforming the poses of given persons is an important task with a large variety of applications. Industries such as electronic commerce, virtual reality, film production, and next-generation communication require such algorithms to generate content. In many cases, however, these requirements are achieved by graphic technologies with precise control over the image rendering. Editing images in this way needs professionals to build fine-grained 3D models for each scene. This complex and tedious process prevents ordinary users from these algorithms and meanwhile increases the costs of the generated contents.

Recently, advances in computer vision fields have made tremendous progress in generating realistic images 

(Goodfellow et al., 2014; Brock et al., 2018; Karras et al., 2019, 2020). Some algorithms (Ma et al., 2017; Siarohin et al., 2018; Ren et al., 2020; Men et al., 2020) are proposed to automatically synthesize person images from references using learning-based methods. Formally, the pose-guided person image synthesis task aims to synthesize person images by transforming the poses of reference images according to the given modifications while preserving the reference identities. Examples are provided in Fig. 1

. It can be seen that the reference and target images in this task have clear mapping relationships: targets are the spatial transformation versions of the reference images. Therefore, this task can be tackled by reassembling the references in the spatial domain. However, Convolutional Neural Networks (CNNs) lack the abilities to enable efficient spatial transformation 

(Goodfellow et al., 2016; Vaswani et al., 2017). Convolutional operations are building blocks that process one local neighborhood at a time. To model long-term dependencies, stacks of convolutional operations are required. Realistic source textures will be “wash away” during these operations, which results in over-smoothed images. Therefore, a fundamental challenge for this task is to design efficient spatial transformation blocks to reassemble the reference images.

The attention operation has been proved as an effective method to extract non-local dependencies (Vaswani et al., 2017; Wang et al., 2018; Zhang et al., 2019). The response of a target query is calculated as the weighted sum of source features. By using this operation, each target feature directly communicates with all source features. Thus, targets can be expected to sample specific source features by increasing the weights of the corresponding regions and refusing the other features. However, to generate images with realistic textures, each output position should only sample a very local region of sources. This requires the attention correlation matrix to be a sparse matrix to reject all the unsampled features, which is extremely difficult for the standard attention operation. Meanwhile, this position-irreverent operation cannot maintain the patterns of the reference images (e.g. logos).

Another efficient spatial transformation operation is the flow-based operation. This operation warps the source information by predicting 2D coordinate offsets specifying the sampling positions. Different from the attention operation, the flow-based operation can obtain photo-realistic textures since it sampling a very local source patch for each target position. However, it is hard for the networks to obtain stable gradients from the flow-based operation because each output feature is only related to a local and indeterminate source patch. This phenomenon will hinder the models to extract accurate motions, which is more evident when complex deformations and severe occlusions are observed.

Observing the complementary advantages of these two operations, in this paper, we propose a novel model by combining the attention operation with the flow-based operation. The architecture of the proposed model is shown in Fig. 2 and Fig. 3. Specifically, a Deformation Estimation Module is first designed to extract the deformations between the reference images and the desired targets. Two types of deformations are estimated: the correlation matrices for the attention operation and the flow fields for the flow-based operation. Then, we generate a combination map that is responsible for selecting the better deformation between the correlation matrices and the flow-fields for each target position. Finally, an Image Synthesis Module is employed to synthesize the target images by reassembling the source feature maps according to the estimated deformations and combination maps.

We compare the proposed model with several state-of-the-art methods. The experiment results show the superiority of our model. Meanwhile, comprehensive ablation studies are conducted to verify the hypothesis and show the efficacy of the proposed modules. Besides, we further apply our model to tackle the portrait image editing task. We show that the proposed model can achieve intuitive portrait image control by modifying the poses and expressions of reference images according to the provided modifications. The main contributions of our paper can be summarized as

  • We propose a novel model for person image synthesis by combining the attention operation with the flow-based operation. Taking the complementary advantages of these operations, our model can synthesize images with not only accurate structures but also realistic details.

  • We demonstrate the versatility of the proposed model by further extending it to tackle the portrait image editing task. Experiments show that our model can synthesize portrait images with accurate movements.

Figure 2. The architecture of the deformation estimation module. The deformations are estimated by the attention correlation estimator and the flow field estimator. Then, the combination maps are generated using the corresponding warped images. Our model estimates multi-scale deformations to transform both global and local contexts.

The architecture of the deformation estimation module.

2. Related Work

Recently, deep neural networks are starting to produce visually compelling images conditioned on certain user specifications like segmentation and edge map. 

(Isola et al., 2017; Choi et al., 2018; Huang et al., 2018; Liu et al., 2019a; Yu et al., 2019). The pose-guided person image synthesis task is a highly active topic in this field, where the target images are synthesized by rendering the corresponding skeletons with the appearance of the reference images. Ma et al(Ma et al., 2017) tackle this task by proposing a coarse-to-fine framework. Their framework first synthesizes coarse images with accurate poses and then refine the results by adding vivid textures in an adversarial way. Some follow-up works (Ma et al., 2018; Esser et al., 2018) manage to disentangle the poses and appearance of the reference images to improve the results. However, these methods use 1D embeddings to represent the appearance, which hinders the generation of complex textures. Men et al(Men et al., 2020) alleviate this problem by extracting appearance from different semantics separately. The extracted embeddings are then injected into the feature maps of target skeletons to generate the final results. Instead of using the same embeddings for all target positions, Zhang et al(Zhang et al., 2021) propose to inject the style of each semantic part into the corresponding target semantic regions. With the representative appearance embeddings, these methods can generate images with realistic textures. However, they cannot maintain the patterns of the reference images. Meanwhile, these methods rely on accurate human parsing maps. The performance may be vulnerable to parsing errors.

Some other methods tackle this task by proposing efficient spatial transformation modules (Siarohin et al., 2018; Wang et al., 2019; Liu et al., 2019b; Zablotskaia et al., 2019; Ren et al., 2020; Tang et al., 2021). Siarohin et al(Siarohin et al., 2018) introduce deformable skip connections to spatially transform the source neural textures with a set of affine transformations. This method relieves the spatial misalignment caused by pose difference and achieves good results. However, it requires one to predefine a set of transformation components, which limits the application. Zhu et al(Zhu et al., 2019) use cascaded attention blocks to transfer the source information progressively. This method can generate accurate structures for target images. But complex textures are smoothed during multiple transfers, which leads to the performance decline. Li et al(Li et al., 2019) propose a flow-based method for this task. To estimate accurate deformations, they generate flow field labels with additional 3D human reconstruction methods. However, the performance of this model is limited by the accuracy of the 3D reconstruction model. Han et al(Han et al., 2019) propose a cascaded flow estimator to predict flow fields in an unsupervised manner. However, this method warps the sources at the pixel level, which means that additional refinement networks are required to fill the holes caused by occlusions. Ren et al(Ren et al., 2020) propose a global-flow local-attention framework to reassemble the input image at the feature level. Tang et al(Tang et al., 2021) further improve the results by proposing a structure-aware person image synthesis method that predicts flow fields of different body semantics separately. Recently, some methods (Zhang et al., 2020; Zhou et al., 2021) achieve spatial transformation by using the attention operation. The correspondences between the sources and targets are extracted by leaning attention correlation metrics. However, the attention operation cannot maintain the spatial distributions of the reference images, which hinders the methods remonstrating complex textures.

3. Our Approach

We propose a novel model for the pose-guided person image synthesis task. The motivation of our model comes from the observation that the commonly used spatial transformation operations have complementary advantages. As shown in Fig. 1, the flow-based operation can extract vivid source textures by assigning a very local patch for each target position. However, it lacks the ability to capture complex deformations between sources and targets. On the contrary, the attention operation can extract accurate deformations and synthesize targets with reasonable structures. However, it cannot maintain the source textures. Therefore, by combining these two operations, our model can synthesize images with not only accurate global structures but also realistic local details. In the following, we first provide the details of our deformation estimation module (Sec. 3.1). Then, the image synthesis module is introduced for the target image synthesis (Sec. 3.2). Finally, we explain the training functions (Sec. 3.3). Please note that we describe the model warping source features at a single scale for the simplicity of the discussion. Our model can be extended by warping multi-scale source features to transform both global and local contexts.

3.1. Deformation Estimation Module

A fundamental challenge of the pose-guided person image synthesis task is to accurately reassemble the source information according to the provided modifications. This requires one to estimate the correspondence between the reference image and the desired target image . We deal with this task using a deformation estimation module. The architecture of this module is shown in Fig. 2. It consists of three parts: the attention correlation estimator, the flow field estimator, and the combination map generator.

The Attention Correlation Estimator is responsible for calculating the attention correlation matrix that contains the correlations of all queries to all keys. This estimator first encodes the reference image and the target skeleton to feature maps using encoders and respectively.

(1)

where and are the extracted feature maps representing keys and queries, respectively. and denote the spatial size of the feature map and is the number of feature channels. Then, the correlation matrix is obtained as

(2)

where is the coordinate set of the feature maps. Symbols and denote the feature located in of the reference feature map and the feature located in of the skeleton feature map respectively. Coefficient is used to control the sharpness of the softmax operation. In this paper, we set for all experiments. With matrix , we can calculate the attention results by weighted sum the source inputs .

(3)

where indicates the attention warping operation. By using the attention operation, the source information can be integrated into desired target positions according to matrix .

Figure 3. The architecture of the image synthesis module. The feature maps of the reference images are warped using the estimated deformations. Then, the network generates the final images by rendering the target skeletons with aligned neural textures.

The architecture of the image generation module.

The Flow Field Estimator is responsible for estimating flow fields that contain the relative movements between the sources and targets. Different from the attention operation, the flow-based operation forces the correlation matrices to be sparse matrices by sampling a local source region for each output. Therefore, this operation can help with reconstructing vivid source textures. To be specific, the estimator takes the reference image , the reference skeleton , and the target skeleton as the inputs. The flow fields are generated by analyzing the difference between the reference images and the desired targets.

(4)

We design with an auto-encoder structure. It first extracts features from the inputs and then decodes them into flow fields according to the extracted features. Several skip-connections are used for leveraging both local and global contexts. After obtaining , the output results of the flow-based operation can be obtained by

(5)

where indicates the flow-based warping operation which samples the input with flow fields

using the Bilinear interpolation method.

The Combination Map Generator predicts the combination maps to select from the warping results of the attention operation and the warping results of the flow-based operation. As mentioned before, the attention operation and the flow-based operation have complementary advantages. Therefore, reasonably combining their results is important for exploiting their strengths and thus improve the quality of the final images. Here, we tackle this task by generating combination maps using a content-aware combination map generator .

(6)

where is the resized image of the original reference image . We design generator

with several residual blocks. The Sigmoid activation function is used as the non-linear function of the output layer. The combination maps

have continuous values between and . With the deformations , and the combination maps , we can generate the final images by spatially transforming the source textures.

3.2. Image Synthesis Module

The image synthesis module is used to synthesize the final images by rendering target skeletons with reference textures. The architecture of this module is shown in Fig. 3. Specifically, this module takes , , , , and as inputs and synthesizes the predicted images .

(7)

The warping block in module is responsible for reassembling the reference neural textures according to the estimated deformations. Let denotes the feature map extracted from the reference image . This block generates the aligned feature map by first warping with both the attention correlation matrix and the flow field and then combining the warped results with the combination map .

(8)

where denotes the element-wise multiplication over the spatial domain. After obtaining , the target image is generated by adding vivid neural textures to the feature map extracted from the target skeleton

(9)

where is the output feature map containing both target semantics and reference textures. We further decode to synthesize

3.3. Training Losses

The proposed model is trained with several loss functions that fulfill special tasks. These loss functions can be divided into two categories: losses for accurate deformations and losses for realistic images.

Losses for Accurate Deformations. Estimating accurate deformations between sources and targets is crucial for generating realistic images. However, using losses at the end of the network (e.g. reconstruction loss) cannot guarantee that the model learns meaningful deformations. Therefore, several losses are designed to directly constrain the estimated deformations. For the attention correlation matrix , we calculate the distance between the warped image and the target image.

(10)

This attention loss encourages the correlation matrix containing meaningful deformations to reduce the reconstruction errors. To constrain the flow field , we employ the sampling correctness loss and the regularization loss proposed in paper (Ren et al., 2020)

. The sampling correctness loss calculates the normalized cosine similarity between the warped reference feature and the ground-truth target feature. We use VGG-19 to extract the corresponding features

, from images , . This loss is defined as

(11)

where indicates the cosine similarity operation. Coordinate set contains all positions in the feature maps. For each position , this loss first calculates the similarity between the warped reference feature and the ground-truth target feature . Then, the similarity is normalized by to avoid the bias brought by occlusion, where is the most similar feature of in .

(12)

The regularization loss is used to extract the spatial correlations of the flow fields . It assumes that each local deformation estimated by should be an affine transformation. Let be the reference coordinates of a local patch centered at location . The coordinates of the sampling points can be calculated using the corresponding flow field patch with , where is the homogeneous coordinates of . This loss assumes a linear relationship between and and calculates the least-square error.

(13)

where the metric is the least-square solution of the linear equation . It can be calculated as

(14)

Losses for Realistic Images. After obtaining accurate deformations, our model synthesizes the final images with the warped features. Several losses are designed to obtain realistic images. The perceptual loss proposed in paper (Johnson et al., 2016) is used to calculate the reconstruction error between the predicted image and the ground-truth image .

(15)

where denotes the -th activation map of the VGG-19 network. Besides, a face reconstruction loss is used for generating natural faces. This loss calculates the difference between features of the cropped faces.

(16)

where is the face cropping function. To generate vivid details, we use a style loss to calculate the statistical error between the activation maps.

(17)

where represents the Gram matrix of activation map . In addition to the VGG-based losses, a generative adversarial loss is employed to mimic the distribution of ground-truth images.

(18)

where is the discriminator. We use the following overall loss to train our model.

(19)

4. Experiment

Dataset. The In-shop Clothes Retrieval Benchmark of the DeepFashion dataset (Liu et al., 2016) is used in our experiments. This dataset contains high-resolution images () of fashion models with different clothing items in different poses. Images of the same person in the same clothes are paired for training and testing. We split the dataset according to the personal identity so that the identities of the training and testing sets do not overlap. A total of pairs are randomly selected for training and pairs for testing.

Implementation Details. In our experiments, reference features with resolutions as and are extracted and warped to generate the final results. Considering the huge deformations between the reference images and target images, we train the model in stages to avoid it getting stuck in bad local minimas. The deformation estimation module is first pre-trained. Then we train the whole model in an end-to-end manner. The batch size is set to for all experiments. We use the historical average technique (Salimans et al., 2016) to update the average model by weighted averaging current parameters with previous parameters. More details can be found in the Supplementary Materials.

VU-Net Def-GAN Pose-Attn Intr-Flow ADGAN GFLA Ours
SSIM 0.6738 0.6836 0.6714 0.6968 0.6736 0.7074 0.7113
LPIPS 0.2637 0.2330 0.2533 0.1875 0.2250 0.1962 0.1813
FID 23.669 18.460 20.728 13.014 14.546 9.9125 9.4502
FR 4.12 14.40 9.56 16.80 29.08 18.88 30.00
Table 1.

The evaluation results compared with several state-of-the-art methods. SSIM and LPIPS calculate the reconstruction errors. FID indicates the realism of the generated images. Fooling rate (FR) is obtained by human subjective studies. It represents the probability that the generated images are mistaken for real images.

4.1. Comparisons with State-of-the-arts

We compare our model with several state-of-the-art methods including VU-Net (Esser et al., 2018), Def-GAN (Siarohin et al., 2018), Pose-Attn(Zhu et al., 2019), Intr-Flow (Li et al., 2019), ADGAN (Men et al., 2020), and GFLA (Ren et al., 2020). The released weights of these methods are used for evaluation.

Figure 4. The qualitative comparisons with several state-of-the-art methods including VU-Net (Esser et al., 2018), Def-GAN (Siarohin et al., 2018), Pose-Attn(Zhu et al., 2019), Intr-Flow (Li et al., 2019), ADGAN (Men et al., 2020), and GFLA (Ren et al., 2020)

The qualitative comparisons.

Figure 5. More results of our model. Deformations and combination maps estimated for feature maps are provided. The visualizations are resized for illustration.

More results of our model.

Figure 6. The qualitative results of the ablation study. We mark some typical artifacts with red arrows.

The generated examples of the ablation models.

Qualitative Results. We provide the typical qualitative results in Fig. 4 for intuitive comparisons. It can be seen that without using spatial transformation blocks, VU-Net fails to reconstruct the details of the clothes. Def-GAN and Pose-Attn can generate images with relatively reasonable structures. However, obvious artifacts can be observed in images with complex textures. ADGAN first extracts the semantic embeddings from the reference images and then injects the extracted variables into target skeletons. This model can generate realistic results for solid color clothes. However, since the extracted 1D embeddings do not contain the image spatial information, it struggles to reconstruct complex textures.

Using the flow-based operation as the spatial transformation modules, Intr-Flow and GFLA can extract vivid neural textures from the reference images and generate results with realistic details. However, their results may suffer from inaccurate structures, which is more evident when complex deformations and severe occlusions are observed. The main reason is that the poor gradients provided by the warping operation hinder the model to estimate accurate deformation for each position. By combining the flow-based operation with the attention operation, our model can generate images with not only accurate structures but also vivid textures. Besides, we provide more results of our model in Fig. 5. The corresponding combination maps, as well as the visualizations of the deformations are also given. It can be seen that the deformations estimated by the flow fields are selected to generate complex textures. For the smooth areas (e.g. arms, legs), the deformations estimated by the attention correlation matrices are selected. Thus, our model can use the reference information to synthesize realistic results.

Quantitative Results. The evaluation results are shown in Tab. 1. We use Structure Similarity Index Measure (SSIM) (Wang et al., 2004) and Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018)

to evaluate the reconstruction errors between the generated images and the ground-truth images. SSIM is a commonly used pixel-level image similarity indicator. However, as discussed in paper 

(Zhang et al., 2018), pixel-level distance metrics may be insufficient for assessing perceptual quality. Therefore, LPIPS is further employed to evaluate the perceptual distance by calculating feature differences using a network trained on human judgments. It can be seen that our model achieves competitive SSIM and LPIPS scores, which means that we can generate images with better perceptual quality and fewer reconstruction errors. Besides, Fréchet Inception Distance (FID) (Heusel et al., 2017) is used to measure the distance between the distributions of synthesized images and real images. This metric indicates the realism of the generated images. Our model achieves the best FID score compared with the state-of-the-art models.

Since the objective metrics have their own limitations, their results may mismatch with the actual subjective perceptions (Zhang et al., 2018). Therefore, we conduct a user study to compare the subjective quality. Image pairs of the ground-truth and generated images are shown to volunteers who are expected to select the more realistic image from each data pair. The Fooling Rate (FR) is calculated as the final score of the user study. This test is implemented on Amazon Mechanical Turk (MTurk). We randomly select images as the test set. Each image pair is compared times by different volunteers. A total of volunteers participate in this experiment where each volunteer conducts comparisons. The evaluation results are shown in Tab. 1. It can be seen that our model achieves the best FR score, which means that we can generate more realistic images.

4.2. Ablation Study

In this section, we evaluate the efficacy of the proposed modules by comparing our model with several variants.

Baseline. A baseline model is trained to prove the efficacy of the deformation modules. An auto-encoder network is used to design this model. The reference images and target skeletons are concatenated as the model inputs. We train this model using the same loss functions as that of our image generation module.

Attention Model.

The attention model is used to evaluate the performance gain brought by the attention correlation estimator. We remove the flow field estimator and use the attention correlation matrices

as the final deformations.

Flow Model. Similarly, the flow model is designed to evaluate the efficacy of the flow field estimator. The attention correlation estimator is removed from the proposed model. We take the flow fields as the final deformations to warp the reference features.

Without Face Loss. This model is used to show the efficacy of face reconstruction loss on the quality of the generated images. We remove the face reconstruction loss when training this model.

Full Model. The proposed model with both attention correlation estimator and flow field estimator is used in this model.

The qualitative results of the ablation models are provided in Fig. 6. It can be seen that the baseline model generates images with correct poses and identities. However, it cannot preserve the textures of the reference images. By using the spatial transformation modules, both the attention model and the flow model can efficiently deform the reference images and generate targets with realistic details. However, different types of artifacts can be observed in their results. The powerful attention operation helps the model extract long-term correlations and generate accurate target structures. However, the dense connections affect the model to benefit from the image locality. As shown in the top three rows in Fig. 6, the complex textures of reference images cannot be reconstructed well. Meanwhile, as this operation may destroy the spatial distributions of the reference images, the special patterns (e.g. logos) are not generated. The flow-based operation can build correlations between adjacent deformations. Thus, it helps to reconstruct the patterns and complex textures by extracting the whole local patches from reference images. However, as shown in the bottom three rows in Fig. 6, this operation fails to estimate accurate deformations for all target positions, which may lead to inaccurate target structures. Our full model benefits from both transformation modules and generates images with not only accurate structures but also realistic details. Meanwhile, compared with the w/o face model, the full model produces more realistic faces, which improves the perceptual quality of the final images.

Baseline Attn Model Flow Model w/o Face Full Model
SSIM 0.6997 0.7119 0.7084 0.7124 0.7113
LPIPS 0.2189 0.1852 0.1911 0.1778 0.1813
FID 16.283 10.636 10.535 9.9336 9.4502
Table 2. The evaluation results of the ablation study.

The quantitative evaluation results are shown in Tab. 2. It can be seen that the baseline model fails to achieve good evaluation results. The main reason is that this model lacks efficient spatial transformation blocks to deform the reference information. Thus, the network cannot obtain aligned feature maps to synthesize the final images. Compared with the baseline, both the attention model and the flow model achieve significant performance gains. This indicates that designing efficient spatial transformation modules is crucial for this task. Meanwhile, the attention model achieves better reconstruction scores than the flow model, which confirms that the attention operation helps to generate accurate target structures. However, the poor FID score indicates that the realism of the final images is degraded since it cannot preserve vivid reference details. By combining the attention operation with the flow operation, both the w/o face model and the full model obtain additional performance improvements. Such combination helps the models take the complementary advantages of the deformation blocks. Although the w/o face model achieves better LPIPS and SSIM scores, as shown in Fig. 6, due to the acuteness of the human visual system towards the faces, a dramatic improvement in visual quality is obtained by using the face reconstruction loss .

Figure 7. The results of the portrait image editing task. The top two rows show the results of pose editing. The bottom row shows the results of expression editing.

The face editing results.

4.3. Portrait Image Editing

In this subsection, we further apply our model to tackle the portrait image editing task. To achieve intuitive portrait image control, we employ the three-dimensional morphable face models (Blanz and Vetter, 1999; Paysan et al., 2009) (3DMMs) to describe the motions of the faces. 3DMMs allow users to control the 3D face meshes with fully disentangled semantic parameters (e.g. shape, pose). The VoxCeleb dataset (Nagrani et al., 2017) which contains talking-head videos is used for training and testing. The face reconstruction model proposed in paper (Deng et al., 2019) is employed to extract the corresponding 3DMM parameters from the images. The face landmarks are rendered from the extracted 3DMM parameters and used as the semantics , of the face images and . Images from the same video are randomly paired as the reference and target images. After training, the users can control the motions of portrait images by providing specific 3DMM parameters. Example results are shown in Fig. 7. Our model enables intuitively editing the poses and expressions of a given portrait image, which will find a large variety of applications in the industries such as social media, virtual reality. Meanwhile, our model can generate images with realistic details and accurate motions. Please find more details in the Supplementary Materials.

5. Conclusion

We have proposed a novel model for the pose-guided person image synthesis task by combining the attention operation with the flow-based operation. We empirically demonstrate the advantages and disadvantages of these two operations for the spatial transformation. Ablation studies prove that our model can take the complementary advantages to generate images with not only accurate global structures but also realistic local details. Both subjective and objective experiments show the superiority of the proposed model compared with the state-of-the-arts. Besides, additional experiment on the portrait image editing task proves that the proposed combination can be flexibly applied to different types of data.

References

  • V. Blanz and T. Vetter (1999) A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp. 187–194. Cited by: §4.3.
  • A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §1.
  • Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018)

    Stargan: unified generative adversarial networks for multi-domain image-to-image translation

    .
    In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 8789–8797. Cited by: §2.
  • Y. Deng, J. Yang, S. Xu, D. Chen, Y. Jia, and X. Tong (2019)

    Accurate 3d face reconstruction with weakly-supervised learning: from single image to image set

    .
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §4.3.
  • P. Esser, E. Sutter, and B. Ommer (2018) A variational u-net for conditional appearance and shape generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8857–8866. Cited by: §2, Figure 4, §4.1.
  • I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio (2016) Deep learning. Vol. 1, MIT press Cambridge. Cited by: §1.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial networks. arXiv preprint arXiv:1406.2661. Cited by: §1.
  • X. Han, X. Hu, W. Huang, and M. R. Scott (2019) Clothflow: a flow-based model for clothed person generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10471–10480. Cited by: §2.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §4.1.
  • X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), pp. 172–189. Cited by: §2.
  • P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2.
  • J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    .
    In European conference on computer vision, pp. 694–711. Cited by: §3.3.
  • T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §1.
  • T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2020) Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119. Cited by: §1.
  • Y. Li, C. Huang, and C. C. Loy (2019) Dense intrinsic appearance flow for human pose transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3693–3702. Cited by: §2, Figure 4, §4.1.
  • M. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and J. Kautz (2019a) Few-shot unsupervised image-to-image translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10551–10560. Cited by: §2.
  • W. Liu, Z. Piao, J. Min, W. Luo, L. Ma, and S. Gao (2019b) Liquid warping gan: a unified framework for human motion imitation, appearance transfer and novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5904–5913. Cited by: §2.
  • Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang (2016) Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1096–1104. Cited by: §4.
  • L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool (2017) Pose guided person image generation. arXiv preprint arXiv:1705.09368. Cited by: §1, §2.
  • L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz (2018) Disentangled person image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 99–108. Cited by: §2.
  • Y. Men, Y. Mao, Y. Jiang, W. Ma, and Z. Lian (2020) Controllable person image synthesis with attribute-decomposed gan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5084–5093. Cited by: §1, §2, Figure 4, §4.1.
  • A. Nagrani, J. S. Chung, and A. Zisserman (2017) Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612. Cited by: §4.3.
  • P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter (2009)

    A 3d face model for pose and illumination invariant face recognition

    .
    In 2009 sixth IEEE international conference on advanced video and signal based surveillance, pp. 296–301. Cited by: §4.3.
  • Y. Ren, G. Li, S. Liu, and T. H. Li (2020) Deep spatial transformation for pose-guided person image generation and animation. IEEE Transactions on Image Processing. Cited by: §1, §2, §3.3, Figure 4, §4.1.
  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. arXiv preprint arXiv:1606.03498. Cited by: §4.
  • A. Siarohin, E. Sangineto, S. Lathuiliere, and N. Sebe (2018) Deformable gans for pose-based human image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3408–3416. Cited by: §1, §2, Figure 4, §4.1.
  • J. Tang, Y. Yuan, T. Shao, Y. Liu, M. Wang, and K. Zhou (2021) Structure-aware person image generation with pose decomposition and semantic correlation. arXiv preprint arXiv:2102.02972. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §1, §1.
  • T. Wang, M. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro (2019) Few-shot video-to-video synthesis. arXiv preprint arXiv:1910.12713. Cited by: §2.
  • X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7794–7803. Cited by: §1.
  • Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §4.1.
  • X. Yu, Y. Chen, S. Liu, T. Li, and G. Li (2019) Multi-mapping image-to-image translation via learning disentanglement. In Advances in Neural Information Processing Systems, Cited by: §2.
  • P. Zablotskaia, A. Siarohin, B. Zhao, and L. Sigal (2019) DwNet: dense warp-based network for pose-guided human video generation. arXiv preprint arXiv:1910.09139. Cited by: §2.
  • H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2019) Self-attention generative adversarial networks. In

    International conference on machine learning

    ,
    pp. 7354–7363. Cited by: §1.
  • J. Zhang, K. Li, Y. Lai, and J. Yang (2021) PISE: person image synthesis and editing with decoupled gan. arXiv preprint arXiv:2103.04023. Cited by: §2.
  • P. Zhang, B. Zhang, D. Chen, L. Yuan, and F. Wen (2020) Cross-domain correspondence learning for exemplar-based image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5143–5153. Cited by: §2.
  • R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595. Cited by: §4.1, §4.1.
  • X. Zhou, B. Zhang, T. Zhang, P. Zhang, J. Bao, D. Chen, Z. Zhang, and F. Wen (2021) CoCosNet v2: full-resolution correspondence learning for image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11465–11475. Cited by: §2.
  • Z. Zhu, T. Huang, B. Shi, M. Yu, B. Wang, and X. Bai (2019) Progressive pose attention transfer for person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2347–2356. Cited by: §2, Figure 4, §4.1.