Towards Fine-grained Human Pose Transfer with Detail Replenishing Network

05/26/2020 ∙ by Lingbo Yang, et al. ∙ 2

Human pose transfer (HPT) is an emerging research topic with huge potential in fashion design, media production, online advertising and virtual reality. For these applications, the visual realism of fine-grained appearance details is crucial for production quality and user engagement. However, existing HPT methods often suffer from three fundamental issues: detail deficiency, content ambiguity and style inconsistency, which severely degrade the visual quality and realism of generated images. Aiming towards real-world applications, we develop a more challenging yet practical HPT setting, termed as Fine-grained Human Pose Transfer (FHPT), with a higher focus on semantic fidelity and detail replenishment. Concretely, we analyze the potential design flaws of existing methods via an illustrative example, and establish the core FHPT methodology by combing the idea of content synthesis and feature transfer together in a mutually-guided fashion. Thereafter, we substantiate the proposed methodology with a Detail Replenishing Network (DRN) and a corresponding coarse-to-fine model training scheme. Moreover, we build up a complete suite of fine-grained evaluation protocols to address the challenges of FHPT in a comprehensive manner, including semantic analysis, structural detection and perceptual quality assessment. Extensive experiments on the DeepFashion benchmark dataset have verified the power of proposed benchmark against start-of-the-art works, with 12%-14% gain on top-10 retrieval recall, 5% higher joint localization accuracy, and near 40% gain on face identity preservation. Moreover, the evaluation results offer further insights to the subject matter, which could inspire many promising future works along this direction.



There are no comments yet.


page 1

page 2

page 3

page 6

page 9

page 10

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Human pose transfer (HPT) is an emerging research topic that attracts increasing attention recently. Aiming at synthesizing person images under new target poses with respect to the appearance of a given source image, HPT contains huge potential in empowering numerous creative applications, such as automatic fashion design, creative media production, online advertising and virtual reality. For these applications, the users would most likely focus their attention at semantically meaningful and detail-rich regions, such as face and clothes. Therefore, the ability to preserve semantic information and replenishing fine-grained appearance details is crucial for the performance and user experience of an HPT model.

Fig. 1: A qualitative illustration of the FHPT objectives achievable by our method and state-of-the-art HPT methods: DSC [35] and PATN [45]. The detailed regions in colored bounding boxes are enlarged on the right. Please zoom in for details.

Inspired by the image-to-image translation 

[14], early HPT solutions [26][27][6] adopt a global predictive strategy to directly translate the source image onto target poses by employing the U-net architecture with skip connections to propagate low-level features. However, due to the lack of localized deformation modeling, U-net based global methods often fail to properly handle the misalignment between source and target poses, leading to detail deficiency in synthesized person images, such as over-smoothed clothes or distorted faces. Instead, Siarohin et. al. proposes a modified feature fusion mechanism [35]

for warping the appearance features extracted from the source image onto the target pose with part-wise affine transformation, opening up a local warping approach for fine-grained appearance transfer. Thereafter, extensive research efforts have been dedicated to better modeling of body deformation and local feature transfer, including thin-plate spline function 

[5], local attention [45][33], optical flow [38][43] and 3D surface models [21][32][7]. However, as relying on exact feature correspondences for detail reconstruction, local warping methods typically suffer from content ambiguity in invisible regions where frequent viewpoint variations and occlusions occur in real-world applications. To simultaneously overcome detail deficiency and content ambiguity, hybrid methods have emerged with an attempt to replenish new content with an additional predictive branch [38][37][23]. Yet the replenished contents often exhibit inferior perceptual quality than local-warped contents, incurring style inconsistency in blended images. Apparently, regarding real-world HPT applications, it is crucial to replenish fine-grained appearance details in a style-consistent manner for increasing performance and user experience.

To narrow the theory-practice gap of HPT, we aim towards a more practical and challenging setting of HPT, termed as Fine-grained Human Pose Transfer (FHPT). Specifically, FHPT addresses the aforementioned issues on detail deficiency, content ambiguity, and style inconsistency by emphasizing more on the preservation and replenishment of fine-grained semantic and appearance details, including facial identity, hairstyle, cloth fabrics, and small body parts, Fig. 1

. To implement FHPT, we propose the Detail Replenishing Network (DRN) with two distinctive designs: a style-guided detail replenishment module to enforce the style consistency of generated contents across the entire human body, and an intermediate feature-sharing path to facilitate the mutual guidance between the global predictive and local warping branch. Moreover, we establish a comprehensive suite of fine-grained evaluation protocols for more reliable and accurate measurement of the model capability towards FHPT objectives, including face identity preservation, keypoint localization, and content-based image retrieval. Extensive experiments carried on the DeepFashion 

[24] dataset verify the efficacy of our proposed DRN in preserving semantic attributes in the source image, as well as replenishing fine-grained appearance details in a style-consistent fashion. Compared to existing baselines, our method achieves 12%-14% gain on top-10 retrieval recall, 5% higher joint localization accuracy, and near 40% gain on face identity preservation, establishing a strong baseline method for FHPT.

This manuscript extends upon our previous work [40] in three aspects. Firstly, we formally develop FHPT, a more practical and challenging HPT scenario with a higher emphasis on fine-grained semantic fidelity and detail quality. Secondly, we propose the DRN for fine-grained person image generation and additionally address the limitation in identity preservation of our previous model by incorporating a facial attribute transfer module, leading to near 40% higher facial identity preservation in synthesized images. Thirdly, we develop a complete suite of fine-grained evaluation protocols for FHPT, offering new insight into the subject matter. All these efforts not only promote HPT towards a more challenging yet practical level, but will also benefit other semantic guided synthesis tasks, such as face renovation [HiFaceGAN] and street view synthesis [38].

The rest of the paper is organized as follows: Section II provide a retrospect on existing HPT works about their architectural designs and limitations towards FHPT challenges. Following the retrospect, Section III establishes the new pbjectives, evaluation criteria, and a novel “detail-replenishment” methodology for the FHPT task, where the corresponding benchmark implementation is detailed in Section IV. Finally, we report the experimental results in Section V and conclude the paper in Section VI.

Ii A Retrospect on Human Pose Transfer

As an emerging topic in computer vision, HPT is still in its infancy, where the first relevant research 

[26] was not proposed until 2017. Driven by different research ideas in various related fields, including conditional GANs [30], style transfer [16]

, human pose estimation 

[3], and computer graphics [25]

, HPT has been evolving in a multifaceted manner. Yet there is still no conclusive answer to the rationality of these ideas, and their contribution towards the HPT objectives. Furthermore, most HPT works directly adopt existing evaluation metrics from related research fields, with limited ability in addressing the multifaceted nature of the HPT problem. In retrospect, the area of HPT would benefit from a more detailed and comprehensive objective design, as well as the corresponding evaluation protocols and research methodology towards practical applications. Here we provide a brief overview of the existing HPT methods and evaluation criteria, highlighting their potential limitations for preservation and replenishment of fine-grained semantic and appearance details.

Fig. 2: Illustration of different HPT network architectures and their potential design flaw. (a) Global predictive method — detail deficiency; (b) Local warping method — content ambiguity; (c) Hybrid method with independent branches — style inconsistency; (d) Our proposed detail replenishing network conducts better feature utilization and mutual guidance with a feature-sharing path between two branches, which in consequence generates results with superior quality.

Ii-a HPT Methods

Based on the feature transfer mechanism, we can summarize existing HPT methods into three categories: global predictive methods, local warping methods, and hybrid methods. Here we illustrate the behavior of different methods with a typical example in Fig. 2, where the source image contains a person in a blue-yellow checkerboard shirt partially occluded by his arm, and the target pose expects the person to move his arm away. By speculating the output of different methods, we can reveal their potential limitations in two aspects, including the ability to preserve semantic and appearance details in the source image and the ability to synthesize new content in occluded regions without exact feature correspondences.

Global predictive methods [26][27][6] typically formulate the HPT as a multi-modal image-to-image translation problem [14] and utilize the U-net architecture with skip connections for feature propagation. The pose guidance is introduced by encoding joint locations into a spatial heatmap [26] and concatenating it along with the input source image. Specifically, as shown in Fig. 2 (a), the source pose and target pose are encoded into spatial guidance maps and concatenated along with the source image to directly generate a prediction of the target image . However, such works often cannot reliably cope with the structural misalignment between different poses [35] due to the lack of accurate deformation modeling. Alternatively, several works in human motion transfer [4][38][41] aim to directly translate pose features to video frames, but often suffers from limited generalization towards new persons due to the lack of appearance guidance. Generally speaking, global predictive methods typically suffer from the inability to capture localized feature correspondences, which often leads to detail deficiency in synthesized images, e.g. blurry details and distortion artifacts, Fig. 2 (a).

Local warping methods

take inspiration from the spatial transformer networks 

[15] and incorporate deformation modeling into the feature propagation mechanism. Deformable Skip-Connection (DSC) [35] is proposed for warping local features through part-wise affine transformation which establishes a new HPT methodology that later inspires a great amount of works. Following DSC, Thin-Plate Splines [2] have further facilitated non-linear warping estimation [5], and local attention mechanism has been incorporated for increased flexibility in deformation modeling [45][33]. Moreover, the rapid advancement in automated dense 3D annotation [8][17] has enabled pixel-level feature warping flow estimation, enabling fine-grained appearance transfer from source to target images [21][32][7].

The deformation mapping of the human body between different poses is usually determined via interpolation over a finite set of keypoint correspondences. Specifically, denote the region occupied by the human body in the source/target image as

and , one can formulate the deformation as a continuous mapping that carries keypoints in the source pose to the target pose, i.e. . Usually such a mapping is not unique, which necessitates an additional regularization term that sometimes referred to as the “bending energy” [2] to mitigate the distortion of warped contents. The resulted mapping is usually composed of part-wise affine transformations [35] or thin-plate splines [5]. Afterward, the output image is reconstructed upon the image features encoded from the source image and warped by the estimated deformation, which can be denoted as .

However, due to the frequent viewpoint changes and self-occlusions, there is no guarantee that the estimated warping would cover the entire target human body, i.e. . Hence it would be difficult for a local warping network to faithfully recover the underlying content without exact correspondence in the source image, leading to content ambiguity in uncovered regions, Fig. 2 (b).

Hybrid methods [32][23][43][37] aim to hallucinate new contents in uncovered regions with another global predictive branch, and blend the global and local generation results according to an estimated composition mask. Such a design is a common practice for video frame prediction [38][37] with little motion between consecutive frames and convenient warping via optical flow [13]. Furthermore, Zheng et. al. [43] designed an unsupervised flow learning scheme to tackle the unpaired scenario. However, existing hybrid methods typically focus on blending the generation results at image-level, whereas the intermediate-level feature fusion is less explored. Specifically, the global predictive branch often works separately from the local warping branch, which could incur style inconsistency between hallucinated and warped contents, Fig. 2(c).

Considering the requirements for real-world applications, it is extremely crucial to develop a more suitable approach for replenishing fine-grained appearance details in uncovered regions without exact feature correspondences. Specifically, the semantic fidelity between source and target images, as well as the style consistency across the entire human body, should be effectively maintained.

Ii-B HPT Evaluation Criteria

For HPT, the quality of generated person images naturally comprises multiple factors, including perceptual quality, structural integrity, and semantic fidelity. Therefore, it is crucial to have well-designed measures addressing these factors simultaneously for reliable and accurate judgment on the capability of HPT models. However, most evaluation metrics are designed towards specific objectives, with less discriminative power in other aspects. For instance, PSNR is designed for measuring signal-level fidelity, but not well-suited for evaluating perceptual quality [31]; Inception Score [34] is designed for measuring images with a multi-modal distribution, and is less reliable for the HPT task containing only a single object category [1]. Therefore, existing works typically introduce auxiliary evaluation tasks to validate the performance in other aspects, such as person detection[35][32], keypoint localization[45][4] and attribute prediction [21]. However, most auxiliary evaluation tasks focus on high-level semantic aggregation, with less discriminative ability on fine-grained details. Also, the corresponding solutions to these tasks are by design robust, or in other words, insensitive to the noises and artifacts in generated images. Several works [26][35][45] have utilized the generated images for data augmentation on person Re-ID, but over the low-resolution () Market1501 [44] benchmark dataset, making it unreliable for measuring the visual quality of high-resolution and detail-rich images. Therefore, it is vital to develop a complete suite of evaluation protocols to assess HPT models in a reliable, accurate, and comprehensive fashion, which we will elaborate in the next section.

Iii Towards Fine-grained Human Pose Transfer

In this section, we develop the proposed FHPT in three aspects: objectives, evaluation criteria, and the corresponding methodology. Specifically, we establish a new set of FHPT objectives to address the requirements in practical application scenarios with user interactions. Based on the proposed objectives, we develop a complete suite of fine-grained evaluation protocols for a more comprehensive, reliable, and accurate measurement of the capability of FHPT models. To fulfill the perspective FHPT objectives, we provide a new detail replenishment methodology along with a corresponding benchmark solution that will be further detailed in Sec. IV.

Iii-a FHPT Objectives

Considering the requirements for creative design applications with complex garments and accessories of diverse color, fabric, texture, FHPT aims to focus on better style representation and attention to pattern details, and satisfy the natural inclination of human perception towards semantically-meaningful and detail-rich contents. Different from HPT objectives with a bias towards the accuracy of pose transfer, the proposed FHPT further emphasizes the preservation and replenishment of fine-grained semantic and appearance details, including facial identity, hairstyle, cloth fabrics, patterns, and small body parts, Fig. 1. Specifically, the FHPT objectives comprise three aspects:

  • Perceptual Realism The generated images should look natural and appealing, with rich, convincing appearance details over the entire image. Specifically, for hybrid networks, the predicted and warped image contents should be consistent in style and quality.

  • Structural Integrity The generated images should well fit the target pose without noticeable structural distortion, particularly for facial landmarks and small body parts.

  • Semantic Fidelity The generated images should preserve all necessary semantic attributes in the source image that help determine the person’s identity, including both the facial attributes and the clothing appearance, such as color, hairstyle, and fabric.

Iii-B FHPT Evaluation Criteria

To better inspire the design of network architectures, loss functions, and training schemes for the proposed FHPT and offer accurate and reliable measurement for the corresponding models, we establish a comprehensive suite of fine-grained evaluation protocols targeting the FHPT objectives: semantic fidelity, structural integrity, and perceptual realism. Below we detail the evaluation protocols for each objective.

Perceptual Evaluation Protocols Existing works often adopt Structural Similarity(SSIM) [39] and Inception Score (IS) [34] to account for the statistical and perceptual fidelity of generated images. However, a recent study [1] has shown that the Inception Score is susceptible to network weights, batch size, and data distribution, making it unreliable for measuring the quality of generative models. To address this issue, we introduce two supervised perceptual metrics: FID [10] and LPIPS [42], to better reflect the perceptual quality of our framework. Both metrics utilize a pre-trained network to project images onto the feature space and compute the distance between image features for the entire distribution or individual pair of samples. Compared with the unsupervised IS metric, the FID and LPIPS metrics can better reflect the perceptual quality of generated images.

Structural Evaluation Protocols We introduce a new metric called Keypoint Error Curve (

KEC) to evaluate the localization accuracy of small body parts in generated images. It is similar to the PCKh metric in pose estimation tasks 

[45], but refined in two ways: (1) In addition to body joints, we also extract landmarks for the face and both hands, and evaluate the accuracy separately for each body part. This helps better reflect the capability of an FHPT model in maintaining the structural integrity of small body parts. (2) We provide a more detailed performance profile by evaluating the keypoint estimation accuracy at a set of adaptively selected anchors. For each given threshold , we calculate the percentage of detected keypoints with distance to the ground truth smaller than and plot the accuracy curve.

Semantic Evaluation Protocols As most semantic information in typical person images concentrates on facial attributes and clothing contents, we propose two well-defined and interpretable measurements for semantic fidelity: content-based image retrieval and face identification. For retrieval, we utilize generated images to query from the database of all corresponding source images, and calculate the retrieval scores using ground truth annotations provided in [24]. Specifically, the retrieval-based evaluation metric is superior to existing distance metrics in three aspects:

  1. The retrieval task is more application-driven with customers being the final judge of content similarity. Thus a well-trained retrieval system will work effectively in extracting the most informative and discriminative features closely related to human perception.

  2. The retrieval-based evaluation protocol does not require the ground truth to be under the same pose as the query, making it more practical for real application purposes. Furthermore, its robustness against pose and view variations also indicates a better focus on semantic and appearance details.

  3. The reliability of retrieval-based metrics can be easily qualified by querying with real images, and further improved by fine-tuning the retrieval system over the testing dataset. In contrast, the performance bound for existing perceptual metrics is always the same, 0 for LPIPS and 1 for SSIM, making it difficult to quantify and improve the reliability of such metrics.

To measure the consistency of face identity across source/target image pairs, we extract feature embeddings of cropped facial regions with a pre-trained face recognition model 

[19], and compare the embedding distance between each pair. If the distance is lower than a threshold , the person’s identity is considered to be preserved. In practice, we choose to be and .

Iii-C FHPT Methodology

Networks (a) (b) (c) (d)
Explicit deformation modeling
Pose-guided appearance transfer
Multi-branch architecture
Mutual guidance across branches
Fine-grained detail synthesis
TABLE I: Main capabilities of different models, (a) Global predictive methods, (b) Local warping methods, (c) Hybrid methods, (d) Our DRN.

So far, we have discussed about existing HPT approaches in terms of feature utilization and content generalization schemes, and highlight their potential flaws in addressing the FHPT challenges. Specifically, a fundamental problem in FHPT lies in the preservation and replenishment of fine-grained semantic and appearance details, which have been explored from two separate approaches. One is to directly synthesize new content based on high-level semantic guidance as in global predictive methods [27][29], and the other is to transfer low-level visual features from the source image, featuring local warping methods with warping flow [38] or attention mechanism [45]. Consequently, the task of human pose transfer is also known by different names, such as “pose-guided human (person) image generation (synthesis)”, indicating the methodological divergence in existing research efforts. Needless to say, both approaches have their advantages and limitations, which necessitates an effective integration scheme to maximize the complementary effects between two approaches. With the emergence of hybrid architectures, the two opposite ends start to join together, but the integration result is still sub-optimal mainly due to the inconsistency between globally and locally generated content.

Moving towards FHPT, the transfer of available image contents and the synthesis of new content should be carried out in a mutually-guided fashion to simultaneously promote the style consistency and detail quality of the final output. To this end, we propose the Detail Replenishing Network (DRN) to substantiate the FHPT methodology with two distinctive architectural designs: 1) a guided detail replenishing module to enforce style-consistency; 2) an intermediate feature sharing pathway to promote mutual guidance. Table I summarizes the main capabilities of DRN against existing networks in Fig. 2(a)-(c), where DRN works in a “coarse-to-fine” fashion by synthesizing a global detail replenishing residual map upon the coarse estimation , leading to detail-rich, unambiguous and style-consistent generation results. In the next section, we will detail the implementation of individual components of the conceptualized FHPT model in Fig. 2(d), and introduce the corresponding model learning scheme to better fulfill the proposed FHPT objectives.

Fig. 3: Overview of the proposed DRN. The pose transfer branch first estimates a coarse output under the target pose. The content feature map provides spatial guidance to the detail replenishing branch, which then generates a residual map to refine the coarse output. Also, the detail replenishing branch also helps preserve the person’s identity by generating the face under target view given source face (if visible) and target sketch .

Iv Detail Replenishing Network

In this section, we illustrate the architectural design and training details of the proposed Detail Replenishing Network (DRN) which contains two branches: A pose transfer branch for acquiring a coarse estimation and provide spatial guidance, and a detail replenishing branch for refining local visual details in a style-guided fashion, Fig. 3.

Iv-a Network Architecture

Pose Representation The human pose representation can be implemented in various ways regarding the resource constraints and application considerations. In this paper, we choose a computationally efficient pose representation of sparse 2D keypoints. For each image in the training dataset, we extract an 18-point pose skeleton using a pretrained estimation network [3], and convert the skeleton coordinates into an 18-channel pose heatmap. Dense representations, such as body parsing labels [5], optical flows [37] or pseudo-3D surfaces [32] are also compatible with our framework at the expense of computational cost and manipulation inflexibility. Furthermore, we also extract the face bounding box using a lightweight face detection library [19] to help preserve necessary facial attributes. Also, the landmarks of the target face are estimated with [3] and encoded as a sketch , which will be provided during inference if the front view is available in the target image.

Pose Transfer Branch takes in the source image and the paired pose representation , and aims to generate a coarse estimation of the target image. The network begins with several down-sampling convolutional layers to encode the contents of the source image and the pose dependencies between and . To better capture the relationship between different poses, the pose pair is concatenated along the depth axis before sent into the encoder. We leverage the design in [45] by introducing several cascaded transfer blocks to encourage a smooth transition of encoded contents. The result feature map is roughly aligned with the target pose, which will serve as the spatial guidance for detail replenishment. Finally, we decode with several upsampling layers to acquire a coarse estimation of the target image.


Note that for our proposed framework, it is crucial that the transfer branch provide accurate spatial guidance for the upcoming detail replenishing branch. In particular, we do not want any fake details to be injected into the guidance map and the coarse estimation . Instead, the transfer branch is supposed to provide useful hints on “where to add what kind of details” and let the detail replenishing branch do the work. This allows a coarse-to-fine generation of the final image with better visual quality. Ablation studies are provided in section V to justify our claims.

Detail Replenishing Branch is composed of several detail-replenishing modules (DRM) that each specializes in refining the entire image or a specific region. For our benchmark solution, we adopt both types of DRMs to refine both the entire image and the face region. The global module predicts a residual map from the spatial guidance map and the source image , and add the residual map onto the coarse result ; while the regional (face) module directly synthesizes the target face image under the target landmark sketch according to the facial attributes extracted from the input face crop. In this way, the face module can be trained independently from the global module and can benefit from additional face data. To assemble the final output, we perform an alpha blending with Gaussian blurred weight mask , where denotes the 2D convolution operator, is a discrete 2D Gaussian kernel, and is an indicator function that equals if the corresponding pixel belongs to the face bounding box and elsewhere. Incidentally, we only need to activate the face enhancing module if the frontal face is available in both the source and the target image. Finally, combining the results in Eqn. (1) lead to the final result:

Fig. 4: Network architecture of the proposed detail replenishment module.

Detail Replenishment Module The proposed detail replenishment module (DRM) aims to encode the visual attributes of the appearance condition image and generate an detail replenishing map  following the spatial guidance map . The network architecture of the DRM is shown in Fig. 4.

We extract the appearance and style information from the source image

with an encoder consisting of several convolutional down-sampling layers and residual blocks. To acquire a more robust representation against partially occluded clothes, we perform an adaptive average pooling to convert the encoded feature tensor into a style code

. After the style code is acquired, an enhancing residual map (or target face crop ) is generated with respect to both the spatial and appearance guidance:


where is the spatial guidance map described in section IV-A. Inspired by related works in style-based image generation [18], we utilize the adaptive instance normalization (AdaIN) [12] to infuse the code into the output enhancing map. Note that now assumes the role of “style-guide” by controlling the pattern and granularity of synthesized details in different regions. The residual map generator is formed by several up-sampling convolutional layers. The content feature map at the coarsest scale is fed into the generator, where semantic and appearance details are gradually infused in a coarse-to-fine fashion across multiple layers. At the i-th layer, the texture code is processed with a 3-layer MLP to acquire the modulation weight and bias , which is then used for controlling the AdaIN operation:


where and

are the pre-channel mean and variance of the feature map

, the input at the -th AdaIN layer.

0:  Networks ; Dataloader
     Update and
     for  in  do
        Update and
     end for
  until Convergence.
Algorithm 1 Alternate optimization algorithm for the proposed network.

Iv-B Training Losses

We adopt slightly different loss functions for two branches. For the transfer branch, the output coarse estimation is NOT expected to contain any fake textures, as explained in section IV-A. Therefore, we adopt the following loss formulation to update the transfer branch without adversarial losses:


where is the pixelwise L1 loss, and is the perceptual loss in [16]:


Here is a pretrained VGG-19 network and denotes the layer index. For the enhancing branch, we further add style and adversarial loss terms upon to promote visual details, leading to the full loss function as follows:


where is the Gram-matrix based style loss [16]:


where is the Gram matrix:

The adversarial loss is evaluated with two conditional discriminators, and to measure the appearance and pose consistency respectively. Concretely, the appearance discriminator takes a pair of source/target images as input and tries to distinguish between real pairs and fake pairs . Similarly, the pose discriminator takes in image/pose pairs  and evaluates if the generated images are consistent with the given pose conditions. Combing the shape and appearance loss leads to the following formulation:

which is also used for discriminator updating. In practice, we use the LSGAN variant [28] for better training stability. For face enhancement, the landmark sketch is adopted as pose condition to replace the global pose , and the source image is replaced with the cropped face patch, naturally.

Iv-C Optimization Algorithm

We employ a coarse-to-fine gradient descent optimization algorithm to update the components alternatively. For each iteration, we first update the pose transfer branch with loss function over the coarse estimation result . This allows the spatial guidance map to be roughly aligned with target pose. Then we perform an end-to-end fine-tuning to update the enhancing branch with the detail-aware loss computed over final output 111The face enhancing module should be optimized separately, as it operates on cropped face regions instead of the entire image and the facial contents are in principle independent of other body parts.

. Note that the pose transfer branch will also be fine-tuned in this step, as gradients can be backpropagated through

and the preceding pose transfer blocks. The discriminators are then updated for steps, where we empirically choose to balance between running speed and discriminative capability. The complete training routine is shown in Algorithm 1. Other model learning schemes are compared in ablation study in section V-A.

V Experiments

In this section, we evaluate the performance of the proposed DRN against several competitive baseline methods. Furthermore, we perform an ablation study to verify the efficacy of our main contributions.

Datasets We carry all experiments on the In-shop Clothes Retrieval Benchmark of the DeepFashion dataset [24], which contains more than 50,000 editorial images of fashion models under varying poses with texture-rich garments. Following the pre-processing routine in [45], we crop the background from both side of the image by 40 pixels, leaving the center 256 176 region for training and testing. For pose representations, we extract the body skeletons and facial landmarks using Openpose [3] and detect the face bounding box using[19]. We use the data partition in [45] with 101,966 pairs for training and 8,570 pairs for testing. The partition ensures that the same person will not appear in both the training and testing split, thus can fully guarantee the generalization ability of our DRN.

Implementation Details

We use PyTorch to implement our proposed framework. The transfer branch contains 3 down-sampling layers and 9 cascaded pose transfer blocks 

[45], and the enhancing branch contains 6 residual blocks [9]. LeakyReLU is introduced after each normalization layer with 0.2 negative slope. The length of texture codes is fixed to 128, where the impact of different lengths are further analyzed in ablation study, see section V-A for details. Rectified Adam optimizer [22] is adopted with weight regularization for improved training stability and better final performance. The full framework is updated for involves 40 batches with a total of 200K iterations. The learning rate is initialized to for all modules and is kept fixed for the first 10 batches before linearly decaying to 0. The loss weights are set to (). For face enhancement, we retain the training samples with paired face bounding boxes, and resize the face crops to using bicubic interpolation. The face loss is computed by Eqn. (7

) with the same loss weights. The training continues for 200 epochs with the learning rate fixed to


V-a Ablation Study

In this section, we compare the proposed DRN against four ablation methods to analyze the impact of different components. The first two aim to verify the proposed network architecture, while the other two focus on the proposed alternate training strategy. We will leave the ablation study on facial detail replenishment module in section V-B, where a more targeted analysis will be provided on cropped facial regions. Here we detail the settings of each ablation method as follows:

PB Only We remove the enhancing branch and train the pose transfer branch directly in an end-to-end fashion with loss function . Notice that this setting differs from the PATN [45] baseline where an additional style loss term is introduced with slightly adjusted loss weights.

Texture64dim We reduce the length of the textural code to 64 and keep the rest unchanged. In this way, the appearance information extracted from source image is reduced, which could result in less amount of details and more severe artifacts.

PB Fix We initialize the parameters of the transfer branch using pre-trained models in [45], and keep it fixed during training. In this way, the interaction between two branches is broken, and detailed style loss and conditional adversarial losses cannot be back-propagated into the transfer branch.

End2end We randomly initialize the parameters and start training from scratch. Instead of the proposed alternative training, we directly update the whole framework with defined over the refined output without explicit constraints on the coarse estimation . In this way, the detail replenishing branch has to rely on inaccurate spatial guidance maps, which could lead to difficulty in convergence and suboptimal performance.

PB Only 0.772 3.231 18.602 0.250
Texture64dim 0.763 3.040 18.263 0.243
PB Fixed 0.767 2.923 20.346 0.238
End2end 0.768 3.147 16.496 0.231
Ours Full 0.774 3.125 14.611 0.218
TABLE II: Quantitative ablation study results, the best result under each metric is shown in bold format.

We present the scores evaluated on the DeepFashion dataset in Table II. Our full framework, DRN, consistently outperforms other ablation methods, especially on perceptual metrics such as FID and LPIPS, highlighting the importance of detail replenishment for human image generation. Also we note that reducing the dimension of latent code leads to significant performance drop, indicating a potential trade-off between perceptual quality and memory cost. Furthermore, Fig. 6 showcases the quality improvement upon other ablation methods. The proposed DRN is capable of creating more authentic appearance details and suppressing repetitive artifacts. Also, due to the guidance from the texture-aware loss over final outputs, it can better preserve the integrity of body parts, as observed in the second and the third row, indicating an increased accuracy of the estimated spatial guidance map . Therefore, the effect of mutual guidance between both modules, along with the alternate optimization algorithm, is further justified.

V-B Comparisons with Previous Works

We choose several representative state-of-the-art works as baselines: DSC [35], UV-Net [6], SPT [36] and PATN [45]. All the tests are carried out on the same set of testing pairs in [45]. Note the official DSC implementation requires the input pose should contain full-body joints (shoulders and hips), we remove 426 bad cases out of 8570 test pairs that don’t fit this criterion and report the scores on remaining samples.

Qualitative Comparison Fig. 5 showcases the generation results for some challenging examples with large pose variations and rich textures. In general, our method is more faithful to the input conditions and contains more realistic semantic and appearance details, such as clothing fabrics (linen, laces), textural patterns, and hair waves. In particular, our method mitigates the gender bias issue reported in [35] caused by imbalanced gender distribution in fashion images. As shown in the third row, without explicit facial guidance in the source image, all the other methods tend to predict the more frequently occurred female face by mistake. In contrast, our method is more effective in capturing the fine-grained semantic details, and is able to faithfully retain the masculinity of synthesized person in both coarse and refined outputs.

Perceptual Evaluation As reported in Table III, our DRN achieves significant improvements against recent state-of-the-art methods in terms of perceptual quality, with gain on FID against PATN and gain on DSC. For the LPIPS score, the gains are and , respectively. Also, the patchwise signal statistical measure SSIM is comparable. It is noteworthy that although our method has the lowest Inception Score, the qualitative improvement is undeniable, thus highlighting the superiority of proposed perceptual metrics against the unsupervised IS metric.

Fig. 5: Qualitative results of the proposed method against several competitive baselines. Some images have been cropped for visualization purposes. Please zoom in for details.
UV-Net 0.763 3.440
SPT 0.736 3.441
DSC 0.762 3.330 24.479 0.233
PATN 0.773 3.209 19.816 0.253
Ours Coarse 0.780 3.230 18.405 0.243
Ours Full 0.774 3.125 14.611 0.218
TABLE III: Performance against other baseline methods, the best and the second best result for each metric is shown in bold / underline format. Up arrow means higher score is preferred, and vice versa.

To further analyze the impact of proposed detail replenishment modules, we also report the scores tested on coarse estimations without the residual map, as shown in Table III. Although both SSIM and IS scores slightly drop after enhancement, the perceptual fidelity is significantly improved, with 20% gain on FID and 10% gain on LPIPS. Furthermore, even without detail replenishment, our pose transfer branch still achieves better perceptual scores than PATN baseline, verifying the efficacy of our DRN for refining the pose map with detail-aware losses. In summary, DRN is effective for preserving semantic and appearance details deficiency during the pose transfer stage and consistently improving the visual quality of generated human images.

Fig. 6: Qualitative results of ablation methods. Our full method faithfully capture the texture details as well as maintaining the integrity of body parts. Please zoom in for details.
Metric L2 error Acc.() Acc. ()
DSC 0.720 0.078 0.383
PATN 0.706 0.085 0.485
Ours (w/o face) 0.647 0.247 0.752
Ours Full 0.615 0.420 0.873
TABLE IV: Mean distance between feature embeddings and the percentage of identity-preserving cases with images generated from different baselines.
Fig. 7: Keypoint error curves of ablation methods for each body part. Note that the error levels for both hands are doubled to cover the majority of estimated points. Best view in color.

Structural Evaluation As shown in Fig. 7

, our DRN has demonstrated consistent quality improvement over the entire human body. Among different body parts, DRN is especially powerful for restoring facial details even without using ground truth facial landmarks. For the body and both hands, the accuracy curves of different methods are roughly parallel, with a gap of 5% between the proposed DRN and the original PATNetwork (PB only). This indicates that our mutual-guidance design helps refine the structural integrity of generated body parts, thus reducing 5% outliers with large localization errors. In summary, the proposed DRN not only restores convincing appearance details but also helps improve the structural integrity of body parts.

Semantic Evaluation We first evaluate the facial identity preservation of different baselines in Table IV by measuring the mean embedding distance and the corresponding percentage of identity-preserving cases. Our basic framework with global DRM is already leading existing methods by 27%, with an additional 12% gain with additional face detail replenishment. We also present the qualitative results in Fig. 8, where DSC [35] fails to preserve the facial structure, and incurs severe landmark distortions. PATN [45] manages to synthesize more plausible faces, but with limited diversity in hairstyles and lip shapes, and sometimes with false attributes, like the thick black beard for the male in row 4. In contrast, our method can well preserve prominent facial attributes and correct mild distortions, leading to better identity consistency and visual realism of generated faces.

Fig. 8: Qualitative results of face attribute transfer with different methods. Our face enhancing module can more faithfully preserve the visual attributes of input faces and prevent distortions. Please zoom in for details.

Also, it is noteworthy that existing works on face view transfer [20][11] most deal with face images under the same resolution and visual quality. However, faces in real person images can vary in spatial resolutions, leading to potentially severe quality discrepancies between resized image pairs. For example, in row 4 of Fig. 8, the input male face comes from a full-body image and is much blurrier than the target face from an upper-body shot. Such an issue is calling for more robust generative models against scale and quality variations, which may constitute another meaningful research topic in the FHPT domain to explore.

(a) The “easy” case
(b) The “challenging” case
(c) The “confusing” case
Fig. 9: Top 10 matching candidates for image retrieval evaluation. Queries are generated with different baselines. From left to right: Input source, generated query, and retrieval results. From top to bottom: DSC [35], PATN [45], the proposed DRN, and the ground truth. Positive and negative items are marked in green checks and red crosses respectively. Zoom in for details.
Recall(%) Resnet50 VGG16
Top 3 Top 5 Top 10 Top 3 Top 5 Top 10
PATN 6.58 10.32 17.84 6.43 9.97 16.97
DSC 11.09 17.60 30.07 11.31 17.52 28.93
PB Only 6.56 10.23 17.78 6.35 10.02 17.10
PB Fixed 4.58 7.27 12.91 4.85 7.64 12.92
Tex-64dim 6.59 10.32 17.80 6.95 11.06 18.06
End2end 6.89 10.90 18.67 6.29 9.47 16.33
Ours Full 11.79 18.45 30.79 11.61 18.60 30.23
Real Images 26.13 41.36 67.89 25.08 38.83 62.92
TABLE V: Top-K recall performance on in-shop clothes retrieval task with images generated from different baselines.

Finally, we evaluate the overall content preservation of our proposed method with the newly proposed content-based image retrieval task. Table V presents the Top-K recall of different methods for K = 3, 5, 10 under two different backbones, ResNet and VGG. Our DRN has a significant advantage over state-of-the-art PATN baseline, with almost twice the recall rate for every K, thus proves the efficacy of our texture enhancing module for recovering appearance details lost during the pose transfer stage. Also, it is evident that the proposed network architecture and training strategy are complementary with each other, as changing either element would severely degrade the overall performance, leading to barely comparable recall rate against the PATN baseline.

Also, we point out two important observations in Table V. Firstly, the warping-based DSC method significantly outperforms PATN, a recently published work with better visual quality, Fig. 5; Secondly, although our method achieves the top recall performance, it’s still far less accurate than querying with real images. To better help with the understanding of the information carried within extracted feature embeddings, and the similarity criterion used by the retrieval system, we compare the top 10 best matching candidates retrieved from queries generated with different methods.

Three typical examples are shown in Fig. 9, where (a) represents the easy case with input clothes containing few textures, (b) represents the challenging case with clothes of rich textures, and (c) represents the confusing case with two or more items in different textural styles — a deep gray coat and a leopard-spotted short skirt. In case (a), we note the retrieval system frequently mismatches a woman in a deep blue skirt, suggesting a higher focus on color consistency over other attributes such as gender and accessories. For case (b), the retrieval results are closer to our expectations, where the number of positive matches are growing consistently with the increasing amount of textural detail in query images. For the confusing case (c), DSC successfully recovers the leopard-spotted skirt from the source image, leading to a more accurate retrieval result with correct top-3 candidates, although not positive matches. In contrast, both our DRN and the PATN baseline failed to capture the desired outfit. This could partially explain DSC’s superior performance and help motivate novel hybrid architectures combining the strengths of warping-based and synthesizing-based methods.

In conclusion, our retrieval-based evaluation protocol is more reliable in measuring semantic and appearance consistency of generated human images, especially for the case with rich textures. Also, it is much more flexible in adjusting the similarity criterion according to different attributes, such as colors, fabrics, textures, and hairstyles. We believe that the advances in fashion retrieval systems could help facilitate the design of better similarity criteria that would further boost the visual realism of human-centered image synthesis tasks.

Vi Conclusions

Human pose transfer (HPT) is an emerging research topic with huge application potential in creative media applications. Yet current HPT methods typically introduce detail deficiency, content ambiguity or style inconsistency in synthesized person images due to the suboptimal integration between low-level feature transfer and high-level semantic-guided content synthesis. To address existing issues and narrow the theory-practice gap of HPT, we aim towards a more practical and challenging setting, termed as Fine-Grained Human Pose Transfer (FHPT). Specifically, we establish the new objectives for FHPT task: perceptual realism, structural integrity, and semantic fidelity, with a comprehensive suite of fine-grained evaluation protocols targeted at these objectives, including face identity preservation, keypoint localization, and content-based image retrieval, providing a comprehensive, accurate and reliable measurement of the model capability towards FHPT scenarios. To implement FHPT, we propose a Detail Replenishing Network (DRN) with distinctive structural designs, including a guided detail replenishment branch for style-consistency in generated contents, and an intermediate feature-sharing path to facilitate the mutual guidance between the global predictive and local warping branches. As a result, DRN achieve significant performance gain against varying HPT baselines by a huge margin, highlighting the challenges in the FHPT task as well as the efficacy of our contributions.

In the future, we aim to investigate more effective semantic guidance and feature utilization schemes for FHPT. For instance, it would be beneficial to incorporate region adaptive style control into the FHPT framework. Also, designing interpretable and customizable evaluation protocols is another promising direction. Finally, although our framework is designed for synthesizing human images, we note that the proposed “detail-replenishment” approach can also benefit other semantic-guided image synthesis tasks, such as face renovation or street view synthesis, leading to enhanced appearance quality and perceptual realism in produced contents.


  • [1] S. Barratt and R. Sharma (2018) A note on the inception score. arXiv preprint. Cited by: §II-B, §III-B.
  • [2] F. L. Bookstein (1989) Principal warps: thin-plate splines and the decomposition of deformations. IEEE Transactions on pattern analysis and machine intelligence 11 (6), pp. 567–585. Cited by: §II-A, §II-A.
  • [3] Z. Cao, T. Simon, S. Wei, and Y. Sheikh (2017) Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, Cited by: §II, §IV-A, §V.
  • [4] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros (2018) Everybody dance now. ArXiv abs/1808.07371. Cited by: §II-A, §II-B.
  • [5] H. Dong, X. Liang, K. Gong, H. Lai, J. Zhu, and J. Yin (2018) Soft-gated warping-gan for pose-guided person image synthesis. In NeurIPS, Cited by: §I, §II-A, §II-A, §IV-A.
  • [6] P. Esser, E. Sutter, and B. Ommer (2018) A variational u-net for conditional appearance and shape generation. CVPR, pp. 8857–8866. Cited by: §I, §II-A, §V-B.
  • [7] A. Grigorev, A. Sevastopolsky, A. Vakhitov, and V. Lempitsky (2019-06) Coordinate-based texture inpainting for pose-guided human image generation. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §I, §II-A.
  • [8] R. A. Güler, N. Neverova, and I. Kokkinos (2018) DensePose: dense human pose estimation in the wild. In CVPR, Cited by: §II-A.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §V.
  • [10] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a nash equilibrium. CoRR. Cited by: §III-B.
  • [11] R. Huang, S. Zhang, T. Li, and R. He (2017) Beyond face rotation: global and local perception gan for photorealistic and identity preserving frontal view synthesis. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2439–2448. Cited by: §V-B.
  • [12] X. Huang and S. J. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. CoRR abs/1703.06868. Cited by: §IV-A.
  • [13] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox (2017-07) FlowNet 2.0: evolution of optical flow estimation with deep networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Link Cited by: §II-A.
  • [14] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    arXiv preprint. Cited by: §I, §II-A.
  • [15] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu (2015) Spatial transformer networks. ArXiv abs/1506.02025. Cited by: §II-A.
  • [16] J. Johnson, A. Alahi, and F. Li (2016)

    Perceptual losses for real-time style transfer and super-resolution

    CoRR abs/1603.08155. Cited by: §II, §IV-B, §IV-B.
  • [17] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik (2017) End-to-end recovery of human shape and pose. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7122–7131. Cited by: §II-A.
  • [18] T. Karras, S. Laine, and T. Aila (2018) A style-based generator architecture for generative adversarial networks. CoRR abs/1812.04948. Cited by: §IV-A.
  • [19] V. Kazemi and J. Sullivan (2014) One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1867–1874. Cited by: §III-B, §IV-A, §V.
  • [20] J. Komorowski and P. Rokita (2010) A method for novel face view synthesis using stereo vision. In International Conference on Computer Vision and Graphics, pp. 49–56. Cited by: §V-B.
  • [21] Y. Li, C. Huang, and C. C. Loy (2019) Dense intrinsic appearance flow for human pose transfer. In CVPR, pp. 3693–3702. Cited by: §I, §II-A, §II-B.
  • [22] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han (2019) On the variance of the adaptive learning rate and beyond. ArXiv. Cited by: §V.
  • [23] W. Liu, Z. Piao, J. Min, W. Luo, L. Ma, and S. Gao (2019) Liquid warping GAN: A unified framework for human motion imitation, appearance transfer and novel view synthesis. CoRR. Cited by: §I, §II-A.
  • [24] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang (2016) DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In (CVPR), Cited by: §I, §III-B, §V.
  • [25] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black (2015) SMPL: a skinned multi-person linear model. ACM Trans. Graph.. Cited by: §II.
  • [26] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. V. Gool (2017) Pose guided person image generation. In NIPS, pp. 406–416. Cited by: §I, §II-A, §II-B, §II.
  • [27] L. Ma, Q. Sun, S. Georgoulis, L. V. Gool, B. Schiele, and M. Fritz (2018) Disentangled person image generation. In CVPR, Cited by: §I, §II-A, §III-C.
  • [28] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley (2017) Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802. Cited by: §IV-B.
  • [29] Y. Men, Y. Mao, Y. Jiang, W. Ma, and Z. Lian (2020) Controllable person image synthesis with attribute-decomposed gan. ArXiv abs/2003.12267. Cited by: §III-C.
  • [30] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. CoRR abs/1411.1784. Cited by: §II.
  • [31] A. K. Moorthy and A. C. Bovik (2011) Blind image quality assessment: from natural scene statistics to perceptual quality. IEEE Transactions on Image Processing 20, pp. 3350–3364. Cited by: §II-B.
  • [32] N. Neverova, R. A. Güler, and I. Kokkinos (2018) Dense pose transfer. In ECCV, Cited by: §I, §II-A, §II-A, §II-B, §IV-A.
  • [33] Y. Ren, X. Yu, J. Chen, T. H. Li, and G. Li (2020) Deep image spatial transformation for person image generation. ArXiv abs/2003.00696. Cited by: §I, §II-A.
  • [34] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In NeurIPS, Cited by: §II-B, §III-B.
  • [35] A. Siarohin, E. Sangineto, S. Lathuilière, and N. Sebe (2018) Deformable gans for pose-based human image generation. In CVPR, pp. 3408–3416. Cited by: Fig. 1, §I, §II-A, §II-A, §II-A, §II-B, Fig. 9, §V-B, §V-B, §V-B.
  • [36] S. Song, W. Zhang, J. Liu, and T. Mei (2019) Unsupervised person image generation with semantic parsing transformation. In CVPR, pp. 2357–2366. Cited by: §V-B.
  • [37] T. Wang, M. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro (2019) Few-shot video-to-video synthesis. CoRR abs/1910.12713. Cited by: §I, §II-A, §IV-A.
  • [38] T. Wang, M. Liu, J. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro (2018) Video-to-video synthesis. In NeurIPS, Cited by: §I, §I, §II-A, §II-A, §III-C.
  • [39] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Processing. Cited by: §III-B.
  • [40] L. Yang, P. Wang, X. Zhang, S. Wang, Z. Gao, P. Ren, X. Xie, S. Ma, and W. Gao (2020) Region-adaptive texture enhancement for detailed person image synthesis. In IEEE International Conference on Multimedia and Expo (ICME), Cited by: §I.
  • [41] L. Yang, Z. Zhao, S. Wang, S. Wang, S. Ma, and W. L. Gao (2019) Disentangled human action video generation via decoupled learning. 2019 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pp. 495–500. Cited by: §II-A.
  • [42] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    In CVPR, pp. 586–595. Cited by: §III-B.
  • [43] H. Zheng, L. Chen, C. Xu, and J. Luo (2019) Unsupervised pose flow learning for pose guided synthesis. ArXiv abs/1909.13819. Cited by: §I, §II-A.
  • [44] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1116–1124. Cited by: §II-B.
  • [45] Z. Zhu, T. Huang, B. Shi, M. Yu, B. Wang, and X. Bai (2019) Progressive pose attention transfer for person image generation. Cited by: Fig. 1, §I, §II-A, §II-B, §III-B, §III-C, §IV-A, Fig. 9, §V-A, §V-A, §V-B, §V-B, §V, §V.