Human pose transfer (HPT) is an emerging research topic that attracts increasing attention recently. Aiming at synthesizing person images under new target poses with respect to the appearance of a given source image, HPT contains huge potential in empowering numerous creative applications, such as automatic fashion design, creative media production, online advertising and virtual reality. For these applications, the users would most likely focus their attention at semantically meaningful and detail-rich regions, such as face and clothes. Therefore, the ability to preserve semantic information and replenishing fine-grained appearance details is crucial for the performance and user experience of an HPT model.
Inspired by the image-to-image translation, early HPT solutions  adopt a global predictive strategy to directly translate the source image onto target poses by employing the U-net architecture with skip connections to propagate low-level features. However, due to the lack of localized deformation modeling, U-net based global methods often fail to properly handle the misalignment between source and target poses, leading to detail deficiency in synthesized person images, such as over-smoothed clothes or distorted faces. Instead, Siarohin et. al. proposes a modified feature fusion mechanism 
for warping the appearance features extracted from the source image onto the target pose with part-wise affine transformation, opening up a local warping approach for fine-grained appearance transfer. Thereafter, extensive research efforts have been dedicated to better modeling of body deformation and local feature transfer, including thin-plate spline function, local attention , optical flow  and 3D surface models . However, as relying on exact feature correspondences for detail reconstruction, local warping methods typically suffer from content ambiguity in invisible regions where frequent viewpoint variations and occlusions occur in real-world applications. To simultaneously overcome detail deficiency and content ambiguity, hybrid methods have emerged with an attempt to replenish new content with an additional predictive branch . Yet the replenished contents often exhibit inferior perceptual quality than local-warped contents, incurring style inconsistency in blended images. Apparently, regarding real-world HPT applications, it is crucial to replenish fine-grained appearance details in a style-consistent manner for increasing performance and user experience.
To narrow the theory-practice gap of HPT, we aim towards a more practical and challenging setting of HPT, termed as Fine-grained Human Pose Transfer (FHPT). Specifically, FHPT addresses the aforementioned issues on detail deficiency, content ambiguity, and style inconsistency by emphasizing more on the preservation and replenishment of fine-grained semantic and appearance details, including facial identity, hairstyle, cloth fabrics, and small body parts, Fig. 1
. To implement FHPT, we propose the Detail Replenishing Network (DRN) with two distinctive designs: a style-guided detail replenishment module to enforce the style consistency of generated contents across the entire human body, and an intermediate feature-sharing path to facilitate the mutual guidance between the global predictive and local warping branch. Moreover, we establish a comprehensive suite of fine-grained evaluation protocols for more reliable and accurate measurement of the model capability towards FHPT objectives, including face identity preservation, keypoint localization, and content-based image retrieval. Extensive experiments carried on the DeepFashion dataset verify the efficacy of our proposed DRN in preserving semantic attributes in the source image, as well as replenishing fine-grained appearance details in a style-consistent fashion. Compared to existing baselines, our method achieves 12%-14% gain on top-10 retrieval recall, 5% higher joint localization accuracy, and near 40% gain on face identity preservation, establishing a strong baseline method for FHPT.
This manuscript extends upon our previous work  in three aspects. Firstly, we formally develop FHPT, a more practical and challenging HPT scenario with a higher emphasis on fine-grained semantic fidelity and detail quality. Secondly, we propose the DRN for fine-grained person image generation and additionally address the limitation in identity preservation of our previous model by incorporating a facial attribute transfer module, leading to near 40% higher facial identity preservation in synthesized images. Thirdly, we develop a complete suite of fine-grained evaluation protocols for FHPT, offering new insight into the subject matter. All these efforts not only promote HPT towards a more challenging yet practical level, but will also benefit other semantic guided synthesis tasks, such as face renovation [HiFaceGAN] and street view synthesis .
The rest of the paper is organized as follows: Section II provide a retrospect on existing HPT works about their architectural designs and limitations towards FHPT challenges. Following the retrospect, Section III establishes the new pbjectives, evaluation criteria, and a novel “detail-replenishment” methodology for the FHPT task, where the corresponding benchmark implementation is detailed in Section IV. Finally, we report the experimental results in Section V and conclude the paper in Section VI.
Ii A Retrospect on Human Pose Transfer
As an emerging topic in computer vision, HPT is still in its infancy, where the first relevant research was not proposed until 2017. Driven by different research ideas in various related fields, including conditional GANs , style transfer 
, human pose estimation, and computer graphics 
, HPT has been evolving in a multifaceted manner. Yet there is still no conclusive answer to the rationality of these ideas, and their contribution towards the HPT objectives. Furthermore, most HPT works directly adopt existing evaluation metrics from related research fields, with limited ability in addressing the multifaceted nature of the HPT problem. In retrospect, the area of HPT would benefit from a more detailed and comprehensive objective design, as well as the corresponding evaluation protocols and research methodology towards practical applications. Here we provide a brief overview of the existing HPT methods and evaluation criteria, highlighting their potential limitations for preservation and replenishment of fine-grained semantic and appearance details.
Ii-a HPT Methods
Based on the feature transfer mechanism, we can summarize existing HPT methods into three categories: global predictive methods, local warping methods, and hybrid methods. Here we illustrate the behavior of different methods with a typical example in Fig. 2, where the source image contains a person in a blue-yellow checkerboard shirt partially occluded by his arm, and the target pose expects the person to move his arm away. By speculating the output of different methods, we can reveal their potential limitations in two aspects, including the ability to preserve semantic and appearance details in the source image and the ability to synthesize new content in occluded regions without exact feature correspondences.
Global predictive methods  typically formulate the HPT as a multi-modal image-to-image translation problem  and utilize the U-net architecture with skip connections for feature propagation. The pose guidance is introduced by encoding joint locations into a spatial heatmap  and concatenating it along with the input source image. Specifically, as shown in Fig. 2 (a), the source pose and target pose are encoded into spatial guidance maps and concatenated along with the source image to directly generate a prediction of the target image . However, such works often cannot reliably cope with the structural misalignment between different poses  due to the lack of accurate deformation modeling. Alternatively, several works in human motion transfer  aim to directly translate pose features to video frames, but often suffers from limited generalization towards new persons due to the lack of appearance guidance. Generally speaking, global predictive methods typically suffer from the inability to capture localized feature correspondences, which often leads to detail deficiency in synthesized images, e.g. blurry details and distortion artifacts, Fig. 2 (a).
Local warping methods
take inspiration from the spatial transformer networks and incorporate deformation modeling into the feature propagation mechanism. Deformable Skip-Connection (DSC)  is proposed for warping local features through part-wise affine transformation which establishes a new HPT methodology that later inspires a great amount of works. Following DSC, Thin-Plate Splines  have further facilitated non-linear warping estimation , and local attention mechanism has been incorporated for increased flexibility in deformation modeling . Moreover, the rapid advancement in automated dense 3D annotation  has enabled pixel-level feature warping flow estimation, enabling fine-grained appearance transfer from source to target images .
The deformation mapping of the human body between different poses is usually determined via interpolation over a finite set of keypoint correspondences. Specifically, denote the region occupied by the human body in the source/target image asand , one can formulate the deformation as a continuous mapping that carries keypoints in the source pose to the target pose, i.e. . Usually such a mapping is not unique, which necessitates an additional regularization term that sometimes referred to as the “bending energy”  to mitigate the distortion of warped contents. The resulted mapping is usually composed of part-wise affine transformations  or thin-plate splines . Afterward, the output image is reconstructed upon the image features encoded from the source image and warped by the estimated deformation, which can be denoted as .
However, due to the frequent viewpoint changes and self-occlusions, there is no guarantee that the estimated warping would cover the entire target human body, i.e. . Hence it would be difficult for a local warping network to faithfully recover the underlying content without exact correspondence in the source image, leading to content ambiguity in uncovered regions, Fig. 2 (b).
Hybrid methods  aim to hallucinate new contents in uncovered regions with another global predictive branch, and blend the global and local generation results according to an estimated composition mask. Such a design is a common practice for video frame prediction  with little motion between consecutive frames and convenient warping via optical flow . Furthermore, Zheng et. al.  designed an unsupervised flow learning scheme to tackle the unpaired scenario. However, existing hybrid methods typically focus on blending the generation results at image-level, whereas the intermediate-level feature fusion is less explored. Specifically, the global predictive branch often works separately from the local warping branch, which could incur style inconsistency between hallucinated and warped contents, Fig. 2(c).
Considering the requirements for real-world applications, it is extremely crucial to develop a more suitable approach for replenishing fine-grained appearance details in uncovered regions without exact feature correspondences. Specifically, the semantic fidelity between source and target images, as well as the style consistency across the entire human body, should be effectively maintained.
Ii-B HPT Evaluation Criteria
For HPT, the quality of generated person images naturally comprises multiple factors, including perceptual quality, structural integrity, and semantic fidelity. Therefore, it is crucial to have well-designed measures addressing these factors simultaneously for reliable and accurate judgment on the capability of HPT models. However, most evaluation metrics are designed towards specific objectives, with less discriminative power in other aspects. For instance, PSNR is designed for measuring signal-level fidelity, but not well-suited for evaluating perceptual quality ; Inception Score  is designed for measuring images with a multi-modal distribution, and is less reliable for the HPT task containing only a single object category . Therefore, existing works typically introduce auxiliary evaluation tasks to validate the performance in other aspects, such as person detection, keypoint localization and attribute prediction . However, most auxiliary evaluation tasks focus on high-level semantic aggregation, with less discriminative ability on fine-grained details. Also, the corresponding solutions to these tasks are by design robust, or in other words, insensitive to the noises and artifacts in generated images. Several works  have utilized the generated images for data augmentation on person Re-ID, but over the low-resolution () Market1501  benchmark dataset, making it unreliable for measuring the visual quality of high-resolution and detail-rich images. Therefore, it is vital to develop a complete suite of evaluation protocols to assess HPT models in a reliable, accurate, and comprehensive fashion, which we will elaborate in the next section.
Iii Towards Fine-grained Human Pose Transfer
In this section, we develop the proposed FHPT in three aspects: objectives, evaluation criteria, and the corresponding methodology. Specifically, we establish a new set of FHPT objectives to address the requirements in practical application scenarios with user interactions. Based on the proposed objectives, we develop a complete suite of fine-grained evaluation protocols for a more comprehensive, reliable, and accurate measurement of the capability of FHPT models. To fulfill the perspective FHPT objectives, we provide a new detail replenishment methodology along with a corresponding benchmark solution that will be further detailed in Sec. IV.
Iii-a FHPT Objectives
Considering the requirements for creative design applications with complex garments and accessories of diverse color, fabric, texture, FHPT aims to focus on better style representation and attention to pattern details, and satisfy the natural inclination of human perception towards semantically-meaningful and detail-rich contents. Different from HPT objectives with a bias towards the accuracy of pose transfer, the proposed FHPT further emphasizes the preservation and replenishment of fine-grained semantic and appearance details, including facial identity, hairstyle, cloth fabrics, patterns, and small body parts, Fig. 1. Specifically, the FHPT objectives comprise three aspects:
Perceptual Realism The generated images should look natural and appealing, with rich, convincing appearance details over the entire image. Specifically, for hybrid networks, the predicted and warped image contents should be consistent in style and quality.
Structural Integrity The generated images should well fit the target pose without noticeable structural distortion, particularly for facial landmarks and small body parts.
Semantic Fidelity The generated images should preserve all necessary semantic attributes in the source image that help determine the person’s identity, including both the facial attributes and the clothing appearance, such as color, hairstyle, and fabric.
Iii-B FHPT Evaluation Criteria
To better inspire the design of network architectures, loss functions, and training schemes for the proposed FHPT and offer accurate and reliable measurement for the corresponding models, we establish a comprehensive suite of fine-grained evaluation protocols targeting the FHPT objectives: semantic fidelity, structural integrity, and perceptual realism. Below we detail the evaluation protocols for each objective.
Perceptual Evaluation Protocols Existing works often adopt Structural Similarity(SSIM)  and Inception Score (IS)  to account for the statistical and perceptual fidelity of generated images. However, a recent study  has shown that the Inception Score is susceptible to network weights, batch size, and data distribution, making it unreliable for measuring the quality of generative models. To address this issue, we introduce two supervised perceptual metrics: FID  and LPIPS , to better reflect the perceptual quality of our framework. Both metrics utilize a pre-trained network to project images onto the feature space and compute the distance between image features for the entire distribution or individual pair of samples. Compared with the unsupervised IS metric, the FID and LPIPS metrics can better reflect the perceptual quality of generated images.
Structural Evaluation Protocols We introduce a new metric called Keypoint Error Curve (
KEC) to evaluate the localization accuracy of small body parts in generated images. It is similar to the PCKh metric in pose estimation tasks, but refined in two ways: (1) In addition to body joints, we also extract landmarks for the face and both hands, and evaluate the accuracy separately for each body part. This helps better reflect the capability of an FHPT model in maintaining the structural integrity of small body parts. (2) We provide a more detailed performance profile by evaluating the keypoint estimation accuracy at a set of adaptively selected anchors. For each given threshold , we calculate the percentage of detected keypoints with distance to the ground truth smaller than and plot the accuracy curve.
Semantic Evaluation Protocols As most semantic information in typical person images concentrates on facial attributes and clothing contents, we propose two well-defined and interpretable measurements for semantic fidelity: content-based image retrieval and face identification. For retrieval, we utilize generated images to query from the database of all corresponding source images, and calculate the retrieval scores using ground truth annotations provided in . Specifically, the retrieval-based evaluation metric is superior to existing distance metrics in three aspects:
The retrieval task is more application-driven with customers being the final judge of content similarity. Thus a well-trained retrieval system will work effectively in extracting the most informative and discriminative features closely related to human perception.
The retrieval-based evaluation protocol does not require the ground truth to be under the same pose as the query, making it more practical for real application purposes. Furthermore, its robustness against pose and view variations also indicates a better focus on semantic and appearance details.
The reliability of retrieval-based metrics can be easily qualified by querying with real images, and further improved by fine-tuning the retrieval system over the testing dataset. In contrast, the performance bound for existing perceptual metrics is always the same, 0 for LPIPS and 1 for SSIM, making it difficult to quantify and improve the reliability of such metrics.
To measure the consistency of face identity across source/target image pairs, we extract feature embeddings of cropped facial regions with a pre-trained face recognition model, and compare the embedding distance between each pair. If the distance is lower than a threshold , the person’s identity is considered to be preserved. In practice, we choose to be and .
Iii-C FHPT Methodology
|Explicit deformation modeling|
|Pose-guided appearance transfer|
|Mutual guidance across branches|
|Fine-grained detail synthesis|
So far, we have discussed about existing HPT approaches in terms of feature utilization and content generalization schemes, and highlight their potential flaws in addressing the FHPT challenges. Specifically, a fundamental problem in FHPT lies in the preservation and replenishment of fine-grained semantic and appearance details, which have been explored from two separate approaches. One is to directly synthesize new content based on high-level semantic guidance as in global predictive methods , and the other is to transfer low-level visual features from the source image, featuring local warping methods with warping flow  or attention mechanism . Consequently, the task of human pose transfer is also known by different names, such as “pose-guided human (person) image generation (synthesis)”, indicating the methodological divergence in existing research efforts. Needless to say, both approaches have their advantages and limitations, which necessitates an effective integration scheme to maximize the complementary effects between two approaches. With the emergence of hybrid architectures, the two opposite ends start to join together, but the integration result is still sub-optimal mainly due to the inconsistency between globally and locally generated content.
Moving towards FHPT, the transfer of available image contents and the synthesis of new content should be carried out in a mutually-guided fashion to simultaneously promote the style consistency and detail quality of the final output. To this end, we propose the Detail Replenishing Network (DRN) to substantiate the FHPT methodology with two distinctive architectural designs: 1) a guided detail replenishing module to enforce style-consistency; 2) an intermediate feature sharing pathway to promote mutual guidance. Table I summarizes the main capabilities of DRN against existing networks in Fig. 2(a)-(c), where DRN works in a “coarse-to-fine” fashion by synthesizing a global detail replenishing residual map upon the coarse estimation , leading to detail-rich, unambiguous and style-consistent generation results. In the next section, we will detail the implementation of individual components of the conceptualized FHPT model in Fig. 2(d), and introduce the corresponding model learning scheme to better fulfill the proposed FHPT objectives.
Iv Detail Replenishing Network
In this section, we illustrate the architectural design and training details of the proposed Detail Replenishing Network (DRN) which contains two branches: A pose transfer branch for acquiring a coarse estimation and provide spatial guidance, and a detail replenishing branch for refining local visual details in a style-guided fashion, Fig. 3.
Iv-a Network Architecture
Pose Representation The human pose representation can be implemented in various ways regarding the resource constraints and application considerations. In this paper, we choose a computationally efficient pose representation of sparse 2D keypoints. For each image in the training dataset, we extract an 18-point pose skeleton using a pretrained estimation network , and convert the skeleton coordinates into an 18-channel pose heatmap. Dense representations, such as body parsing labels , optical flows  or pseudo-3D surfaces  are also compatible with our framework at the expense of computational cost and manipulation inflexibility. Furthermore, we also extract the face bounding box using a lightweight face detection library  to help preserve necessary facial attributes. Also, the landmarks of the target face are estimated with  and encoded as a sketch , which will be provided during inference if the front view is available in the target image.
Pose Transfer Branch takes in the source image and the paired pose representation , and aims to generate a coarse estimation of the target image. The network begins with several down-sampling convolutional layers to encode the contents of the source image and the pose dependencies between and . To better capture the relationship between different poses, the pose pair is concatenated along the depth axis before sent into the encoder. We leverage the design in  by introducing several cascaded transfer blocks to encourage a smooth transition of encoded contents. The result feature map is roughly aligned with the target pose, which will serve as the spatial guidance for detail replenishment. Finally, we decode with several upsampling layers to acquire a coarse estimation of the target image.
Note that for our proposed framework, it is crucial that the transfer branch provide accurate spatial guidance for the upcoming detail replenishing branch. In particular, we do not want any fake details to be injected into the guidance map and the coarse estimation . Instead, the transfer branch is supposed to provide useful hints on “where to add what kind of details” and let the detail replenishing branch do the work. This allows a coarse-to-fine generation of the final image with better visual quality. Ablation studies are provided in section V to justify our claims.
Detail Replenishing Branch is composed of several detail-replenishing modules (DRM) that each specializes in refining the entire image or a specific region. For our benchmark solution, we adopt both types of DRMs to refine both the entire image and the face region. The global module predicts a residual map from the spatial guidance map and the source image , and add the residual map onto the coarse result ; while the regional (face) module directly synthesizes the target face image under the target landmark sketch according to the facial attributes extracted from the input face crop. In this way, the face module can be trained independently from the global module and can benefit from additional face data. To assemble the final output, we perform an alpha blending with Gaussian blurred weight mask , where denotes the 2D convolution operator, is a discrete 2D Gaussian kernel, and is an indicator function that equals if the corresponding pixel belongs to the face bounding box and elsewhere. Incidentally, we only need to activate the face enhancing module if the frontal face is available in both the source and the target image. Finally, combining the results in Eqn. (1) lead to the final result:
Detail Replenishment Module The proposed detail replenishment module (DRM) aims to encode the visual attributes of the appearance condition image and generate an detail replenishing map following the spatial guidance map . The network architecture of the DRM is shown in Fig. 4.
We extract the appearance and style information from the source image
with an encoder consisting of several convolutional down-sampling layers and residual blocks. To acquire a more robust representation against partially occluded clothes, we perform an adaptive average pooling to convert the encoded feature tensor into a style code. After the style code is acquired, an enhancing residual map (or target face crop ) is generated with respect to both the spatial and appearance guidance:
where is the spatial guidance map described in section IV-A. Inspired by related works in style-based image generation , we utilize the adaptive instance normalization (AdaIN)  to infuse the code into the output enhancing map. Note that now assumes the role of “style-guide” by controlling the pattern and granularity of synthesized details in different regions. The residual map generator is formed by several up-sampling convolutional layers. The content feature map at the coarsest scale is fed into the generator, where semantic and appearance details are gradually infused in a coarse-to-fine fashion across multiple layers. At the i-th layer, the texture code is processed with a 3-layer MLP to acquire the modulation weight and bias , which is then used for controlling the AdaIN operation:
are the pre-channel mean and variance of the feature map, the input at the -th AdaIN layer.
Iv-B Training Losses
We adopt slightly different loss functions for two branches. For the transfer branch, the output coarse estimation is NOT expected to contain any fake textures, as explained in section IV-A. Therefore, we adopt the following loss formulation to update the transfer branch without adversarial losses:
where is the pixelwise L1 loss, and is the perceptual loss in :
Here is a pretrained VGG-19 network and denotes the layer index. For the enhancing branch, we further add style and adversarial loss terms upon to promote visual details, leading to the full loss function as follows:
where is the Gram-matrix based style loss :
where is the Gram matrix:
The adversarial loss is evaluated with two conditional discriminators, and to measure the appearance and pose consistency respectively. Concretely, the appearance discriminator takes a pair of source/target images as input and tries to distinguish between real pairs and fake pairs . Similarly, the pose discriminator takes in image/pose pairs and evaluates if the generated images are consistent with the given pose conditions. Combing the shape and appearance loss leads to the following formulation:
which is also used for discriminator updating. In practice, we use the LSGAN variant  for better training stability. For face enhancement, the landmark sketch is adopted as pose condition to replace the global pose , and the source image is replaced with the cropped face patch, naturally.
Iv-C Optimization Algorithm
We employ a coarse-to-fine gradient descent optimization algorithm to update the components alternatively. For each iteration, we first update the pose transfer branch with loss function over the coarse estimation result . This allows the spatial guidance map to be roughly aligned with target pose. Then we perform an end-to-end fine-tuning to update the enhancing branch with the detail-aware loss computed over final output 111The face enhancing module should be optimized separately, as it operates on cropped face regions instead of the entire image and the facial contents are in principle independent of other body parts.
. Note that the pose transfer branch will also be fine-tuned in this step, as gradients can be backpropagated throughand the preceding pose transfer blocks. The discriminators are then updated for steps, where we empirically choose to balance between running speed and discriminative capability. The complete training routine is shown in Algorithm 1. Other model learning schemes are compared in ablation study in section V-A.
In this section, we evaluate the performance of the proposed DRN against several competitive baseline methods. Furthermore, we perform an ablation study to verify the efficacy of our main contributions.
Datasets We carry all experiments on the In-shop Clothes Retrieval Benchmark of the DeepFashion dataset , which contains more than 50,000 editorial images of fashion models under varying poses with texture-rich garments. Following the pre-processing routine in , we crop the background from both side of the image by 40 pixels, leaving the center 256 176 region for training and testing. For pose representations, we extract the body skeletons and facial landmarks using Openpose  and detect the face bounding box using. We use the data partition in  with 101,966 pairs for training and 8,570 pairs for testing. The partition ensures that the same person will not appear in both the training and testing split, thus can fully guarantee the generalization ability of our DRN.
We use PyTorch to implement our proposed framework. The transfer branch contains 3 down-sampling layers and 9 cascaded pose transfer blocks, and the enhancing branch contains 6 residual blocks . LeakyReLU is introduced after each normalization layer with 0.2 negative slope. The length of texture codes is fixed to 128, where the impact of different lengths are further analyzed in ablation study, see section V-A for details. Rectified Adam optimizer  is adopted with weight regularization for improved training stability and better final performance. The full framework is updated for involves 40 batches with a total of 200K iterations. The learning rate is initialized to for all modules and is kept fixed for the first 10 batches before linearly decaying to 0. The loss weights are set to (). For face enhancement, we retain the training samples with paired face bounding boxes, and resize the face crops to using bicubic interpolation. The face loss is computed by Eqn. (7
) with the same loss weights. The training continues for 200 epochs with the learning rate fixed to.
V-a Ablation Study
In this section, we compare the proposed DRN against four ablation methods to analyze the impact of different components. The first two aim to verify the proposed network architecture, while the other two focus on the proposed alternate training strategy. We will leave the ablation study on facial detail replenishment module in section V-B, where a more targeted analysis will be provided on cropped facial regions. Here we detail the settings of each ablation method as follows:
PB Only We remove the enhancing branch and train the pose transfer branch directly in an end-to-end fashion with loss function . Notice that this setting differs from the PATN  baseline where an additional style loss term is introduced with slightly adjusted loss weights.
Texture64dim We reduce the length of the textural code to 64 and keep the rest unchanged. In this way, the appearance information extracted from source image is reduced, which could result in less amount of details and more severe artifacts.
PB Fix We initialize the parameters of the transfer branch using pre-trained models in , and keep it fixed during training. In this way, the interaction between two branches is broken, and detailed style loss and conditional adversarial losses cannot be back-propagated into the transfer branch.
End2end We randomly initialize the parameters and start training from scratch. Instead of the proposed alternative training, we directly update the whole framework with defined over the refined output without explicit constraints on the coarse estimation . In this way, the detail replenishing branch has to rely on inaccurate spatial guidance maps, which could lead to difficulty in convergence and suboptimal performance.
We present the scores evaluated on the DeepFashion dataset in Table II. Our full framework, DRN, consistently outperforms other ablation methods, especially on perceptual metrics such as FID and LPIPS, highlighting the importance of detail replenishment for human image generation. Also we note that reducing the dimension of latent code leads to significant performance drop, indicating a potential trade-off between perceptual quality and memory cost. Furthermore, Fig. 6 showcases the quality improvement upon other ablation methods. The proposed DRN is capable of creating more authentic appearance details and suppressing repetitive artifacts. Also, due to the guidance from the texture-aware loss over final outputs, it can better preserve the integrity of body parts, as observed in the second and the third row, indicating an increased accuracy of the estimated spatial guidance map . Therefore, the effect of mutual guidance between both modules, along with the alternate optimization algorithm, is further justified.
V-B Comparisons with Previous Works
We choose several representative state-of-the-art works as baselines: DSC , UV-Net , SPT  and PATN . All the tests are carried out on the same set of testing pairs in . Note the official DSC implementation requires the input pose should contain full-body joints (shoulders and hips), we remove 426 bad cases out of 8570 test pairs that don’t fit this criterion and report the scores on remaining samples.
Qualitative Comparison Fig. 5 showcases the generation results for some challenging examples with large pose variations and rich textures. In general, our method is more faithful to the input conditions and contains more realistic semantic and appearance details, such as clothing fabrics (linen, laces), textural patterns, and hair waves. In particular, our method mitigates the gender bias issue reported in  caused by imbalanced gender distribution in fashion images. As shown in the third row, without explicit facial guidance in the source image, all the other methods tend to predict the more frequently occurred female face by mistake. In contrast, our method is more effective in capturing the fine-grained semantic details, and is able to faithfully retain the masculinity of synthesized person in both coarse and refined outputs.
Perceptual Evaluation As reported in Table III, our DRN achieves significant improvements against recent state-of-the-art methods in terms of perceptual quality, with gain on FID against PATN and gain on DSC. For the LPIPS score, the gains are and , respectively. Also, the patchwise signal statistical measure SSIM is comparable. It is noteworthy that although our method has the lowest Inception Score, the qualitative improvement is undeniable, thus highlighting the superiority of proposed perceptual metrics against the unsupervised IS metric.
To further analyze the impact of proposed detail replenishment modules, we also report the scores tested on coarse estimations without the residual map, as shown in Table III. Although both SSIM and IS scores slightly drop after enhancement, the perceptual fidelity is significantly improved, with 20% gain on FID and 10% gain on LPIPS. Furthermore, even without detail replenishment, our pose transfer branch still achieves better perceptual scores than PATN baseline, verifying the efficacy of our DRN for refining the pose map with detail-aware losses. In summary, DRN is effective for preserving semantic and appearance details deficiency during the pose transfer stage and consistently improving the visual quality of generated human images.
|Metric||L2 error||Acc.()||Acc. ()|
|Ours (w/o face)||0.647||0.247||0.752|
Structural Evaluation As shown in Fig. 7
, our DRN has demonstrated consistent quality improvement over the entire human body. Among different body parts, DRN is especially powerful for restoring facial details even without using ground truth facial landmarks. For the body and both hands, the accuracy curves of different methods are roughly parallel, with a gap of 5% between the proposed DRN and the original PATNetwork (PB only). This indicates that our mutual-guidance design helps refine the structural integrity of generated body parts, thus reducing 5% outliers with large localization errors. In summary, the proposed DRN not only restores convincing appearance details but also helps improve the structural integrity of body parts.
Semantic Evaluation We first evaluate the facial identity preservation of different baselines in Table IV by measuring the mean embedding distance and the corresponding percentage of identity-preserving cases. Our basic framework with global DRM is already leading existing methods by 27%, with an additional 12% gain with additional face detail replenishment. We also present the qualitative results in Fig. 8, where DSC  fails to preserve the facial structure, and incurs severe landmark distortions. PATN  manages to synthesize more plausible faces, but with limited diversity in hairstyles and lip shapes, and sometimes with false attributes, like the thick black beard for the male in row 4. In contrast, our method can well preserve prominent facial attributes and correct mild distortions, leading to better identity consistency and visual realism of generated faces.
Also, it is noteworthy that existing works on face view transfer  most deal with face images under the same resolution and visual quality. However, faces in real person images can vary in spatial resolutions, leading to potentially severe quality discrepancies between resized image pairs. For example, in row 4 of Fig. 8, the input male face comes from a full-body image and is much blurrier than the target face from an upper-body shot. Such an issue is calling for more robust generative models against scale and quality variations, which may constitute another meaningful research topic in the FHPT domain to explore.
|Top 3||Top 5||Top 10||Top 3||Top 5||Top 10|
Finally, we evaluate the overall content preservation of our proposed method with the newly proposed content-based image retrieval task. Table V presents the Top-K recall of different methods for K = 3, 5, 10 under two different backbones, ResNet and VGG. Our DRN has a significant advantage over state-of-the-art PATN baseline, with almost twice the recall rate for every K, thus proves the efficacy of our texture enhancing module for recovering appearance details lost during the pose transfer stage. Also, it is evident that the proposed network architecture and training strategy are complementary with each other, as changing either element would severely degrade the overall performance, leading to barely comparable recall rate against the PATN baseline.
Also, we point out two important observations in Table V. Firstly, the warping-based DSC method significantly outperforms PATN, a recently published work with better visual quality, Fig. 5; Secondly, although our method achieves the top recall performance, it’s still far less accurate than querying with real images. To better help with the understanding of the information carried within extracted feature embeddings, and the similarity criterion used by the retrieval system, we compare the top 10 best matching candidates retrieved from queries generated with different methods.
Three typical examples are shown in Fig. 9, where (a) represents the easy case with input clothes containing few textures, (b) represents the challenging case with clothes of rich textures, and (c) represents the confusing case with two or more items in different textural styles — a deep gray coat and a leopard-spotted short skirt. In case (a), we note the retrieval system frequently mismatches a woman in a deep blue skirt, suggesting a higher focus on color consistency over other attributes such as gender and accessories. For case (b), the retrieval results are closer to our expectations, where the number of positive matches are growing consistently with the increasing amount of textural detail in query images. For the confusing case (c), DSC successfully recovers the leopard-spotted skirt from the source image, leading to a more accurate retrieval result with correct top-3 candidates, although not positive matches. In contrast, both our DRN and the PATN baseline failed to capture the desired outfit. This could partially explain DSC’s superior performance and help motivate novel hybrid architectures combining the strengths of warping-based and synthesizing-based methods.
In conclusion, our retrieval-based evaluation protocol is more reliable in measuring semantic and appearance consistency of generated human images, especially for the case with rich textures. Also, it is much more flexible in adjusting the similarity criterion according to different attributes, such as colors, fabrics, textures, and hairstyles. We believe that the advances in fashion retrieval systems could help facilitate the design of better similarity criteria that would further boost the visual realism of human-centered image synthesis tasks.
Human pose transfer (HPT) is an emerging research topic with huge application potential in creative media applications. Yet current HPT methods typically introduce detail deficiency, content ambiguity or style inconsistency in synthesized person images due to the suboptimal integration between low-level feature transfer and high-level semantic-guided content synthesis. To address existing issues and narrow the theory-practice gap of HPT, we aim towards a more practical and challenging setting, termed as Fine-Grained Human Pose Transfer (FHPT). Specifically, we establish the new objectives for FHPT task: perceptual realism, structural integrity, and semantic fidelity, with a comprehensive suite of fine-grained evaluation protocols targeted at these objectives, including face identity preservation, keypoint localization, and content-based image retrieval, providing a comprehensive, accurate and reliable measurement of the model capability towards FHPT scenarios. To implement FHPT, we propose a Detail Replenishing Network (DRN) with distinctive structural designs, including a guided detail replenishment branch for style-consistency in generated contents, and an intermediate feature-sharing path to facilitate the mutual guidance between the global predictive and local warping branches. As a result, DRN achieve significant performance gain against varying HPT baselines by a huge margin, highlighting the challenges in the FHPT task as well as the efficacy of our contributions.
In the future, we aim to investigate more effective semantic guidance and feature utilization schemes for FHPT. For instance, it would be beneficial to incorporate region adaptive style control into the FHPT framework. Also, designing interpretable and customizable evaluation protocols is another promising direction. Finally, although our framework is designed for synthesizing human images, we note that the proposed “detail-replenishment” approach can also benefit other semantic-guided image synthesis tasks, such as face renovation or street view synthesis, leading to enhanced appearance quality and perceptual realism in produced contents.
-  (2018) A note on the inception score. arXiv preprint. Cited by: §II-B, §III-B.
-  (1989) Principal warps: thin-plate splines and the decomposition of deformations. IEEE Transactions on pattern analysis and machine intelligence 11 (6), pp. 567–585. Cited by: §II-A, §II-A.
-  (2017) Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, Cited by: §II, §IV-A, §V.
-  (2018) Everybody dance now. ArXiv abs/1808.07371. Cited by: §II-A, §II-B.
-  (2018) Soft-gated warping-gan for pose-guided person image synthesis. In NeurIPS, Cited by: §I, §II-A, §II-A, §IV-A.
-  (2018) A variational u-net for conditional appearance and shape generation. CVPR, pp. 8857–8866. Cited by: §I, §II-A, §V-B.
Coordinate-based texture inpainting for pose-guided human image generation.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-A.
-  (2018) DensePose: dense human pose estimation in the wild. In CVPR, Cited by: §II-A.
-  (2015) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §V.
-  (2017) GANs trained by a two time-scale update rule converge to a nash equilibrium. CoRR. Cited by: §III-B.
-  (2017) Beyond face rotation: global and local perception gan for photorealistic and identity preserving frontal view synthesis. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2439–2448. Cited by: §V-B.
-  (2017) Arbitrary style transfer in real-time with adaptive instance normalization. CoRR abs/1703.06868. Cited by: §IV-A.
-  (2017-07) FlowNet 2.0: evolution of optical flow estimation with deep networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Cited by: §II-A.
Image-to-image translation with conditional adversarial networks. arXiv preprint. Cited by: §I, §II-A.
-  (2015) Spatial transformer networks. ArXiv abs/1506.02025. Cited by: §II-A.
Perceptual losses for real-time style transfer and super-resolution. CoRR abs/1603.08155. Cited by: §II, §IV-B, §IV-B.
-  (2017) End-to-end recovery of human shape and pose. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7122–7131. Cited by: §II-A.
-  (2018) A style-based generator architecture for generative adversarial networks. CoRR abs/1812.04948. Cited by: §IV-A.
-  (2014) One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1867–1874. Cited by: §III-B, §IV-A, §V.
-  (2010) A method for novel face view synthesis using stereo vision. In International Conference on Computer Vision and Graphics, pp. 49–56. Cited by: §V-B.
-  (2019) Dense intrinsic appearance flow for human pose transfer. In CVPR, pp. 3693–3702. Cited by: §I, §II-A, §II-B.
-  (2019) On the variance of the adaptive learning rate and beyond. ArXiv. Cited by: §V.
-  (2019) Liquid warping GAN: A unified framework for human motion imitation, appearance transfer and novel view synthesis. CoRR. Cited by: §I, §II-A.
-  (2016) DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In (CVPR), Cited by: §I, §III-B, §V.
-  (2015) SMPL: a skinned multi-person linear model. ACM Trans. Graph.. Cited by: §II.
-  (2017) Pose guided person image generation. In NIPS, pp. 406–416. Cited by: §I, §II-A, §II-B, §II.
-  (2018) Disentangled person image generation. In CVPR, Cited by: §I, §II-A, §III-C.
-  (2017) Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802. Cited by: §IV-B.
-  (2020) Controllable person image synthesis with attribute-decomposed gan. ArXiv abs/2003.12267. Cited by: §III-C.
-  (2014) Conditional generative adversarial nets. CoRR abs/1411.1784. Cited by: §II.
-  (2011) Blind image quality assessment: from natural scene statistics to perceptual quality. IEEE Transactions on Image Processing 20, pp. 3350–3364. Cited by: §II-B.
-  (2018) Dense pose transfer. In ECCV, Cited by: §I, §II-A, §II-A, §II-B, §IV-A.
-  (2020) Deep image spatial transformation for person image generation. ArXiv abs/2003.00696. Cited by: §I, §II-A.
-  (2016) Improved techniques for training gans. In NeurIPS, Cited by: §II-B, §III-B.
-  (2018) Deformable gans for pose-based human image generation. In CVPR, pp. 3408–3416. Cited by: Fig. 1, §I, §II-A, §II-A, §II-A, §II-B, Fig. 9, §V-B, §V-B, §V-B.
-  (2019) Unsupervised person image generation with semantic parsing transformation. In CVPR, pp. 2357–2366. Cited by: §V-B.
-  (2019) Few-shot video-to-video synthesis. CoRR abs/1910.12713. Cited by: §I, §II-A, §IV-A.
-  (2018) Video-to-video synthesis. In NeurIPS, Cited by: §I, §I, §II-A, §II-A, §III-C.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Processing. Cited by: §III-B.
-  (2020) Region-adaptive texture enhancement for detailed person image synthesis. In IEEE International Conference on Multimedia and Expo (ICME), Cited by: §I.
-  (2019) Disentangled human action video generation via decoupled learning. 2019 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pp. 495–500. Cited by: §II-A.
The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pp. 586–595. Cited by: §III-B.
-  (2019) Unsupervised pose flow learning for pose guided synthesis. ArXiv abs/1909.13819. Cited by: §I, §II-A.
-  (2015) Scalable person re-identification: a benchmark. 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1116–1124. Cited by: §II-B.
-  (2019) Progressive pose attention transfer for person image generation. Cited by: Fig. 1, §I, §II-A, §II-B, §III-B, §III-C, §IV-A, Fig. 9, §V-A, §V-A, §V-B, §V-B, §V, §V.