In this paper, we study the problem of synthesizing images of a person’s appearance under novel poses, which is commonly known as human pose transfer [1, 2, 3]. The problem is first introduced in , where a transfer model receives a source image that provides conditional appearance constraints, and is expected to transfer the person’s appearance to new poses. Human pose transfer forms the core of many real-world applications, including interactive fashion design, creative media production, and many other human-centered tasks.
In a typical person image, the bulk of the area is occupied by clothes with rich textural patterns, which are more likely to attract a viewer’s attention. Therefore, the ability to synthesize realistic texture details is crucial for the performance and human perception of a transfer model. In the area of image-to-image (I2I) translation [4, 5], the U-net architecture  has achieved remarkable success in preserving fine-grained visual details through skip-connections. For human pose transfer task, however, such technique is not suitable due to the structural deformation of the human body under different poses. Siarohin et. al.proposed a modified version called deformable skip-connection , where feature maps extracted from the source image are warped onto the corresponding target regions according to pose correspondences.
The deformable skip-connections  encapsulates the two most distinctive features of the so-called “warping-based” methods: (1) A pixel-wise (part-wise) correspondence map estimated from the source pose to the target pose; (2) A mechanism for warping features from the source image to corresponding target regions according to the estimated mapping. However, most existing methods [7, 8, 9, 10] propose to channel the appearance features through the same pathway used for pose transfer, where the loss of fine-grained textural information would be inevitable due to down-sampling operations. This often leads to over-smoothed clothes and distorted facial landmarks that severely degrade the visual quality of synthesized images. In addition, estimating pixelwise mapping is extremely time-consuming and often requires additional dense annotations, making it intractable for many interactive designing and editing applications.
Instead, we seek another path of enhancing appearance details of pose-transferred images by synthesizing new textures that match the style of the input source image. To this end, we propose a novel Region-Adaptive Texture Enhancing Network (RATE-Net), where an additional texture enhancing module is introduced to estimate a residual map for detail refinement upon coarse estimations by the pose transfer module. The architecture of our framework is shown in Fig 1, where the source image is utilized in two ways: For the pose transfer module, we estimate a feature map roughly aligned with target pose, which later serves as the spatial guidance for the texture enhancing module. In addition, we also extract the style and textural information from the source image into compact codes, and inject the codes into the residual enhancing map through adaptive normalization layers . To help strengthen the mutual guidance between two modules, we also design an alternate training strategy to further improve the overall performance. Extensive experiments conducted on the challenging DeepFashion  benchmark dataset demonstrate the superiority of the proposed framework against recent warping-based methods. In summary, our contributions are twofold:
We propose RATE-Net, a novel enhancing based solution that utilize the style and texture information from the input source image for better textural refinement upon coarse images. The network utilizes the source image for both label map estimation and texture/style control, which leads to better pose transfer results compared with warping-based methods.
We design an effective training strategy to maximize the mutual guidance between two modules, where pose transformation mapping can be further refined through the style-aware loss function computed over enhanced images, thus helps to preserve the integrity of human body.
2 Related Works
Human Pose Transfer is first described in PG2  where the goal is to transfer the appearance of a person from the source image into new poses. Existing works typically focus on warping the textures from the source image to the target image based on the warping transformation estimated between the corresponding poses. DSC  segments the human skeleton into rigid parts and approximates the warping function with piece-wise affine transformation. Some recent works [7, 8, 9, 10] further estimate pixel-level feature warping flow by leveraging dense keypoint annotations, such as DensePose  and SMPL . However, estimating dense annotations and pixel-wise warping flow are typically computationally expensive, and the corresponding groundtruth annotations are more difficult to collect. PATN proposed a lightweight network to gradually transfer a person’s pose through several cascaded pose-attentional transfer blocks. However, most appearance details of the input image would be lost due to the down-sampling operation, which often leads to inferior results when dealing with person images with texture-rich clothes. Instead, we propose a region adaptive texture enhancing network which is better at capturing and re-synthesizing fine-grained details. Furthermore, it only relies on 2D skeletons for pose annotation and can be trained in an end-to-end fashion.
Adaptive Normalization provides a new mechanism to inject guidance information into the main image generation pathway. It is typically implemented as an affine transformation over normalized feature responses, with parameters inferred from external data. AdaIN  was first proposed in style transfer task, and later used in other tasks such as high resolution face image synthesis  and few-shot I2I translation . SPADE 
further expands the mean and deviation parameters from vectors to 3-tensors, thus incorporating spatial attention into semantic controlled image synthesis. This idea is also useful for few-shot video synthesis. Inspired by these works, we incorporate the pose guidance into our appearance encoding network to capture fine-grained visual information from texture-rich regions for detail enhancement in synthesized images.
3 The Network Architecture
The proposed RATE-Net contains two modules: A pose transfer module that generates a coarse image under the target pose, and a texture enhancing module that estimates a residual map to fill in more appearance details onto the coarse image. The overall architecture is illustrated in Fig. 1.
We first introduce some notations. The network takes two inputs, a source image and a target pose , and tries to generate a new image containing the person in under the pose . An array of keypoint coordinates is estimated for both the source and target image, denoted as and , respectively. During the training, the generator is fed with paired images of the same person under different poses along with the corresponding pose heatmaps , which can be estimated and cached before training. The output of the network, , is compared with groundtruth image for losses. Below we further dissect the generator and elaborate on the two modules.
3.2 Pose Transfer Module
As shown in Fig. 1, the input of the pose transfer module consists of a source image and a pair of target poses , which are concatenated along the channel dimension. The network consists of several convolutional down-sampling layers, followed by a series of pose-attentional transfer blocks  that help to warp the contents of the source image onto the target pose in a progressive manner. The output feature map is then fed into the up-sampling layers to recover a coarse estimation of the target image .
It should be emphasized that although our pose-transfer module shares a similar network architecture as in , the purpose and training strategy are completely different. In particular, we DO NOT aim to directly warp the fine-grained textures from the source image into the target pose, as most of the textural information would be lost due to the down-sampling operation. Instead, we utilize the pose-transfer network to acquire a reasonable content feature map under the target pose, which provides regional guidance for the preceding texture enhancing module by hinting “where to add more texture details”. In this way, our framework can synthesize new textures with a separate network and build the final pose-transferred image in a coarse-to-fine manner, which greatly improves the training stability and the performance of our network. Ablation studies are presented to validate our claims.
3.3 Texture Enhancing Module
The texture enhancing module aims to recover fine-grained visual details from the source image , and enhance the coarse estimation by synthesizing a region-aware residual texture map under the guidance of pose-aligned content feature map . We reuse the source image to extract texture codes with an encoder containing several convolutional down-sampling layers, residual blocks and an average pooling operation. To inject the textural code into the content feature map, we utilize the Adaptive Instance Normalization  that originates from style transfer tasks. In particular, now plays the role of “style guide” which controls the pattern and granularity of textures in respective regions. The final output is acquired by adding the residual map onto the coarse estimation:
We leverage the design in  that uses two discriminators to differentiate real and fake samples both in terms of shape and appearance consistency. The shape discriminator evaluates input image/pose pairs for shape consistency , and the appearance discriminator compares the appearance consistency between the synthesized image and the source image . Unlike  that multiplies the scores, we train the two discriminators separately so that both criterion can be individually analyzed and optimized.
4 The Training Strategy
It is clear from Fig. 1 that the two modules of our proposed framework are mutually dependent: The texture enhancement module relies on the pose-aligned guidance map to put textures on the right places, and the losses computed over texture-enhanced images are back-propagated to the pose transfer module, which helps improve the accuracy of the estimated guidance map. Therefore, we figure that using an alternate updating strategy should be useful for promoting the mutual guidance between two modules and achieve the best overall performance. Concretely, for each input batch, we first update the pose transfer module with a loss function defined over coarse estimation before performing an end-to-end fine-tuning step to update two modules together with another texture-aware loss defined over final output . The discriminator is then updated for steps, where we empirically found leads to a nice balance between training speed and discriminative capability. This concludes a “1-1-3” training cycle of our framework.
4.1 Loss Functions
As discussed in section 3.2, the pose transfer module is not designed for conveying fine-grained texture information, and the estimated coarse output can be expected to lack visual realism. Therefore it’s unnecessary to enforce discriminative loss in . Instead, we simply adopt the following formulation:
where is the pixelwise L1 loss, and is the perceptual loss in :
Here is a pretrained VGG-19 network and denotes the layer index. In practice we found it effective to sample from different layers and use the weighted average loss to balance the perceptual consistency across different scales.
For loss function in step 2, we further add style loss and adversarial loss terms upon , leading to the full loss function as follows:
where is the Gram-matrix based style loss  and is the addition of losses in and , which is also used for discriminator updating. Readers can find more details in the corresponding literature.
4.2 Implementation Details
We implement the proposed framework in PyTorch. Both modules include 3 down-sampling layers, and the pose transfer module contains 9 cascaded pose transfer blocks. LeakyReLU is used after convolution and normalization layers with 0.2 negative slope. The dimension of the texture codes is set to 128, and the impact of different lengths are further analyzed in ablation study. We use the Rectified Adam optimizer for better stability and performance at convergence. The full training involves 40K alternating cycles leading to a total of 200K iterations. The learning rate is set to for all networks, and is fixed for the first 10K cycles before linearly drops to 0. The weight is set to 10 and other weights are set to 5.
In this section, we compare our framework with several state-of-the-art methods to demonstrate the superiority of our framework. Furthermore, we perform a thorough ablation study to verify the efficacy of our main contributions.
Dataset We validate our proposed framework on the In-shop Clothes Retrieval Benchmark of DeepFashion  containing about 50K images of fashion models in texture-rich clothes under various poses. The images are in 256 256 resolution and contain clean background. We adopt Openpose  to estimate an 18-point skeleton for each image, and convert it into a 18-channel heatmap as in . We use the train/test pairs in , where 101,966 pairs are randomly selected for training and 8,570 pairs for testing. The partition guarantees that the persons appeared in training and testing sets do not overlap, making it more reliable to validate the generalization ability of our model.
Evaluation Metrics For human pose transfer task, we aim to evaluate both the statistical fidelity and the perceptual quality of generated images. To this end, we adopt the Structural Similarity (SSIM)  and Inception Score (IS)  to account for the model’s performance in both perspectives. However, a recent study  has pointed out that IS is theoretically flawed and cannot always provide useful guidance when comparing models. To more reliably evaluate the perceptual quality of generated images, we introduce two more supervised metrics: FID  and LPIPS . Note that both metrics utilize a pretrained network to convert the images into feature space, and compute the distance between image features with respect to both the global distribution and each pair of samples. We believe that supervised perceptual metrics can better reflect the perceptual fidelity of our proposed model than the unsupervised IS metric.
5.1 Comparison with Previous Works
We compare our proposed framework with several representative works: DSC , UV-Net , SPT  and PATN . All the tests are carried out on the same set of testing pairs in . Table 1 shows significant improvement of the proposed framework against recent state-of-the-art methods in terms of perceptual quality, while the SSIM and IS scores are also comparable. To further analyze the impact of the texture enhancing module, we also evaluate the quality of coarse estimations without residual map. As observed in Table 1, although the SSIM score slightly drops after enhancement, the perceptual fidelity is significantly improved, with 20% gain on FID and 10% gain on LPIPS. Furthermore, although the network backbone is highly similar, our pose transfer module still performs considerably better than the original PATN implementation in perceptual fidelity, which can verify the efficacy of our texture enhancing module for refining the estimation of the guidance map .
In addition, we also showcase the qualitative result for some challenging examples with large pose transition and texture-rich garments, As shown in Fig 2, our method is more faithful to the input pose and appearance condition than other baselines and possesses more realistic visual details, especially like clothing textures and hair waves. Also, the gender bias issue  on DeepFashion dataset is partly resolved as shown in the third row, where all the other methods output a female image by mistake. Moreover, it can be observed that the texture enhancing module is effective for recovering fine-grained visual details lost in pose transfer module and consistently improving the visual quality of synthesized human images.
5.2 Ablation Study
In this section, we perform an ablation study to analyze the effect of different parts of our proposed framework on the final performance. For each part, we set up a corresponding baseline by either removing this part from the whole framework or changing the key parameters.
PB Only We remove the texture enhancing module and train the pose-transfer module directly in an end-to-end fashion with loss function computed over coarse estimation . Notice that this setting slightly differs from the PATN  baseline as the loss function has an additional style loss term, and the weights are slightly modified.
PB Fix To further investigate the efficacy of our proposed training scheme, we initialize the parameters of the pose-transfer module using pretrained models in , and keep it fixed during training. In this way, the interaction between two modules is broken, and detailed style loss and conditional adversarial losses cannot be back-propagated into the pose transfer module.
Texture64dim We construct the corresponding baseline by reducing the length of the textural code to 64. In this way, the textural information extracted from source image is reduced, which could result in less informative textures and more severe artifacts.
Full This is the full version of our proposed framework used for comparing with other SOTA methods.
We report the quantitative performances of all four methods on the DeepFashion dataset in Table 2. As observed, our full method has a clear advantage over other ablation methods, especially in terms of perceptual distance metrics like FID and LPIPS, which we believe has highlighted the importance of textural enhancement in human pose transfer task. It is also noteworthy that increasing the length of texture code can significantly boost the performance, and we speculate that this could lead to a useful method for determining the intrinsic dimension of the latent texture space, which is the length of the texture code at which point the model’s performance stops improving. We didn’t evaluate our model at higher code lengths because the memory cost had become intolerable. However, we believe that it’s still possible to further improve the performance by using longer texture codes.
To better visualize the impact of different components, we showcase some representative examples in Fig. 3. As observed, our full framework is capable of synthesizing much finer textural details than all other methods. In addition, our framework also better preserves the integrity of human body (as observed in the second and the third row), which to our belief indicates the increasing accuracy of the pose-aligned feature map due to more localized guidance provided by the loss function defined over texture enhanced images. This further justifies our claim of mutual guidance between two modules and the proposed alternate training strategy.
We presents RATE-Net, a novel framework for synthesizing person images with sharp texture details. Instead of simply warping the patches from the source image with the risk of losing fine-grained texture details, we proposed to synthesize new textures with additional texture enhancing module that helps add more visual details to the coarse pose transfer results. Furthermore, an effective training strategy is proposed to alternately update the two modules for better overall performance. Compared with previous works, our framework can synthesize human images with much finer details, and can better preserve the style and appearance of the source image. We believe the idea behind our framework can inspire other related topics in semantic image synthesis as well.
-  Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool, “Pose guided person image generation,” in NIPS, 2017, pp. 406–416.
-  Aliaksandr Siarohin, Enver Sangineto, Stéphane Lathuilière, and Nicu Sebe, “Deformable gans for pose-based human image generation,” in CVPR, 2018, pp. 3408–3416.
-  Zhen Zhu, Tengteng Huang, Baoguang Shi, Miao Yu, Bofei Wang, and Xiang Bai, “Progressive pose attention transfer for person image generation,” 2019.
-  Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros, arXiv preprint, 2017.
-  Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” arXiv preprint, 2017.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI. 2015, Lecture Notes in Computer Science, Springer.
-  Natalia Neverova, Riza Alp Güler, and Iasonas Kokkinos, “Dense pose transfer,” in ECCV, 2018.
-  Haoye Dong, Xiaodan Liang, Ke Gong, Hanjiang Lai, Jia Zhu, and Jian Yin, “Soft-gated warping-gan for pose-guided person image synthesis,” in NeurIPS, 2018.
-  Yining Li, Chen Huang, and Chen Change Loy, “Dense intrinsic appearance flow for human pose transfer,” in CVPR, 2019, pp. 3693–3702.
-  Wen Liu, Zhixin Piao, Jie Min, Wenhan Luo, Lin Ma, and Shenghua Gao, “Liquid warping GAN: A unified framework for human motion imitation, appearance transfer and novel view synthesis,” CoRR, 2019.
-  Xun Huang and Serge J. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” CoRR, vol. abs/1703.06868, 2017.
-  Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang, “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations,” in (CVPR), 2016.
Riza Alp Güler, Natalia Neverova, and Iasonas Kokkinos,
“Densepose: Dense human pose estimation in the wild,”in CVPR, 2018.
-  Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black, “SMPL: a skinned multi-person linear model,” ACM Trans. Graph., 2015.
-  Tero Karras, Samuli Laine, and Timo Aila, “A style-based generator architecture for generative adversarial networks,” CoRR, vol. abs/1812.04948, 2018.
-  Ming-Yu Liu, Xun Huang, Arun Mallya, Tero Karras, Timo Aila, Jaakko Lehtinen, and Jan Kautz, “Few-shot unsupervised image-to-image translation,” CoRR, vol. abs/1905.01723, 2019.
-  Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu, “Semantic image synthesis with spatially-adaptive normalization,” in CVPR, 2019.
-  Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan Catanzaro, “Few-shot video-to-video synthesis,” CoRR, vol. abs/1910.12713, 2019.
-  Justin Johnson, Alexandre Alahi, and Fei-Fei Li, “Perceptual losses for real-time style transfer and super-resolution,” CoRR, vol. abs/1603.08155, 2016.
Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng
Gao, and Jiawei Han,
“On the variance of the adaptive learning rate and beyond,”ArXiv, 2019.
-  Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in CVPR, 2017.
-  Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Processing, 2004.
-  Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen, “Improved techniques for training gans,” in NeurIPS, 2016.
-  Shane Barratt and Rishi Sharma, “A note on the inception score,” arXiv preprint, 2018.
-  Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter, “Gans trained by a two time-scale update rule converge to a nash equilibrium,” CoRR, 2017.
Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang,
“The unreasonable effectiveness of deep features as a perceptual metric,”in CVPR, 2018, pp. 586–595.
-  Patrick Esser, Ekaterina Sutter, and Björn Ommer, “A variational u-net for conditional appearance and shape generation,” CVPR, pp. 8857–8866, 2018.
-  Sijie Song, Wei Zhang, Jiaying Liu, and Tao Mei, “Unsupervised person image generation with semantic parsing transformation,” in CVPR, 2019, pp. 2357–2366.