The problem of video-based human motion transfer is an interesting but challenging research problem. Given two monocular video clips, one for a source subject and the other for a target subject, the goal of this problem is to transfer the motion from the source person to the target, while maintaining the target person’s appearance. Specifically, in the synthesized video, the subject should have the same motion as the source person, and the same appearance as the target person (including human clothes and background). To achieve this, it is essential to produce high-quality image-to-image translation of frames, while ensuring temporal coherence.
The difficulty of this problem is how to effectively decouple and recombine the posture information and appearance information of the source and target characters. Based on generative adversarial networks (GANs), a powerful tool for high-quality image-to-image translation, Chan et al.  proposed to first learn a mapping from a 2D pose to a subject image from the target video, and then use the pose of the source subject as the input to the learned mapping for video synthesis. However, due to the difference between the source and target poses, this approach often results in noticeable artifacts, especially for the self-occlusion of body parts.
Observing that the self-occlusion issue is difficult to handle in the image domain, we propose to first reconstruct a 3D human model from a 2D image of both the source and target subjects, and then adjust the pose of the target human body to match the source (while maintaining the target person’s body shape). Intrinsic geometric description of the deformed target is then projected back to 2D to form an image that reflects 3D structure.
This along with the 2D pose figure extracted from the source image is used as a constraint during GAN-based image-to-image translation, to effectively maintain the structural characteristics of human body under different poses.
In addition, previous methods [6, 28] only use the appearance of the target person in the training process of pose-to-image translation, and does not fully utilize the appearance of the source. When an input pose is very different from any poses seen during the training process, such solutions might lead to blurry results. Observing that the source video frame corresponding to the input pose might contain reusable rich details (especially for the body parts like hands where the source and target subjects share some similarity), we intend to selectively transfer details from real source frames to the synthesized video frames. This is achieved by our detail enhancement network. Figure 1 shows representative motion transfer results with rich details.
We summarize our contributions as follows: 1) We propose to reconstruct a 3D human body with its shape from a target frame and its pose from a source frame, and project it to 2D to serve as a GAN-based network condition. This contains rich 3D information including body shape, pose, occlusion to help maintain the structural characteristics of the human body in the generated images. 2) We introduce the detail enhancement net (DE-Net), which utilizes the information from the real source frames to enhance details in the generated results.
2 Related Work
Over the last decades, motion transfer has been extensively studied due to its ability for fast video content production. Some early solutions have mainly revolved around realigning existing video footage according to the similarity to the desired pose [3, 7]. However, it is not an easy task to find an accurate similarity measure for different actions of different subjects. Several other approaches have also attempted to address this problem in 3D, but they focus on the use of inverse kinematic solvers  and transfer motion between 3D skeletons , whereas we consider using a reconstructed 3D body mesh to guide motion transfer in the image domain.
Recently, the rapid advances of deep learning, especially generative adversarial networks (GANs) and their variations (e.g., cGAN , CoGAN , CycleGAN , DiscoGAN ) have provided a powerful tool for image-to-image translation, which has yielded impressive results across a wide spectrum of synthesis tasks and shows its ability to synthesize visually pleasing images from conditional labels. Pix2pix , based on a conditional GAN framework, is one of the pioneering works. CycleGAN  further presents the idea of cycle consistency loss for learning to translate between two domains in the absence of paired images, and Recycle-GAN  combines both spatial and temporal constraints for video retargeting tasks. Pix2pixHD  introduces a multi-scale conditional GAN to synthesize high-resolution images using both global and local generators, and vid2vid  designs specific spatial and temporal adversarial constraints for video synthesis.
Based on these variants of GANs, a lot of approaches [1, 6, 8, 19, 20] have been proposed for human motion transfer between two domains. The key idea of these approaches is to decouple the pose information from the input image and use it as the input of a GAN network to generate a realistic output image. For example, in , the input image is separated into two parts: the foreground (or different body parts) and background, and the final realistic image is generated by separately processing and cross fusion of the two parts. Chan et al. extract pose information with an off-the-shelf human pose detector OpenPose [4, 5, 26, 31] and use the pix2pixHD  framework together with a specialized Face GAN to learn a mapping from a 2D pose figure to an image. Neverova et al.
adopt a similar idea but use the estimation of DensePose to guide image generation. Wang et al.make a step further to adopt both OpenPose and DensePose in . However, due to the lack of 3D semantic information, these approaches are highly sensitive to problems such as self-occlusions.
To solve the above problems, it is natural to add 3D information to the condition of generative networks. There are many robust 3D human mesh reconstruction methods such as [13, 14, 24, 32], which can reconstruct a 3D model corresponding pose from a single image or a video clip. Benefiting from these accurate and reliable 3D body reconstruction techniques, we can study the issue of human motion transfer in a new perspective. Liu et al. present a novel warping strategy, which also uses the projection of 3D models to tackle motion transfer, appearance transfer and novel view synthesis within a unified model. However, due to the diversity of their network functionalities, it does not perform particularly well in the aspect of motion transfer.
We aim to generate a new video of the target person imitating the character movements in the source video, while keeping the structural integrity and detail features of the target subject as much as possible. To accomplish this, we use the mesh projection containing 3D information as the condition for the GAN, and introduce a detail enhancement mechanism to improve the details.
We denote as a set of source video frames, and as a set of target frames. Our architecture can be divided into two modules, as shown in Figure 2: the Motion Transfer Net (MT-Net) on account of motion transfer across two domains and the Detail Enhancement Net (DE-Net) used for the enhancement of details. More specifically, MT-Net takes two real frames and as input, and generates an output image that has the same appearance as and the pose as . DE-Net takes image as input, which is the blending of raw transfer result with blurred details and corresponding real frame with rich details, and aims to generate an image in the target domain with the details enhanced. Our training pipeline is as follows:
Within-Domain Pre-Training of MT-Net. To stabilize the training process, we first pre-train MT-Net using within-domain samples. For the domain , let , and we can obtain , which is the reconstructed source frame. For the domain , let , and we should obtain the reconstructed target frame. This process initializes the MT-Net. Note that for each , is randomly selected from corresponding domain and fixed during the training process.
Training of DE-Net. After the pre-training of MT-Net, we let and , which generates initial transferred image by the MT-Net, which is often blurred. We then calculate a blended image which is an average of and corresponding real frame that contains clear details. We then train the DE-Net to discern and generate details from the blended result selectively to produce output image with details enhanced that matches the target domain.
Our transfer pipeline is as follows: Let and , we can get the initial transfer result (with the source pose and target appearance) using MT-Net. And then we can obtain the final result with details enhanced by the DE-Net.
Note that the domains of and for training DE-Net and transfer are swapped, because in the training setting, has the appearance as the source and pose as the target, and DE-Net aims to produce an image with the appearance and pose both in the target domain, so the ground truth of is available (which is exactly corresponding ). This provides supervision for training to enhance details in the target domain (i.e. with the appearance of the target subject). Such supervision is not available if the transfer setting is used.
3.2 Motion Transfer Net
As illustrated in Figure 3, the Motion Transfer Net consists of 3 parts: Label Image Generator that produces label images that encode 2D/3D human pose and shape information, Appearance Encoder that encodes the appearance of the input image, and Pose GAN that produces an output image with given appearance and pose/shape constraints.
3.2.1 Label Image Generation via 3D Human Models
To maintain the structural integrity of the generated results and produce realistic images for actions involving self-occlusions, we utilize the 3D geometry information of the underlying subject to produce label images as the GAN condition to regularize the generative network. The architecture of our label image generator is shown in Figure 4.
3D human model reconstruction. We first extract the 3D body shape and pose information for both source and target videos using a state-of-the-art pre-trained 3D pose and shape estimator . This leads to a 3D deformable mesh model including the details of body, face and fingers. When transferring between two domains, the 3D human models also allow the generation of a 3D mesh with the pose from one domain and shape from the other. The extracted deformable mesh sequences might exhibit temporal incoherence artifacts due to inevitable reconstruction errors. This can be alleviated by simply applying temporal smoothing to mesh vertices, since our mesh sequences have the same connectivity.
Human model projection.
We project the reconstructed 3D human model onto 2D to obtain a label image, which will be used as the condition to guide the generator. The image should ideally contain intrinsic 3D information (invariant to pose changes) to guide the synthesis process such that a particular color corresponds to a specific location on the human body. To achieve this, we propose to extract the three non-trivial eigenvectors corresponding to three smallest eigenvalues and consider them as a 3-channel image assigned to each vertex, which is projected to 2D to form a 3D constraint image. Although additional 3D information is available, 3D meshes extracted from 2D images may occasionally contain artifacts due to the inherent ambiguity. Therefore, we also adopt OpenPose [4, 5, 26, 31] to extract a 2D pose figure as part of the condition, which is less informative but more robust in the 2D space. Our label image therefore is 6-channel after combining both 2D and 3D constraints.
3.2.2 Pose to Image Translation
We learn the mapping from label image sequences to realistic image sequences by training a conditional GAN, consisting of Appearance Encoder and Pose GAN. The design of Pose GAN is similar to pix2pixHD: It is composed of a generator , and two multi-scale frame discriminators with the same architecture for images in the source and target domains, respectively. The two networks drive each other: the generator learns to synthesize more realistic images conditioned on the input to fool the discriminator, while the discriminator in turn learns to discern the “real” images (ground truth) and “fake” (generated) images. The difference of our network from pix2pixHD is that the data we use to learn the mapping includes not only target video, but also source video. We have done this by conditioning
on both label images and appearance features, extracted by the Label Image Generator and Appearance Encoder respectively. See Figure5 for the generative network architecture. It is worth mentioning that in order to solve the problem of poor continuity caused by single frame generation, similar to , adjacent frames are involved in training to improve temporal coherence.
Appearance Encoder and Pose GAN Generator. As said above, to make full use of the given data and meet the need of the subsequent detail enhancement, we train the generative network using data from both source video and target video. However, training two separate conditional GANs has a high overhead for computing resources and time. In order to simplify this process, we introduce an Appearance Encoder, and use label images containing 2D/3D constraints and appearance features together to guide Pose GAN to produce the reconstructed image (for within-domain input) or initial transfer result (for cross-domain input). Note that when the given subject video is replaced, our framework only needs to finetune the upsampling part of Pose GAN for the generation of a new subject.
Appearance Encoder is a fully convolutional network that extracts appearance features of the input image , which is used as a condition for the Pose GAN. It takes randomly selected frames as input and outputs appearance features corresponding to that domain. Pose GAN is the main part of MT-Net, which consists of three submodules: Downsampling, ResNet blocks and Upsampling. It works on both label images and the appearance features extracted by the Appearance Encoder, and synthesize results with the corresponding pose and appearance. As shown in Figure 5, the output of Appearance Encoder is added to the intermediate ResNet blocks in the generator .
Pose GAN Discriminator. We use the multi-scale discriminator presented in pix2pixHD . Discriminators of different scales can give the discrimination of images at different levels. In our method, we use two discriminators and
to discriminate the probability of generated images belonging to the corresponding domain, each with 3 scales.
Temporal Smoothing. We use the time smoothing strategy in  to enhance the continuity between adjacent generated frames. The generation of the current frame is not only related to the current label image , but also related to the previous frame .
Therefore, let denote the domain in which the training images are selected, our conditional GAN has the following objective:
Here the discriminator takes a pair of adjacent images in the domain
, and classifies them to real images (trained using the current frame, and previous frame from the training set), or fake images ( and the previous frame output generated by the Pose GAN).
3.3 Detail Enhancement Net
Through the first stage of training, we can obtain initial transfer results with blurred details: source to target transfer result as well as target to source transfer result . We can also construct paired data and , as shown in Figure 6, where images in the same pair have the same pose but different appearances, and different clarity of details, which motivates us to use a DE-Net to enhance the details from the blended image. Note that in different videos, subjects may have different builds or different positions relative to camera. In such cases, before sending to DE-Net we transform source frame by aligning the head position and height in accordance with target frame with simple scaling and translation.
The purpose of our DE-Net is to generate clear details of target images from the blended image pair. It is a GAN where the generator is a U-net which synthesizes images in the target domain with clear details, as illustrated in Figure 7. The discriminator discerns the “real” images (ground truth) and “fake” images (synthesized by ).
In the training stage, we use the blended image pair as input and supervisely train DE-Net with as ground truth. The use of mean blended image instead of concatenation avoids the output overfitting to . We optimize the DE-Net by the following objective:
Where and are the label images used to generate .
In the transfer stage, we use the source to target transfer result and the corresponding source image to obtain enhanced transfer result.
3.4 Full Objective
The training of our network is divided into two stages. First we train the Motion Transfer Net, which consists of Pose GAN and Appearance Encoder. The full objective contains adversarial loss, perceptual loss and discriminator feature-matching loss, which has the following form:
The discriminator feature-matching loss is presented in pix2pixHD  and similarly regularizes the output using intermediate result of the discriminator, calculated as
where is the number of layers, is the number of elements in the th layer and is the index of discriminators in the multi-scale architecture. is the condition of cGAN and is the corresponding ground truth. The DE-Net is optimized with the following objective
We compare our method with state-of-the-art methods and ablation variants, both quantitatively and qualitatively.
Dataset. To verify the performance of our method, we collected 8 in-the-wild single-dancer videos from YouTube and filmed 5 videos ourselves, out of which 2 with ordinary background and 3 with green screen. All videos are at resolution and each lasts 2-5 minutes. Each subject wears different clothes and performs different types of action such as freestyle dancing and stretching exercises. To prepare for training and testing, We cut the start and end parts that contain no action, and crop and normalize each frame to 512 1024 by simple scaling and translation.
We adopt a multi-stage training strategy in our method using Adam optimizer with learning rate 0.0001. In the first stage, we pre-train the MT-Net for 20 epochs. In the next stage, the parameters of MT-Net are fixed and DE-Net is trained individually for 10 epoches. We set hyperparametersand for both stages. More details about MT-Net and DE-Net are given in supplementary material.
4.2 Quantitative Results
Evaluation Metrics. We use objective metrics for quantitative evaluation under two different conditions: 1) To directly measure the quality of the generated images, we perform self-transfer in which the source and target are the same subject, and then use SSIM  and learning-based perceptual similarity (LPIPS)  to assess the similarity between source and target images. We split frames of each subject into training and test set at the ratio of 8:2 for this evaluation. 2) We also evaluate the performance of cross-subject transfer where the source and target are different subjects, using inception score  and Fréchet Inception Distance  as metrics. It should be noted that we compute the FID score between the original and generated target images since there exists no ground truth for comparison in this case. We exclude the green screen dataset in quantitative evaluation to focus on more challenging cases.
The metrics mentioned above are all based on single frames, which cannot reflect the smoothness of generated image sequences. The effect of mesh filtering in time series can be observed in the video results and quantitatively measured by the user study.
4.2.1 Comparison with existing methods.
Comparison results with state of the arts are reported in Table I.
It can be found that our method performs better than others.
|vid2vid||Chan et al.||LW-GAN||ours|
4.2.2 Ablation study
In this part, we perform an ablation study to verify the impact of each component of our model, including using 3D constraints (“3D”) and DE-Net (“DE”). Our full pipeline is indicated “Full“. When 3D is disabled, we use 2D pose figures as default label images.
Table II shows the results of the ablation study. It is obvious that our full proposed framework performs better than other variants. And both 3d constraints and DE-Net enhance the results. The comparison of MT(3D only) and MT proved the effect of 3D constraints. However, due to the projection results fail to fully characterize human facial features and expression, its score is slightly below MT+3D. Furthermore, we can observe that scores of self-transfer between MT and MT+3D (or MT+DE and Full) are similar. This is because source and target subjects share the same body shape in self-transfer, which somewhat limits the effectiveness of 3D constraints, where the scores of cross-subject transfer demonstrate the important role 3D information plays on transfer between different subjects with different shapes.
4.2.3 User evaluation
We also conduct a user study to measure the human perceptual quality for cross-subject transfer results. In our experiments, we compare videos generated by vid2vid, Chan et al., Liquid Warping GAN and our method. Specifically, we show to volunteers a series of videos by each of the methods at the resolution of , and the volunteers are given unlimited time to make responses. 50 distinct participants are involved and each of them is asked to select: 1) the clearest result with rich details; 2) the most temporally stable result; and 3) the overall best result. As shown in Table III, our method is more realistic, with richer details and with better temporal stability in comparison with other methods.
|vid2vid||chan et al.||LW-GAN||ours|
4.3 Qualitative Results
We visualize our generated results in Figure 8. It can be seen that our method successfully drives the motions of different targets with structural integrity and rich details, particularly in the face and hands. We also demonstrate that our method outperforms existing methods in Figure 9.
As illustrated in the first row of Figure 9, our method can enhance the structural integrity of arms and legs, and avoid the missing hands in the case that other methods fail to generate. At the same time, our method can also characterize details of the generated results more accurately, such as facial expression shown in the second row of Figure 9. Figure 10 shows the advantage of using the 3D constraints and DE-Net.
4.4 Limitations and Discussion
Although our model is able to synthesize motion transferred images with high authenticity and details, there are still several limitations. We show some failure cases with visual artifacts shown in Figure 11. In the left example, our model fails to eliminate the long hair of the source character in result, while in the right, some undesired part of clothes appears in the generated image because of the loose source clothes.
These failure cases are mainly attributed to the abnormal movement of source character which causes large change in human body shape, such as shaking hair or opening clothes suddenly, in which case the DE-Net fails to eliminate extra details. Our future work will focus on improving the ability of DE-Net to avoid the appearance of undesired details in the transfer results.
We have proposed a new approach to human motion transfer. It employs the 3D body shape and pose constraints as a condition to regularize the generative adversarial learning framework, which is more expressive and complete than 2D. We also design a enhancement mechanism to reinforce the detail characteristics of synthesized results using detailed information from real source frames. Extensive experiments show that our method outperforms existing methods both visually and quantitatively.
-  (2018) Synthesizing images of humans in unseen poses. , pp. 8340–8348. Cited by: §2.
-  (2018) Recycle-gan: unsupervised video retargeting. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 119–135. Cited by: §2.
-  (1997) Video rewrite: driving visual speech with audio. In SIGGRAPH, Cited by: §2.
OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. In arXiv preprint arXiv:1812.08008, Cited by: §2, §3.2.1.
-  (2017) Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, Cited by: §2, §3.2.1.
-  (2018) Everybody Dance Now. External Links: Cited by: §1, §1, §2, §3.2.2, §3.2.2, §4.1.
-  (2003) Recognizing action at a distance. In IEEE International Conference on Computer Vision, Nice, France, pp. 726–733. Cited by: §2.
-  (2018) A variational u-net for conditional appearance and shape generation. Cited by: §2.
-  (2018) DensePose: dense human pose estimation in the wild. Cited by: §2.
-  (2008) Real-time motion retargeting to highly varied user-created morphologies. In SIGGRAPH 2008, Cited by: §2.
-  (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §4.2.
Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2.
-  (2018) Total capture: a 3d deformation model for tracking faces, hands, and bodies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
-  (2018) End-to-end recovery of human shape and pose. In Computer Vision and Pattern Regognition (CVPR), Cited by: §2.
-  (2017) Learning to discover cross-domain relations with generative adversarial networks. CoRR abs/1703.05192. External Links: Cited by: §2.
-  (1999) A hierarchical approach to interactive motion editing for human-like figures. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1999, Los Angeles, CA, USA, August 8-13, 1999, W. N. Waggenspack (Ed.), pp. 39–48. External Links: Cited by: §2.
-  (2016) Coupled generative adversarial networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 469–477. External Links: Cited by: §2.
-  (2019) Liquid warping gan: a unified framework for human motion imitation, appearance transfer and novel view synthesis. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2, §4.1.
-  (2017) Pose guided person image generation. In Advances in Neural Information Processing Systems, pp. 405–415. Cited by: §2.
-  (2018) Disentangled person image generation. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
-  (2003) Discrete differential-geometry operators for triangulated 2-manifolds. In Visualization and mathematics III, pp. 35–57. Cited by: §3.2.1.
-  (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2.
-  (2018) Dense pose transfer. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 123–138. Cited by: §2.
-  (2019) Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §4.2.
-  (2017) Hand keypoint detection in single images using multiview bootstrapping. In CVPR, Cited by: §2, §3.2.1.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.4.
-  (2018) Video-to-video synthesis. Cited by: §1, §2, §2, §4.1.
-  (2017)(Website) External Links: Cited by: §2, §2, §3.2.2, §3.2.2, §3.4.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §4.2.
-  (2016) Convolutional pose machines. In CVPR, Cited by: §2, §3.2.1.
-  (2019) Monocular total capture: posing face, body, and hands in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §3.2.1.
The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, Cited by: §4.2.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, Cited by: §2.