Literature review and compilation for human motion synthesis and human novel view synthesis
In this paper, we tackle the problem of human motion transfer, where we synthesize novel motion video for a target person that imitates the movement from a reference video. It is a video-to-video translation task in which the estimated poses are used to bridge two domains. Despite substantial progress on the topic, there exist several problems with the previous methods. First, there is a domain gap between training and testing pose sequences–the model is tested on poses it has not seen during training, such as difficult dancing moves. Furthermore, pose detection errors are inevitable, making the job of the generator harder. Finally, generating realistic pixels from sparse poses is challenging in a single step. To address these challenges, we introduce a novel pose-to-video translation framework for generating high-quality videos that are temporally coherent even for in-the-wild pose sequences unseen during training. We propose a pose augmentation method to minimize the training-test gap, a unified paired and unpaired learning strategy to improve the robustness to detection errors, and two-stage network architecture to achieve superior texture quality. To further boost research on the topic, we build two human motion datasets. Finally, we show the superiority of our approach over the state-of-the-art studies through extensive experiments and evaluations on different datasets.READ FULL TEXT VIEW PDF
The task of motion transfer between a source dancer and a target person ...
We present a new video-based performance cloning technique. After traini...
Due to the emergence of Generative Adversarial Networks, video synthesis...
Human pose detection systems based on state-of-the-art DNNs are on the g...
Due to the ubiquity of smartphones, it is popular to take photos of one'...
Video-based human motion transfer creates video animations of humans
Video based fall detection accuracy has been largely improved due to the...
Literature review and compilation for human motion synthesis and human novel view synthesis
Video synthesis receives growing attention from research and industrial communities due to a wide range of applications. Among them, human motion retargeting saw significant progress, showing that by utilizing up-to-date deep neural network design and training techniques, approximate human motion can be transferred from one video to another. Such methods make it possible to generate a personalized dancing video of a subject not having any dancing experience. For example, one can animate themselves by using a ballerina video; or generate motion synchronized videos from multiple persons to be used for fake video detection.
Similarly to the Everybody Dance Now work  and other video-to-video translation works [58, 49], our method requires a training video of a person performing a variety of motions. An off-the-shelf body pose detector  is used to parse pose skeleton and represent it as multi-channel pose maps to feed our network. Then, instead of focusing on generating the entire frame, as previous methods do [7, 49], we argue that using the foreground only (e.g., the person) improves performance and increases the number of possible applications (Figure 1). Furthermore, focusing on foreground saves network capacity and computation time and costs, and allows the generated foreground to be easily reused on a new background. Generating the entire frame limits the side movements of the generated person as the situation when the person leaves the known background region is not handled by traditional methods. We, therefore, focus on the foreground only. For the application of changing background, further limitations of generating the whole frame, such as introducing extra background region and missing body limbs, are presented in Figure 2.
Despite substantial progress, state-of-the-art human motion retargeting methods are still far from perfect, with several questions remain open. How to generalize to arbitrary in-the-wild reference motions, including extreme poses not seen during training? How to achieve robust results against possible pose detection errors? How to obtain realistic texture details while keeping temporal consistency and smoothness of the video? These challenging issues prevent us from crossing the realism gap and achieving higher visual fidelity. To answer these questions and tackle the human motion translation problem, we need a more robust method that generalizes well across different pose domains, produces high fidelity texture details, and features temporal consistency at the same time. In this work, we attempt to answer these questions by proposing a novel two-stage pose-to-video translation network employing a unified paired and unpaired learning framework.
Pose estimation networks often fail on in-the-wild input frames (Figure 4
), so that mismatching body parts appear. Moreover, for some poses, especially in the in-the-wild scenarios, several keypoints can be missed. To tackle these issues and in order to enrich the variance in the input, enhancing the robustness of the network to keypoint detector errors, we propose a body pose augmentation method. Specifically, we drop out random pose channels and resize lengths of certain body parts. Furthermore, we observe that direct synthesis of realistic textures from sparse pose representation is often challenging (see Figure5 for examples), we propose a refinement network to refine textures from images that are obtained from a pose-to-image translation network. Finally, there exists a gap between training and testing poses; in-the-wild poses during testing will be substantially different from the training poses sequences extracted from the person itself. This limitation is by design, and we would like to handle poses that the reference person is not able to do themselves, including difficult dancing moves. To bridge the gap, we introduce unpaired learning into the paired training pipeline improving the generalization of the system. To make unpaired learning feasible, we collect a large-scale single-person activities (SPA) dataset and use the dataset as the source of in-the-wild input to the training pipeline, leveraging it along with the supervised paired training branch by adopting the carefully designed combination of discriminators.
With our method, we can generate superior results featuring texture realism, motion smoothness and robustness to poses never seen during training. We extensively evaluate our system by conducting comparisons with state-of-the-art human motion retargeting techniques by reporting quantitative and qualitative metrics on three different datasets. The results support that our method significantly outperforms existing works by obtaining better numerical scores and achieving higher visual fidelity.
In summary, our major contributions are three-fold:
We propose a pose-to-video translation framework to generate high fidelity videos for unseen in-the-wild poses by minimizing the domain gaps between testing and training pose sequences.
We conduct extensive experiments and evaluations, both quantitatively and qualitatively, which demonstrate the significant advantage over state-of-the-art methods with superior result quality and generalization ability on unseen input poses. When presented with the results generated by our method vs. the ground truth examples, the users are often unable to tell which is real, preferring our method in 48.1% of cases.
We collect two datasets: a high-quality indoor human video dataset containing all training sequences of the target persons used in this paper, recorded in front of a green backdrop screen, and a single-person activities dataset including large-scale in-the-wild pose sequences that are used during both training and evaluation.
Image-to-Image Translation. With the recent development in conditional GANs , various inputs, including but not limited to classification categories  and images , can be utilized to condition high fidelity image synthesis, besides noise in previous works [11, 36]21] uses paired training data to transfer an image from one domain to another and introduces an encoder-decoder architecture with skip connections. The architecture and training methods have been widely adopted in follow-up work. After that, pix2pixHD  further proposes a multi-scale generator and discriminator structure with residual-blocks to synthesize high-resolution photo-realistic images. However, obtaining paired data may still be prohibitively difficult for some tasks. Unsupervised image-to-image translation methods are presented to focus on minimizing domain gaps using unpaired image data [59, 54]. Nevertheless, compared with single images, video translation is usually more challenging since temporal consistency should also be considered as one vital factor besides image quality that impacts the final result quality.
Early efforts on non-conditional video synthesis typically convert a latent code into low-resolution short video frames via a recurrent neural network[45, 42, 37]. These methods often suffer from low-quality results, similar to the works on future frames prediction, that also use adversarial training and image reconstruction losses [47, 46, 16, 12]. To generate photo-realistic videos and allow more fine-grain controls, the conditional video generation methods have shown great potential recently by using a sequence of conditioning inputs [17, 49, 20, 8], such as semantic segmentation maps . A further group of work attempts to perform video-to-video translation requiring only a few images for the reference person. Despite substantial progress achieved recently [55, 28, 38, 18, 52], the generated human motion results are far from being realistic [26, 48]. Our work also falls into the scope of video-to-video translation, specifically focusing on the pose to video translation for unseen in-the-wild poses.
Human Motion Transfer. Synthesizing novel views for human face and body has been studied extensively [29, 5, 10, 14, 23, 56, 33, 13, 30, 25]. Some methods transfer facial expressions with parametric face models [3, 41]. Extending such method to animating bodies is not trivial. Similar to the face re-animation method of Kim et al. , human body motion transfer can be achieved by adopting image generation neural networks. Villegas et al. [43, 44] use pose to predict future frames to synthesize a new human video. Ma et al. [32, 31] use a reference image to synthesize novel view given a target pose. Siarohin et al.  improve the approach by proposing a deformable network architecture. Balakrishnan et al.  segment human body parts according to the target pose and use a spatial transformation sub-module to synthesize unseen poses. Instead of predicting future frames or generating single frames from an input pose, we focus on high fidelity personalized video generation, where a network is specifically trained for each target person.
Discussion. Our focus is to synthesize personalized videos given arbitrary reference pose sequences, utilizing training videos for the target subject. We borrow the basic problem formulation from existing literature on pose to image generation [7, 49, 58, 1]. However, compared with Everybody Dance Now (EDN) , Video-to-Video Synthesis (vid2vid) , and Zhou et al. , besides different methods on enforcing temporal coherence, we also adopt a two-stage translation and refinement network to improve generation texture quality, and propose a novel unified learning framework incorporating both paired and unpaired training examples to help the network generalize better on unseen in-the-wild poses. We demonstrate the clear advantage of our methods over EDN  and vid2vid  in experiments.
Given a source pose sequence , our goal is to generate a photo-realistic human video sequence of the target person performing a sequence of motions provided by the input sequence of poses.
To train the method, for each subject, we record a set of video clips, covering common body poses and motions. We then parse the human pose skeleton from each video frame so that the paired training data is formed. Each pair includes a pose skeleton and the corresponding ground-truth image. In the setting of conditional GAN , a pose-to-image translation network can be used to synthesize a body video from input poses.
However, for such a challenging task, there are a few critical issues that prevent the baseline framework from working well in most general cases:
There exist multiple domain gaps between the training and in-the-wild poses, including different recording devices/environments, subject identities, and motion styles. Direct inference on in-the-wild poses using networks trained purely with paired supervision will produce notably degraded results (Table 4);
A single-stage pose-to-image translation network tends to focus more on mapping the input pose to a rough spatial body image layout. Its limited capability is usually not sufficient to achieve high local visual realism at the same time, especially on dynamic texture details of body and clothes.
We detail our solutions to these problems in the next sections.
As the first stage, the Pose2Video network performs pose-to-video translation, as shown in Figure 3a ( in Stage I). We encode the tracked skeleton with parts, such as arms and legs, into an input pose map with exactly channels by drawing a line segment of each part at its corresponding map channel. The network structure follows pix2pixHD . However, instead of doing image-to-image translation at the single-frame level, we adapt the network to accept multi-frame input to further utilize training sequence pairs for better temporal consistency and smoothness. For each training data pair () at time , where and represent input pose maps and corresponding ground-truth frames respectively, we collect and stack frames centered around as network input , to help the network gain a better understanding of the temporal context. For discriminator , we use the concatenation of pose and the ground-truth image as a true pair, and pose and the generated image as a false pair.
Pose Augmentation The commonly used body tracking method [6, 15, 53] may not perfectly accurate. Additionally, its performance drops substantially on the in-the-wild videos at inference time, which are usually more challenging comparing with our training sequences captured in a controlled environment. To remedy such an inevitable gap, we augment training pose maps by randomly dropping some input channels, perturb the location of joints keypoints, and elongate or shorten body part lengths in some channels, such as an example in Figure 3f, so that both lengths and location of body limbs can be randomly changed during training.
Our Pose2Video network can already generate video output conditioned by the input pose sequence. However, to tackle the challenging problem of photo-realistic human video generation, it is often difficult for the network to directly synthesize realistic texture from sparse pose in a single stage. Therefore, we concatenate an additional texture refinement network after Pose2Video, as shown in Figure 3 ( in Stage II), to further refine the local texture details based on the rough output from the first stage. The texture refinement network also follows the setting of condition GAN where the inputs for generator are adjacent frames of Pose2Video RGB outputs , where . For discriminator , the concatenated and a ground-truth image is treated as true pair, and and a generated image is false pair.
Even with our pose augmentation method, the paired poses extracted from the recorded training data may still have limited coverage over the vast human motion space, given that users may probably want to test challenging motion sequences at inference time such as ballet and hip-hop moves. Furthermore, in-the-wild videos with complex background and occlusions also tend to raise the difficulty of pose estimation comparing to our chroma-keyed training sequences. In light of these, we propose to not only use augmented pairs of training data but also introduce unpaired learning with in-the-wild poses to boost the generalization ability and robustness.
We unify paired and unpaired training branches by swapping inputs and learning objectives while sharing weights of the target networks, as shown in Figure 3a. In what follows, we introduce learning strategies of both paired and unpaired training branches respectively and conclude with the full training objective.
Stage I and II share similar network structures and learning objectives. For the purpose of simplification, we use the same notation system to elaborate the training strategies for both stages within a single framework. Besides the paired pose map and ground-truth frame that are consistent across the paper, we also uniformly denote the corresponding input and generator of each network stage as and , where and for the Pose2Video network (Stage I), and and for the refinement network (Stage II).
In the paired learning branch, the existence of ground-truth images allows us to enforce paired supervision on the network output. Instead of using low-level pixel objectives such as L1 reconstruction loss, we measure the perceptual similarity loss  with VGG19 network  between the network output and the corresponding ground-truth, to let the network gain a better semantic understanding and avoid blurriness caused by imperfect pose-image pairing and non-deterministic texture details regarding the pose.
We also use a single-frame discriminator conditioned on the input pose to enforce natural result and proper pose correspondences:
where is the temporal stacking of adjacent frames centered around , as described in the Pose2Video Network section.
To achieve more stable GAN training, we also adopted the discriminator feature matching loss  .
Besides single-frame quality, the generated sequences should also be temporally consistent and retain plausible motion quality. Therefore, we use an additional unconditioned temporal discriminator to tell whether a continuous subset of result frames is realistic or not in the temporal domain:
where stands for stacking frames of generator outputs
. We also use a feature matching loss for .
During unpaired learning, instead of using recorded paired data, we randomly feed the network with body pose inputs extracted from video sequences with different subjects. Incorporating these in-the-wild inputs helps bridge the pose domain gap and increase network robustness against unseen inputs. Without paired ground-truth supervision, we perform unpaired training based on network models that share weights with paired training and solely adopt a single-frame discriminator similar to that introduced in the paired learning section. Different from , those positive examples do not share the same condition as the negative ones, but randomly draw pairs from recorded sequences used in paired learning:
We train our networks with a two-stage training strategy that we first train the Pose2Video network by optimizing , , and , and then fix the weights in and train the refinement network by optimizing , , and
. The overall loss functions for both training stages are similarly defined as:
In this section, we conduct both quantitative and qualitative experiments to demonstrate the advantages of our method, especially on challenging poses.
We use three datasets to perform validation. The first one is released with the Everybody Dance Now (EDN) paper . The dataset includes videos of five subjects. We follow the training and validation strategy in  to perform experiments. To facilitate the following presentation, we denote the validation videos from EDN as EDN-Vali.
For the second dataset, we collected videos containing four subjects. All videos are filmed in an indoor environment with the subject standing in front of a backdrop green-screen (as an example frame shown in Figure 3(c)) to help isolate the foreground and achieve better segmentation quality. An iPhone fixed on a tripod is used to shoot the videos. During the process, all subjects are asked to either perform slow random moves or follow simple online dancing videos (these guidance videos are not used in either training or validation). On average, we collect minutes video for each target person. We split all videos of each subject into training and validation sets with a ratio of . We denote the validation set as Target-Vali.
We process these captured data by first applying an off-the-shelf pose detection networks  to get estimated body poses. Then we perform chroma-key composition to mask out the target person and change the background to a solid green color. Finally, we crop each frame with the smallest rectangle that encloses the target person, as shown in Figure 3(e).
Besides the two datasets for paired learning, we create a large-scale single-person activities dataset (SPA) for unpaired learning and validation. To make SPA suitable for poses-to-video generation, we collect single-person activity videos and make sure that all these videos catch the whole human body, and each video only includes one person. The average duration of each video is about seconds and we extract frames at FPS, giving frames in total. The body poses of each SPA video are detected for training and inference uses. Additionally, we randomly take out videos from SPA as a validation dataset, denoted as SPA-Vali, to verify model performance on unseen in-the-wild poses. The average duration of SPA-Vali videos is about seconds. Compared with the other two validation datasets, SPA-Vali is more challenging.
We adopt the multi-scale generator and discriminator architecture and apply the progressive training schedule. We first train a model for 128256 and then upsample to 256512. We set for input pose maps , and in Eqn. 2. The hyper-parameters in Eqn. 4 are set as and . We use the initial learning rate as 0.0002 and gradually decrease it.
We numerically evaluate the generated videos with both objective analysis and subjective user studies.
Objective Metrics. We adopt the three widely-used objective metrics to assess the result quality: SSIM (Structural Similarity)  index to measure the perceived image quality degradation between both synthesized and real frames; LPIPS (Learned Perceptual Image Patch Similarity)  to measure the perceptual similarity between generated and real images; and FID (Fréchet Inception Distance)  to measure the distribution distance.
Subjective Scores. We also conduct user studies to analyze the video quality regarding real human perception. The experiments are performed using the Amazon Mechanical Turk (AMT) platform. We design two settings for users to compare video quality: 1) Pairwise comparison: we show workers pairs of videos with exactly the same motion but from two different sources (including ground-truth, results of our method, or results of state-of-the-art methods), and ask them to choose the more realistic one. The two videos are shown side-by-side and their orders are randomly chosen; 2) Single-video evaluation: we only show workers a single video and ask them whether this video looks real to them. Workers can only choose a video as real or not.
Both EDN  and vid2vid  are state-of-the-art methods on human body motion transfer and have achieved significantly better results compared with existing image-based human body generation methods [4, 50]. In the following sections, we compare our method with the two studies and show both qualitative and quantitative results.
|ours w/ BG||0.948||0.027|
|ours segmented FG||0.959||0.031|
|ours w/o BG||0.976||0.015|
We use the data collected by EDN to train our models, and then validate the models on EDN-Vali dataset. We conduct three experimental settings: 1) ours w/ BG: similarly to EDN, we use original images with background for training and validation; 2) ours segmented FG: similarly to EDN, we use original images to train the networks, but apply a segmentation network  on the generated images to get foreground region; 3) ours w/o BG: we apply a segmentation network on the training data and only use the segmented foreground to train our model, so the synthesized results only contain foreground. Following EDN, we run the experiments for five subjects and report the averaged results. The results in Table 2 show that our method outperforms EDN significantly as we achieve much higher SSIM and much lower LPIPS. We also notice that using the foreground region for training performs better than using a whole frame as the networks can focus on synthesizing only the foreground part. Although the foreground region can be obtained by segmenting the generated image, the image quality is inferior to the foreground that is synthesized directly, supporting that it is beneficial to remove the background prior to training.
In their original article, vid2vid extracts human poses with both OpenPose  and DensePose . However, DensePose detects human body shapes together with skeleton poses. The shape information encodes the identity of the reference subject, which can cause difficulty preserving the target identity during inference. In order to have a fair comparison, we implement their method using the same pose generation and augmentation methods as ours.
We list the average SSIM and FID metrics in Table 2, evaluated on our Target-Vali dataset in which we have access to ground-truth. Compared with vid2vid , we achieve noticeably higher scores on SSIM and lower scores on FID, which proves that our method can synthesize results with better objective quality.
We also present user study results, with both pair-wise comparison and single-video evaluation settings shown in Table 4 and Table 4. From the pair-wise comparison result, we can see that and users prefer our results to those by vid2vid on Target-Vali and SPA-Vali datasets respectively. Also, users demonstrate very close preferences between our results and the ground-truth ( vs. ), which shows that our results are somehow comparable to the real videos. As for the single-video evaluation, we show the number of percentage videos that are rated as real videos by users. We can notice our method achieves better scores than vid2vid on both datasets as well. As expected, both methods receive lower scores on SPA-Vali compared with Target-Valid since SPA contains more challenging poses. However, our method continues to perform substantially better than vid2vid, demonstrating superior generalization.
For visual comparison, we randomly selected pose sequences from SPA-Vali and generate motion transfer results with both our method and vid2vid in Figure 4. The results confirm that our approach generated more realistic results in cases when the pose detector fails to reliably find the keypoints. For example, in top row of Figure 4a, the left foot is incorrectly detached from the leg in the first frame. However, our method is still capable of generating the foot in the right position with consistent orientation, while vid2vid generates an unnatural dot at the wrong location, misled by the input pose. In Figure 4b, the right lower arm is entirely missing in the second frame. Our method successfully predicts the missing arm utilizing the adjacent frames, achieving significantly better results than vid2vid. In Figure 4c, the position of the left lower arm in the second frame is inconsistent with its neighbors. Our method again generates consistent results while vid2vid generates an extra arm at that frame. In Figure 4d, the detected face key-points are unnaturally stretched in the third frame. In contrast to the broken head result by vid2vid, our method still manages to generate intact head and hair with consistent direction. Besides robustness, it is worth noticing that vid2vid often produce less satisfactory foreground masks, which leads to internal holes or missing parts as shown in Figure 4b and Figure 4d. As can be seen, our method consistently produces complete and accurate foreground boundaries.
We perform ablation studies to identify which of the contributions are responsible for superior quality. We report the following experiments: 1) PL-Stage1: as the baseline, we adopt a single-stage Pose2Video network without unpaired learning and pose augmentation, and stack the input poses as , which follows the settings of existing methods [7, 49]; 2) PL-Stage1-DA: we add pose augmentation to PL-Stage1; 3) PL-Stage1-DA-F: based on PL-Stage1-DA, we change input poses to ; 4) PL-Stage2: we add the second refinement stage to PL-Stage1-DA-F, but still without unpaired learning; 5) PL-UL-Stage2: our full method, with unpaired learning used on both stages.
Table 5 summarizes the quantitative ablation analysis results. We can see that by incrementally introducing these key components, both SSIM and FID metrics gradually improve supporting that for photorealistic human motion retargeting all the proposed contributions are essential. We also present visual results for ablation analysis in Figure 5. By zooming in certain areas, we can clearly find that comparing with the baseline, our full method is able to produce better mask boundaries, fewer artifacts, and richer and sharper texture details.
More motion transfer examples generated by our method are shown in Figure 6, including results for both uniformly sampled single frames and consecutive frame sequences. These various results demonstrate that our method can generate high-fidelity results on various subjects from challenging body pose sequences.
We introduced a novel approach for video human motion transfer. The network contains two stages with one sub-network for pose-to-image translation and another for image-to-image translation. The network is trained by stacked continuous frames to achieve temporally consistent results. We also incorporate a unified paired and unpaired learning strategy and a pose augmentation method during training to help the network generalize well on unseen in-the-wild poses. Our experiments support that all the proposed contributions are essential for obtaining realistic human motion retargteting. To further boost research on the topic, we collected two datasets containing human motion videos. The first dataset is used for training personalized models, while the second one for unpaired learning and evaluation.
Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2.
Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694–711. Cited by: §3.3.
Conditional image synthesis with auxiliary classifier gans. In
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2642–2651. Cited by: §2.
Temporal generative adversarial nets with singular value clipping. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2830–2839. Cited by: §2.
An uncertain future: forecasting from static images using variational autoencoders. In European Conference on Computer Vision, pp. 835–851. Cited by: §2.
The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595. Cited by: 1st item.