This is the official PyTorch implementation of the CVPR 2020 paper "TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting".
We present a lightweight video motion retargeting approach TransMoMo that is capable of transferring motion of a person in a source video realistically to another video of a target person. Without using any paired data for supervision, the proposed method can be trained in an unsupervised manner by exploiting invariance properties of three orthogonal factors of variation including motion, structure, and view-angle. Specifically, with loss functions carefully derived based on invariance, we train an auto-encoder to disentangle the latent representations of such factors given the source and target video clips. This allows us to selectively transfer motion extracted from the source video seamlessly to the target video in spite of structural and view-angle disparities between the source and the target. The relaxed assumption of paired data allows our method to be trained on a vast amount of videos needless of manual annotation of source-target pairing, leading to improved robustness against large structural variations and extreme motion in videos. We demonstrate the effectiveness of our method over the state-of-the-art methods. Code, model and data are publicly available on our project page (https://yzhq97.github.io/transmomo).READ FULL TEXT VIEW PDF
This is the official PyTorch implementation of the CVPR 2020 paper "TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting".
Let’s sway you could look into my eyes. Let’s sway under the moonlight, this serious moonlight.
Can an amateur dancer learn instantly how to dance like a professional in different styles, e.g., Tango, Locking, Salsa, and Kompa? While it is almost impossible in reality, one can now achieve this virtually via motion retargeting - transferring the motion of a source video featuring a professional dancer to a target video of him/herself.
Motion retargeting is an emerging topic in both computer vision and graphics due to its wide applicability to content creation. Most existing methods[41, 28, 30]
achieve motion retargeting through high-quality 3D pose estimation or reconstruction. These methods either require complex and expensive optimization or they are error-prone given the unconstrained videos that contain complex motion. Recently, several efforts are also made to retarget motion in 2D space [2, 6, 23]. Image-based methods [15, 4] obtain compelling results on conditional person generation. However, these methods always neglect the temporal coherence in video and thus suffer from twinkling results. Video-based methods [44, 6, 2]
show state-of-the-art results. However, insufficient consideration of variances between two individuals[44, 6] or the limitation of training on synthesized data  makes their result deteriorate dramatically while encountering large structure variations or extreme motion in web videos.
In this study, we aim to address video motion retargeting via an end-to-end learnable framework in 2D space, bypassing the need for explicit estimation of 3D human pose. Despite recent progress in generative frameworks and motion synthesis, learning for motion retargeting in 2D space remains challenging due to the following issues: 1) Consider the large structural and view-angle variances between the source and target videos, it is difficult to learn a direct person-to-person mapping at the pixel level. Conventional image-to-image translation methods tend to generate unnatural motion in extreme conditions or fail on unseen examples; 2) No corresponding image pairs of two different subjects performing the same motion are available to supervise the learning of such a transfer; 3) Human motion is highly articulated and complex, thus it is challenging to perform motion modeling and transfer.
To address the first challenge, instead of performing direct video-to-video translation at the pixel level, we decompose the translation process into three steps as shown in Fig. 2, i.e., skeleton extraction, motion retargeting on skeleton and skeleton-to-video rendering. The decomposition allows us to focus on the core problem of motion re-targeting using skeleton sequences as the input and output spaces. To cope with the second and third challenges, we exploit the invariance property of three factors: motion, structure, and view-angle. These factors of variation are enforced to be independent of each other, held constant when other factors vary. In particular, 1) motion should be invariant despite structural and view-angle perturbations, 2) structure of one skeleton sequence should be consistent across time and invariant despite view-angle perturbations, and 3) view-angle of one skeleton sequence should be consistent across time and invariant despite structural perturbations. The invariance properties allow us to derive a set of purely unsupervised loss functions to train an auto-encoder for disentangling a sequence of skeletons into orthogonal latent representations of motion, structure, and view-angle. Given the disentangled representation, one can easily mix the latent codes of motion and structure from different skeleton sequences for motion retargeting. Taking different view-angle as a condition to the decoder, one can generate retargeted motion in novel viewpoints. Since motion retargeting is performed on the 2D skeleton space, it can be seen as a lightweight and plug-and-play module, which is complementary to existing skeleton extraction [5, 3, 35, 48] and skeleton-to-video rendering methods [6, 44, 43].
There are several existing studies designed for general representation disentanglement in video [20, 40, 13]. While these methods have shown impressive results on constrained scenarios. It is difficult for them to model articulated human motion due to the highly non-linear and complex kinematic structures. Instead, our method is designed specifically for representation disentanglement in human videos.
We summarize our contributions as follows: 1) We propose a novel Motion Retargeting Network in 2D skeleton space, which can be trained end-to-end with unlabeled web data. 2) We introduce novel loss functions based on invariance to endow the proposed network with the ability to disentangle representation in a purely unsupervised manner. 3) Extensive experiments demonstrate the effectiveness of our method over other state-of-the-art approaches [6, 2, 41], especially under in-the-wild scenarios where motion are complex.
Video Motion Retargeting. Hodgins and Pollard  proposed a control system parameter scaling algorithm to adapt simulated motion to new characters. Lee and Shin  decomposed the problem into inter-frame constraints and intra-frame relationships and modeled them by Inverse Kinematics problem and B-spline curve separately. Choi and Ko  proposed a real-time method based on inverse rate control that computes the changes in joint angles. Tak and Ko  proposed a per-frame filter framework to generate physically plausible motion sequences. Recently, Villegas et al. 
designed a recurrent neural network architecture with a Forward Kinematics layer to capture high-level properties of motion. However, the target to be animated of the aforementioned approaches is typically an articulated virtual character and their results critically depend on the accuracy of 3D pose estimation. More recently, Abermanet al.  propose to retarget motion in 2D space. However, since their training relies on synthetic paired data, the performance is likely to degrade under the unconstrained scenarios. Instead, our method can be trained on pure unlabeled web data, which makes the method robust to the challenging in-the-wild motion transfer task.
There exist a few attempts to address the video motion retargeting problem. Liu et al.  designed a novel GAN  architecture with an attentive discriminator network and better conditioning inputs. However, this method relies on 3D reconstruction of the target person. Aberman et al.  proposed to tackle video-driven performance cloning in a two-branch framework. Chan et al.  proposed a simple but effective method to obtain temporal coherent video results. Wang et al.  achieves results of similar quality to Chan et al. with more complex shape representation and temporal modelling. However, The performance of all these methods degrades dramatically when large variations happened between two individuals with no consideration [1, 43, 44] or a simple rescaling  to address body variations.
Unsupervised Representation Disentanglement. There is a vast literature [26, 29, 21, 36, 47, 46] on disentangling factors of variation. Bilinear models  were an early approach to separate content and style for images of faces and text in various fonts. Recently, InfoGAN  learned a generative model with disentangled factors based on Generative Adversarial Networks (GAN). -VAE  and DIP-VAE , build on variational Auto-Encoders (VAEs) to disentangle interpretable factors in an unsupervised way.
Other approaches explore general methods for learning disentangled representations from video. Whitney et al.  used a gating principle to encourage each dimension of the latent representation to capture a distinct mode of variation. Villegas et al.  used an unsupervised approach to factoring video into content and motion. Denton et al.  proposed to leverage the temporal coherence of video and a novel adversarial loss to learn a disentangled representation. MoCoGAN  employs unsupervised adversarial training to learn the separation of motion and content. Hsieh et al.  proposed an auto-encoder framework, which combines structured probabilistic models and deep networks for disentanglement. However, the performance of these methods are not satisfactory on human videos, since they are not designed specifically for disentanglement of highly articulated and complex objects.
Various machine learning algorithms have been used to generate realistic person images. The generation process could be conditionally guided by skeleton keypoints[4, 31] and style codes [32, 15, 11]. Our method is complementary to the image-based person generation approaches and can further boost the temporal coherence of them since it performs motion retargeting on the 2D skeletons space only.
As illustrated in Fig. 2, we decompose the translation process into three steps, i.e., skeleton extraction, motion retargeting and skeleton-to-video rendering. In our framework, motion retargeting is the most important component in which we introduce our core contribution (i.e., invariance-driven disentanglement). Skeleton extraction and skeleton-to-video rendering are replaceable and can thus benefit from recent advances in 2D keypoints estimation [3, 5, 48] and image-to-image translation [22, 44, 43].
The Motion Retargeting Network decomposes 2D joint input sequences as a motion code that represents the movements of the actor, a structure code that represents the body shape of the actor and a view-angle code that represents the camera angle. The decoder takes any combination of the latent codes and produces a reconstructed 3D joint sequence, which automatically isolates view from motion and structure.
For transferring motion from a source video to a target video, we first use an off-the-shelf 2D keypoints detector to extract joint sequences from videos. By combining the motion code encoded from the source sequence and the structure code encoded from the target sequence, our model then yields a transferred 3D joint sequence. The transferred sequence is then projected back to 2D with any desired view-angle. Finally, we convert the 2D joint sequence frame-by-frame to a pixel-level representation, i.e., label maps. These label maps are fed into a pre-trained image-to-image generator to render the transferred video.
Here, we detail the encoders and decoders for an input sequence where is the length of the sequence and is the number of body joints.
The motion encoder uses several layers of one dimensional temporal convolution to extract motion information: , where is the sequence length after encoding and is the number of channels. Note that the motion code is variable in length so as to preserve temporal information.
The structure encoder has a similar network structure
, with the difference that the final structure code is obtained after a temporal max pooling:, therefore . Effectively, the process of obtaining the structure code can be interpreted as performing multiple body shape estimations in sliding windows: , and then aggregating the estimations. Assuming the viewpoint is also stationary (i.e. all the temporal variances are caused by the movements of the actor), the view code is obtained the same way we obtained the structure code.
The decoder takes the motion, body and view codes as input and reconstructs a 3D joint sequence through convolution layers, in symmetry with the encoders. Our discriminator is a temporal convolutional network that is similar to our motion encoder: .
The disentanglement of motion, structure and view is achieved leveraging the invariance of each of these factors to the changes in the other twos. We design loss terms to restrict changes when perturbations are added, while the entire network tries to reconstruct joint sequences from decomposed features. Structural perturbation is added through limb scaling, i.e. manually shortening or extending the length of the limbs. View perturbation is introduced through rotating the reconstructed 3D sequence and projecting it back to 2D. Motion perturbation needs not be explicitly added since motion itself is varying through time. We first describe the ways perturbations are added and then detail the definitions of the loss terms derived from three invariances, i.e., motion, structure and view-angle invariance.
Limb Scaling as Structural Perturbation. For an input 2D sequence , we create a structurally-perturbed sequence by elongating or shortening the limbs of the performer, as illustrated in Figure 3. It is done in such a way that the created sequence is effectively the same motion performed by a different actor. The length of a limb is extended/shortened by the same ratio across all frames, so limb-scaling does not introduce ambiguity between motion and body structure. Specifically, the limb-scaled sequence is created by applying the limb-scale function frame-by-frame: , where is the frame in the input sequence, is the limb scaling function, are the local scaling factors and is the global scaling factor. Modeling the human skeleton as a tree and joints as its nodes, we define the pelvis joint as the root. For each frame in the sequence, starting from the root, we recursively move the joints and all their dependent joints (child nodes) on the direction of the limb by distance , where is the original length of the limb in the frame. After all local scaling factors have been applied, the global scaling factor is directly multiplied with all the joint coordinates.
3D Rotation as View Perturbation. Let be a rotate-and-project function, i.e., for a 3D coordinate :
is a rotation matrix obtained using Rodrigues’ rotation formula and
is a unit vector representing the axis around which we rotate. In practice,is an estimated vertical direction of the body. It is computed using four points: left shoulder, right shoulder, left pelvis and right pelvis. Note that is differentiable with respect to .
As shown in Fig. 4, we create several rotated sequences from the reconstructed 3D sequence :
and is number of projections. Loss terms enforcing disentanglement will be described later in this chapter.
Motion should be invariant despite structural and view-angle perturbations. To this end, we designed the following loss terms.
Cross Reconstruction Loss. Recall that we use limb scaling to obtain data of the same movements performed by “different” actors and . We cross reconstruct the two sequences, as shown in Fig.5. The cross reconstruction involves encoding, swapping and decoding, namely:
where is the limb-scaled version of . Since and have the same motion, we expect to be the same as ; to be the same as . Therefore, the cross reconstruction loss is defined as
Structural Invariance Loss. This signal is to ensure that the motion codes are invariant to structural changes. and have the same motion but different body structures, therefore we expect the motion encoder to have the same output:
Rotation Invariance Loss. Similarly, to ensure that the motion code is invariant to rotation, we add:
where is the rotated variant.
Body structure should be consistent across time and invariant view-angle perturbations.
Triplet Loss. The triplet loss is added to exploit the time-invariant property of the body structure and thereby better enforce disentanglement. Recall that the body encoder produces multiple body structure estimations before averaging them. The triplet loss is designed to map estimations from the same sequence to a small neighborhood while alienating estimations from different sequences. Let us define an individual triplet loss term:
denotes the cosine similarity function andis our margin. The total triplet loss for the invariance of structure is defined as
Rotation Invariance Loss. This signal is to ensure that the structure codes are invariant to rotation:
where is the rotated variant.
View-angle of one skeleton sequence should be consistent through time invariant despite structural perturbations.
Triplet Loss. Similarly, triplet loss is designed to map view estimations from the same sequence to a small neighborhood while alienating estimations from rotated sequences. Continuing to use the definition of a triplet term in Eq.4:
where , .
Structural Invariance Loss This signal is to ensure that the view code is invariant to structural change:
where is the limb-scaled version of .
The loss terms defined above are designed to enforce disentanglement. Besides them, some basic loss terms are needed for this representation learning process.
Reconstruction Loss. Reconstructing data is the fundamental functionality of auto-encoders. Recall that our decoder outputs reconstructed 3D sequences. Our reconstruction loss minimizes the difference between real data and 3D reconstructions projected back to 2D.
i.e. we expect to be the same as the input when we directly remove the coordinates from .
Adversarial Loss. The unsupervised recovery of 3D motion from joint sequences is achieved through adversarial training. Reconstructed 3D joint sequences are rotated and projected back to 2D and a discriminator is used to measure the domain discrepancy between the projected 2D sequences and real 2D sequences. The feasibility of recovering static 3D human pose from 2D coordinates with adversarial learning has been verified in several works [37, 14, 7, 34]. We want the reconstructed 3D sequence to look right after we rotate it and project it back to 2D, therefore the adversarial loss is defined as.
The proposed motion retargeting network can be trained end-to-end with a weighted sum of the loss terms defined above:
Implementation details. We perform the proposed training pipeline on the synthetic Mixamo dataset  for quantitative error measurement and fair comparison. For in-the-wild training, we collected a motion dataset named Solo-Dancer from online videos. For skeleton-to-video rendering, we recorded target videos and use the synthesis pipeline proposed in . The trained generator is shared by all the motion retargeting methods.
Evaluation metrics. We evaluate the quality of motion retargeting for both skeleton and video, as retargeting results on skeleton would largely influence the quality of generated videos. For skeleton keypoints, we perform evaluations on a held-out test set from Mixamo (with ground truth available) using mean square error (MSE) as the metric. For generated videos, we evaluate the quality of frames with FID score  and through a user study.
We train the model on unconstrained videos in-the-wild, and the model automatically learns the disentangled representations of motion, body, and view-angle, which enables a wide range of applications. We test motion retargeting, novel-view synthesis and latent space interpolation to demonstrate the effectiveness of the proposed pipeline.
Motion retargeting. We extract the desired motion from the source skeleton sequence, then retarget the motion to the target person. Videos from the Internet vary drastically in body structure as shown in Fig. 6. For example, Spiderman has very long legs but the child has short ones. Our method, no matter how large the structural gap between the source and the target, is capable of generating a skeleton sequence precisely with the same body structure as the target person while preserving the motion from the source person.
Novel-view synthesis. We can explicitly manipulate the view of decoded skeleton in the 3D space, rotating it before projecting it down to 2D. We show an example in Fig. 7. This enables us to see the motion-transferred video at any desired view-angle.
Latent space interpolation. The learned latent representation is meaningful when interpolated, as shown in Fig. 8. Both the motion and the the body structure change smoothly between the videos, demonstrating the effectiveness of our model in capturing a reasonable coverage of the manifold.
We compare the motion retargeting results of our method with the following methods (including one intuitive method and three state-of-the-art methods) both quantitatively and qualitatively. 1) Limb Normalization is an intuitive method that calculates a scaling factor for each limb and applies local normalization. 2) Neural Kinematic Networks (NKN)  uses detected 3D keypoints for unsupervised motion retargeting. 3) Everybody Dance Now (EDN) 
applies a global linear transformation on all the keypoints. 4)Learning Character-Agnostic Motion (LCM)  performs disentanglement at the 2D space in a fully-supervised manner.
For the fairness of the comparison, we train and test all the models on a unified Mixamo dataset, but note that our model is trained with less information, using neither 3D information  nor the pairing between motion and skeletons . In addition, we train a separate model with in-the-wild data only. All the methods are evaluated with the aforementioned evaluation metrics.
Our method outperforms all the compared methods in terms of both numerical joint position error and quality of generated image. EDN and LN are naive rule-based methods, the former does not estimate the body structure and the latter is bound to fail when the actor is not facing the camera directly. Despite that NKN is able to transfer motion with little error on the synthesized dataset, it suffers on in-the-wild data due to the unreliability of 3D pose estimation. LCM is trained with a finite set of characters, therefore its capacity of generalization is limited. In contrast, our method uses limb-scaling to augment the training data, exploring all possible body structures in a continuous space.
It is noteworthy that our method enables training on arbitrary web data that previous methods are not able to. The fact that the model trained on in-the-wild data (i.e., Solo-Dancer Dataset) achieved the lowest error (in Table 1) demonstrates the benefits of training on in-the-wild data. For complex motion such as the one shown in Fig. 10, the model learned from wild data performs better, as wild data features a larger diversity of motion. These results show the superiority of our method in learning from unlimited real-world data, while supervised methods rely on strictly paired data that are hard to expand.
In summary, we attribute the superior performance of our method to the following reasons: 1) Our disentanglement is directly performed in 2D space, which circumvents the imprecise process of 3D-keypoints detection from in-the-wild videos. 2) Our explicit invariance-driven loss terms maximize the utilization of information contained in the training data, evidenced by the largely increased data efficiency compared to implicit unsupervised approaches . 3) Our limb scaling mechanism improves the model’s ability to handle extreme body structures. 4) In-the-wild videos provide an unlimited source of motion, compared to limited movements in synthetic datasets like Mixamo .
|Method||w/o crs||w/o trip||w/o adv||Ours (full)|
We train some ablated models to study the impact of the individual loss terms. The results are shown in Table 2. We design three ablated models. The w/o crs model is created by removing the cross reconstruction loss. The w/o trip model is created by removing the triplet loss. The w/o adv model is created by removing the adversarial loss. Removing the cross reconstruction loss has the most detrimental effect to the 2D retargeting performance of our model, evidenced by the doubling of MSE. Removal of the triplet loss increased the MSE by about . Although removing the adversarial loss does not significantly affect the 2D retargeting performance of our model, the rotated sequences look less natural without it.
In this work, we propose a novel video motion retargeting approach, in which motion can be successfully transferred in scenarios where large variations of body-structure exist between the source and target person. The proposed motion retargeting network runs on 2D skeleton input only, makes it a lightweight and plug-and-play module, which is complementary to existing skeleton extraction and skeleton-to-video rendering methods. Leveraging three inherent invariance properties in temporal sequences, the proposed network can be trained with unlabeled web data end-to-end. Our experiments demonstrate the promising results of our methods and the effectiveness of the invariance-driven constraints.
Acknowledgement. This work is supported by the SenseTime-NTU Collaboration Project, Singapore MOE AcRF Tier 1 (2018-T1-002-056), NTU SUG, and NTU NAP. We would like to thank Tinghui Zhou, Rundi Wu and Kwan-Yee Lin for insightful discussion and their exceptional support.
Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §3.
The content of our supplementary material is organized as follows.
Details about the implementation of the three stages of our method.
Datasets and evaluation metrics we use in our experiments.
Qualitative results of ablation study.
We use a pretrained DensePose model 
for skeleton extraction, missing keypoints are complemented by nearest-neighbor interpolation. The extracted skeleton sequences are smoothed using a gaussian kernel with a temporal standard deviation. We use joints for a skeleton, detailed skeleton format will be given in our Github repository.
The sizes of the latent representations are , and . Our encoders down-sample the input sequences to an eighth of its original length, therefore . For limb-scaling, we use global and local scaling factors randomly sampled from
(uniformly distributed). For view perturbations we use. Our motion retargeting network is trained steps with batch size and learning rate using Adam  optimization algorithm. The weights of the loss terms are given as follows: . These parameters are determined through quantitative and qualitative experiments on a validation set.
For training on unlabeled web data, we collected a motion dataset named Solo-Dancer. We downloaded from YouTube categories of dancing videos, each one of the videos features only a single dancer. The total length of the videos add up to hours. We then used an off-the-shelf 2D keypoints detector  to extract keypoints frame-by-frame to be used as our training data.
We also perform the proposed unsupervised training pipeline on the synthetic Mixamo dataset  in order to quantitatively measure the transfer results with ground truth and baseline methods. The training set comprises of characters, each character has sequences and a total of hours for each character.
For an inferred sequence and a groundtruth sequence
where is the subscript of body joints and is the subscript of time. Similarly,
These two metrics are measured in the original scale of Mixamo dataset. The errors are computed after hip-alignment, as visualized in Figure 11.
We calculate the Frechet Inception Distance (FID)  to evaluate the quality of generated frames. FID measures the perceptual distance between the generated frames and the real target frames, and smaller number represents higher visual consistency.
For the quality of retargeted videos, we ask volunteers to perform subjective pairwise A/B tests. For each method ( baseline and ours), we test retargeted videos with the combination of source and target individuals. All the videos are 10 seconds in length. Participants choose which video has better motion consistency (between source videos and retargeted videos) in a pair of retargeted videos from two different methods. Source videos are also given to testers for reference. For each baseline method, retargeted videos are compared times by different participants against our model. Our model has two variants with different training sets (i.e., Mixamo and SoloDancer), the results are shown in Table 1 in main paper as “User” and “User (wild)” respectively.
Besides testing standard MSE, we render the retargeted video for further comparison. As can be empirically observed in Fig. 12, the full model produces the results of the best quality. The cross reconstruction loss plays an essential role for disentanglement. The results without triplet loss show slightly degraded quality on the frame level. However, it is important to note that the triplet loss is used to smooth the structure and view code temporally, therefore stabilizing the generated video. The adversarial loss improves the plausibility of generated joint sequences, making them look more natural and realistic. Recall that the adversarial loss is added on randomly rotated output joint sequences to make the rotated output sequences indistinguishable from real data.