TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting

03/31/2020 ∙ by Zhuoqian Yang, et al. ∙ SenseTime Corporation Tsinghua University Nanyang Technological University Carnegie Mellon University Peking University The Chinese University of Hong Kong 19

We present a lightweight video motion retargeting approach TransMoMo that is capable of transferring motion of a person in a source video realistically to another video of a target person. Without using any paired data for supervision, the proposed method can be trained in an unsupervised manner by exploiting invariance properties of three orthogonal factors of variation including motion, structure, and view-angle. Specifically, with loss functions carefully derived based on invariance, we train an auto-encoder to disentangle the latent representations of such factors given the source and target video clips. This allows us to selectively transfer motion extracted from the source video seamlessly to the target video in spite of structural and view-angle disparities between the source and the target. The relaxed assumption of paired data allows our method to be trained on a vast amount of videos needless of manual annotation of source-target pairing, leading to improved robustness against large structural variations and extreme motion in videos. We demonstrate the effectiveness of our method over the state-of-the-art methods. Code, model and data are publicly available on our project page (



There are no comments yet.


page 1

page 3

page 6

page 7

page 8

page 12

Code Repositories


This is the official PyTorch implementation of the CVPR 2020 paper "TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting".

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Let’s sway you could look into my eyes. Let’s sway under the moonlight, this serious moonlight.

David Bowie, Let’s Dance

Can an amateur dancer learn instantly how to dance like a professional in different styles, e.g., Tango, Locking, Salsa, and Kompa? While it is almost impossible in reality, one can now achieve this virtually via motion retargeting - transferring the motion of a source video featuring a professional dancer to a target video of him/herself.

Motion retargeting is an emerging topic in both computer vision and graphics due to its wide applicability to content creation. Most existing methods 

[41, 28, 30]

achieve motion retargeting through high-quality 3D pose estimation or reconstruction 

[9]. These methods either require complex and expensive optimization or they are error-prone given the unconstrained videos that contain complex motion. Recently, several efforts are also made to retarget motion in 2D space [2, 6, 23]. Image-based methods [15, 4] obtain compelling results on conditional person generation. However, these methods always neglect the temporal coherence in video and thus suffer from twinkling results. Video-based methods [44, 6, 2]

show state-of-the-art results. However, insufficient consideration of variances between two individuals 

[44, 6] or the limitation of training on synthesized data [2] makes their result deteriorate dramatically while encountering large structure variations or extreme motion in web videos.

Figure 2: Motion retargeting pipeline Our method achieves motion retargeting in three stages. 1.Skeleton Extraction: 2D body joints are extracted from source and target videos using an off-the-shelf model. 2.Motion Retargeting Network: our model decomposes the joint sequences and recombines the elements to generate a new joint sequence, which can be viewed at any desired view-angle. 3.Skeleton-to-Video Rendering

: Retargeted video is rendered using the output joint sequence, with an available image-to-image translation method.

In this study, we aim to address video motion retargeting via an end-to-end learnable framework in 2D space, bypassing the need for explicit estimation of 3D human pose. Despite recent progress in generative frameworks and motion synthesis, learning for motion retargeting in 2D space remains challenging due to the following issues: 1) Consider the large structural and view-angle variances between the source and target videos, it is difficult to learn a direct person-to-person mapping at the pixel level. Conventional image-to-image translation methods tend to generate unnatural motion in extreme conditions or fail on unseen examples; 2) No corresponding image pairs of two different subjects performing the same motion are available to supervise the learning of such a transfer; 3) Human motion is highly articulated and complex, thus it is challenging to perform motion modeling and transfer.

To address the first challenge, instead of performing direct video-to-video translation at the pixel level, we decompose the translation process into three steps as shown in Fig. 2, i.e., skeleton extraction, motion retargeting on skeleton and skeleton-to-video rendering. The decomposition allows us to focus on the core problem of motion re-targeting using skeleton sequences as the input and output spaces. To cope with the second and third challenges, we exploit the invariance property of three factors: motion, structure, and view-angle. These factors of variation are enforced to be independent of each other, held constant when other factors vary. In particular, 1) motion should be invariant despite structural and view-angle perturbations, 2) structure of one skeleton sequence should be consistent across time and invariant despite view-angle perturbations, and 3) view-angle of one skeleton sequence should be consistent across time and invariant despite structural perturbations. The invariance properties allow us to derive a set of purely unsupervised loss functions to train an auto-encoder for disentangling a sequence of skeletons into orthogonal latent representations of motion, structure, and view-angle. Given the disentangled representation, one can easily mix the latent codes of motion and structure from different skeleton sequences for motion retargeting. Taking different view-angle as a condition to the decoder, one can generate retargeted motion in novel viewpoints. Since motion retargeting is performed on the 2D skeleton space, it can be seen as a lightweight and plug-and-play module, which is complementary to existing skeleton extraction [5, 3, 35, 48] and skeleton-to-video rendering methods [6, 44, 43].

There are several existing studies designed for general representation disentanglement in video [20, 40, 13]. While these methods have shown impressive results on constrained scenarios. It is difficult for them to model articulated human motion due to the highly non-linear and complex kinematic structures. Instead, our method is designed specifically for representation disentanglement in human videos.

We summarize our contributions as follows: 1) We propose a novel Motion Retargeting Network in 2D skeleton space, which can be trained end-to-end with unlabeled web data. 2) We introduce novel loss functions based on invariance to endow the proposed network with the ability to disentangle representation in a purely unsupervised manner. 3) Extensive experiments demonstrate the effectiveness of our method over other state-of-the-art approaches [6, 2, 41], especially under in-the-wild scenarios where motion are complex.

2 Related Work

Video Motion Retargeting. Hodgins and Pollard [19] proposed a control system parameter scaling algorithm to adapt simulated motion to new characters. Lee and Shin [27] decomposed the problem into inter-frame constraints and intra-frame relationships and modeled them by Inverse Kinematics problem and B-spline curve separately. Choi and Ko [10] proposed a real-time method based on inverse rate control that computes the changes in joint angles. Tak and Ko [38] proposed a per-frame filter framework to generate physically plausible motion sequences. Recently, Villegas et al. [41]

designed a recurrent neural network architecture with a Forward Kinematics layer to capture high-level properties of motion. However, the target to be animated of the aforementioned approaches is typically an articulated virtual character and their results critically depend on the accuracy of 3D pose estimation. More recently, Aberman

et al. [2] propose to retarget motion in 2D space. However, since their training relies on synthetic paired data, the performance is likely to degrade under the unconstrained scenarios. Instead, our method can be trained on pure unlabeled web data, which makes the method robust to the challenging in-the-wild motion transfer task.

There exist a few attempts to address the video motion retargeting problem. Liu et al. [28] designed a novel GAN [16] architecture with an attentive discriminator network and better conditioning inputs. However, this method relies on 3D reconstruction of the target person. Aberman et al. [1] proposed to tackle video-driven performance cloning in a two-branch framework. Chan et al. [6] proposed a simple but effective method to obtain temporal coherent video results. Wang et al. [44] achieves results of similar quality to Chan et al. with more complex shape representation and temporal modelling. However, The performance of all these methods degrades dramatically when large variations happened between two individuals with no consideration [1, 43, 44] or a simple rescaling [6] to address body variations.

Unsupervised Representation Disentanglement. There is a vast literature [26, 29, 21, 36, 47, 46] on disentangling factors of variation. Bilinear models [39] were an early approach to separate content and style for images of faces and text in various fonts. Recently, InfoGAN [8] learned a generative model with disentangled factors based on Generative Adversarial Networks (GAN). -VAE [18] and DIP-VAE [25], build on variational Auto-Encoders (VAEs) to disentangle interpretable factors in an unsupervised way.

Other approaches explore general methods for learning disentangled representations from video. Whitney et al[45] used a gating principle to encourage each dimension of the latent representation to capture a distinct mode of variation. Villegas et al[42] used an unsupervised approach to factoring video into content and motion. Denton et al[13] proposed to leverage the temporal coherence of video and a novel adversarial loss to learn a disentangled representation. MoCoGAN [40] employs unsupervised adversarial training to learn the separation of motion and content. Hsieh et al[20] proposed an auto-encoder framework, which combines structured probabilistic models and deep networks for disentanglement. However, the performance of these methods are not satisfactory on human videos, since they are not designed specifically for disentanglement of highly articulated and complex objects.

Person Generation.

Various machine learning algorithms have been used to generate realistic person images. The generation process could be conditionally guided by skeleton keypoints 

[4, 31] and style codes [32, 15, 11]. Our method is complementary to the image-based person generation approaches and can further boost the temporal coherence of them since it performs motion retargeting on the 2D skeletons space only.

3 Methodology

As illustrated in Fig. 2, we decompose the translation process into three steps, i.e., skeleton extraction, motion retargeting and skeleton-to-video rendering. In our framework, motion retargeting is the most important component in which we introduce our core contribution (i.e., invariance-driven disentanglement). Skeleton extraction and skeleton-to-video rendering are replaceable and can thus benefit from recent advances in 2D keypoints estimation [3, 5, 48] and image-to-image translation [22, 44, 43].

The Motion Retargeting Network decomposes 2D joint input sequences as a motion code that represents the movements of the actor, a structure code that represents the body shape of the actor and a view-angle code that represents the camera angle. The decoder takes any combination of the latent codes and produces a reconstructed 3D joint sequence, which automatically isolates view from motion and structure.

For transferring motion from a source video to a target video, we first use an off-the-shelf 2D keypoints detector to extract joint sequences from videos. By combining the motion code encoded from the source sequence and the structure code encoded from the target sequence, our model then yields a transferred 3D joint sequence. The transferred sequence is then projected back to 2D with any desired view-angle. Finally, we convert the 2D joint sequence frame-by-frame to a pixel-level representation, i.e., label maps. These label maps are fed into a pre-trained image-to-image generator to render the transferred video.

3.1 Motion Retargeting Network

Here, we detail the encoders and decoders for an input sequence where is the length of the sequence and is the number of body joints.

The motion encoder uses several layers of one dimensional temporal convolution to extract motion information: , where is the sequence length after encoding and is the number of channels. Note that the motion code is variable in length so as to preserve temporal information.

The structure encoder has a similar network structure

, with the difference that the final structure code is obtained after a temporal max pooling:

, therefore . Effectively, the process of obtaining the structure code can be interpreted as performing multiple body shape estimations in sliding windows: , and then aggregating the estimations. Assuming the viewpoint is also stationary (i.e. all the temporal variances are caused by the movements of the actor), the view code is obtained the same way we obtained the structure code.

The decoder takes the motion, body and view codes as input and reconstructs a 3D joint sequence through convolution layers, in symmetry with the encoders. Our discriminator is a temporal convolutional network that is similar to our motion encoder: .

3.2 Invariance-Driven Disentanglement

The disentanglement of motion, structure and view is achieved leveraging the invariance of each of these factors to the changes in the other twos. We design loss terms to restrict changes when perturbations are added, while the entire network tries to reconstruct joint sequences from decomposed features. Structural perturbation is added through limb scaling, i.e. manually shortening or extending the length of the limbs. View perturbation is introduced through rotating the reconstructed 3D sequence and projecting it back to 2D. Motion perturbation needs not be explicitly added since motion itself is varying through time. We first describe the ways perturbations are added and then detail the definitions of the loss terms derived from three invariances, i.e., motion, structure and view-angle invariance.

Figure 3: Limb-scaling process. We show a step-by-step limb-scaling process on a joint sequence starting from the root joint (pelvis). At each step, the scaled limbs are highlighted with red. This example scales all limbs with the same factor , but the scaling factors are randomly generated at training time.

Limb Scaling as Structural Perturbation. For an input 2D sequence , we create a structurally-perturbed sequence by elongating or shortening the limbs of the performer, as illustrated in Figure 3. It is done in such a way that the created sequence is effectively the same motion performed by a different actor. The length of a limb is extended/shortened by the same ratio across all frames, so limb-scaling does not introduce ambiguity between motion and body structure. Specifically, the limb-scaled sequence is created by applying the limb-scale function frame-by-frame: , where is the frame in the input sequence, is the limb scaling function, are the local scaling factors and is the global scaling factor. Modeling the human skeleton as a tree and joints as its nodes, we define the pelvis joint as the root. For each frame in the sequence, starting from the root, we recursively move the joints and all their dependent joints (child nodes) on the direction of the limb by distance , where is the original length of the limb in the frame. After all local scaling factors have been applied, the global scaling factor is directly multiplied with all the joint coordinates.

3D Rotation as View Perturbation. Let be a rotate-and-project function, i.e., for a 3D coordinate :

is a rotation matrix obtained using Rodrigues’ rotation formula and

is a unit vector representing the axis around which we rotate. In practice,

is an estimated vertical direction of the body. It is computed using four points: left shoulder, right shoulder, left pelvis and right pelvis. Note that is differentiable with respect to .

Figure 4: Rotation as view perturbation. This figure illustrates the process of taking an input 2D sequence , reconstructing a 3D sequence using our motion retargeting network and projecting it back to 2D, with rotation as view-angle perturbation.

As shown in Fig. 4, we create several rotated sequences from the reconstructed 3D sequence :

and is number of projections. Loss terms enforcing disentanglement will be described later in this chapter.

3.2.1 Invariance of Motion

Motion should be invariant despite structural and view-angle perturbations. To this end, we designed the following loss terms.

Figure 5: Cross reconstruction process. This figure illustrates the process of cross-reconstruction using a 2D input sequence and its limb-scaled variant .

Cross Reconstruction Loss. Recall that we use limb scaling to obtain data of the same movements performed by “different” actors and . We cross reconstruct the two sequences, as shown in Fig.5. The cross reconstruction involves encoding, swapping and decoding, namely:

where is the limb-scaled version of . Since and have the same motion, we expect to be the same as ; to be the same as . Therefore, the cross reconstruction loss is defined as


Structural Invariance Loss. This signal is to ensure that the motion codes are invariant to structural changes. and have the same motion but different body structures, therefore we expect the motion encoder to have the same output:


Rotation Invariance Loss. Similarly, to ensure that the motion code is invariant to rotation, we add:


where is the rotated variant.

3.2.2 Invariance of Structure

Body structure should be consistent across time and invariant view-angle perturbations.
Triplet Loss. The triplet loss is added to exploit the time-invariant property of the body structure and thereby better enforce disentanglement. Recall that the body encoder produces multiple body structure estimations before averaging them. The triplet loss is designed to map estimations from the same sequence to a small neighborhood while alienating estimations from different sequences. Let us define an individual triplet loss term:



denotes the cosine similarity function and

is our margin. The total triplet loss for the invariance of structure is defined as


where .
Rotation Invariance Loss. This signal is to ensure that the structure codes are invariant to rotation:


where is the rotated variant.

3.2.3 Invariance of View-Angle

View-angle of one skeleton sequence should be consistent through time invariant despite structural perturbations.
Triplet Loss. Similarly, triplet loss is designed to map view estimations from the same sequence to a small neighborhood while alienating estimations from rotated sequences. Continuing to use the definition of a triplet term in Eq.4:


where , .
Structural Invariance Loss This signal is to ensure that the view code is invariant to structural change:


where is the limb-scaled version of .

3.2.4 Training Regularization

The loss terms defined above are designed to enforce disentanglement. Besides them, some basic loss terms are needed for this representation learning process.
Reconstruction Loss. Reconstructing data is the fundamental functionality of auto-encoders. Recall that our decoder outputs reconstructed 3D sequences. Our reconstruction loss minimizes the difference between real data and 3D reconstructions projected back to 2D.


i.e. we expect to be the same as the input when we directly remove the coordinates from .

Adversarial Loss. The unsupervised recovery of 3D motion from joint sequences is achieved through adversarial training. Reconstructed 3D joint sequences are rotated and projected back to 2D and a discriminator is used to measure the domain discrepancy between the projected 2D sequences and real 2D sequences. The feasibility of recovering static 3D human pose from 2D coordinates with adversarial learning has been verified in several works [37, 14, 7, 34]. We want the reconstructed 3D sequence to look right after we rotate it and project it back to 2D, therefore the adversarial loss is defined as.


3.2.5 Total Loss

The proposed motion retargeting network can be trained end-to-end with a weighted sum of the loss terms defined above:

4 Experiments

Figure 6: Motion retargeting results. Top to bottom: input source frame, extracted source skeleton, transformed skeleton, generated frame.
Figure 7: Novel view synthesis results. The first row shows the continuous rotation of generated skeleton, and the second row shows the corresponding rendering results.
Figure 8:

Latent space interpolation results.

Linear interpolation is tested for body structure (horizontal axis) and motion (vertical axis).
Figure 9: Qualitative comparison with state-of-the-art methods. Each column on the right represents a motion retargeting method.
Figure 10: Results of our in-the-wild trained model. Qualitative comparison for models trained with our method on Mixamo and Solo-Dancer separately. The first column gives two challenging motion sources, and the other columns show corresponding results.

4.1 Setup

Implementation details. We perform the proposed training pipeline on the synthetic Mixamo dataset [33] for quantitative error measurement and fair comparison. For in-the-wild training, we collected a motion dataset named Solo-Dancer from online videos. For skeleton-to-video rendering, we recorded target videos and use the synthesis pipeline proposed in [6]. The trained generator is shared by all the motion retargeting methods.

Evaluation metrics. We evaluate the quality of motion retargeting for both skeleton and video, as retargeting results on skeleton would largely influence the quality of generated videos. For skeleton keypoints, we perform evaluations on a held-out test set from Mixamo (with ground truth available) using mean square error (MSE) as the metric. For generated videos, we evaluate the quality of frames with FID score [17] and through a user study.

4.2 Representation Disentanglement

We train the model on unconstrained videos in-the-wild, and the model automatically learns the disentangled representations of motion, body, and view-angle, which enables a wide range of applications. We test motion retargeting, novel-view synthesis and latent space interpolation to demonstrate the effectiveness of the proposed pipeline. Motion retargeting. We extract the desired motion from the source skeleton sequence, then retarget the motion to the target person. Videos from the Internet vary drastically in body structure as shown in Fig. 6. For example, Spiderman has very long legs but the child has short ones. Our method, no matter how large the structural gap between the source and the target, is capable of generating a skeleton sequence precisely with the same body structure as the target person while preserving the motion from the source person.
Novel-view synthesis. We can explicitly manipulate the view of decoded skeleton in the 3D space, rotating it before projecting it down to 2D. We show an example in Fig. 7. This enables us to see the motion-transferred video at any desired view-angle.
Latent space interpolation. The learned latent representation is meaningful when interpolated, as shown in Fig. 8. Both the motion and the the body structure change smoothly between the videos, demonstrating the effectiveness of our model in capturing a reasonable coverage of the manifold.


Method MSE MAE FID User User (wild)


LN 0.0886 0.1616 48.37 81.7% 82.9%
NKN [41] 0.0198 0.0781 67.32 84.5% 86.3%
EDN [6] 0.1186 0.2022 40.56 75.2% 77.1%
LCM [2] 0.0151 0.0749 37.15 68.5% 71.6%


Ours 0.0131 0.0673 31.26 - -
Ours (wild) 0.0121 0.0627 31.29 - -


Table 1: Quantitative Results. MSE and MAE are joint position errors measured on Mixamo, reported in the original scale of the data. FID measure the quality of rendered images. Users evaluate the consistency between source videos and generated videos. We report the percentage of users who prefer our model and our in-the-wild trained model, respectively.

4.3 Comparisons to State-of-the-Art Methods

We compare the motion retargeting results of our method with the following methods (including one intuitive method and three state-of-the-art methods) both quantitatively and qualitatively. 1) Limb Normalization is an intuitive method that calculates a scaling factor for each limb and applies local normalization. 2) Neural Kinematic Networks (NKN) [41] uses detected 3D keypoints for unsupervised motion retargeting. 3) Everybody Dance Now (EDN) [6]

applies a global linear transformation on all the keypoints. 4)

Learning Character-Agnostic Motion (LCM) [2] performs disentanglement at the 2D space in a fully-supervised manner.

For the fairness of the comparison, we train and test all the models on a unified Mixamo dataset, but note that our model is trained with less information, using neither 3D information [41] nor the pairing between motion and skeletons [2]. In addition, we train a separate model with in-the-wild data only. All the methods are evaluated with the aforementioned evaluation metrics.

Our method outperforms all the compared methods in terms of both numerical joint position error and quality of generated image. EDN and LN are naive rule-based methods, the former does not estimate the body structure and the latter is bound to fail when the actor is not facing the camera directly. Despite that NKN is able to transfer motion with little error on the synthesized dataset, it suffers on in-the-wild data due to the unreliability of 3D pose estimation. LCM is trained with a finite set of characters, therefore its capacity of generalization is limited. In contrast, our method uses limb-scaling to augment the training data, exploring all possible body structures in a continuous space.

It is noteworthy that our method enables training on arbitrary web data that previous methods are not able to. The fact that the model trained on in-the-wild data (i.e., Solo-Dancer Dataset) achieved the lowest error (in Table 1) demonstrates the benefits of training on in-the-wild data. For complex motion such as the one shown in Fig. 10, the model learned from wild data performs better, as wild data features a larger diversity of motion. These results show the superiority of our method in learning from unlimited real-world data, while supervised methods rely on strictly paired data that are hard to expand.

In summary, we attribute the superior performance of our method to the following reasons: 1) Our disentanglement is directly performed in 2D space, which circumvents the imprecise process of 3D-keypoints detection from in-the-wild videos. 2) Our explicit invariance-driven loss terms maximize the utilization of information contained in the training data, evidenced by the largely increased data efficiency compared to implicit unsupervised approaches [41]. 3) Our limb scaling mechanism improves the model’s ability to handle extreme body structures. 4) In-the-wild videos provide an unlimited source of motion, compared to limited movements in synthetic datasets like Mixamo [33].


Method w/o crs w/o trip w/o adv Ours (full)


MSE 0.0392 0.0154 0.0136 0.0131
MAE 0.1259 0.0708 0.0682 0.0673


Table 2: Ablation Study Results.

4.4 Ablation study

We train some ablated models to study the impact of the individual loss terms. The results are shown in Table 2. We design three ablated models. The w/o crs model is created by removing the cross reconstruction loss. The w/o trip model is created by removing the triplet loss. The w/o adv model is created by removing the adversarial loss. Removing the cross reconstruction loss has the most detrimental effect to the 2D retargeting performance of our model, evidenced by the doubling of MSE. Removal of the triplet loss increased the MSE by about . Although removing the adversarial loss does not significantly affect the 2D retargeting performance of our model, the rotated sequences look less natural without it.

5 Conclusion

In this work, we propose a novel video motion retargeting approach, in which motion can be successfully transferred in scenarios where large variations of body-structure exist between the source and target person. The proposed motion retargeting network runs on 2D skeleton input only, makes it a lightweight and plug-and-play module, which is complementary to existing skeleton extraction and skeleton-to-video rendering methods. Leveraging three inherent invariance properties in temporal sequences, the proposed network can be trained with unlabeled web data end-to-end. Our experiments demonstrate the promising results of our methods and the effectiveness of the invariance-driven constraints.

Acknowledgement. This work is supported by the SenseTime-NTU Collaboration Project, Singapore MOE AcRF Tier 1 (2018-T1-002-056), NTU SUG, and NTU NAP. We would like to thank Tinghui Zhou, Rundi Wu and Kwan-Yee Lin for insightful discussion and their exceptional support.


  • [1] K. Aberman, M. Shi, J. Liao, D. Lischinski, B. Chen, and D. Cohen-Or (2019) Deep video-based performance cloning. Comput. Graph. Forum 38, pp. 219–233. Cited by: §2.
  • [2] K. Aberman, R. Wu, D. Lischinski, B. Chen, and D. Cohen-Or (2019) Learning character-agnostic motion for motion retargeting in 2d. ACM Trans. Graph. 38 (4), pp. 75:1–75:14. Cited by: TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting, §1, §1, §2, §4.3, §4.3, Table 1.
  • [3] R. Alp Güler, N. Neverova, and I. Kokkinos (2018) Densepose: dense human pose estimation in the wild. In CVPR, pp. 7297–7306. Cited by: §1, §3, In-the-wild dataset..
  • [4] G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand, and J. V. Guttag (2018) Synthesizing images of humans in unseen poses. In CVPR, Cited by: §1, §2.
  • [5] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh (2018) OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. In arXiv preprint arXiv:1812.08008, Cited by: §1, §3.
  • [6] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros (2019) Everybody dance now. In ICCV, Cited by: TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting, §1, §1, §1, §2, §4.1, §4.3, Table 1, S1.3 Skeleton-to-Video Rendering.
  • [7] C. Chen, A. Tyagi, A. Agrawal, D. Drover, S. Stojanov, and J. M. Rehg (2019) Unsupervised 3d pose estimation with geometric self-supervision. In CVPR, pp. 5714–5724. Cited by: §3.2.4.
  • [8] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, Cited by: §2.
  • [9] X. Chen, K. Lin, W. Liu, C. Qian, and L. Lin (2019) Weakly-supervised discovery of geometry-aware representation for 3d human pose estimation. In CVPR, Cited by: §1.
  • [10] K. Choi and H. Ko (2000) Online motion retargetting. Journal of Visualization and Computer Animation 11, pp. 223–235. Cited by: §2.
  • [11] R. de Bem, A. Ghosh, A. Boukhayma, T. Ajanthan, N. Siddharth, and P. H. S. Torr (2019) A conditional deep generative model of people in natural images. In WACV, pp. 1449–1458. Cited by: §2.
  • [12] DensePose: dense human pose estimation in the wild. Note: Cited by: S1.1 Skeleton Extraction.
  • [13] E. L. Denton and V. Birodkar (2017) Unsupervised learning of disentangled representations from video. In NeurIPS, Cited by: §1, §2.
  • [14] D. Drover, C. Chen, A. Agrawal, A. Tyagi, and C. Phuoc Huynh (2018) Can 3d pose be learned from 2d projections alone?. In ECCV, pp. 0–0. Cited by: §3.2.4.
  • [15] P. Esser, E. Sutter, and B. Ommer (2018) A variational u-net for conditional appearance and shape generation. In CVPR, Cited by: §1, §2.
  • [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, Cited by: §2.
  • [17] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, Cited by: §4.1, FID..
  • [18] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. In ICLR, Cited by: §2.
  • [19] J. K. Hodgins and N. S. Pollard (1997) Adapting simulated behaviors for new characters. In SIGGRAPH, Cited by: §2.
  • [20] J. Hsieh, B. Liu, D. Huang, F. Li, and J. C. Niebles (2018) Learning to decompose and disentangle representations for video prediction. In NeurIPS, Cited by: §1, §2.
  • [21] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In ECCV, Cited by: §2.
  • [22] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    In CVPR, Cited by: §3.
  • [23] D. Joo, D. Kim, and J. Kim (2018) Generating a fusion image: one’s identity and another’s shape. In CVPR, Cited by: §1.
  • [24] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: S1.2 Motion Retargeting Network.
  • [25] A. Kumar, P. Sattigeri, and A. Balakrishnan (2018) Variational inference of disentangled latent concepts from unlabeled observations. In ICLR, Cited by: §2.
  • [26] H. Lee, H. Tseng, J. Huang, M. K. Singh, and M. Yang (2018) Diverse image-to-image translation via disentangled representations. In ECCV, Cited by: §2.
  • [27] J. Lee and S. Y. Shin (1999) A hierarchical approach to interactive motion editing for human-like figures. In SIGGRAPH, Cited by: §2.
  • [28] L. Liu, W. Xu, M. Zollhoefer, H. Kim, F. Bernard, M. Habermann, W. Wang, and C. Theobalt (2018) Neural rendering and reenactment of human actor videos. arXiv preprint arXiv:1809.03658. Cited by: §1, §2.
  • [29] M. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and J. Kautz. (2019) Few-shot unsueprvised image-to-image translation. In ICCV, Cited by: §2.
  • [30] W. Liu, Z. Piao, and S. Gao (2019) Liquid warping gan: a unified framework for human motion imitation, appearance transfer and novel view synthesis. In ICCV, Cited by: §1.
  • [31] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. V. Gool (2017) Pose guided person image generation. In NeurIPS, Cited by: §2.
  • [32] L. Ma, Q. Sun, S. Georgoulis, L. V. Gool, B. Schiele, and M. Fritz (2018) Disentangled person image generation. In CVPR, Cited by: §2.
  • [33] Mixamo. Note: Cited by: §4.1, §4.3, Sythesized dataset..
  • [34] D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli (2019) 3D human pose estimation in video with temporal convolutions and semi-supervised training. In CVPR, pp. 7753–7762. Cited by: §3.2.4.
  • [35] X. Peng, Z. Tang, F. Yang, R. S. Feris, and D. Metaxas (2018) Jointly optimize data augmentation and network training: adversarial data augmentation in human pose estimation. In CVPR, Cited by: §1.
  • [36] S. Qian, K. Lin, W. Wu, Y. Liu, Q. Wang, F. Shen, C. Qian, and R. He (2019) Make a face: towards arbitrary high fidelity face manipulation. In ICCV, Cited by: §2.
  • [37] V. Ramakrishna, T. Kanade, and Y. Sheikh (2012) Reconstructing 3d human pose from 2d image landmarks. In ECCV, pp. 573–586. Cited by: §3.2.4.
  • [38] S. Tak and H. Ko (2005) A physically-based motion retargeting filter. ACM Trans. Graph. 24, pp. 98–117. Cited by: §2.
  • [39] J. B. Tenenbaum and W. T. Freeman (2000) Separating style and content with bilinear models. Neural Computation 12, pp. 1247–1283. Cited by: §2.
  • [40] S. Tulyakov, M. Liu, X. Yang, and J. Kautz (2018) MoCoGAN: decomposing motion and content for video generation. In CVPR, Cited by: §1, §2.
  • [41] R. Villegas, J. Yang, D. Ceylan, and H. Lee (2018) Neural kinematic networks for unsupervised motion retargetting. In CVPR, Cited by: TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting, §1, §1, §2, §4.3, §4.3, §4.3, Table 1.
  • [42] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee (2017) Decomposing motion and content for natural video sequence prediction. In ICLR, Cited by: §2.
  • [43] T. Wang, M. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro (2019) Few-shot video-to-video synthesis. In NeurIPS, Cited by: §1, §2, §3.
  • [44] T. Wang, M. Liu, J. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro (2018) Video-to-video synthesis. In NeurIPS, Cited by: §1, §1, §2, §3.
  • [45] W. F. Whitney, M. Chang, T. D. Kulkarni, and J. B. Tenenbaum (2016) Understanding visual concepts with continuation learning. In ICLR Workshop, Cited by: §2.
  • [46] W. Wu, K. Cao, C. Li, C. Qian, and C. C. Loy (2019) Disentangling content and style via unsupervised geometry distillation. In ICLRW, Cited by: §2.
  • [47] W. Wu, K. Cao, C. Li, C. Qian, and C. C. Loy (2019) TransGaGa: geometry-aware unsupervised image-to-image translation. In CVPR, Cited by: §2.
  • [48] W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang (2017) Learning feature pyramids for human pose estimation. In ICCV, Cited by: §1, §3.


The content of our supplementary material is organized as follows.

  1. Details about the implementation of the three stages of our method.

  2. Datasets and evaluation metrics we use in our experiments.

  3. Qualitative results of ablation study.

S1. Implementation details

S1.1 Skeleton Extraction

We use a pretrained DensePose model [12]

for skeleton extraction, missing keypoints are complemented by nearest-neighbor interpolation. The extracted skeleton sequences are smoothed using a gaussian kernel with a temporal standard deviation

. We use joints for a skeleton, detailed skeleton format will be given in our Github repository.

S1.2 Motion Retargeting Network

The sizes of the latent representations are , and . Our encoders down-sample the input sequences to an eighth of its original length, therefore . For limb-scaling, we use global and local scaling factors randomly sampled from

(uniformly distributed). For view perturbations we use

. Our motion retargeting network is trained steps with batch size and learning rate using Adam [24] optimization algorithm. The weights of the loss terms are given as follows: . These parameters are determined through quantitative and qualitative experiments on a validation set.

S1.3 Skeleton-to-Video Rendering

For skeleton-to-video rendering, we recorded target videos of subjects as training data (none of the recorded subjects is an author of this work). We use the synthesis pipeline proposed in [6]. Each generator is trained on the target video for epochs and the output size is .

Figure 11: Visualization of retargeting error computation with our model. In this figure, we plotted input joint sequences (red) on the diagonal. Off the diagonal are the retargeted sequences (blue) from our model as well as the groud truth (yellow), where their overlapping areas become green. In this figure, sequences on the same row are expected to perform the same motion, while sequences on the same column are expected to share the same body structure.

S2. Experimental Details

S2.1 Dataset

In-the-wild dataset.

For training on unlabeled web data, we collected a motion dataset named Solo-Dancer. We downloaded from YouTube categories of dancing videos, each one of the videos features only a single dancer. The total length of the videos add up to hours. We then used an off-the-shelf 2D keypoints detector [3] to extract keypoints frame-by-frame to be used as our training data.

Sythesized dataset.

We also perform the proposed unsupervised training pipeline on the synthetic Mixamo dataset [33] in order to quantitatively measure the transfer results with ground truth and baseline methods. The training set comprises of characters, each character has sequences and a total of hours for each character.

S2.2 Evaluation Metrics

MSE and MAE.

For an inferred sequence and a groundtruth sequence

where is the subscript of body joints and is the subscript of time. Similarly,

These two metrics are measured in the original scale of Mixamo dataset. The errors are computed after hip-alignment, as visualized in Figure 11.


We calculate the Frechet Inception Distance (FID) [17] to evaluate the quality of generated frames. FID measures the perceptual distance between the generated frames and the real target frames, and smaller number represents higher visual consistency.

User study.

For the quality of retargeted videos, we ask volunteers to perform subjective pairwise A/B tests. For each method ( baseline and ours), we test retargeted videos with the combination of source and target individuals. All the videos are 10 seconds in length. Participants choose which video has better motion consistency (between source videos and retargeted videos) in a pair of retargeted videos from two different methods. Source videos are also given to testers for reference. For each baseline method, retargeted videos are compared times by different participants against our model. Our model has two variants with different training sets (i.e., Mixamo and SoloDancer), the results are shown in Table 1 in main paper as “User” and “User (wild)” respectively.

Figure 12: Qualitative results of ablation study. The first column gives the motion sources, and the other columns show corresponding results.

S3. Qualitative Ablation Study

Besides testing standard MSE, we render the retargeted video for further comparison. As can be empirically observed in Fig. 12, the full model produces the results of the best quality. The cross reconstruction loss plays an essential role for disentanglement. The results without triplet loss show slightly degraded quality on the frame level. However, it is important to note that the triplet loss is used to smooth the structure and view code temporally, therefore stabilizing the generated video. The adversarial loss improves the plausibility of generated joint sequences, making them look more natural and realistic. Recall that the adversarial loss is added on randomly rotated output joint sequences to make the rotated output sequences indistinguishable from real data.