Log In Sign Up

Human Identity-Preserved Motion Retargeting in Video Synthesis by Feature Disentanglement

by   Jingzhe Ma, et al.

Most motion retargeting methods in human action video synthesis decompose the input video to motion (dynamic information) and shape (static information). However, we observe if the dynamic information is directly transferred to another subject, it will result in unnatural synthesised motion. This phenomenon is mainly caused by neglecting subject-dependent information in motion. To solve the problem, we propose a novel motion retargeting method which can combine both subject-independent (common motion content) information from a source video and subject-dependent (individualized identity motion) information from a target video. So it can synthesize videos with a much natural appearance along with identity-preserved motion. In the proposed method two encoders are employed to extract identity and motion content representations respectively. We employ the adaptive instance normalization (AdaIN) layer in the generator and the instance normalization (IN) layer in the motion content encoder to synthesize the new motion. Besides, we also collected a dataset, named Chuang101, with 101 subjects in total. Each subject performs identical dancing movement, and so it is convenient for feature disentanglement among motion and identity of each subject. Furthermore, an efficient quantitative metric for identify information is designed by gait recognition. The experiments show the proposed method can synthesize videos more naturally when the subject's identity is preserved.


page 1

page 8


DTVNet: Dynamic Time-lapse Video Generation via Single Still Image

This paper presents a novel end-to-end dynamic time-lapse video generati...

Cross-identity Video Motion Retargeting with Joint Transformation and Synthesis

In this paper, we propose a novel dual-branch Transformation-Synthesis n...

An Egocentric Look at Video Photographer Identity

Egocentric cameras are being worn by an increasing number of users, amon...

Preserving Privacy in Human-Motion Affect Recognition

Human motion is a biomarker used extensively in clinical analysis to mon...

Shadow Removal by High-Quality Shadow Synthesis

Most shadow removal methods rely on the invasion of training images asso...

Weakly-supervised High-fidelity Ultrasound Video Synthesis with Feature Decoupling

Ultrasound (US) is widely used for its advantages of real-time imaging, ...

Identity-Driven DeepFake Detection

DeepFake detection has so far been dominated by “artifact-driven” method...

1. Introduction

Figure 1. Human identity-preserved motion retargeting. The source videos’ subject-independent information, i.e. common motion content (MC), is transferred to a target appearance and preserves the subject-dependent information, i.e. individualized identity motion (ID), in target video.
Figure 2. The pipeline of inference framework. Different from the training stage (Figure 3), the framework only has two input videos, i.e.

, the source video and the target video. Then the 2D keypoints are estimated by a off-the-shelf method, OpenPose 

(Cao et al., 2019). The two motions can be disentangled as the ID and MC features by disentanglement block, and synthesize a motion by motion synthesis block. Finally, the pre-trained VQ-GAN (Esser et al., 2021) is employed to render the synthesis video.

Motion retargeting refers to transferring motion from a source video of one person to another person in a target video. This technology has great potential in augmented reality (AR) and virtual reality (VR). With its help, users can easily generate their own videos with some specific motions even they do not act those motions. Movie clips can also be generated when the actors do not perform. It is a key technology in video editing and content generation.

There are many methods to achieve motion retargeting, which can be divided into three categories by data modality: image-based (Li et al., 2019; Zhou et al., 2019; Siarohin et al., 2018; Zhu et al., 2019), video-based (Chan et al., 2019; Wang et al., 2018, 2019; Mallya et al., 2020; Huang et al., 2021), and skeleton-based (Chan et al., 2019; Aberman et al., 2019b, 2020a; Yang et al., 2020; Liu et al., 2019). The image-based methods get encouraging results on conditional generation. However, those methods ignored the context in the temporal domain, and the generated motion may not be the concordant with the appearance. The video-based methods partially solve the temporal neglecting issue but it may introduce a complex model hard to be trained. There are many skeleton-based methods to solve both aforementioned problems. The skeleton-based methods can further be directly 3D keypoints-based (Aberman et al., 2020a; Liu et al., 2019) and 2D keypoints-based (Chan et al., 2019; Aberman et al., 2019b; Yang et al., 2020; Aberman et al., 2019a). It is difficult to obtain accurately 3D keypoints from video through keypoints estimation algorithms (Yoshiyasu et al., 2018). Therefore, we study motion retargeting based on the 2D keypoints modality. However, we observe that previous works on 2D skeleton-based methods do not consider the subject-dependent of subjects in motion features. For example, if the motion of a male subject is directly transferred to a female, the generated motion will not be concordant with the female.

Recently, there have been some works (Xie et al., 2021; Su et al., 2021) try to solve the aforementioned problem. However, it still remains the following issues: (1) No efficient decoupling of subject-dependent and subject-independent information, (2) lack of enough labeled data which contains similar motion videos from different subjects, otherwise it will be difficult to evaluate the disentanglement, (3) lack of a suitable metric to evaluate whether synthetic motion contains subject-dependent information.

In this study, we propose a novel identity-preserved motion retargeting framework as shown in Figure 2, and it is training pipeline as shown in Figure 3. The inference framework is divided into three stages: skeleton estimation, motion transfer in 2D keypoints space by feature disentanglement (our method), and skeleton-to-video. In the first stage, we employ an off-the-shelf algorithm, OpenPose (Cao et al., 2019), to estimate the 2D skeleton sequences of the dancer in the video. In the second stage, to cope with the aforementioned first challenge, we propose a novel disentanglement block to decompose the representations of subject-dependent, i.e. individualized identity motion (ID), and subject-independent, i.e. common motion content (MC) in motions. In addition, we propose a motion synthesis block capable of synthesizing motion and preserving subject identities. Furthermore, adaptive instance normalization (AdaIN) and instance normalization (IN) can extract statistically (Huang and Belongie, 2017). We employ AdaIN and IN in our block to retarget motions. Meanwhile, triplet loss (Schroff et al., 2015) is utilized to learn contrastively ID and MC information of subjects. To address the second challenge, a dataset with 101 subjects is collected on the Internet, which is named . Compared with the previous dataset, has the following advantages: more subjects (dancers), complex dance actions, and each subject performing the same group of actions. For the third challenge, we propose a new metric, named Identity Score (), to evaluate the synthesized motion whether preserved ID information. Thus, we propose an efficient quantitative metric by gait recognition algorithm. In the last stage, we convert the transferred 2D skeleton sequence frame-by-frame to a pixel-level representation and feed these images into a pre-trained VQ-GAN (Esser et al., 2021) network to render the video.

We summarize our contributions as follows:

  • We propose a novel framework in 2D skeleton space, which can disentangle the ID and MC representations and synthesize a motion that can preserve the ID information of the target subject.

  • 101 dance videos have been collect online from the public domain, and the different subjects in this dataset perform the same group of actions.

  • We propose a new metric, Identity Score (IDScore), to evaluate the identity representations whether retargeting to the synthetic motion correctly. Experiments demonstrate that our method can disentangle the identity representations and apply in the synthesized videos effectively.

Figure 3. The training stage of our proposed method. Before training, we sample some paired motion videos, and employ the OpenPose method to get 2D skeleton sequences. During training phase, our task is training disentanglement block (Section 3.1) and motion synthesis block (Section 3.2), where , denotes the real motion from the -th branch (differnet colors), and denote the ID and MC representations, and , denote the synthesized motions, denotes the MC feature from -th branch (left color) and ID feature from

-th branch (right color). To ensure that we correctly disentangle the ID and MC information, we use some loss functions(Section 


2. Related work

Over the last several years, some works have been dedicated to skeleton-based motion retargeting from 2D inputs. Several works (Ma et al., 2017; Dong et al., 2018; Zhao et al., 2020; Cui et al., 2021) have explored to motion retargeting in image space. Ma et al. (Ma et al., 2017) proposed the pioneer top-down method , which mainly used conditional GAN (Mirza and Osindero, 2014). mainly contains two stages: pose integration and image refinement. The top-down method usually shows poor results for large differences between source and target poses. To address the above deficiency, based-deformation methods (Dong et al., 2018; Cui et al., 2021) are proposed by transferring the feature maps of the source person image into the target pose. To explore the latent feature representation in motion retargeting, Zhao et al. (Zhao et al., 2020)

extract pose features from an interpolated pose sequence, from source pose to target pose. Although the works mentioned above perform impressive results, they only retarget the action in image space and ignore the temporal information.

Several approaches focus on video motion retargeting from 2D inputs. Aberman et al. (Aberman et al., 2019a) proposed a two-branch framework with part confidence map as 2D pose representations to clone the performance in a video. Abermanet et al. (Aberman et al., 2019b)

trained a deep neural network to decompose temporal sequences of 2D poses into three components: motion, skeleton, and camera view-angle. Recently, Yang

et al. (Yang et al., 2020) proposed a method can be trained in an unsupervised manner to decompose aforementioned three components: view, shape and motion. Their work is named TransMoMo. However, the aforementioned methods directly transfer the motion from source subject to target subject, which will result in the motion of the target subject having abnormal motion after synthesis.

More recently, to address the above problem, Xie et al. (Xie et al., 2021) proposed a novel network to decompose motion to unique and common features. For disentanglement of unique and common features, they employed the gradient reversal layer (GRL) (Ganin et al., 2016) to restrict the gradient of common features. However, their method was not effective for situations where the identity label information is unseen, i.e. in-the-wild videos. Therefore, we propose a novel disentanglement block containing the MC encoder and ID encoder. Meanwhile, the triplet loss (Hoffer and Ailon, 2015) is employed to decompose the ID and MC representations. In addition, we employ IN layer in the MC encoder to control the identity information in motions and perform the global average-pooling in ID motion encoder to suppress the motion content information. Finally, we employ the AdaIN layer to fuse the MC and ID representations statistically.

Unlike other related works (Chan et al., 2019; Aberman et al., 2019b, 2020a; Yang et al., 2020; Liu et al., 2019)

, our model’s evaluation metrics need to evaluate whether the identity information is transferred correctly. Xie

et al. (Xie et al., 2021) employed a classification to evaluate the identity information. However, the metric is not applied in unseen identity situations. If the identity is unknown, their metric will not be applied. Thus, we propose an novel quantitative metric by gait recognition algorithm.

3. Methods

Our problem can be defined as: given a target motion and a source motion in 2D space, we disentangle the identity (ID) and motion content (MC) features by learning the disentanglement block and synthesis of a motion by motion synthesis block . The synthetic motion has the ID information from the target motion and the MC information from the source motion . The formula can be represented as:


where , is the number of skeleton points, is the number of frames.

Our framework for the training stage is illustrated in Figure 3. Before training, we sample three videos. Then, we estimate the 2D keypoints by an off-the-shelf network, OpenPose (Cao et al., 2019). The estimated results can be denoted as , , and , where means the motions from real data, and the shape of each motion is , where is the number of keypoints, and is the number of motion frames. In particular, and have the same MC label and have different ID labels, in contrast, and have different MC labels with the same ID label. Our training stage contains three branches. We feed , into disentanglement block of correspond branch respectively, and each disentanglement block is share parameters. Then the ID and MC features are obtained, which can be denoted as: and respectively. Afterwards, we cross MC and ID features to take full advantage of these samples, thus we perform the Cartesian Product to get a tuple, i.e. . Nine permutations are obtained, and feed these permutations to motion synthesis block to get the synthesized motions, which can be denoted as , where denotes the MC information from -th branch, , the ID information from -th branch, . The discriminator is employed to ensure that the distribution of synthetic motion is similar to the real motion. The discriminator is a temporal convolutional network. Moreover, we use the disentanglement block again after the motion synthesis block to ensure that the synthesized motion’s ID and MC information are similar to the input motion. The inference phase pipeline is shown in the second stage in Figure 2.

3.1. Disentanglement block

To get the ID and MC representations in the latent space, we propose a disentanglement block, as shown in Figure 4. The block mainly contains two encoder and , which are capture the MC and ID representations.

Figure 4. The pipeline of our disentanglement block. Where means the instance normalization, and denote the MC and ID encoders, means the identity projection head, means the global average pooling on temporal dimension.

The function , i.e. motion content encoder (MC encoder), uses several one dimensional convolution layers. The MC feature can be represented as:


where is the sequence number after MC encoder, is the number of channels, and . In addition, inspired by (Aberman et al., 2020b), we use instance normalization(IN) layers in , which can control mainly local statistics, in other words, identity information, and preserve the motion content.

The function , i.e. identity motion encoder (ID encoder), has a similar network structure with apart from the IN layers. The identity sequence feature can be denoted as:


where is the sequence number after ID encoder, is the number of channels. It is worth mentioning that takes the average temporal pooling after the last downsampling layer, and the ID feature can be denoted as:


where denotes average pooling on the temporal dimension. Furthermore, (Chen et al., 2020)

proposed a non-linear transformation structure, projection head, which can improve the performance in their downstream tasks. Therefore, we employ identity projection

head for identity sequence feature , which contains two fully connected layers to project the to another latent space. The projected features can be denoted as:



is the Leaky ReLU activation function,

is the weight for , and is the number of channels after identity projection head. It is worth mentioning that mainly used to compute triplet loss (Equation 8), and mainly used to the downstream tasks, i.e. motion synthesis.

3.2. Motion synthesis block

Figure 5. The pipeline of our Motion synthesis block. The Conv block contains a 1D convolution layer and a LeakyReLU activate function. The Upsample layer use the linear interpolation. The ToImageSpace and FromImageSpace are 1D convolution layers with one kernel size to map the learning space (channel-wise) to real motion space and Inverse mapping respectively.

Our motion synthesis block is shown in Fig.  5. Recall that we decompose ID and MC features by our disentanglement block. Firstly, we extract the statistical representations (mean

and variance

) of ID feature by a multilayer perceptron (MLP). Then, we apply the statistical representations to the AdaIN layer, the formula can be denoted as:


The AdaIN layer mainly applies an affine transformation to the feature of per-channel by constituting a normalization layer. Note that the affine transformation is temporally invariant and hence only affects non-temporal features of the motion i.e. ID information. For removing the local statistics of MC feature by the IN layer, the AdaIN layer can fuse the MC information with the other ID information by scaling and shifting the feature channels. Afterwards, we employ a skip connection to learn the high-level representations of fused features. Inspired by inspired by (Karras et al., 2017) we employ some PG blocks in decoder to improve the quality of synthesized motion. The PG block mainly contain Conv block, Upsample layer, ToImageSpace layer and FromImageSpace layer. In particular, the ToImageSpace and FromImageSpace are 1D convolution layer of kernel size 1,

is a hyperparameter to control the importance of upsample feature to add features in image space of channel. In our study, we set

. The conv block contain a 1D convolution layer and a LeakyReLU activate function.

Figure 6. Schematic diagram for . The , , and denote the raw data, cross stage, reconstruction stage evaluation pipeline respectively.

3.3. Identity score

In this section, we introduce our proposed metric, Identity Score (IDScore). The goal of gait recognition is to capture the unique biometric characteristics from the common human walking process, which is pretty suitable for our purpose, i.e. identifying individuals from the generally similar dance motion. Thus, we employ a gait recognition algorithm, GaitGraph (Teepe et al., 2021), to capture the ID information and evaluate whether it transfers correctly in synthesis motion. In gait recognition, the test set is usually split into gallery and probe set, where the identities of sequences in former set are considered known while the latter are unknown.

In order to evaluate the identity recognition performance, we conduct three experiments: test raw probe data, reconstruct probe data and cross a new subject to probe data. The details are as shown in Fig . 6. In the test raw probe data stage, the evaluated dataset is divided into gallery and probe sets, which have non-overlapping motion sequences. In the reconstruction stage, all probe data is reconstructed and fed the reconstruction results to the off-the-shelf GaitGraph network to get the recognition performances, and . We denoted the recognition performances in reconstruction stage as and respectively. In the crossing stage, we collect a motion of a new subject (the identity information apart from the gallery set). Then the ID information of the new subject is transferred to all probe data. Similarly, the crossed probe data is fed into GaitGraph. The recognition performances in crossing stage are named as and respectively. The cross performances reflect the recognition effect after transferring the new subject identity information. Ideally, the performances should be random guessing because the new subject information is not registered in the gallery set, in other words, the gallery set does not contain the new subject.

For , , the larger value means better reconstruction effect. In constant, for , the small value means a better transfer performance because the transferred identity is not in the gallery data. It is worth mentioning that another reason for the crossing performance decrease is the poor reconstruction results. The difference between the reconstructing and crossing recognition performance is employed to eliminate this factor, named . The higher results indicate that more identity information is preserved. The can be formulated as:

(a) MC feature using t-SNE
(b) ID feature using t-SNE
Figure 7. The disentanglement results of ID and MC representations in test dataset.

3.4. Loss functions

In our study, we employ some loss functions in the training phase: identity motion loss, motion content loss, motion reconstruction loss, and adversarial loss.

Identity motion loss. Recall that we use three input motions in the training phase, as shown in Figure 2. To ensure the disentanglement of the ID feature, we employ a triplet loss (Hoffer and Ailon, 2015) to improve the clustering of the different ID representations in the latent space, named ID triplet loss, which is defined as:


where is the margin hyperparameter and we set , is L2 distance.

Furthermore, recall that we employ the disentanglement block again for the synthesized motion. The ID feature of synthesized motion is formulated as . To ensure the ID representation of the synthesized motion similar to the raw ID feature , the mean absolute error (MAE) is employed, named ID reconstruction loss, which is defined as:


where, the outside the square brackets acts on three terms, the inside the square brackets acts on the three branches, and is L1 distance.

Motion content loss. Similarly, triplet loss is designed to extract MC information, named MC triplet loss, which is defined as:


where, . Note the difference in negative and positive samples between Equation 10 and Equation 8.

Similarly, we define the motion content reconstruction loss (MC reconstruction loss) as:


Motion reconstruction loss. Recall that we cross the ID and MC features before the motion synthesis block, the 9 permutations are obtained and fed these permutations to motion synthesis block, we get 9 synthesized motions. Therefore, the motion reconstruction loss is defined as:


where is the motion training set, and denotes the ground truth motion of .

Adversarial loss. The discriminator is used to measure the domain discrepancy between the real motions and synthesized motions. Therefore, the adversarial loss is defined as:


Total loss. We combine the above terms. Finally we get our final loss, that is defined as:


In our study, we set , , , and are , , , and respectively.

TransMoMo (Yang et al., 2020) 26.33% 12.24% 14.09% 66.53% 30.82% 35.71%
LCM (Aberman et al., 2019b) 18.37% 14.49% 3.88% 44.90% 29.59% 15.31%
Ours 32.86% 10.41% 22.45% 72.86% 29.59% 43.27%
Raw data 34.69% 77.35%
Table 1. Results on the metric.

4. Experiments

4.1. Setup

Dataset. To correctly disentangle ID and MC features, we collect 101 subjects’ solo dance videos online from the public domain, named , with the same choreography and background music. We manually align the videos in the temporal dimension, and the length of each video is 3 minutes and 40 seconds and has 30fps. Then, each video is resized to at the same scale. To obtain the 2D keypoints in our dataset, OpenPose (Cao et al., 2019)

is employed. Meanwhile, some invalid frames in each video, more than a third of the 2D keypoints are missing, are removed. Afterward, we pad the keypoints of missing less than a third according to the five frames before and after. In addition, we trim each 2D skeleton sequence into a clip of 64 frames. We get 81 dance clips for each video. Finally, the 2D skeleton sequences are split into training and test sets without overlapping. The training set has 91 subjects with 7371 dance clips, and the test set has 10 subjects with 810 dance clips. Furthermore, 2 videos from YouTube are collected to test the generated performance of our method.

Implementation details. All 2D skeleton keypoints use the BODY_25 keypoints format of OpenPose. Our method contains about 1.9M parameters. The detail of training hypterparameters shown in supplementary material. In addition, we set the number of input frame number and the number joints . In disentanglement block, we set the MC encoder last layer channel and ID encoder last layer channel . Especially, the channel of the last layer of the identity projection head . In discriminator, we employ some 1D temporal convolution layers, and the last channel of discriminator is

. Inspired by Least squares generative adversarial networks (LSGAN) 

(Mao et al., 2017), we employ their adversarial losses in our model to constrain the distribution of synthesis motion similarly to real motion. The specific version of triplet loss adopted in ID and MC triplet loss, is Batch All (BA+) triplet loss (Hermans et al., 2017).

Moreover, we use the VQ-GAN network (Esser et al., 2021) to render the synthesized motions to videos. The synthesized motions are changed to pix level frame-by-frame as an input map. The VQ-GAN contains about 87M parameters. And all resolution are . The parameters of the discriminator network are frozen at the first steps.

Figure 8. 2D skeleton reconstruction results, where the red skeletons are ground truth, green skeletons are our method, blue skeletons are TransMoMo method, and yellow skeletons are LCM method. Each column means the different frames in the sequence.

4.2. Representation Disentanglement

The ID and MC representations in latent space are visualized in Figure 7 to demonstrate the effectiveness of the proposed disentanglement block, i.e. the disentanglement ability of ID and MC representations. Expecially, we randomly sample some different subjects’ motions to get the ID and MC representations by our disentanglement block. Afterward, we project the representations into 2D space and 3D space by using t-SNE (Van der Maaten and Hinton, 2008).

Motion content feature. The 2D projection of our MC feature is shown in Figure 7(a), where each MC label name is marked with different colors. Considering 81 MC labels could not be shown in the figure, we only sample 10 labels in test dataset to show in the figure. Each label contains ten samples/subjects. Our network learned to cluster the MC information, demonstrating that the MC information will similarly manipulate the synthesized motion.

Methods w/o id projector w/o PG w/o AdaIN w/o avg Ours(full) Raw data
28.16% 30.2% 28.16% 28.16% 32.86% 34.69%
10.00% 11.84% 10.41% 10.82% 10.41%
18.16% 18.36% 17.75% 17.34% 22.45%
69.59% 68.78% 68.98% 69.18% 72.86% 77.35%
27.96% 29.18% 27.96% 30.00% 29.59%
41.63% 39.60% 41.02% 39.18% 43.27%
Table 2. Ablation study results.

Identity motion feature. Figure 7(b) shows the t-SNE result of our ID feature, colored by their ID label names. We can see that our model has a certain disentanglement capability in terms of ID representation. There is still some ID information that is not decoupled. We think it may be disentangled in higher dimensions. Because we project the ID representation to 3D space by t-SNE and find that some identities also have a certain degree of discrimination in the z dimension, such as the 00074 (blue) and 00077 (green), 00079 (light brown), and 00071 (pink) in 3D space are separated. The visualization of 3D space is shown in the supplementary material (id_3d.html).

4.3. Comparison

In this subsection, we quantitatively and qualitatively compare our method with other state-of-the-art methods using motion transfer results. The compared methods mainly include: TransMoMo (Yang et al., 2020), and Learning Character-Agnostic Motion (LCM) (Aberman et al., 2019b).

Quantitative Comparisons.

Since the output of TransMoMo and LCM methods is 15 keypoints in OpenPose BODY_25 format, for the fairness of the comparison, we transform our output(25 keypoints in OpenPose BODY_25 format) to 15 keypoints. It is worth mentioning that the input to the GaitGraph method is 17 keypoints in COCO format. Therefore, we train an MLP module to map 15 keypoints in OpenPose BODY_25 format to 17 keypoints in COCO format. Then, we feed the synthesis motions to the MLP module to get the 17 keypoints in COCO format. Furthermore, we directly employ GaitGraph pre-train weight on CASIA-B 

(Yu et al., 2006), an open dataset for gait recognition, and do not fine-turn on our test dataset.

Table 1 shows the comparing our method with others. The recognition result of raw data is the upper bound of this experiment. We can see that our method outperforms other methods in , which means that our method can effectively transfer ID information. The LCM and TransMoMo methods do not consider the identity information in the motion, therefore, they have the lower . Specially, the metric of method LCM is also better, which is caused by the poor reconstruction performance. In addition, our method has the 32.86% reconstruction performance and is closer to the upper bound 34.69%, which means the reconstruction results of our method contain more identity information instead of the motion clone.

Figure 9. The rendered results of quantitative comparisons. All synthesized motions are the results of cross the target identity information. The first two rows are the same source motion and different target identity motion.

Qualitative Comparisons. We sample some motions in our test dataset and reconstruct these motions to evaluate the reconstruction performance of our method. Figure 8 visually compares our method with others in 2D skeleton reconstruction. Note that the red skeletons are ground truth, green skeletons are our method, blue skeletons are the TransMoMo method, and yellow skeletons are the LCM method. We can see that our method is most similar to the ground truth skeleton in complex motion, such as the last column. These results demonstrate that our method can reconstruct high quality skeleton and preserve their identity information.

In addition, to explore the cross-domain of our methods, the two in-the-wild videos are collected on YouTube. In this experiment, we combine our test dataset’s ID information of in-the-wild videos and MC information. Then, we render the synthesized motion of state-of-the-art methods and ours to RGB videos. The state-of-the-art methods mainly contain TransMoMo (Yang et al., 2020) and Learning Character-Agnostic Motion (LCM) (Aberman et al., 2019b). The results are as shown in Figure 9. Where the first two columns are the source and target video clips, and the first column of each method is a synthesized skeleton. We can see that our method can learn the ID information of the target video, such as the first row, previous methods perform the same action (raise arms) as the source video. In addition, the target video has the right view in the last row, and the source video is not. The view of our rendered video is the same as the target video. It aligns with the conjecture we mentioned above, i.e. our method does not have the same action but acts with the target video ID information.

4.4. Ablation study

We train some ablated models to study the impact of the different blocks in our method. The results are shown in Table 2. We set up ablated models. The w/o id projector means we removing the id projector block in identity encoder. The w/o PG means we removing the ToImageSpace and FromImageSpace in PG block. The w/o AdaIN means we removing the IN in MC encoder and AdaIN in decoder.The w/o avg means we removing the global average pooling in ID encoder. We can see that each block is beneficial and that our full method (full) achieves the best results in reconstruction performance. In particular, w/o id projector can achieve the lowest recognition performance in crossing stage, that is they have the poor reconstruction performance. The ablation study of the skeleton visualization is shown in the supplementary material.

5. Conclusion and Future Work

To make the synthesized videos are more natural according to properties of the person, the human identity is involved in our proposed method. Human identities can be disentangled from motions, and can also be used to synthesize motions by the disentanglement block and the motion synthesis block respectively. With the help of our dataset , the model can be trained and identifies can be easily disentangled. The evaluation with gait recognition, a human identification technology by walking style, also proves the human identity plays an important role in video synthesis.

In our proposed work, it is shown that the human identity can help human video synthesis. From the results we can easily expand the idea to more attributes of a subject. That means if we can consider the gender, body shape, age and other attributes of a person along with the identity, the synthesized video should be improved and more realistic.


  • K. Aberman, P. Li, D. Lischinski, O. Sorkine-Hornung, D. Cohen-Or, and B. Chen (2020a) Skeleton-aware networks for deep motion retargeting. ACM Transactions on Graphics (TOG) 39 (4), pp. 62–1. Cited by: §1, §2.
  • K. Aberman, M. Shi, J. Liao, D. Lischinski, B. Chen, and D. Cohen-Or (2019a) Deep video-based performance cloning. In Computer Graphics Forum, Vol. 38, pp. 219–233. Cited by: §1, §2.
  • K. Aberman, Y. Weng, D. Lischinski, D. Cohen-Or, and B. Chen (2020b) Unpaired motion style transfer from video to animation. ACM Transactions on Graphics (TOG) 39 (4), pp. 64–1. Cited by: §3.1.
  • K. Aberman, R. Wu, D. Lischinski, B. Chen, and D. Cohen-Or (2019b) Learning character-agnostic motion for motion retargeting in 2d. ACM Transactions on Graphics (TOG) 38 (4), pp. 75. Cited by: §1, §2, §2, Table 1, §4.3, §4.3.
  • Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh (2019)

    OpenPose: realtime multi-person 2d pose estimation using part affinity fields

    IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: Figure 2, §1, §3, §4.1.
  • C. Chan, S. Ginosar, T. Zhou, and A. A. Efros (2019) Everybody dance now. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision

    pp. 5933–5942. Cited by: §1, §2.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In

    International conference on machine learning

    pp. 1597–1607. Cited by: §3.1.
  • A. Cui, D. McKee, and S. Lazebnik (2021) Dressing in order: recurrent person image generation for pose transfer, virtual try-on and outfit editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14638–14647. Cited by: §2.
  • H. Dong, X. Liang, K. Gong, H. Lai, J. Zhu, and J. Yin (2018) Soft-gated warping-gan for pose-guided person image synthesis. Advances in neural information processing systems 31. Cited by: §2.
  • P. Esser, R. Rombach, and B. Ommer (2021) Taming transformers for high-resolution image synthesis. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 12873–12883. Cited by: Figure 2, §1, §4.1.
  • Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. The journal of machine learning research 17 (1), pp. 2096–2030. Cited by: §2.
  • A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §4.1.
  • E. Hoffer and N. Ailon (2015) Deep metric learning using triplet network. In International workshop on similarity-based pattern recognition, pp. 84–92. Cited by: §2, §3.4.
  • X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pp. 1501–1510. Cited by: §1.
  • Z. Huang, X. Han, J. Xu, and T. Zhang (2021) Few-shot human motion transfer by personalized geometry and texture modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2297–2306. Cited by: §1.
  • T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §3.2.
  • Y. Li, C. Huang, and C. C. Loy (2019) Dense intrinsic appearance flow for human pose transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3693–3702. Cited by: §1.
  • W. Liu, Z. Piao, J. Min, W. Luo, L. Ma, and S. Gao (2019) Liquid warping gan: a unified framework for human motion imitation, appearance transfer and novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5904–5913. Cited by: §1, §2.
  • L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool (2017) Pose guided person image generation. Advances in neural information processing systems 30. Cited by: §2.
  • A. Mallya, T. Wang, K. Sapra, and M. Liu (2020) World-consistent video-to-video synthesis. In European Conference on Computer Vision, pp. 359–378. Cited by: §1.
  • X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley (2017) Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2794–2802. Cited by: §4.1.
  • M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1.
  • A. Siarohin, E. Sangineto, S. Lathuiliere, and N. Sebe (2018) Deformable gans for pose-based human image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3408–3416. Cited by: §1.
  • Y. Su, J. Zhang, M. Xing, W. Peng, and Z. Feng (2021) Disentangling style on dynamic aligned poses for individual identification. Ad Hoc Networks 113, pp. 102384. Cited by: §1.
  • T. Teepe, A. Khan, J. Gilg, F. Herzog, S. Hörmann, and G. Rigoll (2021) GaitGraph: graph convolutional network for skeleton-based gait recognition. In 2021 IEEE International Conference on Image Processing (ICIP), pp. 2314–2318. Cited by: §3.3.
  • L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne.. Journal of machine learning research 9 (11). Cited by: §4.2.
  • T. Wang, M. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro (2019) Few-shot video-to-video synthesis. arXiv preprint arXiv:1910.12713. Cited by: §1.
  • T. Wang, M. Liu, J. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro (2018) Video-to-video synthesis. arXiv preprint arXiv:1808.06601. Cited by: §1.
  • F. Xie, G. Irie, and T. Matsubayashi (2021) Disentangling subject-dependent/-independent representations for 2d motion retargeting. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4200–4204. Cited by: §1, §2, §2.
  • Z. Yang, W. Zhu, W. Wu, C. Qian, Q. Zhou, B. Zhou, and C. C. Loy (2020) Transmomo: invariance-driven unsupervised video motion retargeting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5306–5315. Cited by: §1, §2, §2, Table 1, §4.3, §4.3.
  • Y. Yoshiyasu, R. Sagawa, K. Ayusawa, and A. Murai (2018)

    Skeleton transformer networks: 3d human pose and skinned mesh from single rgb image

    In Asian Conference on Computer Vision, pp. 485–500. Cited by: §1.
  • S. Yu, D. Tan, and T. Tan (2006) A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. In 18th International Conference on Pattern Recognition (ICPR’06), Vol. 4, pp. 441–444. Cited by: §4.3.
  • W. Zhao, Q. Xie, Y. Ma, Y. Liu, and S. Xiong (2020) Pose guided person image generation based on pose skeleton sequence and 3d convolution. In 2020 IEEE International Conference on Image Processing (ICIP), pp. 1561–1565. Cited by: §2.
  • Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. Berg (2019) Dance dance generation: motion transfer for internet videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §1.
  • Z. Zhu, T. Huang, B. Shi, M. Yu, B. Wang, and X. Bai (2019) Progressive pose attention transfer for person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2347–2356. Cited by: §1.