Do as we do: Multiple Person Video-To-Video Transfer

04/10/2021 ∙ by Mickael Cormier, et al. ∙ KIT Fraunhofer 0

Our goal is to transfer the motion of real people from a source video to a target video with realistic results. While recent advances significantly improved image-to-image translations, only few works account for body motions and temporal consistency. However, those focus only on video re-targeting for a single actor/ for single actors. In this work, we propose a marker-less approach for multiple-person video-to-video transfer using pose as an intermediate representation. Given a source video with multiple persons dancing or working out, our method transfers the body motion of all actors to a new set of actors in a different video. Differently from recent "do as I do" methods, we focus specifically on transferring multiple person at the same time and tackle the related identity switch problem. Our method is able to convincingly transfer body motion to the target video, while preserving specific features of the target video, such as feet touching the floor and relative position of the actors. The evaluation is performed with visual quality and appearance metrics using publicly available videos with the permission of their owners.



There are no comments yet.


page 1

page 3

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Human Motion analysis is an important topic in the computer vision community. Recent advances in human-related application systems bring new human-centered challenges such as driver behavior recognition 


, crowd pose estimation for crowd motion analysis 

[15] or human action recognition in the dark [35]. However, the CNNs trained for such higher-level tasks require large amounts of annotated data. This data is often challenging to collect and properly annotate. Therefore, synthetic photo-realistic data is often considered as a cost-effective method for augmentation of existing datasets [14, 12, 21]. In this work, we introduce a method for synthesizing real-looking videos of multiple persons dancing side by side and switching places based on real input and target videos. Based on the Everybody Dance Now work [3] and similar to recent video-to-video translation works [41, 16, 28], we first extract pose skeletons using a state of the art method [2, 1, 29, 34]. Since those works only address the video-to-video translation problem for a single person, we extend the simple yet efficient method from [3] to the multiple person transfer problem. We collect online videos from dance workouts with different numbers of persons and perform an ablation study depending on the number of persons. Furthermore, we improve the face generation network by using more accurate face landmarks. Finally, this scenario brings new challenges regarding the pose transfer of each individual in the group. The normalization step is adapted in order to accurately map each subject from the input video to its counterpart in the target. Furthermore, we address the problem of persons switching places by adapting the keypoint correspondence network from [30] for tracking each individual in the video.

Ii Related Work

Recent breakthroughs in the field of image-to-image translation were recently offered by the introduction of conditional GANs for paired and unpaired images 

[19, 42]. Those works were rapidly followed by numerous methods for image and video manipulation. In this section, we review related work for image-to-image translation and appearance transfer.

Wang et al. [32] presented a method to generate high resolution 20481024 pixels results from semantic label maps using perceptual loss, a coarse-to-fine generator and a multi-scale discriminator architecture. An approach was proposed in [22]

to generate images of high-resolution using semantic segmentation and texture prediction. Various generative adversarial networks were proposed to increase the visual quality of generated images using labels and texts 

[43, 39, 36, 27]. Liu et al. [23]

proposed an encoder-decoder for a pose-guided high resolution appearance transfer to a target pose. They use local descriptors with means of progressive local perceptual loss and local discriminators at the highest resolution followed by training of the autoencoder architecture. Zanfir et al. 

[38] successfully transferred the appearance of a person in source images to a person in target images while preserving the body outline of the target person, using 3D pose as an intermediate representation. Kundu et al. [20] propose a recurrent network for targeting a long-term synthesis of 3D person interactions for long periods of time. Attribute-Decomposed GAN [26] introduces a generative model for controllable person image synthesis, which generates desired human attributes such as pose, head, upper clothes and pants.

Efros et al. [11] transfers videos based on predicted skeletons introducing the concepts of “Do as I do”, where the images of a target person are generated according to a drivers movement, and “Do as I say” where images of target persons are produced based on imposed commands. More recently Zhou et al. [41] trained a model with a relatively long video of a target person which resulted in the ability to transfer any movements of choice from a reference video to the target person while preserving the appearance of the target person. This model receives a frame of a target person and a pose from the reference as an input and generates the images of the target person in that pose as an output. Wang et al. [31] proposed a model for generating images of the targets including humans or scenes that have never been seen previously as a few-shot vid2vid framework. This model generalizes the poses of the reference video to few example images of the target simultaneously. Liu et al. [24] proposed a generative adversarial learning-based approach to upper body video synthesis. They use body and facial landmarks of the source person into the target person, followed by the normalization of the upper body landmarks to generate facial features in the target video with spatio-temporal smoothing. Chan et al. [3] proposed a similar approach for motion transfer from a source video to a target video. Their approach consists of two steps of pose encoding and normalization followed by a pose to video translation. Their poses are normalized by evaluating the ankle positions and height of the subjects in order to adapt the size of the source to the target. Their pose to video translation uses a three-step coarse to fine approach, one step explicitly addressing the quality of the generated face, and use of temporal smoothing. Videos for unseen in-the-wild poses are generated in [28] using data augmentation and unpaired learning to improve generalization of the system for minimizing the domain gaps between testing and training pose sequence. Gomes et al. [16] account for pose, shape, appearance, and motion features of the moving target. Finally, a graph convolutional network is proposed in [13] for generating dance videos from audio information to create natural motions preserving the key movements of different music styles. Nevertheless, these methods focus only on the transfer of a single subject at the same time.

Fig. 1: An overview of our method. First, a pose detector is used to detect the pose of each actor. For each person, a pose stick figure is generated with a distinct set of colors which is kept consistent over time through pose tracking. Those are then normalized into stick figures for the target domain. Finally, the trained generator is applied.

Iii Method

Starting from a video with a fixed number of source persons and a video with the same number of targets, we aim to generate a new synthetic version of the target video in which the persons now perform the movement seen in the source. Starting from in [3]

, the pipeline is divided into three stages – pose detection, global pose normalization, and mapping from normalized skeletons to the target subjects.

Pose Encoding   Since the focus of our work lies on generalizing pose transfer, we use a pre-trained state of the art model for pose estimation [2, 1, 29, 34] to produce accurate pose estimation for the input frames. We then generate a colored pose stick figure for each person. In order to reliably learn the appearance of each individual using the poses as intermediate representation, we notice empirically that the stick figures need clear distinct color for each person. If the colors of the body part of two persons are too similar, the model tends to average both appearances and produce less realistic features.

Pose Normalization  Since target and source videos have different settings of environment and camera as well as people with different physical appearances such as height and shape, a normalization step between the input and target subjects is needed in order to produce more realistic frames. For instance, in Figure 2, the horizon of the source video is higher than the horizon of the target video which caused the people in the generated video to be above the horizon: their feet are not located on the floor. Moreover, if the person in the source video is taller than the corresponding person in the target video, the subject in the generated frame is abnormally taller. Another instance is if the distance between the people and the camera in the source video is shorter than the distance between the people and the camera in the target video, those in the generated video are peculiarly and disproportionately larger than they should be.

Changes of Place between Source Subjects   If the source subjects change places, there happens to be an identity switch in the target video, meaning an input person will change appearances with another after changing places. To address this, each subject needs to be tracked over time. In this case, poses are tracked before encoding using keypoint correspondences as proposed in [30] with scenario-specific adjustments. First, we extend the model for tracking the whole 25 body landmarks available instead of only 17. Although we could also use face and hand landmarks, those aren’t predicted as reliably, therefore we decide to use body landmarks only. Since the number of subjects in each video remains the same, we drop frames where more poses are detected than expected. Typically, in the case where more poses are predicted than there are people in a frame, two poses are assigned to one person which can result in identity switches. For better keypoint heatmap accuracy, we train a keypoint correspondence network with input images of size  pixels instead of  pixels. As the persons are a closed set no similarity threshold is used and poses are always assigned in a greedy fashion to the ids of the closed set.

Pose to Video Translation   After preprocessing the tracked skeletons are then used as input to our model. An overview of our method is given in Figure 1. Our model is based on an adversarial conditional GAN setup, trained in three stages: global, local and faces. For more details, we refer to [3].

Fig. 2: Top: input images. Bottom left: a generated image before normalizing the source keypoints. Bottom right: the same frame after normalizing the source keypoints. In this case the normalization step allows our model to generate the feet of the subjects on the floor.

Iv Experiments

Iv-a Setup

A separate model is trained on collected frames for each training video of 2, 3, 4 and 5 people videos at respectively. This is followed by an evaluation using unseen test frames in order to evaluate the efficiency of our approach. Each model is trained separately in three stages. In the first stage, a global generator is used for training the model. In the second stage, the model is refined with a local enhancer generator and finally FaceGAN is used in the last stage. For our experiments we use the Adam optimizer with learning and . For all experiments the batch size is set to . Also we set . We trained our models for about iterations which required totally about hours on a RTX 2080Ti. As baseline we reproduce the results from [3] for single to single person motion transfer.

Evaluation Metrics

We measure the quality of the synthesized frames using four metrics: 1) Peak Signal-to-Noise Ratio (PSNR) measures the similarity of the pixel-level images between generated images, 2) Structural Similarity (SSIM) 

[33] between two images by comparing three factors of luminance, contrast, and structure, 3) Learned Perceptual Image Patch Similarity (LPIPS) [40] to measure the perceptual similarity between synthesized images and ground truth, and 4) Frechet Inception Distance (FID) [17] to measure the quality of frames of generated videos. We strive for high metrics for 1) and 2) and for smaller metrics for 3) and 4).

Iv-B Quantitative Results

We compare our approach quantitatively for a different number of people. We perform the evaluation on a held-out test data and report our results in Table I. We first report results for single-to-single transfer as a baseline. For this model almost four times more data is available for training than for multiple person transfer, which may partially explain the performance gap between the baseline and our models. Overall our models provide satisfying results for the few constraints we chose to apply for collecting the video pairs. Surprisingly our 5-to-5 model performs especially well by FID and LPIPS metrics, which means the results should be more convincing in the human eye. In order to refine our results and since acceptable face keypoints for the 3-to-3 and 5-to-5 videos were available, we added 60 supplemental face landmarks to our model and report our findings in Table II. This addition brought a strong improvement in terms of FID to the 3-to-3 model and a smaller improvement to the 5-to-5 model. However, we emphasize that these improvements are largely reliant on the quality of the pose estimator’s predictions. Therefore, such additional facial landmarks may not always be available or of sufficient quality.

TABLE I: 1 Person with 23,000 Frames as in [3]. The other models are trained with around 5,600 frames due to unavailability of more frames.
TABLE II: Our best models are further optimized using 68 face keypoints instead of only eight.

Iv-C Qualitative Results

Transfer results for multiple source and target subjects can be seen in Figure Do as we do: Multiple Person Video-To-Video Transfer and Figure 3. The advantage of using target normalization can be clearly seen in Figure Do as we do: Multiple Person Video-To-Video Transfer where the input subjects are shifted to the left as in the learned target video. This property is important since the target scene could contain physical objects on which the person would otherwise mistakenly be projected. We show results for more difficult face poses in Figure 3 with an example for which a turning face is handled properly and another for which our model struggles. In this case the turning head of the dancer in the input video has never been seen in a similar fashion during training for the target subject. Therefore, our model can’t handle the projection of the back of the head and produces a strong artifact instead of generating hair. Furthermore, failure cases for previously unseen extreme poses are illustrated in Figure 6. As shown in Table I, the number of subjects for transfer grows, the performance of the model generally declines, which could be expected while increasing the difficulty of the task without altering the parameters of the model. However, the 5-to-5 model delivers contradicting results. We show qualitative results for this model in Figure 4. Those results and particularly the faces are convincingly smooth. We notice the similarities in clothing between the target subjects. This setup not only boosts the performance metric for this video due to the clothing, but also allows better performance for the whole scene. We argue that this is related to the limited amount of parameters available to our model: less parameters are required to learn the appearances of lower bodies, therefore more parameters are available to accurately represent faces and upper bodies. Further works could progressively increase the number of subjects and investigate a required size of the model in order to reach optimal transfer performance. Finally, identity switches are handled as shown in Figure 5. However, such tracking is highly dependent on the quality of the pose estimator. Therefore, for input scenes in which a subject disappears for a long time behind another subject, the track may be lost requiring a new mapping from source to target. While our results suggest the clear feasibility to convincingly transfer multiple persons at the same time, we find that the quality of the synthesized face still needs improvement. Furthermore, extreme arm or face poses remain challenging.

Fig. 3: Given a source video [37] (top) and two different target videos from [6] (middle) and  [8] (bottom). Our approach transfers the movements from the people in (top) to the people in middle and last row. While the middle row handles the turning face (right), the last row can’t manage the face pose and produces a dark artifact in the center of the face.
Fig. 4: Results for our 5-to-5 person model. Input appearance from [5] (top) followed by our results (bottom) on [4]. Due to relatively uniform target clothes, the results are particularly smooth.
Fig. 5: Transfer results on [9] (bottom) with input targets from [10] (top) switching places without identity switch.
Fig. 6: Failure cases. Input appearance from [5] (top) followed by our results (bottom) on [4]. If our model has seen only relatively small motions for training, it can’t handle extreme poses.

V Conclusion

We extended and generalized the concept of video human motion transfer to multiple person using a relatively simple but yet efficient model. We address the pose normalization of multiple subjects and potential identity switches when different actors change places. Our method, while using only a few thousand frames, delivers high-quality videos of a target group of persons following the visual instructions of another group, even generating convincing shadows. However, our results are highly limited by the available data for the target group, which is difficult to collect. Furthermore, input and target videos are required to take on a similar perspective. Future work could focus on the training data and on extracting even more information such as semantic masks, dense poses or clothing information. A potential application is to create photo-realistic avatars from synthesized poses in order to efficiently render individuals anonymous and therefore facilitate a generation of new realistic data in the target domain.


  • [1] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh (2019) OpenPose: realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I, §III.
  • [2] Z. Cao, T. Simon, S. Wei, and Y. Sheikh (2017) Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, Cited by: §I, §III.
  • [3] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros (2019) Everybody dance now. In IEEE International Conference on Computer Vision (ICCV), Cited by: §I, §II, §III, §III, §IV-A, TABLE I.
  • [4] B. Y. channel (2019-07) Bollywood workout. Note: Cited by: Fig. 4, Fig. 6.
  • [5] S. F. F. Y. channel (2019-08-19) Chica. chung ha. Note: Cited by: Fig. 4, Fig. 6.
  • [6] S. F. F. Y. channel (2020-01-09) Ring ring. Note: Cited by: Fig. 3.
  • [7] S. F. F. Y. channel (2020-01-16) Shape of you. Note: Cited by: Do as we do: Multiple Person Video-To-Video Transfer.
  • [8] S. F. F. Y. channel (2020-01-24) 4 week diet challenge- day 16. Note: Cited by: Fig. 3.
  • [9] S. F. F. Y. channel (2020-08-11) Circus. Note: Cited by: Fig. 5.
  • [10] K. dance cover YouTube channel (2017-11-12) TWICE-cheer up. Note: Cited by: Fig. 5.
  • [11] A. A. Efros, A. C. Berg, G. Mori, and J. Malik (2003) Recognizing action at a distance. In IEEE International Conference on Computer Vision, Nice, France, pp. 726–733. Cited by: §II.
  • [12] M. Fabbri, F. Lanzi, S. Calderara, A. Palazzi, R. Vezzani, and R. Cucchiara (2018) Learning to detect and track visible and occluded body joints in a virtual world. In European Conference on Computer Vision (ECCV), Cited by: §I.
  • [13] J. P. Ferreira, T. M. Coutinho, T. L. Gomes, J. F. Neto, R. Azevedo, R. Martins, and E. R. Nascimento (2021) Learning to dance: a graph convolutional adversarial network to generate realistic dance motions from audio. Computers & Graphics 94, pp. 11 – 21. External Links: ISSN 0097-8493, Document Cited by: §II.
  • [14] E. Francis, K. Theodora, H. Alexander, and L. Bastian (2017) Exploring spatial context for 3d semantic segmentation of point clouds. In IEEE International Conference on Computer Vision, 3DRMS Workshop, ICCV, Cited by: §I.
  • [15] T. Golda, T. Kalb, A. Schumann, and J. Beyerer (2019) Human Pose Estimation for Real-World Crowded Scenarios. In 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Cited by: §I.
  • [16] T. L. Gomes, R. Martins, J. Ferreira, and E. R. Nascimento (2020) Do as i do: transferring human motion and appearance between monocular videos with spatial and temporal constraints. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. , pp. 3355–3364. External Links: Document Cited by: §I, §II.
  • [17] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500. Cited by: §IV-A.
  • [18] incorporesony YouTube channel (2019-07-27) GIRLS like you. Note: Cited by: Do as we do: Multiple Person Video-To-Video Transfer.
  • [19] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017-07) Image-to-image translation with conditional adversarial networks.

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    External Links: ISBN 9781538604571, Link, Document Cited by: §II.
  • [20] J. N. Kundu, H. Buckchash, P. Mandikal, R. M. V, A. Jamkhandi, and R. V. Babu (2020) Cross-conditioned recurrent networks for long-term synthesis of inter-person human motion interactions. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. , pp. 2713–2722. External Links: Document Cited by: §II.
  • [21] I. Kviatkovsky, N. Bhonker, and G. Medioni (2020)

    From real to synthetic and back: synthesizing training data for multi-person scene understanding

    External Links: 2006.02110 Cited by: §I.
  • [22] C. Lassner, G. Pons-Moll, and P. V. Gehler (2017) A generative model of people in clothing. In Proceedings of the IEEE International Conference on Computer Vision, pp. 853–862. Cited by: §II.
  • [23] J. Liu, H. Liu, M. Chiu, Y. Tai, and C. Tang (2020) Pose-guided high-resolution appearance transfer via progressive training. arXiv preprint arXiv:2008.11898. Cited by: §II.
  • [24] Z. Liu, H. Hu, Z. Wang, K. Wang, J. Bai, and S. Lian (2019) Video synthesis of human upper body with realistic face. In 2019 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), pp. 200–202. Cited by: §II.
  • [25] M. Martin, A. Roitberg, M. Haurilet, M. Horne, S. Reiß, M. Voit, and R. Stiefelhagen (2019-10) Drive&Act: a multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §I.
  • [26] Y. Men, Y. Mao, Y. Jiang, W. Ma, and Z. Lian (2020) Controllable person image synthesis with attribute-decomposed gan. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §II.
  • [27] A. Odena, C. Olah, and J. Shlens (2017)

    Conditional image synthesis with auxiliary classifier gans

    In International conference on machine learning, pp. 2642–2651. Cited by: §II.
  • [28] J. Ren, M. Chai, S. Tulyakov, C. Fang, X. Shen, and J. Yang (2020) Human motion transfer from poses in the wild. In Computer Vision – ECCV 2020 Workshops, A. Bartoli and A. Fusiello (Eds.), Cham, pp. 262–279. Cited by: §I, §II.
  • [29] T. Simon, H. Joo, I. Matthews, and Y. Sheikh (2017) Hand keypoint detection in single images using multiview bootstrapping. In CVPR, Cited by: §I, §III.
  • [30] R. Umer, A. Doering, B. Leibe, and J. Gall (2020) Self-supervised keypoint correspondences for multi-person pose estimation and tracking in videos. arXiv preprint arXiv:2004.12652. Cited by: §I, §III.
  • [31] T. Wang, M. Liu, A. Tao, G. Liu, B. Catanzaro, and J. Kautz (2019) Few-shot video-to-video synthesis. Advances in Neural Information Processing Systems 32, pp. 5013–5024. Cited by: §II.
  • [32] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018-06) High-resolution image synthesis and semantic manipulation with conditional gans. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. External Links: ISBN 9781538664209, Link, Document Cited by: §II.
  • [33] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §IV-A.
  • [34] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh (2016) Convolutional pose machines. In CVPR, Cited by: §I, §III.
  • [35] Y. Xu, J. Yang, H. Cao, K. Mao, J. Yin, and S. See (2020) ARID: a new dataset for recognizing action in the dark. External Links: 2006.03876 Cited by: §I.
  • [36] X. Yan, J. Yang, K. Sohn, and H. Lee (2016) Attribute2image: conditional image generation from visual attributes. In European Conference on Computer Vision, pp. 776–791. Cited by: §II.
  • [37] B. YouTube (2019-02-10)

    COCA cola

    Note: Cited by: Fig. 3.
  • [38] M. Zanfir, A. Popa, A. Zanfir, and C. Sminchisescu (2018) Human appearance transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5391–5399. Cited by: §II.
  • [39] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 5907–5915. Cited by: §II.
  • [40] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595. Cited by: §IV-A.
  • [41] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. Berg (2019) Dance dance generation: motion transfer for internet videos. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §I, §II.
  • [42] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §II.
  • [43] S. Zhu, R. Urtasun, S. Fidler, D. Lin, and C. Change Loy (2017) Be your own prada: fashion synthesis with structural coherence. In Proceedings of the IEEE international conference on computer vision, pp. 1680–1688. Cited by: §II.