Retargeting human body motion — transferring motion from a “driving” image or video of one subject (the source) to another subject (the target), using one or more reference images of the target subject in an arbitrary pose — has received a great deal of attention in recent years, due to numerous practical and entertaining applications in content generation [wang2019fewshotvid2vid, Chan_2019_ICCV]. Such applications include transferring sophisticated athletic techniques or dancing performances to untrained celebrities for special effects for cinema and television; creating amusing performances for one’s friends or acquaintances for sheer entertainment; and creating plausible motion sequences from photos or videos depicting famous and important political figures (including historical figures who may no longer be alive to perform such actions) for the creation of plausible full-body “deepfake” videos. However, retaining the target subject’s identity while rendering them in novel, unseen poses is highly challenging, and the state-of-the-art is still far from plausible.
Many approaches to this task learn to render a specific person [aberman2019deep, chai2020neural, Chan_2019_ICCV, gafni2019vid2game, isola2017image, knoche2020reposing, ren2020human, wang2018high, wang2018video, wei2020gac, yang2020transmomo, zhou2019dance] conditioned on the desired pose. This requires a large number of training frames of that person, and incurs substantial training time that must be repeated per each new subject. By contrast, in the few-shot setting, addressed in this work, only a few reference images of the target are available, and video generation from those images should be fast (, requiring no subject-specific training). To overcome the lack of data for a given subject, many other techniques [bhatnagar2019multi, lazova2019360, li2019dense, liu2019liquid, ma2020learning, mir2020learning, neverova2018dense, patel2020tailornet] leverage existing human body models [alp2018densepose, kanazawa2018end, loper2015smpl] to construct an approximate representation of the subject that can then be manipulated and rendered. While the 3D nature of these representations often leads to improved performance over their purely 2D counterparts [balakrishnan2018synthesizing, ma2017pose, ma2018disentangled, ren2020deep-cv, si2018multistage], their explicit nature, which faces the limitations of capturing salient details with standard human body models, also leads to reduced modeling power and therefore fidelity. Large variations in the clothing (, dresses or jackets that do not conform to the body shape), body type, or hair of the source and target subjects, for example, cannot easily be represented with standard models that only represent the body itself.
In this work we attain more flexible and expressive modeling power by exploiting a representation that allows for 3D modeling and manipulation, and yet is fully implicit, it can be fully learned, even though we use no explicit ground-truth 3D information such as meshes or voxel grids as supervision. Recently, just such a representation, the Transformable Bottleneck Network (TBN) [olszewski2019transformable], has been shown to produce excellent results on novel view synthesis of rigid objects. In that work, image content is encoded into an implicit volumetric representation (the “bottleneck”), in which each of the encoded features in this volume correspond to the local structure and appearance of the corresponding spatial region in the volume depicted in the image. However, while it requires no 3D supervision, it is trained using multi-view datasets of rigid objects depicted from multiple viewpoints to produce implicit volumes that can then be rigidly transformed to produce novel views of the depicted content corresponding to changes in viewpoint.
We build upon this approach to address the challenge of performing motion retargeting for non-rigid
humans (for which multiple images of a given subject may be available, but in dramatically different poses). In doing so, we address several challenges: how to aggregate volumetric features from images with changes in camera and body pose, and how to learn this aggregation from videos without explicit 3D or camera pose supervision. With such an implicit representation, to synthesize a novel pose, we achievenon-rigid implicit volume aggregation and manipulation by learning a 3D flow to resample the 3D body model from input images captured with the subject performing various poses or under different viewpoints. To allow for expressing large-scale motion while retaining fine-grained details in the synthesized images, we propose a multi-resolution scheme in which image content is encoded, transformed and aggregated into bottlenecks of different resolution scales.
As we focus on transferring motion between human subjects, our network pipeline is designed and trained specifically to extract and manipulate the foreground of the encoded images, with a separate network for extracting and compositing the background with the synthesized result. Our training scheme employs techniques and loss functions precisely designed for the challenging task of producing plausible motion retargeting without 3D supervision or the use of explicit 3D models, making use of specialized training techniques to teach the network to synthesize plausible results when no ground-truth images corresponding to the applied spatial manipulation is available. We thus avoid the limitations of explicit body representations[li2019dense, liu2019liquid], which may lead to unrealistic results due to the limited reconstruction accuracy and mesh precision. Furthermore, it allows for learning directly from real 2D images and videos without requiring the tedious and cumbersome collection of copious high-fidelity 3D data [lazova2019360, neverova2018dense, grigorev2019coordinate].
In our experiments, we demonstrate that our approach qualitatively and quantitatively outperforms state-of-the-art approaches to human motion transfer, despite the few images used for inference, and even allows for plausible motion transfer when using only a single image of the target.
In summary, our key contributions are:
A novel set of neural network architectures for performing implicit volumetric human motion retargeting, which exploits the power of 3D human motion modeling while avoiding the limitations of standard 3D human body modeling techniques.
A framework to train these networks to attain high-fidelity human motion transfer using only a few example images of the target subject performing various poses, without requiring target-specific training.
Evaluations demonstrating our few-shot approach outperforms state-of-the-art alternatives both quantitatively and qualitatively, even those requiring training models for each new subject with substantial training data.
2 Related Work
Video-to-Video Generation. Existing works on video-to-video generation can synthesize high-quality human motion videos using conditional image generation [chai2020neural, isola2017image, wang2018high]. Chan [Chan_2019_ICCV] apply pre-computed human poses from driving videos as input for novel view and pose generation of a target person. Along this line, several works improve the synthesis quality through additional input signals [gafni2019vid2game], pose augmentation and pre-processing [ren2020human, wei2020gac, yang2020transmomo, zhou2019dance], and temporal coherence [aberman2019deep, wang2018video]. However, a long recorded video and person-specific training are required for each target person, limiting the scalability of such methods. Our work targets a few-shot scenario, with just a handful of source images available, as discussed below.
Few-Shot Motion Retargeting. To address scalability, others train generic models that can use just a few images of the target subject in arbitrary poses at test time to synthesize novel images with given poses. While high-quality results have been achieved for face animation [gu2019flnet, ha2019marionette, qian2019make, zakharov2020fast, zakharov2019few], animating bodies remains challenging. Some methods adapt the video-to-video approach to a few-shot setting, using an extra network to generate identity-specific weights for image generation network [wang2019fewshotvid2vid], or adapting a pre-trained network to new subjects [lee2019metapix]. Another class of methods train networks conditioned on an image of the source identity, and an explicit representation of target pose [ma2017pose, ma2018disentangled, esser2018variational]. More recent works exploit multi-stage networks to improve quality [si2018multistage, ilyes2018pose, dong2018soft], in particular using 2D spatial transformers [balakrishnan2018synthesizing, ren2020deep, ren2020deep-cv] or deformation [siarohin2018deformable] to warp the source image into the target pose, or synthesizing images using attention [tang2020xinggan, zhu2019progressive]. However, such purely 2D methods struggle to capture the complex motions generated by 3D shapes and transformations. Several works use explicit 3D representations, exploiting off-the-shelf human pose [alp2018densepose] and shape [kanazawa2018end] inference networks and body meshes [loper2015smpl]. DensePose [alp2018densepose] is used to unwrap appearance from the source image(s) into a canonical texture map, which is then inpainted and re-rendered in the target pose [lazova2019360, neverova2018dense, grigorev2019coordinate]. Other works use 3D meshes to compute 2D flow fields to warp features from source to target images [li2019dense, liu2019liquid], with shallow encoders and decoders at each end to reduce warping artifacts. Finally, several works use body models and standard rendering techniques for clothing transfer [bhatnagar2019multi, ma2020learning, mir2020learning, patel2020tailornet]. However, such explicit representations generally produce synthetic-looking results. We use an implicit 3D representation on this task for the first time, unlocking several benefits: the motion representation is more flexible, no 3D supervision or prior is needed, the image decoder can refine the output easily, and multiple reference images can be used to improve synthesis quality.
Unsupervised Motion Retargeting. Unsupervised methods learn retargeting purely from videos [Siarohin_2019_NeurIPS, lathuiliere2020motion, kim2019unsupervised, song2019unsupervised, pumarola2018unsupervised], forgoing motion supervision and therefore also class-specific motion representations. Siarohin learn unsupervised keypoints in order to warp object parts into novel poses [siarohin2019animating, Siarohin_2019_NeurIPS]. Lorenz [lorenz2019unsupervised]
show that body parts can also be represented by unsupervised learning, which also helps to disentangle body pose and shape[esser2019unsupervised]. Nevertheless, without the benefit of explicit pose information from a human detection model, these methods struggle to generate good results for challenging driving poses. We therefore use a high quality, off-the-shelf, pose detection model [cao2018openpose] to facilitate the generation process.
Implicit and Volumetric 3D Representations. The flexibility of implicit 3D content representations [10.1145/1015706.1015816, 10.1145/383259.383266, survey_implicit_2007], which obviate the use of explicit surfaces, discrete triangle meshes, make them amenable to recent learning-based approaches that extract the information needed to render or reconstruct scenes directly from images. In [sitzmann2019deepvoxels]
, an object-specific network is trained from many calibrated images to extract deep features embedded in a 3D grid that are used to synthesize novel views of the object.[mildenhall2020nerf] perform novel view synthesis on complex scenes by aggregating samples from a trained network which, given a point in space and a view direction, provides an opacity and color value indicating the radiance towards the viewer from this point. This approach was later extended to handle unconstrained, large-scale scenes [martinbrualla2020nerf], or to use multiple radiance field functions, stored in sparse voxel fields to better capture detailed scenes with large empty regions [liu2020neural]. However, these approaches share the limitation that the networks are trained on many images with known camera poses for a specific scene, and thus cannot be used to perform 3D reconstruction or novel view synthesis from new images without re-training. Furthermore, many images from different viewpoints are required to train these networks to learn to sufficiently render points between these images. Other recent efforts use 3D feature grids [kar2017lsm] or pixel-wise implicit functions [saito2019pifu, saito2020pifuhd, li2020monocular] to infer dynamic shapes from one or more images, but these require synthetic ground-truth 3D geometry for supervision, which limits their applicability to unconstrained real-world conditions. In [olszewski2019transformable], an encoder-decoder framework is used to extract a volumetric implicit representation of image content that is spatially disentangled in a manner that allows for novel view synthesis, 3D reconstruction and spatial manipulation of the encoded image content. However, while it allows for performing non-rigid manipulations of the image content after training, it must be trained using a multi-view dataset of rigid objects, making it unsuitable for human motion retargeting. In our work we develop an enhanced implicit volumetric representation and multi-view fusion techniques to address these concerns.
At the heart of our approach to few-shot human body motion retargeting is an implicit volumetric representation of the shape and appearance of the subject depicted in the input images. In the following sections, we first describe this representation, then outline the network architectures employing it to perform foreground motion transfer and composition into the background environment. Finally, we discuss the training techniques and loss functions employed to train these networks to perform flexible motion retargeting while still achieving high-fidelity image synthesis results. The overall architecture is illustrated in Figure 1.
3.1 Implicit Volumetric Representation
Given an input image, our encoder extracts a feature volume or “bottleneck” , where , , and are the depth, height, and width of the encoded volumetric grid used to represent the image content. Each cell in the grid contains an
-dimensional feature vector describing the structure and appearance of the local image content corresponding to that region of the volume. The depth dimension of this volume is aligned with the view direction of the camera, while the height and width correspond to those of the input image. The feature volume may be passed directly to the image decoder to synthesize an image corresponding to the input (in which case it acts as an auto-encoder), or it may be spatially manipulated in a manner corresponding to the desired transformation of the image content. Such manipulations include rigid transformations corresponding to camera viewpoint changes, or non-rigid manipulations corresponding to the subject’s body motion.
Given the encoded bottleneck and a dense flow field encoding a 3D coordinate per cell in the transformed bottleneck pointing to the original volume that corresponds to the mapping from pose to , we employ a sampling layer to perform trilinear sampling to produce . This flexible sampling mechanism enables a wide variety of spatial manipulations, from rigid transformations for novel view synthesis (simulating camera pose change by rotating the bottleneck) to more complex non-rigid changes (raising the arms of the subject while keeping the rest of the body stationary).
While the flow field for rigid camera viewpoint changes can be easily computed given the relative camera transformation between and , non-rigid body pose transformations can be much more complicated. It requires the flow field to have the appropriate sampling location in for each cell in that semantically corresponds to the desired 3D motion of the content extracted from the input image.
Our training process, as described below, in which both rigid viewpoint and non-rigid pose transformations are used, achieves the spatial disentanglement required to infer the appropriate flow field and employ it to guide the desired spatial transformation of the image content.
3.2 Flow-Guided Volumetric Resampling
Network Architecture Overview. Our human body synthesis branch has three major components: the encoder, which consists of a 2D encoding network and a 3D encoding network ; the decoder, which consists of a 3D decoding network and a 2D decoding network ; and the flow-guided transformable bottleneck network . A source reference image , containing only the foreground region from the original image (obtained using a pre-trained segmentation model [chen2018encoder]), is passed through the encoder to obtain the 3D feature representation as =
. To warp the features, we estimate the flow using the flow networkwith inputs of and the source/target poses , which are obtained with the off-the-shelf pose detection network [cao2018openpose, alp2018densepose]. For simplicity, we define this operation as , and generate the synthesized foreground frame using the flow field from as:
Multi-Resolution Bottleneck. To increase the fine-scale fidelity of the synthesized images while retaining the global structure of the target subject, we adopt a multi-resolution representation, with implicit volumetric representations at multiple resolutions, as seen in Figure 1. We employ skip connections between each 3D encoder and 3D decoder. Therefore, the encoder produces 3D features such that , where is the number of resolutions. The skip connections use 3D warping, given the estimated flow, to map content to from the correct region in the encoded bottleneck to the one to be decoded, rather than the direct connections used in prior art [he2016deep]. Thus, for each feature obtained from the 3D encoder, the 3D decoder receives as input .
Multi-View Aggregation. Given that a single input image only contains partial information for the depicted human body, we allow for the aggregation of information, represented with our 3D features, from multiple views to improve the synthesized image quality. To accelerate the inference, we take the first source image as the canonical body pose (though this may actually be the subject in any natural pose), and aggregate features from all other images to this pose. In this way, we only perform the aggregation once during inference. More formally, for a total of source images, the aggregated feature is represented as:
where is the 3D representation from the image and is the warping flow from the image to the first image.
3.3 Background Modeling
The background environment is modeled through a separate network , to which the source images are fed. Specifically, estimates a background image and a confidence map , in which high confidence indicates the foreground, the depicted person, while low confidence indicates the background. The synthesized image is:
where denotes component-wise multiplication of the confidence map with the color channels of the synthesized image, and indicates the synthesized background image.
Retargeting Supervision. We use a conditional discriminator to determine whether the synthesized foreground image is real or fake. The concatenation of the input image, 2D source/target poses, and target image is sent to the discriminator and we apply the following adversarial loss:
where is the foreground region from the real image. The generator is trained to minimize this objective, while the discriminator is trained to maximize it.
Similarly, we have a discriminator that works on the full synthesized with foreground and background:
where denotes the real image.
We also use a perceptual loss [johnson2016perceptual] between the real and generated images to improve fidelity of the generated images for both the foreground and background:
Additionally, we measure the reconstruction quality of each of the source images using the aggregated bottleneck. The reconstructed image is generated as , with no flow required for this auto-encoding. The reconstruction loss is:
Mask Supervision. We also leverage mask supervision, using the foreground masks, to better supervise the implicit 3D representation modeling. Similar to previous work [olszewski2019transformable], we introduce an occupancy decoder to obtain the estimated mask directly from the 3D features. Considering that we have multi-resolution bottlenecks in our architecture, we apply multiple occupancy decoders to get the mask from each of the bottlenecks. Specifically, given the occupancy decoder as , the estimated mask for the -th 3D representation is given as . The mask loss is thus defined as follows:
where is the mask obtained from -th source reference image using the pre-trained segmentation network.
Unsupervised Random Rotation Supervision.
We further introduce an unsupervised training technique to help the network learn implicit volumetric representations for the encoded subject with appropriate spatial structure. Specifically, we apply a random rotation around the vertical axis of the volume to the encoded bottleneck and enforce the corresponding synthesized image to be indistinguishable from the ground-truth views. The magnitude of this rotation is sampled from a uniform distribution. The synthesized image is thus , which should contain a novel view of the subject performing the same pose as in the source image, where is the flow field defined by the rigid transformation between the source pose and the random pose . However, since there is no ground-truth image corresponding to each random rigid transformation, we introduce a discriminator to match the distribution between real images and in an unsupervised manner. We provide the discriminator with the concatenation of the generated image and the original source to better maintain the source identity. We employ an adversarial loss to enforce the rotation constraint on the foreground region as follows:
Full Objective. The full training objective for the entire motion retargeting network is thus given as follows:
where , , , , , and are hyper-parameters to control the weight of each loss function.
In this section, we evaluate our method on two different datasets, and show qualitative and quantitative comparisons with recent state-of-the-art efforts on human motion generation. For additional results, including video sequences, please consult the supplemental material.
4.1 Experimental Setting
Datasets. We adopt two datasets for experiments:
iPER. The first dataset, known as the Impersonator (iPER) dataset, was recently collected and released by LiquidGAN [liu2019liquid], and serves as a benchmark dataset for human motion animation tasks. There are videos with a total of frames in the dataset. Each video depicts one person performing different actions. We follow the protocol in previous work, splitting the training and testing set in the ratio of 8:2. Compared with other datasets [liu2016deepfashion, zheng2015scalable, zablotskaia2019dwnet], iPER includes human subjects with widely varying shape, height and appearance, and performing diverse motions.
Youtube-Dancing. We create this dataset by collecting dancing videos from Youtube. It includes dancing videos with frames in total for training, and videos with frames for testing. The videos are recorded in unconstrained, in-the-wild environments, and consist of more challenging and diverse motion patterns compared with the iPER dataset.
Implementation Details. We normalize all images to the range for training, and train the networks to synthesize images with a resolution of . We apply the Adam optimizer [kingma2014adam] with a mini-batch size of for training. Following previous work [wang2018high], we apply multi-scale discriminators, where each discriminator accepts images with the original resolution, and images with half this resolution, . The hyper-parameters in Eqn. 10 are set to , except for .
Evaluation Metrics. We use three widely-adopted metrics to evaluate the image synthesis quality. The Structural Similarity (SSIM) [wang2004image] index measures the structural similarity between synthesized and real images. The Learned Perceptual Similarity (LPIPS) [zhang2018unreasonable] measures the perceptual similarity between these images. The Frchet Inception Distance (FID) [heusel2017gans] calculates the distance between two distributions of real and synthesized images, and is commonly used to quantify the fidelity of the synthesized images.
4.2 Comparisons and Results
We first show the experimental results on the iPER dataset. Following previous work [liu2019liquid], we use three reference images with different degrees of occlusion from each video to synthesize other images. We report both the SSIM and LPIPS for our work and existing studies, including PG2 [ma2017pose], SHUP [balakrishnan2018synthesizing], DSC [siarohin2018deformable], and LiquidGAN [liu2019liquid] in Table 1. As shown in the table, we achieve better results using both metrics than state-of-the-art works, indicating higher synthesized image quality with our method. Additionally, we provide qualitative results in Figure 2. We show an example source reference image from a target subject, and the generated sequence using the driving pose information from the ground-truth images. As depicted, our method can generate realistic images with diverse motion patterns.
|w/o Multi-Resolution TBN||0.870||0.051|
|w/o Random Rotations||0.876||0.047|
|One reference image||0.878||0.045|
|Two reference images||0.879||0.045|
|Three reference images||0.881||0.044|
|Four reference images||0.882||0.043|
We then perform these experiments on our collected Youtube-Dancing dataset. We compare our method with the most recent work, FewShot-V2V [wang2019fewshotvid2vid], which can generate motion retargeting videos using one or a few images. For evaluation, we use the first frame from each testing video as a source reference frame and using poses from the other frames to generate images. Besides the SSIM and LPIPS, we also follow FewShot-V2V [wang2019fewshotvid2vid] and report the FID scores. The results are summarized in Table 2. As can be seen, our method achieves better results than FewShot-V2V [wang2019fewshotvid2vid] on each of the three metrics. We also provide sample qualitative results in Figure 3. We show the source reference image from a target subject, and the generated short sequence from our method and FewShot-V2V using the pose from the ground-truth images. Compared with FewShot-V2V, our method generates more realistic images with fewer artifacts, and the identity and the texture from the target subject are much better preserved.
4.3 Ablation Study
Architecture and Training Technique Analysis. We conduct an ablation analysis of our network architectures and training strategies. For this experiment, we analyze the effects of each component used to generate the human body, the foreground region. We thus remove the background using an image segmentation model [chen2018encoder], and only compute evaluation metrics on the generated foreground region. We use one source reference image from a target subject and preform experiments with the following setting on iPER:
w/o Multi-Resolution TBN. Instead of using multi-resolution bottlenecks, we use a single resolution TBN to encode the implicit 3D representation.
w/o Skip-Connections. The skip-connections (implemented via flow warping) between the encoded and decoded 3D bottlenecks are removed.
w/o Random Rotations. We remove the unsupervised random rotation supervision applied to the encoded TBN and its associated loss function, defined in Eqn. 9.
The quantitative results, including the SSIM and LPIPS, are shown in Table 3. We can see that each component is beneficial and that our full method (Full) achieves the best results.
Multi-View Aggregation Analysis. As our method can leverage multiple reference images from the target to perform multi-view aggregation, we conduct experiments to analyze the effect of the number of reference images on the synthesis result using the iPER dataset. The experimental results, presented in Table 4, demonstrate that using more reference images improves the quality of the synthesized human bodies.
Our approach to few-shot human motion retargeting exploits advantages of 3D representations of human body while avoiding limitations of the more straightforward prior methods. Our implicit 3D representation, learned via spatial disentanglement during training, avoids pitfalls of standard geometric representations such as dense pose estimations or template meshes, which are limited in their expressive capacity and for which it is impossible to obtain accurate ground-truth in unconstrained conditions. However, it allows for 3D-aware motion inference and image content manipulation, and attains state-of-the-art results on challenging motion retargeting benchmarks. Though we require 2D human poses, our approach could be extended to allow for more general motion retargeting for images of articulated animals given their 2D poses.