Generative models, both conditional and un-conditional, have been at the core of computer vision field from its inception. In recent years, approaches such as GANs [goodfellow2014generative] and VAEs [kingma2013auto] have achieved impressive results in a variety of image-based generative tasks. The progress on the video side, on the other hand, has been much more timid. Of particular challenge is generation of videos containing high-resolution moving human subjects. In addition to the need to ensure that each frame is realistic and video is overall temporally coherent, additional challenge is contending with coherent appearance and motion realism of a human subject itself. Notably, visual artifacts exhibited on human subjects tend to be most glaring for observers (an effect partially termed "uncanny valley" in computer graphics).
In this paper, we address a problem of human motion transfer. Mainly, given a single image depicting a (source) human subject, we propose a method to generate a high-resolution video of this subject, conditioned on the (driving) motion expressed in an auxiliary video. The task is illustrated in Figure 1. Similar to recent methods that focus on pose-guided image generation [ma2017pose, siarohin2018deformable, balakrishnansynthesizing, ma2018disentangled, esser2018variational, dong2018soft, zanfir2018human, Neverova_2018_ECCV, grigorev2019coordinate, siarohin2019animating], we leverage an intermediate pose-centric representation of the subject. However, unlike those methods that tend to focus on sparse keypoint [ma2017pose, siarohin2018deformable, balakrishnansynthesizing, ma2018disentangled, dong2018soft] or skeleton [esser2018variational] representations, or intermediate dense optical flow obtained from those impoverished sources [siarohin2019animating], we utilize a more detailed dense intermediate representation [Guler2018DensePose] and texture transfer approach to define a fine-grained warping from the (source) human subject image to the target poses. This texture warping allows us to more explicitly preserve the appearance of the subject. Further, we focus on temporal consistency which ensures that the transfer is not done independently for each generated frame, but is rather sequentially conditioned on previously generated frames. We also note that unlike [chan2018everybody, wang2018video], we rely only on a single (source) image of the subject and not a video, making the problem that much more challenging.
Contributions: Our contributions are multiple fold. First, we propose a dense warp-based architecture designed to account for, and correct, dense pose errors produced by our intermediate human representation. Second, we formulate our generative framework as a conditional model, where each frame is generated conditioned not only on the source image and the target pose, but also on previous frame generated by the model. This enables our framework to produce a much more temporally coherent output. Third, we illustrate the efficacy of our approach by showing improved performance with respect to recent state-of-the-art methods. Finally, we collect and make available a new high-resolution dataset of fashion videos.
2 Related work
Image generation has become an increasingly popular task in the recent years. The goal is to generate realistic images, mimicking samples from true visual data distribution. Variational Autoencoders (VAEs)[kingma2013auto] and Generative Adversarial Networks (GANs) [goodfellow2014generative] are powerful tools for image generation that have shown promising results. In the case of the unconstrained image generation, resulting images are synthesized from random noise vectors. However, this paradigm can be extended to the conditional image generation [pix2pix2016, pix2pixHD], where apart from the noise vector the network input includes conditional information, which can be, for example, a style image [pix2pix2016, johnson2016perceptual], a descriptive sentence [hong2018textim] or an object layout [zhao2019layoutim] designating aspects of desired image output. Multi-view synthesis [zhou2016view, park2017transformation, sun2018multi] is one of the largest topics in the conditional generation and it is the the one that mostly related to our proposal. The task of multi-view synthesis is to generate unseen view given one or more known views.
Pose guided image generation. In the pioneering work of Ma et al [ma2017pose] pose guided image generation using GANs has been proposed. Ma et al [ma2017pose]
suggest to model human poses as a set of keypoints and use standard image-to-image translation networks (e.g, UNET [ronneberger2015u]). Later, it has been found [balakrishnansynthesizing, siarohin2018deformable] that for UNET-based architectures it is difficult to process inputs that are not spatially aligned. In case of pose-guided models, keypoints of target are not spatially aligned with the source image. As a consequence, [balakrishnansynthesizing, siarohin2018deformable] propose new generator architectures that try to first spatially align these two inputs and then generate target images using image-to-image translation paradigms. Neverova et al [Neverova_2018_ECCV] suggest to exploit SMPL [SMPL:2015]
representation of a person, which they estimate using DensePose[Guler2018DensePose], in order to improve pose-guided generation. Compared to keypoint representations, DensePose [Guler2018DensePose] results provide much more information about the human pose, thus using it as a condition allows much better generation results. Grigorev et al [grigorev2019coordinate] propose coordinate based inpainting to recover missing parts in the DensePose [Guler2018DensePose] estimation. Coordinate based inpainting explicitly predicts from where to copy the missing texture, while regular inpainting predicts the RGB values of the missing texture itself. Contrary to Grigorev et al [grigorev2019coordinate], our work can perform both standard inpainting and coordinate based inpainting.
Video Generation. The field of video generation is much less explored, compared to the image generation. Initial works adopt 3D convolutions and recurrent architectures for video generation [vondrick2016generating, saito2017temporal, tulyakov2017mocogan]. Later works start to focus on conditional video generation. The most well studied task, in conditional video generation, is future frame prediction [srivastava2015unsupervised, oh2015action, finn2016unsupervised, babaeizadeh2017stochastic]. Recent works exploit intermediate representation, in the form of learned keypoints, for future frame prediction [villegas2017learning, zhao2018learning, wang2018every]. However, the most realistic video results are obtained by conditioning video generation on another video. This task is often called video-to-video translation. Two recent works [chan2018everybody, wang2018video] suggest pose-guided video generation, which is a special case of video-to-video translation. The main drawback of these models is that they need to train a separate network for each person. In contrast, we suggest to generate a video based only on a single image of a person. Recently, this task was addressed by Siarohin et al [siarohin2019animating], but they try to learn a representation of a subject in an unsupervised manner which leads to sub-optimal results. Conversely, in our work, we exploit and refine the richer structure and representation from DensePose [Guler2018DensePose] as an intermediate guide for video generation.
The objective of this work is to generate a video, consisting of frames , containing a person from a source image , conditioned on a sequence of driving video frames of another person , such that the person from the source image replicates motions of the person from a driving video. Our model is based on standard image-to-image [pix2pix2016] translation framework. However, the standard convolutional networks are not well suited for the task where the condition and the result are not well aligned. Hence, as advocated in the recent works [balakrishnansynthesizing, siarohin2018deformable], the key to a precise human subject video generation lies in leveraging of motion from the estimated poses. Moreover, perceptual quality of the video is highly dependent on the temporal consistency between nearby frames. We design our model having these goals and intuitions in mind (see Figure 2(a)).
First, differently from standard pose-guided image generation frameworks [ma2017pose, siarohin2018deformable, balakrishnansynthesizing], in order to produce temporary consistent videos we add a markovian assumptions to our model. If we generate each frame of the video independently, the result can be temporary inconsistent and have a lot of flickering artifacts. To this end, we condition generation of each frame on a previously generated frame , where .
Second, we use a DensePose [Guler2018DensePose] architecture to estimate correspondences between pixels and parts of the human body, in 3D. We apply DensePose [Guler2018DensePose] to the initial image and to every frame of a driving video , where function denotes the output of the DensePose. Using this information we obtain a partial correspondence between pixels of any two human images. This correspondence allows us to analytically compute the coarse warp grid estimate and , where . The coarse warp grid, in turn, allows us to perform texture transfer and estimate motion flow. Even though DensePose produces high quality estimates, it is not perfect and sometimes suffers from artifacts, such as false human detections and missing body parts. Another important drawback of using pose estimators is lack of information regarding clothing. This information is very important to us, since we are trying to map a given person onto a video, while preserving their body shape, facial features, hair and clothing. This motivate us to compute refined warp grid estimates and , where (see Figure 2(b)). We train this component end-to-end using standard image generation losses.
To sum up, our generator consists of three blocks: pose encoder , warp module and the decoder (see Figure 2(a)). generates video frames iteratively one-by-one. First, the encoder produces a representation of the driving pose . Then given the source image , the source pose and the driving pose , warp module estimates a grid , encodes the source image and warps this encoded image according to the grid. This gives us a representation . Previous frame is processed in the same way, i.e., after transformation we obtain a representation . These two representations together with are concatenated and later processed by the decoder to produce an output frame .
3.1 Warp module
Coarse warp grid estimate. As described in Section 3 we use DensePose [Guler2018DensePose] for coarse estimation of warp grids. For simplicity, we describe only how to obtain warp grid estimates ; the procedure for is similar. For each body part of SMPL model [SMPL:2015] DensePose [Guler2018DensePose] estimates UV coordinates. However, the UV pixel coordinates in the source image may not exactly match with the UV pixel coordinates in the driving frame
, so in order to obtain a correspondence we use nearest neighbour interpolation. In more detail, for each pixel in the driving frame we find nearest neighbour in the UV space of source image that belongs to the same body part. In order to perform efficient nearest neighbour search we make use of the KD-Trees[bentley1975multidimensional].
Refined warp grid estimate. While the coarse warp grid estimation preserves general human body movement, it contains a lot of errors because of self occlusions and imprecise DensePose [Guler2018DensePose] estimates. Moreover the movement of the person outfit is not modeled. This motivates us to add an additional correction branch . The goal of this branch is, given source image , coarse warp grid and target pose , to predict refined warp grid . This refinement branch is trained end-to-end with the entire framework. Differentiable warping is implemented using bilinear kernel [jaderberg2015spatial]. The main problem of the bilinear kernel is limited gradient flow, e.g, from each spatial location gradients are propagated only to 4 nearby spatial locations. This makes the module highly vulnerable to the local minimums. One way to address the local minimum problem is good initialization. Having this in mind, we adopt residual architecture for our module, i.e., the refinement branch predicts only the correction which is latter added to the coarse warp grid:
Note that since we transform intermediate representations of source image , the spatial size of the warp grid should be the equal to the spatial size of the representation. In our case the spatial size of the representation is . Because of this, and to save computational resources, predicts corrections of size ; moreover, coarse warp grid is downsampled to the size of . Also, since convolutional layers are translation equivariant they can not efficiently process absolute coordinates of coarse grid warp . In order to alleviate this issue we input to the network relative shifts, i.e., , where is an identity warp grid.
Our training procedure is similar to [balakrishnansynthesizing], however, specifically adopted to take markovian assumption into account. At the training time we sample four frames from the same training video (three of which are consecutive), the indices of these frames are , where , and is the total number of frames in the video. Experimentally we observe that using four frames is the best in terms of temporal consistency and computational efficiency. We treat frame as the source image , while the rest are treated as both the driving and the ground truth target frames. We generate the three frames as follows:
for the first frame, where the source frame is treated as the "previous" frame, and for the rest:
This formulation has low memory consumption, but at the same time allows standard pose-guided image generation which is needed to produce the first target output frame. Note that if in Eq.3.2 we use real previous frame the generator will ignore the source image , since is much more similar to than .
Losses. We use a combination of losses from pix2pixHD [pix2pixHD] model. We employ the least-square GAN [mao2017least] for the adversarial loss:
where is the patch bases critique [pix2pixHD, pix2pix2016],
To drive image reconstruction we also employ a feature matching [pix2pixHD] and perceptual [johnson2016perceptual] losses:
where is feature representation from -th layer of the network, for perceptual loss this is VGG-19 [DBLP:journals/corr/SimonyanZ14a] and for the feature matching it is the critique . The total loss is given by:
following [pix2pixHD] .
We have conducted an extensive set of experiments to evaluate the proposed DwNet. We first describe our newly collected dataset, then we compare our method with previous state-of-the-art models for pose-guided human video synthesis and for pose-guided image generation. We show superiority of our model in aspects of realism and temporal coherency. Finally, we evaluate the contributions of each architecture choice we made and show that each part of the model positively contributes to the results.
4.1 The Fashion Dataset
We introduce a new Fashion dataset containing 500 training and 100 test videos, each containing roughly 350 frames. Videos from our dataset are of a single human subject and characterized by the high resolution and static camera. Most importantly, clothing and textures are diverse and cover large space of possible appearances. The dataset is publicly released at: https://vision.cs.ubc.ca/datasets/fashion/.
We conduct our experiments on the proposed Fashion and Tai-Chi [tulyakov2017mocogan] datasets. The latter is composed of more than 3000 tai-chi video clips downloaded from YouTube. In all previous works [siarohin2019animating, tulyakov2017mocogan], the smaller pixel resolution version of this dataset has been used; however, for our work we use pixel resolution. The length varies from 128 to 1024 frames per video. Number of videos per train and test sets are 3049 and 285 respectively.
4.2.2 Evaluation metrics
There is no consensus in the community on a single criterion for measuring quality of the generated videos from the perspective of realism, texture similarity and temporal coherence. Therefore we choose a set widely accepted evaluation metrics to measure performance.
Perceptual loss. The texture similarity is measured using a perceptual loss. Similar to our training procedure, we use VGG-19 [DBLP:journals/corr/SimonyanZ14a] network as a feature extractor and then compute loss between the extracted features from the real and generated frames.
FID. We use Frecht Inception Distance [heusel2017gans] (FID) to measure realism of the individual frames. FID is known to be a widely used metric for comparison of the GAN-based methods.
AKD. We evaluate if the motion is correctly transferred by the means of Average Keypoint Distance (AKD) metric [siarohin2019animating]. Similarly to [siarohin2019animating] we employ human pose estimator [cao2017realtime] and compare average distance between ground truth and detected keypoints. Intuitively this metric measures if the person moves in the same way as in the driving video. We do not report Missing keypoint rate (MKR) because it similar and close to zero for all methods.
User study. We conduct a user study to measure the overall temporal coherency and quality of the synthesised videos. For the user study we exploit Amazon Mechanical Turk (AMT). On AMT we show users two videos (one produced by DwNet and another by a competing method) in random order and ask users to choose one, which has higher realism and consistency. To conduct this study we follow the protocol introduced in Siarohin et al [siarohin2019animating].
4.3 Implementation details
All of our models are trained for two epochs. In our case epoch denotes a full pass through the whole set of video frames, where each sample from the dataset is a set of four frames, as explained in the Section3.2. We train our model starting with the learning rate 0.0002 and bring it down to zero during the training procedure. Generally, our model is similar to Johnson et al [johnson2016perceptual]. Novelties of the architecture such as pose encoder and the appearance encoder both contain 2 downsampling Conv layers. Warp module’s refine branch is also based on 2 Conv layers and additional 2 ResNet blocks. Our decoder architecture is made out of 9 ResNet blocks and 2 upsampling Conv layers. We perform all our training procedures on a single GPU (Nvidia GeForce GTX 1080). Our code will be released.
4.4 Comparison with the state-of-the-art
We compare our framework with the current state-of-the-art method for motion transfer MonkeyNet [siarohin2019animating], which solves a similar problem for the human synthesis. The first main advantage of our method, compared to MonkeyNet, is ability to generate frames with a higher resolution. Originally, MonkeyNet was trained on size frames. However, to conduct fair experiments we re-train MonkeyNet from scratch to produce the same size images with our method. Our task is quite novel and there is limited number of baselines. To this end, we also compare with Coordinate Inpainting framework [grigorev2019coordinate] which is state-of-the-art for image (not video) synthesis, i.e., synthesise of a new image of a person based on a single image. Even though this framework solves a slightly different problem, we still choose to compare with it, since it is similarly leverages DensePose [Guler2018DensePose]. This approach doesn’t have any explicit background handling mechanisms therefore there is no experimental results on a Tai-Chi dataset. Note that since authors of the paper haven’t released the code for the method we were only able to run our experiments on a pre-trained network.
The quantitative comparison is reported in Table 1. Our approach outperforms MonkeyNet and Coordinate Inpainting on both datasets and according to all metrics. With respect to MonkeyNet, this can be explained by its inability to grasp complex human poses, hence it completely fails on the Tai-Chi dataset which contains large variety of non-trivial motions. This can be further observed in Figure 3. MonkeyNet simply remaps person from the source image without modifying the pose. In Figure 4 we can still observe a large difference in terms of the human motion consistency and realism. Unlike our model, MonkeyNet produces images with missing body parts. For Coordinate Inpainting, poor performance could be explained by the lack of temporal consistency, since (unlike our method) it generates each frame independently and hence lacks consistency in clothing texture and appearance. Coordinate Inpainting is heavily based on the output of the DensePose and doesn’t correct resulting artifacts, like is done in our model using refined warp grid estimate. As one can see from Figure 4 the resulting frames are blurry and inconsistent in small details. This can also explains why such a small percentage of users prefer results of Coordinate Inpainting. The user study comparison is reported in Table 2 where we can observe that videos produced by our model were significantly more often preferred by users in comparison to the videos from the competitors models.
|Perceptual ()||FID ()||AKD ()||Perceptual ()||FID ()||AKD ()|
|Coordinate Inpainting ||0.6434||66.50||4.20||-||-||-|
|User-Preference ()||User-Preference ()|
|Coordinate Inpainting ||99.60%||-|
Table in Figure 5 (right) shows the contribution of each of the major architecture choices, i.e., markovian assumption (), refined warp grid estimate () and coarse warp grid estimate (). For these experiments we remove mentioned parts from DwNet and train the resulting model architectures. As expected, removing markovian assumption, i.e., not conditioning on the previous frame, leads to a worse realism and lower similarity with the features of a real image. Mainly it is because this leads to a loss of temporal coherence. Further removal of both warping grid estimators, in the generation pipeline, results in worse performance in the FID score. Perceptual loss is not affected by this change which can be explained by the fact that warp mostly results in removal of the artifacts and naturalness of the texture on a person. In Figure 5 (left) we see the qualitative reflection of our quantitative results. Full model produces the best results, third column shows misalignment between textures of two frames. The architecture without a refined warp produces less realistic results, with a distorted face. Lastly, an architecture without any warp produces blurry, unrealistic results with an inconsistent texture.
In this paper we present DwNet a generative architecture for pose-guided video generation. Our model can produce high quality videos, based on the source image depicting human appearance and the driving video with another person moving. We propose novel markovian modeling to address temporal inconsistency, that typically arises in video generation frameworks. Moreover we suggest novel warp module that is able to correct warping errors. We validate our method on two video datasets, and we show superiority of our method over the baselines. Some possible future directions may include multiple source generation and exploiting our warp correction for improving DensePose [Guler2018DensePose] estimation.