Video-to-video synthesis (vid2vid) aims at converting an input semantic video, such as videos of human poses or segmentation masks, to an output photorealistic video. While the state-of-the-art of vid2vid has advanced significantly, existing approaches share two major limitations. First, they are data-hungry. Numerous images of a target human subject or a scene are required for training. Second, a learned model has limited generalization capability. A pose-to-human vid2vid model can only synthesize poses of the single person in the training set. It does not generalize to other humans that are not in the training set. To address the limitations, we propose a few-shot vid2vid framework, which learns to synthesize videos of previously unseen subjects or scenes by leveraging few example images of the target at test time. Our model achieves this few-shot generalization capability via a novel network weight generation module utilizing an attention mechanism. We conduct extensive experimental validations with comparisons to strong baselines using several large-scale video datasets including human-dancing videos, talking-head videos, and street-scene videos. The experimental results verify the effectiveness of the proposed framework in addressing the two limitations of existing vid2vid approaches.READ FULL TEXT VIEW PDF
We study the problem of video-to-video synthesis, whose goal is to learn...
We address the computational problem of novel human pose synthesis. Give...
Generation of videos of complex scenes is an important open problem in
Unsupervised image-to-image translation methods learn to map images in a...
In this paper, we build a general summarization framework for both of ed...
In this paper, we introduce a new problem of manipulating a given video ...
Time-lapse videos usually contain visually appealing content but are oft...
Video-to-video synthesis (vid2vid) refers to the task of converting an input semantic video to an output photorealistic video. It has a wide range of applications, including generating a human-dancing video using a human pose sequence [chan2018everybody, gafni2019vid2game, wang2018video, zhou2019dance], or generating a driving video using a segmentation mask sequence [wang2018video]. Typically, to obtain such a model, one begins with collecting a training dataset for the target task. It could be a set of videos of a target person performing diverse actions or a set of street-scene videos captured by using a camera mounted on a car driving in a city. The dataset is then used to train a model that converts novel input semantic videos to corresponding photorealistic videos at test time. In other words, we expect a vid2vid model for humans can generate videos of the same person performing novel actions that are not in the training set and a street-scene vid2vid model can videos of novel street-scenes with the same style as those in the training set. With the advance of the generative adversarial networks (GANs) framework [goodfellow2014generative] and its image-conditional extensions [isola2017image, wang2017high], existing vid2vid approaches have shown promising results.
We argue that generalizing to novel input semantic videos is insufficient. One should also aim for a model that can generalize to unseen domains, such as generating videos of human subjects that are not included in the training dataset. More ideally, a vid2vid model should be able to synthesize videos of unseen domains by leveraging just a few example images given at test time. If a vid2vid model cannot generalize to unseen persons or scene styles, then we must train a model for each new subject or scene style. Moreover, if a vid2vid model cannot achieve this domain generalization capability with only a few example images, then one has to collect many images for each new subject or scene style. This would make the model not easily scalable. Unfortunately, existing vid2vid approaches suffer from these drawbacks as they do not consider such generalization.
To address these limitations, we propose the few-shot vid2vid framework. The few-shot vid2vid framework takes two inputs for generating a video, as shown in Figure 1. In addition to the input semantic video as in vid2vid, it takes a second input, which consists of a few example images of the target domain made available at test time. Note that this is absent in existing vid2vid approaches [chan2018everybody, gafni2019vid2game, wang2018video, zhou2019dance]. Our model uses these few example images to dynamically configure the video synthesis mechanism via a novel network weight generation mechanism. Specifically, we train a module to generate the network weights using the example images. We carefully design the learning objective function to facilitate learning the network weight generation module.
We conduct extensive experimental validation with comparisons to various baseline approaches using several large-scale video datasets including dance videos, talking head videos, and street-scene videos. The experimental results show that the proposed approach effectively addresses the limitations of existing vid2vid frameworks. Moreover, we show that the performance of our model is positively correlated with the diversity of the videos in the training dataset, as well as the number of example images available at test time. When the model sees more different domains in the training time, it can better generalize to deal with unseen domains (Figure 7(a)). When giving the model more example images at test time, the quality of synthesized videos improves (Figure 7(b)).
GANs. The proposed few-shot vid2vid model is based on GANs [goodfellow2014generative]. Specifically, we use a conditional GAN framework. Instead of generating outputs by converting samples from some noise distribution [goodfellow2014generative, radford2015unsupervised, liu2016coupled, gulrajani2017improved, karras2018style], we generate outputs based on user input data, which allows more flexible control over the outputs. The user input data can take various forms, including images [isola2017image, zhu2017unpaired, liu2016unsupervised, park2019SPADE], categorical labels [odena2016conditional, miyato2018cgans, zhang2019self, brock2018large], textual descriptions [reed2016generative, zhang2017stackgan, xu2018attngan], and videos [chan2018everybody, gafni2019vid2game, wang2018video, zhou2019dance]. Our model belongs to the last one. However, different from the existing video-conditional GANs, which take the video as the sole data input, our model also takes a set of example images. These example images are provided at test time, and we use them to dynamically determine the network weights of our video synthesis model through a novel network weight generation module. This helps the network generate videos of unseen domains.
Image-to-image synthesis, which transfers an input image from one domain to a corresponding image in another domain [isola2017image, taigman2016unsupervised, bousmalis2016unsupervised, shrivastava2016learning, zhu2017unpaired, liu2016unsupervised, huang2018multimodal, zhu2017toward, wang2017high, choi2017stargan, park2019SPADE, liu2019few, benaim2018one], is the foundation of vid2vid. For videos, the new challenge lies in generating sequences of frames that are not only photorealistic individually but also temporally consistent as a whole. Recently, the FUNIT [liu2019few] was proposed for generating images of unseen domains via the adaptive instance normalization technique [huang2017adain]. Our work is different in that we aim for video synthesis and achieve generalization to unseen domains via a network weight generation scheme. We compare these techniques in the experiment section.
Video generative models can be divided into three main categories, including 1) unconditional video synthesis models [vondrick2016generating, saito2017temporal, tulyakov2017mocogan], which convert random noise samples to video clips, 2) future video prediction models [srivastava2015unsupervised, kalchbrenner2016video, finn2016unsupervised, mathieu2015deep, lotter2016deep, xue2016visual, walker2016uncertain, walker2017pose, denton2017unsupervised, villegas2017decomposing, liang2017dual, lee2018stochastic, hu2018video, li2018flow, hao2018controllable, pan2019video], which generate future video frames based on the observed ones, and 3) vid2vid models [wang2018video, chan2018everybody, gafni2019vid2game, zhou2019dance], which convert semantic input videos to photorealistic videos. Our work belongs to the last category, but in contrast to the prior works, we aim for a vid2vid model that can synthesize videos of unseen domains by leveraging few example images given at test time.
Adaptive networks refer to networks where part of the weights are dynamically computed based on the input data. This class of networks has a different inductive bias to regular networks and has found use in several tasks including sequence modeling [ha2016hypernetworks], image filtering [jia2016dynamic, wu2018dynamic, su2019pixel]
, frame interpolation[niklaus2017video, niklaus2017video2], and neural architecture search [zhang2019graph]. Here, we apply it to the vid2vid task.
Human pose transfer synthesizes a human in an unseen pose by utilizing an image of the human in a different pose. To achieve high quality generation results, existing human pose transfer methods largely utilize human body priors such as body part modeling [balakrishnan2018synthesizing] or human surface-based coordinate mapping [neverova2018dense]. Our work differs from these works in that our method is more general. We do not use specific human body priors other than the input semantic video. As a result, the same model can be directly used for other vid2vid tasks such as street scene video synthesis, as shown in Figure 5. Moreover, our model is designed for video synthesis, while existing human pose transfer methods are mostly designed for still image synthesis and do not consider the temporal aspect of the problem. As a result, our method renders more temporally consistent results (Figure 4).
Video-to-video synthesis aims at learning a mapping function that can convert a sequence of input semantic images111For example, a segmentation mask or an image denoting a human pose., , to a sequence of output images, , in a way that the conditional distribution of given is similar to the conditional distribution of the ground truth image sequence, , given . In other words, it aims to achieve , where is a distribution divergence measure such as the Jensen-Shannon divergence or the Wasserstein distance. To model the conditional distribution, existing works make a simplified Markov assumption, leading to a sequential generative model given by
In other words, it generates the output image, , based on the observed input semantic images, , and the past generated images, . The sequential generator can be modeled in several different ways [chan2018everybody, gafni2019vid2game, wang2018video, zhou2019dance]. A popular choice is to use an image matting function given by
where the symbol is an image of all ones, is the element-wise product operator, is a soft occlusion map, is the optical flow from to , and is a synthesized intermediate image.
Figure 2(a) visualizes the vid2vid architecture and the matting function, which shows the output image is generated by combining the optical-flow warped version of the last generated image, , and the synthesized intermediate image, . The soft occlusion map,
, dictates how these two images are combined at each pixel location. Intuitively, if a pixel is observed in the previously generated frame, it would favor duplicating the pixel value from the warped image. In practice, these quantities are generated via neural network-parameterized functions, , and :
where , , and are learnable parameters. They are kept fixed once the training is done.
Few-shot vid2vid. While the sequential generator in (1) is trained for converting novel input semantic videos, it is not trained for synthesizing videos of unseen domains. For example, a model trained for a particular person can only be used to generate videos of the same person. In order to adapt to unseen domains, we let depend on extra inputs. Specifically, we let take two more input arguments: one is a set of example images of the target domain, and the other is the set of their corresponding semantic images . That is
This modeling allows to leverage the example images given at the test time to extract some useful patterns to synthesize videos of the unseen domain. We propose a network weight generation module for extracting the patterns. Specifically, is designed to extract patterns from the provided example images and use them to compute network weights for the intermediate image synthesis network :
Note that the network does not generate the weights or because the flow prediction network and the soft occlusion map prediction network are designed for warping the last generated image, and warping is a mechanism that is naturally shared across domains.
We build our few-shot vid2vid framework based on Wang et al. [wang2018video], which is the state-of-the-art for the vid2vid task. Specifically, we reuse their proposed flow prediction network and the soft occlusion map prediction network . The intermediate image synthesis network is a conditional image generator. Instead of reusing the architecture proposed by Wang et al. [wang2018video], we adopt the SPADE generator [park2019SPADE], which is the current state-of-the-art semantic image synthesis model.
The SPADE generator contains several spatial modulation branches and a main image synthesis branch. Our network weight generation module only generates the weights for the spatial modulation branches. This has two main advantages. First, it greatly reduces the number of parameters that has to generate, which avoids the overfitting problem. Second, it avoids creating a shortcut from the example images to the output image, since the generated weights are only used in the spatial modulation modules, which generates the modulation values for the main image synthesis branch. In the following, we discuss details of the design of the network and the learning objective.
Network weight generation module. As discussed above, the goal of the network weight generation module is to learn to extract appearance patterns that can be injected into the video synthesis branch by controlling its weights. We first consider the case where only one example image is available (). We then extend the discussion to handle the case of multiple example images.
We decompose into two sub-networks: an example feature extractor
, and a multi-layer perceptron. The network consists of several convolutional layers and is applied on the example image to extract an appearance representation . The representation is then fed into to generate the weights in the intermediate image synthesis network .
Let the image synthesis network has layers , where . We design the weight generation network to also have layers, each generates the weights for the corresponding . Specifically, to generate the weights for layer , we first take the output from -th layer in . Then, we average pool (since might be still a feature map with spatial dimensions.) and apply a multi-layer perceptron to generate the weights . Mathematically, if we define , then , and . These generated weights are then used to convolve the current input semantic map to generate the normalization parameters used in SPADE (Figure 2(c)).
For each layer in the main SPADE generator, we use to compute the denormalization parameters and to denormalize the input features. We note that, in the original SPADE module, the scale map and bias map are generated by fixed weights operated on the input semantic map . In our setting, these maps are generated by dynamic weights, . Moreover, contains three sets of weights: , and . acts as a shared layer to extract common features, and and take the output of to generate and maps, respectively. For each BatchNorm layer in , we compute the denormalized features from the normalized features by
where stands for convolution, and is the nonlinearity function.
Attention-based aggregation (). In addition, we want to be capable of extracting the patterns from an arbitrary number of example images. As different example images may carry different appearance patterns, and they have different degrees of relevance to different input images, we design an attention mechanism [xu2015show, vaswani2017attention] to aggregate the extracted appearance patterns ,…,.
To this end, we construct a new attention network , which consists of several fully convolutional layers. is applied to each of the semantic images of the example images,
. This results in a key vector, where is the number of channels and is the spatial dimension of the feature map. We also apply to the current input semantic image to extract its key vector . We then compute the attention weight by taking the matrix product . The attention weights are then used to compute a weighted average of the appearance representation , which is then fed into the multi-layer perceptron to generate the network weights (Figure 2(b)). This aggregation mechanism is helpful when different example images contain different parts of the subject. For example, when example images include both front and back of the target person, the attention maps can help capture corresponding body parts during synthesis (Figure 7(c)).
Warping example images. To ease the burden of the image synthesis network, we can also (optionally) warp the given example image and combine it with the intermediate synthesized output . Specifically, we make the model estimate an additional flow and mask , which are used to warp the example image to the current input semantics, similar to how we warp and combine with previous frames. The new intermediate image then becomes
In the case of multiple example images, we pick to be the image that has the largest similarity score to the current frame by looking at the attention weights . In practice, we found this helpful when example and target images are similar in most regions, such as synthesizing poses where the background remains static.
Training. We use the same learning objective as in the vid2vid framework [wang2018video]. But instead of training the vid2vid model using data from one domain, we use data from multiple domains. In Figure 7(a), we show the performance of our few-shot vid2vid model is positively correlated with the number of domains included in the training dataset. This shows that our model can gain from increased visual experiences. Our framework is trained in the supervised setting where paired , and are available. We train our model to convert to by using example images randomly sampled from . We adopt a progressive training technique, which gradually increases the length of training sequences. Initially, we set , which means the network only generates single frames. After that, we double the sequence length (
) for every few epochs.
Inference. At test time, our model can take an arbitrary number of example images. In Figure 7(b), we show that our performance is positively correlated with the number of example images. Moreover, we can also (optionally) finetune the network using the given example images to improve performance. Note that we only finetune the weight generation module and the intermediate image synthesis network , and leave all parameters related to flow estimation (, ) fixed. We found this can better preserve the person identity in the example image.
Implementation details. Our training procedure follows the procedure from the vid2vid work [wang2018video]. We use the ADAM optimizer [kingma2014adam] with and . Training was conducted using an NVIDIA DGX-1 machine with 8 32GB V100 GPUs.
Datasets. We adopt three video datasets to validate our method.
YouTube dancing videos. It consists of dancing videos from YouTube. We divide them into a training set and a test set with no overlapping subjects. Each video is further divided into short clips of continuous motions. This yields about clips for training. At each iteration, we randomly pick a clip and select one or more frames in the same clip as the example images. At test time, both the example images and the input human poses are not seen during training.
Street-scene videos. We use street-scene videos from three different geographical areas: 1) Germany, from the Cityscapes dataset [Cordts2016cityscapes], 2) Boston, collected using a dashcam, and 3) NYC, collected by a different dashcam. We apply a pretrained segmentation network [wu2018cgnet] to get the segmentation maps. Again, during training, we randomly select one frame of the same area as the example image. At test time, in addition to the test set images from these three areas, we also test on the ApolloScape [huang2018apolloscape] and CamVid [brostow2008camvid] datasets, which are not included in the training set.
Face videos. We use the real videos in the FaceForensics dataset [rossler2018faceforensics], which contains videos of news briefing from different reporters. We split the dataset into videos for training and videos for validation. We extract sketches from the input videos similar to vid2vid, and select one frame of the same video as the example image to convert sketches to face videos.
Baselines. Since no existing vid2vid method can adapt to unseen domains using few example images, we construct 3 strong baselines that consider different ways of achieving the target generalization capability. For the following comparisons and figures, all methods use 1 example image.
Encoder. In this baseline approach, we encode the example images into a style vector and then decode the features using the image synthesis branch in our to generate .
ConcatStyle. In this baseline approach, we also encode the example images into a style vector. However, instead of directly decoding the style vector using the image synthesis branch in our , it concatenates the vector with each of the input semantic images to produce an augmented semantic input image. This image is then used as input to the spatial modulation branches in our for generating the intermediate image .
AdaIN. In this baseline, we insert an AdaIN normalization layer after each spatial modulation layer in the image synthesis branch of . We generate the AdaIN normalization parameters by feeding the example images to an encoder, similar to the FUNIT method [liu2019few].
In addition to these baselines, for the human synthesis task, we also compare our approach with the following methods using the pretrained models provided by the authors.
PoseWarp [balakrishnan2018synthesizing] synthesizes humans in unseen poses using an example image. The idea is to assume each limb undergoes a similarity transformation. The final output image is obtained by combining all transformed limbs together.
MonkeyNet [Siarohin2019monkeynet] is proposed for transferring motions from a sequence to a still image. It first detects keypoints in the images, and then predicts their flows for warping the still image.
|YouTube Dancing videos||Street Scene videos|
|Method||Pose Error||FID||Human Pref.||Pixel Acc||mIoU||FID||Human Pref.|
Evaluation metrics. We use the following metrics for quantitative evaluation.
Fréchet Inception Distance (FID) [heusel2017gans] measures the distance between the distributions of real data and generated data. It is commonly used to quantify the fidelity of synthesized images.
Pose error. We estimate the poses of the synthesized subjects using OpenPose [cao2018openpose]
. This renders a set of joint locations for each video frame. We then compute the absolute error in pixels between the estimated pose and the original pose input to the model. The idea behind this metric is that if the image is well-synthesized, a well-trained human pose estimation network should be able to recover the original pose used to synthesize the image. We note similar ideas were used in evaluating image synthesis performance in several prior works[isola2017image, wang2017high, wang2018video].
Segmentation accuracy. To evaluate the performance of street scene videos, we run a state-of-the-art street scene segmentation network on the result videos generated by all the competing methods. We then report the pixel accuracy and mean intersection-over-union (IoU) ratio. The idea of using segmentation accuracy as a performance metric follows the discussion of using the pose error as discussed above.
Human subjective score. Finally, we use Amazon Mechanical Turk (AMT) to evaluate the quality of generated videos. We perform AB tests where we provide the user videos from two different approaches and ask them to choose the one with better quality. For each pair of comparisons, we generate clips, each of them viewed by workers. Orders are randomized.
Main results. In Figure 3, we show results of using different example images when synthesizing humans. It can be seen that our method can successfully transfer motion to all the example images. Figure 4 shows comparisons of our approaches against other methods. It can be seen that other methods either generate obvious artifacts or fail to transfer the motion faithfully.
Figure 5 shows the results of synthesizing street scene videos with different example images. It can be seen that even with the same input segmentation map, our method can achieve different visual results using different example images.
Table 1 shows quantitative comparisons of both tasks against the other methods. It can be seen that our method consistently achieves better results than the others on all the performance metrics.
In Figure 6, we show results of using different example images when synthesizing faces. Our method can faithfully preserve the person identity while capturing the motion in the input videos.
Finally, to verify our hypothesis that a larger training dataset helps improve the quality of synthesized videos, we conduct an experiment where part of the dataset is held out during training. We vary the number of videos in the training set and plot the resulting performance in Figure 7(a). We find that the results support our hypothesis. We also evaluate whether having access to more example images at test time helps with the video synthesis performance. As shown in Figure 7(b), the result confirms our assumption.
Limitations. Although our network can, in principal, generalize to unseen domains, when the test domain is too different from the training domains it will not perform well. For example, when testing on CG characters which look very different from real-world people, the network will struggle. In addition, since our network is based on semantic estimations as input such as pose maps or segmentation maps, when these estimations fail our network will also likely fail.
We presented a few-shot video-to-video synthesis framework that can synthesize videos of unseen subjects or street scene styles at the test time. This was enabled by our novel adaptive network weight generation scheme, which dynamically determines the weights based on the example images. Experimental results showed that our method performs favorably against the competing methods.
Figure 8 shows comparisons with our baseline methods on street scene sequences. Again, our method is the only one which can realistically reproduce the style in the example images, while others either generate artifacts to fail to capture the style.
In the experiment section, we introduce three baselines for the few-shot video-to-video synthesis task. Here, we provide additional details of their architectures. Note that these baselines are only designed for dealing with one example image, and we compare the performance of the proposed method with these baselines on the one example image setting.
The Encoder Baseline. As discussed in the main paper, the Encoder baseline consists of an image encoder that encodes the example image to a style latent code, which is then directly fed into the head of the main image synthesis branch in the SPADE generator. We visualize the architecture of the Encoder baseline in Figure 9(a).
The ConcatStyle Baseline. As visualized in Figure 9(b), in the ConcatStyle baseline, we also employ an image encoder to encode the example image to a style latent code. Now, instead of feeding the style code into the head of the main image synthesis branch, we concatenate the style code with the input semantic image via a broadcasting operation. The concatenation is the new input semantic image to the SPADE modules.
The AdaIN Baseline. In this baseline, we use the AdaIN [huang2017adain]
for adaptive video-to-video synthesis. Specifically, we use an image encoder to encode the example image to a latent vector and use multi-layer perceptrons to convert the latent vector to the mean and variance vectors for the AdaIN operations. The AdaIN parameters are fed into each layer of the main image synthesis branch. Specifically, we add an AdaIN normalization layer after a SPADE normalization layer as shown in Figure9(c).
In the AdaIN, information from the example image is represented as a scaling vector and a biased vector. This operation could be considered as a 1x1 convolution with a group size equal to the channel size. From this perspective,the AdaIN is a constrained case of the proposed weight generation scheme, since our scheme can generate a convolutional kernel with a group size equal to 1 and a kernel size larger than 1x1. Moreover, the proposed scheme can be easily combined with the SPADE module. Specifically, we use the proposed generation scheme to generate weights for the SPADE layers, which in turn generate spatially adaptive de-modulation parameters. To justify the importance of weight generation, we compare with AdaIN both using weighted average and our attention module (Figure10(a)).
We also compare with AdaIN when different dataset sizes are used. The assumption is that when the dataset is small, both methods are able to catch the diversity in the dataset. However, as the dataset size grows larger, AdaIN starts to fail since the expressibility is limited, as shown in Figure 10(c).
As discussed in the main paper, the disadvantages of vid2vid is that it requires different models for different persons or cities. For example, they typically need minutes of training data and days of training time, while our method only needs one image and negligible time for weight generation.
To compare our performance with vid2vid, we show quantitative comparisons in Figure 10(b) for synthesizing a specific person. We find that our model renders comparable results even when . Moreover, if we further finetune our model based on the example images, we can achieve comparable or even better performance.