Few-shot Video-to-Video Synthesis

10/28/2019 ∙ by Ting-Chun Wang, et al. ∙ 14

Video-to-video synthesis (vid2vid) aims at converting an input semantic video, such as videos of human poses or segmentation masks, to an output photorealistic video. While the state-of-the-art of vid2vid has advanced significantly, existing approaches share two major limitations. First, they are data-hungry. Numerous images of a target human subject or a scene are required for training. Second, a learned model has limited generalization capability. A pose-to-human vid2vid model can only synthesize poses of the single person in the training set. It does not generalize to other humans that are not in the training set. To address the limitations, we propose a few-shot vid2vid framework, which learns to synthesize videos of previously unseen subjects or scenes by leveraging few example images of the target at test time. Our model achieves this few-shot generalization capability via a novel network weight generation module utilizing an attention mechanism. We conduct extensive experimental validations with comparisons to strong baselines using several large-scale video datasets including human-dancing videos, talking-head videos, and street-scene videos. The experimental results verify the effectiveness of the proposed framework in addressing the two limitations of existing vid2vid approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 9

page 13

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video-to-video synthesis (vid2vid) refers to the task of converting an input semantic video to an output photorealistic video. It has a wide range of applications, including generating a human-dancing video using a human pose sequence [chan2018everybody, gafni2019vid2game, wang2018video, zhou2019dance], or generating a driving video using a segmentation mask sequence [wang2018video]. Typically, to obtain such a model, one begins with collecting a training dataset for the target task. It could be a set of videos of a target person performing diverse actions or a set of street-scene videos captured by using a camera mounted on a car driving in a city. The dataset is then used to train a model that converts novel input semantic videos to corresponding photorealistic videos at test time. In other words, we expect a vid2vid model for humans can generate videos of the same person performing novel actions that are not in the training set and a street-scene vid2vid model can videos of novel street-scenes with the same style as those in the training set. With the advance of the generative adversarial networks (GANs) framework [goodfellow2014generative] and its image-conditional extensions [isola2017image, wang2017high], existing vid2vid approaches have shown promising results.

We argue that generalizing to novel input semantic videos is insufficient. One should also aim for a model that can generalize to unseen domains, such as generating videos of human subjects that are not included in the training dataset. More ideally, a vid2vid model should be able to synthesize videos of unseen domains by leveraging just a few example images given at test time. If a vid2vid model cannot generalize to unseen persons or scene styles, then we must train a model for each new subject or scene style. Moreover, if a vid2vid model cannot achieve this domain generalization capability with only a few example images, then one has to collect many images for each new subject or scene style. This would make the model not easily scalable. Unfortunately, existing vid2vid approaches suffer from these drawbacks as they do not consider such generalization.

To address these limitations, we propose the few-shot vid2vid framework. The few-shot vid2vid framework takes two inputs for generating a video, as shown in Figure 1. In addition to the input semantic video as in vid2vid, it takes a second input, which consists of a few example images of the target domain made available at test time. Note that this is absent in existing vid2vid approaches [chan2018everybody, gafni2019vid2game, wang2018video, zhou2019dance]. Our model uses these few example images to dynamically configure the video synthesis mechanism via a novel network weight generation mechanism. Specifically, we train a module to generate the network weights using the example images. We carefully design the learning objective function to facilitate learning the network weight generation module.

Figure 1: Comparison between the vid2vid (left) and the proposed few-shot vid2vid (right). Existing vid2vid methods [chan2018everybody, gafni2019vid2game, wang2018video] do not consider generalization to unseen domains. A trained model can only be used to synthesize videos similar to those in the training set. For example, a vid2vid model can only be used to generate videos of the person in the training set. To synthesize a new person, one needs to collect a dataset of the new person and uses it to train a new vid2vid model. On the other hand, our few-shot vid2vid model does not have the limitations. Our model can synthesize videos of new persons by leveraging few example images provided at the test time.

We conduct extensive experimental validation with comparisons to various baseline approaches using several large-scale video datasets including dance videos, talking head videos, and street-scene videos. The experimental results show that the proposed approach effectively addresses the limitations of existing vid2vid frameworks. Moreover, we show that the performance of our model is positively correlated with the diversity of the videos in the training dataset, as well as the number of example images available at test time. When the model sees more different domains in the training time, it can better generalize to deal with unseen domains (Figure 7(a)). When giving the model more example images at test time, the quality of synthesized videos improves (Figure 7(b)).

2 Related Work

GANs. The proposed few-shot vid2vid model is based on GANs [goodfellow2014generative]. Specifically, we use a conditional GAN framework. Instead of generating outputs by converting samples from some noise distribution [goodfellow2014generative, radford2015unsupervised, liu2016coupled, gulrajani2017improved, karras2018style], we generate outputs based on user input data, which allows more flexible control over the outputs. The user input data can take various forms, including images [isola2017image, zhu2017unpaired, liu2016unsupervised, park2019SPADE], categorical labels [odena2016conditional, miyato2018cgans, zhang2019self, brock2018large], textual descriptions [reed2016generative, zhang2017stackgan, xu2018attngan], and videos [chan2018everybody, gafni2019vid2game, wang2018video, zhou2019dance]. Our model belongs to the last one. However, different from the existing video-conditional GANs, which take the video as the sole data input, our model also takes a set of example images. These example images are provided at test time, and we use them to dynamically determine the network weights of our video synthesis model through a novel network weight generation module. This helps the network generate videos of unseen domains.

Image-to-image synthesis, which transfers an input image from one domain to a corresponding image in another domain [isola2017image, taigman2016unsupervised, bousmalis2016unsupervised, shrivastava2016learning, zhu2017unpaired, liu2016unsupervised, huang2018multimodal, zhu2017toward, wang2017high, choi2017stargan, park2019SPADE, liu2019few, benaim2018one], is the foundation of vid2vid. For videos, the new challenge lies in generating sequences of frames that are not only photorealistic individually but also temporally consistent as a whole. Recently, the FUNIT [liu2019few] was proposed for generating images of unseen domains via the adaptive instance normalization technique [huang2017adain]. Our work is different in that we aim for video synthesis and achieve generalization to unseen domains via a network weight generation scheme. We compare these techniques in the experiment section.

Video generative models can be divided into three main categories, including 1) unconditional video synthesis models [vondrick2016generating, saito2017temporal, tulyakov2017mocogan], which convert random noise samples to video clips, 2) future video prediction models [srivastava2015unsupervised, kalchbrenner2016video, finn2016unsupervised, mathieu2015deep, lotter2016deep, xue2016visual, walker2016uncertain, walker2017pose, denton2017unsupervised, villegas2017decomposing, liang2017dual, lee2018stochastic, hu2018video, li2018flow, hao2018controllable, pan2019video], which generate future video frames based on the observed ones, and 3) vid2vid models [wang2018video, chan2018everybody, gafni2019vid2game, zhou2019dance], which convert semantic input videos to photorealistic videos. Our work belongs to the last category, but in contrast to the prior works, we aim for a vid2vid model that can synthesize videos of unseen domains by leveraging few example images given at test time.

Adaptive networks refer to networks where part of the weights are dynamically computed based on the input data. This class of networks has a different inductive bias to regular networks and has found use in several tasks including sequence modeling [ha2016hypernetworks], image filtering [jia2016dynamic, wu2018dynamic, su2019pixel]

, frame interpolation 

[niklaus2017video, niklaus2017video2], and neural architecture search [zhang2019graph]. Here, we apply it to the vid2vid task.

Human pose transfer synthesizes a human in an unseen pose by utilizing an image of the human in a different pose. To achieve high quality generation results, existing human pose transfer methods largely utilize human body priors such as body part modeling [balakrishnan2018synthesizing] or human surface-based coordinate mapping [neverova2018dense]. Our work differs from these works in that our method is more general. We do not use specific human body priors other than the input semantic video. As a result, the same model can be directly used for other vid2vid tasks such as street scene video synthesis, as shown in Figure 5. Moreover, our model is designed for video synthesis, while existing human pose transfer methods are mostly designed for still image synthesis and do not consider the temporal aspect of the problem. As a result, our method renders more temporally consistent results (Figure 4).

3 Few-shot Video-to-Video Synthesis

Video-to-video synthesis aims at learning a mapping function that can convert a sequence of input semantic images111For example, a segmentation mask or an image denoting a human pose., , to a sequence of output images, , in a way that the conditional distribution of given is similar to the conditional distribution of the ground truth image sequence, , given . In other words, it aims to achieve , where is a distribution divergence measure such as the Jensen-Shannon divergence or the Wasserstein distance. To model the conditional distribution, existing works make a simplified Markov assumption, leading to a sequential generative model given by

(1)

In other words, it generates the output image, , based on the observed input semantic images, , and the past generated images, . The sequential generator can be modeled in several different ways [chan2018everybody, gafni2019vid2game, wang2018video, zhou2019dance]. A popular choice is to use an image matting function given by

(2)

where the symbol is an image of all ones, is the element-wise product operator, is a soft occlusion map, is the optical flow from to , and is a synthesized intermediate image.

Figure 2(a) visualizes the vid2vid architecture and the matting function, which shows the output image is generated by combining the optical-flow warped version of the last generated image, , and the synthesized intermediate image, . The soft occlusion map,

, dictates how these two images are combined at each pixel location. Intuitively, if a pixel is observed in the previously generated frame, it would favor duplicating the pixel value from the warped image. In practice, these quantities are generated via neural network-parameterized functions

, , and :

(3)
(4)
(5)

where , , and are learnable parameters. They are kept fixed once the training is done.

Figure 2: (a) Architecture of the vid2vid framework [wang2018video]. (b) Architecture of the proposed few-shot vid2vid framework. It consists of a network weight generation module that maps example images to part of the network weights for video synthesis. The module consists of three sub-networks: , , and (used when ). The sub-network extracts features from the example images. When there are multiple example images (),

combines the extracted features by estimating soft attention maps

and weighted averaging different extracted features. The final representation is then fed into the network to generate the weights for the image synthesis network .

Few-shot vid2vid. While the sequential generator in (1) is trained for converting novel input semantic videos, it is not trained for synthesizing videos of unseen domains. For example, a model trained for a particular person can only be used to generate videos of the same person. In order to adapt to unseen domains, we let depend on extra inputs. Specifically, we let take two more input arguments: one is a set of example images of the target domain, and the other is the set of their corresponding semantic images . That is

(6)

This modeling allows to leverage the example images given at the test time to extract some useful patterns to synthesize videos of the unseen domain. We propose a network weight generation module for extracting the patterns. Specifically, is designed to extract patterns from the provided example images and use them to compute network weights for the intermediate image synthesis network :

(7)

Note that the network does not generate the weights or because the flow prediction network and the soft occlusion map prediction network are designed for warping the last generated image, and warping is a mechanism that is naturally shared across domains.

We build our few-shot vid2vid framework based on Wang et al. [wang2018video], which is the state-of-the-art for the vid2vid task. Specifically, we reuse their proposed flow prediction network and the soft occlusion map prediction network . The intermediate image synthesis network is a conditional image generator. Instead of reusing the architecture proposed by Wang et al. [wang2018video], we adopt the SPADE generator [park2019SPADE], which is the current state-of-the-art semantic image synthesis model.

The SPADE generator contains several spatial modulation branches and a main image synthesis branch. Our network weight generation module only generates the weights for the spatial modulation branches. This has two main advantages. First, it greatly reduces the number of parameters that has to generate, which avoids the overfitting problem. Second, it avoids creating a shortcut from the example images to the output image, since the generated weights are only used in the spatial modulation modules, which generates the modulation values for the main image synthesis branch. In the following, we discuss details of the design of the network and the learning objective.

Network weight generation module. As discussed above, the goal of the network weight generation module is to learn to extract appearance patterns that can be injected into the video synthesis branch by controlling its weights. We first consider the case where only one example image is available (). We then extend the discussion to handle the case of multiple example images.

We decompose into two sub-networks: an example feature extractor

, and a multi-layer perceptron

. The network consists of several convolutional layers and is applied on the example image to extract an appearance representation . The representation is then fed into to generate the weights in the intermediate image synthesis network .

Let the image synthesis network has layers , where . We design the weight generation network to also have layers, each generates the weights for the corresponding . Specifically, to generate the weights for layer , we first take the output from -th layer in . Then, we average pool (since might be still a feature map with spatial dimensions.) and apply a multi-layer perceptron to generate the weights . Mathematically, if we define , then , and . These generated weights are then used to convolve the current input semantic map to generate the normalization parameters used in SPADE (Figure 2(c)).

For each layer in the main SPADE generator, we use to compute the denormalization parameters and to denormalize the input features. We note that, in the original SPADE module, the scale map and bias map are generated by fixed weights operated on the input semantic map . In our setting, these maps are generated by dynamic weights, . Moreover, contains three sets of weights: , and . acts as a shared layer to extract common features, and and take the output of to generate and maps, respectively. For each BatchNorm layer in , we compute the denormalized features from the normalized features by

(8)
(9)
(10)

where stands for convolution, and is the nonlinearity function.

Attention-based aggregation (). In addition, we want to be capable of extracting the patterns from an arbitrary number of example images. As different example images may carry different appearance patterns, and they have different degrees of relevance to different input images, we design an attention mechanism [xu2015show, vaswani2017attention] to aggregate the extracted appearance patterns ,…,.

To this end, we construct a new attention network , which consists of several fully convolutional layers. is applied to each of the semantic images of the example images,

. This results in a key vector

, where is the number of channels and is the spatial dimension of the feature map. We also apply to the current input semantic image to extract its key vector . We then compute the attention weight by taking the matrix product . The attention weights are then used to compute a weighted average of the appearance representation , which is then fed into the multi-layer perceptron to generate the network weights (Figure 2(b)). This aggregation mechanism is helpful when different example images contain different parts of the subject. For example, when example images include both front and back of the target person, the attention maps can help capture corresponding body parts during synthesis (Figure 7(c)).

Warping example images. To ease the burden of the image synthesis network, we can also (optionally) warp the given example image and combine it with the intermediate synthesized output . Specifically, we make the model estimate an additional flow and mask , which are used to warp the example image to the current input semantics, similar to how we warp and combine with previous frames. The new intermediate image then becomes

(11)

In the case of multiple example images, we pick to be the image that has the largest similarity score to the current frame by looking at the attention weights . In practice, we found this helpful when example and target images are similar in most regions, such as synthesizing poses where the background remains static.

Training. We use the same learning objective as in the vid2vid framework [wang2018video]. But instead of training the vid2vid model using data from one domain, we use data from multiple domains. In Figure 7(a), we show the performance of our few-shot vid2vid model is positively correlated with the number of domains included in the training dataset. This shows that our model can gain from increased visual experiences. Our framework is trained in the supervised setting where paired , and are available. We train our model to convert to by using example images randomly sampled from . We adopt a progressive training technique, which gradually increases the length of training sequences. Initially, we set , which means the network only generates single frames. After that, we double the sequence length (

) for every few epochs.

Inference. At test time, our model can take an arbitrary number of example images. In Figure 7(b), we show that our performance is positively correlated with the number of example images. Moreover, we can also (optionally) finetune the network using the given example images to improve performance. Note that we only finetune the weight generation module and the intermediate image synthesis network , and leave all parameters related to flow estimation (, ) fixed. We found this can better preserve the person identity in the example image.

4 Experiments

Implementation details. Our training procedure follows the procedure from the vid2vid work [wang2018video]. We use the ADAM optimizer [kingma2014adam] with and . Training was conducted using an NVIDIA DGX-1 machine with 8 32GB V100 GPUs.

Datasets. We adopt three video datasets to validate our method.

  • [leftmargin=5mm,topsep=0pt]

  • YouTube dancing videos. It consists of dancing videos from YouTube. We divide them into a training set and a test set with no overlapping subjects. Each video is further divided into short clips of continuous motions. This yields about clips for training. At each iteration, we randomly pick a clip and select one or more frames in the same clip as the example images. At test time, both the example images and the input human poses are not seen during training.

  • Street-scene videos. We use street-scene videos from three different geographical areas: 1) Germany, from the Cityscapes dataset [Cordts2016cityscapes], 2) Boston, collected using a dashcam, and 3) NYC, collected by a different dashcam. We apply a pretrained segmentation network [wu2018cgnet] to get the segmentation maps. Again, during training, we randomly select one frame of the same area as the example image. At test time, in addition to the test set images from these three areas, we also test on the ApolloScape [huang2018apolloscape] and CamVid [brostow2008camvid] datasets, which are not included in the training set.

  • Face videos. We use the real videos in the FaceForensics dataset [rossler2018faceforensics], which contains videos of news briefing from different reporters. We split the dataset into videos for training and videos for validation. We extract sketches from the input videos similar to vid2vid, and select one frame of the same video as the example image to convert sketches to face videos.

Baselines. Since no existing vid2vid method can adapt to unseen domains using few example images, we construct 3 strong baselines that consider different ways of achieving the target generalization capability. For the following comparisons and figures, all methods use 1 example image.

  • [leftmargin=5mm,topsep=0pt]

  • Encoder. In this baseline approach, we encode the example images into a style vector and then decode the features using the image synthesis branch in our to generate .

  • ConcatStyle. In this baseline approach, we also encode the example images into a style vector. However, instead of directly decoding the style vector using the image synthesis branch in our , it concatenates the vector with each of the input semantic images to produce an augmented semantic input image. This image is then used as input to the spatial modulation branches in our for generating the intermediate image .

  • AdaIN. In this baseline, we insert an AdaIN normalization layer after each spatial modulation layer in the image synthesis branch of . We generate the AdaIN normalization parameters by feeding the example images to an encoder, similar to the FUNIT method [liu2019few].

In addition to these baselines, for the human synthesis task, we also compare our approach with the following methods using the pretrained models provided by the authors.

  • [leftmargin=5mm,topsep=0pt]

  • PoseWarp [balakrishnan2018synthesizing] synthesizes humans in unseen poses using an example image. The idea is to assume each limb undergoes a similarity transformation. The final output image is obtained by combining all transformed limbs together.

  • MonkeyNet [Siarohin2019monkeynet] is proposed for transferring motions from a sequence to a still image. It first detects keypoints in the images, and then predicts their flows for warping the still image.

YouTube Dancing videos Street Scene videos
Method Pose Error FID Human Pref. Pixel Acc mIoU FID Human Pref.

 

Encoder 13.30 234.71 0.96 0.400 0.222 187.10 0.97
ConcatStyle 13.32 140.87 0.95 0.479 0.240 154.33 0.97
AdaIN 12.66 207.18 0.93 0.756 0.360 205.54 0.87
PoseWarp [balakrishnan2018synthesizing] 16.84 180.31 0.83 N/A N/A N/A N/A
MonkeyNet [Siarohin2019monkeynet] 13.73 260.77 0.93 N/A N/A N/A N/A
Ours 6.01 80.44 0.831 0.408 144.24
Table 1: Our method outperforms existing pose transfer methods and our baselines for both dancing and street scene video synthesis tasks. For pose error and FID, lower is better. For pixel accuracy and mIoU, higher is better. The human preference score indicates the fraction of subjects favoring results synthesized by our method.

Figure 3: Visualization of human video synthesis results. Given the same pose video but different example images, our method synthesizes realistic videos of the subjects, who are not seen during training. Click the image to play the video clip in a browser.

Figure 4: Comparisons against different baselines for human motion synthesis. Note that the competing methods either have many visible artifacts or completely fail to transfer the motion. Click the image to play the video clip in a browser.

Figure 5: Visualization of street scene video synthesis results. Our approach is able to synthesize videos that realistically reflect the style in the example images even if the style is not included in the training set. Click the image to play the video clip in a browser.

Figure 6: Visualization of face video synthesis results. Given the same input video but different example images, our method synthesizes realistic videos of the subjects, who are not seen during training. Click the image to play the video clip in a browser.
Figure 7: (a) The plot shows the quality of our synthesized videos improves when it is trained with a larger dataset. Large variety helps learn a more generalizable network weight generation module and hence improves adaptation capability. (b) The plot shows the quality of our synthesized videos is correlated with the number of example images provided at test time. The proposed attention mechanism can take advantage of a larger example set to better generate the network weights. (c) Visualization of attention maps when multiple example images are given. Note that when synthesizing the front of the target, the attention map indicates that the network utilizes more of the front example image, and vice versa.

Evaluation metrics. We use the following metrics for quantitative evaluation.

  • [leftmargin=5mm,topsep=0pt]

  • Fréchet Inception Distance (FID) [heusel2017gans] measures the distance between the distributions of real data and generated data. It is commonly used to quantify the fidelity of synthesized images.

  • Pose error. We estimate the poses of the synthesized subjects using OpenPose [cao2018openpose]

    . This renders a set of joint locations for each video frame. We then compute the absolute error in pixels between the estimated pose and the original pose input to the model. The idea behind this metric is that if the image is well-synthesized, a well-trained human pose estimation network should be able to recover the original pose used to synthesize the image. We note similar ideas were used in evaluating image synthesis performance in several prior works 

    [isola2017image, wang2017high, wang2018video].

  • Segmentation accuracy. To evaluate the performance of street scene videos, we run a state-of-the-art street scene segmentation network on the result videos generated by all the competing methods. We then report the pixel accuracy and mean intersection-over-union (IoU) ratio. The idea of using segmentation accuracy as a performance metric follows the discussion of using the pose error as discussed above.

  • Human subjective score. Finally, we use Amazon Mechanical Turk (AMT) to evaluate the quality of generated videos. We perform AB tests where we provide the user videos from two different approaches and ask them to choose the one with better quality. For each pair of comparisons, we generate clips, each of them viewed by workers. Orders are randomized.

Main results. In Figure 3, we show results of using different example images when synthesizing humans. It can be seen that our method can successfully transfer motion to all the example images. Figure 4 shows comparisons of our approaches against other methods. It can be seen that other methods either generate obvious artifacts or fail to transfer the motion faithfully.

Figure 5 shows the results of synthesizing street scene videos with different example images. It can be seen that even with the same input segmentation map, our method can achieve different visual results using different example images.

Table 1 shows quantitative comparisons of both tasks against the other methods. It can be seen that our method consistently achieves better results than the others on all the performance metrics.

In Figure 6, we show results of using different example images when synthesizing faces. Our method can faithfully preserve the person identity while capturing the motion in the input videos.

Finally, to verify our hypothesis that a larger training dataset helps improve the quality of synthesized videos, we conduct an experiment where part of the dataset is held out during training. We vary the number of videos in the training set and plot the resulting performance in Figure 7(a). We find that the results support our hypothesis. We also evaluate whether having access to more example images at test time helps with the video synthesis performance. As shown in Figure 7(b), the result confirms our assumption.

Limitations. Although our network can, in principal, generalize to unseen domains, when the test domain is too different from the training domains it will not perform well. For example, when testing on CG characters which look very different from real-world people, the network will struggle. In addition, since our network is based on semantic estimations as input such as pose maps or segmentation maps, when these estimations fail our network will also likely fail.

5 Conclusion

We presented a few-shot video-to-video synthesis framework that can synthesize videos of unseen subjects or street scene styles at the test time. This was enabled by our novel adaptive network weight generation scheme, which dynamically determines the weights based on the example images. Experimental results showed that our method performs favorably against the competing methods.

References

Appendix A Comparisons to Baselines for Street Scenes

Figure 8 shows comparisons with our baseline methods on street scene sequences. Again, our method is the only one which can realistically reproduce the style in the example images, while others either generate artifacts to fail to capture the style.

Appendix B Details of Our Baseline Methods

In the experiment section, we introduce three baselines for the few-shot video-to-video synthesis task. Here, we provide additional details of their architectures. Note that these baselines are only designed for dealing with one example image, and we compare the performance of the proposed method with these baselines on the one example image setting.

The Encoder Baseline. As discussed in the main paper, the Encoder baseline consists of an image encoder that encodes the example image to a style latent code, which is then directly fed into the head of the main image synthesis branch in the SPADE generator. We visualize the architecture of the Encoder baseline in Figure 9(a).

The ConcatStyle Baseline. As visualized in Figure 9(b), in the ConcatStyle baseline, we also employ an image encoder to encode the example image to a style latent code. Now, instead of feeding the style code into the head of the main image synthesis branch, we concatenate the style code with the input semantic image via a broadcasting operation. The concatenation is the new input semantic image to the SPADE modules.

The AdaIN Baseline. In this baseline, we use the AdaIN [huang2017adain]

for adaptive video-to-video synthesis. Specifically, we use an image encoder to encode the example image to a latent vector and use multi-layer perceptrons to convert the latent vector to the mean and variance vectors for the AdaIN operations. The AdaIN parameters are fed into each layer of the main image synthesis branch. Specifically, we add an AdaIN normalization layer after a SPADE normalization layer as shown in Figure 

9(c).

Figure 8: Comparisons against different baseline methods for the street view synthesis task. Note that the proposed approach is the only one that can transfer the styles from the example images to the output videos. Click the image to play the video clip in a browser.
Figure 9: Details of our baselines. (a) The Encoder baseline tries to input the style information of the unseen domain by feeding a style code to the head of the main image synthesis branch in the SPADE generator. (b) The ConcatStyle baseline tries to input the style information of the unseen domain by concatenating the style code with the input semantic image, which are then inputted to the SPADE modules. (c) The AdaIN baseline tries to input the style information of the unseen domain by using AdaIN modulation.

Appendix C Discussion with AdaIN

In the AdaIN, information from the example image is represented as a scaling vector and a biased vector. This operation could be considered as a 1x1 convolution with a group size equal to the channel size. From this perspective,the AdaIN is a constrained case of the proposed weight generation scheme, since our scheme can generate a convolutional kernel with a group size equal to 1 and a kernel size larger than 1x1. Moreover, the proposed scheme can be easily combined with the SPADE module. Specifically, we use the proposed generation scheme to generate weights for the SPADE layers, which in turn generate spatially adaptive de-modulation parameters. To justify the importance of weight generation, we compare with AdaIN both using weighted average and our attention module (Figure 

10(a)).

We also compare with AdaIN when different dataset sizes are used. The assumption is that when the dataset is small, both methods are able to catch the diversity in the dataset. However, as the dataset size grows larger, AdaIN starts to fail since the expressibility is limited, as shown in Figure 10(c).

Figure 10: Comparisons to AdaIN and vid2vid. (a) Both our attention mechanism and weight generation scheme help for better image quality. (b) As we use more examples at test time, the performance improves. If we are allowed to finetune the model, we can actually achieve comparable performance to vid2vid. (c) AdaIN is able to achieve good performance when the dataset is small. However, since the capacity of the network is limited, it will struggle to handle larger size datasets.

Appendix D Comparison with vid2vid

As discussed in the main paper, the disadvantages of vid2vid is that it requires different models for different persons or cities. For example, they typically need minutes of training data and days of training time, while our method only needs one image and negligible time for weight generation.

To compare our performance with vid2vid, we show quantitative comparisons in Figure 10(b) for synthesizing a specific person. We find that our model renders comparable results even when . Moreover, if we further finetune our model based on the example images, we can achieve comparable or even better performance.