Log In Sign Up

World-Consistent Video-to-Video Synthesis

by   Arun Mallya, et al.

Video-to-video synthesis (vid2vid) aims for converting high-level semantic inputs to photorealistic videos. While existing vid2vid methods can achieve short-term temporal consistency, they fail to ensure the long-term one. This is because they lack knowledge of the 3D world being rendered and generate each frame only based on the past few frames. To address the limitation, we introduce a novel vid2vid framework that efficiently and effectively utilizes all past generated frames during rendering. This is achieved by condensing the 3D world rendered so far into a physically-grounded estimate of the current frame, which we call the guidance image. We further propose a novel neural network architecture to take advantage of the information stored in the guidance images. Extensive experimental results on several challenging datasets verify the effectiveness of our approach in achieving world consistency - the output video is consistent within the entire rendered 3D world.


page 2

page 5

page 7

page 8

page 12

page 13

page 14


Long-Term Temporally Consistent Unpaired Video Translation from Simulated Surgical 3D Data

Research in unpaired video translation has mainly focused on short-term ...

Extrapolative-Interpolative Cycle-Consistency Learning for Video Frame Extrapolation

Video frame extrapolation is a task to predict future frames when the pa...

Deep Sketch-guided Cartoon Video Synthesis

We propose a novel framework to produce cartoon videos by fetching the c...

Video Demoireing with Relation-Based Temporal Consistency

Moire patterns, appearing as color distortions, severely degrade image a...

Taylor saves for later: disentanglement for video prediction using Taylor representation

Video prediction is a challenging task with wide application prospects i...

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

Generating a video given the first several static frames is challenging ...

Temporally Consistent Depth Prediction with Flow-Guided Memory Units

Predicting depth from a monocular video sequence is an important task fo...

1 Introduction

Video-to-video synthesis [wang2018video] concerns generating a sequence of photorealistic images given a sequence of semantic representations extracted from a source 3D world. For example, the representations can be the semantic segmentation masks rendered by a graphics engine while driving a car in a virtual city [wang2018video]. The representations can also be the pose maps extracted from a source video of a person dancing, and the application is to create a video of a different person performing the same dance [chan2019everybody]. From the creation of a new class of digital artworks to applications in computer graphics, the video-to-video synthesis task has many exciting practical use-cases. A key requirement of any such video-to-video synthesis model is the ability to generate images that are not only individually photorealistic, but also temporally smooth. Moreover, the generated images have to follow the geometric and semantic structure of the source 3D world.

While we have observed steady improvement in photorealism and short-term temporal stability in the generation results, we argue that one crucial aspect of the problem has been largely overlooked, which is the long-term temporal consistency problem. As a specific example, when visiting the same location in the virtual city, an existing vid2vid method [wang2019few, wang2018video] could generate an image that is very different from the one it generated when the car first visited the location, despite using the same semantic inputs. Existing vid2vid methods rely on optical flow warping and generate an image conditioned on the past few generated images. While such operations can ensure short-term temporal stability, they cannot guarantee long-term temporal consistency. Existing vid2vid models have no knowledge of what they have rendered in the past. Even for a short round-trip in a virtual room, these methods fail to preserve the appearances of the wall and the person in the generated video, as illustrated in Fig. World-Consistent Video-to-Video Synthesis.

In this paper, we attempt to address the long-term temporal consistency problem, by bolstering vid2vid models with memories of the past frames. By combining ideas from scene flow [vedula1999three] and conditional image synthesis models [park2019semantic], we propose a novel architecture that explicitly enforces consistency in the entire generated sequence. We perform extensive experiments on several benchmark datasets, with comparisons to the state-of-the-art methods. Both quantitative and visual results verify that our approach achieves significantly better image quality and long-term temporal stability. On the application side, we also show that our approach can be used to generate videos consistent across multiple viewpoints, enabling simultaneous multi-agent world creation and exploration.

2 Related work

Semantic Image Synthesis [chen2017photographic, liu2019learning, park2019semantic, qi2018semi, wang2018high] refers to the problem of converting a single input semantic representation to an output photorealistic image. Built on top of the generative adversarial networks (GAN) [goodfellow2014generative] framework, existing methods [liu2019learning, park2019semantic, wang2018high] propose various novel network architectures to advance state-of-the-art. Our work is built on the SPADE architecture proposed by Park et al[park2019semantic] but focuses on the temporal stability issue in video synthesis.

Conditional GANs

synthesize data conditioned on user input. This stands in contrast to unconditional GANs that synthesize data solely based on random variable inputs 

[goodfellow2014generative, gulrajani2017improved, karras2017progressive, karras2018style]. Based on the input type, there exist label-conditional GANs [brock2018large, miyato2018cgans, odena2016conditional, zhang2019self], text-conditional GANs [reed2016generative, xu2018attngan, zhang2017stackgan], image-conditional GANs [benaim2018one, bousmalis2016unsupervised, choi2017stargan, huang2018multimodal, isola2017image, lee2018diverse, liu2016unsupervised, liu2019few, shrivastava2016learning, taigman2016unsupervised, zhu2017unpaired], scene-graph conditional GANs [johnson2018image], and layout-conditional GANs [zhao2019image]. Our method is a video-conditional GAN, where we generate a video conditioned on an input video. We address the long-term temporal stability issue that the state-of-the-art overlooks [chan2019everybody, wang2019few, wang2018video].

Video synthesis exists in many forms, including 1) unconditional video synthesis [saito2017temporal, tulyakov2017mocogan, vondrick2016generating], which converts random variable inputs to video clips, 2) future video prediction [denton2017unsupervised, finn2016unsupervised, hao2018controllable, hu2018video, kalchbrenner2016video, lee2018stochastic, li2018flow, liang2017dual, lotter2016deep, mathieu2015deep, pan2019video, srivastava2015unsupervised, villegas2017decomposing, walker2016uncertain, walker2017pose, xue2016visual], which generates future video frames based on the observed ones, and 3) video-to-video synthesis [chan2019everybody, chen2019mocycle, gafni2019vid2game, wang2019few, wang2018video, zhou2019dance], which converts an input semantic video to a real video. Our work belongs to the last category. Our method treats the input video as one from a self-consistent world so that when the agent returns to a spot that it has previously visited, the newly generated frames should be consistent with the past generated frames. While a few works have focused on improving the temporal consistency of an input video [bonneel2015blind, lai2018learning, yao2017occlusion], our method does not treat consistency as a post-processing step, but rather as a core part of the video generation process.

Novel-view synthesis aims to synthesize images at unseen viewpoints given some viewpoints of the scene. Most of the existing works require images at multiple reference viewpoints as input [choi2019extreme, flynn2019deepview, flynn2016deepstereo, hedman2018deep, kalantari2016learning, mildenhall2019local, zhou2018stereo]. While some works can synthesize novel views based on a single image [srinivasan2017learning, wiles2019synsin, xie2016deep3d], the synthesized views are usually close to the reference views. Our work differs from these works in the sense that our input is different – instead of using a set of RGB images, our network takes in a sequence of semantic maps. If we directly treat all past synthesized frames as reference views, it makes the memory requirement grow linearly with respect to the video length. If we only use the latest frames, the system cannot handle long-term consistency as shown in Fig. World-Consistent Video-to-Video Synthesis. Instead, we propose a novel framework to keep track of the synthesis history in this work.

The closest related works are those on neural rendering [aliev2019neural, meshry2019neural, sitzmann2019deepvoxels, thies2019deferred], which can re-render a scene from arbitrary viewpoints after training on a set of given viewpoints. However, note that these methods still require RGB images from different viewpoints as input, making it unsuitable for applications such as those to game engines. On the other hand, our method can directly generate RGB images using semantic inputs, so rendering a virtual world becomes more effortless. Moreover, they need to train a separate model (or part of the model) for each scene, while we only need one model per dataset, or domain.

3 World-consistent video-to-video synthesis


Recent image-to-image translation methods perform extremely well when turning semantic images to realistic outputs. To produce videos instead of images, simply doing it frame-by-frame will usually result in severe flickering artifacts 

[wang2018video]. To resolve this, vid2vid [wang2018video] proposes to take both the semantic inputs and previously generated frames as input to the network (e.g. ). The network then generates three outputs – a hallucinated frame, a flow map, and a (soft) mask. The flow map is used to warp the previous frame and linearly combined with the hallucinated frame using the soft mask. Ideally, the network should reuse the content in the warped frame as much as possible, and only use the disoccluded parts from the hallucinated frame.

While the above framework reduces flickering between neighboring frames, it still struggles to ensure long-term consistency. This is because it only keeps track of the past frames, and cannot memorize everything in the past. Consider the scenario in Fig. World-Consistent Video-to-Video Synthesis, where an object moves out of and back in the field-of-view. In this case, we would want to make sure its appearance is similar during the revisit, but that cannot be handled by existing frameworks like vid2vid [wang2018video].

In light of this, we propose a new framework to handle world-consistency. It is a superset of temporal consistency, which only ensures consistency between frames in a video. A world-consistent video should not only be temporally stable, but also be consistent across the entire 3D world the user is viewing. This not only makes the output look more realistic, but also enables applications such as the multi-player scenario where different players can view the same scene from different viewpoints. We achieve this by using a novel guidance image conditional scheme, which is detailed below.

Figure 1: Overview of guidance image generation for training. Consider a scene in which a camera(s) with known parameters and positions travels over time . At , the scene is textureless and an output image is generated for this viewpoint. The output image is then back-projected to the scene and a guidance image for a subsequent camera position is generated by projecting the partially textured point cloud. Using this guidance image, the generative method can produce an output that is consistent across views and smooth over time. Note that the guidance image can be noisy, misaligned, and have holes, and the generation method should be robust to such inputs.

Guidance images and their generation. The lack of knowledge about the world structure being generated limits the ability of vid2vid to generate view-consistent outputs. As shown in Fig. 4 and Sec. 4, the color and structure of the objects generated by vid2vid [wang2018video] tend to drift over time. We believe that in order to produce realistic outputs that are consistent over time and viewpoint change, an ideal method must be aware of the 3D structure of the world.

To achieve this, we introduce the concept of “guidance images”, which are physically-grounded estimates of what the next output frame should look like, based on how the world has been generated so far. As alluded to in their name, the role of these “guidance images” is to guide the generative model to produce colors and textures that respect previous outputs. Prior works including vid2vid [wang2018video] rely on optical flows to warp the previous frame for producing an estimate of the next frame. Our guidance image differs from this warped frame in two aspects. First, instead of using optical flow, the guidance image should be generated by using the motion field, or scene flow, which describes the true motion of each 3D point in the world111As an example, consider a textureless sphere rotating under constant illumination. In this case, the optical flow would be zero, but the motion field would be nonzero.. Second, the guidance image should aggregate information from all past viewpoints (and thus frames), instead of only the direct previous frames as in vid2vid. This makes sure that the generated frame is consistent with the entire history.

While estimating motion fields without an RGB-D sensor [golyanik2017multiframe] or a rendering engine [dosovitskiy2017carla] is not easy, we can obtain motion fields for the static parts of the world by reconstructing part of the 3D world using structure from motion (SfM) [longuet1981computer, tomasi1992shape]. This enables us to generate guidance images as shown in Fig. 1 for training our video-to-video synthesis method using datasets captured by regular cameras. Once we have the 3D point cloud of the world, the video synthesis process can be thought of as a camera moving through the world and texturing every new 3D point it sees. Consider a camera moving through space and time as shown in the left part of Fig. 1. Suppose we generate an output image at . This image can be back-projected to the 3D point cloud and colors can be assigned to the points, so as to create a persistent representation of the world. At a later time step, , we can obtain the projection of the 3D point cloud to the camera and create a guidance image leveraging estimated motion fields. Our method can then generate an output frame based on the guidance image.

Although we generate guidance images using the projection of 3D point clouds, it can also be generated by any other method that gives a reasonable estimate. This makes the concept powerful, as we can use different sources to generate guidance images at training and test time. For example, at test time we can generate guidance images using a graphics engine, which can provide ground truth 3D correspondences. This enables just-in-time colorization of a virtual 3D world with real-world colors and textures, as we move through the world.

Note that our guidance image also differs from the projected image used in prior works like Meshry et al[meshry2019neural] in several aspects. First, in their case, the 3D point cloud is fixed once constructed, while in our case it is constantly being “colorized” as we synthesize more and more frames. As a result, our guidance image is blank at the beginning, and can become denser depending on the viewpoint. Second, the way we use these guidance images to generate outputs is also different. The guidance images can have misalignments and holes due to limitations of SfM, for example in the background and in the person’s head in Fig. 1. As a result, our method also differs from DeepFovea [kaplanyan2019deepfovea], which inpaints sparsely but accurately rendered video frames. In the following subsection, we describe a method that is robust to noises in guidance images, so it can produce outputs consistent over time and viewpoints.

Figure 2: Overview of our world consistent video-to-video synthesis architecture. Our Multi-SPADE module takes input labels, warped previous frames, and guidance images to modulate the features in each layer of our generator.

Framework for generating videos using guidance images. Once the guidance images are generated, we are able to utilize them to synthesize the next frame. Our generator network is based on the SPADE architecture proposed by Park et al[park2019semantic]

, which accepts a random vector encoding the image style as input and uses a series of SPADE blocks and upsampling layers to generate an output image. Each SPADE block takes a semantic map as input and learns to modulate the incoming feature maps through an affine transform

, where is the incoming feature map, and and are predicted from the input segmentation map.

An overview of our method is shown in Fig. 2. At a high-level, our method consists of four sub-networks: 1) an input label embedding network (orange), 2) an image encoder (red), 3) a flow embedding network (green), and 4) an image generator (gray). In our method, we make two modifications to the original SPADE network. First, we feed in the concatenated labels (semantic segmentation, edge maps, etc.) to a label embedding network (orange), and extract features in corresponding output layers as input to each SPADE block in the generator. Second, to keep the image style consistent over time, we encode the previously synthesized frame using the image encoder (red), and provide this embedding to our generator (gray) in place of the random vector222When generating the first frame where no previous frame exists, we use an encoder which accepts the semantic map as input..

Utilizing guidance images. Although using this modified SPADE architecture produces output images with better visual quality than vid2vid [wang2018video], the outputs are not temporally stable, as shown in Sec. 4. To ensure world-consistency of the output, we would want to incorporate information from the introduced guidance images. Simply linearly combining it with the hallucinated frame from the SPADE generator is problematic, since the hallucinated frame may contain something very different from the guidance images. Another way is to directly concatenate it with the input labels. However, the semantic inputs and guidance images have different physical meanings. Besides, unlike semantic inputs, which are labeled densely (per pixel), the guidance images are labeled sparsely. Directly concatenating them would require the network to compensate for the difference. Hence, to avoid these potential issues, we choose to treat these two types of inputs differently.

To handle the sparsity of the guidance images, we first apply partial convolutions [liu2018image] on these images to extract features. Partial convolutions only convolve valid regions in the input with the convolution kernels, so the output features can be uncontaminated by the holes in the image. These features are then used to generate affine transformation parameters and , which are inserted into existing SPADE blocks while keeping the rest of the blocks untouched. This results in a Multi-SPADE module, which allows us to use multiple conditioned inputs in sequence, so we can not only condition on the current input labels, but also on our guidance images,


Using this module yields several benefits. First, conditioning on these maps generates more temporally smooth and higher quality frames than simple linear blending techniques. Separating the two types of input (semantic labels and guidance images) also allows us to adopt different types of convolutions (i.e. normal vs. partial). Second, since most of the network architecture remains unchanged, we can initialize the weights of the generator with one trained for single image generation. It is easy to collect large training datasets for single image generation by crawling the internet, while video datasets can be harder to collect and annotate. After the single image generator is trained, we can train a video generator by just training the newly added layers (i.e. layers generating and ) and only finetune the other parts of the network.

Handling dynamic objects. The guidance image allows us to generate world-consistent outputs over time. However, since the guidance is generated based on SfM for real-world scenes, it has the inherent limitation that SfM cannot handle dynamic objects. To resolve this issue, we revert to using optical flow-warped frames to serve as additional maps in addition to the guidance images we have from SfM. The complete Multi-SPADE module then becomes


where and are generated using a flow-embedding network (green) applied on the optical flow-warped previous frame. This provides additional constraints that the generated frame should be consistent even in the dynamic regions. Note that this is needed only due to the limitation of SfM, and can potentially be removed when ground truth / high quality 3D registrations are available, for example in the case of game engines, or RGB-D data capture.



Guidance Image

Generated Output

figureSample inputs and generated outputs on Cityscapes. Note how the guidance image is initially black, and becomes denser as more frames are synthesized. Click on any image to play video.

Figure 3 shows a sample set of inputs and outputs generated by our method on the Cityscapes dataset.

4 Experiments

Implementation details. We train our network in two stages. In the first stage, we only train our network to generate single images. This means that only the first SPADE layer of our Multi-SPADE block (visualized in Fig. 2

) is trained. Following this, we have a network that can generate high-quality single frame outputs. In the second stage, we train on video clips, progressively doubling the generated video length every epoch, starting from 8 frames and stopping at 32 frames. In this stage, all 3 SPADE layers of each Multi-SPADE block are trained. We found that this two-stage pipeline makes the training faster and more stable. We observed that the ordering of the flow and guidance SPADEs did not make a significant difference in the output quality. We train the network for 20 epochs in each stage, and this takes about 10 days on an NVIDIA DGX-1 (8 V-100 GPUs) for an output resolution of


We train our generator with the multi-scale image discriminator using perceptual and GAN feature matching losses as in SPADE [park2019semantic]. Following vid2vid [wang2018video], we add a temporal video discriminator at two temporal scales and a warping loss that encourages the output frame to be similar to the optical flow-warped previous frame. We also add a loss term to encourage the output frame to correspond to the guidance image, and this is necessary to ensure view consistency. Additional details about architecture and loss terms can be found in Appendix 0.A and 0.B. Code and trained models will be released upon publication.

Datasets. We train and evaluate our method on three datasets, Cityscapes [Cordts2016cityscapes], MannequinChallenge [li2019learning], and ScanNet [dai2017scannet], as they have mostly static scenes where existing SfM methods perform well.

  • [label=•, topsep=2pt, itemsep=2pt]

  • Cityscapes [Cordts2016cityscapes]. This dataset consists of driving videos of resolution captured in several German cities, using a pair of stereo cameras. We split this dataset into a training set of 3500 videos with 30 frames each, and a test set of 3 long sequences with 600-1200 frames each, similar to vid2vid [wang2018video]. As not all the images are labeled with segmentation masks, we annotate the images using the network from Zhu et al[zhu2019improving], which is based on a DeepLabv3-Plus [chen2018encoder]-like architecture with a WideResNet38 [wu2019wider] backbone.

  • MannequinChallenge [li2019learning]. This dataset contains video clips captured using hand-held cameras, of people pretending frozen in a large variety of poses, imitating mannequins. We resize all frames to and randomly split this dataset into 3040 train sequences and 292 test sequences, with sequence lengths ranging from 5-140 frames. We generate human body segmentation and part-specific UV coordinate maps using DensePose [Guler2018DensePose, wu2019detectron2] and body poses using OpenPose [cao2018openpose].

  • ScanNet [dai2017scannet]. This dataset contains multiple video clips captured in a total of 706 indoor rooms. We set aside 50 rooms for testing, and the rest for training. From each video sequence, we extracted 3 sub-sequences of length at most 100, resulting in 4000 train sequences and 289 test sequences, with images of size . We used the provided segmentation maps based on the NYUDv2 [silberman2012indoor] 40 labels.

For all datasets, we also use MegaDepth [li2018megadepth] to generate depth maps and add the visualized inverted depth images as input. As the MannequinChallenge and ScanNet datasets contain a large variety of objects and classes which are not fully annotated, we use edge maps produced by HED [xie2015holistically] in order to better represent the input content. In order to generate guidance images, we performed SfM on all the video sequences using OpenSfM [opensfm], which provided 3D point clouds and estimated cameras poses and parameters as output.

Baselines. We compare our method against the following strong baselines.

  • [label=•, topsep=2pt, itemsep=2pt]

  • vid2vid [wang2018video]. This is the prior state-of-the-art method for video-to-video synthesis. For comparison on Cityscapes, we use the publicly available pretrained model. For the other two datasets, we train vid2vid from scratch using the public code, while providing the same input labels (semantic segmentation, depth, edge maps, etc.) as to our method.

  • Inpainting [liu2018image]. We train a state-of-the-art partial convolution-based inpainting method to fill in the pixels missing from our guidance images. We train the models from scratch for each dataset, using masks obtained from the corresponding guidance images.

  • Ours w/o W.C. (World Consistency). As an ablation, we also compare against our model that does not use guidance images. In this case, only the first two SPADE layers in each Multi-SPADE block are trained (label and flow-warped previous output SPADEs). Other details are the same as our full model.

Method Cityscapes MannequinChallenge ScanNet
Image synthesis models
SPADE [park2019semantic] 48.25 0.63 0.95 29.99 0.13 0.63 31.46 0.08 0.54
Video synthesis models
vid2vid [wang2018video] 69.07 0.55 0.94 72.25 0.05 0.45 60.03 0.04 0.35
Ours w/o W.C. 51.51 0.62 0.95 27.23 0.17 0.67 20.93 0.12 0.62
Ours 49.89 0.61 0.95 22.69 0.19 0.69 21.07 0.13 0.63
Table 2: Human preference scores. Higher is better.
Compared Methods Cityscapes MannequinChallenge ScanNet
Image Realism
Ours/vid2vid [wang2018video] 0.73/0.27 0.83/0.17 0.77/0.23
Temporal Stability
Ours/vid2vid [wang2018video] 0.75/0.25 0.63/0.37 0.82/0.18
Table 3: Forward-backward consistency. means difference.
Method Cityscapes MannequinChallenge ScanNet
vid2vid [wang2018video] 14.90 3.46 37.56 9.42 46.30 12.16
Ours 8.73 2.04 12.61 3.61 11.85 3.41
Table 1: Comparison scores. means lower is better, while means the opposite.

Evaluation metrics. We use both objective and subjective metrics for evaluating our model against the baselines.

  • [label=•, topsep=2pt, itemsep=2pt]

  • Segmentation accuracy and Fréchet Inception Distance (FID). We adopt metrics widely used in prior work on image synthesis [chen2017photographic, park2019semantic, wang2018high] to measure the quality of generated video frames. We evaluate the output frames based on how well they can be segmented by a trained segmentation network. We report both the mean Intersection-Over-Union (mIOU) and Pixel Accuracy (P.A.) using the PSPNet [zhao2017pyramid] (Cityscapes) and DeepLabv2 [chen2017deeplab] (MannequinChallenge & ScanNet). We also use the Fréchet Inception Distance (FID) [heusel2017gans] to measure the distance between the distributions of the generated and real images, using the standard Inception-v3 network.

  • Human preference score. Using Amazon Mechanical Turk (AMT), we perform a subjective visual test to gauge the relative quality of videos. We evaluate videos on two criteria: 1) photorealism and 2) temporal stability. The first aims to find which generated video looks more like a real video, while the second aims to find which one is more temporally smooth and has lesser flickering. For each question, an AMT participant is shown two videos synthesized by two different methods, and asked to choose the better one according to the current criterion. We generate several hundred questions for each dataset, each of them is answered by 3 different workers. We evaluate an algorithm by the ratio that its outputs are preferred.

  • Forward-Backward consistency. A major contribution of our work is generating outputs that are consistent over a longer duration of time with the world that was previously generated. All our datasets have videos that explore new parts of the world over time, rarely revisiting previously explored parts. However, a simple way to revisit a location is to play the video in forward and then in reverse, i.e. arrange frames from time . We can then compare the first produced and last produced frames and measure their difference. We measure the difference per-pixel in both RGB and LAB space, and a lower value would indicate better long-term consistency.

vid2vid [wang2018video]

SPADE [park2019semantic]

Inpainting [liu2018image]

Ours w/o W.C.


figureComparison of different video generation methods on the Cityscapes dataset. Note that for our results, the textures of the cars, roads, and signboards are stable over time, while they change gradually in vid2vid and other methods. Click on an image to play the video.

Cityscapes MannequinChallenge ScanNet

vid2vid [wang2018video]


figureForward-backward consistency. Click on each image to see the change in output when the viewpoint is revisited. Note how drastically vid2vid results change, while ours remain almost the same.

Main results. In Table 3, we compare our proposed approach against vid2vid [wang2018video], as well as SPADE [park2019semantic], which is the single image generator that our method builds upon. We also compare against a version of our method that does not use guidance images and is thus not world-consistent (Ours w/o W.C.). Inpainting [liu2018image] could not provide meaningful output images without large artifacts, as shown in Fig 4

. We can observe that our method consistently beats vid2vid on all three metrics on all three datasets, indicating superior image quality. Interestingly, our method also improves upon SPADE in FID, probably as a result of reducing temporal variance across an output video sequence. We also see improvements over Ours w/o W.C. on almost all metrics.

In Table 3, we show human evaluation results on metrics of image realism and temporal stability. We observe that the majority of workers rank our method better on both metrics.

In Fig. 4, we visualize some sequences generated by the various methods (please zoom in and play the videos in Adobe Acrobat). We can observe that in the first row, vid2vid [wang2018video] produces temporal artifacts in the cars parked to the side and patterns on the road. SPADE [park2019semantic], which produces one frame at a time, produces very unstable videos, as shown in the second row. The third row shows outputs from the partial convolution-based inpainting [liu2018image] method. It clearly has a hard time producing visually and semantically meaningful outputs. The fourth row shows Ours w/o W.C., an intermediate version of our method that uses labels and optical flow-warped previous output as input. While this clearly improves upon vid2vid in image quality and SPADE in temporal stability, it causes flickering in trees, cars, and signboards. The last row shows our method. Note how the textures of the cars, roads, and signboards, which are areas we have guidance images, are stable over time. We also provide high resolution, uncompressed videos for all three datasets on our website.

In Table 3, we compare the forward-backward consistency of different methods, and it shows that our method beats vid2vid [wang2018video] by a large margin, especially on the MannequinChallenge and ScanNet datasets (by more than a factor of 3). Figure 4 visualizes some frames at the start and end of generation. As can be seen, the outputs of vid2vid change dramatically, while ours are consistent. We show additional qualitative examples in Fig. 4. We also provide additional quantitative results on short-term consistency in Appendix 0.C.

figureQualitative results on the MannequinChallenge and ScanNet datasets. Click on an image to play video. Note the results are consistent over time and viewpoints.

Ours w/o W.C.


figureStereo results on Cityscapes. Click on an image to see the the outputs produced by a pair of stereo cameras. Note how our method produces images consistent across the two views, while they differ in the highlighted regions without using the world consistency.

Generating consistent stereo outputs. Here, we show a novel application enabled by our method through the use of guidance images. We show videos rendered simultaneously for multiple viewpoints, specifically for a pair of stereo viewpoints on the Cityscapes dataset in Fig. 4. For the strongest baseline, Ours w/o W.C., the left-right videos can only be generated independently, and they clearly are not consistent across multiple viewpoints, as highlighted by the boxes. On the other hand, our method can generate left-right videos in sync by sharing the underlying 3D point cloud and guidance maps. Note how the textures on roads, including shadows, move in sync and remain consistent over time and camera locations.

5 Conclusions and discussion

We presented a video-to-video synthesis framework that can achieve world consistency. By using a novel guidance image extracted from the generated 3D world, we are able to synthesize the current frame conditioned on all the past frames. The conditioning was implemented using a novel Multi-SPADE module, which not only led to better visual quality, but also made transplanting a single image generator to a video generator possible. Comparisons on several challenging datasets showed that our method improves upon prior state-of-the-art methods.

While advancing the state-of-the-art, our framework still has several limitations. For example, the guidance image generation is based on SfM. When SfM fails to register the 3D content, our method will also fail to ensure consistency. Also, we do not consider a possible change in time of the day or lighting in the current framework. In the future, our framework can benefit from improved guidance images enabled by better 3D registration algorithms. Furthermore, the albedo and shading of the 3D world may be disentangled to better model the time effects. We leave these to future work.

Acknowledgements. We would like to thank Jan Kautz, Guilin Liu, Andrew Tao, and Bryan Catanzaro for their feedback, and Sabu Nadarajan, Nithya Natesan, and Sivakumar Arayandi Thottakara for helping us with the compute, without which this work would not have been possible.


Appendix 0.A Objective functions

Our objective functions contain five losses: an image GAN loss, a video GAN loss, a perceptual loss, a flow-warping loss, and a world-consistency loss. Except for the world-consistency loss, the others are inherited from the vid2vid [wang2018video]. Note that we replace the least square losses used in the vid2vid for GAN losses with the hinge losses as used in SPADE [park2019semantic]. We describe these terms in details in the following.

GAN losses. Let be a sequence of input semantic frames. Let be the sequence of corresponding real video frames, and be the synthesized frames by our generator. Define as one pair of frames at a particular time instance where and . The image GAN loss () and the video GAN loss () for time are then defined as


where and are the image and video discriminators, respectively. The video discriminator takes consecutive frames and concatenates them together for discrimination. For both GAN losses, we also accompany them by the feature matching loss () as in pix2pixHD [wang2018high],


where denotes the -th layer with elements of the discriminator network or .

Figure 3: Label / flow-warped image embedding network.

Perceptual loss. We use the VGG-16 network [simonyan2014very] as a feature extractor and minimize L1 losses between the extracted features from the real and the generated images. In particular,


where denotes the -th layer of the VGG network.

Flow-warping loss. We first warp the previous frame to the current frame using optical flow. We then encourage the warped frame to be similar to the current frame by using an L1 loss,


where is the warping function derived from optical flow.

World-consistency loss. Finally, we add the world consistency by enforcing the generated image to be similar to our guidance image. It is achieved by


where is our estimated guidance image.

The overall objective function is then


where are the weights for each individual terms, which are set to 1, 1, 10, 10, 10, 10 in all of our experiments.

Optimization details. We use the ADAM optimizer [kingma2014adam] with for all experiments and network components. We use a learning rate of 1e-4 for the encoder and generator networks (which are described below) and 4e-4 for the discriminators.

Appendix 0.B Network architecture

Figure 4: Previous image / segmentation encoder.

As described in the main paper, our framework contains four components: a label embedding network (Fig. 3), an image encoder (Fig. 4), a flow embedding network (Fig. 3), and an image generator (Fig. 5).

Label embedding network (Fig. 3). We adopt an encoder-decoder style network to embed the input labels into different feature representations, which are then fed to the Multi-SPADE modules in the image generator.

Image / segmentation encoder (Fig. 4). These networks generate the input to the main image generator. The segmentation encoder is used when generator the first frame in the sequence, while the image encoder is used when generating the subsequent frames. The segmentation encoder encodes the input semantics of the first frame, while the image encoder encodes the previously generated frame.

Figure 5: Image generator.
Figure 6: Multi-SPADE Residual Block and Multi-SPADE module.

Flow embedding network (Fig. 3). It is used to embed the optical flow-warped previous frame, which adopts the same architecture as the label embedding network except for the number of input channels. The embedded features are again fed to the Multi-SPADE layers in the main image generator.

Image generator (Fig. 5). The generator consists of a series of Multi-SPADE residual blocks (M-SPADE ResBlks) and upsampling layers. The structure of each M-SPADE Resblk is shown in Fig. 6, which replaces the SPADE layers in the original SPADE Resblks with Multi-SPADE layers.

Discriminators. We use the same image and video discriminators as vid2vid [wang2018video].

Appendix 0.C Additional Results

Short-term temporal video consistency. For each sequence, we first take two neighboring frames from the ground truth images to compute the optical flow between them using FlowNet2 [ilg2017flownet]. We then use the optical flow to warp the corresponding synthesized images and compute the L1 distance between the warped image and the target image, in RGB space, normalized by the number of pixels and channels. This process is repeated for all pairs of neighboring frames in all sequences and averaged. The result is shown in below in Table 4. As can be seen, Ours w/o World Consistency (W.C.) consistently performs better than vid2vid [wang2018video], and Ours (with world consistency) again consistently outperforms Ours w/o W.C.

Dataset vid2vid [wang2018video] Ours w/o W.C. Ours
Cityscapes 0.0036 0.0032 0.0029
MannequinChallenge 0.0397 0.0319 0.0312
ScanNet 0.0351 0.0278 0.0192
Table 4: Short-term temporal consistency scores. Lower is better.