Video Generation from Single Semantic Label Map

03/11/2019 ∙ by Junting Pan, et al. ∙ 16

This paper proposes the novel task of video generation conditioned on a SINGLE semantic label map, which provides a good balance between flexibility and quality in the generation process. Different from typical end-to-end approaches, which model both scene content and dynamics in a single step, we propose to decompose this difficult task into two sub-problems. As current image generation methods do better than video generation in terms of detail, we synthesize high quality content by only generating the first frame. Then we animate the scene based on its semantic meaning to obtain the temporally coherent video, giving us excellent results overall. We employ a cVAE for predicting optical flow as a beneficial intermediate step to generate a video sequence conditioned on the initial single frame. A semantic label map is integrated into the flow prediction module to achieve major improvements in the image-to-video generation process. Extensive experiments on the Cityscapes dataset show that our method outperforms all competing methods.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 8

page 9

Code Repositories


Pytorch implementation of "Video Generation from Single Semantic Label Map", CVPR 2019

view repo


Video Generation from Single Semantic Label Map

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A typical visual scene is composed of foreground objects and the background. In a dynamic scene, motion of the background is determined by camera movement which is independent of the motion of foreground objects. Scene understanding, which include both understanding how foreground objects and background look and how they change, is essential to advancing the development of computer vision. Scene understanding, besides using recognition models, can be accomplished by generative methods

[34]. In this work we focus on using generative models to understand our visual world.

There has been much progress in image generation to address static scene modeling. Researchers have proposed methods to generate images from only noise [10] or from pre-defined conditions such as attribute, text and pose [41, 20]. In recent works, people also pay attention to image generation conditioned on semantic information with either paired [12] or unpaired data [42]. The conditional image generation methods provide a way to manipulate existing images and have potential value as a data augmentation strategy to assist other computer vision tasks.

Figure 1:

Comparison with existing generation tasks. From top: Image-to-image translation, video-to-video, and our image-to-video synthesis. Our method only takes

one semantic label map as input and synthesizes a sequence of photo-realistic video frames.

While image generation tasks only model static scenes, for video prediction, it is essential to also investigate the temporal dynamics. Models are trained to predict raw pixels of the future frame by learning from historical motion patterns. There is another line of work on video synthesis without any history frames.

Similar to research on image generation, some work investigated unconditional video generation. That is, directly generating video clips from noise by using generative adversarial networks to learn a mapping between spatial-temporal latent space and video clips [31, 25]. Another group of researchers worked on video-to-video translation [37], where a sequence of frames are generated according to a sequence of aligned semantic representations.

Figure 2: Overview of our two step generation network. In the first stage, we generate the starting frame by mapping from a semantic label map. In the second stage, we use our flow prediction network to transform the initial frame to a video sequence.

In this work, we study video generation with a setting similar to the video-to-video work [37] except that it is only conditioned on a single frame‘s semantic label map. Compared to previous works on video generation, our setting not only provides control over the generation process but also allows high variability in the results. Conditioning the generation on semantic label map helps avoid producing undesirable results (e.g.  a car driving on the pavement) which often occurs in unconditional generation. Furthermore, we can generate cars moving at different speeds or in different directions, which is not possible in the video-to-video setting.

Figure 3: Overall architecture of the proposed image-to-video generation network. It consists two components: a) Motion Encoder and b) Video Decoder. For any pair of bidirectional flow predictions, consistency check is computed only in non occluded areas.

One intuitive idea to address this new task would be to train an end-to-end conditional generative model. However, it is not easy to apply such a model to datasets composed of diverse objects and background,  i.e. different objects in different scenes have different motions. In reality, training a single end-to-end model to simultaneously model both appearance and motion of these objects and scenes is very hard. Therefore, as illustrated in Fig. 2, we take a divide-and-conquer strategy, designed to model appearance and motion in a progressive manner.

In the first stage, we aim to transform a semantic label map to a frame such that the appearance of scene is synthesized, which falls into the category of image-to-image translation. During translation process, the model only focuses on producing an image of good quality with reasonable content.

In the next stage, future motion of the scene is predicted based on the generated frame. Specifically, a conditional VAE is employed to model uncertainty of future motion. Different from existing video prediction tasks where motion information can be estimated from historical frames, in our setting, we only have one semantic label map and one generated frame available. We argue that it is important for the model to leverage the semantic meaning of the first frame when predicting motion. For example, buildings and pedestrians have very distinctive motion. We take both the semantic label map and the generated frame as input and feed them into a motion prediction model. Empirical results demonstrate that with semantic representation as input, the model can learn better motion for dynamic objects than without that, specially for complex scenes with multiple classes of objects. We model motion with optical flow. Once flows are predicted, they are directly applied to warp the first frame to synthesize future frames. Finally, a post-processing network is added to rectify imperfection caused during the warping operation. Inspired by

[21], we further improve the performance of flow prediction and future frame generation using bidirectional flows and geometric consistency. Experimental results demonstrate the effectiveness of the proposed method in video generation.

Our contributions are the following.

  1. We introduce the novel task of conditioning video generation on a single semantic label map, allowing a good balance between flexibility and quality compared to existing video generation approaches.

  2. The difficult task is divided into two sub-problems, i.e., image generation followed by image-to-sequence generation, such that each stage can specialize on one problem.

  3. We make full use of the semantic categorical prior in motion prediction when only one starting frame is available. It helps predict more accurate optical flow, thereby producing better future frames.

2 Related Work

Image generation

Many work exists regarding image generation which generally can be classified into two categories, unconditional generation and conditional generation. In unconditional generation, some work extends GANs

[10] or VAE [16] to map from noise to real data distribution. Auto-regressive architectures model the image on a per-pixel basis [32, 22]. In the second category, conditional models generate images given either class category, textual descriptions, scene graphs or images [20, 2, 41, 15, 26]. Especially for image translation task, researchers study how to generate a meaningful image from a semantic representation such as semantic label maps (paired and unpaired) ([12, 42, 38, 3, 26]). However, in image generation tasks, photo-realism of the scene is modeled without considering their motion information.

Video Generation Similar to Image generation, video generation can also be divided into two categories: conditional and unconditional. For the former category, VideoGAN [34] explicitly disentangles a scene’s foreground from background under the assumption that the background is stationary. The model is limited to only simple cases and cannot handle scenes with a moving background due to camera movement. TGAN [25] first generates a sequence of latent variables and then synthesize a sequence of frames based on those latent variables. MoCoGAN [31]

also tries to map a sequence of random vectors to a sequence of frames. However, their framework decomposes video into content subspace and motion subspace, making video generation process more controllable. For conditional video generation, it is still at its early stage. One recent work is vid2vid 

[37] in which authors aim at transforming a sequence of semantic representation, e.g. semantic label map and sketch map, to a sequence of video frames. Our work falls into the category of conditional video generation, but unlike vid2vid, our method only requires a single semantic label map as input which enables more freedom over the generation process.

Video prediction Some work model future motion in a deterministic manner. In [23, 29, 33], future prediction is carried out in a latent space, and the representation of future frames is projected back to image domain. These models are directly trained to optimize a reconstruction loss, such as Mean Squared Error (MSE), between the predicted frames and ground truth frames. However, they are prone to converging to blurry results as they compute an average of all possible future outcomes for the same starting frame. In [19, 13, 8], future motion is predicted using either optical flow or filter, where estimation and then corresponding spatial transformation is applied to history frames to produce future frames. The result is sharp but lacks diversity. A group of researchers [39, 36, 7, 1]

introduced conditional variational autoencoders for video prediction to model uncertainty in future motion allowing the results to be both sharp and diverse. Similar to our work, Walker et al.

[35] and Li et al. [18] attempt to predict multiple future frames from a static image. In the training phase, they take the ground truth optical flow, either human annotated or computed, as supervision to predict such flow, and transform the given frame to future frames. Contrary to Walker et al. [35] and Li et al. [18], we learn optical flow in an unsupervised manner, i.e., without taking any pre-computed flow as supervision.

3 Semantic Label Map to Video Generation

Generating a video sequence from a single semantic label map allows more flexibility compared to translating multiple label maps to a video, but is also more challenging. In this work we propose to divide such a difficult task into two relatively easy sub-problems and address each one separately, i.e., i) Image-to-Image (I2I): an image generation model based on conditional GANs [38] that maps a given semantic label map to the starting frame of a sequence, and ii) Image-to-Video (I2V): an image-sequence generation network that produces a sequence of frames based on the generated starting frame and a latent variable . In each stage we have a model specializing on the corresponding task such that the overall performance is good.

3.1 Image-to-Image (I2I)

Image-to-image translation aims at learning the mapping of an image in the source domain to its corresponding image in the target domain. Among the existing methods [12, 42, 38, 3], we adopt the state-of-the-art image translation model pix2pixHD [38] to generate an image from a semantic label map. It includes a coarse-to-fine architecture to progressively produce high quality images with fine details while keeping global consistency. Note that the translation stage is not restricted to this method and other image translation approaches can substitute pix2pixHD.

3.2 Image-to-Video (I2V)

In this section, we present how to use cVAE for image sequence generation conditioned on an initial frame obtained from Sec. 3.1. It is composed of two sub-modules, i.e., flow prediction and video frame generation from flow. Fig. 3 shows the network structure and the components of the proposed Image-to-Video model.

Conditional VAE - Compared to future prediction from multiple frames, where the future motion can be estimated based on past sequence, motion predicted from one single frame can be more diverse. We employ the conditional VAE (cVAE) model [39] as the backbone to capture multiple possible future motions conditioned on a static image. The proposed cVAE is composed of an encoder and a decoder. The encoder learns to map a starting frame and the subsequent frames into a latent variable that carries information about motion distribution conditioned on the first frame . To achieve such mapping, the latent variable is composed of two parts, one projecting from the whole sequence including both and , and the other from only the initial frame . The decoder then reconstructs the sequence and outputs based on a sampled and . During training, the encoder

learns to match the standard normal distribution,

. When running inference, the cVAE will generate a video sequence from a given starting frame and a latent variable sampled from without the need of the motion encoder.

Flow Prediction - We first use an image encoder to transform the starting frame into a latent vector as a part of the latent variable . The whole sequence is sent to another sequence encoder to compute , which makes up the other part of for uncertainty modeling. and are concatenated as one vector which is fed to a decoder to compute future optical flow. For motion generation, we predict bidirectional flows, i.e. both forward flow from the initial frame to future frames and backward flow from future frames to the initial frame. Computing cycle flow allows us to perform forward-backward consistency checks. For regions which appear in both frames (A and B), correspondence between two frames can be captured both from A to B and from B to A. We compute an occlusion mask to omit regions which are either occluded or missing in the generated frame so that the consistency check is only conducted on non-occluded regions. Putting all this together, the resulting output of the cVAE is the optical flow as well as the occlusion mask for both forward and backward directions, defined as:


Where is the flow prediction module that is composed of the motion encoder and the flow decoder as shown in Fig 3. , where is the forward optical flow from to and , with is the backward optical flow. and are the multi-frame forward-backward occlusion maps. We define a pixel value in the occlusion map to be zero when there is no correspondence between frames. All optical flows and occlusion maps are jointly predicted by our image-to-flow module. Note that both bidirectional and occlusion maps are learned without any pre-computed flow as supervision.

Video frame Generation - With the predicted optical flow, we can directly produce future frames by warping the initial frame. However, the generated frames obtained solely by warping has inherent flaws, as some parts of the objects may not be visible in one frame but appears in another. To fill in the holes caused by either occlusion or objects entering or leaving the scene, we propose to add a post-processing network after frame warping. It takes a warped frame and its corresponding occlusion mask as the input, and generates the refined frame. The final output of our model is defined as follows:


where is the post-processing network and denotes the coordinates of a position in the frame.

Loss Function -

Our loss function contains both per-pixel reconstruction and uncertainty modeling. For the per-pixel reconstruction, we compute losses in both the forward and backward direction, formulated as


where is the length of the generated sequence. We only compute reconstruction in non-occluded regions to avoid learning incorrect deformations. Neighboring pixels usually belong to the same object, thus they tend to have similar displacement. Therefore, similar to previous work [40, 30] we also add a smoothness constraint to encourage flow in a local neighborhood to be similar.


We compute forward-backward consistency loss for non-occluded regions:


To train the in-painting network, we applied an loss together with a perceptual loss [14] that has been shown to be useful for image generation. Therefore, our data loss can be formulated as a weighed sum of the above terms.


where is VGG-19 [27] from where we extract and collect features from the first 16 layers. We add a penalty on the occlusion maps for to avoid the trivial solution where all pixels become occluded (we define the value in a position of to be when the pixels is becoming occluded in the next frame). The weights are set to be: and . To model the motion uncertainty we incorporate the KL-divergence loss such that matches . The training loss for the cVAE is a data loss combined with a KL-divergence loss.


3.3 Flow prediction with semantic label maps

Different from video prediction conditioned on multiple frames, generating a video from a static frame has no access to historical motion information. To infer future motion of a object in a static frame, the model needs to understand the semantic category of that object and its interaction with other objects and background. For example, the car will stop when the traffic light is red and move on when is green. To promote future motion estimation for the whole frame, we incorporate semantic label map which describes semantic information of the whole scene into the flow prediction module discussed in previous sub-section.

We explore two ways of integrating the semantic label map for flow prediction. In the first method, we expand a semantic label map into several heatmaps which is filled with ones on positions correspond to a semantic category and zeros elsewhere. These heatmaps are concatenated with the generated starting frame and fed to the cVAE model for future frame synthesis. In the other method, we further divide the heatmaps into two sets, i.e., foreground heatmaps and background heatmaps, as shown in Fig. 4. Each set of heatmaps is fed to a separate sequence encoder to get a latent vector and . They are then concatenated with becoming the input to the flow decoder. In Section 4, experimental results demonstrate that integrating semantic label map helps computing more accurate flow and accordingly improve the video generation performance.

Figure 4: Semantic sequence encoder. Each sequence encoder only focuses on learning either foreground or background motion.

4 Experiments

In this section we present the dataset and describe the details about the implementation. We evaluate our method against several baseline methods with both qualitative and quantitative metrics. We also perform ablation studies to confirm the effectiveness of using semantic label maps for video generation.

4.1 Datasets and Evaluation Metrics

Datasets We have conducted experiments on the Cityscapes dataset while we have provided qualitative results on the many other datasets. Cityscapes [6] consists of urban scene videos recorded from a car driving on the street. It contains 2,975 training, 500 validation and 1,525 test video sequences, each containing 30 frames. The ground truth semantic segmentation mask is only available for the 20th frame of every video. We use DeepLabV3[5] to compute semantic segmentation maps for all frames, which are used for training and testing. We train the model using all videos from the training set, and test it on the validation set. UCF101 [28] The dataset contains videos of action classes. KTH Action dataset [17] consists of 600 videos of people performing one of the six actions(walking, jogging, running, boxing, handwaving, hand-clapping). KITTI [9] similar to Cityscpes was recorded from a car traversing streets.

Evaluation Metrics We provide both quantitative and qualitative evaluation results in this section. For qualitative evaluation, we conducted a human subjective study to evaluate our method as well as the baseline methods. We randomly generated 100 video sequences for each method, pairing each generated video with the result of another randomly chosen method. The participants are asked to choose from each pair the most realistic looking video. We calculate the human preference score after each pair of videos was evaluated by 10 participants.

The Fréchet Inception Distance (FID) [11] measures the similarity between two sets of images. It was shown to correlate well with human judgment of visual quality and is most often used to evaluate the quality of samples from GANs. FID is calculated by computing the Fréchet distance between two feature representations of the Inception network. Similar to [37], we use the video inception network [4] to extract spatio-temporal feature representations.

4.2 Implementation details

Our method takes a single semantic label map and predict = 8 frames in a single step. We resize all frames to and extract the semantic segmentation maps with DeepLabV3 [5]

for training. We do not use any flow map as ground truth for training. In the cVAE, the motion encoder is built upon stacks of 2D convolutional layers intercepted with max pooling layers. The latent vector

has dimension 1024, 896 for foreground motion and 128 for background motion. For the flow encoder, we use three blocks each consisting of 3D convolutional layers intercepted with bilinear upsampling layer that progressively recovers the input resolution in both spatial and temporal dimensions. For the postprocessing network, we adopt the U-Net architecture from [24].

4.3 Ablation Studies

We conduct extensive experiments on the Cityscapes dataset to analyze the contribution of the semantic label map and optical flow for motion prediction. We have shown that optical flow is reliable motion representation to convey motion between frames and preserver better visual quality. Fig.  9 shows that the model without optical flow produces blurry frames. In contrast, our flow based solution preserves better details even on fast moving objects and produces fewer artifacts.

We also compare frame sequences generated by the model without semantic label map and two ways of integrating that. As shown in Fig. 10, the model integrating semantic label map is able to capture both foreground object motion and background motion, whereas the one without that fails to estimate the independent foreground object motion. By further separating semantic label maps into background and foreground, it can capture more details in structure marked by the red rectangles. As expected, semantic information plays an important role in generating object motion when predicting from a single frame. We show further improvements by separating semantic classes into two groups based on background and foreground.

Figure 5: Comparison between different approaches of video prediction from a static image. Top left: ground truth. Top right: FG. Bottom left: MoCoGAN. Bottom right: img2vid (ours). Our method preserve the the visual quality while other method rapidly degrades.
Figure 6: Comparisons with other competing baselines. Notice that vid2vid uses a sequence of semantic label maps while other methods only take one as input. Please zoom in for best view.
MoCoGAN FG vid2vid Ours
FID 8.77 3.69 4.86 3.52
Table 1: Comparison of video generation methods where the input is a single semantic label map.
FID 7.06 2.86 1.80
Table 2: Comparison of video prediction methods that take a single starting frame as input.

4.4 Baselines

We compare our network with five state-of-the-art baseline methods trained on the Cityscapes dataset.

MoCoGAN [31] is an unconditional video generation model. Here, we also compared the conditional setting of MoCoGAN, given the initial frame as input.

FlowGrounded (FG) [18] is a video prediction model from a static image. We compare our image-to-video stage with this method on both video generation and video prediction tasks.

Vid2Vid [37], the goal of vid2vid is to map a sequence of semantic represenation to a sequence of video frames, where future motion is approximately given in the semantic segmentation sequence. We evaluate vid2vid to see whether our method is comparable to this ”upper bound”.

Figure 7: Comparisons with other competing baselines on UCF-101 dataset and KTH human dataset. Please zoom in to see the details.
Figure 8: Samples of KITTI generated from model trained on the cityscapes dataset.
Figure 9: Ablation studies of our method. Top left: GT. Top right: w/o segmentation label map and flow. Bottom left:w/o flow. Bottom right: our full model. Our method preserve better the visual quality.
Figure 10: We compare three different variants of using semantic label map for flow and frame prediction. (a) ground truth, (b) w/o semantic label maps, (c) with semantic label maps, (d) with separate semantic label maps for background and foreground objects.

4.5 Results

Quantitative Results In Table 1 we report the results on the Cityscapes dataset. In terms of performance, the lower the FID, the better the model. In Table 1, we show that our method has the lowest FID compared to all competing methods. Notice that the results here are slightly different from what is reported by Wang et al. [38] because we only evaluate 8-frame sequences with a resolution of due to GPU memory limitations. We generated a total of 500 short sequences on the validation set. We also provide results for video prediction when only the starting frame is given. As shown in Table  2, our method outperforms all other state-of-the-art approaches in video prediction from a static image.

Qualitative Results Fig. 6 compares our generation results with other approaches. MoCoGAN has limited capability in modeling video sequences (both motion and appearance). FG fails to synthesize the details of the scene,e.g. windows of the background building are completely missing, increasing blurriness. Our method maintains the semantic structure of the scene for the duration of the sequence and contains finer details than the previous two methods. The proposed method makes reasonable estimates of the objects’ future motion and produces temporally coherent video sequence. Compared to the ground truth sequence, our model can generate semantically correct samples but with different properties, e.g., a white car in the ground truth sequence appears as a silver car in our result. For vid2vid, where the input is a sequence of semantic label maps, shows realistic images with great details, but limited on preserving the temporal consistency across frames, e.g. the silver car in has turned into black in , while our methods keeps the same color. To further show the effectiveness of our method on predicting general motions, we provide visual results on UCF-101 dataset and KTH action dataset that mainly consist on people performing actions. As shown in Fig. 7, our method preserves well the body structure and synthesizing complex non-linear motions such as people skiing, playing violin and walking. We trained the model on Cityscapes and tested on samples from KITTI to show the method‘s generalization ability, shown in Fig. 8.

Human Preference Score
seg2vid(ours) / MoCoGAN 1.0 / 0.0
seg2vid(ours) / FG 0.78 / 0.22
seg2vid(ours) /vid2vid 0.37 / 0.63
Table 3: User study on video generation methods.
Human Preference Score
seg2vid(ours) / MoCoGAN 1.0 / 0.0
seg2vid(ours) / FG 0.82 / 0.18
Table 4: User study on video prediction methods.

The user study illustrated in Table.  3 also shown that our method is the most favored except vid2vid. Additionally to the results of synthesized data, we also reported results for video prediction task. As shown in Fig.  5 our method can predict well background motion and simultaneously captured the movement of the car on the left side. The details and structure of the scene is well preserved with our approach while other methods suffer severe deformation. Table 4 shows that participants find our method to be more realistic.

5 Conclusion

In this work, we introduced the new video generation task conditioned only on a single semantic label map, and proposed a novel method for this task. Instead of learning the generation end-to-end, which is very challenging, we employed a divide and conquer strategy to model appearance and motion in a progressive manner to obtain quality results. We demonstrated that introducing semantic information brings large improvement when predicting motion from static content. The impressive performance compared to other baselines indicate the effectiveness of the proposed method for video generation.