Unsupervised Multimodal Video-to-Video Translation via Self-Supervised Learning

04/14/2020
by   Kangning Liu, et al.
8

Existing unsupervised video-to-video translation methods fail to produce translated videos which are frame-wise realistic, semantic information preserving and video-level consistent. In this work, we propose UVIT, a novel unsupervised video-to-video translation model. Our model decomposes the style and the content, uses the specialized encoder-decoder structure and propagates the inter-frame information through bidirectional recurrent neural network (RNN) units. The style-content decomposition mechanism enables us to achieve style consistent video translation results as well as provides us with a good interface for modality flexible translation. In addition, by changing the input frames and style codes incorporated in our translation, we propose a video interpolation loss, which captures temporal information within the sequence to train our building blocks in a self-supervised manner. Our model can produce photo-realistic, spatio-temporal consistent translated videos in a multimodal way. Subjective and objective experimental results validate the superiority of our model over existing methods. More details can be found on our project website: https://uvit.netlify.com

READ FULL TEXT VIEW PDF

Authors

page 2

page 13

page 14

page 21

page 23

page 24

page 27

page 28

06/10/2018

Unsupervised Video-to-Video Translation

Unsupervised image-to-image translation is a recently proposed task of t...
02/07/2018

Self-Supervised Video Hashing with Hierarchical Binary Auto-encoder

Existing video hash functions are built on three isolated stages: frame ...
04/02/2022

Unsupervised Coherent Video Cartoonization with Perceptual Motion Consistency

In recent years, creative content generations like style transfer and ne...
08/15/2018

Recycle-GAN: Unsupervised Video Retargeting

We introduce a data-driven approach for unsupervised video retargeting t...
01/28/2021

Playable Video Generation

This paper introduces the unsupervised learning problem of playable vide...
10/03/2021

Disarranged Zone Learning (DZL): An unsupervised and dynamic automatic stenosis recognition methodology based on coronary angiography

We proposed a novel unsupervised methodology named Disarranged Zone Lear...
11/10/2020

Unsupervised Contrastive Photo-to-Caricature Translation based on Auto-distortion

Photo-to-caricature translation aims to synthesize the caricature as a r...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent image-to-image translation (I2I) methods have achieved astonishing results by employing Generative Adversarial Networks (GANs) 

[goodfellow2014generative].

While there is an explosion of papers on I2I, its video counterpart is much less explored. Nevertheless, the ability to synthesize dynamic visual representations is important to a wide range of tasks such as video colorization 

[zhang2019deep], medical imaging [nie2016estimating]

, model-based reinforcement learning 

[arulkumaran2017brief, ha2018world], computer graphics rendering [kajiya1986rendering], etc.

Compared with the I2I task, the video-to-video translation (V2V) is more challenging. Besides the frame-wise realistic and semantic preserving requirements, which are also required in the I2I task, V2V methods additionally need to consider the temporal consistency for generating sequence-wise realistic videos. Consequently, directly applying I2I methods on each frame of the video is not an optimal solution because I2I cross-domain mapping does not hold temporal consistency within the sequence.

Recent methods [bansal2018recycle, bashkirova2018unsupervised, chen2019mocycle] have included different constraints to model the temporal consistency based on the Cycle-GAN approach [zhu2017unpaired] for unpaired datasets. They either use a 3D spatio-temporal translator [bashkirova2018unsupervised] or add a temporal loss on traditional image-level translator [bansal2018recycle, chen2019mocycle]. However, 3DCycleGAN [bashkirova2018unsupervised] heavily sacrifices image-level quality, and RecycleGAN [bansal2018recycle] suffers from style shift and inter-frame random noise.

Figure 1: A consistent video should be 1) style inconsistent 2) content consistent First row: label inputs; Second row: ReCycleGAN[bansal2018recycle] outputs; Third row: (ours) outputs. To overcome the style shift (e.g. sunset frame gradually changes to rain frame), we utilize style-conditioned translation. To reduce artifacts across frames, our translator incorporate multi-frame information. We use systematic sampling to get the results from a 64-frame sequence. The full video is provided in

In this paper we propose (), a novel framework for video-to-video cross-domain mapping. To this end, a temporally consistent video sequence translation should simultaneously guarantee: (1) Style consistency, and (2) Content consistency, see fig:compareHR for a visual example. Style consistency requires the whole video sequence to have the same style, thus ensuring the video frames to be overall realistic. Meanwhile, content consistency refers to the appearance continuity of contents in adjacent video frames, which ensures the video sequence to be dynamically vivid.

In , by assuming that all domains share the same underlying structure, namely content space, we exploit the style-conditioned translation. To simultaneously impose style and content consistency, we adopt an Encoder-RNN-Decoder architecture as the video translator, see fig:overview for an illustration of the proposed framework. There are two key ingredients in our framework:

Conditional video translation: By applying the same style code to decode the content feature for a specific translated video, the translated video is style consistent. Besides, by changing the style code across videos, we achieve sub-domain111Hereafter we call it subdomain and not domain because a subdomain must belong to a subset of a domain (for instance, subdomains of day, night, snow, etc. belong to the scene video domain) and modality flexible video translation, see fig:teaser for an illustration of subdomains (columns) and modalities (rows). This overcomes the limitations of existing CycleGAN-based video translation techniques, performing deterministic translations (generator as an injective function).

Figure 2: Overview of our proposed  model: given an input video sequence, we first decompose it to the content by a Content Encoder and the style by a Style Encoder. Then the content is processed by special RNN units, namely TrajGRUs [shi2017deep] in order to get the content used for translation and interpolation in a recurrent manner. Finally, the translation content and the interpolation content are decoded to the translated video and the interpolated video together with the style latent variable. We also show the video adversarial loss ( orange), the cycle consistency loss ( violet), the video interpolation loss ( green) and the style encoder loss ( blue)
Figure 3: Our proposed UVIT model can produce photo-realistic, spatio-temporal consistent translated videos in a multimodal way for multiple subdomains

Consistent video translation:

Building inter-frame dependency is essential for generating dynamically vivid videos. Existing video translators utilize optical flow or 3D Convolutional Neural Networks (CNNs), which are incapable to fully capture the complex relationships between multiple video frames. We adopts a RNN-based translator to incorporate inter-frame and current frame information in the high-dimensional hidden space from more frames. As validated by Bojanowski

et al. on image generation task [bojanowski2018optimizing], integrating features in hidden space is beneficial to produce semantically meaningful, smooth nonlinear interpolation in image space. Lacking paired supervision, another crucial aspect in unsupervised video translation lies in training criterion. Besides GAN [goodfellow2014generative] loss and spatial cycle consistency [zhu2017unpaired] loss, we propose video interpolation loss as the temporal constraint to strengthen semantic preserving. Specifically, we use translator building blocks to interpolate the current frame according to inter-frame information produced during translation. The current frame is then used as self-supervised target to tailor the inter-frame information. Meanwhile, it is validated that introducing self-supervision task is beneficial for cross-domain unsupervised tasks [sun2019unsupervised, chen2019self, ren2018cross]. Such self-supervision is applied to all building blocks of the translator, which stabilizes the challenging unpaired video adversarial learning. Effectiveness of this loss is validated through ablation study.

The main contributions of our paper are summarized as follows:

  1. We introduce an Encoder-RNN-Decoder video translator, which decomposes the temporal consistency into independent style and content consistencies for more stable consistent video translation.

  2. Our style-conditioned decoder ensures style consistency as well as facilitates multimodal and multi-subdomain V2V translation.

  3. We use self-supervised learning to incorporate an innovative video interpolation loss which preserves inter-frame information according to the current frame. The combined translation frame is more semantically preserved w.r.t the corresponding input frame. Therefore, our RNN-based translator can recurrently generate dynamically vivid and photo-realistic video frames.

2 Related Work

Image-to-Image Translation. Most of the GAN-based I2I methods mainly focus on the case where paired data exists [pix2pix2017, zhu2017toward, wang2018high]. However, with the cycle-consistency loss introduced in CycleGAN [zhu2017unpaired], promising performance has been achieved also for the unsupervised I2I [huang2018multimodal, almahairi2018augmented, liu2017unsupervised, mo2018instagan, romero2018smit, gong2019dlow, choi2018stargan, wu2019transgaga, cho2019image, wu2019relgan, alharbi2019latent]. The conditional distribution of the translated pictures on the input pictures is quite likely to be multimodal (from a semantic label to different images in a fixed weather condition). However, traditional I2I problem often lacks this characteristic and produces an unimodal outcome. Zhuzhu2017toward proposed Bicycle-GAN that can output diverse translations in a supervised manner. There are also some extensions [huang2018multimodal, almahairi2018augmented, karras2019style] of CycleGAN to decompose the style and content so that the output can be multimodal in the unsupervised scenario. Our work goes in this direction, and under the assumption that close frames within the same domain share the same style, we adopt the style control strategy in the image domain proposed by Almahairialmahairi2018augmented to the video domain.

Video-to-Video Translation In the seminal work, Wangwang2018video (vid2vid) combined the optical flow and video-specific constraints and proposed a general solution for V2V in a supervised way, which achieves long-term high-resolution video sequences. However, vid2vid relies heavily on labeled data which makes it difficult to scale in unsupervised real-world scenarios. As our approach exploits the unsupervised V2V representation, it is the focus of our document.

Based on the I2I CycleGAN approach, recent methods [bashkirova2018unsupervised, bansal2018recycle, chen2019mocycle] on unsupervised V2V proposed to design spatio-temporal loss to achieve more temporally consistent results while preserving semantic information. Bashkirovabashkirova2018unsupervised proposed a 3DCycleGAN method which adopts 3D convolutions in the generator and discriminator of the CycleGAN framework to capture temporal information. However, since the small 3D convolution operator (with a small temporal dimension 3) only captures dependency between adjacent frames. 3DCycleGAN therefore can not exploit temporal information for generating longer style consistent video sequences. Furthermore, the 3D discriminator is also limited in capturing complex temporal relationships between video frames. As a result, when the gap between input and target domain is large, 3DCycleGAN tends to sacrifice the image-level quality and generates blurry and gray translations.

Additionally, Bansalbansal2018recycle designed a recycle loss (ReCycleGAN) for jointly modeling the spatio-temporal relationship between video frames and thus solving the semantic preserving problem. They trained a temporal predictor to predict the next frame based on two past frames, and plugged the temporal predictor in the cycle-loss to impose the spatio-temporal constraint on the traditional image-level translator. Although ReCycleGAN succeeds in V2V translation scenarios such as face-to-face or flower-to-flower, similar to CycleGAN, it lacks domain generalization as the translation fails to be consistent in domains with a large gap with respect to the input. We argue that there are two major reasons that affect ReCycleGAN performance in complex scenarios. First, the translator is a traditional image-level translator without the ability to record the inter-frame information within videos. It processes input frames independently, which has limited capacity in exploiting temporal information, being not content consistent enough. Second, ReCycleGAN temporal predictor only imposes the temporal constraint between a few adjacent frames, the generated video content still might shift abnormally: a sunny scene could change to a snowy scene in the following frames. Note that Chenchen2019mocycle incorporate optical flow to add motion cycle consistency and motion translation constraints. However, their Motion-guided CycleGAN still suffers from the same two limitations as in ReCycleGAN.

In summary, previous methods fail to produce style consistent and multimodal video sequences. Besides, they lack the ability to achieve translation which is both content consistent enough and frame-wise realistic. In this paper, we propose , a novel method for , which produces high-quality semantic preserving frames with consistency within the video sequence. Besides, to the best of our knowledge, our method is the first method that jointly addresses multiple-subdomains and multimodality in V2V cross-domain translations.

3 Unsupervised Multimodal VIdeo-to-video Translation via Self-Supervised Learning ()

3.1 Problem setting

Let be the video domain A, be a sequence of video frames in , let be the video domain B, be a sequence of video frames in . For example, they can be sequences of semantic segmentation labels or scene images. Our general goal of unsupervised video-to-video translation is to train a generator to convert videos between domain A and domain B with many-to-many mappings. Either domain A or domain B can have multiple subdomains (sunny, snow, rain for the case of weather conditions). More concretely, to generate the style consistent video sequence, we assume each video frame has a shared style latent variable . Let and be the style latent variables in domain A and B, respectively.

We aim to achieve two conditional video translation mappings: and . As we propose to use the video interpolation loss to train the translator components in a self-supervised manner, we also define the video interpolation mappings: and . Interpolation and translation mappings use exactly the same building blocks.

3.2 Translation and Interpolation pipeline

In this work, inspired by UNIT [liu2017unsupervised], we assume a shared content space such that corresponding frames in two domains are mapped to the same latent content representation. We show the translation and interpolation processes in fig:merge_module. To achieve the goal of unsupervised video-to-video translation, we propose an Encoder-RNN-Decoder translator which contains the following components:

  • Two content encoders ( and ), which extract the frame-wise content information from each domain to the common spatial content space.

  • Two style encoders ( and ), which encode video frames to the respective style domains.

  • Two Trajectory Gated Recurrent Units (TrajGRUs)  

    [shi2017deep] to form a Bi-TrajGRU (), which propagates the inter-frame content information bidirectionally. TrajGRU [shi2017deep] is one variant of Convolutional RNN (Recurrent Neural Network) [xingjian2015convolutional]

    , which can actively learn the location-variant structure in the video data. More details in .

  • One merge module (), which adaptively combines the inter-frame content from two directions.

  • Two conditional content decoders ( and

    ), which take the spatio-temporal content information and the style code to generate the output frame. If needed, it also takes the conditional subdomain information as an one hot vector encoding.

Figure 4: Video translation (left) and video interpolation (right): two processes share modules organically. The input latent content is processed by the Merge Module to merge information from TrajGRUs in both the forward and the backward direction. The translation content () is obtained by updating interpolation content () with the content () from the current frame ()

Video translation: Given an input frame sequence (), we extract the posterior style () from the first frame () with a style encoder (). Additionally, we extract each content representation () with the content encoder ().

Translation is conducted in a recurrent way. To get the translation result for time , we process the independent content representation: (1) propagate content for the surrounding frames () through Bi-TrajGRU () to obtain the inter-frame content information. (2) update this information with the current frame content () (see fig:merge_module left, Merge Module ) to get the spatio-temporal content () for translation. At last, using the same style-conditioned strategy as Augment CycleGAN [almahairi2018augmented, dumoulin2016learned, perez2018film], the content decoder () takes the prior style information () as the condition and utilizes to generate the translation result (). This process is repeated until we get the whole translated sequence (…,, ,,…).

Style code is induced as the condition of (AdaIN-based [huang2017arbitraryadain]) content decoder. If a domain (e.g. scene images) is presorted, we have prior information on which subset (rain, night, etc.) a video belongs to, so we can take such prior information as a subdomain (subset) label to achieve deterministic control for the style. Within each subset, there are still different modalities (e.g. overcast day, sunny day in day subdomain), yet we do not have prior access to it. This modality information is therefore learned by style encoder. Subdomain label (taken as one-hot vector if available) and modality information together constitute 21-dimensional style code. Style consistency is ensured by sharing style code among a specific video sequence. Multimodal translation is realized by inducing different style codes across videos. When subdomain information is unavailable, simply using style encoder to learn subdomain styles as modalities, we can still generate multimodal style consistent results in a stochastic way.

Video interpolation: In video translation process in fig:merge_module, when translating a specific frame (), the translation content () is integrated by the current frame content () and inter-frame information () from the surrounding frames (). The inter-frame content information helps to build up the dependency between each frame and its surrounding frames, ensuring content consistency across frames. However, if and are not aligned well, image-level quality can be affected. The translated frame () will incline to over smooth image-level details and sacrifice high-level semantic correspondence with . Tailoring inter-frame information is thus of pivotal importance.

Thanks to the flexible Encoder-Decoder structure, our decoder can generate interpolated frame () from . Video interpolation loss is proposed to compute the L1-Norm distance between interpolated frame() and the current frame(), which adds supervision to the inter-frame content(). can thus be combined with more organically. Therefore, the translation task directly benefits from the interpolation task, producing more semantic preserving and photo-realistic outcomes.

Meanwhile, such self-supervised training would be beneficial to make the network more stable in the challenging unpaired video adversarial learning [sun2019unsupervised]

. GANs are powerful methods to learn a data probability distribution with no supervision, yet training GANs is well known for being delicate, unstable 

[arjovsky1701towards, mao2017least, arjovsky2017wasserstein] and easy to suffer from mode collapse[bansal2018recycle]. Besides cycle loss acting as spatial constraint, we introduce the video interpolation loss as a temporal constraint for GAN training in a self-supervised way. It have been validated that bringing self-supervision is beneficial for cross-domain unsupervised tasks (e.g., natural image synthesis) [sun2019unsupervised, chen2019self, ren2018cross].

Furthermore, our framework aims to learn latent representation for style and content, while it has been empirically observed [gulrajani2016pixelvae, chen2016variational] that it is non-trivial to use latent variables when coupled with a strong autoregressive decoder (e.g., RNN). Goyalgoyal2017z found that auxiliary cost could ease training of the latent variables in RNN-based generative latent variable models. Therefore, the video interpolation task provides the latent variables with a auxiliary objective that enhances the performance of the overall model.

Note that the proposed temporal loss highly differs from the previous ReCycleGAN loss [bansal2018recycle] as: (1) we use a RNN-based architecture that captures temporal information better in a high-level feature space, (2) interpolation is conducted within the translator building blocks rather than using different modules, training the translator with direct self-supervision, (3) the translator directly utilizes tailored inter-frame information for better semantic preserving translations.

3.3 Loss functions

We use the Relativistic GAN (RGAN) [jolicoeur2018relativistic] and the least square [mao2017least]

version for the adversarial loss. RGAN estimates the probability that the given real data is more realistic than a randomly sampled fake data.

We use image-level discriminators () and video-level () discriminators to ensure that output frames resemble a real video clip in both video-level and image-level. Moreover, we also add style discriminators () to adopt an adversarial approach for training style encoders.

Video adversarial loss. The translated video frames aim to be realistic compared to the real samples in the target domain for both an image-level and a video-level basis.

(1)

where, are the translated frames from time to . is the image-level discriminator for domain B , is the video-level discriminator for domain B. Adversarial loss for domain A () is defined similarly.

Video interpolation loss. The interpolated video frames should be close to the ground truth frames (pixel-wise loss). Additionally, they aim to be realistic compared to other real frames within the domain (adversarial loss).

Since we are using bidirectional TrajGRUs, we use frames from time to to compute the video interpolation loss. are the interpolated frames. The first part of the loss is the supervised pixel-wise loss, and the later part is the GAN loss computed on the image-level discriminator . is used to control the weight between two loss elements.

Cycle consistency loss. In order to ensure semantic consistency in an unpaired setting, we use a cycle-consistency loss:

(2)

where are the reconstructed frames of domain A from time to , i.e. . Where is the posterior style variable produced by using the style encoder to encode . is the cycle consistency loss weight.

Style encoder loss.

To train the style encoder, the style reconstruction loss and style adversarial loss are define in a similar way as Augment CycleGAN 

[almahairi2018augmented]:

(3)

Here, is the prior style latent variable of domain A drawn from the prior distribution. is the reconstructed style latent variable of domain A by using the style encoder to encode . is the style reconstruction loss weight.

Therefore, the objective for the generator is:

(4)

Detailed

values and loss functions for discriminators are attached in the . Detailed training algorithm for RGANs can be found in 

[jolicoeur2018relativistic].

4 Experiments

We validate our method using two common yet challenging datasets: Viper [richter2017playing], and Cityscapes [cordts2016cityscapes] datasets. In order to feed more frames within limited single GPU resource, we use the image with and 10 frames per batch for the main experiments. During inference, we use video sequences of 30 frames. These 30 frames are divided into 4 smaller sub-sequences of 10 frames with overlap. They all share the same style code to be style consistent. Note that our model can be easily extended to process longer style-consistent video sequences by keeping sharing the same style code for the sub-sequences. The video example of longer style consistent video is provided in , where detailed description of the dataset and implementation are also attached.

4.1 Ablation Study

In order to demonstrate the contribution of our method, we first conduct ablation study experiments. We provide quantitative and qualitative experimental results that evidence the proposed video interpolation loss for a better V2V translation. Besides, we study how the number of frames influence the semantic preserving performance. We also provide multimodal consistent results of our model trained without using subdomain label in the .

Video interpolation loss. We provide ablation experiments to show the effectiveness of the proposed video interpolation loss. We conduct experiments on both the image-to-label and the label-to-image tasks. We denote trained without video interpolation loss as ” wo/vi”.

We follow the experimental setting of ReCycleGAN [bansal2018recycle] and use semantic segmentation metrics to quantitatively evaluate the image-to-label results. The Mean Intersection over Union (mIoU), Average Class Accuracy (AC) and Pixel Accuracy (PA) scores for ablation experiments are reported in table:comparison_ablation. Our model with video interpolation loss achieves the best performance across subdomains, which confirms that the video interpolation helps to preserve the semantic information between the translated frame and the corresponding input frame.

Criterion Model Day Sunset Rain Snow Night
mIoU wo/vi 10.14 10.70 11.06 10.30 9.06
13.71 13.89 14.34 13.23 10.10
AC wo/vi 15.07 15.78 15.46 15.01 13.06
18.74 19.13 18.98 17.81 13.99
PA wo/vi 56.33 57.16 58.76 55.45 55.19
68.06 66.35 67.21 65.49 58.97
Table 1: Image-to-Label (Semantic segmentation) quantitative evaluation. We validate UVIT without video interpolation loss () under Mean Intersection over Union (mIoU), Average Class Accuracy (AC) and Pixel Accuracy (PA) scores
Criterion Model Day Sunset Rain Snow Night
FID wo/vi 26.95 23.04 30.48 34.62 47.50
17.32 16.79 19.52 18.91 19.93
Table 2: Label-to-image quantitative evaluation. We validate our system without video interpolation loss () under the Fréchet Inception Distance (FID) score

For the label-to-image task, we use the Fréchet Inception Distance (FID) [heusel2017gans] to evaluate the feature distribution distance between translated videos and ground-truth videos. Similar to vid2vid [wang2018video], we use the pre-trained network (I3D [carreira2017quo]) to extract features from videos. We extract the semantic labels from the respective sub-domains to generate videos and evaluate the FID score on all the subdomains of the Viper dataset. table:comparison_FID_2 shows the FID score for and the corresponding ablation experiment. On both the image-to-label and label-to-image tasks, the proposed video interpolation loss plays a crucial role for to achieve good translation results.

Different number of input frame. Our RNN-based translator incorporates temporal information from multiple frames. We also investigate the influence of frame number on the performance of our model. As shown in Ablation2 can achieve better semantic preserving with more frames feeding during training as the RNNs are better trained to leverage the temporal information. Specifically, for the image-to-label translation, with the increase of the number from 4 to 10, the overall mIoU increase from 11.19 to 13.07. Complete table and analysis are attached in .

Criterion Frame number Day Sunset Rain Snow Night All
mIoU 4 11.84 11.91 12.35 11.37 8.49 11.19
6 12.29 12.66 13.03 11.77 9.79 11.94
8 13.05 13.21 14.23 13.07 11.00 12.87
10 13.71 13.89 14.34 13.23 10.10 13.07
Table 3: Quantitative results of with different number of frames per batch in training on the image-to-label (Semantic segmentation) task. With the increase of input frames number in the sub-sequence, our RNN-based translator can utilize the temporal information better, resulting in better semantic preserving

4.2 Comparison of with State-of-the-Art Methods

Image-to-label mapping. To further ensure reproducibility, we use the same setting as our ablation study to compare with ReCycleGAN [bansal2018recycle] in the image-to-label mapping task. We report the mIoU, AC and PA metrics by the proposed approach and competing methods in table:comparison. The results clearly validate the advantage of our method over the competing approaches in terms of preserving semantic information. Our model can effectively leverage the inter-frame information from more frames in a direct way, which utilizes the temporal information better than the indirect way in ReCycleGAN [bansal2018recycle].

Criterion Model Day Sunset Rain Snow Night All
mIoU Cycle-GAN 3.39 3.82 3.02 3.05 7.76 4.10
ReCycleGAN (Reproduced)111 The result is reproduced by us. The output would be in a resolution of , we then downscale it to to compute the statistics. 10.31 11.18 11.26 9.81 7.74 10.11
ReCycleGAN (Reported)222 This is the result reported in the original paper [bansal2018recycle] with a resolution of . 8.50 13.20 10.10 9.60 3.10 8.90
(Ours) 13.71 13.89 14.34 13.23 10.10 13.07
AC Cycle-GAN 7.83 8.56 7.91 7.53 11.12 8.55
ReCycleGAN (Reproduced)111 The result is reproduced by us. The output would be in a resolution of , we then downscale it to to compute the statistics. 15.78 15.80 15.95 15.56 11.46 14.84
ReCycleGAN (Reported)222 This is the result reported in the original paper [bansal2018recycle] with a resolution of . 12.60 13.20 10.10 13.30 5.90 12.40
(Ours) 18.74 19.13 18.98 17.81 13.99 17.59
PA Cycle-GAN 15.46 16.34 12.83 13.20 49.03 19.59
ReCycleGAN (Reproduced)111 The result is reproduced by us. The output would be in a resolution of , we then downscale it to to compute the statistics. 54.68 55.91 57.72 50.84 49.10 53.65
ReCycleGAN (Reported)222 This is the result reported in the original paper [bansal2018recycle] with a resolution of . 48.70 70.00 60.10 58.90 33.70 53.70
68.06 66.35 67.21 65.49 58.97 65.20
Table 4: Quantitative comparison between and baseline approaches on the image-to-label (Semantic segmentation) task. Our translator effectively leverage the temporal information directly, thus producing more semantic persevering translation outcomes
Criterion Model Day Sunset Rain Snow Night
FID ReCycleGAN [bansal2018recycle] 23.60 24.45 28.54 31.58 35.74
Improved ReCycleGAN 20.39 21.32 25.67 21.44 21.45
(ours) 17.32 16.79 19.52 18.91 19.93
Table 5: Quantitative comparison between and baseline approaches on the label-to-image task. Better FID indicates that our translation has better visual quality and temporal consistency
Human Preference Score Video level Image level
(ours) / Improved ReCycleGAN 0.67 / 0.33 0.66 / 0.34
(ours) / 3DCycleGAN [bashkirova2018unsupervised] 0.75 / 0.25 0.70 / 0.30
(ours) / vid2vid [wang2018video] 0.49 / 0.51
(ours) / CycleGAN [zhu2017unpaired] 0.61 / 0.39
Table 6: Label-to-image Human Preference Score. Our method outperforms all the competing unsupervised methods in both video-level and image-level reality. Note that we achieve comparable performance with vid2vid although it is supervised

Label-to-image mapping. In this setting, we compare the quality of the translated video sequence by different methods. We first report the FID score [heusel2017gans] on all the sub-domains of the Viper dataset in the same setting as our ablation experiments. As the original ReCycleGAN output video sequences can not ensure style consistency, just as shown in fig:compareHR, we also report the results achieved by our improved version of the ReCycleGAN for a fair comparison. Concretely, we develop a conditional version which formally controls the style of generated video sequences in a similar way as our model, and denote the conditional version as improved ReCycleGAN. The FID results by different methods are shown in table:comparison_FID. The proposed achieves better FID on all the 5 sub-domains, which validates the effectiveness of our model in achieving better visual quality and temporal consistency. Combining table:comparison_FID_2 and table:comparison_FID, there is another observation – the w/o vi-loss could not dominate the Improved ReCycleGAN in terms of FID. This shows that the video interpolation loss is crucial for the superiority of our spatio-temporal translator.

To thoroughly evaluate the visual quality of the video translation results, we conduct a subjective evaluation on the Amazon Mechanical Turk (AMT) platform. The detailed information of conducting this subjective test is provided in the . We compare the proposed with 3DCycleGAN and ReCycleGAN. The video-level and image-level human preference scores (HPS) are reported in table:HPS. For reference, we also compare the video-level quality between and the supervised vid2vid model [wang2018video]. Meanwhile, image-level quality comparison between and CycleGAN (the image translation baseline) is also included. table:HPS clearly demonstrates the effectiveness of our proposed model. In the video-level comparison, our unsupervised model outperforms the competing unsupervised ReCycleGAN and 3DCycleGAN by a large margin, and achieves comparable results with the supervised benchmark. In the image-level comparison, achieves better HPS than both the V2V competing approaches and the image-to-image baseline. Qualitative examples in fig:compare1 also show that model produces a more content consistent video sequence. It could not be achieved by simply introducing the style control without the specialized network structure to record the inter-frame information. For a better comparison, we include several examples of generated videos in the .

Figure 5: Label-to-image qualitative comparison. Top left: label inputs; Top right: improved ReCycleGAN outputs; Bottom left: outputs. Bottom right: ground truth. Video examples can be found in

4.3 More experimental results

Figure 6: Viper Sunset-and-Day Top left: input Sunset video; Top right: input Day video; Bottom left: translated Day video; Bottom right: translated Sunset video

High resolution results. To get a higher resolution and show more details within the existing GPU constraint, we also train our model using images of and 4 frames per batch, then test with longer sequence, which is divided into subsequences of 4 frames with overlap. A visual example is shown in fig:compareHR. More results and videos are provided in .

Translation on other datasets. Besides translating video sequences between image and semantic label domains, we also train models to translate video sequences between different scene image subdomains and different video datasets.

In fig:s2dHD and fig:r2sHD, we provide visual examples of video translation of Sunset-and-Day and Rain-and-Snow scenes in the Viper dataset. Visual examples of translation between Viper and Cityscapes [cordts2016cityscapes] datasets is organized in figure 8. They show the ability of our approach to learn the association between synthetic videos and real-world videos. More examples and the corresponding videos are attached in .

Figure 7: Viper Rain-and-Snow Top left: input Rain video; Top right: input Snow video; Bottom left: translated Snow video; Bottom right: translated Rain video
Figure 8: Cityscapes to Viper translation. Top left: input Cityscapes video; Top right: translated Viper video in the night scenario; Bottom left: translated Viper video in the snow scenario; Bottom right: translated Viper video in the sunset scenario

5 Conclusion

In this paper, we have proposed UVIT, a novel method for unsupervised video-to-video translation. A specialized Encoder-RNN-Decoder spatio-temporal translator has been proposed to decompose style and content in the video for temporally consistent and modality flexible video-to-video translation. In addition, we have designed a video interpolation loss within the translator which utilizes highly structured video data to train our translators in a self-supervised manner. This enables the effective application of RNN-based network in the challenging V2V task. Extensive experiments have been conducted to show the effectiveness of the proposed UVIT model. Without using any paired training data, the proposed UVIT model is capable of producing excellent multimodal video translation results, which are image-level realistic, semantic information preserving and video-level consistent.

Acknowledgments. This work was partly supported by the ETH Zürich Fund (OK), and by Huawei, Amazon AWS and Nvidia grants.

References

1 Additional Loss Details and Implementation Details

1.1 Loss functions for the discriminator

In this section, we provide more details of our image-level (), video-level (), and style latent () discriminator losses. For the purpose of simplicity, we only present the loss functions for domain A, and the loss functions for domain B are defined following the same set of equations. Our adversarial loss is based on Relativistic GAN (RGAN) [jolicoeur2018relativistic], which tries to predict the probability that a real sample is relatively more realistic than a fake one.

Image level discriminator loss

The loss term is defined as follows:

(5)

Video level discriminator loss

for domain A is defined as follows:

(6)

Style latent variable discriminator loss

This loss term () for the style domain A is defined as follows:

(7)

1.2 Network structure

Style Encoder, Content Encoder and Content Decoder

Our style encoder is similar to the one used in Augment CycleGAN [almahairi2018augmented]. Under the shared content space assumption [liu2017unsupervised], we decompose the style-conditioned Resnet-Generator used in Augment CycleGAN [almahairi2018augmented] into a Content Encoder and a Content Decoder. Moreover, when the sub-domain information is available, we assign part of the style latent variable to record such prior information. Concretely, we use one-hot vector to encode the sub-domain information.

RNN - Trajectory Gated Recurrent Units (TrajGRUs)

Traditional RNN (Recurrent Neural Network) is based on the fully connected layer, which has limited capacity of profiting from the underlying spatio-temporal information in video sequence. In order to take full advantage of the spatial and temporal correlations, UVIT utilizes a convolutional RNN architecture in the generator. TrajGRU [shi2017deep] is one variant of Convolutional RNN (Recurrent Neural Network) [xingjian2015convolutional], which can actively learn the location-variant structure in the video data. It uses the input and hidden state to generate the local neighborhood set for each location at each time, thus warping the previous state to compensate for the motion information. We take two TrajGRUs to propagate the inter-frame information in both directions in the shared content space.

Discriminators ()

For the image-level discriminators , the architecture is based on the PatchGANs [pix2pix2017] approach. Likewise, Video-level discriminators are similar to PatchGANs, yet we employ 3D convolutional filters. For the style latent variable discriminators , we use the same architecture as in Augmented CycleGAN [almahairi2018augmented].

1.3 Datasets

We validate our method using two common yet challenging datasets: Viper [richter2017playing], and Cityscapes [cordts2016cityscapes] datasets.

Viper has semantic label videos and scene image videos. There are 5 subdomains for the scene videos: day, sunset, rain, snow and night. The large diversity of scene scenarios makes this dataset a very challenging testing bed for the unsupervised V2V task. We quantitatively evaluate translation performance by different methods on the image-to-label and the label-to-image mapping tasks. We further conduct the translation between different subdomains of the scene videos for qualitative analysis.

Cityscapes has real-world street scene videos. As there is not subdomain information for Cityscapes, we conduct experiments without subdomain label for Cityscapes. We conduct qualitative analysis on the translation between scene videos of Cityscapes and Viper dataset. Note that there is no ground truth semantic labels for the continuous Cityscapes video sequences. The semantic labels are only available to a limited portion of none-continuous individual images. Therefore, we could not use it for our evaluation of image-to-label (semantic segmentation) performance.

1.4 Implementation Details

We train our model using images of and 10 frames per batch in a single NVIDIA P100 GPU for the main experiments to capture temporal information with more frames. Setting the batch size to one, it takes about one week to train. Note that it takes roughly 4 days to train using 6 frames per batch.

During inference, we use video sequences of 30 frames. These 30 frames are divided into 4 smaller sequences of 10 frames with overlap. They all share the same style code to be style consistent. To get a higher resolution and show more details within the existing GPU resource constraint, we also train our model using images of and 4 frames per batch.

The parameters. Video interpolation loss weight is set to 10. Cycle consistency loss weight is set to 10. Style reconstruction loss weight is set to 0.025.

1.5 Human Preference Score

We have conducted human subjective experiments to evaluate the visual quality of synthesized videos using the Amazon Mechanical Turk (AMT) platform.

For the video-level evaluation, we show two videos (synthesized by two different models) to AMT participants, and ask them to select which one looks more realistic regarding a video-consistency and video quality criteria.

  • UVIT (ours) / 3DCycleGAN: Since 3DcycleGAN [bashkirova2018unsupervised] generates consistent output with 8 frames in the original paper setting, UVIT results are organized to 8 frames for a fair comparison.

  • UVIT (ours) / Improved ReCycleGAN: When comparing with improved RecycleGAN [bansal2018recycle], we take each video clip with 30 frames.

  • UVIT (ours) / vid2vid: When comparing with vid2vid [wang2018video], we take each video clip with 28 frames, following the setting in vid2vid [wang2018video].

For the image-level evaluation, we show to AMT participants two generated frames synthesized by two different algorithms, and ask them which one looks more real in visual quality.

These evaluations have been conducted for 100 videos and frame samples to assess the image-level and video-level qualities, respectively. We gathered answers from 10 different workers for each sample.

2 Additional examples of the label-to-image qualitative comparison ()

More results on the label-to-image mapping comparison of UVIT (ours) and Improved ReCycleGAN are depicted in fig:ours_comparision_append and the attached video Compare.mp4. For the 1_LRCompare.mp4, we give a short description to guide the comparison. From left to right, there are outputs for six different input samples to compare:

  • 1: Please see the trajectory of the car and the surrounding road.

  • 2: Please see the boundary between two cars.

  • 3: Please see the translation of the road to check the complete translation and consistency across frames.

  • 4: Please see the walls and the pillar across frames.

  • 5: Please see the consistency of the road.

  • 6: Please see the consistency of the wall.

Figure 9: Video screen cut of the label-to-image qualitative comparison. First row: semantic label inputs; Second row: improved ReCycleGAN outputs; Third row: UVIT outputs. Fourth row: ground truth. A full video file can be found in 1_LRCompare.mp4.

3 Higher resolution results ()

To get a higher resolution and show more details within the existing GPU resource constraint, we also train our model using images of and 4 frames per batch. During the test time, we divide a longer sequence into sub-sequences of 4 frames with overlap. All the results of this section are trained using images of and 4 frames per batch.

3.1 Additional examples of the label-to-image qualitative comparison

In fig:FIGURE1 (corresponding video ) and fig:FIGURE1 (corresponding video ), we provide the visual examples of how our UVIT method compares with respect to RecycleGAN [bansal2018recycle]. The RecycleGAN outputs are generated by the original code provided by the author of RecycleGAN in . Besides the video-level quality comparison from videos, we encourage the reader to also check the frame-level quality from images since .mp4 format may fail to preserve some image-level quality.

Figure 10: Video screenshot of the video corresponding to Fig. 1 in the main paper. The video is attached as
Figure 11: Video screenshot of the comparison with RecycleGAN [bansal2018recycle]: We aim to compare the content consistency and image-level quality. Here the RecycleGAN results are produced by the original RecycleGAN code in a resolution of . Since there is no guarantee of style consistency for RecycleGAN, we select some RecycleGAN visual results in a small sequence length of 30 frames where style is almost consistency to compare with UVIT (ours). The corresponding video is attached as

3.2 Quantitative comparison of the label-to-image and image-to-label

In table:comparison_append and 8, we show quantitative results for our proposed method trained with a resolution of and 4 frames per batch.

Criterion Model Day Sunset Rain Snow Night All
mIoU ReCycleGAN (Reproduced) 10.32 11.19 11.25 9.83 7.73 10.12
(Ours) (frame 4) 12.05 12.23 13.37 11.54 10.49 11.93
AC ReCycleGAN (Reproduced) 15.80 15.79 15.93 15.57 11.47 14.85
(Ours) (frame 4) 17.21 17.41 18.16 17.37 14.30 16.50
PA ReCycleGAN (Reproduced) 54.70 55.92 57.71 50.85 49.11 53.66
(Ours) (frame 4) 63.44 61.98 64.72 60.83 62.05 62.35
Table 7: Quantitative comparison between and baseline approaches on the image-to-label (Semantic segmentation) task.( with 4 frames per batch during training) .Our translator effectively leverage the temporal information directly, thus producing more semantic persevering translation outcomes
Criterion Model Day Sunset Rain Snow Night
FID ReCycleGAN [bansal2018recycle] 23.60 24.45 28.54 31.58 35.74
(ours) (frame 4) 18.68 16.70 20.20 18.27 19.29
Table 8: Quantitative comparison between and baseline approaches on the label-to-image task ( with 4 frames per batch during training). Better FID indicates that our translation has better visual quality and temporal consistency. We use the pre-trained network (I3D [carreira2017quo]) to extract features from 30-frame sequences just as the experiments in the main paper.

3.3 Label-to-image multi-subdomain and multimodality results

Video results of UVIT on label sequences to image sequences with multi-subdomain and multimodality are shown in fig:Multimodaliy_append and the enclosed video 4_Multimodality.mp4. The videos are all with a length of 220 frames.

Figure 12: Video screenshot of the label-to-image multi-subdomain and multimodality results. Better depicted in 4_Multimodality.mp4.

3.4 Long video example (1680 frames)

In fig:long and attached video , we provide a video sequence example with more than 1680 frames to give a qualitative example of how our UVIT model performs in terms of style consistency. Note that the semantic labels in Viper [richter2017playing] are automatic generated, however, we observe that there may still exist a little small flips in the input semantic label sequence occasionally.

Figure 13: Screenshot of a long style consistent translation video visual example (1680 frames). Left: input semantic labels; Right: UVIT translated video in sunset scenario. All frames within the video share the same style code to keep style consistency. The video is attached as

3.5 Translation on other datasets

In fig:R2S_HR_append and in the attached video 6_Rainandsnow.mp4, we provide visual examples of UVIT video translation between Rain and Snow scenes in the Viper dataset. In fig:S2D_HR_append and in the attached video 7_Sunsetanday.mp4, we provide visual examples of UVIT video translation between Sunset and Day scenes in the Viper dataset. In fig:C2G_HR_append and in the attached video 8_Cityscapesandviper.mp4, we provide visual examples of UVIT video translation between Cityscapes dataset and Viper dataset. Besides the video-level quality evaluation from videos, we encourage the reader to also check the frame-level quality from images since .mp4 format may fail to preserve some image-level quality.

Figure 14: Screenshot of Viper Rain-and-Snow translation. First row: real rain inputs; Second row: translated snow videos; Third row: real snow inputs; Fourth row: translated rain videos. Video is attached as 6_Rainandsnow.mp4
Figure 15: Screenshot of Viper Sunset-and-Day translation. First row: real sunset inputs; Second row: translated day videos; Third row: real day inputs; Fourth row: translated sunset videos. Video is attached as 7_Sunsetandday.mp4
Figure 16: Screenshot of Cityscapes-and-Viper translation. Top left: real Cityscapes input; Top right: translated Viper videos with different style codes; Bottom left: real Viper input; Bottom right: translated Cityscapes videos with different style codes. Since the general distribution between Cityscapes and Viper may be different (e.g. there are more buildings in Cityscapes), the translated Viper video may differ from input Cityscapes video in class distribution to fool the discriminator, so as to be close to the class distribution in the target domain. Video is attached as 8_Cityscapesandviper.mp4

4 Additional Results for Ablation Study

Here we provide the supplementary results for the ablation part. We first give a complete table for the ablation experiment on how the performance depends on the input frame number. After that, we give the qualitative results of multimodal consistent videos when UVIT is trained and tested without the sub-domain label.

4.1 Different input frame number

Our RNN-based translator incorporates temporal information from multiple frames. In Ablation2_append, we provide the complete table corresponding to the ”Different input frame number” ablation study section in the main paper.

4.2 UVIT without the subdomain label

To check how our UVIT performs in terms of style consistency without the subdomain label information, we run this ablation experiment. The results are attached in fig:No_sub and . By randomly sampling the style code from prior distribution, we can get multimodal consistent video results in a stochastic way.

Figure 17: Ablation study: when no sub-domain label is used during training and testing. First video is the input semantic label sequence, the rest videos are the translated scene videos with style codes randomly sampled from prior distribution. There are 220 frames for each video. The corresponding video is attached as
Criterion Frame number Day Sunset Rain Snow Night All
mIoU 4 11.84 11.91 12.35 11.37 8.49 11.19
6 12.29 12.66 13.03 11.77 9.79 11.94
8 13.05 13.21 14.23 13.07 11.00 12.87
10 13.71 13.89 14.34 13.23 10.10 13.07
AC 4 16.78 16.75 16.57 16.32 12.21 15.7
6 17.50 17.46 17.66 16.73 14.23 16.62
8 18.42 18.28 19.19 17.80 15.18 17.68
10 18.74 19.13 18.98 17.81 13.99 17.59
PA 4 62.84 60.34 61.97 58.77 51.68 59.04
6 62.85 61.21 62.21 59.77 56.84 60.51
8 65.56 64.11 66.26 64.18 62.02 64.25
10 68.06 66.35 67.21 65.49 58.97 65.20
Table 9: Quantitative results of with different number of frames per batch in training on the image-to-label (Semantic segmentation) task. With the increase of input frames number, our RNN-based translator can utilize the temporal information better, resulting in better semantic preserving