Temporally Coherent Video Harmonization Using Adversarial Networks

by   Haozhi Huang, et al.

Compositing is one of the most important editing operations for images and videos. The process of improving the realism of composite results is often called harmonization. Previous approaches for harmonization mainly focus on images. In this work, we take one step further to attack the problem of video harmonization. Specifically, we train a convolutional neural network in an adversarial way, exploiting a pixel-wise disharmony discriminator to achieve more realistic harmonized results and introducing a temporal loss to increase temporal consistency between consecutive harmonized frames. Thanks to the pixel-wise disharmony discriminator, we are also able to relieve the need of input foreground masks. Since existing video datasets which have ground-truth foreground masks and optical flows are not sufficiently large, we propose a simple yet efficient method to build up a synthetic dataset supporting supervised training of the proposed adversarial network. Experiments show that training on our synthetic dataset generalizes well to the real-world composite dataset. Also, our method successfully incorporates temporal consistency during training and achieves more harmonious results than previous methods.


page 1

page 2

page 4

page 6

page 7

page 8


Temporally Coherent Person Matting Trained on Fake-Motion Dataset

We propose a novel neural-network-based method to perform matting of vid...

HYouTube: Video Harmonization Dataset

Video composition aims to generate a composite video by combining the fo...

Deep Image Harmonization

Compositing is one of the most common operations in photo editing. To ge...

Deep Video Harmonization with Color Mapping Consistency

Video harmonization aims to adjust the foreground of a composite video t...

TopoAL: An Adversarial Learning Approach for Topology-Aware Road Segmentation

Most state-of-the-art approaches to road extraction from aerial images r...

Perceptual Consistency in Video Segmentation

In this paper, we present a novel perceptual consistency perspective on ...

Generative Adversarial Network for Future Hand Segmentation from Egocentric Video

We introduce the novel problem of anticipating a time series of future h...

I Introduction

Fig. 1: Given a composite video generated by a direct cut-and-paste operation (a), our method learns to harmonize it to imporve its realism. Compared with previous methods RealismCNN [1] (b) and Deep Image Harmonization [2] (c), the harmonized results of our method look more natural and temporally consistent (d). The warm color during sunset is correctly cast to the foreground in our result. HSV color values at the same position are shown for clear comparison.

Generating realistic composite videos is a fundamental requirement in video editing tasks. Given two videos, one of them contains a desired foreground, and the other contains a desired background. To generate a realistic composite video from these two source videos, three steps are needed to be performed correctly. First, extract a foreground object from one of the source videos by computing an alpha matte or a binary mask indicating the pixels which belong to the foreground. Second, paste the foreground to a proper location in the background. Third, adjust the foreground appearance to make it look natural in the new background. The last step is often called harmonization. In this paper, we focus on the video harmonization task which intends to improve the realism of a composite video by performing appearance adjustments on the foreground (see Figure 1).

Traditional methods for handling image harmonization are based on learning statistical relationships between hand-crafted appearance features [3, 4, 5, 6]. These methods neglect whether the foreground and the background are compatible considering the context of one whole image. Recently, some image harmonization methods are proposed to leverage powerful convolutional neural networks (CNNs) to automatically learn features that capture context and semantic information of the composite images, which generate more appealing harmonization results. Zhu et al[1] trained a CNN to distinguish natural images from generated ones. Then they used the predicted realism score to guide a simple color adjustment of the foreground to obtain more realistic composite images. Unlike the method in [1] which takes realism evaluation and improvement as two separated steps, Tsai et al[2] proposed an end-to-end network which takes a composite image and a foreground mask as input to generate a harmonized image directly.

All the methods mentioned above aim at harmonization for images. For videos, although applying image harmonization frame by frame can generate a video harmonization result, this will introduce obvious flicker artifacts in the absence of considering the temporal consistency between consecutive harmonized frames. In this paper, we propose an end-to-end harmonization network for videos, which is able to simultaneously harmonize the composite frames and maintain the temporal consistency between them. In order to make the harmonization results more realistic, we propose to train the network with a pixel-wise discriminator. Different from the most common global discriminators used in the literature which predict whether an image is real or fake as a whole, our discriminator is trained to precisely distinguish the harmonious pixels from the disharmonious ones. In addition, with the well trained pixel-wise discriminator, we are able to predict foreground masks automatically to relieve the need of input foreground masks. On the other hand, for the purpose of constraining the network to generate temporally consistent results, we train it with a temporal loss term to incorporate temporal information in the training phase and avoid the trouble of computing optical flows in the inference phase.

Training the proposed end-to-end harmonization network requires a large amount of video data taken at different scenes with ground-truth harmonious and disharmonious pairs, foreground masks and optical flows, which is still a missing piece in the community. To this end, we propose a simple yet effective way to build up a synthetic dataset which satisfies this demand. We also build up a real-world composite video dataset for evaluating the proposed method in real scenarios, which helps demonstrate that training on our synthetic dataset enables our network to generalize to real-world composite videos. Extensive experiments on the two datasets demonstrate that the proposed method is able to conduct temporally coherent video harmonization while generating more harmonious results than existing methods.

The contributions of our method are two-fold. Firstly, to the best of our knowledge, this is the first end-to-end CNN for video harmonization. The network is trained in an adversarial way with an introduced temporal loss to simultaneously acquire high-quality harmonization results and temporal consistency. Secondly, a synthetic dataset is proposed for supporting the efficient training of the video harmonization network, which contains ground-truth harmonious and disharmonious pairs, foreground masks and optical flows.

Ii Related Work

Our work attempts to generate a temporally consistent harmonized video by an adversarial network. This is closely related to the literature on image harmonization, conditional generative adversarial networks, and temporally consistent video editing.

Image harmonization. Traditional methods for image harmonization focus on matching appearance statistics without considering the context of images, such as aligning the statistics of global or local histograms [7, 6], shifting the colors towards predefined harmonious color templates [3], gradient-domain compositing [8, 9, 10], multi-scale matching of various statistics [5]

, and maximizing the co-occurrence probability of color distributions 

[4]. Recently, Zhu et al[1] trained a discriminative model based on a CNN to predict a realism score for a composite image, and then used the score to determine a simple brightness and contrast adjustment for improving the realism. Instead of separating realism evaluation and improvement into two steps, Tsai et al[2] proposed an end-to-end CNN to learn how to harmonize a composite image directly. To improve the ability of capturing semantic information, Tsai et al. pretrained the harmonization network with semantic segmentation and used the segmentation branch to provide features to help harmonization. Different from these existing algorithms which are engaged in image harmonization, we propose an end-to-end CNN trained in an adversarial manner to solve the problem of video harmonization, which generates realistic and temporally consistent harmonized results.

Conditional image generation based on adversarial training. Generative adversarial network (GAN) was first proposed by Goodfellow et al[11] to address the problem of realistic image generation from input noise variables. The key idea of GAN is to train a generator and a discriminator in an adversarial fashion. While the discriminator is trained to distinguish fake images from real ones, the generator is trained to deceive the discriminator and generate images as realistic as possible. Recently, GANs have also been widely used in the task of conditional image generation [12, 13, 14, 15, 16, 17]. Although the Markovian discriminator proposed in [14, 18] performs a patch-level discrimination instead of a global one, it is designed for speeding up the discriminator with a very small receptive field, which is still insufficient for the harmonization task. In this paper, we propose to use an encoder-decoder structure to learn a pixel-wise discriminator which labels each pixel as harmony or disharmony for calculating a more precise adversarial loss. With the proposed discriminator, we can also complete the harmonization task without input foreground masks.

Temporally consistent video editing. Directly applying image harmonization frame by frame for videos inevitably results in flicker artifacts. This is because the corresponding regions in different frames are harmonized in different ways. Plenty of approaches have been proposed to enforce temporal consistency in video editing, such as spatio-temporal smoothing  [19, 20, 21], optimization with a temporal loss [22, 23], frame propagation [24], etc. The above methods either require extra post-processing operations or rely on a time-consuming optimization process. Recently, some video style transfer methods [25, 26] show that temporal consistency and style transfer can be simultaneously learned by a CNN, which acquires considerable temporal consistency at a very little time cost. Inspired by this idea, our proposed method also chooses to incorporate the temporal consistency during the training phase. However, instead of calculating a global temporal loss, we introduce a regional temporal loss that forces our model to pay more attention to the disharmonious regions, therefore leading to more coherent harmonized results.

Fig. 2: The training phase of the proposed video harmonization model. The harmonization network is trained in an adversarial manner with the pixel-wise disharmony discriminator. A two-frame coordinated training strategy is adopted to incorporate a regional temporal loss to constrain the consecutive harmonized foregrounds to have similar appearances.
(a) Ground-truth frame1
(b) Foreground mask
(c) Pure background
(d) Composite frame1
(e) Ground-truth frame2
(f) Composite frame2
Fig. 3: Building up the synthetic dataset. Given an image (a), we take it as the first ground-truth frame. Then we cut out the foreground and apply inpainting to obtain the pure background (c). By performing color adjustment on the foreground of (a), we obtain the first composite frame (d). By applying a random affine transform to the foregrounds of (a) and (d), we obtain the second ground-truth frame (e) and the second composite frame (f).

Iii Video Harmonization Network

In this section, we describe the details of our proposed end-to-end CNN for video harmonization. Figure 2 shows an overview of our network. The harmonization network takes one frame of a composite video and a foreground mask as input, and performs appearance adjustments on the foreground while keeping the background unchanged. To incorporate temporal consistency between consecutive harmonized frames, a two-frame coordinated training strategy with a regional temporal loss is adopted. Note that in the training phase the two frames are fed to the harmonization network in a coordinate but separate way, while in the testing phase the harmonization network processes a video in a frame by fame way. This kind of setting has been proven to be effective for training a network which conducts smoother transformation with less flicker artifacts in video style transfer [26]. To further enhance the realism of the harmonized results, the harmonization network is trained in an adversarial way with a pixel-wise disharmony discriminator, which distinguishes the disharmonious pixels from the harmonious ones. Moreover, the well-trained discriminator can also be employed to predict the disharmony area in the input, which holds as a replacement of the input foreground mask.

Iii-a Synthetic Training Dataset

Before delving deeply into the network architecture, it is essential to describe the way we collect data. For supervised training, our harmonization network needs a composite video and a corresponding harmonized video as a sample pair. Given an arbitrary composite video, it is hard to acquire a high-quality harmonized result even for a human expert. For the image harmonization task, Tsai et al[2] collected images from the MSCOCO dataset [27] which have ground-truth foreground masks, and then applied color transfer between random foreground pairs with the same semantic labels. While the image after a foreground adjustment is used as the input, the original image is used as the ground-truth. The difficulty of extending this idea to the video harmonization task lies in the fact that there are a limited number of videos that have ground-truth foreground masks. Even in the very recent video object segmentation dataset DAVIS [28], there are only 90 annotated videos. This number is far from being enough, because the harmonization network requires training data covering tremendous scenes to learn the natural appearances of foregrounds in various cases. Meanwhile, we also need the ground-truth optical flows between consecutive frames for evaluating the temporal consistency.

To address this data issue, we construct a synthetic dataset named Dancing MSCOCO which contains ground-truth foreground masks and optical flows. Based on the MSCOCO dataset, we apply small-scale random affine transforms to the foregrounds and acquire a series of images containing the same “dancing” foreground, which simulate the consecutive frames in a real video. A similar strategy has been adopted by Dosovitskiyet al[29] to collect data for training a FlowNet to predict optical flows, which shows competitive performance compared to state-of-the-art methods like DeepFlow [30] and EpicFlow [31]. Similarly, Khoreva et al.[32] used synthetic frames to help the training of an object tracking system. The success of [29, 33, 32] proves that proper synthetic data is sufficient to some extent for training a deep neural network targeting at dealing with real video data.

Key outputs during the process of building up our Dancing MSCOCO dataset are shown in Figure 3. First, we select images containing ’people’ as the foregrounds from the MSCOCO dataset, and wipe off the images whose foreground area is smaller than of the whole image. Since the image numbers of different classes are imbalanced in the MSCOCO dataset, training all the classes together inevitably introduces biases. In this paper, we simply focus on images containing people to avoid a class bias. It is easy to transfer the proposed method to other kinds of foregrounds. Further solutions for avoiding a class bias are beyond the scope of this paper. Second, we cut out the foreground and apply inpainting [34] to fill the holes to obtain pure background images. For each background image, we perform random cropping to obtain a distorted copy of the background image and resize it back to the original size, which simulates a background movement in a video. Third, we apply color adjustments to the foregrounds to simulate the composite images. Besides performing color transfer [7] between random foreground pairs as in [2], we also perform random adjustments of the basic color properties including exposure, hue, saturation, temperature, contrast, and tone curve. This makes our dataset cover more kinds of compositing situations than the dataset created in [2]. Fourth, we apply the same random affine transform to the original foreground and the color adjusted foreground, which simulates a foreground movement in a video, and paste it back to the corresponding randomly cropped background. Since we know the exact affine transform between the foregrounds, it is easy to acquire the corresponding ground-truth optical flow. The affine transform parameters are sampled from a suitable range in order not to conduct large-scale foreground movements. Finally, we achieve a total number of 33,338 pairs of ”consecutive original frames” and their corresponding ”consecutive composite frames”. 29,818 pairs of them are used as the training data, 1,000 pairs are used as the validation data, and 2,520 pairs are used as the testing data. We have also tried generating more than one distorted copy for each image, but found a marginal improvement in the training.

Iii-B Adversarial Training with Temporal Loss

Our network contains two parts, a harmonization network behaving as a generator and a pixel-wise disharmony discriminator . The harmonization network processes a composite video frame by frame to generate a harmonized video. At each time step , takes a composite frame and a foreground mask as input and generates a harmonized output frame , which adjusts the appearance of the foreground to make it look more natural in the background. To incorporate temporal consistency, the network is trained in a two-frame coordinated manner to constrain the consecutive outputs and to be temporally coherent. Thanks to this training strategy, is able to generate temporally consistent harmonized frames, though processing a video in a frame-by-frame way. The insight behind this strategy is that by constraining the outputs to be temporally consistent, we train a more stable network that can avoid amplifying small differences in the inputs. This strategy makes the harmonization network learn smoother transformations. To acquire more realistic harmonized results, we propose a pixel-wise disharmony discriminator to play against by telling disharmonious pixels from harmonious ones.

Pixel-wise disharmony discriminator. The objective for training

is to classify harmonious pixels into class

and disharmonious ones into class , while keeping the parameters of

fixed. The loss function we aim to minimize is:


Here, is the input foreground mask, in which the foreground pixels are labeled as and the background pixels are labeled as . is the discriminator output for , which should be close to for guiding the discriminator to distinguish the harmonized pixels from the ground-truth realistic pixels. In our experiments, we find that training merely with cannot generalize well to , which is also a kind of image with a disharmonious foreground. Thus, we also feed to as a disharmonious sample, and give the loss terms relevant to and the same weight. is the discriminator output for the ground-truth realistic frame , in which all pixels should be labeled as . Note that, unlike the most common global discriminators in the literature which consider an image as a whole to be real or fake, our discriminator learns to classify each pixel separately. This is because the background pixels in and are definitely harmonious, which should be treated differently from those disharmonious pixels for training a more precise discriminator. To correctly label a pixel, the discriminator should have a large receptive field to capture context information. Thus a UNet [35] architecture is used as the harmonization network, which will be described in detail later. Theoretically, our method is a variant of LSGAN [36], which shows a more stable and faster convergence than vanilla GAN [11] or WGAN [37].

Harmonization network. The objective for training is to generate temporally coherent harmonized frames that are indistinguishable from realistic ones. The loss function that we minimize for training is:


Here, denotes the total number of pixels in a frame, and denotes the number of pixels in the foreground. The harmonization network is trained with a combination of a reconstruction loss, a regional temporal loss, and an adversarial loss. The reconstruction loss enforces the harmonized frame to be similar to the ground-truth realistic frame . The temporal loss enforces the harmonized frame to be temporally consistent with the previous harmonized frame .

denotes a Spatial Transformer Network 

[38], which warps according to the ground-truth optical flow provided by our Dancing MSCOCO dataset. The Spatial Transformer Network is fully differentiable, which makes our network end-to-end trainable. Instead of treating each pixel equally as in the global temporal loss [26], our regional temporal loss focuses on the foreground region by taking a Hadamard product of the foreground mask and the global element-wise difference between frames: . The insight behind this is that since the network only needs to learn a simple identity mapping for the background pixels, it should pay more attention to learning how to generate temporally consistent foregrounds. In our experiments, we find that the generated backgrounds achieve great temporal consistency even without any applied temporal loss.

On the other hand, the adversarial loss encourages to generate harmonized results that are indistinguishable for the discriminator , hence enforcing the output of to be close to the manifold underlying realistic frames. Although the reconstruction loss has enforced the harmonized frames to be similar to the realistic frames, this is not enough because there may be multiple answers besides the ground-truth for harmonizing a composite frame. Different training samples with similar compositing situations may teach to learn different harmonization solutions for similar inputs. This finally prompts to output an average of different solutions, which may fall outside the manifold of realistic frames. Leveraging to provide an adversarial loss for training can deal with this problem.

As shown in Figure 2, our harmonization network adopts the architecture like UNet [35], which has skip connections to reserve more content details which may be lost during the progressive downsampling in the convolutional layers. Our discriminator also adopts the same architecture for acquiring a large receptive field. The network proposed for deep image harmonization [2] also used skip connections but in the form of element-wise summation instead of feature concatenation in the UNet. We have also tried the residual network used in style transfer [39]. The superiority of the UNet over other networks is clearly validated in Section IV-C.

Iii-C Harmonization without Foreground Masks

Previous state-of-the-art harmonization methods all require a foreground mask as the input besides the composite image itself. In this paper, with the help of a well-trained pixel-wise disharmony discriminator, we can predict the disharmonious foreground area automatically. Thus, we are able to accomplish the harmonization task without an input foreground mask using the same harmonization network which is trained with foreground masks: . Another solution for achieving the same goal is to train without input foreground masks in the first place. In our experiments, we find that this solution achieves similar performance to ours but comes with an extra labor of training another network.

Iv Experiments

In this section, we present extensive experiments to evaluate the effectiveness of the proposed method.

Real-world composite dataset. Besides the synthetic dataset, we also build up a dataset containing real-world composite videos to evaluate the effectiveness of the proposed method. First, we collect videos with the tag ’fashion’ from Youtube-8M dataset [40], the content of which usually is a person talking in front of a static camera. We require the videos to be taken by a static camera because the incompatible movements of cameras also influence the realism of a composite video, which is out of the scope of this paper. For each video, we cut out a 5-10 seconds clip which contains the desired content. To generate the ground-truth foreground masks, we utilize the Rotobrush tool in Adobe After Effects CC 2017 to manually label the foreground regions. Additionally, we collect various background videos from Videvo.net [41]. For each background video, we apply random adjustments of basic color properties as in Section III-A to get two more distorted copies to cover various composite situations. In total, we acquire 30 foreground videos and 48 background videos. Then we extract the foregrounds paste them to different backgrounds one by one. In the end, we get 1440 composite videos. For this real-world composite dataset, the ground-truth optical flows between frames are unavailable. Instead, we utilize the state-of-the-art method [33]

to estimate the optical flows.

Implementation details. During training, we resize the input and the ground-truth frame to , and scale the range of color values to . We set and with a fixed learning rate of . We use a batch size of to alternatively train our harmonization network and pixel-wise disharmony discriminator for epochs. Here, the batch size means for each iteration we feed only one pair of consecutive frames for training. For optimization we use Adam [42] with . Hyper parameters are chosen by experiments on the validation dataset.

Method Quantitative Results User Study Results
PSNR MSE Realism Temporal
Cut-and-paste 18.19 0.030 0.0006 0.027 -0.004 2.176
Zhu [1] 18.27 0.032 0.0822 0.098 -0.330 -2.064
Tsai [2] 18.42 0.024 0.2325 0.029 -0.179 -1.291
Ours 22.45 0.009 0.0247 0.026 0.551 1.179
TABLE I: Comparison with previous methods. Quantitative results are evaluated on the synthetic dataset and the real-world composite dataset. The user study results show the Plackett-Luce scores acquired by different methods
Fig. 4: Visual results on the real-world composite dataset. ((a) are the cut-and-paste results, i.e., the inputs to the harmonization methods. (b) are the results of RealismCNN [1]. (c) are the results of Deep Image Harmonization [2]. (d) are our results. Among all methods, our harmonized results appear most harmonious and temporally consistent. The foregrounds of our results are adjusted to cool color tone which looks more harmonious with the background.
Fig. 5: Visual results on the synthetic dataset. The first column are the ground-truth harmonious frames. The second column are the input frames. The third column are the results of RealismCNN [1]. The fourth column are the results of Deep Image Harmonization [2]. The last column are our results. Among all methods, our harmonized results are closest to the ground-truths and most temporally coherent.

Iv-a Comparison with Previous Methods

We evaluate the proposed method through comparisons with two state-of-the-art methods [1, 2] and the cut-and-paste baseline. Here, we choose to compare with previous methods using their original settings and training data to demonstrate the temporal inconsistency of image harmonization methods and the effectiveness of the proposed synthetic dataset.

Table I shows the quantitative evaluation results. PSNR and MSE are calculated on the synthetic dataset. represents the regional temporal loss computed according to Equation 2 using the ground-truth optical flow in our synthetic dataset. represents the regional temporal loss computed using the estimated optical flow in the real-world composite dataset. We show that our method achieves better performance compared to the state-of-the-art methods [1, 2] in terms of both PSNR, MSE and temporal losses. Although the cut-and-paste baseline shows better temporal consistency, its realism is far from being satisfactory.

To demonstrate training on our synthetic dataset is able to generalize well to real-world composite videos. We set up a user study similar as [1, 2] using 15 real-world composite videos randomly picked from the real-world composite dataset, in which each user watches four videos and is asked to rank the harmonized results of the four methods regarding to either single frame realism or temporal consistency between frames. As a result, a total of 32 subjects participate in this study with a total of 480 rankings over the four candidate methods. Then we use the Plackett-Luce (P-L) model [43] to compute the global ranking score for each method. Table I shows that compared with the other harmonization methods, our method achieves the highest ranking score according to both single frame realism and temporal consistency. It is no surprise that the cut-and-paste method achieves the best temporal consistency score, but it comes at the cost of realism because no appearance adjustment is applied.

Figure 1, Figure 4 and Figure 5 illustrate the visual results generated by different methods. Overall, our method generates more harmonious and temporally consistent results than previous methods. RealismCNN [1] may generate unsatisfactory results when the realism prediction process fails. Among all the methods, the foreground colors of our results are the most consistent with the backgrounds. In addition, both [1] and [2] generate foregrounds in different appearances across frames, which leads to flicker artifacts in the video. More visual comparisons can be found in the supplementary material.

Iv-B Effectiveness of The Synthetic Dataset

To evaluate the effectiveness of the synthetic dataset, we train our harmonization network without the discriminator using different datasets, and test on the synthetic dataset and real-world composite dataset. The datasets we have tested include the MSCOCO dataset [27], the DAVIS dataset [28], and our Dancing MSCOCO dataset. For the MSCOCO dataset, similar as the preprocessing for our synthetic dataset, we only keep those images containing people as the foregrounds and those with foreground occupying over of the whole image, and apply color transfers between foregrounds to acquire composite images as in [2]. For the DAVIS dataset, we apply random adjustments to the basic color properties to create recolored copies for each of the original 70 videos, resulting in 4410 videos in the end. When training on the DAVIS dataset, we use the same temporal loss setting as in our proposed model. We also train on our dataset without the temporal loss for comparison. Table II shows that training on our dataset with our regional temporal loss achieves smaller MSE and temporal losses than training on the other datasets. This means that our Dancing MSCOCO dataset generalizes the model to more composite cases, while the temporal loss incorporated during the training phase leads to temporally consistent harmonized results in the inference phase. Although training on the DAVIS dataset leads to a smaller temporal loss on the real-world composite dataset, it comes at the sacrifice of the single frame realism. This is because DAVIS dataset covers limited number of scenarios, which makes generalizing to other dataset very difficult.

Dataset PSNR MSE
MSCOCO 21.38 0.011 0.051 0.027
DAVIS 17.11 0.031 0.060 0.015
Ours (w/o ) 23.29 0.008 0.043 0.033
Ours 22.13 0.009 0.022 0.023
TABLE II: Performances of training on different datasets.

Iv-C Analysis of The Network Architecture

Model choice. To justify the choice of the UNet architecture, we train different networks without the discriminator and using the same setting on the Dancing MSCOCO dataset. The network architectures we have tested include the UNet [35], the deep image harmonization network (DIH) proposed in [2], the style transfer residual network proposed in [39]. The number of parameters in different architectures are constrained to the same level. Table III shows that the UNet achieves the smallest MSE. This is because the skip connections used in the UNet reserve image details well and keep extra information that is needed by the harmonization task. Although DIH obtains the smallest , its MSE is the largest and PSNR is the lowest. As ensuring the harmonization quality is the first priority, we choose UNet as our network structure.

ResNet [39] 21.38 0.011 0.020 0.026
DIH [2] 19.82 0.014 0.014 0.022
UNet [35] 22.13 0.009 0.022 0.023
TABLE III: Performances of different network architectures.

Regional temporal loss v.s. global temporal loss. To demonstrate the effectiveness of the proposed regional temporal loss, we train our harmonization network without the discriminator using either the regional temporal loss or the global temporal loss. After converging to similar PSNRs and MSEs, we stop the training and compare the temporal losses in the foreground regions. While using regional temporal loss results in , using global temporal loss results in . It shows that using regional temporal loss can enforce the model to create more temporally consistent foregrounds.

Fig. 6: The first row shows the results generated by the model trained with a pixel-wise disharmony discriminator. The second row shows the results generated by the model trained without a discriminator, in which the foreground boundaries are more obvious.

Ablation study on the adversarial loss. To evaluate the effectiveness of the adversarial loss, besides the proposed model, we also train a model without the discriminator. Figure 6 shows harmonized results generated by the two models, from which we can see that the model trained with a discriminator produces harmonized foregrounds with more realistic appearances and the foreground boundaries are less obvious. We also set up a user study using 15 videos randomly picked from the real-world composite dataset, in which each user watches a pair of videos generated by the two methods at a time, and is asked to choose the one that looks more realistic. A total of 36 subjects participate in this study with a total of 540 pairwise comparisons over the two candidate methods. While of the choices prefer the model with the discriminator, of the choices prefer the one without the discriminator. The rest think that they are equally good. This demonstrates that the pixel-wise disharmony discriminator contributes to the generation of more realistic harmonized results.

Iv-D Evaluation of Estimated Foreground Masks

With a well-trained pixel-wise disharmony discriminator, we are able to use the predicted disharmony map instead of the ground-truth foreground mask as input to generate harmonized results. Figure 7 shows that the disharmony maps predicted by our discriminator are close to the ground-truth foreground mask and are considerable to be used as a replacement of the input foreground masks. For comparison, we test alternative solutions including using an all-zero mask or an all-one mask with our trained harmonization network, and training a new harmonization network from scratch taking no foreground mask as input. Table IV shows that while all-zero mask and all-one mask solutions result in large MSEs and low PSNRs, using a predicted disharmony map (Adversarial ) achieves similar performance as training a new network from scratch. This demonstrates that the discriminator endow the proposed network with the flexibility of being used either with or without input foreground masks.

Fig. 7: Disharmony maps predicted by our pixel-wise discriminator. The first row are the inputs with the ground-truth foreground masks at the top-right corner. The second row are the predicted disharmony maps.
Solution PSNR MSE
All-Zero Mask 17.72 0.031
All-One Mask 14.00 0.048
Training from Scratch 20.04 0.016
Adversarial 19.87 0.018
Full Model with Mask 22.45 0.009
TABLE IV: Results of different solutions without input foreground masks.

V Conclusions

In this paper, we proposed an end-to-end CNN for tackling video harmonization. The proposed network contains two parts: a generator ( i.e., the harmonization network) and a pixel-wise discriminator. While the harmonization network takes a composite video as input and outputs a harmonized video that looks more realistic, the pixel-wise discriminator plays against it by learning to distinguish disharmonious foreground pixels from those harmonious ones. To maintain temporal consistency between consecutive harmonized frames, a regional temporal loss was adopted to enforce the harmonization network to pay more attention to generating temporally coherent harmonized foregrounds. To address the problem of lacking suitable video training data, a synthetic dataset Dancing MSCOCO was constructed for achieving harmonization and temporal correspondence ground-truths at the same time. With the help of a pretrained pixel-wise disharmony discriminator, we can relieve the necessity of an input foreground mask for gaining a considerable harmonization quality. The extensive experiments demonstrate the superiority of our method over the state-of-the-arts.


The authors would like to thank Tencent AI Lab and Tsinghua University - Tencent Joint Lab for the support.


  • [1] J.-Y. Zhu, P. Krahenbuhl, E. Shechtman, and A. A. Efros, “Learning a discriminative model for the perception of realism in composite images,” in Proceedings of ICCV, 2015.
  • [2] Y.-H. Tsai, X. Shen, Z. Lin, K. Sunkavalli, X. Lu, and M.-H. Yang, “Deep image harmonization,” in Proceedings of CVPR, 2017.
  • [3] D. Cohen-Or, O. Sorkine, R. Gal, T. Leyvand, and Y.-Q. Xu, “Color harmonization,” ACM Transactions on Graphics (TOG), 2006.
  • [4] J.-F. Lalonde and A. A. Efros, “Using color compatibility for assessing image realism,” in Proceedings of ICCV, 2007.
  • [5] K. Sunkavalli, M. K. Johnson, W. Matusik, and H. Pfister, “Multi-scale image harmonization,” ACM Transactions on Graphics (TOG), 2010.
  • [6] S. Xue, A. Agarwala, J. Dorsey, and H. Rushmeier, “Understanding and improving the realism of image composites,” ACM Transactions on Graphics (TOG), 2012.
  • [7] E. Reinhard, M. Adhikhmin, B. Gooch, and P. Shirley, “Color transfer between images,” IEEE Computer Graphics and Applications, 2001.
  • [8] P. Pérez, M. Gangnet, and A. Blake, “Poisson image editing,” ACM Transactions on graphics (TOG), 2003.
  • [9] J. Jia, J. Sun, C.-K. Tang, and H.-Y. Shum, “Drag-and-drop pasting,” ACM Transactions on Graphics (TOG), 2006.
  • [10] M. W. Tao, M. K. Johnson, and S. Paris, “Error-tolerant image compositing,” in Proceedings of ECCV, 2010.
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proceedings of NIPS, 2014.
  • [12] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of CVPR, 2016.
  • [13] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al.

    , “Photo-realistic single image super-resolution using a generative adversarial network,” in

    Proceedings of CVPR, 2016.
  • [14]

    P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in

    Proceedings of CVPR, 2017.
  • [15] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of ICCV, 2017.
  • [16] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” in Proceedings of ICLR, 2016.
  • [17] X. Liang, L. Lee, W. Dai, and E. P. Xing, “Dual motion gan for future-flow embedded video prediction,” in Proceedings of ICCV, 2017.
  • [18] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb, “Learning from simulated and unsupervised images through adversarial training,” in Proceedings of CVPR, 2017.
  • [19] M. Lang, O. Wang, T. O. Aydin, A. Smolic, and M. H. Gross, “Practical temporal consistency for image-based graphics applications,” ACM Transactions on Graphics (TOG), 2012.
  • [20] N. Bonneel, K. Sunkavalli, S. Paris, and H. Pfister, “Example-based video color grading.” ACM Transactions on Graphics (TOG), 2013.
  • [21] T. O. Aydin, N. Stefanoski, S. Croci, M. Gross, and A. Smolic, “Temporally coherent local tone mapping of hdr video,” ACM Transactions on Graphics (TOG), 2014.
  • [22] N. Bonneel, J. Tompkin, K. Sunkavalli, D. Sun, S. Paris, and H. Pfister, “Blind video temporal consistency,” ACM Transactions on Graphics (TOG), 2015.
  • [23] M. Ruder, A. Dosovitskiy, and T. Brox, “Artistic style transfer for videos,” in

    German Conference on Pattern Recognition

    , 2016.
  • [24] G. Ye, E. Garces, Y. Liu, Q. Dai, and D. Gutierrez, “Intrinsic video and applications,” ACM Transactions on Graphics (TOG), 2014.
  • [25] A. Gupta, J. Johnson, A. Alahi, and L. Fei-Fei, “Characterizing and improving stability in neural style transfer,” in Proceedings of ICCV, 2017.
  • [26] H. Huang, H. Wang, W. Luo, L. Ma, W. Jiang, X. Zhu, Z. Li, and W. Liu, “Real-time neural style transfer for videos,” in Proceedings of CVPR, 2017.
  • [27] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceedings of ECCV, 2014.
  • [28] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool, “The 2017 davis challenge on video object segmentation,” arXiv:1704.00675, 2017.
  • [29] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in Proceedings of ICCV, 2015.
  • [30] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, “Deepflow: Large displacement optical flow with deep matching,” in Proceedings of ICCV, 2013.
  • [31]

    J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid, “Epicflow: Edge-preserving interpolation of correspondences for optical flow,” in

    Proceedings of CVPR, 2015.
  • [32] A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele, “Lucid data dreaming for object tracking,” arXiv preprint arXiv:1703.09554, 2017.
  • [33] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in Proceedings of CVPR, 2017.
  • [34]

    A. Criminisi, P. Pérez, and K. Toyama, “Region filling and object removal by exemplar-based image inpainting,”

    IEEE Transactions on Image Processing (TIP), vol. 13, no. 9, pp. 1200–1212, 2004.
  • [35] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proceedings of MICCAI, 2015.
  • [36] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley, “Least squares generative adversarial networks,” in Proceedings of ICCV, 2017.
  • [37] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial network,” in Proceedings of ICML, 2017.
  • [38] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer networks,” in Proceedings of NIPS, 2015.
  • [39] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in Proceedings of ECCV, 2016.
  • [40] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan, “Youtube-8m: A large-scale video classification benchmark,” arXiv preprint arXiv:1609.08675, 2016.
  • [41] Videvo. (2017) Videvo free footage. http://www.videvo.net.
  • [42] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceedings of ICLR, 2015.
  • [43] L. Maystre and M. Grossglauser, “Fast and accurate inference of plackett–luce models,” in Proceedings of NIPS, 2015.