Log In Sign Up

Automatic Video Colorization using 3D Conditional Generative Adversarial Networks

by   Panagiotis Kouzouglidis, et al.

In this work, we present a method for automatic colorization of grayscale videos. The core of the method is a Generative Adversarial Network that is trained and tested on sequences of frames in a sliding window manner. Network convolutional and deconvolutional layers are three-dimensional, with frame height, width and time as the dimensions taken into account. Multiple chrominance estimates per frame are aggregated and combined with available luminance information to recreate a colored sequence. Colorization trials are run succesfully on a dataset of old black-and-white films. The usefulness of our method is also validated with numerical results, computed with a newly proposed metric that measures colorization consistency over a frame sequence.


page 2

page 4


FREGAN : an application of generative adversarial networks in enhancing the frame rate of videos

A digital video is a collection of individual frames, while streaming th...

Optical Fiber Channel Modeling Using Conditional Generative Adversarial Network

In this paper, we use CGAN (conditional generative adversarial network) ...

Image Colorization with Generative Adversarial Networks

Over the last decade, the process of automatic colorization had been stu...

Frame Interpolation with Multi-Scale Deep Loss Functions and Generative Adversarial Networks

Frame interpolation attempts to synthesise intermediate frames given one...

Generative Adversarial Network for Probabilistic Forecast of Random Dynamical System

We present a deep learning model for data-driven simulations of random d...

VCGAN: Video Colorization with Hybrid Generative Adversarial Network

We propose a hybrid recurrent Video Colorization with Hybrid Generative ...

Adversarial Video Compression Guided by Soft Edge Detection

We propose a video compression framework using conditional Generative Ad...

1 Introduction

In this paper, we address the problem of automatic colorization of monochrome digitized videos [1, 2, 3, 4, 5]. Perhaps the most straightforward practical application is to colorizing black-and-white footage from old films or documentaries. Video compression is another possible application of note [5].

Video colorization methods can be categorized according to the level of user interaction required. A group of methods assume that a partially colored frame exists in the video, where color has been manually annotated in the form of color seeds [5, 6, 7]. The method must then propagate color from these seeds to the rest of the frame, then to other frames in the video. Other methods assume instead that a reference colored image exists that is similar in content and structure to the target monochrome video frames [3, 4, 8]. These methods may or may not require user intervention; for example, in [9] the user can specify matching areas between the reference and the target frames. In reference image-based methods, the problem of video colorization is hence converted to the problem of how to propagate color from the reference frame to other frames and/or from frame to frame. Optical flow estimation has been used to guide frame-to-frame color propagation [2, 3]. In [7], Gabor feature flow is used as alternative to standard optical flow as a more robust guide to color propagation. Naturally, methods of this vein work best for coloring short videos or frames coming from the same scene [3].

In the present work, we propose a learning-based method for video colorization. As such, we assume that a collection of colored frames exist, that will be used to train the model. In particular, the proposed method is based on an appropriately designed Generative Adversarial Network (GAN) [10]

. GANs have gained a fair amount of traction in the last few years. Despite their being harder to train even more than standard neural networks, requiring the employment of various heuristics and careful choosing of hyperparameters to attain convergence to a Nash equilibrium

[11, 12]

, they have proven to be excellent generative models. The proposed model employs a conditional GAN (cGAN) architecture, popularized by the pix2pix model

[13]. In the current work, convolutional and deconvolutional layers are 3D (height, width, time dimensions) to accomodate for the sequential nature of video data.

The main novel points of the current paper are as follows: (a) we present a model for learning-based automatic video colorization that can take advantage of the sequential nature of video, while avoiding the use of frame-by-frame color propagation techniques that come with their own inherent limitations (typically they require existing colored key frames and/or are practically applicable within a single shot). Other recent works use learning methods to color video via propagation [1], or via frame-by-frame image colorization, with each frame processed separately [14]; (b) we elaborate on the issue of video colorization evaluation and propose a quantitative colorization metric specifically for video; (c) we show that the proposed method creates colorization models that are transferable, in the sense that learning over a particular frame sequence produces a plausible output usable on a sequence of different content.

The rest of the paper is organized as follows. In section 2, we briefly discuss preliminaries on adversarial nets and present the architecture and processing pipeline of the proposed video colorization method. In section 3, we elaborate on existing numerical evaluation methods and propose a new metric to evaluate video colorization. In section 4, we show numerical and qualitative results of our method, tested on a collection of old films. We close the paper with section 5, where we discuss conclusions and future work.

Grayscale Ground Truth Proposed
Figure 2: Colorization results using our method. Depicted frames are samples from films: “Et Dieu..créa la femme” and “Tzéni, Tzéni” (2 top and 2 bottom rows respectively).
2D cGAN Proposed
Figure 3: Comparison of proposed model vs non-sequential 2D cGAN model. The proposed model produces better results than the non-sequential variant, as the former can take advantage of optical flow information, with its 3D convolution/deconvolution layers and estimate aggregation scheme. (Note for example how each method colorizes the hand of the standing actor on the top frame, or the color of the suit on the bottom frame). Depicted frames are samples from the film “Dial M for Murder”.

2 Proposed method

The proposed method assumes the existence of a training set consisting of a sequence of colored frames, and a test set consisting of a sequence of monochrome frames that are to be colorized. During the training phase, a cGAN model is used to learn how to color batches of ordered frames. Hyperparameter is fixed beforehand with . During the testing phase, the model is run on the input monochrome video in a sliding window manner. Windows are overlapping and move by a single frame at a time, thereby producing a set of colorization estimates for each monochrome frame, hence video colorization proposals. These estimates are then combined to produce a single colorized output. In what follows we discuss the details of this process.

A GAN is a generative, neural network-based model that consists of two components, the generator network and the discriminator network. The cGAN architecture [13] that is employed as part of the proposed method, is a supervised variant of the original unsupervised GAN [10]. A cGAN learns a mapping from observed input

to target output

111Other variants of a cGAN are possible; for example, a noise variable could be added to produce a non-deterministic output [13]. We employ a deterministic cGAN variant in this work..

Formally, the objective to be optimized is:

+E_x[log(1 - D(G(x)))]] + λE_x, y[∥y - G(x)∥_1], with hyperparameter controlling the trade-off between the GAN (discriminator) loss and the loss. and correspond to the discriminator and generator respectively. The GAN loss quantifies how plausible the colorization output, while the loss forces the colorization to be close to the ground truth. We use representations in the CIE Lab color space (following e.g.[15]). For a monochrome frame sequence, only luminance is known beforehand. Input is a sequence of luminance channels (channel ) of C consecutive frames , and the objective is to learn a mapping from luminance to chrominance (channels ,) where H, W are frame dimensions.

The generator network is comprised of a series of convolutional and deconvolutional layers. Skip connections are added in the manner introduced by UNet [16]. As inputs and outputs are sequences of fixed-size frames, all convolutions and deconvolutions are three-dimensional (frame height, width and time dimensions). The encoder and decoder stacks comprise strided convolutional/deconvolutional layers each (stride=

), followed iteratively by batch normalization (BN) and rectified linear unit (ReLU) activation layers. Following

[17], outputs are forced to lie in the range with a tanh activation layer at the end of the generator network, and only later renormalized to valid chrominance values. The discriminator network is a 3D convolutional network comprising

convolutional layers iteratively followed by BN and ReLU layers. The discriminator is topped by a fully connected (“dense”) layer and a sigmoid activation unit in order to map the image to a real/fake probability figure.

At test time, we use the generator in a sliding window fashion over the footage to be colorized. Hence, each frame is given as input to the generator at a total of times, since is the size of the sliding window. The produced chrominance estimates 222 denotes the colorization estimate for a frame. denotes a colorization estimate for a sequence of frames. then need to be used to produce a single estimate . We can write as a maximum-a-posteriori (MAP) estimate as:


where a prior distribution can be assumed over possible values in order to favor a particular chrominance setup. If identical distributions centered around each and an uninformative prior is used, the above formula simplifies as an average over all chrominance values per frame pixel: . Finally, the chrominance estimate is recombined with input luminance to recreate colored RGB frames for the input video. The architecture of the proposed model is summarized in figure  1.

3 Metrics for numerical evaluation of video colorization

In this section we describe the metrics we use for numerical evaluation of video colorization. We use two metrics that measure per-frame colorization quality, also usable in single-image colorization. Furthermore, we propose a new metric suitable for video colorization in particular.

Peak Signal-to-Noise Ratio (PSNR): PSNR is calculated per each test frame in the RGB colorspace, and their mean is reported as a benchmark over the whole video.

Raw Accuracy (RA): Raw Accuracy, used in [15]

to evaluate image colorization, is defined in terms of accuracy of predicted colors over a varying threshold. Colors are classified as correctly predicted if their Euclidean distance in the

space is lower than a threshold. Accuracy is computed over color values for every pixel position and frame. Integrating over the curve that is produced by taking into account varying threshold yields the RA metric. We integrated from to distance units as in [15].

Color Consistency (CC): The aforementioned metrics measure strictly the quality of colorization of each frame separately. We propose and use a metric to measure both per-frame quality and also the consistency of the choice of colors between consecutive frames. Such a metric can, for example, penalize erratic differences in colorization from frame to frame, that would otherwise be “invisible” to the other metrics, borrowed from single image restoration/colorization. We define color consistency over sets of two consecutive colorization predictions and corresponding ground truth values as


where affinity matrices and are defined as

with function a positive, strictly decreasing function that is used to convert distances to similarities. We use . Total CC over a video sequence is calculated as the average CC over all consecutive frames. Higher values correspond to better results.

4 Experiments

We have tested our method over a collection of old films: “Dial M for Murder” (USA, 1954; 63,243 frames) [18] “Et Dieu..créa la femme” (France, 1956; 54,922 frames) [19] “Tzéni, Tzéni” (Greece, 1965; 58,932 frames) [20] “A streetcar named desire” (USA, 1951; 18,002 frames) [21] “Twelve angry men” (USA, 1957; 12,000 frames) [22]. Frames were sampled off these films at fps. Films are colored, while and are originally black-and-white. Consequently, only the colored films could be used for training, while the black-and-white ones could be used only for testing with a colorizer trained on another film.

We have first experimented with training and testing on different parts of the same (colored) film. For training/testing we have used the first 75%/last 25% from each of the colored films. The proposed 3D cGAN model was used, with model parameters set to (sliding window size), (GAN- loss tradeoff), and compared against a 2D cGAN model that learned to colorize each frame separately. We have also use data augmentation on our training set, with random horizontal flips ( chance to use a flipped input during training) and gaussian additive noise (). For estimate aggregation (eq. 1) we present results with an uninformative prior (preliminary tests with priors learned over data statistics did not give any definite improvement). We also compare with a greyscale baseline, i.e. the case where the “colorized” video estimate uses only luminance information. Numerical results can be examined in table 1. Qualititative results can be examined in figures 2 and 3. While in general both models fare satisfactorily, the proposed model can avoid erroneous colorizations in several cases (cf. fig. 3). This point is validated by our numerical results, where we calculate the metrics presented in section 3. While w.r.t. to PSNR and RA the proposed model still is better, it could be argued that the difference in the result is statistically insignificant. This is not the case with the proposed CC metric however, where the performance of the proposed model is markedly better. These results validate our expectation, as the 3D structure of the proposed model can take into account the sequential structure of the video, in contrast to its 2D counterpart.

We have also run tests for training and testing on different films. The case that is perhaps closest to a practical application of the current model is using trained models on one of the colored films to color black-and-white footage, i.e. in our case films and . Results for this case can be examined at fig. 4 (training performed on film ). Video colorization demos are available online 333 .

(a) “Dial M for Murder”
Grayscale 32.69 96.55 73.09
2D cGAN 34.97 96.67 82.07
Proposed 35.66 96.73 85.59
(b) “Et Dieu..créa la femme”
Grayscale 30.23 94.07 47.82
2D cGAN 32.08 95.17 56.67
Proposed 32.32 95.31 58.80
(c) “Tzéni, Tzéni”
Grayscale 29.83 92.85 39.17
2D cGAN 31.44 93.87 50.81
Proposed 31.77 94.14 55.16
Table 1: Numerical results for colorization evaluation. Training and testing is performed on different clips of the same film. PSNR is measured in dB; RA and CC values are percentages. Higher values are better. The proposed model performs best, in all cases.

5 Conclusion and Future work

We have presented a method for automatic video colorization, based on a novel cGAN-based model with 3D convolutional and deconvolutional layers and an estimate aggregation scheme. The usefulness of our model has been validated with tests on colorizing old black-and-white film footage. Model performance has also been evaluated with single-image based metrics as well as a newly proposed metric that measures sequential color consistency. As future work, we envisage exploring the uses of the color prior in our aggregation scheme.

Figure 4: Colorization results where color ground-truth is unavailable. Depicted are samples from “A streetcar named desire” and “Twelve angry men” (2 leftmost, 2 rightmost columns respectively), colorized with the proposed model trained on “Dial M for Murder”.


  • [1] Simone Meyer, Victor Cornillère, Abdelaziz Djelouah, Christopher Schroers, and Markus Gross, “Deep video color propagation,” arXiv preprint arXiv:1808.03232, 2018.
  • [2] Mayu Otani and Hirohisa Hioki, “Video colorization based on optical flow and edge-oriented color propagation,” in Computational Imaging XII. International Society for Optics and Photonics, 2014, vol. 9020, p. 902002.
  • [3] VS Rao Veeravasarapu and Jayanthi Sivaswamy, “Fast and fully automated video colorization,” in Signal Processing and Communications (SPCOM), 2012 International Conference on. IEEE, 2012, pp. 1–5.
  • [4] Sifeng Xia, Jiaying Liu, Yuming Fang, Wenhan Yang, and Zongming Guo, “Robust and automatic video colorization via multiframe reordering refinement,” in IEEE International Conference on Image Processing. IEEE, 2016, pp. 4017–4021.
  • [5] Liron Yatziv and Guillermo Sapiro, “Fast image and video colorization using chrominance blending,” IEEE Transactions on Image Processing, vol. 15, no. 5, pp. 1120–1129, 2006.
  • [6] Anat Levin, Dani Lischinski, and Yair Weiss, “Colorization using optimization,” in ACM transactions on graphics (TOG). ACM, 2004, vol. 23, pp. 689–694.
  • [7] Bin Sheng, Hanqiu Sun, Marcus Magnor, and Ping Li, “Video colorization using parallel optimization in feature space,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 3, pp. 407–417, 2014.
  • [8] Nir Ben-Zrihem and Lihi Zelnik-Manor, “Approximate nearest neighbor fields in video,” in

    IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2015, pp. 5233–5242.
  • [9] Tomihisa Welsh, Michael Ashikhmin, and Klaus Mueller, “Transferring color to greyscale images,” in ACM Transactions on Graphics (TOG). ACM, 2002, vol. 21, pp. 277–280.
  • [10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in neural information processing systems (NIPS), 2014, pp. 2672–2680.
  • [11] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen, “Improved techniques for training GANs,” in Advances in neural information processing systems (NIPS), 2016, pp. 2234–2242.
  • [12] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng, “Training GANs with optimism,” CoRR, vol. abs/1711.00141, 2017.
  • [13] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros, “Image-to-image translation with conditional adversarial networks,” arXiv preprint arXiv:1611.07004, 2016.
  • [14] A.W. Juliani, “Pix2Pix-Film,”, 2017, [Online; accessed 2-January-2018].
  • [15] Richard Zhang, Phillip Isola, and Alexei A Efros, “Colorful image colorization,” in IEEE European Conference in Computer Vision (ECCV). Springer, 2016, pp. 649–666.
  • [16] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 234–241.
  • [17] S. Chintala, E. Denton, M. Arjovsky, and M. Mathieu, “How to train a GAN? Tips and tricks to make GANs work,”, 2016, [Online; accessed 25-January-2018].
  • [18] “Dial M for murder,” title/tt0046912/, 1954.
  • [19] “Et Dieu..créa la femme,” title/tt0049189/, 1956.
  • [20] “Tzéni, tzéni,” title/tt0145006/, 1966.
  • [21] “A streetcar named desire,” title/tt0044081/, 1951.
  • [22] “Twelve angry men,” title/tt0050083/, 1957.