1 Introduction
Colorization of images requires to predict the two missing channels of a provided graylevel input. Similar to other computer vision tasks like monocular depthprediction or semantic segmentation, colorization is illposed. However, unlike the aforementioned tasks, colorization is also ambiguous, i.e., many different colorizations are perfectly plausible. For instance, differently colored shirts or cars are very reasonable, while there is certainly less diversity in shades of façades. Capturing these subtleties is a nontrivial problem.
Early work on colorization was therefore interactive, requiring some reference color image or scribbles [1, 2, 3, 4, 5, 6]. To automate the process, classical methods formulated the task as a prediction problem [7, 8]
, using datasets of limited sizes. More recent deep learning methods were shown to capture more intricate color properties on larger datasets
[9, 10, 11, 12, 13, 14]. However, all those methods have in common that they only produce a single colorization for a given graylevel image. Hence, the ambiguity and multimodality are often not modeled adequately. To this end, even more recently, diverse output space distributions for colorization were described using generative modeling techniques such as variational autoencoders [15], generative adversarial nets [16], or autoregressive models [17, 18].While approaches based on generative techniques can produce diverse colorizations by capturing a dataset distribution, they often lack structural consistency, e.g., parts of a shirt differ in color or the car is speckled. Inconsistencies are due to the fact that structural coherence is only encouraged implicitly when using deep net based generative methods. For example, in results obtained from [15, 16, 18, 19, 20] illustrated in Fig. 1, the color of the shoulder and neck differ as these models are sensitive to occlusion. In addition, existing diverse colorization techniques also often lack a form of controllability permitting to interfere while maintaining structural consistency.
To address both consistency and controllability, our developed method enhances the output space of variational autoencoders [21] with a Gaussian Markov random field formulation. Our developed approach, which we train in an endtoend manner, enables explicit modeling of the structural relationship between multiple pixels in an image. Beyond learning the structural consistency between pixels, we also develop a control mechanism which incorporates external constraints. This enables a user to interact with the generative process using color stokes. We illustrate visually appealing results on the Labelled Faces in the Wild (LFW) [22], LSUNChurch [23] and ILSVRC2015 [24] datasets and assess the photorealism aspect with a user study.
2 Related Work
As mentioned before, we develop a colorization technique which enhances variational autoencoders with Gaussian Markov random fields. Before discussing the details, we review the three areas of colorization, Gaussian Markov random fields and variational autoencoders subsequently.
Colorization: Early colorization methods rely on userinteraction in the form of a reference image or scribbles [1, 2, 3, 4, 5, 6]. First attempts to automate the colorization process [7]
rely on classifiers trained on datasets containing a few tens to a few thousands of images. Naturally, recent deep net based methods scaled to much larger datasets containing millions of images
[9, 10, 11, 12, 13, 14, 25]. All these methods operate on a provided intensity field and produce a single color image which doesn’t embrace the ambiguity of the task.To address ambiguity, Royer et al. [18] use a PixelCNN [26] to learn a conditional model of the color field given the graylevel image , and draw multiple samples from this distribution to obtain different colorizations. In addition to compelling results, failure modes are reported due to ignored complex longrange pixel interactions, e.g., if an object is split due to occlusion. Similarly, [17] uses PixelCNNs to learn multiple embeddings of the graylevel image, before a convolutional refinement network is trained to obtain the final image. Note that in this case, instead of learning directly, the color field is represented by a low dimensional embedding . Although, the aforementioned PixelCNN based approaches yield diverse colorization, they lack large scale spatial coherence and are prohibitively slow due to the autoregressive, i.e., sequential, nature of the model.
Another conditional latent variable approach for diverse colorization was proposed by Deshpande et al. [15]. The authors train a variational autoencoder to produce a low dimensional embedding of the color field. Then, a Mixture Density Network (MDN) [27] is used to learn a multimodal distribution over the latent codes. Latent samples are afterwards converted to multiple color fields using a decoder. This approach offers an efficient sampling mechanism. However, the output is often speckled because colors are sampled independently for each pixel.
Beyond the aforementioned probabilistic formulations, conditional generative adversarial networks [16]
have been used to produce diverse colorizations. However, mode collapse, which results in the model producing one color version of the graylevel image, is a frequent concern in addition to consistency. This is mainly due to the generator learning to largely ignore the random noise vector when conditioned on a relevant context.
[19] addresses the former issue by concatenating the input noise channel with several convolutional layers of the generator. A second solution is proposed by [20], where the connection between the output and latent code is encouraged to be invertible to avoid many to one mappings. These models show compelling results when tested on datasets with strong alignment between the samples, e.g., the LSUN bedroom dataset [23] in [19]and imagetoimage translation datasets
[28, 29, 30, 31, 32] in [20]. We will demonstrate in Sec. 4 that they lack global consistency on more complex datasets.In contrast to the aforementioned formulations, we address both diversity and global structural consistency requirements while ensuring computational efficiency. To this end we formulate the colorization task by augmenting variational autoencoder models with Gaussian Conditional Random Fields (GCRFs). Using this approach, beyond modeling a structured output space distribution, controllability of the colorization process is natural.
Gaussian Conditional Markov Random Field: Markov random fields [33] and their conditional counterpart are a compelling tool to model correlations between variables. Theoretically, they are hence a good match for colorization tasks where we are interested in reasoning about color dependencies between different pixels. However, inference of the most likely configuration in classical Markov random fields defined over large output spaces is computationally demanding [34, 35, 36, 37] and only tractable in a few special cases.
Gaussian Markov random fields [38]
represent one of those cases which permit efficient and exact inference. They model the joint distribution of the data, e.g., the pixel values of the two color channels of an image as a multivariate Gaussian density. Gaussian Markov random fields have been used in the past for different computer vision applications including semantic segmentation
[39, 40, 41], human part segmentation and saliency estimation
[40, 41], image labeling [42] and image denoising [43, 44]. A sparse Gaussian conditional random field trained with a LEARCH framework has been proposed for colorization in [8]. Different from this approach, we use a fully connected Gaussian conditional random field and learn its parameters endtoend with a deep net. Beyond structural consistency, our goal is to jointly model the ambiguity which is an inherent part of the colorization task. To this end we make use of variational autoencoders.Variational AutoEncoders: Variational autoencoders (VAEs) [21] and conditional variants [45], i.e., conditional VAEs (CVAEs), have been used to model ambiguity in a variety of tasks [46, 47]
. They are based on the manifold assumption stating that a highdimensional data point
, such as a color image, can be modeled based on a lowdimensional embedding and some auxiliary data , such as a graylevel image. Formally, existence of a lowdimensional embedding space and a transformation via the conditional is assumed. Given a dataset containing pairs of conditioning information and desired output , i.e., given , CVAEs formulate maximization of the conditional loglikelihood , parameterized by , by considering the following identity:(1)  
Hereby, denotes the KullbackLeibler (KL) divergence between two distributions, and is used to approximate the intractable posterior of a deep net which models the conditional . The approximation of the posterior, i.e., , is referred to as the encoder, while the deep net used for reconstruction, i.e., for modeling the conditional , is typically called the decoder.
Since the KLdivergence is nonnegative, we obtain a lower bound on the data loglikelihood when considering the right hand side of the identity given in Eq. 1. CVAEs minimize the negated version of this lower bound, i.e.,
(2) 
where the expectation is approximated via samples . For simplicity of the exposition, we ignored the summation over the samples in the dataset , and provide the objective for training of a single pair .
We next discuss how we combine those ingredients for diverse, controllable yet structurally coherent colorization.
3 Consistency and Controllability for Colorization
Our proposed colorization model has several appealing properties: (1) diversity, i.e., it generates diverse and realistic colorizations for a single graylevel image; (2) global coherence, enforced by explicitly modeling the outputspace distribution of the generated color field using a fully connected Gaussian Conditional Random field (GCRF); (3) controllability, i.e., our model can consider external constraints at run time efficiently. For example, the user can enforce a given object to have a specific color or two separated regions to have the same colorization.
3.1 Overview
We provide an overview of our approach in Fig. 2. Given a graylevel image with pixels, our goal is to produce different color fields consisting of two channels and in the Lab color space. In addition, we enforce spatial coherence at a global scale and enable controllability using a Gaussian Markov random field which models the output space distribution.
To produce a diverse colorization, we want to learn a multimodal conditional distribution of the color field given the graylevel image . However, learning this conditional is challenging since the color field and the intensity field are high dimensional. Hence, training samples for learning
are sparsely scattered and the distribution is difficult to capture, even when using large datasets. Therefore, we assume the manifold hypothesis to hold, and we choose to learn a conditional
based on lowdimensional embeddings captured from and , by using a variational autoencoder which approximates the intractable posterior via an encoder. Deshpande et al. [15]demonstrated that sampling from the approximation of the posterior results in low variance of the generated images. Following
[15], we opt for a multistage training procedure to directly sample from as follows.To capture the lowdimensional embedding, in a first training stage, we use a variational autoencoder to learn a parametric unimodal Gaussian encoder distribution of the color field embedding given both the graylevel image and the color image (Fig. LABEL:fig:architecture (a)). At the same time, we learn the parameters of the decoder .
Importantly, we note that the encoder takes advantage of both the color image and the graylevel intensities when mapping to the latent representation . Due to the use of the color image, we expect that this mapping can be captured to a reasonable degree using a unimodal distribution, i.e., we use a Gaussian.
However, multiple colorizations can be obtained from a grayscale image during inference. Hence, following Deshpande et al. [15], we don’t expect a unimodal distribution to be accurate during testing, when only conditioning on the graylevel image .
To address this issue, in a second training stage, we train a Mixture Density Network (MDN) to maximize the loglikelihood of embeddings sampled from (Fig. LABEL:fig:architecture (b)). Intuitively, for a graylevel image, the MDN predicts the parameters of Gaussian components each corresponding to a different colorization. The embedding that was learned in the first stage is then tied to one of these components. The remaining components are optimized by closeby graylevel image embeddings.
At test time, different embeddings are sampled from the MDN and transformed by the decoder into diverse colorizations, as we show in Fig. 2.
To encourage globally coherent colorizations and to ensure controllability, we use a fully connected GCRF layer to model the output space distribution. The negative logposterior of the GCRF has the form of a quadratic energy function:
(3) 
It captures unary and higher order correlations (HOCs) between the pixels’ colors for the a and b channels. Intuitively, the joint GCRF enables the model to capture more global image statistics which turn out to yield more spatially coherent colorizations as we will show in Sec. 4. The unary term is obtained from the VAE decoder and encodes the color per pixel. The HOC term ) is responsible for encoding the structure of the input image. It is a function of the inner product of low rank pixel embeddings , learned from the graylevel image and measuring the pairwise similarity between the pixels’ intensities. The intuition is that pixels with similar intensities should have similar colorizations. The HOC term is shared between the different colorizations obtained at test time. Beyond global consistency, it also enables controllability by propagating user edits encoded in the unary term properly. Due to the symmetry of the HOC term, the quadratic energy function has a unique global minimum that can be obtained by solving the system of linear equations:
(4) 
Subsequently, we drop the dependency of and on and for notational simplicity.
We now discuss how to perform inference in our model and how to learn the model parameters such that colorization and structure are disentangled and controllability is enabled by propagating user strokes.
3.2 Inference
In order to ensure a globally consistent colorization, we take advantage of the structure in the image. To this end, we encourage two pixels to have similar colors if their intensities are similar. Thus, we want to minimize the difference between the color field for the and channels and the weighted average of the colors at similar pixels. More formally, we want to encourage the equalities and , where is a similarity matrix obtained from applying a softmax function to every row of the matrix resulting from . To simplify, we use the blockstructured matrix .
In addition to capturing the structure, we obtain the color prior and controllability by encoding the user input in the computed unary term . Hence, we add the constraint , where is a diagonal matrix with 0 and 1 entries corresponding to whether the pixel’s value isn’t or is specified by the user, and a vector encoding the color each pixel should be set to.
With the aforementioned intuition at hand we obtain the quadratic energy function to be minimized as:
with being a hyperparameter. This corresponds to a quadratic energy function of the form , where , and . It’s immediately apparent that the unary term only encodes color statistics while the HOC term is only responsible for structural consistency. Intuitively, the conditional is interpreted as a Gaussian multivariate density:
(5) 
parametrized by the above defined energy function . It can be easily checked that is a positive definite full rank matrix. Hence, for a strictly positive definite matrix, inference is reduced to solving a linear system of equations:
(6) 
We solve the linear system above using the LU decomposition of the matrix. How to learn the terms and will be explained in the following.
3.3 Learning
We now present the two training stages illustrated in Fig. LABEL:fig:architecture to ensure color and structure disentanglement and to produce diverse colorizations. We also discuss the modifications to the loss given in Eq. 2 during each stage.
Stage 1: Training a Structured Output Space Variational AutoEncoder: During the first training stage, we use the variational autoencoder formulation to learn a lowdimensional embedding for a given color field. This stage is divided into two phases to ensure color and structure disentanglement. In a first phase, we learn the unary term produced by the VAE decoder. In the second phase, we fix the weights of the VAE apart from the decoder’s two topmost layers and learn a dimensional embedding matrix for the pixels from the graylevel image. The matrix obtained from applying a softmax to every row of is used to encourage a smoothness prior for the and channels. In order to ensure that the matrix learns the structure required for the controllability stage, where sparse user edits need to be propagated, we follow a training schedule where the unary terms are masked gradually using the matrix. The input image is reconstructed from the sparse unary entries using the learned structure. When colorization from sparse user edits is desired, we solve the linear system from Eq. 6 for the learned HOC term and an matrix and term encoding the user edits, as illustrated in Fig. 3. We explain the details of the training schedule in the experimental section.
Given the new formulation of the GCRF posterior, the program for the first training stage reads as follows:
(7) 
Subsequently we use the term to refer to the objective function of this program.
Stage 2: Training a Mixture Density Network (MDN): Since a color image is not available during testing, in the second training stage, we capture the approximate posterior , a Gaussian which was learned in the first training stage, using a parametric distribution . Due to the dependence on the color image we expect the approximate posterior to be easier to model than . Therefore, we let
be a Gaussian Mixture Model (GMM) with
components. Its means, variances, and component weights are parameterized via a mixture density network (MDN) with parameters . Intuitively, for a given graylevel image, we expect the components to correspond to different colorizations. The colorfield embedding learned from the first training stage is mapped to one of the components by minimizing the negative conditional loglikelihood, i.e., by minimizing:(8) 
Hereby, , and refer to, respectively, the mixture coefficients, the means and a fixed covariance of the GMM learned by an MDN network parametrized by . However, minimizing is hard as it involves the computation of the logarithm of a summation over the different exponential components. To avoid this, we explicitly assign the code to that Gaussian component , which has its mean closest to , i.e., . Hence, the negative loglikelihood loss is reduced to solving the following program:
(9) 
Note that the latent samples are obtained from the approximate posterior learned in the first stage.
4 Results
Next, we present quantitative and qualitative results on three datasets of increasing color field complexity: (1) the Labelled Faces in the Wild dataset (LFW) [22], which consists of 13,233 face images aligned by deep funneling [48]; (2) the LSUNChurch dataset [23]
containing 126,227 images and (3) the validation set of ILSVRC2015 (ImageNetVal)
[24] with 50,000 images. We compare the diverse colorizations obtained by our model with three baselines representing three different generative models: (1) the Conditional Generative Adversarial Network [16, 19, 20]; (2) the Variational Autoencoder with MDN [15]; and (3) the Probabilistic Image Colorization model [18] based on PixelCNN. Note that [15] presents a comparison between VAEMDN and a conditional VAE, demonstrating the benefits of the VAEMDN approach.4.1 Baselines
Ours vs VAEMDN  Ours vs PIC  VAEMDN vs PIC  

LFW  61.12 %  59.04 %  57.17 % 
LSUNChurch  66.89 %  71.61 %  54.46 % 
ILSVRC2015  54.79 %  66.98 %  62.88% 
Method  LFW  LSUNChurch  ILSVRC2015  

eob.  Var.  SSIM.  Train.  eob.  Var.  SSIM.  Train.  eob.  Var.  SSIM.  Train.  
cGAN[16]  h  h  h  
MLNGAN[19]  h  h  h  
BicycleGAN[20]  h  h  h  
VAEMDN[15]  h  h  h  
PIC[18]  h  h  h  
Ours  h  h  h 
Conditional Generative Adversarial Network: We compare our approach with three GAN models: the cGAN architecture proposed by Isola et al. [16], the GAN with multilayer noise by Cao et al. [19] and the BicycleGAN by Zhu et al. [20].
Variational AutoEncoder with Mixture Density Network (VAEMDN): The architecture by Deshpande et al. [15] trains an MDN based autoencoder to generate different colorizations. It is the basis for our method.
Probabilistic Image Colorization (PIC): The PIC model proposed by Royer et al. [18] uses a CNN network to learn an embedding of a graylevel images, which is then used as input for a PixelCNN network.
Comparison with Baselines: We qualitatively compare the diversity and global spatial consistency of the colorizations obtained by our models with the ones generated by the aforementioned baselines, in Figures 1 and 4. We observe that our approach is the only one which generates a consistent colorization of the skin of the girl in Fig. 1. We are also able to uniformly color the ground, the snake, and the actor’s coat in Fig. 4.
For global consistency evaluation, we perform a user study, presented in Tab. LABEL:tab:user_study, where participants are asked to select the more realistic image from a pair of images at a time. We restrict the study to the three approaches with the overall lowest errorofbest (eob) per pixel reported in Tab. LABEL:tab:quantitative_results, namely VAEMDN, PIC and our model. We use the clicking speed to filter out inattentive participants. Participants did neither know the paper content nor were the methods revealed to them. We gathered 5,374 votes from 271 unique users. The results show that users prefer results obtained with the proposed approach.
To evaluate diversity, we use two metrics: (1) the variance of diverse colorizations and (2) the mean structural similarity SSIM [49] across all pairs of colorizations generated for one image. We report our results in Tab. LABEL:tab:quantitative_results.
Global Consistency: Our model noticeably outperforms all the baselines in producing spatially coherent results as demonstrated by the user study. PIC generates very diversified samples for the LFW and ILSVRC2015 datasets but lacks long range spatial dependencies because of the autoregressive nature of the model. For example, the snake in the second row of Fig. 4 has different colors for the head and the tail, and the woman’s skin tone is inconsistent in Fig. 1. The VAEMDN, BicycleGAN and MLNGAN outputs are sometimes speckled and objects are not uniformly colored. For example, parts of the dome of the building in the second row of Fig. 4 are confused to be part of the sky and the shirt in the third row is speckled. In contrast, our model is capable of capturing complex long range dependencies. This is confirmed by the user study.
Diversity: Across all datasets, cGAN suffers from mode collapse and is frequently unable to produce diverse colorizations. The PIC, MLNGAN and BicycleGAN models yield the most diverse results at the expense of photorealism. Our model produces diverse results while ensuring long range spatial consistency.
Levin et al. [2]  Endo et al. [50]  Barron et al. [51]  Zhang et al. [14]  Ours  
10  50  100  10  50  100  10  50  100  10  50  100  10  50  100  
PSNR  26.5  28.5  30  24.8  25.9  26  25.3  28  29  28  30.2  31.5  26.7  29.3  30.4 
Controllability: For the controllably experiments, we set the
hyperparameter to 1 during training and to 5 during testing. We opt for the following training schedule, to force the model to encode the structure required to propagate sparse user inputs in the controllability experiments: We train the unary branch for 15 epochs (Stage1, Phase1), then train the HOC term for 15 epochs as well (Stage1, Phase2). We use the diagonal matrix
to randomly specify pixels which colors are encoded by the unary branch . We decrease following a training schedule from to , , then of the total number of pixels after respectively epochs , , , , and . Note that additional stages could be added to the training schedule to accommodate for complex datasets where very sparse user input is desired. In Fig. 5, we show that with a single pixel as a user edit (), we are able to colorize a boot in pink, a sea coral in blue and the background behind the spider in yellow in respectively Fig. 5 (ac). With two edits ( and ), we colorize a face in green (Zhang et al. [14] use 3 edits) in Fig. 5 (d) and the sky and the building in different colors in Fig. 5 (e,f). With three user edits (, and ), we show that we can colorize more complex images in Fig. 5 (gi). We show the edits using red markers. We visualize the attention weights per pixel, corresponding to the pixel’s row in the similarity matrix , in blue, where darker shades correspond to stronger correlations.Quantitatively, we report the average PSNR for 10, 50 and 100 edits on the ImageNet test set in Tab. 3, where edits (points) corresponding to randomly selected patches are revealed to the algorithm. We observe that our method achieves slightly better results than the one proposed by Levin et al. [2] as our algorithms learns for every pixel color an ‘attention mechanism’ over all the pixels in the image while Levin et al. impose local smoothness.
Visualization of the HOC and Unary Terms: In order to obtain more insights into the model’s dynamics, we visualize the unary terms, , and the HOC terms, , in respectively Fig. 6 and Fig. 7. As illustrated in Fig. 7, the HOC term has learned complex long range pixel affinities through endtoend training. The results in Fig. 6 further suggest that the unary term outputs a colorization with possibly some noise or inconsistencies that the HOC term fixes to ensure global coherency. For example, for the picture in the second column in Fig. 6, the colors of the face, chest and shoulder predicted by the unary term are not consistent, and were fixed by the binary term which captured the long range correlation as it is shown in Fig. 7 (c).
We notice different interesting strategies for encoding the long range correlations: On the LSUNChurch dataset, the model encourages local smoothness as every pixel seems to be strongly correlated to its neighbors. This is the case for the sky in Fig. 7 (e). The model trained on the LFW dataset, however encoded long range correlation. To ensure consistency over a large area, it chooses some reference pixels and correlates every pixel in the area, as can be seen in Fig. 7 (c).
We provide more results and details of the employed deep net architectures in the supplementary material.
5 Conclusion
We proposed a Gaussian conditional random field based variational autoencoder formulation for colorization and illustrated its efficacy on a variety of benchmark datasets, outperforming existing methods. The developed approach goes beyond existing methods in that it doesn’t only model the ambiguity which is inherent to the colorization task, but also takes into account structural consistency.
Acknowledgments: This material is based upon work supported in part by the National Science Foundation under Grant No. 1718221, Samsung, and 3M. We thank NVIDIA for providing the GPUs used for this research.
References
 [1] Welsh, T., Ashikhmin, M., Mueller, K.: Transferring color to greyscale images. SIGGRAPH (2002)
 [2] Levin, A., Lischinski, D., Weiss, Y.: Colorization using optimization. SIGGRAPH (2004)
 [3] Chia, A.Y.S., Zhuo, S., Gupta, R.K., Tai, Y.W., Cho, S.Y., Tan, P., Lin, S.: Semantic colorization with internet images. SIGGRAPH (2011)
 [4] Gupta, R.K., Chia, A.Y.S., Rajan, D., Ng, E.S., Zhiyong, H.: Image colorization using similar images. ACM Multimedia (2012)
 [5] CohenOr, D., Lischinski, D.: Colorization by example. Eurographics Symposium on Rendering (2005)
 [6] Morimoto, Y., Taguchi, Y., Naemura, T.: Automatic colorization of grayscale images using multiple images on the web. SIGGRAPH (2009)
 [7] Charpiat, G., Hofmann, M., Schölkopf, B.: Automatic image colorization via multimodal predictions. ECCV (2008)
 [8] Deshpande, A., Rock, J., Forsyth, D.: Learning largescale automatic image colorization. ICCV (2015)
 [9] Cheng, Z., Yang, Q., Sheng, B.: Deep colorization. ICCV (2015)
 [10] Iizuka, S., SimoSerra, E., Ishikawa, H.: Let there be color!: joint endtoend learning of global and local image priors for automatic image colorization with simultaneous classification. SIGGRAPH (2016)
 [11] Larsson, G., Maire, M., Shakhnarovich, G.: Learning representations for automatic colorization. ECCV (2016)
 [12] Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. ECCV (2016)

[13]
Varga, D., Szirányi, T.:
Twin deep convolutional neural network for examplebased image colorization.
CAIP (2017)  [14] Zhang, R., Zhu, J.Y., Isola, P., Geng, X., Lin, A.S., Yu, T., Efros, A.A.: Realtime userguided image colorization with learned deep priors. SIGGRAPH (2017)
 [15] Deshpande, A., Lu, J., Yeh, M.C., Forsyth, D.: Learning diverse image colorization. CVPR (2017)

[16]
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.:
Imagetoimage translation with conditional adversarial networks.
CVPR (2017)  [17] Guadarrama, S., Dahl, R., Bieber, D., Norouzi, M., Shlens, J., Murphy, K.: Pixcolor: Pixel recursive colorization. BMVC (2017)
 [18] Royer, A., Kolesnikov, A., Lampert, C.H.: Probabilistic image colorization. BMVC (2017)
 [19] Cao, Y., Zhou, Z., Zhang, W., Yu, Y.: Unsupervised diverse colorization via generative adversarial networks. arXiv preprint arXiv:1702.06674 (2017)
 [20] Zhu, J.Y., Zhang, R., Pathak, D., Darrell, T., Efros, A.A., Wang, O., Shechtman, E.: Toward multimodal imagetoimage translation. In: NIPS. (2017)
 [21] Kingma, D.P., Welling, M.: Autoencoding variational bayes. ICLR (2014)

[22]
LearnedMiller, E., Huang, G.B., RoyChowdhury, A., Li, H., Hua, G.:
Labeled faces in the wild: A survey.
In: Advances in face detection and facial image analysis.
(2016)  [23] Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., Xiao, J.: Lsun: Construction of a largescale image dataset using deep learning with humans in the loop. CoRR, abs/1506.03365 (2015)
 [24] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. IJCV (2015)
 [25] Varga, D., Szirányi, T.: Twin deep convolutional neural network for examplebased image colorization. ICCAIP (2017)
 [26] van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al.: Conditional image generation with pixelcnn decoders. NIPS (2016)
 [27] Bishop, C.M.: Mixture density networks. Aston University (1994)
 [28] Laffont, P.Y., Ren, Z., Tao, X., Qian, C., Hays, J.: Transient attributes for highlevel understanding and editing of outdoor scenes. SIGGRAPH (2014)

[29]
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,
Franke, U., Roth, S., Schiele, B.:
The cityscapes dataset for semantic urban scene understanding.
In: CVPR. (2016)  [30] Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Imagetoimage translation with conditional adversarial networks. CVPR (2017)
 [31] Yu, A., Grauman, K.: Finegrained visual comparisons with local learning. In: CVPR. (2014)
 [32] Zhu, J.Y., Krähenbühl, P., Shechtman, E., Efros, A.A.: Generative visual manipulation on the natural image manifold. In: ECCV. (2016)
 [33] Kindermann, R., Snell, J.L.: Markov random fields and their applications. American Mathematical Society (1980)
 [34] Schwing, A.G., Hazan, T., Pollefeys, M., Urtasun, R.: Distributed Message Passing for Large Scale Graphical Models. In: Proc. CVPR. (2011)
 [35] Schwing, A.G., Hazan, T., Pollefeys, M., Urtasun, R.: Globally Convergent Dual MAP LP Relaxation Solvers using FenchelYoung Margins. In: Proc. NIPS. (2012)
 [36] Schwing, A.G., Hazan, T., Pollefeys, M., Urtasun, R.: Globally Convergent Parallel MAP LP Relaxation Solver using the FrankWolfe Algorithm. In: Proc. ICML. (2014)
 [37] Meshi, O., Schwing, A.G.: Asynchronous Parallel Coordinate Minimization for MAP Inference. In: Proc. NIPS. (2017)
 [38] Rue, H.: Gaussian markov random fields: theory and applications. CRC press (2008)
 [39] Vemulapalli, R., Tuzel, O., Liu, M.Y., Chellapa, R.: Gaussian conditional random field network for semantic segmentation. CVPR (2016)
 [40] Chandra, S., Usunier, N., Kokkinos, I.: Dense and lowrank gaussian crfs using deep embeddings. ICCV (2017)
 [41] Chandra, S., Kokkinos, I.: Fast, exact and multiscale inference for semantic image segmentation with deep gaussian crfs. ECCV (2016)
 [42] Jancsary, J., Nowozin, S., Sharp, T., Rother, C.: Regression tree fields—an efficient, nonparametric approach to image labeling problems. CVPR (2012)
 [43] Tappen, M.F., Liu, C., Adelson, E.H., Freeman, W.T.: Learning gaussian conditional random fields for lowlevel vision. CVPR (2007)
 [44] Vemulapalli, R., Tuzel, O., Liu, M.Y.: Deep gaussian conditional random field network: A modelbased deep network for discriminative denoising. CVPR (2016)
 [45] Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. NIPS (2015)
 [46] Wang, L., Schwing, A.G., Lazebnik, S.: Diverse and Accurate Image Description Using a Variational AutoEncoder with an Additive Gaussian Encoding Space. In: NIPS. (2017)

[47]
Jain, U., Zhang, Z., Schwing, A.G.:
Creativity: Generating Diverse Questions using Variational Autoencoders.
In: Proc. CVPR. (2017) equal contribution.  [48] Huang, G., Mattar, M., Lee, H., LearnedMiller, E.G.: Learning to align from scratch. NIPS (2012)
 [49] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP (2004)

[50]
Endo, Y., Iizuka, S., Kanamori, Y., Mitani, J.:
Deepprop: Extracting deep features from a single image for edit propagation.
Eurographics (2016)  [51] Barron, J.T., Poole, B.: The fast bilateral solver. In: ECCV. (2016)
 [52] Berthelot, D., Schumm, T., Metz, L.: Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717 (2017)
 [53] Yu, A., Grauman, K.: Finegrained visual comparisons with local learning. In: CVPR. (2014)
 [54] Susskind, J.M., Anderson, A.K., Hinton, G.E.: The toronto face database. department of computer science, university of toronto, toronto, on. Technical report, Canada, Tech. Rep (2010)
 [55] Hou, X., Shen, L., Sun, K., Qiu, G.: Deep feature consistent variational autoencoder. IEEE Winter Conference on Applications of Computer Vision (2017)
6 Introduction
We use a variational autoencoder (VAE) model enriched with a Mixture Density Network (MDN) and a Gaussian Conditional Random field (GCRF) to generate diverse and globally consistent colorizations while enabling controllability through sparse user edits. In the following, we will derive closed form expressions of the derivative of the loss with respect to the GCRF layer inputs, i.e., the unary term and the HOC term (Sec. 7). We next present the model’s architecture (Sec. 8) and additional results on the LFW, LSUNChurch, ILSVRC2015 (Sec. 9) and ImageNet (Sec. 10) datasets. Finally, we explore endowing VAE and BEGAN [52] with a structured outputspace distribution through the GCRF formulation for image generation (Sec. 11).
7 Learning  Gradients of the GCRF parameters
During the first training phase’s forward pass, the GCRF receives the unary term and the HOC term , and outputs the reconstructed color field after solving the linear system given in Eq. 4 of the main paper. In the backward pass, the GCRF layer receives the gradient of the objective function of the program given in Eq. 7 in the main paper with respect to and computes closed form expressions for the gradient of the loss with respect to and
. Using the chain rule, the gradients of the remaining parameters can be expressed in terms of
and .Note that the gradient of the unary term decomposes as . By taking Eq. 4 in the main paper into account, we compute . Hence, it can be easily verified that a closed form expression of the gradient of the loss with respect to the unary term corresponds to solving the following system of linear equations:
(10) 
Similarly, we derive the gradient of the pairwise term by application of the chain rule: . Hence:
(11) 
8 Architecture and Implementation Details
In this section, we discuss the architecture of the encoder, the MDN, the decoder, the structure encoder and the GCRF layer. We introduce the following notation for describing the different architectures. We denote by a convolutional layer with kernel size
, stride
, output channels and activation . We writefor a batch normalization layer,
for bilinear upsampling with scale factor and for fully connected layer with output channels. To regularize, we use dropout for fully connected layers.Encoder: The encoder network learns an approximate posterior conditioned on the graylevel image representation and the color image. Its input is a colorfield of size and it generates an embedding of size . The architecture is as follows: .
MDN: The MDN’s input is a graylevel feature embedding from [48]. It predicts the parameters of Gaussian components, namely means of size , and mixture weights. We use a fixed spherical variance equal to . The MDN network is constructed as follows: .
Decoder: During training, the decoder’s input is the embedding of size generated by the encoder. At test time, the input is a dimensional latent code sampled from one of the Gaussian components of the MDN. The decoder generates a vector of unary terms of size . It consists of operations of bilinear upsampling and convolutions. During training, we learn four feature maps , , and of the graylevel image of sizes , , and , respectively. We concatenate these representations with the ones learned by the decoder, after every upsampling operation, i.e., is concatenated with the decoder’s representation after the first upsampling, after the second upsampling and so on. The architecture is described as follows: .
Structure Encoder: The structure encoder is used to learn a dimensional embedding for every node in the downsampled graylevel image of size . It consists of stacked convolutional layers: .
GCRF Layer: The GCRF is implemented as an additional layer that has as input the unary term and the embedding matrix produced by the structure encoder. During the forward pass, it outputs a color field of size by solving the linear system in Eq. 5 of the main paper for the channels and separately. During the backward pass, it generates the gradient of the loss with respect the unary term and the the embedding matrix respectively. To compute the gradient of , we solve the linear system in Eq. 10. The gradient of is given in Eq. 11.
9 Additional Results
10 Results on ImageNet Dataset
11 Beyond Colorization: GCRF for Structured Generative Models
Beyond colorization, we explore the effect of endowing two different generative models, namely variational autoencoders and Boundary Equilibrium Generative Adversarial Network (BEGAN) [52] with a structured output space through our GCRF formulation. We show the results in Fig. 13 and Fig. 14 using the Toronto Face Dataset (TFD) [54]. Quantitative results are reported in Tab. 4 using (1) KL divergence between the distributions of generated and real data, (2) sharpness by gradient magnitude, (3) by edge width and (4) by variance of the Laplacian. The results are normalized with respect to real data measurements.
For the variational autoencoder model, The GCRF is added on top of the decoder. Additionally, the reconstruction loss is augmented with the feature loss from [55]. We compare our results with the ones obtained from a classical VAE and a VAE trained with the feature loss without the GCRF layer. Fig 13 shows that our model results in sharper, higher quality and more diverse faces.
For the BEGAN model, we add our GCRF layer on top of the discriminator. Hence, the model implicitly penalizes generated samples which have different statistics than the real samples, at the output layer level. In Fig. 14, we compare our results with the classical BEGAN for the hyperparameter gamma set to .5 after approx. 120,000 iterations. We observe our model to generate diverse and better quality samples.
KLD  Gradient  Edge width  Laplacian  
Classical VAE  0.33  66.5%  52.65%  16.3% 
Feature Consistent VAE  0.16  91.5%  72.9%  30.17% 
SVAE (Ours)  .  
BEGAN  0.09  98.54 %  99.4 %  96.37% 
SBEGAN (Ours) 
Comments
There are no comments yet.