divcolor
Implementation of "Learning Diverse Image Colorization" CVPR'17
view repo
Colorization is an ambiguous problem, with multiple viable colorizations for a single greylevel image. However, previous methods only produce the single most probable colorization. Our goal is to model the diversity intrinsic to the problem of colorization and produce multiple colorizations that display longscale spatial coordination. We learn a low dimensional embedding of color fields using a variational autoencoder (VAE). We construct loss terms for the VAE decoder that avoid blurry outputs and take into account the uneven distribution of pixel colors. Finally, we build a conditional model for the multimodal distribution between greylevel image and the color field embeddings. Samples from this conditional model result in diverse colorization. We demonstrate that our method obtains better diverse colorizations than a standard conditional variational autoencoder (CVAE) model, as well as a recently proposed conditional generative adversarial network (cGAN).
READ FULL TEXT VIEW PDF
Most existing neural network models for music generation explore how to
...
read it
Problems such as predicting a new shading field (Y) for an image (X) are...
read it
We propose a simple yet highly effective method that addresses the
mode...
read it
Forecasting shortterm motion of nearby vehicles presents an inherently
...
read it
Training robust supervised deep learning models for many geospatial
appl...
read it
While generative models have shown great success in generating
highdime...
read it
We consider the task of generating diverse and novel videos from a singl...
read it
Implementation of "Learning Diverse Image Colorization" CVPR'17
Diverse Colorization in Torch
In colorization, we predict the 2channel color field for an input greylevel image. It is an inherently illposed and an ambiguous problem. Multiple different colorizations are possible for a single greylevel image. For example, different shades of blue for sky, different colors for a building, different skin tones for a person and other stark or subtle color changes are all acceptable colorizations. In this paper, our goal is to generate multiple colorizations for a single greylevel image that are diverse and at the same time, each realistic. This is a demanding task, because color fields are not only cued to the local appearance but also have a longscale spatial structure. Sampling colors independently from perpixel distributions makes the output spatially incoherent and it does not generate a realistic color field (See Figure 2
). Therefore, we need a method that generates multiple colorizations while balancing perpixel color estimates and longscale spatial coordination. This paradigm is common to many ambiguous vision tasks where multiple predictions are desired viz. generating motionfields from static image
[25], synthesizing future frames [27], timelapse videos [31], interactive segmentation and poseestimation [1] etc.A natural approach to solve the problem is to learn a conditional model for a color field conditioned on the input greylevel image . We can then draw samples from this conditional model to obtain diverse colorizations. To build this explicit conditional model is difficult. The difficulty being and are highdimensional spaces. The distribution of natural color fields and greylevel features in these highdimensional spaces is therefore scattered. This does not expose the sharing required to learn a multimodal conditional model. Therefore, we seek feature representations of and that allow us to build a conditional model.
Our strategy is to represent by its lowdimensional latent variable embedding . This embedding is learned by a generative model such as the Variational Autoencoder (VAE) [14] (See Step 1 of Figure 1). Next, we leverage a Mixture Density Network (MDN) to learn a multimodal conditional model (See Step 2 of Figure 1). Our feature representation for greylevel image comprises the features from conv7 layer of a colorization CNN [30]. These features encode spatial structure and perpixel affinity to colors. Finally, at test time we sample multiple and use the VAE decoder to obtain the corresponding colorizations for each (See Figure 1). Note that, our lowdimensional embedding encodes the spatial structure of color fields and we obtain spatially coherent diverse colorizations by sampling the conditional model.
The contributions of our work are as follows. First, we learn a smooth lowdimensional embedding along with a device to generate corresponding color fields with high fidelity (Section 3, 7.2). Second, we a learn multimodal conditional model between the greylevel features and the lowdimensional embedding capable of producing diverse colorizations (Section 4). Third, we show that our method outperforms the strong baseline of conditional variational autoencoders (CVAE) and conditional generative adversarial networks (cGAN) [10] for obtaining diverse colorizations (Section 7.3, Figure (k)k).
Colorization. Early colorization methods were interactive,
they used a reference color image [26] or scribblebased
color annotations [18]. Subsequently, [3, 4, 5, 11, 20] performed
automatic image colorization without any human annotation or
interaction. However, these methods were trained on datasets
of limited sizes, ranging from a few tens to a few thousands
of images. Recent CNNbased methods have been able to
scale to much larger datasets of a million images
[8, 16, 30]. All these methods
are aimed at producing only a single color image as output.
[3, 16, 30] predict a multimodal
distribution of colors over each pixel. But, [3]
performs a graphcut inference to produce a single color field
prediction, [30] take expectation after making
the perpixel distribution peaky and [16] sample the
mode or take the expectation at each pixel to generate single
colorization. To obtain diverse colorizations from
[16, 30], colors have to be sampled independently for
each pixel. This leads to speckle noise in the output color fields
as shown in Figure 2. Furthermore, one obtains little
diversity with this noise. Isola et al. [10] use conditional
GANs for the colorization task. Their focus is to generate single
colorization for a greylevel input. We produce diverse
colorizations for a single input, which are all realistic.
predict a perpixel probability distribution over colors. First three images are diverse colorizations obtained by sampling the perpixel distributions independently. The last image is the groundtruth color image. These images demonstrate the speckled noise and lack of spatial coordination resulting from independent sampling of pixel colors.
Variational Autoencoder. As discussed in Section 1, we wish to learn a lowdimensional embedding of a color field . Kingma and Welling [14] demonstrate that this can be achieved using a variational autoencoder comprising of an encoder network and a decoder network. They derive the following lower bound on log likelihood:
(1) 
The lower bound is maximized by maximizing Equation 1 with respect to parameters . They assume the posterior
. Therefore, the first term of Equation 1 reduces to a decoder network with an L loss . Further, they assume the distributionis a zeromean unitvariance Gaussian distribution. Therefore, the encoder network
is trained with a KLdivergence loss to the distribution . Sampling,, is performed with the reparameterization trick to enable backpropagation and the joint training of encoder and decoder. VAEs have been used to embed and decode Digits
[6, 12, 14], Faces [15, 28] and more recently CIFAR images [6, 13]. However, they are known to produce blurry and oversmooth outputs. We carefully devise loss terms that discourage blurry, greyish outputs and incorporate specificity and colorfulness (Section 3).We use a VAE to obtain a lowdimensional embedding for a color field. In addition to this, we also require an efficient decoder that generates a realistic color field from a given embedding. Here, we develop loss terms for VAE decoder that avoid the oversmooth and washed out (or greyish) color fields obtained with the standard L loss.
Specificity. Topk principal components, , are the directions of projections with maximum variance in the high dimensional space of color fields. Therefore, producing color fields that vary primarily along the topk principal components provides reduction in L loss at the expense of specificity in generated color fields. To disallow this, we project the generated color field and groundtruth color field along topk principal components. We use
in our implementation. Next, we divide the difference between these projections along each principal component by the corresponding standard deviation
estimated from training set. This encourages changes along all principal components to be on an equal footing in our loss. The residue is divided by standard deviation of the (for our case ) component. Write specificity loss using the squared sum of these distances and residue,The above loss is a combination of Mahalanobis distance [19]
between vectors
and with a diagonal covariance matrix and an additional residual term.Colorfulness. The distribution of colors in images is highly imbalanced, with more greyish colors than others. This biases the generative model to produce color fields that are washed out. Zhang et al. [30] address this by performing a rebalancing in the loss that takes into account the different populations of colors in the training data. The goal of rebalancing is to give higher weight to rarer colors with respect to the common colors.
We adopt a similar strategy that operates in the continuous color field space instead of the discrete color field space of Zhang et al. [30]. We use the empirical probability estimates (or normalized histogram) of colors in the quantized ‘ab’ color field computed by [30]. For pixel , we quantize it to obtain its bin and retrieve the inverse of probability . is used as a weight in the squared difference between predicted color and groundtruth at pixel . Write this loss in vector form,
(2) 
Gradient. In addition to the above, we also use a first order loss term that encourages generated color fields to have the same gradients as ground truth. Write and for horizontal and vertical gradient operators. The loss term is,
(3) 
Write overall loss on the decoder as
(4) 
We set hyperparameters and . The loss on the encoder is the KLdivergence to , same as [14]. We weight this loss by a factor with respect to the decoder loss. This relaxes the regularization of the lowdimensional embedding, but gives greater importance to the fidelity of color field produced by the decoder. Our relaxed constraint on embedding space does not have adverse effects. Because, our conditional model (Refer Section 4) manages to produce lowdimensional embeddings which decode to natural colorizations (See Figure (f)f, (k)k).
We want to learn a multimodal (onetomany) conditional model , between the greylevel image and the low dimensional embedding
. Mixture density networks (MDN) model the conditional probability distribution of target vectors, conditioned on the input as a mixture of gaussians
[2]. This takes into account the onetomany mapping and allows the target vectors to take multiple values conditioned on the same input vector, providing diversity.MDN Loss.
Now, we formulate the loss function for a MDN that models the conditional distribution
. Here,is Gaussian mixture model with
components. The loss function minimizes the conditional negative log likelihood for this distribution. Write for the MDN loss, for the mixture coefficients, for the means and for the fixed spherical covariance of the GMM. andare produced by a neural network parameterized by
with input . The MDN loss is,(5) 
It is difficult to optimize Equation 6 since it involves a log of summation over exponents of the form . The distance is high when the training commences and it leads to a numerical underflow in the exponent. To avoid this, we pick the gaussian component with predicted mean closest to the ground truth code and only optimize that component per training step. This reduces the loss function to
(6) 
Intuitively, this approximation resolves the identifiability (or symmetry) issue within MDN as we tie a greylevel feature to a component ( component as above). The other components are free to be optimized by nearby greylevel features. Therefore, clustered greylevel features jointly optimize the entire GMM, resulting in diverse colorizations. In Section 7.3, we show that this MDNbased strategy produces better diverse colorizations than the baseline of CVAE and cGAN (Section 5).
Conditional Variational Autoencoder (CVAE). CVAE conditions the generative process of VAE on a specific input. Therefore, sampling from a CVAE produces diverse outputs for a single input. Walker et al. [25] use a fully convolutional CVAE for diverse motion prediction from a static image. Xue et al. [27] introduce crossconvolutional layers between image and motion encoder in CVAE to obtain diverse future frame synthesis. Zhou and Berg [31] generate diverse timelapse videos by incorporating conditional, twostack and recurrent architecture modifications to standard generative models.
Recall that, for our problem of image colorization the input to the CVAE is the
greylevel image and output is the color field .
Sohn et al. [23] derive a lower bound on conditional
loglikelihood of CVAE. They show that
CVAE consists of training an encoder
network with KLdivergence loss and a decoder network with
an L loss. The difference with respect to VAE being that generating the
embedding and the decoder network both have an additional input .
Conditional Generative Adversarial Network (cGAN). Isola et al. [10]
recently proposed a cGAN based architecture to solve various imagetoimage translation tasks. One of which is colorizing greylevel images. They use an encoderdecoder architecture along with skip connections that propagate lowlevel detail. The network is trained with a patchbased adversarial loss, in addition to L
loss. The noise (or embedding ) is provided in the form of dropout [24]. At testtime, we use dropout to generate diverse colorizations. We cluster colorizations into cluster centers (See cGAN in Figure (k)k).Notation. Before we begin describing the network architecture, note the following notation. Write for convolutions with kernel size
, stride
, output channels and activation ,for batch normalization,
for bilinear upsampling with scale factor and for fully connected layer with output channels. Note, we perform convolutions with zeropadding and our fully connected layers use dropout regularization
[24].Radford et al. propose a DCGAN architecture with generator (or decoder) network that can model complex spatial structure of images [21]. We model the decoder network of our VAE to be similar to the generator network of Radford et al. [21]. We follow their best practices of using strided convolutions instead of pooling, batch normalization [9]
, ReLU activations for intermediate layers and tanh for output layer, avoiding fully connected layers except when decorrelation is required to obtain the lowdimensional embedding. The encoder network is roughly the mirror of decoder network, as per the standard practice for autoencoder networks. See Figure
4 for an illustration of our VAE architecture.Encoder Network. The encoder network accepts a color field of size
and outputs a dimensional embedding. Encoder network can be written as,
Input: .
Decoder Network. The decoder network accepts a dimensional embedding. It performs 5 operations of bilinear upsampling and convolutions to finally output a color field (a and b of Lab color space comprise the two output channels). The decoder network can be written as, Input: .
We use for all our three datasets (Section 7.1).
The input to MDN are the greylevel features from [30] and have dimension . We use components in the output GMM of MDN. The output layer comprises activations for means and softmaxed activations for mixture weights of the components. We use a fixed spherical variance of . The MDN network uses 5 convolutional layers followed by two fully connected layers and can be written as, Input: . Equivalently, the MDN is a network with convolutional and 2 fully connected layers, with the first convolutional layers pretrained on task of [30] and held fixed.
At test time, we can sample multiple embeddings from MDN and then generate diverse colorizations using VAE decoder. However, to study diverse colorizations in a principled manner we adopt a different procedure. We order the predicted means in descending order of mixture weights and use these topk () means as diverse colorizations shown in Figure (k)k (See ours, ours+skip).
In CVAE, the encoder and the decoder both take an additional input . We need an encoder for greylevel images as shown in Figure 3. The color image encoder and the decoder are same as the VAE (Section 6.1). The greylevel encoder of CVAE can be written as, Input: . This produces an output feature map of . The dimensional latent variable generated by the VAE (or color) encoder is spatially replicated () and multiplied to the output of greylevel encoder, which forms the input to the decoder. Additionally, we add skip connections from the greylevel encoder to the decoder similar to [10].
At test time, we feed multiple embeddings (randomly sampled) to the CVAE decoder along with fixed greylevel input. We feed embeddings and cluster outputs to colorizations (See CVAE in Figure (k)k).
Refer to http://vision.cs.illinois.edu/projects/divcolor
for our tensorflow code.
In Section 7.2, we evaluate the performance improvement by the loss terms we construct for the VAE decoder. Section 7.3 shows the diverse colorizations obtained by our method and we compare it to the CVAE and the cGAN. We also demonstrate the performance of another variant of our method: “ours+skip”. In ours+skip, we use a VAE with an additional greylevel encoder and skip connections to the decoder (similar to cGAN in Figure 3) and the MDN step is the same. The greylevel encoder architecture is the same as CVAE described above.
Dataset  L2Loss  MahLoss 


All  Grid  All  Grid  All  Grid  
LFW  .034  .035  .034  .032  .029  .029  
Church  .024  .025  .026  .026  .023  .023  

.031  .031  .039  .039  .039  .039 
Dataset  L2Loss  MahLoss 


All  Grid  All  Grid  All  Grid  
LFW  7.20  11.29  6.69  7.33  2.65  2.83  
Church  4.9  4.68  6.54  6.42  1.74  1.71  

10.02  9.21  12.99  12.19  4.82  4.66 
We use three datasets with varying complexity of color fields. First, we use the Labelled Faces in the Wild dataset (LFW) [17] which consists of face images aligned by deep funneling [7]. Since the face images are aligned, this dataset has some structure to it. Next, we use the LSUNChurch [29] dataset with images. These images are not aligned and lack the structure that was present in the LFW dataset. They are however images of the same scene category and therefore, they are more structured than the images in the wild. Finally, we use the validation set of ILSVRC2015 [22] (called ImageNetVal) with images as our third dataset. These images are the most unstructured of the three datasets. For each dataset, we randomly choose a subset of images as test set and use the remaining images for training.
Method  LFW  Church  ImageNetVal  
Eob.  Var.  Eob.  Var.  Eob.  Var.  
CVAE  .031  .029  .037  
cGAN  .047  .048  .048  
Ours  .030  .036  .043  

.031  .036  .041 
We train VAE decoders with: the standard L loss, the specificity loss of Section 3.1, and all our loss terms of Equation 4. Figure (af)af shows the colorizations obtained for the test set with these different losses. To achieve this colorization we sample the embedding from the encoder network. Therefore, this does not comprise a true colorization task. However, it allows us to evaluate the performance of the decoder network when the best possible embedding is available. Figure (af)af shows that the colorizations obtained with the L loss are greyish. In contrast, by using all our loss terms we obtain plausible and realistic colorizations with vivid colors. Note the yellow shirt and the yellow equipment, brown desk and the green trees in third row of Figure (af)af. For all datasets, using all our loss terms provides better colorizations compared to the standard L loss. Note, the face images in the second row have more contained skin colors as compared to the first row. This shows the subtle benefits obtained from the specificity loss.
In Table 1, we compare the mean absolute error perpixel with respect to the groundtruth for different loss terms. And, in Table 2, we compare the mean weighted absolute error perpixel for these loss terms. The weighted error uses the same weights as colorfulness loss of Section 3.1. We compute the error over: all pixels (All) and over a uniformly spaced grid in the center of image (Grid). We compute error on a grid to avoid using too many correlated neighboring pixels. On the absolute error metric of Table 1, for LFW and Church, we obtain lower errors with all loss terms as compared to the standard L loss. Note unlike L loss, we do not specifically train for this absolute error metric and yet achieve reasonable performance with our loss terms. On the weighted error metric of Table 2, our loss terms outperform the standard L loss on all datasets.
In Figure (k)k, we compare the diverse colorizations generated by our strategy (Sections 3, 4) and the baseline methods – CVAE and cGAN (Section 5). Qualitatively, we observe that our strategy generates better quality diverse colorizations which are each, realistic. Note that for each dataset, different methods use the same train/test split and we train them for epochs. The diverse colorizations have good quality for LFW and LSUN Church. We observe different skin tones, hair, cloth and background colors for LFW, and we observe different brick, sky and grass colors for LSUN Church. More colorizations in Figures (f)f,7,(k)k and 9.
In Table 3, we show the errorofbest (i.e. pick the colorization with minimum error to groundtruth) and the variance of diverse colorizations. Lower errorofbest implies one of the diverse predictions is close to groundtruth. Note that, our method reliably produces high variance with comparable errorofbest to other methods. Our goal is to generate diverse colorizations. However, since diverse colorizations are not observed in the groundtruth for a single image, we cannot reliably evaluate them. Therefore, we use the weaker proxy of variance to evaluate diversity. Large variance is desirable for diverse colorization, which we obtain. We rely on qualitative evaluation to verify the naturalness of the different colorizations in the predicted pool.

Our loss terms help us build a variational autoencoder for high fidelity color fields.
The multimodal conditional model produces embeddings that decode to realistic
diverse colorizations. The colorizations obtained from our methods are more
diverse than CVAE and cGAN. The proposed method can be applied to other ambiguous
problems. Our low dimensional embeddings allow us to predict diversity with multimodal
conditional models, but they do not encode high spatial detail. In future, our work will
be focused on improving the spatial detail along with diversity.
Acknowledgements. We thank Arun Mallya and Jason Rock for useful discussions and suggestions. This work is supported in part by ONR MURI Award N000141612007, and in part by NSF under Grants No. NSF IIS1421521.

Proceedings of the 10th European Conference on Computer Vision: Part III
, ECCV ’08, pages 126–139, 2008.Draw: A recurrent neural network for image generation.
InProceedings of the 32nd International Conference on Machine Learning (ICML15)
, pages 1462–1471, 2015.Imagetoimage translation with conditional adversarial networks.
InComputer Vision and Pattern Recognition
, 2017.
Comments
There are no comments yet.