Deep Exemplar-based Colorization

07/17/2018 ∙ by Mingming He, et al. ∙ 2

We propose the first deep learning approach for exemplar-based local colorization. Given a reference color image, our convolutional neural network directly maps a grayscale image to an output colorized image. Rather than using hand-crafted rules as in traditional exemplar-based methods, our end-to-end colorization network learns how to select, propagate, and predict colors from the large-scale data. The approach performs robustly and generalizes well even when using reference images that are unrelated to the input grayscale image. More importantly, as opposed to other learning-based colorization methods, our network allows the user to achieve customizable results by simply feeding different references. In order to further reduce manual effort in selecting the references, the system automatically recommends references with our proposed image retrieval algorithm, which considers both semantic and luminance information. The colorization can be performed fully automatically by simply picking the top reference suggestion. Our approach is validated through a user study and favorable quantitative comparisons to the-state-of-the-art methods. Furthermore, our approach can be naturally extended to video colorization. Our code and models will be freely available for public use.



There are no comments yet.


page 1

page 7

page 8

page 9

page 10

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The aim of image colorization is to add colors to a gray image such that the colorized image is perceptually meaningful and visually appealing. The problem is ill-conditioned and inherently ambiguous since there are potentially many colors that can be assigned to the gray pixels of an input image (e.g., leaves may be colored in green, yellow, or brown). Hence, there is no unique correct solution and human intervention often plays an important role in the colorization process.

Manual information to guide the colorization is generally provided in one of two forms: user-guided scribbles or a sample reference image. In the first paradigm [Levin et al. 2004, Yatziv and Sapiro 2006, Huang et al. 2005, Luan et al. 2007, Qu et al. 2006], the manual effort involved in placing the scribbles and the palette of colors must be chosen carefully in order to achieve a convincing result. This often requires both experience and a good sense of aesthetics, thus making it challenging for an untrained user. In the second paradigm [Welsh et al. 2002, Irony et al. 2005, Tai et al. 2005, Charpiat et al. 2008, Liu et al. 2008, Chia et al. 2011, Gupta et al. 2012, Bugeau et al. 2014], a color reference image similar to the grayscale image is given to facilitate the process. First, correspondence is established, and then colors are propagated from the most reliable correspondences. However, the quality of the result depends heavily on the choice of reference. Intensity disparities between the reference and the target caused by lighting, viewpoint, and content dissimilarity can mislead the colorization algorithm.

A more reliable solution is to leverage a huge reference image database to search for the most similar image patch/pixel for colorization. Recently, deep learning techniques have achieved impressive results in modeling large-scale data. Image colorization is formulated as a regression problem and deep neural networks are used to directly solve it [Cheng et al. 2015, Deshpande et al. 2015, Larsson et al. 2016, Iizuka et al. 2016, Zhang et al. 2016, Isola et al. 2017, Zhang et al. 2017]. These methods can colorize a new photo fully automatically without requiring any scribbles or reference. Unfortunately, none of these methods allow multi-modal colorization [Charpiat et al. 2008]. By learning from the data, their models mainly use the dominant colors they have learned, hindering any kind of user controllability. Another drawback is that it must be trained on a very large reference image database containing all potential objects.

More recent works attempt to achieve the best of both worlds: controllability from interaction and robustness from learning. Zhang et al. zhang2017real and Sangkloy et al. sangkloy2016scribbler add manual hints in the form of color points or strokes to the deep neural network in order to suggest possibly desired colors for the scribbles provided by users. This greatly facilitates traditional scribble-based interactions and achieves impressive results with more natural colors learned from the large-scale data. However, the scribbles are still essential for achieving high quality results, so a certain amount of trial-and-error is still involved.

In this paper, we suggest another type of hybrid solution. We propose the first deep learning approach for exemplar-based local colorization. Compared with existing colorization networks [Cheng et al. 2015, Iizuka et al. 2016, Zhang et al. 2016], our network allows control over the output colorization by simply choosing different references. As shown in Fig. Deep Exemplar-based Colorizationthanks: Supplemental material:, the reference can be similar or dissimilar to the target, but we can always obtain plausible colors in the results, which are visually faithful to the references and perceptually meaningful.

To achieve this goal, we present the first convolutional neural network (CNN) to directly select, propagate and predict colors from an aligned reference for a gray-scale image. Our approach is qualitatively superior to existing exemplar-based methods. The success comes from two novel sub-networks in our exemplar-based colorization framework.

First, the Similarity sub-net is a pre-processing step which provides the input of the end-to-end colorization network. It measures the semantic similarity between the reference and the target using a VGG-19 network pre-trained on the gray-scale image object recognition task. It provides a more robust and reliable similarity metric to varying semantic image appearances than previous metrics based on low-level features.

   Reference   Target   Colorized Result
Figure 1: Our goal is to selectively propagate the correct reference colors (indicated by the dots) for the relevant patches/pixels, and predict natural colors learned from the large-scale data when no appropriate matching region is available in the reference (indicated by the region outlined in red). Input images (from left to right): Julian Fong/flickr and Ernest McGray, Jr/flickr.

Then, the Colorization sub-net

provides a more general colorization solution for either similar or dissimilar patch/pixel pairs. It employs multi-task learning to train two different branches, which share the same network and weight but are associated with two different loss functions: 1)

Chrominance loss, which encourages the network to selectively propagate the correct reference colors for relevant patch/pixel, satisfying chrominance consistency; 2) Perceptual loss, which enforces a close match between the result and the true color image of high-level feature representations. This ensures a proper colorization learned from the large-scale data even in cases where there is no appropriate matching region in the reference (see Fig. 1). Therefore, our method can greatly loosen restrictive requirements on a good reference selection as required in other exemplar-based methods.

To guide the user towards efficient reference selection, the system recommends the most likely reference based on a proposed image retrieval algorithm. It leverages both high-level semantic information and low-level luminance statistics to search for the most similar images in the ImageNet dataset 

[Russakovsky et al. 2015]. With the help of this recommendation, our approach can serve as a fully automatic colorization system. The experiments demonstrate that our automatic colorization outperforms existing automatic methods quantitatively and qualitatively, and even produces comparably high quality results to the-state-of-the-art interactive methods [Zhang et al. 2017, Sangkloy et al. 2016]. Our approach can also be extended to video colorization.

Our contributions are as follows: (1) The first deep learning approach for exemplar-based colorization, which allows controllability and is robust to reference selection. (2) A novel end-to-end double-branch network architecture which jointly learns faithful local colorization to a meaningful reference and plausible color prediction when a reliable reference is unavailable. (3) A reference image retrieval algorithm for reference recommendation, with which we can also attain a fully automatic colorization. (4) A method capable of transferability to unnatural images, even though the network is trained purely on a natural image dataset. (5) An extension to video colorization.

Figure 2: System pipeline (inference stage). The system consists of two sub-networks. The Similarity sub-net works as a pre-processing step using Input 1 which includes two luminance channels and from the target and the reference respectively, bidirectional mapping functions and two chrominance channels from the reference. It computes the bidirectional similarity maps and the aligned reference chrominance , which, along with , form Input 2 for the Colorization sub-net. The Colorization sub-net is an end-to-end CNN to predict the chrominance channels of the target, which are then combined with to generate the final colorized result .

2 Related work

Next, we provide an overview of the major related works of each of the major algorithm categories.

2.1 Scribble-based colorization

These methods focus on propagating local user hints, for instance, color points or strokes, to the entire gray-scale image. The color propagation is based on some low-level similarity metrics. The pioneering work of Levin et al. levin2004colorization assumed that adjacent pixels with similar luminance should have similar color, and then solved a Markov Random Field for propagating sparse scribble colors. Further advances extended similarity to textures [Qu et al. 2006, Luan et al. 2007], intrinsic distance [Yatziv and Sapiro 2006], and exploited edges to reduce color bleeding [Huang et al. 2005]. The common drawback of such methods is intensive manual work and professional skills for providing good scribbles.

2.2 Example-based colorization

These methods provide a more intuitive way to reduce extensive user effort by feeding a very similar reference to the input grayscale image. The earliest work [Welsh et al. 2002] transferred colors by matching global color statistics, similar to Reinhard et al. reinhard2001color. The approach yielded unsatisfactory results in many cases since it ignored spatial pixel information. For more accurate local transfer, different correspondence techniques are considered, including segmented region level [Irony et al. 2005, Tai et al. 2005, Charpiat et al. 2008], super-pixel level [Gupta et al. 2012, Chia et al. 2011], and pixel level [Liu et al. 2008, Bugeau et al. 2014]. However, finding low-level feature correspondences (e.g.

, SIFT, Gabor wavelet) with hand-crafted similarity metrics is susceptible to error in situations with significant intensity and content variation. Recently two works utilize deep features extracted from a pre-trained VGG-19 network for reliable matching between images that are semantically-related but visually different, and then leverage it to style transfer 

[Liao et al. 2017] and color transfer [He et al. 2017]. However, all of these exemplar-based methods have to rely on finding a good reference, which is still an obstacle for users, even when some semi-automatic retrieval methods [Liu et al. 2008, Chia et al. 2011] are used. By contrast, our approach is robust to any given reference thanks to the capability of our deep network to learn natural color distributions from large-scale image data.

2.3 Learning-based colorization

Several techniques rely entirely on learning to produce the colorization result. Deshpande et al. deshpande2015learning defined colorization as a linear system and learned its parameters. Cheng et al. cheng2015deep concatenated several pre-defined features and fed them into a three-layer fully connected neural network. Recently, some end-to-end learning approaches [Larsson et al. 2016, Iizuka et al. 2016, Zhang et al. 2016, Isola et al. 2017] leveraged CNN to automatically extract features and predict the color result. The key difference in those networks is the loss function (e.g., image reconstruction loss [Iizuka et al. 2016], classification loss [Larsson et al. 2016, Zhang et al. 2016], and loss for considering the multi-modal colorization [Isola et al. 2017]). All of these networks are learned from large-scale data and do not require any user intervention. However, they only produce a single plausible result for each input, even though colorization is intrinsically an ill-posed problem with multi-modal uncertainty [Charpiat et al. 2008].

2.4 Hybrid colorization

To achieve desirable color results, Zhang et al. zhang2017real and Sangkloy et al. sangkloy2016scribbler proposed a hybrid framework that inherits the controllability from scribble-based methods and the robustness from learning-based methods. Zhang et al. zhang2017real uses provided color points while Sangkloy et al. sangkloy2016scribbler adopts strokes. Instead, we incorporate the reference rather than user-guided points or strokes into the colorization network, since we believe that giving a similar color example is a more intuitive way for untrained users. Furthermore, the reference selection can be achieved automatically using our image retrieval system.

3 Exemplar-based Colorization Network

Our goal is to colorize a target grayscale image based on a color reference image. More specifically, we aim to apply a reference color to the target where there is semantically-related content, and fall back to a plausible colorization for the objects or regions with no related content in the reference. To achieve this goal, we address two major challenges.

First, it is difficult to measure the semantic relationship between the reference and the target, especially given that the reference is in color while the target is a grayscale image. To solve this problem, we use a gray-VGG-19, trained on image classification tasks only using the luminance channel to extract their own features, and compute their feature’s differences.

Second, it is still challenging to select reference colors and propagate them properly by defining hand-crafted rules based on the similarity metrics. Instead, we propose an end-to-end network to learn selection and propagation simultaneously. Oftentimes both steps are not enough to recover all colors, especially when the reference is not very related to the target. To address this issue, our network would instead predict the dominant colors for misaligned objects from the large-scale data.

Fig. 2 illustrates the system pipeline. Our system uses the CIE Lab color space, which is perceptually linear. Thus, each image can be separated into a luminance channel and two chrominance channels and . The input of our system includes a grayscale target image , a color reference image , and the bidirectional mapping functions between them. The bidirectional mapping function is a spatial warping function defined with bidirectional correspondences. It returns the transformed pixel location given a source location ”p”. The two functions are respectively denoted as (mapping pixels from to ) and (mapping pixels from to ), where and are the height and width of the input images. For simplicity, we assume the two input images are of the same dimensions, although this is not necessary in practice. Our network consists of two sub-networks. The Similarity sub-net computes the semantic similarities between the reference and the target, and outputs bidirectional similarity maps . The Colorization sub-net takes , and as the input, and outputs the predicted channels of the target , which are then combined with to get the colorized result (). Details of the two sub-networks are introduced in the following sections.

3.1 Similarity Sub-Network

Before calculating pixel-level similarity, the two input images and have to be aligned. The bidirectional mapping functions and can be calculated with a dense correspondence algorithm, such as SIFTFlow [Liu et al. 2011], Daisy Flow [Tola et al. 2010] or DeepFlow [Weinzaepfel et al. 2013]. In our work, we adopt the latest advanced technique called Deep Image Analogy [Liao et al. 2017] for dense matching, since it is capable of matching images that are visually different but semantically related.

Our work is inspired by recent observations that CNNs trained on image recognition tasks are capable of encoding a full spectrum of features, from low-level textures to high-level semantics. It provides a robust and reliable similarity metric to variant image appearances (caused by variant lightings, times, viewpoints, and even slightly different categories), which may be challenging for low-level feature metrics (e.g., intensity, SIFT, Gabor wavelet) used in many works [Welsh et al. 2002, Liu et al. 2008, Charpiat et al. 2008, Tai et al. 2005].

Top-5 Class Top-1 Class
Acc() Acc()
Ori VGG-19 tested on color image 91.24 73.10
Ori VGG-19 tested on gray image 83.63 61.14
Our VGG-19 tested on gray image 89.39 70.05
Table 1: Classification accuracies of original and our fine-tuned VGG-19 calculated on ImageNet validation dataset.

We take the intermediate output of VGG-19 as our feature representation. Certainly, other recognition networks, such as GoogleNet [Szegedy et al. 2015] or ResNet [He et al. 2015] can also be used. The original VGG-19 is trained on color images and has a degraded accuracy of recognizing grayscale images, as shown in Table 1. To reduce the performance gap between color images and their gray versions, we train a gray-VGG-19 only using the luminance channel of an image. It increases the top-5 accuracy from to , and approaches that of the original VGG-19 () evaluated on color images.

We then feed the two luminance channels and into our gray-VGG-19 respectively, and obtain their five-level feature map pyramids (). The feature map of each level is extracted from the layer. Note that the features have progressively coarser spatial resolution with increasing levels. We upsample all feature maps to the same spatial resolution of the input images and denote the upsampled feature maps of and as and respectively. Bidirectional similarity maps and are computed between and at each pixel :


As mentioned in Liao et al. liao2017visual, cosine distance performs better in measuring feature similarity since it is more robust to appearance variances between image pairs. Thus, our similarity metric

between two deep features is defined as their cosine similarity:


The forward similarity map reflects the matching confidence from to while the backward similarity map measures the matching accuracy in the reverse direction. We use to denote both.

Figure 3: Two branches training of Colorization sub-net. Both branches take nearly the same Input 2 except for the concatenated chrominance channel. The aligned ground truth chrominance is used for the Chrominance branch to compute the chrominance loss , while the aligned reference chrominance is used in the Perceptual branch to compute the perceptual loss .

3.2 Colorization Sub-Network

We design an end-to-end CNN to learn selection, propagation and prediction of colors simultaneously. As shown on the right of Fig. 2, takes a thirteen-channel map as the input, which concatenates the gray target , aligned reference with chrominance channels only , and bidirectional similarity maps ). It also predicts channels of the target image . Next, we describe the loss function, network architecture and training strategy of the network.

3.2.1 Loss

Usually, the objective of colorization is to encourage the output to be as close as possible to the ground truth , the original channels of a color image in the training dataset. However this is not true in exemplar-based colorization, because the colorization should allow customization with (e.g., a flower can be colorized in either red, yellow, purple depending on the reference). Thus, it is not accurate to directly penalize a measure of the difference between and , as in other colorization networks (e.g., using loss [Cheng et al. 2015, Iizuka et al. 2016], loss [Isola et al. 2017, Zhang et al. 2017], or classification loss [Larsson et al. 2016, Zhang et al. 2016]).

Instead, our objective function is designed to consider two desiderata. First, we prefer reliable reference colors to be applied in the output, thus making it faithful to the reference. Second, we encourage the colorization to be natural, even when no reliable reference color is available.

To achieve both goals, we propose a multi-task network, which involves two branches, Chrominance branch and Perceptual branch. Both branches share the same network and weight but are associated with their own input and loss functions, as shown in Fig. 3. A parameter is used to dictate the relative weight between the two branches.

In the Chrominance branch, the network learns to selectively propagate the correct reference colors, which depends on how well the target and the reference are matched. However, training such a network is not easy: 1) on the one hand, the network cannot be trained directly with , the reference chrominance warped on the target, because the corresponding ground truth colorization is unknown; 2) while on the other hand, the network cannot be trained using the ground truth target chrominance as a reference, because that would essentially be providing the network with the answer it is supposed to predict. Thus, we leverage the bidirectional mapping functions to reconstruct a ”fake” reference from the ground truth chrominance, i.e., . replaces in the training stage with the underlying hypothesis that correct color samples in are very likely to lie in the same positions as correct color samples in , since both are warped with .

To train the chrominance branch, both and are fed to the network, yielding the result :


Here, is colorized with the guidance of , and should recover the ground truth if the network selects the correct color samples and propagates them properly. The smooth distance is evaluated at each pixel and integrated over the entire image to evaluate the Chrominance loss:


where , if , , otherwise. We take the smooth loss as the distance metric to avoid the averaging solution in the ambiguous colorization problem [Zhang et al. 2017].

Using the missingChrominance branch only works for reliable color samples in but may fail when the reference is dissimilar to parts of the image. To allow the network to predict perceptually plausible colors even without a proper reference, we add a Perceptual branch. In this branch, we take the reference and the target as the network input during training. Then, we generate the predicted chrominance :


In this branch, we minimize Perceptual loss [Johnson et al. 2016] instead. Formally:


where represents the feature maps extracted from the original VGG19 layer for , and is the same for . Perceptual loss measures the semantic differences caused by unnatural colorization and is robust to appearance differences caused by two plausible colors, as shown in Fig. 4. We also did some exploration using cosine distance but found L2 distance generated superior results. A similar loss is widely used in other tasks, like style transfer [Chen et al. 2017a, Chen et al. 2018], photo-realistic image synthesis [Chen and Koltun 2017]

, and super resolution 

[Sajjadi et al. 2017].

Our network , parameterized by , learns to minimize both loss functions (Equation (4) and (6)) across a large dataset:


where is empirically set to to balance both branches.

Ground truth Colorized result 1 Colorized result 2
Error map Error map
Figure 4: Visualization of Perceptual loss. Both colorized results have the same chrominance (ab channels) distance to the ground truth, but the unnatural green face (right) has a much larger Perceptual loss than a more plausible skin color (left). Input image: Zhang et al. zhang2017real.
Figure 5: Color reference recommendation pipeline. Input images: ImageNet dataset.

3.2.2 Architecture

The sub-network adopts a U-net encoder-decoder structure with some skip connections between the lower layers and symmetric higher layers. We empirically chose the U-net architecture because of its effectiveness, as evidenced in many image generation tasks [Badrinarayanan et al. 2015, Yu and Koltun 2015, Zhang et al. 2017]. Specifically, our network consists of 10 convolutional blocks. Each convolutional block contains -

pairs, followed by a batch normalization layer 

[Ioffe and Szegedy 2015] with the exception of the last block. The feature maps in the first convolutional blocks are progressively halved spatially while doubling the feature channel number. To aggregate multi-scale contextual information without losing resolution (as in Yu et al. yu2015multi, Zhang et al. zhang2017real and Fan et al. fan2018decouple), dilated convolution layers with a factor of are used in the and convolutional blocks. In the last

convolutional blocks, feature maps are progressively doubled spatially while halving the feature channel number. All down-sampling layers use convolution with stride

, while all up-sampling layers use deconvolution with stride . Symmetric skip connections are added between the outputs of and , and , and and blocks, respectively. Finally, a convolution layer with a kernel size is added after the block to predict the output . The final layer is a layer (also used in Radford et al. radford2015unsupervised and Chen et al. chen2017stylebank), which makes within a meaningful bound.

3.2.3 Dataset

We generate a training dataset based on ImageNet dataset [Russakovsky et al. 2015] by sampling approximately 700,000 image pairs from 7 popular categories: animals (), plants (), people (), scenery (), food (), transportation () and artifacts (), involving 700 classes out of the total 1,000 classes due to the cost of generating training data. To let the network be robust to any reference, we sample image pairs with different extents of similarity. Specifically, of image pairs belong to Top-5 similarity (selected by our recommendation algorithm described in Section 4) in the same class. Another are randomly sampled within the same class. The remaining have less similarity as they are randomly sampled from different classes but within the same category. In the training stage, we randomly switch the role of the two images for each pair to augment data. In other words, the target and the reference can be switched as two variant pairs during training. All images are scaled with the shortest edge of pixels.

3.2.4 Training

Our network is trained using the Adam optimizer [Kingma and Ba 2014] with a batch size of 256. For every iteration, within the batch, the first of data (128) go through the Chrominance branch using use as a reference and the remaining (128) go through the Perceptual branch using . The two branches respectively use corresponding losses. When updating the Chrominance branch, only Chrominance loss is used for gradient back propagation. When updating the Perceptual branch, only Perceptual loss is used for gradient back propagation. The initial learning rate is set to

and decays by 0.1 every 3 epochs. By default, we train the whole network with 10 epochs. The whole training procedure takes around 2 days on 8 x Titan XP GPUs.

4 Color Reference Recommendation

As discussed earlier, our network is robust to reference selection, and provides user control for the colorization. To aid users in finding good references, we propose a novel image retrieval algorithm that automatically recommends good references to the user. Alternatively, the approach yields a fully automatic system by directly using the Top-1 candidate.

The ideal reference is expected to match the target image in both semantic content and photometric luminance. The purpose of incorporating the luminance term is to avoid any unnatural composition of luminance and chrominance. In other words, combining the reference chrominance with the target luminance may produce visually unfaithful colors to the reference. Therefore, we desire the reference’s luminance to be as close as possible to the target’s.

To measure semantic similarity, we adopt the intermediate features of a pre-trained image classification network as descriptors, which have been widely used in recent image retrieval works [Krizhevsky et al. 2012, Babenko et al. 2014, Gong et al. 2014, Babenko and Lempitsky 2015, Razavian et al. 2016, Tolias et al. 2015].

We propose an effective and efficient image retrieval algorithm. The system overview is shown in Fig. 5. We feed the luminance channel of each image from our training dataset (see Section 3.2.3) to our pre-trained gray-VGG-19 (in Section 3.1), and get its feature from the last convolutional layer and from the first fully-connected layer . These features are pre-computed and stored in the database for the latter query. We also feed the query image (i.e., the target gray image) to the gray-VGG-19 network, and get its corresponding features . We then proceed with two ranking steps described next.

4.0.1 Global Ranking

Through gray-VGG-19, we can also get the recognized Top-1 class ID for the query image . According to the class ID, we narrow down the search domain to all the images ( images) within the same class. Here, we want to further filter out dissimilar candidates by comparing features between the query and all candidates. Even within the same class, the candidate could have a context that is irrelevant to the query. For example, the query could be ”a cat running on grass”, but the candidate could be ”a cat sitting inside the house”. We would like the semantic content in the two images to be as similar as possible however. To achieve this, for each candidate image in this class, we directly compute the cosine similarity (in Equation (2)) between and as the global score and rank all candidates by their scores.

Target Reference Aligned reference Predicted result Chrominance difference Matching error Matching error
Figure 6: Visualization of color selection in the Chrominance branch. The points with smaller difference between the predicted colorization and aligned reference color are most likely to be selected by the network and maintained in the final results. Note how inconsistencies between the similarity maps and the true color difference make it difficult to determine good points by the hand-crafted rules. Input images: ImageNet dataset.
Target Reference Aligned reference Chrominance Two branches Two branches Two branches Perceptual
branch only branch only
Figure 7: Comparison of results from the training with different branch configurations. Input images (from left to right, top to bottom): Tabitha Mort/pexels, Steve Morgan/wikimedia and Anonymous/pxhere.

4.0.2 Local Ranking

The global ranking provides us the top- (we set ) candidates . As we know, features fail to provide more accurate information about the object since it ignores the spatial information. For this purpose, we further prune these candidates by conducting a local ranking on the remaining images. The local similarity score consists of both semantic and luminance terms.

For each image pair , at each point in , we find its nearest neighbor in by minimizing the cosine distance between and , namely . Then, the semantic term is defined as the cosine similarity (see Equation (2

)) between two feature vectors

and .

The luminance term measures the similarity of luminance statistics between two local windows corresponding to and respectively. We evenly split image into a 2D grid with each grid having resolution. Each grid in the image indeed corresponds to a point in its feature map , since it undergoes 4 down-sampling layers. is denoted as the grid cell in the image corresponding to the point in . Likewise, from corresponds the point in . The function measures the correlation coefficient between luminance histograms of and .

The local similarity score is summarized as:


where determines the relative importance between the two terms (empirically set to ). This similarity score is computed for each pair . According to all local scores, we re-rank all retained candidates and retrieve the top selections.

We compress neural features with the common PCA-based compression [Babenko et al. 2014] to accelerate the search. The channels of feature are compressed from to and the channels of features are reduced from to with practically negligible loss. After these dimensionality reductions, our reference retrieval can run in real-time.

5 Discussion

In this section, we analyze and demonstrate the capabilities of our colorization network through ablation studies.

5.1 What does the Colorization sub-net learn?

The Colorization sub-net learns how to select, propagate, and predict colors based on the target and the reference. As discussed earlier, it is an end-to-end network that involves two branches, each playing a distinct role. At first, we want to understand the behavior of the network using just the Chrominance branch during learning. For this purpose, we only train the Chrominance branch of by minimizing the Chrominance loss (in Equation (4)), and evaluate it on one example to intuitively understand its operation (Fig. 6). By comparing the chrominance of the predicted result ( column) with the chrominance of the aligned reference ( column), we notice that they have consistent colors in most regions (e.g., ”blue” sky, ”white” plane and ”green” lawn). That indicates that our Chrominance branch picks color samples from the reference and propagates them to the entire image to achieve a smooth colorization.

To learn which color samples are selected by the network, we compute the chrominance difference between the predicted result and the aligned reference in the column (”blue” denotes nearly no difference while ”red” denotes a noticeable difference). Colors of the points with smaller errors are more likely to be selected by the network and then retained in the final result.

How does the network infer good samples?” or ”Can it be directly inferred from the matching between images?” To answer these questions, we compare the difference map ( column) with the averaged five-levels matching errors ( column) and (

columns). On the one hand, we can see that the matching errors are essentially consistent with the difference. This demonstrates that our network can learn a good sampling based on the matching quality, which serves as a key ”hint” to determine appropriate locations. On the other hand, we find that the network does not always select points with smaller matching errors, as evidenced by a significant number of inconsistent samples. Without similarity maps, the Colorization sub-net can hardly infer the matching accuracy between the aligned reference and the input. It will also increase ambiguity of the color prediction. Thus, adaptive selection according to similarities may be infeasible through an intuitive heuristic. However, by using the large-scale data, our network can more robustly learn this mechanism directly.

To understand the role of the Perceptual branch, we train it by solely minimizing the Perceptual loss (in Equation (6)). We show an example in Fig. 7. For this case, some regions do not have a good match to the reference (i.e., the right ”trunk” object). By using the Chrominance branch only, we attain results with incorrect colors for trunk objects ( column). However, the Perceptual branch is capable of addressing this problem ( column). It predicts the single and natural brown color for the trunk, since the majority of trunks in the training data are brown. Thus, the prediction of the Perceptual branch is purely based on the dominant color of objects from the large-scale data, and independent of the reference. As we can see in the column, it predicts the same colors even for different references.

To enjoy the advantages of both branches, we adopt a multi-task training strategy to train both branches simultaneously. The term is used as their relative weight. The double-branch results in columns of Fig. 7 explicitly indicate that our network learns to adaptively fuse the predictions of both branches: selecting and propagating the reference color at well-matched regions, but generalizing to the natural color learnt from large-scale data for mismatched or unrelated regions. The relative weight tunes the preference towards each branch. Evaluated on the ImageNet validation data, we set as the default in our experiments.

Target Aligned reference Samples Samples
(threshold) (cross-check)
Reference Our result Zhang et al. zhang2017real Zhang et al. zhang2017real
(threshold) (cross-check)
Figure 8: Comparison of our end-to-end network with the alternative of selecting color samples with manual thresholds or cross-check matching, and then colorizing with Zhang et al. zhang2017real. Input images: ImageNet dataset.
Target Ground truth Manually selected Top-1 Intra-class Intra-category Inter-category
Figure 9: Our method predicts plausible colorization with different references: manually selected, automatically recommended, randomly selected in the same class of the target, randomly selected in the same category, and randomly selected out of the category. Input images: ImageNet dataset except the two manual reference photos by Andreas Mortonus/flickr and Indi Samarajiva/flickr.
Target SIFTFlow DaisyFlow DeepFlow Deep Analogy
Figure 10: Our method works with different dense matching algorithms. The first row shows the target and the aligned references by different matching algorithms: SIFTFlow ([Liu et al. 2011]), DaisyFlow ([Tola et al. 2010]), DeepFlow ([Weinzaepfel et al. 2013]), and Deep Image Analogy ([Liao et al. 2017]). The second row shows the reference and final colorized results using different aligned references. Input images: ImageNet dataset.
Target Iizuka et al. iizuka2016let Zhang et al. zhang2016colorful Larsson et al. larsson2016learning Ours Reference
Figure 11: Transferability comparison of colorization networks trained on ImageNet. Input images (from left to right, top to bottom): Charpiat et al. charpiat2008automatic, Snow64/wikimedia and Ryo Taka/pixabay.

5.2 Why is end-to-end learning crucial?

Our Colorization sub-net learns three key components in colorization: color sample selection, color propagation, and dominant color prediction. To our knowledge, there is no other work that learns three steps simultaneously through a neural network.

An alternative is to simply sequentially process the three steps. In our study, we adopt the state-of-the-art color propagation and prediction method [Zhang et al. 2017]. Such a learning-based method significantly advances previous optimization methods [Levin et al. 2004], especially when given few user points. We try two color selection strategies: 1) Threshold: select color points with the top 10 averaged bidirectional similarity score; 2) Cross-check in matching: select color points where the bidirectional mapping satisfies . Once the points are obtained, we directly feed them to the pre-trained color propagation network [Zhang et al. 2017]. We show the two predicted colorization results in and columns of Fig. 8 respectively.

As we can see, the colorization does not work well and introduces many noticeable color artifacts. One possible reason is that the network [Zhang et al. 2017] is not trained on the type of input samples, but rather on user-guided points instead. Therefore, such a sequential learning would always result in a sub-optimal solution.

Moreover, the study also shows the difficulty in determining hand-crafted rules for point selections, as mentioned in Section 5.1. It is hard to eliminate all improper color samples through heuristics. The pre-trained network will also propagate wrong samples, thus causing such artifacts. On the contrary, our end-to-end learning approach avoids these pitfalls by jointly learning selection, propagation and prediction, resulting in a single network that directly optimizes for the quality of the final colorization.

5.3 Robustness

A significant advantage of our network is the robustness to reference selection when compared with traditional exemplar-based colorization. It can provide plausible colors whether the reference is related or unrelated to the target. Fig. 9 shows how well our method works on varying references with different levels of similarity to the target image. As we can see, the colorization result is naturally more faithful to the reference when the reference is more similar to the target in their semantic content. In other situations, the result will be degenerated to a conservative colorization. This is due to the Perceptual branch, which predicts the dominant colors from large-scale data. This behavior is similar to the existing learning-based approaches (e.g.[Iizuka et al. 2016, Larsson et al. 2016, Zhang et al. 2016]).

In addition, our network is also robust to different types of dense matching algorithms, as shown in Fig. 10. Note that our network is only trained using Deep Image Analogy [Liao et al. 2017] as the default matching approach, and the network is tested with various matching algorithms. We can also observe that the result is more faithful to the reference color at well-aligned regions; while the result is degenerated to the dominant colors at misaligned regions. Note that better alignment can improve the results of objects which can find semantic correspondences in the reference, but cannot help the colorization of objects which do not exist in the reference.

5.4 Transferability

Previous learning-based methods are data-driven and thus only able to colorize images that share common properties with those in the training set. Since their networks are trained on natural images, like the ImageNet dataset, they would fail to provide satisfactory colors for unseen images, for example, human-created images (e.g., paintings or cartoons). Their results may degrade to no colorization (, columns in Fig. 11) or introduce notable color artifacts ( column). By contrast, our method benefits from the reference and successfully works in both cases. Although our network does not see such types of images in training, with the Chrominance branch it learns to predict colors based on correlations of image pairs. The learnt ability is common to unseen objects.

6 Comparison and Results

In this section, we first report our performance and user study results. Then we qualitatively and quantitatively compare our method to previous techniques, including learning-based, exemplar-based, and interactive-based methods. Finally, we validate our method on legacy grayscale images and videos.

6.1 Performance

Our core algorithm is developed in CUDA. All of our experiments are conducted on a PC with an Intel E5 2.6GHz CPU and an NVIDIA Titan XP GPU. The total runtime for a image is approximately 0.166s, including 0.016s for reference recommendation, 0.1s for similarity measurement and 0.05s for colorization.

Top-5 Class Top-1 Class
Acc() Acc() PSNR(dB)
Ground truth (color) 90.35/89.99 71.12 /71.25 NA
Ground truth (gray) 84.2/81.35 61.5/57.39 23.28
Iizuka et al. iizuka2016let 85.53/84.12 63.42/61.61 24.92
Zhang et al. zhang2016colorful 84.28/83.12 60.97/60.25 22.43
Larsson et al. larsson2016learning 85.42/83.93 63.56/61.36 25.50
Ours 85.94/84.79 65.1/63.73 22.92
Table 2: Colorization results compared with learning-based methods on images from the ImageNet validation set. The second and third columns are the Top-5 and Top-1 classification accuracies after colorization using the VGG19-BN and VGG16 network. The last column is the PSNR between the colorized result and the ground truth.

6.2 Comparison with Exemplar-based methods

To compare with existing exemplar-based methods [Welsh et al. 2002, Irony et al. 2005, Bugeau et al. 2014, Gupta et al. 2012], we run our algorithm on 35 pairs collected from their papers. Fig. 12 shows several representatives and the complete set can be found in the supplemental material. To provide a fair comparison, we directly borrow their results from their publications or run their publicly available code.

In these examples, the content and object layouts of the reference are very similar to the target (i.e., no irrelevant objects or great intensity disparities). This is a strict requirement of existing exemplar-based methods, whose colorization relies solely on low-level features and is not learned from large-scale data. On the contrary, our algorithm is more general and has no such restrictive requirement. Even on these very related image pairs, our method shows better visual quality than previous techniques. The success comes from the sophisticated mechanism of color sample selection and propagation that are jointly learned from data rather than through heuristics.

Target Reference Welsh et al. welsh2002transferring Ironi et al. ironi2005colorization Bugeau et al. bugeau2012patch Gupta et al. gupta2012image Ours
Figure 12: Comparison results with example-based methods. Input images: Ironi et al. ironi2005colorization and Gupta et al. gupta2012image.
Target Ground truth Iizuka et al. iizuka2016let Larsson et al. larsson2016learning Zhang et al. zhang2016colorful Ours Reference
Figure 13: Comparison results with learning-based methods. Input images: ImageNet dataset.
Target Ground truth Ours (Top-1 ref) Zhang et al. zhang2017real Larsson et al. larsson2016learning Iizuka et al. iizuka2016let Ours (random ref)
() () () () ()
Figure 14: An example to show users preference on vibrant colorization. The numbers in brackets represent its fooling rates. Our colorized results ( and last columns) are guided by the top-right references. Input images: ImageNet dataset.

6.3 Comparison with learning-based methods

We compare our method with the-state-of-the-art learning-based colorization networks [Larsson et al. 2016, Zhang et al. 2016, Iizuka et al. 2016] by evaluating on images in the validation set of ImageNet (same as Larsson et al. larsson2016learning). Our method is trained on a subset of the ImageNet training set, as described in Section 3.2. We tested our automatic solution by taking the Top-1 recommendation as the reference (Sec. 4). To be fair, we use author-released models trained on the ImageNet dataset as well to run their methods.

We show a quantitative comparison of colorized results in Table 2 on two metrics: PSNR relative to the ground truth and classification accuracy. Our results have a lower PSNR score (22.9178dB) than Larsson et al. larsson2016learning and Iizuka et al. iizuka2016let, because PSNR overly penalizes a plausible but different colorization result. A correct colorization faithful to the reference may even achieve a lower PSNR than a conservative colorization, such as predicting gray for every pixel (24.9214dB). On the contrary, our method outperforms all other methods on image recognition accuracy rates when sending the colorized results into VGG19 or VGG16 pre-trained on image recognition task. It indicates that our colorized results seem to be more natural than others, which can be recognizable as well as the true color image.

A qualitative comparison for selected representative cases is shown in Fig. 13. For a full set of images randomly drawn from cases, please refer to our supplemental material. From this comparison, an apparent difference is that our results are more saturated and colorful when compared to Iizuka et al. iizuka2016let and Larsson et al. larsson2016learning, with the help of sampling colorful points from the reference. Zhang et al. zhang2016colorful uses a class-rebalancing step to oversample more colorful portions of the gamut during training, but such a solution sometimes results in overly aggressive colorization and causes artifacts (e.g.the blue and orange colors in the 4th row of Fig. 13). Our approach can control colorization and achieve desired colors by simply giving different references, thus our results are visually faithful to the reference colors.

In addition to quantitative and qualitative comparisons, we use a perceptual metric to evaluate how compelling our colorization looks to a human observer. We ran a real vs. fake two-alternative forced-choice user study on Amazon Mechanical Turk (AMT) across different learning-based methods. This is similar to the approach taken by Zhang et al. zhang2016colorful. Participants in the study were shown a series of pairs of images. Each pair consisted of a ground-truth color photo and a re-colorized version produced by either our algorithm (randomly selected reference or Top-1 recommended reference) or a baseline [Iizuka et al. 2016, Larsson et al. 2016, Zhang et al. 2016]. The two images were shown side-by-side in randomized order. For every pair, participants were asked to observe the image pair for no more than 5 seconds and click on the photo they believed was the most realistic as early as possible. All images were shown with the resolution of pixels on the short edge.

To guarantee all algorithms can be compared by the same ”turker” populations, we included results from different algorithms in one experimental session for each participant. Each session consisted of 5 practice trials (excluded from subsequent analysis), followed by 50 randomly selected test pairs (each algorithm contributed 10 pairs). During the practice trials, participants were given feedback as to whether their answers were correct. No feedback was given during the 50 test pairs. We conducted 5 different sessions to make sure every algorithm covered all image pairs. The participants were only allowed to complete at most one session. All experiment sessions were posted simultaneously and a total of 125 participants were involved in the user study (25 participants per session).

As shown in Table 3, our method with the Top-1 reference () and Zhang et al. zhang2016colorful () respectively ranked and in the fooling rate. We felt that this may be partly because participants preferred more colorful results to less saturated results as shown in Fig. 14. Zhang et al. zhang2016colorful uses a class-rebalancing step to encourage rare colors but at the expense of images which are overly-aggressively colorized; while our method produces more vibrant colorization by utilizing correct color samples from the reference. Our method with random reference also degenerates to conservative color prediction since few reliable color samples can be used from the unrelated reference. This verifies that a good reference is important to high-quality colorization.

Fig. 15 provides a better sense of the participants’ competency at detecting subtle errors made by our algorithm. The percentage on the left shows how often participants think our colorized result is more realistic than the ground truth. Some issues may come from lack of colorization in some local regions (e.g., ), or poor white balancing in the ground truth image (e.g., ). Surprisingly, our results are considered more natural to human observers than the ground truth image in some cases (e.g.).

Method     Fooling Rate ()
Iizuka et al. iizuka2016let     24.56 1.76
Larsson et al. larsson2016learning     24.64 1.71
Zhang et al. zhang2016colorful     35.36 1.52
Ours with random reference     21.92 1.56
Ours with Top-1 reference     38.08 1.72
Table 3: Amazon Mechanical Turk real v.s. fake fooling rate. We compared our method using an automatically recommended reference or a random intra-class reference with other learning-based methods. Note that the best expectation of fooling rate should be around , which occurs when the user cannot distinguish real from fake images and is forced to choose between two equally believable images. Input images: ImageNet dataset.
  Ground truth  Our result    Ground truth  Our result
Figure 15: Examples from the user study. Results are generated with our method with the Top-1 reference and are sorted by how often the users chose our algorithm’s colorization over the ground truth. Input images: ImageNet dataset.

6.4 Comparison with interactive-based methods

We compare our hybrid method with a different hybrid solution [Zhang et al. 2017] which combines user-guided scribbles (i.e., points) and deep learning. As shown in Fig. 16, by giving a proper reference selected by the user, our method can achieve comparable quality to theirs with dozens of user-given color points. Thus, our method proposes a simple way to control the appearance of colorization generated with the help of deep neural networks.

Zhang et al. zhang2017real also present a variant of their method which uses a global color histogram of a reference image as input to control colorization results. In Fig. 17, we show a comparison with results by Zhang et al. zhang2017real using the global color histogram either from the reference image ( column) or the aligned reference ( column). Their method provides a global control to alter color distribution and average saturation but fails to achieve locally variant colorization effects. Our method can preserve semantic correspondence and locally map the reference color to the target (e.g., the plant colorized green and the flowerpot colorized in blue).

6.5 Colorization of legacy photographs and movies

Our system was trained on ”synthetic” grayscale images by removing the chrominance channels from color images. We tested our system on legacy grayscale images, and show some selected results in Fig. 18. Moreover, our method can be extended to colorize legacy movies by independently colorizing each frame and then temporally smoothing the colorized results with the method of Bonneel et al. bonneel2015blind. Some selected frames of a movie example are shown in Fig. 19. Please refer to our supplemental material for a video demo.

Target Zhang et al. zhang2017real Ours Reference
Figure 16: Comparison results with the interactive-based method. The points overlaid on the target are manually given and used in Zhang et al. zhang2016colorful, while the reference in the last column is manually selected and used by our approach. Input images (from left to right, top to bottom): Ansel Adams/wikipedia, Carina Chen/pixabay, Dorothea Lange and Bess Hamiti/pixabay.
Target Zhang et al. zhang2017real Zhang et al. zhang2017real Ours
(Top-1 ref) (Top-1 aligned ref) (Top-1 ref)
Ground truth Zhang et al. zhang2017real Zhang et al. zhang2017real Ours
(Random ref) (Random aligned ref) (Random ref)
Figure 17: Comparison to Zhang et al. zhang2016colorful using global histogram hints from references overlaid on the the top-right corner. The histogram used in Zhang et al. zhang2016colorful is either from the reference ( column) or from its aligned version generated by Liao et al. liao2017visual ( column). Input images: ImageNet dataset.
Figure 18: Colorization of legacy pictures. In each set, the target grayscale photo is the upper-left, the reference is the lower-left and our result lies on the right. Input images (from left to right, top to bottom, target to reference): George L. Andrews/wikipedia, Official White House Photographer/wikimedia, Vandamm/wikimedia, Anonymous/wikimedia, Esther Bubley/wikimedia, Anonymous/wikimedia, Nick Macneill/geograph, Bernd/pixabay, Oberholster Venita/pixabay, EU2017EE Estonian Presidency/wikimedia, Audrey Coey/flickr, EU2017EE Estonian Presidency/wikimedia, Patrick Feller/wikimedia and Anonymous/pixabay.
Figure 19: Extending our method to video colorization. All black and white frames (top row) are independently colorized with the same reference (leftmost column of bottom row) to generate colorized results (right 4 columns of bottom row). The input clip is from the film Anna Lucasta (public domain) and the reference photo is by Heather Harvey/flickr.

7 Limitations and Conclusions

We have presented a novel colorization approach that employs a deep learning architecture and a reference color image. Our approach is a general solution for exemplar-based colorization since it yields plausible results even in cases where the target image does not have clear correspondences in the reference. In such cases, it is still capable of producing plausible and natural colors for the target image. Unlike most deep-learning colorization frameworks, our approach allows us to control colorized results. Furthermore, with the reference recommendation algorithm, the system also provides the user with an automatic tool for re-coloring black-and-white photographs and movies.

Target Reference Aligned Combination Result
reference P
Figure 20: Limitations of our work. Top row: our network cannot colorize objects with unusual or artistic colors constrained by the perceptual loss. Second row: the perceptual loss does not sufficiently penalize incorrect reference colors on regions with less semantic importance, e.g. a smooth background. Third row: the classification network fails to distinguish regions with similar local textures, e.g. sand and grass. Forth row: the result is visually less faithful to the reference if their luminance gaps are too large. Input images: ImageNet dataset except the images on the first row by Anonymous/pxhere and Anonymouse/pixabay.

Our approach also suffers from some limitations that can be addressed in future work. First, our network cannot colorize objects with unusual or artistic colors, since it is constrained by the learning from the proposed Perceptual branch, as shown in the top row of Fig. 20.

Second, the perceptual loss based on the classification network (VGG) cannot penalize incorrect colors in regions with less semantic importance, such as the wall in the second row of Fig. 20, or fails to distinguish less semantic regions with similar local texture, such as the similar sand and grass textures in the third row of Fig. 20. In addition, our result is less faithful to the reference when there are dramatic luminance disparities between images, as shown in the bottom row of Fig. 20. To mitigate this limitation, our reference recommendation algorithm enforces luminance similarity in the local ranking. Occasionally, our method fails to predict colors for some local regions, as shown in Fig. 15. It would be worthwhile to explore how to better balance the two branches of our network.


  • [Babenko and Lempitsky 2015] Babenko, A., and Lempitsky, V. 2015. Aggregating local deep features for image retrieval. In Proc. ICCV, 1269–1277.
  • [Babenko et al. 2014] Babenko, A., Slesarev, A., Chigorin, A., and Lempitsky, V. 2014. Neural codes for image retrieval. In Proc. ECCV, Springer, 584–599.
  • [Badrinarayanan et al. 2015] Badrinarayanan, V., Kendall, A., and Cipolla, R. 2015. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561.
  • [Barnes et al. 2009] Barnes, C., Shechtman, E., Finkelstein, A., and Goldman, D. B. 2009. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph. (Proc. of SIGGRAPH) 28, 3, 24–1.
  • [Bay et al. 2006] Bay, H., Tuytelaars, T., and Van Gool, L. 2006. Surf: Speeded up robust features. 404–417.
  • [Bonneel et al. 2015] Bonneel, N., Tompkin, J., Sunkavalli, K., Sun, D., Paris, S., and Pfister, H. 2015. Blind video temporal consistency. ACM Trans. Graph. (Proc. of SIGGRAPH Asia) 34, 6, 196.
  • [Bugeau and Ta 2012] Bugeau, A., and Ta, V.-T. 2012. Patch-based image colorization. In Pattern Recognition (ICPR), 2012 21st International Conference on, IEEE, 3058–3061.
  • [Bugeau et al. 2014] Bugeau, A., Ta, V.-T., and Papadakis, N. 2014. Variational exemplar-based image colorization. IEEE Trans. on Image Processing 23, 1, 298–307.
  • [Charpiat et al. 2008] Charpiat, G., Hofmann, M., and Schölkopf, B. 2008. Automatic image colorization via multimodal predictions. 126–139.
  • [Chen and Koltun 2017] Chen, Q., and Koltun, V. 2017. Photographic image synthesis with cascaded refinement networks. In Proc. ICCV, vol. 1.
  • [Chen et al. 2017a] Chen, D., Liao, J., Yuan, L., Yu, N., and Hua, G. 2017. Coherent online video style transfer. In Proc. ICCV.
  • [Chen et al. 2017b] Chen, D., Yuan, L., Liao, J., Yu, N., and Hua, G. 2017. Stylebank: An explicit representation for neural image style transfer. In Proc. CVPR.
  • [Chen et al. 2018] Chen, D., Yuan, L., Liao, J., Yu, N., and Hua, G. 2018. Stereoscopic neural style transfer. In Proc. CVPR.
  • [Cheng et al. 2015] Cheng, Z., Yang, Q., and Sheng, B. 2015. Deep colorization. In Proc. ICCV, 415–423.
  • [Chia et al. 2011] Chia, A. Y.-S., Zhuo, S., Gupta, R. K., Tai, Y.-W., Cho, S.-Y., Tan, P., and Lin, S. 2011. Semantic colorization with internet images. ACM Trans. Graph. (Proc. of SIGGRAPH Asia) 30, 6, 156.
  • [Çiçek et al. 2016] Çiçek, Ö., Abdulkadir, A., Lienkamp, S. S., Brox, T., and Ronneberger, O. 2016. 3d u-net: learning dense volumetric segmentation from sparse annotation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 424–432.
  • [Deshpande et al. 2015] Deshpande, A., Rock, J., and Forsyth, D. 2015. Learning large-scale automatic image colorization. In Proc. ICCV, 567–575.
  • [Fan et al. 2018] Fan, Q., Chen, D., Yuan, L., Hua, G., Yu, N., and Chen, B. 2018. Decouple learning for parameterized image operators. In

    ECCV 2018, European Conference on Computer Vision

  • [Gong et al. 2014] Gong, Y., Wang, L., Guo, R., and Lazebnik, S. 2014. Multi-scale orderless pooling of deep convolutional activation features. In Proc. ECCV, Springer, 392–407.
  • [Gupta et al. 2012] Gupta, R. K., Chia, A. Y.-S., Rajan, D., Ng, E. S., and Zhiyong, H. 2012. Image colorization using similar images. In Proc. of the 20th ACM international conference on Multimedia, ACM, 369–378.
  • [He et al. 2015] He, K., Zhang, X., Ren, S., and Sun, J. 2015. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385.
  • [He et al. 2017] He, M., Liao, J., Yuan, L., and Sander, P. V. 2017. Neural color transfer between images. arXiv preprint arXiv:1710.00756.
  • [Huang et al. 2005] Huang, Y.-C., Tung, Y.-S., Chen, J.-C., Wang, S.-W., and Wu, J.-L. 2005. An adaptive edge detection based colorization algorithm and its applications. In Proc. of the 13th annual ACM international conference on Multimedia, ACM, 351–354.
  • [Iizuka et al. 2016] Iizuka, S., Simo-Serra, E., and Ishikawa, H. 2016. Let there be color!: joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM, vol. 35, 110.
  • [Ioffe and Szegedy 2015] Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In

    International Conference on Machine Learning

    , 448–456.
  • [Irony et al. 2005] Irony, R., Cohen-Or, D., and Lischinski, D. 2005. Colorization by example. In Rendering Techniques, 201–210.
  • [Isola et al. 2017] Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. 2017. Image-to-image translation with conditional adversarial networks. In Proc. CVPR.
  • [Johnson et al. 2016] Johnson, J., Alahi, A., and Fei-Fei, L. 2016. Perceptual losses for real-time style transfer and super-resolution. In Proc. ECCV, Springer, 694–711.
  • [Kingma and Ba 2014] Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • [Krizhevsky et al. 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 1097–1105.
  • [Larsson et al. 2016] Larsson, G., Maire, M., and Shakhnarovich, G. 2016. Learning representations for automatic colorization. In Proc. ECCV, 577–593.
  • [Levin et al. 2004] Levin, A., Lischinski, D., and Weiss, Y. 2004. Colorization using optimization. ACM Trans. Graph. (Proc. of SIGGRAPH) 23, 3, 689–694.
  • [Liao et al. 2017] Liao, J., Yao, Y., Yuan, L., Hua, G., and Kang, S. B. 2017. Visual attribute transfer through deep image analogy. arXiv preprint arXiv:1705.01088 36, 4, 120.
  • [Liu et al. 2008] Liu, X., Wan, L., Qu, Y., Wong, T.-T., Lin, S., Leung, C.-S., and Heng, P.-A. 2008. Intrinsic colorization. ACM Trans. Graph. (Proc. of SIGGRAPH Asia) 27, 5, 152.
  • [Liu et al. 2011] Liu, C., Yuen, J., and Torralba, A. 2011. Sift flow: Dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33, 5, 978–994.
  • [Lowe 1999] Lowe, D. G. 1999. Object recognition from local scale-invariant features. In Proc. ICCV, vol. 2, IEEE, 1150–1157.
  • [Luan et al. 2007] Luan, Q., Wen, F., Cohen-Or, D., Liang, L., Xu, Y.-Q., and Shum, H.-Y. 2007. Natural image colorization. In Proc. of the 18th Eurographics conference on Rendering Techniques, Eurographics Association, 309–320.
  • [Pitie et al. 2005] Pitie, F., Kokaram, A. C., and Dahyot, R. 2005.

    N-dimensional probability density function transfer and its application to color transfer.

    In Proc. ICCV, vol. 2, 1434–1439.
  • [Qu et al. 2006] Qu, Y., Wong, T.-T., and Heng, P.-A. 2006. Manga colorization. ACM Trans. Graph. (Proc. of SIGGRAPH Asia) 25, 3, 1214–1220.
  • [Radford et al. 2015] Radford, A., Metz, L., and Chintala, S. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
  • [Razavian et al. 2016] Razavian, A. S., Sullivan, J., Carlsson, S., and Maki, A. 2016. A baseline for visual instance retrieval with deep convolutional networks. arXiv preprint arXiv:1412.6574.
  • [Reinhard et al. 2001] Reinhard, E., Adhikhmin, M., Gooch, B., and Shirley, P. 2001. Color transfer between images. IEEE Computer graphics and applications 21, 5, 34–41.
  • [Ronneberger et al. 2015] Ronneberger, O., Fischer, P., and Brox, T. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 234–241.
  • [Russakovsky et al. 2015] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3, 211–252.
  • [Sajjadi et al. 2017] Sajjadi, M. S., Scholkopf, B., and Hirsch, M. 2017. Enhancenet: Single image super-resolution through automated texture synthesis. In Proc. CVPR, 4491–4500.
  • [Sangkloy et al. 2016] Sangkloy, P., Lu, J., Fang, C., Yu, F., and Hays, J. 2016. Scribbler: Controlling deep image synthesis with sketch and color. In Proc. CVPR.
  • [Simonyan and Zisserman 2014] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  • [Szegedy et al. 2015] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. 2015. Going deeper with convolutions. In Proc. CVPR.
  • [Tai et al. 2005] Tai, Y.-W., Jia, J., and Tang, C.-K. 2005.

    Local color transfer via probabilistic segmentation by expectation-maximization.

    In Proc. CVPR, vol. 1, IEEE, 747–754.
  • [Tola et al. 2010] Tola, E., Lepetit, V., and Fua, P. 2010. Daisy: An efficient dense descriptor applied to wide-baseline stereo. IEEE Trans. Pattern Anal. Mach. Intell. 32, 5, 815–830.
  • [Tolias et al. 2015] Tolias, G., Sicre, R., and Jégou, H. 2015. Particular object retrieval with integral max-pooling of cnn activations. arXiv preprint arXiv:1511.05879.
  • [Weinzaepfel et al. 2013] Weinzaepfel, P., Revaud, J., Harchaoui, Z., and Schmid, C. 2013. Deepflow: Large displacement optical flow with deep matching. In Proc. ICCV, 1385–1392.
  • [Welsh et al. 2002] Welsh, T., Ashikhmin, M., and Mueller, K. 2002. Transferring color to greyscale images. ACM Trans. Graph. (Proc. of SIGGRAPH Asia) 21, 3, 277–280.
  • [Yang et al. 2014] Yang, H., Lin, W.-Y., and Lu, J. 2014. Daisy filter flow: A generalized discrete approach to dense correspondences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3406–3413.
  • [Yatziv and Sapiro 2006] Yatziv, L., and Sapiro, G. 2006. Fast image and video colorization using chrominance blending. IEEE Trans. on Image Processing 15, 5, 1120–1129.
  • [Yosinski et al. 2015] Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., and Lipson, H. 2015. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579.
  • [Yu and Koltun 2015] Yu, F., and Koltun, V. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122.
  • [Zhang et al. 2016] Zhang, R., Isola, P., and Efros, A. A. 2016. Colorful image colorization. In Proc. ECCV, 649–666.
  • [Zhang et al. 2017] Zhang, R., Zhu, J.-Y., Isola, P., Geng, X., Lin, A. S., Yu, T., and Efros, A. A. 2017. Real-time user-guided image colorization with learned deep priors. ACM Trans. Graph. (Proc. of SIGGRAPH) 36, 4, 119.
  • [Zhou et al. 2016] Zhou, T., Krähenbühl, P., Aubry, M., Huang, Q., and Efros, A. A. 2016. Learning dense correspondence via 3d-guided cycle consistency. arXiv preprint arXiv:1604.05383.