Log In Sign Up

Attention-based Stylisation for Exemplar Image Colourisation

by   Marc Gorriz Blanch, et al.

Exemplar-based colourisation aims to add plausible colours to a grayscale image using the guidance of a colour reference image. Most of the existing methods tackle the task as a style transfer problem, using a convolutional neural network (CNN) to obtain deep representations of the content of both inputs. Stylised outputs are then obtained by computing similarities between both feature representations in order to transfer the style of the reference to the content of the target input. However, in order to gain robustness towards dissimilar references, the stylised outputs need to be refined with a second colourisation network, which significantly increases the overall system complexity. This work reformulates the existing methodology introducing a novel end-to-end colourisation network that unifies the feature matching with the colourisation process. The proposed architecture integrates attention modules at different resolutions that learn how to perform the style transfer task in an unsupervised way towards decoding realistic colour predictions. Moreover, axial attention is proposed to simplify the attention operations and to obtain a fast but robust cost-effective architecture. Experimental validations demonstrate efficiency of the proposed methodology which generates high quality and visual appealing colourisation. Furthermore, the complexity of the proposed methodology is reduced compared to the state-of-the-art methods.


page 6

page 8


Fast Universal Style Transfer for Artistic and Photorealistic Rendering

Universal style transfer is an image editing task that renders an input ...

Parameter-Free Style Projection for Arbitrary Style Transfer

Arbitrary image style transfer is a challenging task which aims to styli...

Attention-aware Multi-stroke Style Transfer

Neural style transfer has drawn considerable attention from both academi...

Deep Feature Rotation for Multimodal Image Style Transfer

Recently, style transfer is a research area that attracts a lot of atten...

Style Transfer with Target Feature Palette and Attention Coloring

Style transfer has attracted a lot of attentions, as it can change a giv...

Unbiased Image Style Transfer

Recent fast image style transferring methods use feed-forward neural net...

Two-dimensional Deep Regression for Early Yield Prediction of Winter Wheat

Crop yield prediction is one of the tasks of Precision Agriculture that ...

Code Repositories


PyTorch official implementation for XCNET: Attention-based Stylisation for Exemplar Image Colourisation

view repo

1 Introduction

Colourisation refers to the process of adding colours to greyscale or other monochrome content such that the colourised results are perceptually meaningful and visually appealing. Digital colourisation has become a classic task in computer vision, gaining significant importance in areas so diverse as broadcasting and film industries, restoration of legacy content or producer assistance.

Although significant progress has been achieved, mapping colours from a grayscale input is a complex and ambiguous task due to the large degrees of freedom to arrive to a unique solution. In some cases, the semantics of the scene can help to infer priors of the colour distribution of the image, but in most cases the ambiguity in the decisions leads the system to make random choices, such as the colour of a car or a bird without further information. Thus, in order to overcome the ambiguity challenge, more conservative solutions propose the involvement of human interaction during the colour assignment process, introducing methodologies such as scribbled-based colourisation

[27, 21, 19, 56, 39, 32, 60, 9, 54] or exemplar-based colourisation [4, 55, 16, 31, 41, 8, 51, 45, 21, 37, 34, 15, 52, 57]. Specifically, colourisation by example can be automated by means of a retrieval system to select content related references, which can also be used as a recommender for semi-automatic frameworks [16]. However, existing methods are either highly sensitive to the selection of references (need of similar content, position and size of related objects) or extremely complex and time consuming. For instance, most exemplar-based approaches require a style transfer or similar method to compute the semantic correspondences between the target and the reference before starting the colourisation process. This fact usually increments the system complexity by requiring twofold pipelines with separate and even independent style transfer and colourisation systems.

This work proposes a straightforward end-to-end solution which integrates attention modules that learn how to extract and transfer style features from the reference to the target in an unsupervised way during the colourisation process. Moreover, axial attention [17] is adopted to reduce the overall complexity and achieve a simple and fast architecture easily scalable to high resolution inputs. As shown in Figure 1

, the proposed architecture uses a pre-trained backbone to extract semantic and style features at different scales from the grayscale target and colour reference. Then, attention modules at different resolutions extract analogies between both feature sources and automatically yield output feature maps that fuse the style of the reference to the content of the target. Finally, a multi-scale pyramid decoder generates colour predictions at multiple resolutions, enabling the representation of higher-level semantics and robustness on the variance of scale and size of the local areas of content. The main advantage of such an end-to-end solution is that the attention modules learn how to perform style transfer based on the needs of the colourisation decoder in order to encourage high quality and realistic predictions, even if the reference significantly mismatches the target content. Moreover, it generalises the similarity computation of previous image analogy approaches in a way that does not constrain the similarity to a specific local patch search (attention modules can be interpreted as a set of long-term deformable kernels) and to specific similarity metrics. Finally, the proposed architecture introduces a novel design of the conventional transformer, enabling a modular combination of multi-head attention layers at different resolutions.

Overall, the contributions of this work are threefold:

  • A fast-end-to-end architecture for exemplar-based colourisation that improves existing methods while decreasing significantly the complexity and runtime.

  • A multi-scale interpretation of the axial transformer for unsupervised style transfer and features analogy.

  • A multi-loss training strategy that combines a multi-scale adversarial loss with conventional style transfer and exemplar-based colourisation losses.

2 Related work

Figure 1: Proposed architecture for examplar-based image colourisation. is the black and white frame with luma component only, and is colour reference. Multi-scale outputs - are used for training, where are the colourised image components at targeted resolution.

Modern digital colourisation algorithms can be roughly grouped into three main paradigms: automatic learning-based, scribble-based and exemplar-based colourisation. Automatic learning-based methods can perform colourisation with end-to-end architectures which learn the direct mapping of every grayscale pixel to the colour space. Such approaches require large image datasets to train the network parameters without user intervention. However, in most cases they produce results which aren’t colourful due to treating the colourisation process as a regression problem. As identified in the literature, well-designed loss functions such as adversarial loss

[22, 3], classification loss [26, 47] or perceptual loss [49] or their combination with regularisation [59] is needed to better capture the colour distribution of the input content and enable more colourful results. A different approach is proposed in PixColor [14], solving the automatic colourisation task as an autoregressive problem. Such methods predict the colour distribution of every pixel by conditioning to the grayscale input and the joint colour distribution of previous pixels. Similarly, ColTran [25] addresses the same methodology by using an axial transformer [17]. Autoregressive methods become impractical for colourisation due to the high dimensionality of the colour distribution and the related complexity of decoding high resolution images. For instance, even for modelling 8-bit RGB inputs only, the model needs to predict values.

Scribble-based colourisation interactively propagates initial strokes or colour points annotated by the user to the whole grayscale image. An optimisation approach [54] is proposed to propagate the user hints by using an adaptative clustering in the high dimensional affinity space. Alternatively, a Markov Random Field for propagating the scribbles [27]

is proposed under the rationale that adjacent pixels with similar intensity should have similar colours. Finally, a deep learning approach

[60] fuses low-level cues along with high-level semantic information to propagate the user hints.

Exemplar-based colourisation uses a colour reference to condition the prediction process. An early approach proposed the matching of global colour statistics [51], but yielded unsatisfactory results since it ignored local spatial information. More accurate approaches considered the extraction of correspondences at different levels, such as pixels [30], super-pixels [15, 8], segmented regions [21, 45, 4]

or deep features

[16, 57]. Based on the extraction of deep image analogies from a pre-trained VGG-19 network [44], a deep learning framework uses previously computed similarity maps to perform exemplar-based colourisation [16]. Such a method is posteriorly extended to video colourisation using a temporal consistency loss to enforce temporal coherency [16]. An alternative approach proposed the use of style transfer techniques based on AdaIN [18] to generate an initial stylised version which is further refined with a colourisation network [55]. Finally, a novel framework was proposed to fuse the semantic colours and global color distribution of the reference image towards the prediction of the final colour images [31].

Finally, the architecture presented in this work adopts axial attention to reduce the complexity of the overall system. As introduced in the axial transformer [17]

, attention is performed along a single axis, reducing the effective dimensionality of the attention maps and hence the complexity of the overall transformer. Such an approach managed to approximate conventional attention by focusing sequentially to each of the dimensions of the input tensor. An application was proposed to perform panoptic segmentation

[48], integrating axial attention modules into a modified version of DeepLab [6], and improving the original baseline.

3 Proposed method

Aiming at exemplar-based colourisation, the goal of this method is to enable the colourisation of a grayscale target based on the colour of a reference , where is the image dimension in pixels, represented in the CIE Lab colour space [10]. Note that the target’s index refers specifically to the luminance channel. To achieve this, an exemplar-based colourisation network is trained to model the mapping to the target ab colour channels, conditioned to the reference channels. CIE Lab colour space is chosen as it is designed to maintain perceptual uniformity and is more perceptually linear than other colour spaces [10]. This work assumes a normalised range of values between for each of the Lab channels.

3.1 Exemplar-based Colourisation Network

As shown in Figure 1, the proposed architecture is composed of four parts: the feature extractor backbone, the axial attention modules, the multi-scale pyramid decoder and the prediction heads.

First, both the target and the reference images are fed into a pre-trained feature extractor backbone to obtain multi-scale activated feature maps , in an intermediate position of the convolutional block, where , and the last activated feature map only for the target input

, which is the output of the backbone. Note that the features have progressively coarser volumes with increasing levels. Without loss of generality, the experiments in this paper consider a VGG-19 network pre-trained on ImageNet

[12], extracting features and

from the first Rectified Linear Unit (

ReLU) activation of every convolutional block (relu{l}_1 from VGG-19), and which is the output of the encode (from relu{5}_3 in VGG-19). Note that in order to feed into the pre-trained network, the luminace channel is triplicated to obtain a 3-dimensional input space. Then, all , pairs and are projected onto a -dimensional space by means of a convolution plus ReLU activation [53], to obtain , and , respectively.

Next, pairs (, ), where , are fed into axial attention modules to compute a multi-head attention mask describing the deep correspondences between both sources. Then, the style of the reference source is transferred into the content of the target source by matrix multiplication of the attention mask with the reference source. Section 3.2 describes the axial attention module in depth and provides more information about the logic behind style transfer via attention. This process yields -dimensional fused feature maps .

After generating the multi-scale fused features, a multi-scale pyramid decoder composed on stacked decoders and prediction heads is employed to map into colour predictions at different scales using the corresponding fused features . Thus, starting with , each decoder performs a five-fold operation: (1) adds with the output of the previous decoder , (2) applies a convolution plus ReLU activation, (3) upsamples the resultant feature map by a factor of , (4) similar to the U-Net architecture [43] concatenates the resultant upsampled map with the projected target feature map as skip connection and (5) refines the resultant map with another convolution plus ReLU activation which projects back the concatenated volume of dimensions into the initial dimensions, yielding an output volume .

Finally, the prediction heads map the decoded feature volumes into the output channels . Each prediction head is composed of an -dimensional convolution plus ReLU activation and convolution plus hyperbolic tangent (Tanh) activation to generate the colour channels.

3.2 Axial attention for unsupervised style transfer

Figure 2: Axial attention module described in Section 3.2. In the figure, BN means Batch Normalisation A activation.

Given two projected sources of features , relative to the target and reference respectively, the goal of the axial attention module is to combine them in a way that the style codified in the reference features is transferred into similar content areas within the target features.

Style transfer between two sources of features has been solved in many different ways, although in most cases only artistic style is targeted without contemplating the semantic analogies between both sources. Some strategies include the use of perceptual losses for training feed-forward networks for image transformation [23]

, in order to encourage the transformed images to produce similar features to the style reference when both are fed into a pre-trained loss network (e.g. VGG-16). A faster strategy is to use Adaptative Instance Normalisation (AdaIN)

[18] to align the mean and variance of the content features with those of the style features. Finally, another paradigm tackles deep image analogies for multi-scale visual attribute transfer [28], but the analogy computation and transfer process are performed via a PatchMatch algorithm [2], which is computationally expensive.

This work proposes the use of attention to perform such processes faster and in an unsupervised way. In contrast with image analogy methods based on PatchMatch

algorithms, attention does not need to constrain to a specific local search technique (even if it can act as a set of long-term deformable kernels) nor the similarity metric (e.g. correlation loss, cosine similarity) since the module learns it automatically. Attention was introduced to tackle the problem of long-range interactions in sequence modelling

[46, 29, 35, 7]. However, attention modules have been employed recently to improve computer vision tasks such as object detention [5] or image classification [13], by providing contextual information from other sources of information. Following the same rationale, an attention mechanism can solve the semantic analogy problem in style transfer by focusing on the most relevant areas of the style source when decoding each voxel in the content source.

Following the original definition of stand-alone attention [46, 50, 58], given a projected target and reference feature maps , , the fused feature map at position , is computed as follows:


where is the whole 2D location lattice. Furthermore, the queries to the target source and the keys and values from the reference source , are all linear projections of the target and reference projected sources and , respectively, , where are all the learnable parameters. The denotes a softmax operation applied to all possible positions within the 2D lattice .

Next, a position-sensitive learned positional encoding [11, 40, 48] is adopted to encourage the attention modules to model dynamic prior of where to look at in the receptive field of the reference source ( region within ). Positional encoding has proven to be beneficial in computer vision tasks to exploit spatial information and capture shapes and structures within the sources of input features. Therefore, as in [48], a key, query and value dependent positional encoding are applied to Equation 1 as follows:


where is the local local region centred around location , and the learned relative positional encoding for queries, keys and values, respectively. The inner products and measure the compatibilities from location to within the queries and keys space, and guides the output to retrieve content within the values space.

Finally, axial attention [17] is adopted to reduce the complexity of the original formulation to by computing the attention operations along a -dimensional axial lattice , instead of across the whole space. Following the formulation as in stand-alone axial-DeepLab [48], the global attention operation is simplified by defining an axial-attention layer that propagates the information along the width-axis followed by another one along the height-axis. In this work, we set a span equal to the input image resolution ()), but such values can be reduced for high resolution inputs. Finally, multi-head attention can be performed by applying single axial attention heads with head-dependent projections , posteriorly concatenating the results of each head and projecting the final output maps by means an output convolution.

As shown in Figure 2, a succession of multi-head weight-height axial attention layers are integrated to design the axial attention module for unsupervised style transfer. Given , inputs, such module performs a three-fold operation: (1) normalise the target and reference projected sources by means of batch normalisation plus ReLU activation, (2) fuse the normalised sources by means of the multi-head weight-height axial attention layers, and (3) add resulting features to the target source identity plus activate the output with a ReLU activation.

Reference Welsh et al. [51] Xiao et al. [52] Zhang et al. [57] Ours
Figure 3: Qualitative comparison of the existing and the proposed exemplar-based colourisation methods.

3.3 Training losses

Usually, the objective of colourisation is to encourage that the predicted colour channels are as close as possible to the ground truth in the original training dataset. However, this objective does not apply in exemplar-based colourisation, where should be customized by the colour reference while preserving the content of the grayscale target . Therefore, the definition of the training strategy is not straightforward, as penalising and is not accurate. Then, the objective is to enable the reliable transfer of reference colours to the target content towards obtaining a colour prediction faithful to the reference. This work takes advantage of the pyramidal decoder to combine state-of-the-art exemplar-based losses with an adversarial training at multiple resolutions. Hence, a multi-loss training strategy is proposed to combine a smooth- loss, a colour histogram loss and a total variance regularisation, as in [34], with a multi-scale adversarial loss by means of multiple patch-based discriminators [22]. In order to handle multi-scale losses, average pooling with a factor of is applied to both target and reference to successively generate the multi-scale ground truth and .

Smooth- loss. In order to induce dataset priors in cases when the content of reference highly mismatches with the target, a pixel loss based on Huber loss [20] (also known as smooth-) is proposed to encourage realistic predictions. Huber loss is widely used in colourisation as a substitute of the standard loss in order to avoid the averaging solution in the ambiguous colourisation problem [59]. As spotted in the Fast R-CNN paper [42], smooth-

is less sensitive to outliers than the

loss and in some cases prevents exploding gradients.

Colour histogram loss. As proposed in [31], in order to fully capture the global colour distribution of the reference image and penalise the differences with the predicted colour distribution, a colour histogram loss is considered. The computation of the colour histogram is not a differentiable process which can be integrated within the training loop. To avoid this problem, the aforementioned exemplar-based colourisation approach [31] approximates the colour histograms and

, corresponding to the reference and predicted images respectively, by means of a function similar to a bilinear interpolation. Then, the histogram loss

is defined as a symmetric distance [38] as follows:


where prevents infinity overflows and is the number of histogram bins. In this work, and .

Total variance regularisation. As widely used in style transfer literature [23], a total variance loss is proposed in order to encourage low variance along neighbouring pixels of the predicted colour channels .

Adversarial loss. Although the histogram loss encourages the predictions to contain reference colours, it does not consider spatial information nor discriminate between how realistic different object instances are colourised. With the aim to guide the previous losses towards realistic decisions, an adversarial strategy based on LS-GAN [33] is proposed to derive the scale-based generator loss and discriminator loss , using the ground truth colour targets as real sources and a patch-based discriminator D (same as used in [22]). Note that within the GAN framework, the proposed exemplar-based colourisation network would be the generator. Then, the total discriminator loss is computed by adding the individual multi-scale losses:

Finally, the total multi-scale loss is computed as:


where and are the multi-loss weights which specify the contribution of each individual loss.

4 Experiments

4.1 Training settings

A training dataset based on ImageNet [12] is generated by sampling images from the most popular categories ( images per class), which include: animals, plants, people, scenery, food, transportation and artifacts. Pairs of target-reference images are randomly generated based on the correspondence recommendation pipeline proposed in [16]. First, a top-5 global ranking is created by minimising the

distance between the features of the target and the rest of the of the same class, extracted at the first fully connected layer of a pre-trained VGG-19 with ImageNet and projected into

dimensions via PCA transformation [1]. Next, following the process in [16], the global ranking is refined by a local search selecting the most similar image by means of a patch-based similarity. The top-1 reference is selected by minimising the cosine distance between

patches corresponding to the most similar position-wise feature vector at the


space of the same pre-trained VGG-19, from both target and reference candidate. Finally, pairs of target-reference images are randomly sampled on-the-fly during training by using a weighted uniform distribution of

categories, with a weight : top-1 reference (), random choice among the top-5 candidates () and random choice among the rest of images of the same class (). Testing data is generated in a similar way, sampling pairs of target-reference images from the training subset (different targets than in training) at the same categories ( images per class). All images are resized to pixels, converted to the CIE Lab colour space and normalised into the range for each channel.

All the experiments use multi-head attention layers of heads, a hidden dimension and a prediction head dimension . As shown in the Figure 1, a backbone with convolutional blocks is used, starting the decoding process from a resolution of pixels and decoding different multi-scale predictions. Although several ablations are performed, the best trade-off between complexity and performance is achieved by applying the attention modules from the block . All models are trained around epochs using an Adam optimiser [24] with a learning rate of . The multi-loss weights and

are used for all the experiments. Finally, all models are implemented in Pytorch 1.7.0

[36] and trained with a single GPU using a batch size of around samples.

4.2 Comparison with colourisation methods

In order to compare our approach with existing exemplar-based colourisation methods [57, 52, 51], a test dataset is collected by randomly sampling target-reference pairs from the validation set defined in Section 4.1. To provide a fair comparison, all results are obtained by running the original publicly available codes and models provided by the authors.

A qualitative comparison for a selection of representative cases is shown in Figure 3. From this comparison, our method along with Zhang et al. [57] produce the most visual appealing results, being able to transfer effectively the colours from the reference. Both methods show that image analogy methodology better captures local information from semantically related objects and leads to more precise colour predictions. On the contrary, the methods from Welsh et al. [51] and Xiao et al. [52]

, based on global histogram estimation, fail to detect precise patterns and only map overall tones from the reference. The proposed multi-loss strategy, incorporating histogram and adversarial loss at different resolutions, enables more colorful and saturated results. However, unlike the conservative colourisation of

[57], the instability of the adversarial training can lead to some colour noise, as can be seen in the 4th row of Figure 3. A better control of the adversarial loss could boost our method’s performance, reaching the stability of [57] while producing more colourful and visually appealing predictions.

Figure 4: Runtime comparison in seconds.
Welsh et al. [51] 0.55 0.78 50.3% 74.1%
Xiao et al. [52] 0.59 0.84 54.8% 79.2%
Zhang et al. [57] 0.66 0.88 65.6% 84.8%
Ours axial att. 0.72 0.87 68.1% 89.1%
Ours standard att. 0.74 0.88 69.7% 90.1%
Ours single module 0.68 0.88 67.6% 88.9%
Ours w/o adv. loss 0.70 0.88 67.5% 89.2%
Ours w/o pix. loss 0.68 0.86 67.5% 86.7%
Ours w/o hist. loss 0.54 0.88 65.4% 89.2%
Table 1: Quantitative comparison of the state-of-the art methods with the proposed method in different settings. Note that standard attention is only used in the ablation study, the rest of our combinations use axial attention.
Target Reference Ours w/o adv. loss w/o hist. loss w/o pix. loss
Figure 5: Visual comparison of each individual training loss contribution.
Target Reference Standard Axial multiple Axial single
Figure 6: Visual comparison of the attention module configurations, using standard attention or axial attention one time or two times.

Moreover, a quantitative comparison is shown in Table 1, using three different metrics: Histogram Intersection Similarity (HIS) [22] relative to the reference image, Structural Similarity Index Measure (SSIM) relative to the target ground truth image and classification accuracy. First, HIS score measures the averaged colour histogram intersection between the reference and predicted images. As shown in the results, our method along with [57], which are both based on semantic-related analogies, achieve higher HIS scores suggesting a better transfer of the reference colours. On the contrary, the methods in [52] and [51], based on global histogram estimation, slightly lower the HIS score due to the averaged colourisation in ambiguous cases where the target and reference objects are not recognised. SSIM score is used to estimate structural similarity of each method. As can be observed, the methods achieving a more precise colourisation obtain higher SSIM score. The method in [57] achieves the same score as ours, suggesting that more stable predictions help to better retain the structural information of the target image. Finally, our method outperforms all other methods on image recognition accuracy when the colour predictions are fed into a VGG-16 pre-trained on ImageNet. The obtained results indicate that the proposed method overall outperforms previous methods, which is also reflected by the classification performance.

Method Naturalness (%)
Real images
Ours 61.30%
Zhang et al. [57]
Xiao et al. [52]
Welsh et al. [51]
Table 2: Perceptual test results. The values show the percentage of images selected as genuine (natural) for each of the methods.

In addition to qualitative and quantitative comparisons, a perceptual test is performed to validate overall results and to detect possible failure cases. A total of target-reference pairs are randomly sampled from the test dataset and colourised using our method and the three state-of-the-art methods [57, 52, 51]. Therefore, images are generated, including original images and images which the colour is predicted. Each individual test session randomly selects images and shows them one by one to viewers, which included participants with technical and non-technical backgrounds. Then, each participant has to annotate if the colours in each image appear to be genuine (natural) or not. The study was performed times, generating a total of annotations. As shown in Table 2 which shows the percentage of the annotations that evaluated colours as genuine with reference to the total number of all the annotations for the specific method, our approach () slightly outperforms the method in [57] (). As discussed in the visual comparison, the potential production of colour noise might have lowered the performance of our method. In contrast, the stability of [57] enabled a considerably high rate despite its conservative colourisation. Finally, the methods in [52] () and [51] () achieve the lowest results.

Finally, the runtime is also compared to highlight the efficiency of the proposed end-to-end architecture. All the results are obtained using the implementation provided by the authors. Runtime values are obtained on a machine with 3.60GHz Intel Xeon Gold 5122 CPU and a single NVIDIA GeForce RTX 2080 Ti GPU. As shown in Figure 4, among neural network based methods, the pyramid structure in Xiao et al. [52] costs most of the time. The method from Zhang et al. [57] slightly reduces runtime but the Patch Match search used in Deep Image Analogy [28] consumes a lot of time. On contrast, our end-to-end approach significantly reduces complexity achieving runtimes of 20 ms per image.

4.3 Ablation study

Several experiments are performed with the aim to evaluate the effects of the different architectural choices and training hyperparameters. The ablation study includes the analysis of the attention module, comparing the performance obtained with the standard attention operation and the proposed axial attention simplifications in Section

3.2. Moreover, the number of attention operations at each scale is also evaluated. Finally, the contribution of each of the training losses is validated by removing them separately from the total multi-loss function and study their effects on the final predictions. As discussed in Section 3.2, axial attention is adopted to reduce the complexity of the original attention formulation to by computing the attention operations along a single -dimensional axis, instead of across the whole space. Although axial attention is applied to both the horizontal and vertical axis to approximate the standard performance, a significant loss is identified in Table 1. A visual comparison is shown in Figure 6, where standard attention yields to more precise results, being able to capture longer-term relationships. In order to refine the axial approximation and being able to derive more complex relationships, the attention module is applied 2 consecutive times. As shown, such approach outperforms the single configuration in both quantitative and qualitative evaluations. Finally, the individual contribution of each training loss is evaluated by removing them one by one from the multi-loss configuration. As shown in Table 1 and Figure 5, a major drop in HIS score is identified in the absence of the histogram loss, indicating its importance to guide the learning process towards an effective transfer of reference colours. The absence of the adversarial loss also lowers the performance, dropping by 0.2 the HIS score and 0.6% the Top-1 accuracy. However, a higher effect is shown in the visual comparison, where a clear loss of both colourfulness and naturalness can be observed.

5 Conclusions

This paper introduces a novel architecture for exemplar-based colourisation. The proposed model integrates attention modules at different resolutions that learn how to perform style transfer in an unsupervised way towards decoding realistic colour predictions. Such methodology significantly simplifies previous exemplar-based approaches, unifying the feature matching with the colourisation process and therefore achieving a fast end-to-end colourisation. Moreover, in order to further reduce the model complexity, axial attention is proposed to simplify the standard attention operations and hence reduce the computation intensity. The proposed method outperforms state-of-the-art methods in both visual quality and complexity, and significantly reduces the runtime.


  • [1] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky (2014)

    Neural codes for image retrieval

    In European conference on computer vision, pp. 584–599. Cited by: §4.1.
  • [2] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 28 (3), pp. 24. Cited by: §3.2.
  • [3] M. G. Blanch, M. Mrak, A. F. Smeaton, and N. E. O’Connor (2019) End-to-end conditional gan-based architectures for image colourisation. In 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), pp. 1–6. Cited by: §2.
  • [4] A. Bugeau, V. Ta, and N. Papadakis (2013)

    Variational exemplar-based image colorization

    IEEE Transactions on Image Processing 23 (1), pp. 298–307. Cited by: §1, §2.
  • [5] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In European Conference on Computer Vision, pp. 213–229. Cited by: §3.2.
  • [6] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §2.
  • [7] J. Cheng, L. Dong, and M. Lapata (2016) Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733. Cited by: §3.2.
  • [8] A. Y. Chia, S. Zhuo, R. K. Gupta, Y. Tai, S. Cho, P. Tan, and S. Lin (2011) Semantic colorization with internet images. ACM Transactions on Graphics (TOG) 30 (6), pp. 1–8. Cited by: §1, §2.
  • [9] Y. Ci, X. Ma, Z. Wang, H. Li, and Z. Luo (2018) User-guided deep anime line art colorization with conditional adversarial networks. In Proceedings of the 26th ACM international conference on Multimedia, pp. 1536–1544. Cited by: §1.
  • [10] C. Connolly and T. Fleiss (1997) A study of efficiency and accuracy in the transformation from rgb to cielab color space. IEEE transactions on image processing 6 (7), pp. 1046–1048. Cited by: §3.
  • [11] J. Cordonnier, A. Loukas, and M. Jaggi (2019) On the relationship between self-attention and convolutional layers. arXiv preprint arXiv:1911.03584. Cited by: §3.2.
  • [12] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: §3.1, §4.1.
  • [13] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §3.2.
  • [14] S. Guadarrama, R. Dahl, D. Bieber, M. Norouzi, J. Shlens, and K. Murphy (2017) Pixcolor: pixel recursive colorization. arXiv preprint arXiv:1705.07208. Cited by: §2.
  • [15] R. K. Gupta, A. Y. Chia, D. Rajan, E. S. Ng, and H. Zhiyong (2012) Image colorization using similar images. In Proceedings of the 20th ACM international conference on Multimedia, pp. 369–378. Cited by: §1, §2.
  • [16] M. He, D. Chen, J. Liao, P. V. Sander, and L. Yuan (2018) Deep exemplar-based colorization. ACM Transactions on Graphics (TOG) 37 (4), pp. 1–16. Cited by: §1, §2, §4.1.
  • [17] J. Ho, N. Kalchbrenner, D. Weissenborn, and T. Salimans (2019) Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180. Cited by: §1, §2, §2, §3.2.
  • [18] X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510. Cited by: §2, §3.2.
  • [19] Y. Huang, Y. Tung, J. Chen, S. Wang, and J. Wu (2005) An adaptive edge detection based colorization algorithm and its applications. In Proceedings of the 13th annual ACM international conference on Multimedia, pp. 351–354. Cited by: §1.
  • [20] P. J. Huber (1992) Robust estimation of a location parameter. In Breakthroughs in statistics, pp. 492–518. Cited by: §3.3.
  • [21] R. Ironi, D. Cohen-Or, and D. Lischinski (2005) Colorization by example.. In Rendering techniques, pp. 201–210. Cited by: §1, §2.
  • [22] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2, §3.3, §3.3, §4.2.
  • [23] J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    In European conference on computer vision, pp. 694–711. Cited by: §3.2, §3.3.
  • [24] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • [25] M. Kumar, D. Weissenborn, and N. Kalchbrenner (2021) Colorization transformer. arXiv preprint arXiv:2102.04432. Cited by: §2.
  • [26] G. Larsson, M. Maire, and G. Shakhnarovich (2016) Learning representations for automatic colorization. In European conference on computer vision, pp. 577–593. Cited by: §2.
  • [27] A. Levin, D. Lischinski, and Y. Weiss (2004) Colorization using optimization. In ACM SIGGRAPH 2004 Papers, pp. 689–694. Cited by: §1, §2.
  • [28] J. Liao, Y. Yao, L. Yuan, G. Hua, and S. B. Kang (2017) Visual attribute transfer through deep image analogy. arXiv preprint arXiv:1705.01088. Cited by: §3.2, §4.2.
  • [29] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio (2017) A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130. Cited by: §3.2.
  • [30] X. Liu, L. Wan, Y. Qu, T. Wong, S. Lin, C. Leung, and P. Heng (2008) Intrinsic colorization. In ACM SIGGRAPH Asia 2008 papers, pp. 1–9. Cited by: §2.
  • [31] P. Lu, J. Yu, X. Peng, Z. Zhao, and X. Wang (2020) Gray2ColorNet: transfer more colors from reference image. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 3210–3218. Cited by: §1, §2, §3.3.
  • [32] Q. Luan, F. Wen, D. Cohen-Or, L. Liang, Y. Xu, and H. Shum (2007) Natural image colorization. In Proceedings of the 18th Eurographics conference on Rendering Techniques, pp. 309–320. Cited by: §1.
  • [33] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley (2017)

    Least squares generative adversarial networks

    In Proceedings of the IEEE international conference on computer vision, pp. 2794–2802. Cited by: §3.3.
  • [34] Y. Morimoto, Y. Taguchi, and T. Naemura (2009) Automatic colorization of grayscale images using multiple images on the web. In SIGGRAPH 2009: Talks, pp. 1–1. Cited by: §1, §3.3.
  • [35] A. P. Parikh, O. Täckström, D. Das, and J. Uszkoreit (2016)

    A decomposable attention model for natural language inference

    arXiv preprint arXiv:1606.01933. Cited by: §3.2.
  • [36] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703. Cited by: §4.1.
  • [37] F. Pitié, A. C. Kokaram, and R. Dahyot (2007) Automated colour grading using colour distribution transfer. Computer Vision and Image Understanding 107 (1-2), pp. 123–137. Cited by: §1.
  • [38] J. Puzicha, T. Hofmann, and J. M. Buhmann (1997) Non-parametric similarity measures for unsupervised texture segmentation and image retrieval. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 267–272. Cited by: §3.3.
  • [39] Y. Qu, T. Wong, and P. Heng (2006) Manga colorization. ACM Transactions on Graphics (TOG) 25 (3), pp. 1214–1220. Cited by: §1.
  • [40] P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya, and J. Shlens (2019) Stand-alone self-attention in vision models. arXiv preprint arXiv:1906.05909. Cited by: §3.2.
  • [41] E. Reinhard, M. Adhikhmin, B. Gooch, and P. Shirley (2001) Color transfer between images. IEEE Computer graphics and applications 21 (5), pp. 34–41. Cited by: §1.
  • [42] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497. Cited by: §3.3.
  • [43] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.1.
  • [44] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.
  • [45] Y. Tai, J. Jia, and C. Tang (2005)

    Local color transfer via probabilistic segmentation by expectation-maximization

    In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1, pp. 747–754. Cited by: §1, §2.
  • [46] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §3.2, §3.2.
  • [47] P. Vitoria, L. Raad, and C. Ballester (2020) ChromaGAN: adversarial picture colorization with semantic class distribution. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2445–2454. Cited by: §2.
  • [48] H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L. Chen (2020) Axial-deeplab: stand-alone axial-attention for panoptic segmentation. In European Conference on Computer Vision, pp. 108–126. Cited by: §2, §3.2, §3.2.
  • [49] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807. Cited by: §2.
  • [50] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7794–7803. Cited by: §3.2.
  • [51] T. Welsh, M. Ashikhmin, and K. Mueller (2002) Transferring color to greyscale images. In Proceedings of the 29th annual conference on Computer graphics and interactive techniques, pp. 277–280. Cited by: §1, §2, Figure 3, §4.2, §4.2, §4.2, §4.2, Table 1, Table 2.
  • [52] C. Xiao, C. Han, Z. Zhang, J. Qin, T. Wong, G. Han, and S. He (2020) Example-based colourization via dense encoding pyramids. In Computer Graphics Forum, Vol. 39, pp. 20–33. Cited by: §1, Figure 3, §4.2, §4.2, §4.2, §4.2, §4.2, Table 1, Table 2.
  • [53] B. Xu, N. Wang, T. Chen, and M. Li (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853. Cited by: §3.1.
  • [54] K. Xu, Y. Li, T. Ju, S. Hu, and T. Liu (2009) Efficient affinity-based edit propagation using kd tree. ACM Transactions on Graphics (TOG) 28 (5), pp. 1–6. Cited by: §1, §2.
  • [55] Z. Xu, T. Wang, F. Fang, Y. Sheng, and G. Zhang (2020) Stylization-based architecture for fast deep exemplar colorization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9363–9372. Cited by: §1, §2.
  • [56] L. Yatziv and G. Sapiro (2006) Fast image and video colorization using chrominance blending. IEEE transactions on image processing 15 (5), pp. 1120–1129. Cited by: §1.
  • [57] B. Zhang, M. He, J. Liao, P. V. Sander, L. Yuan, A. Bermak, and D. Chen (2019) Deep exemplar-based video colorization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8052–8061. Cited by: §1, §2, Figure 3, §4.2, §4.2, §4.2, §4.2, §4.2, Table 1, Table 2.
  • [58] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2019) Self-attention generative adversarial networks. In

    International conference on machine learning

    pp. 7354–7363. Cited by: §3.2.
  • [59] R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In European conference on computer vision, pp. 649–666. Cited by: §2, §3.3.
  • [60] R. Zhang, J. Zhu, P. Isola, X. Geng, A. S. Lin, T. Yu, and A. A. Efros (2017) Real-time user-guided image colorization with learned deep priors. arXiv preprint arXiv:1705.02999. Cited by: §1, §2.