Colorization Transformer

by   Manoj Kumar, et al.

We present the Colorization Transformer, a novel approach for diverse high fidelity image colorization based on self-attention. Given a grayscale image, the colorization proceeds in three steps. We first use a conditional autoregressive transformer to produce a low resolution coarse coloring of the grayscale image. Our architecture adopts conditional transformer layers to effectively condition grayscale input. Two subsequent fully parallel networks upsample the coarse colored low resolution image into a finely colored high resolution image. Sampling from the Colorization Transformer produces diverse colorings whose fidelity outperforms the previous state-of-the-art on colorising ImageNet based on FID results and based on a human evaluation in a Mechanical Turk test. Remarkably, in more than 60 prefer the highest rated among three generated colorings over the ground truth. The code and pre-trained checkpoints for Colorization Transformer are publicly available at


page 16

page 17

page 18

page 20

page 21

page 22

page 23

page 24


High-Resolution Swin Transformer for Automatic Medical Image Segmentation

The Resolution of feature maps is critical for medical image segmentatio...

PixColor: Pixel Recursive Colorization

We propose a novel approach to automatically produce multiple colorized ...

PanFormer: a Transformer Based Model for Pan-sharpening

Pan-sharpening aims at producing a high-resolution (HR) multi-spectral (...

Swin Deformable Attention U-Net Transformer (SDAUT) for Explainable Fast MRI

Fast MRI aims to reconstruct a high fidelity image from partially observ...

Styleformer: Transformer based Generative Adversarial Networks with Style Vector

We propose Styleformer, which is a style-based generator for GAN archite...

High-Fidelity Pluralistic Image Completion with Transformers

Image completion has made tremendous progress with convolutional neural ...

Self-Calibrated Efficient Transformer for Lightweight Super-Resolution

Recently, deep learning has been successfully applied to the single-imag...

1 Introduction

[width=0.32]first_figure/image_0.pdf [width=0.32]first_figure/image_1.pdf [width=0.32]first_figure/image_2.pdf
[width=0.32]first_figure/image_3.pdf [width=0.32]first_figure/image_4.pdf [width=0.32]first_figure/image_5.pdf
[width=0.32]first_figure/image_6.pdf [width=0.32]first_figure/image_7.pdf [width=0.32]first_figure/image_8.pdf

Figure 1: Samples of our model showing diverse, high-fidelity colorizations.

Image colorization is a challenging, inherently stochastic task that requires a semantic understanding of the scene as well as knowledge of the world. Core immediate applications of the technique include producing organic new colorizations of existing image and video content as well as giving life to originally grayscale media, such as old archival images (tsaftaris2014novel), videos (geshwind1986method) and black-and-white cartoons (sykora2004unsupervised; qu2006manga; cinarel2017into). Colorization also has important technical uses as a way to learn meaningful representations without explicit supervision (zhang2016colorful; larsson2016learning; vondrick2018tracking) or as an unsupervised data augmentation technique, whereby diverse semantics-preserving colorizations of labelled images are produced with a colorization model trained on a potentially much larger set of unlabelled images.

The current state-of-the-art in automated colorization are neural generative approaches based on log-likelihood estimation

(guadarrama2017pixcolor; royer2017probabilistic; ardizzone2019guided). Probabilistic models are a natural fit for the one-to-many task of image colorization and obtain better results than earlier determinisitic approaches avoiding some of the persistent pitfalls (zhang2016colorful). Probabilistic models also have the central advantage of producing multiple diverse colorings that are sampled from the learnt distribution.

In this paper, we introduce the Colorization Transformer (ColTran), a probabilistic colorization model composed only of axial self-attention blocks (ho2019axial; wang2020axial). The main advantages of axial self-attention blocks are the ability to capture a global receptive field with only two layers and instead of complexity. They can be implemented efficiently using matrix-multiplications on modern accelerators such as TPUs (jouppi2017indatacenter)

. In order to enable colorization of high-resolution grayscale images, we decompose the task into three simpler sequential subtasks: coarse low resolution autoregressive colorization, parallel color and spatial super-resolution. For coarse low resolution colorization, we apply a conditional variant of Axial Transformer

(ho2019axial), a state-of-the-art autoregressive image generation model that does not require custom kernels (child2019generating). While Axial Transformers support conditioning by biasing the input, we find that directly conditioning the transformer layers can improve results significantly. Finally, by leveraging the semi-parallel sampling mechanism of Axial Transformers we are able to colorize images faster at higher resolution than previous work (guadarrama2017pixcolor) and as an effect this results in improved colorization fidelity. Finally, we employ fast parallel deterministic upsampling models to super-resolve the coarsely colorized image into the final high resolution output. In summary, our main contributions are:

  • First application of transformers for high-resolution () image colorization.

  • We introduce conditional transformer layers for low-resolution coarse colorization in Section 4.1. The conditional layers incorporate conditioning information via multiple learnable components that are applied per-pixel and per-channel. We validate the contribution of each component with extensive experimentation and ablation studies.

  • We propose training an auxiliary parallel prediction model jointly with the low resolution coarse colorization model in Section 4.2. Improved FID scores demonstrate the usefulness of this auxiliary model.

  • We establish a new state-of-the-art on image colorization outperforming prior methods by a large margin on FID scores and a 2-Alternative Forced Choice (2AFC) Mechanical Turk test. Remarkably, in more than 60% of cases human evaluators prefer the highest rated among three generated colorings over the ground truth.

2 Related work

Colorization methods have initially relied on human-in-the-loop approaches to provide hints in the form of scribbles (levin2004colorization; ironi2005colorization; huang2005adaptive; yatziv2006fast; qu2006manga; luan2007natural; tsaftaris2014novel; zhang2017real; ci2018user) and exemplar-based techniques that involve identifying a reference source image to copy colors from (reinhard2001color; welsh2002transferring; tai2005local; ironi2005colorization; pitie2007automated; morimoto2009automatic; gupta2012image; xiao2020example). Exemplar based techniques have been recently extended to video as well (zhang2019deep). In the past few years, the focus has moved on to more automated, neural colorization methods. The deterministic colorization techniques such as CIC (zhang2016colorful), LRAC (larsson2016learning), LTBC (iizuka2016let)

, Pix2Pix

(isola2017image) and DC (cheng2015deep; dahl2016automatic) involve variations of CNNs to model per-pixel color information conditioned on the intensity.

Generative colorization models typically extend unconditional image generation models to incorporate conditioning information from a grayscale image. Specifically, cINN (ardizzone2019guided) use conditional normalizing flows (dinh2014nice), VAE-MDN (deshpande2017learning; deshpande2015learning) and SCC-DC (Messaoud_2018_ECCV) use conditional VAEs (kingma2013auto), and cGAN (cao2017unsupervised) use GANs (goodfellow2014generative) for generative colorization. Most closely related to ColTran are other autoregressive approaches such as PixColor (guadarrama2017pixcolor) and PIC (royer2017probabilistic)

with PixColor obtaining slightly better results than PIC due to its CNN-based upsampling strategy. ColTran is similar to PixColor in the usage of an autoregressive model for low resolution colorization and parallel spatial upsampling. ColTran differs from PixColor in the following ways. We train ColTran in a completely unsupervised fashion, while the conditioning network in PixColor requires pre-training with an object detection network that provides substantial semantic information. PixColor relies on PixelCNN

(oord2016pixel) that requires a large depth to model interactions between all pixels. ColTran relies on Axial Transformer (ho2019axial) and can model all interactions between pixels with just 2 layers. PixColor uses different architectures for conditioning, colorization and super-resolution, while ColTran is conceptually simpler as we use self-attention blocks everywhere for both colorization and superresolution. Finally, we train our autoregressive model on a single coarse channel and a separate color upsampling network that improves fidelity (See: 5.3). The multi-stage generation process in ColTran that upsamples in depth and in size is related to that used in Subscale Pixel Networks (menick2018generating) for image generation, with differences in the order and representation of bits as well as in the use of fully parallel networks. The self-attention blocks that are the building blocks of ColTran were initially developed for machine translation (vaswani2017attention), but are now widely used in a number of other applications including density estimation (parmar2018image; child2019generating; ho2019flow++; weissenborn2019scaling) and GANs (zhang2019self)

3 Background: Axial Transformer

3.1 Row and column self-attention

Self-attention (SA) has become a standard building block in many neural architectures. Although the complexity of self-attention is quadratic with the number of input elements (here pixels), it has become quite popular for image modeling recently (parmar2018image; weissenborn2019scaling) due to modeling innovations that don’t require running global self-attention between all pixels. Following the work of (ho2019axial) we employ standard self-attention (vaswani2017attention) within rows and columns of an image. By alternating row- and column self-attention we effectively allow global exchange of information between all pixel positions. For the sake of brevity we omit the exact equations for multihead self-attention and refer the interested reader to the Appendix H for more details. Row/column attention layers are the core components of our model. We use them in the autoregressive colorizer, the spatial upsampler and the color upsampler.

3.2 Axial Transformer

Ths Axial Transformer (ho2019axial) is an autoregressive model that applies (masked) row- and column self-attention operations in a way that efficiently summarizes all past information and to model a distribution over pixel at position . Causal masking is employed by setting all where during self-attention (see Eq. 15).

Outer decoder.

The outer decoder computes a state over all previous rows by applying layers of full row self-attention followed by masked column self-attention. (Eq 2). is shifted down by a single row, such that the output context at position only contains information about pixels from prior rows. (Eq 3)

Inner decoder.

The embeddings to the inner decoder are shifted right by a single column to mask the current pixel . The context from the outer decoder conditions the inner decoder by biasing the shifted embeddings. It then computes a final state , by applying layers of masked row-wise self-attention to infuse additional information from prior pixels of the same row (Eq 4). comprises information about all past pixels and . A dense layer projects into a distribution over the pixel at position conditioned on all previous pixels and .


As shown above, the outer and inner decoder operate on 2-D inputs, such as a single channel of an image. For multi-channel RGB images, when modeling the "current channel", the Axial Transformer incorporates information from prior channels of an image (as per raster order) with an encoder. The encoder encodes each prior channel independently with a stack of unmasked row/column attention layers. The encoder outputs across all prior channels are summed to output a conditioning context for the "current channel". The context conditions the outer and inner decoder by biasing the inputs in Eq 1 and Eq 4 respectively.


The Axial Transformer natively supports semi-parallel sampling that avoids re-evaluation of the entire network to generate each pixel of a RGB image. The encoder is run once per-channel, the outer decoder is run once per-row and the inner decoder is run once per-pixel. The context from the outer decoder and the encoder is initially zero. The encoder conditions the outer decoder (Eq 1) and the encoder + outer decoder condition the inner decoder (Eq 4). The inner decoder then generates a row, one pixel at a time via Eqs. 6, 5 and 4. After generating all pixels in a row, the outer decoder recomputes context via Eqs. 3, 2 and 1 and the inner decoder generates the next row. This proceeds till all the pixels in a channel are generated. The encoder, then recomputes context to generate the next channel.


Figure 2: Depiction of ColTran. It consists of 3 individual models: an autoregressive colorizer (left), a color upsampler (middle) and a spatial upsampler (right). Each model is optimized independently. The autoregressive colorizer (ColTran core) is an instantiation of Axial Transformer (Sec. 3.2, ho2019axial) with conditional transformer layers and an auxiliary parallel head proposed in this work (Sec. 4.1). During training, the ground-truth coarse low resolution image is both the input to the decoder and the target. Masked layers ensure that the conditional distributions for each pixel depends solely on previous ground-truth pixels. (See Appendix G for a recap on autoregressive models). ColTran upsamplers are stacked row/column attention layers that deterministically upsample color and space in parallel. Each attention block (in green) is residual and consists of the following operations: layer-norm multihead self-attention MLP.

4 Proposed Architecture

Image colorization is the task of transforming a grayscale image into a colored image . The task is inherently stochastic; for a given grayscale image , there exists a conditional distribution over , . Instead of predicting directly from , we instead sequentially predict two intermediate low resolution images and with different color depth first. Besides simplifying the task of high-resolution image colorization into simpler tasks, the smaller resolution allows for training larger models.

We obtain , a spatially downsampled representation of

, by standard area interpolation.

is a 3 bit per-channel representation of , that is, each color channel has only 8 intensities. Thus, there are coarse colors per pixel which are predicted directly as a single “color” channel. We rewrite the conditional likelihood to incorporate the intermediate representations as follows:


ColTran core (Section 4.1), a parallel color upsampler and a parallel spatial upsampler (Section 4.3) model and respectively. In the subsections below, we describe these individual components in detail. From now on we will refer to all low resolutions as and high resolution as . An illustration of the overall architecture is shown in Figure 2.

4.1 ColTran Core

In this section, we describe ColTran core, a conditional variant of the Axial Transformer (ho2019axial) for low resolution coarse colorization. ColTran Core models a distribution over 512 coarse colors for every pixel, conditioned on a low resolution grayscale image in addition to the colors from previously predicted pixels as per raster order (Eq. 9).

Component Unconditional Conditional
Layer Norm
Table 1: We contrast the different components of unconditional self-attention with self-attention conditioned on context . Learnable parameters specific to conditioning are denoted by and .

Given a context representation we propose conditional transformer layers in Table 1. Conditional transformer layers have conditional versions of all components within the standard attention block (see Appendix H, Eqs. 14-18).

Conditional Self-Attention. For every layer in the decoder, we apply six convolutions to

to obtain three scale and shift vectors which we apply element-wise to

, and of the self-attention operation (Appendix 3.1), respectively.

Conditional MLP. A standard component of the transformer architecture is a two layer pointwise feed-forward network after the self-attention layer. We scale and shift to the output of each MLP conditioned on as for self-attention.

Conditional Layer Norm. Layer normalization (ba2016layer) globally scales and shifts a given normalized input using learnable vectors , . Instead, we predict and as a function of . We first aggregate into a global 1-D representation via a learnable, spatial pooling layer. Spatial pooling is initialized as a mean pooling layer. Similar to 1-D conditional normalization layers (perez2017film; de2017modulating; dumoulin2016learned; huang2017arbitrary), we then apply a linear projection on to predict and , respectively.

A grayscale encoder consisting of multiple, alternating row and column self-attention layers encodes the grayscale image into the initial conditioning context . It serves as both context for the conditional layers and as additional input to the embeddings of the outer decoder. The sum of the outer decoder’s output and condition the inner decoder. Figure 2 illustrates how conditioning is applied in the autoregressive core of the ColTran architecture.

Conditioning every layer via multiple components allows stronger gradient signals through the encoder and as an effect the encoder can learn better contextual representations. We validate this empirically by outperforming the native Axial Transformer that conditions context states by biasing (See Section 5.2 and Section 5.4).

4.2 Auxiliary parallel model

We additionally found it beneficial to train an auxiliary parallel prediction model that models directly on top of representations learned by the grayscale encoder which we found beneficial for regularization (Eq. 10)


Intuitively, this forces the model to compute richer representations and global color structure already at the output of the encoder which can help conditioning and therefore has a beneficial, regularizing effect on learning. We apply a linear projection, on top of (the output of the grayscale encoder) into a per-pixel distribution over 512 coarse colors. It was crucial to tune the relative contribution of the autoregressive and parallel predictions to improve performance which we study in Section 5.3

4.3 Color & Spatial Upsampling

In order to produce high-fidelity colorized images from low resolution, coarse color images and a given high resolution grayscale image, we train color and spatial upsampling models. They share the same architecture while differing in their respective inputs and resolution at which they operate. Similar to the grayscale encoder, the upsamplers comprise of multiple alternating layers of row and column self-attention. The output of the encoder is projected to compute the logits underlying the per pixel color probabilities of the respective upsampler. Figure 

2 illustrates the architectures

Color Upsampler. We convert the coarse image of 512 colors back into a 3 bit RGB image with 8 symbols per channel. The channels are embedded using separate embedding matrices to , where

indicates the channel. We upsample each channel individually conditioning only on the respective channel’s embedding. The channel embedding is summed with the respective grayscale embedding for each pixel and serve as input to the subsequent self-attention layers (encoder). The output of the encoder is further projected to per pixel-channel probability distributions

over 256 color intensities for all (Eq. 11).


Spatial Upsampler. We first naively upsample into a blurry, high-resolution RGB image using area interpolation. As above, we then embed each channel of the blurry RGB image and run a per-channel encoder exactly the same way as with the color upsampler. The output of the encoder is finally projected to per pixel-channel probability distributions over 256 color intensities for all . (Eq. 12)


In our experiments, similar to (guadarrama2017pixcolor), we found parallel upsampling to be sufficient for high quality colorizations. Parallel upsampling has the huge advantage of fast generation which would be notoriously slow for full autoregressive models on high resolution. To avoid plausible minor color inconsistencies between pixels, instead of sampling each pixel from the predicted distribution in (Eq. 12 and Eq. 11), we just use the argmax. Even though this slightly limits the potential diversity of colorizations, in practice we observe that sampling only coarse colors via ColTran core is enough to produce a great variety of colorizations.


We train our architecture to minimize the negative log-likelihood (Eq. 13) of the data. , , are maximized independently and

is a hyperparameter that controls the relative contribution of



5 Experiments

5.1 Training and Evaluation

We evaluate ColTran on colorizing grayscale images from the ImageNet dataset (russakovsky2015imagenet). We train the ColTran core, color and spatial upsamplers independently on 16 TPUv2 chips with a batch-size of , and for 450K, 300K and 150K steps respectively. We use axial attention blocks in each component of our architecture, with a hidden size of and

heads. We use RMSprop

(tieleman2012lecture) with a fixed learning rate of . We set apart 10000 images from the training set as a holdout set to tune hyperparameters and perform ablations. To compute FID, we generate samples conditioned on the grayscale images from this holdout set. We use the public validation set to display qualitative results and report final numbers.

[scale=0.44]ablations_v1.pdf [scale=0.44]ablations_v2.pdf [scale=0.44]ablations_v3.pdf

Figure 3: Per pixel log-likelihood of coarse colored images over the validation set as a function of training steps. We ablate the various components of the ColTran core in each plot. Left: ColTran with Conditional Transformer Layers vs a baseline Axial Transformer which conditions via addition (ColTran-B). ColTran-B 2x and ColTran-B 4x refer to wider baselines with increased model capacity. Center: Removing each conditional sub-component one at a time (no cLN, no cMLP and no cAtt). Right: Conditional shifts only (Shift), Conditional scales only (Scale), removal of kq conditioning in cAtt (cAtt, only v) and fixed mean pooling in cLN (cLN, mean pool). See Section 5.2 for more details.

5.2 Ablations of ColTran Core

The autoregressive core of ColTran models downsampled, coarse-colored images of resolution with coarse colots, conditioned on the respective grayscale image. In a series of experiments we ablate the different components of the architecture (Figure 3). In the section below, we refer to the conditional self-attention, conditional layer norm and conditional MLP subcomponents as cAtt, cLN and cMLP respectively. We report the per-pixel log-likelihood over coarse colors on the validation set as a function of training steps.

Impact of conditional transformer layers. The left side of Figure 3 illustrates the significant improvement in loss that ColTran core (with conditional transformer layers) achieves over the original Axial Transformer (marked ColTran-B). This demonstrates the usefulness of our proposed conditional layers. Because conditional layers introduce a higher number of parameters we additionally compare to and outperform the original Axial Transformer baselines with 2x and 4x wider MLP dimensions (labeled as ColTran-B 2x and ColTran-B 4x). Both ColTran-B 2x and ColTran-B 4x have an increased parameter count which makes for a fair comparison. Our results show that the increased performance cannot be explained solely by the fact that our model has more parameters.

Importance of each conditional component. We perform a leave-one-out study to determine the importance of each conditional component. We remove each conditional component one at a time and retrain the new ablated model. The curves no cLN, no cMLP and no cAtt in the middle of Figure 3 quantifies our results. While each conditional component improves final performance, cAtt plays the most important role.

Multiplicative vs Additive Interactions. Conditional transformer layers employ both conditional shifts and scales consisting of additive and multiplicative interactions, respectively. The curves Scale and Shift on the right hand side of Figure 3 demonstrate the impact of these interactions via ablated architectures that use conditional shifts and conditional scales only. While both types of interactions are important, multiplicative interactions have a much stronger impact.

Context-aware dot product attention. Self-attention computes the similarity between pixel representations using a dot product between and (See: Eq  15). cAtt applies conditional shifts and scales on , and allow modifying this similarity based on contextual information. The curve cAtt, only v on the right of Figure 3 shows that removing this property, by conditioning only on leads to worse results.

Fixed vs adaptive global representation: cLN aggregates global information with a flexible learnable spatial pooling layer. We experimented with a fixed mean pooling layer forcing all the cLN layers to use the same global representation with the same per-pixel weight. The curve cLN, mean pool on the right of Figure 3 shows that enforcing this constraint causes inferior performance as compared to even having no cLN. This indicates that different aggregations of global representations are important for different cLN layers.

[scale=0.44]fid_at_aux_loss.pdf [scale=0.44]fid_at_aux_loss_v2.pdf [scale=0.44]fid_vs_loss.pdf

Figure 4: Left: FID of generated 64 64 coarse samples as a function of training steps for and . Center: Final FID scores as a function of . Right: FID as a function of log-likelihood.

5.3 Other ablations

Auxiliary Parallel Model.

We study the effect of the hyperparameter , which controls the contribution of the auxiliary parallel prediction model described in Section 4.2. For a given , we now optimize instead of just . Note that , models each pixel independently, which is more difficult than modelling each pixel conditioned on previous pixels given by . Hence, employing as a holdout metric, would just lead to a trivial soluion at . Instead, the FID of the generated coarse 64x64 samples provides a reliable way to find an optimal value of . In Figure 4, at , our model converges to a better FID faster with a marginal but consistent final improvement. At higher values the performance deteriorates quickly.


Upsampling coarse colored, low-resolution images to a higher resolution is much simpler. Given ground truth coarse images, the ColTran upsamplers map these to fine grained images without any visible artifacts and FID of 16.4. For comparison, the FID between two random sets of 5000 samples from our holdout set is 15.5. It is further extremely important to provide the grayscale image as input to each of the individual upsamplers, without which the generated images appear highly smoothed out and the FID drops to 27.0. We also trained a single upsampler for both color and resolution. The FID in this case drops marginally to 16.6.

5.4 Frechet Inception Distance

We compute FID using colorizations of 5000 grayscale images of resolution 256 256 from the ImageNet validation set as done in (ardizzone2019guided). To compute the FID, we ensure that there is no overlap between the grayscale images that condition ColTran and those in the ground-truth distribution. In addition to ColTran, we report two additional results ColTran-S and ColTran-Baseline. ColTran-B refers to the baseline Axial Transformer that conditions via addition at the input. PixColor samples smaller 28 28 colored images autoregressively as compared to ColTran’s 64 64. As a control experiment, we train an autoregressive model on resolution 28 28 (ColTran-S) to disentangle architectural choices and the inherent stochasticity of modelling higher resolution images. ColTran-S and ColTran-B obtains FID scores of 21.9 and 21.6 that significantly improve over the previous best FID of 24.32. Finally, ColTran achieves the best FID score of 19.71. All results are presented in Table 2 left.

Correlation between FID and Log-likelihood.

For each architectural variant, Figure 4 right illustrates the correlation between the log-likelihood and FID after 150K training steps. There is a moderately positive correlation of 0.57 between the log-likelihood and FID. Importantly, even an absolute improvement on the order of  0.01 - 0.02 can improve FID significantly. This suggests that designing architectures that achieve better log-likelihood values is likely to lead to improved FID scores and colorization fidelity.

Models FID ColTran 19.71 0.12 ColTran-B 21.6 0.13 ColTran-S 21.9 0.09   PixColor [guadarrama2017pixcolor] 24.32 0.21 cGAN [cao2017unsupervised] 24.41 0.27 cINN [ardizzone2019guided] 25.13 0.3 VAE-MDN [deshpande2017learning] 25.98 0.28 Ground truth 14.68 0.15 Grayscale 30.19 0.1 Models AMT Fooling rate ColTran (Oracle) 62.0 % 0.99 ColTran (Seed 1) 40.5 % 0.81 ColTran (Seed 2) 42.3 % 0.76 ColTran (Seed 3) 41.7 % 0.83   PixColor [guadarrama2017pixcolor] (Oracle) 38.3 % 0.98 PixColor (Seed 1) 33.3 % 1.04 PixColor (Seed 2) 35.4 % 1.01 PixColor (Seed 3) 33.2 % 1.03   CIC [zhang2016colorful] 29.2 % 0.98 LRAC [larsson2016learning] 30.9 % 1.02 LTBC [iizuka2016let] 25.8 % 0.97
Table 2: We outperform various state-of-the-art colorization models both on FID (left) and human evaluation (right). We obtain the FID scores from (ardizzone2019guided) and the human evaluation results from (guadarrama2017pixcolor). ColTran-B is a baseline Axial Transformer that conditions via addition and ColTran-S is a control experiment where we train ColTran core (See: 4.1) on smaller 28 28 colored images.

5.5 Qualitative Evaluation

Human Evaluation.

For our qualitative assessment, we follow the protocol used in PixColor (guadarrama2017pixcolor). ColTran colorizes 500 grayscale images, with 3 different colorizations per image, denoted as seeds. Human raters assess the quality of these colorizations with a two alternative-forced choice (2AFC) test. We display both the ground-truth and recolorized image sequentially for one second in random order. The raters are then asked to identify the image with fake colors. For each seed, we report the mean fooling rate over 500 colorizations and 5 different raters. For the oracle methods, we use the human rating to pick the best-of-three colorizations. ColTran’s best seed achieves a fooling rate of 42.3 % compared to the 35.4 % of PixColor’s best seed. ColTran Oracle achieves a fooling rate of 62 %, indicating that human raters prefer ColTran’s best-of-three colorizations over the ground truth image itself.

[width=0.09height=0.05]cmap/image1.pdf [width=0.09height=0.05]cmap/cmap1.pdf [width=0.09height=0.05]cmap/image2.pdf [width=0.09height=0.05]cmap/cmap2.pdf [width=0.09height=0.05]cmap/image3.pdf [width=0.09height=0.05]cmap/cmap3.pdf [width=0.09height=0.05]cmap/image4.pdf [width=0.09height=0.05]cmap/cmap4.pdf [width=0.09height=0.05]cmap/image5.pdf [width=0.09height=0.05]cmap/cmap5.pdf

Figure 5: We display the per-pixel, maximum predicted probability over 512 colors as a proxy for uncertainty.
Visualizing uncertainty.

The autoregressive core model of ColTran should be highly uncertain at object boundaries when colors change. Figure 5 illustrates the per-pixel, maximum predicted probability over 512 colors as a proxy for uncertainty. We observe that the model is indeed highly uncertain at edges and within more complicated textures.

6 Conclusion

We presented the Colorization Transformer (ColTran), an architecture that entirely relies on self-attention for image colorization. We introduce conditional transformer layers, a novel building block for conditional, generative models based on self-attention. Our ablations show the superiority of employing this mechanism over a number of different baselines. Finally, we demonstrate that ColTran can generate diverse, high-fidelity colorizations on ImageNet, which are largely indistinguishable from the ground-truth even for human raters.


Appendix A Code, checkpoints and tensorboard files

Our implementation is open-sourced in the google-research framework at Our full set of hyperparameters are available here.

We provide pre-trained checkpoints of the colorizer and upsamplers on ImageNet at Our colorizer and spatial upsampler for thesr checkpoints were trained longer for 600K and 300K steps which gave us a slightly improved FID score of

Finally, reference tensorboard files for our training runs are available at colorizer tensorboard, color upsampler tensorboard and spatial upsampler tensorboard.

Appendix B Exponential Moving Average

We found using an exponential moving average (EMA) of our checkpoints, extremely crucial to generate high quality samples. In Figure 6, we display the FID as a function of training steps, with and without EMA. On applying EMA, our FID score improves steadily over time.

[scale=0.45]polyak.pdf   [scale=0.45]fid_vs_k.pdf

Figure 6: Left: FID vs training steps, with and without polyak averaging. Right: The effect of K in top-K sampling on FID. See Appendix  B and  E

Appendix C Number of parameters and inference speed

Inference speed.

ColTran core can sample a batch of 20 64x64 grayscale images in around 3.5 -5 minutes on a P100 GPU vs PixColor that takes  10 minutes to colorize 28x28 grayscale images on a K40 GPU. Sampling 28x28 colorizations takes around 30 seconds. The upsampler networks take in the order of milliseconds.

Further, in our naive implementation, we recompute the activations, in Table 1 to generate every pixel in the inner decoder. Instead, we can compute these activations once per-grayscale image in the encoder and once per-row in the outer decoder and reuse them. This is likely to speed up sampling even more and we leave this engineering optimization for future work.

Number of parameters.

ColTran has a total of ColTran core (46M) + Color Upsampler (14M) + Spatial Upsampler (14M) = 74M parameters. In comparison, PixColor has Conditioning network (44M) + Colorizer network (11M) + Refinement Network (28M) = 83M parameters.

Appendix D Lower compute regime

We retrained the autoregressive colorizer and color upsampler on 4 TPUv2 chips (the lowest configuration) with a reduced-batch size of 56 and 192 each. For the spatial upsampler, we found that a batch-size of 8 was sub-optimal and lead to a large deterioration in loss. We thus used a smaller spatial upsampler with 2 axial attention blocks with a batch-size of 16 and trained it also on 4 TPUv2 chips. The FID drops from 19.71 to 20.9 which is still significantly better than the other models in 2. We note that in this experiment, we use only 12 TPUv2 chips in total while PixColor (guadarrama2017pixcolor) uses a total of 16 GPUs.

Appendix E Improved FID with Top-K sampling

We can improve colorization fidelity and remove artifacts due to unnatural colors via Top-K sampling at the cost of reduced colorization diversity. In this setting, for a given pixel ColTran generates a color from the top-K colors (instead of 512 colors) as determined by the predicted probabilities. Our results in Figure 6 and demonstrate a performance improvement over the baseline ColTran model with

Appendix F Additional ablations:

Additional ablations of our conditional transformer layers are in Figure 7 which did not help.

  • Conditional transformer layers based on Gated layers (oord2016pixel) (Gated)

  • A global conditioning layer instead of pointwise conditioning in cAtt and cLN. cAtt + cMLP, global



Figure 7: Ablated models. Gated: Gated conditioning layers as done in (oord2016pixel) and cAtt + cMLP, global: Global conditioning instead of pointwise conditioning in cAtt and cLN.

Appendix G Autoregressive models

Autoregressive models are a family of probabilistic methods that model joint distribution of data

or a sequence of symbols as a product of conditionals . During training, the input to autoregressive models are the entire sequence of ground-truth symbols. Masking ensures that the contribution of all "future" symbols in the sequence are zeroed out. The outputs of the autoregressive model are the corresponding conditional distributions. . Optimizing the parameters of the autoregressive model proceeds by a standard log-likelihood objective.

Generation happens sequentially, symbol-by-symbol. Once a symbol is generated, the entire sequence are fed to the autoregressive model to generate .

In the case of autoregressive image generation symbols typically correspond to the 3 RGB pixel-channel. These are generated sequentially in raster-scan order, channel by channel and pixel by pixel.

Appendix H Row/Column Self-attention

In the following we describe row self-attention, that is, we omit the height dimension as all operations are performed in parallel for each column. Given the representation of a single row within of an image , row-wise self-attention block is applied as follows:


refers to the application of layer normalization (ba2016layer)

. Finally, we apply residual connections and a feed-forward neural network with a single hidden layer and ReLU activation (

) after each self-attention block as it is common practice in transformers.


Column-wise self-attention over works analogously.

Appendix I Out of domain colorizations


Figure 8: We train our colorization model on ImageNet and display high resolution colorizations from LSUN


Figure 9: We train our colorization model on ImageNet and display low resolution colorizations from Celeb-A

We use our trained colorization model on ImageNet to colorize high-resolution grayscale images from LSUN (yu2015lsun) and low-resolution grayscale images from Celeb-A (liu2015deep) . Note that these models were trained only on ImageNet and not finetuned on Celeb-A or LSUN.

Appendix J Number of axial attention blocks

We did a very small hyperparameter sweep using the baseline axial transformer (no conditional layers) with the following configurations:

  • hidden size = 512, number of blocks = 4

  • hidden size = 1024, number of blocks = 2

  • hidden size = 512, number of blocks = 2

Once we found the optimal configuration, we fixed this for all future architecture design.

Appendix K Analysis of MTurk ratings

[width=0.5height=0.05]amt_generated_1.pdf [width=0.5height=0.05]amt_generated_2.pdf   [width=0.5height=0.05]amt_ground_1.pdf [width=0.5height=0.05]amt_ground_2.pdf

Figure 10: Top: Colorizations Bottom: Ground truth. From left to right, our colorizations have a progressively higher fooling rate.

[width=0.07height=0.04]high_rate/ground_5.pdf [width=0.21height=0.04]high_rate/gen_5.pdf [width=0.07height=0.04]high_variance/ground_17.pdf [width=0.21height=0.04]high_variance/gen_17.pdf [width=0.07height=0.04]zero_rate/ground_201.pdf [width=0.21height=0.04]zero_rate/gen_201.pdf
[width=0.07height=0.04]high_rate/ground_287.pdf [width=0.21height=0.04]high_rate/gen_287.pdf [width=0.07height=0.04]high_variance/ground_39.pdf [width=0.21height=0.04]high_variance/gen_39.pdf [width=0.07height=0.04]zero_rate/ground_265.pdf [width=0.21height=0.04]zero_rate/gen_265.pdf
[width=0.07height=0.04]high_rate/ground_381.pdf [width=0.21height=0.04]high_rate/gen_381.pdf [width=0.07height=0.04]high_variance/ground_419.pdf [width=0.21height=0.04]high_variance/gen_419.pdf [width=0.07height=0.04]zero_rate/ground_274.pdf [width=0.21height=0.04]zero_rate/gen_274.pdf
[width=0.07height=0.04]high_rate/ground_488.pdf [width=0.21height=0.04]high_rate/gen_488.pdf [width=0.07height=0.04]high_variance/ground_425.pdf [width=0.21height=0.04]high_variance/gen_425.pdf [width=0.07height=0.04]zero_rate/ground_399.pdf [width=0.21height=0.04]zero_rate/gen_399.pdf

Figure 11: In each column, we display the ground truth followed by 3 samples. Left: Diverse and real. Center: Realism improves from left to right. Right: Failure cases

[width=0.09height=0.05]cmap/image1.pdf [width=0.09height=0.05]cmap/cmap1.pdf [width=0.09height=0.05]cmap/image2.pdf [width=0.09height=0.05]cmap/cmap2.pdf [width=0.09height=0.05]cmap/image3.pdf [width=0.09height=0.05]cmap/cmap3.pdf [width=0.09height=0.05]cmap/image4.pdf [width=0.09height=0.05]cmap/cmap4.pdf [width=0.09height=0.05]cmap/image5.pdf [width=0.09height=0.05]cmap/cmap5.pdf
[width=0.09height=0.05]cmap/image6.pdf [width=0.09height=0.05]cmap/cmap6.pdf [width=0.09height=0.05]cmap/image7.pdf [width=0.09height=0.05]cmap/cmap7.pdf [width=0.09height=0.05]cmap/image8.pdf [width=0.09height=0.05]cmap/cmap8.pdf [width=0.09height=0.05]cmap/image9.pdf [width=0.09height=0.05]cmap/cmap9.pdf [width=0.09height=0.05]cmap/image10.pdf [width=0.09height=0.05]cmap/cmap10.pdf

Figure 12: We display the per-pixel, maximum predicted probability over 512 colors as a proxy for uncertainty.

We analyzed our samples on the basis of the MTurk ratings in Figure 11

. To the left, we show images, where all the samples have a fool rate > 60 %. Our model is able to show diversity in color for both high-level structure and low-level details. In the center, we display samples that have a high variance in MTurk ratings, with a difference of 80 % between the best and the worst sample. All of these are complex objects, that our model is able to colorize reasonably well given multiple attempts. To the right of Figure

11, we show failure cases where all samples have a fool rate of 0 %, For these cases, our model is unable to colorize highly complex structure, that would arguably be difficult even for a human.

Appendix L More probability maps

We display additional probability maps to visualize uncertainty as done in 5.5.

Appendix M More samples

We display a wide-diversity of colorizations from ColTran that were not cherry-picked.