1 Introduction
Image colorization is a challenging, inherently stochastic task that requires a semantic understanding of the scene as well as knowledge of the world. Core immediate applications of the technique include producing organic new colorizations of existing image and video content as well as giving life to originally grayscale media, such as old archival images (tsaftaris2014novel), videos (geshwind1986method) and blackandwhite cartoons (sykora2004unsupervised; qu2006manga; cinarel2017into). Colorization also has important technical uses as a way to learn meaningful representations without explicit supervision (zhang2016colorful; larsson2016learning; vondrick2018tracking) or as an unsupervised data augmentation technique, whereby diverse semanticspreserving colorizations of labelled images are produced with a colorization model trained on a potentially much larger set of unlabelled images.
The current stateoftheart in automated colorization are neural generative approaches based on loglikelihood estimation
(guadarrama2017pixcolor; royer2017probabilistic; ardizzone2019guided). Probabilistic models are a natural fit for the onetomany task of image colorization and obtain better results than earlier determinisitic approaches avoiding some of the persistent pitfalls (zhang2016colorful). Probabilistic models also have the central advantage of producing multiple diverse colorings that are sampled from the learnt distribution.In this paper, we introduce the Colorization Transformer (ColTran), a probabilistic colorization model composed only of axial selfattention blocks (ho2019axial; wang2020axial). The main advantages of axial selfattention blocks are the ability to capture a global receptive field with only two layers and instead of complexity. They can be implemented efficiently using matrixmultiplications on modern accelerators such as TPUs (jouppi2017indatacenter)
. In order to enable colorization of highresolution grayscale images, we decompose the task into three simpler sequential subtasks: coarse low resolution autoregressive colorization, parallel color and spatial superresolution. For coarse low resolution colorization, we apply a conditional variant of Axial Transformer
(ho2019axial), a stateoftheart autoregressive image generation model that does not require custom kernels (child2019generating). While Axial Transformers support conditioning by biasing the input, we find that directly conditioning the transformer layers can improve results significantly. Finally, by leveraging the semiparallel sampling mechanism of Axial Transformers we are able to colorize images faster at higher resolution than previous work (guadarrama2017pixcolor) and as an effect this results in improved colorization fidelity. Finally, we employ fast parallel deterministic upsampling models to superresolve the coarsely colorized image into the final high resolution output. In summary, our main contributions are:
First application of transformers for highresolution () image colorization.

We introduce conditional transformer layers for lowresolution coarse colorization in Section 4.1. The conditional layers incorporate conditioning information via multiple learnable components that are applied perpixel and perchannel. We validate the contribution of each component with extensive experimentation and ablation studies.

We propose training an auxiliary parallel prediction model jointly with the low resolution coarse colorization model in Section 4.2. Improved FID scores demonstrate the usefulness of this auxiliary model.

We establish a new stateoftheart on image colorization outperforming prior methods by a large margin on FID scores and a 2Alternative Forced Choice (2AFC) Mechanical Turk test. Remarkably, in more than 60% of cases human evaluators prefer the highest rated among three generated colorings over the ground truth.
2 Related work
Colorization methods have initially relied on humanintheloop approaches to provide hints in the form of scribbles (levin2004colorization; ironi2005colorization; huang2005adaptive; yatziv2006fast; qu2006manga; luan2007natural; tsaftaris2014novel; zhang2017real; ci2018user) and exemplarbased techniques that involve identifying a reference source image to copy colors from (reinhard2001color; welsh2002transferring; tai2005local; ironi2005colorization; pitie2007automated; morimoto2009automatic; gupta2012image; xiao2020example). Exemplar based techniques have been recently extended to video as well (zhang2019deep). In the past few years, the focus has moved on to more automated, neural colorization methods. The deterministic colorization techniques such as CIC (zhang2016colorful), LRAC (larsson2016learning), LTBC (iizuka2016let)
, Pix2Pix
(isola2017image) and DC (cheng2015deep; dahl2016automatic) involve variations of CNNs to model perpixel color information conditioned on the intensity.Generative colorization models typically extend unconditional image generation models to incorporate conditioning information from a grayscale image. Specifically, cINN (ardizzone2019guided) use conditional normalizing flows (dinh2014nice), VAEMDN (deshpande2017learning; deshpande2015learning) and SCCDC (Messaoud_2018_ECCV) use conditional VAEs (kingma2013auto), and cGAN (cao2017unsupervised) use GANs (goodfellow2014generative) for generative colorization. Most closely related to ColTran are other autoregressive approaches such as PixColor (guadarrama2017pixcolor) and PIC (royer2017probabilistic)
with PixColor obtaining slightly better results than PIC due to its CNNbased upsampling strategy. ColTran is similar to PixColor in the usage of an autoregressive model for low resolution colorization and parallel spatial upsampling. ColTran differs from PixColor in the following ways. We train ColTran in a completely unsupervised fashion, while the conditioning network in PixColor requires pretraining with an object detection network that provides substantial semantic information. PixColor relies on PixelCNN
(oord2016pixel) that requires a large depth to model interactions between all pixels. ColTran relies on Axial Transformer (ho2019axial) and can model all interactions between pixels with just 2 layers. PixColor uses different architectures for conditioning, colorization and superresolution, while ColTran is conceptually simpler as we use selfattention blocks everywhere for both colorization and superresolution. Finally, we train our autoregressive model on a single coarse channel and a separate color upsampling network that improves fidelity (See: 5.3). The multistage generation process in ColTran that upsamples in depth and in size is related to that used in Subscale Pixel Networks (menick2018generating) for image generation, with differences in the order and representation of bits as well as in the use of fully parallel networks. The selfattention blocks that are the building blocks of ColTran were initially developed for machine translation (vaswani2017attention), but are now widely used in a number of other applications including density estimation (parmar2018image; child2019generating; ho2019flow++; weissenborn2019scaling) and GANs (zhang2019self)3 Background: Axial Transformer
3.1 Row and column selfattention
Selfattention (SA) has become a standard building block in many neural architectures. Although the complexity of selfattention is quadratic with the number of input elements (here pixels), it has become quite popular for image modeling recently (parmar2018image; weissenborn2019scaling) due to modeling innovations that don’t require running global selfattention between all pixels. Following the work of (ho2019axial) we employ standard selfattention (vaswani2017attention) within rows and columns of an image. By alternating row and column selfattention we effectively allow global exchange of information between all pixel positions. For the sake of brevity we omit the exact equations for multihead selfattention and refer the interested reader to the Appendix H for more details. Row/column attention layers are the core components of our model. We use them in the autoregressive colorizer, the spatial upsampler and the color upsampler.
3.2 Axial Transformer
Ths Axial Transformer (ho2019axial) is an autoregressive model that applies (masked) row and column selfattention operations in a way that efficiently summarizes all past information and to model a distribution over pixel at position . Causal masking is employed by setting all where during selfattention (see Eq. 15).
Outer decoder.
The outer decoder computes a state over all previous rows by applying layers of full row selfattention followed by masked column selfattention. (Eq 2). is shifted down by a single row, such that the output context at position only contains information about pixels from prior rows. (Eq 3)
(1)  
(2)  
(3) 
Inner decoder.
The embeddings to the inner decoder are shifted right by a single column to mask the current pixel . The context from the outer decoder conditions the inner decoder by biasing the shifted embeddings. It then computes a final state , by applying layers of masked rowwise selfattention to infuse additional information from prior pixels of the same row (Eq 4). comprises information about all past pixels and . A dense layer projects into a distribution over the pixel at position conditioned on all previous pixels and .
(4)  
(5)  
(6) 
Encoder.
As shown above, the outer and inner decoder operate on 2D inputs, such as a single channel of an image. For multichannel RGB images, when modeling the "current channel", the Axial Transformer incorporates information from prior channels of an image (as per raster order) with an encoder. The encoder encodes each prior channel independently with a stack of unmasked row/column attention layers. The encoder outputs across all prior channels are summed to output a conditioning context for the "current channel". The context conditions the outer and inner decoder by biasing the inputs in Eq 1 and Eq 4 respectively.
Sampling.
The Axial Transformer natively supports semiparallel sampling that avoids reevaluation of the entire network to generate each pixel of a RGB image. The encoder is run once perchannel, the outer decoder is run once perrow and the inner decoder is run once perpixel. The context from the outer decoder and the encoder is initially zero. The encoder conditions the outer decoder (Eq 1) and the encoder + outer decoder condition the inner decoder (Eq 4). The inner decoder then generates a row, one pixel at a time via Eqs. 6, 5 and 4. After generating all pixels in a row, the outer decoder recomputes context via Eqs. 3, 2 and 1 and the inner decoder generates the next row. This proceeds till all the pixels in a channel are generated. The encoder, then recomputes context to generate the next channel.
4 Proposed Architecture
Image colorization is the task of transforming a grayscale image into a colored image . The task is inherently stochastic; for a given grayscale image , there exists a conditional distribution over , . Instead of predicting directly from , we instead sequentially predict two intermediate low resolution images and with different color depth first. Besides simplifying the task of highresolution image colorization into simpler tasks, the smaller resolution allows for training larger models.
We obtain , a spatially downsampled representation of
, by standard area interpolation.
is a 3 bit perchannel representation of , that is, each color channel has only 8 intensities. Thus, there are coarse colors per pixel which are predicted directly as a single “color” channel. We rewrite the conditional likelihood to incorporate the intermediate representations as follows:(7)  
(8) 
ColTran core (Section 4.1), a parallel color upsampler and a parallel spatial upsampler (Section 4.3) model and respectively. In the subsections below, we describe these individual components in detail. From now on we will refer to all low resolutions as and high resolution as . An illustration of the overall architecture is shown in Figure 2.
4.1 ColTran Core
In this section, we describe ColTran core, a conditional variant of the Axial Transformer (ho2019axial) for low resolution coarse colorization. ColTran Core models a distribution over 512 coarse colors for every pixel, conditioned on a low resolution grayscale image in addition to the colors from previously predicted pixels as per raster order (Eq. 9).
(9) 
Component  Unconditional  Conditional 
SelfAttention  
MLP  
Layer Norm 
Given a context representation we propose conditional transformer layers in Table 1. Conditional transformer layers have conditional versions of all components within the standard attention block (see Appendix H, Eqs. 1418).
Conditional SelfAttention. For every layer in the decoder, we apply six convolutions to
to obtain three scale and shift vectors which we apply elementwise to
, and of the selfattention operation (Appendix 3.1), respectively.Conditional MLP. A standard component of the transformer architecture is a two layer pointwise feedforward network after the selfattention layer. We scale and shift to the output of each MLP conditioned on as for selfattention.
Conditional Layer Norm. Layer normalization (ba2016layer) globally scales and shifts a given normalized input using learnable vectors , . Instead, we predict and as a function of . We first aggregate into a global 1D representation via a learnable, spatial pooling layer. Spatial pooling is initialized as a mean pooling layer. Similar to 1D conditional normalization layers (perez2017film; de2017modulating; dumoulin2016learned; huang2017arbitrary), we then apply a linear projection on to predict and , respectively.
A grayscale encoder consisting of multiple, alternating row and column selfattention layers encodes the grayscale image into the initial conditioning context . It serves as both context for the conditional layers and as additional input to the embeddings of the outer decoder. The sum of the outer decoder’s output and condition the inner decoder. Figure 2 illustrates how conditioning is applied in the autoregressive core of the ColTran architecture.
Conditioning every layer via multiple components allows stronger gradient signals through the encoder and as an effect the encoder can learn better contextual representations. We validate this empirically by outperforming the native Axial Transformer that conditions context states by biasing (See Section 5.2 and Section 5.4).
4.2 Auxiliary parallel model
We additionally found it beneficial to train an auxiliary parallel prediction model that models directly on top of representations learned by the grayscale encoder which we found beneficial for regularization (Eq. 10)
(10) 
Intuitively, this forces the model to compute richer representations and global color structure already at the output of the encoder which can help conditioning and therefore has a beneficial, regularizing effect on learning. We apply a linear projection, on top of (the output of the grayscale encoder) into a perpixel distribution over 512 coarse colors. It was crucial to tune the relative contribution of the autoregressive and parallel predictions to improve performance which we study in Section 5.3
4.3 Color & Spatial Upsampling
In order to produce highfidelity colorized images from low resolution, coarse color images and a given high resolution grayscale image, we train color and spatial upsampling models. They share the same architecture while differing in their respective inputs and resolution at which they operate. Similar to the grayscale encoder, the upsamplers comprise of multiple alternating layers of row and column selfattention. The output of the encoder is projected to compute the logits underlying the per pixel color probabilities of the respective upsampler. Figure
2 illustrates the architecturesColor Upsampler. We convert the coarse image of 512 colors back into a 3 bit RGB image with 8 symbols per channel. The channels are embedded using separate embedding matrices to , where
indicates the channel. We upsample each channel individually conditioning only on the respective channel’s embedding. The channel embedding is summed with the respective grayscale embedding for each pixel and serve as input to the subsequent selfattention layers (encoder). The output of the encoder is further projected to per pixelchannel probability distributions
over 256 color intensities for all (Eq. 11).(11) 
Spatial Upsampler. We first naively upsample into a blurry, highresolution RGB image using area interpolation. As above, we then embed each channel of the blurry RGB image and run a perchannel encoder exactly the same way as with the color upsampler. The output of the encoder is finally projected to per pixelchannel probability distributions over 256 color intensities for all . (Eq. 12)
(12) 
In our experiments, similar to (guadarrama2017pixcolor), we found parallel upsampling to be sufficient for high quality colorizations. Parallel upsampling has the huge advantage of fast generation which would be notoriously slow for full autoregressive models on high resolution. To avoid plausible minor color inconsistencies between pixels, instead of sampling each pixel from the predicted distribution in (Eq. 12 and Eq. 11), we just use the argmax. Even though this slightly limits the potential diversity of colorizations, in practice we observe that sampling only coarse colors via ColTran core is enough to produce a great variety of colorizations.
Objective.
We train our architecture to minimize the negative loglikelihood (Eq. 13) of the data. , , are maximized independently and
is a hyperparameter that controls the relative contribution of
and(13) 
5 Experiments
5.1 Training and Evaluation
We evaluate ColTran on colorizing grayscale images from the ImageNet dataset (russakovsky2015imagenet). We train the ColTran core, color and spatial upsamplers independently on 16 TPUv2 chips with a batchsize of , and for 450K, 300K and 150K steps respectively. We use axial attention blocks in each component of our architecture, with a hidden size of and
heads. We use RMSprop
(tieleman2012lecture) with a fixed learning rate of . We set apart 10000 images from the training set as a holdout set to tune hyperparameters and perform ablations. To compute FID, we generate samples conditioned on the grayscale images from this holdout set. We use the public validation set to display qualitative results and report final numbers.5.2 Ablations of ColTran Core
The autoregressive core of ColTran models downsampled, coarsecolored images of resolution with coarse colots, conditioned on the respective grayscale image. In a series of experiments we ablate the different components of the architecture (Figure 3). In the section below, we refer to the conditional selfattention, conditional layer norm and conditional MLP subcomponents as cAtt, cLN and cMLP respectively. We report the perpixel loglikelihood over coarse colors on the validation set as a function of training steps.
Impact of conditional transformer layers. The left side of Figure 3 illustrates the significant improvement in loss that ColTran core (with conditional transformer layers) achieves over the original Axial Transformer (marked ColTranB). This demonstrates the usefulness of our proposed conditional layers. Because conditional layers introduce a higher number of parameters we additionally compare to and outperform the original Axial Transformer baselines with 2x and 4x wider MLP dimensions (labeled as ColTranB 2x and ColTranB 4x). Both ColTranB 2x and ColTranB 4x have an increased parameter count which makes for a fair comparison. Our results show that the increased performance cannot be explained solely by the fact that our model has more parameters.
Importance of each conditional component. We perform a leaveoneout study to determine the importance of each conditional component. We remove each conditional component one at a time and retrain the new ablated model. The curves no cLN, no cMLP and no cAtt in the middle of Figure 3 quantifies our results. While each conditional component improves final performance, cAtt plays the most important role.
Multiplicative vs Additive Interactions. Conditional transformer layers employ both conditional shifts and scales consisting of additive and multiplicative interactions, respectively. The curves Scale and Shift on the right hand side of Figure 3 demonstrate the impact of these interactions via ablated architectures that use conditional shifts and conditional scales only. While both types of interactions are important, multiplicative interactions have a much stronger impact.
Contextaware dot product attention. Selfattention computes the similarity between pixel representations using a dot product between and (See: Eq 15). cAtt applies conditional shifts and scales on , and allow modifying this similarity based on contextual information. The curve cAtt, only v on the right of Figure 3 shows that removing this property, by conditioning only on leads to worse results.
Fixed vs adaptive global representation: cLN aggregates global information with a flexible learnable spatial pooling layer. We experimented with a fixed mean pooling layer forcing all the cLN layers to use the same global representation with the same perpixel weight. The curve cLN, mean pool on the right of Figure 3 shows that enforcing this constraint causes inferior performance as compared to even having no cLN. This indicates that different aggregations of global representations are important for different cLN layers.
5.3 Other ablations
Auxiliary Parallel Model.
We study the effect of the hyperparameter , which controls the contribution of the auxiliary parallel prediction model described in Section 4.2. For a given , we now optimize instead of just . Note that , models each pixel independently, which is more difficult than modelling each pixel conditioned on previous pixels given by . Hence, employing as a holdout metric, would just lead to a trivial soluion at . Instead, the FID of the generated coarse 64x64 samples provides a reliable way to find an optimal value of . In Figure 4, at , our model converges to a better FID faster with a marginal but consistent final improvement. At higher values the performance deteriorates quickly.
Upsamplers.
Upsampling coarse colored, lowresolution images to a higher resolution is much simpler. Given ground truth coarse images, the ColTran upsamplers map these to fine grained images without any visible artifacts and FID of 16.4. For comparison, the FID between two random sets of 5000 samples from our holdout set is 15.5. It is further extremely important to provide the grayscale image as input to each of the individual upsamplers, without which the generated images appear highly smoothed out and the FID drops to 27.0. We also trained a single upsampler for both color and resolution. The FID in this case drops marginally to 16.6.
5.4 Frechet Inception Distance
We compute FID using colorizations of 5000 grayscale images of resolution 256 256 from the ImageNet validation set as done in (ardizzone2019guided). To compute the FID, we ensure that there is no overlap between the grayscale images that condition ColTran and those in the groundtruth distribution. In addition to ColTran, we report two additional results ColTranS and ColTranBaseline. ColTranB refers to the baseline Axial Transformer that conditions via addition at the input. PixColor samples smaller 28 28 colored images autoregressively as compared to ColTran’s 64 64. As a control experiment, we train an autoregressive model on resolution 28 28 (ColTranS) to disentangle architectural choices and the inherent stochasticity of modelling higher resolution images. ColTranS and ColTranB obtains FID scores of 21.9 and 21.6 that significantly improve over the previous best FID of 24.32. Finally, ColTran achieves the best FID score of 19.71. All results are presented in Table 2 left.
Correlation between FID and Loglikelihood.
For each architectural variant, Figure 4 right illustrates the correlation between the loglikelihood and FID after 150K training steps. There is a moderately positive correlation of 0.57 between the loglikelihood and FID. Importantly, even an absolute improvement on the order of 0.01  0.02 can improve FID significantly. This suggests that designing architectures that achieve better loglikelihood values is likely to lead to improved FID scores and colorization fidelity.
5.5 Qualitative Evaluation
Human Evaluation.
For our qualitative assessment, we follow the protocol used in PixColor (guadarrama2017pixcolor). ColTran colorizes 500 grayscale images, with 3 different colorizations per image, denoted as seeds. Human raters assess the quality of these colorizations with a two alternativeforced choice (2AFC) test. We display both the groundtruth and recolorized image sequentially for one second in random order. The raters are then asked to identify the image with fake colors. For each seed, we report the mean fooling rate over 500 colorizations and 5 different raters. For the oracle methods, we use the human rating to pick the bestofthree colorizations. ColTran’s best seed achieves a fooling rate of 42.3 % compared to the 35.4 % of PixColor’s best seed. ColTran Oracle achieves a fooling rate of 62 %, indicating that human raters prefer ColTran’s bestofthree colorizations over the ground truth image itself.
Visualizing uncertainty.
The autoregressive core model of ColTran should be highly uncertain at object boundaries when colors change. Figure 5 illustrates the perpixel, maximum predicted probability over 512 colors as a proxy for uncertainty. We observe that the model is indeed highly uncertain at edges and within more complicated textures.
6 Conclusion
We presented the Colorization Transformer (ColTran), an architecture that entirely relies on selfattention for image colorization. We introduce conditional transformer layers, a novel building block for conditional, generative models based on selfattention. Our ablations show the superiority of employing this mechanism over a number of different baselines. Finally, we demonstrate that ColTran can generate diverse, highfidelity colorizations on ImageNet, which are largely indistinguishable from the groundtruth even for human raters.
References
Appendix A Code, checkpoints and tensorboard files
Our implementation is opensourced in the googleresearch framework at
https://github.com/googleresearch/googleresearch/tree/master/coltran. Our full set of hyperparameters are available here.We provide pretrained checkpoints of the colorizer and upsamplers on ImageNet at https://console.cloud.google.com/storage/browser/gresearch/coltran. Our colorizer and spatial upsampler for thesr checkpoints were trained longer for 600K and 300K steps which gave us a slightly improved FID score of
Finally, reference tensorboard files for our training runs are available at colorizer tensorboard, color upsampler tensorboard and spatial upsampler tensorboard.
Appendix B Exponential Moving Average
We found using an exponential moving average (EMA) of our checkpoints, extremely crucial to generate high quality samples. In Figure 6, we display the FID as a function of training steps, with and without EMA. On applying EMA, our FID score improves steadily over time.
Appendix C Number of parameters and inference speed
Inference speed.
ColTran core can sample a batch of 20 64x64 grayscale images in around 3.5 5 minutes on a P100 GPU vs PixColor that takes 10 minutes to colorize 28x28 grayscale images on a K40 GPU. Sampling 28x28 colorizations takes around 30 seconds. The upsampler networks take in the order of milliseconds.
Further, in our naive implementation, we recompute the activations, in Table 1 to generate every pixel in the inner decoder. Instead, we can compute these activations once pergrayscale image in the encoder and once perrow in the outer decoder and reuse them. This is likely to speed up sampling even more and we leave this engineering optimization for future work.
Number of parameters.
ColTran has a total of ColTran core (46M) + Color Upsampler (14M) + Spatial Upsampler (14M) = 74M parameters. In comparison, PixColor has Conditioning network (44M) + Colorizer network (11M) + Refinement Network (28M) = 83M parameters.
Appendix D Lower compute regime
We retrained the autoregressive colorizer and color upsampler on 4 TPUv2 chips (the lowest configuration) with a reducedbatch size of 56 and 192 each. For the spatial upsampler, we found that a batchsize of 8 was suboptimal and lead to a large deterioration in loss. We thus used a smaller spatial upsampler with 2 axial attention blocks with a batchsize of 16 and trained it also on 4 TPUv2 chips. The FID drops from 19.71 to 20.9 which is still significantly better than the other models in 2. We note that in this experiment, we use only 12 TPUv2 chips in total while PixColor (guadarrama2017pixcolor) uses a total of 16 GPUs.
Appendix E Improved FID with TopK sampling
We can improve colorization fidelity and remove artifacts due to unnatural colors via TopK sampling at the cost of reduced colorization diversity. In this setting, for a given pixel ColTran generates a color from the topK colors (instead of 512 colors) as determined by the predicted probabilities. Our results in Figure 6 and demonstrate a performance improvement over the baseline ColTran model with
Appendix F Additional ablations:
Additional ablations of our conditional transformer layers are in Figure 7 which did not help.

Conditional transformer layers based on Gated layers (oord2016pixel) (Gated)

A global conditioning layer instead of pointwise conditioning in cAtt and cLN. cAtt + cMLP, global
.
Appendix G Autoregressive models
Autoregressive models are a family of probabilistic methods that model joint distribution of data
or a sequence of symbols as a product of conditionals . During training, the input to autoregressive models are the entire sequence of groundtruth symbols. Masking ensures that the contribution of all "future" symbols in the sequence are zeroed out. The outputs of the autoregressive model are the corresponding conditional distributions. . Optimizing the parameters of the autoregressive model proceeds by a standard loglikelihood objective.Generation happens sequentially, symbolbysymbol. Once a symbol is generated, the entire sequence are fed to the autoregressive model to generate .
In the case of autoregressive image generation symbols typically correspond to the 3 RGB pixelchannel. These are generated sequentially in rasterscan order, channel by channel and pixel by pixel.
Appendix H Row/Column Selfattention
In the following we describe row selfattention, that is, we omit the height dimension as all operations are performed in parallel for each column. Given the representation of a single row within of an image , rowwise selfattention block is applied as follows:
(14)  
(15)  
(16)  
(17) 
refers to the application of layer normalization (ba2016layer)
. Finally, we apply residual connections and a feedforward neural network with a single hidden layer and ReLU activation (
) after each selfattention block as it is common practice in transformers.(18) 
Columnwise selfattention over works analogously.
Appendix I Out of domain colorizations
We use our trained colorization model on ImageNet to colorize highresolution grayscale images from LSUN (yu2015lsun) and lowresolution grayscale images from CelebA (liu2015deep) . Note that these models were trained only on ImageNet and not finetuned on CelebA or LSUN.
Appendix J Number of axial attention blocks
We did a very small hyperparameter sweep using the baseline axial transformer (no conditional layers) with the following configurations:

hidden size = 512, number of blocks = 4

hidden size = 1024, number of blocks = 2

hidden size = 512, number of blocks = 2
Once we found the optimal configuration, we fixed this for all future architecture design.
Appendix K Analysis of MTurk ratings
We analyzed our samples on the basis of the MTurk ratings in Figure 11
. To the left, we show images, where all the samples have a fool rate > 60 %. Our model is able to show diversity in color for both highlevel structure and lowlevel details. In the center, we display samples that have a high variance in MTurk ratings, with a difference of 80 % between the best and the worst sample. All of these are complex objects, that our model is able to colorize reasonably well given multiple attempts. To the right of Figure
11, we show failure cases where all samples have a fool rate of 0 %, For these cases, our model is unable to colorize highly complex structure, that would arguably be difficult even for a human.Appendix L More probability maps
We display additional probability maps to visualize uncertainty as done in 5.5.
Appendix M More samples
We display a widediversity of colorizations from ColTran that were not cherrypicked.