Open source repository at GitHub for End-to-End Conditional GAN-based Architectures for Image Colourisation
In this work recent advances in conditional adversarial networks are investigated to develop an end-to-end architecture based on Convolutional Neural Networks (CNNs) to directly map realistic colours to an input greyscale image. Observing that existing colourisation methods sometimes exhibit a lack of colourfulness, this paper proposes a method to improve colourisation results. In particular, the method uses Generative Adversarial Neural Networks (GANs) and focuses on improvement of training stability to enable better generalisation in large multi-class image datasets. Additionally, the integration of instance and batch normalisation layers in both generator and discriminator is introduced to the popular U-Net architecture, boosting the network capabilities to generalise the style changes of the content. The method has been tested using the ILSVRC 2012 dataset, achieving improved automatic colourisation results compared to other methods based on GANs.READ FULL TEXT VIEW PDF
This paper presents Capsule GAN, a Generative adversarial network using
Generative Adversarial Networks (GANs) are a powerful class of generativ...
A great challenge to steganography has arisen with the wide application ...
Recent advances in deep learning have provided procedures for learning o...
Generative adversarial networks (GAN) became a hot topic, presenting
Recently, convolutional neural networks (CNNs) have achieved great
Domains such as logo synthesis, in which the data has a high degree of
Open source repository at GitHub for End-to-End Conditional GAN-based Architectures for Image Colourisation
Colourisation refers to the process of adding colours to greyscale or other monochrome images such that the colourised results are perceptually meaningful and visually appealing. In general, greyscale content is present in many multimedia applications: from “black and white” videos in old archives and videos with faded colours, to computer vision applications that discard the chroma component in order to simplify processing. However, while the luminance information provides valuable content-related information regarding shapes and structures, the perception of colour is important for modern video viewing. It is also essential for understanding the visual world, allowing the distinction between objects and physical variations, such as shadow gradations, light source reflections or reflectance variations on video frames. For this reason, adding chromatic information to images and improving the quality of colour has become a research area of significant interest for a wide variety of domains that traditionally have resorted to using luminance data alone. This includes medical imaging, surveillance systems  or restoration of degraded historical images .
Recently, the emergence of deep learning has enabled the development of new colourisation algorithms which better generalise the natural data distribution of colours. Convolutional Neural Networks (CNNs) outperform many state-of-the-art methods based on hand-crafted features in tasks such as image enhancement, image classification or object detection[10, 12]. State-of-the-art colourisation methods based on Generative Adversarial Neural Networks (GANs) 
aim to mimic the natural colour distribution of the training data by forcing the generated samples to be indistinguishable from natural images. Moreover, using adversarial loss, the discriminator can learn a trainable loss function that guarantees a correct adaptation of the differences between generated and real images in the target domain. However, existing methods still suffer ambiguity when trying to predict realistic colours causing desaturated results in most cases. Nevertheless, GANs are a suitable basis for further tackling the desaturation problem and gaining colourfulness.
Motivated by the recent success of Conditional Adversarial Networks in image-to-image translation tasks, including colourisation[8, 25], this paper proposes an automatic colourisation paradigm using end-to-end convolutional neural network architectures. Improved colourisation is achieved by introducing techniques that improve the stability of the adversarial loss during training, leading to better colourisation of a variety of images from large multi-class datasets. Further enhancements are achieved by applying feature normalisation techniques which are widely used in style transfer models. The capabilities of adversarial models in image colourisation are improved by adapting an Instance-Batch Normalisation (IBN) convolutional architecture  to conditional GANs. Some examples of the results achieved by the proposed method are presented in Figure 1. The main contributions of this work are the following:
Analysis of drawbacks in stat-of-the art methods for automatic image colourisation.
Identify and integrate appropriate architectural features and training procedures which lead to a boosted GAN performance for image colourisation. The proposed steps of improvement include:
A novel generator-discriminator setting which adapts the IBN paradigm to an encoder-decoder architecture, enabling generalisation of the content’s style changes while encouraging stabilisation during GAN training.
The use of Spectral Normalisation (SN)  for improving the generalisation of the adversarial colourisation and preventing training instability.
The use of multi-scale discriminators to achieve an improved colour generation in small areas and local details and a boosted colourfulness.
The paper has the following structure: Section II reviews related work in the literature, identifying the main drawbacks and possible improvements, Section III details the proposed methodology, Section IV provides information about the implementation and data used in the experiments and a quantitative evaluation of the results while Section V provides conclusions and identifies future work.
Automatic colourisation was originally introduced in 1970 to describe a novel computer-assisted technology for adding colour to black and white movies and TV programs 
. Although such semi-automatic method improved the efficiency of traditional hand-crafted techniques, it still required a considerable amount of manual effort and artistic experience to achieve acceptable results. Since then, it has been shown that the task is complex, ill-conditioned and inherently ambiguous due to the large degrees of freedom during the assignment of colour information.
In some cases, the semantics of the scene and the variations of the luminance intensity can help to infer priors of the colour distribution of the image. For example, an algorithm can successfully associate rapid changes to vegetation areas, assigning ranges of green to it, or smooth areas to sky, inferring blue tones. Nevertheless, in most cases the ambiguity in the decisions can lead a system to make random choices. For instance, the hypothetical prior of a car being red is the same as it being green or blue, although in reality the decision will converge towards the dominant samples in the training data. This fact motivated the research of conservative solutions, such as scribbled-based colourisation [19, 27] and exemplar-based colourisation 
, which involve user interaction through semi-automatic methods. In the first method, the user annotates ground truth colours at certain informative points and the system learns how to propagate them to the entire image. Alternatively, a whole colour reference is carefully selected with similar content and semantics to the target, and the system attempts to transfer the colours from reliable estimated correspondences. However, the quality of the final results depends on the choice of the reference samples and the style transfer methodology used to estimate the correspondences.
, which is associated with treating automatic colourisation as a standard regression problem. Taking a greyscale input image, a parametric model can learn how to predict corresponding chrominance channels by minimising the Euclidean distance between the estimations and the ground truth. Nevertheless, basic solutions are commonly based on averaging the colours of the training examples. In this way the basic model produces desaturated results characterised by low absolute values in the colour channels when trained on large databases of natural images. Previously this problem has been addressed through a deep learning approach which introduced a rebalancing process during training with the aim of penalising the predicted colours based on their likelihood in the prior distribution of training data. Such a method outperforms previous state-of-the-art approaches, including recent successes with GANs, in which more complex architectures need to adopt the methodology to generalise the predicted colours.
As proposed in the pix2pix framework , a more traditional regression loss such as or distance is beneficial when included in the final objective function. This enables a conditional GAN to increase the error rate of the discriminator while producing realistic results close to the ground truth. Although such a framework achieves state-of-the-art performance across a range of image-to-image translation tasks, it still requires the aforementioned rebalancing method, targeting colour rarity far from the desaturated mean of natural data distributions. The high instability during training when a GAN deals with complex generator architectures and high-resolution training images, can lead the pix2pix framework to mode collapse, converging towards undesirable local equilibria between the generator and discriminator [22, 25]. This effect reduces the contribution of the adversarial loss in the multi-loss objective, giving the total weight of convergence to the regression loss and hence leading the system again to desaturated results.
Aiming at colourisation of images, the goal of our method is to enable automatic CNN-based colourisation of an input greyscale image, denoted , where H × W is image dimension in pixels, and represented by the lightness channel in the CIE Lab colour space. To achieve this, it is essential to train an end-to-end CNN architecture capable of learning the direct mapping to the two associated colour channels . As commonly used in the literature, CIE Lab colour space is chosen as it is designed to maintain perceptual uniformity and is more perceptually linear than other colour spaces . The mapping function can be expressed in a neural network form as:
where is the set of learning parameters for a -layer CNN, omitting the bias terms for simplicity, and
the corresponding non-linear activation function, with.
A mapping convolutional model is trained using a generative adversarial methodology with conditional GANs. This work uses the pix2pix framework  as baseline to solve image-to-image translation tasks such as generating realistic street scenes from semantic segmentation maps, aerial photography from cartographic maps or image colourisation from greyscale inputs. As per the traditional GANs setting , two CNNs (a generator and discriminator ) are trained simultaneously in a minimax two-player game, with the objective of reaching the Nash equilibrium between them. Given an input greyscale image
and a vector of random noise, the aim of the generator is to capture the original colour distribution of the training data and to learn a realistic mapping to the colourisation result. On the other hand, the discriminator aims to distinguish real images from colourised ones through the mapping
, estimating the probability that a sample came from the training data rather than from. Therefore, the conditional GAN framework will model the colour distribution of the training data following the minimax training strategy:
where the objective function is given by
using to control the contribution of the regression loss.
As suggested in recent works [5, 17], the standard loss function for the generator is redefined in order to guarantee non-saturation by maximising the probability of the discriminator being mistaken and converting the loss to a strictly decreasing function. Moreover, note the aforementioned distance introduced in the final generator objective to encourage a colourisation close to the ground truth outputs. Regarding the GAN architectures, the pix2pix framework uses a U-Net  as generator and a Markovian PatchGAN  as discriminator, yielding output probability maps based on the discrimination of patches in the input domain. They exploit the intrinsic fully convolutional architecture of the discriminator to control the input patch size via its respective receptive field.
The application of mini-batch normalisation techniques such as batch normalisation (BN) , have become a common practice in deep learning to accelerate the training of deep neural networks. In the case of GANs, the DCGAN architecture it was proven that applying batch normalisation in both generator and discriminator architectures can be very beneficial to stabilise the GAN learning and to prevent a mode collapse due to poor initialisation 
. Internally, batch normalisation preserves content-related information by reducing the covariance shift within a mini-batch during training. It uses the internal mean and variance of the batch to normalise each feature channel.
On the other hand, Instance Normalisation (IN)  was proven to be beneficial in style transfer speeding-up fast stylisation. Image colourisation, as other style transfer techniques, aims to capture style information by learning features that are invariant to appearance changes, with the aim to generalise the colourisation process within a mini-batch of variable content. Therefore, unlike batch normalisation, IN uses the statistics of an individual sample instead of the whole mini-batch to normalise features.
Inspired by IBN-Net 
, in the presented approach BN and IN are combined in the same convolutional architecture with the aim to exploit the instance-based normalisation capabilities in style transfer while encouraging stabilisation during training, to both improve the learning and generalisation capacities of the GAN. This work adapts the residual IBN-Net architecture to a U-Net generator and a patch-based discriminator. The IBN-Net work discussed that appearance variance in a deep convolutional model mainly lies in shallow layers, while the feature discrimination for content is higher in deep layers. Therefore, IBN-Net avoids IN in deep layers to preserve content discrimination in deep features, while it keeps batch normalisation in the whole architecture to preserve content-related features at different levels. Figure2 shows final proposed architectures for generator and discriminator. Note that normalisation is not applied to the input layers to avoid sample oscillation and model instability.
One common strategy to improve the generalisation of the network and to prevent instability during training is the use weight regularisation. This technique penalises proportionally the weights of the network based on their size, aiming to keep small values during training and hence preventing small changes in the input leading to large changes in the output. In the context of GANs, the use of sigmoid activations in the discriminator can lead the optimisation process towards unbounded gradients. To prevent such a situation, Spectral Normalisation (SN)  was introduced as a regularisation step to control the Lipschitz constant of the discriminator.
In the context of convolutional neural networks, they proved that the Lipschitz constant of a linear mapping , e.g.
between the pre-activations of two layers, is its largest singular value or spectral norm. Then, they performed spectral normalisation by replacing the layer weightswith ,
being the largest eigenvalue of.
A challenge in colourisation is to achieve precision in small areas and local details. Using the Markovian PatchGAN discrimination in the pix2pix framework, colourfulness can be boosted by increasing the receptive field of the discriminator, albeit at the price of increasing the complexity with deeper architectures and loosing spatial information, commonly leading to blurry effects and tilling artefacts. A better solution is to use the multi-scale discrimination setting to tackle high-resolution image processing without varying the discriminator architecture . This is achieved using discriminators at different scales by downsampling the actual inputs. Therefore, keeping fixed the original discriminator architecture, variable receptive fields are obtained. These fields are larger at the coarsest levels, and the modified objective function in the GAN context is given as:
where , are the inputs, with , and the number of discriminator scales.
Training examples are generated from the ImageNet dataset , particularly from the 1,000 synsets selected for the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012. Samples are selected from the reduced validation set, containing RGB
images uniformly distributed asimages per class. The test dataset is created by randomly selecting images per class from the training set, generating up to examples. All images are resized to pixels and converted to the CIE Lab colour space.
The pix2pix framework is used as a GAN baseline. The generator consists of a U-Net encoder-decoder architecture, conforming to the following structure: . The encoder’s blocks consist of
convolutions with spectral normalisation and a stride of, followed by the normalisation layers as explained in Section III-B and a
Leaky ReLUactivation. The decoder blocks apply the same block composition but using ReLU activations. The last layer is a convolution with a tanh activation producing a -dimensional output space. For layers of encoder-decoder architecture, skip connections between layers at the encoder and layers at the decoder are applied in order to recover the information lost during the downsampling operations. After the generator, the discriminator is used in a form of PatchGAN with the following fully convolutional architecture: . The output layer is a convolution producing the output probability maps, and the input layer takes the concatenation between the original greyscale input and the original or generated colour channels. Regarding the multi-scale discrimination, a setup of different discriminators is used, downsampling the original input volumes by a factor of and .
Figure 3 illustrates the convergence behaviour of the adversarial and regression losses conforming the generator’s objective function. A poor response from adversarial loss can be observed for the baseline pix2pix
method, represented by the BN line, which rapidly collapses to a local minimum, giving all the weight of global convergence to the regression loss. A loss of colourfulness occurs after this point where the regression loss abruptly starts to overfit leading to the generation of desaturated colours. A considerable improvement results after adding spectral normalisation, the BN+SN line, where weight regularisation helps to stabilise the adversarial loss and slows down the convergence of the regression, hence preserving colourfulness and preventing overfitting. The aforementioned behaviour can be validated by observing the IBN+SN line. Although instance normalisation leads to instability due to increasing the variance of content-based features during training, a sudden improvement of the adversarial loss can be observed after epoch, where the combination of both normalisation techniques leads to colour generalisation while penalising the regression loss and helping the system to prevent desaturation.
The effect of overfitting and lack of colourfulness can be evaluated by comparing deterministic measures, such as the averaged or distance, with perceptually-based ones designed to better capture the visual plausibility of the results. From the results summarised in Table I it can be observed that, unlike the perceptual evaluation, deterministic measures reflect poor performance for those models generating wider colour distributions, e.g. the chrominance distance of a red car colourised with a plausible blue will be always higher than being colourised with a desaturated colour. Additionally, the perceptual loss is computed using a VGG19 model for image classification  pretrained on Imagenet. As proposed in previous works [9, 25], the
distance between the convolutional features produced by classifying real and generated samples is averaged as:
is an input tensor,are the convolutional features from layers , and is the number of features of each volume.
|BN + SN||10.76||25.69||58.58|
|IN + SN||9.89||26.73||60.36|
|BN + SN + MD||11.50||25.11||58.52|
|IBN + SN + MD||11.20||25.32||57.77|
Finally, colourfulness is evaluated by estimating the logarithmic colour distribution of the generated samples in the test dataset, comparing the proposed configurations with the prior distribution of the real data. As shown in Figure 4, SN provides improved colourfulness for both channels, reducing the area of intersection to the real data distribution with respect to the baseline methodology (BN) with uses Batch Normalisation. Finally, we improved the BN+SN setting by applying Multi-scale Discrimination (MD), which enables an increase in colourfulness by gaining detail in local and small areas. Examples of colourisation achieved by all analysed methods are presented in Figure 5.
The work presented in this paper improved the state-of-the art for automatic colourisation using conditional adversarial networks. Proposed GAN architecture integrates techniques from the literature to ensure good training stability and to increase the contribution of the adversarial loss during training, which prevents the GAN from collapsing into desaturated colours. It was also shown that batch normalisation and instance normalisation can be integrated together in a fully-convolutional encoder-decoder architecture within a GAN framework without lowering performance, and encouraging the assignation of more plausible colours. Finally, this work shows that by boosting the performance of the adversarial framework, reduction of the desaturation effect can be achieved due to improvement of the discrimination of unreliable colours.
Variational exemplar-based image colorization. IEEE Trans. on Image Proc. 23 (1). Cited by: §II.
IEEE Conf. on Computer Vision and Pattern Recognition, Cited by: §IV-A.
Perceptual losses for real-time style transfer and super-resolution. In European Conf. on Computer Vision, Cited by: §IV-C.