Pik-Fix: Restoring and Colorizing Old Photos

by   Runsheng Xu, et al.

Restoring and inpainting the visual memories that are present, but often impaired, in old photos remains an intriguing but unsolved research topic. Decades-old photos often suffer from severe and commingled degradation such as cracks, defocus, and color-fading, which are difficult to treat individually and harder to repair when they interact. Deep learning presents a plausible avenue, but the lack of large-scale datasets of old photos makes addressing this restoration task very challenging. Here we present a novel reference-based end-to-end learning framework that is able to both repair and colorize old and degraded pictures. Our proposed framework consists of three modules: a restoration sub-network that conducts restoration from degradations, a similarity sub-network that performs color histogram matching and color transfer, and a colorization subnet that learns to predict the chroma elements of images that have been conditioned on chromatic reference signals. The overall system makes use of color histogram priors from reference images, which greatly reduces the need for large-scale training data. We have also created a first-of-a-kind public dataset of real old photos that are paired with ground truth "pristine" photos that have been that have been manually restored by PhotoShop experts. We conducted extensive experiments on this dataset and synthetic datasets, and found that our method significantly outperforms previous state-of-the-art models using both qualitative comparisons and quantitative measurements.



There are no comments yet.


page 1

page 3

page 6

page 7


ROMNet: Renovate the Old Memories

Renovating the memories in old photos is an intriguing research topic in...

Bringing Old Photos Back to Life

We propose to restore old photos that suffer from severe degradation thr...

Old Photo Restoration via Deep Latent Space Translation

We propose to restore old photos that suffer from severe degradation thr...

Time-Travel Rephotography

Many historical people are captured only in old, faded, black and white ...

Bringing Old Films Back to Life

We present a learning-based framework, recurrent transformer network (RT...

Focusing on Persons: Colorizing Old Images Learning from Modern Historical Movies

In industry, there exist plenty of scenarios where old gray photos need ...

Reply to: Large-scale quantitative profiling of the Old English verse tradition

In Nature Human Behaviour 3/2019, an article was published entitled "Lar...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

While our experience of the visual world are colorful, in earlier days of photography pictures were usually captured as "black and white", i.e. as gray-scale. As time elapses, they suffer other degradation as well. While consumer service are available for restoring and colorizing old photos, these require significant expertise in image manipulation, which is labour intensive, costly, and time-consuming. Thus, developing automated systems that can rapidly and accurately colorize and restore old photos is of interest.

Recent deep learning based techniques for image restoration and colorization have achieved high levels of performance. They have been successfully applied to image denoising (chen2018learning; meng2020gia)

, super-resolution 

(ledig2017photo; mei2020dsr), deblurring (nah2017deep; tu2022maxim), and compression (balle2016end; chen2020proxiqa). Deep learning has also been applied to colorization, but those models generally require large-scale training datasets  (zhang2016colorful) to obtain favorable performance, which is energy-inefficient, labor-intensive, and time-consuming. Towards reducing large data requirements, the authors of (ironi2005colorization; gupta2012image; he2018deep; zhang2019deep) proposed to employ reference/example images to assist colorization of the gray-scale images. He et al. (he2018deep)uses separate similarity and colorization networks, the latter operating in the LAB color space. However, since there are inherent ambiguities of the colors of natural objects because of the effects of ambient lighting. Better results than pixel-level color matching may be obtained by deriving features that describe the statistical color distributions of the reference pictures. In this direction, Yoo et al. (yoo_fewshot)

deploy the means and variances of deep color features, but do not utilize second-order (spatial) distribution models, thereby discarding information descriptive of correlations that exist within image textures and their colors.

Since the spatial statistical color distribution is a very likely a useful source of colorization features, we have developed a reference-based, multi-scale spatial color histogram fusion method of image colorization. The use of reference pictures to guide the colorization of gray-scale photographs relieves the need for very large-scale training data. Precisely, we devised a novel end-to-end deep learning framework for old photo restoration which we dub Pik-Fix, which is composed of 1) a convolutional sub-network that is trained to conduct degradation restoration, 2) a similarity sub-network that performs reference color matching, and 3) a colorization sub-network that learns to render the final colorful image. As illustrated in Fig. 1, Pik-fix is capable of restoring and colorizing the old degraded photos using only limited training data, making it attractive for data-efficient applications. Previous methods (wan2020bringing) mainly use the quantitative results on synthetic data with the restoration ground truth and qualitative results on collected real data without the restoration ground truth for experimental evaluations. To the best of our knowledge, there exists no similar public dataset of authentic, real-world degraded and gray-scale photos that are associated with pristine reference versions of the same photos. Towards advancing research in this direction, we designed and built a first-of-a-kind real-world old photo dataset consisting of 200 authentic old grayscale photos, where each old photo is paired with a ‘pristine’ version of it that was manually restored and colorized by Adobe Photoshop editors. Our experimental results show that Pik-Fix is able to outperform state-of-the-art methods on both existing public synthetic datasets and on our real-world old photo datasets, even though it requires much less training data.

The major contributions we make are summarized as follows:

- We propose the first end-to-end deep learning framework (Pik-Fix) that learns to simultaneously restore and colorize the old old photos, while requiring only a limited amount of training data;

- A reference-based multi-scale color histogram fusion method for image colorization that learns the content-aware transfer functions between the input and reference signals;

- The first publicly available dataset of authentic, real-world degraded old photographs. Each of these 200 authentic contents is paired with a ’pristine’ versions of each that were manually restored and colorized by Adobe Photoshop editors;

- Our experimental results show that the model, called Pik-Fix, achieves better visual and numerical performance than state-of-the-art methods on existing synthetic data and on our new real-world dataset.

The rest of the paper is organized as follows. Section 2 reviews previous literature relevant to image colorization and old photo restoration, while Section 3 details our proposed Pik-Fix model. Experimental results and concluding remarks are given in Sections 4 and 5, respectively.

Figure 2. Flow diagram of the triplet networks (restoration, similarity and colorization) that define the flow of visual information processing in Pik-Fix.

2. Related Work

2.1. Image Colorization

Image colorization models aim to colorize gray-scale images with natural colors that accurately reflect the scenes they were projected from. Before the emergence of deep neural networks, conventional methods for image colorization could be categorized into two classes:,

scribble-based approaches and example-based methods. Scribble-based approaches (levin2004colorization; huang2005adaptive; yatziv2006fast; qu2006manga; luan2007natural; sykora2009lazybrush) rely on user scribbles (e.g., strokes) as guidance to colorize images. In these algorithms, colorization is generally formulated as a constrained optimization problem, whereby the user-defined color scribbles are propagated by maximizing similarity metrics. Although these methods can produce realistic colorized images, given detailed guidance from a user, the process is very labor-intensive. Example-based methods extract summary color statistics from references pictures, selected by the algorithm designer or drawn from a shared database, which are used to conduct colorization of gray-scale photographs (welsh2002transferring; ironi2005colorization; charpiat2008automatic; gupta2012image; liu2008intrinsic; chia2011semantic; he2018deep).While these methods have reduced labor costs, the success of the colorization process greatly depends on the chosen reference images. These kinds of models have used a variety of ways to compute correspondence between the input pictures and the reference data, including pixel comparison (welsh2002transferring; liu2008intrinsic), semantic matching (ironi2005colorization; charpiat2008automatic), and super-pixel level (gupta2012image; chia2011semantic) similarities.

More recently, deep neural networks have produced very good results on the automatic image colorization problems (larsson2016learning; iizuka2016let; zhang2016colorful; zhao2018pixel; deshpande2015learning; cheng2015deep; zhang2017real; isola2017image). Semantics analysis has been identified for successful colorization. For example, (iizuka2016let) and (zhao2018pixel) design two-branch architectures that explicitly learn to fuse the local image features with global semantic predictions. The authors of (su2020instance) argue that pixel-level analysis is insufficient to learn subtle variations of object appearance and color, and shows that incorporating object-level analysis into that regression architecture yields better performance.

2.2. Image Restoration

There is a very wide array of degradations that can effect older photographs, including some that occurred during capture, such as film grain and blurs, and others that occur over time, like stains, fading color, and cracks. Traditional computational approaches to restore photographs that have been digitized usually involve the application of prior constraints such as non-local self-similarity (buades2005non), sparsity (elad2006image), or local smoothness (weiss2007makes). More recently, deep learning-based methods have proved efficacious on many picture restoration tasks, such as image denoising (zhang2017learning; zhang2017beyond; zhang2018ffdnet), super-resolution (dong2014learning; kim2016accurate; ledig2017photo), and deblurring (xu2014deep; sun2015learning; nah2017deep). The success of these methods derive from their abilities to simultaneously learn smooth semantic, perceptual and local image representations.

To deal with difficult situations where multiple picture degradations exist simultaneously, (yu2018crafting) designed a toolbox of operators to handle each specific degradtion indivudually. The authors of (suganuma2019attention) instead employ an attention mechanism that automatically performs operator selection. Along a different vein, (ulyanov2018deep) argues that pretrained deep neural networks can be used as image priors for automatic image restoration, but without extra training data, since the feature maps they produce accurately capture low-level image statistics.

2.3. Old Photo Restoration.

Old Photo Restoration aims at removing the degradations of old photos and colorizing them with natural colors. However, most of the existing models only address one particular aspect of old photo restoration, either color restoration or degradation restoration. The authors (wang2018high) designed an image-level pixel-to-pixel image translation framework using paired synthetic and real images. A model called Deoldify (jantic) also implements a pixel-to-pixel translation using a GAN. (ulyanov2018deep) learns to conduct single-degradation image restoration in an unsupervised manner. (wan2020bringing) first encodes image data to latent representations that separates old photo, ground truth, and synthetic image. It learns image restoration by producing the latent translation.

Although previous have been able to deliver perceptual equality by solely conducting colorization or restoration, in most instances old photo restoration requires both colorization and distortion restoration. Our work leverages both learning-based restoration and example-based color restoration methods to obtain old photo restoration that addresses both aspects. Importantly, our example-based colorization technique requires much less training data.

3. Methodology

There are several major challenges that need to be addressed to further advance old photo restoration. Complex, commingled degradations are often observed in real-world old photos, which are impossible to model analytically, and difficult to gather into large amounts of representative training data. Further, colorization is an ill-posed, ambiguous problem (Charpiat2008), hence existing models require very large training datasets. The presence of complex distortions can make colorization harder. While this has not been deeply studied, the loss of real information likely impedes inferencing and regression. Conversely, restoring degraded gray-scale photos may be harder without clues supplied by color, which tends to be smooth and regional. Solving both problems together has the potential to improve the overall solution.

Towards overcoming these challenges, we propose an end-to-end trainable framework as depicted in Fig. 2. After each potentially grayscale, distorted photo is converted into the perceptually uniform CIE Lab color space, described by a luminance channel and two chrominance channels and . Denoting the input grayscale photo as , the restoration sub-net attempts to reverse any degradations to produce a restored gray-scale image . Then, and the luminance channel of an associated reference picture are both fed into the similarity sub-net, which produces a the similarity map. Then, the chromatic features from the channels of the reference image are projected onto input image space. The colorization sub-net accepts and the projected reference color features together as inputs, processing them to generate channels , finally concatenating it with to obtain a restored and colorized result .

While previous methods operate by directly feeding the raw channels of the reference image into a colorization network (he2018deep; zhang2019deep), or utilize low-order statistics (e.g., mean and variance) of adaptive instance normalization of the reference image (Xu_2020_CVPR; yoo_fewshot)

, we instead employ a multi-scale fusion method that combines a spatial-preserving color histogram with deep features. The spatial-preserving color histogram contains useful prior information regarding the spatial relationships of color. The color features and deep features are aggregated over multiple scales, enabling the learning of the colorization process without a large number of training samples. In the following sections, we detail the restoration sub-net, similarity sub-net, colorization sub-net, and the reference selection algorithms.

3.1. Restoration Sub-Net

Broadly, the types of degradations that affect old photos can be divided into two categories: physical defects (e.g., cracks, tears, smudges) and capture defects (e.g., blur, exposure) (wan2020bringing). Correcting physical defects typically requires that the receptive fields of the analyzing neural network be large enough to capture impairments that span much of the photo dimensions. Yet it is also important that the network accesses local information since capture distortions usually manifect locally, even when globally present.

Here we address the bifurcated nature of old photo distortions by developing a multi-level Residual Dense Network (RDN (rdn)) that serves as the restortion sub-net. RDN models have previously demonstrated outstanding performance on common image restoration tasks like super-resolution, denoising, and deblurring, mainly facilitated by a core module called the residual dense block. The residual dense block is able to extract abundant information via the use of dense connections and contiguous memory mechanisms. While the RDN architecture has been shown to be suitable for handling capture defects, it processes images at a single resolution, restricting the sizes of the filter receptive fields and weakening its ability to correct physical flaws. To enable RDN to handle the broader range of distortions, we have formulated a multi-level RDN that is able to analyze distorted pictures over an enlarged span of receptive field sizes.

As shown in Fig. 2, an original picture, along with 4 and 8

downsampled versions of it are fed into the top, second, and third levels of the RDN, respectively. Each level consists of three residual dense blocks, each composed of 4 identical residual dense units. The outputs of the lower levels are upsampled via bilinear interpolation and fused via concatenation, then passed through another convolution layer to generate the restored luminance


3.2. Similarity Sub-Net

After the refined luminance map is obtained from the restoration sub-net, it is passed to the similarity sub-net along with the reference image’s luminance channel . The similarity sub-net is designed to project the reference image features onto the feature space of the input picture. As illustrated in the Similarity Net in Fig. 2, a pre-trained ResNet34 (resnet) is employed to retrieve layer1, layer2, layer3, layer4 feature maps from the input and reference pictures, respectively. Note that these feature maps have progressively smaller spatial resolutions and a larger number of feature channels with increased network depth. Then, four convolution layers are applied to these intermediate features, yielding feature maps having the same channel dimensions (). We utilize similarity maps at multiple scales to later allow for multi-level feature fusion in the colorization sub-net. Rather than simply resizing and concatenating the four feature maps, we propose to construct a learnable coefficient , where

1 to 4, that assigns different weights to the feature maps depending on the target similarity map size. These weighted feature maps are then concatenated together to obtain a feature tensor. For instance, the concatenated feature

at scale would be:


where is the shared convolution filter for the convolution operator , g is an up-sampling or down-sampling function that aligns the feature size to the target similarity map size, indicates element-wise multiplication, and

denotes the concatenation operation. Subsequently, the three-dimensional feature vector

will be reshaped to a two-dimension matrix . Then, the similarity map characterizing the correlation structure between the reference picture and the input picture at scale level is computed at each spatial location as follows:


where and are mean feature vectors. The softmax function is then applied to the elements of the similarity map along the x-axis so each mapped element lies within [0,1]. This similarity map is then passed to the colorization sub-net, whose task is simplified since the reference picture’s information is aligned with that of the input image.

3.3. Colorization Sub-Net

To tackle the aforementioned colorization problem, we develop a method of guiding the process using the color prior in the reference picture. Specifically, we utilize a space-preserving color histogram (SPHist) computed on the reference pictures. Unlike the traditional color histogram, the SPHist can retains spatial picture information while modelling the probability that each pixel color falls within each bin. Importantly,

SPHist is differentiable, and thus, it can be used in an end-to-end neural network trained using gradient back-propagation. We accomplish this by using Gaussian expansion (schutt2017_gaussian) to separately approximate the SPHist of each channel, where is the number of histogram bins. Then, the probability of a pixel at location () falling into the -th bin is expressed as follow:


where is the value of (or ) channel of the reference picture at spatial coordinate (

); the spread of the Gaussian distribution is fixed at

; is a learnable parameter representing the center of bin , which is initialized as:


where and are the minimum and maximum possible values of the channels (-1 and 1, respectively in our experiments). Although the bins are equally distributed at the start, after training over several iterations, their distributions become unequal, since some colors are rarer than others ‘in the wild’. The extracted color histogram is reshaped to and down-sampled to the available four scales to enable matrix multiplication with the corresponding scales of similarity maps, leading to a warped SPHist that contains similarity-guided space-preserved color histogram from the reference picture. The warped SPHist is then fed into different levels of the encoder in the colorization sub-network to conduct color prediction.

The backbone of our colorization network employs a global U-Net shape (unet) with densenet blocks (denseblock). There are four dense blocks in the encoder containing 6, 12, 24, and 16 dense units. The decoder shares a similar structure as the encoder, and bi-linear interpolation is employed to upscale the forwarding features between the dense blocks. The warped SPHist extracted from the reference picture is concatenated with the intermediate features after each dense block in the encoder, yielding inputs to the fusion module. The fusion module contains a dense block with six dense units and a

convolution layer, which is responsible for efficiently combining the traditional color heuristics and deep features to enable accurate colorization. Since the reference information of reference is fused during the intermediate stages instead of at the start, the model learns to deal with dissimilarities between the input and reference pictures in a multi-scale manner.

Input Reference Pix2pix (wang2018high) Deoldify (jantic) InstColor (su2020instance) He et al. (he2018deep) Ours
Figure 3. Visual comparisons against state-of-the-art colorization methods on DIV2K. It shows that with only 800 training images, our method is able to accomplish visually pleasant colorization and our result is significantly better than others.

3.4. Training Objective

In order to 1) simultaneously train the restoration and colorization nets, 2) exploit the rich color information available in the reference pictures, and 3) improve the visual quality of the overall restored output, we employed a weighted sum of diverse objectives functions against which the entire Pik-Fix system can be trained end-to-end. Among these, the luminance reconstruction loss between the restored luminances and the ground truth luminances are used to supervise the training of restoration subnet:


However, it is well-known that relying on

norms as loss function tends to generate blurred estimates of picture restoration 

(ledig2017photo). Hence, we also used a measure of perceptual loss that has been shown to deliver better quality visual results on a variety of restoration tasks (ledig2017photo; ding2021comparison; wang2018esrgan). The relu22, relu32, relu42, and relu52 layers of a VGG19 (simonyan2014very) are adapted to this task::


where is a feature map of shape . While the VGG19 loss is not based on a biologicalmodels, it has been observed to provide perceptually pleasing results.

The colorization subnet is intended to transfer color distributions from the reference picture to the predicted output pictures. Thus, we also use the histogram loss to measure the distribution distance between the color histograms of the output and reference pictures as expressed by the Earth Mover’s Distance (EMD):


where is the -th element of the cumulative density function of the probability mass function . In Eq. (7), and are one-dimensional differentiable histograms formed by globally pooling over the SPHist features and in Eq. (3), respectively.

We also use the chroma reconstruction loss to impose the spatial consistency between the predicted chromatic channels and the ground truth channels, supplementing the histogram loss by directly controlling the pixel-wise chromatic loss:


The adversarial loss is a recipe that is often used to enhance the visual quality of images synthesized using GANs (ledig2017photo; jiang2021enlightengan; kupyn2018deblurgan). We utilize a PatchGAN (isola2017image) structure to ensure that all of the local patches of the enhanced output channels are visually similar to realistic chroma maps. The adversarial loss is expressed as:


Finally, we combine all of the above directed loss functions into an overall weighted loss under which Pik-Fix is trained:

Input Expert Repair Wan et al. (wan2020bringing) Deoldify (jantic) InstColor (su2020instance) He et al. (he2018deep) Ours
Figure 4. Visual comparisons against state-of-the-art colorization and restoration methods on RealOld dataset. It shows that with the limited synthetic training data from Pascal, our model is able to fix most of the degradation and deliver plausible colorization.
Dataset DIV2K (w/o degradation) Pascal VOC (w/o degradation) Pascal VOC (w/ degradation)
Pix2pix 21.12 0.872 0.138 20.89 0.782 0.200 20.37 0.732 0.231
DeOldify 23.65 0.913 0.128 23.96 0.873 0.117 21.45 0.789 0.192
He et al. 23.53 0.918 0.125 23.85 0.925 0.114 - - -
InstColorization 22.45 0.914 0.131 23.95 0.932 0.111 - - -
Wan et al. - - - - - - - 18.01 0.598 0.421
Ours 23.95 0.925 0.120 24.01 0.940 0.100 22.22 0.828 0.186
Table 1. Quantitative comparison on the DIV2K and Pascal VOC validation datasets. Up-ward arrows indicate that a higher score denotes a good image quality. We highlight the best score for each measure.

3.5. Reference Picture Selection

As discussed earlier, Pik-Fix requires color reference pictures as additional inputs to guide the colorization process. Thus, we developed an automatic reference curation model that generates good reference pictures from a given database, given an input grayscale picture during either the training and or inferencing phases.

An ideal reference should be both visually and semantically similar to a target image to be colorized, while providing rich and appropriate color information for the colorizing process. Inspired by the deep features used in perceptual similarity models (zhang2018unreasonable; ding2020image), we leveraged a pre-trained VGG19 net (simonyan2014very) as backbone to extract intermediate deep feature maps. Then, we measured the degrees of textural and the structural similarity between each given grayscale input image and each of the available color images using the global means and variance/covariances of their feature maps, respectively (ding2020image). Finally, a weighted summation of the texture and structure similarities is used to determine which of the color reference pictures in the training set has the greatest similarity to the grayscale input picture to be repaired and colorized. When deploying the trained Pik-Fik system, user may either automatically select a recommended reference picture retrieved from an available corpus, or they may choose to manually select a reference picture, according to their preference.

4. Experiments

4.1. Experimental Setting

Dataset We trained and evaluated our method on three datasets: Div2K (Agustsson_2017_CVPR_Workshops), Pascal (pascal), and RealOld, as summarized in the following.

Div2K (Agustsson_2017_CVPR_Workshops): Div2K was introduced to assist the development of image super-resolution. It contains high-resolution RGB images split into a 800/100 for training and validation. In our experiments, we only use Div2K to illustrate the effectiveness of the Pik-Fix colorization subnet. To demonstrate that Pik-Fix does not require a large number of potential reference pictures, during inference, we limited the selection of reference pictures to the training set of only 800 images.

Pascal (pascal): We randomly selected 10,000/1000 images from Pascal to serve as training data and testing data. As on the Div2K dataset, the reference pictures used when inferencing were retrieved from the training set. When testing on Pascal, two different experiments were conducted: simultaneous image restoration and colorization, and only image colorization. In order to produce realistic defect pictures, similar to those used in (wan2020bringing), we hired Photoshop experts to mimic the degradation patterns in real old photos (but not from our newly created RealOld dataset) on images from the Pascal dataset, using alpha compositing with randomized transparency levels, thereby generating synthetic old photos. We also added Gaussian blur, and simulated severe photo damage by randomly setting polygonal picture regions to pure white.

Real-World Old Photos (RealOld): To validate the efficacy and generalizability of our model under realistic conditions, we collected digitized copies of 200 real old black & white photographs. Each of these photos were digitally manually restored and colorized by Photoshop experts. To the best of our knowledge, this is the first real-world old photo dataset that has aligned “ground truth” ‘pristine’ photos to enable pixel-to-pixel processing and comparison. We are making this dataset publicly available to allow other researchers to develop advanced algorithms that can both colorize and repair old photos impaired by scratches, blur, cracks, wear, film grain noise, and physical and capture distortions. In our experiments, RealOld is used for testing models that have been trained on Pascal. Furthermore, we randomly downloaded 2,000 RGB portraits from Google Images, and utilize our picture selection algorithm to pick the best references among them, when testing on the ReadOld data.

Evaluation Metrics. We report the PSNR and the structural similarity index (SSIM) (wang2004image) between the reference and restored/colorized pictures. As an alternative, we also use the learned perceptual image patch similarity (LPIPS) model (zhang2018perceptual).

Training Details. We trained Pik-Fix in an end-to-end manner using the Adam solver (adam), with and

. The initial learning rate was set to 0.0001 and exponentially decreased at the end of each epoch using a decay rate of 0.99. The loss balance weights were fixed as follows:

. For data augmentation, randomly cropped patches were included in the training data. At each epoch, we selected one of the following two methods of reference patch/picture generation: 1) a patch was cropped from an RGB image at a location different from that of the patch used as input, and processed with color jittering and a small affine transformation to create a reference picture; 2) one picture was randomly selected from the training set (excluding the ground-truth) and treated as the reference. All of the compared models were trained on DIV2K and Pascal for 20 epochs, respectively, on a single GTX 3090Ti GPU.

4.2. Experimental Results

Since no existing work has explicitly considered the simultaneous correction of picture degradation and colorization, we compared Pik-Fix with models developed for image-to-image translation (denoted as Pix2pix (pix2pix2017)), image restoration (denoted as Wan et al. (wan2020bringing)), and colorization (denoted as Deoldify (jantic), He et al. (he2018deep) and InstColorization (su2020instance)). For fair comparisons, we do not evaluate InstColorization and He et al. (which do not restore degradation) on Pascal VOC with degradation. Likewise, we did not compare against Wan et al. (which does not colorize), on DIV2K without degradation or on Pascal VOC without degradation. All the compared methods were trained from scratch using the training strategies and code provided by those authors.

4.2.1. Quantitative Comparison

Table 1 compares Pik-Fix and the other models’ performances on two public datasets under two scenarios: DIV2K without degradation, Pascal VOC without degradation, and Pascal VOC with degradation. Pik-Fix was able to deliver the best results against all three evaluation metrics as compared with these state-of-the-art models. For example, on the DIV2K dataset without degradation, Pik-Fix achieved much better scores of 23.95 PSNR, 0.925 SSIM, and 0.120 LPIPS, than any of the compared models.

Table 2 shows the results obtained on the RealOld dataset, where again, Pik-Fix generated the highest performance scores among all compared models. These results strongly highlight the performance resilience by Pik-Fix when transferring from the synthetic dataset to the real-world old photo dataset.

4.2.2. Qualitative Comparison

Figure 3 provides qualitative comparison on the Divk2K dataset. Pik-Fix produced pictures having vivid, realistic colors, while compared models delivered incomplete colorization. Fig. 4 shows results on the RealOld dataset, showing that Pik-Fix is able to simultaneously perform picture restoration and colorization, producing perceptually satisfying results.

Dataset Real old photo
Pix2pix 16.80 0.684 0.320
DeOldify 17.14 0.723 0.287
He et al. 16.72 0.707 0.314
InstColorization 16.86 0.715 0.312
Wan et al. 16.99 0.709 0.303
Ours 17.20 0.758 0.258
Table 2. Quantitative comparisons of restoration/colorization performance of the compared models on the RealOld dataset.
Method Top 1 Top 2 Top 3 Top 4 Top 5
Pix2Pix(pix2pix2017) 3.1 11.8 24.3 41.6 67.7
Deoldify(jantic) 10.5 33.9 56.5 79.3 92.8
He et al. (he2018deep) 4.6 26.3 44.1 61.9 83.2
InstColorization (su2020instance) 8.0 26.3 53.6 75.5 92.2
Wan et al. (wan2020bringing) 23.1 38.6 48.9 59.1 72.0
Ours 50.6 65.9 75.3 85.1 94.4
Table 3. User rankings of algorithm performance on the RealOld dataset. The percentage (%) of users choosing each model ranking is shown.
Figure 5. Sensitivity analysis of reference image selection.

4.2.3. User Study

We conducted a user study to compare the visual results of all the methods. We randomly selected 100 old photos from the RealOld dataset, and asked 15 users to rank the results based on their subjective visual impressions. We gathered reports from these 15 people with the results presented in Table 3. Pik-Fix attained a high probability of 50.6% of being selected as the single top performer, outperforming all of the other methods over all rankings, further illustrating the strong performance of the Pik-Fix picture restoration and colorization engine.

4.3. Ablation Studies

Multi-scale SPHist. We conducted three experiments on the Div2k dataset to evaluate the effectiveness of the multi-scale SPHist algorithm: 1) following (he2018deep), the transferred channels of the reference picture and the channel of the input are concatenated and fed into the backbone of the colorization sub-net; 2) instead of computing the multi-scale SPHist of the reference image, the multi-scale raw channels of the reference image are used as input; 3) only a single-scale color histogram is fused with the shallower layers of the encoder. The results reported in Table 4 has validated the importance and usefulness of the proposed multi-scale SPHist.

Multi-scale Similarity Maps. To validate the efficacy of multi-scale similarity maps relative to using a single similarity map, we conducted two experiments: 1) we use the single-scale similarity map proposed in (zhang2019deep); 2) no similarity map is applied to the reference image. Table 5 reflects the benefits brought by the use of multi-scale similarity maps.

Multi-scale RDN. To study the possible performance gains brought by building a multi-level Residual Dense Network, we also tried the origin RDN (rdn) as the backbone for old photo restoration and tested the modified system on the Pascal dataset (pascal) with degradation. The results in Table 6 show that the multi-level design significantly improve the quality of the restored outputs.

Input ab fusion 22.978 0.902 0.130
Multi-scale ab fusion 23.233 0.910 0.127
Single-scale histogram fusion 23.631 0.906 0.125
Multi-scale histogram fusion 23.952 0.925 0.120
Table 4. Ablation study of multi-scale SPHist on Div2k.
No similarity map 22.817 0.910 0.131
Single-scale similarity map 23.803 0.922 0.126
Multi-scale similarity map 23.952 0.925 0.120
Table 5. Ablation study of multi-scale similarity maps on Div2k.
Single-level RDN 21.89 0.818 0.190
Multi-level RDN 22.22 0.828 0.186
Table 6. Ablation study of multi-level RDN on Pascal witho degradation.

Sensitivity to Reference Image Selection.To examine the robustness of our model relative to selection of reference images, we compared the results obtained on the DIV2K dataset using different reference pictures than the computed “most similar” one (rank 1) to the least similar one (rank 10) among the ten selected reference pictures. As shown in Fig. 5, the performance of Pik-Fix is robust against reference picture selection, with only slight performance drops, viz., from 23.95 (rank 1) to 22.25 (rank 10) of PSNR, from 0.925 (rank 1) to 0.899 (rank 10) of SSIM, and from 0.12 (rank 1) to 0.15 (rank 10) of LPIPS.

5. Concluding Remarks

We propose the first end-to-end system (called Pik-Fix) that is able to simultaneously restore and colorize old photos. The overall system contains several subnetworks, each designed to handle a single defect, but trained holistically. A hierarchical restoration subnet recovers the luminance channel from physical and capture distortions, followed by a colorization subnet that uses space-preserving color histograms computed by the reference picture to estimate the chroma components, conditioned on luminance similarity. Our method learns efficiently using limited amount of training data, using a hybrid of learning- and example-based structures and a similarity net that aligns the reference picture with the input. Extensive experimental results show that Pik-Fix attains excellent performance both visually and numerically on synthetic and real old photo datasets, as compared with state-of-the-art models. Moreover, we also created the first publicly available real-world old photo dataset repaired by Photoshop experts, which we hope will facilitate further research on deep learning-based old photo restoration and colorization problems.