A Coarse-to-Fine Framework for Learned Color Enhancement with Non-Local Attention

06/08/2019 ∙ by Chaowei Shan, et al. ∙ USTC 0

Automatic color enhancement are aimed to automaticly and adaptively adjust photos to expected styles and tones. For current learned methods in this field, global harmonious perception and local details are hard to be well-considered in a single model simultaneously. To address this problem, we propose a coarse-to-fine framework with non-local attention for color enhancement in this paper. Within our framework, we propose to divide enhancement process into channel-wise enhancement and pixel-wise refinement performed by two cascaded Convolutional Neural Networks (CNNs). In channel-wise enhancement, our model predicts a global linear mapping for RGB channels of input images to perform global style adjustment. In pixel-wise refinement, we learn a refining mapping using residual learning for local adjustment. Further, we adopt a non-local attention block to capture the long-range dependencies from global information for subsequent fine-grained local refinement. We evaluate our proposed framework on the commonly using benchmark and conduct sufficient experiments to demonstrate each technical component within it.



There are no comments yet.


page 1

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Photo retouching using professional software like Adobe Photoshop or Lightroom is inconvenient for amateur photographers. Because it requires a lot of skills and experience to interactively operate elementary photo enhancement tools (e.g., white balance, contrast, saturation and so on) step-by-step to achieve expected adjustments. Due to its complexity, it is also time-consuming to perform manual adjustments for a large number of photos.

Figure 1: Example result. (a) is a raw image. (b) is the corresponding retouched result of our model.

To address above problems, automatic color enhancement techniques are proposed to adaptively map photos to the satisfying styles and tones (as an example in figure 1). The challenge of the task is that the optimal mapping of a pixel is usually highly non-linear and dependents on not only the global color distribution of image but also local and contextual information [1, 2, 3]. That is, the mapping from the raw image to the enhanced result is non-linear and spatially-varying.

There have been some works devoted to automatic color enhancement. [4] built a dataset of 5,000 example input-output pairs and trained a global adjustment model on it. [1] proposed a local adjustment method that finding candidate images in dataset and searching for the best transformation of each pixel. [5] proposed a learning-to-rank method to enhance images step-by-step like human photographers.

More recently, deep learning techniques show powerful capacity in some vision tasks

[6, 7, 8], it also brings huge improvement to image color enhancement by learning from large amounts of paired raw-retouched images like the dataset MIT-Adobe-FiveK [4]. Within these methods, [2] used a fully-connected network to learn the transformation of each pixel with hand-craft feature. [9] proposed to learn image processing operators through fully convolutional network.

Figure 2: Overview of proposed coarse-to-fine framework.

The enhancement process is divided into channel-wise enhancement and pixel-wise refinement. In channel-wise enhancement, CENet predicts a 12-dim vector (i.e., (

, )) as parameters of linear mapping for RGB channels of input image. Then pixel-wise refinement is performed by predicting a refined residual image with PRNet. PRNet includes 3 stages: downsampling, feature transformation and upsampling.

In addition, regarding color enhancement as image-to-image translation task

[10], conditional Generative Adversarial Networks (GANs) [11, 12] like [10, 13] can also be applied to transforming raw images into retouched images. Based on this, [14] collected aligned image pairs of the same scene by phone camera and DSLR camera, and trained a GAN model to learn the mapping between paired images. [3] proposed several improvements of current GAN models for image enhancement. [15] proposed an aesthetic-driven image enhancement model by adversarial learning. These GAN-based methods are also effective for color enhancement. However, generating high-resolution retouched images with realistic effects is still a challenge [16, 17].

Moreover, Deep Reinforcement Learning (DRL) also makes contributions to image enhancement. For instances,

[17] proposed to learn local exposures with deep reinforcement adversarial learning. [16]

proposed an easily interpretable step-by-step enhancement method. The authors treated the task as a Markov Decision Process (MDP), and used DRL algorithm to learn a sequence of global elementary enhancement actions. Global enhancement methods like

[16] are effective to adjust images to states with global harmonious perception. However, in such global methods, the adjustments of every pixel is independent on pixel’s position, preventing spatially-varying adjustment. Compared to them, pixel-wise enhancement methods have more flexible local adjustment but can also lead to local artifacts especially for handling high-solution images [16], which breaks long-range perceptual color consistency and causes terrible global perceptual style. In view of this observation, in this paper, we propose a coarse-to-fine model with non-local attention for automatic color enhancement.

In our model, we propose to divide automatic color enhancement process into two mappings: a channel-wise enhancement and a pixel-wise refinement. We learn each mapping using a CNN. In this way, channel-wise enhancement performs global adjustment and generate coarse enhanced results with global harmonious perception without causing extra complex local artifacts. After that, pixel-wise refinement will be applied for local adjustment. Through such design, we learn refined residual images to adjust local details over our coarse enhanced results instead of directly learning the mapping from the original image to the target image. It therefore keeps global harmonious perception and better details in final results. In addition, another important innovation in this paper is to apply non-local attention blocks [8] within our pixel-wise refinement network, which helps maintain long-range perceptual color consistency by capturing long-distance context information from global information for local adjustment. The ablation experiments indicate the effectiveness of two proposed enhancers, and the performances of proposed model outperform up-to-date works in quantitative or qualitative comparisons.

This paper is organized as follows: In Section 2, we describe our methods in detail. In Section 3, we present experiments and analysis. In Section 4, we conclude this work.

2 The proposed method

Our coarse-to-fine model consists of two enhancers: a channel-wise enhancement network (named CENet) for global adjustment and a pixel-wise refinement network (named PRNet) for local adjustment. We adopt residual learning [18] to train these two CNNs, which is efficient for tasks that input images and ground truth are largely similar [3, 18]. Our model can be formulated as:


where and are retouched image and original input image, respectively. is the residual image predicted by CENet, and is another residual image predicted by PRNet.

2.1 Channel-wise Enhancement

In [16], a sequence of linear arithmetic is operated on RGB channels step-by-step. Suppose is the output image after -step linear mapping. We conclude their step-by-step enhancing process as:


where is a -dim vector, it represents arbitrary pixel in image . and are a weight matrix and a

-dim bias vector, respectively. The tuple (

, ) defines a unique linear enhancement operation as an element of discrete finite enhancement operation set . For example, to decrease the brightness with , is set to a diagonal matrix given by diag, and is set to .

We consider that a sequence of linear enhancement operations actually is equivalent to a single direct linear mapping from the original image to the final result . Thus, instead of learning the operation sequence by DRL [16], we directly learn a single channel-wise linear mapping (, ) with a neural network (i.e., CENet) by end-to-end training. As presented in figure 2, we adopt CENet consists of ResNet50 [7]

(removed classifier layers) with 3 fully-connected layers to predict (

, ) for each input image. Unlike finite and fixed enhancement operations in [16], we predict the elements of and in continuous space, which can be considered as an extending of and provides more flexible and general global adjustments. After that can be calculated by (, ) and as follows:


where represents arbitrary pixel in image , and represents the pixel at the same location in . The final channel-wise enhancement result is equal to . Due to linear mapping is differentiable, efficient end-to-end training can be applied by using mini-batch gradient decent [19] to minimize the MSE loss, which is defined as:


where are the width and height of images respectively, uperscript represent column and row index of pixel respectively, represents parameters of CENet.

We noticed that the piecewise enhancer in [15] might look similar to ours, but there are differences between [15] and our method. [15] predicts parameters set for 3 piecewise functions, each of which formulates adjustment of one channel in CIELab color space. Their adjustment of each channel is irrelevant to the other 2 channels. Unlike [15], we predict parameters for channel-wise linear mapping, and the adjustment of each channel is equal to a linear combination of all 3 channels, which can be viewed as adjusting one channel by extra information in other 2 channels.

2.2 Pixel-wise Refinement

In pixel-wise refinement, we adopt PRNet to predict color residual , which can be formulated as:


where denotes the mapping of PRNet. Similar to equation 4, the PRNet is also trained by using mini-batch gradient decent to minimize the MSE between and .

Our PRNet has similar basic architecture to networks in [13, 20]

. It can be divided into 3 stages as presented in figure 2. First, the downsampling stage includes one stride-1 convolution and two stride-2 convolutions. Each convolution layer is followed by a batch normalization layer


and a ReLU

[22]. We set the number of feature maps in first convolution layer to 16. Second, the feature transformation stage consists of one non-local block [8] and three residual blocks [7]. The non-local operation in non-local blocks utilizes all elements in input features while convolution only sums up weighted features in local receptive fields. Therefore, the non-local block can combine both non-local and local information [8], which helps PRNet capture more long-range dependencies from global features for pixel-wise adjustment and improve the performance. In practice, we attempted to place the non-local block in different locations, experiment results indicated placing it at the front of the stage could lead to relatively better performances. Third, the upsampling stage is symmetric with downsampling stage, it includes two stride-2 deconvolutions and one stride-1 convolution which can restore features to RGB channels.

Figure 3: Qualitative comparisons of different results. (a) and (h) are raw image and ground truth, respectively. (b)-(f) are the results of our models. (g) is the result of DPE[3].

3 Evaluation

In this section, we first elaborate some implementation details and experiment settings for ablation study. We then describe our experimental results and the corresponding analysises.

3.1 Dataset

We evaluate our method on dataset MIT-Adobe FiveK [4], which consists of 5000 raw images and each raw image was enhanced by 5 professional photographers. Following the common practice, we select the results of photographer C as the label and validate performances on the RANDOM250 [1, 2, 16] which is a subset of MIT-Adobe FiveK. The training set consists of the rest 4750 pairs images. We keep the width-to-height ratio of images and resize them to 500 pixels on the longer edge.

3.2 Implementation Details

For training CENet and PRNet, image pairs are padded to

, so the network can adjust arbitrary images with edges no longer than 500 pixels. Image values are normalized to [0, 1]. We adopt SGD optimizer with momentum of 0.9. The learning rate is initialized to 0.01 and reduced by a factor of 0.1 at every 10k steps. We set the batch size to 16 and stop the training after 200 epochs.

3.3 Ablation Experiments

To evaluate every module in our coarse-to-fine framework, we conduct a series of ablation experiments as below:

  • [leftmargin=1em]

  • CE: This method adjusts raw images by channel-wise linear mapping using our CENet.

  • PR: Directly enhance the raw images using PRNet with 18 residual blocks in feature transformation stage.

  • PRNL: The same as PR but with 1 non-local block and 3 residual blocks in feature transformation stage.

  • CE+PR: Coarse-to-fine method with both CENet and PRNet with 3 residual blocks in feature transformation stage.

  • CE+PRNL:The same as CE+PR except an additional non-local block in feature transformation stage of PRNet.

3.4 Results Analysis

Ablation Study. We first evalute all the methods defined in section 3.3. Table 1 shows quantitative results of those methods. Like previous methods [1, 2, 3, 16], we compute error in CIELab color space and PSNR to compare numerical magnitude of enhancements. SSIM is measured to quantitatively compare local artifacts [3, 16]. From the results, we can conclude as follows:

method error (LAB) PSNR SSIM
CE 10.32 22.85 0.893
PR 10.93 22.07 0.882
PRNL 9.32 23.90 0.905
CE+PR 9.50 23.89 0.906
CE+PRNL 9.10 24.19 0.915
Table 1: Quantitative performances of our methods on RANDOM250 [1, 2, 16].
  1. [label=0., leftmargin=1em]

  2. CE+PR outperforms both CE and PR a lot on all quantitative metrics. This is because that CE lacks local adjustments (see red box in figure 3(c)) although its results have close global color style to the ground truth. And PR is likely to cause complex artifacts due to its flexible local adjustment, its results are more blurred with local artifacts (see blue box in figure 3(b)). However, our coarse-to-fine framework CE+PR can keep harmonious global color style and refined details in final results, which leads to quantitative and qualitative improvements.

  3. PRNL also outperforms PR. Its performances are close to CE+PR. This indicates that the non-local attention block in PRNet can provide a different way to reduce artifacts by capturing long-range dependencies between features.

  4. CE+PRNL achieves best performance of our methods. It improves the error in CIELab color space by around 0.18, improves the PSNR by around 0.3 dB, and improves the SSIM by around 0.01 compared with CE+PR and PRNL, which indicates that both embedding non-local block in CE+PR or adding CE in front of PRNL can offer further improvements.

method error (LAB) PSNR SSIM
Exemplar-based [1] 15.01 - -
DeepNet-based [2] 9.85 - -
DeepRL-based [16] 10.99 - 0.905
CE (ours) 10.32 22.85 0.893
DPE [3] 9.93 23.89 0.906
CE+PRNL (ours) 9.10 24.19 0.915
Table 2: Quantitative performances comparisons of different methods on RANDOM250 [1, 2, 16].

Benchmark Comparison. In table 2, we compare quantitative performances with other methods. We tested DPE[3] on the same dataset with official implementation. In comparisons of channel-wise enhancements, our CE delivers better performance on error than DeepRL-based due to our channel-wise linear enhancement operations are more flexible compared with fixed operations in DeepRL-based [16]. As for pixel-wise retouching results, our CE+PRNL has better quantitative performances than [1, 2, 16, 3]. In figure 3, we can see that the results of ours are also reasonable compared with the ground truth and competitive to DPE [3]. All these results indicate our model is effective to image color enhancement.

4 Conclusion

In this work, we propose a coarse-to-fine automatic color enhancement framework, which consists of channel-wise enhancement and pixel-wise refinement. In channel-wise enhancement, we learn a linear mapping for RGB channels. In pixel-wise refinement, refined residual images are predicted for local adjustments. Experimental results demonstrate that each component in our framework is effective to improve the final performance. In addition, our fully-equipped model outperforms related methods on the benchmark.


  • [1] Sung Ju Hwang, Ashish Kapoor, and Sing Bing Kang, “Context-based automatic local image enhancement,” in

    European Conference on Computer Vision

    . Springer, 2012, pp. 569–582.
  • [2] Zhicheng Yan, Hao Zhang, Baoyuan Wang, Sylvain Paris, and Yizhou Yu, “Automatic photo adjustment using deep neural networks,” ACM Transactions on Graphics (TOG), 2016.
  • [3] Yu-Sheng Chen, Yu-Ching Wang, Man-Hsin Kao, and Yung-Yu Chuang, “Deep photo enhancer: Unpaired learning for image enhancement from photographs with gans,” in

    Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR 2018)

    , June 2018.
  • [4] Vladimir Bychkovsky, Sylvain Paris, Eric Chan, and Frédo Durand, “Learning photographic global tonal adjustment with a database of input / output image pairs,” in The Twenty-Fourth IEEE Conference on Computer Vision and Pattern Recognition, 2011.
  • [5] J. Yan, S. Lin, S. B. Kang, and X. Tang, “A learning-to-rank approach for image color enhancement,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, June 2014.
  • [6] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton,

    Imagenet classification with deep convolutional neural networks,”

    in Advances in neural information processing systems, 2012.
  • [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
  • [8] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He, “Non-local neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [9] Qifeng Chen, Jia Xu, and Vladlen Koltun, “Fast image processing with fully-convolutional networks,” in IEEE International Conference on Computer Vision, 2017.
  • [10] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros,

    “Image-to-image translation with conditional adversarial networks,”

    CVPR, 2017.
  • [11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014.
  • [12] Mehdi Mirza and Simon Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
  • [13] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networkss,” in Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.
  • [14] Andrey Ignatov, Nikolay Kobyshev, Radu Timofte, Kenneth Vanhoey, and Luc Van Gool, “Dslr-quality photos on mobile devices with deep convolutional networks,” in the IEEE Int. Conf. on Computer Vision (ICCV), 2017.
  • [15] Yubin Deng, Chen Change Loy, and Xiaoou Tang, “Aesthetic-driven image enhancement by adversarial learning,” in 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 2018.
  • [16] Jongchan Park, Joon-Young Lee, Donggeun Yoo, and In So Kweon, “Distort-and-recover: Color enhancement using deep reinforcement learning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [17] Runsheng Yu, Wenyu Liu, Yasen Zhang, Zhi Qu, Deli Zhao, and Bo Zhang, “Deepexposure: Learning to expose photos with asynchronously reinforced adversarial learning,” in Advances in Neural Information Processing Systems 31. 2018.
  • [18] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee,

    “Accurate image super-resolution using very deep convolutional networks,”

    in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
  • [19] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel,

    Backpropagation applied to handwritten zip code recognition,”

    Neural computation, 1989.
  • [20] Justin Johnson, Alexandre Alahi, and Li Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conference on Computer Vision. Springer, 2016.
  • [21] Sergey Ioffe and Christian Szegedy, “Batch normalization: accelerating deep network training by reducing internal covariate shift,” in

    Proceedings of the 32nd International Conference on International Conference on Machine Learning

    , 2015.
  • [22] Vinod Nair and Geoffrey E Hinton,

    Rectified linear units improve restricted boltzmann machines,”

    in Proceedings of the 27th international conference on machine learning (ICML), 2010.