DeepAI
Log In Sign Up

DCCF: Deep Comprehensible Color Filter Learning Framework for High-Resolution Image Harmonization

Image color harmonization algorithm aims to automatically match the color distribution of foreground and background images captured in different conditions. Previous deep learning based models neglect two issues that are critical for practical applications, namely high resolution (HR) image processing and model comprehensibility. In this paper, we propose a novel Deep Comprehensible Color Filter (DCCF) learning framework for high-resolution image harmonization. Specifically, DCCF first downsamples the original input image to its low-resolution (LR) counter-part, then learns four human comprehensible neural filters (i.e. hue, saturation, value and attentive rendering filters) in an end-to-end manner, finally applies these filters to the original input image to get the harmonized result. Benefiting from the comprehensible neural filters, we could provide a simple yet efficient handler for users to cooperate with deep model to get the desired results with very little effort when necessary. Extensive experiments demonstrate the effectiveness of DCCF learning framework and it outperforms state-of-the-art post-processing method on iHarmony4 dataset on images' full-resolutions by achieving 7.63 relative improvements on MSE and PSNR respectively.

READ FULL TEXT VIEW PDF

page 2

page 6

page 7

page 11

page 24

page 25

page 26

page 27

09/14/2020

High-Resolution Deep Image Matting

Image matting is a key technique for image and video editing and composi...
07/10/2017

Deep Bilateral Learning for Real-Time Image Enhancement

Performance is a critical challenge in mobile image processing. Given a ...
09/14/2021

High-Resolution Image Harmonization via Collaborative Dual Transformations

Given a composite image, image harmonization aims to adjust the foregrou...
02/18/2021

DSRN: an Efficient Deep Network for Image Relighting

Custom and natural lighting conditions can be emulated in images of the ...
09/13/2022

Exemplar-Based Image Colorization with A Learning Framework

Image learning and colorization are hot spots in multimedia domain. Insp...
07/04/2022

Harmonizer: Learning to Perform White-Box Image and Video Harmonization

Recent works on image harmonization solve the problem as a pixel-wise im...

Code Repositories

DCCF

Deep Comprehensible Color Filter Learning Framework for High-Resolution Image Harmonization (ECCV 22)


view repo

1 Introduction

Image composition, which aims at generating a realistic image with the given foreground and background, is one of the most widely used technology in photo editing. However, since the foreground and background may be captured in different conditions, simple cutting and pasting operations could not make them compatible in color space, as show in Fig. 1. Therefore, photo editors spend a lot of time in manual tuning the color distribution when they accomplish the real-world composition task.


Figure 1: Illustration of color harmonization

In the past decades, a large amount of automatic color harmonization algorithms have been proposed. Traditional methods [3, 24, 25, 26, 23, 17, 32, 31] tend to extract low-level handcrafted features to make the color statistics of the foreground to match the background, which may have poor performance when the content of foreground and background are vastly different. Since Tsai et al. [33] propose a data-driven deep learning framework for color harmonization, the research community has made a large progress rapidly over a short period of time. Deep learning based methods have become the main stream. However, we argue that previous deep learning based color harmonization methods [33, 7, 29, 21, 6, 4, 11] have neglected two problems which are critical for practical applications.

First, high-resolution (HR) images are rarely taken into account in previous works when deep color harmonization models are designed and evaluated. Previous deep models in color harmonization follow the evaluation system proposed by Tsai et al. [33], which resizes the original images to 256256 or 512512 resolution and calculate objective metrics (i.e. MSE and PSNR) in this low-resolution to evaluate the performance of models, instead of the original image resolution. The principal reason is that these methods simply employ UNet-style [27] networks to directly predict pixel level RGB values, which are memory and computational costly, and even modern GPUs could not burden for HR images. However, color harmonization needs to be frequently applied to HR images in real-world applications whose resolution is 30003000 or even higher. Therefore, previous deep models which perform well on low-resolutions may have poor performance when be applied to real-world HR images.

Second, model comprehensibility and manual control mechanism are rarely considered in previous works. Imagine the scenario that the harmonization result of the network is flawed, and the photo editor wants to make some modifications based on the network’s prediction to avoid tuning from scratch, such as hue adjustment in Fig. 1

. Thus it is essential to provide human understandable cooperation mode with the deep models for a friendly color harmonization system. However, previous methods utilize variant networks following the common image-to-image translation framework 

[16] that directly predicts the harmonization result. It is nearly impossible to provide comprehensible tools for humans to interact with these deep models, because of the prediction processes are ”black-box” and inscrutable for photo editors.

Inspired by the idea of learning desired image transformations that could reduce computing and memory burdens by a large margin for image enhancement [8], in this paper, we propose a novel Deep Comprehensible Color Filter (DCCF) learning framework for high-resolution image harmonization. Specifically, we first downsample the input to the low-resolution (such as 256256) counter-part, then learn four comprehensible neural filters (i.e. hue, saturation, value and attentive rendering filters) in a novel end-to-end manner with the supervisions constructed from both RGB and HSV color spaces, finally apply these filters to the original input image to get the harmonized result. Compared with previous deep learning based color harmonization methods that may fail for high-resolution images, our neural filter learning framework is insensitive to image resolution and could perform well on dataset whose resolution range from 480p to 4K. Besides, benefiting from the mechanism that parameters in the filters (especially hue, saturation and value filters) are forced to learn decoupled meaningful chromatics functions, it makes it possible to provide comprehensible tools for humans to interact with these deep models in the traditional chromatics way they familiar with. It is worth noting that learning comprehensible neural filters is not easy. Our experiments show that learning weights directly from supervisions of hue, saturation and value channels could cause poor performance. To handle this, we construct three novel supervision maps that approximate the effects of HSV color space while making the deep model converge well.

We train and evaluate our approach in the open source iHarmony4 dataset  

[6] on the original image resolutions, which range from 480p (HCOCO) to 4K (HAdobe5k). Since previous deep learning based color harmonization models could perform poorly when they are directly applied to HR images, we compare to them with variant post-processing methods. Extensive experiments demonstrate that our approach can make the prediction process comprehensible and outperform these methods as well. We also provide a simple handler that humans could cooperate with the learned deep model to make some desired modifications based on the network’s prediction capacity to avoid tuning from scratch.

In a nutshell, our contributions are three-folds.

  • [noitemsep,topsep=0pt]

  • We propose an effective end-to-end deep neural filter learning framework that is insensitive to image resolution, which makes deep learning based color harmonization practical for real-world high-resolution images.

  • To the best of our knowledge, we are the first to design four types of novel neural filter (i.e. hue, saturation, value and attentive rendering filters) learning functions and learning strategies that make the prediction process and result comprehensible for human in image harmonization task . Meanwhile, we provide a simple yet efficient handler for users to cooperate with deep model to get the desired results with very little effort when necessary.

  • Our approach achieves state-of-the-art performance on the color harmonization benchmark for high-resolution images and outperforms state-of-the-art post-processing method by and relative improvements on MSE and PSNR, respectively.

2 Related Work

Image harmonization. In this subsection, we focus on the discussion of deep learning based methods. These methods regard color harmonization as a black box image-to-image translation task. [33] apply the well-known encoder-decoder U-net structure with skip-connection and train the network with multi-task learning, simultaneously predicting pixel value and semantic segmentation.  [29] insert pretrained semantic segmentation branch into encoder backbone and introduce a learnable alpha-blending mask to borrow useful information from input image. They both use semantic features in networks. [6, 4] tried to make composite image harmonious via domain transfer.  [7, 13] both used attention mechanism in networks.  [1]

propose a generative adversarial network (GAN) architecture for automatic image compositing, which considers geometric, color, and boundary consistency at the same time.  

[11] seek to solve image harmonization via separable harmonization of reflectance and illumination, where reflectance is harmonized through material-consistency penalty and illumination is harmonized by learning and transferring light from background to foreground. Note that recently some image harmonization works start to focus on high-resolution images.  [18]

use self-supervised learning strategy to train network with small local patches of high resolution images, but during inference it still follow the two stage post-processing strategy.  

[15, 28] learn global parameters to adjust image attributes such as lightness and saturation.  [10] learns pixel-wise curves to perform low-light image enhancement.

Smart upsampling. Processing high resolution image becomes difficult due to huge computational burden of deep-learning networks and limited GPU memory. A common approach to accelerate high resolution processing is to first downsample the image, apply time-consuming operator at low resolution and upsample back. To preserve edge gradients, guided filter upsampling [14] uses original high resolution input as guidance map.  [9] fit transformation recipe from compressed input and output, then apply the recipe to high quality input. Bilateral guided upsampling [2] approximates the operator with grids of local affine transformations and apply them on high resolution input, thus control the operator complexity.  [8] predict the local affine model with fully convolution networks, which is trained by end-to-end learning and obtain multi-scale semantic information.  [34] propose a guided filter layer, using point-wise convolution to approximate median filter, thus can be plugged into networks and optimized jointly.  [19]

introduce extra networks to learn deformable offsets for each pixel, thus the interpolation neighbour is predicted online during upsampling.  

[5, 35] learn 3D lookup tables (LUT) to obtain high resolution results, but the learned transformation still lacks interpretable meanings.

Figure 2: An overview of our proposed color harmonization framework. It consists of two primary parts: comprehensible neural color filter module and high resolution assembly module

. Given an input image and corresponding foreground mask, a low-resolution feature extraction backbone first downsamples them to a low-resolution version, such as 256

256, and employs an encoder-decoder network to extract foreground aware high-level semantic features. Comprehensible neural color filter module then learns value filter, saturation filter, hue filter and attentive rendering filter simultaneously based on the features extracted from the backbone. Each filter learns parameters of transformation function in per pixel manner. High resolution assembly module finally extracts and upsamples the specific channel of each DCCF’s output to assemble the final result. In short, input image is unharmonious, is -harmonized, is -harmonized, is -harmonized, is the refinement of by an attention module.

3 Methodology

3.1 Framework Overview

The neural filter learning framework for high-resolution image color harmonization is illustrated in Fig. 2. It consists of two primary parts: comprehensible neural color filter module and high resolution assembly module.

Firstly, given an original input image (3) and corresponding foreground mask (1), low-resolution feature extraction backbone downsamples them to the low-resolution counterparts (256256), then concatenates them as input (2562564) to extract foreground aware high-level semantic representations (25625632). The choice of backbone structure is flexible and iDIH-HRNet architecture [29] is used in this paper.

Subsequently, the comprehensible neural color filter module generates a series of deep comprehensible color filters (DCCFs) with the shape of (256256), where each pixel has learnable parameters to construct a transformation function which can be operated on input image . The gathering of each pixel’s functions builds up a filter map . The design of DCCFs and their cooperating mechanism will be detailed in Section 3.2.

Finally, the high resolution assembly module upsamples these filter maps to their full-resolution () counterparts in order to be applied on the resolution of original input image. Meanwhile, since each DCCF only changes a specific aspect of image, an assembly strategy is thus required to ensure there is no conflicts between each filter’s operating procedure. The details will be discussed in Section 3.3.

The entire network is trained in an end-to-end manner and benefited from the supervision of the full-resolution images. Moreover, we observe that traditional losses in RGB color space is not sufficient for achieving state-of-the-art quality. We therefore propose auxiliary losses in Section 3.4 for each DCCF’s output to ensure that they are functioning as expected.

3.2 Comprehensible Neural Color Filter Module

The comprehensible neural color filter module plays a core rule in our proposed high-resolution image color harmonization framework. We take inspiration from the famous HSV color model which is widely used in photo editing community. Compared with RGB color space, HSV is much more intuitive and easier for humans to interact with computers for color tuning.

Our module consists of four neural filters, that is, value filter, saturation filter, hue filter and attentive rendering filter illustrated as , , and respectively in Fig. 2. Each filter is generated by a 11 convolutional layer (expect for the attentive rendering filter has extra sigmoid layer for nomorlization) that builded on the low-resolution feature extraction backbone.

3.2.1 Value Filter


Figure 3: Illustration of the pixel-level value adjustment function/curve. illustrated in Fig. 2 can be regarded as the result whose value is well tuned. Zoom for better view.

The customized pointwise nonlinear value transformation function is defined as:

(1)

Where indicates the channel of input image in HSV color space, and are learnable parameters and is a hyper-parameter which we set as 8 in this paper. It could be considered as an arbitrary nonlinear curve which is approximated by a stack of parameterized ReLUs. controls the lower-bound of value range, and control the nonlinearity of the curve. Parameters and are stored for each pixel in channel direction of value filter .

We argue that different local regions should have different adjustment curves for better harmonization quality. As illustrated in Fig. 3, two marked points have large gap in original value distribution (the left is darker, the right is brighter), our DCCF successfully allocates proper curves for these two regions, while the global adjustment degrades the overall aesthetic.

3.2.2 Saturation Filter


Figure 4: Illustration of the pixel-level saturation adjustment. illustrated in Fig. 2 is the intermediate result. The change of saturation is consistent with predicted distributions. Zoom for better view.

We use a single parameter to control saturation for each pixel. The customized non-linear saturation transformation function for each pixel is defined as:

(2)

Where indicates the , or values in each pixel, , , , is our learned parameter and is a monotonous function to avoid saturation overflow.

If , the values below median will be suppressed, while the value above median will be enhanced, as a result the saturation is increased and vice versa when . We visualize the effectiveness of in Fig. 4. DCCF allocated positive for most of the pixels in this de-saturated input image and obtained an enhanced result.

3.2.3 Hue Filter

We define an affine color transformation function for each pixel in RGB color space as:

(3)

Where indicates the RGB values for one pixel in image, and is a learnable 3x4 affine transformation matrix that contains a rotation matrix

and a translation vector

.

We suppose that one could find a suitable rotation matrix in RGB color space that is equivalent to a corresponding radian moving on the hue ring in HSV color space [12], which is futher discussed in supplementary. Based on this assumption, it is equivalent to learn an affine color transformation function in RGB color space, which contains a rotation function that could be the parameters for the corresponding hue rotation function in HSV color space. We suggest readers refer to [12] for technical details. Note that  [12] needs extra linearization between sRGB and RGB space, which is mainly a gamma correction thus compatible with our learnable curve function .

3.2.4 Attentive Rendering Filter

We employ simple yet effective attentive rendering filter which is similar to the attention mask in [29] to further improve the harmonization result after hue filter.

For inference, we adopt the previous filters’ harmonization result and input to perform alpha blending as illustrated in Fig. 2

(4)

Where is the per-pixel parameter on ranging in to smartly borrow information from input image,

is an extra affine matrix to refine the appearance of

.

3.3 High Resolution Assembly Module

The biggest reduction of computation comes from the design that each DCCF is generated at low-resolution branch. We then perform upsampling on DCCF’s filter map to match the resolution of original input image. The effectiveness of this action is guaranteed by the common assumption that neighbourhood regions require similar tuning filters.

Afterwards, we propose a split-and-concat strategy to assemble the applying result of each filter. Specifically, as shown in Fig. 2, we utilize value filter , saturation filter and hue filter to extract harmonized value channel , saturation channel and hue channel respectively, then assemble , and as harmonized image , finally use attentive rendering filter to get the final harmonized image . We illustrate the implementation details of saturation assembling as example in Fig. 5.

Figure 5: Illustration of assembly module details. We take the procedure of saturation filter as example. The engaged channel (i.e. ) are colored for visualization.

3.4 Training Loss

In the following description, we will use the superscript for low-resolution and for high-resolution.

3.4.1 High Resolution Supervision

Since the area of foreground region varies a lot among training examples, we adopt foreground-normalized MSE loss [29] between ground-truth and intermediate result , final predicted result . This loss uses the area of foreground mask as a normalization factor to stablize the gradient on foreground object. Differently, our loss can be calculated on both low-resolution and high-resolution streams, namely and .

3.4.2 Auxiliary HSV Loss

A straight forward solution to supervise is using the standard HSV decomposition equations to get HSV channels. However, we observe that this strategy could contain high frequency contents in the output channel as visualized in Fig. 5(a)  Fig. 5(f), which may degrade the convergence of network according to our experiments in Fig. 5(g).

Therefore we heuristicly designed an approximated version of HSV loss to stabilize network training. It is mainly based on a combination of several differentiable basic image processing filters (e.g. whitening, blurring, blending) to obtain smooth approximations of these three attributes

, which benifit training procedure. The implementation details are shown in supplementary.

Auxiliary HSV losses , , are calculated with MSE in low resolution stream only due to memory consideration. We also apply total variation regularization on predicted filters to increase smoothness. The overall training loss is defined as follows, where is hyper-parameters:

(5)
(a) Standard V (b) Smooth V (c) Standard S (d) Smooth S (e) Standard H (f) Smooth H
(g)

Ablations on loss function. Testing errors on iHarmony4 with different loss contributions.

Figure 6: Visualzation of standard HSV and our ad-hoc smoothed version. The smoothed version of , , keeps global chromological properties, meanwhile makes the network converge better, which is demonstrated in sub-figure (g).

4 Experiments


Method
Entire Dataset HCOCO HAdobe5k HFlickr Hday2night
MSE PSNR MSE PSNR MSE PSNR MSE PSNR MSE PSNR
Input image 177.99 31.22 73.03 33.53 354.46 27.63 270.99 28.20 113.07 33.91
iDIH-HRNet[29] - - 19.96 38.25 - - 93.50 32.42 71.01 35.77
iDIH-HRNet[29]+BU 43.56 34.98 34.40 35.45 37.82 35.47 104.69 30.91 50.87 37.41
iDIH-HRNet[29]+GF[14] 35.47 36.00 25.93 36.70 34.51 36.03 85.05 32.01 49.90 37.67
iDIH-HRNet[29]+BGU[2] 26.85 37.24 18.53 37.90 26.71 37.50 66.26 33.19 51.96 37.23
DCCF 24.65 37.87 17.07 38.66 23.34 37.75 64.77 33.60 55.76 37.40
Table 1: Quantative performance comparison on the iHarmony4 test sets. We are the first to evaluate on original resolution in this dataset. The best results are in bold. We use recent state-of-the-art network iDIH-HRNet [29] and several post-upsampling methods as our baselines. ’-’ means not able to obtain results due to memory limitation. Our method is trained in an end-to-end manner and outperforms baselines by comparison. More quantative results with different backbones are shown in supplementary.
(a) Input
(b) BU
(c) GF
(d) BGU
(e) DCCF
(f) GT
Figure 7: Visualization of high-resolution results. Foregrounds are marked in red contour. Bilinear upsampling, guided filter upsampling and bilateral guided upsampling are represented as BU, GF [14] and BGU [2] respectively. GT represents ground truths. Our method DCCF has not only better global appearance but also refined high resolution details. Zoom for better view. More visual results please refer to supplementary materials.

In this section, we first describle experimental setups and implementation details, then compare our approach with the state-of-the-arts quantitatively and qualitatively. Finally, we carry out some ablation studies and provide a simple comprehensible interface to interact with our model. We also present more results and potential limitations in supplementary materials.

4.1 Experimental Setups

We use iHarmony4 [6] as our experiment dataset which contains 73146 images. It consists of 4 subsets: HCOCO, HFlickr, HAdobe5k, HDay2night. The image resolution varies from 640480 to 60484032, which is difficult for learning based color harmonization algorithms to process on the original images’ full resolution. We suggest readers refer to [6] for dataset details.

Since the lack of high-resolution process ability, previous methods [33, 7, 29, 21, 6, 4, 11] resize all images in the dataset to 256

256 to process and evaluate their performance via Mean Square Error (MSE) and Peak Signal To Noise Ratio (PSNR) in this extremely low-resolution. However, we argue that evaluate algorithms on the image’s original full-resolution is much more scientific for practical applications. In this paper, we adopt MSE and PSNR as our objective metrics on the image’s original full-resolution instead of 256

256.

4.2 Implementation Details

Our DCCF learning framework is differentiable and could be stacked on the head of any deep feature extraction networks. In this paper, we adopt the recent state-of-the-art harmonization network iDIH-HRNet 

[29] as our backbone to carry out experiments. For feature extraction backbone, we downsample inputs (images and corresponding foreground mask) to 256

256 following the previous deep harmonization models’ common setting. For detailed training procedure and hyper-parameter setting, please refer to our official Pytorch 

[22] code111https://github.com/rockeyben/DCCF.

4.3 Comparison with Baselines

In order to evalute the effectiveness of our proposed DCCF learning framework, we construct two kinds of baselines. (1) Applying recent state-of-the-art methods directly on the original input images to get the full-resolution harmonized results. (2) Applying recent state-of-the-art methods on the low-resolution inputs () to predict low-resolution harmonized images and adopting variates of state-of-the-art post-processing methods to get the final full-resolution harmonized results. In this paper, we choose iDIH-HRNet [29] as the deep model provided by Sofiiuk et al. [29] and Bilinear Upsampling (BU), Guided Filter Upsampling [14] (GF), Bilateral Guided Upsampling [2] (BGU) as post-processing methods. For fair comparison, we adopt the same low-resolution (i.e. ) feature extractor as [29] for our DCCF learning framework. The performance comparison is shown in Table 1. Some harmonization results are shown in Fig. 7. For the comparison of efficiency metrics like inference time and memory usage, please refer to supplementary for details.

The method of applying [29] directly on the full-resolution (first row in Table 1) performs pooly. The principal reason is that [29] is designed and trained on the resolution of , directly applying this model in testing phase to the original image full-resolution would lead to serious feature misalignment. Moreover this strategy failed on HAdobe5k subset (max resolution: 60484032) due to memory limitation.

The method of applying post-processing after the low-resolutional prediction results from [29] with low-resolution inputs solves the memory problem. However, BU would lead to blurring effect, especially for high-resolution subset HAdobe5k, see Fig 7. Therefore, we adopt more advanced post-processing algorithms GF [14] and BGU [2] that take original full-resolution image as detail guidance to mitigate the blurring effect from upsampling operation. Table  1 shows that these upsampling methods outperform bilinear upsampling methods by a large margin and the best one BGU[2] achieves 26.85 on MSE and 37.24 on PSNR. However, the best performance of post-processing methods is behind our approach DCCF. Our approach achieves 24.65 on MSE and 37.87 on PSNR, and relative improvements on MSE and PSNR respectively compared with  [29]+ BGU [2].

4.4 Qualitative Results

We conduct two evaluations to compare the subjective visual quality of DCCF with other methods, which is presented in Table 2. First, we adopt LPIPS [36] to evalute visual perceptual similarity of harmonized image and ground truth reference. It computes the feature distance between two images and the lower score indicates better result. Second, we randomly select 20 images then present DCCF result with baseline results on the screen after shuffling, and ask 12 users to judge images’ global appearance and detail texture then give scores from 1 to 5, the higher the better. Our DCCF achieves the best result in both metrics which is consistent the quantative performance.

Method iDIH-HRNet[29] +BU iDIH-HRNet[29]+GF [14] iDIH-HRNet[29]+BGU[2] DCCF
LPIPS [36] 0.0459 0.0291 0.0201 0.0186
User Score 2.0541 2.6583 3.3041 3.5583
Table 2: Qualitative results. We evaluate visual perceptual quality by DNN-based image quality accessment LPIPS [36] and a user study.

4.5 Ablation Studies

In this subsection, we carry out a number of ablations to analyze our DCCF learning framework from the aspect of filter and loss function design.

Filter Design: An evaluation of filter design is shown in Table 3a. DBL [8] is an end-to-end ”black-box” bilateral learning method that proposed in image enhancement. We adapt it to our DCCF learning framework to process high-resolution image harmonization. DCCFs w.o. attention is our DCCF learning method that exclude attentive rendering filter. Even DCCFs w.o. attention improves the performance of DBL filter [8] by 1.56 (5.58%) on MSE. It demonstrate that the performance of our model is not just from end-to-end training, our divide, conquer and assemble strategy that learns explicit meaningful parameters also benefit a lot for color harmonization task. DCCFs with attn further improve the DBL filter[8] by 3.27 (11.71%) on MSE.

Loss Functions: The impact of loss functions for our DCCF learning framework is shown in Table 3b. Note that standard channel is an angle value while our approximated is a scalar value, so we train standard with cosine distance while training approximated smooth with euclidean distance. Numerical results show that supervisions from HSV color space is essential for our DCCF learning framework, which is manifested in simply adding loss from standard HSV channels will remarkably decrease MSE from 35.17 to 27.86. The principal reason may be the parameters of our DCCFs (expect for the last attentive rendering filter) are designed from the inspiration of practical tuning criteria in HSV color space used by color artists and has explicit chromatics meaning. Therefore model converges better when supervisory signals from HSV color space are added, which is demonstrate in Fig. 5(g). It is worth noting that adding smooth approximated HSV loss described in subsection 3.4.2 instead of standard HSV loss will further decrease MSE to 24.65 which demonstrates the effectiveness of proposed smoothing HSV loss.

Method MSE PSNR DBL [8] 27.92 37.48 DCCFs w.o. attention 26.36 37.80 DCCFs with attention 24.65 37.87
(a) Ablations for filter design
Method MSE PSNR 35.17 36.81 + standard 27.86 37.39 + smooth 24.65 37.87
(b) Ablations for loss function
Table 3: Ablation studies on the iHarmony4 dataset. (1) As for filter design, DBL [8] is a ”black-box” per-pixel linear filter that directly applied to RGB images. DCCFs with attention achieves the best result. (2) As for losses, supervisions from HSV is essential for DCCF learning framework. Supervisions we constructed from HSV (i.e. smooth ) improves standard HSV by 3.21 (11.65%) on MSE.

4.6 Comprehensible Interaction with Deep Model

Figure 8: Illustration of comprehensible interaction with deep harmonization model on parameter space of hue adjustment. Abscissa represents parameter and ordinate represents parameter . Sampling values in and their results are listed. Zoom for better view.

Benefiting from the comprehensible neural filters, we could provide a simple yet efficient handler for users to cooperate with deep model to get the desired results with very little effort when necessary. We provide two adjustable parameters in the three dimensions of hue, saturation and value respectively for users to express their color adjustment intentions. For space limitation, we only explain hue adjustment for example. The other two dimensions are similar and will be detailed in supplementary.

For hue, we define parameter and to represent the angle for Hue circle and the amount of user color intentions respectively. We calculate the desired rotation matrix mentioned in Eq. (3) as:

(6)

Then we could get the final rotation matrix : , which can be applied on image that takes global user intentions and local complex self-adaptions from deep model in mind.

In one word, users can express their color intentions by parameter and decide the amount of injected color by controlling , which is illustrated in Fig. 8. It is worth noting that when users interact with deep model in one dimension (such as hue above), they need not worry about the side-effect changes of other two dimensions from the network’s prediction.

5 Conclusion

In this paper, we propose comprehensible image processing filters to deal with image harmonization problem. By gradually modifying image’s attributes: value, saturation and hue, we can obtain results not only high-quality but also understandable. This also facilitate human to cooperate with deep models to perform image harmonization. We also leverage these filters to tackle high resolution images in a simple yet effective way. We hope that DCCF can set up a brand new direction for image harmonization.

References

  • [1] B. Chen and A. Kae (2019) Toward realistic image compositing with adversarial learning. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 8415–8424. Cited by: §2.
  • [2] J. Chen, A. Adams, N. Wadhwa, and S. W. Hasinoff (2016) Bilateral guided upsampling. ACM Transactions on Graphics (TOG) 35 (6), pp. 1–8. Cited by: 12(a), 14(e), §2, Figure 7, §4.3, §4.3, Table 1, Table 2, Table 4, §7, Table 6, §9, §9.
  • [3] D. Cohen-Or, O. Sorkine, R. Gal, T. Leyvand, and Y. Xu (2006) Color harmonization. In ACM SIGGRAPH, pp. 624–630. Cited by: §1.
  • [4] W. Cong, L. Niu, J. Zhang, J. Liang, and L. Zhang (2021) Bargainnet: background-guided domain translation for image harmonization. In IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §1, §2, §4.1.
  • [5] W. Cong, X. Tao, L. Niu, J. Liang, X. Gao, Q. Sun, and L. Zhang (2022) High-resolution image harmonization via collaborative dual transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18470–18479. Cited by: §2, Table 5, §8.
  • [6] W. Cong, J. Zhang, L. Niu, L. Liu, Z. Ling, W. Li, and L. Zhang (2020) Dovenet: deep image harmonization via domain verification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8394–8403. Cited by: §1, §1, Table 7, §2, §4.1, §4.1.
  • [7] X. Cun and C. Pun (2020) Improving the harmony of the composite image by spatial-separated attention module. IEEE Transactions on Image Processing (TIP) 29, pp. 4759–4771. Cited by: §1, Table 7, §2, §4.1, Table 4, §7.
  • [8] M. Gharbi, J. Chen, J. T. Barron, S. W. Hasinoff, and F. Durand (2017) Deep bilateral learning for real-time image enhancement. ACM Transactions on Graphics (TOG) 36 (4), pp. 1–12. Cited by: §1, §2, §4.5, Table 3, Table 3.
  • [9] M. Gharbi, Y. Shih, G. Chaurasia, J. Ragan-Kelley, S. Paris, and F. Durand (2015) Transform recipes for efficient cloud photo enhancement. ACM Transactions on Graphics (TOG) 34 (6), pp. 1–12. Cited by: §2.
  • [10] C. Guo, C. Li, J. Guo, C. C. Loy, J. Hou, S. Kwong, and C. Runmin (2020)

    Zero-reference deep curve estimation for low-light image enhancement

    .
    pp. 1780–1789. Cited by: §2.
  • [11] Z. Guo, H. Zheng, Y. Jiang, Z. Gu, and B. Zheng (2021) Intrinsic image harmonization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16367–16376. Cited by: §1, Table 7, §2, §4.1.
  • [12] P. Haeberli (1993) Matrix operations for image processing. Grafica Obscura website. External Links: Link Cited by: §12.1, §12.1, §3.2.3.
  • [13] G. Hao, S. Iizuka, and K. Fukui (2020) Image harmonization with attention-based deep feature modulation.. In British Machine Vision Conference (BMVC), Cited by: §2.
  • [14] K. He, J. Sun, and X. Tang (2013) Guided image filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 35 (6), pp. 1397–1409. Cited by: §2, Figure 7, §4.3, §4.3, Table 1, Table 2, Table 4, §7, Table 6, §9, §9.
  • [15] Y. Hu, H. He, C. Xu, B. Wang, and S. Lin (2018) Exposure: a white-box photo post-processing framework. ACM Transactions on Graphics (TOG) 37 (2), pp. 26. Cited by: §2.
  • [16] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1125–1134. Cited by: §1.
  • [17] J. Jia, J. Sun, C. Tang, and H. Shum (2006) Drag-and-drop pasting. ACM Transactions on Graphics (TOG) 25 (3), pp. 631–637. Cited by: §1.
  • [18] Y. Jiang, H. Zhang, J. Zhang, Y. Wang, Z. Lin, K. Sunkavalli, S. Chen, S. Amirghodsi, S. Kong, and Z. Wang (2021) SSH: a self-supervised framework for image harmonization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4832–4841. Cited by: §2.
  • [19] B. Kim, J. Ponce, and B. Ham (2019) Deformable kernel networks for guided depth map upsampling. CoRR abs/1903.11286. Cited by: §2.
  • [20] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele (2007) Joint bilateral upsampling. ACM Transactions on Graphics (ToG) 26 (3), pp. 96–es. Cited by: 14(d).
  • [21] J. Ling, H. Xue, L. Song, R. Xie, and X. Gu (2021-06) Region-aware adaptive instance normalization for image harmonization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9361–9370. Cited by: §1, §4.1.
  • [22] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems (NeurIPS) 32, pp. 8026–8037. Cited by: §4.2.
  • [23] P. Pérez, M. Gangnet, and A. Blake (2003) Poisson image editing. In ACM SIGGRAPH, pp. 313–318. Cited by: §1.
  • [24] F. Pitie, A. C. Kokaram, and R. Dahyot (2005)

    N-dimensional probability density function transfer and its application to color transfer

    .
    In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Vol. 2, pp. 1434–1439. Cited by: §1.
  • [25] F. Pitié and A. Kokaram (2007) The linear monge-kantorovitch linear colour mapping for example-based colour transfer. In Proceedings of the European Conference on Visual Media Production (CVMP), pp. 1–9. Cited by: §1.
  • [26] E. Reinhard, M. Adhikhmin, B. Gooch, and P. Shirley (2001) Color transfer between images. IEEE Computer Graphics and Applications (CG&A) 21 (5), pp. 34–41. Cited by: §1.
  • [27] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Eds.), pp. 234–241. Cited by: §1, §7.
  • [28] J. Shi, N. Xu, Y. Xu, T. Bui, F. Dernoncourt, and C. Xu (2021) Learning by planning: language-guided global image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13590–13599. Cited by: §2.
  • [29] K. Sofiiuk, P. Popenova, and A. Konushin (2021) Foreground-aware semantic representations for image harmonization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1620–1629. Cited by: §1, Table 7, §2, §3.1, §3.2.4, §3.4.1, §4.1, §4.2, §4.3, §4.3, §4.3, Table 1, Table 2, §7, Table 6.
  • [30] K. Sun, B. Xiao, D. Liu, and J. Wang (2019)

    Deep high-resolution representation learning for human pose estimation

    .
    In CVPR, Cited by: §7.
  • [31] K. Sunkavalli, M. K. Johnson, W. Matusik, and H. Pfister (2010) Multi-scale image harmonization. ACM Transactions on Graphics (TOG) 29 (4), pp. 1–10. Cited by: §1.
  • [32] M. W. Tao, M. K. Johnson, and S. Paris (2010) Error-tolerant image compositing. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 31–44. Cited by: §1.
  • [33] Y. Tsai, X. Shen, Z. Lin, K. Sunkavalli, X. Lu, and M. Yang (2017) Deep image harmonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3789–3797. Cited by: §1, §1, Table 7, §2, §4.1, Table 4, §7.
  • [34] H. Wu, S. Zheng, J. Zhang, and K. Huang (2018) Fast end-to-end trainable guided filter. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1838–1847. Cited by: §2.
  • [35] H. Zeng, J. Cai, L. Li, Z. Cao, and L. Zhang (2020) Learning image-adaptive 3d lookup tables for high performance photo enhancement in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Cited by: §2.
  • [36] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.4, Table 2.

6 Introduction

For better acknowledgement of details, we provide several supplementary sections to demonstrate the mechanism and capability of DCCF.

In Section 7 and 9, we provide more experiment results on high-resolution images along with the efficiency comparison. In Section 10, we further show the low-resolution results to demonstrate the efficacy of designed filters.

In Section 12.1, we show that a constrained rotation matrix in RGB space can actually rotate the color angle in HSV space. Along with the disentanglement in high resolution assembly module in Section 12.2, we also show that DCCF like can learn unconstrained rotation matrix which can be projected on its current HSV plane to directly affect the specific HSV channel.

In Section 13, we demonstrate the numerical procedure of standard and smoothing(ours) HSV supervision strategies.

In Section 14, we show the capability of DCCF to adjust specific image attribute. Each filter of this family can be picked out to qualify its sub-task. In Section 15 and Section 16, we provide more visualization results of comprehensible interaction and high-resolution results. In Section 17, we discuss the potential limitation of our framework.

7 High-Resolution Results with Different Backbones

Note that our DCCF can be plugged into different backbones, we select DIH [33] and SAM [7] for experiment.  [33] shares similary U-net[27] architecture with iDIH-HRNet [29] except for the extra pretrained visual feature from HRNet [30].  [7] uses spatial-separated attention module in decoder to aggregate semantic information. For fair comparison, we keep learning strategy the same as DCCF when training these backbones, where we use RandomResizedCrop and RandomHorizontalFlip as data augmentation, foreground-normlized MSE [29] as training loss, XavierGluon (gaussian, magnitute=0.2) as weights initializer. We use bilinear upsamplng (BU), guided filter (GF [14]) and bilateral grid upsampling (BGU [2]) as post-processing methods. Experiments in Table 4 show that DCCF constantly outputs these baselines which demonstrates the robustness and generality of our framework.

Method Entire Dataset HCOCO HAdobe5k HFlickr Hday2night
MSE PSNR MSE PSNR MSE PSNR MSE PSNR MSE PSNR

DIH [33]
- - 36.39 36.56 - - 186.38 30.78 61.89 35.40
DIH [33] + BU 51.28 34.39 39.35 34.92 44.99 34.81 129.48 30.14 51.07 36.77
DIH [33] + GF[14] 43.10 35.35 30.73 36.11 41.57 35.34 109.99 31.13 50.00 37.09
DIH [33] + BGU[2] 34.37 36.47 23.16 37.21 34.11 36.63 90.24 32.18 51.78 36.64
DCCF-DIH 33.39 36.87 21.60 37.81 34.09 36.77 89.86 32.24 49.93 37.23
SAM [7] - - 33.43 36.93 - - 186.70 30.78 57.48 36.05
SAM [7] + BU 44.02 35.02 36.36 35.30 33.01 35.95 112.74 30.77 41.84 37.44
SAM [7] + GF[14] 35.88 36.05 27.83 36.53 29.63 36.58 93.06 31.85 40.94 37.79
SAM [7] + BGU[2] 27.94 37.18 20.25 37.73 24.71 37.68 73.82 32.99 42.49 37.28
DCCF-SAM 26.74 37.59 18.41 38.43 24.39 37.65 73.18 33.12 44.25 37.36
Table 4: Quantative results on the iHarmony4 original-resolution test sets with other backbones. ’-’ means not able to obtain results due to memory limitation. ’DCCF-*’ means DCCF filters backboned by *.

8 Comparison with CDTNet

We finetune our model under weaker resolution settings (10241024, 20482048) on HAdobe5k subset to compare with the recent high resolution harmonization method CDTNet [5]. To make fair comparison, we use the same backbone SAM-256 with  [5] (i.e CDTNet-256). The result is shown in Table 5. Our DCCF has better performance on higher resolution setting (20482048) than CDTNet-256. It is also observed that the performance of  [5] drops significantly as resolution increases, while our method maintains stable performance. Note that other high resolution experiments in our paper are conducted under a much stronger setting: the original resolution of HAdobe5k can range up to 60484032.

Resolution Method MSE PSNR fMSE SSIM
1024 CDTNet-256 21.24 38.77 152.13 0.9868
1024 DCCF 21.12 38.38 171.17 0.9852
2048 CDTNet-256 29.02 37.66 198.85 0.9845
2048 DCCF 21.35 38.47 174.78 0.9856
Table 5: Quantative comparison with CDTNet [5] on HAdobe5k subset.

9 Efficiency Comparison

To compare efficiency between DCCF and existed post-processing upsampling methods, we test these methods on different resolutions from 10241024 to 3072

3072. The experiment is conducted on a x86-64 machine (72 cores, ubuntu 18.04) with a 12GB Nivida Titan X gpu card. We test cpu time (T-C), gpu time (T-G) and memory usage (Mem) for evaluation metrics. Each method is warmed up by 10 times and averaged by another 20 times forward passes. As for detailed parameters which could influence efficiency metrics, GF 

[14] uses kernel size, BGU [2] uses default grid size.

Experiments in Table 6 show that GF [14] take most of the leads in efficiency metrics, that is mainly because it only involves several basic box filters, however it losses too much performance compared to DCCF (MSE/PSNR, 24.65/37.8735.47/36.00). BGU [2] explicitly estimates bilateral grids which contain image-to-image transformation coefficients, therefore the performance is higher compared to GF [14], however the efficiency drops far behind since it needs extra optimization procedure. To this end, DCCF achieves a good trade-off between performance and efficiency.


Method
T-C (ms) T-G (ms) Mem (MB) T-C (ms) T-G (ms) Mem (MB) T-C (ms) T-G (ms) Mem (MB)
iDIH-HRNet[29] 420 231 1641 41040 907 4233 139768 2042 8551
iDIH-HRNet[29] + GF[14] 642 80.2 983 2001 160 1513 10181 391 2483
iDIH-HRNet[29] + BGU[2] 9932 - 2893 20803 - 4042 29836 - 8173
DCCF-iDIH-HRNet 762 104 1259 3289 286 2607 6517 545 4845

Table 6: Efficiency comparison between DCCF and advanced post-processing methods on different resolutions from 1024 to 3072. ’T-C’ represents cpu time (ms), ’T-G’ represents gpu time (ms) and ’Mem’ represents memory usage (MB). Note that ’-’ means no offical implementation is found.

10 Low-Resolution Results

To further show the efficacy of our designed comprehensible filters, we compare our approach with other state-of-the-art deep models [27,6,5,9] on the iHarmony4 low-resolution () test sets. The results are shown in Table 7. It is interesting that our approach also outperforms the previous best one  [23] slightly on the entire iHarmony4 dataset on low-resolution, which may due to the appropriate design of filters and extra supervision from auxiliary HSV losses via our framework.

Method Entire Dataset HCOCO HAdobe5k HFlickr Hday2night
MSE PSNR MSE PSNR MSE PSNR MSE PSNR MSE PSNR

DIH [33]
76.77 33.41 51.85 34.69 92.65 32.28 163.38 29.55 82.34 34.62

SAM [7]
59.67 34.35 41.07 35.47 63.40 33.77 143.45 30.03 76.61 34.50

DoveNet [6]
52.36 34.75 36.72 35.83 52.32 34.34 133.14 30.21 54.05 35.18

IntrinsicIH [11]
38.71 35.90 24.92 37.16 43.02 35.20 105.13 31.34 55.53 35.96

iDIH-HRNet [29]
22.81 38.18 14.35 39.53 23.43 37.18 61.42 33.84 45.09 38.08


DCCF
22.05 38.50 14.87 39.52 19.90 38.27 60.41 33.94 49.32 37.88
Table 7: Quantative results on the iHarmony4 low-resolution test sets.

11 Ablation Study on Operation Order

According to our survey, many designers and artists in Photoshop community tend to harmonize an image in the order of ’value, saturation, hue’. We regard this phenomenon as a common convention thus design our framework in such an order. However we think the operation order of DCCF filter is also meaningful thus we make corresponding ablation experiment. We investigate the whole combinations ’VSH’, ’HVS’, ’SHV’, ’VHS’, ’HSV’, ’SVH’ on low resolution images in Table 8. Note that the results are slightly different with Table 7 because Table 8

uses fewer training epochs, but we ensure that all abaltions in Table 

8 share the same parameter setting. A tentative conclusion is that the order would affect final results and different subsets also require different optimal operation orders. A possible solution is introducing re-enforcement learning (RL) to decide which is the best operation order when processing a certain given image. This may inspire our future works.

Operation Order Entire Dataset HCOCO HAdobe5k HFlickr Hday2night
MSE PSNR MSE PSNR MSE PSNR MSE PSNR MSE PSNR
VSH (default) 22.52 38.57 14.20 39.73 22.54 38.12 61.80 33.92 45.54 37.51
HVS 22.90 38.53 14.45 39.66 23.66 38.12 60.32 33.93 49.78 37.67
SHV 22.88 38.43 14.90 39.48 21.81 38.17 63.94 33.74 41.64 37.94
VHS 22.11 38.63 14.02 39.71 21.98 38.35 59.54 33.98 51.77 37.58
HSV 22.45 38.57 14.67 39.57 21.29 38.45 61.97 33.87 45.78 37.69
SVH 22.94 38.53 14.25 39.69 23.08 38.06 61.77 33.90 59.10 37.33
Table 8: Ablation study of different orders.

12 Hue Filter and Disentanglement

12.1 Hue Filter

Let indicates the RGB values for one pixel in image, is the corresponding point in HSV space. Let is a learnable 3x4 affine transformation matrix in RGB space that contains a rotation matrix and translation vector , and is a radian moving on the hue ring in HSV space.

According to the theory in [12], to rotate the hue by , we perform a 3D rotation of RGB colors about the diagonal vector [1.0 1.0 1.0] as as illustrated in Fig. 9. The resulting matrix will rotate the hue of the input RGB colors. A rotation of will exactly map Red into Green, Green into Blue and Blue into Red. The matrix processing in [12] makes an approximation that the diagonal axis in RGB space is equivalent to the hue axis in HSV space. This transformation has one problem, however, the luminance of the input colors is not preserved. This can be fixed by shearing the value plane to make it horizontal.

(a) 3D Rotation
(b) Diagonal
(c) Hue
Figure 9: Illustration of rotation matrix. Viewing 3D RGB cube model (a) from the diagonal perspective, we can get the hexagon in (b), which is an approximation of real color circle in hue space (c). Therefore rotation is equivalent in above three models.

We suppose that one could find a suitable rotation matrix in RGB color space that is equivalent to [12]. Therefore, it is possible to learn an affine color transformation function in RGB color space, which contains a rotation function that could be the parameters for the corresponding hue rotation function in HSV space. Exactly,

is the desired linear transformation for hue filter

. It is obvious that the linear transformation could map one RGB point to any other RGB point , and the corresponding HSV point moves to . To avoid the modification along L and S axis, we perform HSV disentangle by projecting the path on H plane to get .

12.2 Effect of HSV Disentanglement

(a) Input
(b) on RGB
(c) Disentangle
Figure 10: Illustration of disentanglement. High resolution assembly module extracts the H channel of output and concatenates it with the input’s V and S. This operation prevents from changing V and S.

We show the importance of high resolution assembly module disentanglement in Fig. 10. Taking as example, naively applying it in RGB space actually changes V and S simultaneously. DE (Disentangle) avoids this situation by extracting the H channel of output and concatenating it with the input’s V and S. This ensures the action of won’t corrupt the result of and .

Note that this disentanglement strategy is equivalent to the projection action in Section 12.1. We only need the path on the H plane: . Similarly, applying + DE and + DE will generate projection path on V plane and S plane, which are , . The orthogonality of HSV space will ensure that these paths won’t cross each other.

13 Auxiliary HSV Loss

13.1 Standard HSV Decomposition

The numerical conversion of HSV and RGB value is performed as:

(7)
(8)
(9)

This expression tends to result in noise points because it is based on numerical values rather than physical characteristics.

13.2 Smoothing V

The generation of is straightforward. We apply gaussian blur on the original decomposition to get

, where we set variance scale

and kernel size in implementation.

13.3 Smoothing S

(a) Input
(b) Red
(c) Green
(d) Blue
(e) Cyan
(f) Megenta
(g) Yellow
(h) Black
(i) White
(j) Neutral
Figure 11: Procedure of smoothing saturation map. It is arranged from left to right, top to bottom. Note that (j) is also the final result of our smoothing S.

Different from Eq. 8, we follow the operation in photoshop to get a smooth map. First, we perform a selective color adjustment by setting all the colored tunes to -100%: red, yellow, green, cyan, blue, and magenta(RGB, CMY). Then for the blacks, whites and neutrals(BWN), we enhance them to 100%. Noted that the param used here ranges from -100% to 100%, actually it is exactly the same we used in our saturation filter. The detailed procedure is shown in Fig. 11. The numerical expression can be expressed as:

(10)
(11)

Where =[R,G,B(Blue),C,M,Y,B(Black),W,N], for [1,2,3,4,5,6], and for [7,8,9]. Note that is the regions within this color.

The result is a color map that shows you saturation levels across the scene. Darker shades of gray are less saturated, and lighter shades are more saturated.

13.4 Smoothing H

As for , we first convert RGB image into HSV space and set the V channel to 0.8, S channel to 0.5, then convert it back to RGB space.

14 Intermediate Result Visualizations

We show that our intermediate result of DICF matches their design purpose in Fig. 14. is the input image. are outputs after . are input’s HSV maps. are their harmonized counterparts. are ground truths.

As illustrated in the upper part of Fig. 14, each intermediate result of , , not only changes its corresponding image attributes: value, saturation and hue, but also maintains reasonable visual quality. This is untrivial since we didn’t apply direct RGB loss on ,,, instead we only apply auxiliary HSV loss on its specific channels. This means that users can choose any of as their desired output if they only want to change part of these attributes, while previous works only provides a final result which is in our framework. This brings more flexility and robustness for users when approaching DCCF.

We further illustrate the comprehensibility in the bottom part of Fig. 14, the modified channel is closer to ground truth after the operation of DCCF. This also proves the effect of HSV loss in our framework.

15 Comprehensible Interaction

(a) Illustration of interactive value adjustment.
(b) Illustration of interactive saturation adjustment.
Figure 12: Interactive adjustment. Upper: users’ global adjustment. Bottom: interactive adjustment with DCCF.

To interact with , in standard image processing softwares, users need to provide a curve to adjust value. In our framework, users can set to approximate a curve. As shown in Fig. 12(a), the first row is the visualization of several curves provided by user. The second row is the result of directly applying these curves on RGB images which degrades image quality since each pixel has the same tuning curve. The third row is the result of applying a weighted fusion() of user’s global curve and DCCF’s filter map .

To interact with , users can change image saturation by . As shown in Fig. 12(b), we set a series of to generate the first row, which is a global adjustment and each pixel has the same and could suffer from over-saturation. The second row is the weighted fusion() of user’s and DCCF’s filter map .

16 More High-Resolution Visualizations

We provide more visualizations of final results compared with previous methods in Fig. 15. Since DCCF is an end-to-end framework, it has strong transformation capability while maintaining high-resolution details.

17 Limitations

(a) BGU [2]
(b) DCCF
(c) GT
Figure 13: Potential limitation. The second row is amplified details. It is observed that even DCCF learns better color adjustment, the detail starts to blur on high frequency regions like leaf.

Since the high resolution result is guided by low resolution stream in our framework, the claim of insensitivity to resolution is valid only if the processed image has enough information shared across all signal frequency. In the case of extremely high frequency contents, it may fail to reharmonize images properly, see Fig 13.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
Figure 14: Intermediate results.
(a) Mask
(b) Input
(c) BU
(d) GF[20]
(e) BGU[2]
(f) Ours
(g) GT
Figure 15: Visualization of high-resolution results.