Log In Sign Up

Very Lightweight Photo Retouching Network with Conditional Sequential Modulation

by   Yihao Liu, et al.

Photo retouching aims at improving the aesthetic visual quality of images that suffer from photographic defects such as poor contrast, over/under exposure, and inharmonious saturation. In practice, photo retouching can be accomplished by a series of image processing operations. As most commonly-used retouching operations are pixel-independent, i.e., the manipulation on one pixel is uncorrelated with its neighboring pixels, we can take advantage of this property and design a specialized algorithm for efficient global photo retouching. We analyze these global operations and find that they can be mathematically formulated by a Multi-Layer Perceptron (MLP). Based on this observation, we propose an extremely lightweight framework – Conditional Sequential Retouching Network (CSRNet). Benefiting from the utilization of 1×1 convolution, CSRNet only contains less than 37K trainable parameters, which are orders of magnitude smaller than existing learning-based methods. Experiments show that our method achieves state-of-the-art performance on the benchmark MIT-Adobe FiveK dataset quantitively and qualitatively. In addition to achieve global photo retouching, the proposed framework can be easily extended to learn local enhancement effects. The extended model, namly CSRNet-L, also achieves competitive results in various local enhancement tasks. Codes will be available.


page 9

page 11

page 12

page 15

page 21

page 22

page 23

page 24


Conditional Sequential Modulation for Efficient Global Image Retouching

Photo retouching aims at enhancing the aesthetic visual quality of image...

Cascade Luminance and Chrominance for Image Retouching: More Like Artist

Photo retouching aims to adjust the luminance, contrast, and saturation ...

Flexible Piecewise Curves Estimation for Photo Enhancement

This paper presents a new method, called FlexiCurve, for photo enhanceme...

Deep Photo Cropper and Enhancer

This paper introduces a new type of image enhancement problem. Compared ...

User Preferences Modeling and Learning for Pleasing Photo Collage Generation

In this paper we consider how to automatically create pleasing photo col...

Computational Design of Lightweight Trusses

Trusses are load-carrying light-weight structures consisting of bars con...

Learning Pixel-Adaptive Weights for Portrait Photo Retouching

Portrait photo retouching is a photo retouching task that emphasizes hum...

1 Introduction

Fig. 1: Left: Compared with existing state-of-the-art methods, our method achieves superior performance with much fewer parameters (1/13 of HDRNet [17] and 1/250 of White-Box [21]). The diameter of the circle represents the amount of trainable parameters. Right: Image retouching examples. Our method can generate more natural and visually pleasing retouched images than other methods. Best viewed in color.

Photo retouching can significantly improve the visual quality of photographs through a sequence of image processing operations, such as brightness and contrast adjustments. Manual retouching requires specialized skills and training, thus is challenging for causal users. Even for professional retouchers, dealing with large collections requires tedious repetitive editing works. This presents the needs for automatic photo retouching method. It can be equipped in smart phones to help ordinary people get visual pleasing photos, or it can be built in photo editing software to provide an editing reference for experts.

The aim of photo retouching is to generate an aesthetically pleasing image from a low-quality input, which may suffer from under/over-exposure or unsaturated color tone. Recent learning-based methods tend to treat photo retouching as a special case of image enhancement or image-to-image translation. They use CNNs to learn either the transformation matrix

[17, 47] or an end-to-end mapping [22, 23, 9] from input/output pairs. Generally, photo retouching needs to adjust the global color tones, while other image enhancement/translation tasks focus on local patterns or even change the image textures. Moreover, photo retouching is naturally a sequential processing, which can be decomposed into several simple operations. This property does not always hold for image enhancement and image-to-image translation problems. As most state-of-the-art algorithms [17, 38, 9] are not specialized for photo retouching, they generally design complex network structures to deal with both global and local transformations, which could largely restrict their performance and implementation efficiency.

When investigating the retouching process, we find that most commonly-used retouching operations (e.g., white balancing, saturation controlling and color curve ajustment/tone-mapping) are global operations, without the change of local statistics. In real-world scenarios, global operations are essential for photo retouching, while local operations are optional. For example, the retouched images in the well-known benchmark MIT-Adobe FiveK dataset [5]

contain only global tonal adjustment. The reinforcement learning (RL) based methods

[21, 34] include only global operations in their action space. These motivate us to take advantage of global operations, and design an efficient algorithm especially for “global” photo retouching. After that, we can simply extend this method to achieve local retouching for other local operations.

For preparation, we revisit several retouching operations adopted in RL based methods [21, 34]. Mathematically, these operations (e.g., white-balancing, saturation controlling, contrast adjustment) are pixel-independent and location-independent. In other words, the manipulation on one pixel is uncorrelated with neighboring pixels or pixels on specific positions. The input pixels can be mapped to the output pixels via pixel-wise mapping functions, without the need of local image features. We find that these pixel-wise functions can be approximated by a simple Multi-layer Perceptron (MLP). Different adjustment operations can share similar network structures but with different parameters. Then the input image can be sequentially processed by a set of MLPs to generate the final output.

Based on the above observation, we propose an extremely lightweight network - Conditional Sequential Retouching Network (CSRNet) - for fast global photo retouching. The key idea is to implicitly mimic the sequential processing procedure and model the editing operations in an end-to-end trainable network. The framework consists of two modules - the base network and the condition network. The base network adopts a fully convolutional structure, while its unique feature is that all filters are of size

, indicating that each pixel is processed independently. Therefore, the base network can be regarded as an MLP for individual pixels. To realize retouching operations, we modulate the intermediate features of the base network using the proposed Global Feature Modulation (GFM) layers, of which the parameters are controlled by the condition network. The condition network accepts the input image and generates a condition vector, which is then broadcasted to different layers of the base network for feature modulation. This procedure is just like a sequential editing process operated on different stages of the MLP (see Figure

4). These two modules are jointly optimized with expert-adjusted image pairs. It is noteworthy that we name our method “sequential modulation” just for better understanding and illustration. In fact, our method does not explicitly model the step-wise retouching operations as RL-based methods. Instead, CSRNet implicitly models the whole process and generalizes well to undefined/unseen operations.

The proposed network is highly efficient for deployment in practical applications. It enjoys a very simple architecture, which contains only six plain convolutional layers in total, without any complex building blocks or even residual connections. Such a compact network could achieve state-of-the-art performance on MIT-Adobe FiveK dataset

[5], with less than 37K parameters -1/13 of HDRNet[17] and 1/90 of DPE [9] (see Figure 1 and Table II

). Note that even the first super-resolution CNN

[12] with three layers already contains 56k parameters. We have also conducted extensive ablation studies on various settings, including different handcrafted global priors, network structures and feature modulation strategies. Further, in terms of feature modulation, we find it important to normalize the conditional inputs. We propose a novel and effective unit normalization (UN) to improve the training stability and enhance the performance. More explorations on the feature normalization and scaling techniques are also discussed.

While we have successfully achieved global photo retouching, the next step is to extend the proposed method to local enhancement operations. Note that CSRNet is very flexible in network design. We only need two simple modifications on the base network. The first one is to replace all filters with larger ones, e.g., filters. This could significantly enlarge the receptive field. The second one is to use the proposed Spatial Feature Modulation (SFM) to take the place of GFM. SFM is able to achieve spatial variant modulation, thus is more suitable for local operations. The modified framework, namely CSRNet-L, could handle both global and local retouching operations, and obtain comparable results against state-of-the-art methods in various local enhancement tasks.

A preliminary version of this work has been published in ECCV2020 [20]. The present work improves the initial version in significant ways. Firstly, we conduct more experiments and ablation studies to further improve the performance of CSRNet. Secondly, more analyses of retouching operations and intuitive explanations are added to the initial version. We make a demonstration experiment on simulating retouching operations, which clearly consolidates the theoretical analysis (Section 5.2). Thirdly, based on the proposed modulation network, we introduce a unit normalization (UN) operation on the generated condition vector. This operation can not only improve the training stability, but also bring further performance improvement. We also make an exploration on different feature normalization/scaling operations. Finally, we extend our method to handle local enhancement effects. With two simple modifications, the extended model CSRNet-L could achieve competitive performance with other methods in local enhancement tasks.

2 Related Work

Photo retouching and image enhancement has been studied for decades for its versatile applications in computer vision, image processing and aesthetic photograph editing

[17, 7, 29, 48, 13]. In this section, we briefly review the recent progress on image retouching and enhancement. Traditional algorithms have proposed various operations and filters to enhance the visual quality of images, such as histogram equalization, local Laplacian operator [2], fast bilateral filtering [14], and color correction methods based on the gray-world [15] or gray-edge [37] assumption. Since Bychkovsky et al. [5]

collected a large-scale dataset MIT-Adobe FiveK, which contains input and expert-retouched image pairs, a plenty of learning-based enhancing algorithms have been developed to continuously push the performance. Generally, these learning-based methods can be divided into three groups: physical-modeling-based methods, image-to-image translation methods and reinforcement learning methods. Physical-modeling-based methods attempt to estimate the intermediate parameters of the proposed physical models or assumptions for image enhancing. Based on the Retinex theory of color vision

[29], several algorithms were developed for image exposure correction by estimating the reflectance and illumination with learnable models [16, 44, 49, 38]. Zero-DCE [18] formulated light enhancement as a task of image-specific curve estimation with a deep network, which did not require any paired or unpaired data during training. Yan et al. [43] adopted a multi-layer perceptron with a set of image descriptors to predict pixel-wise color transforms. By postulating that the enhanced output image can be expressed as local pointwise transformations of the input image, Gharbi et al. [17] combined bilateral grid [8] and bilateral guided upsampling models [7], then constructed a CNN model to predict the affine transformation coefficients in bilateral space for real-time image enhancement. Bianco et al. [3] leveraged CNNs to learn parametric functions for color enhancement, which decoupled the inference of the parameters and the color transformation. SpliNet [4]

estimated a global color transform for the enhancement of raw images. It predicted one set of control points from input raw image for each color channels and interpolated these control points with natural cubic splines. Recently, Zeng

et al. [47]

proposed an image-adaptive 3-dimensional lookup tables (3D LUTs) to achieve fast and robust photo enhancement. It learns multiple basis 3D LUTs and a small convolutional neural network (CNN) simultaneously in an end-to-end manner.

Methods of the second group treat image enhancement as an image-to-image translation problem, which directly learn the end-to-end mapping between input and the enhanced image without modelling intermediate parameters. In practice, generative adversarial networks (GANs) have shown great potential for image transferring tasks

[24, 51, 10, 50, 41]. Lee et al. [30] introduced a method that can transfer the global color and tone statistics of the chosen exemplars to the input photo by selective style ranking technique from a large photo collection. Ignatov et al. explored to translate ordinary photos into DSLR-quality images by residual convolutional neural networks [22] and weakly supervised generative adversarial networks [23]. Yan et al. [42]

formulated the color enhancement task as a learning-to-rank problem in which ordered pairs of images are used for training. Chen et al.

[9] utilized an improved two-way generative adversarial network (GAN) that can be trained in an unpair-learning manner. Zamir et al. [46] developed a multi-scale approach which maintains spatially precise high-resolution representations and receives strong contextual information from the low-resolution representations. Kim et al. [26] combined a global enhancement network (GEN) and a local enhancement network (LEN) in one framework to achieve both paired and unpaired image enhancement. GEN performs the channel-wise intensity transform while LEN conducts spatial filtering to refine GEN results. Deng et al. [11] propsoed a aesthetic-driven image enhancement model with adversarial learning (EnhanceGAN), which requires weak supervision (binary labels on image aesthetic quality). Chai et al. [6] extended the method of [3] and pursued a GAN-based CNN that can be trained using either paired or unpaired images by determining the coefficients of a parametric color transformation. Ni et al. [33] developed a quality attention generative adversarial network (QAGAN), which was designed to learn domain-relevant quality attention directly from the low-quality and high-quality image domains. PieNet [27]

was the first proposed deep learning approach to personalized image enhancement, which represented the users’ preferences in latent vectors and then guide the network to achieve personalization.

Reinforcement learning is adopted for image retouching, which aims at explicitly simulating the step-wise retouching process. Hu et al. [21] presented a White-Box photo post-processing framework that learns to make decisions based on the current state of the image. Park et al. [34]

casted the color enhancement problem into a Markov Decision Process (MDP) where each action is defined as a global color adjustment operation and selected by Deep Q-Network

[32]. Yu et al. [45] exploited deep reinforcement learning to learn multiple local exposure operations, in which an adversarial learning method is adopted to approximate the Aesthetic Evaluation (AE) function. In [28], Satoshi and Toshihiko incorporated image editing software (such as Adobe Photoshop) into a GAN-based reinforcement learning framework, where the generator worked as the agent to select the software’s parameters.

Fig. 2: Illustration of the equivalent MLPs for the corresponding retouching operations. Commonly-used retouching operations can be regarded as classic MLPs used on input image. Moreover, operations like (a) white-balancing, (b) saturation controlling and (c) Color curve adjustment/tone-mapping, can further regarded as MLPs used on individual pixels, since these operations are pixel-independent and only contain inner-pixel connections. However, operations like (d) contrast adjustment require global information and contain cross-pixel connections. The condition network can collaborate with the base network, providing global features and facilitating cross-pixel connections. For simplicity, the illustrations for tone-mapping and contrast adjustment only show the case of single channel.

3 Analysis of Retouching Operations

Photo retouching is accomplished by a series of image processing operations, such as the manipulation of brightness/contrast, the adjustment in each color channel, and the controlling of saturation/hue/tones. We mathematically find that these pixel-independent operations can be approximated or formulated by multi-layer perceptrons (MLPs). This observation motivates the design of our method. Below is our analysis on some representative retouching operations. The proposed framework is depicted in Section 4.

Global brightness change. Given an input image , the global brightness is described as the average value of its luminance map: , where , , represent the RGB channels, respectively. One simple way to adjust the brightness is to multiply a scalar for each pixel:


where is the adjusted pixel value, is a scalar, and indicates the pixel location in an image. We can formulate the adjustment formula (1) into the representation of an MLP:


where is the vector flattened from the input image, and are weights and biases.

is the activation function. When

, and is the identity mapping , the MLP (2) is equivalent to the brightness adjustment formula (1).

Contrast adjustment. Contrast represents the difference in luminance or color maps. Among many definitions of contrast, we adopt a widely-used contrast adjustment formula:


where and is the adjustment coefficient. When , the image will remain the same. The above formula is applied on each channel of the image. We can construct a three-layer MLP that is equivalent to the contrast adjustment operation. For simplicity, the following derivation is for a single-channel image, and it can be easily generalized to RGB images (refer to the derivation of white-balancing in the supplementary material). As in Figure 3, the input layer has units covering all pixels of the input image, the middle layer includes hidden units and the last layer contains output units. This can be formalized as:


where , , , , . Let , , , . When , , and , the above MLP (4) is equivalent to the contrast adjustment formula (3).

White-balancing. In [21, 34], the operation for white-balancing is described as follows:


where , , are the adjustment scalars for each color channel. The above operation can be represented as an MLP used on singal pixel. Note that the following derivation applies to three channels for each pixel location, therefore, there are totally input units.


where is the vector flattened from the input image, and are weights and biases, and is the activation function. Let .
When , and , the above MLP (6) is equivalent to the white-balancing operation (5), as shown in Figure 2(a).

Other operations, like saturation controlling, color curve adjustment/tone-mapping, can also be regarded as MLPs. (Please refer to the supplementary materials.)

Discussions. So far, we have shown that most commonly-used retouching operations can be formulated as classic MLPs. These operations are pixel-independent and location-independent, i.e., the manipulation on one pixel is uncorrelated with its neighboring pixels or pixels on specific positions. That is why we can use a diagonal matrix as the MLP weights. Further, operations like brightness change, white-balancing, saturation controlling, tone-mapping, can be also viewed as MLPs used on a single pixel, which is similar with the MLPconv proposed in [31]. The correlation between MLP and convolutions has been revealed in MLPconv [31] and SRCNN [12]. Enlightened by this discovery, the base network in the proposed method is designed as a fully convolutional network with all the filter size of , which acts like an MLP worked on individual pixels and slides over the input image. Some operations, like contrast adjustment, may require global information that relates to all pixels in the image (e.g., image mean value). Such global information can be provided by the condition network in our method. According to the analysis above, we propose a comprehensible framework for efficient photo retouching.

4 Methodology

Our method aims at fast automatic image retouching with an extremely low parameter cost and computation consumption. Based on the analysis in Section 3, we propose a Conditional Sequential Retouching Network (CSRNet) which is specialized for efficient image retouching. It consists of a base network and a complementary condition network, which collaborate with each other to achieve image global tonal adjustment, as demonstrated in Section 5. Then, we illustrate the intrinsic working mechanism of CSRNet in two comprehensive perspectives. We also describe how to achieve different retouching styles and to control the enhancing strength. Furthermore, our method can be easily extended to learning stylistic local effects, which will be discussed in Section 6.

(a) An MLP on individual pixels.

CSRNet. (k: kernel size; n: number of feature maps; s: stride.)

Fig. 3: (a) Illustration for MLP on a single pixel. Pixel-independent operation can be viewed as an MLP used on individual pixels, such as brightness change, white-balancing, saturation controlling. (b) The proposed network consists of four key components – base network, condition network, GFM and UN.

4.1 Conditional Sequential Modulation Network

The proposed conditional sequential retouching network (CSRNet) contains a base network and a condition network as shown in Figure 3(b). The base network takes the low-quality image as input and generates the retouched image. The condition network estimates the global features from the input image, and afterwards influences the base network by conditional feature modulation operations.

4.1.1 Network Structure

The structures of the base network and the condition network are described as follows.

Base network. The base network adopts a fully convolutional structure with convolutional layers and ReLU activations. One unique trait of the base network is that all the filter size is , suggesting that each pixel in the input image is manipulated independently. Hence, the base network can be regarded as an MLP, which works on each pixel independently and slides over the input image, similar as MLPconv in [31]. Based on the analysis in Section 3, theoretically, the base network has the capability of handling pixel-independent retouching operations. Moreover, since all the filters are of size , the network has dramatically few parameters.

Condition network. The global information/priors are indispensable for image retouching. For example, the contrast adjustment requires the average luminance of the image. To allow the base network to incorporate global features, a condition network is proposed to collaborate with the base network. The condition network is like an encoder that contains three blocks, in which a series of convolutional, ReLU and downsamping layers are included. The output of the condition network is a condition vector, which will be broadcasted into the base network using global feature modulation (GFM). Network details are depicted in Section 4 and Figure 3(b).

4.1.2 Global Feature Modulation

To enable the network to have the ability of handling operations that require global features, we modulate the intermediate features of the base network by element-wise multiplication and addition operations. The proposed conditional feature modulation can be formulated as follows:


where denotes the element-wise multiplication operation, is the feature map to be modulated in the base network, and , are affine parameters that are estimated from the outputs of the condition network. Since and are vectors containing global statistics and the modulation is global-wise for each feature map in the base network, we call this operation Global Feature Modulation (GFM). Due to the effectiveness of GFM, the base network and condition network can collaborate well to achieve image global tonal adjustment with only few parameters.

4.1.3 Unit Normalization

To improve the stability and robustness, we propose a unit normalization (UN) operation on the condition vector to restrict its numerical range.


where is an -dimension input vector. The unit normalization is similar with weight norm [35], but it acts on the feature vectors instead of the convolutional parameters. UN can be regarded as a kind of vector unitization with a scaling coefficient related to the vector dimension. Hence, this operation makes it focus more on the direction of the vector rather than the absolute value. After unit normalization, all the generated condition vectors fall on an -sphere (hypersphere) with radius . In terms of feature modulation, we find UN operation plays an important role in the retouching performance. More explorations on the normalization/scaling strategies are demonstrated in Section 5.4.

(a) View 1: Pixel-level
(b) View 2: Space-level
Fig. 4: Illustration for two perspectives of the proposed framework. (a) The base network can be regarded as an MLP that works on individual pixels. As analysed Section 3, such a three-layer MLP is able to approximate most image retouching operations. The condition network is added to provide image global information. (b) The intermediate features of the base network can be naturally viewed as color maps (e.g., RGB to YUV). Thus, our method can be regarded as a series of retouching operations on the corresponding color spaces.

4.2 Illustration

To facilitate understanding, we illustrate “how the CSRNet works” in two perspectives. We use a simple yet standard setting — The base network includes layers.

Pixel-level view. We regard the base network as an MLP that works on individual pixels as shown in Figure 4

(a). From this perspective, we can explain that the base network is made up of three fully-connected layers that perform feature extraction, non-linear mapping and reconstruction, respectively. As demonstrated in Section

3, such a three-layer MLP is able to approximate most image retouching operations. Then we add the condition network, and see how these two modules work collectively. As the GFM is equivalent to a “multiply+addition” operation, it can be easily merged into filters. Then the condition network could adjust the filter weights of the base network. While for the last layer, the modulation operation can be modeled as an extra layer. This layer performs one-to-one mapping, which changes the “average intensity and dynamic range” of the output pixel, just like “brightness/contrast adjustment”. Combining the base network and the condition network, we will obtain a different MLP for a different input image, allowing image-specific photo retouching. To support this pixel-level view, we have conducted a demonstration experiment that using the proposed framework to simulate the procedures of several retouching operations. The results are shown in Section 5.2.

Space-level view. We can also regard the intermediate features as color maps, while the color space transformation can be realized by linear combination of color maps (e.g., RGB to YUV). Specifically, the input image is initialized in the RGB space. As depicted in Figure 4

(b), the first and second layers of the base network project the input into high dimensional color spaces, and the last layer transforms the color maps back to the RGB space. The GFM performs linear transformation on intermediate features, thus can be regarded as a retouching operation on the mapped color space. In summary, the base network performs color decomposition, the condition network generates editing parameters, and GFM sequentially adjusts the intermediate color maps.

Note that although we name our method “sequential modulation”, the network is not necessarily mimicking the sequential processing with explicit retouching operations. Instead, the sequential processing in our method could be viewed as several stages of implicit retouching processes. Comparing with enforcing the network to model standard retouching operations, our method implicitly models the whole process, thus having the ability to generalize to undefined/unseen operations. It can also be flexibly expanded to larger models for better performance.

4.3 Discussion

In this part, we show the merits of CSRNet by comparing with other state-of-the-art methods. First, we adopt pixel-wise operations ( filters), which will better preserve the edges and textures of the original intput image. While GAN-based methods [9, 24] tend to change local patterns and generate undesired artifacts (see Figure 5, Pix2Pix). Such artifacts are also observed by many other GAN-based applications [51, 41, 9, 25], if the network is not elaborately designed. Second, we use global modulation strategy, which will maintain color consistency across the image. Nevertheless, HDRNet [17] predicts a transformation coefficient for each pixel, thus will lead to abrupt color changes (see Figure 5

, HDRNet) in various local regions. Third, we use a unified CNN framework with supervised learning, which could produce images of higher quality than RL-based methods

[21, 34] (see Figure 5, White-box and Distort-and-recover). The most appealing trait of method is that it is very lightweight and compact, with extremly few parameters and little computation cost. In addition, although CSRNet is specially designed for photo retouching, it can be extended and generalized to local enhancement with simple modifications. Experiments have demonstrate that the extended version of CSRNet can also deal with local enhancement tasks with satisfactory performance on par with other algorithms. More details are described in Section 6.

5 Learning Global Adjustments

5.1 Experimental Setup

Dataset and Metrics. MIT-Adobe FiveK [5] is a commonly-used photo retouching dataset with 5, 000 RAW images and corresponding retouched versions produced by five experts (A/B/C/D/E). We follow the previous methods [21, 9, 38, 17] to use the retouched results of expert C as the ground truth (GT). We adopt the same pre-processing procedure as [21] 111 and all the images are resized to 500px on the long edge. We randomly select 500 images for testing and the remaining 4,500 images for training. We use PSNR, SSIM and the Mean L2 error in CIE L*a*b space 222CIE L*a*b* (CIELAB) is a color space specified by the International Commission on Illumination. It describes all the colors visible to the human eye and was created to serve as a device-independent model to be used as a reference. to evaluate the performance.
Implementation Details. The base network contains 3 convolutional layers with channel size 64 and kernel size . The condition network also contains three convolutional layers with channel size 32. The kernel size of the first convolutional layer is set to to increase the receptive field, while others are

. Each convolutional layer downsamples features to half size with a stride of 2. We use a global average pooling layer at the end of the condition network to obtain a 32-dimensional condition vector. Then the condition vector will be transformed by fully connected layers to generate the parameters of channel-wise scaling and shifting operations. In total, there are 6 fully connected layers for 3 scaling operations and 3 shifting operations. During training, the mini-batch size is set to 1. L1 loss is adopted as the loss function. The learning rate is initialized as

and is decayed by a factor of 2 every iterations. All experiments run

iterations. We use PyTorch framework and train all models on GTX 2080Ti GPUs. It takes only 5 hours to train the model.

5.2 Simulating Retouching Operations

To support the analysis in Section 3 and Section 4, we use the proposed network to simulate the procedures of several retouching operations, including global brightness change, tone-mapping and contrast adjustment. Specifically, we adopt images retouched by expert C as inputs and apply retouching operations with specified adjustment parameters on the inputs as supervision labels. Then we utilize the base network and the proposed CSRNet to learn such mappings.

Theoretically, the base network can perfectly handle operations like global brightness change and tone-mapping, because these pixel-independent operations are equivalent to MLPs used on individual pixels. For contrast adjustment, only the base network should not be enough, since it cannot extract global information like image mean value.

  • The parameters for tone-mapping are set to .

TABLE I: Demonstration experiment on simulating retouching operations. Our method can successfully handle commonly-used retouching opereations, which is consistent with the theoretical analysis. The results in “contrast” ajustment also show the significance of adopting the condition network.

The results are shown in Table I. As expected, the base network can successfully deal with the pixel-independent operations 333Images are basically the same when PSNR 50dB.. Nevertheless, we observe that a sole base network is unable to handle contrast adjustment, which requires global information. We can solve this problem by introducing the condition network. As we can see, the PSNR rises from 28dB to 60dB, demonstrating the effectiveness of introducing a condition network for providing more supportive global information. This simulation experiment consolidates the theoretical analysis and the practical design of the proposed framework.

input Distort-and-recover White-box DPE MIRNet
Pix2Pix HDRNet 3D-LUT CSRNet (ours) GT
input Distort-and-recover White-box DPE MIRNet
Pix2Pix HDRNet 3D-LUT CSRNet (ours) GT
input Distort-and-recover White-box DPE MIRNet
Pix2Pix HDRNet 3D-LUT CSRNet (ours) GT
input Distort-and-recover White-box DPE MIRNet
Pix2Pix HDRNet 3D-LUT CSRNet (ours) GT
Fig. 5: Visual comparison with state-of-the-arts on MIT-Adobe FiveK dataset.

5.3 Comparison with State-of-the-art Methods

To reveal the effectiveness of our method, we compare CSRNet with eight state-of-the-art methods: DUPE [38], HDRNet [17], DPE [9], MIRNet [46], 3D-LUT [47], White-Box [21], Distort-and-Recover [34] and Pix2Pix [24]444Pix2Pix uses conditional generative adversarial networks to achieve image-to-image translation and is also applicable to image enhancement problem.. These methods are all renowned and representative ones in photo retouching, image enhancement or image translation.

Method Running Time PSNR SSIM L2 error (L*a*b) #params
White-Box [21] 1028.91ms 18.59 0.797 17.42 8,561,762
Distort-and-Recover [34] 4063.35ms 19.54 0.800 15.44 259,263,320
HDRNet [17] 6.03ms 22.65 0.880 11.83 482,080
DUPE [17] 8.47ms 20.22 0.829 16.63 998,816
MIRNet [46] 252.60ms 19.37 0.806 16.51 31,787,419
Pix2Pix [24] 181.98ms 21.41 0.749 13.26 11,383,427
3D-LUT [47] 1.60ms 23.12 0.874 11.26 593,516
CSRNet (ours) 1.92ms 23.86 0.897 10.57 36,489

DPE [9] 17.73ms 23.76 0.881 10.60 3,335,395
CSRNet (ours) 1.92ms 24.37 0.902 9.52 36,489
TABLE II: Quantitative comparison with state-of-the-art methods on MIT-Adobe FiveK dataset (expert C).

Quantitative Comparison. We compare CSRNet with state-of-the-art methods555For White-Box, DUPE, DPE and MIRNet, we directly use their released pretrained models to test on our testing set, since their training codes are unavailable. For HDRNet, Distort-and-Recover, 3D-LUT and Pix2Pix, we re-train their models based on their public implementations on our training dataset. The training codes of DPE is not yet accessible and their released model is trained on another input version of MIT-Adobe FiveK. For fair comparison, we additionally train our models on the same input dataset. in terms of PSNR, SSIM, and the Mean L2 error in L*a*b* space. As we can see from Table II, the proposed CSRNet outperforms all the previous state-of-the-art methods by a large margin with the fewest parameters (36,489). Specifically, White-Box and Distort-and-Recover are reinforcement-learning-based methods, which require over millions of parameters but achieve unsatisfactory quantitative results. This is because they are not directly supervised by the ground truth image. HDRNet and DUPE solve the color enhancement problem by estimating the illumination map and require relatively less parameters (less than one million). Since the released model of DUPE is trained for under-exposure images, we can also refer to the result (23.04dB) provided in their paper. Pix2Pix and DPE both utilize the generative adversarial networks and perform well quantitatively. However, they contain much more parameters than CSRNet. Under the same experimental setting, CSRNet outperforms DPE and Pix2Pix in all three metrics with much fewer parameters. 3D-LUT is a recent novel method which learns the image-adaptive 3D lookup tables for image enhancement. It achieves good quantitative results with relatively fewer parameters. However, the design of learning lookup tables limits its application. It is only applicable for global photo retouching and it cannot easily be extended to local enhancement tasks. Instead, the propsoed CSRNet is a rather flexible framework that can be easily extended to other local enhancement tasks, as detailed in Section 6.

Running Time Comparison. Ascribing to the specialized design of the base/condition network and the utilization of convolution, CSRNet enjoys a very fast and efficient inference speed at 1.92ms per image (about size of 480p). From Table II, we can observe that RL-based methods are quite time-consuming. HDRNet [17] is proposed for real-time image enhancement, while our method is nearly three times faster than HDRNet. 3D-LUT [47] yields the speed of 1.60ms, due to its much lower memory cost. It learns three adaptive 3D lookup tables, which are direclty used to map input RGB values to output RGB values. However, it requires more trainable parameters to learn such lookup tables, which are 16 times more than ours. Moreover, 3D-LUT can only be used for global retouching but cannot adapt to local enhancement tasks. In summary, CSRNet is very lightweight and efficient, which is significant for deployment in real applications.

Visual Comparison. The results of visual comparison are shown in Figure 5. The input images from the MIT-Adobe FiveK dataset are generally under low-light condition. Distort-and-recover tends to generate over-exposure output. It seems that White-box and DPE only increase the brightness but fail to modify the original tone, which is oversaturated. The outputs of MIRNet tend to be dark and unsaturated. The enhanced images obtained by Pix2Pix contain artifacts (more visual examples are shown in the supplementary materials). HDRNet outputs images with unnatural color in some regions (e.g. green color on the face and messy color in the flower). The results of 3D-LUT may contain color contaminations in the white sky areas, especially the example of pink flower in Figure 5. DPE also produces images with color contaminations in the background color. The background color of the pink flower image is little blueish; the overall tone of the house image is blue as well. In conclusion, our method is able to generate more realistic images among all methods. More visual results are shown in the supplementary materials.

Fig. 6: Ranking results of user study. Rank 1 means the best visual quality. Our method is favored by users in most cases.

User Study. We have conducted a user study with 20 participants for subjective evaluation. The participants are asked to rank four retouched image versions (HDRNet [17], DPE [9], expert-C (GT) and ours) according to the aesthetic visual quality. 50 images are randomly selected from the testing set and are shown to each participant. 4 retouched versions are displayed on the screen in random order. Users are asked to pay attention to whether the color is vivid, whether there are artifacts and whether the local color is harmonious. Since HDRNet and DPE are representative model-based and GAN-based methods, respectively, we choose them to make the comparison. As suggested in Figure 6, our results achieve better visual ranking against HDRNet and DPE with 553 images ranked first and second. 245 images of our method ranked first, second only to expert C; and 308 images are ranked second, ahead of other methods. Note that, in MIT-Adobe FiveK dataset [5], some of the GT images seem to be darker, due to the retoucher’s personal stylistic preference. However, in practice, we find that the users tend to prefer images that contain brighter and vivid color. Hence, in some cases, GT images ranked the last.

5.4 Exploration on Normalization Strategies

In Section 4.1.3

, we introduce the unit normalization (UN) operation to normalize the condition vectors, which can enhance the training stability and improve the performance. We further make explorations on various feature normalization methods: 1) Without any normalization (None). 2) Softmax normalization. 3) Sigmoid normalization. 4) Softmax scaling. 5) Sigmoid scaling. 6) Z-score normalization (standardization). 7) Min-max normalization (Rescaling). 8) The proposed unit normalization (UN). Their formulas are depicted in Table


Normalization/scaling Formula PSNR
None 23.58
Softmax 23.47
Sigmoid 23.62
Softmax scaling 23.48
Sigmoid scaling 23.61
Z-score 23.69
Min-max 23.68
UN (proposed) 23.86
TABLE III: Exploration on various feature normalizatin methods. The proposed unit normalization (UN) significantl improves the performance. the transformed feature. mean value. standard deviation.

As shown in Table III, adopting proper normalization operations can help improve the performance. Sigmoid normalization, Sigmoid scaling, Z-score, Min-max and UN all bring PSNR improvement. Specifically, the quantitative results of Z-score and Min-max normalization are similar, which improve the PSNR value by 0.11 and 0.10 dB. Nevertheless, Softmax normalization and Softmax scaling could cause obvious performance drop. Among all these operations, the proposed UN operation achieves the highest PSNR value, improving the PSNR by 0.28 dB.

Qualitatively, the visual results of various feature normalization/scaling operations are also different. As shown in Figure 7, for some special cases, if the condition vector is not well normalized, the model will confuse the foreground object color with the background color and output a wrong background color. Such a problem can be also observed in other methods, like DPE, MIRNet, HDRNet and 3D-LUT. By adopting appropriate normalization operations in our method, this issue can be well solved. Further, comparing with other common normalization strategy, the proposed UN can generate more vivid and saturated retouched results. In the second example in Figure 7, the output produced by Softmax is somewhat whitish with unsaturated and low-contrast color, while UN can generate a more saturated and vibrant yellow wall. In summary, both quantitative and qualitative results have demonstrated the effectiveness of the proposed UN operation.

input None Softmax
Sigmoid Softmax scaling Sigmoid scaling
Z-score Min-max UN
input None Softmax
Sigmoid Softmax scaling Sigmoid scaling
Z-score Min-max UN
Fig. 7: Visual comparison among various normalization or scaling operations on the condition vector. In the first example (1st row – 3rd row), the background color of the input image is nearly all white (an extreme value). Without UN, CSRNet tends to confuse the foreground object color with the background color, and outputs a pink background. By introducing UN, the problem can be well solved. In the second example (4th row – 6th row), adopting UN can help produce more vivid and saturated colors (the yellow wall).

5.5 Ablation study

In this section, we investigate our CSRNet in four aspects, unit normalization, base network, modulation strategy, and condition network. We present all the results in PSNR.

Base network. The base network of our CSRNet contains 3 convolutional layers with kernel size and channel number . As mentioned before, we assume that the base network with kernel size performs like a stacked color decomposition, and each layer represents a different color space of the input image. Here, we explore the base network by changing its kernel size and increasing the number of layers. Besides, we remove the condition network to verify whether the base network could fairly deal with image retouching alone.

From Table IV, we can observe that the base network cannot solve the image retouching problem well without the condition network. Specifically, when we expand the filter size to and increase the number of layers to 7, there is only marginal improvement (0.2 dB) in terms of PSNR.

Considering the cases with condition network, if we fix the number of layers, and expand the kernel size to , there is no improvement. Therefore, the process of the base network is just pixel independent, which can be achieved by filters. If we fix the kernel size to and increase the number of layers to 7, the performance improves a little bit (0.08 dB). Since more layers require more parameters, we adopt a lightweight architecture with only three layers.

Base layers Base kernel size PSNR #params
w/o condition 3 20.47 4,611
3 20.69 40,451
7 20.67 188,163
w condition 3 23.86 36,489
3 23.75 72,329
5 23.88 53,257
5 23.72 154,633
7 23.94 70,025
7 23.73 236,937
TABLE IV: Results of ablation study for the base network.
Handcrafted prior Dim PSNR #params
w/o condition None 0 20.47 4,611
  network brightness 1 21.47 5,135
average intensity 3 21.93 5,659
histograms 768 22.90 206,089
w condition None (ours) 32 23.86 36,489
 network brightness 132 23.41 36,751
average intensity 332 23.67 37,275
histograms 76832 23.51 237,705

TABLE V: Results of ablation study for the condition network.
Modulation strategy
PSNR #params
w/o condition 20.47 4,611
concat 23.31 29,891
AdaFM 23.52 71,073
AdaFM 23.43 140,241
SFTNet 23.68 36,489
GFM 23.86 36,489
TABLE VI: Results of ablation study for the modulation strategy.
expert A expert E raw input CSRNet
Fig. 8: The first row shows smooth transition effects between different styles (expert A to E) by image interpolation. In the second row, we use image interpolation to control the retouching strength from input image to the automatic CSRNet retouched result (learned with GT of expert B). We denote the interpolation coefficient for each image.

Condition network. The condition network aims to estimate a condition vector that represents global information of the input image. Alternatively, we can use other handcrafted global priors to control the base network, such as brightness, average intensity, and histograms. Here, we investigate the effectiveness of these global priors. For brightness, we transform the RGB image to gray image, while the mean value of the gray image is regarded as the global prior. For average intensity, we compute the mean value for each channel of the RGB image. Regarding histograms, we generate the histograms for each channel of RGB image, and then concatenate them to a single vector. We directly utilize the aforementioned handcrafted features to replace the condition vector generated before fully connected layers. Besides, we combine the global priors with our condition network to control the base network. In particular, we concatenate the global prior with the condition vector produced by the condition network. From Table V, on the MIT-Adobe FiveK dataset, all three global priors can largely improve the performance compared with base network alone, which means that global priors are essential for image retouching. However, it seems that simply concatenating the handcrafted global prior with the predicted condition vector cannot achieve further improvement. In conclusion, our CSRNet can already extract effective global supporting features.

Modulation strategy. Our framework adopts GFM to modulate the intermediate features under different conditions. Here, we compare different modulation strategies: concatenating, AdaFM [19], and SFTNet [39]. Specifically, we concatenate the condition vector directly with the input image. For AdaFM, we use kernel size and . For SFTNet, we remove all the stride operations and the global average pooling in the condition network. Therefore, the modified condition network is able to generate a condition map, thus allowing spatial feature modulation on the intermediate features.

From Table VI, we observe that although AdaFM and SFTNet can ahieve local feature modulation, they cannot surpass the proposed GFM when the base network and condition network are jointly optimized. Therefore, global image retouching mainly depends on the overall context rather than specific spatial information. As for AdaFM, it is hard to achieve improvement by simply expanding its kernel size. In conclusion, conditional global image retouching can be effectively achieved by GFM, which only scales and shifts the intermediate features.

5.6 Multiple Styles and Strength Control

Photo retouching is a highly ill-posed problem. Different photographers may have different preferences on retouching styles. Our method enjoys the flexibility of multiple style learning. Once the model is trained on a specific retouching style, we can easily transfer the model to another retouching style by only finetuning the condition network, which is much faster than training from scratch. Nevertheless, other methods generally require to retrain the whole model on new datasets.

expert PSNR (finetune) PSNR (scrach)
A 22.49 22.35
B 25.63 25.60
D 23.01 23.12
E 23.93 23.89
TABLE VII: Performance for Multiple styles (A/B/D/E).

Here, we transfer the retouching model of expert C to expert A, B, D, and E. From Table VII, we can observe that finetuning the condition network can achieve comparable results with training from scratch. This indicates that the fixed base network performs like a stacked color decomposition, and have the flexibility to be modulated to different retouching styles.

Given retouched outputs of different styles, users can achieve smooth transition effects between different styles by using “image interpolation”, which can produce intermediate stylisitc images between them.


where and are two images to be interpolated, and is the coefficient controlling the combined styles. In Figure 8, the output style changes continuously from expert A to expert B.

Besides, for one certain style, users can also control the retouching strength by image interpolation between input image and the retouched one. For example, if the automatic retouched output is too bright, users may desire to decrease the overall luminance, as shown in Figure 8. There are also other alternatives to realize continuous output effects, such as DNI [40], AdaFM [19] and DynamicNet [36]. As photo retouching consists of only pixel-wise operations, the simplest image interpolation is already enough to achieve satisfactory results. More results can be found in the supplementary materials.

Fig. 9: The framework of CSRNet-L. Comparing with CSRNet, there are three major modifications for achieving local adjustment. 1) Enlarge the filter size in the base network. 2) The condition network does not downsample the feature maps, thus will output a condition map rather than a vector. 3) Adopt spatial feature modulation strategy. The modified parts in the framework are highlighted in red color.
input GT CSRNet ( + GFM) CSRNet-L ( + SFM) input GT CSRNet ( + GFM) CSRNet-L ( + SFM)
Fig. 10: From global to local. By expanding the filter size of the base network and adopting SFM, CSRNet-L can achieve local effects, demonstrating the expansibility and superiority of the proposed framework.

6 Extension to Learning Local Effects

Thanks to the utilization of convolutions and the corresponding condition network, the proposed method only contains a few parameters and is efficient for global tonal adjustment. Besides, our method can also be extended to learn local enhancement through expanding the base network and adopting spatial feature modulation strategy. The extended network that could achieve local effects is denoted as CSRNet-L. The overall framework is depicted in Figure 9.

6.1 Framework of CSRNet-L

Local adjustments require complex spatially varying manipulations on local regions, such as adjusting local contrast, performing stylistic effects and stressing the foreground color tone. To realize such local adjustments, the model needs to be capable of learning to capture local patterns and perform location-aware operations. Although our method is proposed for global image adjustments, it can be easily extended to learn complicated local adjustments with two simple modifications, which are marked with red color in Figure 9.

Expanding the base network. convolution is not enough to handle local enhancement, since it manipulates pixels independently without considering neighborhood information. We increase the filter size of the base network to , so that it can learn more complex nonlinear mappings and deal with local patterns.

Spatial feature modulation. Local effects require spatially variant operations on different regions. However, convolution operations are spatially invariant, which means that the same convolution filter weights are utilized on all positions. This property violates the goal of local enhancement that manipulates images according to pixel locations. Furthermore, once the network is trained, all the filter weights are then fixed and applied to all image samples. To this end, we extend the global feature modulation (GFM) to spatial feature modulation (SFM), which is first introduced in SFT-Net [39] for semantic super resolution. The formula of SFM is similar as GFM, which is shown as


where denotes the element-wise multiplication. is the feature maps to be modulated in the base network, and , are learned modulation parameters. Note that, in GFM, the modulation parameters are only -dimensional vectors for global adjustment. When using SFM in the base network, the condition network will output a set of condition maps instead of condition vectors, as shown in Figure 9. Hence, SFM can help the network perform spatially variant and location-specific manipulations.

Base layers Base kernel Modulation #params LLF Foreground Xpro Watercolor

- - - - - - - - 18.92 0.648 23.31 0.933 18.56 0.925 19.62 0.657
CSRNet 7 GFM 70,025 21.38 0.769 26.55 0.940 28.87 0.965 21.97 0.689
CSRNet-L 3 SFM 72,329 25.03 0.904 27.13 0.943 29.48 0.968 24.35 0.832

- - - - - - 11,383,427 24.76 0.879 24.12 0.884 26.85 0.926 22.43 0.739
HDRNet - - - - - - 482,080 24.94 0.871 26.73 0.943 30.30 0.975 23.50 0.800

TABLE VIII: Quantitative results on local effect datasets. By extending the CSRNet, our CSRNet-L successfully achieves local effect adjustments and obtains competitive performance with much fewer parameters. Bold and italic indicate the best and the second best performance, respectively.

6.2 Learning Stylistic Local Effects

To demonstrate the effectiveness of CSRNet-L for local adjustments, we conduct experiments with four stylistic local effects. It is shown that our extended method can attain comparable performance with other specially designed networks.

Datasets. We evaluate the performance of our method on four local enhancement tasks:

  • Fast Local Laplacian Filter [1]. A multi-scale operator for edge-preserving detail enhancement. We apply this operator on images retouched by expert C of the MIT-Adobe FiveK dataset [5].

The following three datasets are proposed by [43]. 115 images from Flickr are selected and retouched by a professional photographer using Photoshop. 70 images were chosen for training and the remaining 45 images for testing. The photographer performed a wide range of operations to adjust the images and created three different stylistic local effects. Here we make a brief introduction to them.

  • Foreground Pop-Out. The photographer increased both the contrast and color saturation of foreground salient objects, while suppressed that of the background at the same time. Consequently, the foreground is highlighted and seems to “pop-out”.

  • Local Xpro. This effect was produced by generalizing the prevailing “cross processing” 666Cross processing (abbreviated to Xpro) is the deliberate processing of photographic film in a chemical solution intended for a different type of film. Cross processed photographs are often characterized by unnatural colors and high contrast. effect in a local manner. The photographer first isolated different image regions and then applied a series of operations on each region according to the semantic content within that region.

  • Watercolor. This effect makes the images look like “watercolor” painting style which tends to be brighter and less saturated. Please refer to [43] for detailed procedures of creating this effect.

Implementation details. The network structure is similar with that in Section 4.1.1, except that we expand the filter size of the base network to , and we adopt SFM instead of GFM, as depicted in Figure 9. The downsampling operations are removed from the condition network so that the output size is the same as the feature map size in the base network. To achieve local effects, we first train the base network, and then add the condition network for joint training. For training the base network, the learning rate is initialized to , while the learning is initialized to for joint training. We find this two-stage training strategy can obtain more stable and better performance on local enhancement tasks.

(a) input/LLF Pix2Pix HDRNet CSRNet-L GT (b) input/LLF Pix2Pix HDRNet CSRNet-L GT (c) input/Foreground Pop-Out Pix2Pix HDRNet CSRNet-L GT (d) input/Foreground Pop-Out Pix2Pix HDRNet CSRNet-L GT (e) input/Local Xpro Pix2Pix HDRNet CSRNet-L GT (f) input/Local Xpro Pix2Pix HDRNet CSRNet-L GT (g) input/Watercolor Pix2Pix HDRNet CSRNet-L GT (h) input/Watercolor Pix2Pix HDRNet CSRNet-L GT
Fig. 11: Visual comparison on local effect datasets. Row (a)&(b): Local Laplacian Filter. Row (c)&(d): Foreground Pop-Out effect. Row (e)&(f): Local Xpro effect. Row (g)&(h): Watercolor effect. The proposed CSRNet-L successfully achieves local adjustments and obtains competitive quantitative and qualitative performance on several local effect datasets. Please zoom in for best view.

6.3 Experimental Results

In this section, we first compare CSRNet-L with a deeper CSRNet version, showing the effectiveness of extending CSRNet from global adjustment to local enhancement. Then we compare with other local enhancement methods, including Pix2Pix and HDRNet. The experimental results reveal that CSRNet-L can achieve comparable performance on several local effect datasets yet with much fewer parameters.

6.3.1 Extending CSRNet: from global to local

We demonstrate the effectiveness of extending CSRNet, in terms of enlarging the filter size and adopting spatial feature modulation (SFM). As mentioned above, filters are not able to achieve local adjustment, thus enlarging the filter size in the base network is necessary. We expand the filter size from to , so that the base network can learn local patterns and perform local operations. Since enlarging the filter size brings more parameters, we deepen the original CSRNet to 7 layers for fair comparison. As shown in the third and fourth rows of Table VIII, by enlarging the filter size and utilizing SFM, the network obtains distinct performance gain in terms of PSNR and SSIM. Figure 10 shows that CSRNet-L can successfully achieve local effects, while CSRNet with filters only performs global adjustment but fails to realize local enhancement.

6.3.2 Comparison with other methods

We compare our results with HDRNet [17] and Pix2Pix [24]. Since Pix2Pix is a well-known image-to-image translation framework and HDRNet is a state-of-the-art image enhancement method, we use them as baselines for comparison. For fairness, we retrain their models on each local effect dataset. The PSNR values between input images and ground truth images are shown in the second row of Table VIII. It numerically reflects the gap between the inputs and the stylized outputs.

As depicted in Table VIII, Pix2Pix includes over 11 million parameters but obtains the worst performance. The images produced by Pix2Pix usually contains artifacts and incorrect color tones. CSRNet-L also outperforms HDRNet quantitatively except for “Local Xpro” dataset. Specifically, CSRNet-L only contains 72k parameters (about 1/6 of HDRNet) but reaches 25.03dB on LLF dataset, which transcends Pix2Pix and HDRNet by 0.27dB and 0.09dB, respectively. CSRNet also achieves the best quantitative performance on the “Foreground Pop-out” and “Watercolor” datasets, while it obtains the second best performance on the “Local Xpro” dataset.

Visual comparisons are shown in Figure 11. It can be observed that CSRNet-L successfully implements the effect of local Laplacian filter, which sharpens the input image and enhances the image details. In contrast, HDRNet cannot reproduce this effect well (see the neck of the swan in the second row of Figure 11). As for the “Foreground Pop-out” dataset, Pix2Pix and HDRNet fail to accurately highlight the foreground object, as displayed in the third and forth rows of Figure 11. HDRNet tends to confuse the colors in local regions and mix up the foreground and the background. There are also some errors in Pix2Pix when removing the background color. Similarly, HDRNet cannot produce the watercolor effect well, since the outputs still retains the style of realistic photos. Pix2Pix can transfer the image into watercolor style well but the results are prone to contain artifacts. (See the seventh and eighth rows of Figure 11.) In conclusion, by extending CSRNet, our method can easily achieve local effects and obtain comparable results over other methods with much fewer parameters. This greatly shows the superiority and effectiveness of the proposed framework.

6.3.3 Effect of Training strategy

For local adjustment, we find that the training strategy plays an important role in the final performance. To optimize CSRNet-L, we first train the base network, then add the condition network and train them jointly. Table IX shows that the two-stage training strategy can obtain better performance than training from scratch. However, if we train the base network and condition network together from scratch, the performance is sometimes even worse than the sole base network. Compared with training from scratch, CSRNet-L with finetuning strategy improves 0.15dB, 0.51dB, 0.36dB, 0.27dB on four local effect datasets, respectively. Besides, it can also improve the stability during training.

Settings Strategy LLF Foreground Xpro Watercolor
Only Base scratch 22.26 26.94 29.38 24.19
Base + GFM scratch 23.75 26.82 28.96 24.13
Base + SFM scratch 24.88 26.62 29.12 24.08
Base + GFM finetune 23.81 26.99 29.37 24.18
Base + SFM finetune 25.03 27.13 29.48 24.35

TABLE IX: Effectiveness of SFM and training strategy.

6.3.4 Effectiveness of conditional modulation

To validate the effectiveness of SFM, we conduct experiments on three different settings: (1) Only base network without modulation. (2) Base network with GFM. (3) Base network with SFM. The results are summarized in Table IX. When equipped with GFM, the PSNR rises from 22.26dB to 23.81dB on LLF dataset, but on other datasets, the performance dose not get better (compare the second and the fifth rows in the table). This explains that GFM has little effect for local enhancement. However, after we adopt SFM, the performance is greatly improved from a sole base network, with the improvements of 2.77dB, 0.19dB, 0.1dB, 0.16dB on the four local effect datasets, respectively. This shows that SFM plays a crucial role in local adjustment.

7 Conclusion

In this work, we present an efficient image retouching network with extremely fewer parameters. Our key idea is to mimic the sequential processing procedure and implicitly model the editing operations in an end-to-end trainable network. The proposed CSRNet (Conditional Sequential Retouching Network) consists of a base network and a condition network. The base network acts like an MLP for individual pixels, while the condition network extracts global features to generate a condition vector. Then, the condition vector is transformed to modulate the intermediate features of the base network by global feature modulation (GFM). Extensive experiments show that our method achieves state-of-the-art performance on the benchmark MIT-Adobe FiveK dataset quantitively and qualitatively. In addition, besides achieving global tonal adjustment, the proposed framework can be extended to learn local effects as well. By expanding the base network and introducing spatial feature modulation (SFM), our extended method, named CSRNet-L, can successfully achieve local effect adjustment and attain comparable performance against several existing methods.


This work is partially supported by the National Natural Science Foundation of China (61906184), Science and Technology Service Network Initiative of Chinese Academy of Sciences (KFJ-STS-QYZX-092), Shenzhen Basic Research Program (JSGG20180507182100698, CXB201104220032A), the Joint Lab of CAS-HK, Shenzhen Institute of Artificial Intelligence and Robotics for Society.


  • [1] M. Aubry, S. Paris, S. W. Hasinoff, J. Kautz, and F. Durand (2014) Fast local laplacian filters: theory and applications. ACM Transactions on Graphics (TOG) 33 (5), pp. 1–14. Cited by: 1st item.
  • [2] M. Aubry, S. Paris, S. W. Hasinoff, J. Kautz, and F. Durand (2014) Fast local laplacian filters: theory and applications. ACM Transactions on Graphics (TOG) 33 (5), pp. 1–14. Cited by: §2.
  • [3] S. Bianco, C. Cusano, F. Piccoli, and R. Schettini (2019) Learning parametric functions for color image enhancement. In International Workshop on Computational Color Imaging, pp. 209–220. Cited by: §2, §2.
  • [4] S. Bianco, C. Cusano, F. Piccoli, and R. Schettini (2020) Personalized image enhancement using neural spline color transforms. IEEE Transactions on Image Processing 29, pp. 6223–6236. Cited by: §2.
  • [5] V. Bychkovsky, S. Paris, E. Chan, and F. Durand (2011) Learning photographic global tonal adjustment with a database of input/output image pairs. In CVPR 2011, pp. 97–104. Cited by: §1, §1, §2, §5.1, §5.3, 1st item.
  • [6] Y. Chai, R. Giryes, and L. Wolf (2020)

    Supervised and unsupervised learning of parameterized color enhancement

    In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 992–1000. Cited by: §2.
  • [7] J. Chen, A. Adams, N. Wadhwa, and S. W. Hasinoff (2016) Bilateral guided upsampling. ACM Transactions on Graphics (TOG) 35 (6), pp. 1–8. Cited by: §2.
  • [8] J. Chen, S. Paris, and F. Durand (2007) Real-time edge-aware image processing with the bilateral grid. ACM Transactions on Graphics (TOG) 26 (3), pp. 103–es. Cited by: §2.
  • [9] Y. Chen, Y. Wang, M. Kao, and Y. Chuang (2018) Deep photo enhancer: unpaired learning for image enhancement from photographs with gans. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6306–6314. Cited by: §1, §1, §2, §4.3, §5.1, §5.3, §5.3, TABLE II.
  • [10] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018) Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8789–8797. Cited by: §2.
  • [11] Y. Deng, C. C. Loy, and X. Tang (2018) Aesthetic-driven image enhancement by adversarial learning. In Proceedings of the 26th ACM international conference on Multimedia, pp. 870–878. Cited by: §2.
  • [12] C. Dong, C. C. Loy, K. He, and X. Tang (2014) Learning a deep convolutional network for image super-resolution. In European conference on computer vision, pp. 184–199. Cited by: §1, §3.
  • [13] Y. Dong, Y. Liu, H. Zhang, S. Chen, and Y. Qiao (2020) FD-gan: generative adversarial networks with fusion-discriminator for single image dehazing.. In AAAI, pp. 10729–10736. Cited by: §2.
  • [14] F. Durand and J. Dorsey (2002) Fast bilateral filtering for the display of high-dynamic-range images. In Proceedings of the 29th annual conference on Computer graphics and interactive techniques, pp. 257–266. Cited by: §2.
  • [15] G. D. Finlayson and E. Trezzi (2004) Shades of gray and colour constancy. In Color and Imaging Conference, Vol. 2004, pp. 37–41. Cited by: §2.
  • [16] X. Fu, D. Zeng, Y. Huang, X. Zhang, and X. Ding (2016) A weighted variational model for simultaneous reflectance and illumination estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2782–2790. Cited by: §2.
  • [17] M. Gharbi, J. Chen, J. T. Barron, S. W. Hasinoff, and F. Durand (2017) Deep bilateral learning for real-time image enhancement. ACM Transactions on Graphics (TOG) 36 (4), pp. 1–12. Cited by: Fig. 1, §1, §1, §2, §4.3, §5.1, §5.3, §5.3, §5.3, TABLE II, §6.3.2.
  • [18] C. Guo, C. Li, J. Guo, C. C. Loy, J. Hou, S. Kwong, and R. Cong (2020) Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1780–1789. Cited by: §2.
  • [19] J. He, C. Dong, and Y. Qiao (2019) Modulating image restoration with continual levels via adaptive feature modification layers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11056–11064. Cited by: §5.5, §5.6.
  • [20] J. He, Y. Liu, Y. Qiao, and C. Dong (2020) Conditional sequential modulation for efficient global image retouching. In European Conference on Computer Vision, pp. 679–695. Cited by: §1.
  • [21] Y. Hu, H. He, C. Xu, B. Wang, and S. Lin (2018) Exposure: a white-box photo post-processing framework. ACM Transactions on Graphics (TOG) 37 (2), pp. 1–17. Cited by: Fig. 1, §1, §1, §2, §3, §4.3, §5.1, §5.3, TABLE II.
  • [22] A. Ignatov, N. Kobyshev, R. Timofte, K. Vanhoey, and L. Van Gool (2017) DSLR-quality photos on mobile devices with deep convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3277–3285. Cited by: §1, §2.
  • [23] A. Ignatov, N. Kobyshev, R. Timofte, K. Vanhoey, and L. Van Gool (2018) WESPE: weakly supervised photo enhancer for digital cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 691–700. Cited by: §1, §2.
  • [24] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, Cited by: §2, §4.3, §5.3, TABLE II, §6.3.2.
  • [25] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §4.3.
  • [26] H. Kim, Y. J. Koh, and C. Kim (2020) Global and local enhancement networks for paired and unpaired image enhancement. In European Conference on Computer Vision, pp. 339–354. Cited by: §2.
  • [27] H. Kim, Y. J. Koh, and C. Kim (2020) PieNet: personalized image enhancement network. In European Conference on Computer Vision, pp. 374–390. Cited by: §2.
  • [28] S. Kosugi and T. Yamasaki (2020) Unpaired image enhancement featuring reinforcement-learning-controlled image editing software. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 11296–11303. Cited by: §2.
  • [29] E. H. Land (1977) The retinex theory of color vision. Scientific american 237 (6), pp. 108–129. Cited by: §2.
  • [30] J. Lee, K. Sunkavalli, Z. Lin, X. Shen, and I. So Kweon (2016) Automatic content-aware color and tone stylization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2470–2478. Cited by: §2.
  • [31] M. Lin, Q. Chen, and S. Yan (2013) Network in network. arXiv preprint arXiv:1312.4400. Cited by: §3, §4.1.1.
  • [32] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §2.
  • [33] Z. Ni, W. Yang, S. Wang, L. Ma, and S. Kwong (2020) Unpaired image enhancement with quality-attention generative adversarial network. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 1697–1705. Cited by: §2.
  • [34] J. Park, J. Lee, D. Yoo, and I. So Kweon (2018) Distort-and-recover: color enhancement using deep reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5928–5936. Cited by: §1, §1, §2, §3, §4.3, §5.3, TABLE II.
  • [35] T. Salimans and D. P. Kingma (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks. External Links: 1602.07868 Cited by: §4.1.3.
  • [36] A. Shoshan, R. Mechrez, and L. Zelnik-Manor (2019-10) Dynamic-net: tuning the objective without re-training for synthesis tasks. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §5.6.
  • [37] J. Van De Weijer, T. Gevers, and A. Gijsenij (2007) Edge-based color constancy. IEEE Transactions on image processing 16 (9), pp. 2207–2214. Cited by: §2.
  • [38] R. Wang, Q. Zhang, C. Fu, X. Shen, W. Zheng, and J. Jia (2019) Underexposed photo enhancement using deep illumination estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6849–6857. Cited by: §1, §2, §5.1, §5.3.
  • [39] X. Wang, K. Yu, C. Dong, and C. Change Loy (2018) Recovering realistic texture in image super-resolution by deep spatial feature transform. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 606–615. Cited by: §5.5, §6.1.
  • [40] X. Wang, K. Yu, C. Dong, X. Tang, and C. C. Loy (2019) Deep network interpolation for continuous imagery effect transition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1692–1701. Cited by: §5.6.
  • [41] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy (2018) Esrgan: enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §2, §4.3.
  • [42] J. Yan, S. Lin, S. Bing Kang, and X. Tang (2014) A learning-to-rank approach for image color enhancement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2987–2994. Cited by: §2.
  • [43] Z. Yan, H. Zhang, B. Wang, S. Paris, and Y. Yu (2016) Automatic photo adjustment using deep neural networks. ACM Transactions on Graphics (TOG) 35 (2), pp. 1–15. Cited by: §2, 3rd item, §6.2.
  • [44] Z. Ying, G. Li, Y. Ren, R. Wang, and W. Wang (2017) A new low-light image enhancement algorithm using camera response model. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 3015–3022. Cited by: §2.
  • [45] R. Yu, W. Liu, Y. Zhang, Z. Qu, D. Zhao, and B. Zhang (2018) Deepexposure: learning to expose photos with asynchronously reinforced adversarial learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 2153–2163. Cited by: §2.
  • [46] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M. Yang, and L. Shao (2020) Learning enriched features for real image restoration and enhancement. In European Conference on Computer Vision, pp. 492–511. Cited by: §2, §5.3, TABLE II.
  • [47] H. Zeng, J. Cai, L. Li, Z. Cao, and L. Zhang (2020) Learning image-adaptive 3d lookup tables for high performance photo enhancement in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §2, §5.3, §5.3, TABLE II.
  • [48] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang (2017) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing 26 (7), pp. 3142–3155. Cited by: §2.
  • [49] Q. Zhang, G. Yuan, C. Xiao, L. Zhu, and W. Zheng (2018) High-quality exposure correction of underexposed photos. In Proceedings of the 26th ACM international conference on Multimedia, pp. 582–590. Cited by: §2.
  • [50] W. Zhang, Y. Liu, C. Dong, and Y. Qiao (2019) Ranksrgan: generative adversarial networks with ranker for image super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3096–3105. Cited by: §2.
  • [51] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networkss. In Computer Vision (ICCV), 2017 IEEE International Conference on, Cited by: §2, §4.3.