Everyone has their life’s precious moments captured in photographs. It may tell stories about old memories such as a wedding, a birthday party. Although modern cameras have many techniques to correct the colors and enhance image quality, the natural color style may not express the stories well. Therefore, many powerful tools in editing photos (e.g., Lightroom
) have been released to enrich the preciousness of photographs. However, professional tools require professional skills and knowledge in photography. It causes the end-users a difficulty in making their photos prettier. They may create an unexpected color style. Therefore, many photo applications provide fixed filters to beautify photos easier. Unfortunately, the filters are limited and do not meet the user’s expectations sometimes. Regularly, experienced users try to mimic the color style of a well-retouched photo using a professional tool. Besides, retouched photos with similar contexts give the users an overall of their intended color style. It reveals a correlation between human behavior and style transfer tasks. In this work, we present a novel training scheme based on blending and retouching photos for color style transfer; plus, we design a specific neural network, named Deep Preset, for our case.
Photorealistic Style Transfer (PST). The seminal work, Neural Style Transfer (NST), of Gatys et al. 
presents an optimization-based method transferring an artistic style into a photo using convolutional neural networks. Afterward, it has a huge improvement in achieving real-time performance
, creating a novel way to transform contextual features based on mean and standard deviation (AdaIN), whitening and coloring (WCT) . However, the mentioned methods are designed for artistic stylization rather than photorealistic stylization, which is challenging in retaining structural details. Therefore, Luan et al.  propose a regularization for NST to prevent distortion. However, the optimization-based method costs long computational time, and their result is still distorted. Li et al.  propose an enhanced photo stylization PhotoWCT based on WCT with post-processing such as smoothing and filtering techniques. Based on PhotoWCT, Yoo et al.  present a progressive strategy transferring style in a single pass, and propose Wavelet Corrected Transfer (WCT) with wavelet pooling/unpooling. Furthermore, they do not need any post-processing; however, their performance still relies on semantic masks. Recently, An et al.  propose asymmetric auto-encoder PhotoNet with two modules, Bottleneck Feature Aggregation (BFA) and Instance Normalized Skip Link (INSL), without requiring post-processing and guided masks. Although the methods can retain structural details with a well-transferred photo style, the network architectures are not explored well. Thus, they apply Network Architecture Search (NAS) to find an efficient network architecture for photorealistic style transfer (PhotoNAS) under network complexity constraint. However, in blending and retouching photos, the previous methods in PST are overused transferring exact colors with degradation, while the end-users desire to have a similar color style as a well-retouched photo, especially in sensitive cases such as portraits. The mentioned works transfer the exact colors of a retouched reference rather than learn the color transformation/style representation, as well as ”what beautifies the reference”. Consequently, their results show the color overflowed due to mis-transferring the color style, as shown in Figure 1. In this work, we define that the color style is a bunch of low-level image transformation methods converting a photo with natural colors (likely original colors taken by camera) to retouched one. In photo editing tools, each low-level image transformation method is controlled by a parameter (setting) telling how the method effects on the content, and a preset contains a bunch of parameters set by experienced user for a specific photo. Based on that definition, we present a novel training scheme for color style transfer with ground-truth
by applying various user-generated presets. As a result, having ground-truth helps our model converged in the right direction rather than based on extracted features and mixing methods. Furthermore, we propose the Deep Preset to 1) learn well-generalized features representing color style transforming the input (natural) to reference (retouched), 2) estimate the preset applied on the reference, 3) synthesize the well-retouched input.
Color Blending. Nowadays, most digital cameras have post-processing techniques to correct color and enhance image quality. It provides natural colors for captured photos before saving. Afterward, photographers blend the colors according to their purposes, such as a wedding photo with a vintage color tone. In detail, they adjust the photo settings representing low-level color transformation methods (e.g., Hue, Saturation, etc.), as shown in Figure 2
. Regarding end-users without knowledge in editing photos, they use available color settings set by professional photographers for a specific context. It opens a novel scheme in generating photos having the same color style for training. The ill-posed problem is how to generalize a color transformation and enhance color style transfer. In our work, we consider two approaches: 1) Let the proposed Deep Preset predict the preset, a color transformation behind the retouched reference. The extracted features from many contexts thus define the transformation. However, predicting an accurate preset is a difficult issue; we thus 2) add a positive pair-wise loss function to minimize distances between same-preset-applied photos in latent space. Consequently, the extracted features are robust, and Deep Preset can transfer reference’s style to another photo efficiently.
Our contributions are as follows:
We present a novel way to train deep neural networks for color style transfer by applying user-generated presets. Therefore, our work is the first work in training color style transfer with ground-truth.
We propose a specific deep neural network, named Deep Preset, to transfer the color style of a reference to a photo. As a result, our work outperforms previous works qualitatively and quantitatively in color style transfer.
Our scheme, optimizing the distances between latent features of the photos applied by the same preset, shows the capability of enhancing color transformation as well as stylized output.
Our work can automatically beautify a photo by selecting its suitable reference among well-retouched photos based on the perceptual measurement .
2 Deep Preset
Blending and retouching photos help everyone to enhance the preciousness of their life’s moments captured in photographs. Color style is the way of telling expression. However, it is not easy for them to create a plausible color style for their context. They thus search a well-retouched photo having the same contextual information for a reference (a). Even a suitable sample is found, it is difficult for the end-users without knowledge in photography to retouch their photos using powerful photo editing application (b). In our work, we solve (b) using our proposed Deep Preset, which can synthesize a similar color style as a reference for a photo. Instead of considering photorealistic style, our Deep Preset considers which color-shifting methods (preset) have been applied to the reference, and learns the features representing the cross-content color transformation from natural colors (input) to retouched ones (reference). Regarding the problem (a), we also provide a strategy to find a reference in a bunch of well-retouched photos by matching the contextual information . Consequently, the end-users can retouch their photos in one-click. Additionally, we minimize the distance between photos with the same color transformation to enhance generated color style and preset estimation. Our performance is thus improved.
Our Deep Preset learns the cross-content color transformation from a natural photo to reference and generates the stylized photo . Furthermore, our work also predicts the applied preset representing the hyper-parameters of post-processing color-shifting techniques retouching , as shown in Figure 3. Besides, we extract embeddings and , from and while predicting the applied preset, where is the random photo retouched by same as .
Our advantages are as follows:
Our models can be converged in the right direction with ground-truth. Meanwhile, previous works are significantly based on perceptual features and mixing methods.
Learning color transformation, rather than transferring exact colors in latent space, can reduce the sensitiveness of color style transfer.
Our positive pair-wise loss function helps preset estimation to be stable and enhances generated color style.
2.2 Network Architecture
We adopt the U-Net , which is an encoder-decoder architecture, to design the Deep Preset. Our network includes four main components: Encoder , Encoder , Linear Layers and Decoder .
First of all, the encoder leverages the content and the reference to synthesize feature maps representing color transformation. Meanwhile, the encoder extracts contextual information preparing for blending features between and . Afterwards, the linear leverages the final feature map of to extract the transformation embedding and estimate the preset , as follows:
where can be or , is the estimated preset. Finally, the generator leverages the concatenated features between and to synthesize the stylized photo , as:
where represents concatenations of extracted feature maps between and corresponding to feeding order, as shown in Figure 3.
Technical details. The encoder
contains five Down-sampling Layers (DL) with Max Pooling in prior (excluding the first one), same as the encoder. Meanwhile, the decoder includes 5 Up-sampling Layers (UL) with Bi-linear Pooling (excluding the fifth layer), and the final convolutional module with Tanhactivation function to synthesize the stylized output. To avoid aliasing, we adopt the works [18, 6] and apply a blur filter with a size of to all pooling modules. All convolutional modules in DL and UL have a Sample-based Evolving Normalization-Activation  followed. The linear
consists of 3 fully-connected layers with the Leaky ReLU activation function. The activation of its last layer is replaced byTanh function to estimate the applied preset, as shown in Figure 3 and Figure 4. The first convolution module in each encoder uses a kernel size of to observe features on a large receptive field; meanwhile, remaining convolution modules use a kernel size of .
2.3 Loss functions
In this work, we propose a new scheme to train color style transfer with ground-truth; therefore, our loss functions are based on the ground-truth rather than extracted features of content and reference images. Consequently, our models can be converged the right way to be closer to the ground-truth. We apply Mean Square Error (MSE) to directly minimize the distance between our stylized and the ground-truth as:
where N is the batch size. Additionally, we adopt the perceptual loss LPIPS  to enhance contextual details as:
Additionally, we also estimate the preset, the applied low-level image transformation. The estimated preset is observed as:
the number of hyper-parameters representing low-level image transformation methods such as color-shifting. However, predicting an exact preset is difficult due to various possible cases; furthermore, as human experience, we consider the presets having a similar, but not exact, color tone visually. Therefore, to enhance the color transformation representation, we randomly select a photo having the same color style as and extract the embedding , as described in the Equation 1. The distance between and , so-called positive pair-wise error, is optimized as:
Finally, our total loss function is:
where are empirically set as respectively.
2.4 Data Preparation
In this section, we describe how we collect and pre-process the data for training and evaluation.
Lightroom presets. Our data processing is based on the Lightroom 
, the most powerful photo editing software recently. We collect 510 user-generated presets, 500 presets for training, and 10 for testing. Additionally, we only select 69 settings (e.g., White Balance, Hue, and more) presented by Lightroom. Each setting has a value representing how large colors are shifted in a specific way. Therefore, a preset with 69 settings is assigned as a 69-dimension vector. All elements are normalizedbased on its min and max values that the end-users can intervene.
Training data. We script the Lightroom  to generate photos using high-definition photos from Flickr2K  and 501 pre-processed presets including base color style. Since our training target is to convert a photo with correct colors (natural) to a retouched version, we only choose the photos with corrected colors, likely the original colors taken by a camera. All training photos are resized to have the short dimension is with the same ratio. Besides, we deal with JPEG distortion for efficient storage.
Testing data. Different from the training concept, we conduct experiments natural and retouched photos from DIV2K  validation set. Regarding the color style, we utilize user-generated presets to stylize the references. Since a content retouched by same-preset-applied photos should have the same color style, we select content image to be stylized by reference images (-). Additionally, we also create a set with natural style only including content images and reference images (10-1010). Eventually, we have samples each set. All photos are stored in JPEG format and resized to
on-the-fly using bicubic interpolation while testing.
|Our presets w/o PP Loss||0.6792||273.98||23.66||0.1176||0.6039||745.45||21.62||0.1102||0.6416||509.71||22.62||0.1139|
|Our presets with PP Loss||0.6573||145.1861||23.12||0.1288||0.5815||453.50||20.94||0.1262||0.6194||299.34||22.03||0.1275|
|Ours w/o PP Loss||0.7188||126.29||23.79||0.1039||0.6678||990.82||21.96||0.1015||0.6933||558.56||22.87||0.1027|
|Ours with PP Loss||0.7269||145.47||24.01||0.0993||0.6743||959.43||22.24||0.0966||0.7006||552.45||23.12||0.0980|
2.5 Training details
We train our models using Adam optimizer  with the learning rate of , momentum , , the batch size of . To make diversity training samples, we apply random crop with the size of , random rotation with degrees , and finally random flip in both horizontal and vertical ways. All photos are normalized to the range .
3 Experimental results
3.1 On our positive pair-wise loss function
In this section, we conduct an ablation study on our positive pair-wise (PP) loss function. The encoder of Deep Preset learns color transformation in cross-content with an auxiliary regression as the preset prediction. However, it is difficult to estimate an accurate preset. The features representing color transformation is thus unstable. Therefore, we enhance the color transformation representation by optimizing distances between photos having the same color style in latent space. The extracted features are thus robust for transformation, and it makes predicted preset stable while training, as shown in Figure 6
. Regarding comparison, we train two models with/without PP loss function in the same condition and compare them by qualitatively showing evidence, and quantitatively evaluating on DIV2K validation set using Peak Signal-to-Noise Ratio (PSNR), the perceptual metric LPIPS, Chi-squared distance for histogram (H-CHI). As a result, our model with PP loss function shows the efficiency in improving color style transfer. In detail, the model with PP loss gives the more yellow in color tone as a retouched reference. Quantitatively, stylized results with PP loss have a higher PSNR, lower LPIPS showing a better quality; furthermore, it has a smaller Chi-squared distance on histogram proving its outperformance in transferring color style, as shown in Figure 5. Although the predicted presets of the model without PP loss are better and can mimic the color style of the reference, as shown in Figure 7, it still cannot outperform in directly stylizing photos, as the overall result shown in Table 1.
3.2 Comparison to recent works
present creative techniques to mix contextual features extracted from content and style. They significantly transfer textures and colors of reference into another photo; for example, changing the day to night, summer to winter. However, they are overused in color style transfer. This work defines that the color style is made by the color transformation converting a photo with natural colors to its retouched version based on color-shifting methods. The proposed Deep Preset learns the defined color style and transforms the base colors, rather than transferring exact colors. To prove our proficiency, we compare our work to the mentioned works in quantitative and qualitative ways. Besides, the works in colorization with reference-guided can be treated as color style transfer. Therefore, we also compare this work to the interactive colorization of Zhang et al.. Since our scheme provides the ground-truth, we thus utilize similarity metrics such as histogram correlation (H-Corr), histogram Chi-squared distance (H-CHI), Peak Signal-to-Noise Ratio (PSNR), and the perceptual metric LPIPS  for quantitative comparison on the DIV2K  validation set. Regarding qualitative comparison, we show the photos taken by the DSLR camera and its stylized version generated by the mentioned works.
Quantitative comparison. We compare our work to previous works in two concepts: a content with same-color-style photos and contents and same-color-style photos, as described in Section 2.4. As a result, the reference-guided colorization DeepPriors  outperforms other previous works in overall. Their colorization removes color channels, then colorize the black-and-white photo based on a given reference. The color of output thus has a high correlation with the ground-truth. However, they still suffer from color overflowed. Meanwhile, this work outperforms the previous works in generating a similar color style as 0.7006, 552.45 histogram correlation, histogram chi-squared distance correspondingly. Furthermore, our results also have the highest quality as 23.12dB, 0.098 in average PSNR, LPIPS respectively, as shown in Table 1.
Qualitative comparison. To make our quantitative results more reliable, we conduct experiments on portraits taken by DSLR cameras with three presets. As a result, the proposed Deep Preset obviously can beautify the content using a well-retouched reference; meanwhile, the previous works try to transfer exact colors leading to abnormal colors. In detail, the previous works transfer the red color of the rose in reference to the girl’s hair band with overexposed skin; meanwhile, the DeepPriors synthesize a monochrome tone in the first two columns. In the next two columns, all methods, excepting DeepPriors, synthesize a correct color for the girl uniform; however, they all have color pollution such as red color on the face and uniform. Meanwhile, our work can provide a plausible color with fresher looking as reference. For the last two columns, we directly check Hue-Saturation-Lightness (HSL) information where the color picker located. Our result gives the closest values (H,S,L) compared to the ground-truth as (, , ); meanwhile, DeepPriors , FPS , WCT , and PhotoNAS  give (, , ), (, , ), (, , ), (, , ) respectively, as shown in Figure 8. See more interesting results in our supplementary materials.
Our work does not treat colors of content as transfer destination but changing the base colors of content. Plus, we only train our model on natural-to-retouched transformation. Therefore, we conduct experiments on ”stylizing a retouched content” and ”what if we use an artistic image as reference”.
On retouched contents. Deep Preset is trained to convert a photo with natural colors to its retouched version; therefore, retouched-to-retouched transformation is an out-of-distribution case. As a result, our model implicitly learns cross-content color transformation from input to reference in any style and has the same behavior as a natural-to-retouched scheme, as shown in Figure 9. Our work thus outperforms others in the retouched-to-retouched domain.
On artistic styles. Artistic paintings are out-of-distribution since we only train on camera-taken photos. As a result, our work can slightly beautify a photo with a painting as a reference. It proves that the contextual information of content is preserved while training, then blended by the color-shifting-like transformation. Therefore, the proposed Deep Preset can retain the structural details with homologous colors as reference. For example, in left-to-right order, our result is brighter with the first style, the uniform has richer blue with the second one, and the color tone slightly turns to yellow, especially the girl’s hair with the third style, as shown in Figure 10.
Trade-off between preset prediction and positive pair-wise minimization in color transformation. As shown in Figure 7, the presets predicted without positive pair-wise (PP) loss give the promising transformation in mimicking the overall color of reference. For example, the predicted presets turns content to have bluish tone as the costume in the third column, greenish tone as the grass in the last column. However, with references retouched by the same preset, the predicted presets should provide the same color style as our hypothesis. Meanwhile, our presets with PP loss show the stability in generating a preset with various contexts, though it is worse in overall performance, as evaluated in Table 1. As mentioned, predicting an accurate preset is challenging. It leads to the difficulty of defining the features representing color transformation. Therefore, we reduce the expectation of predicting presets and concentrate on training the features to enhance color transformation with PP loss. Our direct stylization thus outperforms the preset-based approach.
Failed Cases. Our stylization is failed when the reference falsifies color transformation (blue uniform to purple one) in the first row, or reference has an unstable light condition. For example, near-overexposed (on the man’s costume) reference gives near-overexposed result in the first row, same as low-light reference in the second row, as shown in Figure 11.
Automatic Beautification. The automatic colorization works can be treated as automatic beautification. They beautify a black-and-white by giving a plausible color based on trained data. However, they aim to synthesize the correct colors but retouched ones. Although the DeepPriors  provide a scheme to transfer the color from reference to the black-and-white photo, they still suffer from color mismatched, overflowed. In our case, end-users consider choosing a retouched reference having the similar context as their photo. It presents a scheme for the proposed Deep Preset to select a reference in among well-retouched photos by matching perceptual information . The photo is thus retouched automatically. Visit our repository 111https://minhmanho.github.io/deep_preset for more information.
We define the color style as color transformation based on low-level image transformations, especially color-shifting methods. On that point, we first present a novel scheme to train color style transfer with ground-truth. Afterward, we propose the Deep Preset to transfer the color style of a well-retouched reference to a photo. Besides, it is also designed to predict the applied preset behind the retouched reference. The experiments show that the representation between photos having the same color style can be learned not only while predicting the same preset, but also in latent space. Consequently, our positive pair-wise loss, optimizing distances between the representable features of same-preset-applied photos, can enhance the color transformation, as shown in Figure 5. As a result, the proposed Deep Preset outperforms previous works in color style transfer quantitatively and qualitatively.
-  Lightroom https://www.adobe.com/products/photoshop-lightroom.html. Cited by: Figure 2, §1, §2.4, §2.4.
-  (2020) Ultrafast photorealistic style transfer via neural architecture search. In AAAI, Cited by: Figure 1, §1, Figure 8, Table 1, §3.2, §3.2.
-  (2016) Image style transfer using convolutional neural networks. In , pp. 2414–2423. Cited by: §1.
-  (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510. Cited by: §1.
Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694–711. Cited by: §1.
A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §2.2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.5.
-  (2017) Universal style transfer via feature transforms. In Advances in neural information processing systems, pp. 386–396. Cited by: §1.
-  (2018) A closed-form solution to photorealistic image stylization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 453–468. Cited by: Figure 1, §1, Figure 8, Table 1, §3.2, §3.2.
-  (2017-07) Enhanced deep residual networks for single image super-resolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §2.4.
-  (2020) Evolving normalization-activation layers. arXiv preprint arXiv:2004.02967. Cited by: Figure 4, §2.2.
-  (2017) Deep photo style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4990–4998. Cited by: §1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2.2.
-  (2018-06) NTIRE 2018 challenge on single image super-resolution: methods and results. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §2.4, §3.2.
-  (2019) Photorealistic style transfer via wavelet transforms. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9036–9045. Cited by: Figure 1, §1, Figure 8, Table 1, §3.2, §3.2.
The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595. Cited by: Figure 1, 4th item, §2.1, §2.3, §3.1, §3.2, §4.
-  (2017) Real-time user-guided image colorization with learned deep priors. ACM Transactions on Graphics (TOG) 9 (4). Cited by: Figure 8, Table 1, §3.2, §3.2, §3.2, §4.
-  (2019) Making convolutional networks shift-invariant again. In ICML, Cited by: Figure 4, §2.2.