Portrait photography has a vast range of applications in scenarios such as wedding, birthday, graduation, anniversaries, advertisements, personal recording or creations. To ensure high quality of the finalist photos, photographers tend to capture as many raw photos with high dynamic ranges as possible. However, a set of raw photos might be flat-looking and present inconsistent tone due to the variations of subject view, illumination condition, background contexts and camera settings, as shown in the top row of Figure 1. A fast retouching on a large set of raw photos is necessary before feeding back to customers for photo selection, followed by fine-grained editing.
While a set of common standards or styles of pre-retouching are widely accepted and followed, most portrait photos are retouched manually, which is very tedious and time-consuming on the large size and highly redundant raw photos. Automatic portrait photo retouching is thus highly desired as it can save a huge amount of tedious human labor and significantly improves the efficiency of the entire portrait photography pipeline, bringing better experience for both photographers and customers.
Different from general-purpose photo retouching tasks, portrait photo retouching (PPR) has two special and practical requirements: human-region priority (HRP) and group-level consistency (GLC). HRP means that human-related region in a portrait photo should have higher priority and be paid more attention. The first row of Figure 1(a) shows a set of typical examples, where the backgrounds are over-exposed while the human regions are under-exposed. For such cases, retouching should improve the exposure of human regions while preserving as many details as possible in the backgrounds. GLC requires a group of portrait photos, which are usually taken on the same subject at the same scene but have different subject views, lighting conditions and even camera settings, to be adjusted to a consistent tone, as shown in the bottom row of Figure 1.
To the best of our knowledge, existing general-purpose photo retouching or enhancement datasets [1, 13, 9, 2] and models [26, 20, 12, 17, 29, 7, 30, 5, 3, 10, 32] do not touch the above two requirements and thus can hardly satisfy the demands of automatic PPR. To facilitate research on this important and high-frequency task, in this paper, we construct the first large-scale PPR dataset which contains (in groups) high-quality raw portrait photos, namely PPR10K dataset hereafter. The raw photos are captured by various DSLR camera devices, covering a wide range of scenes, subjects, lighting conditions and camera settings. Each raw photo is independently adjusted by 3 expert retouchers with rich experience in professional photography studios, resulting in three versions of high-quality retouched targets. Besides the highly informative raw photos and their retouched results, we also provide a high-resolution human-region mask for each photo to make better use of HRP. Each group of photos is elaborately adjusted to ensure GLC. We believe this dataset will provide a valuable benchmark to facilitate the research on automatic PPR.
With the PPR10K dataset, we define a set of objective measures to evaluate the performance of automatic PPR in terms of both HRP and GLC. We also propose corresponding learning strategies to improve the retouching quality of trained PPR models. Specifically, we define human-region weighted measures based on the provided mask, which contribute to achieving better visual quality on subject areas. Explicitly defining the GLC on the image space is very challenging because of the large variations of content in a group of photos. We find that the GLC can be reliably evaluated based on the statistics in CIELAB color space. We also propose an efficient way to simulate the intra-group variations using individual images, which is proven effective to improve the GLC performance. Considering that previous general-purpose photo retouching models obtain poor performance on the PPR task, we re-implemented representative state-of-the-art photo retouching and enhancement methods, and report their performance on our dataset for a convenient and fair comparison.
The contributions of this paper are two-fold. First, we construct the first large-scale and high-quality PPR dataset with human-region masks and group-level consistent targets, providing a valuable benchmark to facilitate research on this important task. Second, we propose a set of objective measures and learning strategies to evaluate and optimize the PPR models. Extensive experiments verified the effectiveness of the proposed dataset, measures and learning strategies both quantitatively and qualitatively.
2 Related Works
2.1 Photo Enhancement Datasets
High-quality datasets are the foundations of learning based photo enhancement or retouching research [1, 13, 9, 2]. Bychkovsky et al.  constructed the pioneering FiveK dataset, which contains raw photos of general scenes together with five versions of retouched targets. This dataset has successfully facilitated the research of automatic photo retouching and enhancement [7, 3, 32]. Ignatov et al.  constructed the DPED dataset with an aim to learn a mapping from low-quality photos captured by mobile devices to the counterparts captured by high-end DSLR cameras. This dataset mainly consists of photos in general scenes such as landscapes and street views, and serves as a benchmark for general-purpose photo enhancement task. There are also datasets focusing on enhancing the dynamic range and contrast of photos [9, 2], where the ground-truths are elaborately generated via fusing multiple frames.
Despite the great efforts, the above datasets are constructed on general scenes, where portrait photos only take a minority and receive no special treatment. In addition, they only consider the visual quality of each individual photo rather than a group of photos that is commonly encountered in portrait photography. As a result, the models trained on them are unsuitable for the PPR task. In this paper, we elaborately construct a larger-scale PPR dataset, fulfilling the HRP and GLC requirements of portrait photography.
2.2 Photo Retouching Methods
Photo retouching [12, 17, 24, 19, 2, 16, 8, 6, 31, 18] aims to enhance the visual aesthetic quality of an image, which is conventionally achieved via professional tools, e.g., CameraRaw***https://www.adobe.io/apis/creativecloud/camera-raw.html, or hand-craft operations like look-up tabels (LUTs) . However, these manual tools rely heavily on the empirical knowledge and perceptual aesthetic judgment of the well-trained artists, therefore beyond the abilities of non-professional users. Some learning-based methods [1, 29, 14, 21, 22] based on hand-crafted features have been developed, yet can hardly satisfy the practical demands due to their limited representation capacity against the vast range of image contents and light conditions.
Various deep-learning-based schemes[20, 30, 3, 5, 25, 7] have recently been presented, benefiting from the FiveK dataset 
and deep convolutional neural networks[11, 23]. Most of these deep models, however, are limited by the input resolutions or the processing time in practice. For real-time processing on high-resolution images, e.g., images with more than 24M pixels, Gharbi et al.  proposed the HDRNet putting most computation on downsampled images. He et al.  proposed to approximate a sequence of base operations such as brightness or contrast adjustments via a light-weight MLP, while Kosugi et al. 32] proposed to learn an image-adaptive 3-dimensional LUT (3D LUT), which can retouch 4K images in a speed of more than 500fps with appealing tones. Nonetheless, the above-mentioned methods do not touch the HRP and GLC requirements, partially due to the lack of training data. In this paper, based on the constructed dataset, we propose two learning strategies to improve the performance of PPR, providing a benchmark for further research.
3 The PPR10K Dataset
As discussed before, existing photo retouching datasets and models cannot fulfill the requirements of PPR. To solve these problems, we construct a large-scale and high-quality PPR (PPR10K) dataset.
Challenges: To construct a valuable PPR dataset that fulfills the real-world requirements, we have to overcome several challenges. First, the photos should in raw format with high-quality. However, unlike the abundant and easily available compressed JPG images, raw photos are much more difficult to obtain on the internet. Second, the dataset should be large-scale and cover a wide range of real cases, in terms of shooting purpose, human subjects, background scenes, lighting conditions as well as the usage of camera devices, which further increases the cost of data collection. Third, high-quality retouched results (with both good visual quality and group-level consistency) and human-region masks should be provided to learn effective PPR models. These requirements make the labeling process expensive and cumbersome.
Data Collection and Selection: To obtain as many raw portrait photos as possible, we negotiated with many individual photographers and professional photography studios to purchase raw portrait photos in groups from them for research purposes only. We also purchased from several paid material websites that provide raw format portrait photos. During data collection, we have elaborately control the diversity of raw photos in terms of shooting purpose (e.g., wedding, birthday, graduation, anniversaries, advertisements, personal recording and creation), human subjects (including babies, children, younger, couples and worldwide people), background scenes (including indoor and outdoor, lighting conditions (from day to night, winter to summer), and usage of camera devices (covering a wide series of high-end DSLR cameras of Canon, Nikon and Sony). The diversity of collected photos is shown in Figure 2.
We initially collected more than
raw photos then conducted several rounds of selection. We first discarded photos without human subjects, with low quality such as severe motion blur or de-focus, or containing inappropriate information. We further carefully checked photos group by group, removing outliers (photos with very different content from the group) and duplicated ones (photos with almost the same content). After the screening, we finally obtained a total ofportrait photos in groups, and each group contains photos with same the subjects captured in the same scene at consecutive time. Two typical groups of photos are shown in Figure 1.
Data Labeling: To obtain high-quality ground-truths, we hired expert retouchers, all of whom have more than years of experience working in the professional photographic industry, to retouch the raw photos independently, using the CameraRaw in PhotoShop. Each retoucher was required, based on their own domain knowledge, to retouch the raw photos to satisfy the output standards of professional portrait photography studios with two major requirements. First, each photo should be retouched to visually pleasing to the perception of common people, especially for the human-regions. Second, a group of photos should be adjusted to have a consistent tone. Retouchers were allowed to adjust any operations in CameraRaw without changing content or introducing geometric distortions. In addition, the retouching of each expert was also required to be self-consistent among similar scenes, which is important for learning a stable and robust retouching model. We also hired another expert to double-check the retouched results and conducted several rounds of feedback-and-repair to ensure high-quality of the ground-truths. The retouching style of the three experts is shown in supplementary file.
Considering the high priority of human regions and their complicated illuminations in portrait photos, we also provide human-region masks for learning better retouching models. To save the annotation cost, the masks were first generated using an internal-developed portrait segmentation algorithm which was trained on a set of human matting datasets and supports segmenting photos up to 100 megapixels. We then manually check and refine the failure cases on some difficult scenes such as underwater, extremely low light, glass-reflex and occluded cases.
Discussions: Despite the high-quality of our constructed dataset, it leaves several challenges in learning an effective portrait retouching model. First, both the raw photos and human-region masks have very high resolution ranging from 4K to 8K, which requires retouching models to be highly efficient. Second, the diversity of content and lighting conditions in various scenes requires the models to be flexible and content-adaptive. Third, the demand of group-level consistency requires the models to be robust and stable, which is critical for practical applications.
4 Measures and Learning Strategies
Based on the PPR10K dataset, we define a set of measures to quantitatively evaluate the performance of a PPR method. We also propose learning strategies to optimize the HRP and GLC requirements of the PPR task.
4.1 Basic Measures
, we first define two basic measures, including the peak signal-to-noise ratio (PSNR) and the CIELAB color difference. Given an input portrait photo , denote by and its predicted version of a PPR model and the target retouched by a human expert, respectively. We can easily obtain their conversions in Lab color space, which are denoted by , and . Similar to the PSNR defined based on the -distance in sRGB color space, the color difference is defined as the -distance in CIELAB color space with Compared to the sRGB color space, the CIELAB color space is more perceptually uniform and is widely used to tune the tones of photos .
4.2 Human-centered Measures
Considering the higher priority of human regions in portrait photos, we further define two human-centered measures, which can be naturally achieved by putting higher weights to human regions than background regions, leveraging the provided human-region masks in the PPR10K dataset. Given a photo of resolution, we can construct its weighting matrix , where for background regions human regions and for background regions. We empirically fix . The human-centered PSNR () and color difference () can be consequently defined. To save space, we only provide the formula of as:
where denotes the element-wise matrix multiplication.
4.3 Group-level Consistency Measure
Different from the above measures based on individual photos, the group-level consistency (GLC) measures the variations in tone and color among a group of photos. This measure can hardly be defined in the image space since the image contents in a group of photos are not aligned (refer to Figure 1). A reasonable GLC measure should be sensitive to the change of global tone and color appearance, and simultaneously should be robust to the change of image content. It is worth mentioning that the content change is restricted in a group of photos that have the same subject and similar background. Inspired by the practice in white balance , we define the GLC measure based on the statistics of color components.
Specifically, given a group of predicted photos , we first calculate the mean color components of each photo to represent their global tone and color appearance. The GLC measure is then defined as the variance of mean color components:
where denotes a color channel which can be chosen from or the combinations of them. Through extensive quantitative studies, we empirically found that the GLC measure based on the combination of and channels is the most suitable and stable choice. The study details can be found in the supplementary file.
|3||3D LUT ||PPR10K-a||25.64||25.15||6.97||7.25||28.89||28.39||4.53||4.71||11.47||11.05|
|9||3D LUT ||PPR10K-b||24.70||24.30||7.71||7.97||27.99||27.59||4.99||5.16||9.90||9.52|
|15||3D LUT ||PPR10K-c||25.18||24.78||7.58||7.85||28.49||28.09||4.92||5.09||13.51||13.16|
4.4 Learning Strategies
Optimizing the basic measures and human-centered measures is straightforward. We simply employ the human-region weighted MSE loss on sRGB color images to optimize a model with these measures:
where we set in for the basic measures. For the human-centered measures, we set for backgrounds and for human-regions to accelerating training.
Explicitly optimizing the GLC measure is complicated since it introduces much additional cost including reading and processing a group of photos and color space conversion. To simplify and speed up the training process, we introduce a strategy to simulate the group-level variation using a single image, the pipeline of which is shown in Figure 3. Specifically, given an input , we randomly crop two overlapped patches and to mimic the view change in a group of photos. We further randomly adjust the two crops regarding tonal attributes such as temperatures and exposures to synthesize the change of lighting condition and camera setting. We feed the two synthesized crops into a PPR model, obtaining two predictions and and their overlapped ranges and . The GLC can be optimized using the following constraint:
The total loss is calculated as where is a constant parameter to balance the two losses. We simply set in the experiments.
5.1 Experiment settings
Datasets: We employed two datasets, including the constructed PPR10K dataset and the general-purpose FiveK  dataset, in our experiments. The PPR10K dataset is randomly divided into a training set with groups and photos, and a testing set with groups and photos. The FiveK dataset is randomly divided into a training set with 4,500 images and a validation set with 500 images following the common practice. Input images are pre-processed in a 16-bit tiff format via CameraRaw to preserve as much information from the raw file as possible, while the target images are converted into 8-bit sRGB color space for convenient display on common devices. To speed up the training process, training images are resized to 360p (short side of the images) resolution. The testing images have two versions: the 360p resolution and the original resolution ranging from 4K to 8K.
Baseline Methods: Since in practice PPR needs to process very high-resolution photos, this hinders the real applications of most previous photo retouching/enhancement models because of their heavy computational and memory costs. We employ three competitive and efficient models, including the HDRNet , the CSRNet  and the 3D LUT , in our experiments (source codes released by authors). To better model such a large-scale and diverse dataset, for the 3D LUT  mothod, we employ LUTs and use the Resnet-18 
(initialized with the weights pre-trained on ImageNet
) as the scene classifier.
Data Augmentation: Besides the commonly used data augmentation methods such as flipping and rotation, we also augment the training images by adjusting visual attributes in CameraRaw, i.e., temperature, tint, exposure, highlights, contrasts and saturation, to enrich the lighting and color distributions of the training set. The augmentation details can be found in the supplementary material.
5.2 Baseline Performance
We first evaluate the baseline performance of the three state-of-the-art photo retouching/enhancement methods on our PPR10K dataset. We retrained each model on each of the three expert sets independently and report their performance under five measures (, , , , ) in Table 1 (rows 1-3, 7-9, 13-15). Each measure is evaluated on two resolutions (360p low resolution (LR) and original high resolution (HR)). Several observations can be made from the results.
First, all the three models can obtain reasonable results on and , which indicates the high-quality and self-consistent annotations of the three experts. Among the three versions, the retouching style of expert-a is relatively easier to be learned as expected, since this expert prefers rendering a stronger and stable tonal style for all scenes, leading to a compact target space which is relatively easier to be modeled. In contrast, the other two experts prefer a mild rendition to preserve the naturalness of photos (visual examples are provided in the supplementary file). Among the three models, 3D LUT  achieves consistently better performance in most cases than HDRNet  and CSRNet . Given its high performance and high efficiency, we choose 3D LUT as the baseline model to study the proposed learning strategies in Section 5.4.
5.3 Models Trained on FiveK and PPR10K
This section compares the PPR performance of the three methods by training them on the FiveK dataset and on our PPR10K dataset, respectively. We employed the commonly used expert C as the target on the FiveK dataset to train the three models. Input images were processed to have the same format as in our PPR10K dataset. We evaluated the trained models on the three testing sets of PPR10K and report the quantitative results in Table 2. Qualitative comparisons are shown in Figure 4. As expected, all models trained on the FiveK dataset obtain much worse performance on all measures compared to their counterparts trained on the PPR10K dataset (refer to Table 1), because of the domain gap between general-purpose photo enhancement and PPR. As shown in Figure 4, the results obtained by the FiveK models have two obvious problems. First, the tone and color appearance of each individual photo is unpleasing especially on the human regions. Specifically, the girl’s face is dark in the shadow with unnatural color. Second, the retouched results in a group have large variations on both global tone and local contrast. For example, the third photo has obviously higher brightness and more natural color compared to the first one. In contrast, models trained on our PPR10K dataset achieve not only better individual visual quality but also higher group-level consistency.
5.4 Effectiveness of the Learning Strategies
This section evaluates the effectiveness of the proposed learning strategies using the 3D LUT model. On each of the PPR10K set, we trained three 3D LUT models using only HRP, only GLC, both HRP and GLC learning strategies and report the results in Table 1 (rows 4-6, 10-12, 16-18).
One can see that using the HRP loss brings better results on most individual measures. This is reasonable since all three experts paid special attention to the human regions during their retouching. Putting higher weights on human regions thus leads to better individual retouching quality. Two typical visual examples are shown in Figure 5. One can see that using the HRP loss leads to better visual quality (brighter faces and more natural temperature on both examples) on the human regions.
Using the GLC loss slightly deteriorates the four individual measures but improves the GLC measure. A qualitative example of learning with the GLC loss is shown in Figure 6. As shown in the figure, compared to the results obtained by baseline 3D LUT, the color of the background tends to be more consistent when GLC loss is employed. Specifically, the color of curtain in Figure 6 (b, d, f) varies in baseline 3D LUT, while being a consistent pink when the GLC loss is employed. Another observation is that combining the GLC and HRP losses further improves the GLC measure. This is possibly because jointly optimizing the HRP and GLC losses enables the model to learn complementary information and consequently achieves a good trade-off between individual visual quality and group-level consistency.
We constructed a large-scale PPR dataset, which was the first of its kind to the best of our knowledge. We collected high quality raw portrait photos with diverse contents from individual photographers and professional photography studios. After careful screening, portrait photos were selected, which fell into groups. High quality human region masks were provided in the dataset. We invited three expert retouchers to label the photos with priority to the human region and the tonal consistency within a group of photos. We defined a set of human-region centered and group-level consistency measures to faithfully evaluate the performance of a PPR model, and accordingly proposed learning strategies to train high quality PPR models. Extensive experiments were conducted to demonstrate the value of the constructed dataset, and the effectiveness of the proposed measures and learning strategies.
-  (2011) Learning photographic global tonal adjustment with a database of input/output image pairs. In CVPR, Cited by: §1, §2.1, §2.2, §2.2, §4.1, §5.1.
-  (2018) Learning a deep single image contrast enhancer from multi-exposure images. IEEE Transactions on Image Processing 27 (4), pp. 2049–2062. Cited by: §1, §2.1, §2.2.
-  (2018) Deep photo enhancer: unpaired learning for image enhancement from photographs with GANs. In CVPR, Cited by: §1, §2.1, §2.2.
-  (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §5.1.
-  (2018) Aesthetic-driven image enhancement by adversarial learning. In ACM MM, Cited by: §1, §2.2.
-  (2004) Shades of gray and colour constancy. In Color and Imaging Conference, Vol. 2004, pp. 37–41. Cited by: §2.2, §4.3.
-  (2017) Deep bilateral learning for real-time image enhancement. ACM Transactions on Graphics (TOG) 36 (4), pp. 1–12. Cited by: §1, §2.1, §2.2, §4.1, Table 1, §5.1, §5.2.
-  (2011) Computational color constancy: survey and experiments. IEEE Transactions on Image Processing 20 (9), pp. 2475–2489. Cited by: §2.2.
-  (2016) Burst photography for high dynamic range and low-light imaging on mobile cameras. ACM Transactions on Graphics (Proc. SIGGRAPH Asia) 35 (6). Cited by: §1, §2.1.
-  (2020) Conditional sequential modulation for efficient global image retouching. In ECCV, Cited by: §1, §2.2, Table 1, §5.1, §5.2.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §2.2, §5.1.
-  (2018) Exposure: a white-box photo post-processing framework. ACM Transactions on Graphics (TOG) 37 (2), pp. 1–17. Cited by: §1, §2.2.
-  (2017) DSLR-quality photos on mobile devices with deep convolutional networks. In ICCV, Cited by: §1, §2.1.
-  (1997) Properties and performance of a center/surround retinex. IEEE Transactions on Image Processing 6 (3), pp. 451–462. Cited by: §2.2.
-  (2016) A software platform for manipulating the camera imaging pipeline. In ECCV, Cited by: §2.2.
-  (1997) Contrast enhancement using brightness preserving bi-histogram equalization. IEEE transactions on Consumer Electronics 43 (1), pp. 1–8. Cited by: §2.2.
-  (2020) Unpaired image enhancement featuring reinforcement-learning-controlled image editing software. In AAAI, Cited by: §1, §2.2, §2.2.
-  (2008) Display adaptive tone mapping. In ACM SIGGRAPH 2008 papers, pp. 1–10. Cited by: §2.2.
-  (2008) Enhancement of color images by scaling the dct coefficients. IEEE Transactions on Image processing 17 (10), pp. 1783–1794. Cited by: §2.2.
-  (2018) Distort-and-recover: color enhancement using deep reinforcement learning. In CVPR, pp. 5928–5936. Cited by: §1, §2.2.
-  (1996) Multi-scale retinex for color image enhancement. In ICIP, Cited by: §2.2.
-  (2011) Automatic exact histogram specification for contrast enhancement and visual system based quantitative evaluation. IEEE Transactions on Image Processing 20 (5), pp. 1211–1220. Cited by: §2.2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.2.
-  (1998-May 5) Photographic film reproducing apparatus using object brightness and exposure correction amount to develop photographed images. Google Patents. Note: US Patent 5,748,287 Cited by: §2.2.
-  (2019) Underexposed photo enhancement using deep illumination estimation. In CVPR, Cited by: §2.2.
-  (2013) Naturalness preserved enhancement algorithm for non-uniform illumination images. IEEE Transactions on Image Processing 22 (9), pp. 3538–3548. Cited by: §1.
-  (2020) CIELAB color space — Wikipedia, the free encyclopedia. Note: https://en.wikipedia.org/w/index.php?title=CIELAB_color_space&oldid=982533770 Cited by: §4.1.
-  (2020) Color difference — Wikipedia, the free encyclopedia. Note: https://en.wikipedia.org/w/index.php?title=Color_difference&oldid=979429323 Cited by: §4.1.
-  (2014) A learning-to-rank approach for image color enhancement. In CVPR, Cited by: §1, §2.2.
Automatic photo adjustment using deep neural networks. ACM Transactions on Graphics (TOG) 35 (2), pp. 1–15. Cited by: §1, §2.2.
-  (2012) Automatic exposure correction of consumer photographs. In ECCV, Cited by: §2.2.
-  (2020) Learning image-adaptive 3D lookup tables for high performance photo enhancement in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §2.1, §2.2, §4.1, Table 1, Figure 5, §5.1, §5.2.