DR2S : Deep Regression with Region Selection for Camera Quality Evaluation

09/21/2020 ∙ by Marcelin Tworski, et al. ∙ Télécom Paris 6

In this work, we tackle the problem of estimating a camera capability to preserve fine texture details at a given lighting condition. Importantly, our texture preservation measurement should coincide with human perception. Consequently, we formulate our problem as a regression one and we introduce a deep convolutional network to estimate texture quality score. At training time, we use ground-truth quality scores provided by expert human annotators in order to obtain a subjective quality measure. In addition, we propose a region selection method to identify the image regions that are better suited at measuring perceptual quality. Finally, our experimental evaluation shows that our learning-based approach outperforms existing methods and that our region selection algorithm consistently improves the quality estimation.



There are no comments yet.


page 4

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the rapid rise of smartphone photography, many people got more interested in their photographs and in capturing high-quality images. Smartphone makers responded by putting an increasing emphasis on improving their cameras and image processing systems. In particular, sophisticated imaging pipelines are now used in the Image Signal Processor (ISP) of any smartphone. Such software makes use of recent advances such as multi-camera fusion [24], HDR+ [9]

or deep learning-based processing to improve image quality. Despite the high performance of those techniques, their increasing complexity leads to a correspondingly increased need for quality measurement.

Several attributes are important to evaluate an image: target exposure and dynamic range, color (saturation and white balance) texture, noise and various artifacts that can affect the quality of the final image [3]. When it comes to camera evaluation, a camera can be evaluated for its capabilities in low-light, zoom, or shallow depth-of-field simulation.

In this work, we are interested in estimating the quality of a camera for multiple lighting conditions. More specifically, we aim at evaluating camera capabilities to preserve fine texture details. We also refer to this problem as texture quality estimation. Texture details preservation should be measured in a way that reflects human perception. A typical way to evaluate the quality of a set of cameras consists of comparing shots of the same visual content in a controlled environment. The common visual content is usually referred to as a chart. The motivation for using the same chart when comparing different cameras is twofold. First, it facilitates the direct comparison of different cameras. In particular, when it comes to subjective evaluation, humans can more easily provide pairwise preferences than an absolute quality score. Second, when the common noise-free chart is known, this reference can be explicitly included in the quality measurement process. In this context, the Modulation Transfer Function (MTF) is a widely-used tool for evaluating the perceived sharpness and resolution, which are essential dimensions of texture quality. First, MTF-based methods suffer from important drawbacks. MTF-based methods are originally designed for conventional optic systems that can be modeled as linear. Consequently, non-linear processing in the camera processing pipeline, such as multi-images fusion or deep learning-based image enhancement, may lead to inaccurate quality evaluation [25].

Second, these methods assume that the norm of the device transfer function is a reliable measure of texture quality. However, in this work, we argue that this assumption fails to account for many nuances of human perception. Some recent works have shown that the magnitude of image transformations do not always coincide with the perceived impact of the transformations [28]. Consequently, we advocate that human judgment should more explicitly be included in the texture quality measurement process.

As a consequence, we propose DRS, a Deep Regression method with Region Selection. Our contributions are threefold. First, we formulate the problem of assessing the texture quality of a device as a regression problem and we propose to use a deep convolutional network (ConvNet) for estimating quality scores. We aim at obtaining a score which would be close to a subjective quality judgment: to this end, we use annotations provided by expert human observers as ground-truth at training time. Second, we propose an algorithm to identify the regions in a chart that are better suited to measure perceptual quality. Finally, we perform an extensive evaluation study that shows that our learning-based approach performs better than existing methods for texture quality evaluation and that our region selection algorithm leads to further improvement in texture quality measurement.

Ii Related Works

In this section, we review existing works on texture quality assessment. Existing methods can be classified into two main categories: MTF-based and learning-based methods.

Ii-a MTF-based methods

A simple model for a camera consists in a linear system that produces an image as a convolution of the point spread function and the incoming radiant flux

. In the frequency domain,

, where we also consider additive noise . The modulation transfer function is and it is commonly used to characterize an optic acquisition device  [2].

Acutance is a single-value metric calculated from the device MTF. To compute this value, a contrast sensitivity function (CSF), modeling the spatial response of the human visual system, is used to weight the values of the MTF for the different spatial frequencies. The CSF depends on a set of parameters named viewing conditions, which describes how an observer would look at the resulting image. These parameters are usually the printing height and the viewing distance. The acutance is defined as the ratio of the integral of the weighted MTF by CSF’s integral.

The key assumption in MTF-based methods is that they dispose of the noise-free content to estimate the transfer function. The noise-free content is often referred to as full reference. Therefore, these methods are usually used only with synthetic visual charts. To the best of our knowledge, the only work estimating acutance on natural scenes was proposed by Van Zwanenberg et al. [25]. This method uses edges detection in the scene to compute the MTF. Early methods use charts containing a blur spot or a slanted edge for this computation. Loebich et al.  [17] propose a method using the Siemens-Star. Cao et al. propose to use the Dead-Leaves model  [18, 8], and introduce an associated method in [4], which is shown to be more appropriate to describe fine detail rendering since its texture are more challenging for devices. This chart is usually referred to as the Dead-Leaves chart. In this chart, the reference image is generated with occluding disks generated with a random center location, radius and grey-scale value. Importantly, digital camera systems present high-frequency noise, which affects the MTF estimation by dominating signal in the higher frequencies. Consequently, estimating the noise power spectral density (PSD) is key to obtain an accurate acutance evaluation, and this task is not easily performed on the textured region. As a consequence, noise PSD is typically estimated on a uniform patch. It is important to note that only the PSD of the reference image and not the reference image itself is needed for the acutance computation. For this reason, this method is referred to as Reduced-Reference (RR) acutance in the rest of this paper. However, this approach is hindered by the denoising algorithms integrated into cameras. Not only these algorithms interfere with the noise PSD estimation but also they behave differently in uniform and textured regions. In this context, Kirk et al. [14] propose to compute the MTF using the cross-power spectral density of the resulting and the reference images. This method assumes an effective registration of the chart. Sumner et al. [22] then modified Kirk’s computation in order to make it more robust to slight misregistration. Since this method takes fully advantage of the reference image, it is referred to as Full-Reference (FR) acutance in the rest of this paper.

In conclusion, state-of-the-art techniques typically allow to obtain a good estimation of devices’ MTF and then of the acutance. However, it has been shown that acutance itself does not always reflect very well the human quality judgment. This observation calls for learning-based methods that aim at reproducing the score of human experts evaluating the images.

Ii-B Learning-based methods

In opposition to MTF-based techniques described in the previous section, learning-based methods require annotated datasets. Early datasets (LIVE [21], CSIQ [15], TID2008 [20] and its extension TID2013 [19]) consist of noise-free images and subjective preference scores for different types and levels of artificially introduced distortions. These distortions mostly correspond to compression or transmission scenarios. Only some of them, such as intensity shift, additive Gaussian noise, contrast changes or chromatic aberrations, are relevant to the problem of camera evaluation, and those distortions do not encompass the complexity of real systems. Conversely, the KonIQ10k [12] dataset consists of samples from a larger public media database with unknown distortions claimed to be authentic. Similarly, the LIVE In the Wild [7] database consists of 1161 images shot with various devices with unique content for the images. For both datasets, crowd-sourced studies were conducted in order to collect opinions about the quality of these images. Interestingly, the size of these datasets is sufficiently large to allow the use of deep learning approaches. In particular, Varga et al. [26] propose to fine-tune a ResNet-101 network [11] equipped of spatial pyramid pooling layers [10] in order to accept the full images at a non-fixed size to predict the user preference score. This deep approach leads to the best performances on these authentic distortions benchmarks.

Despite the good results of this method, it cannot be directly applied to our problem. One of the reasons is that we aim at estimating the quality of the full image produced by the camera: resizing it would alter the fine details we try to estimate, and without resizing we undergo memory usage issues. To tackle this problem, we propose to work on patches randomly cropped in the input image.

More recently, Yu et al. [27] collected a dataset of 12 853 natural photos from Flickr and used Amazon Mechanical Turk to rate them according to image quality defects: exposition, white balance, color saturation, noise, haze, undesired blur, composition. They used a multi-column CNN: Using GoogLeNet [23], the first layers are shared, and then it is separated according to each attribute. For learning efficiency purposes, they frame the problem of each defect regression as an 11-class classification task.

Nevertheless, these deep learning approaches [10, 27] tackle a slightly different problem than ours. They are designed to evaluate image quality attributes for any input image while we address the problem of evaluating devices using a known common chart.

To address the limitations of both MTF-based and learning-based methods, we now introduce our novel deep regression framework for texture quality estimation.

Fig. 1: Pipeline overview of DRS: Our method is divided into three main stages. First, we perform a naive training using patches randomly cropped over the chart. Second, we compute a map that indicates discriminant regions. Third, we perform a final training using only a selected region.

Iii Method

Iii-a DrS Overview

In this section, we detail the proposed method for estimating texture quality. We formulate this task as a regression problem. We assume that we dispose of a training dataset composed of N color images of dimensions with the corresponding texture quality scores . Since we are interested in estimating the camera quality, each score corresponds to a quality score for one device at a specific lighting condition. Note that, several training images can be taken with the same device. Importantly, we aim at computing quality scores that coincide with human judgment, the ground-truth texture quality scores are provided by human annotators (See Sec.IV-A for more details). We aim at training a ConvNet with parameters . We consider that these images depict the same chart and are taken with different cameras and different lighting conditions.

Our method is based on the observation that all image regions are not equally suited to predict the overall image quality. For example, regions with uniform textures will be rendered similarly by any device independently of its quality. Conversely, other regions with rich and fine texture details are much more discriminating since they will be differently captured by different devices. Based on this observation, we propose the algorithm illustrated in Fig. 1

. The method is divided into three stages: First, we train a first neural network ignoring this problem of unsuited regions. Second, we employ this network for estimating which image regions are the most suitable for measuring image quality. Finally, the network is re-trained using only selected regions. In the following we provide the details of the three stages of our algorithm.

Iii-B Initial Training

The goal of this first stage is to train a neural network that can be used to identify relevant image regions. To this end, we propose to train a network to regress the quality score from an input image patch. This initial network is later used in the second stage of our pipeline to identify discriminant regions (See Sec.III-C). We train this deep convNet on random crops extracted from the training images

. These crops are randomly selected across the images with a uniform distribution. In all our experiments, we use the widely used Resnet-50 network pre-trained on ImageNet

[6] where the final classification layer is replaced by a linear fully connected layer. Since some patches are not discriminant, this initial training suffers from instability and optimization issues. Consequently, we use the Huber loss that reports better performance in the presence of noisy samples [5]:


where and denote the annotated and predicted scores, and is a threshold. At the end of this stage, we obtain a network that estimates the quality of a device from an input patch.

Iii-C Region Selection

In the second stage of our pipeline, we use our previously trained network to predict image regions that are most discriminating for quality measurement. We produce a map that indicates the relevance of each location of the chart to estimate texture quality. This map will allow us to select a suitable region to train our convNet in the last stage of our pipeline.

In order to estimate a single map for all the training images, we first register all the training images. We employ the following algorithm to align the images on the image with the highest resolution (47 MP). First, we detect points of interest. Then, we extract local AKAZE descriptors [1] and, finally, we estimate a homography for every image. Image warping is implemented using bicubic sampling. Note that, while the map computation requires this warping alignment step that may affect the performance, training and prediction can be performed on the original images. We now assume that the training images are registered.

To estimate the map , we propose to use the network trained in the first stage. Let

be the feature tensor outputted by the backbone network for a given input image. In our case, since we employ a ResNet-50 network,

corresponds to the tensor before the Global Average Pooling (GAP) layer. Since Resnet-50 is a fully convolutional network, the dimension and depends on the input image dimensions, while the number of channels is fixed (i.e., ). Let and be the trained parameters of the final regression layer obtained in the first stage. The network prediction is given by:



denotes the sigmoid function. While the network returns one single output scalar per input image, we want to obtain one value per pixel location. In order to adapt the class activation map framework

[29] to our regression setting, we propose to compute a score for every feature map location :



denotes the feature vector at the location

. Note that, the resulting map has a dimension that is different from the initial input image size . This size difference depends on the network architecture. In the case of Resnet-50, we obtain a ratio 32 between the input and the feature map dimensions. Therefore, we resize the score map to the dimension using bicubic-sampling. This procedure, is applied to every image of the training set. Thus, we obtain the set of score maps

We propose to define the confidence score map

as the location-wise variance of the score maps

over the whole training set. The motivation for this choice is that, discriminating regions have a higher variance than non-discriminating ones. Indeed, we observe that the scores produced by the ConvNet over non-discriminative regions tend to have small variance. Conversely, on discriminating patches, the networks predict values with a wide range leading to high variance.

Iii-D Final Training and Prediction

In the last stage of our pipeline, we select the chart region with the highest confidence score value in . In our preliminary experiments, we observed that using a region width approximately six times larger than the network input size leads to good performance. In our case, we use a square region of pixels. In this region, we select random patches that are used as a training set. We re-train the network

, starting again from ImageNet pre-training weights.

At test time, assuming an image with an unknown quality score, we extract patches in the selected region. The final score is given by the average over the different patches.

Iv Experiments

In this section, we perform a thorough experimental evaluation of the proposed pipeline. We implemented our method using Tensorflow and Keras. When training the ConvNets (stages 1 and 3), we employ Adam optimizer following

[16], with a starting learning rate of

with a decay of 0.1 every 40 epochs for a total of 120 epochs. In our model, we assume that all the images have the same resolution. The reason for this choice is that we want the image details to be analyzed at the same scale, as a human observer would do. In practice, the resolution depends on the device. Therefore, we preprocess all the training images resizing them to the highest resolution of the dataset using bicubic upsampling. This solution is preferred to downsampling to a common lower resolution since texture quality is not invariant to downsampling. In addition, due to possible lens shading, we remove the sides of the images.

Iv-a Datasets

Charts and devices

As there is no well-established reference dataset for our problem, we collected annotated data using three different charts.

  • Still-Life: First, we use the chart displayed in Fig. 2(a). This dataset is referred to as Still-Life. The chart is designed to evaluate several image quality attributes and to present diverse content: Vivid colors for color rendering, fine details, uniform zones, portraits as well as resolution lines and a low-quality Dead-Leaves version. Images are acquired using 140 different smartphones and cameras from different brands commonly available in the consumer market. In Fig. 2, we provide an example patch captured using three different cameras. The left image corresponds to a high-quality device while the two others are obtained with low-quality devices. It illustrates the nature of distortions that appear in this dataset with different intensity. To obtain a larger database and predictions robust to lighting conditions, we shoot the chart using five different lighting conditions: 5 lux tungsten, 20 lux tungsten, 100 lux tungsten, 300 lux TL84, 1000 lux D65. Note that process is repeated for every device.

  • Gray-DL: Second, we employ the dead-leave chart proposed in [4]. As mentioned in Sec II-A, this chart depicts gray-scale circles with random radius and locations. In all our experiments, we refer to this dataset as Gray-DL. We use the same five lighting conditions and devices as for the Still-Life chart.

  • Color-DL: Finally, we complete our experiment using the dead-leaves chart proposed in [22]. By opposition to Gray-DL this chart is colored (an image can be found at [13]). For this chart, we employed a limited number of devices. More precisely, we employ only 14 devices with the same five lighting conditions as for the other charts. The low number of devices is especially challenging for learning-based approaches.

Fig. 2: Patches of high (left) and low-quality (center and right) images from our Still-Life dataset
(a) Still-Life Chart
(b) Gray-DL Chart
Fig. 3: Still-Life Chart used in our experiments. The Still-Life chart contains many diverse objects with varying colors and textures while the Gray-DL chart depicts random gray-scale circles.

In order to train and evaluate the different methods we need to provide ground-truth annotation for each device. Note that, the annotation must be provided for each pair of device and lighting conditions. To obtain quality annotations that are a reliable proxy of the perceived visual quality, annotations are provided by human experts. Images to be evaluated were inserted among a fixed set of 42 references, and very high-quality prints were provided to help them judge of the authenticity of the details. Annotators were asked to compare the images using the same field of view for every image, using calibrated high-quality monitors where images are displayed without applying any down-sampling but with a possible digital zoom for the lower resolution image. Each position among the set of references is assigned a score between 0 and 1. In the case of the Dead-Leaves charts, since the charts are unnatural images, human perceptual annotation is problematic. Therefore, we chose to use the annotations obtained on the Still-Life also for the dead-leaves charts, rather than annotating the images. The Still-Life chart contains diverse textures similar to what real images would contain. In this way, we obtain a subjective device evaluation in a setting more similar to real-life scenarios.

Iv-B Metrics

In our problem, relying on standard classification or regression metrics is not straightforward. Indeed, MTF-based methods predict a quality score that is not directly comparable to the score provided by human annotators. A straightforward alternative could consist in computing the correlation between the predictions and the annotation. However, the underlying assumption that the predictions of each method correlate linearly with our annotations may not hold and bias the evaluation. Therefore, we decided to rely on two distinct metrics based on the correlation of the rank-order. First, we adopt the Spearman Rank-Order Correlation Coefficient (SROCC) defined as the linear correlation coefficient of the ranks of predictions and annotations. Second, we report the Kendall Rank-Order Correlation Coefficient (KROCC

) defined by the difference between concordant and discordant pairs divided by the number of possible pairs. The key advantage of this second metric lies in its robustness to outliers.

For all visual charts, the dataset is split into training and test sets as follows. First, among the devices we use in our experiments several are produced by the same brand. So, to avoid bias between training and test, we impose no brand-overlap between training and test sets. Second, as a consequence of such constrain, a limited number of brands may appear in the test set. To avoid evaluation biases towards specific brands, we use a -fold cross-validation with .

In order to measure the impact of the number of devices on the performance, we perform experiments with a variable number of devices. For all the experiments on the Gray-DL and Still-life charts, we report the results obtained using subsets of size 20, 60, 100 and 140 devices. For a given number of devices, each experience is performed over the same devices set. Note that, for every method, the complete pipeline is repeated independently for every subset.

Iv-C Ablation study

In order to experimentally justify our proposed method, we compare three different versions of our model:

  • Random Patch: In this approach, random patches are selected from the whole chart at both training and testing time.

  • Random Region: We then restrict the random patch extraction to a single zone, chosen randomly. We report the average over five random regions.

  • Selected Region: In this model we employ our full pipeline as described in Sec. III. In particular, training and test are performed using the selected region.

In these three models, we employ a ResNet-50 backbone trained using the same optimization hyper-parameters.

Number of devices 20 60 100 140 20 60 100 140
Random Patch 0.626 0.818 0.784 0.806 0.433 0.617 0.588 0.613
Random Region 0.795 0.863 0.866 0.879 0.606 0.680 0.682 0.700
Selected Region (Full model) 0.830 0.912 0.890 0.900 0.638 0.740 0.716 0.728
TABLE I: Ablation Study: we measure the impact of region selection comparing three baseline models on the Still-life chart. SROCC and KROCC metrics are reported.

The results obtained on the Still-life chart are reported in Table.I. First, when using the random patches variant, the model trained on 20 devices performs poorly both in terms of SROCC and KROCC compared to other variants. In this case, we see that it is required to dispose of at least 60 devices to get satisfying performances. Second, we observe that restricting random patches extraction to a region randomly selected leads to better performance than if we do not restrict to this region. The gain is visible for every number of devices and for both metrics. It may be explained that the decreased diversity in content leads to a ConvNet that is specialized in a specific region of the chart. In other words, the benefit of a more restrained input diversity is larger than the benefit of a larger and more diverse training set. Finally, our full model reaches the best performance for both metrics and for every number of devices. This better performance independently of the training sub-set demonstrates the robustness of the proposed method. Interestingly, we obtain performances with 20 devices similar to the performance of the Random Patch model with 140 devices. Overall, this ablation study illustrates the benefit of selecting specific regions for texture quality measurement.

Iv-D Qualitative analysis of our region selection

In order to further study the outcome of our region selection algorithm, we display the resulting map (Fig. 4) of relevant zones.

Fig. 4: Normalized discriminant-region map (better viewed with a digital zoom). For display, we employ histogram equalization for normalization and obtain values from 0 to 1.

We observe that uniform regions are considered by our algorithm as the least discriminant for texture quality assessment. In particular, this is visible in the bottom-right regions on the black square patch. On the contrary, regions with low contrast and many small details appear to be more discriminant (see around the banknote region). Results on wooden regions seem to depend on wood grain.

This analysis is performed considering all the images. We now propose to analyze the regions that discriminate devices among only low quality or only high-quality images. For this analysis, the test set is split to according to the ground-truth score. In this way, we compute two discriminant maps. Two small crops of these two maps are shown in Fig.5. Interestingly, we observe restricting our analysis to high quality or lower quality images leads to differences in results. For example, we observe that the resolution lines (in the bottom row of 5) discriminate for low-quality images, but not for higher quality images. Conversely, areas exhibiting only very fine details are not the most useful for low-quality images. In particular, the forehead of the man is not discriminant among low-quality images, while this region is highly discriminant among high-quality images. It shows that the region with very fine details are discriminant only among high-quality images since these details are completely distorted by all the low-quality devices.

(a) Low quality images
(b) High quality images
Fig. 5: Comparison of discriminant-region maps for high-quality images and low quality. We display two patches extracted from the confidence maps obtained when using texture quality maps only from high (ie. Left) and low (ie. Right) quality images.

Iv-E Comparison to state of the art

Number of devices 20 60 100 140 20 60 100 140
Method Chart SROCC KROCC
RR Acutance [4] Gray-DL 0.704 0.794 0.747 0.788 0.533 0.595 0.592 0.592
ResNet [11] Gray-DL 0.641 0.795 0.792 0.824 0.464 0.598 0.592 0.630
(Ours) Still-Life 0.830 0.912 0.890 0.900 0.638 0.740 0.716 0.728
TABLE II: Comparison of deep learning systems on different charts to [4].SROCC and KROCC metrics are reported.

In this section, we compare the performance of our approach to existing methods. This comparison is twofold since both the methods and the charts need to be compared. We perform two sets of experiments. In the first set of experiments, we compare different methods on the two large datasets recorded with the Gray-DL and the Still-Life charts and the same 140 devices. The second set of experiments consists of a comparison of the devices on the Color-DL chart. This second set of experiments is highly challenging for learning-based methods because of the limited amount of training data.

Large database experiments

First, in our preliminary experiments, we observed that, for this experience, adding a small amount of Gaussian noise and random change in exposition leads to better performance. This data-augmentation is performed on the fly on every training patch. Our main competitor is the RR acutance methods proposed in [4]. For the acutance computation, viewing conditions were set to 120 centimeters printing height and 100 centimeters viewing distance. Note that, the RR acutance method is intrinsically designed for the Dead-Leaves charts and cannot be used for the Still-Life chart. We include a second deep learning-based method for the Gray-DL chart in our comparison. This approach consists of a ResNet-50 [11]

where the classification layer is replaced by 3 additional fully-connected layers and a linear regression layer. For this approach, we employ the

Random patch strategy described in Sec. IV-C inside of the texture region. Importantly, we do not report the performance of DRS on the Gray-DL chart since the chart is designed to be uniformly discriminant for texture quality assessment.

Quantitative results are reported in Table. II. First, we observe that with a limited number of devices for training (e.g. 20 devices), RR Acutance performs better than ResNet-50. However, the proposed approach clearly outperforms the texture-MTF based method ( and in SROCC and KROCC, respectively). It shows that when few training samples are available, selecting the appropriate regions is essential for good performance. ResNet performance increases with the number of devices: with 140 devices, ResNet-50 clearly outperforms RR Acutance according to both metrics showing the potential of learning-based methods. While comparisons between results obtained using different charts must be interpreted with care, this result clearly shows that a learning-based approach can be intrinsically better than acutance-based methods using the exact same input images. Finally our DRS method on the Still-life chart leads to the best results according to both metrics and for every number of devices.

Small database experiments

Concerning the second set of experiments, we compared the different methods on the Color-DL chart. Note that the 14 devices of the Color-DL chart are a subset of the devices of the Gray-DL and Still-Life charts. Consequently, we also performed experiments using only these 14 devices on these two other charts. Note that this setting is very challenging for the two learning-based methods (ResNet and DRS) because of the limited amount of training data. Therefore, for the two learning methods, we perform two experiments. First, training and test are performed using 14-fold cross-validation on the exact same data as the other methods. Second, we train our model on the complete database and test on the 14 devices in common with the Color-DL chart. These two variants are referred to as Restricted and Full. Again, we do not report the performance of DRS on the Gray-DL and Color-DL charts since the chart is designed to be uniformly discriminant. Results are reported in Table III.

Method Chart SROCC KROCC
FR Acutance [22] Color-DL 0.701 0.544
RR Acutance [4] Gray-DL 0.714 0.552
ResNet - Restricted Gray-DL 0.640 0.463
ResNet - Full Gray-DL 0.780 0.598
- Restricted Still-Life 0.746 0.569

- Full
Still-Life 0.873 0.702
TABLE III: State of the art comparison : Performance on the 14 devices database. Deep learning systems on perceptual and Gray-DL are compared to [4] and [22]

First, we observe that the MTF-based methods perform similarly on the color and gray-scale dead leave charts. It shows that better performance of the proposed model on the Still-Life chart is not due to the lack of colors in Gray-DL but to its content. Second, using the restricted database, both learning-based methods, ResNet and , under-perform MTF-based predictions. However, when the amount of training data is sufficient, both methods outperform FR Acutance and RR Acutance.

V Conclusion and Future Work

In this paper, we proposed DRS, a method which learns to estimate a perceptual quality score. To this end, our algorithm selects the chart region that is the most suitable for texture quality assessment. Our results also suggest that, if enough training samples are available, learning-based methods outperform MTF-based methods. A limitation of our method is that we select only a single region. However, texture quality is known to be multi-dimensional. Consequently, as future work, we plan to extend our method to multiple regions in order to highlight several complementary discriminant features and better measure the intrinsic qualities of a device.


  • [1] P. F. Alcantarilla and T. Solutions (2011) Fast explicit diffusion for accelerated features in nonlinear scale spaces. IEEE Trans. Pattern. Anal. Mach. Intell. Cited by: §III-C.
  • [2] G.D. Boreman (2001) Modulation transfer function in optical and electro-optical systems. Society of Photo Optical. Cited by: §II-A.
  • [3] M. Čadík, M. Wimmer, L. Neumann, and A. Artusi (2006) Image attributes and quality for evaluation of tone mapping operators. In National Taiwan University, Cited by: §I.
  • [4] F. Cao, F. Guichard, and H. Hornung (2009) Measuring texture sharpness of a digital camera. In Digital Photography V, Cited by: §II-A, 2nd item, §IV-E, TABLE II, TABLE III.
  • [5] M. Carvalho, B. Le Saux, P. Trouvé-Peloux, A. Almansa, and F. Champagnat (2018) On regression losses for deep depth estimation. In 2018 25th IEEE International Conference on Image Processing (ICIP), Cited by: §III-B.
  • [6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on Computer Vision and Pattern Recognition

    Cited by: §III-B.
  • [7] D. Ghadiyaram and A. C. Bovik (2016) Massive online crowdsourced study of subjective and objective picture quality. IEEE Transactions on Image Processing 25 (1), pp. 372–387. Cited by: §II-B.
  • [8] Y. Gousseau and F. Roueff (2007) Modeling occlusion and scaling in natural images. Multiscale Modeling & Simulation. Cited by: §II-A.
  • [9] S. W. Hasinoff, D. Sharlet, R. Geiss, A. Adams, J. T. Barron, F. Kainz, J. Chen, and M. Levoy (2016) Burst photography for high dynamic range and low-light imaging on mobile cameras. ACM Transactions on Graphics (TOG). Cited by: §I.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence. Cited by: §II-B, §II-B.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §II-B, §IV-E, TABLE II.
  • [12] V. Hosu, H. Lin, T. Sziranyi, and D. Saupe (2020) KonIQ-10k: an ecologically valid database for deep learning of blind image quality assessment. IEEE Transactions on Image Processing. Cited by: §II-B.
  • [13] Imatest website. External Links: Link Cited by: 3rd item.
  • [14] L. Kirk, P. Herzer, U. Artmann, and D. Kunz (2014) Description of texture loss using the dead leaves target: current issues and a new intrinsic approach. In Digital Photography X, Cited by: §II-A.
  • [15] E. C. Larson and D. M. Chandler (2010) Most apparent distortion: full-reference image quality assessment and the role of strategy. Journal of electronic imaging. Cited by: §II-B.
  • [16] S. Lathuilière, P. Mesejo, X. Alameda-Pineda, and R. Horaud (2020) A comprehensive analysis of deep regression. IEEE TPAMI. Cited by: §IV.
  • [17] C. Loebich, D. Wueller, B. Klingen, and A. Jaeger (2007) Digital camera resolution measurements using sinusoidal siemens stars. In Digital Photography III, Cited by: §II-A.
  • [18] G. Matheron (1975) Random sets and integral geometry.

    Wiley series in probability and mathematical statistics: Probability and mathematical statistics

    , Wiley.
    Cited by: §II-A.
  • [19] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola, B. Vozel, K. Chehdi, M. Carli, F. Battisti, et al. (2015) Image database tid2013: peculiarities, results and perspectives. Signal Processing: Image Communication. Cited by: §II-B.
  • [20] N. Ponomarenko, V. Lukin, A. Zelensky, K. Egiazarian, M. Carli, and F. Battisti (2009) TID2008-a database for evaluation of full-reference visual quality assessment metrics. Advances of Modern Radioelectronics. Cited by: §II-B.
  • [21] H. R. Sheikh, M. F. Sabir, and A. C. Bovik (2006) A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on image processing. Cited by: §II-B.
  • [22] R. C. Sumner, R. Burada, and N. Kram (2017) The effects of misregistration on the dead leaves crosscorrelation texture blur analysis. Electronic Imaging. Cited by: §II-A, 3rd item, TABLE III.
  • [23] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §II-B.
  • [24] M. C. Trinidad, R. M. Brualla, F. Kainz, and J. Kontkanen (2019) Multi-view image fusion. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §I.
  • [25] O. van Zwanenberg, S. Triantaphillidou, R. Jenkin, and A. Psarrou (2019) Edge detection techniques for quantifying spatial imaging system performance and image quality. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Cited by: §I, §II-A.
  • [26] D. Varga, D. Saupe, and T. Szirányi (2018) DeepRN: a content preserving deep architecture for blind image quality assessment. In 2018 IEEE International Conference on Multimedia and Expo (ICME), Cited by: §II-B.
  • [27] N. Yu, X. Shen, Z. Lin, R. Mech, and C. Barnes (2018) Learning to detect multiple photographic defects. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: §II-B, §II-B.
  • [28] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I.
  • [29] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016) Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §III-C.