Local Blur Mapping: Exploiting High-Level Semantics by Deep Neural Networks

12/05/2016 ∙ by Kede Ma, et al. ∙ University of Waterloo The University of Sydney 0

The human visual system excels at detecting local blur of visual images, but the underlying mechanism is mysterious. Traditional views of blur such as reduction in local or global high-frequency energy and loss of local phase coherence have fundamental limitations. For example, they cannot well discriminate flat regions from blurred ones. Here we argue that high-level semantic information is critical in successfully detecting local blur. Therefore, we resort to deep neural networks that are proficient in learning high-level features and propose the first end-to-end local blur mapping algorithm based on a fully convolutional network (FCN). We empirically show that high-level features of deeper layers indeed play a more important role than low-level features of shallower layers in resolving challenging ambiguities for this task. We test the proposed method on a standard blur detection benchmark and demonstrate that it significantly advances the state-of-the-art (ODS F-score of 0.853). In addition, we explore the use of the generated blur map in three applications, including blur region segmentation, blur degree estimation, and blur magnification.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Blur is one of the most common image degradations that arises from a number of sources, including atmospheric scatter, camera shake, defocus, and object motion. It is also manipulated by photographers to create visually pleasing effect that draws attention to humans/objects of interest. Given a natural photographic image, the goal of local blur mapping is to label every pixel as either blurry or non-blurry, resulting in a blur map. Local blur mapping is an important component in many image processing and computer vision systems. For image quality assessment, blur is an indispensable factor that affects perceptual image quality 

[1, 2]. For example, the worst quality images scored by human subjects in the LIVE Challenge Database [3] mainly suffer from motion and/or out-of-focus blur. For object detection, the identified blurred regions may be excluded for efficient and robust object localization [4]. Other applications that may benefit from local blur mapping include image restoration [5, 6], photo editing [7], depth recovery [8, 9], and image segmentation [10].

Fig. 1: Challenges in local blur mapping. The pairs of (blue, red) and (green, yellow) framed patches appear to be similar in terms of local structural features and complexities, making them difficult for local feature-based approaches to identify blur. By contrast, semantic information is helpful in making the distinction. (a) Test image from the blur detection benchmark [11]. (b) Ground truth blur map. (c) Blur map produced by the proposed DBM.

The human visual system (HVS) is good at identifying the blurry parts of an image with amazing speed [12, 13], but the underlying mechanism is not well understood. A traditional view of blur is that it reduces the energy (either globally or locally) at high frequencies. Several low-level features have been hand-crafted to exploit this observation. Among those, power spectral slopes [14, 11] and image gradient statistics [15, 14, 11] are representative. Another view of blur perception is that it arises from the disruption of the local phase coherence at precisely localized features (e.g., step edges) [16]. Therefore, a coarse-to-fine phase prediction may serve as an indication of blur [16]. Nearly all previous local blur mappers rely on the two assumptions either explicitly or implicitly with limited success. In particular, they fail to discriminate flat and blurred regions, and they often mix up structures with and without blurring. A visual example is shown in Fig. 1, where we can see that both the blue and red framed patches appear to be smooth but from different origins. Specifically, the textures of sand in the blue framed patch are lost due to the severe blur, while the car body in the red framed patch is flat in nature. On the other hand, the green and yellow framed patches have similar local structural features at the top with similar complexities, but the former suffers from blur, while the latter does not. All of these make the local blur mapping task difficult for local feature-based algorithms.

In this regard, we argue that the fundamental problem in existing approaches is their ignorance to high-level semantic information in natural images, which is crucial in successfully identifying local blur. Therefore, we resort to deep convolutional neural networks (CNN) that have advanced the state-of-the-art in many high-level vision tasks such as image classification 

[17], object detection [4], and semantic segmentation [18]. Specifically, we develop the first fully convolutional network (FCN) [18] for end-to-end and image-to-image blur mapping [11], which we name deep blur mapper (DBM). By fully convolutional, we mean all the learnable filters in the network are convolutional and no fully connected layers are involved. As a result, DBM allows input of arbitrary size, encodes spatial information thoroughly for better prediction, and maintains a relatively low computational cost. We adopt various architectures with different depths by trimming the 16-layer VGGNet [19] from different convolutional stages. By doing so, we empirically show that high-level features from deeper layers are more important than low-level features from shallower layers in resolving challenging ambiguities for local blur mapping, which conforms to our perspective of blur perception. We also experiment with more advanced directed acyclic graph based architectures that better combine low-level spatial and high-level semantic information but yield no substantial improvement, which again verify the critical role of high-level semantics in this task. Due to the limited number of training samples, we initialize all networks with weights pre-trained on the semantic segmentation task [18] that contain rich high-level information about what an input image constitutes. DBM is tested on a standard blur detection benchmark [11] and outperforms state-of-the-art methods by a large margin.

Our contributions are three-fold. First, we provide a new perspective on blur perception, where high-level semantic information plays a critical role. Second, we show that it is possible to learn an end-to-end and image-to-image local blur mapper based on FCNs [18], which well addresses challenging ambiguities such as differentiating flat and blurred regions, and structures with and without blurring. Third, we explore three potential applications of the generated blur maps, i.e., blur region segmentation, blur degree estimation, and blur magnification.

The rest of the paper is organized in the following manner. Section II reviews the related work of local blur mapping with emphasis on statistical analysis of traditional hand-crafted low-level features. Section III details the proposed DBM based on FCNs and their alternative architectures. Section IV conducts extensive comparative and ablation experiments to validate the promise of DBM. Section V concludes the paper.

Ii Related Work

The computational blur analysis is a long-standing problem in vision and image processing research, and early works can be dated back to as early as 1960s [20]. Most researchers in this field focus on the image deblurring problem that aims to restore a sharp image from a blurred version [21, 22]

. As an ill-posed problem, many algorithms assume uniform camera motion and the availability of the structure of the point spread function (PSF). Blind image deblurring takes a step further by simultaneously recovering the PSF and the latent unblurred image. It is frequently cast as a maximum a posteriori estimation problem, characterizing the unblurred image using natural image statistics as priors 

[23, 15, 24]. In practice, the PSF is often spatially varying, making blind image deblurring algorithms unstable and unsatisfactory. On the contrary, blur mapping itself is little investigated. Early works on blur mapping quantify the overall blur degree of an image and cannot perform dense prediction. For example, Marziliano et al. analyzed the spread of the edges [25]. A similar approach was proposed in [26] by estimating the thickness of object contours. Zhang and Bergholm [27] designed a Gaussian difference signature to model the diffuseness caused by out-of-focus blur. The images under consideration are usually uniformly blurred with a Gaussian PSF, and therefore the results can be directly linked to perceptual quality, but cannot be generalized to non-Gaussian and non-uniform blur cases in the real world.

(a)
(b)
Fig. 2: Traditional low-level features fail to differentiate between flat and blurry regions, and structures with and without blurring. (a) Local gradient statistics. (b) Local power spectral slopes. Although the gradient distribution of blurred patches exhibits a sharp peak at zero and a less heavy tail, which is distinctive from structured patches, it cannot tell for example whether structured patches undergo blurring or not. Similar is observed for local power spectral slopes. We extract patches from images in the blur detection benchmark [11] to draw (a) and use the four patches in Fig. 1 to draw (b).

Only recently has local blur mapping become an active research topic. Rugna and Konik [28] identified blurry regions by exploiting the observation that they are more invariant to low-pass filtering. Blind deconvolution-based methods have also been investigated to segment motion-blurred [15] and defocus-blurred [29] regions. Zhuo and Sim [30] exploited the fact that the difference between the blurred and re-blurred patches is insignificant, and estimated the defocus blur amount by the ratio between the gradients of input and re-blurred images. Javaran et al. [31] practiced similar ideas in [30] and characterized the difference in the DCT domain. Su et al[32]

examined the singular value information between blurry and non-blurry regions. Chakrabarti

et al[33]

adopted local Fourier transform to analyze directional blur. Liu

et al[14] manually designed three local features represented by spectrum, gradient, and color information for blurry region extraction. Their features have been later improved by Shi et al[11] and combined with responses of learned local filters to jointly analyze blurry regions in a multi-scale fashion. Pang et al. [34] described spatially varying blur by kernel-specific features. Chen [35] et al. adopted fast defocus belief propagation. Tang et al. [36] presented a blur metric based on the log averaged spectrum residual, whose maps are coarse-to-fine refined by exploiting neighborhood similarity. Yi and Eramian et al. [37] explored local binary patterns in the context of blur identification and found that blurry regions have fewer local binary patterns compared with those of sharp regions. Zhu and Karam [38] quantified the level of spatially varying blur by integrating directional edge spread and just noticeable blur. More recently, Golestaneh and Karam [39] computed blur detection maps based on a high-frequency multi-scale fusion and sort transform of gradient magnitudes.

All the above-mentioned methods are based on hand-crafted low-level features and cannot robustly tell which parts of an image are truly blurred or flat in nature, and which parts of structures have been blurred or not. To have a closer look, we perform statistical analysis of two representative low-level features, namely local gradient statistics and local power spectral slopes, as shown in Fig. 2. The local gradient statistics are drawn by extracting more than million patches from images in the blur detection benchmark [11]. It is widely recognized that the gradient distribution of blurred patches exhibits a sharper peak at zero and a less heavier tail than those of structured patches [40]. However, it cannot tell for example whether structured patches undergo blurring or not, and whether smooth patches are flat in nature or severely blurred. Similar is true for local power spectral slope based measures, for example, on the four patches in Fig. 1.

A closely related area to blur mapping is image sharpness measurement [41, 42], which targets at extracting sharp regions from an image. The results may be combined to an overall sharpness score (global assessment) or refined to a sharpness map (local assessment). Most existing sharpness measurement algorithms rely on similar assumptions to those of blur mapping, but there are subtle differences. For example, in sharpness assessment, flat and blurry regions can both be regarded as non-sharp, but in blur mapping, discriminating them is a must for a successful algorithm.

Iii Deep Blur Mapper

At a high level, we feed an image of arbitrary size into an FCN and the network successively outputs a blur map of the same size with each entry ranging between

to indicate the probability of the corresponding pixel being blurred. The size mismatch is resolved by in-network upsampling. Through a standard stochastic gradient descent (SGD) training procedure, our network is able to learn a complex mapping from raw image pixels to blur perception.

Iii-a Training and Testing

Given a training image set , where is the -th raw input image and is the corresponding ground truth binary blur map, our goal is to learn an FCN that produces a blur map with high accuracy. It is convenient to drop the subscript without ambiguity due to the image-wise operation. We denote all layer parameters in the network as

. The loss function is a sum over per-pixel losses between the prediction

and the ground truth , where indicates the spatial coordinate. We consider the cross entropy loss

(1)

is implemented by the sigmoid function on the

-th activation. Eq. (1) can be easily extended to account for the class imbalance situation by weighting the loss according to the proportion of positive and negative labels. Although the labels in the blur detection database [11] are mildly unbalanced (around

pixels are blurred), we find using the class-balanced cross entropy loss unnecessary. In addition, many probability distribution measures can be adopted as alternatives to the cross entropy loss, such as the fidelity loss from quantum physics 

[43]. We find in our experiments that the fidelity loss gives very similar performance. Therefore, we choose the cross entropy loss throughout the paper.

After training, the optimal layer parameters are learned. Given a test image , we perform a standard forward pass to obtain the predicted blur map:

(2)
FCN configuration
I II III IV V
3 weight 5 weight 8 weight 11 weight 14 weight
layers layers layers layers layers
input image of arbitrary size
conv3-64 conv3-64 conv3-64 conv3-64 conv3-64
conv3-64 conv3-64 conv3-64 conv3-64 conv3-64
conv1-1 max-pooling
conv3-128 conv3-128 conv3-128 conv3-128
conv3-128 conv3-128 conv3-128 conv3-128
conv1-1 max-pooling
conv3-256 conv3-256 conv3-256
conv3-256 conv3-256 conv3-256
conv3-256 conv3-256 conv3-256
conv1-1 max-pooling
conv3-512 conv3-512
conv3-512 conv3-512
conv3-512 conv3-512
conv1-1 max-pooling
conv3-512
conv3-512
conv3-512
conv1-1
in-network upsampling
sigmoid
TABLE I: The FCN configurations. The depth increases from left to right. We follow the convention in [19] and denote the weights of a convolutional layer as “convreceptive field size-number of channels

”. The ReLU nonlinearity is omitted here for brevity

Iii-B Network Architectures and Alternatives

Inspired by recent works [44, 45] that successfully fine-tune deep neural networks pre-trained on image classification for edge detection, we analyze several linearly cascaded FCNs based on the 16-layer VGGNet architecture [19], which has been extremely popular and extensively studied in image classification. Specifically, we trim the VGGNet up to the last convolutional layer in each stage, i.e., , , , , and , respectively, resulting in five FCNs with different depths. For each network, we add a convolutional layer with a

receptive field, which performs a linear summation of the input channels. Finally, we implement an in-network upsampling followed by sigmoid nonlinearity to counteract the size mismatch between the generated and ground truth blur maps. For each configuration, we throw away all fully connected layers to make it fully convolutional as in 

[18]

, because it significantly reduces the computational complexity with only mild loss of representation power. As a result, we speed up the computation and reduce the memory storage at both the training and test stages. Moreover, we choose to drop the max-pooling layer immediately after the last convolutional layer at each stage, trying to keep as finer spatial information as possible to make later interpolation easier. The detailed configurations of the five FCNs are summarized in Table 

I and Configuration V that is used as the default architecture for performance comparison is shown in Fig. 3.

Fig. 3: Configuration V as the default architecture for local blur mapping by trimming the VGG16 network. The height and width of the boxes represent the spatial sizes of the filter responses, which depend upon the size of the input image. The depth of the boxes indicates the filter number used in each layer. Here, we omit ReLU and max-pooling layers for simplicity. After the last convolutional layer (), our mapper performs a in-network upsampling to obtain the final blur map.

The five network configurations characterized by different depths favor different types of information. Configuration I retains the spatial information intact, which is ideal for dense prediction. However, due to its shallow structure, it can only extract low-level information and fail to learn powerful semantic information from the image. On the contrary, Configuration V has a very deep structure, which consists of stacks of convolutional filters. Therefore, it is capable of transforming low-level features into high-level semantic information, but sacrifices fine spatial information due to max-pooling. The in-network upsampling has to be performed in order to recover the spatial resolution. With the five configurations, we are able to empirically study the relative importance of spatial and semantic information in local blur mapping. As will be clear in Section IV-B1, semantic information plays a dominant role in local blur mapping. By contrast, spatial information is less relevant.

We continue by discussing several more sophisticated architecture designs that better combine low-level and high-level features. We first briefly introduce FCNs with skip layers. The original FCNs make use of classification nets for dense semantic segmentation [18] by transferring fully connected layers into convolutional ones. To combat the coarse spatial information in deeper layers, which limits the scale of the details in the upsampled output, Long et al[18] introduced skip layers that combine the responses of the final prediction layer with those of shallower layers with finer spatial information. It is straightforward to adapt this architecture to the blur mapping task and we include FCN-8s, a top-performing architecture with reasonable complexity in our experiment. Moreover, to make the learning process of hidden layers direct and transparent, Lee et al[46] proposed deeply supervised nets (DSN) that add side output layers to the convolutional layers in each stage. In the case of 16-layer VGGNet adopted in edge detection [45], five side outputs are produced right after , , , , and layers, respectively. All side outputs are fused to a final output, whose weights are learnable. The final output together with all side outputs contribute to the loss. We include two variants of DSN: training with weighted fusion only, and training with weighted fusion and deep supervision. As will be clear in Section IV-B3, incorporating low-level features through these sophisticated architectures often impairs performance compared with the default Configuration V.

Iv Experiments

In this section, we first provide thorough implementation details on training and testing the proposed DBM. We then describe the experimental protocol and analyze the five FCN configurations, from which we choose Configuration V as our default architecture to compare with nine state-of-the-art methods. Finally, we analyze various aspects of DBM with emphasis on the role of high-level semantics. All models are trained and tested with Caffe 

[47].

Iv-a Implementations

We first describe data preparation and preprocessing. To the best of our knowledge, the blur detection benchmark built by Shi et al[11] is the only database that is publicly available for this task. It contains images with human labeled blur regions, among which are partially motion-blurred and

are defocus-blurred. Since the number of training samples is limited, we only divide it into training and test sets, without using the validation set for hyper-parameter tuning. It turns out that the only critical hyper-parameter is the learning rate and as long as we set it to a reasonably small value that keeps the gradients from blowing up, no noticeable differences in the final results are observed. The training set contains images with odd indices, denoted by

and the test set contains images with even indices, denoted by . DBM allows for input of arbitrary size, so we try various input sizes and find that it is insensitive to input size variations. We take advantage of this observation and resize all images to

in order to reduce GPU memory cost and speed up training and testing. Another preprocessing step is to subtract the mean RGB value, computed on the ImageNet database 

[48]. We also try to augment training samples by incorporating modest geometric (flipping, rotating, and scaling) and photometric (brightness, contrast, saturation, and gamma mapping) transformations without hurting their perceptual quality and high-level semantics, but this does not yield noticeable improvement. Therefore, the reported results in the paper are without data augmentation.

We initialize the layer parameters with weights from a full 16-layer VGGNet pre-trained on the semantic segmentation task [18] and fine-tune them by SGD with momentum. The training is regularized by weight decay (the penalty multiplier set to ). The learning rate is initially set to be and follows a polynomial decay with a power of . The learning rate for biases is doubled. The batch size is set to images, and momentum to . The in-network upsampling layer is fixed to bilinear interpolation. Although the interpolation weights are learnable, the additional performance gain is marginal. The learning stops when the maximum iteration number is reached. The final weights are used for testing.

Fig. 4: The precision-recall curves of the five configurations trained on and tested on from the blur detection benchmark [11]. Configuration V that favors high-level semantics performs the best at all recall levels compared to the rest shallower configurations. Therefore, it is adopted as the default architecture.
Algorithm ODS OIS AP
Configuration I 0.769 0.790 0.704
Configuration II 0.788 0.816 0.772
Configuration III 0.815 0.850 0.825
Configuration IV 0.836 0.870 0.855
Configuration V 0.853 0.884 0.880
TABLE II: Results of the five configurations trained on and tested on from the blur detection benchmark [11]
(a)
(b)
(c)
(d)
(e)
(f)
(g)
Fig. 5: Comparisons of blur maps generated by the five configurations. (a) Test image from the blur detection benchmark [11]. (b) Configuration I. (c) Configuration II. (d) Configuration III. (e) Configuration IV. (f) Configuration V. (g) Ground truth.
Fig. 6: The precision-recall curves trained on and tested on . DBM boosts the precisions within the entire recall range, where the improvement can be as large as .

Iv-B Experimental Results

Iv-B1 Configuration Comparison

We first compare the five DBM configurations to investigate the role of spatial information versus semantic information. The quantitative performance is evaluated using the precision-recall curve, which is drawn by concatenating all test images into one vector rather than averaging the curves over all test images. We also summarize the performance using three standard criteria: (1) optimal dataset scale (ODS) F-score obtained by finding an optimal threshold for all images in the dataset; (2) optimal image scale (OIS) F-score by averaging the best F-scores for all images; and (3) average precision (AP) by averaging over all recall levels 

[49]. We draw the precision-recall curves in Fig. 4, from which we observe that the performance increases with the depth of configurations at all recall levels. The performance gain is also clear in terms of ODS, OIS, and AP values in Table II. When the configuration goes deeper, it puts more emphasis on high-level features that contain rich semantic information but does not fully respect the global structure encoded by the spatial information. Therefore, it is clear that semantic information learned by deeper configurations plays a more important role than spatial information, which is not surprising because the blur maps are expected to be relatively uniform, and consist of several clustered and connected regions. This stands out in stark contrast to the edge maps [45], where edges scatter across the whole space and spatial information in different scales is essential for accurate edge detection. To take a closer look, we show the blur maps generated by the five architectures in Fig. 5, from which we can see that shallower configurations can only make use of low-level (gradient-based) features and tend to mark smooth regions as blurry. Although they have finer spatial information, the generated maps are less relevant to blur perception. By contrast, deeper configurations seem to make decisions based on extracted semantic information and are less affected by the appearances of subjects in the scene. They generate blur maps in closer agreement with the ground truths. In summary, we adopt Configuration V that makes the best use of high-level semantics and delivers the best performance as our default architecture in the rest of the paper.

Iv-B2 Main Results

We next compare DBM with nine existing methods: Liu08 [14], Chakrabarti10 [33], Zhuo11 [30], Su11 [32], Shi14 [11], Chen16 [35], Tang16 [36], Yi16 [37], and HiFST [39]. All of them are based on hand-crafted low-level features. The blur maps of each method are either obtained from the original authors or generated by the publicly available implementation with default settings. The precision-recall curves are shown in Fig. 6. DBM achieves the highest precisions for all the recall levels, where the improvement can be as large as . It is interesting to note that previous methods experience precision drops at low recall levels. This is no surprise because traditional methods tend to give flat regions high blur confidence and misclassify them into blurry regions even with relatively large thresholds. By contrast, DBM automatically learns rich discriminative features, especially high-level semantics, which accurately discriminate flat regions from blurred ones, resulting in nearly perfect precisions at low recall levels. Moreover, DBM exhibits a less steep decline at the middle recall range . This may result from the accurate classification of structures with and without blurring. Table III lists the ODS, OIS, and AP results, from which we observe that DBM significantly advances the state-of-the-art by a large margin with an ODS F-score of .

Algorithm ODS OIS AP
Liu08 [14] 0.766 0.811 0.745
Chakrabarti10 [33] 0.758 0.797 0.757
Zhuo11 [30] 0.761 0.862 0.687
Su11 [32] 0.782 0.822 0.721
Shi14 [11] 0.776 0.813 0.843
Chen16 [35] 0.771 0.867 0.823
Tang16 [36] 0.765 0.864 0.774
Yi16 [37] 0.803 0.841 0.765
HiFST [39] 0.813 0.851 0.717
DBM 0.853 0.884 0.880
TABLE III: Results trained on and tested on
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)
(w)
(x)
(y)
(a)
(b)
(c)
(d)
(e)
Fig. 7: Representative blur mapping results on the blur detection benchmark [11]. (a) Test images. (b) Su11 [32]. (c) Shi14 [11]. (d) DBM. (e) Ground truths. The proposed DBM shows a clear advantage in terms of accuracy over Su11 [32] and Shi14 [11], and is more consistent with the ground truths.

To better investigate the effectiveness of DBM at detecting local blur, we show some blur maps generated by DBM and compare them with those by two representative methods Su11 [32] and Shi14 [11] in Fig. 7. DBM is able to robustly detect local blur from complex foreground and background. First, it well handles blur regions across different scales from the small motorcycle man (in the -th row) to the big girl (in the -th row). Second, it is capable of identifying flat regions such as the car body (in the first row), clothes (in the -nd and -th rows), and the road sign (in the -th row) as non-blurry. Third, it is barely affected by strong structures after blurring and labels those regions correctly. All of these stand in stark contrast with previous methods, which mix flat and blurry regions with high probability, and are severely biased by strong structures after blurring. Moreover, DBM labels images with high confidence. Nearly pixels in the test images have predicted values either larger than (blurry) or smaller than (non-blurry).

(a)
(b)
(c)
(d)
(e)
(f)
(g)
Fig. 8: Comparisons of blur maps generated by the variants of DBM. (a) Test image from the blur detection benchmark [11]. (b) Training from scratch. (c) Training with FCNs and skip layers [18] (FCN-8s). (d) Training with weighted fusion only [45]. (e) Training with weighted fusion and deep supervision [45] (DSN). (f) DBM. (g) Ground truth.

Iv-B3 Importance of High-Level Semantics

Besides analyzing the importance of high-level semantics via architectures with different depths, we conduct another series of experiments to show that the learned high-level features indeed play a crucial role in our blur mapper. We first train DBM from scratch without using semantically meaningful initializations. The hyper-parameters are manually optimized to guarantee the best performance. The results shown in the first row of Table IV are unsatisfactory, which is expected since bad initializations can stall learning due to the instability of gradients in deep nets. By contrast, a more informative initialization with respect to the blur mapping task (in this case from semantic segmentation) is likely to guide SGD to find better local minima and results in a more meaningful blur map (Fig. 8 (f)).

We then investigate more advanced network architectures that make better use of low-level features at shallower layers, including FCNs with skip layers (FCN-8s) [18], weighted fusion of side outputs [45], and weighted fusion of side outputs and deep supervision [45] (DSN). The results are shown in Table IV and Fig. 8. We observe that although incorporating low-level features produces blur maps with somewhat finer spatial information (similar to what we have observed in Fig. 5), it voids the benefits of high-level features and results in erroneous and non-uniform blur assignments. This is expected because low-level features mainly contain edge information of an input image and do not help blur detection much. FCN-8s [18]

that treats low-level and high-level features with equal importance impairs performance the most. The weighted fusion scheme without deep supervision learns to assign importance weights to the side outputs. It turns out that the side outputs generated by deeper convolutional layers are weighted heavier than those by shallower layers. Specifically, the learned fusion weights of the five side outputs from shallow to deep layers are

, respectively. We observe slight performance improvement over FCN-8s. The weighted fusion scheme with deep supervision directly regularizes low-level features using the ground truth and delivers slightly better performance in terms of ODS and OIS than DBM. In summary, DBM that solely interpolates from high-level feature maps achieves comparable performance to its most sophisticated variant DSN and ranks the best in terms of AP. The blur maps by DBM are more reasonable and closer to the ground truths perceptually. These manifest the central role of high-level semantics in local blur mapping.

ODS OIS AP
Training from scratch 0.833 0.876 0.856
FCN-8s 0.840 0.874 0.847
Fusion (w/o deep supervision) 0.844 0.877 0.865
Fusion (with deep supervision) 0.854 0.889 0.876
DBM 0.853 0.884 0.880
TABLE IV: Comparing DBM with its variants to identify the role of high-level semantics
Algorithm ODS OIS AP
Liu08 [14] 0.753 0.803 0.749
Chakrabarti10 [33] 0.741 0.788 0.741
Zhuo11 [30] 0.746 0.853 0.676
Su11 [32] 0.775 0.814 0.712
Shi14 [11] 0.765 0.804 0.831
Chen16 [35] 0.752 0.859 0.823
Tang16 [36] 0.746 0.846 0.755
Yi16 [37] 0.786 0.829 0.741
HiFST [39] 0.804 0.841 0.706
DBM 0.852 0.885 0.876
TABLE V: Results trained on and tested on
Algorithm Chakrabarti10 Zhuo11 Su11 Chen16 Tang16 Yi16 HiFST DBM DBM
[33] [30] [32] [35] [36] [37] [39] (CPU) (GPU)
Time (s) 0.70.3 11.00.2 5.10.1 1.00.1 1.40.2 20.12.8 99.71.4 1.80.2 0.0270.004
TABLE VI: Average execution time in seconds on images of size from the blur detection benchmark [11]
Fig. 9: The precision-recall curves trained on and tested on . DBM achieves similar superior performance when using for training, indicating its independence of specific training sets.
(a)
(b)
(a)
(b)
Fig. 10: The blur region segmentation results. (a) Shi14 [11]. (b) DBM.

Iv-B4 Independence of Training Data

It is important to verify the generalizability of DBM by showing that it is independent of specific training sets. We therefore switch the training and test sets in our setting. In other words, we train DBM on and test on . We observe in the Fig. 9 and Table V that similar superior performance has been achieved in terms of the precision-recall curve, ODS, OIS, and AP. This verifies that DBM does not rely on any specific training set as long as the set is diverse to cover various natural scenes and causes of blur.

Iv-B5 More Training Data

Deep learning algorithms have dominated many computer vision tasks, at least in part due to the availability of large amounts of labeled data for training. However, in local blur mapping, we are limited by the number of training images available in the existing benchmark. Here we want to explore whether more training data with novel content further benefit DBM. To do this, we randomly sample images from , incorporate them into , and test DBM on the remaining images. The result averaged over such trials is reported. We observe that by adding more training images, performance improves from to . This indicates that we may further boost the performance and enhance the robustness of DBM by training it with a larger dataset.

Iv-B6 Running Time

We compare the execution time of DBM with existing methods using images of size on a computer with GHz CPU and G RAM. From Table VI, we see that DBM keeps a good balance between prediction performance and computational complexity. When the GPU mode is activated (we adopt an NVIDIA GTX Titan X GPU), DBM runs significantly faster than existing methods, enabling real-time applications.

In summary, we have empirically shown that a linear cascaded FCN-based DBM that exploits high-level semantics delivers superior performance in local blur mapping. The low-level features that better encode gradient and spatial information are less relevant to this task.

Iv-C Applications

In this subsection, we explore three potential applications that benefit from the blur maps generated by DBM: 1) blur region segmentation, 2) blur degree estimation, and 3) blur magnification.

Iv-C1 Blur Region Segmentation

The goal of image segmentation is to partition an image into multiple regions, which is perceptually more meaningful and easier to analyze [50]. It is difficult for automatic segmentation algorithms to work well on all images. Therefore, many interactive image segmentation tools are proposed, which require users to manually create a mask to roughly indicate what parts belong to foreground and background. The blur map produced by DBM provides a useful mask to initialize segmentation without human intervention. Here we adopt GrabCut [51], a popular interactive image segmentation method based on graph cuts, and set pixels with blur confidence , , , and as foreground, probable foreground, probable background, and background, respectively. The implementation we use is based on OpenCV version 3.2 with default settings. We compare our results with Shi14 [11] in Fig. 10 and observe that DBM does a better job in segmenting images into blur and clear regions. By contrast, Shi14 [11] mislabels flat regions in the foreground and structures with blurring in the background, and segments images into non-connected parts.

(a)
(b)
(c)
(d)
(e)
(f)
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 11: The overall blur degree estimation based on our blur maps. The dog pictures in the first row are ranked from left to right according to the estimated blur degree . (a) . (b) . (c) . (d) . (e) . (f) .

Iv-C2 Blur Degree Estimation

Our blur map can also serve as an estimation of the overall blur degree of an image. Since each entry in our blur map indicates the blur degree of the corresponding pixel in the image, we implement a straightforward blur degree measure of an image as the average value of the corresponding blur map. More sophisticated pooling strategies taking into account visual attention can also be incorporated to boost the performance. Fig. 11 shows a set of dog pictures ranked from left to right with increasing , from which we can see that DBM robustly extracts blurred regions with high confidence and that the ranking results are in close agreement with human vision of blur perception.

Iv-C3 Blur Magnification

A shallow depth-of-field is often preferred in creative photography, such as portraits. However, current small cameras embedded in mobile devices limit the degree of defocus blur due to small diameters of their lenses. With extracted blurred regions, it is easy to drive a computational photography approach to increase defocus for blur magnification [7]. Here we implement a naïve blur magnifier by convolving pixels with blur confidence greater than using a uniform Gaussian kernel. We compare DBM with Shi14 [11] in Fig. 12. It is clear that DBM is barely affected by the structures with blurring and delivers a perceptually more consistent result with smooth transitions from clear to blur regions.

V Conclusion and Discussion

In this paper, we explore visual blur mapping of natural images, emphasizing on the importance of high-level semantic information. We opt for CNNs as a proper tool to explore high-level features, and develop the first end-to-end and image-to-image blur mapper based on an FCN. The proposed DBM significantly outperforms previous methods and successfully resolves challenging ambiguities such as differentiating flat and blurred regions. In the future, it remains to be seen how the low-level features and high-level semantics interplay with each other and how they can be used to predict visual blur perception.

(a)
(b)
(c)
(d)
(e)
(f)
Fig. 12: The blur magnification results. (a) Test image from the blur detection benchmark [11]. (b) Ground truth blur map. (c) Magnification by Shi14 [11]. (d) Blur map by Shi14 [11]. (e) Magnification by DBM. (f) Blur map by DBM.

DBM fails occasionally in some cases. For example, if the motion-blurred subject happens to be surrounded by a large flat background, as shown in Fig. 13, it is difficult to extract accurate and useful semantic information for local blur mapping. A potential solution is to retrain DBM on a larger database of more scene structure variations. Another limitation of DBM is that it generates blur maps with coarse boundaries. This may be improved by simultaneously learning a reconstruction network for boundary refinement [52]. These issues will be investigated in our future research.

(a)
(b)
(c)
Fig. 13: Failure case of DBM. (a) Test image from the blur detection benchmark [11]. (b) Blur map produced by DBM. (c) Ground truth.

Acknowledgements

The authors would like to thank Dr. Wangmeng Zuo and Dr. Dongwei Ren for deeply insightful comments on blur perception, Kai Zhang and Faqiang Wang for sharing their expertise on CNN, and Zhengfang Duanmu for helpful advices on debugging. We thank the NVIDIA Corporation for donating a GPU for this research.

References

  • [1] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [2] K. Ma, Q. Wu, Z. Wang, Z. Duanmu, H. Yong, H. Li, and L. Zhang, “Group MAD competition - A new methodology to compare objective image quality models,” in

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2016, pp. 1664–1673.
  • [3] D. Ghadiyaram and A. C. Bovik, “Massive online crowdsourced study of subjective and objective picture quality,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 372–387, 2016.
  • [4] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015, pp. 91–99.
  • [5] S. Dai and Y. Wu, “Removing partial blur in a single image,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 2544–2551.
  • [6]

    N. Efrat, D. Glasner, A. Apartsin, B. Nadler, and A. Levin, “Accurate blur models vs. image priors in single image super-resolution,” in

    IEEE International Conference on Computer Vision, 2013, pp. 2832–2839.
  • [7] S. Bae and F. Durand, “Defocus magnification,” Computer Graphics Forum, vol. 26, no. 3, pp. 571–579, 2007.
  • [8] G. Mather, “The use of image blur as a depth cue,” Perception, vol. 26, no. 9, pp. 1147–1158, 1997.
  • [9] J. Shi, L. Xu, and J. Jia, “Just noticeable defocus blur detection and estimation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 657–665.
  • [10] P. Favaro and S. Soatto, “A variational approach to scene reconstruction and image segmentation from motion-blur cues,” in IEEE Conference on Computer Vision and Pattern Recognition, 2004, pp. 631–637.
  • [11] J. Shi, L. Xu, and J. Jia, “Discriminative blur detection features,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2965–2972.
  • [12] M. A. Webster, M. A. Georgeson, and S. M. Webster, “Neural adjustments to image blur,” Nature Neuroscience, vol. 5, no. 9, pp. 839–840, 2002.
  • [13] B. A. Wandell, Foundations of Vision.   Sinauer Associates, 1995.
  • [14] R. Liu, Z. Li, and J. Jia, “Image partial blur detection and classification,” in IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
  • [15] A. Levin, “Blind motion deblurring using image statistics,” in Advances in Neural Information Processing Systems, 2006, pp. 841–848.
  • [16] Z. Wang and E. P. Simoncelli, “Local phase coherence and the perception of blur,” in Advances in Neural Information Processing Systems, 2004, pp. 1435–1442.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [18] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
  • [19] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representation, 2015.
  • [20] D. Slepian, “Restoration of photographs blurred by image motion,” Bell System Technical Journal, vol. 46, no. 10, pp. 2353–2362, 1967.
  • [21] M. Cannon, “Blind deconvolution of spatially invariant image blurs with phase,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 1, pp. 58–63, 1976.
  • [22] L. Xu and J. Jia, “Two-phase kernel estimation for robust motion deblurring,” in European Conference on Computer Vision, 2010, pp. 157–170.
  • [23] J. Miskin and D. J. MacKay, “Ensemble learning for blind image separation and deconvolution,” in

    Advances in Independent Component Analysis

    , 2000, pp. 123–141.
  • [24] A. Levin, R. Fergus, F. Durand, and W. T. Freeman, “Image and depth from a conventional camera with a coded aperture,” ACM Transactions on Graphics, vol. 26, no. 3, pp. 701–710, 2007.
  • [25] P. Marziliano, F. Dufaux, S. Winkler, and T. Ebrahimi, “A no-reference perceptual blur metric,” in IEEE International Conference on Image Processing, 2002, pp. 57–60.
  • [26] J. H. Elder and S. W. Zucker, “Local scale control for edge detection and blur estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 7, pp. 699–716, 1998.
  • [27] W. Zhang and F. Bergholm, “Multi-scale blur estimation and edge type classification for scene analysis,” International Journal of Computer Vision, vol. 24, no. 3, pp. 219–250, 1997.
  • [28] J. Da Rugna and H. Konik, “Automatic blur detection for meta-data extraction in content-based retrieval context,” in SPIE Internet Imaging, 2003, pp. 285–294.
  • [29] L. Kovacs and T. Sziranyi, “Focus area extraction by blind deconvolution for defining regions of interest,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1080–1085, 2007.
  • [30] S. Zhuo and T. Sim, “Defocus map estimation from a single image,” Pattern Recognition, vol. 44, no. 9, pp. 1852–1858, 2011.
  • [31] T. A. Javaran, H. Hassanpour, and V. Abolghasemi, “Automatic estimation and segmentation of partial blur in natural images,” The Visual Computer, vol. 33, no. 2, pp. 151–161, 2017.
  • [32] B. Su, S. Lu, and C. L. Tan, “Blurred image region detection and classification,” in ACM International Conference on Multimedia, 2011, pp. 1397–1400.
  • [33] A. Chakrabarti, T. Zickler, and W. T. Freeman, “Analyzing spatially-varying blur,” in IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 2512–2519.
  • [34]

    Y. Pang, H. Zhu, X. Li, and X. Li, “Classifying discriminative features for blur detection,”

    IEEE Transactions on Cybernetics, vol. 46, no. 10, pp. 2220–2227, 2016.
  • [35] D.-J. Chen, H.-T. Chen, and L.-W. Chang, “Fast defocus map estimation,” in IEEE International Conference on Image Processing, 2016, pp. 3962–3966.
  • [36] C. Tang, J. Wu, Y. Hou, P. Wang, and W. Li, “A spectral and spatial approach of coarse-to-fine blurred image region detection,” IEEE Signal Processing Letters, vol. 23, no. 11, pp. 1652–1656, 2016.
  • [37] X. Yi and M. Eramian, “LBP-based segmentation of defocus blur,” IEEE Transactions on Image Processing, vol. 25, no. 4, pp. 1626–1638, 2016.
  • [38] T. Zhu and L. J. Karam, “Efficient perceptual-based spatially varying out-of-focus blur detection,” in IEEE International Conference on Image Processing, 2016, pp. 2673–2677.
  • [39] S. Alireza Golestaneh and L. J. Karam, “Spatially-varying blur detection based on multiscale fused and sorted transform coefficients of gradient magnitudes,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5800–5809.
  • [40] E. P. Simoncelli and B. A. Olshausen, “Natural image statistics and neural representation,” Annual Review of Neuroscience, vol. 24, no. 1, pp. 1193–1216, 2001.
  • [41] R. Ferzli and L. J. Karam, “A no-reference objective image sharpness metric based on the notion of just noticeable blur (JNB),” IEEE Transactions on Image Processing, vol. 18, no. 4, pp. 717–728, 2009.
  • [42] R. Hassen, Z. Wang, and M. M. Salama, “Image sharpness assessment based on local phase coherence,” IEEE Transactions on Image Processing, vol. 22, no. 7, pp. 2798–2810, 2013.
  • [43] M. A. Nielsen and I. L. Chuang, Quantum Computation and Quantum Information.   Cambridge University Press, 2010.
  • [44] G. Bertasius, J. Shi, and L. Torresani, “DeepEdge: A multi-scale bifurcated deep network for top-down contour detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4380–4389.
  • [45] S. Xie and Z. Tu, “Holistically-nested edge detection,” in IEEE International Conference on Computer Vision, 2015, pp. 1395–1403.
  • [46] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets.” in International Conference on Artificial Intelligence and Statistics, 2015, pp. 562–570.
  • [47] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ACM International Conference on Multimedia, 2014, pp. 675–678.
  • [48] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
  • [49] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection and hierarchical image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 898–916, 2011.
  • [50] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, 2000.
  • [51] C. Rother, V. Kolmogorov, and A. Blake, “GrabCut: Interactive foreground extraction using iterated graph cuts,” ACM Transactions on Graphics, vol. 23, no. 3, pp. 309–314, 2004.
  • [52] G. Ghiasi and C. C. Fowlkes, “Laplacian pyramid reconstruction and refinement for semantic segmentation,” in European Conference on Computer Vision, 2016, pp. 519–534.