Image blurring comes in two main flavors: defocus blur, which is caused by defocusing and motion blur, which is caused by camera or object motion. Blur detection aims to detect the blurred regions in an image regardless of the cause and separate them from sharp regions. This task plays a significant part in many potential applications, such as salient object detection [3, 31], defocus magnification [2, 1], image quality assessment [32, 35], image deblurring [27, 20], image refocusing [42, 43] blur reconstruction [36, 22], etc.
In the past decades, a series of blur detection methods based on hand-crafted features have been proposed. These methods exploit various hand-crafted features related to the image gradient [39, 11, 15, 38], and frequency [27, 28, 33, 2]. These method tend to measure the amount of the feature information contained in different image regions to detect blurriness since the blurred regions usually contain fewer details than the sharp ones. However, these hand-crafted features are usually not good at differentiating the sharp regions from the complex background, and cannot understand semantics to extract sharp regions from a similar background.
2 Related Work
We refer to the previous achievement on the blur detection and the semantic segmentation. We can find the trend of using deep learning networks for the blur detection from the previous achievement. Considering the blur detection as an image segmentation problem, we can learn how to get a more accurate segmentation result from semantic segmentation for the blur detection.
2.1 Previous Achievement
Previous methods of blur detection can be divided into two categories: methods based on traditional hand-crafted feature and methods based on deep learning neural networks.
In the first category, various hand-crafted features exploit gradient and frequency that can describe the information of regions. For example, Su et al. 
combine the gradient distribution pattern of the alpha channel with a blur metric based on singular value distributions to detect the blurred region. Shiet al.  make use of a series of the gradient, Fourier domain, and data-driven local filters features to enhance discriminative power for differentiating blurred and un-blurred image regions. Considering that feature extractors based on local information cannot distinguish the just noticeable blur reliably from unblurred structures, Shi et al.  improved feature extractors via sparse representation and image decomposition. Yi et al.  design a sharpness metric based on local binary patterns and the focus and defocus image regions are separated by using the metric. Tang et al.  design a log averaged spectrum residual metric to obtain a coarse blur map, and propose iterative updating mechanism to refine the blur map from coarse to a fine based on the intrinsic relevance of similar neighbour image regions. Using discrete cosine transform, Golestaneh et al.  computed blur detection maps based on a high-frequency multi-scale fusion and sort transform of gradient magnitudes.
Due to their outstanding performance in high-level feature extraction and parameters learning, deep convolutional neural networks have reached new state-of-the-art on blur detection. Firstly, Parket al. 
combine deep patch-level and hand-crafted features together to estimate the degree of defocus. Huanget al.  design a patch-level CNN to learn blur features, and apply this net work at three coarse-to-fine scales and optimally fuse multi-scale blur likelihood maps to generate better blur detection. Patch-level DCNN methods are time-consuming, which is needed to run thousands of times to process a raw image. Zhao et al. propose a multi-stream bottom-top-bottom fully convolutional network  which integrates low-level cues and high-level semantic information for defocus blur detection and leverages a multi-stream strategy to handle the defocus degree’s sensitivity to image scales. In , Ma et al. exploit the high-level information to separate the blur regions through an end-to-end fully convolution network. In order to increase the efficiency of the network, Tang et al. propose a new blur detection deep neural network via recurrently fusing and refining multi-scale features  . Zhao et al. design the Cross-Ensemble Network  with two groups of defocus blur detectors, which is alternately optimized with cross-negative and self-negative correlation losses to enhance the diversity of features.
With the application of DCNNs in computer vision, more solutions have been proposed for blur detection. Making the network deeper or wider to catch more useful features has been proven applicable, but this way is so dull that it makes unnecessary consumption. In our work, we make an attempt to design a delicate neural network to solve blur detection more efficiently.
2.2 Image Segmentation
As we know, fully convolutional networks (FCNs)  which train end-to-end, pixel-to-pixel on semantic segmentation exceed the previous best results without further machinery. Various improved versions of FCNs have been applied to region proposals , contour detection , depth regression , optical flow  and weakly-supervised semantic segmentation  which further advance the state-of-the-art in image processing. Some classical architectures have good performances in image segmentation. Representative of them are DeepLab models [5, 6, 7] and U-net .
In DeepLab models [5, 6, 7], dilated convolution and dense conditional random field inference (CRF) are used to improve output resolution. ParseNet  normalizes features for fusion and captures context with global pooling. The ”deconvolutional network” has been proposed by  to restore resolution by proposals stacks of learned deconvolution and unpooling. In , the stacked deconvolutional networks are applied in semantic segmentation, which can get an outstanding result without CRF. Almost all kinds of effective tricks in convolution networks have been applied in FCNs to improve the results of the network.
U-shape network has been first proposed in  to address biomedical image segmentation, which only has a few training samples. To make best use of the limited samples, U-Net  combines skip layers and learned deconvolution to fuse the different level features of one image for the more precise result. Because of its outstanding performance for the biomedical datasets that has simple semantic information and a few fixed feature, there are many further studies based on it,such as VNet  that is the U-shaped network using three-dimensional convolutions, UNet++  that is U-shaped network with more dense skip connections, Attention U-Net  that combines U-shaped networks with attention mechanism, ResUNet-a  that implements the U-shaped network with residual convolution blocks, TernausNet  use the pre-trained encoder to improve the U-shape network, MDU-Net  that densely connects the multi-scale of U-shaped network, and LinkNet  that attempts to modify the original U-shaped network for efficiency.
The achievements of U-shape network provide a lot of valuable references for us to solve blur detection. Considering that blur detection also has a few fixed feature, we design our network based on U-Net.
3 Proposed MSD-Unet
Our model consists of two parts: a group of extractors and the U-shaped network. First, we use the group of extractors to capture the multi-scale texture information from the images. Then, we put the extracted feature maps in each contracting steps of U-shaped networks to integrate them together. Finally, we use a soft-max layer to map the feature matrix to the segmentation result. The whole model was shown in detail in Figure 3.
3.1 Basic Components
To improve the efficiency of the extractors, we use dilated convolution to replace standard convolution, which can enlarge the receptive field without increasing parameters. And why we chose the U-shape network to fuse the different-scale texture information is that the skip connections in U-shape network can concatenate the shallow feature matrices and the deep feature matrices, which help us make best use of texture information and semantic information.
Dilated convolution Dilated convolution, also called atrous convolution, was originally developed in algorithms for wavelet decomposition . The main idea of the dilated convolution is to insert a hole between pixels in the convolutional kernel to increase its receptive field. The dilated convolution can effectively improve the extraction ability of convolution kernels for more features with the fixed number of parameters. In Figure 4, the convolution kernel with dilation of has a receptive field as big as a convolution kernel does, and has the same number of parameters as the normal convolution kernel. Dilated convolutional kernels have different dilation rates. If we set the center of the convolution kernel as the origin of the coordinates, for a 2-D convolution kernel with size , the result of dilation can be as following:
where is the size of the dilated convolution kernels, is the size of the origin convolution kernel, is the dilation factor.
where is a single parameter in the dilated convolution kernel, is a single parameter in the origin convolution kernel. In Figure 4, we can see a convolutional kernel change to dilated convolutional kernel with dilation. With the deeplearning method we can use more deeper network to catch the more abstract features. However, whether the region is blured is due to direct features. Thus, We need to increase the receptive field by expanding the size of the convolution kernel without making the deeper. In our method, we exploit dilated convolutions to design a group of extractors which can extract texture information but does not need more additional parameters.
Skip connections Skip connections combine the straight shallow features and abstract deep features, which can make the network notice shallow texture information and deep semantic information to have a more precise result. As we know, the more convolution layers are stacked, the more high-level abstract information can be extracted. Traditional encoder-decoder architectures can extract high-level semantic information, and performs well in panoramic segmentation that contains abundant high-level information. However, if we have to make images segment with the data only containing poor high-level information, such as cell splitting, MIR image segmentation, satellite image segmentation, etc., we should efficiently exploit the low-level information. The skip connections retain the low-level features in the shallow layers and combine them with the high-level features after deep layers, which can make the best use of both high-level and low-level information. For our task, the low-level information of gradient and frequency can describe the absolute degree of blur, and the high-level information of global semantics can help to judge whether the regions are blurred. As a result, the skip connections can make our model robust to adapt to various backgrounds.
3.2 Model Details
Our model is based on U-net , so they have a similar structure. The U-shaped architecture can be seen as having two paths: the contracting path and the expansive path. However, to combine with the multi-scale texture extractors, we have modified the contracting path of the U-net to receive different-scale texture feature matrices in every stage. In this section, we describe the detail of our model from the extractors and U-shape network.
Extractor We design the extractors aiming at capturing the multi-scale texture feature. Firstly, the source image is fed in the dilated convolution layers that dilation rates are various of and kernel size is
. Secondly, we contact the output of the dilated convolution layers with regular convolution layers with strides ofwith ReLU activation function, which can shrink the size of feature maps to fit the size of the feature of each contracting path in U-shaped architectures. After that, we add another regular convolution layers at the end to smooth the different-scale features. Why we make all the extractors independently is that the independent extractors are contacted contracting path in different levels of the U-shaped architecture can make the model more robust with different scales.
U-shaped architecture The contracting path which receives the outputs of texture extractors and integrates them through concatenation, convolution and pooling makes the feature matrices shrink in length and width dimension and grow in channel dimension. The expansive path uses transposed convolutions to restore the resolution of feature matrices and concatenate them with the feature matrix that has the same size in the contracting path through skip connections. U-shape architecture use skip layers concatenate the feature channels of the two paths in the upsampling part, which allow the network to propagate semantic information to higher resolution layers that contain local texture information.
As a consequence, the expansive path is more or less symmetric to the contracting path, and yields a u-shaped architecture. The contracting path follows the typical architecture of a U-net , which consists of the repeated application o f two
convolutions, each followed by a rectified linear unit and amax pooling operation with stride for downsampling. The input feature maps of every step in the contracting path is combined with the output of the last step and the corresponding extractor. The expansive path is the same as U-net, which consist of a transposed convolution that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU.
4.1 Dataset and Implementation
We do our experiments on two publicly available benchmark datasets for blur detection. CUHK  is a classical blur detection dataset, among which 296 images are partially motion-blurred, and 704 images are defocus-blurred. DUT  is a new defocus blur detection dataset which consists of 500 images as the test set and 600 images as the train set. We separate the CUHK blur dataset into a training set, which includes 800 images, and a test set, which includes 200 images that have the same ratio of motion-blurred images and defocus-blurred images. Since the number of training samples is limited, we enlarge training set by horizontal reversal at each orientation. Due to the fact that some state-of-the-art methods were designed solely for defocus blur detection, when we compare to these methods on the CUHK blur dataset we only use the 704 defocus-blurred images from CUHK and separate them into a training set which includes 604 images and a test set which includes 100 images. Our experiments were performed on these three datasets (CUHK, DUT, CUHK-defocus).
We implement our model in Pytorch and train our model on a machine equipped with an Nvidia Tesla M40 GPU, in which the memory is 12 GB. We optimise the network by using Stochastic Gradient Gescent (SGD) algorithm with the momentum of, the weight decay of and the learning rate of
in the beginning and reduced by a factor of 0.1 every 25 epochs. We trained with a batch size of 16 and resize the input images’ size as, which required of GPU memory for training. We use our enhanced training set of 5200 images to train our model for a total of 100 epochs, which takes five hours.
4.2 Evaluation Criteria and Comparison
Precision and Recall. We vary the threshold used to produce a segmentation of sharpness maps to draw a curve.
where is the set of pixels in the segmented blurred region and is the set of pixels in the ground truth blurred. The threshold value is sampled at every integer within the interval .
F-measure. The F-measure, which is an overall performance measurement, is defined as:
The is the weighting parameter ranging from 0 to 1. There following  is employed to emphasize the precision. Precision stands for the percentage of sharp pixels being correctly detected, and Recall is the fraction of detected sharp pixels in relation to the ground truth number of sharp pixels. A larger value means a better result.
Mean Absolute Error(MAE). MAE can provide a good measure of the dissimilarity between ground truth and blurred map.
where stand for pixel coordinates. is the ground truth map, and is the result map. and denote the width and height of the (or ), respectively. A smaller MAE value usually means a more accurate result.
We compare our method against other 9 state-of-the-art methods, including deep learning-based methods and hand-crafted features methods: DeF , CENet , BTBNet , DBM , HIFST , SS , LBP , JNB , and DBDF . In Figure 5, we show some defocus-blurred cases of visual comparison results. These cases include various scenes with cluttered backgrounds or similar backgrounds and contain complex boundaries of objects, which are hard to separate the sharp regions from images. In Figure 6, we show some motion-blurred cases of visual comparison result of different methods that can be applied in motion blur detection.
We also draw the accurate precision-recall curves and F-measure curves to study the capabilities of these methods through statistical calculation. In Figure 7, it is shown that our method makes progress on all three tests, and especially on the CUHK dataset which contains both defocus-blurred images and motion-blurred images. Our method boosts the precision within the entire recall range, where the improvement can be as large as . Furthermore, in Figure 8, the F-measure curves of our methods are all over , which are the best on each dataset.
In Table 1, it is observed that our method consistently performs favourably against other methods on three data sets, which indicates the superiority of our method over other approaches.
4.3 Ablation Analysis
Effectiveness of Skip layers.
Although U-shaped networks with skip layers have been applied in BTBNet, we make supplementary experiments to verify the significance of skip connections. To control variable, we build a new model that is similar with our original model except that there are no skip layers, using CUHK blur dataset for training. By the comparison of the result, we find that the model without skip connections cannot precisely segment the edges of objects in Figure 9. As a result, the model skip connections has a lower F-measure score and a higher MEA score in Table 2.
Effectiveness of multi-scale Extractors.
Multi-scale Extractors with dilated convolution aim to extract multi-scale texture feature to improve the precision of the blurred map. To verify its effect, we compare our network with the classical U-net which does not have multi-scale extractors and resembles our network. In Figure 10, we can find that the results of the U-net  without the multi-scale extractors are disturbed by backgrounds of shallow depths. Owing to the multi-scale extractors, our model is so sensitive to degree of blur that can accurately separate the blur region. As a result, our model has a higher F-measure score and a lower MEA score in Table 3. Also, we have a try to replace 1-dilated convolution kernels with the normal convolution kernels, which have the same receptive field. However, as shown in Table 3, our model performs a bit worse than the model using normal convolution kernels. But our model save millions of parameters by using dilated convolutions.
In this work, we regard the blur detection as image segmentation. We design a group of multi-scale extractors with dilated convolution to capture different scale texture information of images. Then, we combined the extractors with the U-shape network to fuse the shallow texture information and the deep semantic information. Taking the advantage of the multi-scale texture information and the semantic information, our method performer better on the scenes with cluttered backgrounds or similar backgrounds and contain complex boundaries of objects. We test our model on three datasets. Experimental results on three datasets prove that our method performs better than the state-of-the-art methods in blur detection. Although our method have made progress, the performance of our method is limited by the richness of the dataset. In the future, we will make further study to improve the generalization of our model.
-  Defocus magnification. In Computer Graphics Forum, Vol. 26. Cited by: §1.
-  (2013) Defocus map estimation from a single image via spectrum contrast. Optics Letters 38 (10), pp. 1706–1708. Cited by: §1, §1.
-  (2016) Salient object detection via weighted low rank matrix recovery. IEEE Signal Processing Letters PP (99), pp. 1–1. Cited by: §1.
-  (2017) Linknet: exploiting encoder representations for efficient semantic segmentation. In 2017 IEEE Visual Communications and Image Processing (VCIP), pp. 1–4. Cited by: §2.2.
-  (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §2.2, §2.2.
-  (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §2.2, §2.2.
-  (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 801–818. Cited by: §2.2, §2.2.
-  (2019) ResUNet-a: a deep learning framework for semantic segmentation of remotely sensed data. arXiv preprint arXiv:1904.00592. Cited by: §2.2.
Deep ordinal regression network for monocular depth estimation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2002–2011. Cited by: §2.2.
-  (2019) Stacked deconvolutional network for semantic segmentation. IEEE Transactions on Image Processing. Cited by: §2.2.
-  (2017) Spatially-varying blur detection based on multiscale fused and sorted transform coefficients of gradient magnitudes.. In CVPR, pp. 596–605. Cited by: §1, §2.1, §4.2.
-  (1990) A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets, pp. 286–297. Cited by: §3.1.
-  (2018) Multiscale blur detection by learning discriminative deep features. Neurocomputing 285, pp. 154–166. Cited by: §2.1.
Ternausnet: u-net with vgg11 encoder pre-trained on imagenet for image segmentation. arXiv preprint arXiv:1801.05746. Cited by: §2.2.
-  (2017) Edge-based defocus blur estimation with adaptive scale selection. IEEE Transactions on Image Processing 27 (3), pp. 1126–1137. Cited by: §1.
-  (2015) Parsenet: looking wider to see better. arXiv preprint arXiv:1506.04579. Cited by: §2.2.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §2.2.
-  (2018) Deep blur mapping: exploiting high-level semantics by deep neural networks. IEEE Transactions on Image Processing 27 (10), pp. 5155–5166. Cited by: §2.1, §4.2.
-  (2018) Self-supervised segmentation by grouping optical-flow. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §2.2.
-  (2011) Coded apertures for defocus deblurring. In Symposium Iberoamericano de Computacion Grafica, Vol. 5, pp. 1. Cited by: §1.
-  (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. Cited by: §2.2.
-  (2016-November 1) Layered reconstruction for defocus and motion blur. Google Patents. Note: US Patent 9,483,869 Cited by: §1.
-  (2018) Attention u-net: learning where to look for the pancreas. arXiv preprint arXiv:1804.03999. Cited by: §2.2.
-  (2017) A unified approach of multi-scale deep and hand-crafted features for defocus estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1736–1745. Cited by: §2.1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §2.2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2.2, §2.2, §2.2, §3.2, §3.2, §4.3.
-  (2014) Discriminative blur detection features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2965–2972. Cited by: §1, §1, §2.1, §4.1, §4.2.
-  (2015) Just noticeable defocus blur detection and estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 657–665. Cited by: §1, §2.1, §4.2.
-  (2019) Building corner detection in aerial images with fully convolutional networks. Sensors 19 (8), pp. 1915. Cited by: §2.2.
-  (2011) Blurred image region detection and classification. In Proceedings of the 19th ACM international conference on Multimedia, pp. 1397–1400. Cited by: §2.1.
-  (2017) Focus prior estimation for salient object detection. In 2017 IEEE International Conference on Image Processing (ICIP), pp. 1532–1536. Cited by: §1.
-  (2017) An effective edge-preserving smoothing method for image manipulation. Digital Signal Processing 63, pp. 10–24. Cited by: §1.
-  (2016) A spectral and spatial approach of coarse-to-fine blurred image region detection. IEEE Signal Processing Letters 23 (11), pp. 1652–1656. Cited by: §1, §2.1, §4.2.
-  (2019) DeFusionNET: defocus blur detection via recurrently fusing and refining multi-scale deep features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2700–2709. Cited by: §2.1, §4.2.
-  (2008) Blind image quality assessment for measuring image blur. In 2008 Congress on Image and Signal Processing, Vol. 1, pp. 467–470. Cited by: §1.
-  (2016) AllFocus: patch-based video out-of-focus blur reconstruction. IEEE Transactions on Circuits and Systems for Video Technology 27 (9), pp. 1895–1908. Cited by: §1.
-  (2019) Semantic projection network for zero-and few-label semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8256–8265. Cited by: §2.2.
-  (2017) Estimating defocus blur via rank of local patches. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5371–5379. Cited by: §1.
-  (2016) LBP-based segmentation of defocus blur. IEEE transactions on image processing 25 (4), pp. 1626–1638. Cited by: §1, §2.1, §4.2, §4.2.
-  (2010) Deconvolutional networks. In 2010 IEEE Computer Society Conference on computer vision and pattern recognition, pp. 2528–2535. Cited by: §2.2.
-  (2018) MDU-net: multi-scale densely connected u-net for biomedical image segmentation. arXiv preprint arXiv:1812.00352. Cited by: §2.2.
-  (2009) Single image focus editing. In 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pp. 1947–1954. Cited by: §1.
-  (2011) Single-image refocusing and defocusing. IEEE Transactions on Image Processing 21 (2), pp. 873–882. Cited by: §1.
-  (2018) Defocus blur detection via multi-stream bottom-top-bottom fully convolutional network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3080–3088. Cited by: §2.1, §4.1, §4.2.
-  (2019) Enhancing diversity of defocus blur detectors via cross-ensemble network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8905–8913. Cited by: §2.1, §4.2.
-  (2018) Unet++: a nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 3–11. Cited by: §2.2.