Single Image Dehazing Using Ranking Convolutional Neural Network

01/15/2020 ∙ by Yafei Song, et al. ∙ Beihang University 0

Single image dehazing, which aims to recover the clear image solely from an input hazy or foggy image, is a challenging ill-posed problem. Analysing existing approaches, the common key step is to estimate the haze density of each pixel. To this end, various approaches often heuristically designed haze-relevant features. Several recent works also automatically learn the features via directly exploiting Convolutional Neural Networks (CNN). However, it may be insufficient to fully capture the intrinsic attributes of hazy images. To obtain effective features for single image dehazing, this paper presents a novel Ranking Convolutional Neural Network (Ranking-CNN). In Ranking-CNN, a novel ranking layer is proposed to extend the structure of CNN so that the statistical and structural attributes of hazy images can be simultaneously captured. By training Ranking-CNN in a well-designed manner, powerful haze-relevant features can be automatically learned from massive hazy image patches. Based on these features, haze can be effectively removed by using a haze density prediction model trained through the random forest regression. Experimental results show that our approach outperforms several previous dehazing approaches on synthetic and real-world benchmark images. Comprehensive analyses are also conducted to interpret the proposed Ranking-CNN from both the theoretical and experimental aspects.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 7

page 8

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In real-world scenarios, small particles suspending in the atmosphere (e.g., droplets and dusts) often scatter the light. As a consequence, the clarity of an image would be seriously degraded, which may decrease the performance of many multi-media processing systems, e.g.

, content-based image retrieval

[Gao:2016:TMM]. Image enhancement methods [Wang:2016:TMM, Xu:2014:TMM] can only alleviate this problem slightly. It is still helpful to develop effective dehazing methods to recover the clear image from an input hazy or foggy image.

Fig. 1: Both statistical and structural attributes of image patches are useful for dehazing. For example, the grass patches can be dehazed according to their color statistics (i.e., the statistical attributes of image patches), while the haze over fence can be removed according to their gradients (i.e., the structural attributes of image patches).

In the past decades, the problem of haze formation has been extensively studied in atmospheric optics [Timofeev:2008:BOOK]. It is widely acknowledged that a hazy image can be regarded as a convex combination of scene radiance and atmospheric light [He:2009:CVPR, Tang:2014:CVPR, Fattal:2014:TOG, Berman:2016:CVPR, Ren:2016:ECCV, Cai:2016:TIP, Wang:2017:TMM]. The combination coefficient is often called the transmission. As a result, the task of image dehazing can be formulated as recovering the scene radiance from a hazy image by estimating the atmospheric light and the transmission.

Under this formulation, two kinds of dehazing approaches have been proposed in the literature. Some of them propose to dehaze an image under the assistance of additional information, e.g., scene depth [Kopf:2008:TOG], images taken under different weathers [Nayar:1999:ICCV, Narasimhan:2000:CVPR]. However, such additional information may not be always available, which prevents the further usage of these dehazing approaches in many real-world scenarios. On the contrary, some approaches propose to directly dehaze a single image, which is an ill-posed problem since the atmospheric light and the transmission need to be simultaneously recovered for each image pixel. To address this issue, these approaches often assume that the atmospheric light is constant for every pixel in one input image, so that it can be estimated first in a pre-processing step. After that, the dehazing process can be simplified as a transmission estimation problem. For instance, He et al. [He:2009:CVPR] propose the dark channel prior which is proved to be effective in transmission estimation. Tang et al. [Tang:2014:CVPR] incorporate four types of features to train a regression model for transmission prediction. Fattal [Fattal:2014:TOG] utilizes local color-lines prior in clear images to estimate the transmission. Berman et al. [Berman:2016:CVPR] further propose non-local haze-line prior. In many cases, these approaches achieve impressive performance. However, for each prior, there are often images which may not meet it. Therefore, the heuristic designed priors (or features) may be insufficient to fully capture the intrinsic attributes of hazy images.

Inspired by the impressive success of Convolutional Neural Networks (CNN) [LeCun:1998:IEEE], e.g., image classification/annotation [Krizhevsky:2012:NIPS, Wu:2015:TBD], object detection [Erhan:2014:CVPR], semantic segmentation [Farabet:2013:TPAMI], and image denoising [Xie:2012:NIPS, Agostinelli:2013:NIPS], this paper prefers to automatically learn the haze-relevant features from massive hazy images. Two recent works [Ren:2016:ECCV, Cai:2016:TIP] also hold the same basic idea and adopt CNN to perform image dehazing. Ren et al. [Ren:2016:ECCV] directly estimate the whole transmission map from an input image under the multi-scale FCN (fully convolutional networks) framework [Long_2015_CVPR]. Cai et al. [Cai:2016:TIP] use a regression network to estimate the transmission of each pixel from its local surrounding patch. However, these two works mainly exploit existing layers to construct their CNNs. In contrast, we propose a new layer, named ranking layer, derived from our insight on this problem, which can facilitate the learning process of haze-relevant features.

By analysing the mechanism of existing image dehazing methods, we find that statistical attributes are essential, e.g., dark channel prior [He:2009:CVPR], haze-line prior [Berman:2016:CVPR] and color-lines prior [Fattal:2014:TOG]. But the classical CNN, while capturing the structural attributes well (e.g., the fence in Fig. 1), may lack the ability to capture the statistical attributes (e.g., the grass in Fig. 1). To alleviate this problem, we propose a novel ranking layer which can be embedded in the structure of classical CNN to form the Ranking-CNN. A Ranking-CNN can capture the structural and statistical attributes simultaneously. As a straightforward method, an end-to-end regression network can be established to estimate the transmission of each pixel from its surrounding local patch. However, since the regression target is only a real value between , when training the network using backward propagation algorithm, the gradient may be small and not robust. Therefore, it is difficult to effectively train the deep network. To this end, the regression problem is converted into a classification problem. Then the Ranking-CNN can be effectively trained on massive hazy image patches, and various types of haze-relevant features can be automatically learned. Based on these features, the random forest is further adopted to train a regression model so as to predict the transmission. Experimental results on plentiful synthetic and real-world images show that the proposed approach outperforms several previous outstanding approaches.

The main contributions of this paper include: First, we propose a novel ranking layer as well as its forward and backward computations, and theoretical analyses illuminate its excellent ability to capture statistical attributes. Second, by incorporating the ranking layer into the classical CNN, we construct a Ranking-CNN to learn effective haze-relevant features, which demonstrates impressive performance in image dehazing. Third, we benchmark the proposed dehazing approach and several state-of-the-art methods on extensive qualitative and quantitative experiments, in which the proposed approach achieves satisfactory performance.

The rest of this paper is organized as follows. Section II presents some related works. Section III formulates the problem and overviews our pipeline. Then each step is detailedly explained in Section IV. Finally we show the experimental results in Section V and conclude this paper in Section VI.

Ii Related Work

In the past two centuries, the interaction phonomenon of light with the atmosphere has been widely studied [Middleton:1957:BOOK, Mccartney:1976:Optics, Timofeev:2008:BOOK], which is known as atmospheric optics. Based on the physical phonomenon, depending on whether using additional information, there are mainly two kinds of image dehazing methods. We then review the related works from this perspective. In addition, we also briefly introduce several representative works on deep neural network.

Image dehazing with additional information. Early methods usually use additional information to dehaze images. Nayar and Narasihan [Nayar:1999:ICCV, Narasimhan:2000:CVPR] restore the scene structure from multiple images captured under different weather conditions, then the clear image can be recovered. Schechner et al. [Schechner:2001:CVPR] observe that the scattered atmospheric light is usually partially polarized, then they take two or more images through a polarizer at different orientations for image dehazing. Shwartz et al. [Shwartz:2006:CVPR] automatically recover the parameters of the atmospheric light needed by polarizer based image dehazing methods. Kopf et al. [Kopf:2008:TOG] use the geometry of the scene to dehaze image via registering the hazy image into 3D scenes manually. However, as these additional information is usually difficult to obtain, these methods have many limitations.

Fig. 2: System framework of our approach. Given a hazy image, we first estimate a global atmospheric light and use a pre-trained Ranking-CNN to extract haze-relevant features for each pixel from its surrounding patch. After that, the initial transmission is estimated via a random forest regression model, which is then refined through a guided filter. Finally, the clear image is recovered through single image dehazing.

Single image dehazing. As single image dehazing is an ill-posed problem, various priors and hypotheses have been proposed to tackle this problem. Oakley and Bu et al. [Oakley:2007:TIP] assume a constant air-light and estimate it via finding the minimum of a global cost function. Tan [Tan:2008:CVPR] removes the haze layer based on the observations that clear images have more contrast and the transmission tends to be smooth. Fattal [Fattal:2008:TOG] assumes that the shading and transmission functions are locally statistically uncorrelated. Tarel and Hautière [Tarel:2009:ICCV] propose a fast algorithm whose complexity is a linear function of the image size. Kratz and Nishino [Kratz:2009:ICCV] assume that the albedo and depth are statistically independent, then formulate a factorial Markov random field to estimate the transmission. He et al. [He:2009:CVPR] observe that the lowest value of each channel in a local image patch tends to be zero for clear images, which called dark channel prior. Wen et al. [Wen:2013:ISCAS] further develop the underwater dark channel prior for image enhancement. Gibson et al. [Gibson:2012:TIP] investigate the dehazing effects on image and video coding. They further [Gibson:2013:ICIP] use locally adaptive Wiener filter to refine the estimated density of haze. Yan et al. [Yan:2013:SIGATB] reduce the amplified noise in the dehazed result image restored from dense haze. Fattal [Fattal:2014:TOG] utilizes the color-lines prior in local image patch. Sulami et al. [Sulami:2014:ICCP]

apply the color-lines prior to estimate an appropriate global constant atmospheric light vector. Wang and Fan

[Wang:2014:TIP] propose a multiscale depth fusion (MDF) method with local Markov regularization to blend multi-level details of chromaticity priors. Zhu et al. [Zhu:2015:TIP] propose a color attenuation prior and further apply a linear model for haze removal. Wang et al. [Wang:2017:TMM]

propose a fast method based on linear transformation. For each prior, it can be applied to a range of hazy images, however, there are often images which may not meet it. To this end, this paper aims at automatically learning information from massive data.

Recently, there are several learning-based image dehazing methods. Tang et al. [Tang:2014:CVPR] train a regression model to estimate the transmission via incorporating four types of haze-relevant features. Two recent works [Ren:2016:ECCV, Cai:2016:TIP] also adopt CNN to perform image dehazing. Ren et al. [Ren:2016:ECCV] directly estimate the whole transmission map from an input image via multi-scale CNN under the FCN framework [Long_2015_CVPR]. Cai et al. [Cai:2016:TIP] use a regression network to estimate the transmission of each pixel from its surrounding patch. However, these works mainly exploit existing hand-crafted features or classical CNNs. In contrast, we propose a novel Ranking-CNN to simultaneously capture statistical and structure attributes, which both are essential for single image dehazing.

Deep neural networks

. Deep neural networks, also well known as deep learning or feature learning, are more powerful than shallow learning algorithms

[Hinton:2006:NC]

. Many researchers use deep learning to perform high level computer vision tasks and significantly improve the performance, such as image classification

[Krizhevsky:2012:NIPS, He:2016:CVPR], object detection [Erhan:2014:CVPR, Szegedy:2015:CVPR], and semantic labelling [Farabet:2013:TPAMI, Long_2015_CVPR, Chen:2015:ICLR]. Researches also have applied deep neural network to tackle low level problems and obtain promising results. Xie et al. [Xie:2012:NIPS] propose the Stacked Sparse Denoising Auto-encoders (SSDA) to perform image denoising and inpainting. Agostinelli et al. [Agostinelli:2013:NIPS]

further propose adaptive multi-column stacked sparse denoising autoencoder (AMC-SSDA) to tackle multiple types of noise. Schuler

et al. [Schuler:2013:CVPR]

train a multi-layer perceptron to perform image deconvolution task and obtain satisfactory results. Cho

et al. [Cho:2016:ECCV] applies CNN on image matting. And Shen et al. [Shen:2016:ECCV] focus on portrait matting. These works demonstrate that the deep neural network can achieve satisfactory results not only on high-level problems but also low-level problems.

Iii Overview

To dehaze an image, we first briefly formulate the formation process of a hazy image. Under the hazy or foggy weather, the scene radiance is scattered by the small particles suspending in the atmosphere. With increasing scene depth, the camera sensor captures less scene radiance but more atmosphere light. Thus, the formation of a hazy image can be described as a convex combination of the scene radiance J and the atmospheric light A, which can be formulated as [Nayar:1999:ICCV]

(1)

where is a pixel from the hazy image I and is its transmission. As a consequence, the problem of single image dehazing can be described as recovering the scene radiance from the hazy pixel . From (1), we have

(2)

Note that the dehazing process in (2) is ideal and may require slight variations in building the computational model for image dehazing. From (2), we find that the dehazing problem can be decomposed to three subproblems, including:

  1. Estimate the atmospheric light ,

  2. Predict the transmission ,

  3. Recover scene radiance .

To address these subproblems, the system framework of our approach is shown in Fig. 2. Specifically, since transmission prediction is often considered to be the key and most challenging subproblem in image dehazing [He:2009:CVPR, Tang:2014:CVPR, Fattal:2014:TOG], we propose the Ranking-CNN for this subproblem. Similar to the solutions in [He:2009:CVPR, Tang:2014:CVPR], we also assume that the atmospheric light is constant for all image pixels. Then, we calculate the dark channel of the input hazy image using the approach in [He:2009:CVPR], and the atmospheric light at any pixel is estimated by averaging the RGB color of the pixels with the largest dark channel values.

Once the atmospheric light is estimated, we only have to focus on predicting the transmission for every pixel according to its local features. To extract haze-relevant features, the proposed Ranking-CNN extends the structure of the classical CNN by adding a novel ranking layer so that the statistical and structural attributes of hazy image patches can be simultaneously captured. Based on the haze-relevant features, a transmission prediction model is then trained using the random forest regressor. The random forest regressor is adopted due to its several advantages, such as it can measure the importance of features and avoid seriously over-fitting. This regression model can be used to obtain the initial transmission for every pixel in the input image. To avoid edge artifacts, a guided filter is applied to refine the initial transmission, and the refined transmission is combined with the estimated global atmospheric light for image dehazing. As the Ranking-CNN model and the regression model are trained on massive amounts of data, they are effective for different input hazy images. Thus, we only need to train one unique Ranking-CNN model and one unique regression model, which are then used to dehaze any input hazy image.

Iv The Approach

In this Section, we first introduce what the ranking layer is and how to add it to the structure of the classical CNN so as to construct the Ranking-CNN. After that, we describe the implementation details of the Ranking-CNN and show how to learn haze-relevant features. Finally, we demonstrate how to dehaze an input image with the features extracted by the Ranking-CNN.

Iv-a Ranking Layer

By analysing the mechanism of existing image dehazing methods, we find that two types of attributes may influence the performance of transmission estimation, including statistical attributes (e.g., dark channel prior [He:2009:CVPR] and color-lines prior [Fattal:2014:TOG]) and structural attributes (e.g., boundaries [Meng:2013:ICCV]). Inspired by this observation, we propose to automatically learn haze-relevant features through CNN so as to simultaneously capture these two types of attributes. However, CNN performs impressively on capturing the structural attributes due to the usage of convolutional layers, while it often lacks the ability to extract statistical attributes. Thus it is necessary to modify the structure of the classical CNN so as to enhance its ability in extracting haze-relevant features. Toward this end, we propose to add a ranking layer to the classical CNN so as to construct a novel Ranking-CNN.

For a ranking layer, its input consist of a number of feature maps, which is the same as a common layer of classical CNNs. The proposed ranking layer retains the values of all the elements in a feature map and only changes their ordering. The input of a ranking layer consists of a set of feature maps, and the ranking layer operates separately on each input feature map and output a ranked feature map with the same dimension (as shown in Fig. 3). Let be an input feature map with elements and be its ranked version, we denote the th element of and as and , respectively. As shown in Fig. 4 (a), in the forward propagation of a ranking layer, the element corresponds to the th smallest element in , whose index is denoted as , i.e., . To facilitate the operations in the backward propagation, we record such pair-wise correspondences between the elements of input and output feature maps as .

Fig. 3: A ranking layer operates separately on each input feature map and only changes the ordering of elements in each feature map other than modifying their values. Note that a feature map is actually a 2D matrix and here we turn them into a 1D vector by sampling elements column-wise so as to provide a better viewing experience.
Fig. 4: The forward and backward propagation of the ranking layer on a specific feature map. In the forward propagation: the ranking layer sorts all the elements in a feature map and records the correspondence between the input and output feature maps. In the backward propagation: the ranking layer propagates the partial derivatives from the output feature map to the input feature map according to the correspondence .
Fig. 5:

The generation of training data and the structure of the Ranking-CNN. One million training patches are synthesized via adding random haze to 100k clear image patches sampled from 400 clear images. The Ranking-CNN is constructed by adding a ranking layer to the structure of classical CNN (C: convolution; P: max pooling; R: ranking; F: fully-connected).

Fig. 6: An example to show how the ranking layer facilitates the statistical attributes extraction, e.g., the contrast. For (a) various feature maps, classical CNN may need (b) different convolutional filters to compute the contrast. However, if (c) the feature maps are ranked, only (d) one unique filter is needed.

Based on the pair-wise correspondences

between the elements of an input feature map and its output ranked version, the backward propagation at the ranking layer can be conducted. As the ranking layer only changes the ordering of elements in each feature map, the partial derivatives of the loss function

with respect to each output feature can be directly passed to its corresponding input feature as

(3)

In Fig. 4 (b), we pick a specific feature map and visually explain the backward propagation of the ranking layer. Note that the ranking layer is parameter-free. No parameter needs to be learned in the backward propagation other than passing the derivatives.

The ranking layer operates separately on each feature map and sorts the elements in an input feature map in ascending order. Since the output feature map is ordered, extracting its statistical attributes, e.g., its contrast, becomes easier. As shown in Fig. 6(a), for various feature maps, classical CNN may need different convolutional filters (Fig. 6(b)) to compute the contrast. However, if every feature map is ranked (Fig. 6(c)), only one unique convolutional filter (Fig. 6(d)) is needed to compute the contrast. As a whole, a ranked feature map facilitates its statistical attributes extraction. While the value of each feature, which is actually computed through classical convolutional or pooling operations, still reserves the structural attributes.

We finally analyse the computational complexity of the ranking layer. The computational complexity of the forward propagation is serially and parallelly, since it acctualy performs a sort operation. The computational complexity of the backward propagation is serially and parallelly, since it directly propagates the derivatives according to the correspondence .

Iv-B Learning Haze-relevant Features

Given the ranking layer and the CNN, three issues still need to be addressed to learn haze-relevant features, including: 1) generating training data; 2) determining the structure of Ranking-CNN; 3) optimizing the parameters of Ranking-CNN.

Due to the lack of large-scale benchmarks, it is difficult to collect sufficient training data. Thus we address the first issue by generating massive synthesized hazy image patches for training the Ranking-CNN. As shown in Fig. 5, we first collect clear images from the Internet, including various types of scenes, such as mountain, forest, grass, city, building, street scene, etc. From these images, we randomly select clear image patches with the resolution . Based on these patches, we follow the formation process of a hazy image in (1) to generate massive hazy patches. Given a clear patch , we choose random transmission between and assume that the transmission on each small image patch is constant. Thus the hazy patches can be synthesized via simulating the formation process of hazy images in (1). Since the main objective of Ranking-CNN is to learn haze-relevant features for transmission prediction, we use the same atmosphere light for all patches in the synthesization process (i.e., ). Finally, we have synthesized hazy patches for learning haze-relevant features.

Before training the Ranking-CNN, we have to determine its structure. As shown in Fig. 5, our Ranking-CNN has ten layers. The first layer is the input layer, which includes the RGB channels of a color image patch with resolution . The second layer is a convolutional layer, where the R, G, B maps are convolved with convolutional kernels to generate feature maps with resolution . The third layer is a max pooling layer that sub-samples the input feature map over each non-overlapping window. The fourth layer is the ranking layer, which operates separately on each input feature map. It sorts all the elements in an input feature map and outputs a ranked feature map with the same dimension. Note that the elements in the ranked feature map are in ascending order from left-top to right-bottom. The fifth layer is a convolutional layer, which includes 32 feature maps and the convolutional kernel size is . The sixth layer is also a convolutional layer same with the fifth layer. The seventh layer is another max pooling layer which is the same as the third layer. After this layer, we finally obtain feature maps of size . The eighth layer, ninth layer (each with features) and the output layer (with

output values) are all fully-connected layers. In our Ranking-CNN, we use rectified linear unit (ReLU) activation function

[Hinton:2010:ICML] for all convolutional layers and the first fully-connected layer. For each hazy patch , the 10D output vector (denoted as ) are expected to approximate the label vector , where

is a binary variable that can be calculated as

(4)

In other words, we treat the Ranking-CNN as a multi-class classifier and try to optimize its parameters via maximizing the classification accuracy.

Intuitively, we can train an end-to-end network that predicts the transmission by replacing the output layer with a linear regression layer. However, since the output of the linear regression layer is only a variable varying between

, it is difficult to effectively train the deep network. To facilitate the training process, we adopt a two-stage training scheme. That is, we first convert the problem to a 10-category classification problem and train a Ranking-CNN model for classification. After that, the output layer is discarded and the output of the second fully-connected layer is used as features for training a random forest regressor to predict the transmission. In this manner, the training process of the Ranking-CNN is easier and the learned features, when they are combined with the random forest regressor, still have impressive performance in image dehazing.

To train the Ranking-CNN model, we minimize a soft-max loss function to optimize the parameters in the network. The loss function is defined as

(5)

where is the th element of , and is an index that

. To optimize the parameters in Ranking-CNN, we use the back-propagation algorithm with stochastic gradient descent solver

[LeCun:1998:IEEE]. We set the initial learning rate as , the momentum as , the mini-batch size as . As shown in previous literatures [Cai:2016:TIP, Ren:2016:ECCV], it is helpful to decrease the learning rate along with the training process. Therefore, we update the learning rate as

(6)

where is the index of training iteration on each mini-batch. In the experiments, we perform epoches on the whole training data, and the 64D output of the second fully-connected layer are used for transmission prediction.

Iv-C Image Dehazing

Based on the learned features, we further use the Random Forest [Breiman:2001:ML] to learn a regression model between the transmission and the haze-relevant features. Our random forest model has trees and each tree random selects feature dimensions. For efficiency, we random select hazy image patches (i.e., ) to train the regression model. Note that we set the atmospheric light as a constant vector (i.e., ) during training process. To relax this condition, we first apply white balance on the input image using our estimated atmospheric light A. In our approach, white balance is applied by dividing each channel of the input image by the corresponding channel of the estimated A as

(7)

Thus can be regarded to have atmospheric light of and the transmissions of and I are the same.

In the training data synthesizing process, we also assume that the transmission coefficients are locally consistent. However, we do not hold this assumption in the dehazing process. Therefore, we extract the haze-relevant features for every pixel in input image via selecting a patch centred at the pixel using the Ranking-CNN. With these features, the regression model is applied to estimate the transmission for each pixel . To avoid the artifacts near object edges, we further use guided filter [He:2013:TPAMI] to smooth the initial estimated transmission for efficiency. Laplacian matting [Levin:2008:TPAMI] also can be used instead to get more satisfactory results around edges. After obtaining the transmission and atmosphere light for each pixel , we can dehaze the input image by applying the ideal dehazing process in (2). Moreover, to avoid the strong fluctuation of recovered pixel when the transmission is very small, we set if . Thus we can get the clear image as

(8)

As the exposure is determined according to the hazy scene, the dehazed image usually tends to be underexposure, i.e., the luminance of J(x) is usually much less than the luminance of . Therefore, we adaptively increase the exposure as , where is the exposure factor. As there are many regions which tend to be gray in the input hazy image, the dehazed image will be overexposure if . As a compromise, the log function is used in our method and

(9)

Then, the exposure can be increased and overexposure also can be avoided at the same time.

V Experiments

We first compare the dehazed results of our method and several previous methods on both synthetic and real benchmark images. Then we exploit the influence of the ranking layer and compare the features learned by our Ranking-CNN with previous haze-relevant features quantitatively.

V-a Comparisons with Previous Approaches

He et al. [He:2009:CVPR] Tang et al. [Tang:2014:CVPR] Zhu et al. [Zhu:2015:TIP] Berman et al. [Berman:2016:CVPR] Ren et al. [Ren:2016:ECCV] Cai et al. [Cai:2016:TIP] Ours
Aloe 0.100 / 0.191 0.060 / 0.087 0.175 / 0.141 0.060 / 0.086 - / 0.195 0.089 / 0.096 0.051 / 0.070
Art 0.116 / 0.176 0.077 / 0.098 0.114 / 0.145 0.099 / 0.123 - / 0.210 0.094 / 0.122 0.061 / 0.076
Barn 0.079 / 0.089 0.061 / 0.063 0.075 / 0.079 0.128 / 0.049 - / 0.174 0.075 / 0.079 0.051 / 0.055
Bull 0.050 / 0.122 0.035 / 0.091 0.184 / 0.265 0.049 / 0.102 - / 0.337 0.110 / 0.202 0.023 / 0.061
Cones 0.084 / 0.102 0.043 / 0.044 0.106 / 0.110 0.055 / 0.071 - / 0.178 0.081 / 0.087 0.034 / 0.036
Dolls 0.061 / 0.110 0.038 / 0.069 0.152 / 0.201 0.067 / 0.095 - / 0.272 0.076 / 0.132 0.032 / 0.060
Flower 0.059 / 0.105 0.046 / 0.066 0.146 / 0.172 0.066 / 0.145 - / 0.239 0.098 / 0.135 0.045 / 0.068
Teddy 0.092 / 0.135 0.055 / 0.060 0.124 / 0.126 0.092 / 0.125 - / 0.167 0.082 / 0.089 0.054 / 0.061
Tsukuba 0.068 / 0.093 0.077 / 0.123 0.173 / 0.253 0.060 / 0.113 - / 0.329 0.117 / 0.182 0.077 / 0.125
Venus 0.042 / 0.074 0.046 / 0.103 0.159 / 0.239 0.051 / 0.163 - / 0.310 0.114 / 0.196 0.035 / 0.079
Average 0.075 / 0.120 0.053 / 0.080 0.141 / 0.173 0.073 / 0.107 - / 0.241 0.094 / 0.132 0.046 / 0.069
TABLE I: The errors on stereo dataset-syn. Left values indicate error in transmission. Right values indicate error in image.
Fig. 7: Representative dehazed results on Dataset-Syn. We can see that, Zhu et al. [Zhu:2015:TIP] usually under estimate the transmission, while [He:2009:CVPR] and [Tang:2014:CVPR] usually over estimate the haze, such as the light pink pig, the light brown heads, the red teddy and the gray areas.

The dehazed results and comparisons can be found in Fig. 7, Fig. 8 and Fig. 9

, which are achieved respectively on synthetic hazy images with ground-truth clear images and transmissions, captured hazy images with known clear images, and real benchmark hazy images without ground-truth. The experimental results show that, our method can achieve better results compared with several previous methods both quantitatively and qualitatively. In our experiments, we implement our Ranking-CNN to learn haze-relevant features based on the open source deep learning framework Caffe

[Jia:2014:caffe]. We reimplement the methods of [He:2009:CVPR] and [Tang:2014:CVPR], and directly use the published results or codes of other referenced methods, such as [Fattal:2014:TOG, Zhu:2015:TIP, Berman:2016:CVPR, Ren:2016:ECCV, Cai:2016:TIP].

He et al. [He:2009:CVPR] Tang et al. [Tang:2014:CVPR] Zhu et al. [Zhu:2015:TIP] Berman et al. [Berman:2016:CVPR] Ren et al. [Ren:2016:ECCV] Cai et al. [Cai:2016:TIP] Ours
Aloe 0.169 0.175 0.091 0.130 0.169 0.313 0.134
Art 0.136 0.087 0.073 0.090 0.079 0.231 0.064
Barn 0.054 0.046 0.070 0.087 0.061 0.117 0.041
Bull 0.064 0.075 0.046 0.065 0.087 0.206 0.053
Cones 0.093 0.064 0.053 0.057 0.104 0.213 0.057
Dolls 0.103 0.074 0.066 0.083 0.089 0.201 0.057
Flower 0.070 0.049 0.052 0.080 0.068 0.212 0.037
Teddy 0.126 0.108 0.067 0.141 0.129 0.197 0.089
Tsukuba 0.065 0.060 0.072 0.057 0.655 0.259 0.048
Venus 0.048 0.059 0.053 0.105 0.071 0.190 0.050
Average 0.093 0.080 0.064 0.089 0.092 0.214 0.063
TABLE II: The errors on Dataset-Cap. Values indicate error in image.
Fig. 8: Representative dehazed results on Dataset-Cap. We can see that our method can achieve satisfactory results on images with light haze. These results illustrate the robustness of our method.
Fig. 9: Representative results obtained by our approach and previous methods. The results show that, our method can achieve visual better results on a lot of real benchmark images. Specially, our method suffers less over estimating problems or color shifts, such as the faces of the two actresses, and the green trees.

In order to perform quantitative comparison, like some previous methods [Tang:2014:CVPR, Ren:2016:ECCV, Cai:2016:TIP], we synthesize ten hazy images based on stereo benchmark images published in [Scharstein:2002:IJCV, Scharstein:2003:CVPR, Scharstein:2007:CVPR], which is denoted as Dataset-Syn. Each image from this dataset has two types of ground-truth, including a haze-free image and a ground-truth transmission map. To be fair, we follow the experiments set-up in [Tang:2014:CVPR] and set the transmission for each pixel , where is the disparity. Table I shows the error comparisons in transmission and image of our method and [He:2009:CVPR, Tang:2014:CVPR, Zhu:2015:TIP, Berman:2016:CVPR, Ren:2016:ECCV, Cai:2016:TIP]. The error in transmission is calculated between the estimated and ground-truth transmission maps, and the error in image is calculated between the dehazed and haze-free images. Overall, as can be seen, our method achieves the best results and has over lower average error in estimated transmission and dehazed image compared with these methods. There are two dehazed results illustrated in Fig. 7, we can see that [He:2009:CVPR] usually over estimate the haze, such as the light pink pig, the light brown heads, the red teddy and the gray areas. [Tang:2014:CVPR] also suffers this problem as the multi-scale dark channel features are the most important features in their method. On the contrary, our method suffers less over estimated problems.

Though the Dataset-Syn are synthesized following the abstractly formulation of hazy images (1), however, the physical process may not follow it precisely. To this end, inspired by the construction of image matting benchmark [Rhemann:2008:CVPR, Rhemann:2009:CVPR], we design a process to directly capture a hazy image as well as its corresponding clear version. We first use a Lenovo 22” monitor to display each clear image in Dataset-Syn, and capture it by a Cannon 650D DSLR camera. After that, an ultrasonic humidifier is used to fill vapour between the monitor and the camera. Then the camera captures the hazy image with all the other settings and parameters unchanged. Except the vapour, the environment settings and camera parameters are unchanged, therefore the captured clear image can be regarded as the ground-truth of its corresponding captured hazy image. This captured dataset is denoted as Dataset-Cap. Table II shows the error comparisons in image of our method and [He:2009:CVPR, Tang:2014:CVPR, Zhu:2015:TIP, Berman:2016:CVPR, Ren:2016:ECCV, Cai:2016:TIP]. Our method still achieves the best results. There are two dehazed results illustrated in Fig. 8. As Fig. 7 and Fig. 8 show, the hazy of Dataset-Syn is dense and it of Dataset-Cap is light. Our method can achieve the best performance on both these two datasets, which means that our method performs robust under different hazy density than other methods.

Finally, we conduct a subjective test to visually compare the results of our method, [He:2009:CVPR], [Tang:2014:CVPR], [Fattal:2014:TOG], [Berman:2016:CVPR], [Cai:2016:TIP] and [Ren:2016:ECCV] on benchmark images. For a fair comparison, we use the results which are published by [He:2009:CVPR], [Tang:2014:CVPR], [Fattal:2014:TOG], and generate the results using the codes which are published by [Berman:2016:CVPR], [Cai:2016:TIP], [Ren:2016:ECCV]. Since each paper only publishes results on a portion of the images, we obtain pairs of dehazed results in total. Fifteen subjects are invited to perform this experiment. All these subjects have normal or corrected normal visual acuity and normal color vision. The results are shown on a normal display with resolution. The display is placed in a room with fluorescent lamps. On each of such pair-wise comparisons, four images are shown in a grid. The top-left is the input hazy image. The top-right is the ground truth hazy free image (if it exists, otherwise the input hazy image). The bottom-left and bottom-right are the dehazed results from two methods, each result is random shown in left or right. Each image is shown with no more than resolution, which also can be shown with its original resolution in a new window by click. Each subject is requested to observe each comparison and determine which dehazed result is better. Averagely, each subject takes about minutes to perform the test. Note that the methods that are being compared are blind to the subjects. Among all these pair-wise comparisons, our method achieves the first place and outperforms the other methods for times, while [He:2009:CVPR] takes the second place ( times). These results, together with the objective performance, indicate that our method performs the best in both objective and subjective experiments compared with the several referenced methods.

We also show some dehazed results on real world images in Fig. 9. It shows that, our method can achieve visual better results on a lot of real benchmark images. Specially, our method suffers less over estimating problems and color shifts, such as the faces of the actresses and the green trees.

V-B Performance Analysis

Beyond the performance comparisons, in this section we conduct a number of small experiments to validate the performance of our approach from multiple perspectives. For quantitatively evaluation, we further generate hazy patches with synthetic transmission as validation set.

Features comparison. In the first experiment, we compare the performance of various types of features, including the 64D features learned by the Ranking-CNN (denoted as ), the 64D features learned by the classical CNN (removing the ranking layer in the Ranking-CNN, denoted as ), the 325D features designed by [Tang:2014:CVPR] (denoted as ) and the combination of and (denoted as ). Note that the 325D features of consist of multi-scale dark channel priors and local max contrasts, hue disparity and multi-scale local max saturation. For efficiency issue, we random select training patches (i.e., ) from that are used by the Ranking-CNN to train the random forest regression model. Figure 10 shows the error in transmission on validation set using different combinations of the features, i.e. , , and . We can see that the features from the Ranking-CNN outperform those from [Tang:2014:CVPR] by in terms of error. Moreover, our Ranking-CNN features achieve about better compared with the classical CNN features. If we combine our Ranking-CNN features with the features used by [Tang:2014:CVPR], the error is only decreased slightly (), which means that our Ranking-CNN features not only capture most information in the previous hand-crafted features, but also learn more information from the massive data automatically. This experiment also shows that both structural features (e.g., CNN features) and statistical features (e.g., features used in [Tang:2014:CVPR]) are useful for transmission estimation.

Fig. 10: The errors on validation set when different features are used for dehazing. : features used in [Tang:2014:CVPR]; : features learned by the classical CNN; : features learned by the Ranking-CNN; : the combination of and .

We further explore the importance of each dimension in features , which consists of the 64D features from the Ranking-CNN and 325D features used in [Tang:2014:CVPR]. All these features are incorporated to train a random forest regressor, and the importance of each feature dimension can be obtained. As illustrated in Fig. 11, we plot the importance of each feature dimension which can be obtained from the trained random forest regressor. It is obviously that our Ranking-CNN features are more important than the previous features used in [Tang:2014:CVPR]. Moreover, the sum importance of the Ranking-CNN features is , while the sum importance of the previous features is only , which shows that the Ranking-CNN features are powerful and remarkably outperform the previous heuristic designed features used in [Tang:2014:CVPR].

Features of Tang et al.[Tang:2014:CVPR]

Fig. 11: The importance of features learned from Ranking-CNN and features heuristically designed in [Tang:2014:CVPR].

We also compare the features generated from different layers by training the regression model. The errors on the same validation dataset are , and by using the 128D features generated from the second pooling layer, the 64D features generated from the first fully-connected layer and the 64D features generated from the second fully-connected layer, respectively. This may imply that the features from deeper layers are more powerful. Thus we adopt the 64D features generated from the second fully-connected layer for transmission estimation.

Fig. 12: The error in transmission on the validation set using our Ranking-CNN when the ranking layer is placed at different locations.
Fig. 13: The classification accuracy on the validation dataset when two ranking layers are placed at all possible locations in CNN.

The location of the ranking layer. In the third experiment, we show the performance of the Ranking-CNN when the ranking layer is placed at different locations. In the experiment, the ranking layer is placed after the first convolutional layer, the first pooling layer (as in Fig. 5), the second and third convolutional layer, and the second pooling layer respectively. Figure 12 shows the error in transmission on the validation set respectively after training epoches. We can see that when the ranking layer is placed at the fourth layer (after 1st pooling layer), our dehazing model achieves the minimal error. To explain this phenomena, we rethink the problem from another perspective: what features will be extracted without the ranking layer? As state in [Zeiler:2014:ECCV], the shallow layers of the classical CNN extract low-level features like boundaries and contrasts, while the deep layers extract high-level features like patterns and objects. In existing studies haze has been proved to be tightly correlated with low-level structural features like boundaries [Meng:2013:ICCV] as well as the statistical information like dark channel prior [He:2009:CVPR] and color-lines prior [Fattal:2014:TOG]. By inserting the ranking layer after the first pooling layer, the proposed deep network can make full use of the low-level structural features like boundaries, while additional statistical information are also incorporated by using the ranking operations. By fusing both the low-level structural information and statistical information, the proposed method can achieve the best performance by placing the ranking layer at the fourth layer.

In the fourth experiment, we exploit the performance of the Ranking-CNN that uses two ranking layers. The two ranking layers are placed at all possible locations in the CNN. As shown in Fig. 13, the best performance is achieved by placing one ranking layer at the fourth layer (i.e., after the first pooling layer) and the other one at the seventh layer (i.e., after the second convolutional layer). However, the performance improvement, compared with the Ranking-CNN with only one ranking layer, is marginal (about on the validation dataset, after epoches). Considering the additional computational cost in the ranking layer, we still adopt one ranking layer for image dehazing.

Fig. 14: The visualization of 64 filters randomly sampled from the second convolutional layer of the Ranking-CNN.

Visualize the network. In the fifth experiment, we try to explain why the features learned by the Ranking-CNN are useful for image dehazing. We randomly select and visualize filters from the second convolutional layer in Fig. 14. We can see that these filters actually provide cues on which elements in a local patch of a feature map should be referred to in extracting haze-relevant features. For instance, the filter at the left-top corner may imply that the largest value in a local patch should be considered for extracting haze-relevant features. It is somehow similar to the mechanism of dark channel prior, while the main difference is that various types of haze-relevant features are extracted by referring to different combinations of elements in a local patch. In this manner, the Ranking-CNN extracts an over-complete set of haze-relevant features, which are then weighted and selected in the random forest regressor. In this manner, Ranking-CNN demonstrates impressive performance in dehazing images.

The size of training data. In the sixth experiment, we explore the influence when different numbers of synthetic training data are used in training the Ranking-CNN model. As shown in Fig. 15, the accuracy of the Ranking-CNN on the same validation set increases about after epoches when million training samples are used, while the training time is doubled as well. This result implies that our proposed method still has potential to be further improved by simply generating more synthetic training data. Considering the efficiency in the training stage, we use 1 million training samples in all the other experiments.

Fig. 15: The classification accuracy of two Ranking-CNN models on the same validation set, which are trained on one million or two million synthetic training samples, respectively. We can see that more training data generally bring better performance.
Fig. 16: The classification accuracy of our Ranking-CNN when different number of epoches are performed in the training process. Our Ranking-CNN can converge as quickly as classical CNN.

The convergence speed. In the seventh experiment, we compare the convergence speeds between the Ranking-CNN and the classical CNN. As shown in Fig. 16, the convergence speed of the Ranking-CNN is comparable to the classical CNN. This may be caused by the fact that in the ranking layer the partial derivatives of the loss function with respect to each output feature can be directly passed to its corresponding input feature. Since the ranking layer is parameter-free, adding a ranking layer will not dramatically increase the difficulties in training the network.

Different regressors

. In the second experiment, we test the performance of different types of regressors. Besides the random forest, we select three other regressors, including linear regressor, logistic regressor and SVM regressor (with radial basis function kernel). The

errors of these regressors on the same validation dataset are (random forest), (linear), (logistic) and (SVM). As random forest regressor achieves the best performance, we employ it in transmission estimation.

The end-to-end method. To explore the performance of the end-to-end method, we replace the output layer of our Ranking-CNN by a linear regression layer. Then the modified Ranking-CNN can directly predict the transmission. However, though it is actually more efficient, the performance is unsatisfactory. In fact, its mean error on test data is , while our method achieves . The reason may be that, when we take the network as a regressor and train it to predict the transmission, the loss function is a common choice, which is also used in our experiment. When we train a classification network, we use the soft-max loss function. We can see that, the soft-max loss function is steeper than the

loss function, which means that the classification network can be updated more effectively. Moreover, since the classification network outputs a number of probabilities about each label other than a real value, the learned features tend to be more various.

Running time. The last experiment is about the Running time. Our experiments are performed on a 3.1GHz PC with a NVIDIA Gerforce GTX980 GPU. Our feature learning and extracting algorithm is implemented based on Caffe. It takes about seconds to perform one training epoch on all million training samples using GPU. In feature extracting process, it takes about seconds to extract features for million patches, while classical CNN takes about seconds. We use a C implementation of random forest and it takes about two minutes to train the regression model on our training samples using CPU, and takes about seconds to predict initial transmission of patches. The other parts of our method are implemented using matlab, which takes several seconds for a typically image. Our method achieve satisfactory performance quantitatively and qualitatively, the weakness is its efficiency. Compared with several previous methods, our method takes more time. The main reason is that we extract features and estimate the transmission for every pixel according its local patch. However, as the transmissions are correlated in a local patch, we can simultaneously estimate the transmissions of more pixels in the future work. Then, the running time can be decreased more than one order of magnitude.

Vi Conclusion

This paper presents a method to dehaze an image based on the features which are automatically learned from massive hazy images. To this end, a novel ranking layer is proposed to form the Ranking-CNN, that can learn haze-relevant features more effectively compared with the classical CNN. Equipped with the novel ranking layer, our Ranking-CNN can capture the structural and statistical features simultaneously. Based on the learned features, a regression model is further trained to predict haze density for effective haze removal. Experimental results show that our Ranking-CNN features are effective. The proposed image dehazing method, which is based on the features, also achieves satisfactory results on synthetic and real world data. At the same time, as we extract features for every pixel, the weakness of our method is its efficiency, which should be further improved in the future work, i.e., via adopting FCN framework [Long_2015_CVPR] to reduce redundant computations.

References