Joint Transmission Map Estimation and Dehazing using Deep Networks

08/02/2017 ∙ by He Zhang, et al. ∙ Johns Hopkins University 0

Single image haze removal is an extremely challenging problem due to its inherent ill-posed nature. Several prior-based and learning-based methods have been proposed in the literature to solve this problem and they have achieved superior results. However, most of the existing methods assume constant atmospheric light model and tend to follow a two- step procedure involving prior-based methods for estimating transmission map followed by calculation of dehazed image using the closed form solution. In this paper, we relax the constant atmospheric light assumption and propose a novel unified single image dehazing network that jointly estimates the transmission map and performs dehazing. In other words, our new approach provides an end-to-end learning framework, where the inherent transmission map and dehazed result are learned directly from the loss function. Extensive experiments on synthetic and real datasets with challenging hazy images demonstrate that the proposed method achieves significant improvements over the state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 6

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Haze is the obscuration of lower atmosphere, typically caused by the presence of suspended particles in the air such as dust, smoke and other dry particulates. The presence of haze usually reduces the visibility range, thus affecting quality of images captured by camera sensors that will be processed by computer vision systems. A sample hazy image is shown on the left side of Figure 

1. It can be clearly observed that the existence of haze in an image greatly obscures the background scene. The problem of estimating a clear image from a single hazy input image is commonly referred to as dehazing. Image dehazing has attracted a significant interest in the computer vision and image processing communities in recent years [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21].

The deterioration of image quality is captured by the following mathematical model [22]:

(1)

where is the location in the image co-ordinates, represents the observed hazy image, is the image before degradation, is the global atmospheric light, and is the transmission map. Transmission map contains the per-pixel attenuation information that affects the light reaching the camera sensor and it is a factor of depth as shown below:

(2)

where is attenuation coefficient of the atmosphere and is the depth map. One can view (1) as the superposition of two components: 1. Direct attenuation , and 2. Airlight . Direct attenuation represents the effect of scattering of light and the eventual decay of light before it reaches the camera sensor. Airlight is a phenomenon that results from the scattering of environmental light causing a shift in the apparent brightness of the scene. Note that Airlight is a function of scene depth and the global atmospheric light . As it can be observed from Eq. 1, image dehazing is an inherently ill-posed problem which has been addressed in different ways. Many previous methods overcome this issue by including extra prior assumption such as multiple images of the same scene [7] or depth information [6] to determine a solution. However, no extra information such as depth or multiple images is available for the problem of single image dehazing. To tackle this issue, different prior information has to be considered into the optimization framework such as dark-channel prior [5], contrast color-lines [23]] and hazeline prior [4]. For example, based on the observation that there always exists one channel that is significant dark in the captured outdoor images, dark-channel prior [5] is leveraged in the optimization framework to guarantee dehazed images are “dark-channel”. Different from dark-channel prior, [4]

leverage the haze-line prior in the framework, based on the observation that color cluster in the clear image can be approximated as the haze-line in RGB space. More recently, several learning-based methods have also been proposed, where different learning algorithms such as random forest regression and Convolutional Neural Networks (CNNs) are trained for predicting the transmission map

[3, 1, 2, 8]. Many existing methods make an important assumption of constant atmospheric light 111Meaning that the intensity of atmosphere light is independent from its spatial location . in the image degradation model (1) and tend to follow a two-step procedure. First, they learn the mapping from input hazy image to its corresponding transmission map and then using the estimated transmission map they calculate the clear image by reformulating Eq. 1 as

(3)
Fig. 1: Sample image dehazing result using the proposed method. Left: Input hazy image. Right: Dehazed result.

As a result, most of the previous methods consider the task of transmission map estimation and dehazing as two separate tasks, except the Li et al. [8]. By doing so, they are unable to accurately capture the transformation between the transmission map and the dehazed image. Motivated by this observation, we relax the constant atmospheric light assumption [24, 25] and propose to jointly learn the transmission map and dehazed image from an input hazy image using a deep CNN-based network. Relaxed constant atmospheric light hypothesis within a certain adjustable limit not only allows us to exploit the benefits of multi-task learning but it also enables us to regress on losses defined in the image space. By enforcing the network to learn the transmission map, we still follow the popular image degradation model (1). This joint learning enables the network to implicitly learn the atmospheric light and hence avoiding the need for manual calculation. On the other hand, previous learning-based CNN methods [1, 2] utilize Euclidean loss in generating the corresponding transmission map, which may result in blurry effect and hence poor quality dehazed images [26]. To tackle this issue, we incorporate the gradient loss combined with the adversarial loss to generate better transmission map with sharper edges.

Fig. 2:

Overview of the proposed multi-task method for image dehazing. The proposed method consists of three modules: (a) Hazy feature extraction, (b) Transmission map estimation, and (c) Guided image dehazing. First, the transmission map is estimated from the input hazy image and it is concatenated with high dimensional feature map. These concatenated maps are fed into the guided dehazing module to estimate the dehazed image. The transmission map estimation module is trained using a GAN framework. The image dehazing module is trained by minimizing a combination of perceptual loss and Euclidean loss.

Figure 2 gives an overview of the proposed single image dehazing method. Our network consists of three parts: 1. Transmission map estimation, 2. Hazy image feature extraction, and 3. Dehazing network guided by transmission map and hazy image features. The transmission map estimation is learned using a combination of adversarial loss, gradient loss and pixel-wise Euclidean loss. The transmission maps from this module are concatenated with the output of hazy image feature extraction module and processed by the dehazing network. Hence, the transmission maps are also involved in the dehazing procedure via the concatenation operator. The dehazing network is learned by optimizing a weighted combination of perceptual loss and pixel-wise Euclidean loss to generate perceptually better results. Shown in Figure 1 is a sample dehazed image using the proposed method.

This paper makes the following contributions:

  • [noitemsep]

  • A novel joint transmission map estimation and image dehazing using deep networks is proposed. This is enabled by relaxing the constant atmospheric light assumption, thus allowing the network to implicitly learn the transformation from input hazy image to transmission map and transmission map to dehazed image.

  • We propose to use the recently introduced Generative Adversarial Network (GAN) framework for learning the transmission map.

  • By performing a joint learning of transmission map and image dehazing, we are able to minimize losses defined in the image space such as perceptual loss and pixel-wise Euclidean loss, thereby generating perceptually better results with high quality details.

  • Extensive experiments on synthetic and real image datasets are conducted to demonstrate the effectiveness of the proposed method.

Ii Related Work

We briefly review recent works on image dehazing and some commonly used losses in various CNN-based image reconstruction tasks.

Ii-a Single Image Dehazing

Early methods tend to address the dehazing problem via including certain prior assumption. For example, the authors in in [27] tend to recover the contrast for each patch relying on the assumption that that haze greatly decrease the contrast of the color images. Then, Kratz and Nishino [28] proposed to model the image with a factorial Markov random field in which the scene albedo and depth are two statistically independent latent layers. He. et.al in [5] proposed a dark-channel prior based on the surprising observation that RGB images from outdoor scene tend to have one channel that in significantly dark. Built on dark channel prior, Meng et al. [29] imposing a specific boundary constraint during the estimation of transmission map. More recently, Berman et al. [4] proposed a non-local prior method based on the observation that the colors of a haze-free image can be well represented by a few hundred different colors that fall into several tight clusters in the RGB space.

The success of CNNs in modeling the non-learning mapping between input and output has also inspired researchers to explore CNN-based algorithms for low-level vision tasks such as image dehazing [1, 2, 8]. Unlike previous prior-based methods in the estimation of transmission map, Cai et al. [2] train an end-to-end CNN network to directly estimate the transmission map from the input haze image. More recently, Ren et al. [1] proposed a multi-scale deep architecture to directly regress the transmission maps via a course to fine fashion. However, the method of both Ren et al. [1] and Cai et al. [2] still leveraged a two-step procedure and hence the whole algorithm is not end-to-end optimized. Most recently, Li et al proposed an all-in-one dehazing network, where a linear embedding is leveraged to encode the transmission map and the atmospheric light into a single variable. Though these CNN-based learning methods achieve superior performance over the recent state-of-the-art methods, they limit their capabilities by learning a mapping only between the input hazy image and the transmission map. This is mainly due to the fact that these methods are based on the popular image degradation model given by (1) which assumes a constant atmospheric light. In contrast, we relax this assumption and thus enable the network to learn a transformation from the input hazy image to transmission map and transmission map to dehazed image. By doing this, we are also able to use losses defined in the image domain to learn the network. In the following sub-sections, two different losses that we use to improve the performance of the proposed network are reviewed.

Ii-B Loss Functions

Loss functions form an important and integral part of a learning process, especially in CNN-based reconstruction tasks. Initial work on CNN-based image regression tasks optimized over pixel-wise L2-norm (Euclidean loss) or L1-norm between the predicted and ground truth images [30, 31, 32]. Since these losses operate at per-pixel level, their ability to capture high level perceptual/contextual details such as sharp edges and complicated contour are limited and they tend to produce blurred results. In order to overcome this issue, we use two different loss functions: adversarial loss and perceptual loss for learning the transmission map and dehazed image, respectively.

Ii-B1 Adversarial loss

The adversarial loss, formulated in the Generative Adversarial Networks(GAN) work by Goodfellow et al. [33], has been widely used in generating realistic images. GAN consists of a generator and a discriminator that are jointly optimized. While the generator’s goal is to synthesize images that are similar in distribution of the training images, the discriminator’s job is to identify if the images fed to it are real or synthesized (fake). After the success of this method in generating realistic images, this concept has been explored as different formulations in various applications such as data augmentation [34], paired and unpaired 2d/3d image to image translation [35, 36, 37, 38, yamaguchi2018high]

, image super-resolution

[39]

, image inpainting

[40, 41, 42] and image de-raining [43]. In our work, we propose to use the GAN framework as an additional loss function to guide the learning of transmission map, which when optimized appropriately, will generated realistic transmission maps.

Ii-B2 Perceptual loss

Many researchers have argued and demonstrated through their results that it would be better to optimize a perceptual loss function in various applications

[44, 45, 46, 47]. The perceptual function is usually defined using high-level features extracted from a pre-trained convolutional network. The aim is to minimize the perceptual difference between the reconstructed image and the ground truth image. Perceptually superior results were obtained for both super-resolution and artistic style-transfer [48, 49, 15, 50]. In this work, a VGG-16 architecture [51] based perceptual loss is used to train the network for performing dehazing.

Iii Proposed Method

The proposed network is illustrated in Figure 2 which consists of the following modules: 1. Transmission map estimation, 2. Hazy image feature extraction, and 3. Transmission guided image dehazing, where the first module learns to estimate transmission maps from corresponding input hazy images, the second module extracts haze relevant features from the input hazy image and the third module learns to perform image dehazing by combining the feature information extracted from the hazy image with the estimation transmission map. In what follows, we explain these modules in detail.

Iii-a Transmission Map Estimation

The task of predicting transmission map from a given input hazy image is considered as a pixel-level image regression task. In other words, the aim is to learn a pixel-wise non-linear mapping from a given input image to the corresponding transmission map by minimizing the loss between them. In contrast to the method used by Ren et al. in [1], our method uses adversarial loss in addition to pixel-wise Euclidean loss to learn better quality transmission maps. Also, the network architecture used in this work is very different from the one used in [1].

For incorporating the adversarial loss, the transmission map estimation is learned in the Conditional Generative Adversarial Network (CGAN) framework [52]. Similar to earlier works on GANs for image reconstruction tasks [43, 53, 39], the proposed network for learning the transmission map consists of two sub-networks: Generator and Discriminator . The goal of GAN is to train to produce samples from training distribution such that the synthesized samples are indistinguishable from the actual distribution by the discriminator . The sub-network is motivated by the success of encoder-decoder structure in pixel-wise image reconstruction [54, 55, 53]. In this work, we adopt a ‘U-Net’-based structure [54] as the generator for the transmission map estimation. Rather than concatenating the symmetric layers during training, shortcut connections [56]

are used to connect the symmetric layers with the aim of addressing the vanishing gradient problem for deep networks. To better capture the semantic information and make the generated transmission map indistinguishable from the ground truth transmission map, a CNN-based differentiable discriminator is used as a ‘guidance’ to guide the generator in generating better transmission maps. The proposed generator network is as follows (the shortcut connection is neglected here):

CP(15)-CBP(30)-CBP(60)-CBP(120)-CBP(120)-CBP(120)-CBP(120)-CBP(120)-TCBR(120)-TCBR(120)-TCBR(120)-TCBR(120)-TCBR(60)-TCBR(30)-TCBR(15)-TC(1)-TanH,

where represents the convolutional layer, represents transpose convolution layer, indicates Prelu [57] and

indicates batch-normalization

[58]. The number in the bracket represents the number of output feature maps of the corresponding layer.

To ensure that the estimated transmission map is indistinguishable from the ground truth image, a learned discriminator sub-network is designed to classify if each input image is real or fake. Inspired by the success of patch-discriminator in distinguish real from fake, we also adopt a 70

70 patch discriminator, where indicates the receptive field of the discriminator, to generate visually pleasing and sharper results. [59] also explores other ways to make the images sharper. The structure of the discriminator is defined as follows:

CB(48)-CBP(96)-CBP(192)-CBP(384)-CBP(384)-C(1)-Sigmoid.

Furthermore, we propose to employ gradient-based loss function in order to enforce consistency in the gradients between the estimated and target transmission map. The use of gradient loss function is inspired by its success in several other tasks such as depth estimation [60, 61].

Iii-B Hazy Feature Extraction and Guided Image Dehazing

A possible solution to image dehazing is to directly learn an end-to-end non-linear mapping between the estimated transmission map and the desired dehazed output. However, as shown in [53], while learning a mapping from transmission map-like to an RGB color image is possible, one may loose some information due to the absence of the albedo and the lighting information.

To generate better dehazed image and enable the whole process (estimation of the transmission map and the dehazed image) end-to-end, we propose a deep transmission guided network for single image dehazing via relaxing the assumption of constant atmospheric light. Inspired by guided filtering [62, 63, 64]

, where a guidance image is leveraged to guided the generation of high-quality results (eg. depth map), a set of convolutional layers with symmetric skip connections are stacked in the front and they serve as a hazy image feature extractor. Basically, the hazy feature extraction part extract deep features from the input hazy image. Then, These feature maps are concatenated with the estimated transmission map. Then the concatenation is fed into the guided image dehazing module. This module consists of another set of CNN layers with non-linearities and it essentially acts as a fusion CNN whose task is to learn a mapping from transmission map and high-dimensional feature maps to dehazed image.

222Note that our network is quite different from the network proposed in [62] in the sense that the proposed network is a multi-task learning network with a single input while the network in [62] is a single-task network with two inputs. To learn this network, a perceptual loss function based on VGG-16 architecture [51] is used in addition to pixel-wise Euclidean loss. The use of perceptual loss greatly enhances the visual appeal of the results. Details of the network structure for the hazy feature extraction and guided image dehazing module are as follows:

CP(20)-CBP(40)-CBP(80)-C(1)-Conca(2)-CP(80)-CBP(40)-CBP(20)-C(3)-TanH,

where indicates concatenation.

In summary, a non-linear mapping from the input hazy image and transmission map to dehazed image is learned in a multi-task end-to-end fashion. By learning this mapping, we enforce our network to implicitly learn the estimation of atmospheric light, thereby avoiding the “manual” estimation as followed by some of the existing methods.

Input
SSIM:0.8584
T-L2
SSIM:0.8654
T-L2-G
SSIM:0.8854
T-L2-G-GAN
SSIM:1
Input
SSIM:0.8584
T-L2
SSIM:0.8654
T-L2-G
SSIM:0.8854
T-L2-G-GAN
SSIM:1
Target
Fig. 3: Transmission estimation results for Ablation 1. It can be observed that gradient loss enable sharper edges and final GAN framework help to preserved better structure information for each object.

Iii-C Training Loss

As discussed earlier, the proposed method involves joint learning of two tasks: transmission map estimation and dehazing. Accordingly, to train the network, we define two losses and , respectively for the two tasks.

Iii-C1 Transmission map loss

To overcome the issue of blurred results due to the minimization of error, the transmission map estimation network is learned by minimizing a weighted combination of error, an adversarial error and a gradient loss. The transmission map loss is defined as

(4)

where and are two weights, is the pixel-wise Euclidean loss, is the adversarial loss and is the two-directional gradient loss. These three losses are defined as follows

(5)
(6)

where is a -channel input hazy image, is the ground truth transmission map, indicates the dimension of the input image and transmission map, is the generator sub-network for generating the transmission map and is the discriminator sub-network . The directional gradient loss, which has been discussed in other applications [65, 66], the is defined as:

(7)

where and are operators that compute image gradients along rows (horizontal) and columns (vertical), respectively and indicates the width and height of the output feature map.

Traditional techniques for transmission map estimation employ only the Euclidean loss () to learn the network weighs. However, Euclidean loss is known to introduce blur in the generated output. Hence, the use of additional loss functions (adversarial loss and gradient loss) incorporates further constraints into the learning framework. Specifically, the adversarial loss () enforces the network to generate transmission maps that are closer to the input distribution and the gradient loss () ensures consistency between the gradients of the target and estimated transmission map. The weights are set using validation.

Iii-C2 Dehazing loss

The dehazing network is learned by minimizing a weighted combination of the pixel-wise Euclidean loss and perceptual loss between the ground-truth dehazed image and the network output and is defined as follows

(8)

where is a weighting factor, is the pixel-wise Euclidean loss and is the perceptual loss and are respectively defined as

(9)
(10)

where is a -channel input hazy image, is the ground truth dehzed image, is the dimension of the input image and the dehazed image, is the proposed network, represents a non-linear CNN transformation and are the dimensions of a certain high level layer of . Similar to the idea proposed in [45], we aim to minimize the distance between high-level features along with pixel-wise Euclidean loss. In our method, we compute the feature loss at layer relu31 in VGG-16 model [51].333https://github.com/ruimashita/caffe-train/blob/master/
vgg.trainval.prototxt
Note that the dehazing loss is also to be propagated to the transmission estimation part.

Iii-D Discussion

Relaxing the condition of constant atmospheric light enables the network to be trained in an end-to-end fashion, thus allowing the network to implicitly learn the transformation from input hazy image to transmission map and transmission map to dehazed image. While it allows more flexibility in the learning process, it introduces more complexity on the model. Hence, to efficiently learn the network parameters, the transmission map is considered since it preserve information about the portion of the light that is not scattered that reaches the camera. Furthermore, additional losses such as adversarial loss and gradient loss function introduce strong regularization, thus enabling better estimation of transmission map.

Iv Experiments

In this section, we present the details and results of various experiments conducted on synthetic and real datasets that contain a variety of hazy conditions. First we describe the datasets used in our experiments. Then, we discuss the details of the training procedure. Next, we discuss the results of the ablation study conducted to understand the improvements obtained by various modules of the proposed method. Finally, we compare the results of the proposed network with recent state-of-the-art methods. Through these experiments, we attempt to demonstrate the superiority of the proposed method and the effectiveness of its’ various components.

SSIM:0.8241
SSIM:0.8452
SSIM:0.8838
SSIM:1
SSIM:0.8241
SSIM:0.8452
SSIM:0.8838
SSIM:1
Input
I-L2-noT
I-L2-T
I-L2-Per-T
Target
Fig. 4: Dehazed image results and certain zoomed-in parts for Ablation 2. It can be observed that the introduction of transmission map reduce the color distortion and the involvement of perceptual loss enable high quality dehazed result.
SSIM:0.4551

Iv-a Datasets

Since it is extremely difficult to collect a dataset that contains a large number of hazy/clear/transmission-map image pairs, training and test datasets are synthesized using (1) and following the idea proposed in [3, 2, 1]. All the training and test samples are obtained from the NYU Depth dataset [67]. More specifically, given a haze-free image, we randomly sample four atmosphere light and the scattering coefficient of the atmosphere to generate its corresponding hazy images and transmission maps. An initial set of 600 images are randomly chosen from the NYU dataset. From each image belonging to this initial set, 4 training images are generated by using randomly sampled atmospheric light and scattering coefficient, obtaining a total of 2400 training images. In a similar way, a test dataset consisting of 300 images is obtained. We ensure that none of the training images are in the test set. By varying and , we generate our training data with a variety of different conditions.

As discussed in [1, 3], the image content is independent of its corresponding depth. Even though the training images are from the indoor dataset [67] and hence depths of all the images are relatively shallow, we could modify the value of the attenuation coefficient to vary the haze concentration to make sure the datasets can also used for outdoor image dehazing. Meanwhile, the experimental results have also demonstrated the effectiveness of discussed training datasets.

To demonstrate the effectiveness of the proposed method on real-world data, we also created a test dataset including 30 hazy images downloaded from the Internet.

Iv-B Training Details

The entire network is trained on a Nvidia Titan-X GPU using the torch framework

[68]. We choose and for the loss in estimating the transmission map and for the loss in single image dehazing. During training, we use ADAM [69] as the optimization algorithm with learning rate of and batch size of 10 images. All the training samples are resized to . To efficiently train the multi-task network, we leverage the stage-wise training strategy. First, the transmission map estimation module is trained using the loss in Eq. 4 . Then, the entire network is fine-tuned using both Eq. 8 and Eq. 4.

Iv-C Ablation Study

In order to demonstrate the improvements obtained by different modules for both transmission maps and dehazed images, we conduct two ablation studies for estimating transmission maps and dehazed images, separately.

Ablation 1: This ablation study demonstrates the effectiveness of different modules in the transmission map estimation block and it consists of the following experiments:
1) Transmission map estimation using only L2 loss (T-L2),
2) Transmission map estimation using L2 loss and gradient loss (T-L2-G), and
3) Transmission map estimation using L2 loss, gradient loss and adversarial loss (T-L2-G-GAN).

Input T-L2 T-L2-G T-L2-ALL Target
Transmission Map 0.4523 0.9052 0.9257 0.9388 1.0000
TABLE I: Quantitative SSIM results for Ablation 1 evaluated on synthetic datasets for transmission map.

Sample results are shown in Fig 3. It can be observed that the introduction of gradient loss (T-L2-G) eliminates halo-artifacts near complicated edges [26]. Furthermore, the introduction of the discriminator (GAN framework-T-L2-G-GAN) effectively refine the local regions and enables sharper reconstructions, thereby preserving the structure for each object. Results of quantitative analysis on synthetic datasets are presented in Table I. The effect of different modules in the proposed network can be clearly observed from this table.

Ablation 2: Similarly, another ablation study is conducted to demonstrate the improvements obtained by different modules for dehazing images. This ablation study involves the following experiments:
1) Image dehazing using L2 loss without estimation of transmission map (I-L2-noT),
2) Image dehazing using L2 loss with estimation of transmission map (I-L2-T), and
3) Image dehazing using L2 loss and perceptual loss with estimation of transmission map (I-L2-Per-T).

Sample results are shown in Fig 4. It can be observed that the method (I-L2-noT) is unable to accurately estimate the haze level and depth (both are inherently captured in the transmission map) and hence the dehazed results tend to contain some color distortion. The introduction of the branch for the estimation of transmission map helps to generate better quality images. This can be seen by comparing the second column and the third column in Fig 4. Furthermore, the final involvement of the perceptual loss I-L2-Per-T is able to generate better dehazed images with high quality details (observed from the zoom-in parts in Fig 4). We also compare the inference running time for each ablation study, as tabulated in Table III. It can be observed that the multi-task learning results in slight increase in complexity of training and inference time. However, it leads to substantial improvements in the dehazing quality. The introduction of different loss functions such as gradient loss and perceptual loss increase the training time, however, it does not affect the inference time.

Input I-L2-noT I-L2-T I-L2-Per-T Target
Dehazed Image 0.7041 0.8835 0.9002 0.9133 1.0000
TABLE II: Quantitative SSIM results for Ablation 2 evaluated on synthetic datasets for dehazed image.
Input I-L2-noT I-L2-T I-L2-Per-T
Time (s) 2.65 3.33 3.33 3.33
TABLE III: Average running time for Ablation 2 evaluated on synthetic datasets for dehazed image.
Input
He. et al. [5]
Zhu. et al. [70]
Ren. et al. [1]
Berman. et al. [4, 71]
Li. et al. [8]
Our
Transmission N/A 0.8739 0.8326 N/A 0.8675 N/A 0.9388
Image 0.7041 0.8642 0.8567 0.8203 0.7959 0.8842 0.9133
TABLE IV: Quantitative SSIM results on the synthetic dataset.
SSIM: 0.9422
SSIM: 0.8633
SSIM: N/A
SSIM: 0.9307
SSIM: N/A
SSIM: 0.9733
SSIM: 1
SSIM: 0.5788
Input  
SSIM: 0.7169
He et al. [5]
SSIM: 0.7821
Zhu et al.[70]
SSIM: 0.7055
Ren et al. [1]
SSIM: 0.7232
Berman et al. [4, 71]
SSIM: 0.7267
Li et al. [8]
SSIM: 0.8346
Our   
SSIM: 1
SSIM: 0.9422
SSIM: 0.8633
SSIM: N/A
SSIM: 0.9307
SSIM: N/A
SSIM: 0.9733
SSIM: 1
SSIM: 0.5788
Input  
SSIM: 0.7169
He et al. [5]
SSIM: 0.7821
Zhu et al.[70]
SSIM: 0.7055
Ren et al. [1]
SSIM: 0.7232
Berman et al. [4, 71]
SSIM: 0.7267
Li et al. [8]
SSIM: 0.8346
Our   
SSIM: 1
Target   
Fig. 5: Dehazing results from our synthetic images, where the first row correspond to the estimated transmission map and the last row corresponds to the dehazed image.
SSIM: N/A
SSIM: 0.7648
SSIM: 0.7131
SSIM: 0.7340
SSIM: 0.7712
SSIM: 0.6007
SSIM: 0.8026
SSIM: 1
SSIM: 0.3313
Input
SSIM: 0.6919
He et al. [5]
SSIM: 0.6358
Zhu et al. [70]
SSIM: 0.6282
Ren et al. [1]
SSIM: 0.6680
Berman et al. [4, 71]
SSIM: 0.6176
Li et al.[8]
SSIM: 0.71180
Our   
SSIM: 1
SSIM: 0.7648
SSIM: 0.7131
SSIM: 0.7340
SSIM: 0.7712
SSIM: 0.6007
SSIM: 0.8026
SSIM: 1
SSIM: 0.3313
Input
SSIM: 0.6919
He et al. [5]
SSIM: 0.6358
Zhu et al. [70]
SSIM: 0.6282
Ren et al. [1]
SSIM: 0.6680
Berman et al. [4, 71]
SSIM: 0.6176
Li et al.[8]
SSIM: 0.71180
Our   
SSIM: 1
Target   
Fig. 6: Dehazed visual comparisons for results of synthetic image used by previous methods [2, 8].
SSIM: 0.3882
Input 
He. et al.[5]
Zhu. et al.[70]
Ren. et al.[1]
Berman. et al.[4, 71]
Li. et.al [8]
Our  
Fig. 7: Qualitative comparison of dehazing on real-world dataset that is presented in previous dehazing papers. It can be observed from the highlighted region that previous methods may result in undesirable effects such as artifacts and color over-saturation in the output images

Input  
He. et al. [5]
Zhu. et al.[70]
Ren. et al.[1]
Berman. et al.[4, 71]
Li. et.al[8]
Our  
Fig. 8: Qualitative comparison of dehazing on real-world dataset. Results on two sample images from a set of images downloaded from the Internet.

Iv-D Comparison with state-of-the-art Methods

To demonstrate the improvements achieved by the proposed method, it is compared against recent state-of-the-art methods on synthetic and real datasets.

Evaluation on synthetic dataset: Synthetic dataset, as described in Section IV(A), is used for the purpose of training and evaluating the network. Due to the availability of ground-truth images, we conduct both qualitative and quantitative evaluations.

Figure 5 shows results of the proposed method as compared with recent state-of-the-art methods ([5, 70, 4, 71, 1, 8] ) on a sample image from the test split of the synthetic dataset. After carefully analyzing these results, we observed that the recent best methods resulted in either incomplete removal of haze or over-correction which reduced the visual appeal of the image. Even though, [4] is able to achieve good performance in the presence of moderate haze, its dehazed results tend to contain color shift. In contrast, the proposed method is able to achieve better dehazing for a variety of haze contents. Similar results can be observed regarding the quality of transmission maps estimated by the proposed multi-task method as compared with the existing methods. It can be noted that the previous methods are unable to accurately estimate the relative depth in a given image, resulting in lower quality of dehazed images. In contrast, the proposed method not only estimates high quality transmission maps, but also achieves better quality dehazing.

The quantitative performance of the proposed method is compared against five state-of-the-art methods [5, 70, 1, 4, 8] using SSIM [72]. The quantitative results are tabulated in Table IV. It can be observed from this table that the proposed method achieves the best performance in terms SSIM. Note that, we have attempted to obtain the best possible results for the other methods by fine-tuning their respective parameters based on the source code released by the authors and kept the parameter consistent for all the experiments. As the code released by [1, 8] cannot estimate the predicted transmission map, the results for the transmission estimation corresponding to [1, 8] is not included in the discussion.

Furthermore, we also evaluate the proposed method on the synthetic images used by previous methods [2, 8]. Results are shown in Fig 6. It can be clearly observed that Berman et al. [4, 71] and the proposed methods achieve the best visual performance among all. However, by looking closer at the upper right part of Fig 6, it can be found that method from Berman et al. [4, 71] tend to bring in the color-shift and hence degrade the overall performance.

Input 
He. et.al[5]
Zhu. et al.[70]

Ren. et.al[1]
Berman. et.al [4, 71]
Li. et.al [8]
Our  
Fig. 9: Qualitative comparison of dehazing on real-world dataset. Top row: Results on a sample image from the real-world dataset provided by previous methods. Bottom two rows: Results on two sample images from a set of images downloaded from the Internet.

Evaluation on real dataset: In addition to the synthetic dataset, we also conducted evaluation experiments on real dataset which consists of hazy images from the real world, collected from the internet. Since the ground truths are not available for such images, we do not use this dataset for training and we perform only qualitative evaluations.

Fig. 10: More dehazing results on the real-world images. The first row show the original hazy image and second row show dehazed images of the proposed method.

Comparison of results on four sample images used in earlier methods compared with various approaches is shown in Figure 7. Yellow rectangles are used to highlight the improvements obtained using the proposed method. Though the existing methods seem to achieve good visual performance in the top row, it can be observed from the highlighted region that these methods may result in undesirable effects such as artifacts and color over-saturation in the output images. For the bottom two rows, the existing methods either make the image darker due to overestimation of dark pixels or are unable to perform complete dehazing. For example, leaning-based methods [1, 8] underestimate the thickness of haze resulting in partial dehazing. Even though Berman et al. [4, 71] leaves less haze in the output, the resulting image tends to be darker as the haze line is tough to detect under heavy haze conditions. In contrast, the proposed method is able to achieve near-complete dehazing with visually appealing results by avoiding any undesirable effects in the output images.

Furthermore, we also illustrate three qualitative examples of dehazing results on real-world hazy images by different methods. He. et al [5], Li. et al [8] and Ren. et al [1] method perform well but they tend to leave haze in the output leading to loss in color contrast. Even though Berman et al [4, 71] perform better, they tend to over-estimate the haze level resulting darker output images. Overall, our proposed method is able to tackle the problems brought by the other methods and achieve the best performance visually.

In Fig 9, we present a very tough hazy image to illustrate the results. The visual comparison here also confirms our findings in the previous experiments. Particularly, from the highlighted yellow rectangle, it can be observed that the method can better recover the Mandarin characters hidden behind the haze.

Through these experiments on real dataset, we are able to demonstrate that the proposed method, although trained on synthetic dataset, is able to generalize well to real world conditions.

Run Time Comparison: The proposed method is evaluated for its computational complexity. On average, our method is able to processes 512512 images at 18 frames per second (fps), thus providing real-time performance. Further more, the proposed method is compared against several recent methods as shown in Table V. The proposed method is comparable to the Li. et.al [8] but with better performance. On average, it takes about 3.3s to de-rain an image of size .

He. et.al (M) [5] (M) Zhu. et.al [70] (M) Ren. et.al [1] (M) Berman. et.al [4] (M) Li. et.al [8] (P) Our (P)
Time (s) 25.08 3.92 3.75 8.41 3.18 3.33
TABLE V: Average running time on the synthesized dataset. M: Matlab implementation, P: Python implementation.

Iv-E Failure Cases

Input image
Dehazed image
Fig. 11: Failure case of the proposed method,

Although the proposed method is able to generalize well to most of the outdoor cases, it results in saturation of certain region of specific images. For example, as shown in dehazed images in Fig 11, central part of the sky is not recovered appropriately and it looks over-exposed. This is primarily due to the rarity of similar samples during training. This is a common problem in most existing methods.

Though the success of using synthetic samples for avoiding the need of expensive annotations has demonstrated the effectiveness in single image dehazing, the performance gap between the results on synthetic and real-world images illustrates some of the limitations in learning from synthetic data. Hence, it is necessary to explore new possibilities for leveraging synthetic data in order to obtain better generalization across real world images.

V Conclusion

This paper presented a new multi-task end-to-end CNN-based network that jointly learns to estimate transmission map and performs image dehazing. In contrast to the existing methods that consider the transmission estimation and single image dehazing as two separate tasks, we bridge the gap between them by using multi-task learning. This is achieved by relaxing the constant atmospheric light assumption in the standard image degradation model. In other words, we enforce the network to estimate the transmission map and use it for further dehazing thereby following the standard image degradation model for image dehazing. Experiments were conducted on multiple datasets (synthetic and real) and the results were compared against several recent methods. Further, detailed ablation studies were conducted to understand the significance of the different components in the proposed. method.

Acknowledgement

This work was supported by an ARO grant W911NF-16-1-0126.

References