The Effectiveness of Instance Normalization: a Strong Baseline for Single Image Dehazing

05/08/2018 ∙ by Zheng Xu, et al. ∙ University of Maryland 0

We propose a novel deep neural network architecture for the challenging problem of single image dehazing, which aims to recover the clear image from a degraded hazy image. Instead of relying on hand-crafted image priors or explicitly estimating the components of the widely used atmospheric scattering model, our end-to-end system directly generates the clear image from an input hazy image. The proposed network has an encoder-decoder architecture with skip connections and instance normalization. We adopt the convolutional layers of the pre-trained VGG network as encoder to exploit the representation power of deep features, and demonstrate the effectiveness of instance normalization for image dehazing. Our simple yet effective network outperforms the state-of-the-art methods by a large margin on the benchmark datasets.



There are no comments yet.


page 4

page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Images captured in the wild are often degraded in visibility, colors, and contrasts caused by haze, fog and smoke. Recovering high-quality clear images from degraded images (a.k.a. image dehazing) is beneficial for both low-level image processing and high-level computer vision tasks. Dehazed images are more visually appealing to generate for image processing tasks. Dehazed images can improve the robustness of vision systems that often assume clear images as input. Typical applications that benefit from image dehazing include image super-resolution, visual surveillance, and autonomous driving. Image dehazing is highly desired because of the increasing demand of deploying visual system for real-world applications.

Image dehazing is a challenging problem. The effect of haze is caused by atmospheric absorption and scattering that depend on the distance of the scene points from the camera. In computer vision, the hazy image is often described by a simplified physical model, i.e., the atmospheric scattering model [25, 27, 15, 23],


where is the observed hazy image, is the scene radiance (clear image), is the medium transmission map, and is the global atmospheric light. When the atmosphere is homogeneous, can be further expressed as a function of the scene depth d(x) and the scattering coefficient of the atmosphere as . The goal of image dehazing is to recover clear image from hazy image . Single image dehazing is particularly challenging. It is under-constrained because haze is dependent on many factors, including the unknown depth information that is difficult to recover from a single image.

The atmospheric scattering model (1) has been extensively used in previous methods for single image dehazing [10, 36, 38, 15, 26, 11, 4, 6]. These works either separately or jointly estimate the transmission map and the atmospheric light to generate the clear image from a hazy image. Due to the under-constrained nature of single image dehazing, the success of previous methods often relies on hand-crafted priors such as dark channel prior [15], contrast color-lines [11], color attenuation prior [44], and non-local pior [4]. However, it is difficult for these priors to be always satisfied in practice. For example, dark channel prior is known to be unreliable for areas that are similar to the atmospheric light.

More recent works learn convolutional neural networks (CNNs) to estimate components in the atmospheric scattering model for image dehazing

[5, 28, 22, 24, 42, 41]. These methods are often trained with limited (synthetic) images, and use only a few layers of convolutional filters. The learned shallow networks have limited capacity to represent or process images, making them difficult to surpass the prior-based methods. In contrast, training deep neural networks with large-scale data has made significant progress and achieved state-of-the-art performance in many vision tasks [21, 35, 16]

. Moreover, the deep features extracted by a pre-trained deep network are used as powerful image representation in many applications, such as domain invariant recognition

[8], perceptual evaluation [43], and characterizing image statistics [12]. More recently, the architecture of CNNs itself has been recognized as a prior for image processing [40]. In this paper, we study how to release the power of deep network for single image dehazing.

We propose an encoder-decoder architecture as an end-to-end system for single image dehazing. We exploit the representation power of deep features by adopting the convolutional layers of the deep VGG net [35] as our encoder, and pre-train the encoder on large-scale image classification task [31]. We add skip connections with instance normalization between the encoder and decoder, and then train decoder with both reconstruction loss and VGG perceptual loss [43]. We show that the recently proposed instance normalization [39], which is designed for image style transfer, is also effective in image dehazing. The proposed method effectively learns the statistics of clear images based on the deep feature representation, which benefits the dehazing process on the input image. Our approach outperforms the state-of-the-art results by a large margin on a recently released benchmark dataset [23], and performs surprisingly well in several cross-domain experiments. Our method depends on neither the explicit atmospheric scattering model nor the hand-crafted image priors, and only exploits the deep network architecture and pre-trained models to tackle the under-constrained single image dehazing problem. Our simple yet effective network can serve as a strong baseline for future study in this topic.

2 Related work

Traditional methods focus on representing human knowledge as priors for image processing. Tan [36] assumes higher contrast of clear images and proposes a patch-based contrast-maximization method. Fattal [10] assumes the transmission and surface shading are locally uncorrelated, and estimates the albedo of the scene. Dark channel prior (DCP)  [15] assumes local patches contain low intensity pixels in at lease one color channel and hence estimates the transmission map. Fast visibility restoration (FVR)  [38] is a filtering approach by atmospheric veil inference and corner preserving smoothing. Meng et al. [26] uses boundary constraint and contextual regularization (BCCR), and Chen et al. [6] uses gradient residual minimization (GRM) to surpress artifacts. Tang et al. [37]

combines priors by learning with random forests model. Color attenuation prior (CAP)

[44] assumes a linear model of brightness and the saturation and then learns the coefficients. Berman et al. [4] assumes each color cluster in the clear image becomes a line in RGB space, and proposes non-local image dehazing (NLD).

There is an increasing interest in applying convolutional neural networks (CNNs) for image dehazing. DehazeNet [5] and multi-scale convolutional neural networks (MSCNN) [28] are trained to estimate the transmission map. AOD-Net[22] estimates a new variable based on the transformation of the atmospheric scattering model. Zhang and Patel [42] and Li et al. [24] estimate transmission map and atmospheric light by separate CNNs. Yang et al. [41] adversarially train generators for components of the atmospheric scattering model. These methods use relatively small CNNs and do not exploit the pre-trained deep networks for image representation. A few days before our submission, we notice a preprint [7] that also uses the pre-trained deep networks. The proposed method is quite different from [7]: we use encoder-decoder with skip connections, while Cheng et al. [7] only use feature maps extracted from one layer of the pre-trained network as input; we study instance normalization and demonstrate its effectiveness; we train an end-to-end system from hazy image to clear image, while Cheng et al. [7] estimate transmission map and atmospheric light; we can generate impressive results without explicitly applying the atmospheric scattering model.

Deep neural networks can be used as “priors” for image generation and image processing. The architecture of CNNs itself can be a constraint for image processing [40] and image generation [20, 14]. A pre-trained deep networks can be used as general purpose feature extractors [8] and perceptual metric [43]. The second-order information of the features extracted by a pre-trained network describes the style of images [12]. Instance normalization layers that effectively change the statistics of deep features are widely used for image style transfer [39, 9, 13, 17].

Figure 1: The proposed network: encoder-decoder with skip connections and instance normalization (IN); convolutional layers of pre-trained VGG [35] are used as encoder; reconstruction loss and VGG perceptual loss are used for training decoder and IN layers.

3 VGG-based U-Net with instance normalization

We propose an end-to-end encoder-decoder network architecture for single image dehazing, as shown in fig. 1. The input is a hazy image, and the output is the desired clear image. We introduce different components of the network in the following paragraphs of this section.

Encoder. Our encoder uses the convolutional layers of the VGG net [35]

pre-trained on Imagenet large-scale image classification task


. VGG net contains five blocks of convolutional layers, and we use the first three blocks and the first convolutional layer of the forth block. Each block contains several convolutional layers, and each convolutional layer is equipped with ReLU


as activation function. The width (number of channels) and size (height and width) of convolutional layers are shown in

fig. 1

. There is a maxpooling layer of stride two between blocks, which enlarges the receptive field of higher layers. The width of convolutional layer is doubled after the subsampling of feature maps by maxpooling.

The pre-trained VGG net is a powerful feature extractor for perceptual metric [43] and image statistics [12]. Our encoder is deep and wide, and the extracted deep features are capable to capture the semantic information of the input image. We fix the encoder during training to exploit the power of pre-trained VGG net as “priors”, and avoid overfitting from relatively small number of samples in image dehazing dataset.

Decoder and skip connection. Our decoder is designed to be roughly symmetric to the encoder. The decoder also contains four blocks, and each block contains several convolutional layers. The last layer of the first three blocks of the decoder uses transposed convolutional layer to upsample the feature maps. We use ReLU activation for convolutional and transposed convolutional layers except for the last layer, where we use Tanh as activation function.

We add skip connections from the output of the first convolutional layer of encoder block 1,2,3 to the input of decoder block 4,3,2 by concatenating (cat) the feature maps, respectively. Hence our deep encoder-decoder network has a U-Net [29, 19] structure except that our skip connections are based on blocks instead of layers . We use trainable instance normalization for skip connections, and have instance normalization before each convolutional layer in decoder except the first one. Our deep encoder-decoder network has large capacity, and skip connections make the information smoothly flow to easily train a large network.

Instance normalization. We briefly review instance normalization [39], and discuss our motivation in applying instance normalization for single image dehazing. Let represent the feature map of a convolutional layer from a minibatch of samples, where is the batch size, is the width of the layer (number of channels), and are height and width of the feature map. denotes the element at height , width of the th channel from the th sample, and instance normalization layer can be written as,


are learnable affine parameters, is a very small constant, and

represent the mean and variance for each feature map per channel per sample.

If we replace instance level variables with batch level variables

that are estimated for all samples of a minibatch, we get the well-known batch normalization layer

[18]. We show instance normalization is preferred than batch normalization for single image dehazing in our experimental ablation study.

The learnable affine parameters of instance normalization shift the first and second order statistics (mean and variance) of the feature maps. Instance normalization is effective for image style transfer, and the style of images can be represented by learned affine parameters [9]. Shifting the statistics of deep features extracted by pre-trained networks has achieved impressive results for arbitrary style transfer [17]. Shifting the statistics of images is intuitive for dehazing, however, it can be difficult to decide the exact amount to change because haze depends on the unknown depth. The deep features extracted by a pre-trained VGG net contain semantic information to effectively infer depth for haze, and hence the learned affine parameters effectively shift the statistics of images. We apply instance normalization on the deep features extracted by pre-trained VGG net for single image dehazing.

Training loss. Our network is trained with both reconstruction loss and VGG perceptual loss. Denoting the training pairs of hazy image and clear image as , we use the mean squared loss,


where represents the trainable instance normalization and decoder layers in our network, represents the perceptual function, and

is a hyperparameter. We set

, and use the features extracted by the first convultional layer of the third block from the pre-trained VGG net as perceptual function.

4 Experiments

In this section, we conduct various experiments on both synthetic and natural images to demonstrate the effectiveness of the proposed method. The atmospheric scattering model is widely used to synthesize images for both training and testing. The hazy images are synthesized from groundtruth clear images and grountruth depth images [23, 3], or estimated depth images [32].

We train our model on the recently released RESIDE-standard dataset [23]. RESIDE-standard contains 13,990 images for training, and 500 images for testing. These images are generated by existing indoor depth datasets, NYU2 [34] and Middlebury stereo [33]. The atmospheric scattering model is used, where atmospheric lights is randomly chosen between (0.7, 1.0) for each channel, and scattering coefficient is randomly selected between (0.6, 1.8).

We also apply our model trained on RESIDE-standard for cross-domain evaluation on D-Hazy [3], I-Haze [1] and O-Haze [2] dataset. D-Hazy dataset [3] is another synthetic dataset, which contains 23 images synthesized from Middlebury and 1449 images synthesized from NYU2, with atmospheric lights and scattering coefficient . Though D-Hazy dataset use the same clean images as RESIDE-standard, the generated hazy images are quite different. I-Haze [1] and O-Haze [2] are two recent released datasets on natural indoor and outdoor images, respectively. I-Haze contains 35 pairs of indoor images and O-Haze contains 45 pairs of outdoor images, where the hazy images are generated by using a physical haze machine.

We compare our results quantitatively and qualitatively with previous methods. We compare with prior-based methods, DCP [15], FVR [38], BCCR [26] , GRM [6], CAP [44] and NLD [4] . We also compare with learning-based methods DehazeNet [5], MSCNN [28] , and AOD-Net [22]. We have provided a brief review of these baseline methods in section 2. We use peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) as metrics for quantitative evaluation. For the benchmark evaluation on RESIDE-side, all the learning-based methods are trained on the same dataset. For cross-domain evaluation on D-Hazy, O-Haze and I-Haze, we use the released best pre-trained model for the learning-based baseline methods.

We train our model by SGD with minibatch size 16 and learning rate 0.1 for 60 epochs, and linearly decrease the learning rate after 30 epochs. We use momentum 0.9 and weight decay

for all our experiments. We will release our Pytorch code and pre-trained models.

4.1 Quantitative evaluation on benchmark dataset

DCP [15] FVR [38] BCCR [26] GRM [6] CAP [44]
PSNR 16.62 15.72 16.88 18.86 19.05
SSIM 0.8179 0.7483 0.7913 0.8553 0.8364
NLD [4] DehazeNet [5] MSCNN [28] AOD-Net [22] Ours
PSNR 17.29 21.14 17.57 19.06 27.79
SSIM 0.7489 0.8472 0.8102 0.8504 0.9556
Table 1: Quantitative results on RESIDE-standard dataset [23].

We present the performance of our network and baseline methods on the RESIDE-standard benchmark dataset [23] in table 1. Our network and the learning-based baselines [5, 28, 22] are trained on the provided synthetic data, and evaluated on the separate testing set. We evaluate our results by metrics provided by [23], and compare with the baseline results reported in [23]. The learning-based methods perform slightly better than the prior-based method. CAP [44] performs best in prior-based method, which has a learning phase for the coefficients of the linear model. DehazeNet [44] performs best in baseline methods, which uses a relatively small network to predict components.

Our approach outperforms all the baseline methods on both PSNR and SSIM by a large margin. The synthetic data for both training and testing are generated by the atmospheric scattering model, and the baseline methods explicitly use the atmospheric scattering model. In contrast, our approach only uses instance normalization to transform the statistics of deep features . The superior performance of our network on the benchmark dataset demonstrate the effectiveness of deep networks and instance normalization for single image dehazing.

4.2 Ablation study

PSNR 18.24 25.67 26.00 25.99 26.38
SSIM 0.7945 0.9442 0.9414 0.9385 0.9519
Skip IN NA BN IN Perceptual
Dec BN IN IN IN loss
PSNR 26.89 26.57 27.67 27.75 27.79
SSIM 0.9535 0.9381 0.9543 0.9549 0.9556
Table 2: Ablation study on RESIDE-standard dataset.

We provide more discussion on the proposed network. We verify the effectiveness of instance normalization with ablation study on network structures, as shown in table 2. We use no normalization (NA), batch normalization (BN), or instance normalization (IN) for skip connections and decoders, respectively. The normalization layers are added before each convolutional layer of the decoder except for the first layer. All the results in table 2 are obtained by only using reconstruction loss (

in loss function (

3)) except for the last one, where IN and combined loss () are used. We train and evaluate our network on the RESIDE-standard dataset.

First, comparing the NA results in table 2 with previous best results in table 1, our encoder-decoder only achieves competitive results. Second, adding normalization to either skip connections or decoder significantly improves the performance of our network. The normalization layers for decoder are implicitly applied to the features from the skip connections, which makes the result of only normalizing decoder slightly better than only normalizing skip connections. Third, instance normalization works better than batch normalization, which demonstrates the effectiveness of shifting the mean and variance of deep features at instance level.

Finally, the perceptual loss only helps a little for quantitative evaluation, but it can help generate more visually appealing output images. We show an qualitative example in fig. 2, where the hazy input, the groundtruth clear image, outputs of our network without normalization layers and no perceptual loss (NA-NA), our network with instance normalization and no perceptual loss (IN-IN), and our network with instance normalization and perceptual loss (IN-IN-Percep). We enlarge the bottom left corner of the results to show more details. The results of IN-IN look much better than NA-NA. The enlarged area of the result with perceptual loss (IN-IN-Percep) looks sharper and clearer.

Figure 2: An example of qualitative results in ablation study. We zoom in the bottom left corner of the images to show more details in the second row.

4.3 Cross-domain evaluation

D-Hazy-NYU [3] D-Hazy-MB [3] I-Haze [1] O-Haze [2]
DCP [15] 11.56 0.6695 12.13 0.6752 13.41 0.4930 17.01 0.4875
CAP [44] 13.29 0.7266 14.36 0.7526 15.27 0.5603 16.68 0.4810
DehazeNet [5] 13.02 0.7256 13.78 0.7342 16.73 0.6263 17.90 0.5514
MSCNN [28] 13.67 0.7413 13.97 0.7488 15.93 0.5896 16.27 0.4947
AOD-Net [22] 12.44 0.7147 13.48 0.7470 15.00 0.5828 16.22 0.4142
Ours 18.11 0.8268 15.63 0.7338 16.04 0.6332 17.46 0.5337
Table 3: Quantitative results for cross-domain evaluation.

In this section, we focus on the cross-domain performance by evaluating our network trained on RESIDE-standard [23] on the cross domain datasets, D-Hazy [3], I-Haze [1] and O-Haze [2]. We compare with baseline methods that have publicly available code, and these are strong baselines according to benchmark evaluation in table 1. For learning-based methods DehazeNet [5], MSCNN [28], and AOD-Net [22], we use the best model the authors have released. MSCNN [28] and AOD-Net [22] are trained with synthetic images similar to RESIDE-standard, while DehazeNet [5] is trained with patches of web images.

We present the quantitative results in table 3, where we use bold to label the best results and underline to label the second best results. Our approach achieves best results, or close to the best results for all the cross-domain evaluations. Our first observation is that the learning-based methods [5, 28, 22], including ours, generalize reasonably well and perform equally or better than the prior-based methods [15, 44].

Our network performs well on the cross-domain D-Hazy dataset [3]. Particularly, our approach outperforms all baseline methods by a large margin on the images synthesized from NYU depth dataset. D-Hazy dataset is synthesized by the same clear images as our training data RESIDE-standard, but uses different parameters of the atmospheric scattering model. Our trained network has effectively captured the statistics of the deep features of the desired clear images.

I-Haze [1] and O-Haze [2] images look quite different from our training images, and our network may have difficulty to infer the exact statistics of deep features for these images. DehazeNet [5] may have gained some advantage on these two datasets because it is trained on patches of web images. Our approach still produces competitive results compared with DehazeNet [5], and outperforms all the other baselines. Notice again that our network does not use the powerful atmospheric scattering model, and is only trained on a limited number of indoor synthetic images. The cross-domain evaluation further demonstrates the power of deep features and instance normalization in our approach.

4.4 Qualitative evaluation

Figure 3: Qualitative evaluation on cross-domain dataset. The four examples are from D-Hazy-NYU [3], D-Hazy-MB [3], I-Haze [1] and O-Haze [2], respectively. Best viewed in color and zoomed in.

We present qualitative results from cross-domain evaluation in fig. 3. The images are from D-Hazy-NYU [3], D-Hazy-MB [3], I-Haze [1] and O-Haze [2], respectively. We show the hazy image and groundtruth clear image, and compare our results with DCP [15], CAP [44], DehazeNet [5], MSCNN [28], and AOD-Net [22]. We use the best released model for the learning-based baselines [5, 28, 22], and train our network on RESIDE-standard [23].

Our network makes the best efforts to remove haze and recover the real color of images, as shown in fig. 3. The results of baselines still have a large amount of undesired haze and look blurry (row 2,3,4). Particularly, the baselines have difficulty in dark areas of the image, and DCP also has difficulty in area of white and blue walls (row 1,3). For the outdoor image (row 4), our network produces a little artifact due to the significant domain difference between the desired images and the training indoor images. Use regularizers such as total variation [30] may help reduce these artifacts, and we plan to investigate it in the future. Our simple yet effective network has generated visually appealing results, without depending on extra constraints like the atmospheric scattering model.

5 Discussion

We proposed a simple yet effective end-to-end system for single image dehazing. Our network has an encoder-decoder architecutre with skip connections. We manipulated the statistics of deep features extracted by pre-trained VGG net and demonstrated the effectiveness of instance normalization for image dehazing. Moreover, without explicitly using the atmospheric scattering model, our approach outperforms previous methods by a large margin on the benchmark datasets. Notice that both the training and testing data are generated by the atmospheric scattering model, and the baseline methods all explicitly use the model. Our network effectively learns the transformation from hazy image to clear image with limited synthetic data, and generalizes reasonably well.

The atmospheric scattering model is powerful and has been successfully deployed for image dehazing in the past decade. However, the atmospheric scattering model, as a simplified model, also constrained the learnable components to be “linearly” combined by element-wise multiplication and summation, which may not be ideal for training deep models. Our study sheds light on the power of deep neural networks and the deep features extracted by pre-trained network for single image dehazing, and encourages the rethinking on how to effectively exploit the physical model for haze. How will physical model help when training powerful deep networks? It is still an open question, and our approach serves as a strong baseline for future study.

Our network outperforms state-of-the-art methods by a large margin on the benchmark dataset, and achieves competitive results on cross-domain evaluation. The key idea of our approach is to apply instance normalization to shift the statistics of deep features for image dehazing. For cross-domain evaluation, it may be difficult to effectively infer the desired statistics of deep features of clear images that is quite different from the training data. Our generalization ability can be significantly improved by training from large-scale natural images. In the future, we will explore adversarial training to use unpaired hazy and clear images that are easier to collect from the web.