Motivated by the success of deep learning in high-level vision tasks[19, 11, 13, 31], numerous deep models have been developed for low-level vision tasks, e.g., image super-resolution [6, 7, 5, 17, 18, 22], inpainting [27, 24], noise removal [4, 15, 34], image filtering [36, 24], image deraining [8, 38], and dehazing [28, 2]. Although achieving impressive performance, the network architectures of these models strongly resemble those developed for high-level classification tasks.
Existing methods are based on either plain neural networks or residual learning networks. As demonstrated in [1, 27], plain neural networks cannot outperform state-of-the-art traditional approaches on a number of low-level vision problems, e.g., super-resolution . Low-level vision tasks usually involve the estimation of two components, low-frequency structures and high-frequency details. It is challenging for a single network to learn both components simultaneously. As a result, going deeper with plain neural networks does not always lead to better performance .
Residual learning has been shown to be an effective approach to achieve performance gain with a deeper network. The residual learning algorithms (e.g., ) assume that the main structure is given and mainly focus on estimating the residual (details) using a deep network. These methods work well when the recovered structures are perfect or near perfect. However, when the main structure is not well recovered, these methods do not perform well, because the final result is a combination of the structures and details. Figure 1 shows the image super-resolution results by the VDSR method  with structures recovered by different methods. The residual network cannot correct low-frequency errors in the structures (Figure 1(b)).
To address this issue, we propose a dual convolutional neural network (DualCNN) that can jointly estimate the structures and details. A DualCNN consists of two branches, one shallow sub-network to estimate the structures and one deep sub-network to estimate the details. The modular design of a DualCNN makes it a flexible framework for a variety of low-level vision problems. When trained end-to-end, DualCNNs perform favorably against state-of-the-art methods that have been specially designed for each individual task.
2 Related Work
Numerous deep learning methods have been developed for low-level vision tasks. A comprehensive review is beyond the scope of this work and we discuss the most related ones in this section.
The SRCNN  method uses a three-layer plain convolutional neural network (CNN) for super-resolution. As the SRCNN method is less effective in recovering image details, Kim et al.  propose the residual learning  algorithm based on a deeper network. The VDSR algorithm uses the bicubic interpolation of the low-resolution input as the structure of the high-resolution image and estimates the residual details using a 20-layer CNN. However, if the image structure is not well recovered, the generated result is likely to contain substantial artifacts, as shown in Figure 1.
Numerous algorithms based on CNNs have been developed to remove noise/artifacts [4, 15, 34] and unwanted components, e.g., rainy/dirty pixels [8, 38]. These methods are based on plain models, residual learning models or recurrent models. In addition, these methods estimate either the output using one plain network, or details using a residual network. However, plain networks cannot recover fine details [13, 17] and residual networks cannot correct structural errors.
For edge-preserving filtering, Xu et al.  develop a CNN model to approximate a number of filters. Liu et al.  use a hybrid network to approximate a number of edge-preserving filters. These methods aim to preserve the main structures and remove details using a single network, but this imposes a difficult learning task. In this work, we show that it is critical to accurately estimate both the structures and the details for low-level vision tasks.
In image dehazing, existing CNN-based methods [2, 28] mainly focus on estimating the transmission map from an input. Given an estimated transmission map, the atmospheric light can be computed using the air light model. As such, errors in the transmission maps are propagated into the light estimation process. For more accurate results, it is necessary to jointly estimate the transmission map and atmospheric light in one model, which DualCNNs are designed for.
A common theme is that we need to design a new network for every low-level vision task. In this paper, we show that low-level vision problems usually involve the estimation of two components: structures and details. Thus we develop a single framework, called DualCNN, that can be flexibly applied to a variety of low-level vision problems, including the four tasks discussed above.
3 Proposed Algorithm
As shown in Figure 2, the proposed dual model consists of two branches, Net-S, and output of Net-D, which respectively estimate the structure and detail components of the target signals from the input. Take image super-resolution as an example. Given a low-resolution image, we first use the bicubic upsampled image as the input. Then, our dual network learns the details and structures according to the formulation model of the image decomposition.
Dual composition loss function.
Dual composition loss function.
Let , , and denote the ground truth label, output of Net-S, and output of Net-D, respectively. The dual composition loss function enforces that the recovered structure and detail can generate the ground truth label using the given formation model:
where the forms of the functions and are known and depend on the domain knowledge of each task. For example, the functions and are identity functions for image decomposition problems (e.g., filtering) and restoration problems (e.g., super-resolution, denoising, and deraining). We will show that and can take more general forms to deal with specific problems.
3.1 Regularization of the DualCNN Model
The proposed DualCNN model has two branches, which may cause instability if only the composition loss (1) is used. For example, if Net-S and Net-D have the same structure, symmetrical solutions exist. To obtain a stable solution, we use individual loss functions to regularize the two branches respectively. The loss functions for the Net-S and Net-D are defined as
where and are ground truths corresponding to the outputs of Net-S and Net-D. Consequently the overall loss function to train DualCNN is
where , and are non-negative trade-off weights. Our framework can also use other loss functions, e.g., perceptual loss for style transfer.
We use the SGD method to minimize the loss function (4) and train a DualCNN. In the training stage, the gradients for Net-S and Net-D can be obtained by
where , and are the derivatives with respect to and .
In the test stage, we compute the high-quality output using the outputs of Net-S and Net-D according to the formation model,
Aside from image decomposition and restoration problems, the proposed model can handle other low-level vision problems by modifying the composition loss function (1). Here we use image dehazing as an example.
The image dehazing model can be described using the air light model,
where is the hazy image, is the haze-free image, is the atmospheric light, and is the medium transmission map, which describes the portion of the light that reaches the camera from scene surfaces. With the formulation model (7), we can set and in (1) within the DualCNN framework. As a result, the composition loss function (1) for image dehazing becomes
The other two loss functions (2) and (3) remain the same. In the training phase, we use the same method  to generate the atmospheric light , the transmission map and construct hazy/haze-free image pairs. The implementation details of the training stage are presented in Section 4.4.
In the test phase, the clear image can be reconstructed by the outputs of Net-D and Net-S, i.e.,
where is used to prevent division by zero and a typical value is .
4 Experimental Results
We evaluate DualCNNs on several low-level vision tasks including super-resolution, edge-preserving smoothing, deraining and dehazing. The main results are presented in this section and more results can be found in the supplementary material. The trained models are publicly available on the authors’ websites.
Motivated by the success of SRCNN and VDSR for super-resolution, we use 3 convolution layers followed by the ReLU function for the network Net-S. The filter sizes of each layer are, , and , respectively. The depths of each layer are , , and , respectively. For the network Net-D, we use 20 convolution layers followed by the ReLU function. The filter size of each layer is and the depth of each layer is . The batch size is set to be and the learning rate is . Although each branch of the proposed model is similar to SRCNN or VDSR, both our analysis and experimental results show that the proposed model is significantly different from these methods and achieves better results.
|(a) GT||(b) Bicubic||(c) SelfEx||(d) SRCNN|
|(e) VDSR||(f) ESPCN||(g) SRGAN||(h) Ours|
|Average running time||0.88||99.04||0.55||4.85||5.19|
|Xu et al. ||Liu et al. ||VDSR ||Net-S||Ours|
|(a) Xu et al.||(b) Liu et al.||(c) Net-D||(d) Net-S||(e) Ours||(f) RTV|
4.1 Image Super-resolution
For image super-resolution, we generate the training data by randomly sampling 250 thousands patches from 291 natural images in . We apply the Gaussian filter to each ground truth label to obtain . The ground truth is the difference between the ground truth label and the structure .
For this application, we set and . The weights , and in the loss function (4) are set to be , and , respectively. To increase the accuracy, we use the pre-trained models of SRCNN and VDSR as the initializations of Net-S and Net-D.
We present quantitative and qualitative comparisons against the state-of-the-art methods including A+ , SelfEx , SRCNN , ESPCN , SRGAN , and VDSR . Table 1 shows quantitative evaluations on benchmark datasets. Overall, the proposed method performs favorably against the state-of-the-art methods. Note that the architecture of one branch in a DualCNN is either similar to SRCNN or VDSR. However, the results generated by a DualCNN have highest average PSNR values, suggesting the effectiveness of the proposed dual model. Figure 3 shows some super-resolution results by the evaluated methods. The proposed algorithm can well preserve the main structures than state-of-the-art methods.
4.2 Edge-preserving Filtering
Similar to the methods in  and , we apply the DualCNN to learn edge preserving image filters including smoothing , relative total variation (RTV) , and weighted median filter (WMF) . We generate the training data by randomly sampling 1 million patches (clear/filtered pairs) from 200 natural images in . Each image patch is of pixels, and other settings of generating training data are the same as those used in .
For this application, as our goal is to learn the filtered image which does not contain rich details, we set weights , , and in the loss function (4) to be , and , respectively. We further let be the ground truth label .
We evaluate the proposed DualCNN model against methods [36, 24] using the dataset from . Table 3 summarizes the PSNR results. Note that Xu et al.  use image gradients to train their model and the final results are reconstructed by solving a constrained optimization problem. Thus it performs better for approximating smoothing. However, our method does not need these additional steps and generates high quality filtered images with significant improvements over the state-of-the-art deep learning based methods, particularly on RTV and WMF.
We note that the architecture of Net-D is similar to that of VDSR. As such, we retrain the network of VDSR for these problems. The results in Table 3 suggest that only using residual learning does not always generate high-quality filtered images.
Figure 4 shows the filtering results of approximating RTV . The state-of-the-art methods [36, 24] fail to smooth the structures (e.g., the eyes in the green boxes) that are supposed to be removed using the RTV filter (Figure 4(f)). In addition, the results with only one branch (i.e., Net-S) have lower PSNR values (Table 3) and some remaining tiny structures (Figure 4(d)). In contrast, joint learning structures and details preserves more accurate results and the filtered images are significantly closer to the ground truth.
4.3 Image Deraining
Deraining aims to recover clear contents from rainy images. This process can be regarded as recovering the clear details (rainy streaks) and structures (clear images) from inputs. We evaluate the proposed DualCNN on this task.
To train the proposed DualCNN for image deraining, we generate the training data by randomly sampling 1 million patches (rainly/clear pairs) from the rainy image dataset used in [derain/gan]. The size of each image patch used in training stage is pixels. Following settings used in learning image filtering, we let be the ground truth label (i.e., clear image patch). The weights , and in the loss function (4) are set to be , , and , respectively. We use the test dataset  to evaluate the effectiveness of the proposed method.
|(a) Input||(b) SPM||(c) ID-CGAN||(d) Net-D||(e) Net-S||(f) Ours|
|Methods||SPM ||PRM ||CNN ||GMM ||ID-CGAN ||Net-S||Ours|
|(a) Input||(b) ID-CGAN ||(c) Fu et al. ||(d) Ours|
|Methods||He et al. ||Meng et al. ||Ren et al. ||Ours|
|(a) Input||(b) He et al. ||(c) Tarel et al. ||(d) Cai et al. |
|(d) Ren et al. ||(e) Estimated||(f) Estimated||(g) Ours|
Figure 5 shows deraining results from the evaluated methods. The proposed algorithm can accurately estimate both clear details and structures from the input image. The plain CNN-based methods ,  and Net-S all generate results with obvious rainy streaks, demonstrating the advantage of simultaneously recovering structures and details using the DualCNN.
We further evaluate DualCNN using real examples. Figure 6 shows a real example. We note that the algorithm in  develops a deep details network for image deraining. The derained images are obtained by extracting details from input. However, this method depends on whether the image decomposition method is able to extract details or not. The results shown in Figure 6(c) demonstrate the algorithm in  fails to generate clearer images. In contrast, our method generates much clearer results compared to state-of-the-art algorithms.
4.4 Image Dehazing
As discussed in Section 3.2, the proposed method can be applied to the image dehazing. Similar to the method in , we synthesize the hazy image dataset using the NYU depth dataset  and generate the training data by randomly sampling 1 million patches including hazy/clear pairs (/), atmospheric light (), transmission map (). The size of each image patch used in training stage is pixels. The weights , and are set to be 0.1, 0.9, and 0.9, respectively.
We quantitatively evaluate our method on the synthetic hazy images . As summarized in Table 5, the proposed method performs favorably against the state-of-the-art methods for image dehazing. The dehazed images in Figure 7 show that the proposed method can recover the atmospheric light (Figure 7(e)) and transmission map (Figure 7(f)) well, thereby facilitating to recover the clear image (Figure 7(g)).
5 Analysis and Discussion
In this section, we further analyze the proposed method and compare it with the most related methods.
Effect of the architectures of DualCNN.
Lin et al.  develop a bilinear model to extract complementary features for fine-grained visual recognition. By contrast, the proposed DualCNN is motivated by the decomposition of a signal into structures and details. More importantly, the formulation of the proposed model facilitates incorporating the domain knowledge of each individual application. Thus, the DualCNN model can be effectively applied to numerous low-level vision problems, e.g., super-resolution, image filtering, deraining, and dehazing.
|(a) Input||(b) GT||(c) SRCNN||(d) VDSR||(e) Cascade||(f) Ours|
Numerous deep learning methods have been developed based on a single branch for low-level vision problems, e.g., SRCNN  and VDSR . One natural question is why deeper architectures do not necessarily lead to better performance. In principle, a sufficiently deep neural network has sufficient capacity to solve any problem given enough training data. However, it is non-trivial to learn very deep CNN models for these problems while ensuring high efficiency and simplicity.
For experimental validation, we use the SRCNN and a deeper model, i.e., VDSR, for image filtering and deraining. The experimental settings are discussed in Section 4.
Sample results using the VDSR model are shown in Figure 8. While the residual learning (i.e., VDSR) approach performs better than the SRCNN, the generated images with the plain CNN model  contain blurry boundaries or rainy streaks (Figure 8(d)).
Although the proposed DualCNN consists of two branches, an alternative is to combine the Net-S and Net-D in a cascaded manner as shown in Figure 9. In this cascade model, the first stage estimates the main structure while the second stage estimates details. This network architecture is similar to the ResNet . However, this cascaded architecture does not generate high-quality results compared to the proposed DualCNN (Figure 8(e) and Table 6).
Effect of the loss functions in DualCNN.
|(, )||(0, 0)||(1, 0)||(0, 1)||(9, 9)|
Different architectures in DualCNN.
We have used different network structures for the two branches of DualCNNs in the experiments in Section 4. It is interesting to test using the same structures for the two branches of a DualCNN. To this end, we set the two branches in a DualCNN using the network structures of SRCNN [SRCNN/eccv14] and train the DualCNN according to the same settings used in the image super-resolution experiment. The trained DualCNN generates the results with higher average PSNR/SSIM values (30.3690/0.8603) than those of SRCNN (30.1496/0.8551) for upsampling on the “Set5” dataset.
We further quantitatively evaluate the DualCNN when the two branches are the same on image deraining using synthetic rainy dataset . Similar to the image super-resolution experimental settings, the two branches in the DualCNN are set to be the network structures of SRCNN  (SDCNN-S) and the network structures of VDSR  (SDCNN-D), respectively. Table 8 shows that DualCNN with deeper model generates better results when the architectures of two branches are the same. However, the DualCNN where one branch is SRCNN and the other one is VDSR performs better than SDCNN-D. This is mainly because the main structures of the input images are similar to those of output images. Deeper model used in “net-S” will introduce errors in the learning stage.
In this paper, we propose a novel dual convolutional neural network for low-level vision tasks, called DualCNN. From an input signal, the DualCNN recovers both the structure and detail components, which can generate the target signal according to the problem formulation for a specific task. We analyze the effect of the DualCNN and show that it is a generic framework and can be effectively and efficiently applied to numerous low-level vision tasks, including image super-resolution, filtering, image deraining, and image dehazing. Experimental results show that the DualCNN performs favorably against state-of-the-art methods that have been specially designed for each task.
-  H. Burger, C. Schuler, and S. Harmeling. Image denosing: Can plain neural networks compete with BM3D. In CVPR, 2012.
-  B. Cai, X. Xu, K. Jia, C. Qing, and D. Tao. Dehazenet: An end-to-end system for single image haze removal. IEEE TIP, 25(11):5187–5198, 2016.
-  Y.-L. Chen and C.-T. Hsu. A generalized low-rank appearance model for spatio-temporally correlated rain streaks. In ICCV, pages 1968–1975, 2013.
-  C. Dong, Y. Deng, C. C. Loy, and X. Tang. Compression artifacts reduction by a deep convolutional network. In ICCV, pages 576–584, 2015.
-  C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In ECCV, pages 184–199, 2014.
-  C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. IEEE TPAMI, 38(2):295–307, 2016.
-  C. Dong, C. C. Loy, and X. Tang. Accelerating the super-resolution convolutional neural network. In ECCV, pages 391–407, 2016.
-  D. Eigen, D. Krishnan, and R. Fergus. Restoring an image taken through a window covered with dirt or rain. In ICCV, pages 633–640, 2013.
-  X. Fu, J. Huang, X. Ding, Y. Liao, and J. Paisley. Clearing the skies: A deep network architecture for single-image rain removal. IEEE Trans. Image Processing, 26(6):2944–2956, 2017.
-  X. Fu, J. Huang, D. Zeng, Y. Huang, X. Ding, and J. Paisley. Removing rain from single images via a deep detail network. In CVPR, pages 3855–3863, 2017.
-  R. B. Girshick. Fast R-CNN. In ICCV, pages 1440–1448, 2015.
-  K. He, J. Sun, and X. Tang. Single image haze removal using dark channel prior. In CVPR, pages 1956–1963, 2009.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
-  J.-B. Huang, A. Singh, and N. Ahuja. Single image super-resolution from transformed self-exemplars. In CVPR, pages 5197–5206, 2015.
-  V. Jain and H. S. Seung. Natural image denoising with convolutional networks. In NIPS, pages 769–776, 2008.
-  L.-W. Kang, C.-W. Lin, and Y.-H. Fu. Automatic single-image-based rain streaks removal via image decomposition. IEEE TIP, 21(4):1742–1755, 2012.
-  J. Kim, J. K. Lee, and K. M. Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR, pages 1646–1654, 2016.
-  J. Kim, J. K. Lee, and K. M. Lee. Deeply-recursive convolutional network for image super-resolution. In CVPR, pages 1637–1645, 2016.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106–1114, 2012.
-  C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, pages 4681–4690, 2017.
-  Y. Li, R. T. Tan, X. Guo, J. Lu, and M. S. Brown. Rain streak removal using layer priors. In CVPR, pages 2736–2744, 2016.
-  R. Liao, X. Tao, R. Li, Z. Ma, and J. Jia. Video super-resolution via deep draft-ensemble learning. In ICCV, pages 531–539, 2015.
-  T.-Y. Lin, A. Roy Chowdhury, and S. Maji. Bilinear CNN models for fine-grained visual recognition. In ICCV, pages 1449–1457, 2015.
-  S. Liu, J. Pan, and M.-H. Yang. Learning recursive filters for low-level vision via a hybrid neural network. In ECCV, pages 560–576, 2016.
-  D. R. Martin, C. C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, pages 416–425, 2001.
-  G. Meng, Y. Wang, J. Duan, S. Xiang, and C. Pan. Efficient image dehazing with boundary constraint and contextual regularization. In ICCV, pages 617–624, 2013.
-  J. S. J. Ren, L. Xu, Q. Yan, and W. Sun. Shepard convolutional neural networks. In NIPS, pages 901–909, 2015.
-  W. Ren, S. Liu, H. Zhang, J. Pan, X. Cao, and M.-H. Yang. Single image dehazing via multi-scale convolutional neural networks. In ECCV, pages 154–169, 2016.
-  W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, pages 1874–1883, 2016.
-  N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from RGBD images. In ECCV, pages 746–760, 2012.
-  Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. In NIPS, pages 1988–1996, 2014.
-  J. Tarel, N. Hautière, L. Caraffa, A. Cord, H. Halmaoui, and D. Gruyer. Vision enhancement in homogeneous and heterogeneous fog. IEEE Intell. Transport. Syst. Mag., 4(2):6–20, 2012.
-  R. Timofte, V. D. Smet, and L. J. V. Gool. A+: adjusted anchored neighborhood regression for fast super-resolution. In ACCV, pages 111–126, 2014.
-  J. Xie, L. Xu, and E. Chen. Image denoising and inpainting with deep neural networks. In NIPS, pages 350–358, 2012.
-  L. Xu, C. Lu, Y. Xu, and J. Jia. Image smoothing via L gradient minimization. ACM TOG, 30(6):174:1–174:12, 2011.
-  L. Xu, J. S. J. Ren, Q. Yan, R. Liao, and J. Jia. Deep edge-aware filters. In ICML, pages 1669–1678, 2015.
-  L. Xu, Q. Yan, Y. Xia, and J. Jia. Structure extraction from texture via relative total variation. ACM TOG, 31(6):139:1–139:10, 2012.
-  H. Zhang, V. Sindagi, and V. M. Patel. Image de-raining using a conditional generative adversarial network. CoRR, abs/1701.05957, 2017.
-  Q. Zhang, L. Xu, and J. Jia. 100+ times faster weighted median filter (WMF). In CVPR, pages 2830–2837, 2014.