Multi-level Wavelet Convolutional Neural Networks

07/06/2019 ∙ by Pengju Liu, et al. ∙ 7

In computer vision, convolutional networks (CNNs) often adopts pooling to enlarge receptive field which has the advantage of low computational complexity. However, pooling can cause information loss and thus is detrimental to further operations such as features extraction and analysis. Recently, dilated filter has been proposed to trade off between receptive field size and efficiency. But the accompanying gridding effect can cause a sparse sampling of input images with checkerboard patterns. To address this problem, in this paper, we propose a novel multi-level wavelet CNN (MWCNN) model to achieve better trade-off between receptive field size and computational efficiency. The core idea is to embed wavelet transform into CNN architecture to reduce the resolution of feature maps while at the same time, increasing receptive field. Specifically, MWCNN for image restoration is based on U-Net architecture, and inverse wavelet transform (IWT) is deployed to reconstruct the high resolution (HR) feature maps. The proposed MWCNN can also be viewed as an improvement of dilated filter and a generalization of average pooling, and can be applied to not only image restoration tasks, but also any CNNs requiring a pooling operation. The experimental results demonstrate effectiveness of the proposed MWCNN for tasks such as image denoising, single image super-resolution, JPEG image artifacts removal and object classification.



There are no comments yet.


page 4

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Nowadays, convolutional networks have become the dominant technique behind many computer vision tasks, e.g. image restoration [1, 2, 5, 3, 4] and object classification [6, 7, 8, 9, 10]. With continual progress, CNNs are extensively and easily learned on large-scale datasets, speeded up by increasingly advanced GPU devices, and often achieve state-of-the-art performance in comparison with traditional methods. The reason that CNN is popular in computer vision can be contributed to two aspects. First, existing CNN-based solutions dominate on several simple tasks by outperforming other methods with a large margin, such as single image super-resolution (SISR) [1, 2, 11], image denoising [5], image deblurring [12], compressed imaging [13], and object classification [6]. Second, CNNs can be treated as a modular part and plugged into traditional method, which also promotes the widespread use of CNNs [12, 14, 15].

Actually, CNNs in computer vision can be viewed as a non-linear map from the input image to the target. In general, larger receptive field is helpful for improving fitting ability of CNNs and promoting accurate performance by taking more spatial context into account. Generally, the receptive field can be enlarged by either increasing the network depth, enlarging filter size or using pooling operation. But increasing the network depth or enlarging filter size can inevitably result in higher computational cost. Pooling can enlarge receptive field and guarantee efficiency by directly reducing spatial resolution of feature map. Nevertheless, it may result in information loss. Recently, dilated filtering [8] is proposed to trade off between receptive field size and efficiency by inserting “zero holes” in convolutional filtering. However, the receptive field of dilated filtering with fixed factor greater than 1 only takes into account a sparse sampling of the input with checkerboard patterns, thus it can lead to inherent suffering from gridding effect [16]. Based on the above analysis, one can see that we should be careful when enlarging receptive field if we want to avoid both increasing computational burden and incurring the potential performance sacrifice. As can be seen from Figure 1, even though DRRN [17] and MemNet [19] enjoy larger receptive fields and higher PSNR performances than VDSR [2] and DnCNN [5], their speed nevertheless are orders of magnitude slower.

Fig. 1: The running time vs. PSNR value of representative CNN models, including SRCNN [1], FSRCNN [18], ESPCN [4], VDSR [2], DnCNN [5], RED30 [20], LapSRN [3], DRRN [17], MemNet [19] and our MWCNN. The receptive field of each model are also provided. The PSNR and time are evaluated on Set5 with the scale factor running on a GTX1080 GPU.

In an attempt to address the problems stated previously, we propose an efficient CNN based approach aiming at trading off between performance and efficiency. More specifically, we propose a multi-level wavelet CNN (MWCNN) by utilizing discrete wavelet transform (DWT) to replace the pooling operations. Due to invertibility of DWT, none of image information or intermediate features are lost by the proposed downsampling scheme. Moreover, both frequency and location information of feature maps are captured by DWT [21, 22], which is helpful for preserving detailed texture when using multi-frequency feature representation. More specifically, we adopt inverse wavelet transform (IWT) with expansion convolutional layer to restore resolutions of feature maps in image restoration tasks, where U-Net architecture [23] is used as a backbone network architecture. Also, element-wise summation is adopted to combine feature maps, thus enriching feature representation.

In terms of relation with relevant works, we show that dilated filtering can be interpreted as a special variant of MWCNN, and the proposed method is more general and effective in enlarging receptive field. Using an ensemble of such networks trained with embedded multi-level wavelet, we achieve PSNR/SSIM value that improves upon the best known results in image restoration tasks such as image denoising, SISR and JPEG image artifacts removal. For the task of object classification, the proposed MWCNN can achieve higher performance than when adopting pooling layers. As shown in Figure 1, although MWCNN is moderately slower than LapSRN [3], DnCNN [5] and VDSR [2], MWCNN can have a much larger receptive field and achieve higher PSNR value.

This paper is an extension of our previous work [24]. Compared to the former work [24], we propose a more general approach for improving performance, further extend it to high-level task and provide more analysis and discussions. To sum up, the contributions of this work include:

  • A novel MWCNN model to enlarge receptive field with better tradeoff between efficiency and restoration performance by introducing wavelet transform.

  • Promising detail preserving due to the good time-frequency localization property of DWT.

  • A general approach to embedding wavelet transform in any CNNs where pooling operation is employed.

  • State-of-the-art performance on image denoising, SISR, JPEG image artifacts removal, and classification.

The remainder of the paper is organized as follows. Sec. II briefly reviews the development of CNNs for image restoration and classification. Sec. III describes the proposed MWCNN model in detail. Sec. IV reports the experimental results in terms of performance evaluation. Finally, Sec. V concludes the paper.

Ii Related Work

In this section, the development of CNNs for image restoration tasks is briefly reviewed. In particular, we discuss relevant works on incorporating DWT in CNNs. Finally, relevant object classification works are introduced.

Ii-a Image Restoration

Image restoration aims at recovering the latent clean image from its degraded observation . For decades, researches on image restoration have been done from the view points of both prior modeling and discriminative learning [25, 26, 27, 28, 29, 30]. Recently, with the booming development, CNNs based methods achieve state-of-the-art performance over the traditional methods.

Ii-A1 Improving Performance and Efficiency of CNNs for Image Restoration

In the early attempt, the CNN-based methods don’t work so well on some image restoration tasks. For example, the methods of [31, 32, 33] could not achieve state-of-the-art denoising performance compared to BM3D [27] in 2007. In [34], multi-layer perception (MLP) achieved comparable performance as BM3D by learning the mapping from noise patches to clean patches. In 2014, Dong et al. [1] for the first time adopted only a 3-layer FCN without pooling for SISR, which realizes only a small receptive field but achieves state-of-the-art performance. Then, Dong et al. [35] proposed a 4-layer ARCNN for JPEG image artifacts reduction.

Recently, deeper networks are increasingly used for image restoration. For SISR, Kim et al. [2]

stacked a 20-layer CNN with residual learning and adjustable gradient clipping. Subsequently, some works, for example, very deep network 

[36, 5, 37], symmetric skip connections [20], residual units [11], Laplacian pyramid [3], and recursive architecture [38, 17], had also been suggested to enlarge receptive field. However, the receptive field of those methods is enlarged with the increase of network depth, which may has limited potential to extend to deeper network.

For better tradeoff between speed and performance, a 7-layer FCN with dilated filtering was presented as a denoiser by Zhang et al. [12]. Santhanam et al. [39] adopt pooling/unpooling to obtain and aggregate multi-context representation for image denoising. In [40], Zhang et al. considered to operate the CNN denoiser on downsampled subimages . Guo et al. [41] utilized U-Net [23] based CNN as non-blind denoiser. On account of the speciality of SISR, the receptive field size and efficiency could be better traded off by taking the low-resolution (LR) images as input and zooming in on features with upsampling operation [18, 4, 42]. Nevertheless, this strategy can only be adopted for SISR, and are not suitable for other tasks, such as image denoising and JPEG image artifacts removal.

Ii-A2 Universality of Image Restoration

On account of the similarity of tasks such as image denoising, SISR, and JPEG image artifacts removal, the model suggested for one task may be easily extended to other image restoration tasks simply by retraining the same network. For example, both DnCNN [5] and MemNet [19] had been evaluated on all these three tasks. Moreover, CNN denoisers can also serve as a kind of plug-and-play prior. Thus, any restoration tasks can be tackled by sequentially applying the CNN denoisers via incorporating with unrolled inference [12]. To provide an explicit functional for defining regularization induced by denoisers, Romano et al. [14] further proposed a regularization-by-denoising framework. In [43] and [44], LR image with blur kernel is incorporated into CNNs for non-blind SR. These methods not only promote the application of CNN in low level vision, but also present solutions to deploying CNN denoisers for other image restoration tasks.

Ii-A3 Incorporating DWT in CNNs

Several studies have also been given to incorporate wavelet transform into CNN. Bae et al. [45] proposed a wavelet residual network (WavResNet) with the discovery that CNN learning can benefit from learning on wavelet subbands with features having more channels. For recovering missing details in subbands, Guo et al. [46] proposed a deep wavelet super-resolution (DWSR) method. Subsequently, deep convolutional framelets (DCF) [47, 48] had been developed for low-dose CT and inverse problems. However, only one-level wavelet decomposition is considered in WavResNet and DWSR which may restrict the application of wavelet transform. Inspired by the view point of decomposition, DCF independently processes each subband, which spontaneously ignores the dependency between these subbands. In contrast, multi-level wavelet transform is considered by our MWCNN to enlarge receptive field where computational burden is barely increased.

Ii-B Object Classification

The AlexNet [6] is a 8-layers network for object classification, and for the first time achieved state-of-the-art performance than other methods on the ILSVRC2012 dataset. In this method, different sized filters are adopted for extracting and enhancing features. However, Simonyan et al. [7] found that using only sized convolutional filter with deeper architecture can realize larger receptive field and achieve better performance than AlexNet. Yu et al. [8] adopted the dilated convolution to enlarge the receptive field size without increasing the computation burden. Later, residual block [9, 10], inception model [49], pyramid architecture [50], doubly CNN [51] and other architectures [53, 52] were proposed for object classification. Some measures on pooling operation, such as parallel grid pooling [54] and gated mixture of second-order pooling [55]

, were also proposed to enhance feature extractor or feature representation to promote performance. In general, pooling operation, such as average pooling and max pooling, is often adopted for downsampling features and enlarging receptive field, but it can result in significant information loss. To avoid this downside, we adopt DWT as our downsampling layer by replacing pooling operation without changing the main architecture, resulting in more power for enhancing feature representation.

Iii Method

In this section, we first briefly introduce the concept of multi-level wavelet packet transform (WPT) and provide our motivation. We then formally present our MWCNN based on multi-level WPT, and describe its network architecture for image restoration and object classification. Finally, discussion is presented to analyze the connection of MWCNN with average pooling and dilated filtering.

(a) Multi-level WPT architecture
(b) Embed CNN blocks
Fig. 2: From WPT to MWCNN. Intuitively, WPT can be seen as a special case of our MWCNN without CNN blocks as shown in (a) and (b). By inserting CNN blocks to WPT, we design our MWCNN as (b). Obviously, our MWCNN is a generalization of multi-level WPT, and reduces to WPT when each CNN block becomes the identity mapping.
Fig. 3: Multi-level wavelet-CNN architecture. It consists of two parts: the contracting and expanding subnetworks. Each solid box corresponds to a multi-channel feature map. And the number of channels is annotated on the top of the boxes. The number of convolutional layers is set to 24. Moreover, our MWCNN can be further extended to higher level (e.g., ) by duplicating the configuration of the 3rd level subnetwork.

Iii-a From multi-level WPT to MWCNN

Given an image , we can use 2D DWT [56] with four convolutional filters, i.e. low-pass filter , and high-pass filters , , , to decompose into four subband images, i.e. , , , and

. Note that the four filters have fixed parameters with convolutional stride 2 during the transformation. Taking Haar wavelet as an example, four filters are defined as


It is evident that , , , and are orthogonal to each other and form a invertible matrix. The operation of DWT is defined as , , , and , where denotes convolution operator, and means the standard downsampling operator with factor 2. In other words, DWT mathematically involves four fixed convolution filters with stride 2 to implement downsampling operator. Moreover, according to the theory of Haar transform [56], the -th value of , , and after 2D Haar transform can be written as


Although the downsampling operation is deployed, due to the biorthogonal property of DWT, the original image can be accurately reconstructed without information loss by the IWT, i.e., . For the Haar wavelet, the IWT can defined as following:


Generally, the subband images , , , and can be sequentially decomposed by DWT for further processing in multi-level WPT [57, 22]. To get results of two-level WPT, DWT is separately utilized to decompose each subband image ( , , , or ) into four subband images , , , and . Recursively, the results of three or higher levels WPT can be obtained. Correspondingly, the reconstruction of each level subband images are implemented by completely inverse operation via IWT. The above-mentioned process of decomposition and reconstruction of an image are illustrated in Figure 2(a). If we treat the filers of WPT as convolutional filters with pre-defined weights, one can see that WPT is a special case of FCN without the nonlinearity layers. Obviously, the original image can be first decomposed by WPT and then accurately reconstructed by inverse WPT without any information loss.

In image processing applications such as image denoising and compression, some operations, e.g., soft-threshold and quantization, are usually required to process the decomposition part [59, 58] as shown in Figure. 2(a). These operations can be treated as some kind of nonlinearity tailored to specific task. In this work, we further extend WPT to multi-level wavelet-CNN (MWCNN) by plugging CNN blocks into traditional WPT-based method as illustrated in Figure 2(b). Due to the biorthogonal property of WPT, our MWCNN can use subsampling and upsampling operations safely without incurring information loss. Obviously, our MWCNN is a generalization of multi-level WPT, and reduces to WPT when each CNN block becomes the identity mapping. Moreover, DWT can be treated as downsampling operation and extend to any CNNs where pooling operation is required.

Fig. 4: Illustration of average pooling, dilated filter and the proposed MWCNN. Take one CNN block as an example: (a) sum-pooling with factor 2 leads to the most significant information loss which is not suitable for image restoration; (b) dilated filtering with rate 2 is equal to shared parameter convolution on sub-images; (c) the proposed MWCNN first decomposes an image into 4 sub-bands and then concatenates them as input of CNN blocks. IWT is then used as an upsampling layer to restore resolution of the image.

Iii-B Network architecture

Iii-B1 Image Restoration

As mentioned previously in Sec. III-A, we design the MWCNN architecture for image restoration based on the principle of the WPT as illustrated in Figure 2(b). The key idea is to insert CNN blocks into WPT before (or after) each level of DWT. As shown in Figure 3, each CNN block is a 3-layer FCN without pooling, and takes both low-frequency subbands and high-frequency subbands as inputs. More concretely, each layer contains convolution with

filters (Conv), and rectified linear unit (ReLU) operations. Only Conv is adopted in the last layer for predicting the residual result. The number of convolutional layers is set to 24. For more details on the setting of MWCNN, please refer to Figure


Our MWCNN modifies U-Net in three aspects. (i) In conventional U-Net, pooling and deconvolution are utilized as downsampling and upsampling layers. In comparison, DWT and IWT are used in MWCNN. (ii) After DWT, we deploy another CNN blocks to reduce the number of feature map channels for compact representation and modeling inter-band dependency. And convolution are adopted to increase the number of feature map channels and IWT is utilized to upsample feature map. In comparison, conventional U-Net adopting convolution layers are used to increase feature map channels which has no effect on the number of feature map channels after pooling. For upsampling, deconvolution layers are directly adopted to zoom in on feature map. (iii) In MWCNN, element-wise summation is used to combine the feature maps from the contracting and expanding subnetworks. While in conventional U-Net, concatenation is adopted. Compared to our previous work [24], we have made several improvements such as: (i) Instead of directly decomposing input images by DWT, we first use conv blocks to extract features from input, which is empirically shown to be beneficial for image restoration. (ii) In the 3rd hierarchical level, we use more feature maps to enhance feature representation. In our implementation, Haar wavelet is adopted as the default wavelet in MWCNN. Other wavelets, e.g., Daubechies 2 (DB2), are also considered in our experiments.

Denote by the network parameters of MWCNN, i.e., and be the network output. Let be a training set, where is the -th input image, is the corresponding ground-truth image. Then the objective function for learning MWCNN is given by


The ADAM algorithm [60] is adopted to train MWCNN by minimizing the objective function.

Iii-B2 Extend to Object Classification

Similar to image restoration, DWT is employed as a downsampling operation often without upsampling operation to replace pooling operation. The compression filter with

Conv is subsequently utilized after DWT transformation. Note that we don’t modify other blocks or loss function. With this improvement, feature can be further selected and enhanced with adaptive learning. Moreover, any CNN using pooling can be considered instead of DWT operation, and the information of feature maps can be transmitted to next layer without information loss. DWT can be seen as a safe downsampling module and plugged into any CNNs without the need to change network architectures, and may benefit extracting more powerful features for different tasks.

Iii-C Discussion

Iii-C1 Connection to Pooling Operation

The DWT in the proposed MWCNN is closely related to the pooling operation and dilated filtering. By using the Haar wavelet as an example, we explain the connection between DWT and average pooling. According to the theory of average pooling with factor 2, the -th value of feature map in the -th layer after pooling can be written as


where is the feature map before pooling operation. It is clear that Eq. 5 is the same as the low-frequency component of DWT in Eq. 2, which also means that all the high-frequency information is lost during the pooling operation. In Figure 4, the feature map is first decomposed into four sub-images with stride 2. The average pooling operation can be treated as summing all sub-images with fixed coefficient to generate new sub-image. In comparison, DWT uses all sub-images with four fixed orthometric weights to obtain four new sub-images. By taking all the subbands into account, MWCNNs can therefore avoid the information loss caused by conventional subsampling, and may benefit restoration and classification. Hence, average pooling can be seen as a simplified variant of the proposed MWCNNs.

Iii-C2 Connection to Dilated Filtering

To illustrate the connection between MWCNN and dilated filtering, we first give the definition of dilated filtering with factor 2:


where means convolution operation with dilated factor 2, is the position in convolutional kernel , and is the position within the range of convolution of feature . Eq. (6) can be decomposed into two steps, sampling and convoluting. Sampled patch is obtained by sampling at center position of with one interval pixel under the constraint . Then the value is obtained by convolving sampled patch with kernel . Therefore, dilated filtering with factor 2 can be expressed as first decomposing an image into four sub-images and then using the shared standard convolutional kernel on those sub-images as illustrated in Figure. 4. We rewrite Eq. (6) for obtaining the pixel value as following:


Then the pixel value , and can be obtained in the same way. Actually, the value of at the position , , and can be obtained by applying IWT on subband images , , and based on Eqn.(3). Therefore, the dilating filtering can be represented as convolution with the subband images as following,


Different from dilated filtering, the definition of MWCNN in Figure 4 can be given as

where , and denotes concatenate operation. If is group convolution [61] with factor 4, the equation can be rewritten as:


Note that can accurately reconstruct by using IWT. Compared to Eq. (8), the weights of each subband and the corresponding convolution are different. That means that our MWCNN can be reduced to dilated filtering if the subbands are replaced by subimages after IWT in Eq. (3), and the convolution in is shared to each other. Hence, the dilated filtering can be seen as a variant of the proposed MWCNN.

Fig. 5: Illustration of the gridding effect. Taken 3-layer CNNs as an example: (a) the dilated filtering with rate 2 suffers from large amount of information loss, (b) the two neighbored pixels are based on information from totally non-overlapped locations, and (c) our MWCNN can perfectly avoid underlying drawbacks.

Compared with dilated filtering, MWCNN can also avoid the gridding effect. With the increase of depth, dilated filtering with fixed factor greater than 1 only considers a sparse sampling of units in the checkerboard pattern, resulting in large amount of information loss (see Figure 5). Another problem with dilated filtering is that the two output neighboring pixels may be computed from input information from totally non-overlapped units (see Figure 5), and may cause the inconsistence of local information. Figure 5 illustrates the receptive field of MWCNN, which is quite different from dilated filtering. With dense sampling, convolution filter takes multi-frequency information as input, and results in double receptive field after DWT. One can see that MWCNN is able to well address the sparse sampling and inconsistency problems of local information, and is expected to benefit restoration quantitatively.

Iv Experiments

In this section, we first describe application of MWCNN to image restoration. Then ablation experiments is presented to analyze the contribution of each component. Finally, the proposed MWCNN is extended to object classification.

Iv-a Experimental Setting for Image Restoration

2.2 Dataset BM3D [27] TNRD [26] DnCNN [5] IRCNN [12] RED30 [20] MemNet [19] FFDNet [40] MWCNN(P) MWCNN
1.2 Set12 15 32.37 / 0.8952 32.50 / 0.8962 32.86 / 0.9027 32.77 / 0.9008 - - 32.75 / 0.9027 33.15 / 0.9088 33.20 / 0.9089
25 29.97 / 0.8505 30.05 / 0.8515 30.44 / 0.8618 30.38 / 0.8601 - - 30.43 / 0.8634 30.79 / 0.8711 30.84 / 0.8718
50 26.72 / 0.7676 26.82 / 0.7677 27.18 / 0.7827 27.14 / 0.7804 27.34 / 0.7897 27.38 / 0.7931 27.32 / 0.7903 27.74 / 0.8056 27.79 / 0.8060
BSD68 15 31.08 / 0.8722 31.42 / 0.8822 31.73 / 0.8906 31.63 / 0.8881 - - 31.63 / 0.8902 31.86 / 0.8947 31.91 / 0.8952
25 28.57 / 0.8017 28.92 / 0.8148 29.23 / 0.8278 29.15 / 0.8249 - - 29.19 / 0.8289 29.41 / 0.8360 29.46 / 0.8370
50 25.62 / 0.6869 25.97 / 0.7021 26.23 / 0.7189 26.19 / 0.7171 26.35 / 0.7245 26.35 / 0.7294 26.29 / 0.7245 26.53 / 0.7366 26.58 / 0.7382
Urban100 15 32.34 / 0.9220 31.98 / 0.9187 32.67 / 0.9250 32.49 / 0.9244 - - 32.43 / 0.9273 33.17 / 0.9357 33.22 / 0.9361
25 29.70 / 0.8777 29.29 / 0.8731 29.97 / 0.8792 29.82 / 0.8839 - - 29.92 / 0.8886 30.66 / 0.9026 30.74 / 0.9035
50 25.94 / 0.7791 25.71 / 0.7756 26.28 / 0.7869 26.14 / 0.7927 26.48 / 0.7991 26.64 / 0.8024 26.52 / 0.8057 27.42 / 0.8371 27.53 / 0.8393
TABLE I: Average PSNR(dB)/SSIM results of the competing methods for image denoising with noise levels 15, 25 and 50 on datasets Set14, BSD68 and Urban100. Red color indicates the best performance.
Ground Truth Noisy Image BM3D [27] TNRD [26] DnCNN [5] IRCNN [12] RED30 [20] MemNet [19] FFDNet [40] MWCNN Ground Truth
Fig. 6: Image denoising results of “” (Set68) with noise level of 50.

Iv-A1 Training set

To train our MWCNN, we adopt DIV2K [62] as our training dataset. Concretely, DIV2K contains images with about 2K resolution for training, images for validation, and images for testing. Due to the receptive field of MWCNN being , we crop patches with the size of from the training images in the training stage.

For image denoising, we consider three noise levels, i.e., = 15, 25 and 50, and evaluate our denoising method on three dataset, i.e., Set12 [5], BSD68 [63], and Urban100 [64]. For SISR, we take upsampling as the input to MWCNN with three specific scale factors, i.e., , and , respectively. Four widely used datasets, Set5 [65], Set14 [66], BSD100 [63] and Urban100 [64], are adopted to evaluate SISR performance. For JPEG image artifacts removal, we follow the setting as used in [35], i.e., four compression quality settings = 10, 20, 30 and 40 for the JPEG encoder. Two datasets, Classic5 [35] and LIVE1 [67], are used for evaluating our method.

Iv-A2 Network training

In image restoration, a MWCNN model is learned for each degradation setting. The ADAM algorithm [60] with , and is adopted for optimization and we use a mini-batch size of 24. The learning rate is decayed exponentially from to

in the 200 epochs. Rotation or/and flip based data augmentation is used during mini-batch learning. The MatConvNet package

[68] with NVIDIA GTX1080 GPU is utilized for training and testing.

Iv-B Quantitative and qualitative evaluation on Image Restoration Tasks

Comprehensive experiments are conducted to evaluate our 24-layer MWCNN using the same setting as in Sec. III-B on three representative image restoration tasks, respectively. Here, we also provide the results of our previous work and denote it as MWCNN(P) [24].

Iv-B1 Image denoising

For image denoising, only gray images are trained and evaluated for the reason that most denoising methods are only trained and tested on gray images. Moreover, we compare with two classic denoising methods, i.e., BM3D [27] and TNRD [26], and five CNN-based methods, i.e., DnCNN [5], IRCNN [12], RED30 [20], MemNet [19], and FFDNet [40]. Table I lists the average PSNR/SSIM results of the competing methods on these three datasets. Since RED30 [20]and MemNet [19] doesn’t train the models on level 15 and level 25, we use the symbol ‘-’ instead. Obviously, the performance of all the competing methods are worse than our MWCNN. It’s worth noting that our MWCNN can outperform DnCNN and FFDNet by about dB in terms of PSNR on Set12, and slightly surpass with dB on BSD68. On Urban100, our MWCNN generally achieves favorable performance when compared with the competing methods. Specially, the average PSNR by our MWCNN can be 0.5dB higher than that by DnCNN on Set12, and 1.2dB higher on Urban100 when the noise level is 50. Figure 6 shows the denoising results of the images “Test044” from Set68 with the noise level . One can see that our MWCNN is promising in removing noise while recovering image details and structures, and can obtain visually more pleasant result than the competing methods due to the reversibility of WPT during downsampling and upsampling.

2.2 Dataset VDSR [2] DnCNN [5] RED30 [20] SRResNet [11] LapSRN [3] DRRN [17] MemNet [19] WaveResNet [45] SRMDNF [44] MWCNN(P) MWCNN
1.2 Set5 2 37.53 / 0.9587 37.58 / 0.9593 37.66 / 0.9599 - 37.52 / 0.9590 37.74 / 0.9591 37.78 / 0.9597 37.57 / 0.9586 37.79 / 0.9601 37.91 / 0.9600 37.95 / 0.9605
3 33.66 / 0.9213 33.75 / 0.9222 33.82 / 0.9230 - - 34.03 / 0.9244 34.09 / 0.9248 33.86 / 0.9228 34.12 / 0.9250 34.18 / 0.9272 34.21 / 0.9273
4 31.35 / 0.8838 31.40 / 0.8845 31.51 / 0.8869 32.05 / 0.8902 31.54 / 0.8850 31.68 / 0.8888 31.74 / 0.8893 31.52 / 0.8864 31.96 / 0.8925 32.12 / 0.8941 32.14 / 0.8951
Set14 2 33.03 / 0.9124 33.04 / 0.9118 32.94 / 0.9144 - 33.08 / 0.9130 33.23 / 0.9136 33.28 / 0.9142 33.09 / 0.9129 33.05 / 0.8985 33.70 / 0.9182 33.71 / 0.9182
3 29.77 / 0.8314 29.76 / 0.8349 29.61 / 0.8341 - - 29.96 / 0.8349 30.00 / 0.8350 29.88 / 0.8331 30.04 / 0.8372 30.16 / 0.8414 30.14 / 0.8413
4 28.01 / 0.7674 28.02 / 0.7670 27.86 / 0.7718 28.49 / 0.7783 28.19 / 0.7720 28.21 / 0.7720 28.26 / 0.7723 28.11 / 0.7699 28.41 / 0.7816 28.41 / 0.7816 28.58 / 0.7882
BSD100 2 31.90 / 0.8960 31.85 / 0.8942 31.98 / 0.8974 - 31.80 / 0.8950 32.05 / 0.8973 32.08 / 0.8978 31.92 / 0.8965 32.23 / 0.8999 32.23 / 0.8999 32.30 / 0.9002
3 28.82 / 0.7976 28.80 / 0.7963 28.92 / 0.7993 - - 28.95 / 0.8004 28.96 / 0.8001 28.86 / 0.7987 28.97 / 0.8030 29.12 / 0.8060 29.18 / 0.8106
4 27.29 / 0.7251 27.23 / 0.7233 27.39 / 0.7286 27.56 / 0.7354 27.32 / 0.7280 27.38 / 0.7284 27.40 / 0.7281 27.32 / 0.7266 27.62 / 0.7355 27.62 / 0.7355 27.67 / 0.7357
Urban100 2 30.76 / 0.9140 30.75 / 0.9133 30.91 / 0.9159 - 30.41 / 0.9100 31.23 / 0.9188 31.31 / 0.9195 30.96 / 0.9169 32.30 / 0.9296 32.30 / 0.9296 32.36 / 0.9306
3 27.14 / 0.8279 27.15 / 0.8276 27.31 / 0.8303 - - 27.53 / 0.8378 27.56 / 0.8376 27.28 / 0.8334 27.57 / 0.8401 28.13 / 0.8514 28.19 / 0.8520
4 25.18 / 0.7524 25.20 / 0.7521 25.35 / 0.7587 26.07 / 0.7839 25.21 / 0.7560 25.44 / 0.7638 25.50 / 0.7630 25.36 / 0.7614 26.27 / 0.7890 26.27 / 0.7890 26.37 / 0.7891
TABLE II: Average PSNR(dB) / SSIM results of the competing methods for SISR with scale factors 2, 3 and 4 on datasets Set5, Set14, BSD100 and Urban100. Red color indicates the best performance.
Ground Truth VDSR [2] DnCNN [5] RED30 [20] SRResNet [11] LapSRN [3] DRRN [17] MemNet [19] WaveResNet [45] MWCNN Ground Truth
Fig. 7: Single image super-resolution: result of “” (BSD100) with upscaling factor of 4.

Iv-B2 Single image super-resolution

We also train our MWCNN on SISR task with only the luminance channel, i.e. Y in YCbCr color space following [1]. interpolation is used for image degradation, and upsampling by interpolation is used before sending degradation image to network. For qualitative comparisons, we use source codes of nine CNN-based methods, including VDSR [2], DnCNN [5], RED30 [20], SRResNet [11], LapSRN [3], DRRN [17], MemNet [19], WaveResNet [45] and SRMDNF [44]. Since the source code of SRResNet is not released, their results as shown in Table II are incomplete. And the results of LapSRN with scale are not listed here since they are not reported in the authors’ paper.

Table II summarizes the average PSNR/SSIM results of the competing methods on the four datasets by citing the results in their respective papers. In terms of both PSNR and SSIM indexes, the proposed MWCNN outperforms other methods in all cases. Compared with well-known VDSR, our MWCNN achieves a notable gain of about 0.40.8dB by PSNR on Set5 and Set14. Surprisingly, our MWCNN outperforms VDSR by a large gap with about 1.01.6dB on Urban100. Even though SRMDNF is trained on RGB space, it is still slightly weaker than our MWCNN. It can also be seen that our MWCNN outperforms WaveResNet by no less than 0.3dB. We provide quantitative comparisons with the competing methods on the image “253027” from BSD100 in Figure 7. As one can see, our MWCNN can correctly recover the fine and detailed textures, and produce sharp edges due to the frequency and location characteristics of DWT.

2.2 Dataset JPEG ARCNN [35] TNRD [26] DnCNN [5] MemNet [19] MWCNN(P) MWCNN
2.2 Classic5 10 27.82 / 0.7595 29.03 / 0.7929 29.28 / 0.7992 29.40 / 0.8026 29.69 / 0.8107 30.01 / 0.8195 30.03 / 0.8201
20 30.12 / 0.8344 31.15 / 0.8517 31.47 / 0.8576 31.63 / 0.8610 31.90 / 0.8658 32.16 / 0.8701 32.20 / 0.8708
30 31.48 / 0.8744 32.51 / 0.8806 32.78 / 0.8837 32.91 / 0.8861 - 33.43 / 0.8930 33.46 / 0.8934
40 32.43 / 0.8911 33.34 / 0.8953 - 33.77 / 0.9003 - 34.27 / 0.9061 34.31 / 0.9063
LIVE1 10 27.77 / 0.7730 28.96 / 0.8076 29.15 / 0.8111 29.19 / 0.8123 29.45 / 0.8193 29.69 / 0.8254 29.70 / 0.8260
20 30.07 / 0.8512 31.29 / 0.8733 31.46 / 0.8769 31.59 / 0.8802 31.83 / 0.8846 32.04 / 0.8885 32.07 / 0.8886
30 31.41 / 0.9000 32.67 / 0.9043 32.84 / 0.9059 32.98 / 0.9090 - 33.45 / 0.9153 33.46 / 0.9155
40 32.35 / 0.9173 33.63 / 0.9198 - 33.96 / 0.9247 - 34.45 / 0.9301 34.47 / 0.9300
TABLE III: Average PSNR(dB) / SSIM results of the competing methods for JPEG image artifacts removal with quality factors 10, 20, 30 and 40 on datasets Classic5 and LIVE1. Red color indicates the best performance.
Ground Truth ARCNN [35] TNRD [26] DnCNN [5] MemNet [19] MWCNN Ground Truth
Fig. 8: JPEG image artifacts removal: visual results of “” (LIVE1) with quality factor of 10.

Iv-B3 JPEG image artifacts removal

We apply our method to JPEG image artifacts removal to further demonstrate the applicability of our MWCNN on image restoration. Here, both JPEG encoder and JPEG image artifacts removal are only focused on the Y channel. Following [35], we consider four settings on quality factor, e.g., = 10, 20, 30 and 40, for the JPEG encoder. In our experiments, MWCNN is compared to four competing methods, i.e., ARCNN [35], TNRD [26], DnCNN [5], and MemNet [19]. The results of MemNet [19] and TNRD [26] are incomplete according to their paper and released source codes.

Table III shows the average PSNR/SSIM results of the competing methods on Classic5 and LIVE1. Obviously, our MWCNN obtains superior performance than other methods in terms of quantitative metrics for any of the four quality factors. Compared to ARCNN on Classic5, our MWCNN surprisingly outperforms by 1dB in terms of PSNR. On can see that the PSNR value of MWCNN can be 0.20.3dB higher than those of the second best method (i.e., MemNet [19]). In addition to perceptual comparisons, we also provide the image, i.e.carnivaldolls” form LIVE1 with the quality factor of 10. Compared with other methods, our MWCNN is effective in better removing artifacts and restoring detailed textures and sharp salient edges.

2.2 Image Denoising
1.2 Size FFDNet [40] DnCNN [5] RED30 [20] MemNet [19] MWCNN
1.2 256256 0.006 0.0143 1.362 0.8775 0.0437
512 0.012 0.0487 4.702 3.606 0.0844
1024 0.038 0.1688 15.77 14.69 0.3343
1.2 Single Image Super-Resolution
1.2 Size VDSR [2] LapSRN [3] DRRN [17] MemNet [19] MWCNN
1.2 256256 0.0172 0.0229 3.063 0.8774 0.0397
512 0.0575 0.0357 8.050 3.605 0.0732
1024 0.2126 0.1411 25.23 14.69 0.2876
1.2 JPEG Image Artifacts Removal
1.2 Size ARCNN [35] TNRD [26] DnCNN [5] MemNet [19] MWCNN
256256 0.0277 0.009 0.0157 0.8775 0.0413
512 0.0532 0.028 0.0568 3.607 0.0789
1024 0.1613 0.095 0.2012 14.69 0.2717
TABLE IV: Running time (in seconds) of the competing methods for the three tasks on images of size 256256, 512512 and 10241024: image denosing is tested on noise level 50, SISR is tested on scale 2, and JPEG image deblocking is tested on quality factor 10.

Iv-B4 Running time

As mentioned previously, the efficiency of CNNs is also an important measure of network performance. We consider the CNN-based methods with source code and list the GPU running time of the competing methods for the three tasks in Table IV

. Note that the Nvidia cuDNN-v7.0 deep learning library with CUDA 9.2 is adopted to accelerate the GPU computation under Ubuntu 16.04 system. In comparison to the state-of-the-art methods,

i.e., RED30 [20], DRRN [17] and MemNet [19], our MWCNN costs far less time but obtain better performance in terms of PSNR/SSIM metrics. Meanwhile, our MWCNN is moderately slower by speed but can achieve higher PSNR/SSIM indexes compared to the other methods. This means that the effectiveness of MWCNN should be attributed to the incorporation of CNN and DWT rather than increase of network depth/width.

Iv-C Comparison of MWCNN variants

Using image denoising and JPEG image artifacts as examples, we mainly focus on two variants of MWCNN: (i) Ablation experiments to demonstrate where the improved performance comes from. (ii) The related methods, such as wavelet-based approach and dilated filtering are presented for verifying the effectiveness of the proposed method. Note that MWCNN with 24-layer is employed as our baseline, and all the MWCNN variants are designed using the same architecture for fair comparison.

Image Denoising ()
Dataset U-Net [23] U-Net [23]+S U-Net [23]+D MWCNN (P+C) MWCNN (Haar) MWCNN (DB2) MWCNN (HD)
Set12 27.42 / 0.079 27.41 / 0.074 27.46 / 0.080 27.76 / 0.081 27.79 / 0.075 27.81 / 0.127 27.77 / 0.091
Set68 26.30 / 0.076 26.29 / 0.071 26.21 / 0.075 26.54 / 0.077 26.58 / 0.072 26.59 / 0.114 26.57 / 0.086
Urban100 26.68 / 0.357 26.72 / 0.341 26.99 / 0.355 27.46 / 0.354 27.53 / 0.346 27.55 / 0.576 27.50 / 0.413
JPEG Image Artifacts Removal ()
Classic5 29.61 / 0.093 29.60 / 0.082 29.68 / 0.097 30.02 / 0.091 30.03 / 0.083 30.04 / 0.185 29.99 / 0.115
LIVE1 29.36 / 0.112 29.36 / 0.109 29.43 / 0.120 29.69 / 0.120 29.70 / 0.111 29.71 / 0.234 29.68 / 0.171
TABLE V: Performance comparison of ablation experiment in terms of average PSNR (dB) and running time (in seconds): image denosing is tested on noise level 50 and JPEG image deblocking is tested on quality factor 10.

Iv-C1 Ablation experiments

Ablation experiments are provided for verifying the effectiveness of additionally embedded wavelet: (i) the default U-Net with the same architecture to MWCNN, (ii) U-Net+S: using sum connection instead of concatenation, and (iii) U-Net+D: adopting learnable conventional downsampling filters, i.e. convolution operation with stride 2 to replace max pooling. We also compare with the modified MWCNN(P) method which adds one layer of CNN right after inputting and another layer of CNN before outputting, and denote it as MWCNN(P+C). Three MWCNN variants with different wavelet transform are also considered, including: (i) MWCNN (Haar): the default MWCNN with Haar wavelet, (ii) MWCNN (DB2): MWCNN with Daubechies-2 wavelet, and (iii) MWCNN (HD): MWCNN with Haar in contracting subnetwork and Daubechies-2 in expanding subnetwork.

Table V lists the PSNR and running time results of these methods. We have the following observations. (i) The ablation experiments indicate that adopting sum connection instead of concatenation can slightly improve efficiency with almost no decrease of PNSR. (ii) Due to the biorthogonal and time-frequency localization properties of wavelet, our wavelet based method possesses more powerful abilities for image restoration. The pooling operation causes the loss of high-frequency information and leads to difficulty of recovering damaged image. Our MWCNN can easily outperform U-Net+D method which adopts learnable downsampling filters. This indicates that learning alone is not enough and the violation of the invertibility can cause information loss. (iii) Compared with MWCNN (P+C), the proposed method still performs slightly better despite the fact that MWCNN (P+C) has more layers, thereby verifying the effectiveness of the proposed method. (iv) Compared with MWCNN (DB2) and MWCNN (HD), using Haar wavelet for downsampling and upsampling in network is the best choice in terms of quantitative evaluation. MWCNN (Haar) has similar running time as dilated CNN and U-Net but achieves higher PSNR results, which demonstrates the effectiveness of MWCNN for trading off between performance and efficiency.

Image Denoising ()
Dataset Dilated [8] Dilated-2 DCF [47] DCF+R WaveResNet [45] MWCNN (Haar)
Set12 27.45 / 0.181 24.81 / 0.185 27.38 / 0.081 27.61 / 0.081 27.49 / 0.179 27.79 / 0.075
Set68 26.35 / 0.142 24.32 / 0.174 26.30 / 0.075 26.43 / 0.075 26.38 / 0.143 26.58 / 0.072
Urban100 26.56 / 0.764 24.18 / 0.960 26.65 / 0.354 27.18 / 0.354 - / - 27.53 / 0.346
JPEG Image Artifacts Removal ()
Classic5 29.72 / 0.287 29.49 / 0.302 29.57 / 0.104 29.88 / 0.104 - / - 30.03 / 0.083
LIVE1 29.49 / 0.354 29.26 / 0.376 29.38 / 0.155 29.63 / 0.155 - / - 29.70 / 0.111
TABLE VI: Performance comparison of MWCNN variants in terms of average PSNR (dB) and running time (in seconds): image denosing is tested on noise level 50 and JPEG image deblocking is tested on quality factor 10.

Iv-C2 More MWCNN variants

We compare the PSNR results by using more related MWCNN variants. Two 24-layer dilated CNNs are provided: (i) Dilated: the hybrid dilated convolution [16] to suppress the gridding effect, and (ii) Dilated-2: the dilate factor of all layers is set to 2 following the gridding effect. The WaveResNet method in [45] is provided for comparison. Moreover, since the source code of deep convolutional framelets (DCF) is not available, a re-implementation of deep convolutional framelets (DCF) without residual learning [48] is also considered in the experiments. DCF with residual learning (denoted as DCF+R) is provided for fair comparison.

Table VI lists the PSNR and running time results of these methods. We have the following observations. (i) The gridding effect with the sparse sampling and inconsistence of local information authentically has adverse influence on restoration performance. (ii) The worse performance of DCF also indicates that independent processing of subbands harms intra-frequency information dependency. (iii) The fact that our MWCNN method sightly outperforms the DCF+R method means that the adverse influence of independent processing can be eliminated after several non-linear operations.

Iv-C3 Hierarchical level of MWCNN

Here, we discuss the suitable level of our MWCNN even if it can be extended to higher level. Nevertheless, deeper network and heavier computational burden also come with higher level. Thus, we select a suitable level for better balance between efficiency and performance. As shown in Table VII, the PSNR and running time results of MWCNNs with the levels of 0 to 4 (i.e., MWCNN-0 MWCNN-4) are reported. We note that MWCNN-0 is a 6-layer CNN without WPT. In terms of the PSNR metric, MWCNN-3 is much better than MWCNN-1 and MWCNN-2, while negligibly weaker than MWCNN-4. Meanwhile, the speed of MWCNN-3 is also moderate compared with other levels. Based on the above analysis, we choose MWCNN-3 as the default setting.

Set12 26.84 / 0.017 27.25 / 0.041 27.64 / 0.064 27.79 / 0.073 27.80 / 0.087
Set68 25.71 / 0.016 26.21 / 0.039 26.47 / 0.060 26.58 / 0.070 26.59 / 0.081
Urban100 25.98 / 0.087 26.53 / 0.207 27.12 / 0.298 27.53 / 0.313 27.55 / 0.334
TABLE VII: Average PSNR (dB) and running time (in seconds) of MWCNNs with different levels on Gaussian denoising with the noise level of 50.

Iv-D Extend to Object Classification

Using object classification as an example, we test our MWCNN on six famous benchmarks: CIFAR-10, CIFAR-100 [69], SVHN [70], MNIST [71], ImageNet1K [6] and Places365 [72] for our evaluation. CIFAR-10 and CIFAR-100 consist of 60000 3232 colour images in 10 classes, including 50,000 images for training and 10,000 images for testing. MNIST [71] is handwritten digit database with 2828 resolution, which has a training set of 60,000 examples, and a test set of 10,000 examples. The SVHN dataset contains more than 600,000 digit images. ImageNet1K is a resized dataset which consists of two resolutions, 3232 denoted as ImageNet32, and 6464 as ImageNet64. Places365 is resized to for training and testing in our work. Here, we modify and compare several CNN methods, such as pre-activation ResNet (PreResNet) [10], All-CNN [73], WideResNet [53], PyramidNet [50], DenseNet [52] and ResNet [9]. Specifically, we follow [55] to verify our MWCNN on ResNet architecture. As our MWCNN, we use DWT transformation with convolution to instead of avg-pooling operation as described in Sec. III-B2, and denote it as ‘MW’, while the original CNN is denoted as ‘Base’.

Table VIII shows the detailed results of accuracy with the competing methods PreResNet [10], All-CNN [73], WideResNet [53], PyramidNet [50], DenseNet [52] on CIFAR-10, CIFAR-100, SVHN, MNIST and ImageNet32. Table IX and Table X show Top-1 and Top-5 error of ResNet [9] on imagenet64 and Place365. One can see that MWCNN can easily surpass the original CNN because of the powerful DWT used. Our MWCNN is quite different from DCF [48]: DCF combines CNN with DWT during decomposition, where different CNNs are deployed for each subband. However, the results in Table V indicates that independent processing of subbands is not suitable for image restoration. In contrast, MWCNN incorporates DWT into CNN from the perspective of enlarging receptive field without information loss, allowing embedding DWT with any CNNs with pooling operations. By taking all subbands as input, MWCNN is more powerful in modeling inter-band dependency. Moreover, our MWCNN is formulated as a single, generic, plug-and-play module that can be used as a direct replacement of downsampling operation without any adjustments in the network architecture.

Model Dataset CIFAR-10 CIFAR-100 SVHN MNIST ImageNet32
DenseNet-BC-100(k=12) [52] Base 95.40 77.38 98.03 99.69 48.32
MW 95.57 77.70 98.11 99.74 49.77
AllConv [73] Base 91.58 67.57 98.06 99.67 30.12
MW 93.08 71.06 98.09 99.70 36.64
PyramidNet-164 [50] Base 96.09 80.35 97.98 99.72 55.30
MW 96.11 80.49 98.17 99.74 55.51
PreResNet-164([10] Base 95.29 77.32 98.04 99.67 46.16
MW 95.72 77.67 98.09 99.72 47.88
WideResNet-28-10 [53] Base 96.56 81.30 98.19 99.70 57.51
MW 96.60 81.42 98.36 99.75 57.66
TABLE VIII: Accuracy on CIFAR-10, CIFAR-100, SVHN, MNIST and ImageNet32.
Model ImageNet64 Top-1 Top-5
ResNet18-512d [9] Base 49.08 24.25
MW 48.46 23.96
ResNet50 [9] Base 43.28 19.39
MW 41.70 18.05
ResNet50-512d [9] Base 41.42 18.14
MW 41.27 17.70
TABLE IX: Top-1 and Top-5 error on ImageNet64.
Model Dataset Top-1 Top-5
ResNet18-512d [9] Base 49.96 19.19
MW 49.56 18.88
TABLE X: Top-1 and Top-5 error on Place365.

V Conculsion

In this paper, we present MWCNN to better trade off the receptive field and efficiency. To this end, DWT is introduced as a downsampling operation to reduce spatial resolution and enlarge the receptive field, and can be embedded into any CNNs using pooling operation. More specifically, MWCNN takes both low-frequency and high-frequency subbands as input and is safe to perform downsampling without information loss. In addition, WPT can be treated as pre-defined parameters to ease network learning. We first design an architecture for image restoration based on U-Net, which consists of a contracting sub-network and a expanding subnetwork (for object classification, only the contracting network is employed). Due to the invertibility of DWT and its frequency and location property, the proposed MWCNN is effective in recovering detailed textures and sharp structures from degraded observation. Extensive experiments demonstrate the effectiveness and efficiency of MWCNN on three restoration tasks, i.e., image denoising, SISR, and JPEG image artifacts removal, and object classification task when using different CNN methods. In future work, we aims at designing novel network architecture to extend MWCNN to more restoration tasks such as image deblurring. We will also investigate flexible MWCNN models for handling blind restoration tasks. High-level dense prediction tasks, such as object detection and image segmentation, often is accomplished by adopting pooling for downsampling and then performing upsampling for dense prediction. Therefore, the limitations of pooling operation may still be unfavorable in these tasks. In the future work, we will modify to extend the proposed MWCNN to these tasks for better preserving of fine-scale features.


This work was supported in part by the National Natural Scientific Foundation of China (NSFC) under Grant No. 61872118, 61773002 and 61671182.


  • [1] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2):295–307, 2016.
  • [2] J. Kim, J. K. Lee, and K. M. Lee. Accurate image super-resolution using very deep convolutional networks. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , pages 1646–1654, 2016.
  • [3] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang. Deep Laplacian pyramid networks for fast and accurate super-resolution. IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [4] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1874–1883, 2016.
  • [5] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, PP(99):1–1, 2016.
  • [6] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
  • [7] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [8] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016.
  • [11] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [12] K. Zhang, W. Zuo, S. Gu, and L. Zhang. Learning deep cnn denoiser prior for image restoration. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3929–3938, 2017.
  • [13] A. Adler, D. Boublil, M. Elad, and M. Zibulevsky. A deep learning approach to block-based compressed sensing of images. arXiv preprint arXiv:1606.01519, 2016.
  • [14] Y. Romano, M. Elad, and P. Milanfar. The little engine that could: Regularization by denoising (red). arXiv preprint arXiv:1611.02862, 2016.
  • [15] S. Yan, X. Xu, D. Xu, S. Lin, and X. Li. Image classification with densely sampled image windows and generalized adaptive multiple kernel learning. IEEE Transactions on Cybernetics, 45(3):381–390, 2015.
  • [16] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell. Understanding convolution for semantic segmentation. arXiv preprint arXiv:1702.08502, 2017.
  • [17] Y. Tai, J. Yang, and X. Liu. Image super-resolution via deep recursive residual network. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [18] C. Dong, C. L. Chen, and X. Tang. Accelerating the super-resolution convolutional neural network. In European Conference on Computer Vision, pages 391–407, 2016.
  • [19] Y. Tai, J. Yang, X. Liu, and C. Xu. MemNet: A persistent memory network for image restoration. In IEEE Conference on International Conference on Computer Vision, 2017.
  • [20] X. Mao, C. Shen, and Y. Yang. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Advances in Neural Information Processing Systems, pages 2802–2810, 2016.
  • [21] I. Daubechies. The wavelet transform, time-frequency localization and signal analysis. IEEE Transactions on Information Theory, 36(5):961–1005, 1990.
  • [22] I. Daubechies. Ten lectures on wavelets. SIAM, 1992.
  • [23] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241, 2015.
  • [24] P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo. Multi-level wavelet-cnn for image restoration. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 773–782, 2018.
  • [25] M. R. Banham and A. K. Katsaggelos. Digital image restoration. IEEE Signal Processing Magazine, 14(2):24–41, 1997.
  • [26] Y. Chen and T. Pock. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99):1–1, 2015.
  • [27] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Transactions on Image Processing, 16(8):2080–2095, 2007.
  • [28] S. Gu, L. Zhang, W. Zuo, and X. Feng. Weighted nuclear norm minimization with application to image denoising. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2862–2869, 2014.
  • [29] U. Schmidt and S. Roth. Shrinkage fields for effective image restoration. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2774–2781, 2014.
  • [30] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma.

    Robust face recognition via sparse representation.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2):210–227, 2009.
  • [31] F. Agostinelli, M. R. Anderson, and H. Lee. Robust image denoising with multi-column deep neural networks. In Advances in Neural Information Processing Systems, pages 1493–1501, 2013.
  • [32] V. Jain and S. Seung. Natural image denoising with convolutional networks. In Advances in Neural Information Processing Systems, pages 769–776, 2009.
  • [33] J. Xie, L. Xu, and E. Chen. Image denoising and inpainting with deep neural networks. In International Conference on Neural Information Processing Systems, pages 341–349, 2012.
  • [34] H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can plain neural networks compete with BM3D? In IEEE Conference on Computer Vision and Pattern Recognition, pages 2392–2399, 2012.
  • [35] C. Dong, Y. Deng, C. Change Loy, and X. Tang. Compression artifacts reduction by a deep convolutional network. In IEEE Conference on International Conference on Computer Vision, pages 576–584, 2015.
  • [36] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced deep residual networks for single image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1132–1140, 2017.
  • [37] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu. Image super-resolution using very deep residual channel attention networks. In European Conference on Computer Vision, 2018.
  • [38] J. Kim, J. Kwon Lee, and K. Mu Lee. Deeply-recursive convolutional network for image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1637–1645, 2016.
  • [39] V. Santhanam, V. I. Morariu, and L. S. Davis. Generalized deep image to image regression. IEEE Conference on Computer Vision and Pattern Recognition, pages 5609–5619, 2017.
  • [40] K. Zhang, W. Zuo, and L. Zhang. FFDNet: Toward a fast and flexible solution for CNN based image denoising. IEEE Transactions on Image Processing, 2018.
  • [41] S. Guo, Z. Yan, K. Zhang, W. Zuo, and L. Zhang. Toward convolutional blind denoising of real photographs. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [42] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016.
  • [43] G. Riegler, S. Schulter, M. Ruther, and H. Bischof. Conditioned regression models for non-blind single image super-resolution. In IEEE Conference on International Conference on Computer Vision, 2015.
  • [44] K. Zhang, W. Zuo, and L. Zhang. Learning a single convolutional super-resolution network for multiple degradations. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [45] W. Bae, J. Yoo, and J. C. Ye. Beyond deep residual learning for image restoration: Persistent homology-guided manifold simplification. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1141–1149, 2017.
  • [46] T. Guo, H. S. Mousavi, T. H. Vu, and V. Monga. Deep wavelet prediction for image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017.
  • [47] Y. Han and J. C. Ye. Framing U-Net via deep convolutional framelets: Application to sparse-view CT. IEEE Transactions on Medical Imaging, pages 1418–1429, 2018.
  • [48] J. C. Ye and Y. S. Han. Deep convolutional framelets: A general deep learning for inverse problems. Society for Industrial and Applied Mathematics, 2018.
  • [49] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
  • [50] D. Han, J. Kim, and J. Kim. Deep pyramidal residual networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6307–6315, 2017.
  • [51] S. Zhai, Y. Cheng, Z. M. Zhang, and W. Lu. Doubly convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1082–1090, 2016.
  • [52] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 2261–2269, 2017.
  • [53] S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference, 2016.
  • [54] A. Takeki, D. Ikami, G. Irie, and K. Aizawa. Parallel grid pooling for data augmentation. In European Conference on Computer Vision, 2018.
  • [55] Q. Wang, Z. Gao, J. Xie, W. Zuo, and P. Li. Global gated mixture of second-order pooling for improving deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1277–1286, 2018.
  • [56] S. G. Mallat. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7):674–693, 1989.
  • [57] A. N. Akansu and R. A. Haddad. Multiresolution signal decomposition: transforms, subbands, and wavelets. Academic Press, 2001.
  • [58] A. S. Lewis and G. Knowles. Image compression using the 2-D wavelet transform. IEEE Transactions on Image Processing, 1(2):244–250, 1992.
  • [59] S. G. Chang, B. Yu, and M. Vetterli. Adaptive wavelet thresholding for image denoising and compression. IEEE Transactions on Image Processing, 9(9):1532–1546, 2000.
  • [60] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference for Learning Representations, 2015.
  • [61] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1492–1500, 2017.
  • [62] E. Agustsson and R. Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1122–113, 2017.
  • [63] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In IEEE Conference on International Conference Computer Vision, volume 2, pages 416–423, 2001.
  • [64] J.-B. Huang, A. Singh, and N. Ahuja. Single image super-resolution from transformed self-exemplars. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5197–5206, 2015.
  • [65] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. 2012.
  • [66] R. Zeyde, M. Elad, and M. Protter. On single image scale-up using sparse-representations. In International Conference on Curves and Surfaces, pages 711–730. Springer, 2010.
  • [67] A. K. Moorthy and A. C. Bovik. Visual importance pooling for image quality assessment. IEEE Journal of Selected Topics in Signal Processing, 3(2):193–201, 2009.
  • [68] A. Vedaldi and K. Lenc. Matconvnet: Convolutional neural networks for matlab. In the 23rd ACM international conference on Multimedia, pages 689–692, 2015.
  • [69] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical Report, Citeseer, 2009.
  • [70] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop, volume 2011, page 5, 2011.
  • [71] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [72] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba.

    Places: A 10 million image database for scene recognition.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • [73] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. In International Conference on Learning Representations Workshop, 2015.