I Introduction
Nowadays, convolutional networks have become the dominant technique behind many computer vision tasks, e.g. image restoration [1, 2, 5, 3, 4] and object classification [6, 7, 8, 9, 10]. With continual progress, CNNs are extensively and easily learned on largescale datasets, speeded up by increasingly advanced GPU devices, and often achieve stateoftheart performance in comparison with traditional methods. The reason that CNN is popular in computer vision can be contributed to two aspects. First, existing CNNbased solutions dominate on several simple tasks by outperforming other methods with a large margin, such as single image superresolution (SISR) [1, 2, 11], image denoising [5], image deblurring [12], compressed imaging [13], and object classification [6]. Second, CNNs can be treated as a modular part and plugged into traditional method, which also promotes the widespread use of CNNs [12, 14, 15].
Actually, CNNs in computer vision can be viewed as a nonlinear map from the input image to the target. In general, larger receptive field is helpful for improving fitting ability of CNNs and promoting accurate performance by taking more spatial context into account. Generally, the receptive field can be enlarged by either increasing the network depth, enlarging filter size or using pooling operation. But increasing the network depth or enlarging filter size can inevitably result in higher computational cost. Pooling can enlarge receptive field and guarantee efficiency by directly reducing spatial resolution of feature map. Nevertheless, it may result in information loss. Recently, dilated filtering [8] is proposed to trade off between receptive field size and efficiency by inserting “zero holes” in convolutional filtering. However, the receptive field of dilated filtering with fixed factor greater than 1 only takes into account a sparse sampling of the input with checkerboard patterns, thus it can lead to inherent suffering from gridding effect [16]. Based on the above analysis, one can see that we should be careful when enlarging receptive field if we want to avoid both increasing computational burden and incurring the potential performance sacrifice. As can be seen from Figure 1, even though DRRN [17] and MemNet [19] enjoy larger receptive fields and higher PSNR performances than VDSR [2] and DnCNN [5], their speed nevertheless are orders of magnitude slower.
In an attempt to address the problems stated previously, we propose an efficient CNN based approach aiming at trading off between performance and efficiency. More specifically, we propose a multilevel wavelet CNN (MWCNN) by utilizing discrete wavelet transform (DWT) to replace the pooling operations. Due to invertibility of DWT, none of image information or intermediate features are lost by the proposed downsampling scheme. Moreover, both frequency and location information of feature maps are captured by DWT [21, 22], which is helpful for preserving detailed texture when using multifrequency feature representation. More specifically, we adopt inverse wavelet transform (IWT) with expansion convolutional layer to restore resolutions of feature maps in image restoration tasks, where UNet architecture [23] is used as a backbone network architecture. Also, elementwise summation is adopted to combine feature maps, thus enriching feature representation.
In terms of relation with relevant works, we show that dilated filtering can be interpreted as a special variant of MWCNN, and the proposed method is more general and effective in enlarging receptive field. Using an ensemble of such networks trained with embedded multilevel wavelet, we achieve PSNR/SSIM value that improves upon the best known results in image restoration tasks such as image denoising, SISR and JPEG image artifacts removal. For the task of object classification, the proposed MWCNN can achieve higher performance than when adopting pooling layers. As shown in Figure 1, although MWCNN is moderately slower than LapSRN [3], DnCNN [5] and VDSR [2], MWCNN can have a much larger receptive field and achieve higher PSNR value.
This paper is an extension of our previous work [24]. Compared to the former work [24], we propose a more general approach for improving performance, further extend it to highlevel task and provide more analysis and discussions. To sum up, the contributions of this work include:

A novel MWCNN model to enlarge receptive field with better tradeoff between efficiency and restoration performance by introducing wavelet transform.

Promising detail preserving due to the good timefrequency localization property of DWT.

A general approach to embedding wavelet transform in any CNNs where pooling operation is employed.

Stateoftheart performance on image denoising, SISR, JPEG image artifacts removal, and classification.
The remainder of the paper is organized as follows. Sec. II briefly reviews the development of CNNs for image restoration and classification. Sec. III describes the proposed MWCNN model in detail. Sec. IV reports the experimental results in terms of performance evaluation. Finally, Sec. V concludes the paper.
Ii Related Work
In this section, the development of CNNs for image restoration tasks is briefly reviewed. In particular, we discuss relevant works on incorporating DWT in CNNs. Finally, relevant object classification works are introduced.
Iia Image Restoration
Image restoration aims at recovering the latent clean image from its degraded observation . For decades, researches on image restoration have been done from the view points of both prior modeling and discriminative learning [25, 26, 27, 28, 29, 30]. Recently, with the booming development, CNNs based methods achieve stateoftheart performance over the traditional methods.
IiA1 Improving Performance and Efficiency of CNNs for Image Restoration
In the early attempt, the CNNbased methods don’t work so well on some image restoration tasks. For example, the methods of [31, 32, 33] could not achieve stateoftheart denoising performance compared to BM3D [27] in 2007. In [34], multilayer perception (MLP) achieved comparable performance as BM3D by learning the mapping from noise patches to clean patches. In 2014, Dong et al. [1] for the first time adopted only a 3layer FCN without pooling for SISR, which realizes only a small receptive field but achieves stateoftheart performance. Then, Dong et al. [35] proposed a 4layer ARCNN for JPEG image artifacts reduction.
Recently, deeper networks are increasingly used for image restoration. For SISR, Kim et al. [2]
stacked a 20layer CNN with residual learning and adjustable gradient clipping. Subsequently, some works, for example, very deep network
[36, 5, 37], symmetric skip connections [20], residual units [11], Laplacian pyramid [3], and recursive architecture [38, 17], had also been suggested to enlarge receptive field. However, the receptive field of those methods is enlarged with the increase of network depth, which may has limited potential to extend to deeper network.For better tradeoff between speed and performance, a 7layer FCN with dilated filtering was presented as a denoiser by Zhang et al. [12]. Santhanam et al. [39] adopt pooling/unpooling to obtain and aggregate multicontext representation for image denoising. In [40], Zhang et al. considered to operate the CNN denoiser on downsampled subimages . Guo et al. [41] utilized UNet [23] based CNN as nonblind denoiser. On account of the speciality of SISR, the receptive field size and efficiency could be better traded off by taking the lowresolution (LR) images as input and zooming in on features with upsampling operation [18, 4, 42]. Nevertheless, this strategy can only be adopted for SISR, and are not suitable for other tasks, such as image denoising and JPEG image artifacts removal.
IiA2 Universality of Image Restoration
On account of the similarity of tasks such as image denoising, SISR, and JPEG image artifacts removal, the model suggested for one task may be easily extended to other image restoration tasks simply by retraining the same network. For example, both DnCNN [5] and MemNet [19] had been evaluated on all these three tasks. Moreover, CNN denoisers can also serve as a kind of plugandplay prior. Thus, any restoration tasks can be tackled by sequentially applying the CNN denoisers via incorporating with unrolled inference [12]. To provide an explicit functional for defining regularization induced by denoisers, Romano et al. [14] further proposed a regularizationbydenoising framework. In [43] and [44], LR image with blur kernel is incorporated into CNNs for nonblind SR. These methods not only promote the application of CNN in low level vision, but also present solutions to deploying CNN denoisers for other image restoration tasks.
IiA3 Incorporating DWT in CNNs
Several studies have also been given to incorporate wavelet transform into CNN. Bae et al. [45] proposed a wavelet residual network (WavResNet) with the discovery that CNN learning can benefit from learning on wavelet subbands with features having more channels. For recovering missing details in subbands, Guo et al. [46] proposed a deep wavelet superresolution (DWSR) method. Subsequently, deep convolutional framelets (DCF) [47, 48] had been developed for lowdose CT and inverse problems. However, only onelevel wavelet decomposition is considered in WavResNet and DWSR which may restrict the application of wavelet transform. Inspired by the view point of decomposition, DCF independently processes each subband, which spontaneously ignores the dependency between these subbands. In contrast, multilevel wavelet transform is considered by our MWCNN to enlarge receptive field where computational burden is barely increased.
IiB Object Classification
The AlexNet [6] is a 8layers network for object classification, and for the first time achieved stateoftheart performance than other methods on the ILSVRC2012 dataset. In this method, different sized filters are adopted for extracting and enhancing features. However, Simonyan et al. [7] found that using only sized convolutional filter with deeper architecture can realize larger receptive field and achieve better performance than AlexNet. Yu et al. [8] adopted the dilated convolution to enlarge the receptive field size without increasing the computation burden. Later, residual block [9, 10], inception model [49], pyramid architecture [50], doubly CNN [51] and other architectures [53, 52] were proposed for object classification. Some measures on pooling operation, such as parallel grid pooling [54] and gated mixture of secondorder pooling [55]
, were also proposed to enhance feature extractor or feature representation to promote performance. In general, pooling operation, such as average pooling and max pooling, is often adopted for downsampling features and enlarging receptive field, but it can result in significant information loss. To avoid this downside, we adopt DWT as our downsampling layer by replacing pooling operation without changing the main architecture, resulting in more power for enhancing feature representation.
Iii Method
In this section, we first briefly introduce the concept of multilevel wavelet packet transform (WPT) and provide our motivation. We then formally present our MWCNN based on multilevel WPT, and describe its network architecture for image restoration and object classification. Finally, discussion is presented to analyze the connection of MWCNN with average pooling and dilated filtering.
Iiia From multilevel WPT to MWCNN
Given an image , we can use 2D DWT [56] with four convolutional filters, i.e. lowpass filter , and highpass filters , , , to decompose into four subband images, i.e. , , , and
. Note that the four filters have fixed parameters with convolutional stride 2 during the transformation. Taking Haar wavelet as an example, four filters are defined as
(1) 
It is evident that , , , and are orthogonal to each other and form a invertible matrix. The operation of DWT is defined as , , , and , where denotes convolution operator, and means the standard downsampling operator with factor 2. In other words, DWT mathematically involves four fixed convolution filters with stride 2 to implement downsampling operator. Moreover, according to the theory of Haar transform [56], the th value of , , and after 2D Haar transform can be written as
(2) 
Although the downsampling operation is deployed, due to the biorthogonal property of DWT, the original image can be accurately reconstructed without information loss by the IWT, i.e., . For the Haar wavelet, the IWT can defined as following:
(3) 
Generally, the subband images , , , and can be sequentially decomposed by DWT for further processing in multilevel WPT [57, 22]. To get results of twolevel WPT, DWT is separately utilized to decompose each subband image ( , , , or ) into four subband images , , , and . Recursively, the results of three or higher levels WPT can be obtained. Correspondingly, the reconstruction of each level subband images are implemented by completely inverse operation via IWT. The abovementioned process of decomposition and reconstruction of an image are illustrated in Figure 2(a). If we treat the filers of WPT as convolutional filters with predefined weights, one can see that WPT is a special case of FCN without the nonlinearity layers. Obviously, the original image can be first decomposed by WPT and then accurately reconstructed by inverse WPT without any information loss.
In image processing applications such as image denoising and compression, some operations, e.g., softthreshold and quantization, are usually required to process the decomposition part [59, 58] as shown in Figure. 2(a). These operations can be treated as some kind of nonlinearity tailored to specific task. In this work, we further extend WPT to multilevel waveletCNN (MWCNN) by plugging CNN blocks into traditional WPTbased method as illustrated in Figure 2(b). Due to the biorthogonal property of WPT, our MWCNN can use subsampling and upsampling operations safely without incurring information loss. Obviously, our MWCNN is a generalization of multilevel WPT, and reduces to WPT when each CNN block becomes the identity mapping. Moreover, DWT can be treated as downsampling operation and extend to any CNNs where pooling operation is required.
IiiB Network architecture
IiiB1 Image Restoration
As mentioned previously in Sec. IIIA, we design the MWCNN architecture for image restoration based on the principle of the WPT as illustrated in Figure 2(b). The key idea is to insert CNN blocks into WPT before (or after) each level of DWT. As shown in Figure 3, each CNN block is a 3layer FCN without pooling, and takes both lowfrequency subbands and highfrequency subbands as inputs. More concretely, each layer contains convolution with
filters (Conv), and rectified linear unit (ReLU) operations. Only Conv is adopted in the last layer for predicting the residual result. The number of convolutional layers is set to 24. For more details on the setting of MWCNN, please refer to Figure
3.Our MWCNN modifies UNet in three aspects. (i) In conventional UNet, pooling and deconvolution are utilized as downsampling and upsampling layers. In comparison, DWT and IWT are used in MWCNN. (ii) After DWT, we deploy another CNN blocks to reduce the number of feature map channels for compact representation and modeling interband dependency. And convolution are adopted to increase the number of feature map channels and IWT is utilized to upsample feature map. In comparison, conventional UNet adopting convolution layers are used to increase feature map channels which has no effect on the number of feature map channels after pooling. For upsampling, deconvolution layers are directly adopted to zoom in on feature map. (iii) In MWCNN, elementwise summation is used to combine the feature maps from the contracting and expanding subnetworks. While in conventional UNet, concatenation is adopted. Compared to our previous work [24], we have made several improvements such as: (i) Instead of directly decomposing input images by DWT, we first use conv blocks to extract features from input, which is empirically shown to be beneficial for image restoration. (ii) In the 3rd hierarchical level, we use more feature maps to enhance feature representation. In our implementation, Haar wavelet is adopted as the default wavelet in MWCNN. Other wavelets, e.g., Daubechies 2 (DB2), are also considered in our experiments.
Denote by the network parameters of MWCNN, i.e., and be the network output. Let be a training set, where is the th input image, is the corresponding groundtruth image. Then the objective function for learning MWCNN is given by
(4) 
The ADAM algorithm [60] is adopted to train MWCNN by minimizing the objective function.
IiiB2 Extend to Object Classification
Similar to image restoration, DWT is employed as a downsampling operation often without upsampling operation to replace pooling operation. The compression filter with
Conv is subsequently utilized after DWT transformation. Note that we don’t modify other blocks or loss function. With this improvement, feature can be further selected and enhanced with adaptive learning. Moreover, any CNN using pooling can be considered instead of DWT operation, and the information of feature maps can be transmitted to next layer without information loss. DWT can be seen as a safe downsampling module and plugged into any CNNs without the need to change network architectures, and may benefit extracting more powerful features for different tasks.
IiiC Discussion
IiiC1 Connection to Pooling Operation
The DWT in the proposed MWCNN is closely related to the pooling operation and dilated filtering. By using the Haar wavelet as an example, we explain the connection between DWT and average pooling. According to the theory of average pooling with factor 2, the th value of feature map in the th layer after pooling can be written as
(5) 
where is the feature map before pooling operation. It is clear that Eq. 5 is the same as the lowfrequency component of DWT in Eq. 2, which also means that all the highfrequency information is lost during the pooling operation. In Figure 4, the feature map is first decomposed into four subimages with stride 2. The average pooling operation can be treated as summing all subimages with fixed coefficient to generate new subimage. In comparison, DWT uses all subimages with four fixed orthometric weights to obtain four new subimages. By taking all the subbands into account, MWCNNs can therefore avoid the information loss caused by conventional subsampling, and may benefit restoration and classification. Hence, average pooling can be seen as a simplified variant of the proposed MWCNNs.
IiiC2 Connection to Dilated Filtering
To illustrate the connection between MWCNN and dilated filtering, we first give the definition of dilated filtering with factor 2:
(6) 
where means convolution operation with dilated factor 2, is the position in convolutional kernel , and is the position within the range of convolution of feature . Eq. (6) can be decomposed into two steps, sampling and convoluting. Sampled patch is obtained by sampling at center position of with one interval pixel under the constraint . Then the value is obtained by convolving sampled patch with kernel . Therefore, dilated filtering with factor 2 can be expressed as first decomposing an image into four subimages and then using the shared standard convolutional kernel on those subimages as illustrated in Figure. 4. We rewrite Eq. (6) for obtaining the pixel value as following:
(7) 
Then the pixel value , and can be obtained in the same way. Actually, the value of at the position , , and can be obtained by applying IWT on subband images , , and based on Eqn.(3). Therefore, the dilating filtering can be represented as convolution with the subband images as following,
(8) 
Different from dilated filtering, the definition of MWCNN in Figure 4 can be given as
where , and denotes concatenate operation. If is group convolution [61] with factor 4, the equation can be rewritten as:
(9) 
Note that can accurately reconstruct by using IWT. Compared to Eq. (8), the weights of each subband and the corresponding convolution are different. That means that our MWCNN can be reduced to dilated filtering if the subbands are replaced by subimages after IWT in Eq. (3), and the convolution in is shared to each other. Hence, the dilated filtering can be seen as a variant of the proposed MWCNN.
Compared with dilated filtering, MWCNN can also avoid the gridding effect. With the increase of depth, dilated filtering with fixed factor greater than 1 only considers a sparse sampling of units in the checkerboard pattern, resulting in large amount of information loss (see Figure 5). Another problem with dilated filtering is that the two output neighboring pixels may be computed from input information from totally nonoverlapped units (see Figure 5), and may cause the inconsistence of local information. Figure 5 illustrates the receptive field of MWCNN, which is quite different from dilated filtering. With dense sampling, convolution filter takes multifrequency information as input, and results in double receptive field after DWT. One can see that MWCNN is able to well address the sparse sampling and inconsistency problems of local information, and is expected to benefit restoration quantitatively.
Iv Experiments
In this section, we first describe application of MWCNN to image restoration. Then ablation experiments is presented to analyze the contribution of each component. Finally, the proposed MWCNN is extended to object classification.
Iva Experimental Setting for Image Restoration
2.2 Dataset  BM3D [27]  TNRD [26]  DnCNN [5]  IRCNN [12]  RED30 [20]  MemNet [19]  FFDNet [40]  MWCNN(P)  MWCNN  
1.2 Set12  15  32.37 / 0.8952  32.50 / 0.8962  32.86 / 0.9027  32.77 / 0.9008      32.75 / 0.9027  33.15 / 0.9088  33.20 / 0.9089 
25  29.97 / 0.8505  30.05 / 0.8515  30.44 / 0.8618  30.38 / 0.8601      30.43 / 0.8634  30.79 / 0.8711  30.84 / 0.8718  
50  26.72 / 0.7676  26.82 / 0.7677  27.18 / 0.7827  27.14 / 0.7804  27.34 / 0.7897  27.38 / 0.7931  27.32 / 0.7903  27.74 / 0.8056  27.79 / 0.8060  
BSD68  15  31.08 / 0.8722  31.42 / 0.8822  31.73 / 0.8906  31.63 / 0.8881      31.63 / 0.8902  31.86 / 0.8947  31.91 / 0.8952 
25  28.57 / 0.8017  28.92 / 0.8148  29.23 / 0.8278  29.15 / 0.8249      29.19 / 0.8289  29.41 / 0.8360  29.46 / 0.8370  
50  25.62 / 0.6869  25.97 / 0.7021  26.23 / 0.7189  26.19 / 0.7171  26.35 / 0.7245  26.35 / 0.7294  26.29 / 0.7245  26.53 / 0.7366  26.58 / 0.7382  
Urban100  15  32.34 / 0.9220  31.98 / 0.9187  32.67 / 0.9250  32.49 / 0.9244      32.43 / 0.9273  33.17 / 0.9357  33.22 / 0.9361 
25  29.70 / 0.8777  29.29 / 0.8731  29.97 / 0.8792  29.82 / 0.8839      29.92 / 0.8886  30.66 / 0.9026  30.74 / 0.9035  
50  25.94 / 0.7791  25.71 / 0.7756  26.28 / 0.7869  26.14 / 0.7927  26.48 / 0.7991  26.64 / 0.8024  26.52 / 0.8057  27.42 / 0.8371  27.53 / 0.8393  
2.2 
IvA1 Training set
To train our MWCNN, we adopt DIV2K [62] as our training dataset. Concretely, DIV2K contains images with about 2K resolution for training, images for validation, and images for testing. Due to the receptive field of MWCNN being , we crop patches with the size of from the training images in the training stage.
For image denoising, we consider three noise levels, i.e., = 15, 25 and 50, and evaluate our denoising method on three dataset, i.e., Set12 [5], BSD68 [63], and Urban100 [64]. For SISR, we take upsampling as the input to MWCNN with three specific scale factors, i.e., , and , respectively. Four widely used datasets, Set5 [65], Set14 [66], BSD100 [63] and Urban100 [64], are adopted to evaluate SISR performance. For JPEG image artifacts removal, we follow the setting as used in [35], i.e., four compression quality settings = 10, 20, 30 and 40 for the JPEG encoder. Two datasets, Classic5 [35] and LIVE1 [67], are used for evaluating our method.
IvA2 Network training
In image restoration, a MWCNN model is learned for each degradation setting. The ADAM algorithm [60] with , and is adopted for optimization and we use a minibatch size of 24. The learning rate is decayed exponentially from to
in the 200 epochs. Rotation or/and flip based data augmentation is used during minibatch learning. The MatConvNet package
[68] with NVIDIA GTX1080 GPU is utilized for training and testing.IvB Quantitative and qualitative evaluation on Image Restoration Tasks
Comprehensive experiments are conducted to evaluate our 24layer MWCNN using the same setting as in Sec. IIIB on three representative image restoration tasks, respectively. Here, we also provide the results of our previous work and denote it as MWCNN(P) [24].
IvB1 Image denoising
For image denoising, only gray images are trained and evaluated for the reason that most denoising methods are only trained and tested on gray images. Moreover, we compare with two classic denoising methods, i.e., BM3D [27] and TNRD [26], and five CNNbased methods, i.e., DnCNN [5], IRCNN [12], RED30 [20], MemNet [19], and FFDNet [40]. Table I lists the average PSNR/SSIM results of the competing methods on these three datasets. Since RED30 [20]and MemNet [19] doesn’t train the models on level 15 and level 25, we use the symbol ‘’ instead. Obviously, the performance of all the competing methods are worse than our MWCNN. It’s worth noting that our MWCNN can outperform DnCNN and FFDNet by about dB in terms of PSNR on Set12, and slightly surpass with dB on BSD68. On Urban100, our MWCNN generally achieves favorable performance when compared with the competing methods. Specially, the average PSNR by our MWCNN can be 0.5dB higher than that by DnCNN on Set12, and 1.2dB higher on Urban100 when the noise level is 50. Figure 6 shows the denoising results of the images “Test044” from Set68 with the noise level . One can see that our MWCNN is promising in removing noise while recovering image details and structures, and can obtain visually more pleasant result than the competing methods due to the reversibility of WPT during downsampling and upsampling.
2.2 Dataset  VDSR [2]  DnCNN [5]  RED30 [20]  SRResNet [11]  LapSRN [3]  DRRN [17]  MemNet [19]  WaveResNet [45]  SRMDNF [44]  MWCNN(P)  MWCNN  
1.2 Set5  2  37.53 / 0.9587  37.58 / 0.9593  37.66 / 0.9599    37.52 / 0.9590  37.74 / 0.9591  37.78 / 0.9597  37.57 / 0.9586  37.79 / 0.9601  37.91 / 0.9600  37.95 / 0.9605 
3  33.66 / 0.9213  33.75 / 0.9222  33.82 / 0.9230      34.03 / 0.9244  34.09 / 0.9248  33.86 / 0.9228  34.12 / 0.9250  34.18 / 0.9272  34.21 / 0.9273  
4  31.35 / 0.8838  31.40 / 0.8845  31.51 / 0.8869  32.05 / 0.8902  31.54 / 0.8850  31.68 / 0.8888  31.74 / 0.8893  31.52 / 0.8864  31.96 / 0.8925  32.12 / 0.8941  32.14 / 0.8951  
Set14  2  33.03 / 0.9124  33.04 / 0.9118  32.94 / 0.9144    33.08 / 0.9130  33.23 / 0.9136  33.28 / 0.9142  33.09 / 0.9129  33.05 / 0.8985  33.70 / 0.9182  33.71 / 0.9182 
3  29.77 / 0.8314  29.76 / 0.8349  29.61 / 0.8341      29.96 / 0.8349  30.00 / 0.8350  29.88 / 0.8331  30.04 / 0.8372  30.16 / 0.8414  30.14 / 0.8413  
4  28.01 / 0.7674  28.02 / 0.7670  27.86 / 0.7718  28.49 / 0.7783  28.19 / 0.7720  28.21 / 0.7720  28.26 / 0.7723  28.11 / 0.7699  28.41 / 0.7816  28.41 / 0.7816  28.58 / 0.7882  
BSD100  2  31.90 / 0.8960  31.85 / 0.8942  31.98 / 0.8974    31.80 / 0.8950  32.05 / 0.8973  32.08 / 0.8978  31.92 / 0.8965  32.23 / 0.8999  32.23 / 0.8999  32.30 / 0.9002 
3  28.82 / 0.7976  28.80 / 0.7963  28.92 / 0.7993      28.95 / 0.8004  28.96 / 0.8001  28.86 / 0.7987  28.97 / 0.8030  29.12 / 0.8060  29.18 / 0.8106  
4  27.29 / 0.7251  27.23 / 0.7233  27.39 / 0.7286  27.56 / 0.7354  27.32 / 0.7280  27.38 / 0.7284  27.40 / 0.7281  27.32 / 0.7266  27.62 / 0.7355  27.62 / 0.7355  27.67 / 0.7357  
Urban100  2  30.76 / 0.9140  30.75 / 0.9133  30.91 / 0.9159    30.41 / 0.9100  31.23 / 0.9188  31.31 / 0.9195  30.96 / 0.9169  32.30 / 0.9296  32.30 / 0.9296  32.36 / 0.9306 
3  27.14 / 0.8279  27.15 / 0.8276  27.31 / 0.8303      27.53 / 0.8378  27.56 / 0.8376  27.28 / 0.8334  27.57 / 0.8401  28.13 / 0.8514  28.19 / 0.8520  
4  25.18 / 0.7524  25.20 / 0.7521  25.35 / 0.7587  26.07 / 0.7839  25.21 / 0.7560  25.44 / 0.7638  25.50 / 0.7630  25.36 / 0.7614  26.27 / 0.7890  26.27 / 0.7890  26.37 / 0.7891  
2.2 
IvB2 Single image superresolution
We also train our MWCNN on SISR task with only the luminance channel, i.e. Y in YCbCr color space following [1]. interpolation is used for image degradation, and upsampling by interpolation is used before sending degradation image to network. For qualitative comparisons, we use source codes of nine CNNbased methods, including VDSR [2], DnCNN [5], RED30 [20], SRResNet [11], LapSRN [3], DRRN [17], MemNet [19], WaveResNet [45] and SRMDNF [44]. Since the source code of SRResNet is not released, their results as shown in Table II are incomplete. And the results of LapSRN with scale are not listed here since they are not reported in the authors’ paper.
Table II summarizes the average PSNR/SSIM results of the competing methods on the four datasets by citing the results in their respective papers. In terms of both PSNR and SSIM indexes, the proposed MWCNN outperforms other methods in all cases. Compared with wellknown VDSR, our MWCNN achieves a notable gain of about 0.40.8dB by PSNR on Set5 and Set14. Surprisingly, our MWCNN outperforms VDSR by a large gap with about 1.01.6dB on Urban100. Even though SRMDNF is trained on RGB space, it is still slightly weaker than our MWCNN. It can also be seen that our MWCNN outperforms WaveResNet by no less than 0.3dB. We provide quantitative comparisons with the competing methods on the image “253027” from BSD100 in Figure 7. As one can see, our MWCNN can correctly recover the fine and detailed textures, and produce sharp edges due to the frequency and location characteristics of DWT.
2.2 Dataset  JPEG  ARCNN [35]  TNRD [26]  DnCNN [5]  MemNet [19]  MWCNN(P)  MWCNN  
2.2 Classic5  10  27.82 / 0.7595  29.03 / 0.7929  29.28 / 0.7992  29.40 / 0.8026  29.69 / 0.8107  30.01 / 0.8195  30.03 / 0.8201 
20  30.12 / 0.8344  31.15 / 0.8517  31.47 / 0.8576  31.63 / 0.8610  31.90 / 0.8658  32.16 / 0.8701  32.20 / 0.8708  
30  31.48 / 0.8744  32.51 / 0.8806  32.78 / 0.8837  32.91 / 0.8861    33.43 / 0.8930  33.46 / 0.8934  
40  32.43 / 0.8911  33.34 / 0.8953    33.77 / 0.9003    34.27 / 0.9061  34.31 / 0.9063  
LIVE1  10  27.77 / 0.7730  28.96 / 0.8076  29.15 / 0.8111  29.19 / 0.8123  29.45 / 0.8193  29.69 / 0.8254  29.70 / 0.8260 
20  30.07 / 0.8512  31.29 / 0.8733  31.46 / 0.8769  31.59 / 0.8802  31.83 / 0.8846  32.04 / 0.8885  32.07 / 0.8886  
30  31.41 / 0.9000  32.67 / 0.9043  32.84 / 0.9059  32.98 / 0.9090    33.45 / 0.9153  33.46 / 0.9155  
40  32.35 / 0.9173  33.63 / 0.9198    33.96 / 0.9247    34.45 / 0.9301  34.47 / 0.9300  
2.2 
IvB3 JPEG image artifacts removal
We apply our method to JPEG image artifacts removal to further demonstrate the applicability of our MWCNN on image restoration. Here, both JPEG encoder and JPEG image artifacts removal are only focused on the Y channel. Following [35], we consider four settings on quality factor, e.g., = 10, 20, 30 and 40, for the JPEG encoder. In our experiments, MWCNN is compared to four competing methods, i.e., ARCNN [35], TNRD [26], DnCNN [5], and MemNet [19]. The results of MemNet [19] and TNRD [26] are incomplete according to their paper and released source codes.
Table III shows the average PSNR/SSIM results of the competing methods on Classic5 and LIVE1. Obviously, our MWCNN obtains superior performance than other methods in terms of quantitative metrics for any of the four quality factors. Compared to ARCNN on Classic5, our MWCNN surprisingly outperforms by 1dB in terms of PSNR. On can see that the PSNR value of MWCNN can be 0.20.3dB higher than those of the second best method (i.e., MemNet [19]). In addition to perceptual comparisons, we also provide the image, i.e. “carnivaldolls” form LIVE1 with the quality factor of 10. Compared with other methods, our MWCNN is effective in better removing artifacts and restoring detailed textures and sharp salient edges.
2.2 Image Denoising  
1.2 Size  FFDNet [40]  DnCNN [5]  RED30 [20]  MemNet [19]  MWCNN 
1.2 256256  0.006  0.0143  1.362  0.8775  0.0437 
512  0.012  0.0487  4.702  3.606  0.0844 
1024  0.038  0.1688  15.77  14.69  0.3343 
1.2 Single Image SuperResolution  
1.2 Size  VDSR [2]  LapSRN [3]  DRRN [17]  MemNet [19]  MWCNN 
1.2 256256  0.0172  0.0229  3.063  0.8774  0.0397 
512  0.0575  0.0357  8.050  3.605  0.0732 
1024  0.2126  0.1411  25.23  14.69  0.2876 
1.2 JPEG Image Artifacts Removal  
1.2 Size  ARCNN [35]  TNRD [26]  DnCNN [5]  MemNet [19]  MWCNN 
256256  0.0277  0.009  0.0157  0.8775  0.0413 
512  0.0532  0.028  0.0568  3.607  0.0789 
1024  0.1613  0.095  0.2012  14.69  0.2717 
2.2 
IvB4 Running time
As mentioned previously, the efficiency of CNNs is also an important measure of network performance. We consider the CNNbased methods with source code and list the GPU running time of the competing methods for the three tasks in Table IV
. Note that the Nvidia cuDNNv7.0 deep learning library with CUDA 9.2 is adopted to accelerate the GPU computation under Ubuntu 16.04 system. In comparison to the stateoftheart methods,
i.e., RED30 [20], DRRN [17] and MemNet [19], our MWCNN costs far less time but obtain better performance in terms of PSNR/SSIM metrics. Meanwhile, our MWCNN is moderately slower by speed but can achieve higher PSNR/SSIM indexes compared to the other methods. This means that the effectiveness of MWCNN should be attributed to the incorporation of CNN and DWT rather than increase of network depth/width.IvC Comparison of MWCNN variants
Using image denoising and JPEG image artifacts as examples, we mainly focus on two variants of MWCNN: (i) Ablation experiments to demonstrate where the improved performance comes from. (ii) The related methods, such as waveletbased approach and dilated filtering are presented for verifying the effectiveness of the proposed method. Note that MWCNN with 24layer is employed as our baseline, and all the MWCNN variants are designed using the same architecture for fair comparison.
Image Denoising ()  
Dataset  UNet [23]  UNet [23]+S  UNet [23]+D  MWCNN (P+C)  MWCNN (Haar)  MWCNN (DB2)  MWCNN (HD) 
Set12  27.42 / 0.079  27.41 / 0.074  27.46 / 0.080  27.76 / 0.081  27.79 / 0.075  27.81 / 0.127  27.77 / 0.091 
Set68  26.30 / 0.076  26.29 / 0.071  26.21 / 0.075  26.54 / 0.077  26.58 / 0.072  26.59 / 0.114  26.57 / 0.086 
Urban100  26.68 / 0.357  26.72 / 0.341  26.99 / 0.355  27.46 / 0.354  27.53 / 0.346  27.55 / 0.576  27.50 / 0.413 
JPEG Image Artifacts Removal ()  
Classic5  29.61 / 0.093  29.60 / 0.082  29.68 / 0.097  30.02 / 0.091  30.03 / 0.083  30.04 / 0.185  29.99 / 0.115 
LIVE1  29.36 / 0.112  29.36 / 0.109  29.43 / 0.120  29.69 / 0.120  29.70 / 0.111  29.71 / 0.234  29.68 / 0.171 
IvC1 Ablation experiments
Ablation experiments are provided for verifying the effectiveness of additionally embedded wavelet: (i) the default UNet with the same architecture to MWCNN, (ii) UNet+S: using sum connection instead of concatenation, and (iii) UNet+D: adopting learnable conventional downsampling filters, i.e. convolution operation with stride 2 to replace max pooling. We also compare with the modified MWCNN(P) method which adds one layer of CNN right after inputting and another layer of CNN before outputting, and denote it as MWCNN(P+C). Three MWCNN variants with different wavelet transform are also considered, including: (i) MWCNN (Haar): the default MWCNN with Haar wavelet, (ii) MWCNN (DB2): MWCNN with Daubechies2 wavelet, and (iii) MWCNN (HD): MWCNN with Haar in contracting subnetwork and Daubechies2 in expanding subnetwork.
Table V lists the PSNR and running time results of these methods. We have the following observations. (i) The ablation experiments indicate that adopting sum connection instead of concatenation can slightly improve efficiency with almost no decrease of PNSR. (ii) Due to the biorthogonal and timefrequency localization properties of wavelet, our wavelet based method possesses more powerful abilities for image restoration. The pooling operation causes the loss of highfrequency information and leads to difficulty of recovering damaged image. Our MWCNN can easily outperform UNet+D method which adopts learnable downsampling filters. This indicates that learning alone is not enough and the violation of the invertibility can cause information loss. (iii) Compared with MWCNN (P+C), the proposed method still performs slightly better despite the fact that MWCNN (P+C) has more layers, thereby verifying the effectiveness of the proposed method. (iv) Compared with MWCNN (DB2) and MWCNN (HD), using Haar wavelet for downsampling and upsampling in network is the best choice in terms of quantitative evaluation. MWCNN (Haar) has similar running time as dilated CNN and UNet but achieves higher PSNR results, which demonstrates the effectiveness of MWCNN for trading off between performance and efficiency.
Image Denoising ()  
Dataset  Dilated [8]  Dilated2  DCF [47]  DCF+R  WaveResNet [45]  MWCNN (Haar) 
Set12  27.45 / 0.181  24.81 / 0.185  27.38 / 0.081  27.61 / 0.081  27.49 / 0.179  27.79 / 0.075 
Set68  26.35 / 0.142  24.32 / 0.174  26.30 / 0.075  26.43 / 0.075  26.38 / 0.143  26.58 / 0.072 
Urban100  26.56 / 0.764  24.18 / 0.960  26.65 / 0.354  27.18 / 0.354   /   27.53 / 0.346 
JPEG Image Artifacts Removal ()  
Classic5  29.72 / 0.287  29.49 / 0.302  29.57 / 0.104  29.88 / 0.104   /   30.03 / 0.083 
LIVE1  29.49 / 0.354  29.26 / 0.376  29.38 / 0.155  29.63 / 0.155   /   29.70 / 0.111 
IvC2 More MWCNN variants
We compare the PSNR results by using more related MWCNN variants. Two 24layer dilated CNNs are provided: (i) Dilated: the hybrid dilated convolution [16] to suppress the gridding effect, and (ii) Dilated2: the dilate factor of all layers is set to 2 following the gridding effect. The WaveResNet method in [45] is provided for comparison. Moreover, since the source code of deep convolutional framelets (DCF) is not available, a reimplementation of deep convolutional framelets (DCF) without residual learning [48] is also considered in the experiments. DCF with residual learning (denoted as DCF+R) is provided for fair comparison.
Table VI lists the PSNR and running time results of these methods. We have the following observations. (i) The gridding effect with the sparse sampling and inconsistence of local information authentically has adverse influence on restoration performance. (ii) The worse performance of DCF also indicates that independent processing of subbands harms intrafrequency information dependency. (iii) The fact that our MWCNN method sightly outperforms the DCF+R method means that the adverse influence of independent processing can be eliminated after several nonlinear operations.
IvC3 Hierarchical level of MWCNN
Here, we discuss the suitable level of our MWCNN even if it can be extended to higher level. Nevertheless, deeper network and heavier computational burden also come with higher level. Thus, we select a suitable level for better balance between efficiency and performance. As shown in Table VII, the PSNR and running time results of MWCNNs with the levels of 0 to 4 (i.e., MWCNN0 MWCNN4) are reported. We note that MWCNN0 is a 6layer CNN without WPT. In terms of the PSNR metric, MWCNN3 is much better than MWCNN1 and MWCNN2, while negligibly weaker than MWCNN4. Meanwhile, the speed of MWCNN3 is also moderate compared with other levels. Based on the above analysis, we choose MWCNN3 as the default setting.
Dataset  MWCNN0  MWCNN1  MWCNN2  MWCNN3  MWCNN4 
Set12  26.84 / 0.017  27.25 / 0.041  27.64 / 0.064  27.79 / 0.073  27.80 / 0.087 
Set68  25.71 / 0.016  26.21 / 0.039  26.47 / 0.060  26.58 / 0.070  26.59 / 0.081 
Urban100  25.98 / 0.087  26.53 / 0.207  27.12 / 0.298  27.53 / 0.313  27.55 / 0.334 
IvD Extend to Object Classification
Using object classification as an example, we test our MWCNN on six famous benchmarks: CIFAR10, CIFAR100 [69], SVHN [70], MNIST [71], ImageNet1K [6] and Places365 [72] for our evaluation. CIFAR10 and CIFAR100 consist of 60000 3232 colour images in 10 classes, including 50,000 images for training and 10,000 images for testing. MNIST [71] is handwritten digit database with 2828 resolution, which has a training set of 60,000 examples, and a test set of 10,000 examples. The SVHN dataset contains more than 600,000 digit images. ImageNet1K is a resized dataset which consists of two resolutions, 3232 denoted as ImageNet32, and 6464 as ImageNet64. Places365 is resized to for training and testing in our work. Here, we modify and compare several CNN methods, such as preactivation ResNet (PreResNet) [10], AllCNN [73], WideResNet [53], PyramidNet [50], DenseNet [52] and ResNet [9]. Specifically, we follow [55] to verify our MWCNN on ResNet architecture. As our MWCNN, we use DWT transformation with convolution to instead of avgpooling operation as described in Sec. IIIB2, and denote it as ‘MW’, while the original CNN is denoted as ‘Base’.
Table VIII shows the detailed results of accuracy with the competing methods PreResNet [10], AllCNN [73], WideResNet [53], PyramidNet [50], DenseNet [52] on CIFAR10, CIFAR100, SVHN, MNIST and ImageNet32. Table IX and Table X show Top1 and Top5 error of ResNet [9] on imagenet64 and Place365. One can see that MWCNN can easily surpass the original CNN because of the powerful DWT used. Our MWCNN is quite different from DCF [48]: DCF combines CNN with DWT during decomposition, where different CNNs are deployed for each subband. However, the results in Table V indicates that independent processing of subbands is not suitable for image restoration. In contrast, MWCNN incorporates DWT into CNN from the perspective of enlarging receptive field without information loss, allowing embedding DWT with any CNNs with pooling operations. By taking all subbands as input, MWCNN is more powerful in modeling interband dependency. Moreover, our MWCNN is formulated as a single, generic, plugandplay module that can be used as a direct replacement of downsampling operation without any adjustments in the network architecture.
Model  Dataset  CIFAR10  CIFAR100  SVHN  MNIST  ImageNet32 
DenseNetBC100(k=12) [52]  Base  95.40  77.38  98.03  99.69  48.32 
MW  95.57  77.70  98.11  99.74  49.77  
AllConv [73]  Base  91.58  67.57  98.06  99.67  30.12 
MW  93.08  71.06  98.09  99.70  36.64  
PyramidNet164 [50]  Base  96.09  80.35  97.98  99.72  55.30 
MW  96.11  80.49  98.17  99.74  55.51  
PreResNet164() [10]  Base  95.29  77.32  98.04  99.67  46.16 
MW  95.72  77.67  98.09  99.72  47.88  
WideResNet2810 [53]  Base  96.56  81.30  98.19  99.70  57.51 
MW  96.60  81.42  98.36  99.75  57.66 
Model  ImageNet64  Top1  Top5 
ResNet18512d [9]  Base  49.08  24.25 
MW  48.46  23.96  
ResNet50 [9]  Base  43.28  19.39 
MW  41.70  18.05  
ResNet50512d [9]  Base  41.42  18.14 
MW  41.27  17.70 
Model  Dataset  Top1  Top5 
ResNet18512d [9]  Base  49.96  19.19 
MW  49.56  18.88 
V Conculsion
In this paper, we present MWCNN to better trade off the receptive field and efficiency. To this end, DWT is introduced as a downsampling operation to reduce spatial resolution and enlarge the receptive field, and can be embedded into any CNNs using pooling operation. More specifically, MWCNN takes both lowfrequency and highfrequency subbands as input and is safe to perform downsampling without information loss. In addition, WPT can be treated as predefined parameters to ease network learning. We first design an architecture for image restoration based on UNet, which consists of a contracting subnetwork and a expanding subnetwork (for object classification, only the contracting network is employed). Due to the invertibility of DWT and its frequency and location property, the proposed MWCNN is effective in recovering detailed textures and sharp structures from degraded observation. Extensive experiments demonstrate the effectiveness and efficiency of MWCNN on three restoration tasks, i.e., image denoising, SISR, and JPEG image artifacts removal, and object classification task when using different CNN methods. In future work, we aims at designing novel network architecture to extend MWCNN to more restoration tasks such as image deblurring. We will also investigate flexible MWCNN models for handling blind restoration tasks. Highlevel dense prediction tasks, such as object detection and image segmentation, often is accomplished by adopting pooling for downsampling and then performing upsampling for dense prediction. Therefore, the limitations of pooling operation may still be unfavorable in these tasks. In the future work, we will modify to extend the proposed MWCNN to these tasks for better preserving of finescale features.
Acknowledgment
This work was supported in part by the National Natural Scientific Foundation of China (NSFC) under Grant No. 61872118, 61773002 and 61671182.
References
 [1] C. Dong, C. C. Loy, K. He, and X. Tang. Image superresolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2):295–307, 2016.

[2]
J. Kim, J. K. Lee, and K. M. Lee.
Accurate image superresolution using very deep convolutional
networks.
In
IEEE Conference on Computer Vision and Pattern Recognition
, pages 1646–1654, 2016.  [3] W.S. Lai, J.B. Huang, N. Ahuja, and M.H. Yang. Deep Laplacian pyramid networks for fast and accurate superresolution. IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [4] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Realtime single image and video superresolution using an efficient subpixel convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1874–1883, 2016.
 [5] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, PP(99):1–1, 2016.
 [6] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
 [7] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [8] F. Yu and V. Koltun. Multiscale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
 [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
 [10] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016.
 [11] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photorealistic single image superresolution using a generative adversarial network. IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [12] K. Zhang, W. Zuo, S. Gu, and L. Zhang. Learning deep cnn denoiser prior for image restoration. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3929–3938, 2017.
 [13] A. Adler, D. Boublil, M. Elad, and M. Zibulevsky. A deep learning approach to blockbased compressed sensing of images. arXiv preprint arXiv:1606.01519, 2016.
 [14] Y. Romano, M. Elad, and P. Milanfar. The little engine that could: Regularization by denoising (red). arXiv preprint arXiv:1611.02862, 2016.
 [15] S. Yan, X. Xu, D. Xu, S. Lin, and X. Li. Image classification with densely sampled image windows and generalized adaptive multiple kernel learning. IEEE Transactions on Cybernetics, 45(3):381–390, 2015.
 [16] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell. Understanding convolution for semantic segmentation. arXiv preprint arXiv:1702.08502, 2017.
 [17] Y. Tai, J. Yang, and X. Liu. Image superresolution via deep recursive residual network. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [18] C. Dong, C. L. Chen, and X. Tang. Accelerating the superresolution convolutional neural network. In European Conference on Computer Vision, pages 391–407, 2016.
 [19] Y. Tai, J. Yang, X. Liu, and C. Xu. MemNet: A persistent memory network for image restoration. In IEEE Conference on International Conference on Computer Vision, 2017.
 [20] X. Mao, C. Shen, and Y. Yang. Image restoration using very deep convolutional encoderdecoder networks with symmetric skip connections. In Advances in Neural Information Processing Systems, pages 2802–2810, 2016.
 [21] I. Daubechies. The wavelet transform, timefrequency localization and signal analysis. IEEE Transactions on Information Theory, 36(5):961–1005, 1990.
 [22] I. Daubechies. Ten lectures on wavelets. SIAM, 1992.
 [23] O. Ronneberger, P. Fischer, and T. Brox. UNet: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pages 234–241, 2015.
 [24] P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo. Multilevel waveletcnn for image restoration. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 773–782, 2018.
 [25] M. R. Banham and A. K. Katsaggelos. Digital image restoration. IEEE Signal Processing Magazine, 14(2):24–41, 1997.
 [26] Y. Chen and T. Pock. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99):1–1, 2015.
 [27] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3D transformdomain collaborative filtering. IEEE Transactions on Image Processing, 16(8):2080–2095, 2007.
 [28] S. Gu, L. Zhang, W. Zuo, and X. Feng. Weighted nuclear norm minimization with application to image denoising. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2862–2869, 2014.
 [29] U. Schmidt and S. Roth. Shrinkage fields for effective image restoration. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2774–2781, 2014.

[30]
J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma.
Robust face recognition via sparse representation.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2):210–227, 2009.  [31] F. Agostinelli, M. R. Anderson, and H. Lee. Robust image denoising with multicolumn deep neural networks. In Advances in Neural Information Processing Systems, pages 1493–1501, 2013.
 [32] V. Jain and S. Seung. Natural image denoising with convolutional networks. In Advances in Neural Information Processing Systems, pages 769–776, 2009.
 [33] J. Xie, L. Xu, and E. Chen. Image denoising and inpainting with deep neural networks. In International Conference on Neural Information Processing Systems, pages 341–349, 2012.
 [34] H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can plain neural networks compete with BM3D? In IEEE Conference on Computer Vision and Pattern Recognition, pages 2392–2399, 2012.
 [35] C. Dong, Y. Deng, C. Change Loy, and X. Tang. Compression artifacts reduction by a deep convolutional network. In IEEE Conference on International Conference on Computer Vision, pages 576–584, 2015.
 [36] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced deep residual networks for single image superresolution. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1132–1140, 2017.
 [37] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu. Image superresolution using very deep residual channel attention networks. In European Conference on Computer Vision, 2018.
 [38] J. Kim, J. Kwon Lee, and K. Mu Lee. Deeplyrecursive convolutional network for image superresolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1637–1645, 2016.
 [39] V. Santhanam, V. I. Morariu, and L. S. Davis. Generalized deep image to image regression. IEEE Conference on Computer Vision and Pattern Recognition, pages 5609–5619, 2017.
 [40] K. Zhang, W. Zuo, and L. Zhang. FFDNet: Toward a fast and flexible solution for CNN based image denoising. IEEE Transactions on Image Processing, 2018.
 [41] S. Guo, Z. Yan, K. Zhang, W. Zuo, and L. Zhang. Toward convolutional blind denoising of real photographs. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 [42] J. Johnson, A. Alahi, and L. FeiFei. Perceptual losses for realtime style transfer and superresolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016.
 [43] G. Riegler, S. Schulter, M. Ruther, and H. Bischof. Conditioned regression models for nonblind single image superresolution. In IEEE Conference on International Conference on Computer Vision, 2015.
 [44] K. Zhang, W. Zuo, and L. Zhang. Learning a single convolutional superresolution network for multiple degradations. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
 [45] W. Bae, J. Yoo, and J. C. Ye. Beyond deep residual learning for image restoration: Persistent homologyguided manifold simplification. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1141–1149, 2017.
 [46] T. Guo, H. S. Mousavi, T. H. Vu, and V. Monga. Deep wavelet prediction for image superresolution. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017.
 [47] Y. Han and J. C. Ye. Framing UNet via deep convolutional framelets: Application to sparseview CT. IEEE Transactions on Medical Imaging, pages 1418–1429, 2018.
 [48] J. C. Ye and Y. S. Han. Deep convolutional framelets: A general deep learning for inverse problems. Society for Industrial and Applied Mathematics, 2018.
 [49] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
 [50] D. Han, J. Kim, and J. Kim. Deep pyramidal residual networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6307–6315, 2017.
 [51] S. Zhai, Y. Cheng, Z. M. Zhang, and W. Lu. Doubly convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1082–1090, 2016.
 [52] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 2261–2269, 2017.
 [53] S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference, 2016.
 [54] A. Takeki, D. Ikami, G. Irie, and K. Aizawa. Parallel grid pooling for data augmentation. In European Conference on Computer Vision, 2018.
 [55] Q. Wang, Z. Gao, J. Xie, W. Zuo, and P. Li. Global gated mixture of secondorder pooling for improving deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1277–1286, 2018.
 [56] S. G. Mallat. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7):674–693, 1989.
 [57] A. N. Akansu and R. A. Haddad. Multiresolution signal decomposition: transforms, subbands, and wavelets. Academic Press, 2001.
 [58] A. S. Lewis and G. Knowles. Image compression using the 2D wavelet transform. IEEE Transactions on Image Processing, 1(2):244–250, 1992.
 [59] S. G. Chang, B. Yu, and M. Vetterli. Adaptive wavelet thresholding for image denoising and compression. IEEE Transactions on Image Processing, 9(9):1532–1546, 2000.
 [60] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference for Learning Representations, 2015.
 [61] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1492–1500, 2017.
 [62] E. Agustsson and R. Timofte. Ntire 2017 challenge on single image superresolution: Dataset and study. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1122–113, 2017.
 [63] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In IEEE Conference on International Conference Computer Vision, volume 2, pages 416–423, 2001.
 [64] J.B. Huang, A. Singh, and N. Ahuja. Single image superresolution from transformed selfexemplars. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5197–5206, 2015.
 [65] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. AlberiMorel. Lowcomplexity singleimage superresolution based on nonnegative neighbor embedding. 2012.
 [66] R. Zeyde, M. Elad, and M. Protter. On single image scaleup using sparserepresentations. In International Conference on Curves and Surfaces, pages 711–730. Springer, 2010.
 [67] A. K. Moorthy and A. C. Bovik. Visual importance pooling for image quality assessment. IEEE Journal of Selected Topics in Signal Processing, 3(2):193–201, 2009.
 [68] A. Vedaldi and K. Lenc. Matconvnet: Convolutional neural networks for matlab. In the 23rd ACM international conference on Multimedia, pages 689–692, 2015.
 [69] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical Report, Citeseer, 2009.
 [70] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop, volume 2011, page 5, 2011.
 [71] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[72]
B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba.
Places: A 10 million image database for scene recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.  [73] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. In International Conference on Learning Representations Workshop, 2015.
Comments
There are no comments yet.