Nowadays, convolutional networks have become the dominant technique behind many computer vision tasks, e.g. image restoration [1, 2, 5, 3, 4] and object classification [6, 7, 8, 9, 10]. With continual progress, CNNs are extensively and easily learned on large-scale datasets, speeded up by increasingly advanced GPU devices, and often achieve state-of-the-art performance in comparison with traditional methods. The reason that CNN is popular in computer vision can be contributed to two aspects. First, existing CNN-based solutions dominate on several simple tasks by outperforming other methods with a large margin, such as single image super-resolution (SISR) [1, 2, 11], image denoising , image deblurring , compressed imaging , and object classification . Second, CNNs can be treated as a modular part and plugged into traditional method, which also promotes the widespread use of CNNs [12, 14, 15].
Actually, CNNs in computer vision can be viewed as a non-linear map from the input image to the target. In general, larger receptive field is helpful for improving fitting ability of CNNs and promoting accurate performance by taking more spatial context into account. Generally, the receptive field can be enlarged by either increasing the network depth, enlarging filter size or using pooling operation. But increasing the network depth or enlarging filter size can inevitably result in higher computational cost. Pooling can enlarge receptive field and guarantee efficiency by directly reducing spatial resolution of feature map. Nevertheless, it may result in information loss. Recently, dilated filtering  is proposed to trade off between receptive field size and efficiency by inserting “zero holes” in convolutional filtering. However, the receptive field of dilated filtering with fixed factor greater than 1 only takes into account a sparse sampling of the input with checkerboard patterns, thus it can lead to inherent suffering from gridding effect . Based on the above analysis, one can see that we should be careful when enlarging receptive field if we want to avoid both increasing computational burden and incurring the potential performance sacrifice. As can be seen from Figure 1, even though DRRN  and MemNet  enjoy larger receptive fields and higher PSNR performances than VDSR  and DnCNN , their speed nevertheless are orders of magnitude slower.
In an attempt to address the problems stated previously, we propose an efficient CNN based approach aiming at trading off between performance and efficiency. More specifically, we propose a multi-level wavelet CNN (MWCNN) by utilizing discrete wavelet transform (DWT) to replace the pooling operations. Due to invertibility of DWT, none of image information or intermediate features are lost by the proposed downsampling scheme. Moreover, both frequency and location information of feature maps are captured by DWT [21, 22], which is helpful for preserving detailed texture when using multi-frequency feature representation. More specifically, we adopt inverse wavelet transform (IWT) with expansion convolutional layer to restore resolutions of feature maps in image restoration tasks, where U-Net architecture  is used as a backbone network architecture. Also, element-wise summation is adopted to combine feature maps, thus enriching feature representation.
In terms of relation with relevant works, we show that dilated filtering can be interpreted as a special variant of MWCNN, and the proposed method is more general and effective in enlarging receptive field. Using an ensemble of such networks trained with embedded multi-level wavelet, we achieve PSNR/SSIM value that improves upon the best known results in image restoration tasks such as image denoising, SISR and JPEG image artifacts removal. For the task of object classification, the proposed MWCNN can achieve higher performance than when adopting pooling layers. As shown in Figure 1, although MWCNN is moderately slower than LapSRN , DnCNN  and VDSR , MWCNN can have a much larger receptive field and achieve higher PSNR value.
This paper is an extension of our previous work . Compared to the former work , we propose a more general approach for improving performance, further extend it to high-level task and provide more analysis and discussions. To sum up, the contributions of this work include:
A novel MWCNN model to enlarge receptive field with better tradeoff between efficiency and restoration performance by introducing wavelet transform.
Promising detail preserving due to the good time-frequency localization property of DWT.
A general approach to embedding wavelet transform in any CNNs where pooling operation is employed.
State-of-the-art performance on image denoising, SISR, JPEG image artifacts removal, and classification.
The remainder of the paper is organized as follows. Sec. II briefly reviews the development of CNNs for image restoration and classification. Sec. III describes the proposed MWCNN model in detail. Sec. IV reports the experimental results in terms of performance evaluation. Finally, Sec. V concludes the paper.
Ii Related Work
In this section, the development of CNNs for image restoration tasks is briefly reviewed. In particular, we discuss relevant works on incorporating DWT in CNNs. Finally, relevant object classification works are introduced.
Ii-a Image Restoration
Image restoration aims at recovering the latent clean image from its degraded observation . For decades, researches on image restoration have been done from the view points of both prior modeling and discriminative learning [25, 26, 27, 28, 29, 30]. Recently, with the booming development, CNNs based methods achieve state-of-the-art performance over the traditional methods.
Ii-A1 Improving Performance and Efficiency of CNNs for Image Restoration
In the early attempt, the CNN-based methods don’t work so well on some image restoration tasks. For example, the methods of [31, 32, 33] could not achieve state-of-the-art denoising performance compared to BM3D  in 2007. In , multi-layer perception (MLP) achieved comparable performance as BM3D by learning the mapping from noise patches to clean patches. In 2014, Dong et al.  for the first time adopted only a 3-layer FCN without pooling for SISR, which realizes only a small receptive field but achieves state-of-the-art performance. Then, Dong et al.  proposed a 4-layer ARCNN for JPEG image artifacts reduction.
Recently, deeper networks are increasingly used for image restoration. For SISR, Kim et al. 
stacked a 20-layer CNN with residual learning and adjustable gradient clipping. Subsequently, some works, for example, very deep network[36, 5, 37], symmetric skip connections , residual units , Laplacian pyramid , and recursive architecture [38, 17], had also been suggested to enlarge receptive field. However, the receptive field of those methods is enlarged with the increase of network depth, which may has limited potential to extend to deeper network.
For better tradeoff between speed and performance, a 7-layer FCN with dilated filtering was presented as a denoiser by Zhang et al. . Santhanam et al.  adopt pooling/unpooling to obtain and aggregate multi-context representation for image denoising. In , Zhang et al. considered to operate the CNN denoiser on downsampled subimages . Guo et al.  utilized U-Net  based CNN as non-blind denoiser. On account of the speciality of SISR, the receptive field size and efficiency could be better traded off by taking the low-resolution (LR) images as input and zooming in on features with upsampling operation [18, 4, 42]. Nevertheless, this strategy can only be adopted for SISR, and are not suitable for other tasks, such as image denoising and JPEG image artifacts removal.
Ii-A2 Universality of Image Restoration
On account of the similarity of tasks such as image denoising, SISR, and JPEG image artifacts removal, the model suggested for one task may be easily extended to other image restoration tasks simply by retraining the same network. For example, both DnCNN  and MemNet  had been evaluated on all these three tasks. Moreover, CNN denoisers can also serve as a kind of plug-and-play prior. Thus, any restoration tasks can be tackled by sequentially applying the CNN denoisers via incorporating with unrolled inference . To provide an explicit functional for defining regularization induced by denoisers, Romano et al.  further proposed a regularization-by-denoising framework. In  and , LR image with blur kernel is incorporated into CNNs for non-blind SR. These methods not only promote the application of CNN in low level vision, but also present solutions to deploying CNN denoisers for other image restoration tasks.
Ii-A3 Incorporating DWT in CNNs
Several studies have also been given to incorporate wavelet transform into CNN. Bae et al.  proposed a wavelet residual network (WavResNet) with the discovery that CNN learning can benefit from learning on wavelet subbands with features having more channels. For recovering missing details in subbands, Guo et al.  proposed a deep wavelet super-resolution (DWSR) method. Subsequently, deep convolutional framelets (DCF) [47, 48] had been developed for low-dose CT and inverse problems. However, only one-level wavelet decomposition is considered in WavResNet and DWSR which may restrict the application of wavelet transform. Inspired by the view point of decomposition, DCF independently processes each subband, which spontaneously ignores the dependency between these subbands. In contrast, multi-level wavelet transform is considered by our MWCNN to enlarge receptive field where computational burden is barely increased.
Ii-B Object Classification
The AlexNet  is a 8-layers network for object classification, and for the first time achieved state-of-the-art performance than other methods on the ILSVRC2012 dataset. In this method, different sized filters are adopted for extracting and enhancing features. However, Simonyan et al.  found that using only sized convolutional filter with deeper architecture can realize larger receptive field and achieve better performance than AlexNet. Yu et al.  adopted the dilated convolution to enlarge the receptive field size without increasing the computation burden. Later, residual block [9, 10], inception model , pyramid architecture , doubly CNN  and other architectures [53, 52] were proposed for object classification. Some measures on pooling operation, such as parallel grid pooling  and gated mixture of second-order pooling 
, were also proposed to enhance feature extractor or feature representation to promote performance. In general, pooling operation, such as average pooling and max pooling, is often adopted for downsampling features and enlarging receptive field, but it can result in significant information loss. To avoid this downside, we adopt DWT as our downsampling layer by replacing pooling operation without changing the main architecture, resulting in more power for enhancing feature representation.
In this section, we first briefly introduce the concept of multi-level wavelet packet transform (WPT) and provide our motivation. We then formally present our MWCNN based on multi-level WPT, and describe its network architecture for image restoration and object classification. Finally, discussion is presented to analyze the connection of MWCNN with average pooling and dilated filtering.
Iii-a From multi-level WPT to MWCNN
Given an image , we can use 2D DWT  with four convolutional filters, i.e. low-pass filter , and high-pass filters , , , to decompose into four subband images, i.e. , , , and
. Note that the four filters have fixed parameters with convolutional stride 2 during the transformation. Taking Haar wavelet as an example, four filters are defined as
It is evident that , , , and are orthogonal to each other and form a invertible matrix. The operation of DWT is defined as , , , and , where denotes convolution operator, and means the standard downsampling operator with factor 2. In other words, DWT mathematically involves four fixed convolution filters with stride 2 to implement downsampling operator. Moreover, according to the theory of Haar transform , the -th value of , , and after 2D Haar transform can be written as
Although the downsampling operation is deployed, due to the biorthogonal property of DWT, the original image can be accurately reconstructed without information loss by the IWT, i.e., . For the Haar wavelet, the IWT can defined as following:
Generally, the subband images , , , and can be sequentially decomposed by DWT for further processing in multi-level WPT [57, 22]. To get results of two-level WPT, DWT is separately utilized to decompose each subband image ( , , , or ) into four subband images , , , and . Recursively, the results of three or higher levels WPT can be obtained. Correspondingly, the reconstruction of each level subband images are implemented by completely inverse operation via IWT. The above-mentioned process of decomposition and reconstruction of an image are illustrated in Figure 2(a). If we treat the filers of WPT as convolutional filters with pre-defined weights, one can see that WPT is a special case of FCN without the nonlinearity layers. Obviously, the original image can be first decomposed by WPT and then accurately reconstructed by inverse WPT without any information loss.
In image processing applications such as image denoising and compression, some operations, e.g., soft-threshold and quantization, are usually required to process the decomposition part [59, 58] as shown in Figure. 2(a). These operations can be treated as some kind of nonlinearity tailored to specific task. In this work, we further extend WPT to multi-level wavelet-CNN (MWCNN) by plugging CNN blocks into traditional WPT-based method as illustrated in Figure 2(b). Due to the biorthogonal property of WPT, our MWCNN can use subsampling and upsampling operations safely without incurring information loss. Obviously, our MWCNN is a generalization of multi-level WPT, and reduces to WPT when each CNN block becomes the identity mapping. Moreover, DWT can be treated as downsampling operation and extend to any CNNs where pooling operation is required.
Iii-B Network architecture
Iii-B1 Image Restoration
As mentioned previously in Sec. III-A, we design the MWCNN architecture for image restoration based on the principle of the WPT as illustrated in Figure 2(b). The key idea is to insert CNN blocks into WPT before (or after) each level of DWT. As shown in Figure 3, each CNN block is a 3-layer FCN without pooling, and takes both low-frequency subbands and high-frequency subbands as inputs. More concretely, each layer contains convolution with
filters (Conv), and rectified linear unit (ReLU) operations. Only Conv is adopted in the last layer for predicting the residual result. The number of convolutional layers is set to 24. For more details on the setting of MWCNN, please refer to Figure3.
Our MWCNN modifies U-Net in three aspects. (i) In conventional U-Net, pooling and deconvolution are utilized as downsampling and upsampling layers. In comparison, DWT and IWT are used in MWCNN. (ii) After DWT, we deploy another CNN blocks to reduce the number of feature map channels for compact representation and modeling inter-band dependency. And convolution are adopted to increase the number of feature map channels and IWT is utilized to upsample feature map. In comparison, conventional U-Net adopting convolution layers are used to increase feature map channels which has no effect on the number of feature map channels after pooling. For upsampling, deconvolution layers are directly adopted to zoom in on feature map. (iii) In MWCNN, element-wise summation is used to combine the feature maps from the contracting and expanding subnetworks. While in conventional U-Net, concatenation is adopted. Compared to our previous work , we have made several improvements such as: (i) Instead of directly decomposing input images by DWT, we first use conv blocks to extract features from input, which is empirically shown to be beneficial for image restoration. (ii) In the 3rd hierarchical level, we use more feature maps to enhance feature representation. In our implementation, Haar wavelet is adopted as the default wavelet in MWCNN. Other wavelets, e.g., Daubechies 2 (DB2), are also considered in our experiments.
Denote by the network parameters of MWCNN, i.e., and be the network output. Let be a training set, where is the -th input image, is the corresponding ground-truth image. Then the objective function for learning MWCNN is given by
The ADAM algorithm  is adopted to train MWCNN by minimizing the objective function.
Iii-B2 Extend to Object Classification
Similar to image restoration, DWT is employed as a downsampling operation often without upsampling operation to replace pooling operation. The compression filter with
Conv is subsequently utilized after DWT transformation. Note that we don’t modify other blocks or loss function. With this improvement, feature can be further selected and enhanced with adaptive learning. Moreover, any CNN using pooling can be considered instead of DWT operation, and the information of feature maps can be transmitted to next layer without information loss. DWT can be seen as a safe downsampling module and plugged into any CNNs without the need to change network architectures, and may benefit extracting more powerful features for different tasks.
Iii-C1 Connection to Pooling Operation
The DWT in the proposed MWCNN is closely related to the pooling operation and dilated filtering. By using the Haar wavelet as an example, we explain the connection between DWT and average pooling. According to the theory of average pooling with factor 2, the -th value of feature map in the -th layer after pooling can be written as
where is the feature map before pooling operation. It is clear that Eq. 5 is the same as the low-frequency component of DWT in Eq. 2, which also means that all the high-frequency information is lost during the pooling operation. In Figure 4, the feature map is first decomposed into four sub-images with stride 2. The average pooling operation can be treated as summing all sub-images with fixed coefficient to generate new sub-image. In comparison, DWT uses all sub-images with four fixed orthometric weights to obtain four new sub-images. By taking all the subbands into account, MWCNNs can therefore avoid the information loss caused by conventional subsampling, and may benefit restoration and classification. Hence, average pooling can be seen as a simplified variant of the proposed MWCNNs.
Iii-C2 Connection to Dilated Filtering
To illustrate the connection between MWCNN and dilated filtering, we first give the definition of dilated filtering with factor 2:
where means convolution operation with dilated factor 2, is the position in convolutional kernel , and is the position within the range of convolution of feature . Eq. (6) can be decomposed into two steps, sampling and convoluting. Sampled patch is obtained by sampling at center position of with one interval pixel under the constraint . Then the value is obtained by convolving sampled patch with kernel . Therefore, dilated filtering with factor 2 can be expressed as first decomposing an image into four sub-images and then using the shared standard convolutional kernel on those sub-images as illustrated in Figure. 4. We rewrite Eq. (6) for obtaining the pixel value as following:
Then the pixel value , and can be obtained in the same way. Actually, the value of at the position , , and can be obtained by applying IWT on subband images , , and based on Eqn.(3). Therefore, the dilating filtering can be represented as convolution with the subband images as following,
Different from dilated filtering, the definition of MWCNN in Figure 4 can be given as
where , and denotes concatenate operation. If is group convolution  with factor 4, the equation can be rewritten as:
Note that can accurately reconstruct by using IWT. Compared to Eq. (8), the weights of each subband and the corresponding convolution are different. That means that our MWCNN can be reduced to dilated filtering if the subbands are replaced by subimages after IWT in Eq. (3), and the convolution in is shared to each other. Hence, the dilated filtering can be seen as a variant of the proposed MWCNN.
Compared with dilated filtering, MWCNN can also avoid the gridding effect. With the increase of depth, dilated filtering with fixed factor greater than 1 only considers a sparse sampling of units in the checkerboard pattern, resulting in large amount of information loss (see Figure 5). Another problem with dilated filtering is that the two output neighboring pixels may be computed from input information from totally non-overlapped units (see Figure 5), and may cause the inconsistence of local information. Figure 5 illustrates the receptive field of MWCNN, which is quite different from dilated filtering. With dense sampling, convolution filter takes multi-frequency information as input, and results in double receptive field after DWT. One can see that MWCNN is able to well address the sparse sampling and inconsistency problems of local information, and is expected to benefit restoration quantitatively.
In this section, we first describe application of MWCNN to image restoration. Then ablation experiments is presented to analyze the contribution of each component. Finally, the proposed MWCNN is extended to object classification.
Iv-a Experimental Setting for Image Restoration
|2.2 Dataset||BM3D ||TNRD ||DnCNN ||IRCNN ||RED30 ||MemNet ||FFDNet ||MWCNN(P)||MWCNN|
|1.2 Set12||15||32.37 / 0.8952||32.50 / 0.8962||32.86 / 0.9027||32.77 / 0.9008||-||-||32.75 / 0.9027||33.15 / 0.9088||33.20 / 0.9089|
|25||29.97 / 0.8505||30.05 / 0.8515||30.44 / 0.8618||30.38 / 0.8601||-||-||30.43 / 0.8634||30.79 / 0.8711||30.84 / 0.8718|
|50||26.72 / 0.7676||26.82 / 0.7677||27.18 / 0.7827||27.14 / 0.7804||27.34 / 0.7897||27.38 / 0.7931||27.32 / 0.7903||27.74 / 0.8056||27.79 / 0.8060|
|BSD68||15||31.08 / 0.8722||31.42 / 0.8822||31.73 / 0.8906||31.63 / 0.8881||-||-||31.63 / 0.8902||31.86 / 0.8947||31.91 / 0.8952|
|25||28.57 / 0.8017||28.92 / 0.8148||29.23 / 0.8278||29.15 / 0.8249||-||-||29.19 / 0.8289||29.41 / 0.8360||29.46 / 0.8370|
|50||25.62 / 0.6869||25.97 / 0.7021||26.23 / 0.7189||26.19 / 0.7171||26.35 / 0.7245||26.35 / 0.7294||26.29 / 0.7245||26.53 / 0.7366||26.58 / 0.7382|
|Urban100||15||32.34 / 0.9220||31.98 / 0.9187||32.67 / 0.9250||32.49 / 0.9244||-||-||32.43 / 0.9273||33.17 / 0.9357||33.22 / 0.9361|
|25||29.70 / 0.8777||29.29 / 0.8731||29.97 / 0.8792||29.82 / 0.8839||-||-||29.92 / 0.8886||30.66 / 0.9026||30.74 / 0.9035|
|50||25.94 / 0.7791||25.71 / 0.7756||26.28 / 0.7869||26.14 / 0.7927||26.48 / 0.7991||26.64 / 0.8024||26.52 / 0.8057||27.42 / 0.8371||27.53 / 0.8393|
Iv-A1 Training set
To train our MWCNN, we adopt DIV2K  as our training dataset. Concretely, DIV2K contains images with about 2K resolution for training, images for validation, and images for testing. Due to the receptive field of MWCNN being , we crop patches with the size of from the training images in the training stage.
For image denoising, we consider three noise levels, i.e., = 15, 25 and 50, and evaluate our denoising method on three dataset, i.e., Set12 , BSD68 , and Urban100 . For SISR, we take upsampling as the input to MWCNN with three specific scale factors, i.e., , and , respectively. Four widely used datasets, Set5 , Set14 , BSD100  and Urban100 , are adopted to evaluate SISR performance. For JPEG image artifacts removal, we follow the setting as used in , i.e., four compression quality settings = 10, 20, 30 and 40 for the JPEG encoder. Two datasets, Classic5  and LIVE1 , are used for evaluating our method.
Iv-A2 Network training
In image restoration, a MWCNN model is learned for each degradation setting. The ADAM algorithm  with , and is adopted for optimization and we use a mini-batch size of 24. The learning rate is decayed exponentially from to
in the 200 epochs. Rotation or/and flip based data augmentation is used during mini-batch learning. The MatConvNet package with NVIDIA GTX1080 GPU is utilized for training and testing.
Iv-B Quantitative and qualitative evaluation on Image Restoration Tasks
Comprehensive experiments are conducted to evaluate our 24-layer MWCNN using the same setting as in Sec. III-B on three representative image restoration tasks, respectively. Here, we also provide the results of our previous work and denote it as MWCNN(P) .
Iv-B1 Image denoising
For image denoising, only gray images are trained and evaluated for the reason that most denoising methods are only trained and tested on gray images. Moreover, we compare with two classic denoising methods, i.e., BM3D  and TNRD , and five CNN-based methods, i.e., DnCNN , IRCNN , RED30 , MemNet , and FFDNet . Table I lists the average PSNR/SSIM results of the competing methods on these three datasets. Since RED30 and MemNet  doesn’t train the models on level 15 and level 25, we use the symbol ‘-’ instead. Obviously, the performance of all the competing methods are worse than our MWCNN. It’s worth noting that our MWCNN can outperform DnCNN and FFDNet by about dB in terms of PSNR on Set12, and slightly surpass with dB on BSD68. On Urban100, our MWCNN generally achieves favorable performance when compared with the competing methods. Specially, the average PSNR by our MWCNN can be 0.5dB higher than that by DnCNN on Set12, and 1.2dB higher on Urban100 when the noise level is 50. Figure 6 shows the denoising results of the images “Test044” from Set68 with the noise level . One can see that our MWCNN is promising in removing noise while recovering image details and structures, and can obtain visually more pleasant result than the competing methods due to the reversibility of WPT during downsampling and upsampling.
|2.2 Dataset||VDSR ||DnCNN ||RED30 ||SRResNet ||LapSRN ||DRRN ||MemNet ||WaveResNet ||SRMDNF ||MWCNN(P)||MWCNN|
|1.2 Set5||2||37.53 / 0.9587||37.58 / 0.9593||37.66 / 0.9599||-||37.52 / 0.9590||37.74 / 0.9591||37.78 / 0.9597||37.57 / 0.9586||37.79 / 0.9601||37.91 / 0.9600||37.95 / 0.9605|
|3||33.66 / 0.9213||33.75 / 0.9222||33.82 / 0.9230||-||-||34.03 / 0.9244||34.09 / 0.9248||33.86 / 0.9228||34.12 / 0.9250||34.18 / 0.9272||34.21 / 0.9273|
|4||31.35 / 0.8838||31.40 / 0.8845||31.51 / 0.8869||32.05 / 0.8902||31.54 / 0.8850||31.68 / 0.8888||31.74 / 0.8893||31.52 / 0.8864||31.96 / 0.8925||32.12 / 0.8941||32.14 / 0.8951|
|Set14||2||33.03 / 0.9124||33.04 / 0.9118||32.94 / 0.9144||-||33.08 / 0.9130||33.23 / 0.9136||33.28 / 0.9142||33.09 / 0.9129||33.05 / 0.8985||33.70 / 0.9182||33.71 / 0.9182|
|3||29.77 / 0.8314||29.76 / 0.8349||29.61 / 0.8341||-||-||29.96 / 0.8349||30.00 / 0.8350||29.88 / 0.8331||30.04 / 0.8372||30.16 / 0.8414||30.14 / 0.8413|
|4||28.01 / 0.7674||28.02 / 0.7670||27.86 / 0.7718||28.49 / 0.7783||28.19 / 0.7720||28.21 / 0.7720||28.26 / 0.7723||28.11 / 0.7699||28.41 / 0.7816||28.41 / 0.7816||28.58 / 0.7882|
|BSD100||2||31.90 / 0.8960||31.85 / 0.8942||31.98 / 0.8974||-||31.80 / 0.8950||32.05 / 0.8973||32.08 / 0.8978||31.92 / 0.8965||32.23 / 0.8999||32.23 / 0.8999||32.30 / 0.9002|
|3||28.82 / 0.7976||28.80 / 0.7963||28.92 / 0.7993||-||-||28.95 / 0.8004||28.96 / 0.8001||28.86 / 0.7987||28.97 / 0.8030||29.12 / 0.8060||29.18 / 0.8106|
|4||27.29 / 0.7251||27.23 / 0.7233||27.39 / 0.7286||27.56 / 0.7354||27.32 / 0.7280||27.38 / 0.7284||27.40 / 0.7281||27.32 / 0.7266||27.62 / 0.7355||27.62 / 0.7355||27.67 / 0.7357|
|Urban100||2||30.76 / 0.9140||30.75 / 0.9133||30.91 / 0.9159||-||30.41 / 0.9100||31.23 / 0.9188||31.31 / 0.9195||30.96 / 0.9169||32.30 / 0.9296||32.30 / 0.9296||32.36 / 0.9306|
|3||27.14 / 0.8279||27.15 / 0.8276||27.31 / 0.8303||-||-||27.53 / 0.8378||27.56 / 0.8376||27.28 / 0.8334||27.57 / 0.8401||28.13 / 0.8514||28.19 / 0.8520|
|4||25.18 / 0.7524||25.20 / 0.7521||25.35 / 0.7587||26.07 / 0.7839||25.21 / 0.7560||25.44 / 0.7638||25.50 / 0.7630||25.36 / 0.7614||26.27 / 0.7890||26.27 / 0.7890||26.37 / 0.7891|
Iv-B2 Single image super-resolution
We also train our MWCNN on SISR task with only the luminance channel, i.e. Y in YCbCr color space following . interpolation is used for image degradation, and upsampling by interpolation is used before sending degradation image to network. For qualitative comparisons, we use source codes of nine CNN-based methods, including VDSR , DnCNN , RED30 , SRResNet , LapSRN , DRRN , MemNet , WaveResNet  and SRMDNF . Since the source code of SRResNet is not released, their results as shown in Table II are incomplete. And the results of LapSRN with scale are not listed here since they are not reported in the authors’ paper.
Table II summarizes the average PSNR/SSIM results of the competing methods on the four datasets by citing the results in their respective papers. In terms of both PSNR and SSIM indexes, the proposed MWCNN outperforms other methods in all cases. Compared with well-known VDSR, our MWCNN achieves a notable gain of about 0.40.8dB by PSNR on Set5 and Set14. Surprisingly, our MWCNN outperforms VDSR by a large gap with about 1.01.6dB on Urban100. Even though SRMDNF is trained on RGB space, it is still slightly weaker than our MWCNN. It can also be seen that our MWCNN outperforms WaveResNet by no less than 0.3dB. We provide quantitative comparisons with the competing methods on the image “253027” from BSD100 in Figure 7. As one can see, our MWCNN can correctly recover the fine and detailed textures, and produce sharp edges due to the frequency and location characteristics of DWT.
|2.2 Dataset||JPEG||ARCNN ||TNRD ||DnCNN ||MemNet ||MWCNN(P)||MWCNN|
|2.2 Classic5||10||27.82 / 0.7595||29.03 / 0.7929||29.28 / 0.7992||29.40 / 0.8026||29.69 / 0.8107||30.01 / 0.8195||30.03 / 0.8201|
|20||30.12 / 0.8344||31.15 / 0.8517||31.47 / 0.8576||31.63 / 0.8610||31.90 / 0.8658||32.16 / 0.8701||32.20 / 0.8708|
|30||31.48 / 0.8744||32.51 / 0.8806||32.78 / 0.8837||32.91 / 0.8861||-||33.43 / 0.8930||33.46 / 0.8934|
|40||32.43 / 0.8911||33.34 / 0.8953||-||33.77 / 0.9003||-||34.27 / 0.9061||34.31 / 0.9063|
|LIVE1||10||27.77 / 0.7730||28.96 / 0.8076||29.15 / 0.8111||29.19 / 0.8123||29.45 / 0.8193||29.69 / 0.8254||29.70 / 0.8260|
|20||30.07 / 0.8512||31.29 / 0.8733||31.46 / 0.8769||31.59 / 0.8802||31.83 / 0.8846||32.04 / 0.8885||32.07 / 0.8886|
|30||31.41 / 0.9000||32.67 / 0.9043||32.84 / 0.9059||32.98 / 0.9090||-||33.45 / 0.9153||33.46 / 0.9155|
|40||32.35 / 0.9173||33.63 / 0.9198||-||33.96 / 0.9247||-||34.45 / 0.9301||34.47 / 0.9300|
Iv-B3 JPEG image artifacts removal
We apply our method to JPEG image artifacts removal to further demonstrate the applicability of our MWCNN on image restoration. Here, both JPEG encoder and JPEG image artifacts removal are only focused on the Y channel. Following , we consider four settings on quality factor, e.g., = 10, 20, 30 and 40, for the JPEG encoder. In our experiments, MWCNN is compared to four competing methods, i.e., ARCNN , TNRD , DnCNN , and MemNet . The results of MemNet  and TNRD  are incomplete according to their paper and released source codes.
Table III shows the average PSNR/SSIM results of the competing methods on Classic5 and LIVE1. Obviously, our MWCNN obtains superior performance than other methods in terms of quantitative metrics for any of the four quality factors. Compared to ARCNN on Classic5, our MWCNN surprisingly outperforms by 1dB in terms of PSNR. On can see that the PSNR value of MWCNN can be 0.20.3dB higher than those of the second best method (i.e., MemNet ). In addition to perceptual comparisons, we also provide the image, i.e. “carnivaldolls” form LIVE1 with the quality factor of 10. Compared with other methods, our MWCNN is effective in better removing artifacts and restoring detailed textures and sharp salient edges.
|2.2 Image Denoising|
|1.2 Size||FFDNet ||DnCNN ||RED30 ||MemNet ||MWCNN|
|1.2 Single Image Super-Resolution|
|1.2 Size||VDSR ||LapSRN ||DRRN ||MemNet ||MWCNN|
|1.2 JPEG Image Artifacts Removal|
|1.2 Size||ARCNN ||TNRD ||DnCNN ||MemNet ||MWCNN|
Iv-B4 Running time
As mentioned previously, the efficiency of CNNs is also an important measure of network performance. We consider the CNN-based methods with source code and list the GPU running time of the competing methods for the three tasks in Table IV
. Note that the Nvidia cuDNN-v7.0 deep learning library with CUDA 9.2 is adopted to accelerate the GPU computation under Ubuntu 16.04 system. In comparison to the state-of-the-art methods,i.e., RED30 , DRRN  and MemNet , our MWCNN costs far less time but obtain better performance in terms of PSNR/SSIM metrics. Meanwhile, our MWCNN is moderately slower by speed but can achieve higher PSNR/SSIM indexes compared to the other methods. This means that the effectiveness of MWCNN should be attributed to the incorporation of CNN and DWT rather than increase of network depth/width.
Iv-C Comparison of MWCNN variants
Using image denoising and JPEG image artifacts as examples, we mainly focus on two variants of MWCNN: (i) Ablation experiments to demonstrate where the improved performance comes from. (ii) The related methods, such as wavelet-based approach and dilated filtering are presented for verifying the effectiveness of the proposed method. Note that MWCNN with 24-layer is employed as our baseline, and all the MWCNN variants are designed using the same architecture for fair comparison.
|Image Denoising ()|
|Dataset||U-Net ||U-Net +S||U-Net +D||MWCNN (P+C)||MWCNN (Haar)||MWCNN (DB2)||MWCNN (HD)|
|Set12||27.42 / 0.079||27.41 / 0.074||27.46 / 0.080||27.76 / 0.081||27.79 / 0.075||27.81 / 0.127||27.77 / 0.091|
|Set68||26.30 / 0.076||26.29 / 0.071||26.21 / 0.075||26.54 / 0.077||26.58 / 0.072||26.59 / 0.114||26.57 / 0.086|
|Urban100||26.68 / 0.357||26.72 / 0.341||26.99 / 0.355||27.46 / 0.354||27.53 / 0.346||27.55 / 0.576||27.50 / 0.413|
|JPEG Image Artifacts Removal ()|
|Classic5||29.61 / 0.093||29.60 / 0.082||29.68 / 0.097||30.02 / 0.091||30.03 / 0.083||30.04 / 0.185||29.99 / 0.115|
|LIVE1||29.36 / 0.112||29.36 / 0.109||29.43 / 0.120||29.69 / 0.120||29.70 / 0.111||29.71 / 0.234||29.68 / 0.171|
Iv-C1 Ablation experiments
Ablation experiments are provided for verifying the effectiveness of additionally embedded wavelet: (i) the default U-Net with the same architecture to MWCNN, (ii) U-Net+S: using sum connection instead of concatenation, and (iii) U-Net+D: adopting learnable conventional downsampling filters, i.e. convolution operation with stride 2 to replace max pooling. We also compare with the modified MWCNN(P) method which adds one layer of CNN right after inputting and another layer of CNN before outputting, and denote it as MWCNN(P+C). Three MWCNN variants with different wavelet transform are also considered, including: (i) MWCNN (Haar): the default MWCNN with Haar wavelet, (ii) MWCNN (DB2): MWCNN with Daubechies-2 wavelet, and (iii) MWCNN (HD): MWCNN with Haar in contracting subnetwork and Daubechies-2 in expanding subnetwork.
Table V lists the PSNR and running time results of these methods. We have the following observations. (i) The ablation experiments indicate that adopting sum connection instead of concatenation can slightly improve efficiency with almost no decrease of PNSR. (ii) Due to the biorthogonal and time-frequency localization properties of wavelet, our wavelet based method possesses more powerful abilities for image restoration. The pooling operation causes the loss of high-frequency information and leads to difficulty of recovering damaged image. Our MWCNN can easily outperform U-Net+D method which adopts learnable downsampling filters. This indicates that learning alone is not enough and the violation of the invertibility can cause information loss. (iii) Compared with MWCNN (P+C), the proposed method still performs slightly better despite the fact that MWCNN (P+C) has more layers, thereby verifying the effectiveness of the proposed method. (iv) Compared with MWCNN (DB2) and MWCNN (HD), using Haar wavelet for downsampling and upsampling in network is the best choice in terms of quantitative evaluation. MWCNN (Haar) has similar running time as dilated CNN and U-Net but achieves higher PSNR results, which demonstrates the effectiveness of MWCNN for trading off between performance and efficiency.
|Image Denoising ()|
|Dataset||Dilated ||Dilated-2||DCF ||DCF+R||WaveResNet ||MWCNN (Haar)|
|Set12||27.45 / 0.181||24.81 / 0.185||27.38 / 0.081||27.61 / 0.081||27.49 / 0.179||27.79 / 0.075|
|Set68||26.35 / 0.142||24.32 / 0.174||26.30 / 0.075||26.43 / 0.075||26.38 / 0.143||26.58 / 0.072|
|Urban100||26.56 / 0.764||24.18 / 0.960||26.65 / 0.354||27.18 / 0.354||- / -||27.53 / 0.346|
|JPEG Image Artifacts Removal ()|
|Classic5||29.72 / 0.287||29.49 / 0.302||29.57 / 0.104||29.88 / 0.104||- / -||30.03 / 0.083|
|LIVE1||29.49 / 0.354||29.26 / 0.376||29.38 / 0.155||29.63 / 0.155||- / -||29.70 / 0.111|
Iv-C2 More MWCNN variants
We compare the PSNR results by using more related MWCNN variants. Two 24-layer dilated CNNs are provided: (i) Dilated: the hybrid dilated convolution  to suppress the gridding effect, and (ii) Dilated-2: the dilate factor of all layers is set to 2 following the gridding effect. The WaveResNet method in  is provided for comparison. Moreover, since the source code of deep convolutional framelets (DCF) is not available, a re-implementation of deep convolutional framelets (DCF) without residual learning  is also considered in the experiments. DCF with residual learning (denoted as DCF+R) is provided for fair comparison.
Table VI lists the PSNR and running time results of these methods. We have the following observations. (i) The gridding effect with the sparse sampling and inconsistence of local information authentically has adverse influence on restoration performance. (ii) The worse performance of DCF also indicates that independent processing of subbands harms intra-frequency information dependency. (iii) The fact that our MWCNN method sightly outperforms the DCF+R method means that the adverse influence of independent processing can be eliminated after several non-linear operations.
Iv-C3 Hierarchical level of MWCNN
Here, we discuss the suitable level of our MWCNN even if it can be extended to higher level. Nevertheless, deeper network and heavier computational burden also come with higher level. Thus, we select a suitable level for better balance between efficiency and performance. As shown in Table VII, the PSNR and running time results of MWCNNs with the levels of 0 to 4 (i.e., MWCNN-0 MWCNN-4) are reported. We note that MWCNN-0 is a 6-layer CNN without WPT. In terms of the PSNR metric, MWCNN-3 is much better than MWCNN-1 and MWCNN-2, while negligibly weaker than MWCNN-4. Meanwhile, the speed of MWCNN-3 is also moderate compared with other levels. Based on the above analysis, we choose MWCNN-3 as the default setting.
|Set12||26.84 / 0.017||27.25 / 0.041||27.64 / 0.064||27.79 / 0.073||27.80 / 0.087|
|Set68||25.71 / 0.016||26.21 / 0.039||26.47 / 0.060||26.58 / 0.070||26.59 / 0.081|
|Urban100||25.98 / 0.087||26.53 / 0.207||27.12 / 0.298||27.53 / 0.313||27.55 / 0.334|
Iv-D Extend to Object Classification
Using object classification as an example, we test our MWCNN on six famous benchmarks: CIFAR-10, CIFAR-100 , SVHN , MNIST , ImageNet1K  and Places365  for our evaluation. CIFAR-10 and CIFAR-100 consist of 60000 3232 colour images in 10 classes, including 50,000 images for training and 10,000 images for testing. MNIST  is handwritten digit database with 2828 resolution, which has a training set of 60,000 examples, and a test set of 10,000 examples. The SVHN dataset contains more than 600,000 digit images. ImageNet1K is a resized dataset which consists of two resolutions, 3232 denoted as ImageNet32, and 6464 as ImageNet64. Places365 is resized to for training and testing in our work. Here, we modify and compare several CNN methods, such as pre-activation ResNet (PreResNet) , All-CNN , WideResNet , PyramidNet , DenseNet  and ResNet . Specifically, we follow  to verify our MWCNN on ResNet architecture. As our MWCNN, we use DWT transformation with convolution to instead of avg-pooling operation as described in Sec. III-B2, and denote it as ‘MW’, while the original CNN is denoted as ‘Base’.
Table VIII shows the detailed results of accuracy with the competing methods PreResNet , All-CNN , WideResNet , PyramidNet , DenseNet  on CIFAR-10, CIFAR-100, SVHN, MNIST and ImageNet32. Table IX and Table X show Top-1 and Top-5 error of ResNet  on imagenet64 and Place365. One can see that MWCNN can easily surpass the original CNN because of the powerful DWT used. Our MWCNN is quite different from DCF : DCF combines CNN with DWT during decomposition, where different CNNs are deployed for each subband. However, the results in Table V indicates that independent processing of subbands is not suitable for image restoration. In contrast, MWCNN incorporates DWT into CNN from the perspective of enlarging receptive field without information loss, allowing embedding DWT with any CNNs with pooling operations. By taking all subbands as input, MWCNN is more powerful in modeling inter-band dependency. Moreover, our MWCNN is formulated as a single, generic, plug-and-play module that can be used as a direct replacement of downsampling operation without any adjustments in the network architecture.
In this paper, we present MWCNN to better trade off the receptive field and efficiency. To this end, DWT is introduced as a downsampling operation to reduce spatial resolution and enlarge the receptive field, and can be embedded into any CNNs using pooling operation. More specifically, MWCNN takes both low-frequency and high-frequency subbands as input and is safe to perform downsampling without information loss. In addition, WPT can be treated as pre-defined parameters to ease network learning. We first design an architecture for image restoration based on U-Net, which consists of a contracting sub-network and a expanding subnetwork (for object classification, only the contracting network is employed). Due to the invertibility of DWT and its frequency and location property, the proposed MWCNN is effective in recovering detailed textures and sharp structures from degraded observation. Extensive experiments demonstrate the effectiveness and efficiency of MWCNN on three restoration tasks, i.e., image denoising, SISR, and JPEG image artifacts removal, and object classification task when using different CNN methods. In future work, we aims at designing novel network architecture to extend MWCNN to more restoration tasks such as image deblurring. We will also investigate flexible MWCNN models for handling blind restoration tasks. High-level dense prediction tasks, such as object detection and image segmentation, often is accomplished by adopting pooling for downsampling and then performing upsampling for dense prediction. Therefore, the limitations of pooling operation may still be unfavorable in these tasks. In the future work, we will modify to extend the proposed MWCNN to these tasks for better preserving of fine-scale features.
This work was supported in part by the National Natural Scientific Foundation of China (NSFC) under Grant No. 61872118, 61773002 and 61671182.
-  C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2):295–307, 2016.
J. Kim, J. K. Lee, and K. M. Lee.
Accurate image super-resolution using very deep convolutional
IEEE Conference on Computer Vision and Pattern Recognition, pages 1646–1654, 2016.
-  W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang. Deep Laplacian pyramid networks for fast and accurate super-resolution. IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1874–1883, 2016.
-  K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, PP(99):1–1, 2016.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016.
-  C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  K. Zhang, W. Zuo, S. Gu, and L. Zhang. Learning deep cnn denoiser prior for image restoration. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3929–3938, 2017.
-  A. Adler, D. Boublil, M. Elad, and M. Zibulevsky. A deep learning approach to block-based compressed sensing of images. arXiv preprint arXiv:1606.01519, 2016.
-  Y. Romano, M. Elad, and P. Milanfar. The little engine that could: Regularization by denoising (red). arXiv preprint arXiv:1611.02862, 2016.
-  S. Yan, X. Xu, D. Xu, S. Lin, and X. Li. Image classification with densely sampled image windows and generalized adaptive multiple kernel learning. IEEE Transactions on Cybernetics, 45(3):381–390, 2015.
-  P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell. Understanding convolution for semantic segmentation. arXiv preprint arXiv:1702.08502, 2017.
-  Y. Tai, J. Yang, and X. Liu. Image super-resolution via deep recursive residual network. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  C. Dong, C. L. Chen, and X. Tang. Accelerating the super-resolution convolutional neural network. In European Conference on Computer Vision, pages 391–407, 2016.
-  Y. Tai, J. Yang, X. Liu, and C. Xu. MemNet: A persistent memory network for image restoration. In IEEE Conference on International Conference on Computer Vision, 2017.
-  X. Mao, C. Shen, and Y. Yang. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Advances in Neural Information Processing Systems, pages 2802–2810, 2016.
-  I. Daubechies. The wavelet transform, time-frequency localization and signal analysis. IEEE Transactions on Information Theory, 36(5):961–1005, 1990.
-  I. Daubechies. Ten lectures on wavelets. SIAM, 1992.
-  O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241, 2015.
-  P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo. Multi-level wavelet-cnn for image restoration. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 773–782, 2018.
-  M. R. Banham and A. K. Katsaggelos. Digital image restoration. IEEE Signal Processing Magazine, 14(2):24–41, 1997.
-  Y. Chen and T. Pock. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99):1–1, 2015.
-  K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Transactions on Image Processing, 16(8):2080–2095, 2007.
-  S. Gu, L. Zhang, W. Zuo, and X. Feng. Weighted nuclear norm minimization with application to image denoising. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2862–2869, 2014.
-  U. Schmidt and S. Roth. Shrinkage fields for effective image restoration. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2774–2781, 2014.
J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma.
Robust face recognition via sparse representation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2):210–227, 2009.
-  F. Agostinelli, M. R. Anderson, and H. Lee. Robust image denoising with multi-column deep neural networks. In Advances in Neural Information Processing Systems, pages 1493–1501, 2013.
-  V. Jain and S. Seung. Natural image denoising with convolutional networks. In Advances in Neural Information Processing Systems, pages 769–776, 2009.
-  J. Xie, L. Xu, and E. Chen. Image denoising and inpainting with deep neural networks. In International Conference on Neural Information Processing Systems, pages 341–349, 2012.
-  H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can plain neural networks compete with BM3D? In IEEE Conference on Computer Vision and Pattern Recognition, pages 2392–2399, 2012.
-  C. Dong, Y. Deng, C. Change Loy, and X. Tang. Compression artifacts reduction by a deep convolutional network. In IEEE Conference on International Conference on Computer Vision, pages 576–584, 2015.
-  B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced deep residual networks for single image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1132–1140, 2017.
-  Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu. Image super-resolution using very deep residual channel attention networks. In European Conference on Computer Vision, 2018.
-  J. Kim, J. Kwon Lee, and K. Mu Lee. Deeply-recursive convolutional network for image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1637–1645, 2016.
-  V. Santhanam, V. I. Morariu, and L. S. Davis. Generalized deep image to image regression. IEEE Conference on Computer Vision and Pattern Recognition, pages 5609–5619, 2017.
-  K. Zhang, W. Zuo, and L. Zhang. FFDNet: Toward a fast and flexible solution for CNN based image denoising. IEEE Transactions on Image Processing, 2018.
-  S. Guo, Z. Yan, K. Zhang, W. Zuo, and L. Zhang. Toward convolutional blind denoising of real photographs. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016.
-  G. Riegler, S. Schulter, M. Ruther, and H. Bischof. Conditioned regression models for non-blind single image super-resolution. In IEEE Conference on International Conference on Computer Vision, 2015.
-  K. Zhang, W. Zuo, and L. Zhang. Learning a single convolutional super-resolution network for multiple degradations. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  W. Bae, J. Yoo, and J. C. Ye. Beyond deep residual learning for image restoration: Persistent homology-guided manifold simplification. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1141–1149, 2017.
-  T. Guo, H. S. Mousavi, T. H. Vu, and V. Monga. Deep wavelet prediction for image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017.
-  Y. Han and J. C. Ye. Framing U-Net via deep convolutional framelets: Application to sparse-view CT. IEEE Transactions on Medical Imaging, pages 1418–1429, 2018.
-  J. C. Ye and Y. S. Han. Deep convolutional framelets: A general deep learning for inverse problems. Society for Industrial and Applied Mathematics, 2018.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
-  D. Han, J. Kim, and J. Kim. Deep pyramidal residual networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6307–6315, 2017.
-  S. Zhai, Y. Cheng, Z. M. Zhang, and W. Lu. Doubly convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1082–1090, 2016.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 2261–2269, 2017.
-  S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference, 2016.
-  A. Takeki, D. Ikami, G. Irie, and K. Aizawa. Parallel grid pooling for data augmentation. In European Conference on Computer Vision, 2018.
-  Q. Wang, Z. Gao, J. Xie, W. Zuo, and P. Li. Global gated mixture of second-order pooling for improving deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1277–1286, 2018.
-  S. G. Mallat. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7):674–693, 1989.
-  A. N. Akansu and R. A. Haddad. Multiresolution signal decomposition: transforms, subbands, and wavelets. Academic Press, 2001.
-  A. S. Lewis and G. Knowles. Image compression using the 2-D wavelet transform. IEEE Transactions on Image Processing, 1(2):244–250, 1992.
-  S. G. Chang, B. Yu, and M. Vetterli. Adaptive wavelet thresholding for image denoising and compression. IEEE Transactions on Image Processing, 9(9):1532–1546, 2000.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference for Learning Representations, 2015.
-  S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1492–1500, 2017.
-  E. Agustsson and R. Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1122–113, 2017.
-  D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In IEEE Conference on International Conference Computer Vision, volume 2, pages 416–423, 2001.
-  J.-B. Huang, A. Singh, and N. Ahuja. Single image super-resolution from transformed self-exemplars. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5197–5206, 2015.
-  M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. 2012.
-  R. Zeyde, M. Elad, and M. Protter. On single image scale-up using sparse-representations. In International Conference on Curves and Surfaces, pages 711–730. Springer, 2010.
-  A. K. Moorthy and A. C. Bovik. Visual importance pooling for image quality assessment. IEEE Journal of Selected Topics in Signal Processing, 3(2):193–201, 2009.
-  A. Vedaldi and K. Lenc. Matconvnet: Convolutional neural networks for matlab. In the 23rd ACM international conference on Multimedia, pages 689–692, 2015.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical Report, Citeseer, 2009.
-  Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop, volume 2011, page 5, 2011.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba.
Places: A 10 million image database for scene recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
-  J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. In International Conference on Learning Representations Workshop, 2015.