I Introduction
Compressive sensing (CS) is an emerging sampling technique to overcome the NyquistShannon sampling theorem under the assumption of the signal sparsity [1, 5, 2, 3, 4]. A CS measurement , is captured from a sparse or compressible signal, , via
(1) 
where denotes the sampling matrix and represents the additive noise, and the ratio is named subrate
. With a random matrix
, CS enables a simple sampling, compressing encoder, and computational security [6, 7]. CS became an active research topic with numerous applications in medical imaging, image restoration, hyperspectral imaging, wireless communication, etc. However, there are three major challenges in CS as follows.Sampling complexity. With a Gaussian matrix [1]
, signals are guaranteed to recover at a high probability but an extremely high burden on computation and storage. Tremendous effort has sought to alleviate the computational complexity with blockbased
[2], separable [3], and structure sampling [5, 4].Reconstruction quality. Over the past decade, various reconstruction models have exploited image priors such as nonlocal [51], lowrank approximation [22], etc. Signal priors are further utilized in the sampling side but limited to general prior (i.e., lowfrequency prior) because of the nonavailability of the tobesampled signal. Researchers developed multiscale CS [9, 10, 11] which captures more lowfrequency components. However, sampling and reconstruction are often designed separately, thus limiting their performance.
Reconstruction complexity. With linear projection sampling, CS shifts most of the complexity to the decoder, thereby demanding tremendous complexity for reconstruction [52]. Also, conventional CS often utilizes image priors and iterative optimization thus making the reconstruction more timeconsuming. Various efforts have sought to develop a fast method [53] but failed to maintain high reconstruction quality.
With the massive data and huge computation for training, deep learning (DL) has shown stateoftheart performance in many image restoration tasks like image super resolution
[13, 14], denoising [20, 23, 14, 21], etc. Modeling the sampling as a deep learning layer, deep learningbased compressive imaging (DCI) jointly learns to sense and reconstruct signals in an endtoend framework [15] and improves the reconstruction quality and reduce the reconstruction complexity. However, DCI also faces the sampling complexity when dealing with high dimensional signals such as image/video. Therefore, research in DCI has been limited to singlescale, blockbased CS [15, 16, 17, 18, 33, 49]. Recently, researchers start integrating image priors such as multiscale (Wavelet decomposition [14, 26]), nonlocal structure [24, 25] in DLbased image restoration to further improve the reconstruction quality. However, most researches are focused on reconstruction and frequently overlook the importance of sampling. As a result, little attention has been paid to the sampling, especially multiscale sampling.This work conduct the first analysis of the learned sampling matrices of singlescale DCI which results in aliasing artifact similar to the conventional multiscale CS. Motivate by multiscale CS [11, 30], we design a novel multiscale architecture efficient sampling and reconstruct images at multiscale.
In summary, we make the following contributions:

We propose an endtoend multiscale DCI with Pyramid, Wavelet, and Scalespace decomposition. We jointly learn to decompose, sample, and reconstruct images at multiscale.

We present a threephase training framework with initialization, and two enhance reconstruction networks.

We analyze learned decomposition, learned sampling matrices, and demonstrate the high reconstruction quality of multiscale DCI over stateoftheart sensing schemes.
We also investigate the importance of nonlinearity at the sampling stage. To reduce the sampling complexity, we use the combination of framebased multiscale decomposition, blockbased sampling, and framebased reconstruction. This work focuses on deterministic sampling. Our multiscale sampling matrices are fixed after the training process.
Ii Related Work
Iia Deep Learning for Image Restoration
Deep Learning (DL) for image restoration has been an active research area in recent years. Researchers started with a simple multiple layer perceptron
[19], then more advance architectures (i.e., CNN [20], residual learning [21], UNetlike networks [14], and generative model (GAN) [54, 55]). The conventional restoration often utilizes internal (i.e., lowfrequency, perceptual prior, nonlocal structure) more often than the external information. The dictionary learning approach does benefit from external information but on a small scale. Meanwhile, DL takes the advantage of a large scale dataset to learn the mapping from the degraded to clean image, also known as deep prior [43, 23, 44]. Despite outperforming the stateoftheart conventional algorithms (sparse coding, lowrank approximation [51, 22]), DL [14, 44, 20] (excluding GAN) favor the external information more, thus loss quality for strong local structure images. GANbased methods [55], on the other hand, created artificiallike structures in the reconstructed images which are not existing the original images. Thus, researchers integrated wellknown priors to DL framework by exploiting the iterative solution [44], nonlocal [24, 25], low frequency prior [14, 26], etc.IiB Compressive Sensing Meets Deep Learning
IiB1 Blockbased vs. Structure Framebased
To realize deep learningbased compressive imaging (DCI), the linear CS acquisition was modeled as a layer in the deep network. For instance, framebased sensing is equivalent to a fully connected layer (FCL) without the activation and bias term [15], blockbased sampling as a convolution [18]
with a large stride as in Fig.
1. Since FCLbased DCI faces the sampling complexity issues as framebased sensing, it was trained with small datasets of size or a small image block [15, 17], thereby introduced the blocking artifact. Blockbased DCI with framebased recovery was preferred [18, 33] to reduce complexity without the blocking artifact. Also, DCI can learn to reconstruct images from measurements of fixed [15, 16, 25] or learned sampling matrices [17, 18, 30]. To reduce framebased sensing complexity, Iliasdis et al. [27] learned a binary sampling matrix, Nguyen et al. [28] developed a sparse ternary matrix network with only values of 0, 1, +1, Cui et al. [29] sparsified the learned sampling matrix and our previous work on separable DCI [30].IiB2 SingleScale vs. MultipleScale
Acquiring images in the original or multilevel decomposition domain (i.e. wavelet transform domain) are named singlescale or multiscale sampling, respectively. While the advantage of multiscale over signal scale has been well studied in the literature [9, 10, 11, 41], most DCI researches are focused on singlescale sampling. Despite trained with singlescale sampling, the learned singlescale matrix [18] mimics the conventional multiscale sampling. That is, it captures more lowfrequency components and results in the aliasing artifact of reconstructed images. Therefore, it would be easier for the network to learn multiscale features with a multiscale architecture.
Multiscale DCI was introduced in [33, 49, 32] and our initial work [41]. LAPRAN [32]
defined a set of measurements correspond to a given resolution. The lowresolution image was recovered first and utilized to recover the higher resolution images. As a result, its lowresolution measurements capture more lowfrequency components, thus follow the multiscale prior. This method demands a heuristic measurement allocation for each resolution.
MSCSNet [33] trained a set of measurements with the corresponding subrate and reused at the larger subrate, therefore, their lowsubrate measurements correspond to lowfrequency components. SCSNet [49] was a scalable framework that supports multiple levels of reconstructions at multiple levels of quality. Similar to LAPRAN, it favored the lowfrequency contents more, especially at low frequency layers. However, SCSNet aimed to address the varying subrate problem (i.e. one network for multiple subrates) than designing a multiscale DCI. Therefore, there is lacking rigorous study on multiscale DCIMethod  Deep Compressive Imaging  Deep AutoEncoder 

Type  EndtoEnd  EndtoEnd 
Encoder  Less complexity  Same complexity as decoder 
Large kernel size, i.e.  Small kernel size, i.e.  
Large stride, i.e.  Small stride, i.e.  
Without padding 
With padding  
Without bias  With bias  
Decoder  More complexity  Same complexity as encoder 
Small kernel size, i.e.  Small kernel size, i.e.  
Small stride, i.e.  Small stride, i.e.  
With padding  With padding 
IiB3 Deep Compressive Imaging vs. Deep AutoEncoder
Deep AutoEncoder (DAE) [34] is an unsupervised artificial neural network, which learns the compact latent representation of signals similar to CS. We can interpret DCI as an asymmetric DAE with an encoder being much simpler than the decoder (with an exception of [50] which designed equivalent complexity DCI). Also, DCI removes the nonlinear activation and bias to maintain a simple encoder and compatible with the conventional CS. DCI uses a considerable large kernel size (e.g., ) while small kernel size is used in DAE (e.g. , ). Additionally, convolution in DCI is nonoverlapping with the stride of convolution is equal to the sampling block size. The summary is given in Table I.
Note that it is possible to implement nonlinear activations and bias terms in the encoder (i.e., via postprocessing) but only for one sampling layer to maintain practical. Thus, we study the impact of nonlinearity and bias in Section VB.
Iii MultiScale Deep Compressive Imaging
This section proposed a multiscale deep compressive imaging (MSDCI) which consists of three main parts of (i) multiscale sampling, (ii) initial reconstruction, and (iii) multiscale enhance reconstruction network as shown in Fig. 2.
Iiia MultiScale Sampling Network
Conventional multiscale sampling first linearly decomposes images into multiple scales (i.e. by wavelet decomposition), then adaptively acquires measurements at each signal’s scale [10, 12]. As the signal in each scale has a different level of sparsity, the number of measurements at different scale are heuristically allocated, thereby, limited the final performance. We observe that singlescale sensing tends to prefer lowfrequency contents similar to multiscale sampling, especially at lowsubrate. As shown in Fig. 8
, the sorted energy of the learned matrix for singlescale CSNet is approximate the sorted energy of the multiscale wavelet at lowlow subband. Therefore, without additional information, CSNet struggles to learn multiscale features and results in similar large variance kernels. Therefore, guiding multiscale information is more likely to benefit the network to better capture multiscale features.
Therefore, we design a multiscale sampling architecture with multiscale decomposition and multiscale sampling which are modeled as convolution layers in a deep network to enable endtoend learning.Firstly, a convolution layer without activation and bias is used to mimic the linear decomposition as
(2) 
where , denotes the convolution kernel and decomposition filter, is decomposed signal at level , and is the total number of level. It is straight forward to capture signal at each level independently by
(3) 
From this equation, we can acquire multiscale measurements with the multiscale sampling matrix
(4) 
However, this approach requires the number of measurement at each level to allocate heuristically or adaptively. To overcome this drawback, we sample measurements across multiple decomposed scales as
(5) 
In both approaches, the multiscale matrix is equivalent to a single sampling matrix , thereby compatible with the sequential sampling scheme. By jointly learning to decompose and sensing at multiscale, we can capture measurement more efficient. Additionally, after the training process, our multiscale sampling matrices are fixed for all test images.
IiiB MultiPhase Reconstruction Network
IiiB1 Initial Reconstruction
Similar to [18], this subnetwork mimics the matrix inversion to recover multiscale measurements. We can either (i) independently recover each scale then perform inverse decomposition to deliver the reconstructed image or (ii) directly recover image at the original domain. In our experiment, we select the later approach for better reconstruction quality. Similar to [18], a reshape and concatenate layer is used after each convolution to form the 2D recovered image. Our initial recovery is very simple with one convolution, without bias and activation.
IiiB2 Enhance Reconstruction
The initial recovered image is subsequently enhanced by (i) simple convolutions and (ii) multiscale wavelet convolutions [14]. The first enhanced network is a set of five convolutions (kernel size
, stride 1, and zero padding 1) following by ReLU activation. We set the number of feature maps to 64 (except the first and last layer) which is identical to CSNet’s architecture
[18]. The second enhanced network uses multilevel wavelet convolution (MWCNN) [14] which applies convolutions on top of decomposed wavelet features to effectively capture features at multiscale. Utilize MWCNN, our MSDCI network can take advantage of multiscale structures in both decomposition, sampling, and reconstruction.IiiC Training
IiiC1 Loss Function
Motivated by many image restoration methods [14, 20, 18, 33, 49], we select the Euclidean loss (L2 norm) as the objective function. We computed the average L2norm error overall training samples as:
(6) 
where is the total number of training samples, denotes an image sample, and is the network function at a setting .
IiiC2 MultiPhases Training
As mentioned in [16], the better the initial image is, the higher the quality reconstruction can be achieved. They first learned the initial reconstruction then enhanced the quality through a subnetwork of residual architecture. Motivated by this approach, we propose a threephase training process of (Phase 1) initial reconstruction, (Phase 2) enhance reconstruction by convolution, and (Phase 3) enhance reconstruction by MWCNN as visualized in Fig. 2. We train each phase sequentially from Phase 1 to Phase 3 follow the endtoend setting. The previously trained network at th is used as initialization for the th phase.
In each phase, the learning rate of 0.001, 0.0005, and 0.0001 are used for every 50 epochs. In Phase 3, we use the pretrained MWCNN for Gaussian denoising at the noise level
as initialization. The adaptive moment estimation (Adam) method is used for optimization. In general, only convolution, ReLU, and batch norm layers are used in our proposed deep compressive sensing network for reconstruction.
Dilate  Kernel size  Padding  Stride  Block size 
1  32  
2  32  
3  32  
4  32  
Image  Rate  MH  GSR  DBCS  CSNet  MSCSNet  SCSNet  PDCI  WDCI  SSDCI  
1  2  3  1  2  3  1  2  3  
Set5  0.1  28.57  29.98  31.31  32.30  32.82  32.77  30.84  32.59  33.27  30.66  32.44  33.39  30.50  32.69  33.63 
0.821  0.865  0.894  0.902  0.909  0.983  0.862  0.906  0.914  0.855  0.904  0.917  0.863  0.941  0.942  
0.2  32.09  34.17  34.55  35.63  36.23  36.15  34.04  35.94  36.28  34.06  35.75  36.56  33.81  35.94  36.68  
0.888  0.926  0.940  0.945  0.949  0.941  0.924  0.948  0.949  0.925  0.941  0.951  0.924  0.949  0.953  
0.3  34.07  36.83  36.54  37.90  38.43  38.45  36.11  37.95  38.42  36.51  38.20  38.74  36.22  38.42  38.92  
0.916  0.949  0.958  0.963  0.966  0.963  0.948  0.963  0.964  0.952  0.965  0.967  0.952  0.966  0.968  
Set14  0.1  26.38  27.50  28.54  28.91  29.29  29.22  27.87  29.15  29.56  27.81  29.10  29.67  27.69  29.22  29.69 
0.728  0.771  0.832  0.812  0.820  0.818  0.783  0.816  0.827  0.778  0.815  0.828  0.784  0.823  0.837  
0.2  29.47  31.22  31.21  31.86  32.26  32.19  30.63  32.13  32.43  30.69  32.05  32.51  30.43  32.23  32.72  
0.824  0.864  0.890  0.891  0.896  0.824  0.873  0.893  0.896  0.874  0.893  0.900  0.872  0.898  0.904  
0.3  31.37  33.74  33.08  33.99  34.34  34.51  32.52  34.05  34.32  32.86  34.30  34.71  32.63  34.50  34.90  
0.869  0.907  0.921  0.928  0.919  0.928  0.912  0.927  0.929  0.917  0.930  0.934  0.917  0.933  0.937  
The best results in each training phase of MSDCI and all methods are in Bold and red, respectively. Multiscale methods are in blue 
Iv Proposed Decomposition Methods for MSDCI
Iva Linear Image Decomposition
In general, images can be decomposed into multiple layers each carries a different amount of signal information for image analysis. We aim to study linear image decompositions as then can be easily integrated into the linear projection framework of CS [12]. Without loss of generality, we linearly decompose an image into layers, by a Pyramid decomposition formula as
(7) 
where denotes a downsampling matrix, and represent a smoothing matrix at layer th. It is possible to derived various linear decomposition model from eq. (3) such as: (i) Scalespace decomposition by setting
to an identity matrix; (ii) multiresolution by setting
to an identity matrix; (iii) wavelet decomposition by using high and lowpass filters .IvB MultiScale DCI with Decomposition
This section realizes various decomposition methods in a deep multiscale sampling network, with corresponding multiscale sampling subnetwork. Details for parameters of all frameworks are presented in Fig. 3.
IvB1 Waveletbased DCI (WDCI)
We implement the Discrete Haar Wavelet Transform (DWT) transform as a layer in a deep network. Given an image of size , DWT outputs four frequency bands at size . Then, we perform multiscale block sampling with convolutions across decomposed channels with kernel size , without bias and activation [14]. Notation denotes the block size, and stands for the number of measurements for each block at a given sampling rate .
IvB2 Scalespace DCI (SSDCI)
In the Scalespace analysis [35], a signal is decomposed into multiple layers using a set of Gaussian smoothing filters. To integrate Scalespace to deep network, we model the smoothing process as convolution so that the decomposition also can be trained (unlike fixed decomposition kernel in WDCI). To comparable with WDCI, we use four convolutions with the different kernel: , and to output four decomposed features. Multiscale sampling is performed with convolutions with kernel size , without bias and activation. Note that, Scalespace decomposition shares similarity with GoogleNet [36] which exploits multiple filter kernel size.
IvB3 Pyramidbased DCI (PDCI)
The Pyramid decomposition [37] is equivalent to Scalespace decomposition with additional downsampling operators. Avoid using two operations for downscale and sampling, we use dilated convolutions [38] instead. Dilated convolution is a subsampled convolution with the fixed sampling grid. It is equivalent to downscale image first (by the nearest neighbor downsampling) and follow by a conventional convolution. Unlike WDCI and SSDCI, PDCI captures images at each scale independently due to the difference in resolution (or dilated factor). The details settings for PDCI are given in Table II.
V Experimental Results
For experiments, we used the DIV2K dataset [13] to generate training data with patches of grayscale images of size . Our MSDCIs are implemented under the MatConvNet [39]. Table IV summaries the related work for comparison. For a fair comparison, we selected block size for singlescale and for multiplescale sensing except for PDCI as in Table V. We defined MSDCI at different training phases as WDCI, SSDCI, and PDCI with . Set5, Set14 [13] and test images of are evaluated.
Method  Sampling  

Singlescale  Multiscale  
Conventional  MH [40]  MRKCS [11] 
GSR [22]  
RSRM [7]  
Deep Learning  ReconNet [15]  MSCSNet [33] 
DRNet [16]  SCSNet [49]  
DBCS [17]  PDCI  
CSNet [18]  WDCI  
KCSNet [30]  SSDCI  
Proposed methods are highlighted in Bold. 
Method  Decomp. Image Size  Decomp. Filter  Sampling Matrix  
Size  No.  kernel size  No. meas. /block  
0.1  0.2  0.3  
CSNet      102  204  307  
WDCI  4  102  204  307  
SSDCI  1  26  51  102  
1  
1  
1  
PDCI  1  26  51  102  
1  26  51  102  
1  26  51  102  
1  24  52  101 
Va MultiPhase Training Performance
The efficiency of multiple phase training was examined by the reconstruction quality in Table III. The best reconstruction quality in Phase 1 was PDCI and WDCI at low (0.1) high subrate (0.2 and 0.3), respectively. Better capturing the lowfrequency components is the main reason with higher dilated convolution for the lowfrequency band in PDCI and well separate low and highfrequency of DWT in WDCI. At Phase 2, with better reconstruction, learned decomposition in SSDCI and PDCI offered higher PSNR than fixed decomposition kernel of WDCI. Because of multiscale sampling, all multiscale methods showed dB gain and similar reconstruction quality as singlescale (CSNet) and multiscale DCIs (MSCSNet and SCSNet).
At Phase 3, we utilized the multiscale reconstruction with MWCNN with significant additional parameters and complexity, thereby, greatly improved the final reconstruction performance. The order of increasing reconstruction quality was PDCI, WDCI, and SSDCI. Thanks to learned decomposition, SSDCI showed dB improvement over WDCI. On the other hand, our PDCI presented the poorest reconstruction quality due to favoring to lowfrequency components and independent sampling each decomposed scale.
What stands out from Table III was the increase of reconstruction quality along with the training phase. Compared to Phase 1, Phase 2 and Phase 3 improved dB (Set5) and dB (Set14) in average, respectively. From Table VIII, Phase 2 and Phase 3 added 0.21 and 16.17 million more parameters to the reconstruction. In general, more parameters is, higher learning capability, and results in better reconstruction performance.
VB Evaluate Decomposition Methods
VB1 Reconstruction Performance
The number of parameters does not linearly proportional to the learning capability. While multiscale sampling accounts for less than of parameters in Phase 3, it still plays a significant role. At slightly smaller parameters, SSDCI still showed better performance with up to dB than WDCI and over PDCI, respectively. This difference is similar to Phase 2. It is because SSDCI learned a better decomposition than the fixed wavelet decomposition with DWT. Also, decomposed images of SSDCI are at the higher resolution than in WDCI and PDCI, thus, easier to capture the highfrequency contents.
VB2 Learned Decomposition
We showed four decomposed images for our MSDCIs in Fig. 4. It is clearly shown that PDCI and SSDCI network can learn to decompose the image with gradually reduce features ranges between scales.
Note that, the size of decomposed images WDCI are in WDCI and in both PDCI and WDCI. With Haar DWT, WDCI divided images into four distinct subbands with a significant difference in energy. Meanwhile, SSDCI has departed from the conventional scalespace (i.e., each scale is a Gaussian smoothed image) and somehow mimicked the decomposition of WDCI. That is, the first level looks like the lowlow (i.e., image content is visible) and the other scales look similar to the highfrequency parts. The intensity is also gradually reduced along with the scale. On the other hand, in PDCI, we observed a similar pattern to SSDCI but less distinctive between scales. Also, the difference between scales in SSDCI and PDCI are not as substantial as WDCI.
Image  rate  GSR  MRKCS  RSRM  ReconNet  DRNet  CSNet  KCSNet  SCSNet  PDCI  WDCI  SSDCI 
Lena  0.1  30.97  32.87  33.31  26.89  28.65  32.15  33.09  32.36  32.63  32.90  33.05 
0.866  0.821  0.887  0.749  0.800  0.879  0.887  0.882  0.887  0.891  0.895  
0.2  34.44  35.85  36.41      35.21  35.08  35.41  35.75  35.86  36.14  
0.914  0.919  0.925  0.924  0.913  0.910  0.926  0.929  0.931  
0.3  36.47  37.72  38.00      37.33    37.51  37.44  37.85  38.11  
0.936  0.940  0.941  0.945  0.946  0.944  0.948  0.950  
Peppers  0.1  31.45  33.40  32.42  26.21  28.32  32.06  31.00  32.57  33.17  33.48  33.52 
0.843  0.853  0.845  0.716  0.769  0.858  0.861  0.863  0.869  0.87  0.876  
0.2  34.12  35.52  34.89      34.42  32.80  35.05  35.32  35.47  35.58  
0.88  0.891  0.881  0.891  0.881  0.896  0.897  0.899  0.901  
0.3  35.65  36.59  36.31      35.84    36.49  36.42  36.68  36.74  
0.905  0.909  0.905  0.910  0.915  0.913  0.915  0.918  
Mandrill  0.1  19.93  21.92  20.12  19.70  20.18  22.26  22.18  22.29  22.48  22.5  22.62 
0.508  0.549  0.491  0.411  0.455  0.592  0.581  0.597  0.611  0.610  0.624  
0.2  22.22  23.61  22.52      24.08  23.51  24.19  24.40  24.44  24.57  
0.682  0.688  0.659  0.749  0.697  0.754  0.762  0.767  0.775  
0.3  23.92  25.13  24.40      25.72    25.88  25.89  26.05  26.31  
0.775  0.780  0.751  0.833  0.838  0.838  0.842  0.854  
Boats  0.1  27.55  28.78  27.82  24.35  24.35  29.08  28.99  29.41  29.72  29.66  29.87 
0.773  0.786  0.762  0.636  0.636  0.812  0.802  0.822  0.833  0.833  0.841  
0.2  31.34  31.88  32.00      32.05  30.93  32.47  32.76  32.82  33.10  
0.862  0.865  0.865  0.884  0.852  0.891  0.894  0.896  0.899  
0.3  33.72  33.73  34.09      33.98    34.41  34.29  34.73  35.05  
0.904  0.899  0.901  0.911  0.916  0.914  0.919  0.922  
Cameraman  0.1  32.12  34.51  34.44  26.03  28.46  31.15  32.98  31.36  33.46  33.10  33.56 
0.913  0.928  0.927  0.798  0.848  0.918  0.930  0.925  0.940  0.942  0.945  
0.2  37.15  38.92  36.57      34.59  36.37  36.82  38.95  39.46  39.68  
0.958  0.967  0.969  0.961  0.963  0.975  0.977  0.980  0.982  
0.3  40.58  42.37  42.54      37.47    40.92  42.84  44.10  44.27  
0.977  0.984  0.981  0.976  0.990  0.989  0.992  0.993  
Man  0.1  27.74  29.58  28.21  25.30  26.51  29.84  29.86  30.03  30.17  30.21  30.44 
0.781  0.812  0.774  0.660  0.714  0.833  0.829  0.840  0.845  0.847  0.855  
0.2  30.63  32.31  32.39      32.55  31.81  32.77  32.87  32.87  33.05  
0.867  0.885  0.885  0.907  0.883  0.915  0.912  0.914  0.917  
0.3  32.83  34.33  34.94      34.52    34.76  34.55  34.94  35.06  
0.921  0.924  0.925  0.939  0.942  0.939  0.944  0.946  
Average  0.1  28.29  30.18  29.39  24.75  26.29  29.42  29.68  29.97  30.27  30.31  30.51 
0.781  0.792  0.792  0.662  0.712  0.815  0.815  0.817  0.831  0.832  0.839  
0.2  31.65  33.02  32.46      32.15  31.75  32.79  33.34  33.49  33.69  
0.861  0.869  0.864  0.886  0.865  0.890  0.895  0.898  0.901  
0.3  33.86  34.98  35.05      34.14    35.00  35.24  35.73  35.92  
0.903  0.906  0.901  0.919  0.925  0.923  0.927  0.931  
Multiscale methods are in blue and the best performance is in Bold. 
VB3 Learned MultiScale Sampling Matrices
We visualized the learned measurement matrices in Fig. 5. We selected part of the learned matrix, reshaped, and concatenated them to make a larger image for better visibility. For CSNet, we selected 100 over 102 filter at block size and subrate 0.1. In general, it is difficult to observe any pattern in the learned kernel of CSNet. Single scale sampling like CSNet has learned to capture both low and highfrequency contents and results in similar variance of learned kernels in Fig. 8.
For WDCI, we visualized (over 102) sampling block size of at subrate 0.1, four Wavelet scales. At each scale, the learned kernels were concatenated to form a larger image for better visualization. The horizontal, vertical, as well as diagonal patterns of the learned sampling matrix, were correspond to the wavelet band of lowhigh (LH), high low (HL), and highhigh (HH), respectively. In Fig. 8, unlike the single scale CSNet, there was a significant difference in the variance of learned kernels between lowlow (LL) and other bands in WDCI, thus verified the effectiveness of multiscale sampling.
Instead of reducing the dimension of decomposed images, SSDCI reduced the total number of filters. There are 26 filters of size at subrate 0.1. We showed only 25 filters that were reshaped and concatenated in Fig. 5. Similarly, we illustrated the learned sampling matrices of PDCI with 25 filters for each scale with the corresponding size , and in Fig. 6. It is shown that the network can learn to efficiently sampling at multiscale as distinctive variance at different scales. The learned sampling matrices of both PDCI and SSDCI at the lowfrequency measurements are smoother in comparison with that of the highfrequency, but not as significant different as the learned sampling matrices of WDCI.
VB4 Learned MultiScale Measurements
We visualized the learned measurements in Fig. 7. The measurement size of CSNet was at subrate 0.1, we took 100 measurements, reshaped to , and concatenated them to form a larger measurement image. Similarly, we selected for WDCI. On the other hand, the measurement for SSDCI and PDCI were in at subrate 0.1. As a result, we selected only measurements to form a larger image for better visualization.
It is easy to observe that SSDCI captured edges features more efficient with the stronger visible structures in its captured measurements. In contrast, it also revealed more information about the sampled images. As visualized in Fig. 7, the head and hair region of Lena image is significantly visible in the SSDCI measurements compared to other methods. From Fig. 7. The order of increasing revealing information and increasing sampling efficiency is CSNet, PCSDCI, WDCI, and SSDCI. This is the cost of improving performance by the learnable decomposition in SSDCI.
VB5 Linear vs. Nonlinear in MultiScale Sampling
The impact of linearity and bias term at the sampling stage was evaluated in Table VII. The convolution layer was either (i) linear; (ii) linear with bias term; (iii) nonlinear with ReLU and without bias term. While nonlinearity often helps in autoencoder, it significantly reduces the reconstruction quality in all phases. As captured measurement can be negative to positive, using ReLU will set all negative measurements to zeros thus reduces the amount of information at the sampling stage. In contrast, adding bias increases parameters to the network thus slightly improved reconstruction quality. However, the bias term only showed improvement for simple networks at Phase 1 and Phase 2. With a complex reconstruction network like Phase 3, adding bias term in the sampling stage degraded the PSNR performance. Therefore, we conclude that the linearity is important at the sampling stage to preserve the input signal information. Therefore, we would suggest to use first few layers in a deep convolution network without the nonlinear activation. This observation agreed with the linear bottlenecks model in MobileNetv2 [61].
rate  WDCI  WDCI  WDCI  

Set5  0.1  30.66/ 30.82/ 24.57  32.44/ 32.57/ 30.71  33.39/ 33.34/ 32.47 
0.2  34.06/ 34.08/ 25.56  35.82/ 35.97/ 32.18  36.56/ 36.61/ 35.75  
Set14  0.1  27.81/ 29.90/ 23.27  29.10/ 29.16/ 27.81  29.67/ 29.56/ 28.91 
0.2  30.69/ 30.69/ 23.81  32.05/ 32.19/ 28.79  32.51/ 32.45/ 31.02 
rate  CSNet  SCSNet  WDCI  PDCI  SSDCI  
0  1  2  3  0  1  2  3  0  1  2  3  
parameters (weights)  0.1  0.321  0.340  0.104  0.209  0.321  16.492  0.039  0.143  0.255  16.426  0.027  0.033  0.145  16.316 
0.2  0.532  0.551  0.210  0.420  0.532  16.703  0.076  0.286  0.398  16.568  0.052  0.065  0.177  16.348  
0.3  0.741  0.760  0.314  0.629  0.741  16.912  0.114  0.429  0.541  16.711  0.079  0.099  0.211  16.381  
variables
(intermediate features) 
0.1  135.03  371.48  0.288  0.813  135.29  544.24  1.101  1.625  136.11  545.05  2.134  2.648  137.13  546.07 
0.2  135.06  372.03  0.315  0.839  135.32  544.26  1.154  1.678  136.16  545.10  2.149  2.674  137.15  546.10  
0.3  135.08  372.59  0.341  0.865  135.35  544.29  1.206  1.730  136.21  545.15  2.176  2.700  137.18  546.12  
means the complexity of the sampling stage (without reconstruction) 
VC Comparison with StateoftheArt Sampling Schemes
VC1 Reconstruction Quality
For Set5 and Set14 in Table III, all multiscale networks outperform the single scale sampling scheme. While previous MSDCI works (MSCSNet and SCSNet) produces similar performance, they showed an average improvement of 0.500.62 dB over the single scale DCI (CSNet and DBCS) and 1.982.79 dB over the conventional CS (MH and GSR). Our proposed MSDCIs showed steady improvement over each training phase. At Phase 3, SSDCI, WDCI, PDCI gained more than 3.33dB, 3.09dB, 2.97dB over CSNet on Set5 at subrate 0.1. Moreover, our SSDCI outperformed MSCSNet and SCSNet with 0.470.86dB and 0.390.47dB gain on Set5 and Set14, respectively. Both WDCI and PDCI improved 0.450.62 dB and 0.330.50 dB over SCSNet at subrate 0.1. WDCI still offered higher performance than MSCSNet and SCSNet at high subrate. PDCI resulted in better PSNR at subrate 0.2 and similar or less at subrate 0.3. In general, the smaller subrate is (or the more illposed the problem is), the higher improvement in reconstruction is achieved by our MSDCIs. Overall, the order of reducing performance are SSDCI, WDCI, PDCI, MSCSNet, SCSNet, SSDCI, WDCI, PDCI.
For images in Table VI, without learning the sampling matrices, ReconNet and DRNet resulted in 3.88dB less than the conventional CS with RSRM and other jointly learning reconstruction and sampling networks. Single scale DCI, CSNet, showed 0.41.5 dB gain over GSR but less than the conventional single scale RSRM (i.e. around 0.660.84 dB) and multiscale CS with MRKCS (0.030.89 dB). Our multiscale networks (PDCI, WDCI, and SSDCI) outperformed both singlescale CSNet and KCSNet, and multiscale SCSNet. Thanks to the jointly learn of decomposition, sampling, and reconstruction, SSDCS demonstrated the best reconstruction with up to 6.83 dB gain over CSNet and SCSNet for the highly compressible image (i.e. Cameraman) at subrate 0.3. For complex images like Lena, Mandrill, SSDCI still showed 0.360.93 dB gain over CSNet. Compared to SSDCI, WDCI and PDCS shown comparable and less reconstruction quality at the high subrate.
Visual quality is presented in Fig. 9 and 10. Conventional CS (GSR and RSRM) preserved the highfrequency detail regions well (i.e., Lena’s hat) but created the fake edge artifacts in complex regions (i.e. Boats and Lena’s hair). It is because the lowrank assumption does not hold for the complex texture regions. In contrast, losing high frequency was observed in the conventional multiscale CS (i.e, MRKCS) and all DCI methods, especially for strong local structure regions like Lena’s hat. The effective of our multiscale sampling is demonstrated by the best visual quality of SSDCI following by WDCI and PDCI. Other multiscale method SCSNet, KCSNet, and MRKCS surfer aliasing artifact.
VC2 Complexity
The complexity of DCI is expressed by the number of learning parameters (i.e. learning capability) and the number of variables (i.e. size of the intermediate features) in Table VIII. Generally, sampling requires much less computation than reconstruction, especially at Phase 3. Complexity and performance of our MSDCIs are increased after each reconstruction phase (). Meanwhile, the fixed decomposition of WDCI needed more parameters but less variables than PDCI and SSDCI.
At Phase 2, our MSDCIs required a similar number of parameters (number of weights) and variables (number of intermediate features) as the singlescale CSNet but offered 0.150.30dB improvement in PSNR. SCSNet improved reconstruction quality by slightly increasing parameters and double variables. At phase 3, we added parameters and variables over Phase 2. Also, compared to SCSNet, our MSDCIs were 30 more parameters but only variables. One might conclude that the PSNR improvement of our proposed MSDCIs has solely come from the multiscale reconstruction than the sampling method. However, at the same reconstruction complexity, only a slight difference in sampling complexity, SSDCI still gained 0.20dB over WDCI and PDCI. Therefore, we concluded that sampling architecture has a significant impact on the final reconstruction performance. Also, a reduction in parameters comes with the cost of increasing variables.
For the complexity, we showed the running times (in seconds) versus the average reconstruction quality (in PSNR [dB]) for six images in Fig. 11 wit a PC system running Windows 10 Home Edition, Matlab 2019a, 32Gb Ram, GPU Nvidia 2080ti, and MatConvNet1.0 beta 25. DCI schemes were tested with 1000 average for each image and the conventional CS schemes (MRKCS, RSRM, and GSR) are averaged 10 times. Conventional single CS (RSRM and GSR) showed poor quality and slow reconstruction. Conventional multiscale CS (MRKCS) offered higher quality and faster reconstruction but still significantly slower than DCIs. SSDCI is always slower than WDCI and PDC but better in performance at Phase 2 and 3. It clearly showed that WDCI and PDCI were better than CSNet with the same running time. SCSNet had similar PSNR at the same running time as SSDCI and slower than PDCI. All MSDCI at Phase 3 were better than SCSNet in performance and running time.
VD Discussions and Future Work
One of the most obvious application of our multiscale framework is learned image/video compression [58] which compressing images at singlescale. Since it is known that multiscale processing benefits compressive sensing and image compression in JPEG2000 [57], jointly learn multiscale decomposition, multiscale image compression, will further improve the compression ratio. The second research topic that could utilize the multiscale sampling concept is coded imaging. It is possible to model the multiscale sampling via multilevel of exposures. For instance, some pixels are exposed shorter to capture fast moving object (high frequency) while others are exposed longer to capture stationary object (low frequency). The third application MRI sampling and reconstruction [60]
. Since the MRI image is captured in the Fourier decomposed domain, it is possible to mimic the Fourier transform and radial sampling as a layer to enable endtoend multiscale sampling.
Our MSDCI offers an alternative explanation for the multilevel wavelet convolution (MWCNN) [14]. In MWCNN, the authors decomposed features independently by wavelet before performing convolution across multiple channels. MWCNN was interpreted as an efficient pooling method to avoid the gridding artifact of the conventional pooling. However, as a convolution layer also can be interpreted as a CS sampling scheme with linear projection (see Fig. 1). In our view, we can consider the conventional convolution (CNN) and MWCNN as singlescale and multiscale sampling with wavelet, respectively. As we demonstrate the advantage of multiscale over single scale sampling, MWCNN can capture multiscale features more efficiently than conventional CNN, thus archives higher reconstruction performance. Additionally, our results indicated that wavelet is not the best decomposition in multiscale sampling. With that assumption, scalespace CNN would show better performance than MWCNN.
Vi Conclusion
This work proposed a novel multiscale deep learningbased compressive imaging network to improve the sampling efficiency and reconstruction quality. The proposed framework not only learns to decompose and sample images at multiscales but also reconstruct images at multiscale. We demonstrated the importance of sampling networks in improving the final reconstruction performance with merely additional complexity. We proposed a threephase training scheme to further improves training efficiency and reconstruction quality. The characteristic of learned decomposition, learning multiscale sensing were investigated including Pyramid, Wavelet, and Scalespace decomposition. After the training process, our multiscale sampling matrices are fixed and can be applied for sequential imaging systems.
Acknowledgment
This work is supported in part by the National Research Foundation of Korea (NRF) grant 2017R1A2B2006518 funded by the Ministry of Science and ITC.
References
 [1] D. L. Donoho, “Compressed sensing,” IEEE Trans. Info. Theo., vol. 52, no. 4, pp. 1289–1306, 2006.
 [2] L. Gan, “Block compressed sensing of natural image,” Proc. IEEE Inter. Conf. Digital Sig. Process., pp. 14, 2007.
 [3] M. F. Duarte and R. G. Baraniuk, “Kronecker Compressive Sensing,” IEEE Trans. Image Process., vol. 21, no. 2, pp. 494 – 504, Feb. 2012.
 [4] W. Yin, S. Morgan, J. Yang, and Y. Zhang, “Practical compressive sensing with Toeplitz and circulant matrices,” in Proc. SPIE Visual Comm. Image Process. (VCIP), vol. 7744, pp. 77440K, 2010.
 [5] T. T. Do, L. Gan, N. H. Nguyen, and T. D. Tran, “Fast and efficient compressive sensing using structurally random matrices,” IEEE Trans. Sig. Process., vol. 60, no. 1, pp. 139 – 154, Jan. 2012.
 [6] J. N. Laska and et al., “Democracy in action: Quantization, saturation, and compressive sensing,” J. Appl. Comp. Harm. Anal., vol. 31, no. 3, pp. 429 – 443, Nov. 2011.
 [7] T. N. Canh and B. Jeon, “Restricted Random Matrix for Compressive Imaging,” summitted to Digital Image Processing, 2019.
 [8] G. J. Sullivan, J. R. Ohm, W. J. Han, and T. Wiegand, “Overview of the High Efficiency Video Coding (HEVC) standard,” IEEE Trans. Circ. Syst. Video Tech., vol. 22, no. 12, Dec. 2012.
 [9] Y. Tsaig and D. L. Donoho, “Extensions of compressed sensing,” J. Sig. Process., vol. 86, no. 3, pp. 549–571, Mar. 2006.
 [10] J. E. Fowler, S. Mun, and E. W. Tramel, “Multiscale block compressed sensing with smoothed projected Landweber reconstruction,” IEEE Conf. European Sig. Process. Conf., Aug. 2011.
 [11] T. N. Canh, and et al., “Multiscale/multiresolution Kronecker compressive imaging,” Proc. IEEE Inter. Conf. Image Process., 2015.
 [12] T. N. Canh, and et al., “Decompositionbased multiscale compressed sensing,” Proc. Inter. Workshop Adv. Image Process. (IWAIT), 2018.
 [13] R. Timofte and et al., “Ntire 2017 challenge on image superresolution: method & results,” Proc. Com. Vis. Patt. Recog. Work. (CVPR), 2017.
 [14] P. Liu and et al., “Multilevel Wavelet Convolutional Networks,” IEEE Access, vol 7. June 2019.
 [15] K. Kulkarni and et al., “ReconNet: Noniterative reconstruction of images from compressively sensed random measurements,” Proc. IEEE Inter. Conf. Comp. Vis. Patt. Recog., pp. 440 – 458, 2016.
 [16] H. Yao et al., “DR2Net: Deep residual reconstruction network for image compressive sensing,” [online] at arXiv:1702.05743, 2017.
 [17] A. Adler, et al., “A Deep learning approach to blockbased compressed sensing of image,” Proc. IEEE Inter. Works. Multi. Signal. Process., 2017.
 [18] W. Shi, et al., “Deep network for compressed image sensing,” Proc. IEEE Inter. Conf. Mult. Expo (ICME), pp. 877 882, 2017.

[19]
H. C. Burger and et al., ”Image denoising: Can plain neural networks compete with BM3D?,” IEEE Conference on Computer Vision and Pattern Recognition, 2012.
 [20] K. Zhang and et al., ”Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Transaction on Image Processing, vol. 26, no. 7, pp. 31423155, 2017.
 [21] T. Wang, M. Sun and K. Hu, ”Dilated Deep Residual Network for Image Denoising,” arXiv:1708.05473, 2018.
 [22] J. Zhang, D. Zhao, and W. Gao, “Groupbased sparse representation for image restoration,” IEEE Trans. Image Process., vol. 23, no. 8, pp. 3336 – 3351, Aug. 2014.
 [23] K. Zhang, W. Zuo, and L. Zhang, “Deep Plugand Play SuperResolution for Arbitrary Blur Kernels,” in Proc. IEEE Conf. Comp. Vis. Patt. Recog.(CVPR), 2019.
 [24] S. Lefkimmiatis, ”NonLocal Color Image Denoising with Convolutional Neural Networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [25] B. Ahn and N. I. Cho, ”BlockMatching Convolutional Neural Network for Image Denoising,” arxiv:1704.00524, 2017.
 [26] E. Kang and et al., ”Wavelet Domain Residual Network (WavResNet) for LowDose Xray CT Reconstruction,” arXiv:1703.01383, 2017.
 [27] M. Iliadis, L. Spinoulas, and A. K. Katsaggelos, “DeepBinaryMask: Learning a Binary Mask for Video Compressive Sensing,” arXiv:1607.03343, 2016.
 [28] D. M. Nguyen and et al., “Deep learning sparse ternary projections for compressed sensing of images,” in Proc. IEEE Global Conf. Signal Info. Process. (GlobalSIP), 2017.
 [29] W. Cui et al., “Deep neural network based sparse measurement matrix for image compressed sensing,” IEEE Inter. Conf. Image Process., 2018.
 [30] T. N. Canh and B. Jeon, “Deep learningbased Kronecker compressive imaging,” in Proc. IEEE Inter. Conf. Consum. Elect.–Asia, 2018.
 [31] W. Shi and et al., “Image compressed sensing using convolutional neural network,” IEEE Transaction on Image Processing, 2019.
 [32] K. Xu and et al., “LAPRAN: A Scalable Laplacian Pyramid Reconstructive Adversarial Network for Flexible Compressive Sensing Reconstruction,” in Proc. European Conf. Comp. Vis. (ECCV), 2018.
 [33] W. Shi and et al., “Multiscale deep networks for image compressed sensing,” Proc. IEEE Inter. Conf. Image Process. (ICIP), 2018.
 [34] X. Mao and et al., “Image restoration using very deep convolutional encoderdecoder networks with symmetric skip connections,” Adv. Neural Info. Process. Sys., 2016.
 [35] T. Lindeberg, Scalespace theory in computer vision, Kluwer, 1994.
 [36] C. Szegedy et al., “Going deeper with convolutions,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
 [37] P. Burt et al., “The laplacian pyramid as a compact image code,” IEEE Transaction on Communication, vol. 31, no. 4, pp. 532540, 1983.
 [38] F. Yu and V. Koltun, “MultiScale Context Aggreation by Dilated Convolutions,” International Conference on Learning Representation, 2016.
 [39] A. Vedaldi and et al., “Matconvnet: Convolutional neural networks for Matlab,” in Proc. ACM Inter. Conf. Multi., pp. 689 – 692, 2015.
 [40] C. Chen and et al., “Compressed sensing recovery of images and video using multihypothesis predictions,” in Proc. Asilomar Conf. Signals, Systems and Computers, pp. 1193–1198, 2011.
 [41] T. N. Canh and B. Jeon, “Multiscale deep compressive sensing network,” in IEEE Inter. Conf. Visual Comm. Image Process (VCIP), 2018.
 [42] D. G. Lowe, ”Distinctive Image Features from ScaleInvariant Keypoints,” Int. Journal of Computer Vision, vol. 60, no. 2, pp. 91110, 2004.
 [43] J. H. R. Chang and et al., One network to solve them all  solving linear inverse problems using deep projection models”, IEEE Inter. Conf. Comp. Vis. (ICCV), 2017.
 [44] U. Dmitry and et al., ”Deep image prior”, Proc. IEEE Comp. Vis. Patt. Recog. (CVPR), 2018.
 [45] C. Metzler et al., ”Learned DAMP: Principled neural network based compressive image recovery,” Adv. Neural Info. Process. Sys., 2017.
 [46] Y. Wu and et al., ”Deep compressed sensing,” arXiv:1905.06723, 2019.
 [47] U. S. Kamilov, and et al, ”Learning optimal nonlinearities for iterative thresholding algorithms,” IEEE Sig. Process. Lett., vol. 23, no. 5, 2016.
 [48] R. Heckel and P. Hand, ”Deep decoder: Concise image representation from untrained nonconvolutional networks,” arXiv:1810.03982, 2018.
 [49] W. Shi and et al., ”Scalable convolutional neural network for image compressed sensing,” Proc. IEEE Comp. Vis. Patt. Recog. (CVPR), 2019.
 [50] A. Mousavi and et al., ”A datadriven and distributed approach to sparse signal representation and recovery,” Proc. IEEE Inter. Conf. Learn. Repr. (ICLR), 2019.
 [51] T. N. Canh and et al., ”Compressive sensing reconstruction via decomposition,” Signal Process: Image Comm., vol. 49, 2016.
 [52] T. Goldstein and et al., ”The stone transform: Multiresolution image enhancement and compressive video,” IEEE Trans. Image Process. (TIP), vol. 24, no. 12, 2015.
 [53] K. Q. Dinh et al., ”Iterative weighted recovery for blockbased compressive sensing of image/video at low subrates,” IEEE Trans. Circ. Syst. Video Tech., vol. 27, no 11, 2017.

[54]
J. Chen et al., ”Image blind denoising with generative adversarial network based noise modeling,” Proc. IEEE Conf. Comp. Vis. Patt. Recog.(CVPR), 2018.
 [55] C. Ledig et al., ”PhotoRealistic Single Image SuperResolution Using a Generative Adversarial Network,” Proc. IEEE Conf. Comp. Vis. Patt. Recog.(CVPR), 2017.
 [56] B. C. Csáji, Approximation with artificial neural networks, Faculty of Sciences, Etvs Lornd University, 2001.
 [57] T. David, M. Michael. JPEG2000 Image Compression Fundamentals, Standards and Practice: Image Compression Fundamentals, Standards and Practice. Springer Science & Business Media. 2012.
 [58] K. M. Nakasihi et al., ”Neural multiscale image compression,” Asian Conference on Computer Vision, 2018.
 [59] M. Yoshida et al., ”Joint optimization for compressive video sensing and reconstruction under hardware constraints,” European Conference on Computer Vision, 2018.
 [60] Y. Dai et al., ”Compressed sensing MRI via a multiscale dilated residual convolution network,” Elsevier Magnetic Resonance Imaging, vol. 63, 2019.
 [61] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, ”MobileNetV2: Inverted Residuals and Linear Bottlenecks,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 45104520, 2018.