Compressive sensing (CS) theory demonstrates that a signal can be recovered from a much fewer acquired measurement than prescribed by Nyquist theorem with a high probability when the signal is sparse in certain transform domains. The benefits of reducing sampling rate allow low-cost and efficient data compression, thereby relieving data storage and transmission bandwidth burden. These inherent merits enable it to be very desirable in a series of applications, including single-pixel camera , magnetic resonance imaging [23, 44], video CS , and snapshot compressive imaging.
In a compressive image sensing method, for the image , the sampling stage first performs fast sampling of to obtain the linear random measurements . Here, is the sensing matrix with , and denotes the CS sampling ratio. In the recovery stage, our goal is to infer the original image given . Such inverse problem is typically under-determined because the number of unknowns is much larger than the number of observations . To address this problem, traditional CS methods [54, 53, 2] explore the sparsity as an image prior and find the sparsest signal among all measurements by iteratively optimizing the sparsity-regularized problem. Although these methods usually have theoretical guarantees and simultaneously inherit interpretability, they inevitably suffer from the high computational cost dictated by the interactive calculations.
Compared to the conventional CS methods, neural networks have been leveraged to solve the image CS reconstruction problems by directly learning the inverse mapping from the compressive measurements to the original images. Recently, with the advent of deep learning (DL), diverse data-driven deep neural network models for CS have been shown to achieve impressive reconstruction quality and efficient recovery speed[18, 26, 34, 51, 45, 33, 52, 55, 48, 47, 36]. In addition, the DL based CS methods often jointly learn the sampling and the reconstruction network to further improve the performance [34, 47, 52, 55].
In the existing CS literature, the DL based CS methods can be divided into two categories. The first is deep unfolding methods [26, 51, 45, 52, 55, 47], which leverage the deep neural network to mimic the iterative restoration algorithms. They attempt to maintain the merits of both iterative recovery methods and the data-driven network methods by mapping each iteration into a network layer. The deep unfolding approaches can extend the representation capacity over iterative algorithms and avoid the limited interpretability of deep neural networks. However, since these methods are inspired by the traditional optimization processes, it inevitably limits the full potential of deep neural networks.
The second group is the straightforward methods [18, 34, 42, 33, 35, 46] that are free from any handcrafted constraint. These methods can reconstruct images by one pass feed-forward of the learned convolutional neural network (CNN) given the measurement . However, the principle of local processing limits CNN in trems of receptive fields and brings challenges in capturing long-range dependencies. Moreover, the weight sharing of the convolution layer leads the interactions between images and filters to be content-independent . Numerous efforts have been devoted to addressing these problems, such as enlarging the kernel size of convolution, using multi-scale reconstruction, dynamic convolution, and the attention mechanism. Sun et al.  explore the non-local prior to guide the network in view of the long-range dependencies problem. Furthermore, Sun et al.  attempt to adopt dual-path attention network for CS, where the recovery structure is divided into structure and texture paths. Despite amplifying the ability of context modeling to some extent, these approaches are still unable to escape from the limitation of the locality, stranded by the CNN architecture.
Unlike prior convolution-based deep neural networks, transformer 
, designed initially for sequence-to-sequence prediction in NLP domain, is well-suited to modeling global contexts due to the self-attention-based architectures. Inspired by the significant revolution of transformer in NLP, several researchers recently attempt to integrate the transformer into computer vision tasks, including image classification, image processing [6, 21, 39], and image generation . With the simple and general-purpose neural architecture, transformer has been considered as an alternative to CNN and strived for better performance. However, a naïve application of transformer to CS reconstruction may not produce sufficiently competitive results that match the performance of CNN. The reason is that transformer can capture high-level semantics due to the global self-attention, which is helpful for image classification but lacks the low-level details for image restoration. In general, CNN has better generalization ability and faster convergence speed with its strong biases towards feature locality and spatial invariance, making it very efficient for the image. Transformer has higher model capacity thanks to less restriction by inductive biases, enabling self-attention layers to learn the inherent characteristics of larger datasets well. Moreover, the explosive computational complexity and colossal memory explosion for high resolution reconstruction are other challenges in applying transformer to CS.
To cope with the above issues and further refine the reconstruction quality, we propose CSformer, an effective and efficient transformer based method for image CS. CSformer integrates the advantages of leveraging both detailed spatial information from CNN and the global context provided by transformer. We design a hybrid framework that gradually increases the feature map resolution while reducing the dimension, enhancing the feature representation by multi-scale features while reducing memory cost and computational complexity. The proposed approach is an end-to-end compressive image sensing method composed of adaptive sampling and recovery. In the sampling module, images are measured block-by-block by the learned sampling matrix. In the reconstruction stage, we employ a progressive reconstruction strategy, and the CNN features are aligned with the layer-wise representations from the transformer. On one hand, the progressive reconstruction can process the multi-scale feature maps, which is helpful for representation learning and reduces the complexity of the parameters. On the other hand, CSformer enjoys the elaborate combination of local and global context by combining the two types of features at each resolution. Compared with the prevalent CNN-based methods, CSformer benefits from several aspects: (1) self-attention mechanism ensures the content-dependency between image and attention weight, (2) CNN provides a locality to transformer that lacks in addressing long-range dependencies, (3) progressive reconstruction balances the complexity and efficiency. To the best of our knowledge, CSformer is the first work to apply the transformer to CS. Experimental results demonstrate that our method has a promising performance and outperforms existing iterative methods and DL based methods.
The main contributions of this work can be summarized as follows:
We propose CSformer, a hybrid framework that couples transformer with CNN for adaptive sampling and reconstruction of image CS. The proposed CSformer inherits both local features from CNN and global representations from transformer.
To make full use of the complementary features of transformer and CNN, we introduce progressive reconstruction to aggregate the multi-scale features, which are thoughtfully designed for image CS to balance the complexity and performance with spatial variance.
Extensive experiments on various datasets demonstrate the superiority of the proposed CSformer. We reveal the great potential of transformer in combination with CNN for CS.
Ii Related work
In this section, we present the related works. We first review the existing CS methods of natural images in section II-A. Then we provide a brief overview of the recent development of vision transformer in section II-B.
Ii-a Compressive Sensing
CS methods can be classified into two categories: iterative optimization based conventional methods and data-driven based DL methods. Furthermore, we can divide the deep network based approaches into deep unfolding methods and deep straightforward methods.
Ii-A1 Iterative Optimization based Conventional Methods
The conventional methods mainly rely on sparsity priors to recover the signal from the under-sampled measurements. Some approaches obtain the reconstruction by linear programming based onminimization. Examples of such algorithms involve basis pursuit (BP) , least absolute shrinkage and selection operator (LASSO) , the iterative shrinkage/thresholding algorithm (ISTA) , and the alternating direction method of multipliers (ADMM) . In addition, some works improve the recovery performance by exploring image priors. TVAL3  utilizes the total variation (TV) regularized to reconstruct images by enhancing the local smoothness. In , D-AMP considers the denoising perspective of the approximate message passing (AMP)  for CS iterative reconstruction. In general, all the above methods suffer from high computational complexity due to the iterative calculations.
Ii-A2 Deep Unfolding Methods
Deep neural networks have been developed for image CS in the last few years. Deep unfolding methods incorporate the traditional iterative reconstruction and the deep neural networks. Such methods map each iteration into a network layer that preserves the interpretability and performance. Inspired by the D-AMP, Metzler et al.  implement a learned D-AMP (LDAMP), which unfolds the iterative D-AMP algorithm and combines it with a denoising CNN. In analogy to LDAMP, AMP-Net  also applies denoising prior, whereas it has an additional deblocking module and uses a learned sampling matrix.
Moreover, ISTA-Net+  and ISTA-Net++  design the deep network to mimic the ISTA algorithm for CS reconstruction. The difference is that ISTA-Net++ uses a cross-block learnable sampling strategy and achieves multi-ratio sampling and reconstruction in one model. OPINE-Net  can also be regarded as a variant of ISTA-Net+, except that OPINE-Net simultaneously explores adaptive sampling and recovery. Besides exploring upon on AMP and ISTA, Yang et al.  propose the ADMM-CSNet to reconstruct images with high accuracy and speed by learning the sparse representations, model parameters, and ADMM algorithm from different types of images. The main drawback of the unfolding approaches is that the limitation of parallel training and hardware acceleration owing to its sophisticated and iterative structure.
Ii-A3 Deep Straightforward Methods
Instead of specific priors, the deep straightforward methods directly impose the modeling power of DL free from any constraints. ReconNet , considered as the first deep network based method that brings CNN for CS reconstruction, aims to recover the image from CS measurements via CNN. The reconstruction quality and computational complexity are both superior to the traditional iterative algorithms. Joint learning the sampling with the reconstruction in the whole network further improves the reconstruction performance. Instead of fixing sampling matrix, Shi et al.  implement a convolution layer to replace it and propose a deep network to recover the image named CSNet. In , they further extend their model to learn binary sampling matrix and bipolar sampling matrix. DR2-Net  adopts a fully connected layer to perform the sampling, then stacks several residual learning blocks to improve reconstruction quality. In , Sun et al. design a 3-D encoder and decoder with the channel attention motivated skip links and introduce the non-local regularization for exploring the long-range dependencies. Sun et al.  propose a dual-path attention network dubbed DPA-Net for CS reconstruction. Two path networks are embedded in the DPA-Net for learning structure and texture, respectively, and then combined by the attention module.
The original transformer 
is designed for natural language processing (NLP), in which the multi-head self-attention and feed-forward MLP layer excel at handling long-range dependencies of sequence data. Inspired by the power of transformer in NLP, the pioneering work of VIT splits an image into flattered patches, successfully extending the transformer to image classification task. Swin transformer  designs a hierarchical transformer architecture with the shifted window-based multi-head attentions to reduce the computation cost. Since then, transformer has vaulted into a model on a par with CNN, and the transformer based application of computer vision has mushroomed. Yang et al. et al.  develop a pre-trained model named image processing transformer (IPT) for several low-level computer vision tasks. They excavate the capability of transformer by using large scale pre-training, and IPT outperforms state-of-the-art methods on super-resolution, denoising, and deraining tasks. Uformer  borrows from the structure of U-Net to build transformer to further improve the performance for low-level vision tasks. Liang et al. 
use a stack of residual swin transformer blocks to achieve state-of-the-art performance on image restoration tasks. In addition, TransGAN proposes a generative adversarial network (GAN)[14, 28, 27] architecture using pure transformer for image generation. On the other hand, many works aim to combine the strengths from the CNN and transformer effectively. Xie et al.  utilize CNN to extract feature representation and a transformer to model the long-range dependency for 3D medical image segmentation. CoAtNet  unifies the depthwise convolution and self-attention via a relative attention. Peng et al.  propose a dual backbone to combine CNN with visual transformer for visual recognition. ConVit  introduces a gated positional self-attention mechanism to bring the convolutional inductive bias to transformer.
Fig. 1 illustrates the network architecture of the proposed CSformer for adaptive sampling and reconstruction. The sampling module is applied to sample block by block in the image patches, which are split from the image via a non-overlapping way. The sampling matrix is replaced by the learned convolution kernels in each patch. The reconstruction module comprises a linear initialization module, an input projection module, an output projection module, a CNN stem, and a transformer Stem, learning an end-to-end mapping from CS measurements to the recovered images. One stream of the CS measurement is the linear initialization module, including two consecutive operations that a convolution and a pixelshuffle layer, to obtain the initial reconstruction . The other stream of the CS measurement is to pass through an input projection that contains several layers of convolution followed by a pixelshuffle layer to obtain the input feature , which matches the input feature sizes for CNN and transformer . The trunk recovery network consists of a CNN stem and a transformer Stem. Each stem contains four blocks with upsample layers to progressively reconstruct features until aligning the patch size. In both branches, convolution features are used to provide local information that complements the features of transformer. The recovery of the trunk recovery network is projected from the transformer output to the single-channel by output projection. CSformer reconstructs the final patches by summing the initial reconstruction and the trunk recovery. Finally, we merge all patches to obtain the final image .
CSformer samples and reconstructs the whole image by merging the fixed patches. Suppose that is the patch of input whole image . The sampling operation takes place in patch . We process the block-based CS (BCS) in patch , which decomposes a patch into non-overlapping blocks. Then the number of blocks is
. Each block is vectorized and subsequently sampled by the measurement matrix. Suppose that is the block of input patch . The corresponding measurement is obtained by , where and represents the sampling ratio. Then the measurement of the input patch
is obtained by stacking each block. In this paper, the sampling process is replaced by the convolution operation with appropriately sized filters and stride, as shown in Fig.2. The sampling convolution can be formulated as:
where corresponds to a convolution layer without bias consisting of filters with size, and the stride equals to . After applying the convolution operation on the patch , we can obtain the final total CS measurement . As shown in Fig. 2, the CS measurement of size can be acquired from an input patch of size with sampling ratio 0.25 by exploiting a convolution layer using filters of kernel size , stride . In this case, , and . In fact, the adoption of the learned convolutional kernel instead of sampling matrix can efficiently utilize the characteristic of the image, and make the output measurement more easily be used in the following reconstruction module.
Given the CS measurements, traditional BCS usually obtain the initial reconstructed block by , where is the reconstruction of , and is the pseudo-inverse matrix of . In CSformer initialization process, we utilze the convolution to replace . The difference is that we can directly implement the convolution layer on the to recover the initial patch. The initialization first adopts filters of kernel size to covert the measurement dimension to . Subsequently, the followed pixelshuffle layer is employed to obtain the original patch . For instance, a measurement with size is transformed to the initial reconstruction with size at the CS ratios of
. In summary, we use the convolution and pixelshuffle to obtain each initial reconstruction, which is a more efficient way as the output is directly a tensor instead of a vector.
Iii-C CNN Stem
The measurement is taken as the input of the input projection module that contains several convolution layers followed by a pixelshuffle layer to to obtain feature with size (by default we set ). The CNN stem is composed of multiple stages. The first stage takes the projected output feature as input. Then the feature passes through the first convolution block to obtain feature with size
with 1 as the padding size, and the output channel is the same as the input channel. Thus, the resolution and channel size is maintained to be consistent after a convolution block.
To scale up to a higher-resolution feature, we add an upsample module before the rest of convolution block. The upsample convolution module first adopts bicubic upsample to upscale the resolution of the previous feature, and then a convolutional layer is used to reduce the dimension to a half. Thus, the output features of CNN stem can be represented by , where .
Iii-D Transformer Stem
Transformer stem aims to provide further guidance for global restoration with progressive features according to the convolution features. As shown in Fig. 3(b), each transformer block stacks transformer network. The input of transformer is the aggregation feature that bridges the convolution features and transformer features.
Iii-D1 Feature Aggregation
The aggregation feature fuses the local features from CNN and the global features from transformer via a concatenation way. The feature dimension of CNN stem and transformer stem is inconsistent, such that we need to reshape the CNN features to align with the transformer features. The 2D feature map of CNN with size needs to be flattened to a 1D sequence for transformer. As can be seen from Fig. 3, the aggregation feature is taken as the input to the transformer blocks by concatenating these two features. It is worth mentioning that the input aggregation feature of the first transformer block is concatenated by and . In this way, the first transformer block makes full use of the information in the measurements and introduces local features of CNN. This also aligns with the observation of many studies [40, 9, 32] that introducing locality in early layers is beneficial for feature representation in transformer.
After the first transformer block, we obtain the transformer feature with size . The misalignment between the transformer feature with next stage CNN features is further eliminated. We first reshape the 1D sequence of to 2D feature map with the size . Subsequently, a pixelshuffle layer is used to upsample the resolution by ratio and reduce the channel dimension to a quarter of the input. We complete the spatial dimension and channel dimension alignment of transformer features and CNN features. Then the aggregation feature is obtained by concatenating the transformer feature and CNN feature. The aggregation feature can be expressed by , where .
Iii-D2 Window-based Transformer
The standard transformer  takes a series of sequences (tokens) as input and computes self-attention globally between all tokens. However, if we take each pixel as one token in transformer for CS reconstruction, the sequences grow as the resolution increases, resulting in explosive computational complexity for larger resolution. For instance, even a image will lead to sequences and have cost of self-attention. To address the above issue, CSformer performs window-based transformer. Given an input fusion feature of transformer, we partition feature into non-overlapping windows. Then the feature is split into the size of , where is the total number of windows. The multi-head self-attention is computed in each window. In each window, the feature is computed by the self-attention, where is the number of heads in the multi-head self attention. First, the query, key, and value matrices are computed as:
where , and are the projection matrices with the size . Subsequently, the self-attention can be formulated by:
where denotes the self-attention operation, is the softmax function, and is the learnable relative position encoding. The multi-head self-attention is performed for times self-attention in parallel and concatenates the results to obtain the output. The multi-head self-attention (MSA) based on the windows significantly reduces the computational and GPU memory cost.
Then, the output of MSA passes through a multi-layer perceptron (MLP) consisting of two fully-connected layers with GELU activation for nonlinear transformation. As shown in Fig.3(b), the layer norm is inserted before MSA and MLP and the whole transformer process can be formulated as follows,
After the transformer feature reaches the input resolution , the output projection module is used to project the transformer feature to the image space. Before passing through the output projection, we first reshape the transformer feature to a 2D feature. Output projection consists of two convolution layers followed by a tanh action function, which maps the transformer feature to single channel reconstruction patches. Then we sum up the reconstruction patches with the initial reconstruction patches to obtain the final patches and merge all patches to obtain the final reconstructed image .
Iii-E Loss Function
We optimize the parameters of CSformer by minimizing the the mean square error (MSE) between the output reconstructed image and the ground-truth image as follows,
It is worth mentioning that the proposed scheme is based on patch reconstruction while the loss function is computed on the whole image. As such, we attenuate the blocking artifacts without other post-processing deblocking modules.
In this section, we first introduce the training settings and evaluation datasets in section IV-A. Section IV-B shows the experimental results of our method compared with state-of-the-art on different test datasets. Section IV-C analyzes the effectiveness of the proposed approach by comparing the results with those of some variants of CSformer. Section IV-D compares the retraining performance and the computational time.
Iv-a Experimental Settings
Iv-A1 Dataset and Metrics
Training vision transformer is known to be data-hungry. Therefore, we use the COCO 2017 unlabeled images dataset for training, which is a large-scale dataset that consists of over 123K images of high diversity. To reduce the training time, it is worth mentioning that we only use a quarter of the whole training set,i.e., around 40K images for training. We evaluate our method on various widely used benchmark datasets, including Set11 , BSD68 , Set5 , Set14 , Urban100 . Set11 and BSD68 datasets are composed of 11 and 68 gray images, respectively. Urban100 dataset contains 100 high-resolution challenging city images. Set5 and Set14 datasets have 5 and 14 images with different resolutions. Fig. 4
displays the visual samples of each dataset. We utilize the luminance components of color images for both training and testing. The test images are divided into overlapping patches for testing in the real implementation. The reconstruction results are reported under a range of sampling ratios from 0.1 to 0.5. Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) are adopted as the evaluation measures.
Iv-A2 Training Details
The training images are cropped into images as input, i.e., . The size of the fixed patches is . The sampling convolutional kernel size in the sampling process is set to be , i.e., convolution layer with stride . The output feature dimension of input projection is set to . The window size of window-based multi-head self-attention is set to be for all transformer blocks. Each transformer block stacks
transformer network. We use 1 Nvidia 2080Ti card for training our model on Pytorch, and the model is optimized by Adam optimizer. The learning rate is initially set as and the cosine decay strategy is adopted to decrease the learning rate to . The number of iteration is 50,000, and the training time is about 1.5 days.
Iv-B Performance Comparisons
To facilitate comparisons, we evaluate the performance of our CSformer on five widely used testsets, and compare our method with four recent representatives DL based CS state-of-the-art methods, including CSNet , DPA-Net , OPINE-Net  and AMP-Net . The results of other methods are obtained by their public pre-trained model.
To display the comprehensive performance comparisons over multiple datasets, we utilize two commonly-used average measures to evaluate the average performance over the five test databases, as suggested in . The two average measures can be defined as follows:
where denotes the total number of databaset ( in this paper), represents the value of the performance index (e.g. PSNR, SSIM) on the -th dataset, and is the corresponding weight on the -th dataset. The first average measurement is Direct Average with . The second average measurement is Weighted Average, where is set as the number of images in the -th dataset (e.g. 11 for the Set11 dataset, 100 for the Urban100 dataset).
Table I shows the average PSNR and SSIM performance of different methods at different CS ratios across all five datasets. It can be obviously observed that the proposed CSformer achieves the both highest PSNR and SSIM results for different ratios on all datasets. Our approach achieves a large gap (12 dB) across all CS ratios in Urban100 dataset that contains more images with larger resolution. The Direct Average and Weighted Average show our proposed CSformer outperforms all state-of-the-art models under comparison. The improvement of performance is mainly attributed to the powerful feature representation ability by bridging the two strong neural networks, CNN and transformer. Experimental results demonstrate that CSformer has better generalization ability and recovery ability for limit sampling under the premise that all sampling rates can achieve optimal performance.
In Fig. 5, we show the reconstructed images of all the methods at CS ratios of 10 and 50. The proposed CSformer is able to recover more fine detail and more clear edges than other methods. Fig. 6 shows the qualitative comparison of the reconstruction image and the absolute residual intensity map with different methods at CS ratios of 4%. The absolute residual intensity map is the intensity map of the absolute residual between the recovered image and the ground-truth image. As shown in Fig. 6, our CSformer can recover more fine details and structure due to the help of CNN to transformer. Compared to DPA-Net, which uses the dual-Path CNN structure, our clarity is significantly improved. Compared to the deep unfolding methods OPINE-Net and AMP-Net, our CSformer reduces the artifact and provides more reasonable reconstruction. The visual quality results at CS ratio of 25% are shown in Fig. 7. The improvement can be seen more clearly in the residual map that the reconstructed texture detail of our approach is finer. The visual quality comparisons clearly demonstrate the effectiveness of the proposed CSformer. Overall, the quantitative and qualitative comparisons with several competing methods verify the superiority of CSformer.
Iv-C Ablation Studies
This subsection first presents the ablation studies on the feature dimension and feature aggregation. Subsequently, network structure is analyzed to investigate the effects of the dual structure in our CSformer. Moreover, we visualize the feature map and the feature similarity to verify that our hybrid framework effectively bridges CNN and transformer.
Iv-C1 Feature Dimension
Table II shows the results for different dimensions, where the subscript represents the dimension of . The smaller CSformer is capable of achieving good performance on the five datasets. The CSformer outperforms CSformer at most of CS ratios. The largest improvement appears on the Urban100 dataset with average 0.4 dB. In addition, there are about 0.2 dB PSNR gains over Set11 and Set14. The larger CSformer achieves around 0.10.2 dB gains than the second one but has the maximum number of parameters. To balance the performance and model size, we adopt for our CSformer by default.
Iv-C2 Feature Aggregation
CSformer adopts concatenation operator to aggregate features from different stems. To illustrate the effectiveness of this way, we construct a variant that the CNN features and transformer features are added rather than concatenated. It is worth mentioning that replacing the concatenation operation directly with the addition operation will cause the feature dimension of the transformer block to be halved. Thus, for a fair comparison, we modify the output dimension of CNN block to keep the input dimension of the transformer stem unchanged for using the adding fusion way. The parameters of the CSformer using adding feature fusion are 9.04 M, and the parameters of the CSformer using concatenating feature aggregation are 6.71 M. Fig. 8 shows the PSNR results on the Set11 dataset. The concatenating feature aggregation shows superior PSNR performance with different samling ratios and has fewer parameters. The adding feature fusion operation achieves a close performance when CS ratios are less than 50%. The gap is most obvious at CS ratios of 50%, which shows the concatenation way can make better use of the complementarity of the CNN features and transformer features at higher sampling ratios. The same pattern is observed on the Urban100 dataset, as shown in Fig. 9. The default concatenation aggregation way has superior performance compared to the way of adding fusion on Urban100. The improvement is up to 2.06 dB at 50% CS ratios and around 0.10.3 dB at 1% to 25%.
Iv-C3 Dual Stem
CSformer is a dual stems model, aiming to couple the efficiency of convolution in extracting local features with the power of transformer in modeling global representations. To evaluate the benefits of these two branches, we build a single-path model, named “SPT”, which only uses transformer for reconstruction. For a fair comparison, we add one more convolution before transformer block and set to maintain the consistency of resolution and dimension in transformer block while keeping all others unchanged. The testing is implemented on the Set11 dataset and Urban100 dataset as depicted in Table III and Table IV
. CSformer shows a better result on the Set11 dataset at CS ratios of 1% while has slight performances drop than SPT at other ratios. This is partly due to the increase in the number of parameters and partly reflects the powerful modeling capability of the transformer network. On the Urban100 dataset, CSformer shows superior PSNR performance at different CS ratios with at most 0.84 dB gains. The gap between these two methods ascends with the increase of sampling ratio and achieves the largest gap at CS ratio of 50%. The improvement of CSformer is more noticeable at high ratios. The reason can be explained by the fact that the trunk recovery network recovers the residuals according to the initial reconstruction, while under high sampling ratios the initial reconstruction is relatively sufficient. Therefore, the detailed and local information provided by CNN is more helpful for the final reconstruction. Meanwhile, CSformer plays more critical roles on the Urban100 dataset than the Set11 dataset. The reason can be attributed to the fact that the Urban100 dataset has more textured data, making the local information more helpful for the reconstruction. In this case, the convolution operation is more efficient and practical for image local feature extraction.
Iv-C4 Image Loss
The proposed CSformer relies on the loss between images instead of patches to reduce the blocking artifact by merging the output patches to the image. In Fig. 11, we compare the image loss with another version of CSformer that calculates the loss between the input patches and output patches. Through experiments on five testing datasets, it is found that image loss adopted by CSformer can significantly improve performance without additional post-processing modules, especially in the case of higher sampling rates.
Iv-C5 Feature Analysis
We investigate the difference of the internal features representations between CNN and transformer by feature visualization and feature similarity. In the first analysis, we visualize the feature maps in Fig. 10. It can be seen that compared with the CSformer, the SPT tends to activate more global areas than the local region. Besides, with the help of the local information extracted by CNN, the detailed textures are remained in the CSformer compared to SPT. This figure shows the ability of CSformer in bridging the local feature and global representation, which enhances the locality of features through convolution starting from the early layer. The early local intervention is a helpful complement to transformer features.
In Fig. 12, we extract the CNN features and transformer features from CNN stem and transformer stem, respectively. We analyze the features from the perspective of representation similarity using centered kernel alignment . It is worth mentioning that the transformer features already contain the CNN features as the fusion features the input of transformer block. We observe the lower layers of transformer block are similar to the deep layers of CNN. It shows the transformer has a good ability to capture the long-dependence from the beginning, while CNN requires more dependence on the stacking of layers to enhance long-distance feature dependencies. In addition, it indicates that the CNN features play a more critical role in the early layers than deep layers. The middle features show weak similarity, which indicates the transformer features show more dominant effects. The deep layers show moderate similarity, and it illustrates our CSformer balances the local and global representation in the deep layers.
|Direct Average||1%||21.80 (-1.41)||21.82 (-1.39)||23.21|
|4%||26.22 (-1.19)||26.00 (-1.41)||27.41|
|10%||30.05 (-0.67)||29.71 (-1.01)||30.72|
|25%||34.32 (-0.65)||34.18 (-0.79)||34.97|
|50%||39.00 (-1.32)||39.38 (-0.94)||40.32|
|Weighted Average||1%||21.42 (-1.13)||21.39 (-1.16)||22.55|
|4%||25.09 (-1.23)||25.04 (-1.28)||26.32|
|10%||28.72 (-0.70)||28.26 (-1.16)||29.42|
|25%||32.94 (-0.69)||32.71 (-0.92)||33.63|
|50%||37.68 (-1.25)||38.06 (-0.87)||38.93|
Iv-D Analysis on the Retraining Performance and Running Time
We retrain the AMP-Net and OPINE-Net on the COCO dataset to show their performance on the larger training dataset in Table V. The original AMP-Net is trained on the BSD500 dataset , and OPINE-Net is trained on the T91 dataset . As shown in Table V, the CSformer achieves the highest PSNR results under the same training dataset. Compared with the model trained on the BSD500 dataset and T91 dataset, the performances of the other two methods show varying degrees of improvement or decline across multiple datasets.
Table VI provides the parameter number of various CS methods at CS ratio of 50% and the time consuming analysis for reconstructing a image. Considering that we utilize the transformer model and CNN model, the total parameters of our method are still 30% lower than the DPA-Net using the dual-path CNN structure. Though the running time increases, our proposed CSformer achieves the best performance and generalization capabilities.
In this paper, we propose a novel dual-stem network named CSformer, which bridges the CNN and transformer networks for adaptive sampling and reconstruction of CS. The sampling stage adaptively learns the sampling matrix and adopts sampling block by block. In the reconstruction stage, we design a dual-stem structure to combine the two types of features and gradually increase feature resolution to reduce memory cost and computation complexity. Experiments show that our CSformer effectively utilizes the complementary of transformer and CNN, outperforming the pure single-path transformer. The proposed CSformer achieves the best performance on various testsets at different CS ratios compared with the existing DL based method. The proposed CSformer is the first work to extend vision transformer to CS, and it has shown great potential to improve the CS performance.
-  (2010) An augmented lagrangian approach to the constrained optimization formulation of imaging inverse problems. IEEE Transactions on Image Processing 20 (3), pp. 681–695. Cited by: §II-A1.
-  (2017) Error bounds for compressed sensing algorithms with group sparsity: a unified approach. Applied and Computational Harmonic Analysis 43 (2), pp. 212–232. Cited by: §I.
-  (2010) Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence 33 (5), pp. 898–916. Cited by: §IV-D.
-  (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2 (1), pp. 183–202. Cited by: §II-A1.
-  (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the British Machine Vision Conference, pp. 1–10. Cited by: §IV-A1.
Pre-trained image processing transformer.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12299–12310. Cited by: §I, §II-B.
-  (2001) Atomic decomposition by basis pursuit. SIAM review 43 (1), pp. 129–159. Cited by: §II-A1.
-  (2021) ConViT: improving vision transformers with soft convolutional inductive biases. CoRR abs/2103.10697. External Links: Cited by: §II-B.
-  (2021) CoAtNet: marrying convolution and attention for all data sizes. CoRR abs/2106.04803. External Links: Cited by: §II-B, §III-D1.
-  (2009) Message-passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences 106 (45), pp. 18914–18919. Cited by: §II-A1.
-  (2006) Compressed sensing. IEEE Transactions on information theory 52 (4), pp. 1289–1306. Cited by: §I.
-  (2021) An image is worth 16x16 words: transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Cited by: §I, §II-B.
-  (2008) Single-pixel imaging via compressive sampling. IEEE signal processing magazine 25 (2), pp. 83–91. Cited by: §I.
-  (2014) Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §II-B.
-  (2015) Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5197–5206. Cited by: §IV-A1.
-  (2021) TransGAN: two transformers can make one strong gan. CoRR abs/2102.07074. External Links: Cited by: §I.
Similarity of neural network representations revisited.
Proceedings of the International Conference on Machine Learning, pp. 3519–3529. Cited by: §IV-C5.
-  (2016) Reconnet: non-iterative reconstruction of images from compressively sensed measurements. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 449–458. Cited by: §I, §I, §II-A3, §IV-A1, §IV-D.
-  (2013) An efficient augmented lagrangian method with applications to total variation minimization. Computational Optimization and Applications 56 (3), pp. 507–530. Cited by: §II-A1.
-  (2017) Structured sparse representation with union of data-driven linear and multilinear subspaces model for compressive video sampling. IEEE Transactions on Signal Processing 65 (19), pp. 5062–5077. Cited by: §I.
-  (2021) SwinIR: image restoration using swin transformer. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1833–1844. Cited by: §I, §I, §II-B.
-  (2021) Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10012–10022. Cited by: §II-B.
-  (2007) Sparse mri: the application of compressed sensing for rapid mr imaging. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine 58 (6), pp. 1182–1195. Cited by: §I.
-  (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 2, pp. 416–423. Cited by: §IV-A1.
-  (2016) From denoising to compressed sensing. IEEE Transactions on Information Theory 62 (9), pp. 5117–5144. Cited by: §II-A1.
-  (2017) Learned d-amp: principled! neural network based compressive image recovery. In Proceedings of the Advances in Neural Information Processing Systems, pp. 1773–1784. Cited by: §I, §I, §II-A2.
-  (2020) Towards unsupervised deep image enhancement with generative adversarial network. IEEE Transactions on Image Processing 29, pp. 9140–9151. Cited by: §II-B.
-  (2020) Unpaired image enhancement with quality-attention generative adversarial network. In Proceedings of the ACM International Conference on Multimedia, pp. 1697–1705. Cited by: §II-B.
-  (2018) A gabor feature-based quality assessment model for the screen content images. IEEE Transactions on Image Processing 27 (9), pp. 4516–4528. Cited by: §IV-B.
-  (2019) Pytorch: an imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, pp. 8026–8037. Cited by: §IV-A2.
-  (2021) Conformer: local features coupling global representations for visual recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 367–376. Cited by: §II-B.
-  (2021) Do vision transformers see like convolutional neural networks?. In Proceedings of the Advances in Neural Information Processing Systems, Cited by: §III-D1.
-  (2020) Image compressed sensing using convolutional neural network. IEEE Transactions on Image Processing 29 (), pp. 375–388. External Links: Cited by: §I, §I, §II-A3, §IV-B.
-  (2017) Deep networks for compressed image sensing. In Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 877–882. Cited by: §I, §I, §II-A3.
-  (2020) Dual-path attention network for compressed sensing image reconstruction. IEEE Transactions on Image Processing 29, pp. 9482–9495. Cited by: §I, §II-A3, §IV-B.
-  (2020) Learning non-locally regularized compressed sensing network with half-quadratic splitting. IEEE Transactions on Multimedia 22 (12), pp. 3236–3248. Cited by: §I, §I, §II-A3.
-  (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §II-A1.
-  (2017) Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §I, §II-B, §III-D2.
-  (2021) Uformer: A general u-shaped transformer for image restoration. CoRR abs/2106.03106. External Links: Cited by: §I, §II-B.
-  (2021) Early convolutions help transformers see better. In Proceedings of the Advances in Neural Information Processing Systems, Cited by: §III-D1.
-  (2021) CoTr: efficiently bridging CNN and transformer for 3d medical image segmentation. CoRR abs/2103.03024. External Links: Cited by: §II-B.
-  (2018) Lapran: a scalable laplacian pyramid reconstructive adversarial network for flexible compressive sensing reconstruction. In Proceedings of the European Conference on Computer Vision, pp. 485–500. Cited by: §I.
-  (2020) Learning texture transformer network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5791–5800. Cited by: §II-B.
-  (2017) DAGAN: deep de-aliasing generative adversarial networks for fast compressed sensing mri reconstruction. IEEE transactions on medical imaging 37 (6), pp. 1310–1321. Cited by: §I.
-  (2018) ADMM-csnet: a deep learning approach for image compressive sensing. IEEE transactions on pattern analysis and machine intelligence 42 (3), pp. 521–538. Cited by: §I, §I, §II-A2.
-  (2019) Dr2-net: deep residual reconstruction network for image compressive sensing. Neurocomputing 359, pp. 483–493. Cited by: §I, §II-A3.
-  (2021) ISTA-net++: flexible deep unfolding network for compressive sensing. In Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 1–6. Cited by: §I, §I, §II-A2.
-  (2021) COAST: controllable arbitrary-sampling network for compressive sensing. IEEE Transactions on Image Processing 30 (), pp. 6066–6080. External Links: Cited by: §I.
-  (2020) Plug-and-play algorithms for large-scale snapshot compressive imaging. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1447–1457. Cited by: §I.
-  (2010) On single image scale-up using sparse-representations. In Proceedings of the International Conference on Curves and Surfaces, pp. 711–730. Cited by: §IV-A1.
-  (2018) ISTA-net: interpretable optimization-inspired deep network for image compressive sensing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1828–1837. Cited by: §I, §I, §II-A2.
-  (2020) Optimization-inspired compact deep compressive sensing. IEEE Journal of Selected Topics in Signal Processing 14 (4), pp. 765–774. Cited by: §I, §I, §II-A2, §IV-B.
-  (2014) Image compressive sensing recovery using adaptively learned sparsifying basis via l0 minimization. Signal Processing 103, pp. 114–126. Cited by: §I.
-  (2012) Image compressive sensing recovery via collaborative sparsity. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 2 (3), pp. 380–391. Cited by: §I.
-  (2020) AMP-net: denoising-based deep unfolding for compressive image sensing. IEEE Transactions on Image Processing 30, pp. 1487–1500. Cited by: §I, §I, §II-A2, §IV-B.