Log In Sign Up

Neural Sparse Representation for Image Restoration

Inspired by the robustness and efficiency of sparse representation in sparse coding based image restoration models, we investigate the sparsity of neurons in deep networks. Our method structurally enforces sparsity constraints upon hidden neurons. The sparsity constraints are favorable for gradient-based learning algorithms and attachable to convolution layers in various networks. Sparsity in neurons enables computation saving by only operating on non-zero components without hurting accuracy. Meanwhile, our method can magnify representation dimensionality and model capacity with negligible additional computation cost. Experiments show that sparse representation is crucial in deep neural networks for multiple image restoration tasks, including image super-resolution, image denoising, and image compression artifacts removal. Code is available at


Scale-wise Convolution for Image Restoration

While scale-invariant modeling has substantially boosted the performance...

Deep Networks for Image Super-Resolution with Sparse Prior

Deep learning techniques have been successfully applied in many areas of...

Attentive Fine-Grained Structured Sparsity for Image Restoration

Image restoration tasks have witnessed great performance improvement in ...

Image Block Loss Restoration Using Sparsity Pattern as Side Information

In this paper, we propose a method for image block loss restoration base...

Low-Light Image Restoration Based on Retina Model using Neural Networks

We report the possibility of using a simple neural network for effortles...

Accurate Image Restoration with Attention Retractable Transformer

Recently, Transformer-based image restoration networks have achieved pro...

Joint Visual Denoising and Classification using Deep Learning

Visual restoration and recognition are traditionally addressed in pipeli...

Code Repositories


Neural Sparse Representation for Image Restoration

view repo

1 Introduction

Sparse representation plays a critical role in image restoration problems, such as image super-resolution Yang et al. (2010); Zeyde et al. (2010); Yang et al. (2012), denoising Elad and Aharon (2006), compression artifacts removal Zhao et al. (2016), and many others Mairal et al. (2007, 2008). These tasks are inherently ill-posed, where the input signal usually has insufficient information while the output has infinitely many solutions w.r.t. the same input. Thus, it is commonly believed that sparse representation is more robust to handle the considerable diversity of solutions.

Sparse representation in sparse coding is typically high-dimensional but with limited non-zero components. Input signals are represented as sparse linear combinations of tokens from a dictionary. High dimensionality implies larger dictionary size and generally leads to better restoration accuracy, since a more massive dictionary is capable of more thoroughly sampling the underlying signal space, and thus more precisely representing any query signal. Besides, sparsity limits numbers of non-zero elements work as an essential image prior, which has been extensively investigated and exploited to make restoration robust. Sparsity also brings computational efficiency by ignoring zero parts.

Deep convolutional neural networks for image restoration extend the sparse coding based methods with repeatedly cascaded structures. The deep network based approach was firstly introduced to improve the performance in

Dong et al. (2014) and conceptually connected with previous sparse coding based methods. A simple network, with two convolutional layers bridged by a non-linear activation layer, can be interpreted as: activation denotes sparse representation; non-linearity enforces sparsity and convolutional kernels consist of the dictionary. SRResNet Ledig et al. (2017) extended the basic structure with skip connection to form a residual block and cascaded a large number of blocks to construct very deep residual networks.

Sparsity of hidden representation in deep neural networks cannot be solved by iterative optimization as sparse coding, since deep networks are feed-forward during inference. Sparsity of neurons is commonly achieved by ReLU activation in  

Glorot et al. (2011)

by thresholding negative values to zero independently in each neuron. Still, its 50% sparsity on random vectors is far from the sparsity definition on the overall number of non-zero components. Oppositely, sparsity constraints are more actively used in model parameters to achieve network pruning 

Han et al. (2015). However, hidden representation dimensionality is reduced in pruned networks, and accuracy may hurt.

In this paper, we propose a method that can structurally enforce sparsity constraints upon hidden neurons in deep networks but also keep representation in high dimensionality. Given high-dimensional neurons, we divide them into groups along channels and allow only one group of neurons can be non-zero each time. The adaptive selection of the non-sparse group is modeled by tiny side networks upon context features. And computation is also saved when only performed on the non-zero group. However, the selecting operation is not differentiable, so it is difficult to embed the side networks for joint training. We relax the sparse constraints to soft and approximately reduce as a sparse linear combination of multiple convolution kernels instead of hard selection. We further introduce additional cardinal dimensions to decompose sparsity prediction into sub-problems by splitting each sparse group and concatenating after cardinal-independent combination of parameters.

To demonstrate the significance of neural sparse representation, we conduct extensive experiments on image restoration tasks, including image super-resolution, denoising, and compression artifacts removal. Our experiments conclude that: (1) dedicated constraints are essential to achieve neural sparsity representation and benefit deep networks; (2) our method can significantly reduce computation cost and improve accuracy, given the same size of model footprint; (3) our method can dramatically enlarge the model capacity and boost accuracy with negligible additional computation cost.

2 Related Work

2.1 Sparse coding and convolutional networks

Here we briefly review the application of sparsity in image restoration and its relation to convolutional networks. Considering image super-resolution as an example of image restoration, sparse coding based method Yang et al. (2010) assumes that input image signal can be represented by a sparse linear combination over dictionary , which typically is learned from training images as


In Yang et al. (2012), a coupled dictionary, , for restored image signal is jointly learned with as well as its sparse representation by


Convolutional networks, which consist of stacked convolutional layers and non-linear activation functions, can be interpreted with the concepts from sparse coding

Dong et al. (2014). Given for instance a small piece of network with two convolutional layers with kernels and a non-linear function , the image restoration process can be formalized as


The convolution operation with is equivalent to projecting input image signal onto dictionary . The convolution operation with is corresponding to the projection of the signal representation on dictionary . These two convolutional layers structure is widely used as a basic residual block and stacked with multiple blocks to form very deep residual networks in recent advances Ledig et al. (2017); Lim et al. (2017) of image restoration.

Dimensionality of hidden representation or number of kernels in each convolutional layer determines the size of dictionary memory and learning capacity of models. However, unlike sparse coding, representation dimensionality in deep models is usually restricted by running speed or memory usage.

2.2 Sparsity in parameters and pruning

Exploring the sparsity of model parameters can potentially improve robustness Guo et al. (2018), but sparsity in parameters is not sufficient and necessary to result in sparse representation. Furthermore, group sparsity upon channels and suppression of parameters close to zero can achieve node pruning He et al. (2014); Han et al. (2015); Frankle and Carbin (2019); Liu et al. (2019); Yu et al. (2019b), which dramatically reduces inference computation cost. Despite efficiency, node pruning reduces representation dimensionality proportionally instead of sparsity, limits representation diversity, and leads to accuracy regression.

2.3 Thresholding and gating

Thresholding function, ReLU Nair and Hinton (2010) for example, plays the similar role of imposing the sparsity constraints Glorot et al. (2011) by filtering out negative values to zero, and contributes to significant performing improvement over previous activation functions, i.e., hyperbolic tangent. Although ReLU statistically gives only 50% sparsity over random vectors, there is still a significant gap between sparsity definition in Eq. 1. Gating mechanism, in Squeeze-and-Excitation Hu et al. (2018); Zhang et al. (2018), for example, scales hidden neurons with adaptive sigmoid gates and slightly improves sparsity besides noticeable accuracy improvements. Both thresholding and gating are applied independently to hidden neurons and could not inherently guarantee global sparsity in Eq. 1.

3 Methodology

We propose novel sparsity constraints to achieve sparse representation in deep neural networks. Relaxed soft restrictions are more friendly to gradient-based training. Additional cardinal dimension refines the constraints and improves the diversity of sparse representation.

3.1 Sparsity in hidden neurons

Unlike the methods discussed in Section 2.3 only considering local sparsity for each neuron independently, our approach enforces global sparsity between groups. Specifically, the hidden neurons are divided into groups with nodes in each group, and only one group is allowed to contain non-zero values. Correspondingly, convolution kernels can also be divided upon connected hidden neurons. Then only the kernels connected to non-zero neurons need to be accounted. Formally, for networks structure in Eq. 3, the convolution kernels are divided as and . Then the Eq. 3 can be rewritten as


When sparsity constraints only allow the th group of neurons with non-zero components, then Eq. 4 can be reduced, as shown in Figure 1, and formally as

Figure 1: Illustration of computation reduction in two-layer neural networks with sparse hidden nodes in a simplified matrix multiplication example. Left: network with sparsity constraints, which only allow one group with hidden nodes to be non-zero over nodes in total. Right: reduced computation with only and , since other activation nodes are zero. (Grayscale reflects magnitude of matrix values. Matrix multiplication is in right to left order.)

The proposed sparsity is supposed to pick the node group with the largest amplitude and cannot be achieved without computing the values of all the nodes. In our approach, the selection of the only non-zero group is modeled by a multi-layer perceptron (MLP) with respect to the input signal


Regular convolution operations need the kernels shared for every pixel. Hence the selection should also be identified through the spatial space. We are inspired by the Squeeze-and-Excitation Hu et al. (2018); Zhang et al. (2018) operation and propose to add pooling operation before the MLP and boardcasting operation for group selection. The above procedure can be formalized as


Note that, as most of patch-based algorithms Yang et al. (2010); Zhang et al. (2018) for image restoration, the pooling operation should be with respect to a specific patch size instead of the whole image.

Comparison to thresholding and gating. The proposed method limits the number of non-zero entities under of all the nodes in hidden-layer representation, which is more closed to sparsity definition in Eq. 1 than thresholding and gating methods discussed in section 2.3. The proposed method also dramatically reduces computation cost by times by only considering the adaptively selected group, which is not possible with thresholding and gating methods.

Comparison to node pruning. Node pruning is designed to diminish activation nodes by zeroing all the related trainable parameters. The pruned nodes stick to zero no matter how the input signal varies, which substantially reduces representation dimensionality. In our method, the sparsity adaptively depends on input. Although the input inherently keeps the high dimensionality in representation, our method saves computation and memory cost as narrow models.

3.2 Relaxed soft sparsity

Similar as L0-norm in sparse coding, the adaptive sparse group selection in Eq. 6 is not differentiable and feasible to be jointly learned with neural networks. Although Gumbel trick Jang et al. (2017) is proposed to re-parameterize the

with respect to a conditional probability distribution, it does not achieve convincing results in our experiment settings.

The sparsity constrains are relaxed by substituting selection with softmax as a smooth approximation of max. Instead of predicting index over , the MLP is relaxed to predict probability over groups with softmax function by


Then, the two-layer structure in Eq. 4 is updated to adaptive weighted sum of groups as


With weighted summation, Eq. 8 cannot be directly reduced as Eq.5, since none of group weights is exactly zero. Fortunately, given sparse assumption of softmax outputs, , and piece-wise linear activation function , ReLU for example, it can be proved that weighted sum of hidden neurons can be approximately reduced to weighted sum of parameters , as shown in Figure 2, and formally as


Note that the two applied to and are not necessary to be identical to achieve the approximation. Our experiments show that independently predicting weights for and has benefits for accuracy.

Figure 2: Illustration of weighted neurons in soft sparsity constraints and reduced counterpart with weighted sum of parameters. Left: network with soft sparsity constraints, weights are applied to neurons in groups. Right: approximate reduction by firstly weighted summing of parameter groups into a small slice then applying it to features.

In this way, networks restricted by soft sparse constraints can be as efficient as those with hard constraints. And the only additional computation cost from the interpolation of convolution kernels is negligible comparing with convolution operations with the image.

Comparison to conditional convolution. CondConv Yang et al. (2019)

has similar operation of the adaptive weighed sum of convolution kernels as our relaxed soft sparsity approach. However, CondConv uses the sigmoid function to normalize the weights of kernels instead of softmax function in our method. Hence, no sparsity constraints are explicitly applied in CondConv, and our experiments show that sparsity is very important for model accuracy.

3.3 Cardinality over sparsity groups

Modeling sparsity between groups with a simple MLP is challenging, especially when dimensionality per group grows. Also, bonding channels within pre-defined groups limits diversity of the sparsity patterns. Inspired by group convolution in ResNeXt Xie et al. (2017), we split the nodes per sparsity group into cardinal groups, and each cardinal group with nodes is independently constrained along sparsity groups, as shown in Figure 3. Formally, the averaging weights are extended to matrix and , then weighted averaged convolution kernel becomes


where and is the th cardinal group and th sparsity group. is concatenation operation along the axis of output channels. Notably, with cardinal grouping, Squeeze-and-Excitation Hu et al. (2018) operation becomes a particular case of our approach when , and the MLP activation is substituted with sigmoid function.

Figure 3: Illustration of our method. Features of image patch are firstly spatially pooled and feed in MLP with softmax activation to predict sparsity constraints . Softmax function is performed along axis. Convolution kernel is divided into sparsity groups and channels per group . Each group is further divided into cardinal groups and channels per group . The cardinal-independent weighted sum is performed as Eq. 10. Finally, the aggregated kernel convolves with the original features. (Colors reflect sparsity groups and grayscale reflects magnitude of matrix values.)

4 Experiments

4.1 Settings

Datasets and benchmarks. We use multiple datasets for image super-resolution, denoising, and compression artifacts removal separately. For image super-resolution, models are trained with DIV2K Timofte et al. (2017) dataset which contains 800 high-quality (2K resolution) images. The DIV2K also comes with 100 validation images, which are used for ablation study. The datasets for benchmark evaluation include Set5 Bevilacqua et al. (2012), Set14 Zeyde et al. (2010), BSD100 Martin et al. (2001) and Urban100 Huang et al. (2015) with three up-scaling factors: x2, x3 and x4. For image denoising, training set consists of Berkeley Segmentation Dataset (BSD) Martin et al. (2001) 200 images from training split and 200 images from testing split, as Zhang et al. (2017). The datasets for benchmark evaluation include Set12, BSD64 Martin et al. (2001) and Urban100 Huang et al. (2015) with additive white Gaussian noise (AWGN) of level 15, 25, 50. For compression artifacts removal, training set consists of 91 images in Yang et al. (2010) and 200 training images in Martin et al. (2001). The datasets for benchmark evaluation include LIVE1 Sheikh et al. (2006)

and Classic5 with JPEG compression quality 10, 20, 30 and 40. Evaluation metrics include PSNR and SSIM 

Wang et al. (2004) for predicted image quality in luminance or grayscale, only DIV2K is evaluated in RGB channels. FLOPs per pixels is used to measure efficiency, because the runtime complexity is proportional input image size for fully convolutional models.

Training settings.

Models are trained with nature images and their degraded counterparts. Online data augmentation includes random flipping and rotation during training. Training is based on randomly sampled image patches for 100 times per image and epoch. And total training epochs are 30. Models are optimized with L1 distance and ADAM optimizer. Initial learning rate is 0.001 and multiplied by 0.2 at 20 and 25 epochs.

4.2 Ablation study

We conduct ablation study to prove the significance of neural sparse representation. The experiments are evaluated on DIV2K validation set for image super-resolution with x2 up-scaling under PSNR. We use WDSR Yu et al. (2019a) networks with 16 residual blocks, 32 neurons and 4x width multiplier as the baseline, and set for sparsity groups by default.

Sparsity constraints. Sparsity constraints are essential for representation sparsity. We implement the hard sparsity constraints with Gumbel-softmax to simulate the gradient of hardmax and compare it with soft sparsity achieved by softmax function. The temperature in softmax also controls the sharpness of output distribution. When the temperature is small, softmax outputs are sharper and closer to hardmax. Thus gradient will vanish. When the temperature is large, softmax outputs are more smooth, then it will contradict with our sparsity assumption in Eq. 9 for approximation. We also compare them with a similar model with sigmoid function as MLP activation instead of sparsity constraints in CondConv Yang et al. (2019). Results in Table 1 show that Gumbel-based hard-sparsity methods are not feasible and even worse than the baseline without sparsity groups. Temperature is necessary to be initialized with proper value to achieve better results, which coincides with the above analysis. Sigmoid also gets worse results than softmax because sigmoid cannot guarantee sparsity, which also agrees with our comparison in the previous section.

Sparsity N/A Sigmoid Gumbel Softmax Softmax Softmax
PSNR 34.76 34.81 34.45 34.86 34.87 34.83
Table 1: Comparison of different sparsity constrains.
Figure 4: Comparison of cardinality.

Cardinality. Cardinal dimension reduces the actual dimensionality and dependency between channels in sparsity groups and improves the diversity of linear combination weights over convolution kernels. Results of models with different cardinalities in Fig. 4 show that increasing cardinality constantly benefits accuracy. We also compare with Squeeze-and-Excitation (SE) model, which is a special case of our method, under the same FLOPs. And our models significantly outperform the SE model.

Efficiency. Our method can approximately save computation by times with sparsity groups but remains the same model size or number of parameters. Results in Table 3 have the same model size in columns and show that our method can save at least half of computation without hurting accuracy uniformly for various model sizes.

Capacity. Our method can also extend model capacity or number of parameters by times with sparsity groups but only with negligible additional computation cost. Results in Table 3 have the same computation cost in columns and show that our method can continually improve accuracy by extending model capacity up to 16 times.

Group size # of residual blocks
2 4 8 16
N/A 33.91 34.29 34.56 34.76
2 33.92 34.30 34.57 34.77
3 33.86 34.23 34.51 34.71
4 33.81 34.21 34.41 34.68
# params (M) 0.15 0.30 0.60 1.2
Table 3: Models with the same FLOPs. Model size is proportional to group size.
Group size # of residual blocks
2 4 8 16
N/A 33.91 34.29 34.56 34.76
2 33.98 34.38 34.65 34.83
4 34.07 34.45 34.70 34.87
8 34.14 34.50 34.74 34.89
16 34.17 34.56 34.77 34.91
FLOPs (M) 0.15 0.30 0.60 1.2
Table 2: Models with the same size. FLOPs is inversely proportional to group size.

Visualization of kernel selection. It is difficult to directly visualize the sparsity of high-dimensional hidden representation, we take the selection of kernels as a surrogate. As Fig. 5, in the first block, the weights are almost binary everywhere and only depend on color and low-level cues. In later blocks, the weights are more smooth and more attentive to high-frequency positions with more complicated texture. And the last layer is more correlated to semantics, for example, tree branches in the first image and lion in the second image.

Figure 5: Visualization of kernel selection for network with 2x sparsity group and 4 residual blocks. First column shows input images. Later columns shows softmax MLP outputs of the second convolutional layer in each block. Blue and yellow denotes two groups respectively.

4.3 Main results

In this section, we compare our method on top of the state-of-the-art methods on image super-resolution, image denoising, and image compression artifact removal.

Dataset Scale Bicubic VDSR EDSR (S) Sparse + EDSR (L) Sparse -
33.66 / 0.9299 37.53 / 0.9587 37.99 / 0.9604 38.02 / 0.9610 38.11 / 0.9601 38.23 / 0.9614
Set5 30.39 / 0.8682 33.66 / 0.9213 34.37 / 0.9270 34.43 / 0.9277 34.65 / 0.9282 34.62 / 0.9289
28.42 / 0.8104 31.35 / 0.8838 32.09 / 0.8938 32.25 / 0.8957 32.46 / 0.8968 32.55 / 0.8987
30.24 / 0.8688 33.03 / 0.9124 33.57 / 0.9175 33.60 / 0.9191 33.92 / 0.9195 33.94 / 0.9203
Set14 27.55 / 0.7742 29.77 / 0.8314 30.28 / 0.8418 30.37 / 0.8443 30.52 / 0.8462 30.57 / 0.8475
26.00 / 0.7027 28.01 / 0.7674 28.58 / 0.7813 28.66 / 0.7836 28.80 / 0.7876 28.79 / 0.7876
29.56 / 0.8431 31.90 / 0.8960 32.16 / 0.8994 32.26 / 0.9008 32.32 / 0.9013 32.34 / 0.9020
B100 27.21 / 0.7385 28.82 / 0.7976 29.09 / 0.8052 29.15 / 0.8074 29.25 / 0.8093 29.26 / 0.8100
25.96 / 0.6675 27.29 / 0.7251 27.57 / 0.7357 27.61 / 0.7372 27.71 / 0.7420 27.72 / 0.7414
26.88 / 0.8403 30.76 / 0.9140 31.98 / 0.9272 32.57 / 0.9329 32.93 / 0.9351 33.02 / 0.9367
Urban100 24.46 / 0.7349 27.14 / 0.8279 28.15 / 0.8527 28.43 / 0.8587 28.80 / 0.8653 28.83 / 0.8663
23.14 / 0.6577 25.18 / 0.7524 26.04 / 0.7849 26.24 / 0.7919 26.64 / 0.8033 26.61 / 0.8025
30.80 / 0.9339 37.22 / 0.9750 38.55 / 0.9769 38.94 / 0.9776 39.10 / 0.9773 39.31 / 0.9782
Manga109 26.95 / 0.8556 32.01 / 0.9340 33.45 / 0.9439 33.77 / 0.9462 34.17 / 0.9476 34.27 / 0.9484
24.89 / 0.7866 28.83 / 0.8870 30.35 / 0.9067 30.63 / 0.9106 31.02 / 0.9148 31.10 / 0.9145
DIV2K validation 31.01 / 0.8923 33.66 / 0.9290 34.61 / 0.9372 34.87 / 0.9395 35.03 / 0.9407 35.07 / 0.9410
28.22 / 0.8124 30.09 / 0.8590 30.92 / 0.8734 31.10 / 0.8767 31.26 / 0.8795 31.30 / 0.8797
26.66 / 0.7512 28.17 / 0.8000 28.95 / 0.8178 29.10 / 0.8223 29.25 / 0.8261 29.29 / 0.8263
FLOPs (M) - 0.67 1.4 1.4 43 9.5
Table 4: Public image super-resolution benchmark results and DIV2K validation results in PSNR / SSIM. The better resutls with small and large EDSR are underlined and in bold respectively.

Super-resolution. We compare our method on top of EDSR Lim et al. (2017), the state-of-the-art single image super-resolution methods, also with bicubic upsampling, VDSR Kim et al. (2016). As shown in Table 4, the small EDSR(S) has 16 residual blocks and 64 neurons per layer, and our sparse+ model extends it to 4 sparsity groups with cardinality 16 and outperforms on all benchmarks with 4x model capacity but negligible additional computation cost. The large EDSR(L) has 32 residual blocks and 256 neurons per layer, and our sparse- model has 32 residual blocks, 128 neurons per layer, 4 sparsity groups with cardinality 16. Then they have a similar model footprint and on-par benchmark accuracy but 4x computation cost difference.

Denoising. We compare our method with state-of-the-art image denoising methods: BM3D Dabov et al. (2007), WNNM Gu et al. (2014) and DnCNN Zhang et al. (2017). As shown in Table 5, our baseline model is residual networks with 16 blocks, 32 neurons per layer, 2x width multiplier Yu et al. (2019a)

, and has similar footprint as DnCNN but better performance because of residual connections. Our sparse- model with 2 sparsity groups and 1x width multiplier keeps the model size as baseline but gains 2x computation reduction with better performance. Our sparse+ model adds 2 sparsity groups over baseline model, doubles model capacity, and boosts performance with negligible computation cost.

Dataset Noise BM3D WNNM DnCNN Baseline Sparse - Sparse +
Set12 15 32.37 / 0.8952 32.70 / 0.8982 32.86 / 0.9031 32.97 / 0.9044 33.00 / 0.9048 33.04 / 0.9054
25 39.97 / 0.8504 30.28 / 0.8557 30.44 / 0.8622 30.59 / 0.8655 30.63 / 0.8667 30.68 / 0.8676
50 26.72 / 0.7676 27.05 / 0.7775 27.18 / 0.7829 27.40 / 0.7939 27.46 / 0.7954 27.51 / 0.7969
BSD68 15 31.07 / 0.8717 31.37 / 0.8766 31.73 / 0.8907 31.79 / 0.8925 31.81 / 0.8928 31.83 / 0.8931
25 28.57 / 0.8013 28.83 / 0.8087 29.23 / 0.8278 29.30 / 0.8311 29.33 / 0.8319 29.35 / 0.8327
50 25.62 / 0.6864 25.87 / 0.6982 26.23 / 0.7189 26.35 / 0.7272 26.37 / 0.7265 26.39 / 0.7274
Urban100 15 32.35 / 0.9220 32.97 / 0.9271 32.68 / 0.9255 32.94 / 0.9309 32.96 / 0.9316 33.05 / 0.9324
25 29.70 / 0.8777 30.39 / 0.8885 29.97 / 0.8797 30.33 / 0.8930 30.36 / 0.8932 30.48 / 0.8959
50 25.95 / 0.7791 26.83 / 0.8047 26.28 / 0.7874 26.76 / 0.8118 26.84 / 0.8113 26.95 / 0.8122
FLOPs (M) - - 0.55 0.59 0.30 0.60
Table 5: Benchmark image denoising results of PSNR / SSIM for various noise levels. Training and testing protocols are followed as in Zhang et al. (2017). The best results are in bold and the second are underlined.

Compression artifact removal. We compare our method with state-of-the-art image compression artifact removal methods: JPEG, SA-DCT Foi et al. (2007), ARCNN Dong et al. (2015) and DnCNN Zhang et al. (2017). As shown in Table 6, baseline and sparse- models have the same structure as the ones in denoising. Our method consistently saves computation and improves performance on all the benchmark datasets and different JPEG compression qualities.

Dataset JPEG SA-DCT ARCNN DnCNN Baseline Sparse -
LIVE1 10 27.77 / 0.7905 28.65 / 0.8093 28.98 / 0.8217 29.19 / 0.8123 29.36 / 0.8179 29.39/0.8183
20 30.07 / 0.8683 30.81 / 0.8781 31.29 / 0.8871 31.59 / 0.8802 31.73 / 0.8832 31.79/0.8839
30 31.41 / 0.9000 32.08 / 0.9078 32.69 / 0.9166 32.98 / 0.9090 33.17 / 0.9116 33.21/0.9121
40 32.35 / 0.9173 32.99 / 0.9240 33.63 / 0.9306 33.96 / 0.9247 34.18 / 0.9273 34.23/0.9276
Classic5 10 27.82 / 0.7800 28.88 / 0.8071 29.04 / 0.8111 29.40 / 0.8026 29.54 / 0.8085 29.56/0.8087
20 30.12 / 0.8541 30.92 / 0.8663 31.16 / 0.8694 31.63 / 0.8610 31.72 / 0.8634 31.72/0.8635
30 31.48 / 0.8844 32.14 / 0.8914 32.52 / 0.8967 32.91 / 0.8861 33.07 / 0.8885 33.08/0.8891
40 32.43 / 0.9011 33.00 / 0.9055 33.34 / 0.9101 33.77 / 0.9003 33.94 / 0.9028 33.96/0.9031
FLOPs (M) - - 0.11 0.55 0.59 0.30
Table 6: Compression artifacts reduction benchmark results of PSNR / SSIM for various compression qualities. Training and testing protocols are followed as in Zhang et al. (2017). The best results are in bold.

5 Conclusions

In this paper, we have presented a method to structurally enforces sparsity constraints upon hidden neurons to achieve sparse representation in deep neural networks. Our method trade-offs between sparsity and differentiability, and is jointly learnable with deep networks iteratively. Our method is packed as a standalone module and substitutable for convolution layers in various models. Evaluation and visualization both illustrate the importance of sparsity in hidden representation for multiple image restoration tasks. The improved sparsity further enables optimization of model efficiency and capacity simultaneously.


  • [1] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC, pp. 135.1–135.10. Cited by: §4.1.
  • [2] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian (2007) Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on Image Processing 16 (8), pp. 2080–2095. Cited by: §4.3.
  • [3] C. Dong, Y. Deng, C. Change Loy, and X. Tang (2015) Compression artifacts reduction by a deep convolutional network. In ICCV, pp. 576–584. Cited by: §4.3.
  • [4] C. Dong, C. C. Loy, K. He, and X. Tang (2014) Learning a deep convolutional network for image super-resolution. In ECCV, pp. 184–199. Cited by: §1, §2.1.
  • [5] M. Elad and M. Aharon (2006) Image denoising via learned dictionaries and sparse representation. In CVPR, Vol. 1, pp. 895–900. Cited by: §1.
  • [6] A. Foi, V. Katkovnik, and K. Egiazarian (2007) Pointwise shape-adaptive dct for high-quality denoising and deblocking of grayscale and color images. IEEE Transactions on Image Processing 16 (5), pp. 1395–1411. Cited by: §4.3.
  • [7] J. Frankle and M. Carbin (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. In ICLR, Cited by: §2.2.
  • [8] X. Glorot, A. Bordes, and Y. Bengio (2011) Deep sparse rectifier neural networks. In

    Proceedings of the fourteenth international conference on artificial intelligence and statistics

    pp. 315–323. Cited by: §1, §2.3.
  • [9] S. Gu, L. Zhang, W. Zuo, and X. Feng (2014) Weighted nuclear norm minimization with application to image denoising. In CVPR, pp. 2862–2869. Cited by: §4.3.
  • [10] Y. Guo, C. Zhang, C. Zhang, and Y. Chen (2018) Sparse dnns with improved adversarial robustness. In Advances in neural information processing systems, pp. 242–251. Cited by: §2.2.
  • [11] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §1, §2.2.
  • [12] T. He, Y. Fan, Y. Qian, T. Tan, and K. Yu (2014) Reshaping deep neural network for fast decoding by node-pruning. In ICASSP, pp. 245–249. Cited by: §2.2.
  • [13] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, pp. 7132–7141. Cited by: §2.3, §3.1, §3.3.
  • [14] J. Huang, A. Singh, and N. Ahuja (2015) Single image super-resolution from transformed self-exemplars. In CVPR, pp. 5197–5206. Cited by: §4.1.
  • [15] E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. In ICLR, Cited by: §3.2.
  • [16] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Accurate image super-resolution using very deep convolutional networks. In CVPR, pp. 1646–1654. Cited by: §4.3.
  • [17] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, pp. 4681–4690. Cited by: §1, §2.1.
  • [18] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee (2017) Enhanced deep residual networks for single image super-resolution. In CVPR workshops, pp. 136–144. Cited by: §2.1, §4.3.
  • [19] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell (2019) Rethinking the value of network pruning. In ICLR, Cited by: §2.2.
  • [20] J. Mairal, M. Elad, and G. Sapiro (2007) Sparse representation for color image restoration. IEEE Transactions on Image Processing 17 (1), pp. 53–69. Cited by: §1.
  • [21] J. Mairal, G. Sapiro, and M. Elad (2008) Learning multiscale sparse representations for image and video restoration. Multiscale Modeling & Simulation 7 (1), pp. 214–241. Cited by: §1.
  • [22] D. Martin, C. Fowlkes, D. Tal, and J. Malik (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, Vol. 2, pp. 416–423. Cited by: §4.1.
  • [23] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In ICML, pp. 807–814. Cited by: §2.3.
  • [24] H. R. Sheikh, M. F. Sabir, and A. C. Bovik (2006) A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on Image Processing 15 (11), pp. 3440–3451. Cited by: §4.1.
  • [25] R. Timofte, E. Agustsson, L. Van Gool, M. Yang, and L. Zhang (2017) Ntire 2017 challenge on single image super-resolution: methods and results. In CVPR workshops, pp. 114–125. Cited by: §4.1.
  • [26] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: §4.1.
  • [27] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In CVPR, pp. 1492–1500. Cited by: §3.3.
  • [28] B. Yang, G. Bender, Q. V. Le, and J. Ngiam (2019) CondConv: conditionally parameterized convolutions for efficient inference. In Advances in Neural Information Processing Systems, pp. 1305–1316. Cited by: §3.2, §4.2.
  • [29] J. Yang, Z. Wang, Z. Lin, S. Cohen, and T. Huang (2012) Coupled dictionary training for image super-resolution. IEEE Transactions on Image Processing 21 (8), pp. 3467–3478. Cited by: §1, §2.1.
  • [30] J. Yang, J. Wright, T. S. Huang, and Y. Ma (2010) Image super-resolution via sparse representation. IEEE Transactions on Image Processing 19 (11), pp. 2861–2873. Cited by: §1, §2.1, §3.1, §4.1.
  • [31] J. Yu, Y. Fan, and T. Huang (2019) Wide activation for efficient image and video super-resolution. In BMVC, Cited by: §4.2, §4.3.
  • [32] J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang (2019) Slimmable neural networks. In ICLR, Cited by: §2.2.
  • [33] R. Zeyde, M. Elad, and M. Protter (2010) On single image scale-up using sparse-representations. In International Conference on Curves and Surfaces, pp. 711–730. Cited by: §1, §4.1.
  • [34] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang (2017) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing 26 (7), pp. 3142–3155. Cited by: §4.1, §4.3, §4.3, Table 5, Table 6.
  • [35] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. In ECCV, pp. 286–301. Cited by: §2.3, §3.1.
  • [36] C. Zhao, J. Zhang, S. Ma, X. Fan, Y. Zhang, and W. Gao (2016) Reducing image compression artifacts by structural sparse representation and quantization constraint prior. IEEE Transactions on Circuits and Systems for Video Technology 27 (10), pp. 2057–2071. Cited by: §1.