nsr
Neural Sparse Representation for Image Restoration
view repo
Inspired by the robustness and efficiency of sparse representation in sparse coding based image restoration models, we investigate the sparsity of neurons in deep networks. Our method structurally enforces sparsity constraints upon hidden neurons. The sparsity constraints are favorable for gradient-based learning algorithms and attachable to convolution layers in various networks. Sparsity in neurons enables computation saving by only operating on non-zero components without hurting accuracy. Meanwhile, our method can magnify representation dimensionality and model capacity with negligible additional computation cost. Experiments show that sparse representation is crucial in deep neural networks for multiple image restoration tasks, including image super-resolution, image denoising, and image compression artifacts removal. Code is available at https://github.com/ychfan/nsr
READ FULL TEXT VIEW PDFNeural Sparse Representation for Image Restoration
Sparse representation plays a critical role in image restoration problems, such as image super-resolution Yang et al. (2010); Zeyde et al. (2010); Yang et al. (2012), denoising Elad and Aharon (2006), compression artifacts removal Zhao et al. (2016), and many others Mairal et al. (2007, 2008). These tasks are inherently ill-posed, where the input signal usually has insufficient information while the output has infinitely many solutions w.r.t. the same input. Thus, it is commonly believed that sparse representation is more robust to handle the considerable diversity of solutions.
Sparse representation in sparse coding is typically high-dimensional but with limited non-zero components. Input signals are represented as sparse linear combinations of tokens from a dictionary. High dimensionality implies larger dictionary size and generally leads to better restoration accuracy, since a more massive dictionary is capable of more thoroughly sampling the underlying signal space, and thus more precisely representing any query signal. Besides, sparsity limits numbers of non-zero elements work as an essential image prior, which has been extensively investigated and exploited to make restoration robust. Sparsity also brings computational efficiency by ignoring zero parts.
Deep convolutional neural networks for image restoration extend the sparse coding based methods with repeatedly cascaded structures. The deep network based approach was firstly introduced to improve the performance in
Dong et al. (2014) and conceptually connected with previous sparse coding based methods. A simple network, with two convolutional layers bridged by a non-linear activation layer, can be interpreted as: activation denotes sparse representation; non-linearity enforces sparsity and convolutional kernels consist of the dictionary. SRResNet Ledig et al. (2017) extended the basic structure with skip connection to form a residual block and cascaded a large number of blocks to construct very deep residual networks.Sparsity of hidden representation in deep neural networks cannot be solved by iterative optimization as sparse coding, since deep networks are feed-forward during inference. Sparsity of neurons is commonly achieved by ReLU activation in
Glorot et al. (2011)by thresholding negative values to zero independently in each neuron. Still, its 50% sparsity on random vectors is far from the sparsity definition on the overall number of non-zero components. Oppositely, sparsity constraints are more actively used in model parameters to achieve network pruning
Han et al. (2015). However, hidden representation dimensionality is reduced in pruned networks, and accuracy may hurt.In this paper, we propose a method that can structurally enforce sparsity constraints upon hidden neurons in deep networks but also keep representation in high dimensionality. Given high-dimensional neurons, we divide them into groups along channels and allow only one group of neurons can be non-zero each time. The adaptive selection of the non-sparse group is modeled by tiny side networks upon context features. And computation is also saved when only performed on the non-zero group. However, the selecting operation is not differentiable, so it is difficult to embed the side networks for joint training. We relax the sparse constraints to soft and approximately reduce as a sparse linear combination of multiple convolution kernels instead of hard selection. We further introduce additional cardinal dimensions to decompose sparsity prediction into sub-problems by splitting each sparse group and concatenating after cardinal-independent combination of parameters.
To demonstrate the significance of neural sparse representation, we conduct extensive experiments on image restoration tasks, including image super-resolution, denoising, and compression artifacts removal. Our experiments conclude that: (1) dedicated constraints are essential to achieve neural sparsity representation and benefit deep networks; (2) our method can significantly reduce computation cost and improve accuracy, given the same size of model footprint; (3) our method can dramatically enlarge the model capacity and boost accuracy with negligible additional computation cost.
Here we briefly review the application of sparsity in image restoration and its relation to convolutional networks. Considering image super-resolution as an example of image restoration, sparse coding based method Yang et al. (2010) assumes that input image signal can be represented by a sparse linear combination over dictionary , which typically is learned from training images as
(1) |
In Yang et al. (2012), a coupled dictionary, , for restored image signal is jointly learned with as well as its sparse representation by
(2) |
Convolutional networks, which consist of stacked convolutional layers and non-linear activation functions, can be interpreted with the concepts from sparse coding
Dong et al. (2014). Given for instance a small piece of network with two convolutional layers with kernels and a non-linear function , the image restoration process can be formalized as(3) |
The convolution operation with is equivalent to projecting input image signal onto dictionary . The convolution operation with is corresponding to the projection of the signal representation on dictionary . These two convolutional layers structure is widely used as a basic residual block and stacked with multiple blocks to form very deep residual networks in recent advances Ledig et al. (2017); Lim et al. (2017) of image restoration.
Dimensionality of hidden representation or number of kernels in each convolutional layer determines the size of dictionary memory and learning capacity of models. However, unlike sparse coding, representation dimensionality in deep models is usually restricted by running speed or memory usage.
Exploring the sparsity of model parameters can potentially improve robustness Guo et al. (2018), but sparsity in parameters is not sufficient and necessary to result in sparse representation. Furthermore, group sparsity upon channels and suppression of parameters close to zero can achieve node pruning He et al. (2014); Han et al. (2015); Frankle and Carbin (2019); Liu et al. (2019); Yu et al. (2019b), which dramatically reduces inference computation cost. Despite efficiency, node pruning reduces representation dimensionality proportionally instead of sparsity, limits representation diversity, and leads to accuracy regression.
Thresholding function, ReLU Nair and Hinton (2010) for example, plays the similar role of imposing the sparsity constraints Glorot et al. (2011) by filtering out negative values to zero, and contributes to significant performing improvement over previous activation functions, i.e., hyperbolic tangent. Although ReLU statistically gives only 50% sparsity over random vectors, there is still a significant gap between sparsity definition in Eq. 1. Gating mechanism, in Squeeze-and-Excitation Hu et al. (2018); Zhang et al. (2018), for example, scales hidden neurons with adaptive sigmoid gates and slightly improves sparsity besides noticeable accuracy improvements. Both thresholding and gating are applied independently to hidden neurons and could not inherently guarantee global sparsity in Eq. 1.
We propose novel sparsity constraints to achieve sparse representation in deep neural networks. Relaxed soft restrictions are more friendly to gradient-based training. Additional cardinal dimension refines the constraints and improves the diversity of sparse representation.
Unlike the methods discussed in Section 2.3 only considering local sparsity for each neuron independently, our approach enforces global sparsity between groups. Specifically, the hidden neurons are divided into groups with nodes in each group, and only one group is allowed to contain non-zero values. Correspondingly, convolution kernels can also be divided upon connected hidden neurons. Then only the kernels connected to non-zero neurons need to be accounted. Formally, for networks structure in Eq. 3, the convolution kernels are divided as and . Then the Eq. 3 can be rewritten as
(4) |
When sparsity constraints only allow the th group of neurons with non-zero components, then Eq. 4 can be reduced, as shown in Figure 1, and formally as
(5) |
The proposed sparsity is supposed to pick the node group with the largest amplitude and cannot be achieved without computing the values of all the nodes. In our approach, the selection of the only non-zero group is modeled by a multi-layer perceptron (MLP) with respect to the input signal
.Regular convolution operations need the kernels shared for every pixel. Hence the selection should also be identified through the spatial space. We are inspired by the Squeeze-and-Excitation Hu et al. (2018); Zhang et al. (2018) operation and propose to add pooling operation before the MLP and boardcasting operation for group selection. The above procedure can be formalized as
(6) |
Note that, as most of patch-based algorithms Yang et al. (2010); Zhang et al. (2018) for image restoration, the pooling operation should be with respect to a specific patch size instead of the whole image.
Comparison to thresholding and gating. The proposed method limits the number of non-zero entities under of all the nodes in hidden-layer representation, which is more closed to sparsity definition in Eq. 1 than thresholding and gating methods discussed in section 2.3. The proposed method also dramatically reduces computation cost by times by only considering the adaptively selected group, which is not possible with thresholding and gating methods.
Comparison to node pruning. Node pruning is designed to diminish activation nodes by zeroing all the related trainable parameters. The pruned nodes stick to zero no matter how the input signal varies, which substantially reduces representation dimensionality. In our method, the sparsity adaptively depends on input. Although the input inherently keeps the high dimensionality in representation, our method saves computation and memory cost as narrow models.
Similar as L0-norm in sparse coding, the adaptive sparse group selection in Eq. 6 is not differentiable and feasible to be jointly learned with neural networks. Although Gumbel trick Jang et al. (2017) is proposed to re-parameterize the
with respect to a conditional probability distribution, it does not achieve convincing results in our experiment settings.
The sparsity constrains are relaxed by substituting selection with softmax as a smooth approximation of max. Instead of predicting index over , the MLP is relaxed to predict probability over groups with softmax function by
(7) |
Then, the two-layer structure in Eq. 4 is updated to adaptive weighted sum of groups as
(8) |
With weighted summation, Eq. 8 cannot be directly reduced as Eq.5, since none of group weights is exactly zero. Fortunately, given sparse assumption of softmax outputs, , and piece-wise linear activation function , ReLU for example, it can be proved that weighted sum of hidden neurons can be approximately reduced to weighted sum of parameters , as shown in Figure 2, and formally as
(9) |
Note that the two applied to and are not necessary to be identical to achieve the approximation. Our experiments show that independently predicting weights for and has benefits for accuracy.
In this way, networks restricted by soft sparse constraints can be as efficient as those with hard constraints. And the only additional computation cost from the interpolation of convolution kernels is negligible comparing with convolution operations with the image.
Comparison to conditional convolution. CondConv Yang et al. (2019)
has similar operation of the adaptive weighed sum of convolution kernels as our relaxed soft sparsity approach. However, CondConv uses the sigmoid function to normalize the weights of kernels instead of softmax function in our method. Hence, no sparsity constraints are explicitly applied in CondConv, and our experiments show that sparsity is very important for model accuracy.
Modeling sparsity between groups with a simple MLP is challenging, especially when dimensionality per group grows. Also, bonding channels within pre-defined groups limits diversity of the sparsity patterns. Inspired by group convolution in ResNeXt Xie et al. (2017), we split the nodes per sparsity group into cardinal groups, and each cardinal group with nodes is independently constrained along sparsity groups, as shown in Figure 3. Formally, the averaging weights are extended to matrix and , then weighted averaged convolution kernel becomes
(10) |
where and is the th cardinal group and th sparsity group. is concatenation operation along the axis of output channels. Notably, with cardinal grouping, Squeeze-and-Excitation Hu et al. (2018) operation becomes a particular case of our approach when , and the MLP activation is substituted with sigmoid function.
Datasets and benchmarks. We use multiple datasets for image super-resolution, denoising, and compression artifacts removal separately. For image super-resolution, models are trained with DIV2K Timofte et al. (2017) dataset which contains 800 high-quality (2K resolution) images. The DIV2K also comes with 100 validation images, which are used for ablation study. The datasets for benchmark evaluation include Set5 Bevilacqua et al. (2012), Set14 Zeyde et al. (2010), BSD100 Martin et al. (2001) and Urban100 Huang et al. (2015) with three up-scaling factors: x2, x3 and x4. For image denoising, training set consists of Berkeley Segmentation Dataset (BSD) Martin et al. (2001) 200 images from training split and 200 images from testing split, as Zhang et al. (2017). The datasets for benchmark evaluation include Set12, BSD64 Martin et al. (2001) and Urban100 Huang et al. (2015) with additive white Gaussian noise (AWGN) of level 15, 25, 50. For compression artifacts removal, training set consists of 91 images in Yang et al. (2010) and 200 training images in Martin et al. (2001). The datasets for benchmark evaluation include LIVE1 Sheikh et al. (2006)
and Classic5 with JPEG compression quality 10, 20, 30 and 40. Evaluation metrics include PSNR and SSIM
Wang et al. (2004) for predicted image quality in luminance or grayscale, only DIV2K is evaluated in RGB channels. FLOPs per pixels is used to measure efficiency, because the runtime complexity is proportional input image size for fully convolutional models.Training settings.
Models are trained with nature images and their degraded counterparts. Online data augmentation includes random flipping and rotation during training. Training is based on randomly sampled image patches for 100 times per image and epoch. And total training epochs are 30. Models are optimized with L1 distance and ADAM optimizer. Initial learning rate is 0.001 and multiplied by 0.2 at 20 and 25 epochs.
We conduct ablation study to prove the significance of neural sparse representation. The experiments are evaluated on DIV2K validation set for image super-resolution with x2 up-scaling under PSNR. We use WDSR Yu et al. (2019a) networks with 16 residual blocks, 32 neurons and 4x width multiplier as the baseline, and set for sparsity groups by default.
Sparsity constraints. Sparsity constraints are essential for representation sparsity. We implement the hard sparsity constraints with Gumbel-softmax to simulate the gradient of hardmax and compare it with soft sparsity achieved by softmax function. The temperature in softmax also controls the sharpness of output distribution. When the temperature is small, softmax outputs are sharper and closer to hardmax. Thus gradient will vanish. When the temperature is large, softmax outputs are more smooth, then it will contradict with our sparsity assumption in Eq. 9 for approximation. We also compare them with a similar model with sigmoid function as MLP activation instead of sparsity constraints in CondConv Yang et al. (2019). Results in Table 1 show that Gumbel-based hard-sparsity methods are not feasible and even worse than the baseline without sparsity groups. Temperature is necessary to be initialized with proper value to achieve better results, which coincides with the above analysis. Sigmoid also gets worse results than softmax because sigmoid cannot guarantee sparsity, which also agrees with our comparison in the previous section.
Sparsity | N/A | Sigmoid | Gumbel | Softmax | Softmax | Softmax |
---|---|---|---|---|---|---|
PSNR | 34.76 | 34.81 | 34.45 | 34.86 | 34.87 | 34.83 |
Cardinality. Cardinal dimension reduces the actual dimensionality and dependency between channels in sparsity groups and improves the diversity of linear combination weights over convolution kernels. Results of models with different cardinalities in Fig. 4 show that increasing cardinality constantly benefits accuracy. We also compare with Squeeze-and-Excitation (SE) model, which is a special case of our method, under the same FLOPs. And our models significantly outperform the SE model.
Efficiency. Our method can approximately save computation by times with sparsity groups but remains the same model size or number of parameters. Results in Table 3 have the same model size in columns and show that our method can save at least half of computation without hurting accuracy uniformly for various model sizes.
Capacity. Our method can also extend model capacity or number of parameters by times with sparsity groups but only with negligible additional computation cost. Results in Table 3 have the same computation cost in columns and show that our method can continually improve accuracy by extending model capacity up to 16 times.
Group size | # of residual blocks | |||
---|---|---|---|---|
2 | 4 | 8 | 16 | |
N/A | 33.91 | 34.29 | 34.56 | 34.76 |
2 | 33.92 | 34.30 | 34.57 | 34.77 |
3 | 33.86 | 34.23 | 34.51 | 34.71 |
4 | 33.81 | 34.21 | 34.41 | 34.68 |
# params (M) | 0.15 | 0.30 | 0.60 | 1.2 |
Group size | # of residual blocks | |||
---|---|---|---|---|
2 | 4 | 8 | 16 | |
N/A | 33.91 | 34.29 | 34.56 | 34.76 |
2 | 33.98 | 34.38 | 34.65 | 34.83 |
4 | 34.07 | 34.45 | 34.70 | 34.87 |
8 | 34.14 | 34.50 | 34.74 | 34.89 |
16 | 34.17 | 34.56 | 34.77 | 34.91 |
FLOPs (M) | 0.15 | 0.30 | 0.60 | 1.2 |
Visualization of kernel selection. It is difficult to directly visualize the sparsity of high-dimensional hidden representation, we take the selection of kernels as a surrogate. As Fig. 5, in the first block, the weights are almost binary everywhere and only depend on color and low-level cues. In later blocks, the weights are more smooth and more attentive to high-frequency positions with more complicated texture. And the last layer is more correlated to semantics, for example, tree branches in the first image and lion in the second image.
In this section, we compare our method on top of the state-of-the-art methods on image super-resolution, image denoising, and image compression artifact removal.
Dataset | Scale | Bicubic | VDSR | EDSR (S) | Sparse + | EDSR (L) | Sparse - |
33.66 / 0.9299 | 37.53 / 0.9587 | 37.99 / 0.9604 | 38.02 / 0.9610 | 38.11 / 0.9601 | 38.23 / 0.9614 | ||
Set5 | 30.39 / 0.8682 | 33.66 / 0.9213 | 34.37 / 0.9270 | 34.43 / 0.9277 | 34.65 / 0.9282 | 34.62 / 0.9289 | |
28.42 / 0.8104 | 31.35 / 0.8838 | 32.09 / 0.8938 | 32.25 / 0.8957 | 32.46 / 0.8968 | 32.55 / 0.8987 | ||
30.24 / 0.8688 | 33.03 / 0.9124 | 33.57 / 0.9175 | 33.60 / 0.9191 | 33.92 / 0.9195 | 33.94 / 0.9203 | ||
Set14 | 27.55 / 0.7742 | 29.77 / 0.8314 | 30.28 / 0.8418 | 30.37 / 0.8443 | 30.52 / 0.8462 | 30.57 / 0.8475 | |
26.00 / 0.7027 | 28.01 / 0.7674 | 28.58 / 0.7813 | 28.66 / 0.7836 | 28.80 / 0.7876 | 28.79 / 0.7876 | ||
29.56 / 0.8431 | 31.90 / 0.8960 | 32.16 / 0.8994 | 32.26 / 0.9008 | 32.32 / 0.9013 | 32.34 / 0.9020 | ||
B100 | 27.21 / 0.7385 | 28.82 / 0.7976 | 29.09 / 0.8052 | 29.15 / 0.8074 | 29.25 / 0.8093 | 29.26 / 0.8100 | |
25.96 / 0.6675 | 27.29 / 0.7251 | 27.57 / 0.7357 | 27.61 / 0.7372 | 27.71 / 0.7420 | 27.72 / 0.7414 | ||
26.88 / 0.8403 | 30.76 / 0.9140 | 31.98 / 0.9272 | 32.57 / 0.9329 | 32.93 / 0.9351 | 33.02 / 0.9367 | ||
Urban100 | 24.46 / 0.7349 | 27.14 / 0.8279 | 28.15 / 0.8527 | 28.43 / 0.8587 | 28.80 / 0.8653 | 28.83 / 0.8663 | |
23.14 / 0.6577 | 25.18 / 0.7524 | 26.04 / 0.7849 | 26.24 / 0.7919 | 26.64 / 0.8033 | 26.61 / 0.8025 | ||
30.80 / 0.9339 | 37.22 / 0.9750 | 38.55 / 0.9769 | 38.94 / 0.9776 | 39.10 / 0.9773 | 39.31 / 0.9782 | ||
Manga109 | 26.95 / 0.8556 | 32.01 / 0.9340 | 33.45 / 0.9439 | 33.77 / 0.9462 | 34.17 / 0.9476 | 34.27 / 0.9484 | |
24.89 / 0.7866 | 28.83 / 0.8870 | 30.35 / 0.9067 | 30.63 / 0.9106 | 31.02 / 0.9148 | 31.10 / 0.9145 | ||
DIV2K validation | 31.01 / 0.8923 | 33.66 / 0.9290 | 34.61 / 0.9372 | 34.87 / 0.9395 | 35.03 / 0.9407 | 35.07 / 0.9410 | |
28.22 / 0.8124 | 30.09 / 0.8590 | 30.92 / 0.8734 | 31.10 / 0.8767 | 31.26 / 0.8795 | 31.30 / 0.8797 | ||
26.66 / 0.7512 | 28.17 / 0.8000 | 28.95 / 0.8178 | 29.10 / 0.8223 | 29.25 / 0.8261 | 29.29 / 0.8263 | ||
FLOPs (M) | - | 0.67 | 1.4 | 1.4 | 43 | 9.5 |
Super-resolution. We compare our method on top of EDSR Lim et al. (2017), the state-of-the-art single image super-resolution methods, also with bicubic upsampling, VDSR Kim et al. (2016). As shown in Table 4, the small EDSR(S) has 16 residual blocks and 64 neurons per layer, and our sparse+ model extends it to 4 sparsity groups with cardinality 16 and outperforms on all benchmarks with 4x model capacity but negligible additional computation cost. The large EDSR(L) has 32 residual blocks and 256 neurons per layer, and our sparse- model has 32 residual blocks, 128 neurons per layer, 4 sparsity groups with cardinality 16. Then they have a similar model footprint and on-par benchmark accuracy but 4x computation cost difference.
Denoising. We compare our method with state-of-the-art image denoising methods: BM3D Dabov et al. (2007), WNNM Gu et al. (2014) and DnCNN Zhang et al. (2017). As shown in Table 5, our baseline model is residual networks with 16 blocks, 32 neurons per layer, 2x width multiplier Yu et al. (2019a)
, and has similar footprint as DnCNN but better performance because of residual connections. Our sparse- model with 2 sparsity groups and 1x width multiplier keeps the model size as baseline but gains 2x computation reduction with better performance. Our sparse+ model adds 2 sparsity groups over baseline model, doubles model capacity, and boosts performance with negligible computation cost.
Dataset | Noise | BM3D | WNNM | DnCNN | Baseline | Sparse - | Sparse + |
Set12 | 15 | 32.37 / 0.8952 | 32.70 / 0.8982 | 32.86 / 0.9031 | 32.97 / 0.9044 | 33.00 / 0.9048 | 33.04 / 0.9054 |
25 | 39.97 / 0.8504 | 30.28 / 0.8557 | 30.44 / 0.8622 | 30.59 / 0.8655 | 30.63 / 0.8667 | 30.68 / 0.8676 | |
50 | 26.72 / 0.7676 | 27.05 / 0.7775 | 27.18 / 0.7829 | 27.40 / 0.7939 | 27.46 / 0.7954 | 27.51 / 0.7969 | |
BSD68 | 15 | 31.07 / 0.8717 | 31.37 / 0.8766 | 31.73 / 0.8907 | 31.79 / 0.8925 | 31.81 / 0.8928 | 31.83 / 0.8931 |
25 | 28.57 / 0.8013 | 28.83 / 0.8087 | 29.23 / 0.8278 | 29.30 / 0.8311 | 29.33 / 0.8319 | 29.35 / 0.8327 | |
50 | 25.62 / 0.6864 | 25.87 / 0.6982 | 26.23 / 0.7189 | 26.35 / 0.7272 | 26.37 / 0.7265 | 26.39 / 0.7274 | |
Urban100 | 15 | 32.35 / 0.9220 | 32.97 / 0.9271 | 32.68 / 0.9255 | 32.94 / 0.9309 | 32.96 / 0.9316 | 33.05 / 0.9324 |
25 | 29.70 / 0.8777 | 30.39 / 0.8885 | 29.97 / 0.8797 | 30.33 / 0.8930 | 30.36 / 0.8932 | 30.48 / 0.8959 | |
50 | 25.95 / 0.7791 | 26.83 / 0.8047 | 26.28 / 0.7874 | 26.76 / 0.8118 | 26.84 / 0.8113 | 26.95 / 0.8122 | |
FLOPs (M) | - | - | 0.55 | 0.59 | 0.30 | 0.60 |
Compression artifact removal. We compare our method with state-of-the-art image compression artifact removal methods: JPEG, SA-DCT Foi et al. (2007), ARCNN Dong et al. (2015) and DnCNN Zhang et al. (2017). As shown in Table 6, baseline and sparse- models have the same structure as the ones in denoising. Our method consistently saves computation and improves performance on all the benchmark datasets and different JPEG compression qualities.
Dataset | JPEG | SA-DCT | ARCNN | DnCNN | Baseline | Sparse - | |
LIVE1 | 10 | 27.77 / 0.7905 | 28.65 / 0.8093 | 28.98 / 0.8217 | 29.19 / 0.8123 | 29.36 / 0.8179 | 29.39/0.8183 |
20 | 30.07 / 0.8683 | 30.81 / 0.8781 | 31.29 / 0.8871 | 31.59 / 0.8802 | 31.73 / 0.8832 | 31.79/0.8839 | |
30 | 31.41 / 0.9000 | 32.08 / 0.9078 | 32.69 / 0.9166 | 32.98 / 0.9090 | 33.17 / 0.9116 | 33.21/0.9121 | |
40 | 32.35 / 0.9173 | 32.99 / 0.9240 | 33.63 / 0.9306 | 33.96 / 0.9247 | 34.18 / 0.9273 | 34.23/0.9276 | |
Classic5 | 10 | 27.82 / 0.7800 | 28.88 / 0.8071 | 29.04 / 0.8111 | 29.40 / 0.8026 | 29.54 / 0.8085 | 29.56/0.8087 |
20 | 30.12 / 0.8541 | 30.92 / 0.8663 | 31.16 / 0.8694 | 31.63 / 0.8610 | 31.72 / 0.8634 | 31.72/0.8635 | |
30 | 31.48 / 0.8844 | 32.14 / 0.8914 | 32.52 / 0.8967 | 32.91 / 0.8861 | 33.07 / 0.8885 | 33.08/0.8891 | |
40 | 32.43 / 0.9011 | 33.00 / 0.9055 | 33.34 / 0.9101 | 33.77 / 0.9003 | 33.94 / 0.9028 | 33.96/0.9031 | |
FLOPs (M) | - | - | 0.11 | 0.55 | 0.59 | 0.30 |
In this paper, we have presented a method to structurally enforces sparsity constraints upon hidden neurons to achieve sparse representation in deep neural networks. Our method trade-offs between sparsity and differentiability, and is jointly learnable with deep networks iteratively. Our method is packed as a standalone module and substitutable for convolution layers in various models. Evaluation and visualization both illustrate the importance of sparsity in hidden representation for multiple image restoration tasks. The improved sparsity further enables optimization of model efficiency and capacity simultaneously.
Proceedings of the fourteenth international conference on artificial intelligence and statistics
, pp. 315–323. Cited by: §1, §2.3.