Learning Sparse Masks for Efficient Image Super-Resolution

by   Longguang Wang, et al.

Current CNN-based super-resolution (SR) methods process all locations equally with computational resources being uniformly assigned in space. However, since highfrequency details mainly lie around edges and textures, less computational resources are required for those flat regions. Therefore, existing CNN-based methods involve much redundant computation in flat regions, which increases their computational cost and limits the applications on mobile devices. To address this limitation, we develop an SR network (SMSR) to learn sparse masks to prune redundant computation conditioned on the input image. Within our SMSR, spatial masks learn to identify "important" locations while channel masks learn to mark redundant channels in those "unimportant" regions. Consequently, redundant computation can be accurately located and skipped while maintaining comparable performance. It is demonstrated that our SMSR achieves state-of-the-art performance with 41 SR.



There are no comments yet.


page 2

page 4

page 7

page 8


Component Divide-and-Conquer for Real-World Image Super-Resolution

In this paper, we present a large-scale Diverse Real-world image Super-R...

Light Field Image Super-Resolution with Transformers

Light field (LF) image super-resolution (SR) aims at reconstructing high...

Difficulty-aware Image Super Resolution via Deep Adaptive Dual-Network

Recently, deep learning based single image super-resolution(SR) approach...

CRNet: Image Super-Resolution Using A Convolutional Sparse Coding Inspired Network

Convolutional Sparse Coding (CSC) has been attracting more and more atte...

Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform

Despite that convolutional neural networks (CNN) have recently demonstra...

Lightweight Single-Image Super-Resolution Network with Attentive Auxiliary Feature Learning

Despite convolutional network-based methods have boosted the performance...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of single image super-resolution (SR) is to recover a high-resolution (HR) image from a single low-resolution (LR) observation. Due to the powerful feature representation and model fitting capabilities of deep neural network, CNN-based SR methods have achieved significant performance improvements against traditional ones. Recently, many efforts have been made for real-world applications, including few-shot SR

[Shocher2018Zero, Soh2020Meta], blind SR [Gu2019Blind, Zhang2020Deep], and scale-arbitrary SR [Hu2019Meta, Wang2020Learning]. With the popularity of intelligent edge devices like smartphones and wearable devices, efficient SR is also under great demand [Hui2018Fast, Ahn2018Fast].

Since the pioneering work of SRCNN [Dong2014Learning], deeper networks have been extensively studied for image SR. In VDSR [Kim2016Accurate], SRCNN is first deepened to 20 layers. Then, a very deep and wide architecture with over 60 layers is introduced into EDSR [Lim2017Enhanced]. Later, Zhang et al. further increased the network depth to over 100 and 400 in RDN [Zhang2018Residual] and RCAN [Zhang2018Image], respectively. Although a deep network usually improves SR performance, it also leads to high computational cost and limits the applications on mobile devices. To address this problem, several efforts have been made to reduce model size through information distillation [Hui2018Fast] and efficient feature reuse [Ahn2018Fast]. Nevertheless, these networks still involve redundant computation. Compared to an HR image, missing details in its LR image mainly exist in regions of edges and textures. Consequently, less computational resources are required in those flat regions. However, these CNN-based SR methods process all locations equally, resulting in much redundant computation within flat regions.

In this paper, we propose a sparse mask SR (SMSR) network to skip redundant computation for efficient image SR. Specifically, we learn spatial masks to dynamically identify “important” regions (e.g., edge and texture regions) and use channel masks to mark redundant channels in those “unimportant” regions. These two kinds of masks work jointly to accurately locate redundant computation. During network training, we soften these binary masks using Gumbel softmax trick to make them differentiable. During inference, we use sparse convolution to skip redundant computation.

Our main contributions can be summarized as: 1) We develop an SMSR network to dynamically skip redundant computation for efficient image SR. 2) We propose to locate redundant computation by learning spatial and channel masks. These two kinds of masks work jointly for a fine-grained location of redundant computation. 3) Experimental results show that our SMSR achieves state-of-the-art performance with better inference efficiency. For example, our SMSR outperforms previous methods for SR on the Set14 dataset with a FLOPs reduction and a speedup on mobile devices.

Figure 1: Absolute difference between , and in luminance channel.
Figure 2:

Visualization of feature maps after ReLU layer in the first backbone block of RCAN. Note that, sparisy is defined as the ratio of activated pixels in the corresponding channels.

2 Related Work

In this section, we first review several major works for CNN-based single image SR. Then, we discuss CNN acceleration techniques related to our work, including adaptive inference and network pruning.

Single Image SR. CNN-based methods have dominated the research of single image SR due to their strong representation and fitting capabilities. Dong et al. [Dong2014Learning] first introduced CNNs to learn an LR-to-HR mapping for single image SR. Kim et al. [Kim2016Accurate] then proposed a deeper network with 20 layers (namely, VDSR). Recently, deeper networks are extensively studied for image SR. Lim et al. [Lim2017Enhanced] proposed a very deep and wide network (namely, EDSR) by cascading modified residual blocks. Zhang et al. [Zhang2018Residual] further combined residual learning and dense connection to build RDN with over 100 layers. Although these networks achieve state-of-the-art performance, their high computational cost and memory footprint limit the applications on mobile devices. To address this problem, several lightweight networks are developed [Lai2017Deep, Hui2018Fast, Ahn2018Fast]. Specifically, lightweight distillation blocks are used for feature learning in IDN [Hui2018Fast], while a cascading mechanism is introduced to encourage efficient feature reuse in CARN [Ahn2018Fast]. Different from these manually designed networks, Chu et al. [Chu2019Fast] developed a compact architecture using neural architecture search. Although existing lightweight SR networks successfully reduce the model size, redundant computation is still involved and hinders them to achieve better computational efficiency.

Adaptive Inference. Adaptive inference techniques [Wang2018SkipNet, Ren2018SBNet, Mullapudi2018HydraNets, Graham20183D, Li2019Improved] have attracted increasing interest since they can adapt the network structure according to the input. One active branch of adaptive inference techniques is to dynamically select an inference path at the levels of layers. Specifically, Wu et al. [Wu2018BlockDrop] proposed a BlockDrop approach for ResNets to dynamically drop several residual blocks for efficiency. Mullapudi et al. [Mullapudi2018HydraNets] proposed an HydraNet with multiple branches and used a gating approach to dynamically choose a set of them at test time. Another popular branch is to dynamically identify “unimportant” regions and skip the computation within these regions. On top of ResNets, Figurnov et al. [Figurnov2017Spatially] proposed a spatially adaptive computation time (SACT) mechanism to stop computation for a spatial position when the features become “good enough”. Liu et al. [Liu2020Deep] introduced adaptive inference for SR by producing a map of local network depth to adapt the number of convolutional layers implemented at different locations. However, these methods only focus on spatial redundancy without considering redundancy in channel dimension.

Network Pruning. Network pruning [Han2015Learning, Liu2017Learning, Luo2017ThiNet] is widely used to remove a set of redundant parameters. As a popular branch of network pruning methods, structured pruning approaches are usually used to prune the network at the level of channels and even layers [Li2017Pruning, Liu2017Learning, Luo2017ThiNet, He2019Filter]. Specifically, Li et al. [Li2017Pruning] used norm to measure the importance of different filters and then pruned less important ones. Liu et al. [Liu2017Learning]

imposed a sparsity constraint on scaling factors of the batch normalization layers and identified channels with lower scaling factors as less informative ones. Different from these static structured pruning methods, Lin

et al. [Lin2017Runtime] conducted runtime neural network pruning according to the input image. Recently, Gao et al. [Gao2019Dynamic] introduced a feature boosting and suppression method to dynamically prune unimportant channels at inference time. Nevertheless, these pruning methods treat all spatial locations equally without taking their different importance into consideration.