DeepAI
Log In Sign Up

ShuffleMixer: An Efficient ConvNet for Image Super-Resolution

05/30/2022
by   Long Sun, et al.
Nanjing University
0

Lightweight and efficiency are critical drivers for the practical application of image super-resolution (SR) algorithms. We propose a simple and effective approach, ShuffleMixer, for lightweight image super-resolution that explores large convolution and channel split-shuffle operation. In contrast to previous SR models that simply stack multiple small kernel convolutions or complex operators to learn representations, we explore a large kernel ConvNet for mobile-friendly SR design. Specifically, we develop a large depth-wise convolution and two projection layers based on channel splitting and shuffling as the basic component to mix features efficiently. Since the contexts of natural images are strongly locally correlated, using large depth-wise convolutions only is insufficient to reconstruct fine details. To overcome this problem while maintaining the efficiency of the proposed module, we introduce Fused-MBConvs into the proposed network to model the local connectivity of different features. Experimental results demonstrate that the proposed ShuffleMixer is about 6x smaller than the state-of-the-art methods in terms of model parameters and FLOPs while achieving competitive performance. In NTIRE 2022, our primary method won the model complexity track of the Efficient Super-Resolution Challenge [23]. The code is available at https://github.com/sunny2109/MobileSR-NTIRE2022.

READ FULL TEXT VIEW PDF

page 7

page 9

05/12/2022

Blueprint Separable Residual Network for Efficient Image Super-Resolution

Recent advances in single image super-resolution (SISR) have achieved ex...
09/26/2019

Lightweight Image Super-Resolution with Information Multi-distillation Network

In recent years, single image super-resolution (SISR) methods using deep...
10/12/2022

Efficient Image Super-Resolution using Vast-Receptive-Field Attention

The attention mechanism plays a pivotal role in designing advanced super...
11/13/2020

Lightweight Single-Image Super-Resolution Network with Attentive Auxiliary Feature Learning

Despite convolutional network-based methods have boosted the performance...
01/27/2022

Revisiting RCAN: Improved Training for Image Super-Resolution

Image super-resolution (SR) is a fast-moving field with novel architectu...
10/16/2020

VolumeNet: A Lightweight Parallel Network for Super-Resolution of Medical Volumetric Data

Deep learning-based super-resolution (SR) techniques have generally achi...
11/30/2022

From Coarse to Fine: Hierarchical Pixel Integration for Lightweight Image Super-Resolution

Image super-resolution (SR) serves as a fundamental tool for the process...

1 Introduction

Single image super-resolution (SISR) aims to recover a high-resolution image from a low-resolution one. This is a classic problem that has attracted lots of attention recently due to the rapid development of high-definition devices, such as Ultra-High Definition Television, Samsung Galaxy S22 Ultra, iPhone 13 Pro Max, and HUAWEI P50 Pro, and so on. Thus, it is of great interest to develop an efficient and effective method to estimate high-resolution images to be better displayed on these devices.

Recently, convolutional neural network (CNN) based SR models 

Dong et al. (2016a, b); Ahn et al. (2018); Hui et al. (2019); Lim et al. (2017); Zhang et al. (2018) have achieved impressive reconstruction performance. However, these networks hierarchically extract local features, which highly rely on stacking deeper or more complex models to enlarge the receptive fields for performance improvements. As a result, the required computational budget makes these heavy SR models difficult to deploy on resource-constrained mobile devices in practical applications Zhang et al. (2021).

Figure 1: Model complexity and performance comparison between our proposed ShuffleMixer family and other lightweight methods on B100 Arbelaez et al. (2011) for SR. Circle sizes indicate the numbers of parameters. ShuffleMixer achieves better trade-off.

To alleviate heavy SR models, various methods have been proposed to reduce model complexity or speed up runtime, including efficient operation design Sandler et al. (2018); Ma et al. (2018); Tan and Le (2021); Dong et al. (2016b); Hui et al. (2019); Ahn et al. (2018); Shi et al. (2016); Zhang et al. (2020); Li et al. (2022); Liu et al. (2022), neural architecture search Chu et al. (2021); Song et al. (2020), knowledge distillation Gao et al. (2019); He et al. (2020), and structural re-parameterization methodology Ding et al. (2022); Li et al. (2022); Zhang et al. (2021). These methods are mainly based on improved small spatial convolutions or advanced training strategies, and large kernel convolutions are rarely explored. Moreover, they mostly focus on one of the efficiency indicators and do not perform well in real resource-constrained tasks. Thus, the need to obtain a better trade-off between complexity, latency, and SR quality is imperative.

A large receptive field involves more feature interactions, which helps reconstruct more refined results in tasks such as super-resolution that require dense per-pixel predictions. Recent visual transformer (ViT)-based approaches Dosovitskiy et al. (2021); Liu et al. (2021); Mehta and Rastegari (2022); Liang et al. (2021) employ a multi-head self-attention (MHSA) mechanism to learn long-range feature representations, which lead to their state-of-the-art performance in various vision tasks. However, MHSA is not friendly to enlarging the receptive field of an efficient SR design. Its complexity grows quadratically with the input resolution (the size is usually large and constant during SR training). Regular convolution with large kernels is also a simple but heavyweight approach to obtaining efficient receptive fields. To make large kernel convolutions practical, using depth-wise convolutions with large kernel sizes Liu et al. (2022); Trockman and Kolter (2022); Ding et al. (2022) is an effective alternative. Since depth-wise convolutions share connection weights between spatial locations and remain independent between channels, this property makes it challenging to capture sufficient interactions. In lightweight network design, therefore, it is essential to improve the learning capability of depth-wise convolutions (DW Convs).

In this paper, we develop a simple and effective network named ShuffleMixer that introduces large kernel convolutions for lightweight SR design. The core idea is to fuse non-local and local spatial locations within a feature mixing block with fewer parameters and FLOPs. Specifically, we employ depth-wise convolutions with large kernel sizes to aggregate spatial information from a large region. For channel mixing, we introduce channel splitting and shuffling strategies to reduce model parameters and computational cost and improve network capability. We then build an effective shuffle mixer layer based on these two operators. To further improve the learning capability, we embed the Fused-MBConv into the mixer layer to boost local connectivity. Taken together, we find that the ShuffleMixer network with a simple module can obtain state-of-the-art performance. Figure 1 shows that our ShuffleMixer achieves a better trade-off with the least parameters and FLOPs among all existing lightweight SR methods.

The contributions of this paper are summarized as follows: (1) We develop an efficient SR design by exploring a large kernel ConvNet that involves more useful information for image SR. (2) We introduce a channel splitting and shuffling operation to perform feature mixing of the channel projection efficiently. (3) To better explore the local connectivity among cross-group features from the shuffle mixer layer, we utilize Fused-MBConvs in the proposed SR design. We formulate the aforementioned modules into an end-to-end trainable network, which is named as ShuffleMixer. Experimental results show that ShuffleMixer is about smaller than the state-of-the-art methods in terms of model parameters and FLOPs while achieving competitive performance compared to the state-of-the-art methods.

2 Related Work

CNN-based Efficient SR. CNN-based methods adopt various ways to reduce model complexity. FSRCNN Dong et al. (2016b) and ESPCN Shi et al. (2016) employ post-upsampling layers to reduce the computational burden from predefined inputs significantly. Namhyuk et al. Ahn et al. (2018) uses group convolution and cascading connection upon a recursive network to save parameters. Hui et al. Hui et al. (2019) proposes a lightweight information multi-distillation network (IMDN) to aggregate features by applying feature splitting and concatenation operations, and the improved IMDN variants Zhang et al. (2020); Li et al. (2022) won the AIM2020 and NTIRE2022 Efficient SR challenge. Meanwhile, an increasingly popular approach is to search for a well-constrained architecture as a multi-objective evolution problem Chu et al. (2021); Song et al. (2020). Another branch is to compress and accelerate a heavy deep model through knowledge distillation He et al. (2020); Gao et al. (2019). Note that fewer parameters and FLOPs do not sufficiently mean faster runtime on mobile devices because FLOPs ignore several important latency-related factors such as memory access cost (MAC) and degree of parallelism Ma et al. (2018); Sandler et al. (2018). In this paper, we analyze factors affecting the efficiency of SR models and develop a mobile-friendly SR network.

Transformer-based SR. Transformers were initially proposed for language tasks, which stacked the multi-head self-attention and feed-forward MLP layers to learn long-range relations among its inputs. Dosovitskiy et al. Dosovitskiy et al. (2021) first applied a vision transformer to image recognition. Since then, ViT-based models have become increasingly applicable to both high-level and low-level vision tasks. For image super-resolution, Chen et al. Chen et al. (2021) develop a pre-trained image processing transformer (IPT) that directly applies the vanilla ViT to non-overlapped patch embeddings. Liang et al. Liang et al. (2021) follow Swin Transformer Liu et al. (2021)

and propose a window-based self-attention model for image restoration tasks and achieve excellent results. Window-based self-attention is much more computationally efficient than global self-attention, but it is still a time-consuming and memory-intensive operation.

Models with Large Kernels. AlexNet Krizhevsky et al. (2012)

is a classic large-kernel convolutional neural network model that inspired many subsequent works. Global Convolutional Network 

Peng et al. (2017) uses symmetric, separable large filters to improve semantic segmentation performance. Due to the high computational cost and a large number of parameters, large-size convolutional filters became not popular after VGG-Net Simonyan and Zisserman (2014). However, large convolution kernels have recently gained attention with the development of efficient convolution techniques and new architectures such as transformers and MLPs. ConvMixer Trockman and Kolter (2022) replaces the mixer component of ViTs Liu et al. (2021); Dosovitskiy et al. (2021) or MLPs Tolstikhin et al. (2021) with large kernel depth-wise convolutions. ConvNeXt Liu et al. (2022) uses depth-wise kernels to redesign a standard ResNet and achieves comparable results to Transformers. RepLKNet Ding et al. (2022) enlarges the convolution kernel to to build a pure CNN model, which obtains better results than Swin Transformer Liu et al. (2021)

on ImageNet. Unlike these methods that focus on building big models for high-level vision tasks, we explore the possibility of large convolution kernels for lightweight model design in image super-resolution.

3 Proposed Method

We aim to develop an efficient large-kernel CNN model for the SISR task. To meet the efficiency goal, we introduce key designs to the feature mixing block employed to encode information efficiently. This section first presents the overall pipeline of our proposed ShuffleMixer network in detail. Then, we formulate the feature mixing block, which acts as a basic module for building the ShuffleMixer network. Finally, we provide detail on the training loss function.

3.1 ShuffleMixer Architecture

The overall ShuffleMixer architecture. Given a low-resolution image , where , denote the number of channels and the spatial resolution, respectively. For a color image, the value of is 3. The proposed ShuffleMixer first extracts feature by a convolution operation with a filter size of and channels. Then, we develop a feature mixing block (FMB) consisting of two shuffle mixer layers and a Fused-MBConv Tan and Le (2021), which takes the feature as input to produce a deeper feature . Next, we utilize an upsampler module with a scale factor s to upscale the spatial resolution of the features generated by a sequence of FMBs. To save parameters of the enlargement module as much as possible, we only use a convolutional layer of size and a pixel shuffling layer Shi et al. (2016). For the scale factor, we progressively upsampled the resolution by repeating the upsampler two times. Finally, we use a convolutional layer to map the upscaled feature to the residual image , and add it to the upscaled

by bilinear interpolation to get the final high-resolution image

: , where denotes the bilinear interpolation with scale factor s. In the following, we explain the proposed method in details.

Figure 2:

An overview of the proposed ShuffleMixer. Our method includes feature extraction, feature mixing, and upsampling. The key component is the Feature Mixing Block (FMB), containing two (a) shuffle mixer layers and one (d) Fused-MBConv (FMBConv). Each shuffle mixer layer is composed of two (b) channel projection modules and a large kernel depth-wise convolution. Channel projection then includes channel splitting and shuffling operations, (c) point-wise MLP layers, skip connections, and layer norms.

The Feature Mixing Block is developed to explore local and non-local information for better feature representations. To effectively obtain non-local feature interactions, we apply shuffle mixer layers on , as illustrated in Figure 2(a). For each shuffle mixer layer, we employ large kernel DW Convs to mix features at spatial locations. This operation enjoys large effective receptive fields with fewer parameters, which can encode more spatial information to reconstruct complete and accurate structures. As we investigated in Table 3, depth-wise convolutions with larger sizes consistently improve SR performance while maintaining computational efficiency.

To mix features at channel locations, we employ point-wise MLPs to perform channel projection. With the help of depth-wise convolution, the computational cost of the shuffle mixer layer is mainly caused by channel projections. We further introduce a channel splitting and shuffling (CSS) strategy Ma et al. (2018) to improve the efficiency of this step. Specifically, the input feature is first to split into and ; then, a point-wise MLP then performs channel mixing on the split feature ; finally, a channel shuffling operation is employed to enable the exchange of information on the concatenate feature. Therefore, the parameter complexity of the channel projection layer drops from to . This procedure can be formulated as follow:

(1)

where is SiLU nonlinearity function Elfwing et al. (2018), and are the point-wise convolutions; and

represent the splitting and shuffling of features in the channel dimension. This splitting operation limits representational capability since we exclude the other half of the input tensors from channel interactions. The channel shuffle operation cannot guarantee that all features are processed. Inspired by the MobileNetv2 block 

Sandler et al. (2018), we thus repeat the channel projection layer and arrange them before and after the large depth-wise convolution to learn visual representations. From our study, as listed in Table 2, the enhanced mixer layer achieves quite similar results to the ConvMixer block Trockman and Kolter (2022) while using fewer parameters and FLOPs.

Since the content of natural images is locally correlated, the stacked FMB modules do not fully exploit local features, and it requires more capacity to model feature representations for better SR performance. Therefore, we embed a few convolutional blocks into the proposed model to enhance local connectivity. Concretely, we evenly add the Fused-MBConv after every two shuffle mixer layers. The original Fused-MBConv contains an expansion convolution of size , an SE layer Hu et al. (2018) (i.e., the commonly used channel attention), and a reduction convolution of size . Using such a Fused-MBConv significantly increases parameters and FLOPs, which motivated us to make some changes to match the computational requirements. We remove the SE layer first, as the SiLU function can be treated as a gating mechanism to some extent. Note that the inference time is much slower as the hidden dimension expands. Instead of expanding the hidden channel rapidly with a large factor (where the default expansion factor is usually set to be 6) of this expansion convolution, we then limit the number of output channels and expand it to ( is experimentally set to 16), as shown in Figure 2(d). We also study several operations for this mixing process, and more details can be seen in Sec 4.3.

3.2 Learning Strategy

To constrain the network training, a straightforward way is to ensure that the content of the network output is close to that of the ground truth image:

(2)

where and denote the output result and the corresponding ground truth HR image. We note that only using the pixel-wise loss function does not effectively help high-frequency details estimation Cho et al. (2021). We accordingly employ a frequency constraint to regularize network training. The proposed loss function for the network training is defined as:

(3)

where

denotes the Fast Fourier transform, and

is a weight parameter that is set to be 0.1 empirically.

4 Experimental Results

4.1 Datasets and implementation

Datasets. Following existing methods Li et al. (2020); Liang et al. (2021); Li et al. (2022), we train our models on the DF2K dataset, a merged dataset with DIV2K Timofte et al. (2017) and Flickr2K Lim et al. (2017)

, which contains 3450 (800 + 2650) high-quality images. We adopt standard protocols to generate LR images by bicubic downscaling of reference HR images. During the testing stage, we evaluate our models with the peak signal to noise ratio (PSNR) and the structural similarity index (SSIM) on five publicly available benchmark datasets - Set5 

Bevilacqua et al. (2012)

, Set14 

Zeyde et al. (2012), B100 Arbelaez et al. (2011), Urban100 Huang et al. (2015) and Manga109 Matsui et al. (2015). All PSNR and SSIM results are calculated on the Y channel from the YCbCr color space.

Implementation details. We train our model in RGB channels and augment the input patches with random horizontal flips and rotations. In each training mini-batch, we randomly crop 64 patches of size from LR images as the input. The proposed model is trained by minimizing L1 loss and the frequency loss Cho et al. (2021) with Adam Kingma and Ba (2015) optimizer for 300,000 total iterations. The learning rate is set to a constant

. All experiments are conducted with the PyTorch framework on an Nvidia Tesla V100 GPU.

We provide two models according to the number of feature channels and DW Conv kernel size, and the number of FMB modules is 5. The number of channels and convolution kernel sizes is 64 and pixels for the ShuffleMixer model and 32 and pixels for the ShuffleMixer-Tiny model. The training code and models will be available to the public.

Scale Method Params FLOPs Set5 Set14 B100 Urban100 Manga109
SRCNN Dong et al. (2016a) 57K 53G 36.66/0.9542 32.42/0.9063 31.36/0.8879 29.50/0.8946 35.74/0.9661
FSRCNN Dong et al. (2016b) 12K 6G 37.00/0.9558 32.63/0.9088 31.53/0.8920 29.88/0.9020 36.67/0.9694
ESPCN Shi et al. (2016) 21K 5G 36.83/0.9564 32.40/0.9096 31.29/0.8917 29.48/0.8975 -
VDSR Kim et al. (2016b) 665K 613G 37.53/0.9587 33.03/0.9124 31.90/0.8960 30.76/0.9140 37.22/0.9729
DRCN Kim et al. (2016a) 1,774K 17,974G 37.63/0.9588 33.04/0.9118 31.85/0.8942 30.75/0.9133 37.63/0.9723
LapSRN Lai et al. (2017) 813K 30G 37.52/0.9590 33.08/0.9130 31.80/0.8950 30.41/0.9100 37.27/0.9740
CARN-M Ahn et al. (2018) 412K 91G 37.53/0.9583 33.26/0.9141 31.92/0.8960 31.23/0.9193 -
CARN Ahn et al. (2018) 1,592K 223G 37.76/0.9590 33.52/0.9166 32.09/0.8978 31.92/0.9256 -
EDSR-baseline Lim et al. (2017) 1,370K 316G 37.99/0.9604 33.57/0.9175 32.16/0.8994 31.98/0.9272 38.54/0.9769
FALSR-A Chu et al. (2021) 1021K 235G 37.82/0.9595 33.55/0.9168 32.12/0.8987 31.93/0.9256 -
IMDN Hui et al. (2019) 694K 161G 38.00/0.9605 33.63/0.9177 32.19/0.8996 32.17/0.9283 38.88/0.9774
LAPAR-C Li et al. (2020) 87K 35G 37.65/0.9593 33.20/0.9141 31.95/0.8969 31.10/0.9178 37.75/0.9752
LAPAR-A Li et al. (2020) 548K 171G 38.01/0.9605 33.62/0.9183 32.19/0.8999 32.10/0.9283 38.67/0.9772
ECBSR-M16C64 Zhang et al. (2021) 596K 137G 37.90/0.9615 33.34/0.9178 32.10/0.9018 31.71/0.9250 -
SMSR Wang et al. (2021) 985K 132G 38.00/0.9601 33.64/0.9179 32.17/0.8990 32.19/0.9284 38.76/0.9771
ShuffleMixer-Tiny(Ours) 108K 25G 37.85/0.9600 33.33/0.9153 31.99/0.8972 31.22/0.9183 38.25/0.9761
ShuffleMixer(Ours) 394K 91G 38.01/0.9606 33.63/0.9180 32.17/0.8995 31.89/0.9257 38.83/0.9774
SRCNN Dong et al. (2016a) 57K 53G 32.75/0.9090 29.28/0.8209 28.41/0.7863 26.24/0.7989 30.59/0.9107
FSRCNN Dong et al. (2016b) 12K 5G 33.16/0.9140 29.43/0.8242 28.53/0.7910 26.43/0.8080 30.98/0.9212
VDSR Kim et al. (2016b) 665K 613G 33.66/0.9213 29.77/0.8314 28.82/0.7976 27.14/0.8279 32.01/0.9310
DRCN Kim et al. (2016a) 1,774K 17,974G 33.82/0.9226 29.76/0.8311 28.80/0.7963 27.15/0.8276 32.31/0.9328
CARN-M Ahn et al. (2018) 412K 46G 33.99/0.9236 30.08/0.8367 28.91/0.8000 27.55/0.8385 -
CARN Ahn et al. (2018) 1,592K 119G 34.29/0.9255 30.29/0.8407 29.06/0.8034 28.06/0.8493 -
EDSR-baseline Lim et al. (2017) 1,555K 160G 34.37/0.9270 30.28/0.8417 29.09/0.8052 28.15/0.8527 33.45/0.9439
IMDN Hui et al. (2019) 703K 72G 34.36/0.9270 30.32/0.8417 29.09/0.8046 28.17/0.8519 33.61/0.9445

LAPAR-C Li et al. (2020) 99K 28G 33.91/0.9235 30.02/0.8358 28.90/0.7998 27.42/0.8355 32.54/0.9373
LAPAR-A Li et al. (2020) 594K 114G 34.36/0.9267 30.34/0.8421 29.11/0.8054 28.15/0.8523 33.51/0.9441
SMSR Wang et al. (2021) 993K 68G 34.40/0.9270 30.33/0.8412 29.10/0.8050 28.25/0.8536 33.68/0.9445
ShuffleMixer-Tiny(Ours) 114K 12G 34.07/0.9250 30.14/0.8382 28.94/0.8009 27.54/0.8373 33.03/0.9400
ShuffleMixer(Ours) 415K 43G 34.40/0.9272 30.37/0.8423 29.12/0.8051 28.08/0.8498 33.69/0.9448
SRCNN Dong et al. (2016a) 57K 53G 30.48/0.8628 27.49/0.7503 26.90/0.7101 24.52/0.7221 27.66/0.8505
FSRCNN Dong et al. (2016b) 12K 5G 30.71/0.8657 27.59/0.7535 26.98/0.7150 24.62/0.7280 27.90/0.8517
ESPCN Shi et al. (2016) 25K 1G 30.52/0.8697 27.42/0.7606 26.87/0.7216 24.39/0.7241 -
VDSR Kim et al. (2016b) 665K 613G 31.35/0.8838 28.01/0.7674 27.29/0.7251 25.18/0.7524 28.83/0.8809
DRCN Kim et al. (2016a) 1,774K 17,974G 31.53/0.8854 28.02/0.7670 27.23/0.7233 25 .14/0.7510 28.98/0.8816
LapSRN Lai et al. (2017) 813K 149G 31.54/0.8850 28.19/0.7720 27.32/0.7280 25.21/0.7560 29.09/0.8845
CARN-M Ahn et al. (2018) 412K 33G 31.92/0.8903 28.42/0.7762 27.44/0.7304 25.62/0.7694 -
CARN Ahn et al. (2018) 1,592K 91G 32.13/0.8937 28.60/0.7806 27.58/0.7349 26.07/0.7837 -
EDSR-baseline Lim et al. (2017) 1,518K 114G 32.09/0.8938 28.58/0.7813 27.57/0.7357 26.04/0.7849 30.35/0.9067
IMDN Hui et al. (2019) 715K 41G 32.21/0.8948 28.58/0.7811 27.56/0.7353 26.04/0.7838 30.45/0.9075
LAPAR-C Li et al. (2020) 115K 25G 31.72/0.8884 28.31/0.7740 27.40/0.7292 25.49/0.7651 29.50/0.8951
LAPAR-A Li et al. (2020) 659K 94G 32.15/0.8944 28.61/0.7818 27.61/0.7366 26.14/0.7871 30.42/0.9074
ECBSR-M16C64 Zhang et al. (2021) 603K 35G 31.92/0.8946 28.34/0.7817 27.48/0.7393 25.81/0.7773 -
SMSR Wang et al. (2021) 1006K 42G 32.12/0.8932 28.55/0.7808 27.55/0.7351 26.11/0.7868 30.54/0.9085
ShuffleMixer-Tiny(Ours) 113K 8G 31.88/0.8912 28.46/0.7779 27.45/0.7313 25.66/0.7690 29.96/0.9006
ShuffleMixer(Ours) 411K 28G 32.21/0.8953 28.66/0.7827 27.61/0.7366 26.08/0.7835 30.65/0.9093
Table 1: Comparisons on multiple benchmark datasets for efficient SR networks. All results are calculated on the Y-channel. The FLOPs is calculated corresponding to an HR image of size . Best and second-best performance are in red and blue color, respectively. Blanked entries link to results not reported in previous works.
(a) HR patch (b) Bicubic (c) VDSR Kim et al. (2016b) (d) DRCN Kim et al. (2016a)

ppt3 from Set14 (e) LapSRN Lai et al. (2017) (f) CARN Ahn et al. (2018) (g) IMDN Hui et al. (2019) (h) ShuffleMixer


(a) HR patch (b) Bicubic (c) VDSR Kim et al. (2016b) (d) DRCN Kim et al. (2016a)

img078 from Urban100 (e) LapSRN Lai et al. (2017) (f) CARN Ahn et al. (2018) (g) IMDN Hui et al. (2019) (h) ShuffleMixer

(a) HR patch (b) Bicubic (c) VDSR Kim et al. (2016b) (d) DRCN Kim et al. (2016a)

img095 from Urban100 (e) LapSRN Lai et al. (2017) (f) CARN Ahn et al. (2018) (g) IMDN Hui et al. (2019) (h) ShuffleMixer

Figure 3: Visual comparisons for SR on Set14 and Urban100 datasets. The proposed algorithm recovers the image with clearer structures.
(a) LR patch (b) Bicubic (c) SelfEx Huang et al. (2015)

img004 from historical dataset (d) CARN Ahn et al. (2018) (e) LAPAR-A Li et al. (2020) (f) ShuffleMixer
Figure 4: Visual comparisons for SR on historical dataset Lai et al. (2017). Compared with the results in (b)–(e), the super-resolved image (f) generated by our approach is much clearer with fewer artifacts.

4.2 Comparisons with State-of-the-Art Methods

To evaluate the performance of our approach, we compare the proposed ShuffleMixer with state-of-the-art lightweight frameworks, including SRCNN Dong et al. (2016a), FSRCNN Dong et al. (2016b), VDSR Kim et al. (2016b), DRCN Kim et al. (2016a), LapSRN Lai et al. (2017), CARN Ahn et al. (2018), EDSR-baseline Lim et al. (2017), FALSR-A Chu et al. (2021), IMDN Hui et al. (2019), LAPAR Li et al. (2020), ECBSR Zhang et al. (2021), and SMSR Wang et al. (2021).

Table 1 shows quantitative comparisons on benchmark datasets for the upscaling factors of , , and . In addition to PSNR/SSIM metrics, we also list the number of parameters and FLOPs. The number of FLOPs is tested under a setting of super-resolving an image to pixels. In Figure 1, we compare FLOPs and the number of parameters on the B100 dataset. Here, our ShuffleMixer model obtains competitive results with even fewer parameters and FLOPs. Especially, ShuffleMixer has a similar number of parameters to CARN-M, but our model outperforms it by a large margin on all benchmark datasets. Even with only 113K parameters, ShuffleMixer-Tiny achieves better performance than many existing methods. With regard to the scale factor and , the proposed ShuffleMixer family is capable of achieving similar performance.

Although IMDN Hui et al. (2019), LAPAR-A Li et al. (2020) and SMSR Wang et al. (2021) obtain comparable PSNR/SSIM performance, ShuffleMixer requires only a relatively small amount of model complexity. Meanwhile, we compare the GPU run time with fast and lightweight models on SR: CARN Ahn et al. (2018), CARN-M Ahn et al. (2018) and LAPAR-A Li et al. (2020), and the proposed method has fast inference speed. Our ShuffleMixer-Tiny and ShuffleMixer speeds 0.016s and 0.021s to reconstruct an HR image of size , respectively. As a comparison, the runtimes are 0.017s, 0.019s, and 0.031s for CARN-M, CARN, and LAPAR-A. Note that Pytorch has poor support for large-kernel depth-wise convolution; employing optimized depth-wise convolutions can further accelerate the inference time of our method, as suggested in Ding et al. (2022). All these results demonstrate the effectiveness of our method.

Figure 3 presents visual comparisons on Set14 and Urban100 datasets for a scale. The qualitative comparison results demonstrate that our proposed methods can produce more visually pleasing results. The structures and details are better recovered.

We further evaluate our approach on real low-quality images. One example from the historical dataset Lai et al. (2017) is shown in Figure 4. The results by  Huang et al. (2015); Li et al. (2020) show visible artifacts. Our method and CARN Ahn et al. (2018) generate smooth details, but our results have a clearer structure.

(a) Shuffle Mixer Layer (b) Feature Mixing Block
Baseline CSS CDC Conv S-Conv C-Conv S-ResBlock S-FMBConv
Params(K) 55.9 24.7 35.5 81.7 81.7 128 128 113
FLOPs(G) 5.2 3.2 3.8 6.9 6.9 9.9 9.9 8.9
PSNR(dB) 29.96 29.83 29.99 30.12 30.16 30.20 30.24 30.21
SSIM 0.8288 0.8231 0.8259 0.8299 0.8305 0.8316 0.8327 0.8321
Table 2: Ablation studies of the shuffler mixer layer and the feature mixing block on DIV2K validation setTimofte et al. (2017). The FLOPs is tested by fvcore on an LR image of size .
Kernel Size LR Size PSNR(dB)/SSIM Params(K) FLOPs(G)
30.21/0.8321 113 8.9
30.24/0.8326 118 9.2
30.28/0.8342 125 9.7
30.29/0.8339 136 10.4
30.28/0.8339 148 11.2
30.29/0.8337 164 12.2
Table 3: Experimental results on different kernel settings of depth-wise convolution. We test the PSNR result on 4 DIV2K validation set, and compute the FLOPs with the LR input of .
Figure 5: Visualization of feature maps before the upsampler module. We show the average features over the channel dimension.

4.3 Analysis and Discussions

The core idea of ShuffleMixer lies in the shuffle mixer layer, feature mixing block, and large kernel convolution. In this subsection, we evaluate them respectively on the proposed tiny model and train them on DIV2K dataset Timofte et al. (2017).

Effectiveness of the shuffle mixer layer. To verify the efficiency of the shuffle mixer, we use 10 ConvMixer Trockman and Kolter (2022) blocks to build a baseline model. Unlike the original ConvMixer module, we replace BatchNorm with LayerNorm and apply it only before the point-wise MLP layer, because BatchNorm tends to bring artifacts in the generated results Lim et al. (2017); Wang et al. (2018). The kernel of depth-wise convolution is set 3, and the number of channels is 32. When applying the channel splitting and shuffling (CSS) strategy, the number of parameters is reduced from 55.9K to 24.7K, and the performance is also 0.13dB lower than the baseline. This result reflects that the split operation limits the representation capability of the channel projection layer. To compensate for the lack of PSNR, we repeat the CSS-based projection layer to enable more cross-group feature mixing (denoted by CDC). Table 2(a) shows a quantitative comparison where we find that CDC achieves similar performance to the baseline model while reducing parameters from 55.9K to 35.5K and FLOPs from 5.2G to 3.8G.

Effectiveness of the feature mixing block. To validate the effectiveness of the proposed feature mixing block, we take the CDC model as baseline and first embed a convolution layer with size of after each two shuffle mixer layers, and it has a gain of 0.13dB over the baseline. To further analyze the effect of feature fusion manners, we study the S-Conv (take an element-wise summation between input and output features followed by a convolution layer with size of ) and C-Conv (concatenate input and output features on channel dimension followed by a convolution layer of size ). Table 2(b) shows that they all improve over the baseline; C-Conv achieves better PSNR performance while having more computational cost. Figure 5 exhibits the average feature map of the channel axis before the upsampler module, which illustrates that enhancing local connectivity between feature elements is helpful for grabbing finer high-frequency contents. Based on the S-Conv, we additionally replace the convolution layer with basic residual blocks (S-ResBlock) and Fused-MBConv (S-FMBConv). Table 2(b) shows that S-FMBConv obtains a balanced trade-off between model complexity and SR performance. Thus, we choose S-FMBConv to strengthen the local connectivity between features in this paper.

Effectiveness of large depth-wise convolution. To demonstrate the effect of a large kernel, we use different kernel sizes ranging from to pixels and test their performance separately. Table 3 shows that using larger kernel size will improve the performance. In particular, the PSNR value of the method using the depth-wise convolution of size is 0.07dB higher than that of using size while only increasing 12K parameters and 0.8G FLOPs. In addition, we note that if the kernel size is larger than pixels, the performance gains are minor. Thus, the kernel size is set to be pixels as a trade-off between accuracy and model complexity in this paper.

5 Conclusion

In this paper, we have proposed a lightweight deep model to solve the image super-resolution problem. The proposed deep model, i.e., ShuffleMixer, contains a shuffler mixer layer with a larger effective receptive field to extract non-local feature representations efficiently. We have introduced the Fused-MBConv to model the local connectivity of features generated by the shuffler mixer layer, which is critical for improving SR performance. We both qualitatively and quantitatively evaluate the proposed ShuffleMixer on commonly used benchmarks. Experimental results demonstrate that the proposed ShuffleMixer is much more efficient while achieving competitive performance than the state-of-the-art methods.

Broader Impact

This paper is an exploratory work on lightweight and efficient image super-resolution using a large-kernel ConvNet. This approach can be deployed in some resource-constrained environments to improve image quality, such as processing pictures taken by smartphones and reducing bandwidth during video calls or meetings. However, super-resolution technology has also brought some negative effects, such as criminals using this technology to enhance people’s facial or body features, thereby allowing identity information to leak. It is worth noting that the positive social impact of image super-resolution far outweighs the potential problems. We call on people to use this technology and its derivative applications without harming the personal interests of the public.

References

  • N. Ahn, B. Kang, and K. Sohn (2018) Fast, accurate, and lightweight super-resolution with cascading residual network. In ECCV, pp. 252–268. Cited by: §1, §1, §2, Figure 3, Figure 4, §4.2, §4.2, §4.2, Table 1.
  • P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik (2011) Contour detection and hierarchical image segmentation. PAMI 33 (5), pp. 898–916. Cited by: Figure 1, §4.1.
  • M. Bevilacqua, A. Roumy, C. Guillemot, and M. A. Morel (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC, pp. 135.1–135.10. Cited by: §4.1.
  • H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao (2021) Pre-trained image processing transformer. In CVPR, pp. 12299–12310. Cited by: §2.
  • S. Cho, S. Ji, J. Hong, S. Jung, and S. Ko (2021) Rethinking coarse-to-fine approach in single image deblurring. In ICCV, pp. 4641–4650. Cited by: §3.2, §4.1.
  • X. Chu, B. Zhang, H. Ma, R. Xu, and Q. Li (2021) Fast, accurate and lightweight super-resolution with neural architecture search. In ICPR, pp. 59–64. Cited by: §1, §2, §4.2, Table 1.
  • X. Ding, X. Zhang, Y. Zhou, J. Han, G. Ding, and J. Sun (2022) Scaling up your kernels to 31x31: revisiting large kernel design in cnns. arXiv preprint arXiv:2203.06717. Cited by: §1, §1, §2, §4.2.
  • C. Dong, C. C. Loy, K. He, and X. Tang (2016a) Image super-resolution using deep convolutional networks. PAMI 38 (2), pp. 295–307. Cited by: §1, §4.2, Table 1.
  • C. Dong, C. C. Loy, and X. Tang (2016b) Accelerating the super-resolution convolutional neural network. In ECCV, pp. 391–407. Cited by: §1, §1, §2, §4.2, Table 1.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. ICLR. Cited by: §1, §2, §2.
  • S. Elfwing, E. Uchibe, and K. Doya (2018)

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

    .
    Neural Networks 107, pp. 3–11. Cited by: §3.1.
  • Q. Gao, Y. Zhao, G. Li, and T. Tong (2019) Image super-resolution using knowledge distillation. In ACCV, pp. 527–541. Cited by: §1, §2.
  • Z. He, T. Dai, J. Lu, Y. Jiang, and S. Xia (2020) Fakd: feature-affinity based knowledge distillation for efficient image super-resolution. In ICIP, pp. 518–522. Cited by: §1, §2.
  • J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, Cited by: §3.1.
  • J. Huang, A. Singh, and N. Ahuja (2015) Single image super-resolution from transformed self-exemplars. In CVPR, pp. 5197–5206. Cited by: Figure 4, §4.1, §4.2.
  • Z. Hui, X. Gao, Y. Yang, and X. Wang (2019) Lightweight image super-resolution with information multi-distillation network. In ACM MM, pp. 2024–2032. Cited by: §1, §1, §2, Figure 3, §4.2, §4.2, Table 1.
  • J. Kim, J. Kwon Lee, and K. Mu Lee (2016a) Deeply-recursive convolutional network for image super-resolution. In CVPR, pp. 1637–1645. Cited by: Figure 3, §4.2, Table 1.
  • J. Kim, J. K. Lee, and K. M. Lee (2016b) Accurate image super-resolution using very deep convolutional networks. In CVPR, pp. 1646–1654. Cited by: Figure 3, §4.2, Table 1.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §4.1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In NeurIPS, Cited by: §2.
  • W. Lai, J. Huang, N. Ahuja, and M. Yang (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, pp. 624–632. Cited by: Figure 3, Figure 4, §4.2, §4.2, Table 1.
  • W. Li, K. Zhou, L. Qi, N. Jiang, J. Lu, and J. Jia (2020) LAPAR: linearly-assembled pixel-adaptive regression network for single image super-resolution and beyond. In NeurIPS, pp. 20343–20355. Cited by: Figure 4, §4.1, §4.2, §4.2, §4.2, Table 1.
  • Y. Li, K. Zhang, L. V. Gool, R. Timofte, et al. (2022) NTIRE 2022 challenge on efficient super-resolution: methods and results. In CVPR Workshops, Cited by: ShuffleMixer: An Efficient ConvNet for Image Super-Resolution, §1, §2, §4.1.
  • J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte (2021) SwinIR: image restoration using swin transformer. In ICCV Workshops, pp. 1833–1844. Cited by: §1, §2, §4.1.
  • B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee (2017) Enhanced deep residual networks for single image super-resolution. In CVPR Workshops, pp. 1132–1140. Cited by: §1, §4.1, §4.2, §4.3, Table 1.
  • Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin transformer: hierarchical vision transformer using shifted windows. In ICCV, pp. 10012–10022. Cited by: §1, §2, §2.
  • Z. Liu, H. Mao, C. Wu, C. Feichtenhofer, T. Darrell, and S. Xie (2022) A convnet for the 2020s. arXiv preprint arXiv:2201.03545. Cited by: §1, §1, §2.
  • N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) ShuffleNet v2: practical guidelines for efficient cnn architecture design. In ECCV, pp. 116–131. Cited by: §1, §2, §3.1.
  • Y. Matsui, K. Ito, Y. Aramaki, T. Yamasaki, and K. Aizawa (2015) Sketch-based manga retrieval using manga109 dataset. arXiv preprint arXiv:1510.04389. Cited by: §4.1.
  • S. Mehta and M. Rastegari (2022) MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer. In ICLR, Cited by: §1.
  • C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun (2017) Large kernel matters – improve semantic segmentation by global convolutional network. In CVPR, pp. 4353–4361. Cited by: §2.
  • M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) MobileNetV2: inverted residuals and linear bottlenecks. In CVPR, pp. 4510–4520. Cited by: §1, §2, §3.1.
  • W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, pp. 1874–1883. Cited by: §1, §2, §3.1, Table 1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.
  • D. Song, C. Xu, X. Jia, Y. Chen, C. Xu, and Y. Wang (2020) Efficient residual dense block search for image super-resolution. In AAAI, pp. 12007–12014. Cited by: §1, §2.
  • M. Tan and Q. Le (2021) EfficientNetV2: smaller models and faster training. In ICML, pp. 10096–10106. Cited by: §1, §3.1.
  • R. Timofte, E. Agustsson, L. Van Gool, M. Yang, and L. Zhang (2017) NTIRE 2017 challenge on single image super-resolution: methods and results. In CVPR Workshops, Cited by: §4.1, §4.3, Table 2.
  • I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, M. Lucic, and A. Dosovitskiy (2021) MLP-mixer: an all-mlp architecture for vision. In NeurIPS, pp. 24261–24272. Cited by: §2.
  • A. Trockman and J. Z. Kolter (2022) Patches are all you need?. In ICLR, Cited by: §1, §2, §3.1, §4.3.
  • L. Wang, X. Dong, Y. Wang, X. Ying, Z. Lin, W. An, and Y. Guo (2021) Exploring sparsity in image super-resolution for efficient inference. In CVPR, pp. 4917–4926. Cited by: §4.2, §4.2, Table 1.
  • X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. C. Loy (2018)

    ESRGAN: enhanced super-resolution generative adversarial networks

    .
    In ECCV Workshops, Cited by: §4.3.
  • R. Zeyde, M. Elad, and M. Protter (2012) On single image scale-up using sparse-representations. In Curves and Surfaces, pp. 711–730. Cited by: §4.1.
  • K. Zhang, M. Danelljan, Y. Li, and et al. (2020) AIM 2020 challenge on efficient super-resolution: methods and results. In ECCV Workshops, pp. 5–40. Cited by: §1, §2.
  • X. Zhang, H. Zeng, and L. Zhang (2021) Edge-oriented convolution block for real-time super resolution on mobile devices. In ACM MM, pp. 4034–4043. Cited by: §1, §1, §4.2, Table 1.
  • Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. In ECCV, pp. 286–301. Cited by: §1.