Single image super-resolution (SISR) aims to recover a high-resolution image from a low-resolution one. This is a classic problem that has attracted lots of attention recently due to the rapid development of high-definition devices, such as Ultra-High Definition Television, Samsung Galaxy S22 Ultra, iPhone 13 Pro Max, and HUAWEI P50 Pro, and so on. Thus, it is of great interest to develop an efficient and effective method to estimate high-resolution images to be better displayed on these devices.
Recently, convolutional neural network (CNN) based SR modelsDong et al. (2016a, b); Ahn et al. (2018); Hui et al. (2019); Lim et al. (2017); Zhang et al. (2018) have achieved impressive reconstruction performance. However, these networks hierarchically extract local features, which highly rely on stacking deeper or more complex models to enlarge the receptive fields for performance improvements. As a result, the required computational budget makes these heavy SR models difficult to deploy on resource-constrained mobile devices in practical applications Zhang et al. (2021).
To alleviate heavy SR models, various methods have been proposed to reduce model complexity or speed up runtime, including efficient operation design Sandler et al. (2018); Ma et al. (2018); Tan and Le (2021); Dong et al. (2016b); Hui et al. (2019); Ahn et al. (2018); Shi et al. (2016); Zhang et al. (2020); Li et al. (2022); Liu et al. (2022), neural architecture search Chu et al. (2021); Song et al. (2020), knowledge distillation Gao et al. (2019); He et al. (2020), and structural re-parameterization methodology Ding et al. (2022); Li et al. (2022); Zhang et al. (2021). These methods are mainly based on improved small spatial convolutions or advanced training strategies, and large kernel convolutions are rarely explored. Moreover, they mostly focus on one of the efficiency indicators and do not perform well in real resource-constrained tasks. Thus, the need to obtain a better trade-off between complexity, latency, and SR quality is imperative.
A large receptive field involves more feature interactions, which helps reconstruct more refined results in tasks such as super-resolution that require dense per-pixel predictions. Recent visual transformer (ViT)-based approaches Dosovitskiy et al. (2021); Liu et al. (2021); Mehta and Rastegari (2022); Liang et al. (2021) employ a multi-head self-attention (MHSA) mechanism to learn long-range feature representations, which lead to their state-of-the-art performance in various vision tasks. However, MHSA is not friendly to enlarging the receptive field of an efficient SR design. Its complexity grows quadratically with the input resolution (the size is usually large and constant during SR training). Regular convolution with large kernels is also a simple but heavyweight approach to obtaining efficient receptive fields. To make large kernel convolutions practical, using depth-wise convolutions with large kernel sizes Liu et al. (2022); Trockman and Kolter (2022); Ding et al. (2022) is an effective alternative. Since depth-wise convolutions share connection weights between spatial locations and remain independent between channels, this property makes it challenging to capture sufficient interactions. In lightweight network design, therefore, it is essential to improve the learning capability of depth-wise convolutions (DW Convs).
In this paper, we develop a simple and effective network named ShuffleMixer that introduces large kernel convolutions for lightweight SR design. The core idea is to fuse non-local and local spatial locations within a feature mixing block with fewer parameters and FLOPs. Specifically, we employ depth-wise convolutions with large kernel sizes to aggregate spatial information from a large region. For channel mixing, we introduce channel splitting and shuffling strategies to reduce model parameters and computational cost and improve network capability. We then build an effective shuffle mixer layer based on these two operators. To further improve the learning capability, we embed the Fused-MBConv into the mixer layer to boost local connectivity. Taken together, we find that the ShuffleMixer network with a simple module can obtain state-of-the-art performance. Figure 1 shows that our ShuffleMixer achieves a better trade-off with the least parameters and FLOPs among all existing lightweight SR methods.
The contributions of this paper are summarized as follows: (1) We develop an efficient SR design by exploring a large kernel ConvNet that involves more useful information for image SR. (2) We introduce a channel splitting and shuffling operation to perform feature mixing of the channel projection efficiently. (3) To better explore the local connectivity among cross-group features from the shuffle mixer layer, we utilize Fused-MBConvs in the proposed SR design. We formulate the aforementioned modules into an end-to-end trainable network, which is named as ShuffleMixer. Experimental results show that ShuffleMixer is about smaller than the state-of-the-art methods in terms of model parameters and FLOPs while achieving competitive performance compared to the state-of-the-art methods.
2 Related Work
CNN-based Efficient SR. CNN-based methods adopt various ways to reduce model complexity. FSRCNN Dong et al. (2016b) and ESPCN Shi et al. (2016) employ post-upsampling layers to reduce the computational burden from predefined inputs significantly. Namhyuk et al. Ahn et al. (2018) uses group convolution and cascading connection upon a recursive network to save parameters. Hui et al. Hui et al. (2019) proposes a lightweight information multi-distillation network (IMDN) to aggregate features by applying feature splitting and concatenation operations, and the improved IMDN variants Zhang et al. (2020); Li et al. (2022) won the AIM2020 and NTIRE2022 Efficient SR challenge. Meanwhile, an increasingly popular approach is to search for a well-constrained architecture as a multi-objective evolution problem Chu et al. (2021); Song et al. (2020). Another branch is to compress and accelerate a heavy deep model through knowledge distillation He et al. (2020); Gao et al. (2019). Note that fewer parameters and FLOPs do not sufficiently mean faster runtime on mobile devices because FLOPs ignore several important latency-related factors such as memory access cost (MAC) and degree of parallelism Ma et al. (2018); Sandler et al. (2018). In this paper, we analyze factors affecting the efficiency of SR models and develop a mobile-friendly SR network.
Transformer-based SR. Transformers were initially proposed for language tasks, which stacked the multi-head self-attention and feed-forward MLP layers to learn long-range relations among its inputs. Dosovitskiy et al. Dosovitskiy et al. (2021) first applied a vision transformer to image recognition. Since then, ViT-based models have become increasingly applicable to both high-level and low-level vision tasks. For image super-resolution, Chen et al. Chen et al. (2021) develop a pre-trained image processing transformer (IPT) that directly applies the vanilla ViT to non-overlapped patch embeddings. Liang et al. Liang et al. (2021) follow Swin Transformer Liu et al. (2021)
and propose a window-based self-attention model for image restoration tasks and achieve excellent results. Window-based self-attention is much more computationally efficient than global self-attention, but it is still a time-consuming and memory-intensive operation.
Models with Large Kernels. AlexNet Krizhevsky et al. (2012)
is a classic large-kernel convolutional neural network model that inspired many subsequent works. Global Convolutional NetworkPeng et al. (2017) uses symmetric, separable large filters to improve semantic segmentation performance. Due to the high computational cost and a large number of parameters, large-size convolutional filters became not popular after VGG-Net Simonyan and Zisserman (2014). However, large convolution kernels have recently gained attention with the development of efficient convolution techniques and new architectures such as transformers and MLPs. ConvMixer Trockman and Kolter (2022) replaces the mixer component of ViTs Liu et al. (2021); Dosovitskiy et al. (2021) or MLPs Tolstikhin et al. (2021) with large kernel depth-wise convolutions. ConvNeXt Liu et al. (2022) uses depth-wise kernels to redesign a standard ResNet and achieves comparable results to Transformers. RepLKNet Ding et al. (2022) enlarges the convolution kernel to to build a pure CNN model, which obtains better results than Swin Transformer Liu et al. (2021)
on ImageNet. Unlike these methods that focus on building big models for high-level vision tasks, we explore the possibility of large convolution kernels for lightweight model design in image super-resolution.
3 Proposed Method
We aim to develop an efficient large-kernel CNN model for the SISR task. To meet the efficiency goal, we introduce key designs to the feature mixing block employed to encode information efficiently. This section first presents the overall pipeline of our proposed ShuffleMixer network in detail. Then, we formulate the feature mixing block, which acts as a basic module for building the ShuffleMixer network. Finally, we provide detail on the training loss function.
3.1 ShuffleMixer Architecture
The overall ShuffleMixer architecture. Given a low-resolution image , where , denote the number of channels and the spatial resolution, respectively. For a color image, the value of is 3. The proposed ShuffleMixer first extracts feature by a convolution operation with a filter size of and channels. Then, we develop a feature mixing block (FMB) consisting of two shuffle mixer layers and a Fused-MBConv Tan and Le (2021), which takes the feature as input to produce a deeper feature . Next, we utilize an upsampler module with a scale factor s to upscale the spatial resolution of the features generated by a sequence of FMBs. To save parameters of the enlargement module as much as possible, we only use a convolutional layer of size and a pixel shuffling layer Shi et al. (2016). For the scale factor, we progressively upsampled the resolution by repeating the upsampler two times. Finally, we use a convolutional layer to map the upscaled feature to the residual image , and add it to the upscaled
by bilinear interpolation to get the final high-resolution image: , where denotes the bilinear interpolation with scale factor s. In the following, we explain the proposed method in details.
An overview of the proposed ShuffleMixer. Our method includes feature extraction, feature mixing, and upsampling. The key component is the Feature Mixing Block (FMB), containing two (a) shuffle mixer layers and one (d) Fused-MBConv (FMBConv). Each shuffle mixer layer is composed of two (b) channel projection modules and a large kernel depth-wise convolution. Channel projection then includes channel splitting and shuffling operations, (c) point-wise MLP layers, skip connections, and layer norms.
The Feature Mixing Block is developed to explore local and non-local information for better feature representations. To effectively obtain non-local feature interactions, we apply shuffle mixer layers on , as illustrated in Figure 2(a). For each shuffle mixer layer, we employ large kernel DW Convs to mix features at spatial locations. This operation enjoys large effective receptive fields with fewer parameters, which can encode more spatial information to reconstruct complete and accurate structures. As we investigated in Table 3, depth-wise convolutions with larger sizes consistently improve SR performance while maintaining computational efficiency.
To mix features at channel locations, we employ point-wise MLPs to perform channel projection. With the help of depth-wise convolution, the computational cost of the shuffle mixer layer is mainly caused by channel projections. We further introduce a channel splitting and shuffling (CSS) strategy Ma et al. (2018) to improve the efficiency of this step. Specifically, the input feature is first to split into and ; then, a point-wise MLP then performs channel mixing on the split feature ; finally, a channel shuffling operation is employed to enable the exchange of information on the concatenate feature. Therefore, the parameter complexity of the channel projection layer drops from to . This procedure can be formulated as follow:
where is SiLU nonlinearity function Elfwing et al. (2018), and are the point-wise convolutions; and
represent the splitting and shuffling of features in the channel dimension. This splitting operation limits representational capability since we exclude the other half of the input tensors from channel interactions. The channel shuffle operation cannot guarantee that all features are processed. Inspired by the MobileNetv2 blockSandler et al. (2018), we thus repeat the channel projection layer and arrange them before and after the large depth-wise convolution to learn visual representations. From our study, as listed in Table 2, the enhanced mixer layer achieves quite similar results to the ConvMixer block Trockman and Kolter (2022) while using fewer parameters and FLOPs.
Since the content of natural images is locally correlated, the stacked FMB modules do not fully exploit local features, and it requires more capacity to model feature representations for better SR performance. Therefore, we embed a few convolutional blocks into the proposed model to enhance local connectivity. Concretely, we evenly add the Fused-MBConv after every two shuffle mixer layers. The original Fused-MBConv contains an expansion convolution of size , an SE layer Hu et al. (2018) (i.e., the commonly used channel attention), and a reduction convolution of size . Using such a Fused-MBConv significantly increases parameters and FLOPs, which motivated us to make some changes to match the computational requirements. We remove the SE layer first, as the SiLU function can be treated as a gating mechanism to some extent. Note that the inference time is much slower as the hidden dimension expands. Instead of expanding the hidden channel rapidly with a large factor (where the default expansion factor is usually set to be 6) of this expansion convolution, we then limit the number of output channels and expand it to ( is experimentally set to 16), as shown in Figure 2(d). We also study several operations for this mixing process, and more details can be seen in Sec 4.3.
3.2 Learning Strategy
To constrain the network training, a straightforward way is to ensure that the content of the network output is close to that of the ground truth image:
where and denote the output result and the corresponding ground truth HR image. We note that only using the pixel-wise loss function does not effectively help high-frequency details estimation Cho et al. (2021). We accordingly employ a frequency constraint to regularize network training. The proposed loss function for the network training is defined as:
denotes the Fast Fourier transform, andis a weight parameter that is set to be 0.1 empirically.
4 Experimental Results
4.1 Datasets and implementation
Datasets. Following existing methods Li et al. (2020); Liang et al. (2021); Li et al. (2022), we train our models on the DF2K dataset, a merged dataset with DIV2K Timofte et al. (2017) and Flickr2K Lim et al. (2017)
, which contains 3450 (800 + 2650) high-quality images. We adopt standard protocols to generate LR images by bicubic downscaling of reference HR images. During the testing stage, we evaluate our models with the peak signal to noise ratio (PSNR) and the structural similarity index (SSIM) on five publicly available benchmark datasets - Set5Bevilacqua et al. (2012)
, Set14Zeyde et al. (2012), B100 Arbelaez et al. (2011), Urban100 Huang et al. (2015) and Manga109 Matsui et al. (2015). All PSNR and SSIM results are calculated on the Y channel from the YCbCr color space.
Implementation details. We train our model in RGB channels and augment the input patches with random horizontal flips and rotations. In each training mini-batch, we randomly crop 64 patches of size from LR images as the input. The proposed model is trained by minimizing L1 loss and the frequency loss Cho et al. (2021) with Adam Kingma and Ba (2015) optimizer for 300,000 total iterations. The learning rate is set to a constant
. All experiments are conducted with the PyTorch framework on an Nvidia Tesla V100 GPU.
We provide two models according to the number of feature channels and DW Conv kernel size, and the number of FMB modules is 5. The number of channels and convolution kernel sizes is 64 and pixels for the ShuffleMixer model and 32 and pixels for the ShuffleMixer-Tiny model. The training code and models will be available to the public.
|SRCNN Dong et al. (2016a)||57K||53G||36.66/0.9542||32.42/0.9063||31.36/0.8879||29.50/0.8946||35.74/0.9661|
|FSRCNN Dong et al. (2016b)||12K||6G||37.00/0.9558||32.63/0.9088||31.53/0.8920||29.88/0.9020||36.67/0.9694|
|ESPCN Shi et al. (2016)||21K||5G||36.83/0.9564||32.40/0.9096||31.29/0.8917||29.48/0.8975||-|
|VDSR Kim et al. (2016b)||665K||613G||37.53/0.9587||33.03/0.9124||31.90/0.8960||30.76/0.9140||37.22/0.9729|
|DRCN Kim et al. (2016a)||1,774K||17,974G||37.63/0.9588||33.04/0.9118||31.85/0.8942||30.75/0.9133||37.63/0.9723|
|LapSRN Lai et al. (2017)||813K||30G||37.52/0.9590||33.08/0.9130||31.80/0.8950||30.41/0.9100||37.27/0.9740|
|CARN-M Ahn et al. (2018)||412K||91G||37.53/0.9583||33.26/0.9141||31.92/0.8960||31.23/0.9193||-|
|CARN Ahn et al. (2018)||1,592K||223G||37.76/0.9590||33.52/0.9166||32.09/0.8978||31.92/0.9256||-|
|EDSR-baseline Lim et al. (2017)||1,370K||316G||37.99/0.9604||33.57/0.9175||32.16/0.8994||31.98/0.9272||38.54/0.9769|
|FALSR-A Chu et al. (2021)||1021K||235G||37.82/0.9595||33.55/0.9168||32.12/0.8987||31.93/0.9256||-|
|IMDN Hui et al. (2019)||694K||161G||38.00/0.9605||33.63/0.9177||32.19/0.8996||32.17/0.9283||38.88/0.9774|
|LAPAR-C Li et al. (2020)||87K||35G||37.65/0.9593||33.20/0.9141||31.95/0.8969||31.10/0.9178||37.75/0.9752|
|LAPAR-A Li et al. (2020)||548K||171G||38.01/0.9605||33.62/0.9183||32.19/0.8999||32.10/0.9283||38.67/0.9772|
|ECBSR-M16C64 Zhang et al. (2021)||596K||137G||37.90/0.9615||33.34/0.9178||32.10/0.9018||31.71/0.9250||-|
|SMSR Wang et al. (2021)||985K||132G||38.00/0.9601||33.64/0.9179||32.17/0.8990||32.19/0.9284||38.76/0.9771|
|SRCNN Dong et al. (2016a)||57K||53G||32.75/0.9090||29.28/0.8209||28.41/0.7863||26.24/0.7989||30.59/0.9107|
|FSRCNN Dong et al. (2016b)||12K||5G||33.16/0.9140||29.43/0.8242||28.53/0.7910||26.43/0.8080||30.98/0.9212|
|VDSR Kim et al. (2016b)||665K||613G||33.66/0.9213||29.77/0.8314||28.82/0.7976||27.14/0.8279||32.01/0.9310|
|DRCN Kim et al. (2016a)||1,774K||17,974G||33.82/0.9226||29.76/0.8311||28.80/0.7963||27.15/0.8276||32.31/0.9328|
|CARN-M Ahn et al. (2018)||412K||46G||33.99/0.9236||30.08/0.8367||28.91/0.8000||27.55/0.8385||-|
|CARN Ahn et al. (2018)||1,592K||119G||34.29/0.9255||30.29/0.8407||29.06/0.8034||28.06/0.8493||-|
|EDSR-baseline Lim et al. (2017)||1,555K||160G||34.37/0.9270||30.28/0.8417||29.09/0.8052||28.15/0.8527||33.45/0.9439|
|IMDN Hui et al. (2019)||703K||72G||34.36/0.9270||30.32/0.8417||29.09/0.8046||28.17/0.8519||33.61/0.9445|
||LAPAR-C Li et al. (2020)||99K||28G||33.91/0.9235||30.02/0.8358||28.90/0.7998||27.42/0.8355||32.54/0.9373|
|LAPAR-A Li et al. (2020)||594K||114G||34.36/0.9267||30.34/0.8421||29.11/0.8054||28.15/0.8523||33.51/0.9441|
|SMSR Wang et al. (2021)||993K||68G||34.40/0.9270||30.33/0.8412||29.10/0.8050||28.25/0.8536||33.68/0.9445|
|SRCNN Dong et al. (2016a)||57K||53G||30.48/0.8628||27.49/0.7503||26.90/0.7101||24.52/0.7221||27.66/0.8505|
|FSRCNN Dong et al. (2016b)||12K||5G||30.71/0.8657||27.59/0.7535||26.98/0.7150||24.62/0.7280||27.90/0.8517|
|ESPCN Shi et al. (2016)||25K||1G||30.52/0.8697||27.42/0.7606||26.87/0.7216||24.39/0.7241||-|
|VDSR Kim et al. (2016b)||665K||613G||31.35/0.8838||28.01/0.7674||27.29/0.7251||25.18/0.7524||28.83/0.8809|
|DRCN Kim et al. (2016a)||1,774K||17,974G||31.53/0.8854||28.02/0.7670||27.23/0.7233||25 .14/0.7510||28.98/0.8816|
|LapSRN Lai et al. (2017)||813K||149G||31.54/0.8850||28.19/0.7720||27.32/0.7280||25.21/0.7560||29.09/0.8845|
|CARN-M Ahn et al. (2018)||412K||33G||31.92/0.8903||28.42/0.7762||27.44/0.7304||25.62/0.7694||-|
|CARN Ahn et al. (2018)||1,592K||91G||32.13/0.8937||28.60/0.7806||27.58/0.7349||26.07/0.7837||-|
|EDSR-baseline Lim et al. (2017)||1,518K||114G||32.09/0.8938||28.58/0.7813||27.57/0.7357||26.04/0.7849||30.35/0.9067|
|IMDN Hui et al. (2019)||715K||41G||32.21/0.8948||28.58/0.7811||27.56/0.7353||26.04/0.7838||30.45/0.9075|
|LAPAR-C Li et al. (2020)||115K||25G||31.72/0.8884||28.31/0.7740||27.40/0.7292||25.49/0.7651||29.50/0.8951|
|LAPAR-A Li et al. (2020)||659K||94G||32.15/0.8944||28.61/0.7818||27.61/0.7366||26.14/0.7871||30.42/0.9074|
|ECBSR-M16C64 Zhang et al. (2021)||603K||35G||31.92/0.8946||28.34/0.7817||27.48/0.7393||25.81/0.7773||-|
|SMSR Wang et al. (2021)||1006K||42G||32.12/0.8932||28.55/0.7808||27.55/0.7351||26.11/0.7868||30.54/0.9085|
|(a) HR patch||(b) Bicubic||(c) VDSR Kim et al. (2016b)||(d) DRCN Kim et al. (2016a)|
|ppt3 from Set14||(e) LapSRN Lai et al. (2017)||(f) CARN Ahn et al. (2018)||(g) IMDN Hui et al. (2019)||(h) ShuffleMixer|
|(a) HR patch||(b) Bicubic||(c) VDSR Kim et al. (2016b)||(d) DRCN Kim et al. (2016a)|
|img078 from Urban100||(e) LapSRN Lai et al. (2017)||(f) CARN Ahn et al. (2018)||(g) IMDN Hui et al. (2019)||(h) ShuffleMixer|
|(a) HR patch||(b) Bicubic||(c) VDSR Kim et al. (2016b)||(d) DRCN Kim et al. (2016a)|
|img095 from Urban100||(e) LapSRN Lai et al. (2017)||(f) CARN Ahn et al. (2018)||(g) IMDN Hui et al. (2019)||(h) ShuffleMixer|
|(a) LR patch||(b) Bicubic||(c) SelfEx Huang et al. (2015)|
|img004 from historical dataset||(d) CARN Ahn et al. (2018)||(e) LAPAR-A Li et al. (2020)||(f) ShuffleMixer|
4.2 Comparisons with State-of-the-Art Methods
To evaluate the performance of our approach, we compare the proposed ShuffleMixer with state-of-the-art lightweight frameworks, including SRCNN Dong et al. (2016a), FSRCNN Dong et al. (2016b), VDSR Kim et al. (2016b), DRCN Kim et al. (2016a), LapSRN Lai et al. (2017), CARN Ahn et al. (2018), EDSR-baseline Lim et al. (2017), FALSR-A Chu et al. (2021), IMDN Hui et al. (2019), LAPAR Li et al. (2020), ECBSR Zhang et al. (2021), and SMSR Wang et al. (2021).
Table 1 shows quantitative comparisons on benchmark datasets for the upscaling factors of , , and . In addition to PSNR/SSIM metrics, we also list the number of parameters and FLOPs. The number of FLOPs is tested under a setting of super-resolving an image to pixels. In Figure 1, we compare FLOPs and the number of parameters on the B100 dataset. Here, our ShuffleMixer model obtains competitive results with even fewer parameters and FLOPs. Especially, ShuffleMixer has a similar number of parameters to CARN-M, but our model outperforms it by a large margin on all benchmark datasets. Even with only 113K parameters, ShuffleMixer-Tiny achieves better performance than many existing methods. With regard to the scale factor and , the proposed ShuffleMixer family is capable of achieving similar performance.
Although IMDN Hui et al. (2019), LAPAR-A Li et al. (2020) and SMSR Wang et al. (2021) obtain comparable PSNR/SSIM performance, ShuffleMixer requires only a relatively small amount of model complexity. Meanwhile, we compare the GPU run time with fast and lightweight models on SR: CARN Ahn et al. (2018), CARN-M Ahn et al. (2018) and LAPAR-A Li et al. (2020), and the proposed method has fast inference speed. Our ShuffleMixer-Tiny and ShuffleMixer speeds 0.016s and 0.021s to reconstruct an HR image of size , respectively. As a comparison, the runtimes are 0.017s, 0.019s, and 0.031s for CARN-M, CARN, and LAPAR-A. Note that Pytorch has poor support for large-kernel depth-wise convolution; employing optimized depth-wise convolutions can further accelerate the inference time of our method, as suggested in Ding et al. (2022). All these results demonstrate the effectiveness of our method.
Figure 3 presents visual comparisons on Set14 and Urban100 datasets for a scale. The qualitative comparison results demonstrate that our proposed methods can produce more visually pleasing results. The structures and details are better recovered.
We further evaluate our approach on real low-quality images. One example from the historical dataset Lai et al. (2017) is shown in Figure 4. The results by Huang et al. (2015); Li et al. (2020) show visible artifacts. Our method and CARN Ahn et al. (2018) generate smooth details, but our results have a clearer structure.
|(a) Shuffle Mixer Layer||(b) Feature Mixing Block|
|Kernel Size||LR Size||PSNR(dB)/SSIM||Params(K)||FLOPs(G)|
4.3 Analysis and Discussions
The core idea of ShuffleMixer lies in the shuffle mixer layer, feature mixing block, and large kernel convolution. In this subsection, we evaluate them respectively on the proposed tiny model and train them on DIV2K dataset Timofte et al. (2017).
Effectiveness of the shuffle mixer layer. To verify the efficiency of the shuffle mixer, we use 10 ConvMixer Trockman and Kolter (2022) blocks to build a baseline model. Unlike the original ConvMixer module, we replace BatchNorm with LayerNorm and apply it only before the point-wise MLP layer, because BatchNorm tends to bring artifacts in the generated results Lim et al. (2017); Wang et al. (2018). The kernel of depth-wise convolution is set 3, and the number of channels is 32. When applying the channel splitting and shuffling (CSS) strategy, the number of parameters is reduced from 55.9K to 24.7K, and the performance is also 0.13dB lower than the baseline. This result reflects that the split operation limits the representation capability of the channel projection layer. To compensate for the lack of PSNR, we repeat the CSS-based projection layer to enable more cross-group feature mixing (denoted by CDC). Table 2(a) shows a quantitative comparison where we find that CDC achieves similar performance to the baseline model while reducing parameters from 55.9K to 35.5K and FLOPs from 5.2G to 3.8G.
Effectiveness of the feature mixing block. To validate the effectiveness of the proposed feature mixing block, we take the CDC model as baseline and first embed a convolution layer with size of after each two shuffle mixer layers, and it has a gain of 0.13dB over the baseline. To further analyze the effect of feature fusion manners, we study the S-Conv (take an element-wise summation between input and output features followed by a convolution layer with size of ) and C-Conv (concatenate input and output features on channel dimension followed by a convolution layer of size ). Table 2(b) shows that they all improve over the baseline; C-Conv achieves better PSNR performance while having more computational cost. Figure 5 exhibits the average feature map of the channel axis before the upsampler module, which illustrates that enhancing local connectivity between feature elements is helpful for grabbing finer high-frequency contents. Based on the S-Conv, we additionally replace the convolution layer with basic residual blocks (S-ResBlock) and Fused-MBConv (S-FMBConv). Table 2(b) shows that S-FMBConv obtains a balanced trade-off between model complexity and SR performance. Thus, we choose S-FMBConv to strengthen the local connectivity between features in this paper.
Effectiveness of large depth-wise convolution. To demonstrate the effect of a large kernel, we use different kernel sizes ranging from to pixels and test their performance separately. Table 3 shows that using larger kernel size will improve the performance. In particular, the PSNR value of the method using the depth-wise convolution of size is 0.07dB higher than that of using size while only increasing 12K parameters and 0.8G FLOPs. In addition, we note that if the kernel size is larger than pixels, the performance gains are minor. Thus, the kernel size is set to be pixels as a trade-off between accuracy and model complexity in this paper.
In this paper, we have proposed a lightweight deep model to solve the image super-resolution problem. The proposed deep model, i.e., ShuffleMixer, contains a shuffler mixer layer with a larger effective receptive field to extract non-local feature representations efficiently. We have introduced the Fused-MBConv to model the local connectivity of features generated by the shuffler mixer layer, which is critical for improving SR performance. We both qualitatively and quantitatively evaluate the proposed ShuffleMixer on commonly used benchmarks. Experimental results demonstrate that the proposed ShuffleMixer is much more efficient while achieving competitive performance than the state-of-the-art methods.
This paper is an exploratory work on lightweight and efficient image super-resolution using a large-kernel ConvNet. This approach can be deployed in some resource-constrained environments to improve image quality, such as processing pictures taken by smartphones and reducing bandwidth during video calls or meetings. However, super-resolution technology has also brought some negative effects, such as criminals using this technology to enhance people’s facial or body features, thereby allowing identity information to leak. It is worth noting that the positive social impact of image super-resolution far outweighs the potential problems. We call on people to use this technology and its derivative applications without harming the personal interests of the public.
- Fast, accurate, and lightweight super-resolution with cascading residual network. In ECCV, pp. 252–268. Cited by: §1, §1, §2, Figure 3, Figure 4, §4.2, §4.2, §4.2, Table 1.
- Contour detection and hierarchical image segmentation. PAMI 33 (5), pp. 898–916. Cited by: Figure 1, §4.1.
- Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC, pp. 135.1–135.10. Cited by: §4.1.
- Pre-trained image processing transformer. In CVPR, pp. 12299–12310. Cited by: §2.
- Rethinking coarse-to-fine approach in single image deblurring. In ICCV, pp. 4641–4650. Cited by: §3.2, §4.1.
- Fast, accurate and lightweight super-resolution with neural architecture search. In ICPR, pp. 59–64. Cited by: §1, §2, §4.2, Table 1.
- Scaling up your kernels to 31x31: revisiting large kernel design in cnns. arXiv preprint arXiv:2203.06717. Cited by: §1, §1, §2, §4.2.
- Image super-resolution using deep convolutional networks. PAMI 38 (2), pp. 295–307. Cited by: §1, §4.2, Table 1.
- Accelerating the super-resolution convolutional neural network. In ECCV, pp. 391–407. Cited by: §1, §1, §2, §4.2, Table 1.
- An image is worth 16x16 words: transformers for image recognition at scale. ICLR. Cited by: §1, §2, §2.
Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks 107, pp. 3–11. Cited by: §3.1.
- Image super-resolution using knowledge distillation. In ACCV, pp. 527–541. Cited by: §1, §2.
- Fakd: feature-affinity based knowledge distillation for efficient image super-resolution. In ICIP, pp. 518–522. Cited by: §1, §2.
- Squeeze-and-excitation networks. In CVPR, Cited by: §3.1.
- Single image super-resolution from transformed self-exemplars. In CVPR, pp. 5197–5206. Cited by: Figure 4, §4.1, §4.2.
- Lightweight image super-resolution with information multi-distillation network. In ACM MM, pp. 2024–2032. Cited by: §1, §1, §2, Figure 3, §4.2, §4.2, Table 1.
- Deeply-recursive convolutional network for image super-resolution. In CVPR, pp. 1637–1645. Cited by: Figure 3, §4.2, Table 1.
- Accurate image super-resolution using very deep convolutional networks. In CVPR, pp. 1646–1654. Cited by: Figure 3, §4.2, Table 1.
- Adam: A method for stochastic optimization. In ICLR, Cited by: §4.1.
- ImageNet classification with deep convolutional neural networks. In NeurIPS, Cited by: §2.
- Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, pp. 624–632. Cited by: Figure 3, Figure 4, §4.2, §4.2, Table 1.
- LAPAR: linearly-assembled pixel-adaptive regression network for single image super-resolution and beyond. In NeurIPS, pp. 20343–20355. Cited by: Figure 4, §4.1, §4.2, §4.2, §4.2, Table 1.
- NTIRE 2022 challenge on efficient super-resolution: methods and results. In CVPR Workshops, Cited by: ShuffleMixer: An Efficient ConvNet for Image Super-Resolution, §1, §2, §4.1.
- SwinIR: image restoration using swin transformer. In ICCV Workshops, pp. 1833–1844. Cited by: §1, §2, §4.1.
- Enhanced deep residual networks for single image super-resolution. In CVPR Workshops, pp. 1132–1140. Cited by: §1, §4.1, §4.2, §4.3, Table 1.
- Swin transformer: hierarchical vision transformer using shifted windows. In ICCV, pp. 10012–10022. Cited by: §1, §2, §2.
- A convnet for the 2020s. arXiv preprint arXiv:2201.03545. Cited by: §1, §1, §2.
- ShuffleNet v2: practical guidelines for efficient cnn architecture design. In ECCV, pp. 116–131. Cited by: §1, §2, §3.1.
- Sketch-based manga retrieval using manga109 dataset. arXiv preprint arXiv:1510.04389. Cited by: §4.1.
- MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer. In ICLR, Cited by: §1.
- Large kernel matters – improve semantic segmentation by global convolutional network. In CVPR, pp. 4353–4361. Cited by: §2.
- MobileNetV2: inverted residuals and linear bottlenecks. In CVPR, pp. 4510–4520. Cited by: §1, §2, §3.1.
- Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, pp. 1874–1883. Cited by: §1, §2, §3.1, Table 1.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.
- Efficient residual dense block search for image super-resolution. In AAAI, pp. 12007–12014. Cited by: §1, §2.
- EfficientNetV2: smaller models and faster training. In ICML, pp. 10096–10106. Cited by: §1, §3.1.
- NTIRE 2017 challenge on single image super-resolution: methods and results. In CVPR Workshops, Cited by: §4.1, §4.3, Table 2.
- MLP-mixer: an all-mlp architecture for vision. In NeurIPS, pp. 24261–24272. Cited by: §2.
- Patches are all you need?. In ICLR, Cited by: §1, §2, §3.1, §4.3.
- Exploring sparsity in image super-resolution for efficient inference. In CVPR, pp. 4917–4926. Cited by: §4.2, §4.2, Table 1.
ESRGAN: enhanced super-resolution generative adversarial networks. In ECCV Workshops, Cited by: §4.3.
- On single image scale-up using sparse-representations. In Curves and Surfaces, pp. 711–730. Cited by: §4.1.
- AIM 2020 challenge on efficient super-resolution: methods and results. In ECCV Workshops, pp. 5–40. Cited by: §1, §2.
- Edge-oriented convolution block for real-time super resolution on mobile devices. In ACM MM, pp. 4034–4043. Cited by: §1, §1, §4.2, Table 1.
- Image super-resolution using very deep residual channel attention networks. In ECCV, pp. 286–301. Cited by: §1.