SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and Transformers

08/14/2023
by   Xijun Wang, et al.
0

This paper presents a module, Spatial Cross-scale Convolution (SCSC), which is verified to be effective in improving both CNNs and Transformers. Nowadays, CNNs and Transformers have been successful in a variety of tasks. Especially for Transformers, increasing works achieve state-of-the-art performance in the computer vision community. Therefore, researchers start to explore the mechanism of those architectures. Large receptive fields, sparse connections, weight sharing, and dynamic weight have been considered keys to designing effective base models. However, there are still some issues to be addressed: large dense kernels and self-attention are inefficient, and large receptive fields make it hard to capture local features. Inspired by the above analyses and to solve the mentioned problems, in this paper, we design a general module taking in these design keys to enhance both CNNs and Transformers. SCSC introduces an efficient spatial cross-scale encoder and spatial embed module to capture assorted features in one layer. On the face recognition task, FaceResNet with SCSC can improve 2.7 parameters. On the ImageNet classification task, Swin Transformer with SCSC can achieve even better performance with 22 improve 5.3 ResNet) embedded with SCSC can match Swin Transformer's performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/12/2021

Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

Transformers have sprung up in the field of computer vision. In this wor...
research
11/25/2022

Degenerate Swin to Win: Plain Window-based Transformer without Sophisticated Operations

The formidable accomplishment of Transformers in natural language proces...
research
02/03/2023

DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition

As a de facto solution, the vanilla Vision Transformers (ViTs) are encou...
research
04/19/2023

SLIC: Self-Conditioned Adaptive Transform with Large-Scale Receptive Fields for Learned Image Compression

Learned image compression has achieved remarkable performance. Transform...
research
07/12/2022

LightViT: Towards Light-Weight Convolution-Free Vision Transformers

Vision transformers (ViTs) are usually considered to be less light-weigh...
research
06/21/2022

Scaling up Kernels in 3D CNNs

Recent advances in 2D CNNs and vision transformers (ViTs) reveal that la...
research
09/21/2021

DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers

Dynamic networks have shown their promising capability in reducing theor...

Please sign up or login with your details

Forgot password? Click here to reset