Are Large Kernels Better Teachers than Transformers for ConvNets?

05/30/2023
∙
by   Tianjin Huang, et al.
∙
4
∙

This paper reveals a new appeal of the recently emerged large-kernel Convolutional Neural Networks (ConvNets): as the teacher in Knowledge Distillation (KD) for small-kernel ConvNets. While Transformers have led state-of-the-art (SOTA) performance in various fields with ever-larger models and labeled data, small-kernel ConvNets are considered more suitable for resource-limited applications due to the efficient convolution operation and compact weight sharing. KD is widely used to boost the performance of small-kernel ConvNets. However, previous research shows that it is not quite effective to distill knowledge (e.g., global information) from Transformers to small-kernel ConvNets, presumably due to their disparate architectures. We hereby carry out a first-of-its-kind study unveiling that modern large-kernel ConvNets, a compelling competitor to Vision Transformers, are remarkably more effective teachers for small-kernel ConvNets, due to more similar architectures. Our findings are backed up by extensive experiments on both logit-level and feature-level KD “out of the box", with no dedicated architectural nor training recipe modifications. Notably, we obtain the best-ever pure ConvNet under 30M parameters with 83.1% top-1 accuracy on ImageNet, outperforming current SOTA methods including ConvNeXt V2 and Swin V2. We also find that beneficial characteristics of large-kernel ConvNets, e.g., larger effective receptive fields, can be seamlessly transferred to students through this large-to-small kernel distillation. Code is available at: <https://github.com/VITA-Group/SLaK>.

READ FULL TEXT
research
∙ 03/13/2022

Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs

We revisit large kernel design in modern convolutional neural networks (...
research
∙ 07/07/2022

More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity

Transformers have quickly shined in the computer vision world since the ...
research
∙ 07/21/2022

SPIN: An Empirical Evaluation on Sharing Parameters of Isotropic Networks

Recent isotropic networks, such as ConvMixer and vision transformers, ha...
research
∙ 02/27/2022

Transformer-based Knowledge Distillation for Efficient Semantic Segmentation of Road-driving Scenes

For scene understanding in robotics and automated driving, there is a gr...
research
∙ 04/07/2022

Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results

ImageNet serves as the primary dataset for evaluating the quality of com...
research
∙ 11/09/2021

Sliced Recursive Transformer

We present a neat yet effective recursive operation on vision transforme...
research
∙ 11/10/2022

Demystify Transformers Convolutions in Modern Image Deep Networks

Recent success of vision transformers has inspired a series of vision ba...

Please sign up or login with your details

Forgot password? Click here to reset