InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

11/10/2022
by   Wenhai Wang, et al.
0

Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. Different from the recent CNNs that focus on large dense kernels, InternImage takes deformable convolution as the core operator, so that our model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has the adaptive spatial aggregation conditioned by input and task information. As a result, the proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs. The effectiveness of our model is proven on challenging benchmarks including ImageNet, COCO, and ADE20K. It is worth mentioning that InternImage-H achieved a new record 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, outperforming current leading CNNs and ViTs. The code will be released at https://github.com/OpenGVLab/InternImage.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/13/2022

Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs

We revisit large kernel design in modern convolutional neural networks (...
research
07/07/2022

More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity

Transformers have quickly shined in the computer vision world since the ...
research
06/23/2021

Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition

In this paper, we present Vision Permutator, a conceptually simple and d...
research
11/09/2022

Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

Audio Spectrogram Transformer models rule the field of Audio Tagging, ou...
research
04/26/2017

Deep Convolutional Neural Network to Detect J-UNIWARD

This paper presents an empirical study on applying convolutional neural ...
research
06/02/2021

Container: Context Aggregation Network

Convolutional neural networks (CNNs) are ubiquitous in computer vision, ...
research
10/20/2022

Large-batch Optimization for Dense Visual Predictions

Training a large-scale deep neural network in a large-scale dataset is c...

Please sign up or login with your details

Forgot password? Click here to reset