Container: Context Aggregation Network

06/02/2021
by   Peng Gao, et al.
15

Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers – originally introduced in natural language processing – have been increasingly adopted in computer vision. While early adopters continue to employ CNN backbones, the latest networks are end-to-end CNN-free Transformer solutions. A recent surprising finding shows that a simple MLP based solution without any traditional convolutional or Transformer components can produce effective visual representations. While CNNs, Transformers and MLP-Mixers may be considered as completely disparate architectures, we provide a unified view showing that they are in fact special cases of a more general method to aggregate spatial context in a neural network stack. We present the (CONText AggregatIon NEtwoRk), a general-purpose building block for multi-head context aggregation that can exploit long-range interactions a la Transformers while still exploiting the inductive bias of the local convolution operation leading to faster convergence speeds, often seen in CNNs. In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named , can be employed in object detection and instance segmentation networks such as DETR, RetinaNet and Mask-RCNN to obtain an impressive detection mAP of 38.9, 43.8, 45.1 and mask mAP of 41.3, providing large improvements of 6.6, 7.3, 6.9 and 6.6 pts respectively, compared to a ResNet-50 backbone with a comparable compute and parameter size. Our method also achieves promising results on self-supervised learning compared to DeiT on the DINO framework.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/10/2021

Self-Supervised Learning with Swin Transformers

We are witnessing a modeling shift from CNN to Transformers in computer ...
research
08/01/2023

LGViT: Dynamic Early Exiting for Accelerating Vision Transformer

Recently, the efficient deployment and acceleration of powerful vision t...
research
08/15/2021

SOTR: Segmenting Objects with Transformers

Most recent transformer-based models show impressive performance on visi...
research
03/22/2021

Incorporating Convolution Designs into Visual Transformers

Motivated by the success of Transformers in natural language processing ...
research
12/29/2022

Local Learning on Transformers via Feature Reconstruction

Transformers are becoming increasingly popular due to their superior per...
research
10/05/2021

Transformer Assisted Convolutional Network for Cell Instance Segmentation

Region proposal based methods like R-CNN and Faster R-CNN models have pr...
research
11/10/2022

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Compared to the great progress of large-scale vision transformers (ViTs)...

Please sign up or login with your details

Forgot password? Click here to reset