CvT-ASSD: Convolutional vision-Transformer Based Attentive Single Shot MultiBox Detector

10/24/2021
by   Weiqiang Jin, et al.
0

Due to the success of Bidirectional Encoder Representations from Transformers (BERT) in natural language process (NLP), the multi-head attention transformer has been more and more prevalent in computer-vision researches (CV). However, it still remains a challenge for researchers to put forward complex tasks such as vision detection and semantic segmentation. Although multiple Transformer-Based architectures like DETR and ViT-FRCNN have been proposed to complete object detection task, they inevitably decreases discrimination accuracy and brings down computational efficiency caused by the enormous learning parameters and heavy computational complexity incurred by the traditional self-attention operation. In order to alleviate these issues, we present a novel object detection architecture, named Convolutional vision Transformer Based Attentive Single Shot MultiBox Detector (CvT-ASSD), that built on the top of Convolutional vision Transormer (CvT) with the efficient Attentive Single Shot MultiBox Detector (ASSD). We provide comprehensive empirical evidence showing that our model CvT-ASSD can leads to good system efficiency and performance while being pretrained on large-scale detection datasets such as PASCAL VOC and MS COCO. Code has been released on public github repository at https://github.com/albert-jin/CvT-ASSD.

READ FULL TEXT

page 1

page 2

page 5

page 9

research
03/02/2022

Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy for Image Recognition without Convolutions

With the achievements of Transformer in the field of natural language pr...
research
06/07/2023

2D Object Detection with Transformers: A Review

Astounding performance of Transformers in natural language processing (N...
research
03/23/2022

Efficient Few-Shot Object Detection via Knowledge Inheritance

Few-shot object detection (FSOD), which aims at learning a generic detec...
research
06/05/2023

Learning Probabilistic Symmetrization for Architecture Agnostic Equivariance

We present a novel framework to overcome the limitations of equivariant ...
research
05/08/2022

ConvMAE: Masked Convolution Meets Masked Autoencoders

Vision Transformers (ViT) become widely-adopted architectures for variou...
research
08/27/2022

YOLOX-PAI: An Improved YOLOX, Stronger and Faster than YOLOv6

We develop an all-in-one computer vision toolbox named EasyCV to facilit...
research
03/15/2023

BiFormer: Vision Transformer with Bi-Level Routing Attention

As the core building block of vision transformers, attention is a powerf...

Please sign up or login with your details

Forgot password? Click here to reset