Toward Transformer-Based Object Detection

12/17/2020
by   Josh Beal, et al.
3

Transformers have become the dominant model in natural language processing, owing to their ability to pretrain on massive amounts of data, then transfer to smaller, more specific tasks via fine-tuning. The Vision Transformer was the first major attempt to apply a pure transformer model directly to images as input, demonstrating that as compared to convolutional networks, transformer-based architectures can achieve competitive results on benchmark classification tasks. However, the computational complexity of the attention operator means that we are limited to low-resolution inputs. For more complex tasks such as detection or segmentation, maintaining a high input resolution is crucial to ensure that models can properly identify and reflect fine details in their output. This naturally raises the question of whether or not transformer-based architectures such as the Vision Transformer are capable of performing tasks other than classification. In this paper, we determine that Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results. The model that we propose, ViT-FRCNN, demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance. We also investigate improvements over a standard detection backbone, including superior performance on out-of-domain images, better performance on large objects, and a lessened reliance on non-maximum suppression. We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.

READ FULL TEXT
research
04/16/2022

Searching Intrinsic Dimensions of Vision Transformers

It has been shown by many researchers that transformers perform as well ...
research
12/09/2021

PE-former: Pose Estimation Transformer

Vision transformer architectures have been demonstrated to work very eff...
research
01/10/2022

A ConvNet for the 2020s

The "Roaring 20s" of visual recognition began with the introduction of V...
research
03/18/2022

Three things everyone should know about Vision Transformers

After their initial success in natural language processing, transformer ...
research
07/12/2023

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

The ubiquitous and demonstrably suboptimal choice of resizing images to ...
research
12/12/2016

Inverse Compositional Spatial Transformer Networks

In this paper, we establish a theoretical connection between the classic...
research
03/07/2022

Knowledge Amalgamation for Object Detection with Transformers

Knowledge amalgamation (KA) is a novel deep model reusing task aiming to...

Please sign up or login with your details

Forgot password? Click here to reset