Divert More Attention to Vision-Language Tracking

07/03/2022
by   Mingzhe Guo, et al.
0

Relying on Transformer for complex visual feature learning, object tracking has witnessed the new standard for state-of-the-arts (SOTAs). However, this advancement accompanies by larger training data and longer training period, making tracking increasingly expensive. In this paper, we demonstrate that the Transformer-reliance is not necessary and the pure ConvNets are still competitive and even better yet more economical and friendly in achieving SOTA tracking. Our solution is to unleash the power of multimodal vision-language (VL) tracking, simply using ConvNets. The essence lies in learning novel unified-adaptive VL representations with our modality mixer (ModaMixer) and asymmetrical ConvNet search. We show that our unified-adaptive VL representation, learned purely with the ConvNets, is a simple yet strong alternative to Transformer visual features, by unbelievably improving a CNN-based Siamese tracker by 14.5 even outperforming several Transformer-based SOTA trackers. Besides empirical results, we theoretically analyze our approach to evidence its effectiveness. By revealing the potential of VL representation, we expect the community to divert more attention to VL tracking and hope to open more possibilities for future tracking beyond Transformer. Code and models will be released at https://github.com/JudasDie/SOTS.

READ FULL TEXT

page 3

page 9

page 13

research
07/19/2023

Divert More Attention to Vision-Language Object Tracking

Multimodal vision-language (VL) learning has noticeably pushed the tende...
research
12/02/2021

SwinTrack: A Simple and Strong Baseline for Transformer Tracking

Transformer has recently demonstrated clear potential in improving visua...
research
12/17/2021

Efficient Visual Tracking with Exemplar Transformers

The design of more complex and powerful neural network models has signif...
research
10/17/2021

Siamese Transformer Pyramid Networks for Real-Time UAV Tracking

Recent object tracking methods depend upon deep networks or convoluted a...
research
04/27/2023

SeqTrack: Sequence to Sequence Learning for Visual Object Tracking

In this paper, we present a new sequence-to-sequence learning framework ...
research
09/13/2023

Transparent Object Tracking with Enhanced Fusion Module

Accurate tracking of transparent objects, such as glasses, plays a criti...
research
12/05/2021

Learning Tracking Representations via Dual-Branch Fully Transformer Networks

We present a Siamese-like Dual-branch network based on solely Transforme...

Please sign up or login with your details

Forgot password? Click here to reset