Learning Spatio-Temporal Transformer for Visual Tracking

03/31/2021
by   Bin Yan, et al.
0

In this paper, we present a new tracking architecture with an encoder-decoder transformer as the key component. The encoder models the global spatio-temporal feature dependencies between target objects and search regions, while the decoder learns a query embedding to predict the spatial positions of the target objects. Our method casts object tracking as a direct bounding box prediction problem, without using any proposals or predefined anchors. With the encoder-decoder transformer, the prediction of objects just uses a simple fully-convolutional network, which estimates the corners of objects directly. The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing, thus largely simplifying existing tracking pipelines. The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term benchmarks, while running at real-time speed, being 6x faster than Siam R-CNN. Code and models are open-sourced at https://github.com/researchmm/Stark.

READ FULL TEXT
research
04/27/2023

SeqTrack: Sequence to Sequence Learning for Visual Object Tracking

In this paper, we present a new sequence-to-sequence learning framework ...
research
08/11/2020

Robust Long-Term Object Tracking via Improved Discriminative Model Prediction

We propose an improved discriminative model prediction method for robust...
research
09/04/2019

'Skimming-Perusal' Tracking: A Framework for Real-Time and Robust Long-term Tracking

Compared with traditional short-term tracking, long-term tracking poses ...
research
11/04/2022

Real-Time Target Sound Extraction

We present the first neural network model to achieve real-time and strea...
research
04/01/2021

TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking

Tracking multiple objects in videos relies on modeling the spatial-tempo...
research
09/06/2023

Efficient Training for Visual Tracking with Deformable Transformer

Recent Transformer-based visual tracking models have showcased superior ...
research
07/31/2019

On the difficulty of learning and predicting the long-term dynamics of bouncing objects

The ability to accurately predict the surrounding environment is a found...

Please sign up or login with your details

Forgot password? Click here to reset