Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking

05/31/2022
by   Xin Yu, et al.
0

The recent trend in multiple object tracking (MOT) is heading towards leveraging deep learning to boost the tracking performance. In this paper, we propose a novel solution named TransSTAM, which leverages Transformer to effectively model both the appearance features of each object and the spatial-temporal relationships among objects. TransSTAM consists of two major parts: (1) The encoder utilizes the powerful self-attention mechanism of Transformer to learn discriminative features for each tracklet; (2) The decoder adopts the standard cross-attention mechanism to model the affinities between the tracklets and the detections by taking both spatial-temporal and appearance features into account. TransSTAM has two major advantages: (1) It is solely based on the encoder-decoder architecture and enjoys a compact network design, hence being computationally efficient; (2) It can effectively learn spatial-temporal and appearance features within one model, hence achieving better tracking accuracy. The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA with respect to previous state-of-the-art approaches on all the benchmarks. Our code is available at <https://github.com/icicle4/TranSTAM>.

READ FULL TEXT
research
04/01/2021

TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking

Tracking multiple objects in videos relies on modeling the spatial-tempo...
research
09/15/2022

Beat Transformer: Demixed Beat and Downbeat Tracking with Dilated Self-Attention

We propose Beat Transformer, a novel Transformer encoder architecture fo...
research
03/14/2021

Learning a Proposal Classifier for Multiple Object Tracking

The recent trend in multiple object tracking (MOT) is heading towards le...
research
04/25/2019

Spatial-Temporal Relation Networks for Multi-Object Tracking

Recent progress in multiple object tracking (MOT) has shown that a robus...
research
08/09/2017

Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism

In this paper, we propose a CNN-based framework for online MOT. This fra...
research
01/26/2023

Compact Transformer Tracker with Correlative Masked Modeling

Transformer framework has been showing superior performances in visual o...
research
09/06/2021

Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation

3D human shape and pose estimation is the essential task for human motio...

Please sign up or login with your details

Forgot password? Click here to reset