Learning Spatial-Frequency Transformer for Visual Object Tracking

08/18/2022
by   Chuanming Tang, et al.
17

Recent trackers adopt the Transformer to combine or replace the widely used ResNet as their new backbone network. Although their trackers work well in regular scenarios, however, they simply flatten the 2D features into a sequence to better match the Transformer. We believe these operations ignore the spatial prior of the target object which may lead to sub-optimal results only. In addition, many works demonstrate that self-attention is actually a low-pass filter, which is independent of input features or key/queries. That is to say, it may suppress the high-frequency component of the input features and preserve or even amplify the low-frequency information. To handle these issues, in this paper, we propose a unified Spatial-Frequency Transformer that models the Gaussian spatial Prior and High-frequency emphasis Attention (GPHA) simultaneously. To be specific, Gaussian spatial prior is generated using dual Multi-Layer Perceptrons (MLPs) and injected into the similarity matrix produced by multiplying Query and Key features in self-attention. The output will be fed into a Softmax layer and then decomposed into two components, i.e., the direct signal and high-frequency signal. The low- and high-pass branches are rescaled and combined to achieve all-pass, therefore, the high-frequency features will be protected well in stacked self-attention layers. We further integrate the Spatial-Frequency Transformer into the Siamese tracking framework and propose a novel tracking algorithm, termed SFTransT. The cross-scale fusion based SwinTransformer is adopted as the backbone, and also a multi-head cross-attention module is used to boost the interaction between search and template features. The output will be fed into the tracking head for target localization. Extensive experiments on both short-term and long-term tracking benchmarks all demonstrate the effectiveness of our proposed framework.

READ FULL TEXT

page 1

page 4

page 11

page 12

research
04/14/2020

Deformable Siamese Attention Networks for Visual Object Tracking

Siamese-based trackers have achieved excellent performance on visual obj...
research
03/25/2022

High-Performance Transformer Tracking

Correlation has a critical role in the tracking field, especially in rec...
research
09/07/2023

Separable Self and Mixed Attention Transformers for Efficient Object Tracking

The deployment of transformers for visual object tracking has shown stat...
research
12/06/2021

PTTR: Relational 3D Point Cloud Object Tracking with Transformer

In a point cloud sequence, 3D object tracking aims to predict the locati...
research
03/23/2023

MSFA-Frequency-Aware Transformer for Hyperspectral Images Demosaicing

Hyperspectral imaging systems that use multispectral filter arrays (MSFA...
research
03/09/2022

Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice

Vision Transformer (ViT) has recently demonstrated promise in computer v...
research
06/08/2023

Multi-Architecture Multi-Expert Diffusion Models

Diffusion models have achieved impressive results in generating diverse ...

Please sign up or login with your details

Forgot password? Click here to reset