RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided Learning

03/26/2023
by   Yabin Zhu, et al.
0

Existing Transformer-based RGBT tracking methods either use cross-attention to fuse the two modalities, or use self-attention and cross-attention to model both modality-specific and modality-sharing information. However, the significant appearance gap between modalities limits the feature representation ability of certain modalities during the fusion process. To address this problem, we propose a novel Progressive Fusion Transformer called ProFormer, which progressively integrates single-modality information into the multimodal representation for robust RGBT tracking. In particular, ProFormer first uses a self-attention module to collaboratively extract the multimodal representation, and then uses two cross-attention modules to interact it with the features of the dual modalities respectively. In this way, the modality-specific information can well be activated in the multimodal representation. Finally, a feed-forward network is used to fuse two interacted multimodal representations for the further enhancement of the final multimodal representation. In addition, existing learning methods of RGBT trackers either fuse multimodal features into one for final classification, or exploit the relationship between unimodal branches and fused branch through a competitive learning strategy. However, they either ignore the learning of single-modality branches or result in one branch failing to be well optimized. To solve these problems, we propose a dynamically guided learning algorithm that adaptively uses well-performing branches to guide the learning of other branches, for enhancing the representation ability of each branch. Extensive experiments demonstrate that our proposed ProFormer sets a new state-of-the-art performance on RGBT210, RGBT234, LasHeR, and VTUAV datasets.

READ FULL TEXT

page 1

page 2

page 5

page 7

page 10

research
12/03/2021

LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences

Learning modality-fused representations and processing unaligned multimo...
research
04/09/2023

RGB-T Tracking Based on Mixed Attention

RGB-T tracking involves the use of images from both visible and thermal ...
research
01/17/2023

Cooperation Learning Enhanced Colonic Polyp Segmentation Based on Transformer-CNN Fusion

Traditional segmentation methods for colonic polyps are mainly designed ...
research
11/08/2022

DepthFormer: Multimodal Positional Encodings and Cross-Input Attention for Transformer-Based Segmentation Networks

Most approaches for semantic segmentation use only information from colo...
research
01/02/2018

Learning Multimodal Word Representation via Dynamic Fusion Methods

Multimodal models have been proven to outperform text-based models on le...
research
07/26/2020

Challenge-Aware RGBT Tracking

RGB and thermal source data suffer from both shared and specific challen...
research
10/19/2021

Bilateral-ViT for Robust Fovea Localization

The fovea is an important anatomical landmark of the retina. Detecting t...

Please sign up or login with your details

Forgot password? Click here to reset