Visual Prompt Multi-Modal Tracking

03/20/2023
by   Jiawen Zhu, et al.
0

Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based parameters. Albeit effective, this manner is not optimal due to the scarcity of downstream data and poor transferability, etc. In this paper, inspired by the recent success of the prompt learning in language models, we develop Visual Prompt multi-modal Tracking (ViPT), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT finds a better way to stimulate the knowledge of the RGB-based model that is pre-trained at scale, meanwhile only introducing a few trainable parameters (less than 1 parameters). ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking. Extensive experiments show the potential of visual prompt learning for multi-modal tracking, and ViPT can achieve state-of-the-art performance while satisfying parameter efficiency. Code and models are available at https://github.com/jiawen-zhu/ViPT.

READ FULL TEXT

page 3

page 8

research
06/20/2023

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Prompt tuning, like CoOp, has recently shown promising vision recognizin...
research
04/12/2023

Open-TransMind: A New Baseline and Benchmark for 1st Foundation Model Challenge of Intelligent Transportation

With the continuous improvement of computing power and deep learning alg...
research
08/31/2023

RGB-T Tracking via Multi-Modal Mutual Prompt Learning

Object tracking based on the fusion of visible and thermal im-ages, know...
research
03/08/2022

Multi-Modal Mixup for Robust Fine-tuning

Pre-trained large-scale models provide a transferable embedding, and the...
research
08/11/2023

Foundation Model is Efficient Multimodal Multitask Model Selector

This paper investigates an under-explored but important problem: given a...
research
05/15/2023

Mode Approximation Makes Good Vision-Language Prompts

With the advance of large-scale model technologies, parameter-efficient ...
research
02/12/2023

Team Triple-Check at Factify 2: Parameter-Efficient Large Foundation Models with Feature Representations for Multi-Modal Fact Verification

Multi-modal fact verification has become an important but challenging is...

Please sign up or login with your details

Forgot password? Click here to reset