DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

04/02/2023
by   Qiangqiang Wu, et al.
0

In this paper, we study masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS). A simple extension of MAE is to randomly mask out frame patches in videos and reconstruct the frame pixels. However, we find that this simple baseline heavily relies on spatial cues while ignoring temporal relations for frame reconstruction, thus leading to sub-optimal temporal matching representations for VOT and VOS. To alleviate this problem, we propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos. We show that our DropMAE is a strong and efficient temporal matching learner, which achieves better finetuning results on matching-based tasks than the ImageNetbased MAE with 2X faster pre-training speed. Moreover, we also find that motion diversity in pre-training videos is more important than scene diversity for improving the performance on VOT and VOS. Our pre-trained DropMAE model can be directly loaded in existing ViT-based trackers for fine-tuning without further modifications. Notably, DropMAE sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets. Our code and pre-trained models are available at https://github.com/jimmy-dq/DropMAE.git.

READ FULL TEXT

page 2

page 4

page 9

page 10

page 11

research
12/07/2022

SimVTP: Simple Video Text Pre-training with Masked Autoencoders

This paper presents SimVTP: a Simple Video-Text Pretraining framework vi...
research
11/23/2022

Integrally Pre-Trained Transformer Pyramid Networks

In this paper, we present an integral pre-training framework based on ma...
research
03/25/2022

Reinforcement Learning with Action-Free Pre-Training from Videos

Recent unsupervised pre-training methods have shown to be effective on l...
research
05/26/2022

Revealing the Dark Secrets of Masked Image Modeling

Masked image modeling (MIM) as pre-training is shown to be effective for...
research
03/09/2023

Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking

Masked Autoencoders (MAE) have been popular paradigms for large-scale vi...
research
05/03/2022

In Defense of Image Pre-Training for Spatiotemporal Recognition

Image pre-training, the current de-facto paradigm for a wide range of vi...
research
09/25/2022

D^3: Duplicate Detection Decontaminator for Multi-Athlete Tracking in Sports Videos

Tracking multiple athletes in sports videos is a very challenging Multi-...

Please sign up or login with your details

Forgot password? Click here to reset