Revisiting Pre-training in Audio-Visual Learning

02/07/2023
by   Ruoxuan Feng, et al.
0

Pre-training technique has gained tremendous success in enhancing model performance on various tasks, but found to perform worse than training from scratch in some uni-modal situations. This inspires us to think: are the pre-trained models always effective in the more complex multi-modal scenario, especially for the heterogeneous modalities such as audio and visual ones? We find that the answer is No. Specifically, we explore the effects of pre-trained models on two audio-visual learning scenarios: cross-modal initialization and multi-modal joint learning. When cross-modal initialization is applied, the phenomena of "dead channel" caused by abnormal Batchnorm parameters hinders the utilization of model capacity. Thus, we propose Adaptive Batchnorm Re-initialization (ABRi) to better exploit the capacity of pre-trained models for target tasks. In multi-modal joint learning, we find a strong pre-trained uni-modal encoder would bring negative effects on the encoder of another modality. To alleviate such problem, we introduce a two-stage Fusion Tuning strategy, taking better advantage of the pre-trained knowledge while making the uni-modal encoders cooperate with an adaptive masking method. The experiment results show that our methods could further exploit pre-trained models' potential and boost performance in audio-visual learning.

READ FULL TEXT

page 3

page 5

research
07/01/2021

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross...
research
06/11/2022

A Unified Continuous Learning Framework for Multi-modal Knowledge Discovery and Pre-training

Multi-modal pre-training and knowledge discovery are two important resea...
research
09/10/2023

Multimodal Fish Feeding Intensity Assessment in Aquaculture

Fish feeding intensity assessment (FFIA) aims to evaluate the intensity ...
research
08/31/2023

Enhancing Subtask Performance of Multi-modal Large Language Model

Multi-modal Large Language Model (MLLM) refers to a model expanded from ...
research
06/05/2023

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

We present Video-LLaMA, a multi-modal framework that empowers Large Lang...
research
06/24/2021

A Transformer-based Cross-modal Fusion Model with Adversarial Training for VQA Challenge 2021

In this paper, inspired by the successes of visionlanguage pre-trained m...
research
01/11/2023

A Multi-Modal Geographic Pre-Training Method

As a core task in location-based services (LBS) (e.g., navigation maps),...

Please sign up or login with your details

Forgot password? Click here to reset