Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised Audio-Visual Video Parsing

07/05/2023
by   Jie Fu, et al.
0

Weakly-supervised audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances as well as identify the corresponding event categories with only video-level category labels for training. Most previous methods pay much attention to refining the supervision for each modality or extracting fruitful cross-modality information for more reliable feature learning. None of them have noticed the imbalanced feature learning between different modalities in the task. In this paper, to balance the feature learning processes of different modalities, a dynamic gradient modulation (DGM) mechanism is explored, where a novel and effective metric function is designed to measure the imbalanced feature learning between audio and visual modalities. Furthermore, principle analysis indicates that the multimodal confusing calculation will hamper the precise measurement of multimodal imbalanced feature learning, which further weakens the effectiveness of our DGM mechanism. To cope with this issue, a modality-separated decision unit (MSDU) is designed for more precise measurement of imbalanced feature learning between audio and visual modalities. Comprehensive experiments are conducted on public benchmarks and the corresponding experimental results demonstrate the effectiveness of our proposed method.

READ FULL TEXT

page 1

page 4

page 5

page 9

research
07/21/2020

Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing

In this paper, we introduce a new problem, named audio-visual video pars...
research
04/25/2022

Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing

This paper focuses on the weakly-supervised audio-visual video parsing t...
research
07/12/2022

Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection

Weakly-supervised audio-visual violence detection aims to distinguish sn...
research
11/22/2017

Integrating both Visual and Audio Cues for Enhanced Video Caption

Video caption refers to generating a descriptive sentence for a specific...
research
05/30/2021

Rethinking the constraints of multimodal fusion: case study in Weakly-Supervised Audio-Visual Video Parsing

For multimodal tasks, a good feature extraction network should extract i...
research
11/24/2021

MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing

Recognizing and localizing events in videos is a fundamental task for vi...
research
04/23/2017

Learning weakly supervised multimodal phoneme embeddings

Recent works have explored deep architectures for learning multimodal sp...

Please sign up or login with your details

Forgot password? Click here to reset