Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser

05/27/2023
by   Yung-Hsuan Lai, et al.
0

Audio-visual learning has been a major pillar of multi-modal machine learning, where the community mostly focused on its modality-aligned setting, i.e., the audio and visual modality are both assumed to signal the prediction target. With the Look, Listen, and Parse dataset (LLP), we investigate the under-explored unaligned setting, where the goal is to recognize audio and visual events in a video with only weak labels observed. Such weak video-level labels only tell what events happen without knowing the modality they are perceived (audio, visual, or both). To enhance learning in this challenging setting, we incorporate large-scale contrastively pre-trained models as the modality teachers. A simple, effective, and generic method, termed Visual-Audio Label Elaboration (VALOR), is innovated to harvest modality labels for the training events. Empirical studies show that the harvested labels significantly improve an attentional baseline by 8.0 in average F-score (Type@AV). Surprisingly, we found that modality-independent teachers outperform their modality-fused counterparts since they are noise-proof from the other potentially unaligned modality. Moreover, our best model achieves the new state-of-the-art on all metrics of LLP by a substantial margin (+5.4 F-score for Type@AV). VALOR is further generalized to Audio-Visual Event Localization and achieves the new state-of-the-art as well. Code is available at: https://github.com/Franklin905/VALOR.

READ FULL TEXT

page 2

page 17

page 22

research
03/31/2022

Investigating Modality Bias in Audio Visual Video Parsing

We focus on the audio-visual video parsing (AVVP) problem that involves ...
research
07/12/2022

Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection

Weakly-supervised audio-visual violence detection aims to distinguish sn...
research
04/19/2018

Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events

Audio-visual representation learning is an important task from the persp...
research
06/12/2021

Multi-level Attention Fusion Network for Audio-visual Event Recognition

Event classification is inherently sequential and multimodal. Therefore,...
research
04/25/2022

Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing

This paper focuses on the weakly-supervised audio-visual video parsing t...
research
09/10/2023

Multimodal Fish Feeding Intensity Assessment in Aquaculture

Fish feeding intensity assessment (FFIA) aims to evaluate the intensity ...
research
12/21/2021

Decompose the Sounds and Pixels, Recompose the Events

In this paper, we propose a framework centering around a novel architect...

Please sign up or login with your details

Forgot password? Click here to reset