Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events

04/19/2018
by   Sanjeel Parekh, et al.
0

Audio-visual representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. We show that the learnt representations are useful for classifying events and localizing their characteristic audio-visual elements. The system is trained using only video-level event labels without any timing information. An important feature of our method is its capacity to learn from unsynchronized audio-visual events. We achieve state-of-the-art results on a large-scale dataset of weakly-labeled audio event videos. Visualizations of localized visual regions and audio segments substantiate our system's efficacy, especially when dealing with noisy situations where modality-specific cues appear asynchronously.

READ FULL TEXT

page 13

page 14

research
03/31/2022

Investigating Modality Bias in Audio Visual Video Parsing

We focus on the audio-visual video parsing (AVVP) problem that involves ...
research
11/09/2018

Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision

We tackle the problem of audiovisual scene analysis for weakly-labeled d...
research
07/15/2016

DCAR: A Discriminative and Compact Audio Representation to Improve Event Detection

This paper presents a novel two-phase method for audio representation, D...
research
07/12/2023

Temporal Label-Refinement for Weakly-Supervised Audio-Visual Event Localization

Audio-Visual Event Localization (AVEL) is the task of temporally localiz...
research
05/27/2023

Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser

Audio-visual learning has been a major pillar of multi-modal machine lea...
research
05/30/2023

Learning Weakly Supervised Audio-Visual Violence Detection in Hyperbolic Space

In recent years, the task of weakly supervised audio-visual violence det...
research
12/21/2021

Decompose the Sounds and Pixels, Recompose the Events

In this paper, we propose a framework centering around a novel architect...

Please sign up or login with your details

Forgot password? Click here to reset