Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision

11/09/2018
by   Sanjeel Parekh, et al.
0

We tackle the problem of audiovisual scene analysis for weakly-labeled data. To this end, we build upon our previous audiovisual representation learning framework to perform object classification in noisy acoustic environments and integrate audio source enhancement capability. This is made possible by a novel use of non-negative matrix factorization for the audio modality. Our approach is founded on the multiple instance learning paradigm. Its effectiveness is established through experiments over a challenging dataset of music instrument performance videos. We also show encouraging visual object localization results.

READ FULL TEXT
research
04/19/2018

Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events

Audio-visual representation learning is an important task from the persp...
research
06/02/2021

Exploring modality-agnostic representations for music classification

Music information is often conveyed or recorded across multiple data mod...
research
04/13/2021

Self-supervised object detection from audio-visual correspondence

We tackle the problem of learning object detectors without supervision. ...
research
04/16/2019

Audio-Visual Model Distillation Using Acoustic Images

In this paper, we investigate how to learn rich and robust feature repre...
research
04/11/2023

Audio Bank: A High-Level Acoustic Signal Representation for Audio Event Recognition

Automatic audio event recognition plays a pivotal role in making human r...
research
10/24/2019

Vision-Infused Deep Audio Inpainting

Multi-modality perception is essential to develop interactive intelligen...
research
10/22/2020

Mood Classification Using Listening Data

The mood of a song is a highly relevant feature for exploration and reco...

Please sign up or login with your details

Forgot password? Click here to reset