Class-aware Sounding Objects Localization via Audiovisual Correspondence

12/22/2021
by   Di Hu, et al.
0

Audiovisual scenes are pervasive in our daily life. It is commonplace for humans to discriminatively localize different sounding objects but quite challenging for machines to achieve class-aware sounding objects localization without category annotations, i.e., localizing the sounding object and recognizing its category. To address this problem, we propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios using only the correspondence between audio and vision. First, we propose to determine the sounding area via coarse-grained audiovisual correspondence in the single source cases. Then visual features in the sounding area are leveraged as candidate object representations to establish a category-representation object dictionary for expressive visual character extraction. We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas by referring to this dictionary. Finally, we employ category-level audiovisual consistency as the supervision to achieve fine-grained audio and sounding object distribution alignment. Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones. We also transfer the learned audiovisual network into the unsupervised object detection task, obtaining reasonable performance.

READ FULL TEXT

page 1

page 4

page 9

page 10

page 11

page 14

page 17

research
10/12/2020

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

Discriminatively localizing sounding objects in cocktail-party, i.e., mi...
research
03/29/2023

Audio-Visual Grouping Network for Sound Localization from Mixtures

Sound source localization is a typical and challenging task that predict...
research
11/04/2014

Do Convnets Learn Correspondence?

Convolutional neural nets (convnets) trained from massive labeled datase...
research
05/12/2023

Hear to Segment: Unmixing the Audio to Guide the Semantic Segmentation

In this paper, we focus on a recently proposed novel task called Audio-V...
research
05/23/2017

Look, Listen and Learn

We consider the question: what can be learnt by looking at and listening...
research
10/24/2022

I see what you hear: a vision-inspired method to localize words

This paper explores the possibility of using visual object detection tec...
research
12/20/2013

Multi-View Priors for Learning Detectors from Sparse Viewpoint Data

While the majority of today's object class models provide only 2D boundi...

Please sign up or login with your details

Forgot password? Click here to reset