Egocentric Audio-Visual Object Localization

03/23/2023
by   Chao Huang, et al.
0

Humans naturally perceive surrounding scenes by unifying sound and sight in a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created while wearers shift their attention. To address the first problem, we propose a geometry-aware temporal aggregation module to handle the egomotion explicitly. The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations. Moreover, we propose a cascaded feature enhancement module to tackle the second issue. It improves cross-modal localization robustness by disentangling visually-indicated audio representation. During training, we take advantage of the naturally available audio-visual temporal synchronization as the “free” self-supervision to avoid costly labeling. We also annotate and create the Epic Sounding Object dataset for evaluation purposes. Extensive experiments show that our method achieves state-of-the-art localization performance in egocentric videos and can be generalized to diverse audio-visual scenes.

READ FULL TEXT

page 3

page 6

page 7

page 8

page 13

page 14

page 15

page 16

research
11/10/2021

Space-Time Memory Network for Sounding Object Localization in Videos

Leveraging temporal synchronization and association within sight and sou...
research
02/13/2022

Visual Sound Localization in the Wild by Cross-Modal Interference Erasing

The task of audio-visual sound source localization has been well studied...
research
07/13/2020

Multiple Sound Sources Localization from Coarse to Fine

How to visually localize multiple sound sources in unconstrained videos ...
research
08/13/2020

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

When watching videos, the occurrence of a visual event is often accompan...
research
06/01/2021

Dual Normalization Multitasking for Audio-Visual Sounding Object Localization

Although several research works have been reported on audio-visual sound...
research
11/20/2019

Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

Visual events are usually accompanied by sounds in our daily lives. Howe...
research
05/03/2023

AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation

Segment Anything Model (SAM) has recently shown its powerful effectivene...

Please sign up or login with your details

Forgot password? Click here to reset