Egocentric Auditory Attention Localization in Conversations

03/28/2023
by   Fiona Ryan, et al.
11

In a noisy conversation environment such as a dinner party, people often exhibit selective auditory attention, or the ability to focus on a particular speaker while tuning out others. Recognizing who somebody is listening to in a conversation is essential for developing technologies that can understand social behavior and devices that can augment human hearing by amplifying particular sound sources. The computer vision and audio research communities have made great strides towards recognizing sound sources and speakers in scenes. In this work, we take a step further by focusing on the problem of localizing auditory attention targets in egocentric video, or detecting who in a camera wearer's field of view they are listening to. To tackle the new and challenging Selective Auditory Attention Localization problem, we propose an end-to-end deep learning approach that uses egocentric video and multichannel audio to predict the heatmap of the camera wearer's auditory attention. Our approach leverages spatiotemporal audiovisual features and holistic reasoning about the scene to make predictions, and outperforms a set of baselines on a challenging multi-speaker conversation dataset. Project page: https://fkryan.github.io/saal

READ FULL TEXT

page 1

page 3

page 5

page 8

page 15

research
07/03/2021

Development of a Conversation State Prediction System

With the evolution of the concept of Speaker diarization using LSTM, it ...
research
03/20/2023

Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation

The images and sounds that we perceive undergo subtle but geometrically ...
research
08/06/2021

The Right to Talk: An Audio-Visual Transformer Approach

Turn-taking has played an essential role in structuring the regulation o...
research
01/06/2022

Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Augmented reality devices have the potential to enhance human perception...
research
04/17/2023

Conditional Generation of Audio from Video via Foley Analogies

The sound effects that designers add to videos are designed to convey a ...
research
03/03/2021

The Spatial Selective Auditory Attention of Cochlear Implant Users in Different Conversational Sound Levels

In multi speakers environments, cochlear implant (CI) users may attend t...
research
08/20/2020

Dyadic Speech-based Affect Recognition using DAMI-P2C Parent-child Multimodal Interaction Dataset

Automatic speech-based affect recognition of individuals in dyadic conve...

Please sign up or login with your details

Forgot password? Click here to reset