No-audio speaking status detection in crowded settings via visual pose-based filtering and wearable acceleration

11/01/2022
by   Jose Vargas-Quiros, et al.
0

Recognizing who is speaking in a crowded scene is a key challenge towards the understanding of the social interactions going on within. Detecting speaking status from body movement alone opens the door for the analysis of social scenes in which personal audio is not obtainable. Video and wearable sensors make it possible recognize speaking in an unobtrusive, privacy-preserving way. When considering the video modality, in action recognition problems, a bounding box is traditionally used to localize and segment out the target subject, to then recognize the action taking place within it. However, cross-contamination, occlusion, and the articulated nature of the human body, make this approach challenging in a crowded scene. Here, we leverage articulated body poses for subject localization and in the subsequent speech detection stage. We show that the selection of local features around pose keypoints has a positive effect on generalization performance while also significantly reducing the number of local features considered, making for a more efficient method. Using two in-the-wild datasets with different viewpoints of subjects, we investigate the role of cross-contamination in this effect. We additionally make use of acceleration measured through wearable sensors for the same task, and present a multimodal approach combining both methods.

READ FULL TEXT

page 1

page 4

page 5

page 6

page 7

page 9

page 10

research
05/10/2022

ConfLab: A Rich Multimodal Multisensor Dataset of Free-Standing Social Interactions in the Wild

Recording the dynamics of unscripted human interactions in the wild is c...
research
12/16/2019

Mimetics: Towards Understanding Human Actions Out of Context

Recent methods for video action recognition have reached outstanding per...
research
03/05/2019

Acoustic Impulse Responses for Wearable Audio Devices

We present an open-access dataset of over 8000 acoustic impulse from 160...
research
04/22/2019

You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions

The body pose of a person wearing a camera is of great interest for appl...
research
09/01/2020

Semantics-aware Adaptive Knowledge Distillation for Sensor-to-Vision Action Recognition

Existing vision-based action recognition is susceptible to occlusion and...
research
11/02/2022

Impact of annotation modality on label quality and model performance in the automatic assessment of laughter in-the-wild

Laughter is considered one of the most overt signals of joy. Laughter is...
research
11/17/2019

Detecting F-formations Roles in Crowded Social Scenes with Wearables: Combining Proxemics Dynamics using LSTMs

In this paper, we investigate the use of proxemics and dynamics for auto...

Please sign up or login with your details

Forgot password? Click here to reset