EgoCom: A Multi-person Multi-modal Egocentric Communications Dataset

03/13/2021
by   Curtis G. Northcutt, et al.
0

Multi-modal datasets in artificial intelligence (AI) often capture a third-person perspective, but our embodied human intelligence evolved with sensory input from the egocentric, first-person perspective. Towards embodied AI, we introduce the Egocentric Communications (EgoCom) dataset to advance the state-of-the-art in conversational AI, natural language, audio speech analysis, computer vision, and machine learning. EgoCom is a first-of-its-kind natural conversations dataset containing multi-modal human communication data captured simultaneously from the participants’ egocentric perspectives. EgoCom includes 38.5 hours of synchronized embodied stereo audio, egocentric video with 240,000 ground-truth, time-stamped word-level transcriptions and speaker labels from 34 diverse speakers. We study baseline performance on two novel applications that benefit from embodied data: (1) predicting turn-taking in conversations and (2) multi-speaker transcription. For (1), we investigate Bayesian baselines to predict turn-taking within 5% of human performance. For (2), we use simultaneous egocentric capture to combine Google speech-to-text outputs, improving global transcription by 79% relative to a single perspective. Both applications exploit EgoCom’s synchronous multi-perspective data to augment performance of embodied AI tasks.

READ FULL TEXT

page 3

page 4

page 10

page 12

research
03/10/2022

BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis

Achieving realistic, vivid, and human-like synthesized conversational ge...
research
01/16/2023

OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset

Inspired by humans comprehending speech in a multi-modal manner, various...
research
11/25/2014

Non-Verbal Communication Analysis in Victim-Offender Mediations

In this paper we present a non-invasive ambient intelligence framework f...
research
04/07/2019

Time Domain Audio Visual Speech Separation

Audio-visual multi-modal modeling has been demonstrated to be effective ...
research
03/01/2022

WEMAC: Women and Emotion Multi-modal Affective Computing dataset

Among the seventeen Sustainable Development Goals (SDGs) proposed within...
research
09/30/2020

Ethically Collecting Multi-Modal Spontaneous Conversations with People that have Cognitive Impairments

In order to make spoken dialogue systems (such as Amazon Alexa or Google...

Please sign up or login with your details

Forgot password? Click here to reset