EgoCom: A Multi-person Multi-modal Egocentric Communications Dataset

by   Curtis G. Northcutt, et al.

Multi-modal datasets in artificial intelligence (AI) often capture a third-person perspective, but our embodied human intelligence evolved with sensory input from the egocentric, first-person perspective. Towards embodied AI, we introduce the Egocentric Communications (EgoCom) dataset to advance the state-of-the-art in conversational AI, natural language, audio speech analysis, computer vision, and machine learning. EgoCom is a first-of-its-kind natural conversations dataset containing multi-modal human communication data captured simultaneously from the participants’ egocentric perspectives. EgoCom includes 38.5 hours of synchronized embodied stereo audio, egocentric video with 240,000 ground-truth, time-stamped word-level transcriptions and speaker labels from 34 diverse speakers. We study baseline performance on two novel applications that benefit from embodied data: (1) predicting turn-taking in conversations and (2) multi-speaker transcription. For (1), we investigate Bayesian baselines to predict turn-taking within 5% of human performance. For (2), we use simultaneous egocentric capture to combine Google speech-to-text outputs, improving global transcription by 79% relative to a single perspective. Both applications exploit EgoCom’s synchronous multi-perspective data to augment performance of embodied AI tasks.



There are no comments yet.


page 3

page 4

page 10

page 12


iQIYI-VID: A Large Dataset for Multi-modal Person Identification

Person identification in the wild is very challenging due to great varia...

Non-Verbal Communication Analysis in Victim-Offender Mediations

In this paper we present a non-invasive ambient intelligence framework f...

Time Domain Audio Visual Speech Separation

Audio-visual multi-modal modeling has been demonstrated to be effective ...

Understanding Art through Multi-Modal Retrieval in Paintings

In computer vision, visual arts are often studied from a purely aestheti...

Ethically Collecting Multi-Modal Spontaneous Conversations with People that have Cognitive Impairments

In order to make spoken dialogue systems (such as Amazon Alexa or Google...

The Spot the Difference corpus: a multi-modal corpus of spontaneous task oriented spoken interactions

This paper describes the Spot the Difference Corpus which contains 54 in...

Virtual Embodiment: A Scalable Long-Term Strategy for Artificial Intelligence Research

Meaning has been called the "holy grail" of a variety of scientific disc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.