Seeing and Hearing Egocentric Actions: How Much Can We Learn?

10/15/2019
by   Alejandro Cartas, et al.
8

Our interaction with the world is an inherently multimodal experience. However, the understanding of human-to-object interactions has historically been addressed focusing on a single modality. In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose. In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information. Our model combines a sparse temporal sampling strategy with a late fusion of audio, spatial, and temporal streams. Experimental results on the EPIC-Kitchens dataset show that multimodal integration leads to better performance than unimodal approaches. In particular, we achieved a 5.18 improvement over the state of the art on verb classification.

READ FULL TEXT

page 1

page 2

page 6

page 7

research
09/11/2022

MAiVAR: Multimodal Audio-Image and Video Action Recognizer

Currently, action recognition is predominately performed on video data a...
research
06/03/2019

How Much Does Audio Matter to Recognize Egocentric Object Interactions?

Sounds are an important source of information on our daily interactions ...
research
05/20/2023

Efficient Multimodal Neural Networks for Trigger-less Voice Assistants

The adoption of multimodal interactions by Voice Assistants (VAs) is gro...
research
06/18/2017

3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition

Audio-visual recognition (AVR) has been considered as a solution for spe...
research
07/16/2022

Visually-aware Acoustic Event Detection using Heterogeneous Graphs

Perception of auditory events is inherently multimodal relying on both a...
research
08/10/2023

Ensemble Modeling for Multimodal Visual Action Recognition

In this work, we propose an ensemble modeling approach for multimodal ac...
research
07/17/2023

Clarifying the Half Full or Half Empty Question: Multimodal Container Classification

Multimodal integration is a key component of allowing robots to perceive...

Please sign up or login with your details

Forgot password? Click here to reset