EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

08/22/2019
by   Evangelos Kazakos, et al.
12

We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multi-modal temporal-binding, i.e. the combination of modalities within a range of temporal offsets. We train the architecture with three modalities -- RGB, Flow and Audio -- and combine them with mid-level fusion alongside sparse temporal sampling of fused representations. In contrast with previous works, modalities are fused before temporal aggregation, with shared modality and fusion weights over time. Our proposed architecture is trained end-to-end, outperforming individual modalities as well as late-fusion of modalities. We demonstrate the importance of audio in egocentric vision, on per-class basis, for identifying actions as well as interacting objects. Our method achieves state of the art results on both the seen and unseen test sets of the largest egocentric dataset: EPIC-Kitchens, on all metrics using the public leaderboard.

READ FULL TEXT

page 1

page 3

page 11

page 12

research
08/24/2022

Modality Mixer for Multi-modal Action Recognition

In multi-modal action recognition, it is important to consider not only ...
research
06/27/2018

Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition

In this report, our approach to tackling the task of ActivityNet 2018 Ki...
research
12/31/2014

ModDrop: adaptive multi-modal gesture recognition

We present a method for gesture detection and localisation based on mult...
research
06/27/2021

Hear Me Out: Fusional Approaches for Audio Augmented Temporal Action Localization

State of the art architectures for untrimmed video Temporal Action Local...
research
06/21/2018

CaloriNet: From silhouettes to calorie estimation in private environments

We propose a novel deep fusion architecture, CaloriNet, for the online e...
research
05/02/2018

Investigating Audio, Visual, and Text Fusion Methods for End-to-End Automatic Personality Prediction

We propose a tri-modal architecture to predict Big Five personality trai...
research
06/03/2021

Cross-Domain First Person Audio-Visual Action Recognition through Relative Norm Alignment

First person action recognition is an increasingly researched topic beca...

Please sign up or login with your details

Forgot password? Click here to reset