ModDrop: adaptive multi-modal gesture recognition

12/31/2014
by   Natalia Neverova, et al.
0

We present a method for gesture detection and localisation based on multi-scale and multi-modal deep learning. Each visual modality captures spatial information at a particular spatial scale (such as motion of the upper body or a hand), and the whole system operates at three temporal scales. Key to our technique is a training strategy which exploits: i) careful initialization of individual modalities; and ii) gradual fusion involving random dropping of separate channels (dubbed ModDrop) for learning cross-modality correlations while preserving uniqueness of each modality-specific representation. We present experiments on the ChaLearn 2014 Looking at People Challenge gesture recognition track, in which we placed first out of 17 teams. Fusing multiple modalities at several spatial and temporal scales leads to a significant increase in recognition rates, allowing the model to compensate for errors of the individual classifiers as well as noise in the separate channels. Futhermore, the proposed ModDrop training technique ensures robustness of the classifier to missing signals in one or several channels to produce meaningful predictions from any number of available modalities. In addition, we demonstrate the applicability of the proposed fusion scheme to modalities of arbitrary nature by experiments on the same dataset augmented with audio.

READ FULL TEXT

page 2

page 3

page 4

page 5

page 14

research
08/22/2019

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

We focus on multi-modal fusion for egocentric action recognition, and pr...
research
02/11/2023

Flexible-modal Deception Detection with Audio-Visual Adapter

Detecting deception by human behaviors is vital in many fields such as c...
research
11/25/2022

Towards Good Practices for Missing Modality Robust Action Recognition

Standard multi-modal models assume the use of the same modalities in tra...
research
05/04/2023

Learning Missing Modal Electronic Health Records with Unified Multi-modal Data Embedding and Modality-Aware Attention

Electronic Health Record (EHR) provides abundant information through var...
research
02/10/2022

Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks

We hypothesize that due to the greedy nature of learning in multi-modal ...
research
10/27/2017

Multi-modal Aggregation for Video Classification

In this paper, we present a solution to Large-Scale Video Classification...
research
08/27/2023

Unified and Dynamic Graph for Temporal Character Grouping in Long Videos

Video temporal character grouping locates appearing moments of major cha...

Please sign up or login with your details

Forgot password? Click here to reset