MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition

05/12/2023
by   Xinyu Gong, et al.
0

In this paper, we study a novel problem in egocentric action recognition, which we term as "Multimodal Generalization" (MMG). MMG aims to study how systems can generalize when data from certain modalities is limited or even completely missing. We thoroughly investigate MMG in the context of standard supervised action recognition and the more challenging few-shot setting for learning new action categories. MMG consists of two novel scenarios, designed to support security, and efficiency considerations in real-world applications: (1) missing modality generalization where some modalities that were present during the train time are missing during the inference time, and (2) cross-modal zero-shot generalization, where the modalities present during the inference time and the training time are disjoint. To enable this investigation, we construct a new dataset MMG-Ego4D containing data points with video, audio, and inertial motion sensor (IMU) modalities. Our dataset is derived from Ego4D dataset, but processed and thoroughly re-annotated by human experts to facilitate research in the MMG problem. We evaluate a diverse array of models on MMG-Ego4D and propose new methods with improved generalization ability. In particular, we introduce a new fusion module with modality dropout training, contrastive-based alignment training, and a novel cross-modal prototypical loss for better few-shot performance. We hope this study will serve as a benchmark and guide future research in multimodal generalization problems. The benchmark and code will be available at https://github.com/facebookresearch/MMG_Ego4D.

READ FULL TEXT
research
11/25/2022

Towards Good Practices for Missing Modality Robust Action Recognition

Standard multi-modal models assume the use of the same modalities in tra...
research
04/16/2023

Robust Cross-Modal Knowledge Distillation for Unconstrained Videos

Cross-modal distillation has been widely used to transfer knowledge acro...
research
03/06/2022

Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos

With the assumption that a video dataset is multimodality annotated in w...
research
09/09/2021

M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product Downstream Tasks

In this paper, we aim to advance the research of multi-modal pre-trainin...
research
12/04/2020

Cross-Modal Generalization: Learning in Low Resource Modalities via Meta-Alignment

The natural world is abundant with concepts expressed via visual, acoust...
research
09/09/2022

MIntRec: A New Dataset for Multimodal Intent Recognition

Multimodal intent recognition is a significant task for understanding hu...
research
05/26/2023

LANISTR: Multimodal Learning from Structured and Unstructured Data

Multimodal large-scale pretraining has shown impressive performance gain...

Please sign up or login with your details

Forgot password? Click here to reset