HMS: Hierarchical Modality Selection for Efficient Video Recognition

04/20/2021
by   Zejia Weng, et al.
0

Videos are multimodal in nature. Conventional video recognition pipelines typically fuse multimodal features for improved performance. However, this is not only computationally expensive but also neglects the fact that different videos rely on different modalities for predictions. This paper introduces Hierarchical Modality Selection (HMS), a simple yet efficient multimodal learning framework for efficient video recognition. HMS operates on a low-cost modality, i.e., audio clues, by default, and dynamically decides on-the-fly whether to use computationally-expensive modalities, including appearance and motion clues, on a per-input basis. This is achieved by the collaboration of three LSTMs that are organized in a hierarchical manner. In particular, LSTMs that operate on high-cost modalities contain a gating module, which takes as inputs lower-level features and historical information to adaptively determine whether to activate its corresponding modality; otherwise it simply reuses historical information. We conduct extensive experiments on two large-scale video benchmarks, FCVID and ActivityNet, and the results demonstrate the proposed approach can effectively explore multimodal information for improved classification performance while requiring much less computation.

READ FULL TEXT

page 1

page 8

research
06/30/2021

Attention Bottlenecks for Multimodal Fusion

Humans perceive the world by concurrently processing and fusing high-dim...
research
03/06/2022

Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos

With the assumption that a video dataset is multimodality annotated in w...
research
12/03/2019

LiteEval: A Coarse-to-Fine Framework for Resource Efficient Video Recognition

This paper presents LiteEval, a simple yet effective coarse-to-fine fram...
research
06/14/2017

Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification

Videos are inherently multimodal. This paper studies the problem of how ...
research
11/04/2020

Mutual Modality Learning for Video Action Classification

The construction of models for video action classification progresses ra...
research
10/27/2017

Multi-modal Aggregation for Video Classification

In this paper, we present a solution to Large-Scale Video Classification...
research
09/19/2019

HyperLearn: A Distributed Approach for Representation Learning in Datasets With Many Modalities

Multimodal datasets contain an enormous amount of relational information...

Please sign up or login with your details

Forgot password? Click here to reset