Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

05/06/2020
by   Sunder Ali Khowaja, et al.
0

Two-stream networks have provided an alternate way of exploiting the spatiotemporal information for action recognition problem. Nevertheless, most of the two-stream variants perform the fusion of homogeneous modalities which cannot efficiently capture the action-motion dynamics from the videos. Moreover, the existing studies cannot extend the streams beyond the number of modalities. To address these limitations, we propose a hybrid and hierarchical fusion (HHF) networks. The hybrid fusion handles non-homogeneous modalities and introduces a cross-modal learning stream for effective modeling of motion dynamics while extending the networks from existing two-stream variants to three and six streams. On the other hand, the hierarchical fusion makes the modalities consistent by modeling long-term temporal information along with the combination of multiple streams to improve the recognition performance. The proposed network architecture comprises of three fusion tiers: the hybrid fusion itself, the long-term fusion pooling layer which models the long-term dynamics from RGB and optical flow modalities, and the adaptive weighting scheme for combining the classification scores from several streams. We show that the hybrid fusion has different representations from the base modalities for training the cross-modal learning stream. We have conducted extensive experiments and shown that the proposed six-stream HHF network outperforms the existing two- and four-stream networks, achieving the state-of-the-art recognition performance, 97.2% and 76.7% accuracies on UCF101 and HMDB51 datasets, respectively, which are widely used in action recognition studies.

READ FULL TEXT

page 3

page 4

page 5

page 7

page 10

page 11

page 12

page 13

research
04/30/2019

Cross-Modal Message Passing for Two-stream Fusion

Processing and fusing information among multi-modal is a very useful tec...
research
09/12/2017

Learning Gating ConvNet for Two-Stream based Methods in Action Recognition

For the two-stream style methods in action recognition, fusing the two s...
research
01/31/2020

Modality Compensation Network: Cross-Modal Adaptation for Action Recognition

With the prevalence of RGB-D cameras, multi-modal video data have become...
research
08/19/2019

Cross-Enhancement Transform Two-Stream 3D ConvNets for Pedestrian Action Recognition of Autonomous Vehicles

Action recognition is an important research topic in machine vision. It ...
research
07/16/2020

Memory Based Attentive Fusion

The use of multi-modal data for deep machine learning has shown promise ...
research
01/21/2019

Semantic Image Networks for Human Action Recognition

In this paper, we propose the use of a semantic image, an improved repre...
research
11/20/2018

Reversing Two-Stream Networks with Decoding Discrepancy Penalty for Robust Action Recognition

We discuss the robustness and generalization ability in the realm of act...

Please sign up or login with your details

Forgot password? Click here to reset