Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition

06/27/2018
by   Dongliang He, et al.
0

In this report, our approach to tackling the task of ActivityNet 2018 Kinetics-600 challenge is described in detail. Though spatial-temporal modelling methods, which adopt either such end-to-end framework as I3D i3d or two-stage frameworks (i.e., CNN+RNN), have been proposed in existing state-of-the-arts for this task, video modelling is far from being well solved. In this challenge, we propose spatial-temporal network (StNet) for better joint spatial-temporal modelling and comprehensively video understanding. Besides, given that multi-modal information is contained in video source, we manage to integrate both early-fusion and later-fusion strategy of multi-modal information via our proposed improved temporal Xception network (iTXN) for video understanding. Our StNet RGB single model achieves 78.99% top-1 precision in the Kinetics-600 validation set and that of our improved temporal Xception network which integrates RGB, flow and audio modalities is up to 82.35%. After model ensemble, we achieve top-1 precision as high as 85.0% on the validation set and rank No.1 among all submissions.

READ FULL TEXT
research
08/22/2019

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

We focus on multi-modal fusion for egocentric action recognition, and pr...
research
05/05/2023

Distilled Mid-Fusion Transformer Networks for Multi-Modal Human Activity Recognition

Human Activity Recognition is an important task in many human-computer c...
research
02/17/2023

Dynamic Spatial-temporal Hypergraph Convolutional Network for Skeleton-based Action Recognition

Skeleton-based action recognition relies on the extraction of spatial-te...
research
10/13/2019

VATEX Captioning Challenge 2019: Multi-modal Information Fusion and Multi-stage Training Strategy for Video Captioning

Multi-modal information is essential to describe what has happened in a ...
research
10/27/2017

Multi-modal Aggregation for Video Classification

In this paper, we present a solution to Large-Scale Video Classification...
research
12/13/2022

SST: Real-time End-to-end Monocular 3D Reconstruction via Sparse Spatial-Temporal Guidance

Real-time monocular 3D reconstruction is a challenging problem that rema...
research
06/12/2018

Qiniu Submission to ActivityNet Challenge 2018

In this paper, we introduce our submissions for the tasks of trimmed act...

Please sign up or login with your details

Forgot password? Click here to reset