FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial Video Classification

09/22/2022
by   Pu Jin, et al.
0

Unmanned aerial vehicles (UAVs) are now widely applied to data acquisition due to its low cost and fast mobility. With the increasing volume of aerial videos, the demand for automatically parsing these videos is surging. To achieve this, current researches mainly focus on extracting a holistic feature with convolutions along both spatial and temporal dimensions. However, these methods are limited by small temporal receptive fields and cannot adequately capture long-term temporal dependencies which are important for describing complicated dynamics. In this paper, we propose a novel deep neural network, termed FuTH-Net, to model not only holistic features, but also temporal relations for aerial video classification. Furthermore, the holistic features are refined by the multi-scale temporal relations in a novel fusion module for yielding more discriminative video representations. More specially, FuTH-Net employs a two-pathway architecture: (1) a holistic representation pathway to learn a general feature of both frame appearances and shortterm temporal variations and (2) a temporal relation pathway to capture multi-scale temporal relations across arbitrary frames, providing long-term temporal dependencies. Afterwards, a novel fusion module is proposed to spatiotemporal integrate the two features learned from the two pathways. Our model is evaluated on two aerial video classification datasets, ERA and Drone-Action, and achieves the state-of-the-art results. This demonstrates its effectiveness and good generalization capacity across different recognition tasks (event classification and human action recognition). To facilitate further research, we release the code at https://gitlab.lrz.de/ai4eo/reasoning/futh-net.

READ FULL TEXT

page 1

page 3

page 5

page 9

page 10

research
01/19/2023

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Spatial and temporal modeling is one of the most core aspects of few-sho...
research
09/13/2022

Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action Segmentation in Videos

Video action segmentation and recognition tasks have been widely applied...
research
12/07/2021

MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection

Action detection is an essential and challenging task, especially for de...
research
03/02/2023

AZTR: Aerial Video Action Recognition with Auto Zoom and Temporal Reasoning

We propose a novel approach for aerial video action recognition. Our met...
research
07/22/2020

Video-ception Network: Towards Multi-Scale Efficient Asymmetric Spatial-Temporal Interactions

Previous video modeling methods leverage the cubic 3D convolution filter...
research
08/27/2019

Temporal Reasoning Graph for Activity Recognition

Despite great success has been achieved in activity analysis, it still h...
research
01/30/2020

ERA: A Dataset and Deep Learning Benchmark for Event Recognition in Aerial Videos

Along with the increasing use of unmanned aerial vehicles (UAVs), large ...

Please sign up or login with your details

Forgot password? Click here to reset