Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action Segmentation in Videos

09/13/2022
by   Junbin Zhang, et al.
0

Video action segmentation and recognition tasks have been widely applied in many fields. Most previous studies employ large-scale, high computational visual models to understand videos comprehensively. However, few studies directly employ the graph model to reason about the video. The graph model provides the benefits of fewer parameters, low computational cost, a large receptive field, and flexible neighborhood message aggregation. In this paper, we present a graph-based method named Semantic2Graph, to turn the video action segmentation and recognition problem into node classification of graphs. To preserve fine-grained relations in videos, we construct the graph structure of videos at the frame-level and design three types of edges: temporal, semantic, and self-loop. We combine visual, structural, and semantic features as node attributes. Semantic edges are used to model long-term spatio-temporal relations, while the semantic features are the embedding of the label-text based on the textual prompt. A Graph Neural Networks (GNNs) model is used to learn multi-modal feature fusion. Experimental results show that Semantic2Graph achieves improvement on GTEA and 50Salads, compared to the state-of-the-art results. Multiple ablation experiments further confirm the effectiveness of semantic features in improving model performance, and semantic edges enable Semantic2Graph to capture long-term dependencies at a low cost.

READ FULL TEXT

page 3

page 8

research
08/22/2023

How Much Temporal Long-Term Context is Needed for Action Segmentation?

Modeling long-term context in videos is crucial for many fine-grained ta...
research
09/22/2022

FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial Video Classification

Unmanned aerial vehicles (UAVs) are now widely applied to data acquisiti...
research
10/29/2021

Visual Spatio-Temporal Relation-Enhanced Network for Cross-Modal Text-Video Retrieval

The task of cross-modal retrieval between texts and videos aims to under...
research
03/29/2021

Unified Graph Structured Models for Video Understanding

Accurate video understanding involves reasoning about the relationships ...
research
04/23/2022

Long-term Spatio-temporal Forecasting via Dynamic Multiple-Graph Attention

Many real-world ubiquitous applications, such as parking recommendations...
research
02/18/2018

Structured Label Inference for Visual Understanding

Visual data such as images and videos contain a rich source of structure...
research
06/01/2020

Temporal Aggregate Representations for Long Term Video Understanding

Future prediction requires reasoning from current and past observations ...

Please sign up or login with your details

Forgot password? Click here to reset