Skeleton-based Action Recognition through Contrasting Two-Stream Spatial-Temporal Networks

01/27/2023
by   Chen Pang, et al.
0

For pursuing accurate skeleton-based action recognition, most prior methods use the strategy of combining Graph Convolution Networks (GCNs) with attention-based methods in a serial way. However, they regard the human skeleton as a complete graph, resulting in less variations between different actions (e.g., the connection between the elbow and head in action “clapping hands”). For this, we propose a novel Contrastive GCN-Transformer Network (ConGT) which fuses the spatial and temporal modules in a parallel way. The ConGT involves two parallel streams: Spatial-Temporal Graph Convolution stream (STG) and Spatial-Temporal Transformer stream (STT). The STG is designed to obtain action representations maintaining the natural topology structure of the human skeleton. The STT is devised to acquire action representations containing the global relationships among joints. Since the action representations produced from these two streams contain different characteristics, and each of them knows little information of the other, we introduce the contrastive learning paradigm to guide their output representations of the same sample to be as close as possible in a self-supervised manner. Through the contrastive learning, they can learn information from each other to enrich the action features by maximizing the mutual information between the two types of action representations. To further improve action recognition accuracy, we introduce the Cyclical Focal Loss (CFL) which can focus on confident training samples in early training epochs, with an increasing focus on hard samples during the middle epochs. We conduct experiments on three benchmark datasets, which demonstrate that our model achieves state-of-the-art performance in action recognition.

READ FULL TEXT

page 1

page 10

research
09/07/2021

GCsT: Graph Convolutional Skeleton Transformer for Action Recognition

Graph convolutional networks (GCNs) achieve promising performance for sk...
research
03/07/2023

Learning Discriminative Representations for Skeleton Based Action Recognition

Human action recognition aims at classifying the category of human actio...
research
02/05/2023

Spatiotemporal Decouple-and-Squeeze Contrastive Learning for Semi-Supervised Skeleton-based Action Recognition

Contrastive learning has been successfully leveraged to learn action rep...
research
11/16/2016

Joint Network based Attention for Action Recognition

By extracting spatial and temporal characteristics in one network, the t...
research
08/27/2019

Cooperative Cross-Stream Network for Discriminative Action Representation

Spatial and temporal stream model has gained great success in video acti...
research
08/29/2019

DWnet: Deep-Wide Network for 3D Action Recognition

We propose in this paper a deep-wide network (DWnet) which combines the ...
research
06/30/2023

SpATr: MoCap 3D Human Action Recognition based on Spiral Auto-encoder and Transformer Network

Recent advancements in technology have expanded the possibilities of hum...

Please sign up or login with your details

Forgot password? Click here to reset