MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

03/02/2022
by   Jinlu Zhang, et al.
7

Recent transformer-based solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn spatio-temporal correlation. We observe that the motions of different joints differ significantly. However, the previous methods cannot efficiently model the solid inter-frame correspondence of each joint, leading to insufficient learning of spatial-temporal correlation. We propose MixSTE (Mixed Spatio-Temporal Encoder), which has a temporal transformer block to separately model the temporal motion of each joint and a spatial transformer block to learn inter-joint spatial correlation. These two blocks are utilized alternately to obtain better spatio-temporal feature encoding. In addition, the network output is extended from the central frame to entire frames of the input video, thereby improving the coherence between the input and output sequences. Extensive experiments are conducted on three benchmarks (i.e. Human3.6M, MPI-INF-3DHP, and HumanEva) to evaluate the proposed method. The results show that our model outperforms the state-of-the-art approach by 10.9 7.6

READ FULL TEXT

page 3

page 6

page 8

page 15

page 16

research
10/08/2022

(Fusionformer):Exploiting the Joint Motion Synergy with Fusion Network Based On Transformer for 3D Human Pose Estimation

For the current 3D human pose estimation task, in order to improve the e...
research
03/16/2023

PSVT: End-to-End Multi-person 3D Pose and Shape Estimation with Progressive Video Transformers

Existing methods of multi-person video 3D human Pose and Shape Estimatio...
research
04/30/2015

Predicting People's 3D Poses from Short Sequences

We propose an efficient approach to exploiting motion information from c...
research
03/10/2023

Human Pose Estimation from Ambiguous Pressure Recordings with Spatio-temporal Masked Transformers

Despite the impressive performance of vision-based pose estimators, they...
research
09/15/2021

Learning Dynamical Human-Joint Affinity for 3D Pose Estimation in Videos

Graph Convolution Network (GCN) has been successfully used for 3D human ...
research
03/11/2020

GAST-Net: Graph Attention Spatio-temporal Convolutional Networks for 3D Human Pose Estimation in Video

3D pose estimation in video can benefit greatly from both temporal and s...

Please sign up or login with your details

Forgot password? Click here to reset