P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation

03/15/2022
by   Wenkang Shan, et al.
0

This paper introduces a novel Pre-trained Spatial Temporal Many-to-One (P-STMO) model for 2D-to-3D human pose estimation task. To reduce the difficulty of capturing spatial and temporal information, we divide this task into two stages: pre-training (Stage I) and fine-tuning (Stage II). In Stage I, a self-supervised pre-training sub-task, termed masked pose modeling, is proposed. The human joints in the input sequence are randomly masked in both spatial and temporal domains. A general form of denoising auto-encoder is exploited to recover the original 2D poses and the encoder is capable of capturing spatial and temporal dependencies in this way. In Stage II, the pre-trained encoder is loaded to STMO model and fine-tuned. The encoder is followed by a many-to-one frame aggregator to predict the 3D pose in the current frame. Especially, an MLP block is utilized as the spatial feature extractor in STMO, which yields better performance than other methods. In addition, a temporal downsampling strategy is proposed to diminish data redundancy. Extensive experiments on two benchmarks show that our method outperforms state-of-the-art methods with fewer parameters and less computational overhead. For example, our P-STMO model achieves 42.1mm MPJPE on Human3.6M dataset when using 2D poses from CPN as inputs. Meanwhile, it brings a 1.5-7.1 times speedup to state-of-the-art methods. Code is available at https://github.com/paTRICK-swk/P-STMO.

READ FULL TEXT

page 5

page 8

page 12

page 19

page 20

page 21

research
06/29/2023

MPM: A Unified 2D-3D Human Pose Representation via Masked Pose Modeling

Estimating 3D human poses only from a 2D human pose sequence is thorough...
research
09/14/2023

Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning

Recently, large-scale pre-trained language-image models like CLIP have s...
research
01/08/2019

A Spatial-temporal 3D Human Pose Reconstruction Framework

3D human pose reconstruction from single-view camera is a difficult and ...
research
01/18/2023

HSTFormer: Hierarchical Spatial-Temporal Transformers for 3D Human Pose Estimation

Transformer-based approaches have been successfully proposed for 3D huma...
research
05/03/2022

In Defense of Image Pre-Training for Spatiotemporal Recognition

Image pre-training, the current de-facto paradigm for a wide range of vi...
research
09/06/2021

Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation

3D human shape and pose estimation is the essential task for human motio...
research
07/31/2017

Recurrent 3D Pose Sequence Machines

3D human articulated pose recovery from monocular image sequences is ver...

Please sign up or login with your details

Forgot password? Click here to reset