IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation

08/06/2022
by   Zhongwei Qiu, et al.
0

Video 3D human pose estimation aims to localize the 3D coordinates of human joints from videos. Recent transformer-based approaches focus on capturing the spatiotemporal information from sequential 2D poses, which cannot model the contextual depth feature effectively since the visual depth features are lost in the step of 2D pose estimation. In this paper, we simplify the paradigm into an end-to-end framework, Instance-guided Video Transformer (IVT), which enables learning spatiotemporal contextual depth information from visual features effectively and predicts 3D poses directly from video frames. In particular, we firstly formulate video frames as a series of instance-guided tokens and each token is in charge of predicting the 3D pose of a human instance. These tokens contain body structure information since they are extracted by the guidance of joint offsets from the human center to the corresponding body joints. Then, these tokens are sent into IVT for learning spatiotemporal contextual depth. In addition, we propose a cross-scale instance-guided attention mechanism to handle the variational scales among multiple persons. Finally, the 3D poses of each person are decoded from instance-guided tokens by coordinate regression. Experiments on three widely-used 3D pose estimation benchmarks show that the proposed IVT achieves state-of-the-art performances.

READ FULL TEXT

page 4

page 8

research
04/12/2023

Distilling Token-Pruned Pose Transformer for 2D Human Pose Estimation

Human pose estimation has seen widespread use of transformer models in r...
research
03/16/2022

DeciWatch: A Simple Baseline for 10x Efficient 2D and 3D Pose Estimation

This paper proposes a simple baseline framework for video-based 2D/3D hu...
research
01/19/2022

Poseur: Direct Human Pose Regression with Transformers

We propose a direct, regression-based approach to 2D human pose estimati...
research
03/31/2017

Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos

Deep ConvNets have been shown to be effective for the task of human pose...
research
08/16/2023

Agglomerative Transformer for Human-Object Interaction Detection

We propose an agglomerative Transformer (AGER) that enables Transformer-...
research
02/13/2019

3D Robot Pose Estimation from 2D Images

This paper considers the task of locating articulated poses of multiple ...
research
05/05/2017

Knowledge-Guided Deep Fractal Neural Networks for Human Pose Estimation

Human pose estimation using deep neural networks aims to map input image...

Please sign up or login with your details

Forgot password? Click here to reset