Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos

09/20/2022
by   Yilin Wen, et al.
0

Understanding dynamic hand motions and actions from egocentric RGB videos is a fundamental yet challenging task due to self-occlusion and ambiguity. To address occlusion and ambiguity, we develop a transformer-based framework to exploit temporal information for robust estimation. Noticing the different temporal granularity of and the semantic correlation between hand pose estimation and action recognition, we build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation, and the latter aggregates per-frame pose and object information over a longer time span to recognize the action. Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O. Extensive ablation studies verify our design choices. We will open-source code and data to facilitate future research.

READ FULL TEXT

page 1

page 5

page 7

research
12/18/2022

2D Pose Estimation based Child Action Recognition

We present a graph convolutional network with 2D pose estimation for the...
research
04/08/2017

First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations

In this work we study the use of 3D hand poses to recognize first-person...
research
06/13/2022

A Training Method For VideoPose3D With Ideology of Action Recognition

Action recognition and pose estimation from videos are closely related t...
research
03/13/2016

Pose for Action - Action for Pose

In this work we propose to utilize information about human actions to im...
research
07/07/2017

The 2017 Hands in the Million Challenge on 3D Hand Pose Estimation

We present the 2017 Hands in the Million Challenge, a public competition...
research
03/06/2023

EvHandPose: Event-based 3D Hand Pose Estimation with Sparse Supervision

Event camera shows great potential in 3D hand pose estimation, especiall...
research
11/10/2020

Selective Spatio-Temporal Aggregation Based Pose Refinement System: Towards Understanding Human Activities in Real-World Videos

Taking advantage of human pose data for understanding human activities h...

Please sign up or login with your details

Forgot password? Click here to reset