Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos

08/20/2023
by   Haoyuan Li, et al.
1

Multi-person 3D mesh recovery from videos is a critical first step towards automatic perception of group behavior in virtual reality, physical therapy and beyond. However, existing approaches rely on multi-stage paradigms, where the person detection and tracking stages are performed in a multi-person setting, while temporal dynamics are only modeled for one person at a time. Consequently, their performance is severely limited by the lack of inter-person interactions in the spatial-temporal mesh recovery, as well as by detection and tracking defects. To address these challenges, we propose the Coordinate transFormer (CoordFormer) that directly models multi-person spatial-temporal relations and simultaneously performs multi-mesh recovery in an end-to-end manner. Instead of partitioning the feature map into coarse-scale patch-wise tokens, CoordFormer leverages a novel Coordinate-Aware Attention to preserve pixel-level spatial-temporal coordinate information. Additionally, we propose a simple, yet effective Body Center Attention mechanism to fuse position information. Extensive experiments on the 3DPW dataset demonstrate that CoordFormer significantly improves the state-of-the-art, outperforming the previously best results by 4.2 and PVE metrics, respectively, while being 40 approaches. The released code can be found at https://github.com/Li-Hao-yuan/CoordFormer.

READ FULL TEXT

page 4

page 7

page 13

page 15

page 16

page 17

page 18

research
08/27/2020

CenterHMR: a Bottom-up Single-shot Method for Multi-person 3D Mesh Recovery from a Single Image

In this paper, we propose a method to recover multi-person 3D mesh from ...
research
05/06/2021

Body Meshes as Points

We consider the challenging multi-person 3D body mesh estimation task in...
research
01/02/2023

Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person Re-identification

In recent years, the Transformer architecture has shown its superiority ...
research
07/15/2022

Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection

Active speaker detection (ASD) in videos with multiple speakers is a cha...
research
11/24/2019

Using panoramic videos for multi-person localization and tracking in a 3D panoramic coordinate

This work proposes a new human-related video processing task named 3D pa...
research
03/24/2022

Occluded Human Mesh Recovery

Top-down methods for monocular human mesh recovery have two stages: (1) ...
research
08/17/2023

Realistic Full-Body Tracking from Sparse Observations via Joint-Level Modeling

To bridge the physical and virtual worlds for rapidly developed VR/AR ap...

Please sign up or login with your details

Forgot password? Click here to reset