Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

05/25/2023
by   Nicola Messina, et al.
0

Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available at https://github.com/mesnico/text-to-motion-retrieval.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/07/2023

Cross-Modal Retrieval for Motion and Text via MildTriple Loss

Cross-modal retrieval has become a prominent research topic in computer ...
research
09/02/2023

AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

Generating 3D human motion based on textual descriptions has been a rese...
research
07/13/2022

Global-local Motion Transformer for Unsupervised Skeleton-based Action Learning

We propose a new transformer model for the task of unsupervised learning...
research
04/18/2020

Attention, please: A Spatio-temporal Transformer for 3D Human Motion Prediction

In this paper, we propose a novel architecture for the task of 3D human ...
research
08/02/2023

Spatio-Temporal Branching for Motion Prediction using Motion Increments

Human motion prediction (HMP) has emerged as a popular research topic du...
research
11/29/2022

UDE: A Unified Driving Engine for Human Motion Generation

Generating controllable and editable human motion sequences is a key cha...
research
06/18/2021

All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers

Combining Natural Language with Vision represents a unique and interesti...

Please sign up or login with your details

Forgot password? Click here to reset