Cross-Modal Retrieval for Motion and Text via MildTriple Loss

05/07/2023
by   Sheng Yan, et al.
0

Cross-modal retrieval has become a prominent research topic in computer vision and natural language processing with advances made in image-text and video-text retrieval technologies. However, cross-modal retrieval between human motion sequences and text has not garnered sufficient attention despite the extensive application value it holds, such as aiding virtual reality applications in better understanding users' actions and language. This task presents several challenges, including joint modeling of the two modalities, demanding the understanding of person-centered information from text, and learning behavior features from 3D human motion sequences. Previous work on motion data modeling mainly relied on autoregressive feature extractors that may forget previous information, while we propose an innovative model that includes simple yet powerful transformer-based motion and text encoders, which can learn representations from the two different modalities and capture long-term dependencies. Furthermore, the overlap of the same atomic actions of different human motions can cause semantic conflicts, leading us to explore a new triplet loss function, MildTriple Loss. it leverages the similarity between samples in intra-modal space to guide soft-hard negative sample mining in the joint embedding space to train the triplet loss and reduce the violation caused by false negative samples. We evaluated our model and method on the latest HumanML3D and KIT Motion-Language datasets, achieving a 62.9% recall for motion retrieval and a 71.5% recall for text retrieval (based on R@10) on the HumanML3D dataset. Our code is available at https://github.com/eanson023/rehamot.

READ FULL TEXT

page 4

page 8

page 11

research
07/11/2022

Intra-Modal Constraint Loss For Image-Text Retrieval

Cross-modal retrieval has drawn much attention in both computer vision a...
research
05/25/2023

Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

Due to recent advances in pose-estimation methods, human motion can be e...
research
05/02/2023

TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis

In this paper, we present TMR, a simple yet effective approach for text ...
research
01/23/2019

Exploring Uncertainty in Conditional Multi-Modal Retrieval Systems

We cast visual retrieval as a regression problem by posing triplet loss ...
research
11/07/2022

Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal Retrieval

The heterogeneity gap problem is the main challenge in cross-modal retri...
research
04/20/2023

Image-text Retrieval via preserving main Semantics of Vision

Image-text retrieval is one of the major tasks of cross-modal retrieval....
research
02/18/2021

Hierarchical Similarity Learning for Language-based Product Image Retrieval

This paper aims for the language-based product image retrieval task. The...

Please sign up or login with your details

Forgot password? Click here to reset