HunYuan_tvr for Text-Video Retrivial

04/07/2022
by   Shaobo Min, et al.
0

Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short clips and phrases or single frame and word. In this paper, we propose a novel method, named HunYuan_tvr, to explore hierarchical cross-modal interactions by simultaneously exploring video-sentence, clip-phrase, and frame-word relationships. Considering intrinsic semantic relations between frames, HunYuan_tvr first performs self-attention to explore frame-wise correlations and adaptively clusters correlated frames into clip-level representations. Then, the clip-wise correlation is explored to aggregate clip representations into a compact one to describe the video globally. In this way, we can construct hierarchical video representations for frame-clip-video granularities, and also explore word-wise correlations to form word-phrase-sentence embeddings for the text modality. Finally, hierarchical contrastive learning is designed to explore cross-modal relationships, i.e., frame-word, clip-phrase, and video-sentence, which enables HunYuan_tvr to achieve a comprehensive multi-modal understanding. Further boosted by adaptive label denosing and marginal sample enhancement, HunYuan_tvr obtains new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0 DiDemo, and ActivityNet respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/28/2023

Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions

Most existing audio-text retrieval (ATR) methods focus on constructing c...
research
09/12/2021

Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval

Existing research for image text retrieval mainly relies on sentence-lev...
research
04/13/2018

Text-to-Clip Video Retrieval with Early Fusion and Re-Captioning

We propose a novel method capable of retrieving clips from untrimmed vid...
research
07/21/2022

Correspondence Matters for Video Referring Expression Comprehension

We investigate the problem of video Referring Expression Comprehension (...
research
02/01/2021

Semantic Grouping Network for Video Captioning

This paper considers a video caption generating network referred to as S...
research
01/27/2023

Semi-Parametric Video-Grounded Text Generation

Efficient video-language modeling should consider the computational cost...
research
10/16/2022

Efficient Cross-Modal Video Retrieval with Meta-Optimized Frames

Cross-modal video retrieval aims to retrieve the semantically relevant v...

Please sign up or login with your details

Forgot password? Click here to reset