HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

03/28/2021
by   Song Liu, et al.
10

Video-Text Retrieval has been a hot research topic with the explosion of multimedia data on the Internet. Transformer for video-text learning has attracted increasing attention due to the promising performance.However, existing cross-modal transformer approaches typically suffer from two major limitations: 1) Limited exploitation of the transformer architecture where different layers have different feature characteristics. 2) End-to-end training mechanism limits negative interactions among samples in a mini-batch. In this paper, we propose a novel approach named Hierarchical Transformer (HiT) for video-text retrieval. HiT performs hierarchical cross-modal contrastive matching in feature-level and semantic-level to achieve multi-view and comprehensive retrieval results. Moreover, inspired by MoCo, we propose Momentum Cross-modal Contrast for cross-modal learning to enable large-scale negative interactions on-the-fly, which contributes to the generation of more precise and discriminative representations. Experimental results on three major Video-Text Retrieval benchmark datasets demonstrate the advantages of our methods.

READ FULL TEXT
research
05/10/2022

Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training

In this paper, we present a cross-modal recipe retrieval framework, Tran...
research
01/17/2023

USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval

As a fundamental and challenging task in bridging language and vision do...
research
10/09/2022

ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval

In this paper, we re-examine the task of cross-modal clip-sentence retri...
research
10/16/2022

Efficient Cross-Modal Video Retrieval with Meta-Optimized Frames

Cross-modal video retrieval aims to retrieve the semantically relevant v...
research
11/01/2020

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Many real-world video-text tasks involve different levels of granularity...
research
03/09/2023

Improving Video Retrieval by Adaptive Margin

Video retrieval is becoming increasingly important owing to the rapid em...
research
03/28/2022

Image-text Retrieval: A Survey on Recent Research and Development

In the past few years, cross-modal image-text retrieval (ITR) has experi...

Please sign up or login with your details

Forgot password? Click here to reset