ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval

10/09/2022
by   Adriano Fragomeni, et al.
0

In this paper, we re-examine the task of cross-modal clip-sentence retrieval, where the clip is part of a longer untrimmed video. When the clip is short or visually ambiguous, knowledge of its local temporal context (i.e. surrounding video segments) can be used to improve the retrieval performance. We propose Context Transformer (ConTra); an encoder architecture that models the interaction between a video clip and its local temporal context in order to enhance its embedded representations. Importantly, we supervise the context transformer using contrastive losses in the cross-modal embedding space. We explore context transformers for video and text modalities. Results consistently demonstrate improved performance on three datasets: YouCook2, EPIC-KITCHENS and a clip-sentence version of ActivityNet Captions. Exhaustive ablation studies and context analysis show the efficacy of the proposed method.

READ FULL TEXT

page 10

page 14

research
08/09/2019

Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

We address the problem of cross-modal fine-grained action retrieval betw...
research
03/28/2021

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Video-Text Retrieval has been a hot research topic with the explosion of...
research
09/30/2021

CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations

Contrastive learning allows us to flexibly define powerful losses by con...
research
03/16/2022

Learning video retrieval models with relevance-aware online mining

Due to the amount of videos and related captions uploaded every hour, de...
research
02/24/2023

Deep Learning for Video-Text Retrieval: a Review

Video-Text Retrieval (VTR) aims to search for the most relevant video re...
research
12/29/2022

BagFormer: Better Cross-Modal Retrieval via bag-wise interaction

In the field of cross-modal retrieval, single encoder models tend to per...
research
04/10/2020

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval

Existing dominant approaches for cross-modal video-text retrieval task a...

Please sign up or login with your details

Forgot password? Click here to reset