Text-to-Clip Video Retrieval with Early Fusion and Re-Captioning

04/13/2018
by   Huijuan Xu, et al.
0

We propose a novel method capable of retrieving clips from untrimmed videos based on natural language queries. This cross-modal retrieval task plays a key role in visual-semantic understanding, and requires localizing clips in time and computing their similarity to the query sentence. Current methods generate sentence and video embeddings and then compare them using a late fusion approach, but this ignores the word order in queries and prevents more fine-grained comparisons. Motivated by the need for fine-grained multi-modal feature fusion, we propose a novel early fusion embedding approach that combines video and language information at the word level. Furthermore, we use the inverse task of dense video captioning as a side-task to improve the learned embedding. Our full model combines these components with an efficient proposal pipeline that performs accurate localization of potential video clips. We present a comprehensive experimental validation on two large-scale text-to-clip datasets (Charades-STA and DiDeMo) and attain state-of-the-art retrieval results with our model.

READ FULL TEXT
research
08/09/2019

Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

We address the problem of cross-modal fine-grained action retrieval betw...
research
04/07/2022

HunYuan_tvr for Text-Video Retrivial

Text-Video Retrieval plays an important role in multi-modal understandin...
research
03/10/2022

StyleBabel: Artistic Style Tagging and Captioning

We present StyleBabel, a unique open access dataset of natural language ...
research
07/31/2019

Use What You Have: Video Retrieval Using Representations From Collaborative Experts

The rapid growth of video on the internet has made searching for video c...
research
06/07/2023

MarineVRS: Marine Video Retrieval System with Explainability via Semantic Understanding

Building a video retrieval system that is robust and reliable, especiall...
research
09/12/2023

Dual-Path Temporal Map Optimization for Make-up Temporal Video Grounding

Make-up temporal video grounding (MTVG) aims to localize the target vide...
research
03/16/2018

Object Captioning and Retrieval with Natural Language

We address the problem of jointly learning vision and language to unders...

Please sign up or login with your details

Forgot password? Click here to reset