ExCL: Extractive Clip Localization Using Natural Language Descriptions

04/04/2019
by   Soham Ghosh, et al.
0

The task of retrieving clips within videos based on a given natural language query requires cross-modal reasoning over multiple frames. Prior approaches such as sliding window classifiers are inefficient, while text-clip similarity driven ranking-based approaches such as segment proposal networks are far more complicated. In order to select the most relevant video clip corresponding to the given text description, we propose a novel extractive approach that predicts the start and end frames by leveraging cross-modal interactions between the text and video - this removes the need to retrieve and re-rank multiple proposal segments. Using recurrent networks we encode the two modalities into a joint representation which is then used in different variants of start-end frame predictor networks. Through extensive experimentation and ablative analysis, we demonstrate that our simple and elegant approach significantly outperforms state of the art on two datasets and has comparable performance on a third.

READ FULL TEXT
research
10/26/2022

Visual Answer Localization with Cross-modal Mutual Knowledge Transfer

The goal of visual answering localization (VAL) in the video is to obtai...
research
03/28/2022

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

In text-video retrieval, the objective is to learn a cross-modal similar...
research
03/15/2021

Boundary Proposal Network for Two-Stage Natural Language Video Localization

We aim to address the problem of Natural Language Video Localization (NL...
research
05/05/2017

TALL: Temporal Activity Localization via Language Query

This paper focuses on temporal localization of actions in untrimmed vide...
research
03/22/2018

Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings

We present a method for generating colored 3D shapes from natural langua...
research
05/18/2020

End-to-End Lip Synchronisation

The goal of this work is to synchronise audio and video of a talking fac...
research
03/28/2022

Text2Pos: Text-to-Point-Cloud Cross-Modal Localization

Natural language-based communication with mobile devices and home applia...

Please sign up or login with your details

Forgot password? Click here to reset