Video and Text Matching with Conditioned Embeddings

10/21/2021
by   Ameen Ali, et al.
0

We present a method for matching a text sentence from a given corpus to a given video clip and vice versa. Traditionally video and text matching is done by learning a shared embedding space and the encoding of one modality is independent of the other. In this work, we encode the dataset data in a way that takes into account the query's relevant information. The power of the method is demonstrated to arise from pooling the interaction data between words and frames. Since the encoding of the video clip depends on the sentence compared to it, the representation needs to be recomputed for each potential match. To this end, we propose an efficient shallow neural network. Its training employs a hierarchical triplet loss that is extendable to paragraph/video matching. The method is simple, provides explainability, and achieves state-of-the-art results for both sentence-clip and video-text by a sizable margin across five different datasets: ActivityNet, DiDeMo, YouCook2, MSR-VTT, and LSMDC. We also show that our conditioned representation can be transferred to video-guided machine translation, where we improved the current results on VATEX. Source code is available at https://github.com/AmeenAli/VideoMatch.

READ FULL TEXT

page 1

page 7

research
06/23/2020

Keyframe Segmentation and Positional Encoding for Video-guided Machine Translation Challenge 2020

Video-guided machine translation as one of multimodal neural machine tra...
research
03/22/2023

Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

Sequential video understanding, as an emerging video understanding task,...
research
04/07/2018

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

Joint understanding of video and language is an active research area wit...
research
04/23/2016

Word2VisualVec: Image and Video to Sentence Matching by Visual Feature Prediction

This paper strives to find the sentence best describing the content of a...
research
05/02/2023

SLTUNET: A Simple Unified Model for Sign Language Translation

Despite recent successes with neural models for sign language translatio...
research
10/05/2022

Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective

Visual-Semantic Embedding (VSE) aims to learn an embedding space where r...
research
08/12/2019

Sentence Specified Dynamic Video Thumbnail Generation

With the tremendous growth of videos over the Internet, video thumbnails...

Please sign up or login with your details

Forgot password? Click here to reset