Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

04/07/2018
by   Antoine Miech, et al.
0

Joint understanding of video and language is an active research area with many applications. Prior work in this domain typically relies on learning text-video embeddings. One difficulty with this approach, however, is the lack of large-scale annotated video-caption datasets for training. To address this issue, we aim at learning text-video embeddings from heterogeneous data sources. To this end, we propose a Mixture-of-Embedding-Experts (MEE) model with ability to handle missing input modalities during training. As a result, our framework can learn improved text-video embeddings simultaneously from image and video datasets. We also show the generalization of MEE to other input modalities such as face descriptors. We evaluate our method on the task of video retrieval and report results for the MPII Movie Description and MSR-VTT datasets. The proposed MEE model demonstrates significant improvements and outperforms previously reported methods on both text-to-video and video-to-text retrieval tasks. Code is available at: https://github.com/antoine77340/Mixture-of-Embedding-Experts

READ FULL TEXT

page 2

page 5

page 14

research
06/07/2019

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Learning text-video embeddings usually requires a dataset of video clips...
research
08/22/2023

Multi-event Video-Text Retrieval

Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of ma...
research
04/01/2021

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Our objective in this work is video-text retrieval - in particular a joi...
research
10/21/2021

Video and Text Matching with Conditioned Embeddings

We present a method for matching a text sentence from a given corpus to ...
research
03/14/2022

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

In this work we present a new State-of-The-Art on the text-to-video retr...
research
06/08/2022

Sparse Fusion Mixture-of-Experts are Domain Generalizable Learners

Domain generalization (DG) aims at learning generalizable models under d...
research
07/31/2019

Use What You Have: Video Retrieval Using Representations From Collaborative Experts

The rapid growth of video on the internet has made searching for video c...

Please sign up or login with your details

Forgot password? Click here to reset