Everything at Once – Multi-modal Fusion Transformer for Video Retrieval

12/08/2021
by   Nina Shvetsova, et al.
0

Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation to obtain an embedding that aggregates multi-modal temporal information. We propose to train the system with a combinatorial loss on everything at once, single modalities as well as pairs of modalities, explicitly leaving out any add-ons such as position or modality encoding. At test time, the resulting model can process and fuse any number of input modalities. Moreover, the implicit properties of the transformer allow to process inputs of different lengths. To evaluate the proposed approach, we train the model on the large scale HowTo100M dataset and evaluate the resulting embedding space on four challenging benchmark datasets obtaining state-of-the-art results in zero-shot video retrieval and zero-shot video action localization.

READ FULL TEXT

page 4

page 13

research
05/27/2020

AVGZSLNet: Audio-Visual Generalized Zero-Shot Learning by Reconstructing Label Features from Multi-Modal Embeddings

In this paper, we solve for the problem of generalized zero-shot learnin...
research
07/21/2020

Multi-modal Transformer for Video Retrieval

The task of retrieving video content relevant to natural language querie...
research
08/24/2023

Preserving Modality Structure Improves Multi-Modal Learning

Self-supervised learning on large-scale multi-modal datasets allows lear...
research
11/21/2022

Unifying Vision-Language Representation Space with Single-tower Transformer

Contrastive learning is a form of distance learning that aims to learn i...
research
04/26/2021

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Multimodal self-supervised learning is getting more and more attention a...
research
12/23/2016

DeMIAN: Deep Modality Invariant Adversarial Network

Obtaining common representations from different modalities is important ...
research
09/01/2022

Zero-Shot Multi-Modal Artist-Controlled Retrieval and Exploration of 3D Object Sets

When creating 3D content, highly specialized skills are generally needed...

Please sign up or login with your details

Forgot password? Click here to reset