Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

04/26/2021
by   Brian Chen, et al.
0

Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper proposes a self-supervised training framework that learns a common multimodal embedding space that, in addition to sharing representations across different modalities, enforces a grouping of semantically similar instances. To this end, we extend the concept of instance-level contrastive learning with a multimodal clustering step in the training pipeline to capture semantic similarities across modalities. The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains. To evaluate our approach, we train our model on the HowTo100M dataset and evaluate its zero-shot retrieval capabilities in two challenging domains, namely text-to-video retrieval, and temporal action localization, showing state-of-the-art results on four different datasets.

READ FULL TEXT

page 1

page 8

research
12/08/2021

Everything at Once – Multi-modal Fusion Transformer for Video Retrieval

Multi-modal learning from video data has seen increased attention recent...
research
01/07/2019

Self-Supervised Learning from Web Data for Multimodal Retrieval

Self-Supervised learning from multimodal image and text data allows deep...
research
06/29/2020

Self-Supervised MultiModal Versatile Networks

Videos are a rich source of multi-modal supervision. In this work, we le...
research
03/06/2020

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

One of the key factors of enabling machine learning models to comprehend...
research
07/20/2023

Language-based Action Concept Spaces Improve Video Self-Supervised Learning

Recent contrastive language image pre-training has led to learning highl...
research
03/06/2023

CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning

Multimodal contrastive pretraining has been used to train multimodal rep...
research
02/07/2022

Reasoning for Complex Data through Ensemble-based Self-Supervised Learning

Self-supervised learning deals with problems that have little or no avai...

Please sign up or login with your details

Forgot password? Click here to reset