DeepAI AI Chat
Log In Sign Up

Self-Supervised MultiModal Versatile Networks

06/29/2020
by   Jean-Baptiste Alayrac, et al.
Google
82

Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language. To this end, we introduce the notion of a multimodal versatile network – a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of audio and vision can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51 and ESC-50 when compared to previous self-supervised work.

READ FULL TEXT
04/22/2021

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

We present a framework for learning multimodal representations from unla...
05/23/2017

Look, Listen and Learn

We consider the question: what can be learnt by looking at and listening...
10/31/2020

Multimodal and self-supervised representation learning for automatic gesture recognition in surgical robotics

Self-supervised, multi-modal learning has been successful in holistic re...
06/23/2021

Learning Multimodal VAEs through Mutual Supervision

Multimodal VAEs seek to model the joint distribution over heterogeneous ...
01/11/2023

Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing

Self-supervised learning in vision-language processing exploits semantic...
06/27/2023

Semi-supervised Multimodal Representation Learning through a Global Workspace

Recent deep learning models can efficiently combine inputs from differen...
04/26/2021

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Multimodal self-supervised learning is getting more and more attention a...