DeepAI AI Chat
Log In Sign Up

Self-Supervised MultiModal Versatile Networks

by   Jean-Baptiste Alayrac, et al.

Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language. To this end, we introduce the notion of a multimodal versatile network – a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of audio and vision can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51 and ESC-50 when compared to previous self-supervised work.


VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

We present a framework for learning multimodal representations from unla...

Look, Listen and Learn

We consider the question: what can be learnt by looking at and listening...

Multimodal and self-supervised representation learning for automatic gesture recognition in surgical robotics

Self-supervised, multi-modal learning has been successful in holistic re...

Learning Multimodal VAEs through Mutual Supervision

Multimodal VAEs seek to model the joint distribution over heterogeneous ...

Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing

Self-supervised learning in vision-language processing exploits semantic...

Semi-supervised Multimodal Representation Learning through a Global Workspace

Recent deep learning models can efficiently combine inputs from differen...

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Multimodal self-supervised learning is getting more and more attention a...