Video Understanding as Machine Translation

06/12/2020
by   Bruno Korbar, et al.
11

With the advent of large-scale multimodal video datasets, especially sequences with audio or transcribed speech, there has been a growing interest in self-supervised learning of video representations. Most prior work formulates the objective as a contrastive metric learning problem between the modalities. To enable effective learning, however, these strategies require a careful selection of positive and negative samples often combined with hand-designed curriculum policies. In this work we remove the need for negative sampling by taking a generative modeling approach that poses the objective as a translation problem between modalities. Such a formulation allows us to tackle a wide variety of downstream video understanding tasks by means of a single unified framework, without the need for large batches of negative samples common in contrastive metric learning. We experiment with the large-scale HowTo100M dataset for training, and report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT), and text-based clip retrieval (YouCook2 and MSR-VTT).

READ FULL TEXT

page 2

page 7

research
07/06/2021

Contrastive Multimodal Fusion with TupleInfoNCE

This paper proposes a method for representation learning of multimodal d...
research
05/22/2023

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Large-scale image-text contrastive pre-training models, such as CLIP, ha...
research
04/26/2021

Multimodal Self-Supervised Learning of General Audio Representations

We present a multimodal framework to learn general audio representations...
research
06/14/2021

Self-Supervised Metric Learning in Multi-View Data: A Downstream Task Perspective

Self-supervised metric learning has been a successful approach for learn...
research
04/13/2023

Verbs in Action: Improving verb understanding in video-language models

Understanding verbs is crucial to modelling how people and objects inter...
research
08/29/2020

iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering

Most prior art in visual understanding relies solely on analyzing the "w...
research
02/22/2023

Distribution Normalization: An "Effortless" Test-Time Augmentation for Contrastively Learned Visual-language Models

Advances in the field of visual-language contrastive learning have made ...

Please sign up or login with your details

Forgot password? Click here to reset