Sound and Visual Representation Learning with Multiple Pretraining Tasks

01/04/2022
by   Arun Balajee Vasudevan, et al.
0

Different self-supervised tasks (SSL) reveal different features from the data. The learned feature representations can exhibit different performance for each downstream task. In this light, this work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks. Specifically, for this study, we investigate binaural sounds and image data in isolation. For binaural sounds, we propose three SSL tasks namely, spatial alignment, temporal synchronization of foreground objects and binaural audio and temporal gap prediction. We investigate several approaches of Multi-SSL and give insights into the downstream task performance on video retrieval, spatial sound super resolution, and semantic prediction on the OmniAudio dataset. Our experiments on binaural sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models and fully supervised models in the downstream task performance. As a check of applicability on other modality, we also formulate our Multi-SSL models for image representation learning and we use the recently proposed SSL tasks, MoCov2 and DenseCL. Here, Multi-SSL surpasses recent methods such as MoCov2, DenseCL and DetCo by 2.06 +1.56 and +1.61 AP on COCO detection. Code will be made publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/03/2020

Learning Representations from Audio-Visual Spatial Alignment

We introduce a novel self-supervised pretext task for learning represent...
research
06/02/2022

Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment

Learning from audio-visual data offers many possibilities to express cor...
research
08/13/2020

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

When watching videos, the occurrence of a visual event is often accompan...
research
06/07/2023

Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks

In recent years, self-supervised learning (SSL) has emerged as a popular...
research
01/01/2023

MTNeuro: A Benchmark for Evaluating Representations of Brain Structure Across Multiple Levels of Abstraction

There are multiple scales of abstraction from which we can describe the ...
research
05/21/2023

From Patches to Objects: Exploiting Spatial Reasoning for Better Visual Representations

As the field of deep learning steadily transitions from the realm of aca...
research
06/04/2019

Information Competing Process for Learning Diversified Representations

Learning representations with diversified information remains an open pr...

Please sign up or login with your details

Forgot password? Click here to reset