Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning

06/13/2021
by   Shaobo Min, et al.
0

Cross-modal correlation provides an inherent supervision for video unsupervised representation learning. Existing methods focus on distinguishing different video clips by visual and audio representations. We human visual perception could attend to regions where sounds are made, and our auditory perception could also ground their frequencies of sounding objects, which we call bidirectional local correspondence. Such supervision is intuitive but not well explored in the contrastive learning framework. This paper introduces a pretext task, Cross-Modal Attention Consistency (CMAC), for exploring the bidirectional local correspondence property. The CMAC approach aims to align the regional attention generated purely from the visual signal with the target attention generated under the guidance of acoustic signal, and do a similar alignment for frequency grounding on the acoustic attention. Accompanied by a remoulded cross-modal contrastive loss where we consider additional within-modal interactions, the CMAC approach works effectively for enforcing the bidirectional alignment. Extensive experiments on six downstream benchmarks demonstrate that CMAC can improve the state-of-the-art performance on both visual and audio modalities.

READ FULL TEXT

page 1

page 3

page 7

research
07/20/2022

Cross-Modal Contrastive Representation Learning for Audio-to-Image Generation

Multiple modalities for certain information provide a variety of perspec...
research
11/28/2019

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

The visual and audio modalities are highly correlated yet they contain d...
research
04/27/2020

Audio-Visual Instance Discrimination with Cross-Modal Agreement

We present a self-supervised learning approach to learn audio-visual rep...
research
10/20/2021

Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

We introduce the task of spatially localizing narrated interactions in v...
research
08/26/2021

Drop-DTW: Aligning Common Signal Between Sequences While Dropping Outliers

In this work, we consider the problem of sequence-to-sequence alignment ...
research
10/12/2022

Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning

Learning medical visual representations directly from paired radiology r...
research
08/22/2023

DiffCloth: Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal Semantic Alignment

Cross-modal garment synthesis and manipulation will significantly benefi...

Please sign up or login with your details

Forgot password? Click here to reset