Contrastive Positive Sample Propagation along the Audio-Visual Event Line

11/18/2022
by   Jinxing Zhou, et al.
0

Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs). Given a video, we aim to localize video segments containing an AVE and identify its category. It is pivotal to learn the discriminative features for each video segment. Unlike existing work focusing on audio-visual feature fusion, in this paper, we propose a new contrastive positive sample propagation (CPSP) method for better deep feature representation learning. The contribution of CPSP is to introduce the available full or weak label as a prior that constructs the exact positive-negative samples for contrastive learning. Specifically, the CPSP involves comprehensive contrastive constraints: pair-level positive sample propagation (PSP), segment-level and video-level positive sample activation (PSA_S and PSA_V). Three new contrastive objectives are proposed (i.e., ℒ_avpsp, ℒ_spsa, and ℒ_vpsa) and introduced into both the fully and weakly supervised AVE localization. To draw a complete picture of the contrastive learning in AVE localization, we also study the self-supervised positive sample propagation (SSPSP). As a result, CPSP is more helpful to obtain the refined audio-visual features that are distinguishable from the negatives, thus benefiting the classifier prediction. Extensive experiments on the AVE and the newly collected VGGSound-AVEL100k datasets verify the effectiveness and generalization ability of our method.

READ FULL TEXT

page 1

page 7

page 11

page 14

page 15

page 16

research
04/01/2021

Positive Sample Propagation along the Audio-Visual Event Line

Visual and audio signals often coexist in natural environments, forming ...
research
03/20/2023

Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning

Self-supervised audio-visual source localization aims to locate sound-so...
research
11/03/2022

MarginNCE: Robust Sound Localization with a Negative Margin

The goal of this work is to localize sound sources in visual scenes with...
research
10/11/2022

Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization

Audio-visual event localization has attracted much attention in recent y...
research
05/06/2023

Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation

Egocentric gaze anticipation serves as a key building block for the emer...
research
11/30/2019

Probing the State of the Art: A Critical Look at Visual Representation Evaluation

Self-supervised research improved greatly over the past half decade, wit...
research
04/05/2020

Clustering based Contrastive Learning for Improving Face Representations

A good clustering algorithm can discover natural groupings in data. Thes...

Please sign up or login with your details

Forgot password? Click here to reset