Improving Audio-Visual Video Parsing with Pseudo Visual Labels

by   Jinxing Zhou, et al.

Audio-Visual Video Parsing is a task to predict the events that occur in video segments for each modality. It often performs in a weakly supervised manner, where only video event labels are provided, i.e., the modalities and the timestamps of the labels are unknown. Due to the lack of densely annotated labels, recent work attempts to leverage pseudo labels to enrich the supervision. A commonly used strategy is to generate pseudo labels by categorizing the known event labels for each modality. However, the labels are still limited to the video level, and the temporal boundaries of event timestamps remain unlabeled. In this paper, we propose a new pseudo label generation strategy that can explicitly assign labels to each video segment by utilizing prior knowledge learned from the open world. Specifically, we exploit the CLIP model to estimate the events in each video segment based on visual modality to generate segment-level pseudo labels. A new loss function is proposed to regularize these labels by taking into account their category-richness and segmentrichness. A label denoising strategy is adopted to improve the pseudo labels by flipping them whenever high forward binary cross entropy loss occurs. We perform extensive experiments on the LLP dataset and demonstrate that our method can generate high-quality segment-level pseudo labels with the help of our newly proposed loss and the label denoising strategy. Our method achieves state-of-the-art audio-visual video parsing performance.


page 1

page 2

page 4

page 9

page 10


Investigating Modality Bias in Audio Visual Video Parsing

We focus on the audio-visual video parsing (AVVP) problem that involves ...

Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

We focus on the weakly-supervised audio-visual video parsing task (AVVP)...

Query-based Video Summarization with Pseudo Label Supervision

Existing datasets for manually labelled query-based video summarization ...

Timestamp-Supervised Action Segmentation in the Perspective of Clustering

Video action segmentation aims to slice the video into several action se...

Self-Learning with Rectification Strategy for Human Parsing

In this paper, we solve the sample shortage problem in the human parsing...

Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing

This paper focuses on the weakly-supervised audio-visual video parsing t...

Positive Sample Propagation along the Audio-Visual Event Line

Visual and audio signals often coexist in natural environments, forming ...

Please sign up or login with your details

Forgot password? Click here to reset