Learning Weakly Supervised Audio-Visual Violence Detection in Hyperbolic Space

05/30/2023
by   Xiaogang Peng, et al.
0

In recent years, the task of weakly supervised audio-visual violence detection has gained considerable attention. The goal of this task is to identify violent segments within multimodal data based on video-level labels. Despite advances in this field, traditional Euclidean neural networks, which have been used in prior research, encounter difficulties in capturing highly discriminative representations due to limitations of the feature space. To overcome this, we propose HyperVD, a novel framework that learns snippet embeddings in hyperbolic space to improve model discrimination. Our framework comprises a detour fusion module for multimodal fusion, effectively alleviating modality inconsistency between audio and visual signals. Additionally, we contribute two branches of fully hyperbolic graph convolutional networks that excavate feature similarities and temporal relationships among snippets in hyperbolic space. By learning snippet representations in this space, the framework effectively learns semantic discrepancies between violent and normal events. Extensive experiments on the XD-Violence benchmark demonstrate that our method outperforms state-of-the-art methods by a sizable margin.

READ FULL TEXT
research
07/12/2022

Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection

Weakly-supervised audio-visual violence detection aims to distinguish sn...
research
02/10/2023

Dual Memory Units with Uncertainty Regulation for Weakly Supervised Video Anomaly Detection

Learning discriminative features for effectively separating abnormal eve...
research
04/19/2018

Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events

Audio-visual representation learning is an important task from the persp...
research
04/05/2021

Can audio-visual integration strengthen robustness under multimodal attacks?

In this paper, we propose to make a systematic study on machines multise...
research
09/27/2019

wMAN: Weakly-supervised Moment Alignment Network for Text-based Video Segment Retrieval

Given a video and a sentence, the goal of weakly-supervised video moment...
research
05/30/2021

Rethinking the constraints of multimodal fusion: case study in Weakly-Supervised Audio-Visual Video Parsing

For multimodal tasks, a good feature extraction network should extract i...
research
03/08/2022

Skating-Mixer: Multimodal MLP for Scoring Figure Skating

Figure skating scoring is a challenging task because it requires judging...

Please sign up or login with your details

Forgot password? Click here to reset