Rule-embedded network for audio-visual voice activity detection in live musical video streams

10/27/2020
by   Yuanbo Hou, et al.
0

Detecting anchor's voice in live musical streams is an important preprocessing for music and speech signal processing. Existing approaches to voice activity detection (VAD) primarily rely on audio, however, audio-based VAD is difficult to effectively focus on the target voice in noisy environments. With the help of visual information, this paper proposes a rule-embedded network to fuse the audio-visual (A-V) inputs to help the model better detect target voice. The core role of the rule in the model is to coordinate the relation between the bi-modal information and use visual representations as the mask to filter out the information of non-target sound. Experiments show that: 1) with the help of cross-modal fusion by the proposed rule, the detection result of A-V branch outperforms that of audio branch; 2) the performance of bi-modal model far outperforms that of audio-only models, indicating that the incorporation of both audio and visual signals is highly beneficial for VAD. To attract more attention to the cross-modal music and audio signal processing, a new live musical video corpus with frame-level label is introduced.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/21/2021

Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams

Many previous audio-visual voice-related works focus on speech, ignoring...
research
04/07/2022

Musical Information Extraction from the Singing Voice

Music information retrieval is currently an active research area that ad...
research
03/05/2022

Audio-visual speech separation based on joint feature representation with cross-modal attention

Multi-modal based speech separation has exhibited a specific advantage o...
research
04/05/2022

VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices

In this paper, we address the problem of lip-voice synchronisation in vi...
research
03/25/2021

Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation

In this paper, we address the problem of separating individual speech si...
research
04/11/2016

Kernel-based Sensor Fusion with Application to Audio-Visual Voice Activity Detection

In this paper, we address the problem of multiple view data fusion in th...
research
10/31/2019

Adversarial Music: Real World Audio Adversary Against Wake-word Detection System

Voice Assistants (VAs) such as Amazon Alexa or Google Assistant rely on ...

Please sign up or login with your details

Forgot password? Click here to reset