Audio-Visual Wake Word Spotting System For MISP Challenge 2021

04/19/2022
by   Yanguang Xu, et al.
0

This paper presents the details of our system designed for the Task 1 of Multimodal Information Based Speech Processing (MISP) Challenge 2021. The purpose of Task 1 is to leverage both audio and video information to improve the environmental robustness of far-field wake word spotting. In the proposed system, firstly, we take advantage of speech enhancement algorithms such as beamforming and weighted prediction error (WPE) to address the multi-microphone conversational audio. Secondly, several data augmentation techniques are applied to simulate a more realistic far-field scenario. For the video information, the provided region of interest (ROI) is used to obtain visual representation. Then the multi-layer CNN is proposed to learn audio and visual representations, and these representations are fed into our two-branch attention-based network which can be employed for fusion, such as transformer and conformed. The focal loss is used to fine-tune the model and improve the performance significantly. Finally, multiple trained models are integrated by casting vote to achieve our final 0.091 score.

READ FULL TEXT
research
06/24/2021

SRIB-LEAP submission to Far-field Multi-Channel Speech Enhancement Challenge for Video Conferencing

This paper presents the details of the SRIB-LEAP submission to the Confe...
research
03/04/2023

The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis

This paper further explores our previous wake word spotting system ranke...
research
06/21/2021

Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams

Many previous audio-visual voice-related works focus on speech, ignoring...
research
06/02/2023

Audio-Visual Speech Enhancement with Score-Based Generative Models

This paper introduces an audio-visual speech enhancement system that lev...
research
01/15/2021

AMFFCN: Attentional Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement

Audio-visual speech enhancement system is regarded to be one of promisin...
research
12/27/2017

A Light-Weight Multimodal Framework for Improved Environmental Audio Tagging

The lack of strong labels has severely limited the state-of-the-art full...
research
03/15/2023

Autonomous Soundscape Augmentation with Multimodal Fusion of Visual and Participant-linked Inputs

Autonomous soundscape augmentation systems typically use trained models ...

Please sign up or login with your details

Forgot password? Click here to reset