The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition

03/11/2023
by   Zhe Wang, et al.
0

The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve “who spoken when” using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing “who spoken what when” with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers.

READ FULL TEXT

page 2

page 4

research
09/15/2023

The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

Previous Multimodal Information based Speech Processing (MISP) challenge...
research
02/17/2022

A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning

Audio-only-based wake word spotting (WWS) is challenging under noisy con...
research
08/18/2022

Deploying Enhanced Speech Feature Decreased Audio Complaints at SVT Play VOD Service

At Public Service Broadcaster SVT in Sweden, background music and sounds...
research
06/18/2023

STHG: Spatial-Temporal Heterogeneous Graph Learning for Advanced Audio-Visual Diarization

This report introduces our novel method named STHG for the Audio-Visual ...
research
06/14/2019

Video-Driven Speech Reconstruction using Generative Adversarial Networks

Speech is a means of communication which relies on both audio and visual...
research
06/04/2020

Third DIHARD Challenge Evaluation Plan

This paper introduces the third DIHARD challenge, the third in a series ...
research
03/21/2022

Audio visual character profiles for detecting background characters in entertainment media

An essential goal of computational media intelligence is to support unde...

Please sign up or login with your details

Forgot password? Click here to reset