Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline

03/27/2018
by   Szu-Jui Chen, et al.
0

This paper describes a new baseline system for automatic speech recognition (ASR) in the CHiME-4 challenge to promote the development of noisy ASR in speech processing communities by providing 1) state-of-the-art system with a simplified single system comparable to the complicated top systems in the challenge, 2) publicly available and reproducible recipe through the main repository in the Kaldi speech recognition toolkit. The proposed system adopts generalized eigenvalue beamforming with bidirectional long short-term memory (LSTM) mask estimation. We also propose to use a time delay neural network (TDNN) based on the lattice-free version of the maximum mutual information (LF-MMI) trained with augmented all six microphones plus the enhanced data after beamforming. Finally, we use a LSTM language model for lattice and n-best re-scoring. The final system achieved 2.74% WER for the real test set in the 6-channel track, which corresponds to the 2nd place in the challenge. In addition, the proposed baseline recipe includes four different speech enhancement measures, short-time objective intelligibility measure (STOI), extended STOI (eSTOI), perceptual evaluation of speech quality (PESQ) and speech distortion ratio (SDR) for the simulation test set. Thus, the recipe also provides an experimental platform for speech enhancement studies with these performance measures.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/27/2018

Student-Teacher Learning for BLSTM Mask-based Speech Enhancement

Spectral mask estimation using bidirectional long short-term memory (BLS...
research
06/24/2021

SRIB-LEAP submission to Far-field Multi-Channel Speech Enhancement Challenge for Video Conferencing

This paper presents the details of the SRIB-LEAP submission to the Confe...
research
07/01/2017

Rank-1 Constrained Multichannel Wiener Filter for Speech Recognition in Noisy Environments

Multichannel linear filters, such as the Multichannel Wiener Filter (MWF...
research
11/18/2022

AVATAR submission to the Ego4D AV Transcription Challenge

In this report, we describe our submission to the Ego4D AudioVisual (AV)...
research
02/03/2022

The RoyalFlush System of Speech Recognition for M2MeT Challenge

This paper describes our RoyalFlush system for the track of multi-speake...
research
01/20/2021

The JHU ASR System for VOiCES from a Distance Challenge 2019

This paper describes the system developed by the JHU team for automatic ...
research
11/06/2018

Language model integration based on memory control for sequence to sequence speech recognition

In this paper, we explore several new schemes to train a seq2seq model t...

Please sign up or login with your details

Forgot password? Click here to reset