Dual-Encoder Architecture with Encoder Selection for Joint Close-Talk and Far-Talk Speech Recognition

09/17/2021
by   Felix Weninger, et al.
0

In this paper, we propose a dual-encoder ASR architecture for joint modeling of close-talk (CT) and far-talk (FT) speech, in order to combine the advantages of CT and FT devices for better accuracy. The key idea is to add an encoder selection network to choose the optimal input source (CT or FT) and the corresponding encoder. We use a single-channel encoder for CT speech and a multi-channel encoder with Spatial Filtering neural beamforming for FT speech, which are jointly trained with the encoder selection. We validate our approach on both attention-based and RNN Transducer end-to-end ASR systems. The experiments are done with conversational speech from a medical use case, which is recorded simultaneously with a CT device and a microphone array. Our results show that the proposed dual-encoder architecture obtains up to 9 reduction when using both CT and FT input, compared to the best single-encoder system trained and tested in matched condition.

READ FULL TEXT
research
10/14/2022

Learning to Jointly Transcribe and Subtitle for End-to-End Spontaneous Speech Recognition

TV subtitles are a rich source of transcriptions of many types of speech...
research
11/12/2018

Stream attention-based multi-array end-to-end speech recognition

Automatic Speech Recognition (ASR) using multiple microphone arrays has ...
research
10/23/2021

Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding

The advances in attention-based encoder-decoder (AED) networks have brou...
research
04/19/2019

An Investigation of End-to-End Multichannel Speech Recognition for Reverberant and Mismatch Conditions

Sequence-to-sequence (S2S) modeling is becoming a popular paradigm for a...
research
06/17/2019

Multi-Stream End-to-End Speech Recognition

Attention-based methods and Connectionist Temporal Classification (CTC) ...
research
03/31/2022

CTA-RNN: Channel and Temporal-wise Attention RNN Leveraging Pre-trained ASR Embeddings for Speech Emotion Recognition

Previous research has looked into ways to improve speech emotion recogni...
research
11/15/2019

Sample Drop Detection for Distant-speech Recognition with Asynchronous Devices Distributed in Space

In many applications of multi-microphone multi-device processing, the sy...

Please sign up or login with your details

Forgot password? Click here to reset