Multi-Channel Transformer Transducer for Speech Recognition

08/30/2021
by   Feng-Ju Chang, et al.
0

Multi-channel inputs offer several advantages over single-channel, to improve the robustness of on-device speech recognition systems. Recent work on multi-channel transformer, has proposed a way to incorporate such inputs into end-to-end ASR for improved accuracy. However, this approach is characterized by a high computational complexity, which prevents it from being deployed in on-device systems. In this paper, we present a novel speech recognition model, Multi-Channel Transformer Transducer (MCTT), which features end-to-end multi-channel training, low computation cost, and low latency so that it is suitable for streaming decoding in on-device speech recognition. In a far-field in-house dataset, our MCTT outperforms stagewise multi-channel models with transformer-transducer up to 6.01 addition, MCTT outperforms the multi-channel transformer up to 11.62 is 15.8 times faster in terms of inference speed. We further show that we can improve the computational cost of MCTT by constraining the future and previous context in attention computations.

READ FULL TEXT

page 2

page 3

research
10/22/2020

Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset

Recently, Transformer based end-to-end models have achieved great succes...
research
02/10/2020

End-to-End Multi-speaker Speech Recognition with Transformer

Recently, fully recurrent neural network (RNN) based end-to-end models h...
research
02/08/2021

End-to-End Multi-Channel Transformer for Speech Recognition

Transformers are powerful neural architectures that allow integrating di...
research
03/31/2022

Exploiting Single-Channel Speech for Multi-Channel End-to-End Speech Recognition: A Comparative Study

Recently, the end-to-end training approach for multi-channel ASR has sho...
research
03/02/2023

LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion

As a key component of automated speech recognition (ASR) and the front-e...
research
11/01/2019

Long-distance Detection of Bioacoustic Events with Per-channel Energy Normalization

This paper proposes to perform unsupervised detection of bioacoustic eve...
research
09/09/2020

VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition

We introduce VoiceFilter-Lite, a single-channel source separation model ...

Please sign up or login with your details

Forgot password? Click here to reset