Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition

03/31/2021
by   Timo Lohrenz, et al.
0

Stream fusion, also known as system combination, is a common technique in automatic speech recognition for traditional hybrid hidden Markov model approaches, yet mostly unexplored for modern deep neural network end-to-end model architectures. Here, we investigate various fusion techniques for the all-attention-based encoder-decoder architecture known as the transformer, striving to achieve optimal fusion by investigating different fusion levels in an example single-microphone setting with fusion of standard magnitude and phase features. We introduce a novel multi-encoder learning method that performs a weighted combination of two encoder-decoder multi-head attention outputs only during training. Employing then only the magnitude feature encoder in inference, we are able to show consistent improvement on Wall Street Journal (WSJ) with language model and on Librispeech, without increase in runtime or parameters. Combining two such multi-encoder trained models by a simple late fusion in inference, we achieve state-of-the-art performance for transformer-based models on WSJ with a significant WER reduction of 19% relative compared to the current benchmark approach.

READ FULL TEXT
research
07/02/2021

Relaxed Attention: A Simple Method to Boost Performance of End-to-End Automatic Speech Recognition

Recently, attention-based encoder-decoder (AED) models have shown high p...
research
07/21/2021

Multi-Stream Transformers

Transformer-based encoder-decoder models produce a fused token-wise repr...
research
06/23/2023

Upscaling Global Hourly GPP with Temporal Fusion Transformer (TFT)

Reliable estimates of Gross Primary Productivity (GPP), crucial for eval...
research
09/22/2017

Attention-based Wav2Text with Feature Transfer Learning

Conventional automatic speech recognition (ASR) typically performs multi...
research
09/13/2022

Analysis of Self-Attention Head Diversity for Conformer-based Automatic Speech Recognition

Attention layers are an integral part of modern end-to-end automatic spe...
research
03/21/2023

Automatic evaluation of herding behavior in towed fishing gear using end-to-end training of CNN and attention-based networks

This paper considers the automatic classification of herding behavior in...
research
08/13/2020

Large-scale Transfer Learning for Low-resource Spoken Language Understanding

End-to-end Spoken Language Understanding (SLU) models are made increasin...

Please sign up or login with your details

Forgot password? Click here to reset