Investigations on Phoneme-Based End-To-End Speech Recognition

05/19/2020
by   Albert Zeyer, et al.
5

Common end-to-end models like CTC or encoder-decoder-attention models use characters or subword units like BPE as the output labels. We do systematic comparisons between grapheme-based and phoneme-based output labels. These can be single phonemes without context ( 40 labels), or multiple phonemes together in one output label, such that we get phoneme-based subwords. For this purpose, we introduce phoneme-based BPE labels. In further experiments, we extend the phoneme set by auxiliary units to be able to discriminate homophones (different words with same pronunciation). This enables a very simple and efficient decoding algorithm. We perform the experiments on Switchboard 300h and we can show that our phoneme-based models are competitive to the grapheme-based models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/05/2017

Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition

End-to-end training of deep learning-based models allows for implicit le...
research
07/13/2018

Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units

In this paper, we present an end-to-end automatic speech recognition sys...
research
05/08/2018

Improved training of end-to-end attention models for speech recognition

Sequence-to-sequence attention-based models on subword units allow simpl...
research
08/30/2018

End-to-end Speech Recognition with Adaptive Computation Steps

In this paper, we present Adaptive Computation Steps (ACS) algorithm, wh...
research
03/28/2020

Serialized Output Training for End-to-End Overlapped Speech Recognition

This paper proposes serialized output training (SOT), a novel framework ...
research
12/08/2019

VM-Net: Mesh Modeling to Assist Segmentation in Volumetric Data

CNN-based volumetric methods that label individual voxels now dominate t...
research
09/01/2021

Tree-constrained Pointer Generator for End-to-end Contextual Speech Recognition

Contextual knowledge is important for real-world automatic speech recogn...

Please sign up or login with your details

Forgot password? Click here to reset