An improved hybrid CTC-Attention model for speech recognition
Recently, end-to-end speech recognition with a hybrid model consisting of connectionist temporal classification(CTC) and the attention-based encoder-decoder achieved state-of-the-art results. In this paper, we propose a novel CTC decoder structure based on the experiments we conducted and explore the relation between decoding performance and the depth of encoder. We also apply attention smoothing mechanism to acquire more context information for subword-based decoding. Taken together, these strategies allow us to achieve a word error rate(WER) of 4.43 test-clean subset of the LibriSpeech corpora, which by far are the best reported WERs for end-to-end ASR systems on this dataset.
READ FULL TEXT