Deep Sparse Conformer for Speech Recognition

09/01/2022
by   Xianchao Wu, et al.
0

Conformer has achieved impressive results in Automatic Speech Recognition (ASR) by leveraging transformer's capturing of content-based global interactions and convolutional neural network's exploiting of local features. In Conformer, two macaron-like feed-forward layers with half-step residual connections sandwich the multi-head self-attention and convolution modules followed by a post layer normalization. We improve Conformer's long-sequence representation ability in two directions, sparser and deeper. We adapt a sparse self-attention mechanism with 𝒪(LlogL) in time complexity and memory usage. A deep normalization strategy is utilized when performing residual connections to ensure our training of hundred-level Conformer blocks. On the Japanese CSJ-500h dataset, this deep sparse Conformer achieves respectively CERs of 5.52%, 4.03% and 4.50% on the three evaluation sets and 4.16%, 2.84% and 3.20% when ensembling five deep sparse Conformer variants from 12 to 16, 17, 50, and finally 100 encoder layers.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset