A Comparison of Modeling Units in Sequence-to-Sequence Speech Recognition with the Transformer on Mandarin Chinese

05/16/2018 ∙ by Shiyu Zhou, et al. ∙ 0

The choice of modeling units is critical to automatic speech recognition (ASR) tasks. Conventional ASR systems typically choose context-dependent states (CD-states) or context-dependent phonemes (CD-phonemes) as their modeling units. However, it has been challenged by sequence-to-sequence attention-based models, which integrate an acoustic, pronunciation and language model into a single neural network. On English ASR tasks, previous attempts have shown that the modeling unit of graphemes can outperform that of phonemes by sequence-to-sequence attention-based model. In this paper, we are concerned with modeling units on Mandarin Chinese ASR tasks using sequence-to-sequence attention-based models with the Transformer. Five modeling units are explored including context-independent phonemes (CI-phonemes), syllables, words, sub-words and characters. Experiments on HKUST datasets demonstrate that the lexicon free modeling units can outperform lexicon related modeling units in terms of character error rate (CER). Among five modeling units, character based model performs best and establishes a new state-of-the-art CER of 26.64% on HKUST datasets without a hand-designed lexicon and an extra language model integration, which corresponds to a 4.8% relative improvement over the existing best CER of 28.0% by the joint CTC-attention based encoder-decoder network.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Conventional ASR systems consist of three independent components: an acoustic model (AM), a pronunciation model (PM) and a language model (LM), all of which are trained independently. CD-states and CD-phonemes are dominant as their modeling units in such systems [1, 2, 3]. However, it recently has been challenged by sequence-to-sequence attention-based models. These models are commonly comprised of an encoder

, which consists of multiple recurrent neural network (RNN) layers that model the acoustics, and a

decoder, which consists of one or more RNN layers that predict the output sub-word sequence. An attention layer acts as the interface between the encoder and the decoder: it selects frames in the encoder representation that the decoder should attend to in order to predict the next sub-word unit [4]. In [5], Tara et al. experimentally verified that the grapheme-based sequence-to-sequence attention-based model can outperform the corresponding phoneme-based model on English ASR tasks. This work is very interesting and amazing since a hand-designed lexicon might be removed from ASR systems. As we known, it is very laborious and time-consuming to generate a pronunciation lexicon. Without a hand-designed lexicon, the design of ASR systems would be simplified greatly. Furthermore, the latest work shows that attention-based encoder-decoder architecture achieves a new state-of-the-art WER on a hour English voice search task using the word piece models (WPM), which are sub-word units ranging from graphemes all the way up to entire words [6].

Since the outstanding performance of grapheme-based modeling units on English ASR tasks, we conjecture that maybe there is no need for a hand-designed lexicon on Mandarin Chinese ASR tasks as well by sequence-to-sequence attention-based models. In Mandarin Chinese, if a hand-designed lexicon is removed, the modeling units can be words, sub-words and characters. Character-based sequence-to-sequence attention-based models have been investigated on Mandarin Chinese ASR tasks in [7, 8], but the performance comparison with different modeling units are not explored before. Building on our work [9], which shows that syllable based model with the Transformer can perform better than CI-phoneme based counterpart, we investigate five modeling units on Mandarin Chinese ASR tasks, including CI-phonemes, syllables (pinyins with tones), words, sub-words and characters. The Transformer is chosen to be the basic architecture of sequence-to-sequence attention-based model in this paper [9, 10]. Experiments on HKUST datasets confirm our hypothesis that the lexicon free modeling units, i.e. words, sub-words and characters, can outperform lexicon related modeling units, i.e. CI-phonemes and syllables. Among five modeling units, character based model with the Transformer achieves the best result and establishes a new state-of-the-art CER of on HKUST datasets without a hand-designed lexicon and an extra language model integration, which is a relative reduction in CER compared to the existing best CER of by the joint CTC-attention based encoder-decoder network with a separate RNN-LM integration [11].

The rest of the paper is organized as follows. After an overview of the related work in Section 2, Section 3 describes the proposed method in detail. we then show experimental results in Section 4 and conclude this work in Section 5.

2 Related work

Sequence-to-sequence attention-based models have achieved promising results on English ASR tasks and various modeling units have been studied recently, such as CI-phonemes, CD-phonemes, graphemes and WPM [4, 5, 6, 12]. In [5], Tara et al. first explored sequence-to-sequence attention-based model trained with phonemes for ASR tasks and compared the modeling units of graphemes and phonemes. They experimentally verified that the grapheme-based sequence-to-sequence attention-based model can outperform the corresponding phoneme-based model on English ASR tasks. Furthermore, the modeling units of WPM have been explored in [6], which are sub-word units ranging from graphemes all the way up to entire words. It achieved a new state-of-the-art WER on a hour English voice search task.

Although sequence-to-sequence attention-based models perform very well on English ASR tasks, related works are quite few on Mandarin Chinese ASR tasks. Chan et al. first proposed Character-Pinyin sequence-to-sequence attention-based model on Mandarin Chinese ASR tasks. The Pinyin information was used during training for improving the performance of the character model. Instead of using joint Character-Pinyin model, [8]

directly used Chinese characters as network output by mapping the one-hot character representation to an embedding vector via a neural network layer. What’s more,

[13] compared the modeling units of characters and syllables by sequence-to-sequence attention-based models.

Besides the modeling unit of character, the modeling units of words and sub-words are investigated on Mandarin Chinese ASR tasks in this paper. Sub-word units encoded by byte pair encoding (BPE) have been explored on neural machine translation (NMT) tasks to address out-of-vocabulary (OOV) problem on open-vocabulary translation

[14], which iteratively replace the most frequent pair of characters with a single, unused symbol. We extend it to Mandarin Chinese ASR tasks. BPE is capable of encoding an open vocabulary with a compact symbol vocabulary of variable-length sub-word units, which requires no shortlist.

3 System overview

3.1 ASR Transformer model architecture

The Transformer model architecture is the same as sequence-to-sequence attention-based models except relying entirely on self-attention and position-wise, fully connected layers for both the encoder and decoder [15]. The encoder maps an input sequence of symbol representations x = to a sequence of continuous representations z = . Given z, the decoder then generates an output sequence y = of symbols one element at a time.

The ASR Transformer architecture used in this work is the same as our work [9] which is shown in Figure 1. It stacks multi-head attention (MHA) [15] and position-wise, fully connected layers for both the encode and decoder. The encoder is composed of a stack of

identical layers. Each layer has two sub-layers. The first is a MHA, and the second is a position-wise fully connected feed-forward network. Residual connections are employed around each of the two sub-layers, followed by a layer normalization. The decoder is similar to the encoder except inserting a third sub-layer to perform a MHA over the output of the encoder stack. To prevent leftward information flow and preserve the auto-regressive property in the decoder, the self-attention sub-layers in the decoder mask out all values corresponding to illegal connections. In addition, positional encodings

[15] are added to the input at the bottoms of these encoder and decoder stacks, which inject some information about the relative or absolute position of the tokens in the sequence.

The difference between the NMT Transformer [15]

and the ASR Transformer is the input of the encoder. we add a linear transformation with a layer normalization to convert the log-Mel filterbank feature to the model dimension

for dimension matching, which is marked out by a dotted line in Figure 1.

Figure 1: The architecture of the ASR Transformer.

3.2 Modeling units

Five modeling units are compared on Mandarin Chinese ASR tasks, including CI-phonemes, syllables, words, sub-words and characters. Table 1 summarizes the different number of output units investigated by this paper. We show an example of various modeling units in Table 2.

Modeling units Number of outputs CI-phonemes Syllables Characters Sub-words Words

Table 1: Different modeling units explored in this paper.

3.2.1 CI-phoneme and syllable units

CI-phoneme and syllable units are compared in our work [9], which CI-phonemes without silence (phonemes with tones) are employed in the CI-phoneme based experiments and syllables (pinyins with tones) in the syllable based experiments. Extra tokens

(i.e. an unknown token (), a padding token (), and sentence start and end tokens (/)) are appended to the outputs, making the total number of outputs

and

respectively in the CI-phoneme based model and syllable based model. Standard tied-state cross-word triphone GMM-HMMs are first trained with maximum likelihood estimation to generate CI-phoneme alignments on training set. Then syllable alignments are generated through these CI-phoneme alignments according to the lexicon, which can handle multiple pronunciations of the same word in Mandarin Chinese.

The outputs are CI-phoneme sequences or syllable sequences during decoding stage. In order to convert CI-phoneme sequences or syllable sequences into word sequences, a greedy cascading decoder with the Transformer [9] is proposed. First, the best CI-phoneme or syllable sequence is calculated by the ASR Transformer from observation with a beam size . And then, the best word sequence is chosen by the NMT Transformer from the best CI-phoneme or syllable sequence with a beam size . Through cascading these two Transformer models, we assume that can be approximated.

Here the beam size and are employed in this work.

3.2.2 Sub-word units

Sub-word units, using in this paper, are generated by BPE 111https://github.com/rsennrich/subword-nmt [14]

, which iteratively merges the most frequent pair of characters or character sequences with a single, unused symbol. Firstly, the symbol vocabulary with the character vocabulary is initialized, and each word is represented as a sequence of characters plus a special end-of-word symbol ‘@@’, which allows to restore the original tokenization. Then, all symbol pairs are counted iteratively and each occurrence of the most frequent pair (‘A’, ‘B’) are replaced with a new symbol ‘AB’. Each merge operation produces a new symbol which represents a character n-gram. Frequent character n-grams (or whole words) are eventually merged into a single symbol. Then the final symbol vocabulary size is equal to the size of the initial vocabulary, plus the

number of merge operations

, which is the hyperparameter of this algorithm

[14].

BPE is capable of encoding an open vocabulary with a compact symbol vocabulary of variable-length sub-word units, which requires no shortlist. After encoded by BPE, the sub-word units are ranging from characters all the way up to entire words. Thus there are no OOV words with BPE and high frequent sub-words can be preserved.

In our experiments, we choose the number of merge operations , which generates the number of sub-words units from the training transcripts. After appended with extra tokens, the total number of outputs is .

3.2.3 Word and character units

For word units, we collect all words from the training transcripts. Appended with extra tokens, the total number of outputs is .

For character units, all Mandarin Chinese characters together with English words in training transcripts are collected, which are appended with extra tokens to generate the total number of outputs 222we manually delete two tokens and , which are not Mandarin Chinese characters..

4 Experiment

4.1 Data

The HKUST corpus (LDC2005S15, LDC2005T32), a corpus of Mandarin Chinese conversational telephone speech, is collected and transcribed by Hong Kong University of Science and Technology (HKUST) [16]

, which contains 150-hour speech, and 873 calls in the training set and 24 calls in the test set. All experiments are conducted using 80-dimensional log-Mel filterbank features, computed with a 25ms window and shifted every 10ms. The features are normalized via mean subtraction and variance normalization on the speaker basis. Similar to

[17, 18], at the current frame , these features are stacked with 3 frames to the left and downsampled to a 30ms frame rate. As in [11], we generate more training data by linearly scaling the audio lengths by factors of and (speed perturb.), which can improve the performance in our experiments.

Modeling units Example CI-phonemes Y IY1 JH UH3 NG3 X IY4 N4 N IY4 AE4 N4 Syllables YI1 ZHONG3 XIN4 NIAN4 Characters 一  种  信  念 Sub-words 一种  信@@  念 Words 一种  信念

Table 2: An example of various modeling units in this paper.

4.2 Training

We perform our experiments on the base model and big model (i.e. D512-H8 and D1024-H16 respectively) of the Transformer from [15]. The basic architecture of these two models is the same but different parameters setting. Table 3 lists the experimental parameters between these two models. The Adam algorithm [19]

with gradient clipping and warmup is used for optimization. During training, label smoothing of value

is employed [20]. After trained, the last 20 checkpoints are averaged to make the performance more stable [15].

model D512-H8 D1024-H16

Table 3: Experimental parameters configuration.

In the CI-phoneme and syllable based model, we cascade an ASR Transformer and a NMT Transformer to generate word sequences from observation . However, we do not employ a NMT Transformer anymore in the word, sub-word and character based model, since the beam search results from the ASR Transformer are already the Chinese character level. The total parameters of different modeling units list in Table 4.

model
D512-H8
(ASR)
D1024-H16
(ASR)
D512-H8
(NMT)
CI-phonemes
Syllables
Words
Sub-words
Characters
Table 4: Total parameters of different modeling units.

4.3 Results

According to the description from Section 3.2, we can see that the modeling units of words, sub-words and characters are lexicon free, which do not need a hand-designed lexicon. On the contrary, the modeling units of CI-phonemes and syllables need a hand-designed lexicon.

Our results are summarized in Table 5. It is clear to see that the lexicon free modeling units, i.e. words, sub-words and characters, can outperform corresponding lexicon related modeling units, i.e. CI-phonemes and syllables on HKUST datasets. It confirms our hypothesis that we can remove the need for a hand-designed lexicon on Mandarin Chinese ASR tasks by sequence-to-sequence attention-based models. What’s more, we note here that the sub-word based model performs better than the word based counterpart. It represents that the modeling unit of sub-words is superior to that of words, since sub-word units encoded by BPE have fewer number of outputs and without OOV problems. However, the sub-word based model performs worse than the character based model. The possible reason is that the modeling unit of sub-words is bigger than that of characters which is difficult to train. We will conduct our experiments on larger datasets and compare the performance between the modeling units of sub-words and characters in future work. Finally, among five modeling units, character based model with the Transformer achieves the best result. It demonstrates that the modeling unit of character is suitable for Mandarin Chinese ASR tasks by sequence-to-sequence attention-based models, which can simplify the design of ASR systems greatly.

Modeling units Model CER CI-phonemes [9] D512-H8 D1024-H16 30.65 D1024-H16 (speed perturb) Syllables [9] D512-H8 D1024-H16 D1024-H16 (speed perturb) 28.77 Words D512-H8 D1024-H16 D1024-H16 (speed perturb) 27.42 Sub-words D512-H8 D1024-H16 D1024-H16 (speed perturb) 27.26 Characters D512-H8 D1024-H16 D1024-H16 (speed perturb) 26.64

Table 5: Comparison of different modeling units with the Transformer on HKUST datasets in CER (%).

4.4 Comparison with previous works

In Table 6, we compare our experimental results to other model architectures from the literature on HKUST datasets. First, we can find that our best results of different modeling units are comparable or superior to the best result by the deep multidimensional residual learning with 9 LSTM layers [21], which is a hybrid LSTM-HMM system with the modeling unit of CD-states. We can observe that the best CER of character based model with the Transformer on HKUST datasets achieves a relative reduction compared to the best CER of by the deep multidimensional residual learning with 9 LSTM layers. It shows the superiority of the sequence-to-sequence attention-based model compared to the hybrid LSTM-HMM system.

Moreover, we can note that our best results with the modeling units of words, sub-words and characters are superior to the existing best CER of by the joint CTC-attention based encoder-decoder network with a separate RNN-LM integration [11], which is the state-of-the-art on HKUST datasets to the best of our knowledge. Character based model with the Transformer establishes a new state-of-the-art CER of on HKUST datasets without a hand-designed lexicon and an extra language model integration, which is a relative reduction in CER compared to the CER of of the joint CTC-attention based encoder-decoder network when no external language model is used, and a relative reduction in CER compared to the existing best CER of by the joint CTC-attention based encoder-decoder network with separate RNN-LM [11].

model CER LSTMP-9800P512-F444 [21] CTC-attention+joint dec. (speed perturb., one-pass) +VGG net +RNN-LM (separate) [11] 28.0 CI-phonemes-D1024-H16 [9] Syllables-D1024-H16 (speed perturb) [9] Words-D1024-H16 (speed perturb) Sub-words-D1024-H16 (speed perturb) Characters-D1024-H16 (speed perturb) 26.64

Table 6: CER (%) on HKUST datasets compared to previous works.

5 Conclusions

In this paper we compared five modeling units on Mandarin Chinese ASR tasks by sequence-to-sequence attention-based model with the Transformer, including CI-phonemes, syllables, words, sub-words and characters. We experimentally verified that the lexicon free modeling units, i.e. words, sub-words and characters, can outperform lexicon related modeling units, i.e. CI-phonemes and syllables on HKUST datasets. It represents that maybe we can remove the need for a hand-designed lexicon on Mandarin Chinese ASR tasks by sequence-to-sequence attention-based models. Among five modeling units, character based model achieves the best result and establishes a new state-of-the-art CER of on HKUST datasets without a hand-designed lexicon and an extra language model integration, which corresponds to a relative improvement over the existing best CER of by the joint CTC-attention based encoder-decoder network. Moreover, we find that sub-word based model with the Transformer, encoded by BPE, achieves a promising result, although it is slightly worse than character based counterpart.

6 Acknowledgements

The authors would like to thank Chunqi Wang and Feng Wang for insightful discussions.

References