Analysis of Multilingual Sequence-to-Sequence speech recognition systems

11/07/2018 ∙ by Martin Karafiat, et al. ∙ 0

This paper investigates the applications of various multilingual approaches developed in conventional hidden Markov model (HMM) systems to sequence-to-sequence (seq2seq) automatic speech recognition (ASR). On a set composed of Babel data, we first show the effectiveness of multi-lingual training with stacked bottle-neck (SBN) features. Then we explore various architectures and training strategies of multi-lingual seq2seq models based on CTC-attention networks including combinations of output layer, CTC and/or attention component re-training. We also investigate the effectiveness of language-transfer learning in a very low resource scenario when the target language is not included in the original multi-lingual training data. Interestingly, we found multilingual features superior to multilingual models, and this finding suggests that we can efficiently combine the benefits of the HMM system with the seq2seq system through these multilingual feature techniques.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The sequence-to-sequence (seq2seq) model proposed in [1, 2, 3]

is a neural network (NN) architecture for performing sequence classification. Later, it was also adopted to perform speech recognition  

[4, 5, 6]. The model allows to integrate the main blocks of ASR (acoustic model, alignment model and language model) into a single neural network architecture. The recent ASR advancements in connectionist temporal classification (CTC) [6, 5] and attention [4, 7] based approaches have generated significant interest in speech community to use seq2seq models. However, outperforming conventional hybrid RNN/DNN-HMM models with seq2seq requires a huge amount of data [8]. Intuitively, this is due to the range of roles this model needs to perform: alignment and language modeling along with acoustic to character label mapping.

Multilingual approaches have been used in hybrid RNN/DNN-HMM systems for tackling the problem of low-resource data. These include language adaptive training and shared layer retraining [9]. Parameter sharing investigated in our previous work [10] seems to be the most beneficial.

Existing multilingual approaches for seq2seq modeling mainly focus on CTC. A multilingual CTC proposed in [11] uses a universal phone set, FST decoder and language model. The authors also use a linear hidden unit contribution (LHUC) [12] technique to rescale the hidden unit outputs for each language as a way to adapt to a particular language. Another work [13]

on multilingual CTC shows the importance of language adaptive vectors as auxiliary input to the encoder in multilingual CTC model. The decoder used here is based on simple greedy search of applying

on every time frame. An extensive analysis of multilingual CTC performance with limited data is performed in [14]. Here, the authors use a word level FST decoder integrated with CTC during decoding.

On a similar front, attention models are explored within a multilingual setup in [15, 16], where an attempt was made to build an attention-based seq2seq model from multiple languages. The data is just pulled together assuming the target languages are seen during the training. Although our prior study [17] performs a preliminary investigation of transfer learning techniques to address the unseen languages during training, this is not an intensive study of covering various multi-lingual techniques.

In this paper, we explore the multilingual training approaches [18, 19] in hybrid RNN/DNN-HMMs and we incorporate them into the seq2seq models. In our recent work [19], we showed the multilingual acoustic models (BLSTM particularly) to be superior to multilingual acoustic features in RNN/DNN-HMM systems. Consequently, similar experiments are performed in this paper on a sequence-to-sequence scheme.

The main motivation and contribution behind this work is as follows:

  • To incorporate the existing multilingual approaches in a joint CTC-attention [20] framework.

  • To compare various multilingual approaches: multilingual features, model architectures, and transfer learning.

2 Sequence-to-Sequence Model

In this work, we use the attention based approach [2] as it provides an effective methodology to perform sequence-to-sequence training. Considering the limitations of attention in performing monotonic alignment [21, 22]

, we choose to use CTC loss function to aid the attention mechanism in both training and decoding.

Let be a -length speech feature sequence and be an -length grapheme sequence. A multi-objective learning framework proposed in [20] is used in this work to unify attention loss and CTC loss

with a linear interpolation weight

, as follows:


The unified model benefits from both effective sequence level training and the monotonic afforded by the CTC loss.

represents the posterior probability of character label sequence

w.r.t input sequence

based on the attention approach, which is decomposed with the probabilistic chain rule, as follows:


where denotes the ground truth history. Detailed explanation of the attention mechanism is given later.

Similarly, represents the CTC posterior probability:


where is a CTC state sequence composed of the original grapheme set and the additional blank symbol. is a set of all possible sequences given the character sequence .

The following paragraphs explain the encoder, attention decoder, CTC, and joint decoding used in our approach.


In our approach, both CTC and attention use the same encoder function:


where is an encoder output state at . As , we use bidirectional LSTM (BLSTM).

Attention Decoder:

Location-aware attention mechanism [23] is used in this work. The output of location-aware attention is:


Here, acts as attention weight, denotes the decoder hidden state, and is the encoder output state defined in (4). The location-attention function is given by a convolution and maps the attention weight of the previous label to a multi channel view for better representation:


Here, (7) provides the unnormalized attention vectors computed with the learnable vector

, linear transformation

, and affine transformation . Normalized attention weight are obtained in (8) by a standard operation. Finally, the context vector is obtained as a weighted sum of the encoder output states over all frames, with the attention weight:


The decoder function is an LSTM layer which decodes the next character output label from their previous label , hidden state of the decoder and attention output :


This equation is incrementally applied to form in (2).

Connectionist temporal classification (CTC):

Unlike the attention approach, CTC does not use any specific decoder network. Instead, it invokes two important components to perform character level training and decoding: the first one is an RNN-based encoding module . The second component contains a language model and state transition module. The CTC formalism is a special case [6] of hybrid DNN-HMM framework with the Bayes rule applied to obtain .

Joint decoding:

Once we have both CTC and attention-based seq2seq models trained, both are jointly used for decoding as below:


Here is a final score used during beam search. controls the weight between attention and CTC models. and multi-task learning weight in (1) are set differently in our experiments.

3 Data

The experiments are conducted using the BABEL speech corpus collected during the IARPA Babel program. The corpus is mainly composed of conversational telephone speech (CTS) but some scripted recordings and far field recordings are present as well. Table 1 presents the details of the languages used for training and evaluation in this work. We decided to evaluate also on training languages to see effect of multilingual training on training languages. Therefore, Tok Pisin, Georgian from “train” languages and Assamese, Swahili from “target” languages are taken for evaluation.

Usage Language Train Eval # of
# spkrs. # hours # spkrs. # hours characters
Train Cantonese 952 126.73 120 17.71 3302
Bengali 720 55.18 121 9.79 66
Pashto 959 70.26 121 9.95 49
Turkish 963 68.98 121 9.76 66
Vietnamese 954 78.62 120 10.9 131
Haitian 724 60.11 120 10.63 60
Tamil 724 62.11 121 11.61 49
Kurdish 502 37.69 120 10.21 64
Tokpisin 482 35.32 120 9.88 55
Georgian 490 45.35 120 12.30 35
Target Assamese 720 54.35 120 9.58 66
Swahili 491 40.0 120 10.58 56
Table 1: Details of the BABEL data used for experiments.

4 Sequence to sequence model setup

The baseline systems are built on 80-dimensional Mel-filter bank (fbank) features extracted from the speech samples using a sliding window of size 25 ms with 10ms stride. KALDI toolkit

[24] is used to perform the feature processing. The “fbank” features are then fed to a seq2seq model with the following configuration:

The Bi-RNN [25] models mentioned above uses an LSTM [26] cell followed by a projection layer (BLSTMP). In our experiments below, we use only a character-level seq2seq model based on CTC and attention, as mentioned above. Thus, in the following experiments, we will use character error rate (% CER) as a suitable measure to analyze the model performance. All models are trained in ESPnet, end-to-end speech processing toolkit [27].

5 Multilingual features

Multilingual features are trained separately from seq2seq model according to a setup from our previous RNN/DNN-HMM work [19]

. It allows us to easily combine traditional DNN techniques with the seq2seq model such as GMM based alignments for NN target estimation, phoneme units and frame-level randomization. Multilingual features incorporate additional knowledge from non-target languages into features which could better guide the seq2seq model.

5.1 Stacked Bottle-Neck feature extraction

The original idea of Stacked Bottle-Neck feature extraction is described in [28]. The scheme consists of two NN stages: The first one is reading short temporal context, its output is stacked, down-sampled, and fed into the second NN reading longer temporal information.

The first stage bottle-neck NN input features are 24 log Mel filter bank outputs concatenated with fundamental frequency features. Conversation-side based mean subtraction is applied and 11 consecutive frames are stacked. Hamming window followed by discrete cosine transform (DCT) retaining 0 to 5 coefficients are applied on the time trajectory of each parameter resulting in 376=222 coefficients at the first-stage NN input.

In this work, the first-stage NN has 4 hidden layers with 1500 units in each except the bottle-neck (BN) one. The BN layer has 80 neurons. The neurons in the BN layer have linear activations as found optimal in 

[29]. 21 consecutive frames from the first-stage NN are stacked, down-sampled (each 5 frame is taken) and fed into the second-stage NN with an architecture similar to the first-stage NN, except of BN layer with only 30 neurons. Both neural networks were trained jointly as suggested in [29] in CNTK toolkit [30] with block-softmax final layer [31]. Context-independent phoneme states are used as the training targets for the feature-extraction NN, otherwise the size of the final layer would be prohibitive.

Finally, BN outputs from the second stage NN are used as features for further experiments and will be noted as “Mult11-SBN”.

5.2 Results

Figure 1 presents the performance curve of the seq2seq model with four “train” and “target” languages, as discussed in Section 3, by changing the amount of training data. It shows significant performance drop of baseline, “fbank” based, systems when the amount of training data is lowered.

On the other hand, the multilingual features present: 1) significantly smaller performance degradation than baseline “fbank” features on small amounts of training data. 2) consistent improvement on both train (seen) and target (unseen) languages where we only use train (seen) languages in feature extractor training data. 3) significant improvement even on the full training set, i.e., 1.6%-5.0% absolute (Table 2 summarizes the full training set results).

Figure 1: Monolingual models trained on top of multilingual features.
Features Swahili Amharic Tok Pisin Georgian
FBANK 28.6 45.3 32.2 34.8
Mult11-SBN 26.4 40.4 26.8 33.2
Table 2: Monolingual models trained on top of multilingual features.

6 Multilingual models

Next, we focus on the training of multilingual seq2seq models. As our models are character-based, the multilingual training dictionary is created by concatenation of all train languages, and the system is trained in same way as monolingual on concatenated data.

6.1 Direct decoding from multilingual NN

As the multilingual net is trained to convert a sequence of input features into sequence of output characters, any language from training set or an unknown language with compatible set of characters can be directly decoded. Obviously, characters from wrong language can be generated as the system needs to performs also language identification (LID). Adding LID information as an additional feature, similarly to [32]

, complicates the system. Therefore, we experimented with “fine-tuning” of the system into the target language by running a few epochs only on desired language data. This is in strengthening the target language characters, therefore it makes the system less prone to language- and character-set-mismatch errors.

The first two rows of table 3 present significant performance degradation from monolingual to multilingual seq2seq models caused by wrong decision of output characters in about 20% of test utterances. However, no out-of-language characters are observed after “fine-tuning” and 1.5% and 4.7% improvement over monolingual baseline is reached.

Model Tok Pisin Georgian
Monolingual 32.2 34.8
Multilingual 37.2 51.1
Multilingual-fine tuned 27.5 33.3
Table 3: Multilingual fine tuning of seq2seq model.

As mentioned above, multilingual NN can be fine-tuned to the target language if character set is compatible with the training set. Figure 2 shows results on Swahili, which is not part of the training set. Similarly to experiments with multilingual features in Figure 1, the multilingual seq2seq systems are effective especially on small amounts of data, but also beat baseline models on full 50h language set.

Figure 2: Fine-tuning of multilingual NN on Swahili.

6.2 Language-Transfer learning

Language-Transfer learning is necessary if target language characters differ from train set ones. The whole process can be described in three steps: 1) randomly initialize output layer parameters, 2) train only new parameters and freeze the remaining ones 3) “fine-tune” the whole NN. Various experiments are conducted on level of output parameters including output softmax (Out), the attention (Att), and CTC parts. Table 4 compares all combinations and clearly shows that retraining of output softmax only is giving the best results.

Language Swahili Amharic Tok Pisin Georgian
Transfer %CER %CER %CER %CER
Monoling. 28.6 45.3 32.2 34.8
Out 27.4 41.2 27.7 33.6
Att+Out 27.5 41.2 28.3 34.2
CTC+Out 27.6 41.2 27.9 33.7
Att+CTC+Out 28.0 42.1 27.6 34.1
Table 4: Multilingual Language Transfer

Figure 3: Comparison of various multilingual approaches on Swahili.

Finally, we summarize the comparison of the use of multilingual features for the seq2seq model and language transfer learning of multilingual seq2seq model in Figure  3. Interestingly, on contrary to our previous observations on DNN-HMM systems [19], we found multilingual features superior to language transfer learning in seq2seq model case.

7 Conclusions

We have presented various multilingual approaches in seq2seq systems including multilingual features and multilingual models by leveraging our multilingual DNN-HMM expertise. Unlike DNN-HMM systems [19], we obtain the opposite conclusion that multilingual features are more effective in seq2seq systems. It is probably due to efficient fusion of two complementary approaches: explicit GMM-HMM alignment incorporated in BN features and seq2seq models in the final system. With this finding, we will further explore efficient combinations of the DNN-HMM and seq2seq systems as our future work.


  • [1] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
  • [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
  • [3] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
  • [4] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio, “Attention-based models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585.
  • [5] Alex Graves and Navdeep Jaitly,

    “Towards end-to-end speech recognition with recurrent neural networks.,”

    in ICML, 2014, vol. 14, pp. 1764–1772.
  • [6] Alex Graves, “Supervised sequence labelling,” in Supervised sequence labelling with recurrent neural networks, pp. 5–13. Springer, 2012.
  • [7] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964.
  • [8] Andrew Rosenberg, Kartik Audhkhasi, Abhinav Sethy, Bhuvana Ramabhadran, and Michael Picheny, “End-to-end speech recognition and keyword search on low-resource languages,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5280–5284.
  • [9] Sibo Tong, Philip N Garner, and Hervé Bourlard, “An investigation of deep neural networks for multilingual speech recognition training and adaptation,” Tech. Rep., 2017.
  • [10] Martin Karafiát, Murali Karthick Baskar, Pavel Matějka, Karel Veselỳ, František Grézl, and Jan Černocky, “Multilingual blstm and speaker-specific vector adaptation in 2016 BUT Babel system,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 637–643.
  • [11] Sibo Tong, Philip N Garner, and Hervé Bourlard, “Multilingual training and cross-lingual adaptation on CTC-based acoustic model,” arXiv preprint arXiv:1711.10025, 2017.
  • [12] Pawel Swietojanski and Steve Renals, “Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models,” in Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 2014, pp. 171–176.
  • [13] Markus Müller, Sebastian Stüker, and Alex Waibel, “Language adaptive multilingual CTC speech recognition,” in International Conference on Speech and Computer. Springer, 2017, pp. 473–482.
  • [14] Siddharth Dalmia, Ramon Sanabria, Florian Metze, and Alan W. Black, “Sequence-based multi-lingual low resource speech recognition,” in ICASSP. 2018, pp. 4909–4913, IEEE.
  • [15] Shinji Watanabe, Takaaki Hori, and John R Hershey, “Language independent end-to-end architecture for joint language identification and speech recognition,” in Automatic Speech Recognition and Understanding Workshop (ASRU), 2017 IEEE. IEEE, 2017, pp. 265–271.
  • [16] Shubham Toshniwal, Tara N Sainath, Ron J Weiss, Bo Li, Pedro Moreno, Eugene Weinstein, and Kanishka Rao, “Towards language-universal end-to-end speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
  • [17] Jaejin Cho, Murali Karthick Baskar, Ruizhi Li, Matthew Wiesner, Sri Harish Mallidi, Nelson Yalta, Martin Karafiat, Shinji Watanabe, and Takaaki Hori, “Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling,” in IEEE Workshop on Spoken Language Technology (SLT), 2018.
  • [18] Zoltan Tuske, David Nolden, Ralf Schluter, and Hermann Ney, “Multilingual mrasta features for low-resource keyword search and speech recognition systems,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 7854–7858.
  • [19] Martin Karafiát, K. Murali Baskar, Pavel Matějka, Karel Veselý, František Grézl, and Jan Černocký, “Multilingual blstm and speaker-specific vector adaptation in 2016 but babel system,” in Proceedings of SLT 2016. 2016, pp. 637–643, IEEE Signal Processing Society.
  • [20] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R Hershey, and Tomoki Hayashi, “Hybrid CTC/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
  • [21] Matthias Sperber, Jan Niehues, Graham Neubig, Sebastian Stüker, and Alex Waibel, “Self-attentional acoustic models,” in 19th Annual Conference of the International Speech Communication Association (InterSpeech 2018), 2018.
  • [22] Chung-Cheng Chiu and Colin Raffel, “Monotonic chunkwise attention,” CoRR, vol. abs/1712.05382, 2017.
  • [23] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio, “Attention-based models for speech recognition,” in Advances in Neural Information Processing Systems. 2015, vol. 2015-January, pp. 577–585, Neural information processing systems foundation.
  • [24] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The Kaldi speech recognition toolkit,” in Automatic Speech Recognition and Understanding, 2011 IEEE Workshop on. IEEE, 2011, pp. 1–4.
  • [25] Mike Schuster and Kuldip K Paliwal,

    Bidirectional recurrent neural networks,”

    IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
  • [26] Sepp Hochreiter and Jürgen Schmidhuber,

    Long short-term memory,”

    Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [27] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, et al., “ESPnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
  • [28] Martin Karafiát, František Grézl, Mirko Hannemann, Karel Veselý, Igor Szoke, and Jan ”Honza” Černocký, “BUT 2014 Babel system: Analysis of adaptation in NN based systems,” in Proceedings of Interspeech 2014, Singapure, September 2014.
  • [29] Karel Veselý, Martin Karafiát, and František Grézl, “Convolutive bottleneck network features for LVCSR,” in Proceedings of ASRU 2011, 2011, pp. 42–47.
  • [30] Amit Agarwal et al., “An introduction to computational networks and the computational network toolkit,” Tech. Rep. MSR-TR-2014-112, August 2014.
  • [31] Karel Veselý, Martin Karafiát, František Grézl, Miloš Janda, and Ekaterina Egorova, “The language-independent bottleneck features,” in Proceedings of IEEE 2012 Workshop on Spoken Language Technology. 2012, pp. 336–341, IEEE Signal Processing Society.
  • [32] Suyoun Kim and Michael L. Seltzer, “Towards language-universal end-to-end speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.