Nowadays the usage of deep neural networks hidden Markov models (DNN-HMMs)[1, 2] provides effective performance in speech recognition: there are concrete applications ranging from mobile voice search , transcriptions of broadcast news, videos  or conversations  to recognition in noisy environments [6, 7, 8].
The availability of large training corpora for a given application domain allows to train a DNN with many layers and parameters in order to improve the classification performance. On the contrary, in the absence of sufficient data for training, e.g. in the case of under-resourced languages, the number of DNNs parameters that can be reliably estimated greatly reduces and, consequently, classification performance is not always satisfactory. Recognition of children’s speech is a kind of application domain often characterized by training data shortage, even for major languages.
As alternative to complete training of DNN parameters, starting by scratch, adaptation of an existing DNN by using the available small data set is a viable approach. This has been investigated in [9, 10], where an initial DNN trained on adult speakers is then adapted using limited set of children’s data.
Another approach, to address the lack of training data is represented by multi-task learning. This approach has been demonstrated effective for multi-lingual speech recognition, especially if the size of training data for each language is small [11, 12, 13, 14, 15]
. The reason of this is due to the fact that the shared hidden layers of the DNN used to estimate the emission probabilities of HMM states in a hybrid Automatic Speech Recognition (ASR) system, are language independent if the DNN itself is trained on multi-lingual data. This DNN can be used to initialize a new one which can be trained only with data of the target language to recognize. When the size of training data is small only a subset of the connection weights, usually those of the output layer, are re-estimated. This training procedure is often called transfer learning, to indicate the fact that an initial set of learned parameters is transferred to the final acoustic model used by the ASR system.
In this work we address the problem of automatic speech recognition of children speaking a non-native language, specifically: (a) Italian students, speaking both English and German, and (b) German students speaking English.
It is known that non-native speakers articulate sounds very differently from native ones, because they try to use the phonology of their mother language, giving rise to two types of errors : mispronunciation, when they aim to pronounce a wrong target, and phonological interference since they use their original set of phones. In the past several approaches have been proposed to take into account the pronunciation errors of non-native speakers [17, 18], spanning from the usage of non-native pronunciation lexicon [19, 20, 21, 22, 23] to acoustic model adaptation using either native data and non native data [24, 25, 26, 27].
As previously mentioned, we use transfer learning to adapt the multi-lingual DNN trained on native data from Italian, German and English children. Basically, only the weights of the output layer of the network are updated, through back propagation, using data from non-native speakers of a given language while the weights of the lower layers are frozen and remain unchanged during adaptation. In addition, we propose to use multi-lingual data even in the adaptation phase, that is to update the weights of the output layer of the original DNN with all available non-native data.
The novelties of this work are:
the application of both multi-task learning and transfer learning to the recognition of voices of children speaking in a foreign language;
the usage of (non-native) multi-lingual data for updating the weights of the original transferred DNN.
Experimental results reported in the paper show that: (a) the usage of the multi-lingual DNN gives performance, on native data, which are similar to those achieved with mono-lingual networks, i.e. trained only with data from a single language; (b) the usage of the multi-lingual DNN provides performance on non-native data significantly better than those obtained with mono-lingual DNNs; (c) the usage of multi-lingual data in the adaptation process further increases the performance on non-native data.
This paper is organized as follows: Section 2 describes the experimental data used for testing the approaches proposed in this work, Section 3 gives details of acoustic models, language models and IPA based lexicon used in the multi-lingual ASR system employed in this work; section 4 reports experiments and related results. Finally, Section 5 concludes the paper, presenting directions for future work.
2 Speech Data
In this work we exploited speech data collected within the European funded project PF-Star (2002-2004). During the PF-Star project, noticeable amount of speech data were collected from English, German, Italian and Swedish children ; for one of the Italian corpora see also . For the purposes of this work, children’s speech pronounced by English, German and Italian students in the three languages were considered, as shown in Table 1. As already mentioned, data from native speakers were used to train Acoustic Models (AMs), while test was carried out on both native and non-native data sets; all the children are in the age range 9-10. In addition, a non-native adaptation set was used in transfer learning and, finally, performance, measured in terms of Word Error Rate (WER), was computed on both native and non-native evaluation data sets.
|Italian||train + eval||ada + eval||ada + eval|
|German||–||train + eval||ada + eval|
|English||–||–||train + eval|
|mono-lingual native training corpora|
|mono-lingual native eval corpora|
Table 2 reports details about the mono-lingual training and eval data - in all cases, native children’s speech - in terms of number of speakers, duration, number of running words and lexicon size. Table 3 presents some statistics about the non-native speech data. In particular, Italian children produced both English and German speech, while German children produced English speech. In all cases, the speech data was split into ada (used to perform transfer learning) and eval data (for evaluation purposes only). Overlapping among training, ada and eval speakers never occurs.
|non-native ada corpora|
|non-native eval corpora|
3 ASR system
As mentioned in the Introduction a multi-lingual DNN was first trained on data of native speakers.
3.1 Multi-lingual DNN
: the preliminary HMM is trained on the usual 13 mel-frequency cepstral coefficients (MFCCs), which are then mean/variance normalized; fMLLR-transformed coefficients are then estimated and used as input features for the DNN. The learning procedure features layer-wise pre-training based on Restricted Boltzmann Machines, per-frame cross-entropy training and sequence-discriminative training (lattice framework and State Minimum Bayes Risk criterion). Besides the mono-lingual DNNs trained on native German, Italian, English speech, a multi-lingual model is derived from a shared lexicon (see Section3.3 for details): Table 4 Table 4 reports the number of phonetic units used by the mono-lingual lexica as well as by the multi-lingual lexicon; it also reports the size of the output layer of the DNNs trained.for mono-lingual and multi-lingual speech recognition.
3.2 Language Models
Since the focus of this paper is on acoustic modeling, we did not cope with Out-Of-Vocabulary (OOV) words issues, that would have complicated the analysis of results. Also, in order to avoid to use different LMs for native and non-native corpora, we decided to build only a single LM for each language. For this reason, the lexicon of each language has to contain at least all of the words included in the native eval, non-native ada and eval sets. Given that the lexicon size is quite high (1289 to 5042 words), we decided to use a bigram LM, with Witten-Bell smoothing, that assures a reasonable perplexity over the ada + eval data, as reported in Table 5.
Concerning the lexicon, we have at our disposal grapheme to phoneme converter for the three languages, that were used in the past to build mono-lingual ASR systems. For this work, we decided to convert all the mono-lingual phones in IPA format, shown here as ASCII sequences. Of course, some of the choices we did (for instance, we replaced geminate consonants with simple ones for the Italian lexicon) are questionable and some of them could be revised in the future. Table 6 contains the list of all the 67 phones resulting from the merging of the three lexica. Of these, 18 phones are common for all three languages, 13 are common to only 2 languages (9 de+en, 2 it+de, 2 it+en), and 36 are present in one language only (16 de, 14 en, 6 it).
4 Experiments and Results
Table 7 reports the WER results obtained using the four reference acoustic models: the three models, trained on clean data from native children speaking Italian, German and English, perform slightly better than the multi-lingual models trained on the three corpora using the multi-lingual lexicon. Nevertheless, the multi-lingual system shows more robustness against non-native speech: the off-diagonals WERs, that represent the performance for young students speaking a foreign language, are significantly lower when the target language is English (39.8% against 53.2 and 31.7 against 40.4) whilst it similar in case of German (i.e. 18.5% against 17.4%).
The forthcoming experiment shows the effectiveness of transfer learning in this applicative context: the baseline DNNs are adapted to non-native speech using limited data from the target domain. The adaptation sets comprise data of Italian students reading German and English sentences and German students reading English text. The adaptation is implemented keeping fixed all the layers of the DNNs except the last one; few additional learning iterations are then performed using the adaptation material. We evaluated three modalities related to transfer learning: m1) the mono-lingual model is adapted to non-native speakers using adaptation data of the single target language; m2) similarly, the multi-lingual model is adapted to a single target language; m3) the multi-lingual model is adapted to multi-lingual non-native speech using at the same time the three adaptation sets (i.e. English and German data coming from Italian speakers and English data coming from German speakers).
The results presented in Table 8 demonstrate that the multi-lingual system is capable to better cope with the acoustic mismatch introduced by non-native speakers. Indeed, also in this case, the multi-lingual system exhibits a better behavior, allowing to produce more effective models for non-native speech: the WERs related to Italian students speaking in both German and English decrease from 11.1% to 9.6% and from 16.0% to 15.4%, respectively; larger gain is observed in case of Germans speaking in English (WER from 18.3% to 15.2%).
|adapted mono-lingual AMs (m1)|
|adapted multi-lingual AM (m2)|
|adapted multi-lingual AM (m3)|
The next experiment (see Table 9) investigates the case where, starting from the multi-lingual model, we adapt from Italian: adaptation sets from Italian students speaking German and English are merged. Similarly, the English set merges data of German and Italian students speaking English. In the first case, we use the adaptation sentences coming from speakers with the same mother tongue (i.e., Italian). Vice versa, in the second case, the adaptation set is defined in terms of the language spoken by the students, regardless of their original mother language. Also in this case, the gain with respect to the mono-lingual case is noticeable. Moreover, this combination can lead to the best results (highlighted in bold) for two cases; of course, as expected, the pairs with no adaptation data (German-to-English and Italian-to-German, respectively) produce lower results.
|adapted multi-lingual AM with Italian speakers|
|adapted multi-lingual AM with English utterances|
As a consequence, we conclude that transfer-learning in the context of children non-native speech, where usually a limited amount of data is available for training purposes, can successfully be applied and mitigate the acoustic mismatch. Moreover, it seems evident that the hidden layers of the multi-lingual DNN are able to build a more general representation of the phonetic space and this turns out to be suitable for the adaptation to non-native speakers.
In this work we have investigated the application of transfer learning for adapting a multi-lingual DNN, trained on native speech from three languages (Italian, German, English), to non-native data. The approach is implemented updating the output layer of the DNN with small adaptation sets; the experimental results confirm the validity of this technique and show the positive effect of a multi-language model to compensate the pronunciation differences of a non-native speaker.
As future work we plan to further address non-native acoustic modeling, experimenting other types of acoustic features and exploiting some a-priori knowledge about phonetics of the first and foreign languages. In particular, it seems promising the investigation of alternative lexicons that take into account the possible pronunciation variations introduced by non-native students.
-  G. Hinton, L. Deng, D. Yu, and Y. Wang, “Deep Neural Networks for Acoustic Modeling in Speech Recognition,” IEEE Signal Processing Magazine, vol. 9, no. 3, pp. 82–97, 2012.
A. Mohamed, G.E. Dahl, and G. Hinton,
“Acoustic Modeling Using Deep Belief Networks,”IEEE Trans. on Audio Speech and Language Processing, vol. 20, no. 1, pp. 14–22, 2012.
-  K. Yao, D. Yu, F. Seide, H. Su, and Y. Gong, “Adaptation of Context-Dependent Deep Neural Networks for Automatic Speech Recognition,” in Proc. of ICASSP, Kyoto (Japan), March, 25-30 2012, pp. 366–369.
-  G.E. Dahl, D. Yu, L. Deng, and A. Acero, “Context Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition,” IEEE Trans. on Audio Speech and Language Processing, vol. 20, no. 1, pp. 30–42, 2012.
-  F. Seide, G. Li, and D. Yu, “Conversational Speech Transcriptions Using Context-Dependent Deep Neural Networks,” in Proc. of Interspeech, Florence (Italy), 2011, pp. 437–440.
-  M. Seltzer, D. Yu, and Y. Q. Wang, “An investigation of deep neural networks for noise robust speech recognition,” in Proc. of ICASSP, Vancouver, Canada, May 2013, pp. 7398–7402.
-  J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ’CHiME’ Speech Separation and Recognition Challenge: Dataset, task and baselines,” in Proc. of IEEE ASRU Workshop, Scottsdale, Arizona (USA), 2015.
-  S. Renals and P. Swietojanski, “Neural networks for distant speech recognition,” in Proc. of Hands-free Speech Communication and Microphone Arrays (HSCMA) Wokshop, Villers-les-Nancy, 2014, pp. 172–176.
-  M. Matassoni, D. Falavigna, and D. Giuliani, “DNN adaptation for recognition of children speech through automatic utterance selection,” in IEEE Spoken Language Technology Workshop (SLT), Dec 2016, pp. 644–651.
-  R. Serizel and D. Giuliani, “Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children,” Natural Language Engineering, vol. FirstView, pp. 1–26, 7 2016.
-  Y. Miao and F. Metze, “Improving low-resource cd-dnnhmm using dropout and multilingual dnn training,” in Interspeech, 2013, pp. 2237–2241.
-  N. T. Vu, D. Imseng, P. Mot. Schultz D. Povey, and H. Bourlard, “Multilingual deep neural network based acoustic modeling for rapid language adaptation,” in Proc. of ICASSP, Forence, Italy, May, 4-9 2014, pp. 7639–764.
-  J. T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, “Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers,” in Proc. of ICASSP, 2013, pp. 7304–7308.
-  A. Ghoshal, P. Swietojanski, and S. Renals, “Multilingual training of deep neural networks,” in Proc. of ICASSP, 2013, pp. 7319–7323.
-  Z. Tang, L. Li, and D. Wang, “Multi-task recurrent model for speech and speaker recognition,” arXiv preprint arXiv:1603.09643, 2016.
-  M. Russell, “Analysis of Italian children’s English pronunciation,” http://archive.is/http://www.eee.bham.ac.uk/russellm, 2007.
-  G. Bouselmi, D. Fohr, and J. P. Haton, “Fully automated non-native speech recognition using confusion-based acoustic model integration,” in Proc. of Eurospeech, 2005, pp. 1369–1372.
-  G. Bouselmi, D. Fohr, I. Illina, and J. P. Haton, “Multilingual non-native speech recognition using phonetic confusion-based acoustic model modification and graphemic constraints,” in Proc. of ICSLP, 2006, pp. 109–112.
Z. Wang and T. Schultz,
“Non-native spontaneous speech recognition through polyphone decision tree specialization,”in Proc. of Eurospeech, 2003, pp. 1449–1452.
-  Z. Wang, T. Schultz, and A. Waibel, “Comparison of acoustic model adaptation techniques on non-native speech,” in Proc. of ICASSP, 2003, pp. 540–543.
-  Y. R. Oh, J. S. Yoon, and H. K. Kim, “Adaptation based on pronunciation variability analysis for non native speech recognition,” in Proc. of ICASSP, 2006, pp. 137–140.
-  H. Strik, K.P. Truong, F. de Wet, and C. Cucchiarini, “Comparing different approaches for automatic pronunciation error detection,” Speech Communication, vol. 51, no. 10, pp. 845–852, 2009.
-  S. Steidl, G. Stemmer, C. Hacker, and E. Nöth, “Adaptation in the pronunciation space for non-native speech recognition,” in Proc. of ICSLP, 2004, pp. 2901–2904.
-  R. Duan, T. Kawahara, M. Dantsuji, and J. Zhang, “Articulatory modeling for pronunciation error detection without non-native training data based on dnn transfer learning,” IEICE Transactions on Information and Systems, vol. E100D, no. 9, pp. 2174–2182, 2017.
-  W. Li, S. M. Siniscalchi, N. F. Chen, and C. Lee, “Improving non-native mispronunciation detection and enriching diagnostic feedback with dnn-based speech attribute modeling,” in Proc. of ICASSP, 2016, vol. 2016-May, pp. 6135–6139.
-  A. Lee and J. Glass, “Mispronunciation detection without nonnative training data,” in Proc. of Interspeech, 2015, vol. 2015-January, pp. 643–647.
-  A. Das and M. Hasegawa-Johnson, “Cross-lingual transfer learning during supervised training in low resource scenarios,” Proc. of Interspeech, vol. 2015-January, pp. 3531–3535, 2015.
-  A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, S. Steidl, and M. Wong, “The PF-STAR children’s speech corpus,” in Proc. of Eurospeech, 01 2005, pp. 2761–2764.
-  M. Gerosa, D. Giuliani, and F. Brugnara, “Acoustic variability and automatic recognition of children’s speech,” Speech Communication, vol. 49, no. 10, pp. 847 – 860, 2007, Intrinsic Speech Variations.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Veselỳ, “The Kaldi Speech Recognition Toolkit,” in Proc. of IEEE ASRU Workshop, Hawaii (US), December 2011.
-  K. Vesely, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminative Training of Deep Neural Networks,” in Proc. of Interspeech, Florence, Italy, August 2011, pp. 2345–2349.