Language identification (LID) lends itself to a wide range of applications, e.g., mix-lingual (code-switching) speech recognition. Early methods are based on statistical models of phonetic or acoustic units [1, 2, 3]
. Recent methods are based on probabilistic acoustic modeling, among which the i-vector model is perhaps the most successful[4, 5].
Recently, deep neural models have attracted much attention in LID. Lopez-Moreno et al.  proposed a DNN-based approach which uses a DNN to discriminate different languages at the frame-level, and the language posteriors of an utterance are generated by a simple average of all the frame-level posteriors of the utterance. An RNN-based approach was later proposed by Gonzalez-Dominguez et al. , and better performance was obtained with much less parameters compared to the DNN-based model. Due to the advantage in temporal modeling, the RNN approach has been followed by a number of researchers, e.g., [8, 9]. Other neural model structures were also investigated, e.g., CNN [10, 11] and TDNN [12, 13]. Compared to the i-vector approach that is based on a probabilistic model, these pure neural methods show clear advantage in short utterances (e.g., 23 seconds) [6, 7, 9]. The main advantage of the neural-based methods, compared to the i-vector model, is that they are discriminative and can learn complex decision bounds between languages, provided that sufficient data is provided. Moreover, the power of feature learning associated with deep neural nets often provides better robustness against noise and speaker variation, which is highly important for LID.
Deep neural models are also used in a hybrid way, i.e., to generate features  or alignment [15, 16] for an i-vector model. In this case, the phonetic DNN/RNN model is trained for phone discrimination as in automatic speech recognition (ASR). This model is then used to produce bottle-neck features or acoustic alignment for constructing the i-vector model. By using the phonetic information that is directly related to LID, the i-vector model can be consistently improved.
The above two-approaches have their own disadvantages. For the pure neural approach, the entire system relies on raw features, ignoring any phonetic information that is known to be important from the beginning of LID research 
; for the hybrid system, it is still based on a probabilistic model that (1) involves a strong Gaussian assumption that is not suitable for dealing with complex class (here, language) boundaries; (2) requires relatively more speech frames to estimate a reliable i-vector, which is not applicable to many real applications that require quick identification, e.g., code-switch ASR.
In this paper, we follow the pure neural model scheme, and enhance the existing models by introducing phonetic information as an auxiliary feature. Due to the clear advantage of the LSTM-RNN in both ASR and LID, we adopt this model in the study, though the idea of leveraging phonetic information is applicable to any neural models. The architecture is illustrated in Figure 1, which involves a phonetic RNN that is trained to discriminate phones as in ASR and produces phonetic features once the training is done, and an LID RNN which receives the phonetic features and uses them together with the raw acoustic feature to perform LID. This model has the following properties:
The phonetic RNN can be trained with flexible objectives. It can be discriminant for phones (as in ASR) or for both phones and languages, following the multi-task learning principle .
The phonetic RNN can be trained independently from the LID RNN. This means that it can be trained using data of any languages that are totally different from the target languages of the LID task. This is particularly attractive when the LID task is to discriminate low-resource languages.
The phonetic feature extraction and propagation is flexible. It can be extracted from any place of the phonetic RNN, and can be propagated to any place of the LID RNN.
This architecture is a reminiscence of the early phonetic recognition and language modeling (PRLM) approach , where a phone recognizer is used as a front-end to decode phonetic units, followed by a phonetic LM to perform scoring. The two RNNs in our architecture can be regarded as corresponding to the phone recognizer and the LM respectively, although the structure is much more flexible than the historic model. In fact, if the phonetic features are derived from the output layer of the phonetic RNN and are propagated to the input layer of the LID RNN, and if the raw feature is omitted, we obtain a PRLM system where the LID RNN is essentially an RNN-based phone LM. This architecture was recently studied by Salamea et al. .
2 Model structure
We choose the LSTM-RNN as the phonetic RNN and the LID RNN components in the study. One reason for the choice is that LSTM-RNN has been demonstrated to perform well in both the pure neural approach  and the hybrid approach . Another reason is that this structure (phonetic feature plus RNN LID) is in accordance with our motivation to model the phonetic dynamics as in the old PRLM approach . This section first describes the LSTM-RNN structure used in the study, and then presents the phone-aware LID system.
2.1 LSTM structure
The associated computation is given as follows:
In the above equations, the terms denote weight matrices and those associated with the cells were constrained to be diagonal in our implementation. The
terms denote bias vectors.and are the input and output symbols respectively; , , represent respectively the input, forget and output gates; is the cell and is the cell output. and are two output components derived from , where is recurrent and fed to the next time step, while is not recurrent and contributes to the present output only.
is the logistic sigmoid function, andand
are non-linear activation functions, often chosen to be hyperbolic.denotes element-wise multiplication.
2.2 Phone-aware LID system
As a preliminary work to demonstrate the concept, we design a simple phone-aware LID system as shown in Figure 3, where both the phonetic RNN and the LID RNN involve a single LSTM layer. Although the phonetic features can be extracted from any places of the phonetic RNN, we choose to use the output of the recurrent projection layer. Similarly, the receiver of the phonetic features is also flexible and we will investigate the performance of difference choices. The configure shown in Figure 3 uses the non-linear function as the receiver. With this configure, most computation of the LID RNN remains the same, except that the cell value should be updated as follows:
where is the phonetic feature propagated from the phonetic RNN.
3.1 Database and configurations
The experiments were conducted with the Babel corpus. This corpus was collected as part of the IARPA (Intelligence Advanced Research Projects Activity) Babel program, with aim to develop speech technologies for low-resource languages. In this paper, we chose speech data of four languages from the Babel corpus to conduct the study: Assamese, Georgian, Bengali, and Turkish. For each language, a training dataset and a development dataset were officially provided. Training dataset contains both conversational speech and scripted speech and development dataset only has conversational speech. We used the entire training set of each language in the model training, but randomly selected utterances from the development set of each language to perform the test. The training data sets from the four languages are as follows: Assamese111Language collection release IARPA-babel102b-v0.5a. hours, Georgian222Language collection release IARPA-babel404b-v1.0a. hours, Bengali333Language collection release IARPA-babel103b-v0.4b. hours, and Turkish444Language collection release IARPA-babel105b-v0.5. hours. The average length of the test utterances is seconds, ranging from seconds to seconds.
The raw feature used for the RNN models is -dimensional Fbanks, with a symmetric -frame window to splice neighboring frames. All the experiments were conducted with Kaldi . The default configurations of the Kaldi WSJ s5 nnet3 recipe were used to train the phonetic RNN and the LID RNN.
3.2 Baseline results
As the first step, we build three RNN baseline systems using the speech data of two languages: Assamese and Georgian. The three RNN baselines are: multilingual ASR system (AG-ASR), LID system (AG-LID), ASR-LID multi-task system (AG-MLT). For the AG-ASR, the phone sets of the two languages are merged and the softmax group involves all the state targets, which is in our experiment. The ASR performance in terms of word error rate (WER) is and for Assamese and Georgian on the whole development dataset, respectively. The training and decoding follow the standard WSJ s5 nnet3 recipe of Kaldi. For the AG-LID, the output layer consists of two units, corresponding to the two languages respectively. The training procedure is similar to the one used for training the AG-ASR model. The AG-MLT model involves two groups of targets, and the training utilizes the labels of both phones and languages.
The LID for Assamese and Georgian can be conducted by either AG-LID or AG-MLT, using the language posteriors they produce. The performance results with these two systems, in terms of and equal error rate (EER), are shown in Table 1. Both the frame-level performance and the utterance-level performance are reported. For the utterance-level results, the frame-level posteriors are averaged to produce the utterance-level posterior, with which the LID is conducted.
The results in Table 1 indicates that both the LID RNN and the multi-task LID RNN are capable of language discrimination, and the multi-task RNN performs better. This is expected as the ASR targets can help to regularize the model training and alleviate the impact of variability factors such as noise and speaker.
3.3 Phonetic feature
The three baseline RNNs will be used as the candidates of the phonetic RNN. We visualize the discriminative power of the phonetic features produced by these RNNs using PCA. Specifically, test utterances are randomly selected from the test set for each language, and these utterances are fed into the phonetic RNN frame by frame. For each frame, the phonetic feature is read from the recurrent projection layer of the tested RNN, and then is projected into the -dimensional space by PCA. Figure 4 presents the distribution of the features for Assamese and Georgian, the two languages ‘known’ in the model training. Figure 5 shows the distribution for Bengali and Turkish, two language that are ‘unknown’ in the model training. Figure 6 shows the distribution of the features of all the four languages. It can be observed that all the three RNNs possess certain discriminative capability for both the known and unknown languages. Comparing the three models, the features generated by the ASR-based RNN is clearly worse, and the features generated by the multi-task RNN looks more discriminative. Note that the phonetic features of Assamese and Bengali are highly overlapped, no matter which phonetic RNN is used. This means that the four-language LID task will be highly difficult, as we will see shortly.
3.4 Phone-aware LID on known languages
Due to the clear advantage of the AG-MLT in language discrimination, we first choose this model to be the candidate of the phonetic RNN to produce phonetic features. The LID RNNs are trained to discriminate the two known languages: Assamese and Georgian. The results are shown in Table 2, where four configurations for the ‘receiver’ of the phonetic feature are tested: the input gate, the forget gate, the output gate and the function. Compared to the results with the baseline RNNs (Table 1), introducing the phonetic feature leads to clear performance improvement, on both the frame-level and the utterance-level, in terms of both and .
Then we use the AG-ASR and AG-LID models as the phonetic RNNs with only the best configurations above, that is, the function or output gate as the receiver. The AG-ASR results in better performance than both the baseline and AG-LID, which further confirms our conjecture that phonetic information is valuable for neural-based LID.
3.5 Phone-aware LID on unknown languages
We now test the generalizability of the phonetic feature. Specifically, we use the feature to help discriminate two new languages, i.e., the languages that are unknown during the phonetic RNN training, which are Bengali and Turkish in our experiment. To test the gain with the phonetic feature, the LID RNN trained with the two target languages, Bengali and Turkish, denoted by BT-LID, is used as the baseline. For simplicity, we only test the scenario where the AG-MLT is used as the phonetic RNN, and the function is used as the feature receiver.
The results are shown in Table 3. It can be seen that although the phonetic RNN has no knowledge of the two target languages, the phonetic feature it produces is still highly valuable for the LID task. This is understandable as the phonetic units are often shared by human languages, and so the phonetic information the phonetic RNN provides is generally valuable. This in fact demonstrates that the phonetic RNN can be trained very flexibly, by using speech data of any languages. This is particulary interesting for languages with too little training data to obtain a reasonable phonetic RNN.
3.6 Phone-aware LID on four languages
The final experiment tests the LID performance on all the four languages. The baseline system is the RNN LID model trained with the data of the four languages, denoted by AGBT-LID. For the phone-aware system, the AG-MLT is used to produce the phonetic feature, and the function is used as the feature receiver. The results are shown in Table 4. Again, the performance is greatly improved by involving the phonetic feature.
We presented a phone-aware LSTM-RNN model for language identification. Our argument is that phonetic information is important for LID. This information has been successfully used in the historical phonetic models such as the famous PRLM system, but it has been largely ignored by the present pure acoustic methods, either the i-vector model or the pure neural model. Particularly with the LSTM-RNN model, the inherent power on modeling temporal dynamics with this model has been largely wasted without phonetic information involved. The phone-aware architecture we proposed in the paper employs a deep neural model to produce phonetic features and these features are propagated to the vanilla LSTM-RNN LID system. Our experiments conducted on the data of four languages of the Babel corpus demonstrated that the phone-aware model can dramatically improve performance of the LSTM-RNN LID system. In the future, we will test the phone-aware approach on more languages and under more complex conditions. Particularly, we expect that with the phonetic information, the RNN-based LID may be significantly improved on long utterances, by providing the phonetic normalization.
-  L. F. Lamel and J.-L. Gauvain, “Language identification using phone-based acoustic likelihoods,” in Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., 1994 IEEE International Conference on, vol. 1. IEEE, 1994, pp. I–293.
-  M. A. Zissman et al., “Comparison of four approaches to automatic language identification of telephone speech,” IEEE Transactions on speech and audio processing, vol. 4, no. 1, p. 31, 1996.
-  H. Li, B. Ma, and C.-H. Lee, “A vector space modeling approach to spoken language identification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 1, pp. 271–284, 2007.
-  N. Dehak, A.-C. Pedro, D. Reynolds, and R. Dehak, “Language recognition via i-vectors and dimensionality reduction,” in Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), 2011, pp. 857–860.
-  D. Martınez, O. Plchot, L. Burget, O. Glembek, and P. Matejka, “Language recognition in ivectors space,” Proceedings of Interspeech, Firenze, Italy, pp. 861–864, 2011.
I. Lopez-Moreno, J. Gonzalez-Dominguez, O. Plchot, D. Martinez, J. Gonzalez-Rodriguez, and P. Moreno, “Automatic language identification using deep neural networks,” inAcoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 5337–5341.
J. Gonzalez-Dominguez, I. Lopez-Moreno, H. Sak, J. Gonzalez-Rodriguez, and P. J. Moreno, “Automatic language identification using long short-term memory recurrent neural networks.” inInterspeech, 2014, pp. 2155–2159.
-  G. Gelly, J.-L. Gauvain, V. Le, and A. Messaoudi, “A divide-and-conquer approach for language identification based on recurrent neural networks,” Interspeech 2016, pp. 3231–3235, 2016.
-  R. Zazo, A. Lozano-Diez, J. Gonzalez-Dominguez, D. T. Toledano, and J. Gonzalez-Rodriguez, “Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks,” PloS one, vol. 11, no. 1, p. e0146917, 2016.
A. Lozano-Diez, R. Zazo Candil, J. González Domínguez, D. T. Toledano, and J. Gonzalez-Rodriguez, “An end-to-end approach to language identification in short utterances using convolutional neural networks,” inProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. International Speech and Communication Association, 2015.
-  M. Jin, Y. Song, I. Mcloughlin, L.-R. Dai, and Z.-F. Ye, “LID-senone extraction via deep neural networks for end-to-end language identification,” in Proc. of Odyssey, 2016.
-  M. Kotov and M. Nastasenko, “Language identification using time delay neural network d-vector on short utterances,” in Speech and Computer: 18th International Conference, SPECOM 2016, Budapest, Hungary, August 23-27, 2016, Proceedings, vol. 9811. Springer, 2016, p. 443.
-  D. Garcia-Romero and A. McCree, “Stacked long-term tdnn for spoken language recognition,” Interspeech 2016, pp. 3226–3230, 2016.
-  Y. Song, B. Jiang, Y. Bao, S. Wei, and L.-R. Dai, “I-vector representation based on bottleneck features for language identification,” Electronics Letters, vol. 49, no. 24, pp. 1569–1570, 2013.
-  L. Ferrer, Y. Lei, M. McLaren, and N. Scheffer, “Study of senone-based deep neural network approaches for spoken language recognition,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 24, no. 1, pp. 105–116, 2016.
-  Y. Tian, L. He, Y. Liu, and J. Liu, “Investigation of senone-based long-short term memory rnns for spoken language recognition,” Odyssey 2016, pp. 89–93, 2016.
-  R. Caruana, “Multitask learning,” Machine Learning, vol. 28, no. 1, pp. 41–75, 1997.
-  P. Matejka, L. Burget, P. Schwarz, and J. Cernocky, “Brno university of technology system for nist 2005 language recognition evaluation,” in Speaker and Language Recognition Workshop, 2006. IEEE Odyssey 2006: The. IEEE, 2006, pp. 1–7.
-  C. Salamea, L. F. D’Haro, R. de Córdoba, and R. San-Segundo, “On the use of phone-gram units in recurrent neural networks for language identification,” Odyssey 2016, pp. 117–123, 2016.
-  H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), 2014, pp. 338–342.
-  D. Povey, X. Zhang, and S. Khudanpur, “Parallel training of deep neural networks with natural gradient and parameter averaging,” arXiv preprint arXiv:1410.7455, 2014.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding, no. EPFL-CONF-192584. IEEE Signal Processing Society, 2011.