Oriental languages, including various languages spoken in east, northeast and southeast Asia, belong to several language families, including Austroasiatic languages (e.g.,Vietnamese, Cambodia ) , Tai–Kadai languages (e.g., Thai, Lao), Hmong–Mien languages (e.g., some dialects in south China), Sino-Tibetan languages (e.g., Chinese Mandarin), Altaic languages (e.g., Korea, Japanese), Indo-European languages (e.g., Russian) [2, 3, 4]. These languages were generally believed to be genetically unrelated and were developed from diverse cultures. However, they do share many features due to the demographic migration and international business interaction in history. For example, many languages in the so-called Mainland Southeast Asia (MSEA) linguistic area posses a particular syllable structure that involves monosyllabic morphemes, lexical tone, a fairly large inventory of consonants . Another example is the significant influence of Chinese to Korean, Japanese, Vietnamese and many languages in southeast Asia. In the modern period, English becomes the most influential language, resulting in numerous English-originated words in almost all oriental languages.
The complex acoustic and linguistic patterns of oriental languages have attracted much interest in a multitude of research areas, including comparative phonetics, evolutionary linguistics, second language acquisition, and social linguistics. In particular, the diverse evolution paths of these languages and their complicated interaction offers a valuable opportunity for studying mixlingual and multilingual phenomena.
Despite the broad interest, data resources of oriental languages are far from abundant. One possible reason is that many of these languages are spoken by a relatively small population, and most of the speakers are in developing countries. Some effort has been devoted to building data resources for oriental languages, e.g., the annual oriental COCOSDA (OC) workshop intends to promote speech and language resource construction for oriental languages, and the transactions on Asian and Low-Resource Language Information Processing (TALLIP) journal calls for original research on oriental languages, especially languages with limited resources.111https://mc.manuscriptcentral.com/tallip Some projects, e.g., the Babel program222https://www.iarpa.gov/index.php/research-programs/babel, although not particularly for oriental languages, do involve Vietnamese, Thais, Lao and some other low-resource languages in southeast Asia. In spite of these efforts, resource construction and corresponding research on oriental languages are still rather limited, except one or two rich-resource languages, such as Chinese and Japanese.
To promote research for oriental languages, particularly on multilingual speech and language processing, the center for speech and language technologies (CSLT) at Tsinghua University and Speechocean collaborated together and organized an oriental language recognition (OLR) challenge on APSIPA 2016. This event called for a competition on a language recognition task on seven oriental languages. To support this event, Speechocean released a multilingual speech database AP16-OL7 and made it free for the challenge participants. This paper will present the data profile of the database, the evaluation rules of the challenge, and a baseline system that the participants can refer to.
Note that there are several databases that can be used for multilingual research. For example, polyphone , globalPhone , NTT multilingual database333http://www.ntt-at.com/product/speech2002/, SPEECHDAT-CAR ,Speechdat-E , Babel , and the multilingual databases created by the new Babel project. To our best knowledge, AP16-OL7 is the first multilingual speech database specifically designed for oriental languages.
2 Database profile
|Datasets||Training & Dev set||Test set|
|Code||Description||Channel||No. of Speakers||Utt./Spk.||Total Utt.||No. of Speakers||Utt./Spk.||Total Utt.|
|ct-cn||Cantonese in China Mainland and Hongkong||Mobile||18||320||5759||6||320||1920|
|zh-cn||Mandarin in China||Mobile||18||300||5398||6||300||1800|
|id-id||Indonesian in Indonesia||Mobile||18||320||5751||6||320||1920|
|ja-jp||Japanese in Japan||Mobile||18||320||5742||6||320||1920|
|ru-ru||Russian in Russia||Mobile||18||300||5390||6||300||1800|
|ko-kr||Korean in Korea||Mobile||18||300||5396||6||300||1800|
|vi-vn||Vietnamese in Vietnam||Mobile||18||300||5400||6||300||1800|
Male and Female speakers are balanced.
The number of total utterances might be slightly smaller than expected, due to the quality check.
The AP16-OL7 database was originally created by Speechocean targeting for various speech processing tasks (mainly speech recognition). The entire database involves seven datasets, each in a particular language. The seven languages are: Mandarin, Cantonese, Indoesian, Japanese, Russian, Korean, Vietnamese. The data volume for each language is about hours of speech signals recorded by speakers ( males and females), and each speaker recorded about utterances in reading style. The signals were recorded by mobile phones, with a sampling rate of kHz and a sample size of bits. Each dataset was split into a training set consisting of speakers, and a test set consisting of speakers. For Mandarin, Cantonese, Vietnamese and Indonesia, the recording was conducted in a quiet environment. As for Russian, Korean and Japanese, there are recording sessions for each speaker: the first session was recorded in a quiet environment and the second was recorded in a noisy environment. The basic information of the AP16-OL7 database is presented in Table 1.
Besides the speech signals, the AP16-OL7 database also provides lexicons of all the seven languages, and transcriptions of all the training utterances. These resources allow training acoustic-based or phonetic-based language recognition systems. Training phone-based speech recognition systems is also possible, though large vocabulary recognition systems are not well supported, due to the lack of large-scale language models.
The AP16-OL7 database is freely available for the participants of the AP16-OLR challenge and the APSIPA 2016 special session on multilingual speech and language processing. It is also available for any academic and industrial users, subject to a slightly different licence from SpeechOcean.444http://speechocean.com
3 AP16-OLR challenge
Based on the AP16-OL7 database, we call an oriental language recognition (OLR) challenge.555http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/ASR-events-AP16-details Following the definition of NIST LRE15 , the task of the challenge is defined as follows: Given a segment of speech and a language hypothesis (i.e., a target language of interest to be detected), the task is to decide whether that target language was in fact spoken in the given segment (yes or no), based on an automated an analysis of the data contained in the segment. The AP16-OLR evaluation plan also follows the principles of NIST LRE15: it focuses on the close-set condition, and allows no additional training materials besides AP16-OL7. The evaluation details are described as follows.
3.1 System input/output
The input to the OLR system is a set of speech segments in unknown languages (but within the languages of AP16-OL7). The task of the OLR system is to determine the confidence that a language is contained in a speech segment. More specifically, for each speech segment, the OLR system outputs a score vector , where represents the confidence that language is spoken in the speech segment. Each score will be interpreted as follows: if , then the decision would be that language is contained in the segment, otherwise it is not. The scores should be comparable across languages and segments. This is consistent with the principle of LRE15, but differs from that of LRE09  where an explicit decision is required for each trial.
In summary, the output of an OLR submission will be a text file, where each line contains a speech segment plus a score vector for this segment, e.g.,
3.2 Test condition
No additional training materials are allowed to use.
All the trials should be processed. Scores of lost trials will be interpreted as -.
Each test segment should be processed independently. Knowledge from other test segments is not allowed to use (e.g., score distribution of all the test segments).
Information of speakers is not allowed to use.
Listening to any speech segments is not allowed.
3.3 Evaluation metrics
As in LRE15, the AP16-OLR challenge chooses
where and are the target and non-target languages, respectively; and are the missing and false alarm probabilities, respectively.
is the prior probability for the target language, which is set toin the evaluation. Then the principle metric is defined as the average of the above pair-wise performance:
where is the number of languages, and = .
4 Baseline results
We present baseline language recognition systems based on the i-vector model, and evaluate the performance in terms of the metrics defined by the AP16-OLR challenge. The purpose of these experiments is not to present a competitive submission, instead to demonstrate that the AP16-OL7 database is a reasonable data resource to conduct language recognition research.
4.1 Experimental setup
The baseline system was constructed based on the i-vector model [13, 14]. The static acoustic features involved 19-dimensional Mel frequency cepstral coefficients (MFCCs) and the log energy. This static features were augmented by their first and second order derivatives, resulting in 60-dimensional feature vectors. The UBM involved Gaussian components and the dimensionality of the i-vectors was . Linear discriminative analysis (LDA) was employed to promote language-related information. The dimensionality of the LDA projection space was set to .
With the i-vectors (either original or after LDA transform), the score of a trail on a particular language can be simply computed as the cosine distance between the test i-vector and the mean i-vector of the training segments that belong to that language. This is denoted to be ‘cosine distance scoring’. A more powerful scoring approach is to employ various discriminative models. In our experiment, we trained a support vector machine (SVM) for each language to determine the score that a test i-vector belongs to that language. The SVMs were trained on the i-vectors of all the training segments, following the one-verse-rest scheme. We will call this scoring approach as ‘SVM-based scoring’.
4.2 Visualization with T-SNE 
To provide an intuitive understanding of the discriminative capability of i-vectors on languages, the i-vectors of all the segments in the test set are plotted in a two-dimensional space via T-SNE . Fig. 1 shows the original i-vectors, and Fig. 2 shows the i-vectors after LDA transform, where each color/shap represents a particular language. It can be seen that for the original i-vectors, each language is split into several clusters basically due to different speakers. After LDA transformation, speaker information is suppressed and the language identify is more significant.
4.3 Performance results
The primary evaluation metric in AP16-OLR is . Besides that, we also present the performance in terms of equal error rate (EER), minimum detection cost function (DCF), detection error tradeoff (DET) curve, and identification rate (IDR). These metrics evaluate the system from different perspectives, offering a whole picture of the verification/identification capability of the baseline system.
The results are shown in Table 2. The rows ‘i-vector’ and ‘L-vector’ present the results with the cosine distance scoring; ‘i-vector-SVM’ and ‘L-vecotr-SVM’ present the results with the SVM-based scoring. ‘Linear’, ‘Poly’(degree=), and ‘RBF’ represent the three commonly used kernel functions. It can be seen that LDA leads to consistent performance gains, and the SVM-based scoring tends to outperform cosine distance scoring.
4.3.2 EER and DCF results
EER and DCF are also widely used in measuring performance of verification systems. Compared to , these two metrics are not related to the decision result, but the quality of the scoring, and therefore evaluate the verification system from a different angle. The results for these two metrics are presented in Table 2. respectively. It can be seen that similar conclusions can be drawn from these results as from the results.
4.4 DET curve
The DET curve is another popular way to evaluate verification systems. Compared to , EER and DCF, the DET curve presents performance on all operation points, and therefore can evaluate a verification system in a more systematic way. Experimental results are shown in Fig 3. The black circles represent the operation location where the DCFs are obtained. Again, similar conclusions as with the , EER and DCF can be obtained.
4.4.1 IDR results
Note that in the OLR challenge, the target languages are known in prior, and the confidence scores are comparable across languages. This means that OLR can be treated as a language identification task, for which the language obtaining the highest score in a trail is regarded as the identification result. For such an identification task, IDR is a widely used metric , which treats errors on all languages equally serious. IDR is formally defined as follows:
where and are the numbers of correctly and incorrectly identified utterances, respectively. Table 2 presents the IDR results of the baseline system. We can observe similar trends as with the verification metrics: , EER, DCF and DET curve.
We presented the data profile of the AP16-OL7 database that was released to support the AP16-OLR challenge on APSIPA 2016. The evaluation rules of the challenge was described, and a baseline system was presented. We show that the AP16-OL7 database is a suitable data resource for language recognition research.
This work was supported by the National Science Foundation of China (NSFC) under the project No. 61371136, and the MESTDC PhD Foundation Project No. 20130002120011. It was also supported by SpeechOcean.
-  P. Sidwell and R. Blench, “14 the austroasiatic urheimat: the southeastern riverine hypothesis,” Dynamics of human diversity, p. 315, 2011.
-  S. R. Ramsey, The languages of China. Princeton University Press, 1987.
-  M. Shibatani, The languages of Japan. Cambridge University Press, 1990.
-  B. Comrie, G. Stone, and M. Polinsky, The Russian language in the twentieth century. Oxford University Press, 1996.
-  N. J. Enfield, “Areal linguistics and mainland southeast asia,” Annual Review of Anthropology, vol. 34, pp. 181–206, 2005.
-  J. J. Godfrey, “Multilingual speech databases at ldc,” in Proceedings of the workshop on Human Language Technology. Association for Computational Linguistics, 1994, pp. 23–26.
-  T. Schultz, “Globalphone: a multilingual speech and text database developed at karlsruhe university.” in INTERSPEECH, 2002.
-  A. Moreno, B. Lindberg, C. Draxler, G. Richard, K. Choukri, S. Euler, and J. Allen, “Speechdat-car. a large speech database for automotive environments.” in LREC, 2000.
-  H. van den Heuvel, J. Boudy, Z. Bakcsi, J. Cernockỳ, V. Galunov, J. Kochanina, W. Majewski, P. Pollak, M. Rusko, J. Sadowski et al., “Speechdat-e: five eastern european speech databases for voice-operated teleservices completed.” in INTERSPEECH, 2001, pp. 2059–2062.
-  P. Roach, S. Arnfield, W. J. Barry, J. Baltova, M. Boldea, A. Fourcin, W. Gonet, R. Gubrynowicz, E. Hallum, L. Lamel et al., “Babel: an eastern european multi-language database.” in ICSLP, vol. 96, 1996, pp. 1892–1893.
-  “The 2015 NIST language recognition evaluation plan (LRE15),” NIST, 2015, ver. 22-3.
-  “The 2009 NIST language recognition evaluation plan (LRE09),” NIST, 2009, ver. 6.
-  N. Dehak, P. G. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
-  N. Dehak, P. A. Torres-Carrasquillo, D. A. Reynolds, and R. Dehak, “Language recognition via i-vectors and dimensionality reduction,” in INTERSPEECH, 2011, pp. 857–860.
-  L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Machine Learning Research, 2008.
-  B. Yin, E. Ambikairajah, and F. Chen, “Hierarchical language identification based on automatic language clustering.” in INTERSPEECH, 2007.