The language identification (LID) refers to identify the language categories from utterances, and it is usually presented at the front end of speech processing systems, such as the automatic speech recognition (ASR), meaning that the LID technology plays a great role in the applications of multilingual interaction. However, there are still difficult issues, decaying the performance of LID systems, such as the cross-channel issue, the lack of training resources condition and the noisy environment.
The oriental language families, as a part of many language families around the world, often include Austroasiatic languages (e.g.,Vietnamese, Cambodia) , Tai-Kadai languages (e.g., Thai, Lao), Hmong-Mien languages (e.g., some dialects in south China), Sino-Tibetan languages (e.g., Chinese Mandarin), Altaic languages (e.g., Korea, Japanese) and Indo-European languages (e.g., Russian) [4, 5, 1].
Dialect, often referring to a variety of a specific language, is also a typical linguistic phenomenon. Different dialects may be considered as different kinds of languages for speech processing. As an oriental country, China has 56 ethnic groups, and each ethnic group has its own unique dialect(s). Some of these Chinese dialects may share some part of written system with Mandarin Chinese, but the totally different pronunciation results in more complicated multilingual phenomena.
The oriental language recognition (OLR) challenge is organized annually, aiming at improving the research on multilingual phenomena and advancing the development of language recognition technologies. The challenge has been conducted four times since 2016, namely AP16-OLR , AP17-OLR , AP18-OLR  and AP19-OLR , each attracting dozens of teams around the world.
AP19-OLR involved more than languages and focused on three challenging tasks: (1) short-utterance ( second) LID, which was inherited from AP18-OLR; (2) cross-channel LID; (3) zero-resource LID. In the first task, the system submitted by the Innovem-Tech team achieved the best performance ( with , and EER with ). In the second task, the system submitted by the Samsung SSLab team achieved the best performance with , and EER with . And in the third task, the system submitted by the XMUSPEECH team achieved the best performance with , and EER with . From these results, one can see that for the cross-channel condition, the task remains challenging. More details about the past four challenges can be found on the challenge website.111http://olr.cslt.org
Based on the experience of the last four challenges and the calling from industrial application, we propose the fifth OLR challenge. This new challenge, denoted by AP20-OLR, will be hosted by APSIPA ASC 2020. It involves more languages/dialects and focuses on more practical and challenging tasks: (1) cross-channel LID, as in the last challenge, (2) dialect identification, where three dialect resources are provided for training, but other three languages are also included in the test set, to compose the open-set dialect identification, and (3) noisy LID, which reveals another real-life demand of speech technology to deal with the low SNR condition.
In the rest of the paper, we will present the data profile and the evaluation plan of the AP20-OLR challenge. To assist participants to build their own submissions, two types of baseline systems are provided, based on Kaldi and Pytorch respectively.
|Code||Description||Channel||No. of Speakers||Utt./Spk.||Total Utt.||No. of Speakers||Utt./Spk.||Total Utt.|
|ct-cn||Cantonese in China Mainland and Hongkong||Mobile||24||320||7559||6||300||1800|
|zh-cn||Mandarin in China||Mobile||24||300||7198||6||300||1800|
|id-id||Indonesian in Indonesia||Mobile||24||320||7671||6||300||1800|
|ja-jp||Japanese in Japan||Mobile||24||320||7662||6||300||1800|
|ru-ru||Russian in Russia||Mobile||24||300||7190||6||300||1800|
|ko-kr||Korean in Korea||Mobile||24||300||7196||6||300||1800|
|vi-vn||Vietnamese in Vietnam||Mobile||24||300||7200||6||300||1800|
|Code||Description||Channel||No. of Speakers||Utt./Spk.||Total Utt.||No. of Speakers||Utt./Spk.||Total Utt.|
|ka-cn||Kazakh in China||Mobile||86||50||4200||86||20||1800|
|ti-cn||Tibetan in China||Mobile||34||330||11100||34||50||1800|
|uy-id||Uyghur in China||Mobile||353||20||5800||353||5||1800|
Male and Female speakers are balanced.
The number of total utterances might be slightly smaller than expected, due to the quality check.
2 Database profile
Participants of AP20-OLR can request the following datasets for system construction. All these data can be used to train their submission systems as followd.
AP16-OL7: The standard database for AP16-OLR, including AP16-OL7-train, AP16-OL7-dev and AP16-OL7-test.
AP17-OL3: A dataset provided by the M2ASR project, involving three new languages. It contains AP17-OL3-train and AP17-OL3-dev.
AP17-OLR-test: The standard test set for AP17-OLR. It contains AP17-OL7-test and AP17-OL3-test.
AP18-OLR-test: The standard test set for AP18-OLR. It contains AP18-OL7-test and AP18-OL3-test.
AP19-OLR-dev: The development set for AP19-OLR. It contains AP19-OLR-dev-task2 and AP19-OLR-dev-task3.
AP19-OLR-test: The standard test set for AP19-OLR. It contains AP19-OL7-test and AP19-OL3-test.
AP20-OLR-dialect: The newly provided training set, including three kinds of Chinese dialects.
THCHS30: The THCHS30 database (plus the accompanied resources) published by CSLT, Tsinghua University .
Besides the speech signals, the AP16-OL7 and AP17-OL3 databases also provide lexicons of all the 10 languages, as well as the transcriptions of all the training utterances. These resources allow training acoustic-based or phonetic-based language recognition systems. Training phone-based speech recognition systems is also possible, though large vocabulary recognition systems are not well supported, due to the lack of large-scale language models.
A test dataset AP20-OLR-test will be provided at the date of result submission, which includes three parts corresponding to the three LID tasks.
The AP16-OL7 database was originally created by Speechocean, targeting for various speech processing tasks. It was provided as the standard training and test data in AP16-OLR. The entire database involves 7 datasets, each in a particular language. The seven languages are: Mandarin, Cantonese, Indonesian, Japanese, Russian, Korean and Vietnamese. The data volume for each language is about hours of speech signals recorded in reading style. The signals were recorded by mobile phones, with a sampling rate of kHz and a sample size of bits.
For Mandarin, Cantonese, Vietnamese and Indonesia, the recording was conducted in a quiet environment. As for Russian, Korean and Japanese, there are recording sessions for each speaker: the first session was recorded in a quiet environment and the second was recorded in a noisy environment. The basic information of the AP16-OL7 database is presented in Table 1, and the details of the database can be found in the challenge website or the description paper .
The AP17-OL7 database is a dataset provided by SpeechOcean. This dataset contains 7 languages as in AP16-OL7, each containing utterances. The recording conditions are the same as AP16-OL7. This database is used as part of the test set for the AP17-OLR challenge.
The AP17-OL3 database contains 3 languages: Kazakh, Tibetan and Uyghur, all are minority languages in China. This database is part of the Multilingual Minorlingual Automatic Speech Recognition (M2ASR) project, which is supported by the National Natural Science Foundation of China (NSFC). The project is a three-party collaboration, including Tsinghua University, the Northwest National University, and Xinjiang University . The aim of this project is to construct speech recognition systems for five minor languages in China (Kazakh, Kirgiz, Mongolia, Tibetan and Uyghur). However, our ambition is beyond that scope: we hope to construct a full set of linguistic and speech resources and tools for the five languages, and make them open and free for research purposes. We call this the M2ASR Free Data Program. All the data resources, including the tools published in this paper, are released on the web site of the project.222http://m2asr.cslt.org
The sentences of each language in AP17-OL3 are randomly selected from the original M2ASR corpus. The data volume for each language in AP17-OL3 is about hours of speech signals recorded in reading style. The signals were recorded by mobile phones, with a sampling rate of kHz and a sample size of bits. We selected utterances for each language as the development set (AP17-OL3-dev), and the rest is used as the training set (AP17-OL3-train). The test set of each language involves utterances, and is provided separately and denoted by AP17-OL3-test. Compared to AP16-OL7, AP17-OL3 contains much more variations in terms of recording conditions and the number of speakers, which may inevitably increase the difficulty of the challenge task. The information of the AP17-OL3 database is summarized in Table 1.
The AP18-OLR-test database is the standard test set for AP18-OLR, which contains AP18-OL7-test and AP18-OL3-test. Like the AP17-OL7-test database, AP18-OL7-test contains the same target languages, each containing utterances, while AP18-OL7-test also contains utterances from several interference languages. The recording conditions are the same as AP17-OL7-test. Like the AP17-OL3-test database, AP18-OL3-test contains the same languages, each containing utterances. The recording conditions are also the same as AP17-OL7-test.
The AP19-OLR-test database is the standard test set for AP19-OLR, which includes 3 parts responding to the 3 LID tasks respectively, precisely AP19-OLR-short, AP19-OLR-channel and AP19-OLR-zero.
AP20-OLR-dialect is the training set provided by SpeechOcean. It includes three kinds of Chinese dialects, namely Hokkien, Sichuanese and Shanghainese. The utterances of each language are about 8000. The signals were recorded by mobile phones, with a sampling rate of kHz and a sample size of bits.
The AP20-OLR-test database is the standard test set for AP20-OLR, which includes 3 parts responding to the 3 LID tasks respectively, precisely AP20-OLR-channel-test, AP20-OLR-dialect-test and AP20-OLR-noisy-test.
AP20-OLR-channel-test: This subset is designed for the cross-channel LID task, which contains six of the ten target languages, but was recorded with different recording equipments and environment. The six languages are Cantonese, Indonesian, Japanese, Russian, Korean and Vietnamese. Each language has about 1800 utterances.
AP20-OLR-dialect-test: This subset is designed for the dialect identification task, including three dialects which are Hokkien, Sichuanese and Shanghainese. Considering the real-life situation, other three kinds of nontarget languages, which are Mandarin, Malay and Thai, are included in this subset to compose the open-set dialect identification, and there may be some cross channel utterances as well. Each dialect/language has about 1800 utterances.
AP20-OLR-noisy-test: This subset is designed for the noisy LID task, which contains five of the ten target languages, but was recorded under noisy environment (low SNR). The five languages are Cantonese, Japanese, Russian, Korean and Mandarin. Each language has about 1800 utterances.
3 AP20-OLR challenge
Following the definition of NIST LRE15 , the task of the LID challenge is defined as follows: Given a segment of speech and a language hypothesis (i.e., a target language of interest to be detected), the task is to decide whether that target language was in fact spoken in the given segment (yes or no), based on an automated analysis of the data contained in the segment.
The AP20-OLR challenge includes three tasks as follows:
Task 1: cross-channel LID is a close-set identification task, which means the language of each utterance is among the known traditional target languages, but utterances were recorded with different channels.
Task 2: dialect identification is a open-set identification task, in which three nontarget languages are added to the test set with the three target dialects.
Task 3: noisy LID, where noisy test data of the target languages will be provided.
3.1 System input/output
The input to the LID system is a set of speech segments in unknown languages. For task 1 and task 3, those speech segments are within the or known target languages. For task 2, the three target dialects of the speech segments are the same as three dialects in the AP20-OLR-dialect. The task of the LID system is to determine the confidence that a language is contained in a speech segment. More specifically, for each speech segment, the LID system outputs a score vector , where represents the confidence that language is spoken in the speech segment. The scores should be comparable across languages and segments. This is consistent with the principles of LRE15, but differs from that of LRE09  where an explicit decision is required for each trial.
In summary, the output of an OLR submission will be a text file, where each line contains a speech segment plus a score vector for this segment, e.g.,
3.2 Training condition
The use of additional training materials is forbidden, including the use of non-speech data for data augmentation purposes. The only resources that are allowed to use are: AP16-OL7, AP17-OL3, AP17-OLR-test, AP18-OLR-test, AP19-OLR-test, AP19-OLR-dev, AP20-OLR-dialect and THCHS30.
3.3 Test condition
All the trials should be processed. Scores of lost trials will be interpreted as -.
The speech segments in each task should be processed independently, and each test segment in a group should be processed independently too. Knowledge from other test segments is not allowed to use (e.g., score distribution of all the test segments).
Information of speakers is not allowed to use.
Listening to any speech segments is not allowed.
3.4 Evaluation metrics
As in LRE15, the AP20-OLR challenge chooses
where and are the target and non-target languages, respectively; and are the missing and false alarm probabilities, respectively.
is the prior probability for the target language, which is set toin the evaluation. Then the principle metric is defined as the average of the above pair-wise performance:
where is the number of languages, and = . For the open-set testing condition, all of interfering languages will be seen as one unknown language in the computation of . We have provided the evaluation scripts for system development.
4 Baseline systems
: the i-vector model baseline and the extended TDNN x-vector model baselines, respectively. The feature extracting and back-ends were all conducted with Kaldi. To provide more options, we built the i-vector and x-vector models with Kaldi, and conducted an x-vector model with Pytorch as well. The Kaldi and Pytorch recipes of these baselines can be downloaded from the challenge web site.333http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/OLR_Challenge_2020
We trained the baseline systems with a combined dataset including AP16-OL7, AP17-OL3 and AP17-OLR-test, and the target number of the system refers to the number of all languages, i.e. . Before training, we adopted the data augmentation, including speed and volume perturbation, to increase the amount and diversity of the training data. For speed perturbation, we applied a speed factor of 0.9 or 1.1 to slow down or speed up the original recording. And for volume perturbation, random volume factor was applied. Finally, two augmented copies of the original recording were added to the original data set to obtain a 3-fold combined training set.
The acoustic features involved 20-dimensional Mel frequency cepstral coefficients (MFCCs) with the 3-dimensional pitch, and the energy VAD was used to filter out nonspeech frames.
The back-end was the same for all three tasks when the embeddings was extracted from the model. Linear discriminative analysis (LDA) trained on the enrollment set was employed to promote language-related information. The dimensionality of the LDA projection space was set to 100. After the LDA projection and centering, the logistic regression (LR) trained on the enrollment set was used to compute the score of a trial on a particular language.
4.1 i-vector system
The i-vector baseline system was constructed based on the i-vector model , . The acoustic features were augmented by their first and second order derivatives, resulting in 69-dimensional feature vectors. The UBM involved 2,048 Gaussian components and the dimensionality of the i-vectors was 600.
4.2 x-vector system
. Compared to the traditional x-vector, the extended TDNN x-vector structure used a slightly wider temporal context in the TDNN layers and interleave dense layers between TDNN layers, which leaded to a deeper x-vector model. This deep structure was trained to classify thelanguages in the training data with the cross entropy (CE) loss. After training, embeddings called ‘x-vector’ were extracted from the affine component of the penultimate layer. Two implementations of this model were conducted on Kaldi and Pytorch, respectively.
4.2.1 Implementation details on Kaldi
The chunk size between 60 to 80 was used in the sequential sampling when prepared the training examples. The model was optimized with SGD optimizer, with a mini-batch size of 128. The Kaldi’s parallel training and sub-models fusion strategy was used.
|Task||Cross-channel LID||Dialect Identification|
|Task||Cross-channel LID||Dialect Identification||Noisy LID|
4.2.2 Implementation details on Pytorch
The chunk size was 100 with the language-balanced sampling when prepared the training examples. The language-balanced sampling ensured that the examples in languages were roughly the same, by the repeated sampling of languages with less training frames. The model was optimized with Adam optimizer, with a mini-batch size of 512. The warm restarts was used to control the learning rate and the feature dropout was used to enhance the robustness.
4.3 Performance results
The primary evaluation metric in AP20-OLR is
. Besides that, we also present the performance in terms of equal error rate (EER). These metrics evaluate system performance from different perspectives, offering a whole picture of the capability of the tested system. The performance of baselines is evaluated on the AP20-OLR-test database, but we also provide the results on the referenced development sets. For task 1, we choose the cross-channel subset of AP19-OLR-test to be the referenced development set. For task 2, the dialect test subset of AP19-OLR-dev-task3, which contains three target dialects: Hokkien, Sichuanese and Shanghainese, and the test subset of AP19-OLR-eval-task3, which contains three nontarget (interfering) languages: Catalan, Greek and Telugu, are combined as the referenced development set. While no noisy data set was given in the past OLR challenges, we do not present the referenced development set of task 3 in this challenge. The referenced development sets are used to help estimate the system performance when participants reproduce the baseline systems or prepare their own systems, and participants are encouraged to design their own development sets.
4.3.1 Cross-channel LID
The first task identifies cross-channel utterances. The enrollment sets are subsets of the 3-fold combined training set mentioned above, in which the utterances of the same languages in test sets are reserved, namely AP20-ref-dev-task1 and AP20-ref-enroll-task1. In the referenced development set, the inconsistency of two metrics and EER is observed. The x-vector model on Pytorch achieves the best performance with , but the EER performance with , which was much worse than i-vector model (EER with ). In the evaluation set, the x-vector model on Pytorch achieves the best performance with , and EER with . Meanwhile, the trend of baselines’ performance is different between the referenced development set and the evaluation set, since the channel conditions might be different between these two test sets.
4.3.2 Dialect Identification
The second task identifies three target dialects from six languages/dialects. In the referenced development set, it should be noted that the AP19-OLR-dev-task3-test only contain 500 utterances per target dialects, and the two interfering languages in the AP19-OLR-eval-task3-test are European languages, while the languages in the training set are all oriental languages, so the performance of Pytorch’s x-vector baseline is relatively unsatisfactory, since it’s stronger fitting of the training data, comparing with the Kaldi’s model. In the evaluation set, three nontarget languages are Asian languages, and the best performance on is achieved on the Pytorch’s x-vector model, with , and EER with . The Kaldi’s x-vector and i-vector models have similar performance in both the referenced development set and the evaluation set in terms of .
4.3.3 Noisy LID
The evaluation process for task 3 can be seen as identifying languages from 5 target languages under noisy testing condition. No referenced development is given for this task. The referenced enrollment set is subsets of the 3-fold combined training set mentioned above, in which the utterances of the same languages as in test sets were reserved, namely AP20-ref-enroll-task3. In the evaluation set, the x-vector model on Pytorch achieves the best performance of and EER, with , and , respectively. The performance of Kaldi’s x-vector and i-vector models is close in the evaluation set.
In this paper, we present the data profile, three task definitions and their baselines of the AP20-OLR challenge. In this challenge, besides the presented data sets in past challenges, new dialect training/developing data sets are provided for participants, and more languages are included in the test sets. The AP20-OLR challenge are with three tasks: (1) cross-channel LID, (2) dialect identification and (3) noisy LID. The i-vector and x-vector frameworks, deploying with Kaldi and Pytorch, are conducted as baseline systems to assist participants to construct a reasonable starting system. Given the results on baseline systems, these three tasks defined by AP20-OLR are rather challenging and are worthy of careful study. All the data resources are free for the participants, and the recipes of the baseline systems can be freely downloaded.
This work was supported by the National Natural Science Foundation of China No.61876160 and No.61633013.
We would like to thank Ming Li at Duke Kunshan University, Xiaolei Zhang at Northwestern Polytechnical University for their help in organizing this AP20-OLR challenge.
-  (1996) The russian language in the twentieth century. Oxford University Press. Cited by: §1.
PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Cited by: §4.
-  (2011) The Kaldi speech recognition toolkit. In Proceedings of IEEE 2011 workshop on Automatic Speech Recognition and Understanding, Cited by: §4.
-  (1987) The languages of china. Princeton University Press. Cited by: §1.
-  (1990) The languages of japan. Cambridge University Press. Cited by: §1.
-  (2011) 14 the austroasiatic urheimat: the southeastern riverine hypothesis. Dynamics of human diversity, pp. 315. Cited by: §1.
-  (2018) Spoken language recognition using x-vectors. In Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, pp. 105–111. Cited by: §4.2.
-  (2018) X-vectors: robust DNN embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. Cited by: §4.2.
-  (2018) AP18-OLR challenge: three tasks and their baselines. In 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 596–600. Cited by: §1.
-  (2016) AP17-OLR challenge: data, plan, and baseline. In APSIPA ASC, Cited by: §1.
-  (2019) AP19-OLR challenge: three tasks and their baselines. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1917–1921. Cited by: §1.
-  (2009) The 2009 NIST language recognition evaluation plan (LRE09). NIST. Note: ver. 6 Cited by: §3.1.
-  (2015) The 2015 NIST language recognition evaluation plan (LRE15). NIST. Note: ver. 22-3 Cited by: §3.
-  (2019) State-of-the-art speaker recognition for telephone and video speech: the JHU-MIT submission for NIST SRE18. Proc. Interspeech 2019, pp. 1488–1492. Cited by: §4.2.
-  (2016) AP16-OL7: a multilingual database for oriental languages and a language recognition baseline. In APSIPA ASC, Cited by: §1, §2.1.
-  (2015) THCHS-30: a free chinese speech corpus. arXiv preprint arXiv:1512.01882. Cited by: 8th item.
-  (2017) M2ASR: ambitions and first year progress. In OCOCOSDA, Cited by: §2.3.