Spoken language identification, henceforth LID, is the problem of determining the identity of the language in a spoken utterance [li2013spoken]. In today’s globalized world, LID systems can facilitate a wide range of cross-lingual speech and communication technologies such as spoken language translation [waibel2000multilinguality, fugen2007simultaneous, bangalore2012real] and multilingual spoken document retrieval [chelba2008retrieval]. Furthermore, LID-aware transfer of language resources has been shown to be effective for multilingual ASR in low-resource settings [muller2016language, muller2015using, nguyen2014multilingual, cutler2014language].
Earlier work has addressed the LID task using the so-called phonotactic approach. In this paradigm, the acoustic signal is first transduced into a sequence of discrete symbols (e.g., phones), then probabilistic models are utilized to obtain language likelihoods [lamel1994language, li2005phonotactic]kenny2010bayesian, garcia2011analysis, martinez2011language, su2016factor]. Currently, end-to-end deep neural networks (DNNs) are predominant for LID and outperform GMMs, especially for short utterances [mateju2018using, shen2018feature, shon2018convolutional, lopez2014automatic, gonzalez2014automatic].
The findings of the popular language guessing game, the Great Language Game [skirgaard2017some], have shown that discriminating between closely-related languages is a difficult task for humans. On the other hand, neural LID models have shown striking performance discriminating between spoken varieties of Arabic [bulut2017utd, gelly2016language, shon2018convolutional], Slavic languages [mateju2018using], and languages in accented speech samples from multilingual speakers [titus2020improving]. For instance, the best neural LID model in [mateju2018using] has reported an error rate as low as 1.2% when discriminating between 11 Slavic languages. Generally speaking, the impressive performance of DNN-based LID reported in the literature gives the impression that LID is almost a solved problem.
However, previous works have developed their models using disjoint splits of the same dataset where the training and evaluation samples have similar, if not identical, acoustic conditions (i.e., same domain). The impact of dataset-bias [tzeng2017adversarial] on LID robustness has not yet been investigated with a systematic evaluation across datasets. In this paper we aim to fill this gap and focus on the challenging case of LID for short utterances of related languages (i.e., Slavic languages) in a cross-domain setting. We investigate the following questions:
RQ1 To what degree do neural LID models for related languages generalize to another domain with different acoustic conditions?
RQ2 Are different low-level speech features equally robust under domain mismatch?
RQ3 Can we adapt LID models to a new domain without using labelled data in the new domain? If yes, what are the factors that affect the adaptability of the model?
To address these research questions, we conduct a series of LID experiments with datasets from two domains: (1) Read speech recordings from the Slavic subset of the GlobalPhone speech database [schultz2013globalphone], and (2) Slavic broadcast recordings collected and distributed in [mateju2018using, nouza2016asr] for LID (RQ1). We also compare the performance of spectral (MFSCs) and cepstral (MFCCs) speech features within- and across-domain (RQ2). Finally, we apply adversarial domain confusion [ganin2015unsupervised] to adapt our model to a target domain, analyze predictions from the adapted model, and visualize its representations compared to the baseline (RQ3).
2 LID with Deep Neural Networks
2.1 Problem Definition
We define the LID task as a discriminative sequence classification problem. First, a variable-length utterance is transformed by an acoustic front-end into a sequence of acoustic observations , where is a low-level feature vector at timestep . Given a sequence , the goal is to predict the spoken language . Using a deep neural network as a classification model, the LID problem can be defined as
where is a finite set of languages, is the model’s parameters learned in a supervised approach, and
represents a posterior probability of the language label.
2.2 LID Model Overview
Our LID model consists of a 1D 3-layer convolutional network followed by 2-layer fully-connected feed-forward network as schematized in Fig. 1(a). We refer to the convolutional block as a high-level feature extractor that transforms the input sequence into a -dimensional feature vector , i.e. . Then, the feed-forward layers transform
into a logit vector
via a series of non-linear transformations, i.e., followed by a softmax function that maps
into a probability distribution over the language space. We refer to the fully-connected block of the model
as a language classifier. The parameters of the networkand are learned jointly in an end-to-end approach given a dataset of labelled samples in one domain. The objective function is to minimize
where is the loss of the language classifier.
2.3 Domain Adaptive LID
In this paper, we explore a well-established domain adaptation technique that has been successfully applied to many vision and speech recognition problems [ganin2015unsupervised, shinohara2016adversarial, meng2017unsupervised]. This technique aims to minimize the discrepancy between two domains given a dataset of unlabelled samples in the target domain, in addition to the source labelled samples .
To improve the LID model’s out-of-domain generalization, the feature representations emerging from the model should be both language-discriminative and domain-invariant. This objective can be achieved if the model is encouraged during training to build up representations that are good predictors of the spoken language but do not encode domain-related information. To this end, a fully-connected feed-forward block is added to the network to predict the domain given (see Fig. 1(b)). We view as a domain classifier with a separate set of parameters which are learned by exploiting the domain labels of source and target samples. That is, each training sample in the source domain is augmented with a domain label , while each training sample in the target domain is augmented with a domain label . We seek the parameters that minimize the loss of the domain classifier. On the other hand, the feature extractor is trained such that is uninformative for the domain classifier. Thus, we seek the parameters that maximize the domain classifier loss. This procedure is an instance of adversarial learning where different blocks in the network are trained with competing objectives. The overall objective function is to minimize
where is the loss of the language classifier, is the loss of the domain classifier, and
is a parameter that controls the contribution of the domain classifier’s loss to the overall loss. In practice, this adversarial loss is realized with a special layer that reverses the direction of the gradient signal coming from the domain classifier’s loss into the feature extractor during backpropagation, which is referred to as a gradient reversal layer. We refer the reader to the original paper for a detailed overview of the training procedure[ganin2015unsupervised].
3 Experimental Setup
3.1 Datasets for Slavic LID
GlobalPhone Read Speech (GRS) We use the Slavic portion of the multilingual GlobalPhone speech database [schultz2013globalphone] which includes read speech recordings from native speakers of six Slavic languages: Bulgarian (BUL), Croatian (HRV), Czech (CZE), Polish (POL), Russian (RUS), and Ukrainian (UKR). The utterances vary in length and quality across languages. We set the minimum utterance length to 3 seconds and segment longer utterances into non-overlapping 3-second speech segments. Our final training subset consists of 8,000 utterances per language. We use the same splits as in .
Radio Broadcast Speech (RBS) A large collection of Slavic recordings were collected by harvesting online radio broadcasts in [mateju2018using, nouza2016asr]. The original dataset contains recordings for 11 Slavic languages. We use the same subset of six languages as in the GRS dataset. The extracted utterances are either segments of professional news reports or of spontaneous speech during discussions. Occasionally, the utterances include background music and different sorts of acoustic noise. We sample 8,000 and 500 utterances per language from the training split as our training and validation sets, respectively. This dataset does not include any speaker IDs. Thus, we cannot confirm whether training and evaluation speakers are disjoint.
3.2 Low-level Feature Extraction
In our experiments, we use the first 13 coefficients of MFSCs and MFCCs, with the zeroth coefficient being the average frame energy, as low-level speech features. While previous works usually refer to MFSCs as mel-filterbanks [shon2018convolutional], we use the term MFSCs to refer to mel-frequency spectral features that are correlated [mohameddeep]
. Since both datasets in our study are sampled at 16 kHz, we extract frames of 400 samples with 160 samples overlap, which corresponds to 25 ms and 10 ms, respectively. We normalize the features to have utterance-level zero mean and unit variance.
3.3 Model Architecture and Hyperparameters
We use 1D 3-layer convolution over the temporal dimension with 128, 256, and 512 filters and widths of 5, 10, and 10 for each layer and keep stride step at 1. We apply batch normalization and ReLU non-linearity following each convolutional operation. We apply max pooling to downsample the representation only at the end of the convolution block. For the language classifier, we use 2 fully-connected layers (512512 6) before the softmax for both the non-adapted and the adapted LID models.
Domain-Adaptive Model For our adapted LID models, we use a 3-layer feed-forward network (512 1024 1024 2) as the domain classifier. For the adaptation factor , we use a gradually increasing value to suppress the noise from the feature extractor during the initial phase of the training procedure. We experiment with two variants of the domain-adaptive model: (1) DA-LID I: an identical configuration to [ganin2015unsupervised], where the convolutional block of the model is considered as the feature extractor, and (2) DA-LID II: we consider the feature extractor as the convolutional block as well as the first layer of the fully-connected block; thus, the reversed gradient signal from the domain classifier is back-propagated into all layers of the network except the final layer before the softmax of the language classifier.
Training Details We use cross-entropy loss for both and
. The ADAM optimizer is used with learning rate of 0.001. We train our models with a batch size of 256 for 50 epochs and observe the validation performance during training.
4 Experimental Results
We now present and discuss the results of our experiments. To make the results comparable across datasets and prevent undesirable effects due to utterance length mismatch, we train and evaluate each of our LID models on 3-second utterances. Since the GRS evaluation data is imbalanced, we use balanced accuracy [brodersen2010balanced]) and average cost (), which we do not report for the sake of conciseness.
4.1 Cross-Domain Evaluation
Table 1 presents the results of the cross-domain evaluation on both datasets without adaptation. Even though our LID models are not heavily regularized, the in-domain performance is always above 95%, while MFSC and MFCC features yield a comparable performance. On the other hand, out-of-domain (OOD) evaluation shows a considerable drop in accuracy in each cross-domain setting. It is interesting to observe that the drop in accuracy is more pronounced for MFCC features, and MFSCs seem to be more robust under domain shift. The impact of domain shift is more pronounced in the GRS RBS direction.
4.2 Adaptation Results
In our adaptation experiments, we investigate two transfer tasks; GRS RBS and RBS GRS. The results are shown in Table 2. The adapted models consistently improve the accuracy compared to the source-only non-adapted baseline with both features and in both directions. Our DA-LID II model yields the best results, which suggests that the domain discrepancy is present not only in the convolutional layers, but also in the fully-connected layers that are more distant from the input. We present and discuss the results for both directions.
|Direction||None||DA-LID I||DA-LID II|
|GRS RBS||MFSCs||43.27||47.22 (+09.1)||50.56 (+16.6)|
|MFCCs||37.80||41.77 (+12.6)||44.55 (+17.9)|
|RBS GRS||MFSCs||54.01||72.12 (+33.5)||86.99 (+61.1)|
|MFCCs||50.97||66.50 (+30.5)||90.56 (+77.7)|
RBS GRS Both adapted models yield significant improvements over non-adapted models. The MFCC-based DA-LID II boosts OOD accuracy from 50.94% to 90.56% with a relative accuracy gain of 77.7%.
GRS RBS Even though adapted models improve over the baseline, the improvements in this direction are less impressive than what is observed in the RBS GRS direction. Our MFSC-based DA-LID II model performs best and improves the accuracy by 16.6% compared to the baseline.
The performance gap between the two directions in our experiments seemed surprising at the beginning. In retrospective, this should not be surprising as the two directions are not equally challenging. The RBS dataset is more diverse in terms of the number of unique speakers and background noise. An LID model trained on the RBS dataset has learned to extract language ID features from noisy speech signals, thus it is expected to be more generic and perform well on clean speech signals even under domain shift. This finding is consistent with what has been reported in the domain adaptation literature on how source domain diversity affects adaptability of the model to new domains [ganin2015unsupervised]. On the other hand, if the model has not been exposed to noisy speech signals during training, it is unlikely to perform well on noisy signals even if the representation discrepancy has been minimized, which is the case in the GRS RBS direction. This suggests that alternative adversarial training procedures that add noise to the input representation could be explored to encourage the model to transform the noisy input signals into noise-robust representations. Moreover, our experiments show that MFCCs are more sensitive to input variations due to domain shift, thus MFCC-based models in both directions tend to benefit more from adaptation in terms of relative accuracy gain compared to their MFSC-based counterparts, with only one exception case.
5 Adaptive Model Analysis
In this section, we seek to understand why unsupervised adaptation with adversarial training improves OOD performance. We analyze the results of the RBS GRS transfer task to get insights into the factors that lead to the significant improvement.
5.1 Fine-grained Performance Analysis
Fig. 2 shows the performance per language measured by score. In the non-adapted case, we observe a much higher variance between languages compared to the adapted models. For example, while the non-adapted model achieves up to 70% on Czech, it drops to 18.6% on Ukrainian, which is slightly better than the chance-level (16.7%). We inspected the performance on Ukrainian in the other direction and found that the
is even worse than chance-level. We hypothesized that the acoustic conditions of the Czech recordings in the two domains are similar, while the discrepancy is maximal in the case of Ukrainian. To validate this hypothesis, we manually inspected several Ukrainian utterances from the GRS dataset. We found that most utterances are characterized by unnatural pauses and hesitations that distort the speech signal and are uniformly distributed across Ukrainian training and evaluation speakers in the GRS dataset. This effect adds to the discrepancy due to domain shift since RBS utterances are more naturally flowing speech than the read speech from the GRS dataset, despite the occasional background noise. In particular, this effect creates abnormal patterns that hinder non-adapted LID performance in two ways: (1) if these patterns are not uniformly distributed across languages and observed during training, the network exploits them as shortcuts to discriminate between languages, and (2) if these patterns are encountered during OOD inference, the distorted signal causes a failure because the model has not been exposed to such patterns during training. Both cases lead to poor OOD generalization when training on a single-domain dataset. However, since these patterns are only present in one dataset, they are good predictors of the domain. Therefore, adversarial training with domain confusion prevents the models from exploiting such dataset-specific artifacts, which consistently yields a better OOD generalization. The advantage of adversarial training is demonstrated in Fig.2. Our adapted model boosts the score on Ukrainian from 18.6% to 96.0%, which is surprisingly the highest in this direction.
5.2 Visualizing the Representations
In Fig. 3, we visualize the representations using the t-SNE algorithm [maaten2008visualizing]. We sample a set of 1800 data points from each domain and obtain the representations from the last hidden layer of the MFCC-based LID models: (a) source-only non-adapted LID, (b) DA-LID I, and (c) DA-LID II. Fig. 3 shows how adaptation aligns the distributions of the extracted representations from the two domains.
We have investigated the problem of spoken language identification for closely-related languages in a cross-domain setting, using deep convolutional neural networks as discriminative models. While our experiments have confirmed that they perform very well within-domain, our cross-domain evaluation has revealed that neural models poorly generalize to a novel dataset with acoustic conditions that differ from those that have been observed during training. To improve the robustness of our models against domain mismatch, we have applied unsupervised domain adaptation with gradient reversal and shown that our adaptive models generalize better across domains. Our analysis has shown that adversarial training prevents the model from exploiting dataset-specific artifacts, thus leading to better out-of-domain generalization. We have identified the diversity of the speech samples in the source domain as the major factor that affects the adaptability of the model to a new domain. Given a diverse source dataset, our adaptive models achieved relative accuracy improvements of up to 77.7%.
We would like to thank the anonymous reviewers for their insightful suggestions and comments. We extend our gratitude to Marius Mosbach for his valuable feedback and fruitful discussions on this research. This research is Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), Project ID 232722074, SFB 1102.