Improved Language Identification Through Cross-Lingual Self-Supervised Learning

07/08/2021 ∙ by Andros Tjandra, et al. ∙ Facebook 0

Language identification greatly impacts the success of downstream tasks such as automatic speech recognition. Recently, self-supervised speech representations learned by wav2vec 2.0 have been shown to be very effective for a range of speech tasks. We extend previous self-supervised work on language identification by experimenting with pre-trained models which were learned on real-world unconstrained speech in multiple languages and not just on English. We show that models pre-trained on many languages perform better and enable language identification systems that require very little labeled data to perform well. Results on a 25 languages setup show that with only 10 minutes of labeled data per language, a cross-lingually pre-trained model can achieve over 93



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic speech recognition (ASR) has seen large improvements through better modeling [9, 11, 29] and the use of unlabeled data [25, 31, 22, 6, 2]. Despite a sizeable body of work on multilingual speech recognition [4, 20, 12, 3, 5, 26, 16, 19, 23], the vast majority of systems are trained for a single language. However, in many real-world settings, we wish to transcribe speech data in different languages and it is crucial to route utterances to the system trained for the language at hand. Language identification (LID) is typically used to identify the language of an utterance and the accuracy of this component is crucial to prevent poor ASR performance.

Language identification has been tackled with conventional methods [33]

as well as with modern neural networks 

[21]. Most of these approaches are trained purely with labeled data, however, unlabeled data is typically much easier to collect. Self-supervised learning leverages unlabeled data to obtain good data representations that can then be fine-tuned for a particular downstream task [27, 6, 24, 2].

Prior work on LID has explored the use of a wav2vec 2.0 model pre-trained only on English data [10]. In this paper we extend this work by considering cross-lingually trained self-supervised models [7]. In particular, we pre-train models on up to 25 different languages and then fine-tune them with as little as one hour of labeled data for LID to enable systems for low-resource languages. The audio data used here is sampled from public social media videos, which presents unique challenges such as a variety of speaker styles and the quality of the recordings. Moreover, our approach does not use any auxiliary features as are commonly used to improve performance. We also investigate different pooling strategies to aggregate the pre-trained context-representations in order to perform LID. We adapt wav2vec 2.0 to use log-mel spectrogram features as input, similar to [32]. Our experiments show strong performance using as little as ten minutes of labeled data per language compared to models trained on much larger amounts of labeled data. Furthermore, multilingual pre-trained models achieve better performance than models trained only a single language.

Figure 1: a) Log-mel Wav2Vec architecture.  b) An illustration of how Wav2Vec generates context representation and solves the contrastive task.

2 Pre-training Wav2Vec

In this section, we describe our modifications to the original wav2vec 2.0 model architecture and the cross-lingual training strategy.

2.1 Log-spectrogram Wav2Vec

A wav2vec model [2, 24] consists of multiple convolution and Transformer [28] layers and it operates on top of raw-waveform features. In this paper, we use log-mel spectrogram [32] as the input features instead of raw-waveform. We present our modified wav2vec architecture in Figure 1(a).

We define our input feature as where is the number of frames in an utterance and is the input dimension for each frame (e.g., for 80-dimensional log-mel spectrogram). From here, the input features are fed into a feature encoder which is followed by a context encoder.

2.1.1 Feature encoder

A feature encoder takes input and outputs latent speech representations . The feature encoder consists of a time-stacking layer and a linear layer. Due to the quadratic cost of time and memory of Transformer layers, reducing the input length greatly improves the training and inference efficiency. Time-stacking layer is defined as a function , which stack consecutive frames into a single frame.

2.1.2 Context encoder

A context encoder takes the input produced by the feature encoder block and outputs context representations . The context encoder consists of a linear layer, a layer normalization [18], a convolution layer, multiple Transformers layers [28]

and another linear layer. Compared to standard Transformers, wav2vec 2.0 replaces fixed positional embeddings with relative positional embeddings using a temporal convolution and a GELU activation function. A residual connection sums the output of the convolution with the input and layer normalization is applied. We feed the output into multiple Transformer layers which are followed by a linear projection.

2.2 Quantization block

A quantization block takes the output of the feature encoder layer and produces a quantized representation . A linear layer is added on top to the input . A product quantization [15, 1] with groups and codebook entries was applied on the projected input. The result of each group is concatenated and another linear layer is applied to generate the quantized representation . A Gumbel softmax function [14] enables discretely choosing an index based on the maximum value to be fully differentiable.

3 Cross-lingual pre-training

For multilingual pre-training, we collected large quantities of unlabeled speech from languages and combined them into a single multilingual dataset to train cross-lingual speech representations (XLSR; [7]). The audio data used here is sampled from public social media videos, involving unconstrained, natural speech that is filled with background music and noise, with various speaking styles, disfluencies, accents, and un-cued speaker and language switching. This presents an interesting and challenging application of self-supervised learning that directly complements other recent work on self-supervised learning which focused on datasets based on audio-books, which is clean and focused on a single domain [24, 2, 32] with a few exceptions [13].

In our setting, the amount of data varies for each language and we therefore follow the procedure of [7]. For each batch, we sample our data from a multinomial distribution where , is the total number of hours of language , and is the smoothing hyper-parameter to control the balance between high-resource and low-resource languages.

Figure 1(b) illustrates how each block interacts with the other to solve the contrastive task. The model needs to find which sample is a true quantized latent from a set of candidates . The false quantized latent are uniformly sampled from any masked time-step. We define this contrastive loss as:



is cosine-similarity between context representations and quantized latent speech representations.

We add a diversity loss to encourage the equal usage of codebook entries on each codebook (Sec.2.2

). The diversity loss is designed to maximize the entropy of the averaged softmax probability over the codebook entries for each codebook group

. We define the loss as:


where is the softmax probability without gumbel noise on group , codebook entry , sample , and time-step .

Our final loss is defined as:



is a hyperparameter to control the diversity loss.

4 LID finetuning

Figure 2: A Wav2Vec encoder with pooling and softmax projection layer for utterance-level language id (LID) classification.

We illustrate our LID classifier architecture in Figure 

2. After we pre-training a log-mel wav2Vec model, we take the latest checkpoint and use it to initialize the bottom part of an LID classifier. The context representations are summarized by adding a pooling function . We explore several pooling operations such as:

  1. Average pooling.

  2. Max-pooling.

  3. Concatenated average and max-pooling.


After the pooling layer, we append a randomly initialized linear layer with output dimensions, for the different languages, and normalize it via a softmax function.

We minimize the cross-entropy loss


where is the model probability of speech utterance belonging to language and if is the target class, otherwise .

5 Experimental setup

5.1 Dataset

Figure 3: List of 25 languages with the number of unlabeled hours used for pre-training.

We conducted the experiments on our in-house datasets. The training data consists of de-identified public videos with no personally identifiable information (PII), from where only the audio part is used. We show the list of all 25 languages used in this paper and the amount of unlabeled data in Figure 3. We consider English (en), Spanish (es), Arabic (ar), Indonesian (id), Vietnamese (vi), Portuguese (pt), Thai (th), Hindi (hi), Italian (it), French (fr), Turkish (tr), Tagalog (tl), Urdu (ur), German (de), Chinese (zh), Malayalam (ml), Bengali (bn), Russian (ru), Burmese (my), Malay (ms), Tamil (ta), Marathi (mr), Kannada (kn), Sinhalese (si), and Japanese (ja). The dataset contains a fair amount of accented speech utterances and a good amount of local dialects for languages that are spoken across the globe.

For the input features, we extract 80-dimensions log-mel spectrogram with 25 milliseconds window size and 10 milliseconds step size. We normalize the value for each feature dimension by subtracting it with the mean and divide by the standard deviation. We calculate those statistics from a small portion of the pre-training data.

5.2 Pre-training setup

Here, we describe each module configuration inside our pre-trained wav2vec model. Inside a feature encoder, we have:

  • [itemsep=0em]

  • Time-stride layer: reduce input sequence length by

    times (80 input dimension, 320 output dimension).

  • Linear layer: 320 input dimension, 512 output dimension.

Inside a context encoder, we have:

  • [itemsep=0em]

  • Linear layer: 512 input dimension, 1024 output dimension.

  • 1D Convolution layer: 1024 input dimension, 1024 output dimension, kernel size 48, filter groups 16.

  • Transformers: 24 layers, 1024 input dimension, 16 attention head, 4096 feedforward dimension, GELU activation function, pre-layer norm [30].

  • Linear layer: 1024 input dimension, 768 output dimension.

To generate quantized target for contrastive task, we feed latent speech representation into a quantization block with:

  • [itemsep=0em]

  • Linear layer: 512 input dimension, 768 output dimensions.

  • Gumbel VQ: 320 codebooks, 2 groups.

  • Linear layer: 768 input and output dimension.

For masking over the latent speech representation , we sample as the starting indices and we mask the next frames. Overall, this model has 300 million parameters.

We pre-trained three models with the same architecture but different inputs:

  1. wav2vec 2.0 En is trained on English only,

  2. XLSR-7 is trained with 7 languages (en, es, fr, de, ru, my, ja). We resample the data with .

  3. XLSR-25 is trained with all 25 languages. We resample the data with .

We set the diversity loss hyperparameter for all experiments. All models are trained using the Adam optimizer [17] with learning rate for wav2vec 2.0 En and for XLSR-{7,25} up to 300000 updates. We also add weight decay with weight penalty. We anneal the learning rate by using a linear decay learning schedule, with warm-up step up to 32000 updates and linearly decay to 0 after that. We crop the input sequence length up to 2000 samples (equals to 20 seconds). For each update step, wav2vec 2.0 En calculates the gradient from 18 hours of data, and XLSR-{7,25} calculates the gradient from 36 hours of data.

5.3 Fine-tuning setup

In the finetuning step, we randomly crop the audio into a 6 seconds chunk and extract 80-dimensions log mel spectrogram. On top of a wav2vec encoder, we added a pooling layer and a linear layer with output dimension. We prepare the finetuning dataset with a different number of languages: 7 languages (en, es, fr, de, ru, my, ja) and 25 languages (all languages in Figure 3), different amounts of supervised data per language: 10 minutes, 1 hour, 10 hours, 100 hours, 1000 hours. All finetuning models are trained with Adam optimizer with learning rate , fixed learning rate schedule with 10% warm-up step, weight decay with weight penalty. We also have several finetuning scenarios such as from scratch (randomly initialized wav2vec), from a monolingual wav2vec 2.0 En checkpoint, from an XLSR-7 checkpoint, from an XLSR-25 checkpoint.

6 Results

Labeled data Model Accuracy (%)
1 h Scratch 45.5
wav2vec 2.0 En 93.3
XLSR-7 98.0
XLSR-25 97.6
10 h Scratch 77.9
wav2vec 2.0 En 98.6
XLSR-7 98.8
XLSR-25 98.5
100 h Scratch 89.1
wav2vec 2.0 En 99.2
XLSR-7 99.1
XLSR-25 99.2
Table 1: LID test accuracy for the seven language setup (English (es), Spanish (es), German (de), French (fr), Russian (ru), Burmese (my), Japanese (ja)) using different amounts of labeled training data per language (1 h - 100 h). We compare training an LID model only on labeled data (Scratch), pre-training on English data only (wav2vec English) as well as seven or 25 languages (XLSR-7/XLSR-25), followed by fine-tuning on the labeled data.

6.1 Language identification for seven languages

We first report the benefit of pre-training for language identification on a small seven language setup. Table 1 shows that pre-training is particularly beneficial when little labeled training data is available (Scratch vs. other models). The test accuracy is calculated from a test set that contains a total of 1200 hours from 7 languages. Monolingual pre-training (wav2vec 2.0 En) is very competitive but cross-lingual pre-training performs better when very little labeled data is available (e.g., 1 h). Increased amounts of labeled data enable learning more about different languages as can be seen in the improved performance of Scratch for 100 h. However, if there is very little labeled data, then this is limited. XLSR-25 performs generally less well than XLSR-7 and we attribute this to the fact that the former learns representations for many more languages than required for the seven language LID task. This splits model capacity and results in slightly better performance for XLSR-7.

The ability to train LID models with very little labeled data is important in order to be able to extend speech technology to the thousands of languages and dialects spoken around the world.

Labeled data Model Accuracy (%)
10 min Scratch 14.8
wav2vec 2.0 En 63.4
XLSR-7 72.6
XLSR-25 93.5
1 h Scratch 32.0
wav2vec 2.0 En 90.6
XLSR-7 91.0
XLSR-25 96.0
10 h Scratch 66.6
wav2vec 2.0 En 96.4
XLSR-7 95.5
XLSR-25 97.0
100 h Scratch 89.8
wav2vec 2.0 En 98.2
XLSR-7 97.9
XLSR-25 98.0
1000 h Scratch 91.3
wav2vec 2.0 En 98.4
XLSR-7 98.2
XLSR-25 98.4
Table 2: LID test accuracy for our 25 language setup (cf. Table 1).

6.2 Language identification for 25 languages

Next, we consider a larger setup where the classifier needs to discriminate between 25 different languages. The test accuracy for 25 languages experiments are calculated from a test set that contains a total of 3700 hours from 25 languages. Table 2 shows that training with labeled data only (Scratch) performs particularly poorly with just 10 minutes of labeled data. On the other hand, XLSR-25 achieves over 93% accuracy. Pre-training on more languages performs better with little labeled data (XLSR-25 vs. XLSR-7/wav2vec 2.0 En) and in the high-resource labeled data regime, the labeled data provides sufficient learning signal. There is a similar trend of Scratch training improving with more labeled data but even with large amounts of labeled data, there is still a sizeable gap to pre-trained models.

Figure 4: LID accuracy when using the output of different Transformer blocks of the XLSR-25 model as input to the LID classifier. Accuracy is in terms of the 25 language setup and using one hour of labeled data per language for fine-tuning.

6.3 Ablations

Next, we explore the effect of using representations from different Transformer blocks of the pre-trained models. For this section, we focus our experiment on 25 language setup using one hour of labeled data per language for fine-tuning. Figure 4 shows that the middle and upper parts of the pre-trained model (from 8th to 24th layer) perform significantly better than the lower part (from 2nd to 6th layer). The result suggests that we could prune up to of the context encoder blocks, which reduces the time and memory usage during fine-tuning and inference while maintaining good accuracy. Additionally, by keeping only eight Transformer blocks, we reduce the number of parameters from 300 million down to 100 million.

Context aggregation strategy Accuracy (%)
Average pooling 96.0
Max pooling 95.6
Avg+Max+Min pooling 95.9
Avg+Max pooling 96.0
Class Token 95.6
Table 3: LID accuracy for different strategies to aggregate the context representations of an XLSR-25 model on the 25 language setup using one hour of labeled data per language for fine-tuning.

So far we used average pooling to aggregate the output of the pre-trained models for a given speech utterance into a single vector. Table 

3 compares this strategy to max pooling, concatenated average+max+min pooling, concatenated average+max pooling, and simply adding a class “[CLS]” token into the first time-step for LID classifier [8]. The result suggests that average pooling works very well compared to the alternatives and provides a simple way to aggregate the context information.

7 Conclusion

In this paper, we demonstrated the benefit of using self-supervised pre-trained representations learned on unlabeled speech data to improve language identification. We showed that pre-training is more effective than training LID models solely from labeled data and cross-lingual representations are particularly effective for low-resource setups, where little labeled data is available. This is important to enabling speech technology for many more languages spoken around the world. Using only 10 minutes of labeled data per language, LID can achieve an accuracy of over 93% on a 25 language setup. We also find that we can prune up to two-thirds of the pre-trained model while achieving the same accuracy. For future work, we may explore how to make these models more efficient for inference since pre-trained models are still very large.


  • [1] A. Baevski, S. Schneider, and M. Auli (2019) Vq-wav2vec: self-supervised learning of discrete speech representations. In International Conference on Learning Representations, Cited by: §2.2.
  • [2] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli (2020) Wav2vec 2.0: a framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33. Cited by: §1, §1, §2.1, §3.
  • [3] H. Bourlard, J. Dines, M. Magimai-Doss, P. N. Garner, D. Imseng, P. Motlicek, H. Liang, L. Saheer, and F. Valente (2011) Current trends in multilingual speech processing. Sadhana. Cited by: §1.
  • [4] L. Burget, P. Schwarz, M. Agarwal, P. Akyazi, K. Feng, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, D. Povey, et al. (2010)

    Multilingual acoustic modeling for speech recognition based on subspace gaussian mixture models

    In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Cited by: §1.
  • [5] J. Cho, M. K. Baskar, R. Li, M. Wiesner, S. H. Mallidi, N. Yalta, M. Karafiat, S. Watanabe, and T. Hori (2018)

    Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling

    In 2018 IEEE Spoken Language Technology Workshop (SLT), Cited by: §1.
  • [6] Y. Chung and J. Glass (2018) Speech2vec: a sequence-to-sequence framework for learning word embeddings from speech. In Proc. of Interspeech, Cited by: §1, §1.
  • [7] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli (2020) Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979. Cited by: §1, §3, §3.
  • [8] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019-06) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §6.3.
  • [9] L. Dong, S. Xu, and B. Xu (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In Proc. of ICASSP, Cited by: §1.
  • [10] Z. Fan, M. Li, S. Zhou, and B. Xu (2020) Exploring wav2vec 2.0 on speaker verification and language identification. arXiv preprint arXiv:2012.06185. Cited by: §1.
  • [11] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang (2020) Conformer: convolution-augmented transformer for speech recognition. Proc. of Interspeech. Cited by: §1.
  • [12] G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen, M. Ranzato, M. Devin, and J. Dean (2013) Multilingual acoustic models using distributed deep neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Cited by: §1.
  • [13] W. Hsu, A. Sriram, A. Baevski, T. Likhomanenko, Q. Xu, V. Pratap, J. Kahn, A. Lee, R. Collobert, G. Synnaeve, and M. Auli (2021) Robust wav2vec 2.0: analyzing domain shift in self-supervised pre-training. arXiv. Cited by: §3.
  • [14] E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §2.2.
  • [15] H. Jegou, M. Douze, and C. Schmid (2010) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §2.2.
  • [16] A. Kannan, A. Datta, T. N. Sainath, E. Weinstein, B. Ramabhadran, Y. Wu, A. Bapna, Z. Chen, and S. Lee (2019) Large-scale multilingual speech recognition with a streaming end-to-end model. In Interspeech 2019, Cited by: §1.
  • [17] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.2.
  • [18] J. Lei Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. ArXiv e-prints, pp. arXiv–1607. Cited by: §2.1.2.
  • [19] B. Li, Y. Zhang, T. Sainath, Y. Wu, and W. Chan (2019) Bytes are all you need: end-to-end multilingual speech recognition and synthesis with bytes. In Proc. of ICASSP, Cited by: §1.
  • [20] H. Lin, L. Deng, D. Yu, Y. Gong, A. Acero, and C. Lee (2009) A study on multilingual acoustic modeling for large vocabulary asr. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Cited by: §1.
  • [21] I. Lopez-Moreno, J. Gonzalez-Dominguez, O. Plchot, D. Martinez, J. Gonzalez-Rodriguez, and P. Moreno (2014) Automatic language identification using deep neural networks. In Proc. of ICASSP, Cited by: §1.
  • [22] D. S. Park, Y. Zhang, Y. Jia, W. Han, C. Chiu, B. Li, Y. Wu, and Q. V. Le (2020) Improved noisy student training for automatic speech recognition. In Proc. of Interspeech, Cited by: §1.
  • [23] V. Pratap, A. Sriram, P. Tomasello, A. Hannun, V. Liptchinsky, G. Synnaeve, and R. Collobert (2020) Massively multilingual asr: 50 languages, 1 model, 1 billion parameters. arXiv abs/2007.03001. Cited by: §1.
  • [24] S. Schneider, A. Baevski, R. Collobert, and M. Auli (2019) Wav2vec: unsupervised pre-training for speech recognition. Proc. Interspeech 2019, pp. 3465–3469. Cited by: §1, §2.1, §3.
  • [25] G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave, V. Pratap, A. Sriram, V. Liptchinsky, and R. Collobert (2020)

    End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

    Proc. of ICML workshop on Self-supervision in Audio and Speech (SAS). Cited by: §1.
  • [26] S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. Moreno, E. Weinstein, and K. Rao (2018) Multilingual speech recognition with a single end-to-end model. In Proc. of ICASSP, Cited by: §1.
  • [27] A. van den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. In Proc. of NIPS, Cited by: §1.
  • [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §2.1.2, §2.1.
  • [29] Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar, H. Huang, A. Tjandra, X. Zhang, F. Zhang, and et al. (2020) Transformer-based acoustic modeling for hybrid speech recognition. In Proc. of ICASSP, Cited by: §1.
  • [30] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu (2020) On layer normalization in the transformer architecture. In

    International Conference on Machine Learning

    pp. 10524–10533. Cited by: 3rd item.
  • [31] Q. Xu, T. Likhomanenko, J. Kahn, A. Hannun, G. Synnaeve, and R. Collobert (2020) Iterative pseudo-labeling for speech recognition. In Proc. of Interspeech, Cited by: §1.
  • [32] Y. Zhang, J. Qin, D. S. Park, W. Han, C. Chiu, R. Pang, Q. V. Le, and Y. Wu (2020) Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2010.10504. Cited by: §1, §2.1, §3.
  • [33] M. A. Zissman (1996) Comparison of four approaches to automatic language identification of telephone speech. IEEE Transactions on speech and audio processing. Cited by: §1.