Spoken languages show great variation across regions and such distinctions derive from the phonetics of local dialects and language backgrounds. Despite the high performance reported by state-of-the-art English automatic speech recognition (ASR) systems, accented speech recognition is still an unsolved real-world challenge due to the great variability of accents and their complex characteristics [kat1999fast]. It is difficult for ASR models to adapt to unseen accents which have relatively distinct pronunciations and tones compared to the accents used for training the ASR models. Increasing the number of training data and exposing the model to different accents is a common solution to improve the model’s robustness to different speakers’ accents by introducing variations. However, such approaches are costly and not scalable due to the difficulties in collecting high-quality speech data with different accents. Existing data augmentation techniques such as noise injection [narayanan2014joint] and speed perturbation [hori2017advances] have been proposed to overcome the limitation on high-resource data. In this work, we explore training approaches for fast adaptation to unseen accents instead of augmenting the training data. We apply model-agnostic meta-learning (MAML) [finn2017model] to teach the model to learn new tasks faster and more efficiently, and our approach can easily be applied to few-shot learning. A few studies have explored joint and multi-task training on multiple accent speech recognition models [sun2018domain, jain2018improved, jain2019multi]. However, none thoroughly investigated few-shot learning on the cross-accented speech recognition task.
In this paper, we introduce a cross-accented speech recognition task derived from existing dataset, CommonVoice [ardila2019common], to move toward building a robust speech recognition system. The motivation of this work is to establish a benchmark for evaluating cross-accented speech recognition. We introduce an accent-agnostic model by applying meta-learning as a learning to learn method for fast accent adaptation. The trained model is able to rapidly adapt to recognize speech with unseen accents. We train our transformer [vaswani2017attention] speech recognition model on a set of accents via meta-learning and fine-tune the trained model with a few samples of target accented speech. Experimental results show that our approach is able to quickly adapt to new accents more effectively than joint-training, and interestingly, our approach is also able to handle zero-shot predictions.
2 Related Work
Meta-learning is a sub-field of machine learning that designs models to learn new tasks in a new setting with a few training examples[schmidhuber1992learning, thrun2012learning]. In a recent work, [finn2017model]
propose model-agnostic meta-learning (MAML) and show the application of meta-learning in a deep learning framework. Several meta-learning-based models have been proposed for solving few-shot image classification[ravi2016optimization, vinyals2016matching, santoro2016meta]
and natural language processing applications, such as text classification[yu2018diverse], dialogue response generation [madotto2019paml, qian-yu-2019-domain], low-resource machine translation [gu2018meta], semantic parsing [huang2018natural], and sales prediction [lin2019learning]. [gu2018meta] makes the interesting finding that MAML actually is able to generalize the model in the low-resource machine translation task without any fine-tuning steps or when there is no information on the target accent. In speech applications, [hsu2019meta] introduce the practicality of applying MAML in cross-lingual speech recognition, while in another line of works, MAML has been applied to learn how to adapt respectively to the speaker [klejch2018learning, klejch2019speaker].
2.2 Accented Speech Recognition
Existing studies on accented speech recognition mainly focus on applying acoustic features that are accent-invariant and an adaptation methods to allow the model to accommodate accented speech. [zheng2005accent, najafian2014unsupervised] introduce acoustic features and adaptation method for recognizing accented speech. Meanwhile, [jain2018improved, jain2019multi] and [viglino2019end]
explore a multi-task architecture that jointly learns an accent classifier and an acoustic model.[jain2019multi] propose a mixture of expert models to segregate accent-specific and phone-specific speech variability in a joint framework, and [sun2018domain] propose an adversarial training objective to help the model to learn accent-invariant features. In this work, we explore the possibility of recognizing speech with unseen accents, and extend MAML to enable fast adaptation by a few-shot learning in the cross-accent setting.
3 Cross-Accented Speech Recognition
In this section, we present the architecture of our transformer-based speech recognition model and the proposed meta-learning method for fast adaptation on the cross-accented speech recognition task.
3.1 Transformer Speech Recognition Model
We build our model using a sequence-to-sequence transformer ASR [vaswani2017attention, dong2018speech, winata2019code, winata2019lightweight]
to learn to predict graphemes from the speech input. Our model extracts audio inputs with a learnable feature extractor module to generate input embeddings. The encoder uses input embeddings generated from the feature extractor module. Then the decoder receives the encoder outputs and applies multi-head attention to its input to finally calculates the logits of the outputs. To generate the probability of the outputs, we compute the softmax function of the logits. We apply a mask in the attention layer to avoid any information flow from future tokens, and we train our model by optimizing the next step prediction on the previous characters and by maximizing the log probability:
where is the character inputs, is the next predicted character, and is the ground truth of the previous characters. In the inference time, we generate the sequence using a beam-search in an auto-regressive manner. Then we maximize the following scoring function:
where is the parameter to control the decoding probability from the decoder, and is the parameter to control the effect of the word count as suggested in [winata2019code] and [winata2019lightweight].
|accents||# sample||duration (hr)|
|Hong Kong (hk)||1,181||1.21|
|New Zealand (nz)||6,070||7.06|
|South Atlantic (sa)||212||0.23|
|United States (us)||145,692||163.89|
|Bermuda||33.22 0.46||32.73 0.47||31.85 0.48||29.90 0.60||38.92 0.55||37.84 0.50||36.23 0.56||36.12 0.65|
|Philippines||50.08 0.56||48.22 0.69||45.94 0.64||44.43 0.69||50.58 0.81||49.72 0.80||48.27 0.85||45.47 0.93|
|Wales||33.66 0.83||33.31 0.77||31.63 0.86||29.70 0.87||37.04 0.68||37.43 0.69||35.60 0.80||32.37 0.87|
|Bermuda||28.25 0.47||28.64 0.42||26.59 0.43||25.71 0.43||31.42 0.57||31.43 0.56||30.05 0.44||27.64 0.40|
|Philippines||40.99 0.51||40.07 0.52||39.06 0.44||37.48 0.42||43.17 0.83||41.98 0.76||40.56 0.77||38.79 0.69|
|Wales||25.91 0.73||25.55 0.86||23.94 0.73||23.40 0.64||29.14 0.49||28.54 0.52||26.70 0.49||25.01 0.56|
Average Word Error Rate (% WER) with Standard Error (SE) results in the mixed-region setting.
3.2 Fast Adaptation via Meta-Learning
Model-agnostic meta-learning (MAML) [finn2017model] learns to quickly adapt to a new task from a number of different tasks using a gradient descent procedure. In this paper, we apply MAML to effectively learn from a set of accents and quickly adapt to a new accent in the few-shot setting. We denote our Transformer ASR as parameterized by . Our dataset is consist a set of accents , and for each accent , we split the data into and , then update into by computing gradient descent updates on :
where is the fast adaptation learning rate. During the training, the model parameters are trained to optimize the performance of the adapted model on unseen . The meta-objective is defined as follows:
where is the loss evaluated on . We collect the loss from a batch of accents and perform the meta-optimization as follows:
where is the meta step size and is the adapted network on accent . The meta-gradient update step is performed to achieve a good initialization for our model, then we can optimize our model with few number of samples on target accents in the fine-tuning step. In this work, we use first order approximation MAML as [gu2018meta] and [finn2018pmaml], thus Equation 5 is reformulated as:
We use the CommonVoice Dataset [ardila2019common],111We downloaded the data in December 2019 a multilingual open-accented dataset collected by Mozilla. In this work, we only use the English dataset and filter for only speech data with an accent label. There are 16 accents listed in the dataset, and we split the dataset into groups according to the accent label. The statistics of the English dataset are shown in Table 1. Note that the dataset is imbalanced and some accents only have very limited data. The pre-trained models are trained on the LibriSpeech corpus [panayotov2015librispeech], a 960-hour training corpus of English read speech derived from audio books in the LibriVox project, sampled at 16 kHz. The accents are various and not labeled, but the majority are US English.222The LibriSpeech Dataset can be downloaded at http://www.openslr.org/12/ and the list of LibriVox accents can be found at https://wiki.librivox.org/index.php/Accents_Table
4.2 Experimental Setup
We preprocess raw audio input into a spectrogram before we fetch it into our model. Our model utilizes a VGG model [simonyan2014very], a 6-layer CNN architecture, as the feature extractor. Our transformer model consists of two transformer encoder layers and four transformer decoder layers. The transformer consists of a of 2048, of 512, and of 512. We use 8 heads for multi-head attention. In total, our model has around 10.2M parameters. For both the MAML and joint training models, we end the training process after 200k iterations. In the pre-training setting, we pre-train the model using the LibriSpeech Dataset for 1M iterations, and resume the training using the CommonVoice Dataset subsequently for another 100k iterations for all approaches. During the fine-tuning step, we run 10 iterations for each sample. We evaluate our model using a beam search with , , and a beam size of 5. In the pre-training setting, we downsample the CommonVoice speech data to 16 kHz following the LibriSpeech Dataset audio sample rate.
We train and evaluate the effectiveness of our fast adaptation method in two settings: (1) mixed-region setting, and (2) cross-region setting. The former is to train on ten accents, such as af, au, ca, en, hk, in, ir, my, nz, sa, sc, sg, and us, sampled from all regions, and we validate the model on the ca, sc, and sa accents and test the model on the be, ph, and wa accents. The latter is to train on five accents, such as au, en, ir, nz, and us, from specific regions and validate the model on the ca, sc, and sa accents, and test it on the af, hk, in, ph, and sg accents that come from other regions. We evaluate the model performance using the word error rate (WER) and run experiments ten times using different test folds. Each fold consists of 100 data randomly sampled from the test data. In the few-shot scenarios, we split the test accents data into training and testing sets. 75% of the data are allocated for training, and the remainder for testing.333We will release the code and dataset manifests used in the experiments for reproducibility. We report the average and standard error of all folds in zero-shot (0%-shot), 5%-shot, 25%-shot, and all-shot (100%-shot) settings. In addition, we also investigate the usefulness of pre-training on a large English corpus and fine-tune the model.
|Africa||40.38 1.11||38.31 1.20||36.36 1.01||34.64 1.01||41.56 1.04||41.40 1.08||39.34 1.34||38.32 1.17|
|Hong Kong||42.04 0.74||40.20 0.89||38.29 0.78||35.61 0.71||44.84 0.65||44.88 0.67||44.09 0.66||41.28 0.59|
|India||62.07 0.90||54.60 1.46||51.71 1.06||47.85 1.00||63.09 0.82||56.76 1.08||53.89 1.00||50.73 0.98|
|Philippines||50.06 0.74||48.17 0.71||47.71 0.78||45.05 0.82||53.22 0.97||52.60 0.99||51.64 0.78||48.12 0.76|
|Singapore||55.75 0.85||55.76 0.83||54.43 0.68||52.71 1.06||57.87 0.64||57.21 0.67||55.15 0.69||53.59 0.72|
|Africa||32.63 1.25||31.75 1.19||31.09 1.22||29.75 1.01||34.61 1.22||33.42 1.18||33.12 1.12||31.63 1.13|
|Hong Kong||36.06 0.56||36.04 0.71||32.38 0.71||32.15 0.62||37.43 0.77||36.51 0.57||35.88 0.51||34.18 0.77|
|India||54.50 1.41||48.73 1.31||46.15 1.35||43.54 1.35||55.43 1.36||50.52 1.26||48.63 1.32||46.58 1.07|
|Philippines||43.73 0.94||42.96 1.01||40.80 1.03||40.14 0.98||45.16 0.98||44.64 1.04||42.38 0.88||41.74 0.98|
|Singapore||49.45 0.55||48.40 0.56||46.62 0.62||46.17 0.67||52.06 0.71||50.48 0.70||49.43 0.69||47.11 0.66|
5 Results and Discussion
5.1 Quantitative Analysis
As shown in Table 2, MAML consistently outperforms joint training in the mixed-region setting. The approach yields up to a 4% WER margin in the zero-shot and few-shot settings. In general, for both MAML and joint-training, by adding more data on fine-tuning, the WER drops at a constant rate. Using the pre-trained model on the LibriSpeech Dataset significantly boosts the performance of all models by around 5% to 8% WER. In the all-shot setting, the results are similar to results in the 5%-shot and 25%-shot settings. We observe that the WER improvement after applying the pre-trained model for the Wales accent is higher than for the Bermuda and Philippines accents since the majority of the LibriSpeech Dataset is US accented speech which is far more acoustically similar to the accent of Wales than of Bermuda or Phillippines.
5.2 Cross-region Performance
We show the cross-region performance in Table 3. As expected, the WER of the Philippines accent is slightly reduced when we remove Asian accents from the training data. Interestingly, focusing only on the Philippines accent results, as shown in Table 2 and Table 3, MAML on the cross-region setting yielding WER performance similar to the joint-training on the mixed-region setting. The same result is not shown from training on the cross-region setting. Based on the empirical results, we can conclude that MAML is far more accent-agnostic compared to joint training. In sum, the model trained with MAML performs better than with joint training and learns more accent-invariant representations.
5.3 Effectiveness of Few-Shot Fine-tuning
We first investigate the number of samples needed to start showing performance improvement after fine-tuning. We start by training the model with a very small number of samples, from one to ten, where each sample approximately consists of 4 seconds of audio. We observe that the model does not adapt to the target accent with a miniscule amount of data. We believe that our model is not able to capture the information from a very short audio sample due to a large acoustic variation in the data. Therefore, we increase the minimum threshold to 5% of the training data, and the model starts to adapt to the target accent accordingly.
In Figure 3 and Figure 4, in general, MAML performs better than joint training in all settings. By having more target accented speech data, the model gains higher performance with a lower WER for both the mixed-accent and cross-accent settings. We observe that MAML is effectively applied to models without pre-training on the LibriSpeech Dataset and it decays much faster than joint training.
We further investigate the effectiveness of fast-adaptability of the MAML approach compared to the all-shot setting. As shown in Tables 2 and 3, the MAML approach with 25%-shot fine-tuning performs similarly or even better compared to the joint approach with all-shot fine tuning, both in the mixed-accent and cross-accent settings. In the all-shot setting, the MAML approach can further improve the performance, and outperforms the joint training approach in all experiment settings. In light of the impressive experimental results of the MAML approach, we can infer that MAML has fast adaptability to low-resource unseen accented data.
In this paper, we introduce a cross-accented speech recognition task derived from an existing dataset, CommonVoice, and establish a new benchmark for evaluating cross-accented speech recognition in the mixed-region and cross-region scenarios. We apply a fast adaptation method via the model-agnostic meta-learning (MAML) approach to learn a robust speech recognition system to rapidly adapt to unseen accents. Based on the empirical results, MAML consistently outperforms the non-meta learning baseline in all settings around 4% WER improvement compared to joint training in both the mixed-region and cross-region scenarios. Impressively, MAML leverages less data (25%-shot) and achieves comparable results to joint-training with all training data (all-shot). We also further improve the performance of our model by adding pre-training on a large speech corpus.
This work has been partially funded by ITF/319/16FP and MRP/055/18 of the Innovation Technology Commission, the Hong Kong SAR Government, and School of Engineering Ph.D. Fellowship Award, the Hong Kong University of Science and Technology, and RDC 1718050-0 of EMOS.AI.