Speech-driven interfaces have become increasingly common, and COVID-19 has accelerated interest in speech as one of the primary modes of interaction with a variety of systems . In the case of kiosks used for ticketing, banking or informational purposes, developers of speech interfaces need to minimize latency and resource consumption while maintaining a high level of performance in speech-related tasks, like automatic speech recognition (ASR). At the same time, kiosks placed in transportation or business hubs (e.g., airports or international hotels) need to be able to support multiple languages.
Running any speech or language model on an edge system is non-trivial due to the size of parameter sets in modern speech and language models and the accelerated hardware needed to run many of these neural network based models. If multiple models are needed for multiple languages, resource consumption and increased inference times could very easily prevent developers from deploying applications in any environment, much less at the edge. In fact, common enterprise ASR systems require developers to deploy separate, dissimilar instances of model servers for each supported language, which complicates infrastructure and could result in reliability or maintenance issues.
Beyond language, accent and other demographic factors have been shown to dramatically impact ASR performance [1, 14]. Those demographic factors need to be addressed via larger, more computationally expensive speech models or via models fine-tuned for particular demographic groups.
2 Related Work
Many speech-driven systems at the edge detect a wake word and transfer audio to the cloud or another remote endpoint for processing. Such a system may not be desirable in cases where latency and privacy are a concern. For example, per Google Cloud Speech-to-Text best practice recommendations , 16-bit, 16kHz mono PCM audio data needs to be sent to API endpoints with a frame size of 100ms. Thus, round trip latency would be around 4s for an upload of around 3.2KB/second, which is beyond acceptable levels for many applications.
In terms of language support, developers often leverage slot-filling. For example, developers might require a user to select a preferred language via a touchscreen or speak the name of their preferred language. Both of these interactions could be awkward as they represent interactions that do not naturally occur in conversation. Moreover, the former implementation prevents truly contactless interactions, and the latter presents challenges in dealing with alternate language names, accents, and demographic factors.
Researchers have tried to extend ASR systems that rely on phonetically inspired acoustic models to support multiple languages. Certain of these extensions pool phonemes from all languages into a single set and others manage separate sets (see  and references therein). Either approach requires the management of expertly crafted linguistic information, and the burden of curating that information grows with the number of languages.
There have also been an increasing number of attempts to create end-to-end (E2E) models that recognize speech in multiple languages. For example,  utilizes a single model which is trained while sharing parameters across 51 languages, and the model ends up containing over a billion parameters. The scale of this type of model would make it a challenge to utilize on edge devices because of resource consumption and/or the need for specialized hardware accelerators. Other attempts at E2E multilingual ASR exhibit similar characteristics .
, the authors use an E2E architecture called RNN Transducer (RNN-T) that shows promise in edge applications. A larger, adapted version of RNN-T takes as input a vector representing a particular one of 9 supported languages. The authors assume that the language is either specified manually by a user of the model or determined automatically from a language identification system, but they do not integrate any specific spoken language identification system.
In still another vein of research [18, 16, 21, 11, 2, 4], researchers have tried to integrate spoken language identification directly into a joint speech recognition and language ID model. Generally, this work adds the prediction of one or more language identifications into the prediction of other outputs, such as decoded characters. While these models can reduce the overall number of models needed to support multilingual ASR, they can exhibit degradation in performance for some or all of the supported languages in comparison to the performance of corresponding monolinguhttps://www.overleaf.com/project/6020b5d855cbeaf4a4bfea22al ASR models. Moreover, the authors are not aware of any of these studies that integrates identifications of accent intohttps://www.overleaf.com/project/6020b5d855cbeaf4a4bfea22f speech recognition models.
While the present study is related to recent work in spoken language identification  and E2E ASR, it capitalizes on a new, simultaneous combination of spoken language identification, spoken accent identification, and fine-tuned E2E ASR, which was not considered in these earlier studies. In particular, we investigate the efficacy of utilizing such a combination on edge devices for high quality multilingual ASR.
An overview of the proposed Dynamic ASR (or Dyn-ASR, which we pronounce ”dinosaur”) to processing multilingual speech is presented in Fig. 1. In the approach, we assume WAV file inputs which are first pre-processed to normalize, trim silence from, and format the audio. In the case of audio fed to speech recognition models, the audio is formatted to 16-bits and 16kHz. In case of audio fed to language identification and accent identification models, the audio is formatted to 16-bits and 8kHz, and we artificially repeat the input audio to fill at least 10 seconds.
After pre-processing, language and accent identification is performed. For both language and accent identification we utilize a model with two LSTM layers, each having 200 units and each followed by batch normalization. One such model is used to classify the input audio into a language class. Then, we utilize a separate accent identification model (corresponding to the identified language) to further classify the input audio into an accent class.
An ASR model is trained for each language and accent pair that is to be supported by the system. We fine-tune these language and accent specific ASR models from general (i.e., not accent specific) ASR models. Because we are targeting edge applications, we experimented with several different phonetically inspired and E2E model architectures that were optimized for edge devices using OpenVINO  and/or compact by nature. These models included Deep Neural Network (DNN) acoustic models and RNN-T, Conformer , DeepSpeech , and QuartzNet  E2E models. In the end, we found that the Conformer and/or QuartzNet E2E models fulfilled our constraints in terms of ASR performance and system resource consumption.
Depending on the size of the models and system constraints, each of the ASR models can be loaded into memory when an application implementing the Fig. 1 process starts, or each ASR model could be loaded into memory on-the-fly. In any event, the model corresponding to the identified language and accent pair is dynamically chosen or loaded into memory after the language and accent is identified. In this way, multiple compact monolingual models can be utilized dynamically to recognize speech in multiple languages without significantly sacrificing the performance of speech recognition or exceeding edge device memory or processor constraints.
To test the Fig. 1 process for multilingual speech recognition on edge devices, we evaluated (i) the performance of our language and accent identification models; (ii) the performance of our language and accent specific ASR models; and (iii) the performance of an implementation of the full Fig. 1 process with respect to resource consumption.
In the following, we trained and tested our models/methods on English, Tamil, and Mandarin speech data. The English data was segmented into 8 accents (Scotland, Australia, England, India, USA, China, Malaysia, other) and the Mandarin data was segmented into 3 accents (Mainland, Taiwan, Hong Kong).
For transcribed speech data with corresponding language and accent labels, we relied on Mozilla’s Common Voice data, the Speech Accent Archive from George Mason University (SAA), and the Singapore National Speech Corpus (NSC). For additional Tamil speech data, we used Microsoft’s Indian Language Speech Corpus. We used the SoX utility to normalize the speech files to 16kHz, 16-bit WAV files for training and testing ASR tasks and 8kHz, 16-bit WAV files for training and testing language and accent identification tasks.
4.2 LID and accent identification
As mentioned in Section 3, we utilize one LSTM-based model for language identification and one LSTM-based model per language for accent identification. For our combination of English, Tamil, and Mandarin, that means that we have 1 spoken language identification model and 2 spoken accent identification models (one for English and one for Mandarin). We sampled 38,400 samples per language to train the models. To train the accent identification models we utilized rejection sampling due to the unbalanced nature of the accent data.
Our language identification model gives 84.99 % accuracy on the 3 language classes. On English accents, we achieve 74.41 % accuracy across the 8 accents, and we achieve 79.83 % accuracy across the 3 Mandarin accents. The more crucial step in the Fig. 1 approach is language identification, because language identification determines if the ASR model used will correspond to the spoken language or another language entirely. Correct accent identification will further improve recognition accuracy, but to a lesser degree. Our results here show that executing a spoken language identification model prior to selection of ASR model could result in choosing a proper model for at least 8-9 out of 10 inferences. Additionally, we found that using a single model for LID and accent identification would not achieve comparable accuracy on a similarly-sized model. Using a larger model for combined LID and accent identification would also slow down the time-to-ASR for the combined system.
4.3 Fine-tuned ASR
Assuming a proper language identification, we also wanted to validate the idea that switching between monolingual ASR models (each fine-tuned for a particular accent) could both: (i) outperform individual models trained on data corresponding to multiple accents; and (ii) allow us to avoid more complicated and/or larger multi-accent data and models. We created a test set of Indian, Chinese, and Malaysian accented English by selecting these accents out of the SAA. We then evaluated ASR models fine-tuned on each of these accents alongside publicly available pre-trained models. For this evaluation, we chose English because of the availability of multiple pre-trained models for comparison and because it is one of the languages considered in our other experiments.
The ASR models we fine-tuned were based on the QuartzNet architecture and fine-tuned on Indian, Chinese, and Malaysian accented English data from the Singapore National Speech Corpus. When evaluating these models (collectively referred to below as the models of the Dynamic ASR system, or Dyn-ASR), we loaded and utilized each of the models for the corresponding annotated accent. This simulates the best case scenario when loading language and accent specific models in the process illustrated in Fig. 1. Of course in any implementation of the Fig. 1, the performance of the Dyn-ASR models will depend on the performance of the language and accent identification models, but this evaluation gives us a baseline for evaluating the set of ASR models themselves.
The pre-trained models that we used as a reference are DeepSpeech trained on US English (DS), QuartzNet trained on LibriSpeech (QN-LS), and QuartzNet trained on multiple accents (QN-Multi). The results of this comparison are presented in Table 1.
4.4 Resource consumption
To evaluate resource consumption, we created an implementation of the Fig. 1 approach for English and Mandarin. We compare the resource consumption of this implementation (Dyn-ASR below) with the Vosk speech recognition toolkit server  (both an English instance, VS-EN, and a Mandarin instance, VS-CMN), Mozilla’s DeepSpeech implementation trained on US English (MDS-EN) , and PaddlePaddle’s DeepSpeech implementation trained on Mandarin (PDS-CMN). Note, the authors had difficulty in finding any practical, publicly available system natively supporting multilingual ASR models or integrating spoken language identification. As such, multiple instances and versions of these systems had to be deployed, which demonstrates the operational barriers to practically deploying a multilingual ASR system.
While there are portable versions of the Vosk servers for each language, we picked the server version that would give the best quality speech recognition results. We utilized two languages (English and Mandarin), each with two accents (US and Chinese accented English and Mainland and Taiwanese accented Mandarin) respectively for the input audio data. Table 2 includes the system resource consumption for each solution.
|System||Dyn-ASR||VS-EN||VS-CMN||MDS- EN||PDS- CMN|
To ensure that we could provide each solution with whatever resources it could consume, we ran all of the ASR solutions on a Core i9 System (i9-7920X CPU) which has 12 cores and 24 threads with 64GB of system memory and 500GB of storage. All the audio file inputs were of type 16kHz, 16-bit PCM mono. Results were captured in terms of the number of CPU cores utilized, memory usage and total inference time and are included in Fig. 2
As shown in Fig. 2 part a, the Dyn-ASR container has not been pinned to a CPU core, and thus it ends up using as many cores as needed to complete inference at a constant time of less than 1 second (see Fig. 2 part c). It also uses minimal incremental memory as shown in Fig. 2 part b (around 10MB). A combination of the VS-EN + VS-CMN systems or the MDS-EN + PDS-CMN systems would need to be assembled to match the multilingual ASR capabilities of Dyn-ASR, yet any of these combinations would exceed the memory usage of the the example Dyn-ASR system and increase the complexity of deployed infrastructure. Further, neither of these combinations (VS-EN + VS-CMN or MDS-EN + PDS-CMN) would solve the problem of selecting the correct ASR model corresponding to the input language, which functionality is natively rolled into the Dyn-ASR system. These characteristics together make the Dyn-ASR approach appealing for edge deployments.
5 Conclusions and Future Work
We introduced a new approach to multilingual speech recognition that selectively uses monolingual ASR models fine-tuned for particular accents. The particular recognition models used for each inference is determined on-the-fly using a language identification model and an accent identification model. An implementation of this approach for English and Mandarin behaved favorably in terms of resource consumption as compared to other publicly available ASR solutions and also shows promise in terms of recognition performance. This work explored certain model architectures, but we are exploring still other architectures along with further optimization using Intel’s OpenVINO toolkit. The authors would also like to integrate a step in the processing that uses a text-based model and/or probabilities from the language/ accent identification to deal with misidentified languages.
-  (2010) The impact of accents on automatic recognition of south african english speech: a preliminary investigation. SAICSIT ’10 10 (), pp. 187–192. Cited by: §1.
-  (2012) INTEGRATING language identification to improve multilingual speech recognition. Cited by: §2.
AI and machine learning best practices guide. Note: Available at https://cloud.google.com/speech-to-text/docs/best-practices#frame_size (2020/09/18) Cited by: §2.
-  (2015) A real-time end-to-end multilingual speech recognition architecture. IEEE Journal of Selected Topics in Signal Processing 9, pp. 749–759. Cited by: §2.
OpenVINO deep learning workbench: comprehensive analysis and tuning of neural networks inference.
2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 783–787. Cited by: §3.
-  (2020) Conformer: convolution-augmented transformer for speech recognition. ArXiv abs/2005.08100. Cited by: §3.
-  (2014) Deep speech: scaling up end-to-end speech recognition. ArXiv abs/1412.5567. Cited by: §3, §4.4.
-  (2019) Large-scale multilingual speech recognition with a streaming end-to-end model. In INTERSPEECH, Cited by: §2.
-  (2020) Quartznet: deep automatic speech recognition with 1d time-channel separable convolutions. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6124–6128. Cited by: §3.
-  (2002) Insights into the memory demands of speech recognition algorithms. In Proc. of the 2nd Annual Workshop on Memory Performance Issues, Cited by: §1.
-  (2019) Towards code-switching asr for end-to-end ctc models. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6076–6080. Cited by: §2.
-  (2020) Universal phone recognition with a multilingual allophone system. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8249–8253. Cited by: §2.
-  (2020-05-19) Deep learning for spoken language identification; syväoppiminen puhutun kielen tunnistamisessa. G2 Pro gradu, diplomityö, Aalto University, (en). External Links: Cited by: §2.
-  (2020-05) Artie bias corpus: an open dataset for detecting demographic bias in speech applications. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, pp. 6462–6468 (English). External Links: Cited by: §1.
-  (2020) Massively multilingual asr: 50 languages, 1 model, 1 billion parameters. ArXiv abs/2007.03001. Cited by: §2.
-  (2020) Streaming end-to-end bilingual asr systems with joint language identification. ArXiv abs/2007.03900. Cited by: §2.
-  The growth of contactless and voice interfaces in a post-pandemic world. Note: Available at https://blog.soundhound.com/the-growth-of-voice-user-interfaces-in-a-post-pandemic-world-8bded452520b (2020/04/09) Cited by: §1.
-  (2018) An end-to-end language-tracking speech recognizer for mixed-language speech. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4919–4923. Cited by: §2.
-  VOSK server github project. Note: Available at https://alphacephei.com/vosk/server (2020/06/21) Cited by: §4.4.
-  (2017) Language independent end-to-end architecture for joint language identification and speech recognition. 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 265–271. Cited by: §2.
-  (2012) Integration of language identification into a recognition system for spoken conversations containing code-switches. In SLTU, Cited by: §2.