Linguistic-Acoustic Similarity Based Accent Shift for Accent Recognition

by   Qijie Shao, et al.
Search Home 首页»»正文

General accent recognition (AR) models tend to directly extract low-level information from spectrums, which always significantly overfit on speakers or channels. Considering accent can be regarded as a series of shifts relative to native pronunciation, distinguishing accents will be an easier task with accent shift as input. But due to the lack of native utterance as an anchor, estimating the accent shift is difficult. In this paper, we propose linguistic-acoustic similarity based accent shift (LASAS) for AR tasks. For an accent speech utterance, after mapping the corresponding text vector to multiple accent-associated spaces as anchors, its accent shift could be estimated by the similarities between the acoustic embedding and those anchors. Then, we concatenate the accent shift with a dimension-reduced text vector to obtain a linguistic-acoustic bimodal representation. Compared with pure acoustic embedding, the bimodal representation is richer and more clear by taking full advantage of both linguistic and acoustic information, which can effectively improve AR performance. Experiments on Accented English Speech Recognition Challenge (AESRC) dataset show that our method achieves 77.42 accuracy on Test set, obtaining a 6.94 system in the challenge.



page 1

page 2

page 3

page 4


Accent Recognition with Hybrid Phonetic Features

The performance of voice-controlled systems is usually influenced by acc...

A High Quality Text-To-Speech System Composed of Multiple Neural Networks

While neural networks have been employed to handle several different tex...

Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition

Unifying acoustic and linguistic representation learning has become incr...

Frame Shift Prediction

Frame shift is a cross-linguistic phenomenon in translation which result...

Is Everything Fine, Grandma? Acoustic and Linguistic Modeling for Robust Elderly Speech Emotion Recognition

Acoustic and linguistic analysis for elderly emotion recognition is an u...

Transducer-based language embedding for spoken language identification

The acoustic and linguistic features are important cues for the spoken l...

A comparative study of estimating articulatory movements from phoneme sequences and acoustic features

Unlike phoneme sequences, movements of speech articulators (lips, tongue...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Accents are special pronunciations that are generally affected by many factors, e.g., region, native language, and education level [1]

. Accent recognition (AR) or accent classification can be used for advertising recommendations and regionally differentiated services. Furthermore, AR is also an important precursor to many speech tasks, such as automatic speech recognition (ASR) and voice assistant. For these tasks, AR has a significant impact on their performance. Therefore, high-performance AR solutions have received extensive attention recently.

Some earlier research directly applied x-vector based speaker recognition model  [2] in AR tasks by simply changing the speaker label to accent label [3, 4, 5]. Such models tend to directly extract low-level features (such as the frequency and timbre) from spectrums, resulting in significant overfitting to speakers or channels [6]. In recent years, considering that accent is language-related, researchers began to alleviate overfitting by introducing linguistic information, e.g., from ASR. Shi et al. [7] proposed to initialize an AR model with a well-trained ASR encoder and achieved obvious improvement. Many researchers [8, 9, 10] used multi-task architectures to combine AR and ASR into a union model, which leads the model to focus more on linguistic information. Furthermore, Hamalainen et al. [11] used self-supervised pre-training models [12, 13] to respectively extract linguistic and acoustic embeddings and concatenated them together for an AR task.

While significant progress has been achieved with the help of linguistic information, the accuracy rates of AR models are still not ideal for downstream tasks. In [14, 15], researchers showed that accent can be considered as a set of deviations or shifts from standard pronunciation. Therefore, if a native utterance that has the same words as the accent utterance could be obtained, the accent shift could be measured. With this in mind, distinguishing different accents will be an easier task for AR models with accent shift as input. Inspired by this, similar methods [16, 17, 18] have been applied in pronunciation assessment tasks. In these studies, the pronunciation similarities between the utterance out of the same transcripts from native and non-native speakers are evaluated. Specifically, accent shift was extracted by aligning an accent utterance with a standard pronunciation and comparing their differences. Results showed that accent shift can effectively represent the fine-grained accent characteristics between native and non-native speakers.

However, in AR tasks, it is difficult to obtain such paired utterances which have the same transcripts. In the above methods, native utterance acts as an anchor for accent shift extraction. We consider building a virtual anchor to replace the native utterance in this study. In [19, 20, 21], word embeddings were regarded as anchors to extract emotion shift from visual and acoustic features. Inspired by these works, in this paper, we propose linguistic-acoustic similarity based accent shift (LASAS) for AR tasks. Specifically, we align the accent speech and its corresponding text by force alignment and map the resulted phoneme-level text sequence into multiple Euclidean spaces. The obtained mapping text vectors are regarded as anchors for accent shift estimation which tries to capture the pronunciation variants of the same word on different accents. Then we use the scale dot-product to calculate the similarities between the acoustic embeddings extracted by a Conformer [22] encoder and the mapping text vectors. These similarities show different shift distances and directions for different accents. So their combination could be regarded as an accent shift. Finally, we concatenate the accent shift with a dimension-reduced text vector to obtain a bimodal representation for AR model training. Compared with pure acoustic embedding, the bimodal representation is richer and more clear by taking full advantage of both linguistic and acoustic information, which can effectively improve AR performance. Extensive experiments conducted on the AESRC challenge dataset [7]

demonstrate that LASAS is effective and has a clear physical meaning in line with linguistics, which apparently outperforms the direct concatenation of text vector and acoustic embeddings. With ASR-generated text, LASAS achieves

accuracy on Test set, significantly surpassing a competitive system in the challenge.

Figure 1: The framework of our accent recognition system. (a) is the overall system architecture schematic. (b) is the illustration of linguistic-acoustic similarity based accent shift (LASAS) block.

2 Method

2.1 Overall System Structure

The purpose of our research is to measure the shift of accent with the help of linguistic information. By incorporating the shift of each accent into the AR training, the model is able to learn more discriminative accent representations over the conventional acoustic features, resulting in better recognition performance. To achieve this goal, we propose a LASAS-based AR model.

As shown in Fig. 1 (a), our model consists of three major blocks: (1) Input block, which takes acoustic features such as log-Mel filter banks (Fbank) or Mel-frequency cepstral coefficient (MFCC) and aligned text as inputs. We feed the acoustic features into a Conformer [22] encoder and then concatenate the outputs of certain encoder layers together as acoustic embeddings. (2) LASAS block, which is used to map aligned text and the acoustic embeddings to calculate the similarities as accent shifts, and linguistic-acoustic bimodal representations are obtained. (3) AR block, which takes these bimodal representations as input. First, context-sensitive accent information is extracted through a lightweight Transformer [23]

encoder. Then, the utterance-level accent prediction is obtained after a DNN classifier followed by a statistical pooling layer.

2.2 Input Block

Compared with standard pronunciation, accent shift is often manifested in special words or phonemes. Therefore, to evaluate it, pronunciation units need to be given first. In this paper, we use the byte pair encoding (BPE) [24] method to obtain subwords, and then take them as pronunciation units.

In  [17], fine-grained accent features of pronunciation units are extracted by comparing one standard pronunciation with another accented pronunciation. The two sentences are aligned in time before comparison. After the alignment, the accent and standard pronunciation frame at the same time will correspond to the same pronunciation unit. Only in this way can the comparison between two frames be meaningful. As mentioned above, this method needs pairs of accent and standard pronunciation utterances with the same text, which are difficult to obtain in an AR task. So we use aligned text to construct virtual anchors instead of standard pronunciation utterances to measure accent shift. That is, we need an additional ASR system for alignment. The text mentioned here can be manually transcribed or generated by an ASR system, which will be compared in experiments later.

In addition, accents are human voice-related information. Study in [7] shows that the encoder of an ASR model can be used to extract voice-related acoustic embeddings. So we use a Conformer block as an encoder, which is the state-of-the-art architecture in ASR. Finally, several layers’ outputs of the encoder are concatenated to form an acoustic embedding.

2.3 LASAS Block

As mentioned in [7], by comparing the differences between a standard pronunciation and another accent utterance, we can extract the physical accent shift. But for AR tasks, this physical shift cannot be obtained under the lack of data pairs. If we build a virtual shift that shows different shift distances and directions on the same pronunciation unit for different accents, such a shift is also meaningful for the subsequent AR model.

To achieve this goal, an anchor is needed, which is related to the speech content and aligned with the speech. We map the aligned text vectors to multiple Euclidean spaces as an anchor set. This method ensures that when the same pronunciation unit appears in utterances with different accents, the anchor is the same. Anchor is defined as:


where is the number of mapping spaces. is an frame-level aligned text vector. is the text mapping matrix, and . C is the dimension of hidden features, which is shown in Fig. 1 (b).

Then, we map an acoustic embedding to the same dimension as the text anchor and calculate the similarity of them by scale dot-product [23] frame by frame. This measuring method of similarity for two different dimension vectors is widely used in the attention mechanism [23, 25]. Details are as follows:


where is acoustic embedding. is the acoustic mapping matrix, and . And is a scale factor, we set . is the similarity value in the mapping space.

As shown in Fig. 1 (b), we use multiple sets of mappings to obtain similarities. All and are trainable. Therefore, different mappings can measure the accent shifts from different aspects. This method is similar to the multi-head attention mechanism [23]. A similarity value can only indicate the proximity between one mapping acoustic vector and one mapping text anchor. However, the combination of multiple similarities can reflect the shift directions and similar degrees of different accents. Therefore, the accent shift is calculated as:


Accent shift is a relative representation. For an AR model, it is also necessary to know the reference of this shift. That is, we need to offer the pronunciation unit information corresponding to accent shift. Since each mentioned above is related to different accents, the reference should preferably only contain pronunciation unit information. So any is not suitable for reference. Therefore, we set up a dimension reduction matrix separately to reduce the dimension of the input text OneHot vector for representing the pure subword information. Then, we directly concatenate the accent shift and the dimension-reduced subword together to form a linguistic-acoustic bimodal representation, which is used as the input for the subsequent AR model. We believe that explicitly preserving the accent shift is better for an AR task than adding it to the subword reference [19]. The bimodal representation is obtained:


2.4 Accent Recognition Block

Since the bimodal representation is frame-level information, we first use a lightweight Transformer encoder with only a few layers to extract context-sensitive accent information. After that, similar to common AR models, DNN is used to reduce the dimension of representations. At last, An utterance-level accent prediction is extracted after a statistical pooling layer.

3 Experiments

3.1 Data And Experimental Settings

In this study, we use an open-source English accent recognition challenge dataset named AESRC 

[7] for experiments. Specifically, we use the official Dev and Track1 Test sets for evaluation. There is no speaker overlap in the three sets. The duration of each accent in the training set is 20 hours, while each test set is 2 hours.

As mentioned above, we used an ASR model to align the text, initialize the Conformer encoder, and generate the text in Table 3. This ASR model is trained by Librispeech (960 hours) [26] and AESRC data sets with Wenet tools [27]. It contains 12 Conformer layers without language model. The CER of this ASR model in the AESRC Dev set is .

In all of our experiments, the Dev set does not participate in training. Our experiments include SpecAugment and model average (top 5), and no other data augmentation. We set the attention dimension of Conformer and Transformer encoders, and the parameter of the LASAS block in Fig. 1 (b) to 256. The number of Transformer and DNN layers is 3. And the DNN dimension is halved for each layer. To exclude interference, we use real text in the train and test stage, except and in Table 3.

3.2 Results And Analysis

3.2.1 Effectiveness of LASAS

In Table 1, is similar to the official baseline in the AESRC challenge with the encoder being replaced from Transformer to Conformer. For further comparison, we remove the LASAS block in Fig. 1 (a) and the rest parts are noted as . is a typical LASAS-based AR model and is used as a base model for the following experiments. As shown in Table 1, is significantly better than and . Without LASAS, even if is more refined and complex, its performance is not significantly improved than . These experiments fully prove that LASAS is effective.

ID System Configure
1 Baseline1 Conformer(12L)+DNN 73.53 68.08
2 Baseline2
Conformer(3 6 9L)
75.68 67.71
Conformer(3 6 9L)+LASAS
84.05 75.64
Table 1: Results of LASAS effectiveness experiments.
Figure 2: T-SNE transform of subword “”.
Figure 3: PCA transform of subword “” and “”. The circles are Gaussian contours. The straight lines connect the mean subword centroids of all accents and each individual accent subword centroid, which show accent shift distances.

We choose subwords “” and “” to visualize accent shift. This is because some accents show a significant shift on “”, while most accents have little shift on “”. We average frame-level accent shifts of each subword segment in the Dev and Test sets. And then project the shifts into 2-dimensional space by t-SNE and PCA transforms, respectively. As shown in Fig. 2 and Fig. 3, the distinguishability of accent shift is significantly higher than that of acoustic embedding before LASAS, and the shift distance of “” is significantly larger than “”. In particular, for the subword “” in the Indian accent, the clustering effect is prominent in Fig. 2, and the shift distance in Fig. 3 is significantly larger than British and American accents. This is because “” is pronounced like “” in the Indian accent, which is unusual in other accent. These phenomena fully prove that the accent shift extracted by LASAS conforms to the laws of linguistics.

3.2.2 Architecture of LASAS

Table 2 shows the effects of LASAS with different structures. First, comparing and , it makes sense to change the number of mapping spaces. An increment of N can significantly improve AR performance, which indicates that each mapping space learns different accent features. From , we can see, better results can be obtained with richer acoustic embeddings. Finally, in , we replace the reference from dimension-reduced text to acoustic embedding. The results of show even still includes accent shift, using acoustic embedding as reference is not good as LASAS in . It shows that using clear and simple linguistic information as a reference will allow the model to learn accent information more easily.

3 8 3,6,9 Text 84.05 75.64
4 4 3,6,9 Text 83.46 74.22
5 16 3,6,9 Text 85.42 77.16
6 8 112 Text 84.63 78.19
7 8 3,6,9 Acoustic 81.47 73.3
Table 2: Results of the structure comparison of LASAS.

3.2.3 Comparison of LASAS Versus Direct Concatenation

Considering the introduction of linguistic information may improve AR performance even without LASAS, we conducted comparative experiments () by directly concatenating (DC) the text vector and acoustic embedding as the bimodal representation in Fig. 1. As shown in Fig. 4, when using 1-3 layers of encoder output as acoustic embeddings, both methods are less effective. When 4-6 layers were used, LASAS significantly outperforms DC and achieves the best results. After that, with deeper layers are used, the two approaches get closer. This is because the middle layers of the encoder contain more human-voice information, which is required by LASAS. The deeper layers contain too much linguistic information but little related to accents, which leads the performance degeneration when used for LASAS. Overall, LASAS can more fully utilize linguistic-acoustic bimodal information than DC.

Figure 4: LASAS Comparison with direct concatenation (DC) the acoustic embedding and text vector.

3.2.4 Comparison of Final System Versus AESRC Top-N

Table 3 shows the results of LASAS and top-level schemes in the AESRC challenge [7]. In and

, phone posteriorgram (PPG) feature extracted from a TDNN-based ASR model is used as inputs for AR models. In addition,

used model fusion and many kinds of data augmentation methods including TTS. Without model fusion and comprehensive data augmentation (except SpecAug), our schemes achieve similar performance with in the Dev set. In

, researchers trained a hybrid CTC/attention-based ASR model and made the model learned to predict text and accent at the same time by transfer learning. In

, an ASR-AR multitask architecture is used. And the ASR-generated features are used for the AR task, which contained sufficient implicit linguistic information. Our schemes are significantly surpassing and , which fully proves the advantages of LASAS.

ID System
16 Top1: AR (PPG) + Data Aug (TTS) 91.3 83.63
17 Top1 w/o Data Aug 84.51 -
18 Top2: AR+ASR Fusion 80.98 72.39
19 Top3: AR+ASR Multitask 81.1 69.63
20 LASAS: N=16, Enc=(49)L Real+Real 85.39 77.79
21 Real+ASR 85.12 77.18
22 ASR+ASR 84.88 77.42
Table 3: Results of LASAS and AESRC top-level schemes. Real and ASR in the table represent using real text or ASR-generated text, respectively. Details of the ASR model are mentioned in Section 3.1.

In , we further optimize the model according to the conclusions in Section 3.2.2 and Section 3.2.3 by using 4-9 layers’ outputs of the Conformer encoder and setting the number of mapping spaces to 16. Then, we investigate the impact of text reliability on LASAS. As expected, in , LASAS achieves the best results when using manually annotated transcripts (real text). If using real text for training, and ASR-generated text in the test stage, the accuracy of the Dev and Test sets drops by and , respectively. If in the training stage we also use ASR-generated text, this gap may be alleviated, and the respective accuracy of on the Test set drops by only . This means that the LASAS has high robustness to text errors. Since LASAS integrates the information of the whole utterance, it can maintain good performance even if some characters are wrong. At last, achieves and accuracy on the Dev and Test set, obtaining a and relative improvement over .

4 Conclusions

In this paper, we propose a LASAS-based AR model. We extract accent shift by LASAS block and concatenate it with a text reference to form a bimodal representation for AR tasks. Experiments show that our method effectively improve AR performance. We visualize the accent shift and show its rationality. Furthermore, our scheme significantly outperforms the method that directly concatenates acoustic embedding and text vector, and also shows superior performance in the AESRC challenge dataset. In the future, we hope to improve LASAS for accent ASR tasks or text-dependent speaker verification tasks.


  • [1] R. Lippi-Green, English with an accent: Language, ideology, and discrimination in the United States.   Routledge, 2012.
  • [2] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. ICASSP, 2018, pp. 5329–5333.
  • [3] M. A. T. Turan, E. Vincent, and D. Jouvet, “Achieving multi-accent ASR via unsupervised acoustic model adaptation,” in Proc. Interspeech, 2020.
  • [4] A. Hanani and R. Naser, “Spoken Arabic dialect recognition using x-vectors,” Natural Language Engineering, pp. 691–700, 2020.
  • [5] S. Shon, A. Ali, Y. Samih, H. Mubarak, and J. Glass, “ADI17: A fine-grained Arabic dialect identification dataset,” in Proc. ICASSP, 2020, pp. 8244–8248.
  • [6] S. A. Chowdhury, A. M. Ali, S. Shon, and J. R. Glass, “What does an end-to-end dialect identification model learn about non-dialectal information?” in Proc. Interspeech, 2020, pp. 462–466.
  • [7] X. Shi, F. Yu, Y. Lu, Y. Liang, Q. Feng, D. Wang, Y. Qian, and L. Xie, “The accented English speech recognition challenge 2020: Open datasets, tracks, baselines, results and methods,” in Proc. ICASSP, 2021, pp. 6918–6922.
  • [8] Z. Zhang, Y. Wang, and J. Yang, “Accent recognition with hybrid phonetic features,” Sensors, 2021.
  • [9] J. Zhang, Y. Peng, P. Van Tung, H. Xu, H. Huang, and E. S. Chng, “E2E-based multi-task learning approach to joint speech and accent recognition,” arXiv preprint arXiv:2106.08211, 2021.
  • [10] Q. Gao, H. Wu, Y. Sun, and Y. Duan, “An end-to-end speech accent recognition method based on hybrid CTC/attention transformer ASR,” in Proc. ICASSP, 2021, pp. 7253–7257.
  • [11] M. Hämäläinen, K. Alnajjar, N. Partanen, and J. Rueter, “Finnish dialect identification: The effect of audio and text,” arXiv preprint arXiv:2111.03800, 2021.
  • [12]

    A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,”

    Advances in Neural Information Processing Systems, pp. 12 449–12 460, 2020.
  • [13] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186.
  • [14] J. C. Wells and J. C. Wells, Accents of English: Volume 1.   Cambridge University Press, 1982.
  • [15] M. Tjalve and M. Huckvale, “Pronunciation variation modelling using accent features,” in Proc. EuroSpeech, 2005.
  • [16]

    A. Lee, Y. Zhang, and J. Glass, “Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams,” in

    Proc. ICASSP, 2013, pp. 8227–8231.
  • [17] A. Lee and J. Glass, “Pronunciation assessment via a comparison-based system,” in Speech and Language Technology in Education, 2013.
  • [18] Y. Xiao, F. K. Soong, and W. Hu, “Paired phone-posteriors approach to ESL pronunciation quality assessment,” in Proc. Interspeech, 2018.
  • [19] Y. Wang, Y. Shen, Z. Liu, P. P. Liang, A. Zadeh, and L.-P. Morency, “Words can shift: Dynamically adjusting word representations using nonverbal behaviors,” in Proc. AAAI, 2019, pp. 7216–7223.
  • [20] W. Rahman, M. K. Hasan, S. Lee, A. Zadeh, C. Mao, L.-P. Morency, and E. Hoque, “Integrating multimodal information in large pretrained transformers,” in Proc. ACL, 2020, p. 2359.
  • [21] M. Nicolao, A. V. Beeston, and T. Hain, “Automatic assessment of English learner pronunciation using discriminative classifiers,” in Proc. ICASSP, 2015, pp. 5351–5355.
  • [22] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020.
  • [23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, 2017.
  • [24]

    R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in

    Proc. ACL, 2016.
  • [25] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proc. EMNLP, 2015.
  • [26] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.
  • [27] B. Zhang, D. Wu, C. Yang, X. Chen, Z. Peng, X. Wang, Z. Yao, X. Wang, F. Yu, L. Xie et al., “Wenet: Production first and production ready end-to-end speech recognition toolkit,” arXiv e-prints, pp. arXiv–2102, 2021.