In Sociolinguistics, accent is a manner of pronouncing a language. Anyone who speaks a language, does so in an accent. The way the native speakers of a language speak that language defines the standard pronunciation, and is generally considered to be the standard or reference accent for that language. When the non-native speakers of a language speak that language, say an Indian person speaking English, the phonological requirement of the non-native language, in this case English, interacts with the phonological knowledge of their first language, say Hindi. This influences their manner of speaking, giving rise to what is considered as the non-native accent.
Accents per se are interesting because they refer to a wide variety of social issues such as the acceptance of speakers into a community, indication of class in society, and linguistic issues such as those pertaining to the phonology of languages. This in itself warrants a better understanding of accents. However, there is another fundamental reason for studying accents. Speakers always have a manner of speaking and the speech always has accent. Since spoken communication is an important form of communication, studying accents becomes important to design technologies built to interact with human speech.
1.1 Indian Accents in English
Internet has led to English language becoming the linguafranca for conveying information about science, culture, sports and society in the world. The continued advancements in technologies supporting speech, in the form of audio and video media, has led to an increase in the usage of spoken English on the web. Since these speakers come from various different linguistic backgrounds, English language happens to be spoken in many different accents across the world.
English has become an important language of communication among the younger generation of India because of its status as the language of formal education. A large number of young Indians is bilingual, i.e. they speak one of the 22 Indian languages as their first language, alongside English. An implication of this is that when speaking English, the intervention from the phonology of their first language, e.g. Malayalam, gives rise to an accent in the speech of Indian speakers of English. This accent is generally very distinct and is readily identifiable, for example, as the Malayalam English accent, the Telugu English accent, the Bangla English accent, etc.
Interestingly, this younger generation of India is also a large and growing group of users of speech-based technology through hand-held devices and voice assistants. These voice assistants have become very good at identifying English spoken in a native accent. However, non-native accented speech continues to be a challenge for them. If the automatic speech recognition (ASR) systems of the voice assistants have apriori knowledge that the speaker is going to speak with a certain accent, the voice assistant may be primed to listen to certain features in the voice, which would lead to a greater performance accuracy. For the success of this technology, it becomes pertinent then to identify and process accents, apart from the semantic content of the speech. Due to the large number of speakers, and vast varieties of accent, English spoken within India is an excellent resource for creating and testing technology whose success is contingent on detecting, identifying and understanding the native and non-native accents.
2 The Database
A key requirement for developing speech-based technology is the access to a well-curated database of speech samples. Some of the widely used datasets for specific ASR tasks are very well labelled, either manually or through automation. For example, Google AudioSet  is a massive dataset for audio event detection, that includes more than 2 million manually-labelled 10-second sound clips belonging to over 600 classes. Similarly, VoxCeleb  is a speaker identification dataset which contains audio clips extracted from interviews of celebrities.
In this section, we first establish certain key requirements for constructing an accent database that could be well-suited for ASR tasks. Then we survey a few existing accent datasets. Further, we discuss our approach and setup for collecting our database, AccentDB111https://accentdb.github.io/. Finally, we present an analysis of the distribution of speech samples that constitute AccentDB.
2.1 Key Requirements
The following are some of the key requirements for an accent database suitable for ASR systems.
1. Variety of Speakers: In order to represent the speaker differences, the database should ideally contain spoken material from a wide range of speakers.
2. Words vs. Sentences: The pronunciation patterns for words spoken in isolation are different from when they appear in connected speech, due to the suprasegmental phenomena such as elision and assimilation . Therefore, for the purposes pertaining to the processing of spoken sentences, the database should contain sentence-length material.
3. Uniformity of Content: For the sake of isolating and identifying accents, it is necessary to have uniformity in the speech material across speakers. One way to address this is to have all the speakers speak the same sentences, preferably at the same speed. A related requirement is for the speech material to be phonetically balanced, so that no specific phonemes get over-represented in the database.
4. Semantic Requirement: If the sentences are meaningful, it avoids semantic factors affecting the pronunciation of the sentences.
2.2 Existing Accent Databases
Various attempts have been made in the past at creating accent focused speech databases with varied data sources, speakers, accents and corpora. carlos-isabel-antonio created a word database with 20 speakers for each accent from a total of 6 countries. They used a small corpus of around 200 isolated English words spoken twice in a row by each speaker. deep-learning-for-video-games presented a collection of British and American accents in the form of utterances from non-playable characters of the video game, ”Dragon Age: Origins (BioWare 2009)”, with manual labelling of the accents done by three individuals.
Two of the most popular datasets used for accent-related tasks are: the Foreign Accented English (FAE) corpus , and the Speech Accent Archive . FAE data comprises 4925 telephonic utterances by native English speakers of 22 different languages. The subjects spoke about themselves for 20 seconds and the recordings were rated on a 4-point scale to determine the strength of accent. The Speech Accent Archive is a crowd-sourced collection of speech recordings of readings of a passage (colloquially referred to as ”Please call Stella.”) in English. Information about speakers’ demographic and linguistic background is publicly available 222Speech Accent Archive, George Mason University.
. The passage has been spoken by more than 2000 speakers covering over 100 accents and 30 languages, but a significant number of samples are not tagged with the correct accent. This is because the database is crowd sourced, and there is no independent supervision on the accent label that is assigned to a recorded audio sample. For instance, a speaker whose first language is Bengali/Bangla, might mark his samples as belonging to the Bangla accent, even if his Bangla accent is neutralized after living in the UK for many years. Another drawback of using such crowd sourcing approaches for collection of accent data is that neither the recording environment, nor the recording hardware are consistent across speakers. This leads to the introduction of significant noise in samples. The lack of correct label for each sample adds to the difficulty of using any supervised learning algorithm for speech recognition tasks.
|The birch canoe slid on the smooth planks.|
|Glue the sheet to the dark blue background.|
|It’s easy to tell the depth of a well.|
|These days a chicken leg is a rare dish.|
|Rice is often served in round bowls.|
|Accent||Number of Samples||Duration||Number of Speakers|
|Amazon Polly||American||h min|
The CMU Festvox Project has a dataset titled CMU-Arctic  which contains speech samples in native English accents. In CMU-Indic, another dataset in the Festvox project, the content across the samples is not uniform as they are spoken not in one language with different accents, rather in different languages altogether. The samples here incorporate certain manifestations of an accent as well, as is evident from samples in any Indian language such as Gujarati, but the task of accent classification now entails modelling two attributes - the difference in utterances and the accent itself.
2.3 Introducing AccentDB
To fulfill the aforementioned key requirements and to avoid the issues faced by some existing databases, we created a multiple-pair parallel corpus of well structured and labelled data of accents. The database, AccentDB, contains speech recordings in 9 accents, split across 4 non-native accents of Bangla, Malayalam, Odiya and Telugu; 1 metropolitan Indian accent referred as ”Indian” and 4 native accents namely American, Australian, British and Welsh. The number of samples, duration of all samples and the number of speakers per accent are listed in Table 2.
AccentDB is collected by employing the Harvard Sentences  which are phonetically balanced sentences that use specific phonemes at the same frequency as they appear in English language. The sentences in this dataset are neither too short nor too long, making them suitable for proper manifestation of accents in sentence-level speech. Harvard Sentences dataset contains 72 sets, each consisting of 10 sentences. The first five sentences from this dataset are listed in Table 1. We ensure that the corpus is also parallel by recording a minimum of the same 25 sets across all 4 of the non-native accents. Additionally, we compile recordings of all the 72 sets across rest of the 5 accents.
2.4 Collection of Speech Data
The data for the non-native accents, namely Bangla, Malayalam, Odiya and Telugu, was collected by the authors. For the task of recording speech samples, we recruited volunteers whom we identified to have strong non-native English accents in their daily conversations. Another requirement for these speakers was for them to be the native speakers of at least one Indian language since childhood. The demographics of the speakers can be found in Table 3.
The data was collected in the form of audio recordings made inside a professionally-designed soundproof booth. The text of the sentences was presented to the participants on a computer screen through a web-app333http://speech-recorder.herokuapp.com/ designed specifically for this purpose. The participants were asked to read the text of the sentences aloud. The speech samples were recorded using the following equipment:
Microphone : Audio Technica AT2005USB Cardioid Dynamic Microphone
Recorder: Tascam DR-05 Linear PCM Recorder
Each set was repeated thrice to account for the speech variations in each sentence spoken by the same speaker.
For the 4 native accents, namely British, Welsh, American and Australian, and the metropolitan Indian accent, we generated speech samples by using Amazon Polly’s Text-to-Speech API 444https://aws.amazon.com/polly/. The API was used with a special speech synthesis markup formatted file555HarvardSentences.ssml containing the Harvard Sentences.
and number of epochs().
2.5 Cleaning and Post-processing
Any noise or other unwanted events (sneeze, giggle etc.) that were introduced while recording were sliced out using Audacity  software. The cleaned audio files consisting of more than an hour-long recordings from each speaker were split on a pre-computed silence threshold to make one audio file per sentence. A split was created wherever the energy level was below for a duration of atleast econds. We then also trimmed silence slices at the beginning and the end of each sample to create richer data. These processed audio files were structured into directories tagged with the accent of the speaker.
2.6 Separability of AccentDB: An Analysis
Understanding the distribution of AccentDB speech recordings provides more insight into the quality of the collected data. To use the speech samples for any computational task or mathematical representation, they must first be converted to feature vectors. Mel-Frequency Cepstral Coefficient (MFCC) extraction is a very widely used technique to represent audio files as vectors. The MFCC extraction of audio clips generally produces very high-dimensional vectors (for example, google-attention use 40 MFCC dimensions per audio frame). We concatenated the MFCC features of each frame to obtain high-dimensional acoustic vectors for the full-length of a clip. Since modelling the distribution of high dimensional data is difficult, we performed dimensionality reduction to obtain a set of principal variables and reduce the number of random variables under consideration. Dimensionality reduction techniques, when used for speech, learn projections of high-dimensional acoustic spaces into lower dimensional spaces.
The Principal Component Analysis on the acoustic vectors shows that the recordings from each accent in our collected database follows a definite convexity (Fig. 1(a)). We also performed Uniform Manifold Approximation and Projection with 20 and 50 neighbours to show that the speech samples from an accent are closer to each other (Fig. 1(b) & Fig. 1(c)). Further, t-SNE projections of the data (Figures 1(d), 1(e) & 1(f)
) show the separability of the accents, establishing that the speech samples collected in AccentDB model their respective accents distinctively and are well-suited for use in machine learning tasks.
3 Accent Classification
|Task||Type||MLP||CNN||CNN (with attention)|
|Indian vs. Non Indian||2-class classification||100.0%||100.0%||100.0%|
|Non-native Indian Accents||4-class classification||98.3%||98.6%||99.0%|
|All accents||9-class classification||98.4%||99.3%||99.5%|
|Samples Used||Bangla, Telugu||Bangla, Malayalam||Telugu, Malayalam|
|Training Set||Testing Set|
Accent classification is an important step for tasks such as speech profiling and speaker identification. The current state-of-the-art ASR systems are already within the striking range of human-level performance with word error rates (WER) as low as . Accent classification can also be used to enhance ASR systems for better generalization towards unseen data by augmenting the training dataset with more relevant features (speech-augmention-for-asr; specaugment). One such very relevant feature is present in human communication in the form of accent and hence, the task of accent classification has been crucial in the combined modelling of speech.
Over the years, multiple approaches have been used to tackle the tasks related to the accent classification for speech recognition. These include classical methods of Gaussian mixture models (GMMs) and Hidden Markov models (HMMs), machine learning models using Support Vector Machine (SVM), and very recently, deep neural architectures like Convolutional Neural Networks (CNNs) and Long Short Term Memory (LSTM).
An early architecture that was proposed for this task by carlos-isabel-antonio used parallel ergodic nets with context-dependent HMM units for word-level accent identification. Their system obtained a global accuracy score of % on their word-level speech data comprising
different accents. ge2015accent used purely acoustic features to build a GMM based accent classifier optimized using Heteroscedastic Linear Discriminant Analysis (HLDA). They used the FAE dataset and achieved a success rate of % on accents.
The recent advancements in deep learning architectures have proven to be a great success in a variety of speech recognition tasks including accent classification. In the work by yang2018joint, the authors highlighted the importance of accent information for acoustic modeling and presented a joint end-to-end model for multi-accent speech recognition that achieves significant improvement on word-error rates. They used a bi-directional LSTM model with average pooling, and trained it with a Connectionist Temporal Classification (CTC) loss function. bird2019accent explored a variety of different techniques for accent classification on diphthong vowel sounds collected from speakers from Mexico and the United Kingdom. They achieved a classification accuracy of
% using an ensemble model of Random Forest and LSTM.
We ran classification experiments on our database using two standard baseline neural network architectures - a multi layer perceptron (MLP) and a CNN model. We evaluated the classification models in three different setups - (i) classifying amongst Indian accents collected in our database and non-Indian accents obtained from AWS Polly, (ii) classifying amongst the 4 collected Indian accents in our database, (iii) and finally classifying amongst all the 9 accents in AccentDB.
Each audio file was divided into 10ms segments with a 1ms overlap between the segments. All the samples were less than
seconds in duration and hence padded to a standardized input dimension of. For each of these segments, we extracted MFCC features. Hence, our final vector input for n audio files is of the dimension (n, , ). This two-dimensional image-like vector for each audio file was used as the input to the first convolutional layer in all the CNN-based models. For MLP models, the input vector was created by flattening the image to one dimension.
3.1.2 Model Architecture and Training
The MLP model consists of multiple fully connected layers stacked together. The CNN model uses a combination of 1D Convolutional and Max Pooling layers, followed by multiple dense fully connected layers. For calculating the class probabilities, softmax activation was used in the final layer of each model. We used Adam
and RMSProp optimizers, with a learning rate of using cross-entropy loss function. Dropout was used in dense layers for regularization. A variety of batch sizes were tried during training to achieve the best results. As part of the evaluation, we used % of the total data present as test set for evaluation.
As the next step, we augmented the CNN network with attention. Attention mechanism has been successfully applied in machine translation 
and image captioning. Promising results have also been obtained for speech based tasks, e.g., in 
, where the authors solved the task of acoustic scene classification using a Convolutional, Long Short-term memory, Deep Neural Network (CLDNN) network and several attention-based LSTM models. We took this motivation further to apply attention mechanism onto the accent data to analyze the segments of the audio that are given more importance by our classification model. We used multiple variations of attention, firstlyD and D variations based on the number of dimensions used. In the D version, attention vector is shared across the input dimensions, which correspond to the number of MFCC features used ( in our case). For the D version, separate attention probability vectors were learnt for each input feature dimension. We also varied the layer to the output of which attention is applied.
We evaluate our MLP, CNN and attention-CNN models on three different classification tasks, as described in the section 3.1 The accuracy results are summarized in Table 4. As is observable in the results, all the models performed exceptionally well, with the CNN models having a slight edge in accuracy as expected. Particularly in the binary classification setup, these models were able to detect the correct class with % accuracy. These instances of high accuracy can also be attributed to the presence of a quality dataset with good separability as discussed in section .
3.3 Train on One, Test on Other
Speech classification models tend to overfit if they have a large number of trainable parameters but the training data is not extensive enough. This leads to poor generalization of models from training samples to unseen samples. To test if the models described previously, perform well even on unseen data, we evaluated our models in a challenging setup. Three accents in AccentDB - Bangla, Telugu and Malayalam were chosen for this experiment and the data for each accent was split into two, based on the speaker. We trained binary classifier models on sets of two accents by feeding them with only one half of the data of each accent (i.e. data of one speaker per accent). The models were tested on the unseen half data of each accent (i.e. the other speaker). Table 5 shows that our classifier models generalized well on the test data consisting of samples from other speakers even without seeing them during training.
3.4 Interpreting Attention
The attention scores that were obtained were analyzed by plotting them against the corresponding MFCC features for two audio files of Malayalam accent. In Figure 2, we observe a clear spike around the word ”Four”, while in figure 3, the spikes correspond to timestamps around the words ”depth” and ”well”. These can be attributed to the different pronunciations of a particular phoneme sequence. For example, ”depth” has the sounds that don’t occur next to each other in the phonetics of Indian languages. So, each participant looks up to their own phonology to pronounce the word.
4 Accent Neutralization
State-of-the-art ASR systems often do not perform well on rare non-native accents, primarily due to the non-availability of good quality data for training such systems. We present our dataset on Indian Accents to augment training data for existing ASR systems to help make them more robust. ASR systems that perform well on native accents can further be improved for rare accents by performing accent neutralization. This means processing non-native audio file to make it sound like that of a native accent that the ASR system performs well on. The accent neutralization performed here involves extracting and transforming the para-linguistic and non-linguistic features of a source accent into those of a target accent while preserving the linguistic features. Acoustic feature conversions have been explored in other speech processing tasks as well. voice-conversion-challenge devised a challenge to better understand transformations of voice identity among speakers. For accent conversion, foreign-english-accent-conversion proposed a method to create accented samples of words by leveraging the difference between a dialect and General American English. Their model learns generalizations that would otherwise be created using rules written manually by phonologists.
With the success of neural networks in speech modelling, recent works have attempted end-to-end accented speech recognition. The experiments performed by bearman-accent-stanford and end-to-end-speech are on datasets that consist primarily of native accents (such as American, British, Australian, Canadian) and Indian as well, but the performance of models are underutilized due to the absence of non-native accents data. As reported in the next sections, we utilized the data collected in AccentDB to train and test deep neural networks on the task of non-native accent neutralization. We propose these transformation models to be used as an inference-time pre-processing step for ASR systems in order to overcome challenges associated with low resource accents.
4.1 Pairwise Neutralization
A pairwise accent neutralization system consists of a set of individual models which can convert MFCC feature vectors of samples belonging to a source accent to those of samples belonging to a target accent. This set of individual converter systems can be used in conjunction with an accent classification system. An input audio file is routed to the converter corresponding to its predicted accent class from the classifier. The selected converter would be the one which can convert files belonging to this predicted accent to the given target accent. (Figure 4). The pre-processing step for this experiment is the same as that described in section 3.1.1
4.1.1 Model Architecture
We trained a stacked denoising autoencoder network consisting of a series of convolution and pooling layers followed by deconvolutional (deconvolutional layers are required for upsampling) and pooling layers. The output of each layer is passed through a tanhactivation function. Further, we train another similar network for evaluating reconstruction on reversed pairs where the source and target files are swapped. The autoencoder network’s loss function is defined by feature-wise mean squared error between the input and output vectors. We used RMSProp optimizer  with a learning rate of . The convolutional layers act as feature extractors for the input MFCC feature vectors and learn to encode them into a dense representation. The deconvolutional layers learn transformations on this dense representation for reconstruction into MFCC features of the target accent.
The reconstructed feature vectors obtained from the autoencoder model were evaluated on classification accuracy metric using CNN classifiers trained in section 3 We performed this experiment for neutralizing Bangla and Indian accents into 4 native accents. Our model performed very well on the non-native to native accent neutralization achieving an accuracy of >% on all experiments. We obtained an accuracy of >% when converting from native to non-native accents as well. The model can also be used to neutralize a non-native accent into a different non-native accent as shown through the Odiya-Malayalam pair. The results in Table 6 show that the transformations learnt through our model can be used effectively as a preprocessing step in ASR systems enabling them to work well on non-native accents.
|Model||1 source 2 targets||2 sources 2 targets|
|CNN Autoencoder + Skip Connections||52.15%||66.73%|
|LSTM Autoencoder + Skip Connections||53.46%||70.08%|
4.2 Multi-source Accent Multi-target Accent Neutralization
Extending the neutralization task for a set of n accents requires training pairwise neutralization models. Moreover, any device using this system would also require the source accent to be identified first before choosing a pairwise trained model to perform neutralization. To overcome both of these challenges, we present a single model that can be trained over multiple accents to neutralize samples from Sn number of source accents to Tn number of target accents.
To train a single model with pairs of (source, target)
samples belonging to multiple accents, we added an additional marker in each training pair, similar to zero-shot neural machine translation system proposed in. The MFCC feature vectors of source accent samples were prefixed with a
-dimensional one-hot encoded representation of accent label of target samples. Hence, the transformation of each input vector of dimension(, ) is as follows:
where Si denotes input files of i-th source accent, Tj denotes target files of the j-th accent, LTj denotes label of Tj-th target accent and (.) represents concatenate operation.
4.2.2 Experiments and Results
We used the prefixed inputs to run experiments in two setups for this task. We started with a set of accents such that all the samples wee from same source accent and wee to be neutralized into two different target accents. We then experimented with the same set of accents but now with samples from two different source accents. The convolutional autoencoder described in section 4.1.1 was augmented with skip connections to propagate target accent information in the form of label vector. This target accent label information is available in each layer up until the last one. We also experimented with a stacked LSTM autoencoder with skip connections. Table 7 compiles our preliminary results.
5 Conclusion and Future Works
We presented AccentDB, a well-labelled parallel database of non-native accents that shall aid in the development of machine learning models for speech recognition. Having a parallel corpus is better suited for tasks such as accent neutralization where each source sample should correspond to a target sample with the same vocabulary such that the differences in accent could be modelled easily. We evaluated accent classification models in a variety of settings and also discussed an interpretation of attention scores for analyzing audio frames. Finally, we showed the applicability of autoencoder models for accent neutralization. Future scope of our work includes enriching the database with more accents, and a larger variety of speakers in terms of age and gender. We would also like to add single-word database, ideally labelled for phonemes to have the data devoid of the effects of suprasegmental features.
This study was funded by the OPERA grant from BITS Pilani, provided to Dr. Pranesh Bhargava.
7 Bibliographical References
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.1.2.
-  (2015) Attention-based models for speech recognition. In Advances in neural information processing systems, pp. 577–585. Cited by: §3.1.2.
-  (2017) Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA. Cited by: §2.
-  (1969-06) IEEE recommended practice for speech quality measurements. IEEE No 297-1969 (), pp. 1–24. External Links: Cited by: §2.3.
-  (2017) Google’s multilingual neural machine translation system: enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5, pp. 339–351. Cited by: §4.2.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.1.2.
-  (2004-01) The cmu arctic speech databases. SSW5-2004, pp. . Cited by: §2.2.
-  (1993.) A course in phonetics.. 3rd ed. edition, Harcourt,, Firt Worth :. Cited by: §2.1.
-  (2007) CSLU foreign accented english release 1.2 ldc2007s08. pp. . Cited by: §2.2, §3.
-  Audacity ®software is copyright ©1999-2019 audacity team. Cited by: §2.5.
-  (2017) VoxCeleb: a large-scale speaker identification dataset. CoRR. External Links: Cited by: §2.
-  (2017) English conversational telephone speech recognition by humans and machines. CoRR abs/1703.02136. External Links: Cited by: §3.
-  (2012) Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. Note: COURSERA: Neural Networks for Machine Learning Cited by: §3.1.2, §4.1.1.
-  (2010-12) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, pp. 3371–3408. External Links: Cited by: §4.1.1.
-  (2011-12) The speech accent archive: towards a typology of english accents. Language and Computers 73, pp. . Cited by: §2.2.
-  (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §3.1.2.