Automatic speech recognition (ASR) is increasingly used, e.g. in emergency response centers, domestic voice assistants, search engines, etc. Because of the paramount relevance spoken language plays in our lives, it is critical that ASR systems are able to deal with the variability in the way people speak (e.g., due to speaker differences, demographics, different speaking styles, and differently abled users). ASR systems promise to deliver objective interpretation of human speech.
State-of-the-art ASR systems are based on deep neural networks (DNNs). DNNs are often considered to be a harbour of objectivity because they follow a clear path against the set parameters applied to the provided dataset. Although studies on bias in ASR are only nascent, practice and recent evidence is however troubling, suggesting that the state-of-the-art ASRs do not recognise the speech of everyone equally well. This evidence ranges from anecdotal (e.g., the Google Home of author O.S. typically does not recognise the speech of her 8-year-old daughter) to research- and policy-oriented. For instance, ASR systems have been shown to struggle with speech variance due to gender, age, speech impairment, race, and accents. Several studies on different languages have found gender differences: although most studies report that female speech is recognised better than male speech (Arabic[shariah2013effects], English [koenecke2020racial, adda2005speech, goldwater2010words], and French [adda2005speech]), the reverse pattern is also found (French [garnerin2019gender], English [tatman2017gender]), although no difference in the recognition of male and female speech was found in a follow-up study of the latter study [tatman2017effects] nor in [garnerin2019gender]. [shariah2013effects] found that speakers younger than 30 years of age were better recognised than those older than 30 years. Moreover, ASR for child speech is proven more challenging than that for adult speech, due to children’s shorter vocal tracts, slower and more variable speaking rate and inaccurate articulation [qian2017bidirectional]. A speech impairment is known to cause many problems for standard ASR systems, e.g., for impairments related to dysarthria [laureano2019study], stroke survival, oral cancer [halpern2020detecting] or cleft lip and palate [schuster2006evaluation]. Additionally, recent studies demonstrate how dominant voice assistants perpetuate a racial divide by misrecognising the speech of black speakers more often than of white speakers [koenecke2020racial, tatman2017effects]. Finally, ASR systems are typically trained on speech from native speakers of a “standard” variant of that language, inadvertently discriminating not only the speech of non-native speakers with high error rates [wu2020see, palanica2019you] but also that of speakers of regional or sociolinguistic variants of the language (English [koenecke2020racial, tatman2017gender, tatman2017effects], Arabic [shariah2013effects]).
There are many factors that can cause this bias. First, the composition of the training data plays an important role. Moreover, a speaker with a type of language usage that deviates from the training data transcripts can lead to a mismatch with the language model. Articulation differences (i.e., differences in sound realisations) due to differences in speaking style, (regional/non-native) accent, vocal tract differences (e.g., due to gender, age) can lead to a mismatch between the speaker and the trained acoustic models (AMs). Additionally, a slower or faster speaking rate will result in a mismatch with the AMs. Another source of a possible bias is that the transcriptions can be biased. Anecdotal evidence (from author B.M.H. on the Jasmin-CGN corpus [cucchiarini2006jasmin], see also Section 3.3) suggests that production errors of children are corrected (“normalised” towards what should have been said) in a more lenient way than those of non-native speakers (transcriptions tend to be more verbatim, including restarts), which leads to an increase in out-of-vocabulary (OOV) words and consequently an underestimation of the recognition performance for the latter group. Importantly, bias also creeps in far before the datasets are collected and deployed, e.g. when framing the problem, preparing the data and collecting it. Caliskan et al. showed that language corpora actually contain human-like biases [caliskan2017semantics]. Moreover, possibly, bias can be due to the specific architectures and algorithms used in ASR system development.
Powered by these concerns and equipped with a broad understanding of bias, our overarching goal in this project is to uncover bias in a standard DNN-based ASR system to work towards proactive bias-mitigation in ASR systems. In this paper, we systematically investigate how well a state-of-the-art ASR system for Dutch recognises speech from different groups of speakers in order to quantify the bias in a standard, state-of-the-art Dutch ASR system111Code: https://github.com/syfengcuhk/jasmin.. In other words, we investigate how well the ASR system can deal with the diversity in speech. In deviance to the above described work that typically focused on one to three dimensions, here we will investigate possible bias against gender, age (children, and older adults), regional accents and non-native accents. We compare word error rates (WERs), but will also carry out an in-depth analysis of which sounds are particularly prone to misrecognition in order to understand where bias is occurring. In this work, we focus on bias in the dataset, with a particular focus on bias due to articulation differences. Based on our findings, we will suggest potential bias mitigation strategies.
2 Experimental set-up
2.1.1 Dutch Spoken Corpus (CGN)
The CGN corpus [oostdijk2000spoken] is used to train the standard-purpose ASR system in this study. CGN contains Dutch recordings spoken by speakers (age range 18-65 years old) from all over the Netherlands (NL) and Flanders (FL, in Belgium). It covers speaking styles including but not limited to read, broadcast news (BN) and conversational telephone speech (CTS). In this study, CGN data only from NL is used, and its training and test data partition follows [leeuwen2009results]. The total amount of training material is 483 hours, spoken by 1185 female and 1678 male speakers.
2.1.2 Jasmin-CGN corpus
The Jasmin-CGN corpus [cucchiarini2006jasmin], which is an extension of the CGN corpus, is used to evaluate the standard-purpose ASR system trained in this study on the dimensions of gender, age, regional and non-native accent222The training data of both CGN and Jasmin-CGN are recorded under a wide variety of recording conditions, which are likely to be non-overlapping between the two corpora which might lead to an additional ASR performance deterioration on Jasmin-CGN. . Particularly, we use the speech from the following groups:
DC: native children; age 7–11; 12h 21m of speech;
DT: native teenagers; age 12–16; 12h 21m of speech;
DOA: native older adults; age 65+; 9h 26m of speech.
These speakers come from four different regions in the Netherlands: W: West, T: Transitional, N: North, S: South. Moreover, we were interested in testing the standard ASR trained on NL Dutch on another variant of Dutch: Flemish Dutch. Here, we follow the same age division as for the Dutch speakers:
FC: Flemish children; age 7–11; 6h 10m of speech;
FT: Flemish teenagers; age 12–16; 6h 10mm of speech;
FOA: Flemish older adults; age 65+; 5h 5m of speech.
Table 1 shows the number of speakers broken down by gender (female, male) for each age group and each region. In this study, FL is treated as a “region” similar to W, T, N, and S.
Finally, we have two groups of non-native speakers from the Netherlands, children and adults, with a wide range of native languages, including Turkish, Moroccan Arabic, etc.:
NNC: non-native children; age 7–16; 12h 21m of speech;
NNA: non-native adults; age 18–60; 12h 21m of speech.
Table 2 shows the number of non-native children and adults broken down by gender (female, male), also separately for their Dutch proficiency level according to the Common European Framework (CEF; A1 the lowest) for the adults.
The Jasmin-CGN corpus consists of read speech and human-machine interaction (HMI) speech, both of which are used in the experiments.
2.2 State-of-the-art ASR system for Dutch
We adopt a hybrid DNN-HMM architecture [dahl2011context] for training an ASR system, using Kaldi [povey2011kaldi]. We tested with different mainstream DNN AM structures such as TDNNF, TDNN-LSTM and TDNN-BLSTM on the CGN test sets (BN and CTS) and found TDNN-BLSTM to be the best, thus TDNN-BLSTM is used throughout our experiments. The TDNN-BLSTM model consists of three TDNN layers of dimension 1024, and 3 pairs of forward-backward LSTM layers of cell dimension 1024 on top. The model is trained with the lattice-free maximum mutual information (LF-MMI) criterion [povey2016purely]. We applied data augmentation techniques including speed perturbation [ko2015audio], reverberation [ko2017study] and noise [snyder2015musan]
to the CGN training material, increasing the total hours of training data nine-fold, in order to increase our AM’s robustness towards different recording conditions in the evaluation data. The input features to the AM are 40-dimension high-resolution MFCCs. The AM is trained for 4 epochs. Context-dependent phone alignments used to train the AM are obtained by forced-alignment using a GMM-HMM trained beforehand with the same training data as that for the TDNN-BLSTM. The language model (LM) in our ASR system is an RNNLM[xu2018neural]. It consists of 3 TDNN layers interleaved with 2 LSTM layers. To apply the RNNLM, a tri-gram LM is used to generate N-best results. After that, the RNNLM rescores the N-best results to get the final recognition results. The RNNLM and the tri-gram LM are trained using the training data transcriptions in CGN.
2.3 Experiments and Evaluation
In our experiments, the potential bias due to gender, age, regional and non-native accents is estimated for read speech and HMI speech separately. This allows us to investigate whether the size of the potential bias is influenced by the speaking style of the person. Read speech is typically well articulated, and in general, ASR systems tend to perform well on read speech. HMI speech is less well prepared than read speech and possibly allows for more speaker-dependent articulations and differences in word usage, which might be more problematic for ASR systems, and consequently have an influence on the size of the bias.
The potential bias is estimated in terms of differences in WER between the different speaker groups. Additionally, we carry out an in-depth analysis at the phoneme level to investigate whether certain phonemes are prone to misrecognitions in order to investigate in how far atypical pronunciations are a possible source for bias to occur. To that end, we use a phoneme error rate (PER) based technique. The PER is calculated as follows: First, the word-level ground-truth and hypothesised (by the ASR system) transcripts are converted to phoneme-level sequences using the Dutch lexicon in CGN. Second, the ground-truth and hypothesised phoneme sequences are aligned using the Levenshtein distance, after which the PER is calculated333Source code of the analysis method can be found at: https://github.com/karkirowle/relative_phoneme_analysis..
3.1 Baseline results
Since there are no standard read speech and HMI test sets in CGN (which was used for training our ASR), the ASR system was first evaluated on the CGN standard BN and CTS test sets for reference. The ASR achieved 5.5% WER on the BN set (female speech: 5.5%; male speech: 5.4%), and 20.8% WER on the CTS set (female speech: 17.9%; male speech: 23.2%).
3.2 Word recognition results
The WER averaged over all speakers was 36.2% on read speech and 47.5% on HMI speech. Table 3 shows the WER per age group, for the female and male speech separately and averaged over both genders (column Avg), for read speech and HMI separately. The top rows report the results for the native Dutch speakers per age group; the bottom rows for the non-native speakers per age group. The WERs per gender, averaged over all age groups (row Avg), over the native (row AvgD) and non-native (row AvgN) Dutch speakers, respectively, are also shown.
Table 3 shows that, in general, female speech is better recognised than male speech. This is true for all native and non-native groups and for both speech styles. The female-male WER difference is the largest in DOA and the smallest in DC, for both the read and HMI speech styles.
Looking at the different age groups, Table 3 shows that among the native speakers, DT achieves the best WER performances in read and HMI speech, followed by the DOA, while DC was the worst recognised. Among the non-native speakers, the performance differences between NNC and NNA do not differ much (absolute 1.8% and 0.3% in read and HMI speech, respectively). To gain a better understanding of the WER of the different age groups, Figure 1 illustrates the per-speaker read speech WER distribution of these groups on the read speech. It shows speaker-level WERs in DOA are more variable than in DT. Manual checking of some DOA speakers that had high per-speaker WERs (50%) suggested that their speech was not that well articulated (possibly due to their (old) age ( 75)). Figure 1 also shows the per-speaker WER distribution of the two non-native groups does not differ much.
Comparing the native (D-) with the non-native (NN-) groups shows that speech of native speakers is recognised much better than that of non-native speakers of Dutch. The worst recognised native speech (DC, i.e., Dutch children) has a read speech WER that is around 20% absolute better than that of the best non-native age group (NNC), i.e, the non-native children.
Table 4 provides a closer look at the WERs for the different Dutch proficiency levels (CEF) of the non-native adult speakers (NNA), separated by gender.
Perhaps surprisingly, we do not see a reduction in WER with an increase in the CEF level.
Finally, Tables 3 and 4 shows that for each group, the WER performance of HMI speech is consistently worse than that of read speech. Overall, the absolute WER difference between read speech and HMI speech is around 13.7% for native speakers, and is around 5.5% for non-native speakers of Dutch.
Table 5 shows the WERs with regards to regional accents of the four large regions in the Netherlands (W, T, N and S) and Flanders (FL) per age group. The average WER results of read speech ((a)a) and HMI speech ((b)b) in each age group are shown in the gray rows. and the WER results broken down by gender (female, male) are shown in the white rows. Table 5 shows that speech spoken by people from Flanders (FL) achieved the worst WER performance in all age groups except for the older adults (DOA/FOA) (S the worst). This is for both read and HMI speech. For read speech, among the four regions in the Netherlands, no region was consistently recognised worse than others. For HMI speech, region S in general was the worst recognised.
Looking at the Dutch age groups, Table (a)a shows that for the children and teenagers (DC and DT), the read speech differences in WER between the four regions vary much less (5%@DC, 6%@DT) than for the older Dutch speakers (DOA, 19%). The same observation is made for HMI speech in Table (b)b (2%@DC, 8%@DT, 18%@DOA). This suggests that older speakers in the Netherlands typically have stronger regional accents than children and teenagers. Specifically, in DOA, region S has the highest WER, and within this group, male speech is worse recognized than female speech by an absolute WER difference of 8% for read speech and 3.3% for HMI speech.
3.3 Error analyses
We analyse the sources of recognition errors by the ASR through a systematic analysis of the phoneme errors, and a qualitative analysis of the dataset, supplemented by post-hoc quantitative results when appropriate. We first report the general findings and then four main variables are assessed: non-native accents, age groups, regional accents and gender.
In general, the PER of //444IPA symbol, similarly hereinafter., /S/ and /Z/ seems to be consistently high for all age groups, however, these phonemes occur rarely (50) in most groups. To account for this, we only report top-5 phonemes where there are at least 50 occurrences.
First, regarding non-native v.s. native accents, we find that the top-5 misrecognised phonemes in group NNA for A1 are /œy/, /Y/, /y/, /ø:/, /h/; For A2: /œy/, /Z/, /Y/, /y/, /j/; For B1: /øey/, /’ø:/, /Y/, /h/, /j/. For native speakers in DC, DT and DOA, these phonemes are /S/, /h/, /Y/, /z/, /@/. The results show that /øey/ is difficult for non-native (adult) speakers, while not being difficult for native speakers. This sound is known to be difficult to acquire for many second-language learners of Dutch.
Looking at the PER differences for the different age groups of native Dutch speakers, we find the top-5 misrecognised phonemes for DC are: /S/, /Y/, /h/, /@/, /j/; for DT: /S/, /Y/, /ø:/, /h/, /@/; For DOA: //, /h/, /x/, /E/, /@/. Based on this, we hypothesise that native children and teenagers make sibilant pronunciations that confuse the ASR system, which is confirmed by certain substitution errors (sullen vs zullen, zal vs saul).
For the regional variants of Dutch, the breakdown of the PER by regions W, T, N, S and FL is as follow: W: /S/, /h/, /@/, /Y/, /E/; T: /S/, /h/, /z/, /@/, /Y/; N: /S/, /h/, /Y/, /O/, /z/; S: /h/, /x/, /S/, /E/, /@/; FL: /Z/, /S/, y, /œy/, /Au/. Based on this, older speakers in region S having the highest WER (see Table 5) can be explained due to the presence of /x/ and /h/. It is known that Dutch in the south region of NL realises /x/ and /h/ differently from “standard” Dutch, and is closer to Dutch spoken in FL (soft g vs hard g), and that the elderly speech is more affected by the regional accent than children and teenagers. Interestingly, region FL has more problems with sibilants.
Finally, regarding gender, the top-5 misrecognised phonemes for male speakers in groups DC, DT, DOA, NNC and NNA are: /S/, /Z/, //, /ø:/, /Au/. For female speakers, the top-5 phonemes are: /S/, /øey/, /Y/, /h/, /ø:/. We see that /Z/, // and /Au/ are the main sources of errors for male speech. Overall, male speech seems to achieve consistently higher PERs than female speech irrespective of the phoneme identity.
4 General discussion and conclusion
In this paper, we have shown that an ASR system can perpetuate the existing bias in society. We have quantified bias in a state-of-the-art standard Dutch ASR system with regards to diversity in gender, age, regional accents and non-native accents.
We found that female speech was better recognised than male speech. This result adds to a growing set of findings that male and female speech are not recognised equally well [koenecke2020racial, tatman2017gender, shariah2013effects, adda2005speech, goldwater2010words]. Teenagers’ speech is the best recognised, followed by senior people’s (over 65 y/o) and children’s speech is the worst. The problems of the ASR with recognising children’s speech is not surprising: the large difference in children’s speech and adults’ speech [qian2017bidirectional] leads to a large mismatch of the children’s speech with the AM. The worse recognition of the older adults’ speech, especially those over 75 y/o, is due to less well articulation. Possibly, the speech of the teenagers resembles the speech of the adult speakers in CGN the most.
Speech of native Dutch speakers is much better recognised than that of non-native speakers, irrespective of age. This is in line with the qualitative findings reported in [wu2020see, palanica2019you], and is not surprising: Non-native speakers typically have an accent, meaning that the match with the AM is worse than that of native speakers. Interestingly, for the non-native speakers, no correlation was found between Dutch proficiency level and the ASR performance. A reason might be that at the A1, A2 and B1 levels the focus is primarily on vocabulary and grammar rather than pronunciation, so they are all considered relatively low proficiency levels. Another reason might be that the proficiency level is not a good proxy for the strength of the accent.
For native Dutch speakers, the speech from Flanders (FL) obtained the worst ASR performance and worse than all the regions in the Netherlands. The much higher WER results for the FL speech is well explained by the large accent difference between Dutch spoken in the Netherlands (used to train the ASR system) and that spoken in Flanders. For regions within the Netherlands, we found regional accents have a stronger influence on senior people than on children and teenagers.
We found HMI speech to be consistently worse recognised than read speech. This confirms that the size of the bias is influenced by the speaking style of the person.
The above results show that bias in the training data plays a critical role in the performance difference of an ASR system for a diverse range of speech.
In this paper, we have focused on bias that can be quantified. However, owing to the foundational nature of bias, it is impossible to remove bias that creeps into datasets [kudina2021co]. This becomes a priority in responsible ASR system development: framing the problem, developing the team composition and the implementation process from a point of anticipating, proactively spotting, and developing mitigation strategies for affective prejudice. A direct bias mitigation strategy concerns diversifying and aiming for a balanced representation in the dataset [koenecke2020racial, caliskan2017semantics]. An indirect bias mitigation strategy deals with diverse team composition: the variety in age, regions, gender, etc. provides additional lenses of spotting potential bias in design. Together, they can help ensure a more inclusive developmental environment for ASR.
This project has received funding from the EU’s H2020 research and innovation programme under MSC grant agreement No 766287. The Department of Head and Neck Oncology and Surgery of the Netherlands Cancer Institute receives a research grant from Atos Medical (Hörby, Sweden), which contributes to the existing infrastructure for quality of life research.