Log In Sign Up

Towards Measuring Fairness in Speech Recognition: Casual Conversations Dataset Transcriptions

by   Chunxi Liu, et al.

It is well known that many machine learning systems demonstrate bias towards specific groups of individuals. This problem has been studied extensively in the Facial Recognition area, but much less so in Automatic Speech Recognition (ASR). This paper presents initial Speech Recognition results on "Casual Conversations" – a publicly released 846 hour corpus designed to help researchers evaluate their computer vision and audio models for accuracy across a diverse set of metadata, including age, gender, and skin tone. The entire corpus has been manually transcribed, allowing for detailed ASR evaluations across these metadata. Multiple ASR models are evaluated, including models trained on LibriSpeech, 14,000 hour transcribed, and over 2 million hour untranscribed social media videos. Significant differences in word error rate across gender and skin tone are observed at times for all models. We are releasing human transcripts from the Casual Conversations dataset to encourage the community to develop a variety of techniques to reduce these statistical biases.


page 1

page 2

page 3

page 4


Quantifying Bias in Automatic Speech Recognition

Automatic speech recognition (ASR) systems promise to deliver objective ...

Thai Wav2Vec2.0 with CommonVoice V8

Recently, Automatic Speech Recognition (ASR), a system that converts aud...

Finnish Parliament ASR corpus - Analysis, benchmarks and statistics

Public sources like parliament meeting recordings and transcripts provid...

Speech recognition for medical conversations

In this paper we document our experiences with developing speech recogni...

Age Group Classification with Speech and Metadata Multimodality Fusion

Children comprise a significant proportion of TV viewers and it is worth...

Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model

Videos uploaded on social media are often accompanied with textual descr...

1 Introduction

The problem of algorithmic bias in machine learning (ML) systems is generally well known and well studied. When performance on certain groups of individuals are specifically impacted, a perception of unfairness can result. In the context of ML, the term “Fairness” often refers to various attempts at correction of such errors or statistical biases. There is particular emphasis on advancing Fairness with respect to variables correspond to societally sensitive characteristics or traditionally marginalized communities, such as gender, ethnicity, sexual orientation, disability, etc.

Some particularly publicized examples of bias in machine learning systems have occurred in the area of Facial Recognition, when it was realized that performance of multiple commercially available systems were demonstrated to be much poorer on individuals with certain skin tones and also as a function of gender [3]. It therefore seems reasonable to assume that if Facial Recognition systems are prone to unfairly distributed errors, then other systems that involve human-machine interactions might also be prone to similar shortcomings. In particular, in the context of Automatic Speech Recognition (ASR), several studies have analyzed gender, race and dialect bias. in articles appearing in various media outlets [34, 2, 1, 15, 33, 29]. Summarizing the findings, there are ASR performance differences across gender, race, and dialect which amplify the historical biases previously known to exist. This creates an impetus for the ML community to devise remedies, e.g. [34] explicitly states that “Everyone deserves their voice to be heard”.

The focus of this paper is to announce the release of publicly available data that can be specifically used to investigate some of these issues in the context of ASR, and present preliminary results investigating such issues. Specifically, we are augmenting the already released “Casual Conversations” dataset [14] with accurate manual transcriptions that can be used to evaluate existing models with respect to various potential biasing factors and also train new models directly from the dataset itself with respect to these factors.

“Casual Conversations” is composed of over 45,000 videos of approximately 1 minute duration each collected from a set of 3,011 participants, comprising approximately 846 hours of data. The videos feature paid actors who agreed to participate in the project. Participants casually speak about various topics and sometimes depict a range of facial expressions. They explicitly provided age and gender labels themselves. Also, a group of trained annotators labeled the participants’ apparent skin tone using the Fitzpatrick scale. The videos were recorded in the U.S. with a diverse set of adults in various age, gender and apparent skin tone groups. These videos were originally intended to be used for assessing the performance of already trained models in computer vision and audio applications for the purposes permitted in the data user agreement. The agreement prevents a user from developing models that predicts the values of these labels, but one may measure performance of an algorithm as a function of these labels.

The original data release of Casual Conversations did not contain any transcriptions, and thus could not be easily used to examine Fairness issues in the context of speech recognition. As such, we have had the data manually transcribed to permit such an evaluation to take place, and also provide the transcriptions to the broader speech community. We then performed a preliminary evaluation of speech recognition performance across different speech recognition models as a function of gender, age and skin tone labels. While the phenotypical property of skin tone may not be as operative a variable in measuring comparative error for ASR as it is for computer vision, we opted to leverage the existing labels from the Casual Conversations dataset due to the likelihood skin tone may correlate with other characteristics that could be drivers of disparate ASR performance. The rest of the paper describes the transcription process followed, the models employed, presents recognition results, and draws some preliminary conclusions about bias and fairness in the context of the Casual Conversations data.

2 Previous work in Fairness in Speech Recognition

As machine learning systems have been utilized more frequently in decision making, biases or unfairness of the outcomes of these systems have become an active research area recently [6, 5, 4]. ASR which is one of the application areas of machine learning is also subject to these fairness concerns. For instance, it has been shown that there is a performance gap between male and female speakers [32] as well as black and white speakers [18]. The reason for unfairness in many of these cases is attributed to representation in the training data, i.e. having limited amounts of training data for certain groups of subjects, e.g. less female speakers than male speakers in a speech corpus [9]. A recent study [8] also confirmed the hypothesis that ASR systems can perpetuate the societal biases. In [10], a commonly used benchmark dataset, namely LibriSpeech [27], is subsampled to investigate the impact of the amount of training data on the ASR performance. The conclusion of [10] is that individual variability, irrespective of gender, affects the final performance more significantly, hence intra-variability within gender groups is also found to be an important factor to investigate. These studies mainly focused on analyzing the existing biases in ASR outputs. One approach to reduce ASR performance gaps between different sensitive groups is provided in [30] which builds on the counterfactual fairness idea proposed in [20].

Since most ASR datasets do not come with speaker attribute labels such as age, gender, dialect, etc, most of the previous studies evaluated fairness of the ASR systems on a limited number of corpora. There are recent efforts on curating data for under-represented groups such as African-Americans in speech studies [16] or for various demographic groups [24]. The “Casual Conversations” dataset provides data from a larger set of subgroups than [16], and also contains conversational speech rather than read speech as in [24].

3 Data Transcription

In order to produce transcriptions usable by the community, verbatim transcriptions were produced with various mark-ups. Specifically, hesitations and disfluencies (“uh”, “um”…) and repeated words were kept as spoken. Colloquial wordings (“gonna”, “sorta”) were maintained. Non-speech sounds, like music and laughter, were tagged. Numbers were spelled out as words as spoken, as were emails and URLs. Common named entities made up of acronyms (e.g., NASA, USA) were left spelled as colloquially written, though.

In terms of metadata related to the transcription, the text was punctuated. Long pauses were all indicated with the tag no-speech. Speaker turns associated with primary speakers (the interviewee) and secondary speakers (the interviewer, and occasional third parties who were not the official interviewee) were marked and time-stamped. An example of a transcript produced following the above process is:

[0.000] [secondary_0.240_secondary] would you rather work from home, or in an office and why? [/secondary_2.903_secondary/] no-speech [3.890] [primary_4.183_primary] um no-speech [7.345] I prefer a mix of both, because no-speech [11.170] I like to have the structure of the office, no-speech just to colloquialkinda/colloquial create a routine, but I do prefer some days [18.010] being able to work from home, because it’s just a no-speech [21.010] more convenient option, sometimes, when life gets busy. [/primary_23.512_primary/] [secondary_23.655_secondary] mhm. no-speech spk_noise alright. [/secondary_25.515_secondary/] [25.560]

In total, 846.1 hours of manual transcriptions have been produced. Many of the recordings contain both speech from the primary speaker and the interviewer, and sometimes a third speaker as well. Since the metadata annotations refer to each primary speaker, we first remove videos with no primary speaker speaking. Second, we convert and segment each video into audio files via manual time stamps, such that each resulting segment only contains the primary speaker’s speech. This leaves in total 572.6 hours of audio from the original 846.1 hours. After segmentation, the longest utterance is about 224 seconds.

overall gender age skin type model WER female male other rel. gap 18-30 31-45 46-85 1 2 3 4 5 6 rel. gap LibriSpeech 34.3 31.8 37.1 60.0 16% 36.1 35.1 30.9 27.5 30.9 34.6 34.0 37.5 37.2 37% Video, supervised 13.9 11.9 16.3 31.8 37% 13.9 13.9 13.8 11.2 13.1 14.0 13.7 15.0 14.8 34% Video, semi-supervised 9.8 8.5 11.3 24.0 32% 9.7 9.8 9.8 7.8 9.3 10.4 9.9 10.1 10.0 33% Video, teacher 8.6 7.5 9.9 21.6 33% 8.4 8.6 8.6 6.9 8.3 9.0 8.4 8.9 8.8 30% # of hours 573 312 249 0.2 - 198 188 174 22 160 135 49 89 118

Table 1: WER results on the complete Casual Conversations dataset. Rel. gap either refers to the relative WER difference between female and male, or refers to the largest relative WER difference between all pairwise skin types, with the corresponding pair indicated in bold.

[LibriSpeech model.]   [Video model, semi-supervised.]

Figure 1: Confidence intervals of each test statistic (Eq. 1) for pairwise skin types. Each red line indicates the WER difference between two subgroups statistically significant, and black line insignificant.

4 Speech Recognition Models

We built a series of recurrent neural network transducer (RNN-T)

[12] ASR models with respective sets of training data and configurations:

  1. [label=()]

  2. LibriSpeech model: a full-context conformer transducer model [13, 35] trained on LibriSpeech. RNN-T output labels consist of a blank label and 1023 wordpieces generated by the unigram language model algorithm from SentencePiece toolkit [19]

    . Four 80-dimensional log Mel filter bank features are concatenated with stride 4 to form a 320 dimensional vector, followed by a linear layer and mapped to a 512 dimensional input to encoder. Encoder has 17 conformer layers of embedding dimension 512, attention heads 8, feed-forward network (FFN) size 2048, and convolution kernel size 15. Following

    [21], we remove the original relative positional encoding, and reuse the existing convolution module for positional encoding by swapping the order of convolution and multi-head self-attention modules. Prediction network is a 2-layer LSTM of 512 hidden units and dropout

    . Joint network has 1024 hidden units, and a softmax layer of 1024 units for blank and wordpieces. The word error rate (WER) on

    test-clean and test-other are and respectively.

  3. Video model, supervised: a streaming emformer model [31] trained on 14K-hour manually transcribed social media videos. The video dataset is a collection of public and de-identified English videos, and contain a diverse range of speakers, accents, topics, and acoustic conditions. The input feature stride is 6. Encoder network has 24 simplified emformer layers without memory bank, and each has embedding dimension 512, attention heads 8, and two macaron-like feed-forward network (FFN) modules [13] with FFN size 2048. Prediction and joint networks are the same as above, and 4095 wordpieces used instead. Model size is of about 140M parameters.

  4. Video model, semi-supervised: a semi-supervised streaming emformer model trained on over 2 million hour social media videos. 14K hours are manually transcribed as above, and the rest are unlabeled data and decoded by progressively larger teacher models.

  5. Video model, semi-supervised teacher: a final teacher model of one billion parameters and trained on over 2 million hour social media videos.

For all ASR training, we applied SpecAugment [28], alignment restricted training [23] to improve training throughput, and auxiliary chenone prediction criteria to improve model convergence and performance [22]

. For all neural network implementations, we used an in-house extension of the PyTorch-based

fairseq [25] toolkit. All experiments used the Adam optimizer [17], tri-stage [28] learning rate schedule with peak learning rate , multi-GPU and the mixed precision training supported in fairseq.

5 Bootstrap Confidence Interval Method

In most speech recognition systems, the test data remains fixed and the nature of the speech recognition model used for decoding is the critical variable. Typical tests for statistical significance utilized for such comparisons are described in [26, 11]. However, in our case, the test data is different for each primary speaker. This implies a different test of statistical significance is needed. We decided to conduct significance tests using the bootstrap method [7]

, designed to compare data drawn from disparate populations that does not rely on the underlying assumption that the underlying data have a normal distribution.

Assume we would like to conduct statistical testing to determine whether a significant difference exists between the WERs of two subgroups and , where and can denote female and male, or any two skin types, . WERs for subgroup and are denoted as , . Then we define the test statistic as the empirical WER ratio minus one:


In the bootstrap method [7], each bootstrap sample is generated by the following process:

  1. subjects are repeatedly sampled with replacement from the original subgroups and , and the number of samples are equal to the number of subjects in respective original subgroup. Then and are calculated on the samples.

  2. a parameter estimate for

    is then calculated by Eq. 1.

Thus we generate (e.g., ) random bootstrap samples, and all bootstrap parameter estimates are ordered from the lowest to highest. Then the percentile bootstrap confidence interval (CI) of the test statistic, denoted as , is obtained such that


e.g., a 95% percentile bootstrap CI via 1,000 bootstrap samples is the interval between the

th quantile and

th quantile of 1,000 bootstrap parameter estimates. If CI does not cover the point , we claim the WER difference between two subgroups is statistically significant. We apply the confidence interval method instead of hypothesis tests – not only for significance tests, but also for quantifying the uncertainty of the test statistic.

overall gender skin type model WER female male rel. gap 1 2 3 4 5 6 rel. gap Video, supervised 11.8 9.8 14.2 45% 9.7 11.7 12.4 12.6 11.0 11.8 30% Video, supervised + fine-tuning 8.1 6.8 9.6 42% 6.7 8.1 8.9 8.5 7.3 7.8 33% Video, semi-supervised 8.4 7.1 9.8 37% 6.9 8.5 9.3 8.8 7.4 8.0 35% Video, semi-supervised + fine-tuning 7.2 6.0 8.5 42% 6.0 7.2 8.0 7.6 6.4 6.9 34% # of hours 295 161 128 - 11 84 69 25 45 60

Table 2: WER results on a test split, after model fine-tuning on a train split. There is no subject of gender “other” present in this test subset.

6 Results

The paper that introduced the Casual Conversations data reported on a number of aspects of the metadata: gender, age, skin tone, and lighting conditions [14]. We report on the same metadata categories except for lighting condition, which we did not expect to impact speech recognition performance. While skin tone, a purely visual characteristic, is unlikely to directly impact speech recognition performance (as for computer vision models), we expect that it correlates with other characteristics that may have such an impact, so we opt to report results along that dimension.

6.1 Evaluating off-the-shelf ASR models

We first decode the complete 281 hour Casual Conversation audios via each RNN-T model above (Section 4), and report overall WERs and WERs on each subgroup in Table 1. There appears to be a large performance gap between the female and male speakers with a definite bias towards female speakers, especially for the video models. We perform the significant tests, as described in Section 5, and the WER differences are significant for all models. We did not observe much difference across WER by age, except for slightly better performance for the older age category for the LibriSpeech model.

For skin type, we observed noticeable WER differences between various pairs, mostly frequently for the LibriSpeech model. Then we perform significant tests on the pairwise skin type WERs, and compute the confidence interval (CI) for each test statistic (Eq. 1) of respective subgroup and (Section 5). As shown in Figure 1, if the CI does not include point , we conclude the WER difference between a skin type pair is significant. Any narrower CI in Figure 1

indicates a smaller variance in the bootstrap parameter estimates, i.e., WER ratios. The further the complete CI away from point

, the more significant performance difference is suggested.

We find that the LibriSpeech model has the most occurrences of significant WER differences between subgroups, and video models have the fewest. We believe that, for video models, the 2 million hour pseudo-label training data in addition to the transcribed social media videos are more heterogeneous datasets than LibriSpeech, which contains read speech from audiobooks. Therefore, the training data diversity in social media videos may have helped reduce WER differences between subgroups. Note that, although the video teacher model - trained on the same amounts of data as the video student model - provides better overall accuracies, it does not provide more evenly distributed error rates. Although we do not suggest that skin type has a direct effect on acoustic properties, it may be a proxy for other unobserved characteristics that result in disparate ASR performance.

6.2 Evaluating the fine-tuned ASR models with in-domain data

We further use the train/valid/test data split provided in the original dataset release [14], to investigate if the unevenly distributed WERs can be reduced by fine-tuning pretrained ASR models with in-domain data111Given the amount of transcribed speech available, fine-tuning an existing high-performing ASR model provides much better overall accuracies than training a model with Casual Conversations data only. In addition, in line with data use agreement in [14], we only use metadata categories (gender, age and skin tone) for ASR model evaluation, and we do not use metadata for any model training purposes.. The training data split is 248 hours in total, and the data size of skin type 1 to 6 is 10h, 67h, 59h, 22h, 39h, 52h, respectively. The results and test data size are shown in Table 2.

For both video models of supervised and semi-surprised training, we observe large WER reductions after 2,000 fine-tuning updates. However, in either case, the relative WER differences between subgroups are not reduced, which suggests the model’s unbalanced accuracies may not be simply resolved by such in-domain fine-tuning processes. Since the relative amounts of data across our metadata categories are unknown for the video dataset, we cannot say that the WER differences between subgroups result from inadequate representation of certain categories of metadata in the broader training data. However, given (i) the large variability in the million hour video collection, and (ii) the available training data of each skin tone in each train split (used during fine-tuning), we still observe the unfairly distributed errors across metadata. This suggests there are more fundamental underlying variables associated with the speech styles in the Casual Conversations corpus that deserve additional investigation.

7 Conclusion

We have leveraged the existing metadata categories from the Casual Conversations dataset, and then performed an ASR performance evaluation across different models as a function of gender, age, and skin tone categories. Firstly, large accuracy gaps are consistently observed across gender, while no clear bias found towards any age group.

Secondly, we acknowledge that skin tone is a visual indicator which is suboptimal in measuring an auditory phenomenon. However, significant WER differences are observed in various comparisons across skin tone categories, suggesting skin tone may correlate with other characteristics that could drive performance differences between subgroups. The comparative error rates of various ASR models also indicate that, ASR models trained on sizable and potentially more diverse training data - i.e., the data that more likely contains a diverse range of attributes or subgroup representations - can provide more evenly distributed accuracies, but do not reduce the differences to zero.

The transcriptions will all be released to the community by the time of paper dissemination; we hope these interesting results inspire the community to continue these investigations to achieve a deeper understanding of the underlying variables affecting speech recognition performance, eventually permitting us to build robust speech systems without having to collect massive amounts of data from each and every target population.


  • [1] AI voice recognition racially biased against black voices. Note: 2021-09-10 Cited by: §1.
  • [2] Bridging the gender gap in AI. Note: 2021-09-10 Cited by: §1.
  • [3] J. Buolamwini and T. Gebru (2018) Gender shades: intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pp. 77–91. Cited by: §1.
  • [4] A. Chouldechova and M. G’Sell (2017) Fairer and more accurate, but for whom?. arXiv preprint arXiv:1707.00046. Cited by: §2.
  • [5] A. Chouldechova (2017) Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big data 5 (2), pp. 153–163. Cited by: §2.
  • [6] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel (2012) Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp. 214–226. Cited by: §2.
  • [7] B. Efron and R. J. Tibshirani (1994) An introduction to the bootstrap. CRC press. Cited by: §5, §5.
  • [8] S. Feng, O. Kudina, B. M. Halpern, and O. Scharenborg (2021) Quantifying bias in automatic speech recognition. arXiv preprint arXiv:2103.15122. Cited by: §2.
  • [9] M. Garnerin, S. Rossato, and L. Besacier (2019) Gender representation in French Broadcast Corpora and its impact on ASR performance. In Proceedings of the 1st International Workshop on AI for Smart TV Content Production, Access and Delivery, pp. 3–9. Cited by: §2.
  • [10] M. Garnerin, S. Rossato, and L. Besacier (2021) Investigating the impact of gender representation in asr training data: a case study on librispeech. GeBNLP 2021, pp. 86. Cited by: §2.
  • [11] L. Gillick and S.J. Cox (1989) Some statistical issues in the comparison of speech recognition algorithms. In Proc. ICASSP, Cited by: §5.
  • [12] A. Graves (2012) Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711. Cited by: §4.
  • [13] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al. (2020) Conformer: convolution-augmented transformer for speech recognition. In Proc. Interspeech, Cited by: item 1, item 2.
  • [14] C. Hazirbas, J. Bitton, B. Dolhansky, J. Pan, A. Gordo, and C. C. Ferrer (2021-06) Casual conversations: a dataset for measuring fairness in AI. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

    pp. 2289–2293. Cited by: §1, §6.2, §6, footnote 1.
  • [15] How to overcome cultural bias in voice AI design. Note: 2021-09-10 Cited by: §1.
  • [16] T. Kendall and C. Farrington (2020) The corpus of regional African American Language. The Online Resources for African American Language Project Version 2020.05. External Links: Link Cited by: §2.
  • [17] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Proc. ICLR, Cited by: §4.
  • [18] A. Koenecke, A. Nam, E. Lake, J. Nudell, M. Quartey, Z. Mengesha, C. Toups, J. R. Rickford, D. Jurafsky, and S. Goel (2020) Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences 117 (14), pp. 7684–7689. Cited by: §2.
  • [19] T. Kudo and J. Richardson (2018) Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226. Cited by: item 1.
  • [20] M. J. Kusner, J. R. Loftus, C. Russell, and R. Silva (2017) Counterfactual fairness. arXiv preprint arXiv:1703.06856. Cited by: §2.
  • [21] B. Li, A. Gulati, J. Yu, T. N. Sainath, C. Chiu, A. Narayanan, S. Chang, R. Pang, Y. He, J. Qin, et al. (2021) A better and faster end-to-end model for streaming ASR. In Proc. ICASSP, Cited by: item 1.
  • [22] C. Liu, F. Zhang, D. Le, S. Kim, Y. Saraf, and G. Zweig (2021) Improving rnn transducer based asr with auxiliary tasks. In Proc. SLT, Cited by: §4.
  • [23] J. Mahadeokar, Y. Shangguan, D. Le, G. Keren, H. Su, T. Le, C. Yeh, C. Fuegen, and M. L. Seltzer (2021) Alignment restricted streaming recurrent neural network transducer. In Proc. SLT, Cited by: §4.
  • [24] J. Meyer, L. Rauchenstein, J. D. Eisenberg, and N. Howell (2020) Artie bias corpus: an open dataset for detecting demographic bias in speech applications. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 6462–6468. Cited by: §2.
  • [25] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, Cited by: §4.
  • [26] D.S. Pallet, W.M. Fisher, and J.G. Fiscus (1990) Tools for the analysis of benchmark speech recognition tests. In Proc. ICASSP, Cited by: §5.
  • [27] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) LibriSpeech: an ASR corpus based on public domain audio books. In Proc. ICASSP, Cited by: §2.
  • [28] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019) SpecAugment: a simple data augmentation method for automatic speech recognition. Proc. Interspeech. Cited by: §4.
  • [29] Racist algorithms. Note: 2021-09-10 Cited by: §1.
  • [30] L. Sari, M. Hasegawa-Jonson, and C. Yoo (2021 (accepted)) Counterfactually fair automatic speech recognition. IEEE Transactions on Audio, Speech & Language Processing. Cited by: §2.
  • [31] Y. Shi, Y. Wang, C. Wu, C. Yeh, J. Chan, F. Zhang, D. Le, and M. Seltzer (2021) Emformer: efficient memory transformer based acoustic model for low latency streaming speech recognition. In Proc. ICASSP, Cited by: item 2.
  • [32] R. Tatman (2017) Gender and dialect bias in YouTube’s automatic captions. In

    Proceedings of the First ACL Workshop on Ethics in Natural Language Processing

    pp. 53–59. Cited by: §2.
  • [33] Understanding gender and racial bias in AI. Note: 2021-09-10 Cited by: §1.
  • [34] Voice recognition still has significant race and gender biases. Note: 2021-09-10 Cited by: §1.
  • [35] C. Yeh, Y. Wang, Y. Shi, C. Wu, F. Zhang, J. Chan, and M. L. Seltzer (2021) Streaming attention-based models with augmented memory for end-to-end speech recognition. In Proc. SLT, Cited by: item 1.