The human voice contains an uncanny amount of personal information. Decades of research have correlated behavioural, demographic, physiological, sociological and many other individual characteristics to a person’s voice (Singh2019Profiling). For example, voice quality can reveal sensitive attributes of a speaker such as their age, anatomy, health status, medical conditions, identity, intoxication, emotional state, stress, and truthfulness. While untrained human listeners may not be able to discern all the details, automated voice processing has changed this. Speaker recognition automatically recognises the identity of a human speaker from personal information contained in their voice signal (Furui1994Overview)
. Today, speaker recognition permeates private and public life. Speaker recognition systems are deployed at scale in call centers, on billions of mobile phones and on voice-enabled consumer devices such as smart speakers. They grant access not only to personal devices in intimate moments, but also to essential public services for vulnerable user groups. For example, in Mexico speaker recognition is used to allow senior citizens to provide a telephonic proof of life to receive their pension(veridas2021).
In this paper we study bias in speaker recognition systems. Bias in machine learning is a source of unfairness (mehrabi2019survey) that can have harmful consequences, such as discrimination (wachter2021bias). Bias and discrimination in the development of face recognition technologies (buolamwini2018gendershades; Raji2019actionable; Raji2021aboutface), natural language processing (Bolukbasi2016Man) and automated speech recognition (addadecker2005do; tatman2017effects; koenecke2020racial; Toussaint2021Characterising) have been studied for several years. To date, bias in speaker recognition has received very limited attention, despite the technology being pervasive, and extremely sensitive to demographic attributes of speakers. This makes it a matter of urgency to investigate bias in these systems.
Drawing on Suresh and Guttag’s Framework for Understanding Sources of Harm (Suresh2021Framework), we present the first detailed study on bias in the speaker recognition domain. We approach this work as a combination of an analytical and empirical evaluation focused on the VoxCeleb Speaker Recognition Challenge (Nagrani2020Voxsrc), one of the most popular benchmarks in the domain with widely used datasets. Our study shows that existing benchmark datasets, learning mechanisms, evaluation practices, aggregation habits and post-processing choices in the speaker recognition domain produce systems that are biased against female and non-US speakers. We summarise our contributions as follows:
We present an evaluation framework for quantifying performance disparities in speaker verification - a speaker recognition task that serves as the biometrics of voice
Using the framework, we conduct the first evaluation of bias in speaker verification to show that bias exists at every stage of the machine learning pipeline in speaker verification development
Informed by our evaluation, we recommend research directions to address bias in automated speaker recognition
Our paper is structured as follows. In Section 2 and 3 we review related work and provide a background on speaker recognition, its evaluation and supporting infrastructure for its development. We then present the empirical experiment setup and bias evaluation framework in Section 4. In Section 5 we present our findings of bias in model building and implementation, and in Section 6 our findings of bias in data generation. We discuss our findings, and make recommendations for mitigating bias in speaker recognition in Section 7. Finally, we conclude in Section 8.
2. Related Work
In this section we provide a background on speaker recognition within its historical development context and present evidence of bias in the domain. We then introduce the theoretical framework on which we base our analytical and empirical evaluation of bias in speaker verification.
2.1. Historical Development of Automated Speaker Recognition
The historical development of automated speaker recognition reflects that of facial recognition in many aspects. Similar to the development of facial recognition systems (Raji2021aboutface), research in early speaker recognition systems was supported by defense agencies, with envisioned applications in national security domains such as forensics (greenberg2020two). The systems relied on datasets constructed from telephone corpuses and their development was greatly accelerated through coordinated, regular competitions and benchmarks. Since its inception, research into speaker recognition has enabled voice-based applications for access control, transaction authentication, forensics, law enforcement, speech data management, personalisation and many others (Reynolds2002Overview). As a voice-based biometric, speaker verification is viewed to have several advantages: it is non-intrusive, system users have historically considered it to be non-threatening, microphone sensors are ubiquitously available in telephone and mobile systems or can be installed at low-cost if they are not, and in many remote applications speech is the only form of biometrics available (Reynolds2002Overview). Given the proliferation of speaker recognition systems in digital surveillance technologies, concerns over its pervasive, hidden and invasive nature are rising (eff2021catalog).
2.2. Evidence of Bias in Automated Speaker Recognition
It is well established that sensitive speaker characteristics such as age, accent and sex affect the performance of speaker recognition (hansen2015speaker). In acknowledgement of this, past works in speech science, like research promoted through the 2013 Speaker Recognition Evaluation in Mobile Environments challenge, have reported speaker recognition performance separately for male and female speakers (Khoury2013speaker). The submissions to the challenge made it clear that bias is a cause of concerns: of 12 submissions, all submitted systems performed worse for females than for males on the evaluation set. On average the error rate for females was 49.35% greater than for males. Despite these performance differences being acknowledged, they went unquestioned and were attributed solely to an unbalanced training set that contained a male:female speaker ratio of 2:1. In later works the discrepancy between female and male speakers is still evident and reported, but remains unquestioned and unaddressed (Park2016Speaker). In practice, a common approach to avoid sex-based bias has been to develop separate models for female and male speakers (Kinnunen2009Overview). While this may be insufficient to eradicate bias, generating separate feature sets for female and male speakers can reduce it (Mazairafernandez2015improving). Beyond considering speaker sex, evaluating demographic performance gaps based on other speaker attributes is less common, and intersectional speaker subgroups have not been considered.
Since the adoption of Deep Neural Networks (DNNs) for speaker recognition, practices of evaluating system performance for speaker subgroups seem to have disappeared. Several system properties beyond performance have been considered in recent years, such as robustness (Bai2021Speaker) and privacy (Nautsch2019preserving). However, research in robustness and privacy in speaker recognition does not address the glaring gap that remains in the domain: system performance appears biased against speaker groups based on their demographic attributes. Only one recent study investigates bias in end-to-end deep learning models based on speaker age and sex (Fenu2021Fair), reconfirming the importance of balanced training sets.
In automated speech recognition, which is concerned with the linguistic content of voice data, not with speaker identity, recent studies have provided evidence that commercial automated caption systems have a higher word error rate for speakers of colour (tatman2017effects). Similar racial disparities exist in commercial speech-to-text systems, which are strongly influenced by pronunciation and dialect (koenecke2020racial). Considering their shared technical backbone with facial recognition systems, and shared data input with automated speech recognition systems, we expect that bias and harms identified in these domains will also exist in speaker recognition systems. Mounting evidence of bias in facial and speech recognition, the abundance of historic evidence of bias in speaker recognition, and the absence of research on fairness in speaker recognition provide a strong motivation for the work that we present in this paper.
2.3. Framework for Understanding Sources of Harm Through the Machine Learning Life Cycle
We draw on Suresh and Guttag’s (Suresh2021Framework) framework for understanding sources of harm through the machine learning life cycle to ground our investigation into bias in automated speaker recognition. Suresh and Guttag divide the machine learning life cycle into two streams and identify seven sources of bias related harms across the two streams: 1) the data generation stream can contain historical, representational and measurement bias; and 2) the model building and implementation stream can contain learning, aggregation, evaluation and deployment bias. Historical bias replicates bias, like stereotypes, that are present in the world as is or was. Representation bias underrepresents a subset of the population in the sample, resulting in poor generalization for that subset. Measurement bias occurs in the process of designing features and labels to use in the prediction problem. Aggregation bias arises when data contains underlying groups that should be treated separately, but that are instead subjected to uniform treatment. Learning bias concerns modeling choices and their effect on amplifying performance disparities across samples. Evaluation bias
is attributed to a benchmark population that is not representative of the user population, and to evaluation metrics that provide an oversimplified view of model performance. Finally,deployment bias arises when the application context and usage environment do not match the problem space as it was conceptualised during model development.
Next we introduce automated speaker recognition, and then show analytically and empirically how these seven types of bias manifest in the speaker recognition development ecosystem.
3. Background on Automated Speaker Recognition
Speaker recognition refers to the collection of data processing tasks that identify a speaker by their voice (Furui1994Overview). Core tasks in speaker recognition are speaker identification, which determines a speaker’s identity from a subset of speakers, speaker verification, which validates if a speaker’s identity matches the identity of a stored speech utterance, and speaker diarisation, which is concerned with partitioning speech to distinguish between different speakers (Bai2021Speaker). While technical implementation details differ in the three areas, their communities overlap, they share datasets and participate in the same competitions. We focus our investigation in this paper on speaker verification. However, as the tasks have evolved together, many of the biases that we uncover in speaker verification also apply to speaker identification and diarisation. In this section we provide a high level overview of speaker verification and its evaluation, as well as its supporting ecosystem of competitions and benchmarks. We refer the reader to (Bai2021Speaker) for a detailed technical survey on state-of-the-art speaker recognition, and to (Kinnunen2009Overview) for a review on the classical speaker recognition literature prior to the advent of Deep Neural Networks (DNNs).
3.1. Speaker Verification Overview
A speaker verification system determines whether a candidate speaker matches the identity of a registered speaker by comparing a candidate speaker’s speech signal (i.e. trial utterance) to the speech signal of a registered speaker (i.e. enrollment utterance
). Speaker verification is classified based on its training data as text-dependent if speech signals are fixed phrases or text-independent if not, prompted if speech was produced by reading text or spontaneous if not(greenberg2020two). Spontaneous text-independent speech is the type of speech that occurs naturally when a speaker interacts with a voice assistant or a call centre agent. It presents the most general speaker verification task and has been the focus of research advances in recent years.
As shown in Figure 1, many speaker verification systems consists of two stages, a front-end that generates a speaker embedding model for enrollment and trial utterances, and a back-end that computes a similarity score for the two resultant embeddings. Alternatively, end-to-end speaker verification directly learns a similarity score from training utterances (heigold2016endtoend). Modern speaker verification systems use DNNs to learn the front-end embedding, or to train the end-to-end system (Bai2021Speaker). As the final step of the speaker verification process, the score output is compared to a threshold. Speaker identity is accepted if the score lies above the threshold, and rejected if it lies below the threshold.
3.2. Speaker Verification Evaluation
To evaluate speaker verification systems, scores are generated for many pairs of enrollment and trial utterances. The utterance pairs are labelled as being from the same or from different speakers. Two typical score distributions generated from many same and different speaker utterance pairs are shown in Figure 2. After calibrating the speaker verification system to a threshold (e.g. equal error rate or detection cost), utterance pairs with a score value to the left of the threshold are classified as different speakers and the trail utterance is rejected. Utterance pairs with a score value to the right of the threshold are classified as the same speaker, and accepted. As the two distributions overlap, classification is not perfect. At a particular threshold value there will be false positives, i.e. utterance pairs of different speakers with a score value to the right of the threshold, and false negatives, i.e. utterance pairs of the same speakers with a score value to the left of the threshold.
Speaker verification performance is determined by its false positive rate (FPR) and false negative rate (FNR) at the threshold value to which the system has been calibrated (greenberg2020two). It is accepted that the two error rates present a trade-off, and that selecting an appropriate threshold is an application-specific design decision (NIST2020). The threshold value is determined by balancing the FPR and FNR error rates for a particular objective, such as obtaining an equal error rate (EER) for FPR and FNR, or minimising a cost function. The detection cost function (DCF) is a weighted sum of FPR and FNR across threshold values, with weights determined by the application requirements. To compare performance across models, systems are frequently tuned to the threshold value at the minimum of the DCF, and the corresponding detection cost value is reported as a metric. Various detection cost functions have been proposed over time, such as the following, proposed in the NIST SRE 2019 Evaluation Plan (NIST2019):
Speech science literature recommends that detection error trade-off (DET) curves (greenberg2020two) are used to visualise the trade-off between FPR and FNR, and to consider system performance across various thresholds. DET curves visualise the FPR and FNR at different operating thresholds on the x- and y-axis of a normal deviate scale (Martin1997Det) (see Figure 7 in the Appendix as example). They can be used to analyse the inter-model performance (across models), and are also recommended for analysing intra-model performance (across speaker subgroups in one model).
3.3. Competitions and Benchmarks
Speaker recognition challenges have played an important role in evaluating and benchmarking advances in speaker verification. They were first initiated within the Information Technology Laboratory of the US National Institute of Standards and Technology (NIST) to conduct evaluation driven research on automated speaker recognition (greenberg2020two). The NIST Speaker Recognition Evaluation (SRE) challenges and their associated evaluation plans have been important drivers of speaker verification evaluation. In addition, new challenges have emerged over time to address the requirements of emerging applications and tasks. Table 1 summarises recent challenges, their organisers and the metrics used for evaluation. Most challenges have adopted the minimum of the detection cost function, , recommended by the NIST SREs as their primary metric. As the NIST SREs have modified this function over time, different challenges use different versions of the metric. In the remainder of this paper we evaluate bias in the VoxCeleb Speaker Recognition Challenge (SRC).
|NIST SRE (greenberg2020two)||US National Inst. of Standards & Tech.||1996 - 2021||Detection Cost Function|
|SRE in Mobile Env’s (Khoury2013speaker)||Idiap Research Institute||2013||DET curve, EER,|
|Speakers in the Wild SRC (mclaren2016speakers)||at Interspeech 2016||2016||* (SRE2016), ,|
|VoxCeleb SRC (Nagrani2020Voxsrc)||Oxford Visual Geometry Group||2019 - 2021||* (SRE2018), EER|
|Far-Field SVC (qin2020interspeech)||at Interspeech 2020||2020||*, EER|
|Short Duration SVC (zeinali2019shortduration)||at Interspeech 2021||2020 - 2021||* (SRE08)|
|SUPERB benchmark (yang2021superb)||CMU, JHU, MIT, NTU, Facebook AI||2021||EER*|
4. Experiment Setup for Evaluating Bias in the VoxCeleb SRC
Launched in 2019, the objective of the VoxCeleb Speaker Recognition Challenge (SRC) is to ”probe how well current [speaker recognition] methods can recognize speakers from speech obtained ’in the wild’” (voxcelebsrc2021). The challenge has four tracks: open and closed fully supervised speaker verification, closed self-supervised speaker verification and open speaker diarisation. It serves as a well-known benchmark, and has received several hundred submissions over the past three years. The popularity of the challenge and its datasets make it a suitable candidate for our evaluation, representative of the current ecosystem. We evaluate group bias in the speaker verification track of the VoxCeleb SRC.
4.1. Baseline Models
The challenge has released two pre-trained baseline models (heo2020clova) trained on the VoxCeleb 2 training set (Nagrani2020a) with close to 1 million speech utterances of 5994 speakers. 61% of speakers are male and 29% of speakers have a US nationality, which is the most represented nationality. The baseline models are based on a 34-layer ResNet trunk architecture. ResNetSE34V2 (heo2020clova) is a larger model that has been optimized for predictive performance. ResNetSE34L (chung2020defence) is a smaller model that has been optimized for fast execution. We downloaded and used the baseline models as black-box predictors in our evaluation. The technical details of the baseline models are summarised in Table 3 in the Appendix.
4.2. Evaluation Dataset
We evaluate the baseline models on three established evaluation sets that can be constructed from the VoxCeleb 1 dataset (Nagrani2020a)
. VoxCeleb 1 was released in 2017 with the goal of creating a large scale, text-independent speaker recognition dataset that mimics unconstrained, real-world speech conditions, in order to explore the use of deep convolutional neural networks for speaker recognition tasks(Nagrani2017Voxceleb). The dataset contains 153 516 short clips of audio-visual utterances of 1251 celebrities in challenging acoustic environments (e.g. background chatter, laughter, speech overlap) extracted from YouTube videos. The dataset also includes metadata for speaker sex and nationality, and is disjoint from VoxCeleb 2 which is used for training. Three different evaluation sets have been designed for testing speaker verification with VoxCeleb 1. We consider all three evaluation sets in our analysis. The evaluation sets are discussed in detail in Section 5.3.
4.3. Speaker Subgroups and Bias Evaluation Measures
We select speaker subgroups based on speaker attributes that are known to affect a speaker’s voice: sex, which is the biological difference between male and female speakers, and accent, which we approximate with the speaker’s nationality (Singh2019Profiling). We then establish bias by evaluating performance disparities between these subgroups using existing evaluation measures in speaker verification.
Our first technique for establishing bias is to plot the DET curves for all subgroups, and to compare the subgroups’ DET curves to the overall curve for all subgroups. As speaker verification systems must operate on the DET curve, this presents the theoretical performance boundary of the model across subgroups. Secondly, we consider bias at the threshold to which the system has been calibrated, which ultimately presents the operating point of the system. Here we consider an unbiased system as one that has equal false positive and true positive (or false negative) rates across subgroups, in line with the definition of equalized odds(Hardt2016equality). We compare each subgroup’s performance to the overall system performance to facilitate comparison across a large number of subgroups, and thus deviate slightly from the formal definition of equalized odds. We use as defined in Equation 1 to determine the calibration threshold and quantify the relative bias towards each subgroup with the ratio of the subgroup cost to the overall cost at the threshold value where is minimized for the overall system:
If the subgroup bias is greater than 1, the subgroup performance is worse than the overall performance, and the speaker verification model is prejudiced against that subgroup. Conversely, if the subgroup bias is less than one, the model favours the subgroup. If the ratio is exactly 1, the model is unbiased for that subgroup.
4.4. Black-box Bias Evaluation Framework
We designed a framework111 The code for the evaluation framework has been released as an open-source python library:
The code for the evaluation framework has been released as an open-source python library:https://github.com/hidden-url that replicates a real evaluation scenario to evaluate bias in the VoxCeleb SRC benchmark. Figure 3 shows an overview of the framework. We start with pairs of single-speaker speech utterances in the evaluation dataset as input, and use the baseline models, ResNetSE32V2 and ResNetSE34L, as black-box predictors. The baseline models output scores for all utterance pairs in the evaluation set. We set the threshold to the value that minimizes the overall system cost of the DCF and accept or reject speakers in utterance pairs based on that. Our predicted binary acceptance is then compared to the true labels of the utterance pairs to determine false positive and false negative predictions. Using the demographic metadata for speakers, we allocate each utterance pair to a subgroup based on the demographics of the enrollment utterance. From these inputs we evaluate bias by establishing the FPR, FNR and thus at the threshold value for each subgroup. We also plot DET curves from the outputs scores for each subgroup. The evaluation is repeated for each of the three VoxCeleb 1 evaluation sets.
5. Bias in Model Building and Implementation in Speaker Verification
In this section we identify sources of bias that arise during the model building and implementation stage in speaker verification. In the machine learning pipeline this stage involves model definition and training, evaluation and real-world deployment. The types of bias that arise in these processes are aggregation bias, learning bias, evaluation bias and deployment bias. We found evidence of each type of bias in our evaluation of the VoxCeleb SRC and present our findings in the remainder of the section.
5.1. Aggregation Bias
Aggregation bias arises when data contains underlying groups that should be treated separately, but that are instead subjected to uniform treatment.
We evaluate aggregation bias by plotting disaggregated DET performance curves for speaker subgroups based on nationality and sex. In Figure 4 we show the DET curves for female (left) and male (right) speakers across 11 nationalities for the ResNetSE34V2 model evaluated on the VoxCeleb 1-H evaluation set. The dotted black DET curve shows the overall performance across all subgroups. DET curves above the dotted line have a high likelihood of performing worse than average, while DET curves below the dotted line will generally perform better than average. It is easy to see that the DET curves of female speakers lie mostly above the average DET curve, while those of male speakers lie below it. The model is thus likely to perform worse than average for females, and better for males. Figure 8 in the Appendix shows DET subplots for each nationality, highlighting disparate performance across nationalities.
The triangular markers show the FPR and FNR at the threshold where the overall system DCF is minimized. The markers for male and female speaker subgroups are dispersed, indicating that the aggregate system calibration results in significant operating performance variability across subgroups. Table 4 in the Appendix shows the for all subgroups. With the exception of US female speakers, all females have a greater than 1, and thus perform worse than average.
The DET curves and demonstrate disparate performance based on speaker sex and nationality. They also show that the model is fit to the dominant population in the training data, US speakers. The trends in aggregation bias that we observe for ResNetSE34V2 are evident in all three evaluation sets, as well as ResNetSE34L. They indicate that speaker verification models do not identify all speaker subgroups equally well, and validate that performance disparities between male and female speakers identified in the past (Mazairafernandez2015improving) still exist in DNN speaker verification models today.
5.2. Learning Bias
Learning bias concerns modeling choices and their effect on amplifying performance disparities across samples.
The ResNetSE34V2 and ResNetSE34L models are built with different architectures and input features. The two architectures have been designed for different goals respectively: to optimize performance and to reduce inference time. We evaluate the potential effect of learning bias in speaker verification by comparing the predictive performance of the two models across subgroups. In Figure 5 we plot the for both models for all subgroups. On the dotted line both models perform equally well for a subgroup. The greater the distance between a marker and the line, the greater the performance disparity between the models for that subgroup. As described in §4.3, subgroup performance is worse than average if the is greater than 1, and better than average if it is less than 1. We make three observations: Firstly, the for both models for male speaker subgroups is close to or less than 1, indicating that at the threshold value males experience better than average performance for both models. Secondly, we observe that for male and female US speakers is equal for both models, indicating that performance disparities are not amplified for the dominant group. Thirdly, we observe that neither of the two models reduces performance disparities definitively: ResNetSE34V2 has a lower for 7 subgroups, ResNetSE34L for 10 subgroups.
In addition to examining we have plotted the DET curves for both models across subgroups in Figure 9 in the Appendix. We observe that ResNetSE34L increases the distance between DET curves for males and females with nationalities from the UK, USA and Ireland, indicating that the model increases performance disparities between male and female speakers of these nationalities. For Australian, Indian and Canadian speakers the distance between DET curves for males and females remains unchanged, while for Norwegian nationalities they lie closer together. Together these results provide evidence of learning bias, showing that modeling choices such as the architecture and feature input can amplify performance disparities in speaker verification. The disparities tend to negatively affect female speakers and minority nationalities.
5.3. Evaluation Bias
Evaluation bias is attributed to a benchmark population that is not representative of the user population, and to evaluation metrics that provide an oversimplified view of model performance.
Representative benchmark datasets are particularly important during machine learning development, as benchmarks have disproportionate power to scale bias across applications if models overfit to the data in the benchmark (Suresh2021Framework). Three evaluation sets can be constructed from the VoxCeleb 1 dataset to benchmark speaker verification models. VoxCeleb 1 test contains utterance pairs of 40 speakers whose name starts with E. VoxCeleb 1-E includes the entire dataset, with utterance pairs sampled randomly. VoxCeleb 1-H is considered a hard test set, that contains only utterance pairs where speakers have the same sex and nationality. Speakers have only been included in VoxCeleb 1-H if there are at least 5 unique speakers with the same sex and nationality. All three evaluation sets contain a balanced count of utterance pairs from same speakers and different speakers. We have calculated the speaker and utterance level demographics for each evaluation set from the dataset’s metadata, and summarise the attributes of the evaluation sets in Table 2.
|VoxCeleb 1 test||VoxCeleb 1-E||VoxCeleb 1-H|
|unique speakers||40||1 251||1 190|
|unique utterance pairs||37 720||579 818||550 894|
|speaker pairing details||-||random sample||same sex, nationality|
|speaker pair inclusion criteria||name starts with ’E’||all||¿=5 same sex, nationality speakers|
|female / male speakers (%)||38 / 62||45 / 55||44 / 56|
|female / male utterances (%)||29.5 / 70.5||41.8 / 58.2||41.1 / 58.9|
|count of nationalities||9||36||11|
|top 1 nationality (% speakers / utterances)||US (62.5 / 59.6)||US (63.9 / 61.4)||US (67.1 / 64.7)|
|top 2 nationality (% speakers / utterances)||UK (12.5 / 13.9)||UK (17.2 / 18.3)||UK (18.1 / 19.3)|
|top 3 nationality (% speakers / utterances)||Ireland (7.5 / 6.7)||Canada (4.3 / 3.8)||Canada (4.5 / 3.9)|
Several observations can be made based on the summary in Table 2: the VoxCeleb 1 dataset suffers from representation bias (see Section 6.2) and all three evaluation sets overrepresent male speakers and US nationals. Furthermore, the sample size of VoxCeleb 1 test is too small to use it for a defensible evaluation. Its inclusion criterion based on speakers’ names introduces additional representation bias into the evaluation set, as names strongly correlate with language, culture and ethnicity.
In addition to these obvious observations, the summary also reveals subtler discrepancies. Speaker verification evaluation is done on utterance pairs. Demographic representation is thus important on the speaker level to ensure that the evaluation set includes a variety of speakers, and on the utterance level to ensure that sufficient speech samples are included for each individual speaker. A significant mismatch in demographic representation between the speaker and utterance level is undesirable. If the representation of a subgroup is higher on the speaker level than the utterance level, this misrepresents the demographics that matter during evaluation and may indicate underrepresentation of individual speakers. Conversely, if the representation of a subgroup is lower on the speaker level, this increases the utterance count per speaker, suggesting overrepresentation of individual speakers. When considering utterances instead of speakers, the representation of females in relation to males decreases from 61% to 42% for VoxCeleb 1 test, from 82% to 72% for VoxCeleb 1-E and from 79% to 70% for VoxCeleb 1-H. The evaluation sets thus not only contain fewer female speakers, they also contain fewer utterances for each female speaker, which reduces the quality of evaluation for female speakers.
We evaluate ResNetSE34V2 with the three evaluation sets and plot the resulting DET curves in Figure 6. The DET curve of VoxCeleb 1 test is irregular, confirming that this evaluation set is too small for a valid evaluation. In a FPR range between 0.1% and 5%, which is a reasonable operating range for speaker verification, model performance is similar on VoxCeleb 1 test and VoxCeleb 1-E. The curve of VoxCeleb 1-H lies significantly above the other two evaluation sets, indicating that the model performs worse on this evaluation set. Our empirical results illustrate that model performance is highly susceptible to the evaluation set, and show how evaluation bias can affect speaker verification models during evaluation.
The two dominant metrics used in speaker verification benchmarks, including the VoxCeleb SRC, are the equal error rate (EER) and the minimum value of the detection cost function (see Table 1). Both error metrics give rise to evaluation bias. The EER presents an oversimplified view of model performance, as it cannot weight false positives and false negatives differently. Yet, most speaker verification applications strongly favour either a low FPR or a low FNR (greenberg2020two). The NIST SREs do not promote the use of the EER for speaker verification evaluation for this reason (greenberg2020two), which makes it particularly concerning that new challenges like the SUPERB benchmark evaluate only the EER (yang2021superb). can weight FPR and FNR, but has its own shortcomings. Firstly, the detection cost function has been updated over the years, and different versions of the metric are in use. This is impractical for consistent evaluation of applications across time. Secondly, the cost function is only useful if the FPR and FNR weighting reflect the requirements of the application. Determining appropriate weights is a normative design decision, and has received very limited attention in the research community. In benchmarks weights are typically not adjusted, which oversimplifies real-life evaluation scenarios. Finally, presents a limited view of a model’s performance at a single threshold value. While DET curves can provide a holistic view on the performance of speaker verification models across thresholds, many recent research papers do not show them, and those that do only show aggregate curves.
The aggregate form of current evaluation practices based on and optimised for average performance hides the nature of harm that arises from evaluation bias. Ultimately, what matters when a speaker verification system is deployed, are the FPR and FNR. False positives pose a security risk, as they grant unauthorized speakers access to the system. False negatives pose a risk of exclusion, as they deny authorized speakers access to the system. We consider the FPR and FNR for subgroups at in relation to the average FPR and FNR in Table 5 in the Appendix. US male speakers have a FPR and FNR ratio of 1, indicating that this subgroup will experience error rates in line with the average. On the other end of the spectrum Indian female speakers have a FPR and FNR that are 13 and 1.3 times greater than average, indicating that this subgroup is exposed to a significant security risk, and to a higher risk of exclusion.
5.4. Deployment Bias
Deployment bias arises when the application context and usage environment do not match the problem space as it was conceptualised during model development.
Advancements in speaker verification research have been funded by governments to advance intelligence, defense and justice objectives (greenberg2020two). The underlying use cases of speaker verification in these domains have been biometric identification and authentication. From this lens, the speaker verification problem space has been conceptualized to minimize false positives, which result in security breeches. Research on evaluation and consequently also model development has thus focused on attaining low FPRs. This dominant, but limited view promotes deployment bias in new use cases, which require evaluation practices and evaluation datasets tailored to their context.
Today, speaker verification is used in a wide range of audio-based applications, ranging from voice assistants on smart speakers and mobile phones to call centers. A low FPR is necessary to ensure system security. In voice assistants, false positives also affect user privacy, as positive classifications trigger voice data to be sent to service providers for downstream processing (Schonherr2020Unacceptable). When used in forensic applications, false positives can amplify existing bias in decision-making systems, for example in the criminal justice system (machinebias2016). Even if the FPR is low, the speaker verification system will have a high FNR as trade-off, and the consequences of this must be considered. The FNR affects usability and can lead to a denial of service from voice-based user interfaces. The more critical the service, the higher the risk of harm associated with the FNR. Consider, for example, the previously mentioned speaker verification system used as proof-of-life of pensioners (veridas2021). As long as the system is able to identify a pensioner correctly, it relieves the elderly from needing to travel to administrative offices, thus saving them time, money and physical strain. If the system has disparate FNR between demographic subgroups, some populations will be subjected to a greater burden of travel.
Evaluation practices aside, many speaker verification applications will suffer from deployment bias when evaluated on the utterance pairs in the VoxCeleb 1 evaluation datasets. Voice assistants in homes, cars, offices and public spaces are geographically bound, and speakers using them will frequently share a nationality, language and accent. These user and usage contexts should be reflected in the evaluation sets. The VoxCeleb 1 evaluation sets with randomly generated utterance pairs (i.e. VoxCeleb 1 test and -E) are inadequate to capture speaker verification performance in these application scenarios. Even VoxCeleb 1-H, which derives its abbreviation -H from being considered the hard evaluation set, is inadequate to evaluate speaker verification performance in very common voice assistant scenarios, such as distinguishing family members. Furthermore, the naming convention of the evaluation sets promotes a limited perspective on speaker verification application contexts: naming VoxCeleb 1-H the hard evaluation set creates a false impression that the randomly generated utterance pairs of VoxCeleb 1-E are the typical evaluation scenario.
The operating threshold of a speaker verification system is calibrated after model training (see §3.2). This post-processing step amplifies aggregation bias (discussed in §5.1) and deployment bias due to the application context (discussed above). The operating threshold is set in a calibration process that tunes a speaker verification system to a particular evaluation set. If the evaluation set does not take the usage environment and the characteristics of speakers in the environment into consideration, this can give rise to further deployment bias due to post-processing. As discussed above, the VoxCeleb 1 evaluation sets encompass a very limited perspective on application scenarios, and thresholds tuned to these evaluation sets will suffer from deployment bias due to post-processing in many contexts.
The speaker verification system threshold is typically calibrated for the overall evaluation set. This gives rise to threshold bias, which we consider a form of aggregation bias that arises during post-processing and deployment. Instead of calibrating the threshold to the overall evaluation set, it could be tuned for each subgroup individually. Using the detection cost function as example, this means setting the threshold for a subgroup to the value where is minimized for the subgroup (i.e. ). If the detection cost at the subgroup’s minimum is smaller than at the overall minimum, then the subgroup benefits from being tuned to its own threshold. By calculating the ratio of the subgroup’s overall detection cost and the subgroup’s minimum detection cost, we can get an intuition of the extent of threshold bias. If the threshold bias is greater than 1, the subgroup will benefit from being tuned to its own threshold. The greater the ratio, the greater the threshold bias and the more the subgroup will benefit from tuning to its own minimum. Table 4 in the Appendix shows the threshold bias for all subgroups. It is clear that all subgroups would perform better if tuned to their own threshold. However, the mean threshold bias of female speakers is 1.37, and they will experience greater benefit than male speakers with a mean threshold bias of 1.09. Visually, the effect of calibrating subgroups to their own threshold can be seen in Figure 10 in the Appendix.
6. Bias in Data Generation in Speaker Verification
Having analyzed bias in model building and implementation in detail, we now present evidence of bias in the data generation stage in speaker verification with the VoxCeleb SRC benchmark. The end goal of the data generation stage is to create training, test and benchmark datasets. The stage involves data generation, population definition and sampling, measurement and pre-processing. The types of bias that arise in these processes are historical bias, representation bias and measurement bias. We discuss historical bias and representation bias in the VoxCeleb 1 dataset in the remainder of the section.
6.1. Historical Bias
Historical bias replicates biases, like stereotypes, that are present in the world as is or was.
The VoxCeleb 1 dataset was constructed with a fully automated data processing pipeline from open-source audio-visual media (Nagrani2017Voxceleb). The candidate speakers for the dataset were sourced from the VGG Face dataset (Parkhi2015Deep)
, which is based on the intersection of the most searched names in the Freebase knowledge graph and Internet Movie Database (IMDB). After searching and downloading video clips for identified celebrities, further processing was done to track faces, identify active speakers and verify the speaker’s identity using the HOG-based face detector(King2009Dlibml), Sync-Net (Chung2017out) and VGG Face CNN (Simonyan2015very) respectively. If the face of a speaker was correctly identified, the clip was included in the dataset.
Bias in facial recognition technologies is well known (buolamwini2018gendershades; Raji2019actionable; Raji2021aboutface), and historic bias pervades the automated data generation process of VoxCeleb. The VoxCeleb 1 inclusion criteria subject the dataset to the same bias that has been exposed in facial recognition verification technology and reinforce popularity bias based on search results (mehrabi2019survey). Moreover, the data processing pipeline directly translates bias in facial recognition systems into the speaker verification domain, as failures in the former will result in speaker exclusion from VoxCeleb 1.
6.2. Representation Bias
Representation bias underrepresents a subset of the population in its sample, resulting in poor generalization for that subset.
that representation bias in the VoxCeleb 1 dataset contributes to several forms of bias in speaker verification. The dataset is skewed towards males and US nationals, as can be seen in Figure11 in the Appendix. Recent work on age recognition with the VoxCeleb datasets (Hechmi2021voxceleb) shows that speakers between ages 20 and 50 are most represented in the dataset, indicating that representation bias is also evident across speaker age. Accent, which we approximate from nationality, sex and age only account for some of the attributes of a speaker’s voice that affect automated speaker recognition (Singh2019Profiling). Being a celebrity dataset that is not representative of the broad public, it is likely that VoxCeleb 1 contains representation bias that affect many other sensitive speaker attributes.
In this paper we have presented an in-depth study of bias in speaker verification, which is a core task in automated speaker recognition. We have provided empirical and analytical evidence of sources of bias at every stage of the speaker verification machine learning development workflow. Our study highlights that speaker verification performance degradation due to demographic attributes of speakers is significant, and can be attributed to aggregation, learning, evaluation, deployment, historical and representation bias. Our findings echo concerns similar to those raised in the evaluation for facial recognition technologies (Raji2021aboutface). While our findings are specific to speaker verification, they can, for the most part, be extended to automated speaker recognition more broadly. Below we present recommendations for mitigating bias in automated speaker recognition and discuss limitations of our work.
Evaluation Datasets that Reflect Real Usage Scenarios.
We have shown that speaker verification evaluation is extremely sensitive to the evaluation set. The three evaluation sets that can be constructed from the VoxCeleb 1 dataset induce evaluation bias, and are insufficient for evaluating many real-world speaker verification application scenarios. Representative datasets for evaluating speaker verification performance are thus needed. An appropriate evaluation set should contain utterance pairs that reflect the reality of the application context. This requires guidelines for constructing application-specific utterance pairs for evaluation. In particular, the specification and sufficient representation of subgroups should be considered carefully, as this in itself can introduce a source of bias. Previous research on diversity and inclusion in subgroup selection (Mitchell2020Diversity) presents a starting point that can inform the design of fairer speaker verification evaluation datasets. If evaluation datasets are named to indicate an aspect of the quality of the dataset, they should do so responsibly so as not to mislead developers. VoxCeleb 1-E, for example, could be named as the easy, and VoxCeleb 1-H as the heterogeneous evaluation set.
Evaluation Metrics that Consider the Consequences of Errors.
Considering the consequences of errors across application contexts is necessary to reduce deployment bias in speaker verification. Speaker verification evaluation and testing should carefully consider the choice and parameters of error metrics to present robust evaluations and comparison across models for specific application contexts. To this end, guidelines are needed for designing application specific error metrics, and for evaluating bias with these metrics. Such guidelines should determine acceptable FPR and FNR ranges, and guide normative decisions pertaining to the selection of weights of cost functions. Alternative evaluation metrics, such as those used for privacy-preserving speaker verification (Maouche2020comparative; Nautsch2020privacy), should also be studied for evaluating bias. Lastly, to assess aggregation bias in speaker verification, disaggreated evaluation across speaker subgroups is needed. DET curves, which have history in speaker verification evaluation, should be used for visualizing model performance across speaker subgroups. Additionally error metrics should also be computed and compared across subgroups to mitigate evaluation bias.
Approaches for Mitigating Bias.
While evaluation, representation and historic bias can be addressed with new methods for collecting and generating datasets, and application specific evaluation practices, interventions are also needed to address learning, deployment, aggregation and measurement bias. We suggest some interventions for mitigating these types of bias. Speaker verification will improve for all subgroups if they are tuned to their own threshold rather than the overall threshold. Developing engineering approaches to dynamically select the optimal threshold for subgroups or individual speakers will improve the performance of speaker verification in deployed applications. As subgroup membership is typically not known at run time, this is a challenging task for future work. Previous work in keyword spotting, a different automated speech processing task, has shown that performance disparities across speaker subgroups can be attributed to model input features and the data sample rate at which the voice signal was recorded (Toussaint2021Characterising). Assessing sources of measurement bias due to data processing and input features, and studying approaches to mitigating measurement bias, thus remain an important area for future work. Further research is also required to study bias in learning algorithms in cloud-based and on-device settings.
Our work presents the first study of bias in speaker verification development and does not study bias in commercial products, which we position as an area for future work. Our aim was to study typical development and evaluation practices in the speaker verification community, not to compare speaker verification algorithms. We thus designed a case study with a confined scope, using publicly available benchmark models as black box predictors. Our findings should be interpreted with this in mind, and not be seen as a generic evaluation for all speaker verification models. We constructed demographic subgroups based on those included in the VoxCeleb1-H evaluation set. Some subgroups thus have insufficient sample sizes, which affects the quality of our empirical evaluation for these subgroups. However, as discussed in detail in §5.3, small subgroups of minority speakers are in themselves a source of representation bias that needs to be addressed. We observed that the performance difference that we identified between male and female speakers, and across nationalities, persist across minority and dominant subgroups. Finally, as the language of speakers in the dataset is not specified, we assumed all speech utterances to be in English, and used nationality as a proxy for accent rather than language. Future work should investigate bias in speaker verification across languages, and determine accent in a more robust manner.
Automated speaker recognition is deployed on billions of smart devices and in services such as call centres. In this paper we study bias in speaker verification, the biometrics of voice, which is a core task in automated speaker recognition. We present an in-depth empirical and analytical study of bias in a benchmark speaker verification challenge, and show that bias exists at every stage of the machine learning development workflow. Most affected by bias are female speakers and non-US nationalities, who experience significant performance degradation due to aggregation, learning, evaluation, deployment, historic and representation bias. Our findings lay a strong foundation for future work on bias and fairness in automated speaker recognition.
Acknowledgements.This research was partially supported by projects funded by EU Horizon 2020 research and innovation programme under GA No. 101021808 and GA No. 952215.
Appendix A Appendix
a.1. Speaker Recognition Evaluation
Detection Error Trade-off Curves
Summary of Technical Details of VoxCeleb SRC Baseline Models
|Alternative name in publication:||performance optimised model, H/ASP||Fast ResNet-34|
|Additional training procedures:||data augmentation (noise & room impulse response)||-|
|Parameters:||8 million||1.4 million|
|Frame-level aggregation:||attentive statistical pooling||self-attentive pooling|
|Loss function:||angular portotypical softmax loss||angular portotypical loss|
|Input features:||64 dim log Mel filterbanks||40 dim Mel filterbanks|
|Window (width x step):||25ms x 10ms||25ms x 10ms|
|Optimized for:||predictive performance||fast execution|
a.2. Evaluation Bias
a.3. Learning Bias
a.4. Aggregation and Post-processing Deployment Bias
Figure 10 shows DET curves and thresholds for ResNetSE34V2 for male and female speakers of Indian, UK and USA nationalities evaluated on VoxCeleb 1-H. We use the following conventions: triangle markers show the FPR and FNR at the overall minimum threshold , cross markers show the FPR and FNR at the subgroup minimum threshold , and dotted black lines and markers are used for the overall DET curve and threshold. The DET curve of female Indian speakers lies far above the overall aggregate, indicating that irrespective of the threshold, the model will always perform worse than aggregate for this subgroup. In the operating region around the tuned thresholds, the model also performs worse for female speakers from both the UK and the USA. Being tuned to does not affect the FNR and improves the FPR of USA female and male speakers. For other speaker subgroups, especially UK females and Indian females and males, either the FPR or the FNR deteriorates significantly when tuned to the overall minimum. For all subgroups the threshold at the subgroup minimum, , shifts the FPR and FNR closer to those of the minimum overall threshold, suggesting that performance will improve when optimising thresholds for subgroups individually.
a.5. Application Context Deployment Bias