Log In Sign Up

Improving fairness in speaker verification via Group-adapted Fusion Network

Modern speaker verification models use deep neural networks to encode utterance audio into discriminative embedding vectors. During the training process, these networks are typically optimized to differentiate arbitrary speakers. This learning process biases the learning of fine voice characteristics towards dominant demographic groups, which can lead to an unfair performance disparity across different groups. This is observed especially with underrepresented demographic groups sharing similar voice characteristics. In this work, we investigate the fairness of speaker verification models on controlled datasets with imbalanced gender distributions, providing direct evidence that model performance suffers for underrepresented groups. To mitigate this disparity we propose the group-adapted fusion network (GFN) architecture, a modular architecture based on group embedding adaptation and score fusion. We show that our method alleviates model unfairness by improving speaker verification both overall and for individual groups. Given imbalanced group representation in training, our proposed method achieves overall equal error rate (EER) reduction of 9.6 29.0 20.0 applicable to other types of training data skew in speaker recognition systems.


page 1

page 2

page 3

page 4


Improving Fairness in Speaker Recognition

The human voice conveys unique characteristics of an individual, making ...

Adversarial Reweighting for Speaker Verification Fairness

We address performance fairness for speaker verification using the adver...

Study on the Fairness of Speaker Verification Systems on Underrepresented Accents in English

Speaker verification (SV) systems are currently being used to make sensi...

Improving Voice Trigger Detection with Metric Learning

Voice trigger detection is an important task, which enables activating a...

Adaptive Sampling to Reduce Disparate Performance

Existing methods for reducing disparate performance of a classifier acro...

SVEva Fair: A Framework for Evaluating Fairness in Speaker Verification

Despite the success of deep neural networks (DNNs) in enabling on-device...

A Multi-tasking Model of Speaker-Keyword Classification for Keeping Human in the Loop of Drone-assisted Inspection

Audio commands are a preferred communication medium to keep inspectors i...

1 Introduction

A speaker verification system answers the question of who is speaking based on a recording of a spoken utterance. With smart home and mobile applications becoming more ubiquitous, speaker verification systems are playing an important role in enabling convenient and secure access to personalized services through natural conversational interactions, such as playing one’s favorite music, checking one’s calendar, and conducting financial transactions via voice commands. Users need to be able to count on such personalization features working reliably regardless of the speaker’s linguistic or demographic background.

Modern deep speaker verification models, such as d-vectors [1, 2, 3, 4], x-vectors [5, 6], and other variants [7, 8, 9], are typically trained on large datasets to minimize average speaker identification loss. Such a training paradigm can cause models to overlook distinctive voice characteristics for underrepresented groups (such as gender groups, nonnative speakers, or regional accents) in the training data. The resulting lower verification performance for minority groups affects their fair access to services enabled by voice verification technologies. Meanwhile, common speaker verification performance metrics typically measure the overall model performance across all speakers and do not reflect equity of performance over different demographic groups. Prior work has reported that insufficient training data from minority demographic groups could impair performance fairness in state-of-the-art automated speech recognition systems and speaker verification models [10, 11]. Similar fairness issues have also been identified in other areas [12]

, such as face recognition 

[13] and recommender systems [14].

A common way to improve model fairness is to collect more annotated training data from minority groups, which can be prohibitively expensive and time-consuming. Here we propose an algorithmic approach to overcome fairness issues arising from typical speaker verification systems, consisting of two major components, jointly called group-adapted fusion network (GFN). First, we use group-wise embedding adaptation to improve the front-end embedding encoder’s ability to extract better discriminative features within a demographic group with similar voice characteristics. Second, we fuse scores from different embedding encoders via learnable weights to improve generalization and prevent overfitting. Embedding adaptation has been applied in few-shot learning or transfer learning settings to refine task-specific features, with applications in computer vision

[15, 16] and speech [17], among others.

We illustrate the fairness problem and demonstrate the effectiveness of our solution for the case of gender imbalance in the training data. By constructing training sets with various degrees of imbalance, as well as metrics for performance disparity, we aim to systematically probe model unfairness, understand its causes, and offer a generalizable solution. Although our approach is only evaluated on imbalanced gender groups, it is applicable to other demographic groups (e.g., children, the elderly) affected by underrepresentation.

Our main contributions are as follows: We provide direct evidence that imbalanced group representation in speaker recognition training sets can lead to model unfairness. We propose a general, modular architecture based on group embedding adaptation and score fusion that alleviates model unfairness. Our approach also comes with a set of tools to rigorously inspect, evaluate, and analyze model unfairness and proposed solutions. Finally, the training and evaluation datasets used in this work are available for other uses.111

2 Group-adapted Fusion Network

Figure 1: The overview of proposed group-adapted fusion network, which consists of the front-end group adaptation encoders (left) to extract group-wise embeddings and the back-end score fusion model (right) to fuse scores from all embedding encoders.

The architecture of the group-adapted fusion network is shown in Figure 1. It consists of front-end encoders to extract base and group-adapted embeddings and a back-end score fusion model to fuse base scores and group-specific scores to generate the fused score. The core idea is motivated by ensemble learning [18] and mixture-of-experts [19, 20], where multiple expert networks model complementary data characteristics and are fused at the score level. Another precedent is speaker verification based on adaptation transforms that are gender-specific but applied uniformly to all speakers [21].

2.1 Group Embedding Adaptation

The base and group-adapted encoders are three separate deep neural networks with the same architecture (ResNet-34 variants [22]). All are trained with metric learning objectives [22]. We first train the base encoder with gender-mixed data to capture generic voice characteristics, and then fine-tune the pre-trained base encoder with gender-specific training data. In the training stage, inputs to each encoder are mini-batches (batch size ) of audio features (e.g., log Mel filter banks), where is the number of audio frames and is the feature dimension. Outputs from each encoder are length-normalized embeddings , where

is the embedding dimension. Embeddings are fed into the metric loss function for network training. We use

to denote base embeddings, female-adapted embeddings and male-adapted embeddings produced from corresponding encoders.

2.2 Score Fusion

We next leverage a score fusion model to aggregate the three embeddings, to predict if an utterance pair is from the same speaker. In the score fusion stage, the inputs are utterance-level base embedding pairs , female- and male-adapted embedding pairs and

. First, we compute cosine similarities

between pairs for base, female-adapted and male-adapted embeddings, respectively:


We then use a score fusion model

to aggregate the three similarity scores into the fused score for speaker verification. We employ a multilayer perceptron (MLP)

with the three similarity scores as inputs and as the learnable model weights. To train the score fusion model, we construct positive and negative training pairs from the training set for contrastive learning. Positive utterance pairs are sampled from the same speaker; negative pairs are formed by sampling utterances from different speakers of both same and different gender. We train the fusion model with binary cross-entropy loss


where is the fused score output of the -th utterance pair , and is the corresponding label ( indicates the paired utterances are from the same speaker, otherwise).

During inference, given a pair of utterances for verification, we first extract their base, female- and male-adapted embeddings from the three encoders. The three embeddings are fed into the score fusion model producing score . If is greater than a predefined threshold, we predict that two utterances are from the same speaker; otherwise, they are deemed from different speakers.

3 Data and Experiments

Data. We use VoxCeleb1 [3] and VoxCeleb2 [4] datasets, which have gender information for each speaker, to construct customized training and evaluation datasets. We constructed training datasets with different gender ratios from subsets of the VoxCeleb2 dataset. These subsets have the same total of 2,500 speakers but with different numbers of male and female speakers. The female-to-male (F:M) gender ratio ranges from 9:1 to 1:9, as shown in Table 1. We will refer to them as VoxCeleb2-GRC (gender ratio controlled) datasets. To accurately evaluate model fairness based on gender, we also constructed an evaluation dataset based on VoxCeleb1. We call it the VoxCeleb1-F (Fairness) dataset. VoxCeleb1-F strictly controls for the presence of positive and negative trials with same or different genders, as shown in Table 2. We use F and M to denote the female and male groups, respectively.

ratio F:M
9:1 2,250 250 387,322 45,181
4:1 2,000 500 341,500 95,157
1:1 1,250 1,250 214,919 228,823
1:4 500 2,000 86,616 372,133
1:9 250 2,250 43,482 419,853
Table 1: VoxCeleb2-GRC datasets with different gender ratios.

Model Training.

All ResNet-34 encoders were trained on the VoxCeleb2-GRC datasets with 40-dim log Mel filter bank features. Models were trained for 300 epochs on a single GPU with angular prototypical loss

[22] and a minibatch of a 400 2-sec utterance segments. We used the Adam optimizer with the initial learning rate

and a decay factor of 0.95 per epoch. The output embedding dimension is 512. The group-adapted encoders were obtained by fine-tuning the base encoder on the single-gender training subsets with the same training parameters for an additional 300 epochs. The score fusion model is a three-layer MLP, where each hidden layer has 32 units with ReLU activation. The fusion network outputs a single fused score based on a sigmoid function. We trained the score fusion model with 200,000 randomly sampled positive and negative pairs constructed from VoxCeleb2. Positive and negative training pairs were combined and shuffled on-the-fly during the training, with a minibatch batch of 1000 3-sec segments, 50 training epochs, and 0.001 learning rate.

Evaluation. False accept rate (FAR) is the fraction of imposter speakers’ utterances that are falsely accepted; False rejection rate (FRR) is the fraction of true speakers’ utterances that are falsely rejected. Equal error rate (EER) is the rate where FAR and FRR are equal. We mainly use EER to characterize the performance of speaker verification models on different subsets of speakers. To probe the model fairness according to gender, we define group-wise EERs: , where the EER is derived from FAR and FRR when the true speakers are female speakers; is defined correspondingly. We denote as the overall EER. Given the group-wise EERs, we can characterize the model unfairness across groups via the disparity score (DS) The usage of different trials for computing these EER metrics is depicted in Table 2. Cosine similarity scores or their fused version is used to compute EERs. The cosine similarity between two utterances is evaluated according to the protocol in [4]. For each pair, ten 3-second temporal crops are sampled from each utterance and used to compute the mean similarity between all crops. EERs are reported with unit %.

4 Results

4.1 Model Fairness with Imbalanced Group Representation

We first investigate the impact of imbalanced training dataset on model fairness in typical deep speaker verification models. We consider two state-of-the-art deep neural networks [22], a quarter-channel ResNet-34 baseline (Q/RN) with 1.4M parameters and a larger half-channel ResNet-34 baseline (H/RN) with 5.6M parameters. The two baselines are trained on the VoxCeleb2-GRC datasets and evaluated on the VoxCeleb1-F dataset.

As showed in Figure 2,222All results in tabular form are also available on the website given in Footnote 1. when the total number of speakers in training set are kept the same, the majority group has better group-wise EER than the minority group. This is, increasing dominance of one gender group (e.g., from ratio 4:1 and 9:1) leads to increasing performance gap or model unfairness, indicated by the increasing values (e.g., from 1.71 to 3.70). For example, when F:M = 9:1 in the training set, the Q/RN baseline has a female-group EER of 3.52, which is much better than the male-group EER of 7.22. A similar gap between female EER and male EER is observed in the setting of F:M = 1:9. When the training set is balanced, both baselines achieve roughly equal group-wise EER for both genders, indicated by the lowest values (). Notably, overall EER increases as the training dataset becomes increasingly imbalanced, even though total training data sizes remain similar.333The differences between operating points for group-wise EERs and overall EER are less than 0.09.

Among the two baselines, H/RN baseline achieves better group-wise and overall EERs among all gender ratios and smaller than those of Q/RN. We attribute the fairness improvement to the larger learning capacity of the H/RN baseline. However, the H/RN baseline, which is four times of Q/RN baseline in model size, can only provide relatively small improvement on group-wise EERs and overall EER (5%).

Gender Trials Trial Count VoxCeleb1-F
[F] [M] [All]
Positive F-F 150,000
Negative F-F 150,000
Negative M-F 150,000
Positive M-M 150,000
Negative M-M 150,000
Table 2: Trial statistics in VoxCeleb1-F evaluation datasets.
Figure 2: VoxCeleb1-F evaluation results from models trained on VoxCeleb2-GRC datasets. Q/RN, H/RN: baseline ResNet models; GFN: gender-adapted fusion of networks.

4.2 Improving Fairness via Group-adapted Fusion Network

Now we consider the performance of the proposed group-adapted fusion network (GFN) model. The GFN has 4.3M parameters, which is around 3 times that of Q/RN and smaller than H/RN. Compared to the two baselines (in Figure 2), the GFN achieves better group-wise and overall EERs regardless of gender group imbalance in the training sets. In particular, GFN achieves a female group EER of 3.12 (realizing 11.4% and 6.9% improvement relative to Q/RN and H/RN, resp.), and a male group EER of 5.88 (18.6%/13.7% improvement) in the F:M=9:1 setting. For overall EERs, GFN achieves an EER of 5.84 (13.9%/9.6% improvement) and 5.08 (28.6%/29.0% improvement) in the F:M = 9:1 and 1:9 settings, respectively.

Note that GFN offers larger relative improvements for the minority group than for the majority group, thereby reducing the model unfairness or disparity score in the imbalanced cases. Specifically, GFN achieves a of 2.76 (25.4%/20.2% improvement relative to Q/RN and H/RN) and 2.23 (24.2%/21.2% relative improvement) in the F:M = 9:1 and 1:9 settings, respectively.

4.3 Embedding Visualization and Analysis

To shed light on the cause of model unfairness and the effects of GFN modeling, we first use t-SNE to visualize utterance embeddings (10000 female and male base embeddings from Q/FN trained on F:M=1:1 dataset) in a 2D space. As shown in Figure 3(a), utterances of the same gender tend to aggregate in separate regions of the embedding space, which is consistent with the perception that same-gender voices sound relatively similar compared to between-gender differences. We hypothesize that in an imbalanced training setting, adapting encoders separately for different genders allows each encoder to improve the representation of different subregions of the embedding space, thus alleviating the bias toward the dominant group found in the single-model framework.

Figure 3: Visualization of learned speaker embeddings using the t-SNE method. (a) Male and female speaker utterance embeddings from Q/RN model; (b, c) Utterance embeddings of 8 randomly-sampled speakers from Q/RN model (b) and GFN model (c).

The t-SNE method can visualize the benefit of adapting and fusing group-specific embeddings by the clustering and separation of embeddings from different speakers. Figures 3 (b) and (c) show the low-dimensional utterances embeddings from the baseline Q/RN encoder and from the concatenated three GFN embeddings, respectively. The adapted encoders tend to generate more compact speaker clusters with more separation between speakers, compared to the baseline encoder. The compactness of speaker clusters can also be quantified via silhouette coefficients (SC) [23]

used in clustering analysis, implemented by Scikit-learn

[24]. SC measures how similar an utterance is to its own speaker cluster (compactness) compared to other speaker clusters (separation); a higher SC is better. We compute the SC from 80 utterance embeddings of 8 randomly-sampled speakers. The mean SC from the Q/RN baseline is 0.64, while the mean SC from the adapted encoder is 0.83, indicating that the adapted encoder extracts better embeddings for speaker differentiation purposes.

4.4 Ablation Studies

Several ablation studies reveal the benefits of different aspects of our model, as summarized in Figure 4. We first examine the impact of using only one adapted encoder without fusing other encoders. Using a single male-adapted encoder (M-FT) typically degrades the overall EER, particularly the female group-wise EER although it might improve the male group-wise EER slightly when the male group is the minority. A similar phenomenon is observed when only using a female-adapted encoder (F-FT). Since male and female voices have distinct characteristics, we hypothesize that fine-tuning an encoder to one gender subset can cause the encoder to forget the learned features of the other gender. Therefore, it is necessary to fuse separately-adapted encoders to achieve better performance. To further verify the efficacy of the score fusion strategy we compare GFN’s 3-layer MLP score fusion model with an equal-weight score (ES) fusion strategy (i.e., 1/3 for each input score). ES fusion achieves better group-wise EER than the baselines for gender-imbalanced settings, but gives overall worse performance than GFN.

Figure 4: VoxCeleb1-F evaluation results from models trained on VoxCeleb2-GRC datasets under various ablation study conditions. Black dashed lines are results from our GFN models.

Additionally, we consider an alternative embedding adaptation method named “gender batching with weighted loss” (GBWL), which fine-tunes the Q/RN base model by 1) alternating all-female and all-male mini-batches and 2) reciprocal weight for the majority-gender minibatch (i.e., when training on F:M=1:9, we scale the loss from all-male minibatches by 1/9). The GBWL method significantly degrades group-wise and overall EER in the baseline Q/RN (and similarly for H/RN). This shows the benefit and necessity of performing embedding adaptation using separate networks.

We also investigated the gated mixture-of-experts (MOE) [20] strategy for fusing scores, which incorporates both adapted embeddings and similarity scores as inputs. However, gated-MOE achieves worse performance in the current score fusion training setting.

5 Conclusion

We have analyzed the effect of imbalanced training data on group fairness of modern speaker verification models, by manipulating the gender balance in a VoxCeleb-based data set. The results show that there is a direct relationship between training set imbalance and verification accuracy on the test set, both overall and for the underrepresented group. To improve performance fairness we developed a modular classifier architecture based on group-adapted encoders that are fused at the score level. Our approach achieves improvements in both overall and group-wise metrics, and reduces the performance gap between groups in scenarios with imbalanced training data. Specifically, our proposed method achieves relative reductions in overall EER of 9.6% to 29.0%, in minority group EER of 13.7% to 18.6%, and narrows EER disparity by 20.0% to 25.4%, compared to baselines. Note that our approach can be generalized to more problematic scenarios, such as children and elderly demographic groups, by incorporating and fusing more group-adapted encoders. Additionally, robust backend scoring approaches such as PLDA

[25, 26] are worth exploring to alleviate model unfairness. Finally, performing group-adaptation and fusion on the training dataset might introduce additional overfitting risk, which should be taken into account when applying a trained GFN to out-of-domain datasets.

6 Acknowledgments

We thank our colleague Victor Rozgic and Alexa Speaker Understanding team members and managers for their input and support.


  • [1] Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer, “End-to-end text-dependent speaker verification,” in Proc. IEEE ICASSP, 2016, pp. 5115–5119.
  • [2] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno, “Generalized end-to-end loss for speaker verification,” in Proc. IEEE ICASSP, 2018, pp. 4879–4883.
  • [3] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “VoxCeleb: a large-scale speaker identification dataset,” arXiv:1706.08612, 2017.
  • [4] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “VoxCeleb2: Deep speaker recognition,” arXiv:1806.05622, 2018.
  • [5] David Snyder, Daniel Garcia-Romero, Daniel Povey, and Sanjeev Khudanpur, “Deep neural network embeddings for text-independent speaker verification.,” in Proc. Interspeech, 2017, pp. 999–1003.
  • [6] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. IEEE ICASSP, 2018, pp. 5329–5333.
  • [7] Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, and Zhenyao Zhu, “Deep speaker: an end-to-end neural speaker embedding system,” arXiv:1705.02304, 2017.
  • [8] Daniel Garcia-Romero, Gregory Sell, and Alan McCree,

    “Magneto: X-vector magnitude estimation network plus offset for improved speaker recognition,”

    in Proc. Odyssey Speaker and Language Recognition Workshop, 2020, pp. 1–8.
  • [9] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” arXiv:2005.07143, 2020.
  • [10] Allison Koenecke, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R Rickford, Dan Jurafsky, and Sharad Goel, “Racial disparities in automated speech recognition,” Proc, National Academy of Sciences, vol. 117, no. 14, pp. 7684–7689, 2020.
  • [11] Gianni Fenu, Mirko Marras, Giacomo Medda, and Giacomo Meloni, “Fair voice biometrics: Impact of demographic imbalance on group fairness in speaker recognition,” Proc. Interspeech 2021, pp. 1892–1896, 2021.
  • [12] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan,

    “A survey on bias and fairness in machine learning,”

    ACM Computing Surveys (CSUR), vol. 54, no. 6, pp. 1–35, 2021.
  • [13] Mei Wang, Weihong Deng, Jiani Hu, Xunqiang Tao, and Yaohai Huang, “Racial faces in the wild: Reducing racial bias by information maximization adaptation network,” in Proc. IEEE/CVF International Conference on Computer Vision, 2019, pp. 692–702.
  • [14] Alex Beutel, Jilin Chen, Tulsee Doshi, Hai Qian, Li Wei, Yi Wu, Lukasz Heldt, Zhe Zhao, Lichan Hong, Ed H Chi, et al., “Fairness in recommendation ranking through pairwise comparisons,” in Proc. 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 2212–2220.
  • [15] Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha, “Few-shot learning via embedding adaptation with set-to-set functions,” in

    Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , 2020, pp. 8808–8817.
  • [16] Hongyang Li, David Eigen, Samuel Dodge, Matthew Zeiler, and Xiaogang Wang, “Finding task-relevant features for few-shot learning by category traversal,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1–10.
  • [17] Zhenning Tan, Yuguang Yang, Eunjung Han, and A. Stolcke, “Improving speaker identification for shared devices by adapting embeddings to speaker subsets,” in Proc. IEEE Automatic Speech Recogntion and Understanding Workshop, Dec. 2021.
  • [18] Zhi-Hua Zhou, “Ensemble learning,” in Machine Learning, pp. 181–210. Springer, 2021.
  • [19] Michael I Jordan and Robert A Jacobs, “Hierarchical mixtures of experts and the em algorithm,” Neural Computation, vol. 6, no. 2, pp. 181–214, 1994.
  • [20] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” arXiv:1701.06538, 2017.
  • [21] Andreas Stolcke, Sachin S. Kajarekar, Luciana Ferrer, and Elizabeth Shriberg, “Speaker recognition with session variability normalization based on MLLR adaptation transforms,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 7, pp. 1987–1998, Sept. 2007.
  • [22] Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, and Icksang Han, “In defence of metric learning for speaker recognition,” in Proc. Interspeech, 2020, pp. 2977–2981.
  • [23] Peter J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” Journal of Computational and Applied Mathematics, vol. 20, pp. 53–65, 1987.
  • [24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  • [25] Luciana Ferrer, Mitchell McLaren, and Niko Brümmer, “A speaker verification backend with robust performance across conditions,” Computer Speech & Language, vol. 71, pp. 101258, 2022.
  • [26] Shreyas Ramoji, Prashant Krishnan, and Sriram Ganapathy, “NPLDA: A deep neural PLDA model for speaker verification,” in Proc. Odyssey Speaker and Language Recognition Workshop, Tokyo, 2020, pp. 202–209.