Adversarial Attacks on GMM i-vector based Speaker Verification Systems

by   Xu Li, et al.
The Chinese University of Hong Kong

This work investigates the vulnerability of Gaussian Mix-ture Model (GMM) i-vector based speaker verification (SV) systems to adversarial attacks, and the transferability of adversarial samples crafted from GMM i-vector based systemsto x-vector based systems. In detail, we formulate the GMM i-vector based system as a scoring function, and leverage the fast gradient sign method (FGSM) to generate adversarial samples through this function. These adversarial samples are used to attack both GMM i-vector and x-vector based systems. We measure the vulnerability of the systems by the degradation of equal error rate and false acceptance rate. Experimental results show that GMM i-vector based systems are seriously vulnerable to adversarial attacks, and the generated adversarial samples are proved to be transferable and pose threats to neural network speaker embedding based systems (e.g. x-vector systems).


page 1

page 2

page 3

page 4


Adversarial Attacks on Spoofing Countermeasures of automatic speaker verification

High-performance spoofing countermeasure systems for automatic speaker v...

On the Detection of Adaptive Adversarial Attacks in Speaker Verification Systems

Speaker verification systems have been widely used in smart phones and I...

Investigating Robustness of Adversarial Samples Detection for Automatic Speaker Verification

Recently adversarial attacks on automatic speaker verification (ASV) sys...

Robotic and Generative Adversarial Attacks in Offline Writer-independent Signature Verification

This study explores how robots and generative approaches can be used to ...

The Attacker's Perspective on Automatic Speaker Verification: An Overview

Security of automatic speaker verification (ASV) systems is compromised ...

Voting for the right answer: Adversarial defense for speaker verification

Automatic speaker verification (ASV) is a well developed technology for ...

Perceptual Based Adversarial Audio Attacks

Recent work has shown the possibility of adversarial attacks on automati...

1 Introduction

Automatic speaker verification (ASV) systems aim at confirming a spoken utterance against a speaker identity claim. Through decades of years’ development, speaker verification community has made great progress and applied this technology in many biometric authentication cases, such as voice activation in phones, bank authentication online, etc.

However, past research has shown that ASV systems are vulnerable to malicious attacks via spoofing and fake speech, such as imitated [26, 24], replayed [24, 25], synthetic [18, 19] and converted [11] speech. These spoofing and fake speech are created to sound like the voice of the target person as much as possible. Moreover, ASV systems can also be spoofed even though the spoofing speech sounds like the voice of the non-target person from human perception. It will expose the systems to some other dangerous situations, such as controlling one’s voice-operated devices in place of the real owner unbeknownst to him/her. These situations can be achieved by adversarial attacks.

According to [22, 5, 8]

, deep neural networks (DNNs) with impressive performance can be vulnerable to simple adversarial attacks in many tasks, such as face recognition

[1, 17], image classification [13, 2], and speech recognition [3]. However, to the best of our knowledge, the only work that applied adversarial attack into ASV systems is [12] where they verified the vulnerability of an end-to-end ASV system to adversarial attacks. Briefly, there are three representative ASV frameworks: i-vector speaker embedding based systems [4, 10, 16, 7], neural network (NN) speaker embedding based systems [23, 20] and end-to-end approaches [27, 9, 21]. While end-to-end based systems have proved to be vulnerable to adversarial attacks, the robustness of other approaches, including GMM i-vector based systems and NN speaker embedding based systems (e.g. x-vector systems we implement), still remains to be explored. Importantly, GMM i-vector based systems are widely applied to biometric authentication, and it is imperative to investigate their robustness to such adversarial attacks.

Adversarial attacks have two main scenarios: white box attack and black box attack. White box attack allows the attacker to access the complete parameters of the system, and adversarial samples can be constructed by an optimization on the input while fixing the system parameters. Black box attacker only has the access to the system’s input and output, and adversarial samples are crafted by other substitute systems. To generate an adversarial sample, a designed subtle perturbation is added to the input so that the system outputs a wrong prediction while there is nearly no difference from human perception between spoofing and original input.

This work focuses on the robustness of GMM i-vector systems under adversarial attacks, and the adversarial samples’ transferability to x-vector systems. One of the simplest attack methods, i.e. fast gradient sign method (FGSM) [8]

, is adopted to generate adversarial samples from GMM i-vector systems. These samples are used to perform white and black box attacks on GMM i-vector systems, and black box attack on x-vector systems. Our codes will be made open-source


This paper is organized as follows. Section 2 and 3 introduce the ASV systems and FGSM adopted in our experiments, respectively. Experiment setup and results are illustrated in Section 4 and 5. Section 6 concludes this work.

2 automatic speaker verification systems

In this work, GMM i-vector and x-vector systems are involved in our experiments. Both GMM i-vector and x-vector systems have two parts: a front-end for utterance-level speaker embedding extraction and a back-end for speaker similarity scoring. In our experiments, we adopt the probabilistic linear discriminant analysis (PLDA) back-end for both kinds of systems.

2.1 Gaussian Mixture Model i-vector extraction

The illustration of GMM i-vector extractor [4] is shown in Fig. 1. It consists of a Gaussian Mixture Model-universal background model (GMM-UBM) and a total variability matrix ( matrix). Given the acoustic features of utterance , GMM-UBM is used to extract zero () and first () order statistics by Baum-Welch statistics computation. The statistics information is incorporated with matrix to extract i-vector as Eq. 1,


where ,

is the identity matrix and

is the covariance matrix of GMM-UBM.

2.2 x-vector extraction

The x-vector extractor [20]

leverages DNNs to produce speaker-discriminative embeddings. It consists of frame- and utterance-level extractors. At the frame level, acoustic features are fed forward by several layers of time delay neural network (TDNN). At the utterance level, statistics pooling layer aggregates all the frame-level outputs from the last frame-level layer and computes their mean and standard deviation. The mean and standard deviation are concatenated together and propagated through utterance-level layers and finally softmax output layer. In the testing stage, given acoustic features of an utterance, the embedding layer output is extracted as the x-vector.

2.3 Probabilistic linear discriminant analysis back-end

PLDA is a supervised version of factor analysis [6]. It models i-vectors/x-vectors () by Eq. 2,


where is a global bias term, the columns of provides a basis of speaker-specific subspace, and is a latent speaker-identity vector. The residual term

has a Gaussian distribution with zero mean and a full corvariance matrix

. The model parameters {, ,

} are estimated with expectation-maximization (EM) algorithm on the training set.

Figure 1: The illustration of GMM i-vector extractor

In the testing stage, score

is estimated as a log likelihood ratio of two conditional probabilities (Eq. 

3). and are i-vectors/x-vectors extracted from enrollment and test utterances, respectively. is the hypothesis that two utterances belong to the same identity, whereas is the opposite.


3 adversarial samples generation

In a general case, we denote a system function as with parameters . Given the input , we search for an adversarial perturbation to be added to

for maximizing the loss function

between the system’s prediction and ground truth , as shown in Eq. 4,


Note that the system parameters are fixed and we revise the input to maximize the the deviation between the system prediction and the ground truth, to make the system give a wrong prediction. The constraint that the -norm of is within a perturbation degree guarantees human can not distinguish the adversarial sample from original .

3.1 Fast gradient sign method

In this work, fast gradient sign method (FGSM) is adopted to solve Eq. 5, in which the norm is specialized as . The solution is given in Eq. 5,


where function takes the sign of the gradient. FGSM perturbs with a small perturbation multiplied by the sign of the gradient to generate adversarial samples ().

3.2 Problem formulation & evaluation matrics

In this work, we generate adversarial samples from GMM i-vector systems. We denote the whole system by two parts: i-vector extraction function with parameter and PLDA scoring function with parameter . The acoustic feature sequence of each utterance is denoted by . Normalization steps, such as i-vector normalization, are implicitly included within the functions.

In the testing stage, each trial consists of one enrollment utterance and one test utterance . To mimic a realistic model attack, we fix and system parameters , and revise of all trials to perform adversarial attacks. For GMM i-vector framework, there is no explicit ground truth and loss function for each output score. We optimize the output score according to target/non-target trials. For target trials, i.e. the enrollment and test utterances belong to one person, we minimize the similarity score. While for non-target trials, we maximize the score. The problem formulation and its solution are shown in Eq. 6, 7 and 8.


In this work, we measure the vulnerability of the systems to adversarial attacks by the degradation of the systems’ equal error rate (EER) and false acceptance rate (FAR). EER is the rate where acceptance and rejection errors are equal. It reflects the balanced system error rate in terms of acceptance and rejection. The FAR threshold is fixed as the EER threshold in the system without attack. FAR reflects the prediction error rate over non-target trials.

4 Experiment Setup

The dataset used in this experiment is Voxceleb1 [15], which consists of short clips of human speech. There are totally 148,642 utterances for 1251 speakers. Consistent with [15], 4874 utterances for 40 speakers are reserved for testing, to generate trials and perform adversarial attacks. The remaining utterances are used for training our SV models. In addition, we apply data augmentation [20] when training x-vector embedding networks.

Mel-frequency cepstral coefficients (MFCCs) and log power magnitude spectrums (LPMSs) are adopted as acoustic features in this experiment. To extract MFCCs, a pre-emphasis with coefficient of 0.97 is adopted. “Hamming” window having size of 25ms and step-size of 10ms is applied to extract a frame, and finally 24 cepstral coefficients are kept. For LPMSs, “blackman” window having size of 8ms and step-size of 4ms is adopted. No pre-emphasis is applied.

4.1 ASV model configuration

In the GMM i-vector system setup, only voice activity detection (VAD) is applied for preprocessing acoustic features. 2048-mixture UBM with full covariance matrix cooperates matirx with a 400-dimension i-vector space. The i-vectors are centered and length-normalized before PLDA modeling.

In the x-vector system, cepstral mean and variance normalization (CMVN) and VAD are adopted to preprocess the acoustic features. The setup of x-vector embedding network is commonly consistent with

[20]. And the extracted x-vectors are centered and projected using a 200-dimension LDA, and then length-normalized before PLDA modeling.

LPMS-ivec attacks MFCC-ivec 7.20 8.83 13.82 50.02 69.04 74.62 74.59 63.24
MFCC-ivec attacks MFCC-xvec 6.62 8.52 14.06 57.43 74.32 60.85 54.07 51.34
LPMS-ivec attacks MFCC-xvec 6.62 7.42 9.49 25.47 37.51 43.89 48.48 48.39
Table 1: EER (%) of the target systems under attack with different perturbation degrees ().
MFCC-ivec 7.20 81.78 97.64 50.25 50.72
LPMS-ivec 10.24 94.04 99.95 99.77 88.6
Table 2: EER (%) of the GMM i-vector systems under attack with different perturbation degrees ().
MFCC-ivec 7.20 82.91 96.87 18.14 16.65
LPMS-ivec 10.24 96.78 99.99 99.64 69.95
Table 3: FAR (%) of the GMM i-vector systems under attack with different perturbation degrees ().

4.2 Adversarial attack configuration

In this work, we train three ASV models: MFCC based GMM i-vector system (MFCC-ivec), LPMS based GMM i-vector system (LPMS-ivec) and MFCC based x-vector system (MFCC-xvec), and perform white and black box attacks. Two white box attacks are performed on MFCC-ivec and LPMS-ivec systems respectively, and three black box attacks in the mutual attack settings: LPMS-ivec attacks MFCC-ivec (cross feature), MFCC-ivec attacks MFCC-xvec (cross model architecture) and LPMS-ivec attacks MFCC-xvec (cross both). The last two settings is to investigate the adversarial samples’ transferability to x-vector systems.

4.3 ABX test

To evaluate the auditory indistinguishability of adversarial audios compared with the original audios, we perform the ABX test [14], which is a forced choice test to identify detectable differences between two choices of sensory stimuli. The adversarial samples are generated from the LPMS-ivec by using FGSM with equal to 1. Each adversarial audio is reconstructed from the perturbed LPMSs and the phase of its corresponding original audio. In this work, 50 randomly selected original-adversarial audio (A and B) pairs are presented to listeners, and from each pair one audio is chosen as the audio X. Eight listeners join this test, and they are asked to choose one audio from A and B, which is the audio X.

5 experimental results

Figure 2: FAR (%) of the target systems under attack with different perturbation degrees ().

5.1 ABX test results

Experimental results show that the average accuracy of ABX test is 51.5%, which verifies that human cannot distinguish between the adversarial and corresponding original audios.

5.2 White box attack

We perform white box attack on two GMM i-vector systems, i.e. MFCCs and LPMSs based systems. The EERs of the systems under different perturbation degree are shown in Table 2. Especially, the columns where equals 0 exhibit the performance of the systems without attack. We observe that both systems are most vulnerable when equals 1, where the EERs are both increased by around 90%. Meanwhile, the FARs of the systems are also shown in Table 3. The FARs of both systems are increased by around 90% when equals 1. These two observations confirm that GMM i-vector systems are vulnerable to white box attack. Another observation is that the attack success rate increases first and then decreases as increases, rather than keeps increasing with . The increase at first (small perturbation) is due to the adversarial perturbation effort, but a large perturbation will bring much difference and mitigate this effort to make the attack success rate down. Technically speaking, the function (in Eq. 6) w.r.t. the input is actually non-convex, and a large perturbation breaks the linear approximation of at point assumed by FGSM. This will decrease the attack effectiveness.

5.3 Black box attack

As configured in Section 4.2, three black box attack settings are involved, i.e. LPMS-ivec attacks MFCC-ivec (cross feature), MFCC-ivec attacks MFCC-xvec (cross model architecture) and LPMS-ivec attacks MFCC-xvec (cross both). EERs and FARs are shown in Table 1 and Fig. 2 respectively. We observe that EER is increased by 67.42%, 67.70% and 41.86% in the settings of cross feature, cross model architecture and cross both, respectively. Also, FAR is increased by around 60%, 50% and 30% in cross feature, model architecture and both respectively. These observations confirm that both GMM i-vector and x-vector systems are vulnerable to black box attack. And the systems are most fragile to cross-feature attack, while relatively more robust to cross-both attack. We also observe that the where the target system is most vulnerable is different due to different source-target attack settings. In some cases, increases to a large value to make the most serious attack, e.g. 20 in the cross-feature setting. In this case, the spoofing audio has a noise that human can easily perceive, but human can still confirm the spoofing and original one belong to the same person. Here222 are some adversarial samples and the corresponding system responses. Since x-vector systems prove to be vulnerable to black box attack, it suggests that these systems are also vulnerable to a more aggressive attack, i.e. white box attack.

6 conclusion

In this work, we investigate the vulnerability of GMM i-vector systems to adversarial attacks and the adversarial transferability to x-vector systems. Experimental results show that GMM i-vector systems are vulnerable to both white and black box attacks. The generated adversarial samples are also proved to be transferable to NN speaker embedding based systems, e.g. x-vector systems. Further work will focus on protecting SV systems against such adversarial attacks.


  • [1] A. Agarwal, R. Singh, M. Vatsa, and N. Ratha (2018) Are image-agnostic universal adversarial perturbations for face recognition difficult to detect?. In 2018 IEEE 9th International Conference on Biometrics Theory, Applications and Systems (BTAS), pp. 1–7. Cited by: §1.
  • [2] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok (2017) Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397. Cited by: §1.
  • [3] N. Carlini and D. Wagner (2018) Audio adversarial examples: targeted attacks on speech-to-text. In IEEE Security and Privacy Workshops (SPW), pp. 1–7. Cited by: §1.
  • [4] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet (2010) Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4), pp. 788–798. Cited by: §1, §2.1.
  • [5] J. Ebrahimi, D. Lowd, and D. Dou (2018)

    On adversarial examples for character-level neural machine translation

    arXiv preprint arXiv:1806.09030. Cited by: §1.
  • [6] B. Fruchter (1954) Introduction to factor analysis.. Cited by: §2.3.
  • [7] D. Garcia-Romero and C. Espy-Wilson (2011) Analysis of i-vector length normalization in speaker recognition systems. In Twelfth Annual Conference of the International Speech Communication Association, Cited by: §1.
  • [8] I. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §1.
  • [9] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer (2016) End-to-end text-dependent speaker verification. In ICASSP, pp. 5115–5119. Cited by: §1.
  • [10] P. Kenny (2012) A small footprint i-vector extractor. In The Speaker and Language Recognition Workshop, Cited by: §1.
  • [11] T. Kinnunen, Z. Wu, K. Lee, F. Sedlak, E. Chng, and H. Li (2012) Vulnerability of speaker verification systems against voice conversion spoofing attacks: the case of telephone speech. In ICASSP, pp. 4401–4404. Cited by: §1.
  • [12] F. Kreuk, Y. Adi, M. Cisse, and J. Keshet (2018) Fooling end-to-end speaker verification with adversarial examples. In ICASSP, pp. 1962–1966. Cited by: §1.
  • [13] Y. Liu, X. Chen, C. Liu, and D. Song (2016) Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770. Cited by: §1.
  • [14] W. Munson and M. Gardner (1950) Standardizing auditory tests. The Journal of the Acoustical Society of America 22 (5), pp. 675–675. Cited by: §4.3.
  • [15] A. Nagrani, J. S. Chung, and A. Zisserman (2017) Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612. Cited by: §4.
  • [16] S. Prince and J. Elder (2007) Probabilistic linear discriminant analysis for inferences about identity. In

    IEEE 11th International Conference on Computer Vision

    pp. 1–8. Cited by: §1.
  • [17] M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter (2016) Accessorize to a crime: real and stealthy attacks on state-of-the-art face recognition. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 1528–1540. Cited by: §1.
  • [18] V. Shchemelinin and K. Simonchik (2013) Examining vulnerability of voice verification systems to spoofing attacks by means of a tts system. In ICSC, pp. 132–137. Cited by: §1.
  • [19] V. Shchemelinin, M. Topchina, and K. Simonchik (2014) Vulnerability of voice verification systems to spoofing attacks by tts voices based on automatically labeled telephone speech. In International Conference on Speech and Computer, pp. 475–481. Cited by: §1.
  • [20] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur (2018) X-vectors: robust dnn embeddings for speaker recognition. In ICASSP, pp. 5329–5333. Cited by: §1, §2.2, §4.1, §4.
  • [21] D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and S. Khudanpur (2016) Deep neural network-based speaker embeddings for end-to-end speaker verification. In IEEE Spoken Language Technology Workshop (SLT), pp. 165–170. Cited by: §1.
  • [22] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
  • [23] E. Variani, X. Lei, E. McDermott, I. Moreno, and J. Gonzalez-Dominguez (2014) Deep neural networks for small footprint text-dependent speaker verification. In ICASSP, pp. 4052–4056. Cited by: §1.
  • [24] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li (2015) Spoofing and countermeasures for speaker verification: a survey. speech communication 66, pp. 130–153. Cited by: §1.
  • [25] Z. Wu, S. Gao, E. Cling, and H. Li (2014) A study on replay attack and anti-spoofing for text-dependent speaker verification. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–5. Cited by: §1.
  • [26] E. Zetterholm, M. Blomberg, and D. Elenius (2004) A comparison between human perception and a speaker verification system score of a voice imitation. evaluation 119, pp. 116–4. Cited by: §1.
  • [27] S. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong (2016) End-to-end attention based text-dependent speaker verification. In IEEE Spoken Language Technology Workshop (SLT), pp. 171–178. Cited by: §1.