In recent years, voice user interface (VUI) has been integrated into various platforms, such as smartphones and smart appliances, and is shaping up to become the hubs of our increasingly connected lives. With the prevalent usage of VUI, speaker recognition system, which identifies a person from characteristics of voices, could be seamlessly integrated and used for various security-enhanced applications, such as remote voice authentication to prevent fraud in financial services, voice-matched voice assistants that can only respond to the owner’s voice, and even suspects identification and criminals detection [5, 1].
Deep network networks (DNNs), with its superiority over current state-of-the-art models (e.g., universal background model-Gaussian mixture model)[7, 8]
, has been becoming the computation core of the speaker recognition systems. However, recent studies have shown that DNN models are vulnerable to adversarial input in various fields (e.g., computer vision3, 2] and speaker verification ). The most related work  generates adversarial examples against an end-to-end speaker verification model, which is a binary speaker recognition system that verifies whether the voice is uttered by a claimed speaker or not. However, the adversarial attack against a more complex multi-class speaker recognition model still remains unexplored. Moreover, this attack  is individual attack (i.e., non-universal) requiring to generate different perturbation for each voice input, which would cost considerable time training perturbations for each individual voice input and thus make real-time attacks impossible.
In this paper, we build the first real-time, universal, and robust
targeted adversarial attack on X-vector, a state-of-the-art DNN-based multi-class speaker recognition model. The adversarial attack is performed by crafting an audio-agnostic universal perturbation which can be added into any enrolled speaker’s any voice input to deceive the speaker recognition system, causing it to output an adversary-desired (targeted) speaker label. The generated universal perturbation uses repeated-playback of fixed-length universal noise to fit different voice input with various lengths. Additionally, unlike the existing digital attack  that feeds the adversarial examples to the speaker verification model directly, in this paper we take one step forward to build robust adversarial attacks through estimating the sound distortions introduced by the physical world propagation, which makes the adversarial examples remain effective while being played over-the-air. Experiments on a public dataset of speakers show the effectiveness and robustness of our proposed attack with a high attack success rate of over . The achieved attack launching time is only around , which is speedup over contemporary non-universal attacks.
2 Related Work
Adversarial Attack on Speech Recognition.
Recent studies have successfully produced adversarial examples against automatic speech recognition (ASR) system (i.e., speech-to-text), which is the most prevalent application in the audio space. For instance, Vaidyaet al. [17, 2] generate noise-like adversarial sound making ASR models output adversary-desired text transcriptions. Nonetheless, the generated adversarial examples would be perceived as noises by human, which may draw considerable attention on practical attacks. To solve this problem, Carlini et al.  propose to craft adversarial samples by adding unnoticeable perturbations into original speech, misleading the model to translate the adversarial examples to adversary-desired text. Moreover, CommanderSong  can embed any malicious command into regular songs, which could be recognized by ASR systems as malicious commands but still being perceived as common music by human. However, all the aforementioned ASR adversarial attacks are individual attack through solving an optimization problem for each individual input audio, which needs high run-time requirements (e.g., several hours) to compute the adversarial examples per input audio. Alternatively, a more recent work  produces a single universal perturbation which can fool ASR systems causing an error in transcription. This work is in the case of untargeted attack, in which the adversary cannot specify the expected speech transcription during the phase of adversary example generation.
Adversarial Attack on Speaker Recognition. Different from speech recognition systems, speaker recognition (a.k.a., voice recognition) mainly focuses on extracting individual-dependent voice characteristics through embedding methods to identify speakers’ identities regardless of their speech content. It has been shown a growing trend of using DNNs in the embedding layers of speaker recognition model due to its superiority of scalable embedding performance [7, 8]. However, few studies have been conducted to explore the vulnerability of the DNN-based speaker recognition system. To the best of our knowledge, the only related study  proposes to build adversarial examples against an end-to-end speaker verification model, which is a binary speaker recognition system. Moreover, this attack is individual attack, which requires a long time to craft different perturbation for each voice input. It does not consider any sound distortions caused by practical over-the-air playback either. To bridge the gap in terms of all the aforementioned issues, in this paper we explore the possibility of launching real-time universal, targeted, and robust adversarial attacks against multi-class speaker recognition system, with speakers in our testing model.
3 Real-time, Universal, and Robust Adversarial Examples
3.1 Target Speaker Recognition Model
In this work, the DNN-embedding-based X-vector system  is used as the speaker recognition system since it has shown a significant improvement over standard i-vector models, and has been further studied in many follow-up studies (e.g., [14, 11]). The architecture of X-vector system is shown in Figure 1. Specifically, for an input audio, the system first extracts mel-frequency cepstral coefficents (MFCCs) features using a sliding window. The extracted features are then passed to a time-delay neural network (TDNN) structure 
that operates on audio frames. The statistics pooling layer takes the output of the final frame-level layer as input, aggregates over the input segment, and computes its mean and standard deviation. Subsequently, hidden layers are used to map the concatenated statistics into final embeddings. In the recognition phase, the probabilistic linear discriminant analysis (PLDA) computes the probability of the input audio belonging to each enrolled speaker with the embedding information and identifies the speaker label with the highest calculated score.
3.2 Challenges and Threat Model
Challenges. Generating such a real-time, universal, and robust adversarial example against speaker recognition system in practice raises a number of challenges:
(1) Real-time Adversarial Attack. To craft an adversarial noise with respect to the speaker’s speech, using conventional optimization-based approach is usually very time-consuming, which makes many practical attack scenarios impossible, such as playing the adversarial noise on a hidden speaker in a real-time manner along with the speaker’s voice input.
(2) Universal Targeted Adversarial Example. Using an audio-agnostic universal perturbation to deceive the speaker recognition system, which causes it to misclassify any enrolled speaker’s input audio as the adversary-desired speaker, needs to build a universal mapping from the audio sources to the adversary-desired target. The proposed algorithm needs to be general enough to various length audio inputs spoken by different speakers with various accents.
(3) Robust Adversarial Example. The attack performance would be inevitably impacted by the sound distortions due to the attenuation and multi-path effects while playing the adversarial examples over the air. Thus, the generated adversarial perturbation needs to be robust enough to remain effective under this kind of real-world distortions.
Threat Model. In this work, we consider the white box threat model where the adversary has full knowledge of the target speaker recognition model as well as its parameters. In order to build a robust adversarial attack considering the sound distortions in the room where the attack will be launched, we assume the adversary has access to the room’s layout. As shown in Figure 2, we aim to find a single audio-agnostic universal perturbation that can be applied on arbitrary enrolled speakers’ input audio to mislead the speaker recognition system causing it output the specific adversary-desired speaker label. Additionally, we expect to build a more robust adversarial perturbation that can remain effective while being played over-the-air in acoustic room simulated environments.
3.3 Real-time, Universal and Robust Adversarial Attacks
Most of the existing targeted adversarial attacks would fool DNN-based systems through building different adversary perturbation for each individual input. Differently, in this paper we explore how to build a single universal perturbation that can be directly applied to arbitrary speaker’s any utterance, making the speaker recognition system output the adversary-desired speaker label. Such a universal perturbation would greatly shorten the attack launching time, making real-time attacks possible.
To clearly present the steps of our perturbation generation, we model the target speaker recognition system, X-vector, as a function , which takes as input an utterance and outputs a predicted speaker label. We define
as the function of all DNN layers (including PLDA) to compute the probabilities of classifyingas each of the profiled speakers. We can recognize the voice as the speaker with highest calculated probability, . Therefore, to launch a universal targeted adversarial attack, where targeted speaker label is , we aim to find a perturbation that could achieve for arbitrary .
To build such a universal attack, we need to find a general solution that can make the generated perturbation effective for all the utterances regardless of their speakers, accents, speech content and length. To overcome the issue of varying utterance length, we dynamically construct the universal perturbation based on the length of the input utterance :
where is a short-length adversarial perturbation (e.g., in our work), and is a vector constructed by repeating . crops the first input to the length of the second input. With this process, the derived perturbation could be applied to the audio input with any length.
To minimize the distortion between the adversarial example and the original voice, would be clipped to a pre-defined range. The generated adversarial example with the clipped could be formulated as:
where is the function to perform element-wise clipping of . Values of outside the interval would be clipped to the interval edges, and is our pre-defined attack strength.
Moreover, to preserve the effectiveness of the adversarial example while being played over the air, we first mimic the sound distortions during playback and recording by estimating room impulse response (RIR), , which characterizes the acoustic propagation (e.g., reverberations) in a room environment. The details of how to estimate RIR (i.e., ) based on the room setting are provided in Section 3.4. Then, we could iteratively derive the targeted adversarial example through the following objective function:
where is the targeted speaker label, denotes the convolution operation, and is the estimated adversarial example recorded by the microphone. It is important to note that the estimated RIR represents a certain mapping from the played sound to the recorded sound as per specific location of the loudspeaker and microphone in the room. To make the generated adversarial examples robust in various environmental settings, we estimate multiple RIRs in various environments. To make the adversarial perturbation survive all these environments, we randomly select one RIR in for each training step when updating the perturbation based on each training utterance. In addition, as directly solving the non-linear constrained non-convex problem is difficult, we iteratively solve the following optimization problem:
where represents the output probabilities of all speakers except the targeted speaker, while denotes the predicted probability to the targeted speaker. is a configurable parameter which represents attack confidence and is set to in our implementation. To generate the universal perturbation, we iteratively modify the trainable sequence, , which is used for constructing , with the entire training dataset until satisfying the desired attack success rate. For each training utterance, if the predicted probability of the targeted class is larger than other classes, the update of the perturbation is skipped on the next sample.
3.4 Room Impulse Response Estimation
Acoustic propagation in a room is commonly considered as a linear and time-invariant system. Thus the recorded signal could be presented as a deterministic function of the played signal : = , where is the estimated room impulse response (RIR), and denotes the convolution operation. To simulate the play-over-the-air process in the physical world, we take the RIR generated by an acoustic room simulator  into account in the adversarial example training phase. Specifically, the simulator can adjust several parameters, including the size of a 3D shoe-box room, the location of the audio sources and microphones, and the reverberation rate. Optimization with the simulated RIR would increase the robustness of the generated adversarial example, and consequently enable over-the-air attack in practice.
4 Experimental Results
4.1 Experimental Methodology
Dataset. We evaluate our proposed attack on an English multi-speaker corpus provided in CSTR voice cloning toolkit (VCTK) . In total, the dataset contains utterances spoken by speakers with various accents. The dataset is divided into a training and a testing set with a ratio of 4:1.
. In our TensorFlow-implemented X-vector system, 30-dimensional MFCC features with a frame length of are extracted. A pre-trained X-vector DNN embedding model provided in Kaldi  is used in the model. The baseline model achieves a classification accuracy of on testing utterances from speakers.
Evaluation Metrics. (1) Attack Success Rate: The ratio between the number of succeeded attacks and the total number of attack attempts; (2) Noise Level: We quantify the relative noise level of the perturbation with respect to the original audio in decibels (dB): .
4.2 Attack Evaluation
Effectiveness of Universal Targeted Attack. To evaluate the effectiveness of our proposed universal targeted attack, we alternatively choose one of the enrolled speakers as the targeted speaker and the rest speakers as victims. In total, we generated universal adversarial perturbations, trying to make the speaker recognition system classify the victims’ utterances as the targeted speakers. As shown in Table 1, by adjusting attack strength , the noise level ranges from to . As discussed in the previous study , such noise level is considered to be quasi-imperceptible to humans. For instance, is comparatively the difference between a person talking and the ambient noise in a quiet room. For each value, the minimum, maximum, and average attack success rate among all attack attempts targeting on speakers are calculated. We can observe that when the noise level is , a high average attack success rate of can be reached. When the noise level decreases to dB, the average attack success rate still remains over , which illustrates the effectiveness of our proposed universal targeted attack.
Robustness Analysis Using Room Simulator. An acoustic room simulator toolkit  is used to simulate the audio propagation in a room environment. Specifically, a modeled room with a size of is used, and locations of the loudspeaker and the microphone are chosen randomly in the room for RIR estimation. For the estimated RIRs, locations are used to build the universal, targeted and robust adversarial perturbation, and the rest locations are used for testing. Table 2 summarizes the results of our practical universal adversarial perturbation. We can observe that the universal adversarial perturbations trained with RIRs still remain effective after the over-the-air simulation. In particular, the practical universal perturbation generated with a noise level of can still achieve an average attack success rate of . For comparison, we test the adversarial perturbation of the same noise level and without RIR in the simulated room environment. However, the average attack success rate decreases significantly to . This shows that our approach can efficiently improve the robustness of the generated adversarial examples.
Speedup on Attack Time. Unlike conventional individual attacks that require to build adversarial perturbation for each individual voice input, our proposed universal attack could generate a single perturbation that makes arbitrary speaker’s utterances to be identified as the adversary-desired speaker. Thus, simply playing the pre-generated universal perturbation nearby the victim speaker becomes possible for launching adversarial attacks. For showing the possibility of launching real-time attacks, we compare the attack launching time of using the conventional individual targeted attack method  and our proposed universal attack for a given audio signal. Particularly, the conventional targeted attack requires at least to deploy, measured on a Tesla V100 GPU with memory, while our proposed universal method only takes an average of , which results in a speedup.
This paper proposes a real-time, universal, and robust targeted adversarial attack against speaker recognition system. The proposed attack builds a universal perturbation that can be added into any enrolled speaker’s voice input to fool the system causing it to output any adversary-desired speaker label. The robustness of the adversarial perturbations is also greatly improved by using an acoustic room simulator to estimate the sound distortions associated with playing the audio over-the-air. Evaluation on a public dataset of speakers shows the effectiveness and robustness of our proposed attack.
Acknowledgments This research is supported in part by the National Science Foundation grants CNS1801630 and CCF1909963, the Army Research Office grant W911NF-18-1-0221 and the Air Force Research Laboratory grant FA8750-18-2-0058.
-  (2019-Oct.) Security as unique as your voice. Note: https://www.chase.com/personal/voice-biometrics Cited by: §1.
-  (2016) Hidden voice commands. In 25th USENIX Security Symposium (USENIX Security 16), pp. 513–530. Cited by: §1, §2, §3.3.
-  (2018) Audio adversarial examples: targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 1–7. Cited by: §1, §2, §4.2, §4.2.
-  (2016) CSTR vctk corpus: english multi-speaker corpus for cstr voice cloning toolkit. The Centre for Speech Technology Research (CSTR). Cited by: §4.1.
-  (2019-Sep.) Voice match and media on google home. Note: https://support.google.com/googlenest/answer/7342711?hl=en Cited by: §1.
-  (2018) Fooling end-to-end speaker verification with adversarial examples. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1962–1966. Cited by: §1, §1, §2.
-  (2014) A novel scheme for speaker recognition using a phonetically-aware deep neural network. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1695–1699. Cited by: §1, §2.
-  (2015) Advances in deep neural network approaches to speaker recognition. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4814–4818. Cited by: §1, §2.
-  (2019) Universal adversarial perturbations for speech recognition systems. arXiv preprint arXiv:1905.03828. Cited by: §2.
-  (2011) The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, Cited by: §4.1.
-  (2019) Probing the information encoded in x-vectors. arXiv preprint arXiv:1909.06351. Cited by: §3.1.
-  (2018) Pyroomacoustics: a python package for audio room simulation and array processing algorithms. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 351–355. Cited by: §3.4, §4.2.
-  (2017) Deep neural network embeddings for text-independent speaker verification. In Interspeech, pp. 999–1003. Cited by: §3.1.
-  (2019) Speaker recognition for multi-speaker conversations using x-vectors. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5796–5800. Cited by: §3.1.
-  (2018) X-vectors: robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. Cited by: §1, §3.1, §4.1.
-  (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
-  (2015) Cocaine noodles: exploiting the gap between human and machine speech recognition. In 9th USENIX Workshop on Offensive Technologies (WOOT 15), Cited by: §2.
-  (2018) Commandersong: a systematic approach for practical adversarial voice recognition. In 27th USENIX Security Symposium (USENIX Security 18), pp. 49–64. Cited by: §2.