With the success of deep neural networks (DNNs) since Krizhevsky et al. [krizhevsky2012imagenet]
won the ImageNet challenge[russakovsky2015imagenet] in 2012, more and more deep-based models for biometric systems, such as fingerprint/face/speaker recognition, have been deployed in our daily life. However, these systems are facing the risk of being attacked since deep models are vulnerable to adversarial examples [goodfellow2014explaining], which have been intentionally perturbed. Meanwhile, attacking the deep models and finding the weaknesses of the models can help us avoid the potential risk and design corresponding methods to defense against these attacks. In these widely deployed biometric systems, previous works mainly focus on the vision-based systems, the audio-based systems, such as speaker recognition, have not been well-studied, although the speaker recognition systems have been widely deployed. In this paper, we focus on the attack for speaker recognition models by generating the universal adversarial perturbations (UAPs), which are independent of the input samples and can be applied to the whole dataset.
Before UAPs have been found by Moosavi-Dezfooli et al. [Moosavi2017Universal], generating the adversarial examples and spoofing the well-trained deep models have become an emerging topic since Szegedy et al. [szegedy2013intriguing] found DNNs are vulnerable to the adversarial examples with intentional imperceptible perturbations. Following [szegedy2013intriguing], some other optimization methods, such as Adam [carlini2017towards], Fast Gradient Sign Method (FGSM) [goodfellow2014explaining]
or the genetic algorithm[su2019one] are used to find the perturbations for the input image. Recently, Moosavi-Dezfooli et al. [Moosavi2017Universal]
demonstrated that there exists a universal and small perturbation that can spoof the well-trained DNN image classifier with high probability. Subsequently, Hayeset al. [hayes2018learning] crafted the UAPs by leveraging a generative network to synthesize the perturbation from the input noise which samples from the normal distribution, and improved the attack success rate as well as showed the transferability cross different models for the same dataset. Motivated by the existence of UAPs in the image classification, in this paper, we attempt to find the UAPs of the speaker recognition systems by designing a generative model [hayes2018learning].
In addition to attacking the vision-based systems, the attack for speaker recognition systems has also been addressed for a long time. Before the DNNs have been used in the speaker recognition, the replay and synthesis attacks had been studied to avoid the risk in the voice verification systems [korshunov2016overview]. In recent years, with the wide deployment of DNN-based systems, attacking the DNN-based speaker recognition models has drawn more and more attention. Gong et al. [gong2017crafting] crafted the adversarial examples using FGSM to attack the well-trained speech verification model and showed the deep models are vulnerable to the adversarial attack. However, the evidence is missing on large-scale datasets [gong2017crafting]. Using the same optimization method, Kreuk et al. [kreuk2018fooling] presented white box attacks for text-dependent speaker verification on the deep end-to-end network on NTIMIT [jankowski1990ntimit] and YOHO [campbell1995testing].
In this paper, we attempt to generate the UAPs by learning the mapping from the low-dimensional normal distribution to the universal perturbation subspace via a generative model, given the fact that the UAPs are not unique [Moosavi2017Universal]. We demonstrate the effectiveness of our proposed method by attacking the state-of-the-art speaker recognition model [ravanelli2018speaker] under non-targeted and targeted settings on TIMIT [garofolo1993darpa] and LibriSpeech [panayotov2015librispeech] datasets. Our contributions can be summarized as follows:
We demonstrate the existence of the UAPs for the well-trained speaker recognition model, which are the potential risks for the widely deployed speaker recognition systems in our daily life.
We can synthesize different UAPs efficiently by mapping the normal distribution into the UAPs subspace using the generative model. The experimental results show that our model can achieve an SER of 97.0 with an SNR of 49.87 and a PESQ of 3.00 in the non-targeted attack on TIMIT dataset, indicating the effectiveness of our proposed model.
The ablation study for the UAPs shows that our proposed model can learn useful universal patterns, map the low-dimensional normal distribution into the UAPs subspace, and generate UAPs that perform much better than the random perturbations.
2 Related Works
2.1 UAPs Generation
The existence of UAPs have been demonstrated in many areas [wu2019g, neekhara2019universal] , since Moosavi-Dezfooli et al. [Moosavi2017Universal] found the UAPs in the image classification. Here we mainly review some UAPs generation models on image classification and audio-based systems that are related to our work. Different from the iterative optimization method used in [Moosavi2017Universal], Hayes et al. [hayes2018learning] crafted the UAPs by leveraging a generative network to synthesize the perturbation from the input noise which samples from the normal distribution, and improved the attack success rate as well as showed the transferability cross different models for the same dataset. In addition to the works in image classification, some works about UAPs generation for audio-based systems are also proposed recently. Neekhara et al. [neekhara2019universal] iteratively searched the UAPs with minimal norm under the constraint of high attacking success rate, and only one UAP can be found in once optimization. Our work is different from the above two works in two aspects: (1) our work focuses on the unexplored task, speaker recognition, to study the potential risk of the widely deployed authentication systems; (2) our generative attacker can synthesize different UAPs efficiently once trained, which has been demonstrated more effective than the iterative methods in image classification attacks [hayes2018learning].
2.2 Speaker Recognition Attack
Attacking the speaker recognition models has drawn the researchers’ attention because: (1) deep model attacking has become a hot topic in the machine learning community; (2) the speaker recognition/verification systems have been widely deployed in our daily life. Gonget al. crafted the adversarial examples iteratively to attack the speaker recognition model trained on a small dataset and demonstrated the existence of the adversarial example for the speaker recognition models. Subsequently, Kreuk et al. [kreuk2018fooling] attempted to fool the end-to-end speaker verification model which is trained on MFCC features by optimizing the perturbation using FGSM [goodfellow2014explaining]. However, these white-box attacks need gradients in the testing phase. In this paper, we proposed a semi-white attack model to learn the UAPs, which is more practical than the white box methods in the real scenario because: (1) our generative attacker needs no gradient in the testing stage; (2) the adversarial perturbations are universal and they can spoof the well-trained speaker recognition model with any input speeches.
3 Proposed Method
As illustrated in Fig. 1, our generative attacker aims to map the input noise, which is sampled from the low-dimensional normal distribution , into a UAP, and the following well-trained speaker recognition model is spoofed by the input adversarial example, which is perturbed by the generated UAP. Given a speech and its speaker label , the non-targeted attack for the speaker recognition model with UAPs can be formulated as:
where is the generative attack model to synthesize the UAP from the noise , is the prediction of the adversarial example , is a well-trained state-of-the-art speaker recognition model, is a distance function to measure the distortion between the raw signal and the adversarial example. For the targeted attack, we modify the constraint for from as , in which is the target class.
3.1 The Victim Model
We use the state-of-the-art speaker recognition model SincNet [ravanelli2018speaker] as our target victim model. SincNet achieved state-of-the-art performance on TIMIT [garofolo1993darpa] and LibriSpeech [panayotov2015librispeech] datasets by replacing the first convolution layer as the learnable band pass filters. Given the frequency band , the learnable band pass filter can be described as:
where , and are the learnable parameters. By using the band pass filters rather than the convolution filters in the first layer, the model is more interpretable and achieves better results [ravanelli2018speaker].
3.2 The Framework
Our model aims to spoof the well-trained speaker recognition model with UAPs. As illustrated in Fig. 1, the Generator is a generative network with several upsampling blocks to synthesize the UAP from the input noise with 100 dimensions (following [hayes2018learning]), which samples from the standard normal distribution . Subsequently, the UAP is scaled to control the distortion for the real data before being added on the input raw speech data with the real label Tom.
The Speaker Recognition model, which is fixed and well-trained, is spoofed by the adversarial examples, which are the input speech data with UAPs, and predicts the input as Jerry by mistake.
Since the UAPs are not unique [Moosavi2017Universal], we use a Generator to learn the mapping from the normal distribution into the UAP subspace. We use several UpBlocks to synthesize the high-dimensional UAPs from the low-dimensional noise, and the convolution layer, batchnorm [ioffe2015batch]
, and ReLU[nair2010rectified] are used in each UpBlock.
The optimization objective of our model is to find the adversarial examples with the smallest distortion, and they can attack the well-trained speaker recognition model successfully. Given the input noise , the raw speech data and its class label , the goal above can be formulated as follows:
where denotes the attack success Rate, denotes the Distortion, is the hyper-parameter to get a trade-off between and , is the UAP. In the optimization this objective function will be maximized.
In non-targeted attacks, attacking successfully means the victim model predicts by mistake, so we can optimize the attack success rate by reducing the prediction probability for the true class, and increasing the prediction probability for any wrong class. To spoof the victim model with minimal cost, we increase the probability for the class which is the top-1 class except for the true class. So can be formulated as follows:
whereis a threshold to stop the optimize for this sample. In targeted attacks, the attack is successful as long as the prediction class is the target class. Given the target class , can be formulated as follows:
The distortion for UAPs can be measured in two aspects: the objective quality and the perceptual quality. We use Signal-Noise Ratio (SNR) and the Perceptual Evaluation of Speech Quality (PESQ) score [rix2001perceptual] to evaluate the quality of the adversarial examples with perturbations in objective and perceptual, respectively. SNR is defined as:
where is the adversarial example with the perturbation . PESQ, as an ITU-T recommendation standard [itu2000pesq], is an integrated model to measure the distortion for the speech in telephony. It is a full-reference algorithm with range to measure the perceptual quality of the speech after a temporal alignment. It is worth mentioning that PESQ is not differentiable, so we only use it in the testing phase. In the training phase, is only the SNR and we just optimize SNR by minimizing the norm of the perturbations.
The inference is not intuitive because the input data are with variable lengths. In the training phase, we can clip the data into slices with a fixed length, but in the testing phase, we can not just drop the data beyond the UAPs. In our implementation, we use a simple but effective method repeat+clip to repeat the UAP until it is longer than the input sample and then clip it to make them two matched. In our experiments, we will conduct comparison experiment to study the influence of different UAP lengths.
4 Experimental Results
4.1 Datasets and Metric
Datasets: Following [ravanelli2018speaker], which proposed our victim model, we conduct the experiments on TIMIT [garofolo1993darpa] (462 speakers totally) and LibriSpeech [panayotov2015librispeech] (2484 speakers totally) datasets. The training/testing split follows the official implementation of [ravanelli2018speaker], in which 2310/1386 samples are used for training/testing in TIMIT, and 14481/7452 samples are used for training/testing in LibriSpeech.
Metric: We use the sentence error rate (SER) to represent the attack success rate in the non-targeted attack, and the prediction target rate (PTR) is used in targeted attacks because the attack is successful as long as the prediction is the target class in the targeted attack. The distortion is measured by SNR (for objective quality) and PESQ (for perceptual quality), as introduced in subsection 3.3
. We use the official open-source implementation111https://github.com/dennisguse/ITU-T_pesq for PESQ in our experiments.
4.2 Implementation Details
Our Generator can only synthesize UAPs with the fixed length, so we randomly select a slice with a fixed length from the raw speech data in training phase. In our experiments, we synthesize UAPs for 200ms, which is 3200 dimensional because the data are with a sampling rate of 16000. We use the pretrained victim model which is released by the author of [ravanelli2018speaker] 222https://github.com/mravanelli/SincNet. The hyper-parameter will be finetuned in our experiments and the scale factor is fixed as because the perturbations are constrained on a small scale by the distortion item in our optimization objective. The threshold for non-targeted/targeted attack is set as 10/0 after being finetuned to get a good trade-off between and . Besides, we initialize the biases and weights of the last convolution layer as zero to ensure that no perturbation is added on the signals at the beginning of the training333The code, data, and pretrained models will be released soon..
4.3 Non-Targeted Attack
We conduct the non-targeted attack on TIMIT and LibriSpeech datasets to demonstrate the effectiveness of our proposed model. The results of these two datasets are illustrated in Table. 1. We can observe from the results that:
For non-targeted attacks, the UAPs exist and our model manages to map the normal distribution into the UAPs subspace because our model can synthesize the UAPs which can attack the well-trained speaker recognition model with high success rate.
On the TIMIT dataset, with , the UAPs generated by our model can attack the well-trained speaker recognition model with an SNR of 49.87dB and a PESQ of 3.00, which means that the noise is noticeable but not intrusive.
On the LibriSpeech dataset, with , the UAPs generated by our model can attack the well-trained speaker recognition model with an SNR of 31.15dB and a PESQ of 2.33, which means that the noise is noticeable and a little intrusive.
On both TIMIT and LibriSpeech datasets, by tuning , we can control the trade-off between the attack success rate and the adversarial example quality.
It is worth mentioning that the random perturbations in non-targeted attack can also achieve a high SER as long as the perturbations are intense enough, so a high SER here cannot provide enough evidence that our model has learned some useful universal patterns. In Section 5, we will compare the UAPs generated by our model with the random perturbations to show that our model has learned the universal patterns.
Error rate without attack on TIMIT/LibriSpeech.
4.4 Targeted Attack
In this subsection, we show our model’s effectiveness on the targeted attack. In the targeted attack, we fix as 3000/2000 for TIMIT/LibriSpeech dataset, respectively. We randomly select 5 speakers from TIMIT/LibriSpeech dataset as the targets, and attack the victim model to misclassify any input sample as the target class. The attack results are illustrated in Table. 2. Some conclusions can be drawn from the results:
For the targeted attack, the UAPs exist and our model is successful to synthesize UAPs for the targeted attack on both TIMIT and LibriSpeech datasets.
On the TIMIT dataset, we can achieve a PTR of 97.2 on average with an SNR of 48.53dB and a PESQ of 2.48, which means that the noise is noticeable, and a little intrusive, given the fact that the targeted attack is more challenging than the non-targeted attack.
On the LibriSpeech dataset, we can achieve a PTR of 64.1 on average with an SNR 29.94dB and a PESQ of 2.11. This is not as good as that on the TIMIT dataset, because the speaker number in LibriSpeech (2484) is much more than that in TIMIT dataset (462).
Besides, the high success rate here can demonstrate that our model has learned the universal patterns for the targeted attack. Although random perturbations are able to achieve a high SER in non-targeted attack, they will fail to achieve a high PTR in targeted attack because they are random. Thus a high PTR in targeted attack can demonstrate that our model has learned the useful universal patterns.
5 Ablation Study
5.1 The Length of UAPs
We can only generate UAPs with a fixed length, but the input signals are with variable lengths. So we use repeat+clip method to make them two matched.
The length of the UAPs may affect the performance of our model, so in this subsection, we conduct experiments to study how the UAP length influences the attacking performance. We generate the UAPs with duration 200ms, 400ms, and 600ms on TIMIT dataset, and we plot the SER-SNR and SER-PESQ curves to take both the adversarial examples quality and the attack success rate into account. Besides, we compare our generated UAPs with the random noise, which are the random perturbations sampled from the normal distribution ( is tuned to get five results with different trade-offs between SNR/PESQ with SER).
the models for longer UAPs are more difficult to train but we train all models for different UAPs lengths for the same epochs to make a fair comparison.
As illustrated in Fig. 2, a curve is higher than another means that this model can achieve a higher SER with the same SNR/PESQ, indicating this model performs better than the other. From Fig. 2 (a), we can observe that: (1) the UAPs generated by our model perform better than the random perturbations, indicating that our model has learned the useful universal patterns; (2) with SNR below 50dB, UAPs with different lengths achieve comparable performance, but with SNR higher 50dB, the shorter the UAPs are, the better performance they can achieve; (3) with UAPs length as 600ms, UAPs generated by our model may perform not as good as the random perturbations when SNR is higher than 52dB, the reason may be
the models for longer UAPs are more difficult to train but we train all models for different UAPs lengths for the same epochs to make a fair comparison.From Fig. 2 (b), similar conclusions can be drawn except that the UAPs generated by our model performs much better with PESQ higher than 2.5 because the random perturbations struggle to achieve good perceptual quality (PESQ). On both the objective quality (SNR) and perceptual quality (PESQ), the UAPs generated by our model perform better than the random perturbations, demonstrating that our model has learned the useful universal patterns to attack the well-trained speaker recognition model.
5.2 Noise Interpolation
To demonstrate our model is able to map the noise into the UAP subspace, we synthesize UAPs from the noise which is interpolated from two random noises and evaluate these UAPs on attacking the well-trained speaker recognition model. With an interpolation parameter, the interpolation noise can be obtained by: where and
are two low-dimensional noise vectors sampled from the normal distribution. As illustrated in Table 3, with different , the UAPs generated by our model can achieve similar SER, SNR, and PESQ, validating our model can map into the UAPs subspace indirectly.
In this paper, we attempted to demonstrate the existence of the UAPs for the speaker recognition, and we proposed a generative network to map the low-dimensional noise space into the UAPs subspace to synthesize the UAPs efficiently. Experimental results showed that our model can generate UAPs and fool the state-of-the-art speaker recognition model with high success rate. The ablation study provided enough evidence to show that our model had learned useful universal patterns for attacking the well-trained speaker recognition model. We envision our work to provide a benchmark for universal attacks for speaker recognition.