Deep network based biometric systems, such as fingerprint/face/speaker recognition, have been widely deployed in our daily life. Meanwhile, finding the weakness and attacking these recognition systems also draw more and more attention. Although many works have been done on vision-based systems [16, 6, 3], the attack to speaker recognition has not been well-studied. There are two main applications of attacking the speaker recognition systems and finding the adversarial examples: (1) disturbing the speaker recognition systems when they are not wanted; (2) helping improve the performance and robustness of the speaker recognition systems. In this work, we focus on attacking the speaker recognition and present a model as well as its optimization method to attack the well-trained state-of-the-art deep speaker recognition model by adding the perturbations on the input speech, as illustrated in Fig. 1.
Attacking the deep neural networks (DNNs) has become an emerging topic along with the development of DNNs since the weaknesses of DNNs have been found by Szegedy et al. . On the vision tasks, some optimization methods, such as L-BFGS , Adam 
, or genetic algorithm, are used to modify the input pixels to obtain the adversarial examples. But these methods need the gradient or iterations during the testing phase, which are not practical in realistic scenarios. Baluja et al. 
proposed adversarial transformation networks (ATNs), which create a separated attacker network, to transform all inputs into adversarial ones.
Base on the works on vision tasks, some methods are proposed to attack the automatic speech recognition (ASR) model. Alzantotet al.  proposed to attack the ASR model via a gradient-free genetic algorithm to generate adversarial examples iteratively. However, different from visual images, the psychoacoustic model shows that no difference will be perceived by humans if the distortion is under certain hearing thresholds. Therefore, Schonherr at al.  and Szurley et al.  proposed to optimize the attack with the psychoacoustic model, and add perturbations under the hearing threshold. Our work is different from these audio attack works in two aspects. First, our work mainly focuses on the different task to attack the speaker recognition model. Second, our model is based on ATNs, which need no gradient during the testing phase and is fast for inference. Besides, although the replaying attack has been explored , learning based attack has not been well-studied. Our contributions can be summarized as follows:
We attempt to attack the state-of-the-art speaker recognition model, and find that it is vulnerable to the attack.
We propose a model to attack the speaker recognition model. In the non-targeted experiments conducted on TIMIT dataset , we achieve a sentence error rate (SER) of with the SNR up to dB and PESQ up to 4.2 with speed rather faster than real-time.
We present an optimization method to train our model, and experimental results show that our method can achieve a trade-off between the attack success rate and the perceptual quality.
2 Speaker Recognition Attacker
As illustrated in Fig. 2, our proposed speaker recognition attacker, a trainable network, is used to add generative perturbations onto the input speech to attack the following pretrained speaker recognition model. A pretrained phoneme recognition model is used to help train the attacker. Given a speech and its speaker label , the non-targeted attack for the speaker recognition model can be formulated as:
where is the attack model to transform the input speech signal into an adversarial example, is a well-trained speaker recognition model, is a metric function to measure the distance between two samples (e.g., the norm). The constraint is changed to for the targeted attack.
2.1 Attacker Network
The proposed attacker network is a fully convolution residual network, including 5 convolution blocks totally in the residual branch, as illustrated in Fig. 2
. 1-D convolution, batchnorm, and ReLU are applied in every convolution block, following the setting in. The kernel size for all convolution layers is set as and the channel is set to be . To increase the receptive field, we use different dilations in the different convolution layers. The dilations for 5 convolution layers are , respectively. Besides, we init the weight and bias of the last convolution layer as zero so that our model adds no perturbation at the start of the training, which is important for the optimization to keep the perturbations on a small scale.
The intuitive method to train the attacker network is gradient ascent, however, in practice, it fails because a well-trained speaker recognition model propagates back almost zero gradient due to the softmax layer. Motivated by the Wasserstein GAN to optimize the Wasserstein distance between two distributions, we just solve this gradient missing problem to optimize directly on the immediate activation before the softmax layer. On the other hand, we also need to ensure the perturbations are imperceptible. norm is used to constraint the scale of the perturbations. Besides, we also take the phoneme information into the account via a pretrained phoneme recognition network to optimize the perceptual quality. In summary, we optimize our attacker network from three aspects:
where is the immediate activation before the softmax layer of the speaker recognition model with the input , is the output distribution of the softmax layer of the phoneme recognition model with the input , is the index of 1st/2nd largest value in . In
, we use Kullback–Leibler divergence (KLD) to measure the distance between two distributions. In, is a hyper-parameter to give a margin in which the perturbations are thought imperceptible. and are used to fuse the three loss items. For the targeted attack, the loss is the same except that is changed to
where denotes the target speaker class.
The input speeches with variable length are split into frames with fixed length due to the fully connection layer in the speaker recognition model in the training stage. However, in the testing stage, the input speeches can be of arbitrary length because our attacker network is a fully convolution residual network. The inference is fast because our attacker network is lightweight with only 5 convolution blocks and small kernel size filters.
3 Experimental Result
3.1 Experimental Setup
Pretrained Speaker/Phoneme Recognition Model. We use the state-of-the-art speaker recognition model, SincNet , as the target model to attack. SincNet replaces the first layer of a CNN with a group of learnable bandpass filters. In this way, the network is more interpretable and show better performance . Besides, SincNet also works on phoneme recognition111https://github.com/mravanelli/pytorch-kaldi
. In our experiment, we use the official released pretrained SincNet model for speaker recognition. The phoneme recognition model is referred from Pytorch-Kaldi and achieve a frame error rate on the TIMIT dataset222we use the same train/test split for phoneme and speaker recognition, which is different with the typical split manner for phoneme recognition..
Dataset and Metric. Following the setting in , we conduct experiments on TIMIT (462 speakers, train chunk) 
to demonstrate the performance of our proposed model. The signal-to-noise ratio (SNR) and Perceptual Evaluation of Speech Quality (PESQ) score are used to evaluate the objective and perceptual quality, respectively. SNR is calculated as follows:
where is the mean square of input signal/error. PESQ  is an integrated perceptual model to map the distortion to a prediction of subjective mean opinion score (MOS) with range , which is an ITU-T recommendation technology . Following the works for attacking image classification [16, 6], we use the classification error rate (CER) to evaluate the performance of our attacker. For the non-targeted attack, sentence error rate (SER), which is defined as the CER of the speech sentences, is used to measure our attacker’s performance. For the targeted attack, our attack is successful as long as the prediction is the target, so we use prediction target rate (PTR), which is the percentage of the target in the predictions over the testing set, to measure our attacker’s performance.
The speech sentences, with sampling rate 16k, are split into 200ms frames with 10ms stride, following. The data will be normalized before being feed into the attacker model and de-normalized when they output the attacker model. After finetuning the hyper-parameters, we set , and . We use Adam  optimizer with a learning rate of to train the attack model for epochs. Data, code, and pretrained models have been released on our project home page333https://smallflyingpig.github.io/speaker-recognition-attacker/main.
3.2 Non-targeted Attack
To demonstrate the effectiveness of our proposed model, we conduct non-targeted attack experiments on the TIMIT dataset. The results are illustrated in Table. 1, some conclusions can be drawn:
Our proposed model successfully attacks the trained state-of-the-art speaker recognition model with an SER of on the testing set. Meanwhile, the perturbations are small enough to be imperceptible for humans because the SNR is up to 57.2dB and PESQ is no less than 4.2, indicating the efficiency of our model.
The results in the 2nd row and 4th row show that can improve the performance of the attacker model (measured by SER), as well as the quality of the adversarial examples (measured by SNR and PESQ).
Besides, we also give the distribution of the perturbations on the frequency domain to study if the perturbations have frequency selectivity. The frequency distributions of the perturbations from the attacker model withare shown in Fig. 3. The spectrogram energy distribution shows that: (1) the perturbations are full-band, and all the frequencies are useful for the attack; (2) the energy in the 7k8k band is significantly stronger than that in other bands, indicating the high-frequency band has much affect on the speaker recognition’s performance. The frequency characteristics of perturbations have a great influence on the model performance and signal quality, but looking deeper into this question is beyond the scope of this paper.
3.3 Targeted Attack
Besides the non-targeted attack, we also evaluate our model on the targeted attack. We fix the hyper-parameters as in the targeted attack. Five speakers are randomly selected from the 462 speakers of the TIMIT dataset as targets. The attack results are illustrated in Table. 2. The results show that:
Our model can attack the speaker recognition with a PTR of on average over the five targets. Meanwhile, the perturbations are small enough because the SNR is up to dB.
The PESQ (3.48 on average) is well enough although it is not as good as that in the non-targeted attack, given the fact that targeted attack is more challenging than the non-targeted attack.
3.4 Real-time Attack
Our attack model is very lightweight with only 5 convolution blocks, so it is very fast. To verify it is fast enough to process the speech data in real-time, we calculate the real-time factor(RTF) over the testing set. RTF is defined as the ratio of the processing time to the input duration and the system is real-time if . We test our model in CPU mode on a machine with an Inter(R) Core(TM) i7-6700K @ 3.4GHz CPU and get an average RTF 0.042, indicating that our attacker is more than 20 times faster than the real-time requirement, although it runs in the CPU mode.
In this paper, we proposed a model to attack the speaker recognition model by training a lightweight attacker network to add perturbations on the input speech. Experiments show that our model was effective and efficient on both non-targeted and targeted attacks. We have built a pioneer work on the learning based speaker recognition attack and established the corresponding benchmark for such study. In the future, the black-box attack and transferable attack will be explored because the gradient of the target is usually inaccessible in the real scenarios.
-  (2018) Did you hear that? adversarial examples against automatic speech recognition. arXiv preprint arXiv:1801.00554. Cited by: §1.
-  (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §2.2.
-  (2018) Learning to attack: adversarial transformation networks.. In AAAI, pp. 2687–2695. Cited by: §1, §1, §2.1.
-  (2017) Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §1.
-  (1993) DARPA timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n 93. Cited by: 2nd item, §3.1.
-  (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §3.1.
-  (2001) ITU-t recommendation p.862: perceptual evaluation of speech quality (pesq): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. Cited by: §3.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.1.
-  (2016) Overview of btas 2016 speaker anti-spoofing competition. In 2016 IEEE 8th international conference on biometrics theory, applications and systems (BTAS), pp. 1–6. Cited by: §1.
-  (2019) The pytorch-kaldi speech recognition toolkit. In In Proc. of ICASSP, Cited by: §3.1.
-  (2018) Interpretable convolutional filters with sincnet. arXiv preprint arXiv:1811.09725. Cited by: §3.1.
-  (2018) Speaker recognition from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 1021–1028. Cited by: Table 1, §3.1.
-  (2001) Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), Vol. 2, pp. 749–752. Cited by: §3.1.
-  (2018) Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. arXiv preprint arXiv:1808.05665. Cited by: §1.
One pixel attack for fooling deep neural networks.
IEEE Transactions on Evolutionary Computation. Cited by: §1.
-  (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1, §1, §3.1.
-  (2019) Perceptual based adversarial audio attacks. arXiv preprint arXiv:1906.06355. Cited by: §1.