In recent years, attacks and defenses of speaker recognition systems have attracted more and more attention. As one of the most prominent biometric authentication methods, the security of speaker identification system is extremely important. Prior works have found speaker recognition systems are not only facing the spoofing attacks [wu2015asvspoof, wu2015spoofing, wu2016anti] including impersonation, replay, speech synthesis, as well as voice conversion, while adversarial attacks are also be able to affect speaker recognition systems. In [das2020attacker], Das et al. gave an overview of the attacker’s perspective on speaker verification.
Adversarial attacks are usually conducted by adversarial examples, which are designed by constructing imperceptible perturbations to lead a mis-classification. Adversarial examples were first proposed by Szegedy et al. [szegedy2013intriguing]
in computer vision tasks, which show that a certain network is vulnerable to a crafted small perturbation in the training set. Goodfellowet al. [goodfellow2014explaining]
proposed an effective approach, fast gradient-sign method (FGSM), to generate adversarial examples through the linearization of the loss function. Since then, various experimental results have shown that adversarial examples can successfully influence a variety of models[kurakin2016adversarial, miyato2018virtual].
Apart from the applications in image tasks, speech-related tasks could also be affected by adversarial examples. There has been plenty of work focused on attacking automatic speech recognition (ASR) systems using adversarial examples. In[carlini2018audio], Carlini et al. demonstrated the effectiveness of targeted audio adversarial examples on a end-to-end ASR system. With optimization-based attacks, they were able to turn any audio waveform into any target transcription. Instead of using a norm to measure the maximum perturbation introduced as above, Schönherr et al. [schonherr2018adversarial] introduced a new type of adversarial examples based on psychoacoustic hiding and attacked the Kaldi ASR system [povey2011kaldi] successfully. Next, Qin et al. [qin2019imperceptible] extended this idea and developed effectively imperceptible audio adversarial examples by leveraging the psychoacoustic principle of auditory masking.
In speaker recognition area, adversarial examples could also be used to attack and to defend the system. In [kreuk2018fooling], Kreuk et al. used adversarial examples for fooling a speaker verification (SV) system by adding a peculiar noise to the original speaker examples. In our previous work [wang2019adversarial], we added adversarial perturbations on feature-level to conduct a non-targeted attack to SV system. We also explored using adversarial examples for model regularization and improved the robustness of the SV system. Xie et al. [xie2020real] made the DNN based speaker recognition system can identify the speaker as any target label by adding audio-agnostic universal perturbations on speakers’ voice input. In [li2020universal], Li et al.
proposed to generate universal adversarial perturbations (UAPs) by learning the mapping from the low-dimensional normal distribution to the universal perturbation subspace via a generative model. However, the aforementioned adversarial examples are mostly restricted to make a slight change of original signal in audio sampling points, without considering the human perceptibility of sound.
In this study, we were inspired by the work in [schonherr2018adversarial, qin2019imperceptible] and propose to generate inaudible adversarial perturbations for targeted attacking speaker recognition directly on wave-level. We use the structure of the x-vector speaker recognition system proposed in [snyder2018x] as our baseline to conduct targeted white-box attacks. To generate the inaudible adversarial perturbations, we adopt the frequency masking concept where one faint but audible sound becomes inaudible in the presence of another louder audible sound. Our experimental results based on Aishell-1 [bu2017aishell] corpus demonstrate that the inaudible adversarial perturbations can achieve better targeted attack performance than previous norm based adversarial examples. To further compare the frequency masking based approach with previous ones, we also evaluate them from both subjective and objective metrics. Results show that the adversarial perturbations generated by proposed methods are more inaudible, even with larger absolute energy. Finally, we attempt to conduct targeted attacks using the music portion of the MUSAN corpus [snyder2015musan], which is a completely irrelevant non-speech dataset. Experiments show that even non-speech can also achieve a high speaker attack success rate.
The rest of the paper is organized as follows. In Section 2, we detail the generation of the inaudible adversarial perturbations. In Section 3, we describe the experimental setup. Experimental results and analysis are presented in Section 4. We conclude in Section 5.
2 Inaudible adversarial perturbations
In this section, we introduce how we generate the inaudible perturbations that can conduct targeted speaker attacks. Figure 1 shows an overview of the generation of adversarial examples base on frequency masking.
2.1 Adversarial example generation
An adversarial example is defined as an instance with imperceptible, intentional perturbation that causes a well-trained model to make a false prediction. Conventional approaches to generate adversarial perturbations are typically by performing gradient descent w.r.t the input sample. Specifically, given an input speech , its label speaker , an arbitrary target label and a well-trained speaker recognition model , the adversarial perturbation can be generated by
is the loss function. The hyperparameteris used to control the maximum perturbation generated.
2.2 Frequency masking
Our goal is to generate indistinguishable adversarial perturbations in the human perceptibility of audio, instead of maintaining a slight noise to the clean speech sample points. In order to achieve that, we utilize the idea of frequency masking, which refers to the phenomenon that one faint but audible sound (the maskee) becomes inaudible in the presence of another louder audible sound (the masker) [lin2015principles]. Therefore, we can modify adversarial perturbations to be inaudible, as long as the perturbation falls under the masking threshold of the original speech. In [lin2015principles], Lin et al. investigated the algorithm of computing masking threshold, which consists of 3 steps.
STEP 1: Identifications of maskers
In order to obtain the frequency masking threshold of the original speech, raw audio signals from the time domain are first converted into time-frequency representations by short-time Fourier transform (STFT). The output of STFTrefers the -th bin of the spectrum at frame . Then, the power spectral density (PSD) of can be computed as
After that, the PSD estimateis normalized to a sound pressure level (SPL) of 96 dB,
The normalized PSD estimate of reasonable maskers must satisfy three constraints. First is local maxima,
Secondly, they should be larger than the absolute threshold of hearing (ATH),
Finally, any group of maskers should keep a maximum amplitude within 0.5 Bark (a psychoacoustically-motivated frequency scale) and only the masker with the highest SPL is retained,
Since the masking effect is additive in the logarithmic domain, the SPL of each masker can be further smoothed by
STEP 2: Calculation of individual masking thresholds
An individual masking threshold means that the masker at frequency index contributes to the masking effect on the maskee at frequency index , where and are the masker and maskee’s frequencies in Bark scale. The individual masking thresholds can be calculated as:
where and is a two-slop spread function.
STEP 3: Calculation of global masking threshold
After the individual masking thresholds are obtained, the global masking threshold can be calculated by combining them with the absolute threshold of hearing. The global masking threshold at frequency index is calculated according to
where is the SPL of threshold in quiet at frequency index , is the number of maskers, and is corresponding individual masking threshold. Readers can get more detail about the calculation of masking threshold in [lin2015principles].
2.3 Optimization procedure
Given an input speech , its label speaker , an arbitrary target speaker label , where , and a well-trained x-vector speaker recognition model , the additional loss function to modify the perturbation fall under the masking threshold can be defined as
where means the normalized PSD estimated of at the -th frequency bin. The inaudible adversarial perturbation can be generated by
where aims to make the adversarial examples fool the well-trained speaker recognition system into predicting an arbitrary target label and the constrains the normalized PSD estimate of perturbation to be inaudible. The is a hyper-parameters to scale different losses.
The whole optimization procedure is separated into two stages. In Attack Stage1, we focus on finding a relative small perturbation using a common norm based algorithm as defined in Eq. (1). The is initialized to a zero vector and is gradually reduced from a large value. For each iteration, is updated by
In Attack Stage2, we further optimize above perturbation by introducing frequency masking based loss as defined in Eq. (11). The starts from and adaptively updated based on the performance of attack. For each iteration, is updated to be inaudible through:
3 Experimental setup
We use the Mandarin Aishell-1 corpus [bu2017aishell] as the evaluation data set. The entire corpus contains 400 speakers (214 female, 186 male), sampled at 16kHz, including training, development and test sets, without speaker overlapping. Training set is used in x-vector baseline training, while test set is used to evaluate the baseline system. For conducting inaudible adversarial targeted attacks, we randomly choose 10 female speakers (denoted as F) and 10 male speakers (denoted as M) from the training set, each with 100 utterances, as the original speaker set. Another 10 female speakers (denoted as F’) and 10 male speakers (denoted as M’) are selected as the attack targets. We assign these selected sets into 4 test modes. The first one is using 10 male original speakers to attack 10 male target speakers, denoted as M2M’. Similarly, the other three test modes are M2F’, F2M’ and F2F’.
Besides, we use the music portion of MUSAN [snyder2015musan] corpus as our non-speech dataset, which consists of western art music (e.g., Baroque, Romantic, and Classical) and popular genres (e.g., jazz, bluegrass, hip-hop, etc). We randomly choose 200 pieces of western art music and cut them into 1000 pieces of 6 seconds short segments. This subset is used as the original wave to attack the selected male target speakers.
3.2 Experimental setup
We use x-vector system [snyder2018x] as our baseline. The 30-dimensional Mel-frequency cepstral coefficients (MFCC) features are extracted as the input for all experiments. The configuration of x-vector network is exactly the same as in [snyder2018x]
: a 5-layer TDNN with ReLU followed by batch normalization is used for extracting frame-level hidden features. The number of hidden nodes is 512 and the dimension of frame-level hidden features for pooling is 1500. Each frame-level feature is generated from a 15-frame context of acoustic features. Pooling layer aggregates frame-level features, followed by 2 fully-connected layers with ReLU activation functions, batch normalization, and a Softmax output layer. The EER of the x-vector baseline system is 4.27%. Note that we use the whole sentence as input instead of using chunks as in[snyder2018x], because we need to compute the gradient w.r.t the sentence-level perturbation.
After training the x-vector baseline system, we calculate the speaker prediction for the original utterances with their true labels. The accuracy for the M set is 95.9%, while the accuracy for the F set is 97.9%. We also calculate the prediction accuracy for the original utterances with assigned target speakers. All the results of the four test modes are 0.00%.
3.2.2 Inaudible adversarial perturbations
We first compute the STFT of original speech to get the time-frequency representations. The window type of STFT is the modified Hann window with a length of 2048 and a hop length of 512. In Attack Stage1, the learning rate is set to be and the will be updated times for each mini-batch. We use the norm to measure the perturbation bound. The starts from 2000 and will multiply when attacking successfully. In Attack Stage2, the learning rate is and the total training step for each mini-batch is . The scale parameter begins with and will increase to when attacking successfully or decrease to
when fails. All systems are implemented using PyTorch[paszke2017automatic] and optimized by Adam optimizer [kingma2014adam].
3.2.3 Evaluation metrics
We use various metrics to measure the performance of proposed method. First, we compute the attack success rate to evaluate the performance of targeted attacks in speaker recognition. Formally, the accuracy is computed as:
where is the total number of utterances we used to test and refers to the number of utterances attacking. Besides, perceptual evaluation of speech quality (PESQ) [rix2001perceptual]
and signal-to-noise ratio (SNR) are also computed to measure the distortion of generated adversarial examples. Finally, we also conduct a subjective evaluation to evaluate the adversarial examples from the human perceptibility of audio. successfully.
4 Experimental results and analysis
4.1 Inaudible adversarial targeted attack
In Table 1, we calculate the attack success rate for all the four test modes. As we separated the optimization procedure into two stages in Section 2.3. We will test the adversarial examples generated in these two stages, donated as Attack Stage1 and Attack Stage2, respectively. System Attack Stage1 is we conduct attack using the adversarial examples generated in Attack Stage1, which just focus on finding a small perturbation. And the targeted attack successfully affected the speaker model in 72.6%, 73.8%, 73.3% and 71.3% of cases in these four test modes. For System Attack Stage2, the frequency masking method is used in generating inaudible adversarial perturbations. The rates of successful targeted attacks in four test modes are 98.5%, 97.6%, 96.7% and 93.8%. In this experiment, adversarial examples from both attack stages can successfully conduct targeted speaker attacks. We can achieve a higher attack success rate in System Attack Stage2, which indicates the effectiveness of the inaudible adversarial perturbations in targeted attacks.
4.2 Objective evaluation and subjective listener evaluation
After conducting the attacks, we want to analyze the adversarial examples from each attack stage. Fig. 2 shows objective performance of the generated adversarial examples. We can observe that the objective performance of the Attack Stage1 adversarial examples is slightly better than Attack Stage2. The reason of these results is frequency masking only hide the perturbation in the masking threshold, but does not decrease the energy of the perturbations of the adversarial examples. So we also perform subjective test to evaluate the similarity of the adversarial examples and the original wave to find out whether the perturbations generated in Attack Stage2 is inaudible to listeners.
To subjectively evaluate the performance of both attack stages, we conduct ABX preference test. In our task, 20 utterances pairs of are chosen randomly from the four test modes as evaluation speech and each pair is judged by 30 participants. The voices for comparison are separately the adversarial examples generated from Attack Stage1 and Attack Stage2. Participants were asked to make judgement mainly according to “which one is more similar to the original voice?”.
Table 2 summarizes the ABX test results. We can see that the Attack Stage2 obtains better preference score than the Attack Stage1 (-value0.05). The result indicates that frequency masking make the perturbations more inaudible when generating the adversarial examples, even with larger absolute energy. Some samples of generated adversarial examples can be found on this website111https://pengchengguo.github.io/inaudible-advex-for-sv.
|Attack Stage1||Neural||Attack Stage2|
4.3 Non-speech targeted attack
We also use music as the original input to conduct the targeted speaker attack. We match each utterance with a target speaker label and measure the attack success rate. The result shows in Table 3. We first use original music wave with target speaker labels to test the system and get 0.00% of prediction accuracy. After generating adversarial examples from Attack Stage1 and Attack Stage2, we can achieve 77.0% and 91.5% attack success rate, respectively. The experimental result demonstrates the attacking effectiveness of the inaudible adversarial perturbations, even applied to a completely irrelevant waveform.
|Before Attack||Attack Stage1||Attack Stage2|
In this study, we have proposed to targeted attack the speaker recognition system by generating inaudible adversarial perturbations. In particular, the psychoacoustic principle of frequency masking is used for the generation of adversarial examples. We constrict the perturbation under the masking threshold of the original audio, instead of a common distortion measures. Experiments on Aishell-1 corpus show that our approach yields up to 98.5% attack success rate to arbitrary gender speaker targets, while retaining indistinguishable attribute to listeners. In subjective listener evaluation, the frequency masking based adversarial perturbations have a 68.67% preference, which indicates the frequency masking based adversarial perturbations are more inaudible, even with larger absolute energies. Furthermore, the results demonstrate the effectiveness when applying to non-speech data, such as music, to conduct targeted speaker attacks.
In our future work, we will explore more challenging scenarios, both white-box and black-box targeted attacks and the defenses of the adversarial examples. On-the-air targeted attacks [xie2020real] and defenses also are within our future plan.