Investigating Robustness of Adversarial Samples Detection for Automatic Speaker Verification

06/11/2020 ∙ by Xu Li, et al. ∙ Tencent The Chinese University of Hong Kong 0

Recently adversarial attacks on automatic speaker verification (ASV) systems attracted widespread attention as they pose severe threats to ASV systems. However, methods to defend against such attacks are limited. Existing approaches mainly focus on retraining ASV systems with adversarial data augmentation. Also, countermeasure robustness against different attack settings are insufficiently investigated. Orthogonal to prior approaches, this work proposes to defend ASV systems against adversarial attacks with a separate detection network, rather than augmenting adversarial data into ASV training. A VGG-like binary classification detector is introduced and demonstrated to be effective on detecting adversarial samples. To investigate detector robustness in a realistic defense scenario where unseen attack settings exist, we analyze various attack settings and observe that the detector is robust (6.27% EER_det degradation in the worst case) against unseen substitute ASV systems, but it has weak robustness (50.37% EER_det degradation in the worst case) against unseen perturbation methods. The weak robustness against unseen perturbation methods shows a direction for developing stronger countermeasures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic speaker verification (ASV) systems aim at confirming a claimed speaker identity against a spoken utterance. It has been widely applied into commercial devices and authorization tools, such as household speakers, car navigation systems, e-banking authorization, etc. However, recent studies have shown that a well-trained ASV system could be deceived by malicious attacks[kinnunen2012vulnerability, shchemelinin2014vulnerability, wu2015spoofing]. In the last decade, the speaker verification community held several ASVspoof challenge competitions[wu2015asvspoof, kinnunen2017asvspoof, todisco2019asvspoof] to develop countermeasures mainly against replay [williams2019speech, cai2019dku], speech synthesis [hanilcci2015classifiers, wu2012detecting] and voice conversion [wu2012detecting, correia2014preventing] attacks.

Very recently, another threat, named adversarial attacks, has been explored on ASV systems. Adversarial attacks slightly perturb the input so that the system will make incorrect decisions. Kreuk et al. [kreuk2018fooling] added adversarial perturbations into a testing utterance to attack an end-to-end ASV system. The attack was verified to be successful in both cross-feature and cross-corpus settings. Li et al. [li2020adversarial] extended the studies into other ASV frameworks and observed the adversarial transferability from one ASV to attack another ASV. Also some works explored adversarial attacks in practical real-time scenarios [chen2019real, li2020practical, xie2020real] and attacks on spoofing countermeasures [liu2019adversarial].

Apart from the effective perturbations that pose severe threats on ASV systems, the perturbation variations caused by different attack settings also bring difficulty in developing defense approaches. In a realistic attack, different substitute ASV systems can be used to craft adversarial samples and perform effective attack on the target ASV system in a transferable way [li2020adversarial]. The choice of a substitute ASV system, as one of attack settings, results in different perturbation patterns. Besides, perturbation patterns also vary greatly across perturbation methods [kurakin2016adversarial] with different perturbation configurations, e.g. perturbation degrees. So countermeasure robustness against different attack settings, including substitute ASV systems, perturbation methods along with perturbation configurations, is another important concern.

Defense approaches against adversarial attacks have been investigated mostly in the image domain [song2017pixeldefend, xu2017feature, gong2017adversarial]. Defense approaches explored in ASV area are still very limited. Wang et al. [wang2019adversarial] leveraged adversarial samples into training an end-to-end ASV as a regularization to improve system robustness. Wu et al. [wu2020defense] adopted a combination of spatial smoothing [xu2017feature] and adversarial training [goodfellow2014explaining] to strengthen countermeasures against adversarial samples. Both methods are found to be effective. However, they need to retrain a well-developed ASV system with adversarial data augmentation. To the best of our knowledge, no existing work investigates countermeasure robustness against different attack settings of spoofing ASV systems.

Inspired by [gong2017adversarial, samizade2019adversarial], this work makes the first attempt to defend ASV systems against adversarial attacks with a separate detection network. A VGG-like [nagrani2020voxceleb] binary classification system is introduced to capture the difference between adversarial and genuine samples, and predict whether an input is adversarial or not. A separate detection countermeasure has the following advantages: 1) It separates the defense part and speaker verification into two independent stages, which avoids retraining a well-developed ASV model. 2) Since most existing countermeasures for replay and synthetic speech attacks are based on a separate detection network [williams2019speech, cai2019dku, hanilcci2015classifiers], the proposed approach provides the feasibility to develop a unified countermeasure against all spoofing attacks.

In a realistic defense scenario, attack settings cannot be accessed by the defender so that the proposed detector can be degraded by unseen attack settings. To investigate detector robustness in such a realistic scenario and provide directions for developing stronger countermeasures, this work also gives a robustness discussion based on unseen attack settings, including substitute ASV systems, perturbation methods and perturbation degrees. In this work, the three most representative ASV frameworks are used as variations: Gaussian mixture model (GMM) i-vector system

[dehak2010front]

, time delay neural network (TDNN) x-vector system

[snyder2018x] and ResNet-34 r-vector system [zeinali2019but]. Two of the most effective perturbation methods, i.e. basic iterative method (BIM) [kurakin2016adversarial] and Jacobian-based saliency map approach (JSMA) [papernot2016limitations], are applied to generate adversarial samples.

The contributions of this work include: 1) Design of a dedicated defense network against adversarial attacks, rather than augmenting adversarial samples into ASV training; 2) Introducing a VGG-like network and demonstrating its effectiveness on detecting adversarial samples; 3) Investigating detector robustness against unseen attack settings to uncover vulnerability and lack of robustness against unseen perturbation methods, which provides directions for developing stronger countermeasures.

The remaining of this paper is organized as follows: Section 2 details the process of adversarial samples generation. The proposed adversarial samples detection network is illustrated in Section 3. Section 4 analyzes the experiment results. Finally, Section 5 summaries this work.

2 Adversarial Samples Generation

In a speaker verification task, given acoustic features of the enrollment utterance and testing utterance , a well-trained system function with parameters will predict a similarity score, which indicates speaker similarity between the enrollment and testing utterances. From an adversary’s perspective, it will optimize a perturbation to be added on so that the system will behave incorrectly: either falsely rejecting the true target’s voice or falsely accepting the imposter’s voice. The optimization problem can be formulated as Eq. 1 and 2,

(1)
(2)

where the constraint -norm of within perturbation degree guarantees a subtle perturbation so that human cannot perceive the difference between adversarial and genuine samples.

To investigate detector robustness against different attack settings, we leverage three different ASV system architectures and two perturbation methods in our experiments. The details for ASV systems and perturbation methods are illustrated in Sections 2.1 and 2.2, respectively.

2.1 ASV systems

In this work, we adopt three different ASV systems as follows: GMM i-vector with probabilistic linear discriminant analysis (PLDA) back-end [dehak2010front], TDNN x-vector with PLDA back-end [snyder2018x] and ResNet-34 r-vector with cosine back-end [zeinali2019but]. In this work, all three ASV models adopt cepstral frequency with the same settings in [li2020adversarial] as input features.

The i-vector system [dehak2010front] consists of 2048 mixtures with full covariance matrix. matrix projects utterance statistics into a 400-dimension i-vector space. The i-vectors are centered and length-normalized before PLDA modeling.

The x-vector system adopts the network architecture in [snyder2018x], except that AAM-softmax loss [xiang2019margin] with hyper-parameters {, } is used for training the extractor. Extracted x-vectors are centered and projected by a 200-dimension LDA, then length-normalized before PLDA modeling.

The r-vector system has the same architecture as [zeinali2019but], and AM-softmax loss [xiang2019margin] with hyper-parameters {, } is used for training networks. Extracted r-vectors are centered and length-normalized before cosine scoring.

2.2 Perturbation methods

In this work, BIM [kurakin2016adversarial] and JSMA [papernot2016limitations] are involved to generate adversarial samples.

BIM perturbs the genuine input towards the gradient of the objective w.r.t. in a multiple-step manner to generate adversarial samples. It optimizes the perturbation with the norm constraint parameter in Eq. 1 being . Starting from the genuine input , the input is perturbed iteratively as follows:

for (3)

where is a function that takes the sign of the gradient, absorbs the trial indicator and its absolute value is the step size, is the number of iterations and holds the norm constraints by applying element-wise clipping such that . In our experiments, is set as 5, and is set as perturbation degree devided by .

JSMA is another effective perturbation method to craft adversarial samples. Unlike BIM that adds perturbations to the whole input, JSMA perturbs only one bit at a time. In each iteration, it selects the bit with the most significant effects on output to be perturbed. With this purpose, a saliency score is calculated for each bit and bit with the highest score is chosen to be perturbed. We formulate the algorithm specialized in our case, as shown in Algorithm 1. The at Step 4 computes the absolute value of gradient while masking out the bits already reach the constraint boundary: , where is the element-wise absolute value of and is an element-wise product operator. In this work, is set as 300 iterations, and is set as half of the perturbation degree.

0:  , , , , ,
1:  , ,
2:  for  do
3:     
4:     
5:     
6:     
7:     if  then
8:        
9:     end if
10:     
11:  end for
12:  return  
Algorithm 1 JSMA perturbation method
and are acoustic features of enrollment and testing utterances, respectively. is the system function with parameters, is the step size, is the perturbation degree, and is the number of iterations. is a mask matrix having the same size with , initialized with all-one element matrix .

2.3 Dataset generation

Our experiments are conducted on the Voxceleb1 [nagrani2017voxceleb] dataset, which consists of short clips of human speech. There are in total 148,642 utterances from 1251 speakers. Following data partitioning in [nagrani2017voxceleb], 148,642 utterances from 1211 speakers are used to train the ASV systems, and the remaining 4874 utterances from 40 speakers are used for testing the systems and generating adversarial samples. The corpus [nagrani2017voxceleb] provides totally 37,720 trials consisting of enrollment-testing utterance pairs selected from the testing utterances.

For each attack configuration, including the substitute ASV system, perturbation method and perturbation degree, we generate an “adversary-genuineness” dataset consisting of both adversarial and genuine samples. To make a balanced dataset, for each genuine utterance, we randomly select one trial where that utterance is used to generate an adversarial counterpart according to the attack configuration by Eq. 1. Then there are around 9K utterances in each “adversary-genuineness” dataset, including around 4.5K adversarial and 4.5K genuine utterances.

To evaluate the detection network, we separate the “adversary-genuineness” dataset of each attack configuration into training and testing subsets, with 30 speaker’s data for training and 10 speaker’s data for testing. The speaker partitioning for training and testing is consistent among all attack configurations. This guarantees that source utterances (either a genuine utterance or an adversarial utterance generated from it) in the testing subsets cannot be observed during training.

3 Adversarial Samples Detection

In this section, we present our proposed system to detect adversarial samples. To detect adversarial samples, we adopt Mel-frequency cepstral coefficients (MFCCs) unfolding in time as the input feature map forwarded to our detection network. A pre-emphasis with coefficient of 0.97 is adopted. “Hamming” window having size of 25ms and step-size of 10ms is applied to extract a frame, and finally 24 cepstral coefficients are kept. Other possible features will be investigated in future work.

Before designing the detector architecture, we notice some issues about adversarial samples: 1) The deviation between adversarial and genuine samples is subtle and localized on feature maps, and we shall adopt convolutional layers at bottom to effectively capture such deviations; 2) The adversarial characteristics exist in the whole utterance, so a pooling layer can be adopted to aggregate the utterance statistics for decision. Based on these considerations, we introduce a VGG-like network structure [nagrani2020voxceleb] to detect adversarial samples. The detailed architecture configurations are illustrated in Table 1. 4 convolutional layers at bottom to capture local feature patterns. A statistics pooling layer aggregates the mean and deviation from the last convolutional layer outputs, and forwards them to dense layers. Finally, 2 dense layers project statistics into a 2-dimensional output space to predict whether the input is genuine or adversarial. The network is trained with the Adam [kingma2014adam] optimizer, along with the initial learning rate as 0.001.

4 Experiments

4.1 Evaluation metrics

To verify the effectiveness of adversarial attacks, we evaluate the ASV system performance before and under adversarial attacks in terms of equal error rate (EER) and minimum detection cost function with target trial prior to be 0.01 and 0.001, i.e. and . When evaluating the detector performance, we report the detection accuracy (DA) over the “adversary-genuineness” testing subset. Also, regardless of the operating point, we use the detector’s log softmax output at the adversarial bit as the adversarial score, and compute an EER () from adversarial scores over the testing subset.

Layer Structure Activation
Conv2D ReLU
Statistics Pooling - -
Flatten - -
Dense1 512, dropout 0.2 ReLU
Dense2 512, dropout 0.2 ReLU
Output 2 Softmax
Table 1: Detailed configurations of the proposed detector.

4.2 Adversarial attack performance

The attack results on the x-vector system are shown in Table 2. The results on the i-vector and r-vector systems have similar trends. From Table 2, we observe that the ASV system performance seriously drops when being attacked by both perturbation methods. Also, the attack effectiveness increases as the perturbation degree increases. However, the perturbations with a higher degree are easier to be detected, which will be discussed in Section 4.3. This suggests a trade-off for attackers to design an effective but cannot be easily detected perturbations.

EER (%)
genuine 5.97 0.515 0.695
BIM 39.87 0.995 0.996
95.02 1 1
99.96 1 1
JSMA 20.41 0.880 0.932
48.28 0.995 0.995
60.22 1 1
Table 2: The x-vector system performance under different attack configurations.
BIM-xvec training
evaluation 99.83 48.65 48.61
99.82 100.00 87.01
99.83 100.00 100.00
JSMA-xvec training
evaluation 99.44 59.84 48.61
99.83 100.00 98.41
99.83 100.00 100.00
Table 3: Detection accuracy (%) against perturbation degrees

4.3 Robustness against perturbation degree

In this section, we discuss detector robustness against perturbation degree. Adversarial samples crafted from the x-vector system along with BIM and JSMA perturbation methods are involved. The system detection accuracy (DA) under different conditions is shown in Table 3

. The diagonal results are based on in-domain evaluation, which reflects the proposed binary classifier is effective and can distinguish the adversarial and genuine data distribution with an accuracy over 99%. It is also observed that the detector can generalize well from adversarial samples with a small perturbation to a larger perturbation. However, the performance drops greatly in the reverse direction. This indicates that we should craft small perturbations to develop our detector, so that it could defend ASV systems against adversarial samples with equal or higher degrees very well.

4.4 Robustness against substitute ASV systems

In this section, we investigate detector robustness against substitute ASV systems. We conduct experiments on the BIM perturbation method with attacking three different ASV systems, i.e. i-vector, x-vector and r-vector systems. The experiment results based on JSMA method have similar observations. The DA and are shown in Table 4. From DA results, we observe that the detector achieves high detection accuracy in most unseen cases, even though the system architectures are totally different. To see how the detector characterizes adversarial samples crafted from in-domain and unseen ASV systems, we visualize the output adversarial score distribution for genuine samples, in-domain and unseen adversarial samples, e.g. training on r-vector while evaluated on i-vector system as shown in Fig. 1. It shows the detector can generalize well to unseen ASV systems by assigning most of adversarial samples high adversarial scores. For some cases where a low DA occurs, e.g. training on i-vector while evaluated on r-vector system (72.45%), the detector still achieves an acceptable (6.27%). This indicates the detector still works well but there needs a shifted operating point to detect adversarial samples.

DA (%) training
BIM-ivec BIM-xvec BIM-rvec
evaluation BIM-ivec 99.87 99.78 99.44
BIM-xvec 99.65 99.83 99.39
BIM-rvec 72.45 76.38 99.70
(%) training
BIM-ivec BIM-xvec BIM-rvec
evaluation BIM-ivec 0 0.18 0.55
BIM-xvec 0.46 0.18 0.65
BIM-rvec 6.27 5.90 0.28
Table 4: Detector performance against ASV systems

4.5 Robustness against perturbation methods

In this section, we investigate detector robustness against perturbation methods. Detector performance is evaluated by adversarial samples crafted from BIM and JSMA attacking on the x-vector system, as shown in Table 5. We observe a generalizability of 10.15% from JSMA to BIM and 50.55% from BIM to JSMA. This indicates the generalizability is not symmetric and can drop greatly in some cases to be a random guess (50.55% ). This phenomenon shows a limited detector robustness against unseen perturbation methods. The detector trained on a combination of both methods can perform well on both, which suggests that we could enlarge our training dataset to include as many existing perturbation methods as possible to enhance our model’s robustness. To deal with unseen perturbation methods, we believe that a proper combination of observed perturbation methods can reinforce the detector’s robustness against unseen perturbation methods. We leave this combination strategy to be investigated in future studies.

Figure 1: The adversarial score distribution for genuine samples, and adversarial samples crafted from in-domain and unseen ASV systems (training: BIM-rvec, evaluation: BIM-ivec).
DA (%) training
BIM JSMA combined
evaluation BIM 99.83 57.73 99.48
JSMA 48.61 99.44 99.09
(%) training
BIM JSMA combined
evaluation BIM 0.18 10.15 0.46
JSMA 50.55 0.46 0.92
Table 5: Detector performance against perturbation methods

5 Conclusion

This work proposes to defend ASV systems against adversarial attacks using a separate detection network, which offers the feasibility to develop a unified countermeasure against all malicious attacks. A VGG-like network architecture is introduced to determine whether an input is a genuine or an adversarial sample. Our results demonstrate that the proposed detection network is effective on detecting adversarial samples. To investigate detector robustness in a realistic defense scenario where unseen attack settings exist, we give a further analysis on various attack settings. We observe that the detector is relatively robust against substitute ASV systems, while the generalizability among perturbation methods is not symmetric and detector performance could drop greatly in some cases. The weak robustness against unseen perturbation methods shows a direction for developing stronger countermeasures. The utilization of observed perturbation methods to improve detector robustness against unseen perturbation methods will be investigated in future studies.

References