Adversarial Attacks on Spoofing Countermeasures of automatic speaker verification

10/19/2019 ∙ by Songxiang Liu, et al. ∙ 0

High-performance spoofing countermeasure systems for automatic speaker verification (ASV) have been proposed in the ASVspoof 2019 challenge. However, the robustness of such systems under adversarial attacks has not been studied yet. In this paper, we investigate the vulnerability of spoofing countermeasures for ASV under both white-box and black-box adversarial attacks with the fast gradient sign method (FGSM) and the projected gradient descent (PGD) method. We implement high-performing countermeasure models in the ASVspoof 2019 challenge and conduct adversarial attacks on them. We compare performance of black-box attacks across spoofing countermeasure models with different network architectures and different amount of model parameters. The experimental results show that all implemented countermeasure models are vulnerable to FGSM and PGD attacks under the scenario of white-box attack. The more dangerous black-box attacks also prove to be effective by the experimental results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic speaker verification (ASV) aim to confirm that a given utterance is pronounced by a specified speaker. It is now a mature technology for biometric authentication [27, 11, 7, 26, 25, 9, 19, 12]. Modern speaker verification systems harness the combination of several modules to tackle the problem of ASV. In [25]

, for example, Gaussian mixture models (GMMs) are trained to model the acoustic features and likelihood ratio is used for scoring. Recently, ASV systems based on deep learning models require fewer concepts and heuristics compared to traditional speaker verification systems and have achieved considerable performance improvement. Heigold et al.

[9] proposed an integrated model with end-to-end style which directly learns a mapping from a test utterance and a few reference utterances to a verification score, resulting in compact structure and sufficiently good performance. However, past research has shown that ASV systems are vulnerable to malicious attacks using spoofing and fake audios, such as synthetic, converted and replayed speech.

Anti-spoofing countermeasures for speaker verification systems arouse keen interests and several novel studies have been done [6, 30, 32, 5, 34, 13, 3, 17, 4, 24, 15]. The ASVspoof 2019 challenge [31], a community-led challenge, attracts more than 60 international industrial and academic teams to investigate spoofing countermeasures for ASV. Both the scenarios of logical access (LA) and physical access (PA) are taken into account. The LA scenario involves fake audios synthesized by modern text-to-speech synthesis (TTS) and voice conversion (VC) models. The PA scenario involves replayed audio recorded in reverberant environment under different acoustic and replay configurations. Several teams achieve excellent performance in detecting spoofing and reinforcing robustness of ASV systems under both LA and PA scenarios [31]. Yet according to [29, 2], machine learning models with impressive performance can be vulnerable to adversarial attacks [8]. Are the spoofing countermeasures for ASV in [31] robust enough to defend against adversarial examples?

An adversarial example is generated by the combination of a tiny perturbation and a normal instance . It is very similar to the original normal instance

and may even be visually or acoustically indistinguishable to human, but will lead the neural networks to incorrectly classify it as any target

given a specific perturbation. Szegedy et al. [29] claim that well-trained neural network for image classification can succumb to adversarial attacks. The vulnerability of automatic speech recognition (ASR) neural network model under adversarial attacks is proved in [2], where an adversarial example can be transcribed as any phrase. However, to our best knowledge, the robustness of spoofing countermeasure systems for ASV under the adversarial attacks has not been studied yet. In this paper, we implement several high-performance spoofing countermeasure models in the ASVspoof 2019 challenge and assess the reliability of these models under the attack of adversarial examples.

Adversarial attacks contains two main scenarios: white-box attack and black-box attack. White-box attacks are those where the adversary requires knowledge of the target model internals and adversarial examples are generated by an optimization strategy applied to the input space while fixing the model’s parameters. Black-box attacks, such as [23], have no access to target model internals, only to its inputs and outputs. With the paired data acquired from available training data or by testing the online target model, a substitute for the target model can be trained. Then adversarial examples are easily crafted by the substitute and then used to attack the target model. The successful attack in the black-box scenarios, to some extent, guarantees the success of white-box attack because black-box attack requires less information and is more difficult than white-box attack. In this paper, for the sake of completeness, both white-box attacks and black box attacks are adopted to assess the reliability of spoofing countermeasure systems for ASV. There are a lot of adversarial attack approaches [8, 1, 22, 21, 28]. In this paper, we adopt the fast gradient sign method (FGSM) [8] and the projected gradient descent (PGD) method [21].

This paper is the first one investigating the vulnerability of spoofing countermeasures for ASV under both white-box and black-box adversarial attacks. We compare performance of black-box attacks across spoofing countermeasure models with different network architectures and different number of parameters. We implement countermeasure models which achieve comparable or even better anti-spoofing performance than some high-performance models in the ASVspoof 2019 challenge and we successfully attack them under both white-box and black-box attack scenarios. All our codes will be made open-source

111Codes are available at https://github.com/ano-demo/AdvAttacksASVspoof.

The paper is organized as follows. Section 2 provides the detailed description of two ASV spoofing countermeasure models. In section 3, we introduce the procedure of adversarial audio generation with two adversarial attack methods, i.e., the FGSM and the PGD method. In section 4, we report the experimental setups. The experiment results and discussion are given in section 5. Finally, we conclude this paper in section 6.

2 ASV Spoofing Countermeasure Models

Up to the submission time of this paper, only a few top-ranking systems in the ASVspoof 2019 challenge have accessible and complete model description. We choose two kinds of models, proposed by team T44 and team T45, to conduct adversarial attack experiments. According to the results of the ASVspoof 2019 challenge, the overall best performing single system for the LA scenario is proposed by team T45. The authors adopt an angular margin based softmax (A-softmax) [20] loss rather than traditional softmax with cross-entropy loss to train a Light CNN (LCNN) architecture [33]. The system proposed by team T44 is ranked 3-rd and 14-th places for the PA and LA scenarios, respectively. Team T44 adopts a Squeeze-Excitation extended ResNet (SENet) as one of models in their submitted system. We provide brief description of these two kinds of models in the next two subsections.

2.1 LCNN model and A-Softmax

Figure 1:

The Max-Feature-Map (MFM) activation function for convolution layer.

The LCNN architecture is successfully adopted for replay attacks detection in [17] and outperformed other proposed systems in ASVspoof 2017 challenge [14]. The LCNN architecture adopt the Max-Feature-Map (MFM) activation based on the Max-Out activation function. The MFM activation function for a convolutional layer is illustrated in Fig. 1, which is defined as

(1)

where is the input feature map of size , is the output feature map of size .

To enhance the anti-spoofing performance of LCNN architecture, team T44 utilizes an A-Softmax loss to train the model. A-Softmax enables the model to learn angularly discriminative features. Geometrically, A-Softmax loss can be viewed as imposing discriminative constraints on a hypersphere manifold. A-Softmax is represented as:

(2)

where is the number of training samples, and their labels are training pairs, is the angle between and the corresponding column of weights in the fully connected classification layer, and is an integer that controls the size of an angular margin between classes.

2.2 Squeeze-Excitation ResNet model

Squeeze-Excitation network (SENet) adaptively recalibrates channel-wise feature responses by explicitly modelling dependencies between channels, which has shown great merits in image classification task [10]. In this paper, we implemented a model with network architecture similar to that of SENet34 in [16]. The overall architecture of an SENet moduel is illustrated in Fig. 2.

Figure 2: The Squeeze-Excitation network (SENet) module, where

is a customized variable specifying the stride and

is the reduction factor.

3 Adversarial Audio Generation

To execute an adversarial attack, we consider the model parameters as fixed and optimize over the input space. Specifically, in this paper, an adversarial example is a perturbed version of the input spectral feature .

(3)

where is small enough such that the reconstructed speech signal of the perturbed version is perceptually indistinguishable from the original signal by humans, but causes the network to make incorrect decision. Searching for a suitable can be formulated as solving the following optimization problem:

(4)

where

is the loss function,

is the label of and is a set of allowed perturbation that formalizes the manipulative power of the adversary. In this paper, is a small -norm ball, that is, .

To solve the optimization problem (4), in this paper, we adopt the fast gradient sign method (FGSM) [8] and the projected gradient descent (PGD) method [21].

3.1 Fast gradient sign method

The first adversarial attack method, shorted as the FGSM method, consists of taking a single step along the direction of the gradient, i.e.,

(5)

where the function simply takes the sign of the gradient . Therefore, given an utterance , the adversarial spectral feature can be simply computed as . While the FGSM benefits from being the simplest adversarial attack method, it is often relatively inefficient at solving the maximization problem (4).

3.2 Projected gradient descent method

Unlike the FGSM, which is a single-step method, the PGD method is an iterative method. Starting from the original input , the input is iteratively updated as follows:

(6)

where is the step size, is the number of iterations and the function applies element-wise clipping such that . We take as the final perturbed spectral feature. Intuitively, the PGD method can be thought of as iteratively applying small-step FGSM, but forcing the perturbed input stay within the admissible set at every step. The PGD method allows for more effective attacks but is naturally more computationally expensive than the FGSM.

The performance of PGD is still limited by the possibility of sticking at local optima of the loss function. To mitigate this problem, random restarts is incorporated into the PGD method [21]. The PGD method with random restarts will be executed multiple runs. The initial location of the adversarial example is randomly selected within the admissible perturbation set and the PGD method will be executed a certain number of times in one run. The final adversarial example is the one resulting in maximum loss.

4 Experimental setups

This paper uses the ASVspoof 2019 dataset, which encompasses partitions for the assessment of LA and PA scenarios. In this paper, we only utilize the LA partition, which are themselves partitioned into training, development and evaluation sets. We use raw log power magnitude spectrum computed from the signal as acoustic features. Following [18], FFT spectrum is extracted with the Blackman window having size of 1724 and step-size of 0.0081s. Only the first 600 frames for each utterance are used as input for all trained models. No additional preprocessing techniques such as voice activity detection (VAD), pre-emphasis or dereverberation are adopted.

The adversarial attacks are conducted as the following: We train a spoofing countermeasure model using the training set. Hyper-parameters of the countermeasure model are tuned using the development set. We evaluate the anti-spoofing performance and generate adversarial examples using the evaluation set. When generating adversarial examples, we add adversarial perturbation into the acoustic feature vectors using the FGSM or the PGD method introduced in Section

3. The perturbed acoustic features are reconstructed into waveform to attack the well trained countermeasure model.

4.1 Details of countermeasure model implementation

Three countermeasure models are trained, which we term as LCNN-big, LCNN-small and SENet12. LCNN-big has the same network architecture as in [18]. LCNN-small has similar network architecture to LCNN-big, but with less parameters. SENet12 has similar network architecture as in [16] but with less parameters. The number of parameters of these three models are shown as in Table 1. LCNN-big model has larger model capacity than LCNN-small and SENet12 model in terms of number of model parameters, while LCNN-small model and SENet12 model have equal model capacity. The detailed network architecture of LCNN-small and SENet12 model are shown in Table 2 and Table 3, respectively.

LCNN-big and LCNN-small model are trained using the A-Softmax loss function, while SENet12 is trained using the original softmax and cross-entropy loss. The constant in Eq.(2) is set to 4. To mitigate the overfitting issue, a dropout rate of 0.75 is used when training LCNN models. We found that adding a relatively large L2 regularization is helpful to mitigate the overfitting issue. We set the weight decay rate at 0.001 when training all these three models. We use Adam optimizer with a constant learning rate of 0.001 to update the model parameters in all training cases, where and .

During training stage, we applied early stopping according to the classification accuracy on the development set. During inference stage, we took the cosine similarity between the input and the weight vector in the last FC layer corresponding to the bonafide class as the score of the utterance.

Model LCNN-big LCNN-small SENet12
Num. of parametes 10M 0.51M 0.48M
Table 1: Number of parameters of LCNN-Big, LCNN-Small and SENet12 model.
Type Filter / Stride Output
Conv_1 /
MFM_2
MaxPool_3
Conv_4
MFM_5
BatchNorm_6
Conv_7
MFM_8
MaxPool_9
BatchNorm_10
Conv_11
MFM_12
BatchNorm_13
Conv_14
MFM_15
MaxPool_16
Conv_17
MFM_18
BatchNorm_19
Conv_20
MFM_21
BatchNorm_22
Conv_23
MFM_24
BatchNorm_25
Conv_26
MFM_27
MaxPool_28
FC_29 64
MFM_30 32
BatchNorm_31 32
FC_32 2
Table 2: LCNN-small network architecture.
Type Filter / Stride Output
Conv /
BatchNorm
ReLU
MaxPool
SEResNet Module
SEResNet Module
SEResNet Module
SEResNet Module
Global AvgPool 128
FC 2
Table 3: SENet12 network architecture.

4.2 Adversarial attacks

We apply both white-box and balck-box adversarial attacks on the trained countermeasure models with the FGSM and the PGD method. We investigate adversarial attacks under various levels of manipulative power of the adversary on the audios. In both the FGSM and PGD attack settings, in Eq.(5) and Eq.(3.2) is chosen from the set of {0.1, 1, 5}. To make the level of manipulative power of the PGD attack consistent with the FGSM attack, we make the step-size and in the PGD attack scenario satisfy the relationship of

(7)

For example, if and number of iterations is 10, then is set to be 0.1. The number of random restarts is 5 in all experiments.

4.3 XAB listening test

To achieve a valid adversarial attack, it is important to make the adversarial audio examples sound indistinguishable from the original audio signals by human ears. We conduct an XAB listensing test, which is a standard way to assess the detectable differences between two choices of sensory stimuli. The adversarial audio signals are generated from the LCNN-big using the PGD method with . Each of the adversarial audio signal is reconstructed from the perturbed log power magnitude spectrum and the phase spectrum of its corresponding original audio signal. We presented to listeners 50 randomly chosen adversarial-original audio pairs (i.e., A and B), from each of which we randomly choose one as the reference audio (i.e., X). Five listeners take part in the XAB listening test, where they are asked to choose from A and B one audio which sounds more like the reference audio X.

5 Results and discussion

5.1 Countermeasure performance

Anti-spoofing performance of the three countermeasure models, LCNN-Big, LCNN-Small and SENet12, is evaluated via the minimum normalized tandem detection cost function (t-DCF) and the equal error rate (EER), as shown in Table 4. Note that the LCNN-Big model achieves comparable performance with that reported in [18], and the SENet12 model has even better performance than the best performing single model reported in [16].

(a) Reference.
(b) Adv. example.
(c) Adv. perturbation.
Figure 3: Sub-figures (a)-(c) are spectrograms of original audio, adversarial audio and adversarial perturbation of the utterance LA_E_1001227. The attack is conducted to the LCNN-big model using the PGD method, where , number of iterations is 10 and number of random restarts is 5.
(a) LCNN-big and LCNN-small.
(b) LCNN-big and SENet12.
(c) LCNN-small and SENet12.
Figure 4: Black-box attack performance of the FGSM and the PGD method.
Model Dev Eval
t- EER(%) t- EER(%)
LCNN-big 0.0010 0.047 0.1052 3.875
LCNN-small 0 0.002 0.1577 6.226
SENet12 0 0 0.1737 6.077
Table 4: Anti-spoofing performance of the three countermeasure models, LCNN-big, LCNN-small and SENet12.
EER(%)
FGSM PGD FGSM PGD FGSM PGD
LCNN-big 4.691 6.256 36.504 54.382 48.457 93.119
LCNN-small 7.613 17.419 34.670 73.649 48.375 89.845
SENet12 7.737 13.896 24.936 62.681 51.626 87.220
Table 5: White-box attack performance of the FGSM and the PGD method.

5.2 Adversarial attack results

The subjective XAB listening test in subsection 4.3 results in average classification accuracy of 48.4%, which confirms the validity of the adversarial audio attacks. Spectrograms of original audio, adversarial audio and adversarial perturbation of the utterance LA_E_1001227 is shown in Fig. 3, where the attack is conducted to the LCNN-big model using the PGD method, with , number of iterations 10 and number of random restarts 5. We can see that the adversarial perturbed spectrogram is almost visually indistinguishable from that of the original audio signal. If we set the EER point of the evaluation set as the operating point, the utterance LA_E_1001227 is wrongly classified as ”bona fide” in this setting. In the following subsections, we use the EER as metric to evaluate the performance of adversarial attacks by the means of the FGSM or the PGD method, under both white-box and black-box scenarios.

5.2.1 White-box attacks

The white-box attack performance of the FGSM and the PGD method using different is shown in Table 5. The EERs of all three countermeasure models increase as grows. PGD attacks attain larger EERs than FGSM attacks in all three countermeasure models and all the settings of under the white-box attack scenario. The EERs of all three models under the FGSM attacks reach near 50% when . As for PGD, the EERs are greater than 50% when , and are greater than 85% when , which will result in reversed classification decision if the operating point is pre-defined by the EER point on the evaluation set. We can conclude from the white-box attack results that the reliability of all three countermeasure models are challenged and broken down by FGSM or PGD attacks under the scenario of white-box attack. The PGD method is more effective than the FGSM. Research on more advanced countermeasure models should be done to keep pace with today’s white-box adversarial attacks.

5.2.2 Black-box attacks

In this part, we study adversarial attacks of the FGSM and the PGD method from the perspective of black-box scenario. As shown in Fig 4, the black-box attacks achieve a resounding success in all the mutual attack settings: LCNN-big with LCNN-small, LCNN-big with SENet12 and LCNN-small with SENet12. In a large fraction of attacking scenarios, the attack method of PGD attains larger EER than the FGSM method, leading to that the PGD method tends to generate more powerful adversarial examples. According to Fig. 4(a) and 4(b), adversarial examples generated by LCNN-big with rather more parameters are more powerful as they can attack smaller models, LCNN-small and SENet12, with larger EERs, while adversarial examples generated by LCNN-small and SENet12 fail to attack LCNN-big with such large EERs. So in our experimental setup, we can safely conclude adversarial examples from small models are outperformed by large models in terms of the performance of attack for both the FGSM and PGD methods. According to the red dotted line with triangle marker and red dash-dot line with circle marker in Fig. 4(a) and Fig. 4(b), the adversarial examples generated by LCNN-big attain greater EER when attack LCNN-small rather than RESNet12, resulting in a conclusion that adversarial attacks are much easier realized under similar model structure. According to Fig 4(c), the attack efficacy of adversarial examples generated by LCNN-small outperforms the adversarial examples from SENet12.

6 Conclusions

In this paper, we have investigated the vulnerability of spoofing countermeasures for ASV under both white-box and black-box adversarial attacks using the FGSM and PGD method. We have also compared performance of black-box attacks across spoofing countermeasure models with different network architectures and different amount of parameters. We implement three countermeasure models, i.e., LCNN-big, LCNN-small and SENet12, and conduct adversarial attacks on them. The experimental results show that all three models are subject to FGSM and PGD attacks under the scenario of white-box attack. The more dangerous black-box attacks are also proved to be effective by the experimental results. For the future work, we would like to adopt adversarial training methods to improve the robustness of countermeasure models and make them less vulnerable to adversarial attacks.

7 Acknowledgements

Songxiang Liu was supported by the General Research Fund from the Research Grants Council of Hong Kong SAR Government (Project No. 14208718). Haibin Wu and Hung-yi Lee were supported by the Ministry of Science and Technology of Taiwan (Project No. 108-2636-E-002-001).

References

  • [1] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §1.
  • [2] N. Carlini and D. Wagner (2018) Audio adversarial examples: targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 1–7. Cited by: §1, §1.
  • [3] N. Chen, Y. Qian, H. Dinkel, B. Chen, and K. Yu (2015)

    Robust deep feature for spoofing detection—the sjtu system for asvspoof 2015 challenge

    .
    In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §1.
  • [4] Z. Chen, Z. Xie, W. Zhang, and X. Xu (2017) ResNet and model fusion for automatic spoofing detection.. In INTERSPEECH, pp. 102–106. Cited by: §1.
  • [5] H. Delgado, M. Todisco, M. Sahidullah, N. Evans, T. Kinnunen, K. Lee, and J. Yamagishi (2018) ASVspoof 2017 version 2.0: meta-data analysis and baseline enhancements. In Odyssey 2018 The Speaker and Language Recognition Workshop, Cited by: §1.
  • [6] N. W. Evans, T. Kinnunen, and J. Yamagishi (2013) Spoofing and countermeasures for automatic speaker verification.. In Interspeech, pp. 925–929. Cited by: §1.
  • [7] D. Garcia-Romero and C. Y. Espy-Wilson (2011) Analysis of i-vector length normalization in speaker recognition systems. In Twelfth annual conference of the international speech communication association, Cited by: §1.
  • [8] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §1, §3.
  • [9] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer (2016) End-to-end text-dependent speaker verification. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5115–5119. Cited by: §1.
  • [10] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 7132–7141. Cited by: §2.2.
  • [11] A. Kanagasundaram, R. Vogt, D. B. Dean, S. Sridharan, and M. W. Mason (2011) I-vector based speaker recognition on short utterances. In Proceedings of the 12th Annual Conference of the International Speech Communication Association, pp. 2341–2344. Cited by: §1.
  • [12] P. Kenny, V. Gupta, T. Stafylakis, P. Ouellet, and J. Alam (2014) Deep neural networks for extracting baum-welch statistics for speaker recognition. In Proc. Odyssey, pp. 293–298. Cited by: §1.
  • [13] E. Khoury, T. Kinnunen, A. Sizov, Z. Wu, and S. Marcel (2014) Introducing i-vectors for joint anti-spoofing and speaker verification. In Fifteenth Annual Conference of the International Speech Communication Association, Cited by: §1.
  • [14] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, and K. A. Lee (2017) The asvspoof 2017 challenge: assessing the limits of replay spoofing attack detection. Cited by: §2.1.
  • [15] C. Lai, A. Abad, K. Richmond, J. Yamagishi, N. Dehak, and S. King (2019) Attentive filtering networks for audio replay attack detection. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6316–6320. Cited by: §1.
  • [16] C. Lai, N. Chen, J. Villalba, and N. Dehak (2019) ASSERT: anti-spoofing with squeeze-excitation and residual networks. arXiv preprint arXiv:1904.01120. Cited by: §2.2, §4.1, §5.1.
  • [17] G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Kudashev, and V. Shchemelinin (2017) Audio replay attack detection with deep learning frameworks.. In Interspeech, pp. 82–86. Cited by: §1, §2.1.
  • [18] G. Lavrentyeva, S. Novoselov, A. Tseren, M. Volkova, A. Gorlanov, and A. Kozlov (2019) STC antispoofing systems for the asvspoof2019 challenge. arXiv preprint arXiv:1904.05576. Cited by: §4.1, §4, §5.1.
  • [19] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren (2014) A novel scheme for speaker recognition using a phonetically-aware deep neural network. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1695–1699. Cited by: §1.
  • [20] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017)

    Sphereface: deep hypersphere embedding for face recognition

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212–220. Cited by: §2.
  • [21] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §1, §3.2, §3.
  • [22] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Cited by: §1.
  • [23] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2017) Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp. 506–519. Cited by: §1.
  • [24] Y. Qian, N. Chen, and K. Yu (2016) Deep features for automatic spoofing detection. Speech Communication 85, pp. 43–52. Cited by: §1.
  • [25] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn (2000) Speaker verification using adapted gaussian mixture models. Digital signal processing 10 (1-3), pp. 19–41. Cited by: §1.
  • [26] A. Senior and I. Lopez-Moreno (2014) Improving dnn speaker independence with i-vector inputs. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 225–229. Cited by: §1.
  • [27] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur (2018) X-vectors: robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. Cited by: §1.
  • [28] J. Su, D. V. Vargas, and K. Sakurai (2019) One pixel attack for fooling deep neural networks.

    IEEE Transactions on Evolutionary Computation

    .
    Cited by: §1.
  • [29] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1, §1.
  • [30] M. Todisco, H. Delgado, and N. Evans (2016-06) A new feature for automatic speaker verification anti-spoofing: constant q cepstral coefficients. Odyssey 2016. External Links: Link, Document Cited by: §1.
  • [31] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee (2019) ASVspoof 2019: future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441. Cited by: §1.
  • [32] G. Valenti, H. Delgado, M. Todisco, N. Evans, and L. Pilati (2018-06)

    An end-to-end spoofing countermeasure for automatic speaker verification using evolving recurrent neural networks

    .
    Odyssey 2018 The Speaker and Language Recognition Workshop. External Links: Link, Document Cited by: §1.
  • [33] X. Wu, R. He, Z. Sun, and T. Tan (2018) A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security 13 (11), pp. 2884–2896. Cited by: §2.
  • [34] Z. Wu, T. Kinnunen, E. S. Chng, H. Li, and E. Ambikairajah (2012) A study on spoofing attack in state-of-the-art speaker verification: the telephone speech case. In Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1–5. Cited by: §1.