Advanced voice conversion (VC) and text-to-speech (TTS) technologies make it easy to create a high-quality synthetic voice. However, synthetic voices can be misused to attack automatic speaker verification (ASV) systems , now referred to as a presentation attack (PA) by the ISO/IEC 30107-1 standard . They can also be abused to fool humans and have lead to an issue known as deepfakes. These concerns call for reliable PA and deepfake detection methods.
….. 19cm(1cm,1cm) ©2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Most PA and deepfake detection methods, or spoofing countermeasures (CM) in general, are based on a binary classification scheme. Given an input speech trial of length , the CM extracts frames of acoustic features and computes a score to indicate how likely the input trial is bona fide – a real human voice. It then makes a decision by comparing the score with an application-dependent threshold . Most CMs use deep neural networks (DNNs) to detect artifacts in input trials, and many of them have achieved impressive results on benchmark databases .
However, a CM well trained on a closed-set database is likely to misclassify unseen trials from unknown attacks and unseen bona fide trials from mismatched unknown domains [4, 5]111Although benchmark databases such as those from the ASVspoof challenges intentionally keep unknown attacks in the evaluation set, the labels of the evaluation set are released to the public after the challenges. Network architectures and other hyper-parameters of CMs can unintentionally overfit the evaluation set.
. These trials are sometimes referred to as “known-unknown” and “unknown-unknown” in the machine learning field, and in this study, we simply refer to them as being unknown to the CM. While data augmentation techniques  can make a CM more robust, they cannot cover all unknown conditions. Rather than being forced to make a binary decision, a practical CM should abstain from making decisions on trials that are difficult to judge. Such a CM is illustrated in Fig. 1
. The option to abstain is desired when a classification error incurs a high risk, regardless of being a false positive or false negative. If a CM abstains, the input trial can be scrutinized by other CMs or a human expert. Although it is not investigated in this study, an active learning strategy can be used to collect the trials annotated by the human expert and fine-tune the CM.
Classification with abstention is an established topic in the machine learning field, and most methods require a separately trained model to decide whether to abstain or not 10, 11] or a non-trainable scoring module [12, 13]. OOD data, which is likely to be misclassified by a classifier, usually receives a low confidence score and can be identified.
Inspired by the aforementioned studies, this study investigates how and whether it is useful to introduce abstention to DNN-based speech spoofing CMs. On the basis of two high-performance CMs, we compared a few confidence estimators on the ASVspoof 2019 logical access (LA) database  and an additional evaluation set with unknown trials from Voice Conversion Challenges (VCC) [15, 16]
. The results demonstrate that simply using the probability from a softmax as the confidence score lead to overconfidence, which is consistent with findings from other studies[17, 18]. An energy-based confidence scoring method  and a confidence branch  achieved an acceptable performance, and both helped the CM to identity unknown trials from the VCC test set. Without making decisions on trials with a confidence score lower than the threshold, the CM achieved a better CM EER on the remaining trials.
2 Confidence estimators
2.1 Max probability from CM
The first estimator is a simple plug-in to a CM. Suppose the hidden layers in the CM scoring module have converted the input
into an utterance-level vector
. We can compute a logit vectorusing an affine transformation and get the output probability using a softmax function. This is written as
where and denote the probability of being bona fide and spoofed, respectively, and where and are the -th row of the matrix
and the bias vector, respectively.
2.2 Energy-based confidence score
Given the logits , the second estimator computes an energy-based score as . The value of is argued to be proportional to the unconditional model likelihood , where is the parameter set of the CM. Therefore, known attacks are likely to receive a higher score than unknown ones.
The energy-based confidence score does not add additional model parameters and can be used on pre-trained CMs. When some unknown trials are available, it is also possible to re-train the CM so that the confidence scores of unknown and known trials can be better discriminated. This requires an energy-based training loss .
2.3 Negative Mahalanobis distance
Assuming that the utterance-level vectors
of trials from the same attack type follow a Gaussian distribution, we can use the negative Mahalanobis distance as a confidence estimator. Given the of an input trial, we compute
where and are the sample mean vector and the covariance matrix of the -th class. Note that the class here can be bona fide or any known attack in the training set. This method has been used for OOD detection  and CM scoring . Here, we assume that a trial far away from the known classes – and hence with a smaller score – is likely to be unknown and misclassified by the CM.
When some unknown trials are available, we can fine-tune the CM to tighten the Gaussian distributions of the known classes. This is done in this work using an outlier exposure training loss .
|Score range||Trainable?||Use of unknown data|
|Max prob.||No||Not applicable|
|Energy score||No||Usable for CM training|
|Neg. M-dist.||Trained after CM||Usable for CM training|
|Conf. branch||Jointly trained with CM||Not applicable|
|Train set||CM scoring||Confidence scoring||Test set E1||Test set E2|
|EER||At TPR=95%||AUROC||AUPR||EER||At TPR=95%||AUROC||AUPR|
|T1||AM softmax||Max prob.||0.64||4.64||-||-||0.51||0.38||0.65||5.50||-||-||0.51||0.35|
|plain softmax||Max prob.||0.45||3.33||3.43||70.98||0.78||0.64||0.41||5.55||2.82||72.91||0.70||0.49|
2.4 Confidence branch
The fourth confidence estimator adds a trainable module to the CM . It learns to map the vector of the input trial into a confidence score , where
is the Sigmoid function. Theis jointly trained with the CM by minimizing the loss over the training data :
where , is an indicator function, is the target label, and denotes the parameter set of the CM scoring module.
When the CM predicts a small confidence score , the value of for the target class is increased, while that of is decreased. This means that the CM can ask for more hints from the target label when it is less confident to classify the input. However, the regularization term prevents the CM from predicting a small for all trials. Therefore, a well-trained CM is expected to predict a small only for trials that are difficult to classify.
using two linear layers, where the first layer used 128 hidden units and the Tanh activation function. We followed the official implementation and used a budget mechanism to tune the hyper-parameter. We also observed that it is essential to balance the ratio of bona fide and spoofed trials in each mini-batch.
2.5 Supervised binary known-unknown classifier
The last estimator in this study is a standalone DNN that predicts the confidence score from input acoustic features. It has the same network structure as the CM but the target class is either known or unknown. Bona fide and spoofed trials in the original training set are treated as known, and trials from other databases are unknown.
The confidence estimators used in this study are summarized in Tab. 1. We included them because they cover various application scenarios. When the CM scoring module has been trained and cannot be fine-tuned, the max-probability, energy-based score, or M-distance can be used. If it is possible to train the CM while only known data is available, the confidence branch can be used. If we collect new unknown training data, we can choose to build a standalone confidence estimator or update the Gaussian statistics for the M-distance after tuning the CM.
3.1 Databases and protocols
We used three databases: the ASVspoof 2019 LA database , bona fide and TTS trials collected from Blizzard Challenge 2019 (BC19)  and ESPNet , and bona fide and VC trials from Voice Conversion Challenge (VCC) 2018 and 2020 [15, 16]. The BC19, ESPNet, and VCC datasets contain more types of spoofed trials than LA, and the sets of speakers are disjoint in all databases.
To simulate real application scenarios, we prepared different training and test sets, which are listed in Tab. 3. The LA test set was split into two subsets. LA test kn. contains bona fide trials and four spoofing attacks (A08, A09, A16, and A19), and LA test unk. contains the rest of the spoofed data in the test set. Note that A16 and A19 are known because they used exactly the same TTS/VC algorithms as two attackers in the training set. A08 and A09 are treated as known in this study because they use a TTS framework similar to some spoofing attacks in the training set.
Using T1 and E1 is equivalent to the official protocol of ASVspoof 2019 LA. Using E2 simulates a scenario in which some test trials are unknown. The use of T2 and T3 simulates a scenario in which a small amount of unknown trials can be used to train the confidence estimator. However, these unknown trials are disjoint from those in the test set.
|Known trials||Unknown trials|
|Train set||T1||LA trn. (2,580 / 22,800)||-|
|T2||LA trn. (2,580 / 22,800)||ESPNet (250 / 2,000)|
|T3||LA trn. (2,580 / 22,800)||BC19 (100 / 7,625)|
|Test set||E1||LA test kn. (7,355 / 19,656)||LA test unk. (0 / 44,226)|
|E2||LA test kn. (7,355 / 19,656)||VCC (770 / 49,467)|
3.2 Model configurations and training recipes
We followed our previous study to configure the CMs since they performed well on the ASVspoof 2019 LA database . The acoustic features were linear frequency cepstrum coefficients (LFCC) extracted with a frame length of 20ms, a frame shift of 10ms, and a 512-point FFT. The LFCC vector per frame had 60 dimensions, including static, delta, and delta-delta components. We compared two back-end classifiers in the experiment. While both are based on the light CNN (LCNN)  with two bi-directional LSTM layers and an average pooling layer, one uses an plain softmax, and the other uses an additive margin-softmax (AM-softmax) . These CMs were combined with the confidence estimators for the experiments.
The training recipe was borrowed from our previous study: the Adam optimizer with , a mini-batch size of 64, and a learning rate initialized to
and halved every ten epochs. Each model was trained on an Nvidia Tesla A100 card for three rounds, and the result was averaged. Voice activity detection and feature normalization were not applied.
3.3 Evaluation metrics
The following evaluation metrics were used. The first set, including EER and, was used to evaluate the CMs in the conventional scenario without discriminating known and unknown trials. was used because it is a measure of both discrimination and calibration.
The second set of metrics were for the confidence estimators. By treating known and unknown as positive and negative classes, respectively, we computed the false positive rate (FPR) given the threshold for which the true positive (TPR) rate is 95%. This is used in other studies [13, 21]. We also computed the area under ROC (AUROC) and the area under the precision-recall curve (AUPR).
Finally, another EER for the CMs was computed for trials whose confidence score was larger than at TPR=95%. This EER measures how well a CM discriminates bona fide and spoofed trials that the CM is confident about.
3.4 Results and discussions
The evaluation results are listed in Tab. 2. Note that the CMs with a confidence branch were trained with the loss in Eq. (3), and their and EER were hence different from the others that used a non-trainable confidence estimator. The CMs with the energy-based confidence estimator were also re-trained when using T2 or T3.
Which confidence estimator is effective? To answer this question, we first focus on the condition using the training set T1, test set E1, and the CMs with the AM softmax. Among the three confidence estimators, the max-probability-based one performed poorly. As shown in Fig. 1(a), the confidence score was close to the maximum value of 1.0 for most of the unknown test trials (in red color), and it was impossible to compute FPR at TPR=95%. This indicates that the CM was overconfident, which is consistent with the findings in other studies [17, 18]. In comparison, the energy-based method and the confidence branch produced relatively useful confidence scores. Although the FPR at TPR 95% was higher than 80% for both methods, the AUROC and AUPR improved. Particularly, as shown in Fig. 1(b), the confidence branch produced confidence scores that varied for the three types of test trials even though most of the CM scores were either -1 or 1.
For T1-E1, the max probability, energy-based scoring, and confidence branch improved the AUROC values when the CM used the plain softmax rather than the AM one. However, although the results of the max-probability method were close to the other two methods, Fig. 1(c) shows that many of the unknown spoofed trials (in red color) still received a confidence score close to 1.0. In contrast, the confidence scores from the other two methods were more dispersed as Figs. 1(d) and 1(e) show. The M-distance-based method was not competitive as the results demonstrate.
The energy-based score and confidence branch were good candidates for the task. Interestingly, the energy-based confidence scores were highly correlated with the CM scores. This is partially due to the large numeric difference between the logits . The confidence score becomes and correlates with the CM score . This also indicates that the logits from the plain softmax contain useful information for confidence estimation.
When is the confidence estimation useful for the CM? In the condition T1-E1, the EER at TPR=95% was not lower than the original EER in most cases. The CM cannot better discriminate bona fide and spoofed trials on which the CM is confident. One possible reason is that the CM has been unintentionally overfitted to the attackers in the LA test set (see the footnote on the 1st page). Therefore, to reveal the usefulness of confidence estimation, we need to examine the performance on real unknown trials in the test set E2. From the results for T1-E2, we observed that the EER at TPR=95% was lower than the original EER for all confidence estimators except the max-probability-based one for the AM softmax CM. As Fig. 1(g) on the energy-based method shows, the bona fide (in green color) and spoofed (in red color) trials from VCC had smaller confidence scores than those from LA (in grey and blue color). This is expected since the trials from VCC were quite different from those in the CM’s training set. Avoiding making decisions on these low-confidence trials is a reasonable strategy for the CM.
Can we improve the confidence estimator if we have some unknown spoofing data? A comparison between the metrics across T1, T2, and T3 indicate that it was not effective to use unknown training data to fine-tune the CM with an M-distance- or energy-based confidence estimator. In the case of using the energy-based estimator on T2, a comparison between Fig. 1(e) and 1(f) shows that the confidence score of unknown spoofed test trials was pushed towards the known ones. Although the M-distance’s FPR was slightly improved when evaluating on E1, it was higher than 90%. Note that the M-distance achieved similar FPR, AUROC, and AUPR values for T2-E1 and T3-E1 because the confidence scores were similar.
Last but not least, the estimator trained in a supervised manner produced an AUROC smaller than or around 0.5. One possible reason is that the trials for T2 and T3 were too different from those for T1 and the test set. The confidence estimator learned a simple decision boundary that separated T2 and T3 from T1 but did not generalize to the spoofing attacks in the test set.
This study investigated speech spoofing CMs that can opt for abstention. This is implemented by augmenting the CMs with a confidence estimator and comparing the confidence score of an input trial against a decision threshold. We compared various methods to estimate the confidence score and conducted experiments on a mix of speech databases. The results on the ASVspoof 2019 LA database demonstrated that the energy-based confidence score can be a convenient method for estimating the confidence for pre-trained CMs. The confidence branch is also a potential candidate. Another experiment with unknown spoofed trials from the VCC database showed that the CM can reduce misclassification rate if it can refrain from classifying low-confidence trials.
This study also found that it is difficult to use unknown attacks to improve the confidence estimation performance. Experiments using high-quality spoofing trials from BC19 and ESPNet turned out to be detrimental. Future work will look into this issue.
-  Nicholas Evans, Tomi Kinnunen, and Junichi Yamagishi, “Spoofing and countermeasures for automatic speaker verification,” in Proc. Interspeech, 2013, pp. 925–929.
-  ISO/IEC JTC1 SC37 Biometrics, ISO/IEC 30107-1. Information Technology - Biometric presentation attack detection - Part 1: Framework, 2016.
-  Andreas Nautsch, Xin Wang, Nicholas Evans, Tomi H. Kinnunen, Ville Vestman, Massimiliano Todisco, Hector Delgado, Md Sahidullah, Junichi Yamagishi, and Kong Aik Lee, “ASVspoof 2019: Spoofing Countermeasures for the Detection of Synthesized, Converted and Replayed Speech,” IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 3, no. 2, pp. 252–265, apr 2021.
-  Dipjyoti Paul, Md Sahidullah, and Goutam Saha, “Generalization of spoofing countermeasures: A case study with ASVspoof 2015 and BTAS 2016 corpora,” in Proc. ICASSP. IEEE, 2017, pp. 2047–2051.
-  Rohan Kumar Das, Jichen Yang, and Haizhou Li, “Assessing the scope of generalized countermeasures for anti-spoofing,” in Proc. ICASSP. IEEE, 2020, pp. 6589–6593.
-  Joshua Attenberg, Panos Ipeirotis, and Foster Provost, “Beat the machine: Challenging humans to find a predictive model’s “unknown unknowns”,” Journal of Data and Information Quality (JDIQ), vol. 6, no. 1, pp. 1–17, 2015.
-  Rohan Kumar Das, “Known-unknown Data Augmentation Strategies for Detection of Logical Access, Physical Access and Speech Deepfake Attacks: ASVspoof 2021,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 29–36.
-  Burr Settles, “Active Learning Literature Survey,” Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009.
-  Heinrich Jiang, Been Kim, Melody Y Guan, and Maya Gupta, “To trust or not to trust a classifier,” Proc. NIPS, pp. 5546–5557, 2018.
-  Terrance DeVries and Graham W Taylor, “Learning confidence for out-of-distribution detection in neural networks,” arXiv preprint arXiv:1802.04865, 2018.
-  Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin, “A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks,” in Proc. NIPS, 2018, pp. 7167–7177.
-  Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li, “Energy-based Out-of-distribution Detection,” in Proc. NIPS, 2020, vol. 33, pp. 21464–21475.
-  Dan Hendrycks and Kevin Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” Proc. ICLR, 2017.
-  Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Héctor Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, Lauri Juvela, Paavo Alku, Yu-Huai Peng, Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Sébastien Le Maguer, Markus Becker, Fergus Henderson, Rob Clark, Yu Zhang, Quan Wang, Ye Jia, Kai Onuma, Koji Mushika, Takashi Kaneda, Yuan Jiang, Li-Juan Liu, Yi-Chiao Wu, Wen-Chin Huang, Tomoki Toda, Kou Tanaka, Hirokazu Kameoka, Ingmar Steiner, Driss Matrouf, Jean-François Bonastre, Avashna Govender, Srikanth Ronanki, Jing-Xuan Zhang, and Zhen-Hua Ling, “ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” Computer Speech & Language, vol. 64, pp. 101114, nov 2020.
-  Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, and Zhenhua Ling, “The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods,” in Proc. Odyssey, 2018, pp. 195–202.
-  Zhao Yi, Wen-Chin Huang, Xiaohai Tian, Junichi Yamagishi, Rohan Kumar Das, Tomi Kinnunen, Zhen-Hua Ling, and Tomoki Toda, “Voice Conversion Challenge 2020 –- Intra-lingual semi-parallel and cross-lingual voice conversion –-,” in Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 2020, pp. 80–98.
-  Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger, “On calibration of modern neural networks,” in Proc. ICML. PMLR, 2017, pp. 1321–1330.
-  Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, and Mario Lucic, “Revisiting the Calibration of Modern Neural Networks,” arXiv preprint arXiv:2106.07998, 2021.
-  Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018.
Nanxin Chen, Yanmin Qian, Heinrich Dinkel, Bo Chen, and Kai Yu,
“Robust deep feature for spoofing detection—The SJTU system for ASVspoof 2015 challenge,”in Proc. Interspeech, 2015, pp. 2097–2101.
Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich,
“Deep anomaly detection with outlier exposure,”Proc. ICLR, 2019.
-  Zhizheng Wu, Zhihang Xie, and Simon King, “The blizzard challenge 2019,” in Proc. Blizzard Challenge Workshop, 2019.
Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji
Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, and Xu Tan,
“Espnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit,”in Proc. ICASSP. IEEE, 2020, pp. 7654–7658.
-  Xin Wang and Junich Yamagishi, “A comparative study on recent neural spoofing countermeasures for synthetic speech detection,” Proc. Interspeech, pp. 4259–4263, 2021.
-  Galina Lavrentyeva, Sergey Novoselov, Andzhukaev Tseren, Marina Volkova, Artem Gorlanov, and Alexandr Kozlov, “STC Antispoofing Systems for the ASVspoof2019 Challenge,” in Proc. Interspeech, 2019, pp. 1033–1037.
-  Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, 2014.
-  David A Van Leeuwen and Niko Brümmer, “An introduction to application-independent evaluation of speaker recognition systems,” in Speaker classification I, pp. 330–353. Springer, 2007.