Automatic Speaker Verification (ASV) system is a biometric authentication method working with humans’ speech. As any type of protection it may be susceptible to tampering, called spoofing-attacks, which can be classified into four kinds: impersonation, voice conversion (VC), replay attacks and text-to-speech (TTS). There are two ways, how to deal with spoofing-attacks. First is to improve the reliability of ASV systems. Second – to use specialized algorithms, called countermeasures (CM), which detect unauthorized access.
The ASVspoof initiative and the first ASVspoof Challenge 2015 were created in order to standardize research in the field, metrics and databases . This competition was held four times in 2015, 2017, 2019 and 2021 respectively. Last time it contained three tasks :
Logical Access (LA) task: spoofing data is created by VC and TTS algorithms and transmitted over telephone and VoIP networks.
Physical Access (PA) task: CMs for replay attacks.
Speech Deepfake (DF) task: combination of other tasks, but without speaker verification.
The new Spoofing Aware Speaker Verification (SASV) Challenge 2022 is a continuation of the ASVspoof series. However, in ASVspoof CMs work with a fixed ASV system. Organizers of the SASV Challenge believe that joint optimization of CM and ASV systems can lead to more robust models. This paper compares different approaches for solution for this challenge.
Existing architectures for CM systems can be divided into two types: ones, which work with raw signal as input, and ones, which apply time-frequency transforms and create spectrograms. In  authors showed that spoofing artifacts lie in different frequency sub-bands rather than in full-band. Because of that, the performance of spectrogram-based methods relies on the used time-frequency algorithm’s resolution in the sub-band, where spoofing attack left marks. Different spectrograms enhance different frequencies, what leads to their contrasting capabilities. Hence, combining architectures with different front-ends can result in a robust model. Raw-input solutions define which frequencies are helpful by themselves during the train stage. However, their aptitude is still limited. Thus, architecture with great discriminating power should use fusion over models with different input types.
Fusion methods appeared in literature are usually done over scores obtained from different CM models. It was shown that non-linear types of fusion perform better than the linear ones . However, average or weighted sum with pre-normalization are still the most common [1, 7, 8, 9, 10]
. Other appeared methods, where scores from CMs are stacked in one feature-vector, are presented below:
All of these methods, except the Decision Tree, are shown to improve the performance and surpass single model systems.
Fusion over embeddings occurs much less often. Authors of  created Fusion-Layer, which combines outputs of model sub-parts. SASV Challenge 2022 organizers used a 3-layer MLP with embeddings from both ASV and CM systems .
In this paper we explore and compare the performance of different fusion methods. Apart from existing solutions, which mostly use scores, we take embeddings from CM and ASV systems. Moreover, only one ensemble method of Decision Trees was applied in anti-spoofing research – bagging (specifically, Random Forest). We propose usage of another Decision Tree ensemble method called boosting. Concretely, fusion is done using CatBoost , and it significantly outperforms SASV Challenge 2022 baselines on all metrics. In addition, we test usage of Random Fourier Features (RFF)  with a logistic regression base, which is a RBF SVM approximation method.
The remainder of this paper is organized as follows. Section 2 describes CM and ASV systems, which embedding will be taken for fusion. Section 3 describes the experimental setup, which results are described in Section 4. Conclusions are presented in Section 5.
2 CM and ASV systems
This section describes what CM ans ASV systems were used in experiments.
For the CM system we took five different models: two with raw input and three with spectrogram front-ends.
Firstly, we used the AASIST , which is a baseline CM subsystem in the SASV Challenge 2022. This is a spectro-temporal graph attention network working with raw input. Sending waveform through AASIST results in 190-dimensional embedding. Scores are outputs of the last Fully-Connected layer (using softmax is an option).
Secondly, we adopted the RawNet2  architecture for our training pipeline. Specifically, Dropout layers were added. RawNet2 was a baseline CM system for the ASVspoof Challenge 2021 . It consists of a Sinc-layer, ResBlocks and a GRU-layer, ending with two FC-layers. Embeddings are 1024-dimensional vectors obtained from the output of the first classifying FC-layer. Scores come out the last layer.
: STFT, MEL and CQT spectrograms. These time-frequency transform methods enhance different frequency sub-bands, which allows to create a robust model by combining them. LCNN is a classic CNN model with Max-Feature-Map operations as activation functions. Embeddings are outputs of the penultimate FC-layer and have 80 dimensions for each type. Scores come out the last layer.
2.2 Automatic Speaker Verification
For the ASV system we used ECAPA-TDNN proposed in 
, which is also a baseline ASV subsystem in SASV Challenge 2022. This subsystem consists of Squeeze-Excitation Res2Blocks, Statistic Pooling and Multi-layer Feature Aggregation. ECAPA-TDNN returns two 160-dimensional vectors for enrollment and test utterances respectively. Score is a cosine similarity between embeddings.
3 Experimental Setup
|SVM (Linear Kernel)|
|SVM (RBF Kernel)|
|SVM (Polynomial Kernel)|
|RFF (Logistic Regression Based)|
|Score-Sum (Baseline 1)|
|3-layer MLP (Baseline 2)|
In this section we describe how we trained our models and used fusion over them to following evaluation.
3.1 Training Stage
Our ASV system ECAPA-TDNN was pre-trained on the VoxCeleb2 dataset . It is same to the SASV Challenge 2022’s organizers’ ASV subsystem.
All CMs were trained on the ASVspoof 2019 LA train partition . We also took pre-trained AASIST as was done in competition. However, RawNet2 and LCNN-based models were trained from scratch and had data preprocessing with data augmentation.
In the DF task of the ASVspoof challenge speech waveforms were encoded and then decoded with different lossy codecs. This added some distortion into data and complicated the detection of spoofing-attacks. However, in practice compressed audio is common. So we believe that robust CM, which can be used in real-life situations, should handle these encoding-decoding perturbations. For this we added random compression with mp3 and aac codecs as data augmentation.
There is a recent study, which shows that the ASVspoof Challenge Database has a very uneven distribution of silence: the bonafide samples have much longer silences, than many of attacks, especially TTS ones . This leads to a serious problem: models partially make their predictions based on the duration of silence in the data. It was shown that same architectures learned on samples with trimmed silence get much worse results. We believe that strong biometric authentication system has to deal with no silence data and reject samples with hush: indeed, if somebody wants access to protected information, he or she should provide his or her biometry, and there is no biometry in silence. Therefore, we preprocess data for RawNet2 and LCNN-based models by removing silences using simple magnitude-based Voice Activity Detection (VAD) algorithm to improve their detecting spoofing artifacts capabilities.
We trained RawNet2 and LCNN-based models for 30 epochs with the Adam optimizer.
3.2 Fusion methods over embeddings
We already discussed why fusion of CMs’ scores with different inputs is better than the single system. We hypothesize that at the same manner fusion of CMs with and without silence trimming results in a well-performing model, which detects artifacts and does not make predictions based only on silence duration.
For this study we get embeddings from each sub-model and concatenate them into one long feature-vector, which is used as input for the final classifier. Then the classifier is trained on embeddings from train data. RawNet2 and LCNN-based models’ embeddings are taken from preprocessed and encoded-decoded data. For each LCNN-based model input spectrogram is split into three equal-sized parts. We get embeddings for each part and stack them together.
Final scores for development and evaluation sets are received from the trained classifier. We again used same preprocessing and encoding-decoding for RawNet2 and LCNN-based models.
The resulting pipeline of our system is given in Figure 1. The final feature-vector is a 2288-dimensional vector of numerical features.
Methods tested as final classifier are as follows:
3-layer MLP. Same architecture, that was used in SASV Challenge 2022 Baseline, but with our 2288-dimensional vector as input: three FC-layers with 256, 128 and 64 out features respectively, LeakyReLU with negative slope as activation and one final classifier FC-layer.
Logistic Regression. Iterations are limited to (we also tried bigger value, but its performance was worse). Regular coefficient is , where is the amount of utterances in ASVspoof 2019 LA train partition. All other parameters are set to default111We used sklearn v.0.24.2 .
SVM with a Linear kernel. Iterations are limited to , regular coefficient is . All other parameters are set to defaultnote1.
SVM with a RBF kernel. Iterations are limited to , regular coefficient is . All other parameters are set to defaultnote1.
SVM with a Polynomial kernel. Iterations are limited to , regular coefficient is , degrees are set to . All other parameters are set to defaultnote1.
RFF with a logistic regression. Classifier uses linear principal component analysis (PCA) withdimensions and Standard Scaler. Number of random features is set to . Iterations of inner logistic regression are limited to , regular coefficient is . All other parameters are set to defaultnote1.
GMM. Number of components is set to . Amount of EM iterations is . All other parameters are set to defaultnote1.
Random Forest. Number of estimators is, all other parameters are set to defaultnote1.
CatBoost. Number of estimators is , all other parameters are set to default222We used catboost v.1.0.4 .
3.3 Fusion methods over scores
For completeness of our study we tested if we could improve the results by doing fusion over scores. First of all, we replicated SASV Challenge 2022 Baseline 2 (it will be called RB2 for shortness). Then we took RB2 and our best model (OBM for shortness) from the embeddings experiment and stacked their scores for each utterance in one 4-dimensional feature vector.
Then we tested the same classifiers, except MLP and RFF, because of too small input size. For Logistic Regression and SVM-based methods iterations were unlimited and regular coefficient was set to .
|SVM (Linear Kernel)|
|SVM (RBF Kernel)|
|SVM (Polynomial Kernel)|
|Score-Sum (Baseline 1)|
|3-layer MLP (Baseline 2)|
The results for different fusion methods over embeddings described in Section 3.2 are shown in terms of SV-EER, SPF-EER and SASV-EER in Table 1. The performance of SASV Challenge 2022 Baseline systems and ECAPA-TDNN without any CM are at the bottom of the table. SV-EER measures how the model distinguish target and not-target trials. SPF-EER – target and spoofed trials. SASV-EER is a general metric, where non-target and spoofed trials are treated equally.
We can clearly see that the GMM approach works as a random binary classifier. It was expected, because embeddings are hardly similar to Gaussian vectors. Hence, GMM should not be used for fusion over embeddings.
Unlike the results from  where Polynomial SVM outperformed RBF and Linear ones, in our experiment SVM with RBF Kernel shows the best of three results. Logistic regression has similar results for RBF SVM and Poly SVM in terms of SPF-EER and SASV-EER respectively, but SV-EER is equal to random classification. Linear SVM’s quality indicators are poor too. The reason for such performance of Logistic Regression and Linear SVM is the linear inseparability of bonafide and non-target trials, which can be observed by the superiority of non-linear approaches. 3-layer MLP has the same problem, caused by the small train set and narrow layers with large input size. In theory, Random Fourier Features should approximate SVM with a RBF kernel. We can see from Table 1 that RFF indeed has similar performance to SVM methods, but it has much worse results than RBF SVM.
Random Forest, which is a bagging ensemble method of Decision Trees, exceeds the results of Baseline 2 in terms of SPF-EER and has similar results to Polynomial SVM in other metrics. Thus, Random Forest is better than SVM with a polynomial kernel in our study.
The final tested method is CatBoost. It surpasses all competitors methods in all metrics with a huge margin in SV-EER and SASV-EER. Moreover, it reduces error-rates, obtained by baselines, by a relative for SV-EER, SPF-EER and SASV-EER on development and evaluation sets respectively. Only single ECAPA-TDNN outperforms the CatBoost approach on SV-EER. It is not surprising because using countermeasures worsens verification ability.
To further improve the results we took model with the CatBoost fusion method (OBM) and combined it with RB2 in our second experiment, described in Section 3.3. Its results are presented in Table 2.
Scores are more likely to be from Gaussian Mixture than embeddings. The GMM approach still deteriorates performance of the best single model, however it is far from random classification. Hence, GMM is not hopeless as score fusion method.
Logistic Regression and SVM-based methods have alike results. Polynomial SVM has the best of four performances with only a slight impairment in terms of SPF-EER and SASV-EER on evaluation set. These four methods outperform OBM on the evaluation set and slightly worsen on the development set. Hence, these methods work as some kind of regularization.
Both the Decision Tree ensemble methods got unsatisfactory quality indicators, however, they are better than GMM.
Thus, our best results are obtained using SVM with a 7-degree polynomial kernel score fusion of RB2 and OBM, which is a fusion of ECAPA-TDNN, AASIST, RawNet2 and three LCNN models’ embeddings using CatBoost. This system is our submission for the SASV Challenge 2022.
This paper reports a comparison of different fusion methods over embeddings. Results show that the CatBoost approach, which did not appear in anti-spoofing studies before, outperforms all other methods by a huge margin. Moreover, we trimmed silence in audio data for some of our sub-models, what was demonstrated to worse performance in , but still have desirable results. This indicates how robust the CatBoost method is. Other fusion over embeddings approaches can still get satisfactory results, but only for the distinction of target and spoofing-trials. They have poor results on SV-EER and SASV-EER metrics.
However, for fusion over scores everything is different. Logistic Regression and SVM-based methods outperform the best single system on the evaluation set and act as a regularization. Random Forest and CatBoost did not get great results. Thus, this methods are better to use with embeddings than scores. Finally, GMM should be used only as a score fusion approach, and its performance highly depends on the distribution of scores from sub-models. In addition, our study confirms that non-linear fusion methods are better than the linear ones. This paper is submitted to INTERSPEECH 2022.
-  J.-w. Jung, H. Tak, H.-j. Shim, H.-S. Heo, B.-J. Lee, S.-W. Chung, H.-G. Kang, H.-J. Yu, N. Evans, and T. Kinnunen, “SASV Challenge 2022: A Spoofing Aware Speaker Verification Challenge Evaluation Plan,” arXiv preprint arXiv:2201.10283, 2022.
-  Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li, “Spoofing and countermeasures for speaker verification: A survey,” Speech Communication, vol. 66, pp. 130–153, 2015.
-  Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilçi, M. Sahidullah, and A. Sizov, “ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,” in Sixteenth annual conference of the international speech communication association, 2015.
-  H. Delgado, N. Evans, T. Kinnunen, K. A. Lee, X. Liu, A. Nautsch, J. Patino, M. Sahidullah, M. Todisco, X. Wang et al., “ASVspoof 2021: Automatic speaker verification spoofing and countermeasures challenge evaluation plan,” arXiv preprint arXiv:2109.00535, 2021.
-  H. Tak, J. Patino, A. Nautsch, N. Evans, and M. Todisco, “An Explainability Study of the Constant Q Cepstral Coefficient Spoofing Countermeasure for Automatic Speaker Verification,” in Proc. The Speaker and Language Recognition Workshop (Odyssey 2020), 2020, pp. 333–340.
-  ——, “Spoofing Attack Detection Using the Non-Linear Fusion of Sub-Band Classifiers,” in Proc. Interspeech 2020, 2020, pp. 1106–1110.
-  X. Wang and J. Yamagishi, “A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection,” in Proc. Interspeech 2021, 2021, pp. 4259–4263.
-  J. Cáceres, R. Font, T. Grau, J. Molina, and B. V. SL, “The Biometric Vox system for the ASVspoof 2021 challenge,” in Proc. ASVspoof2021 Workshop, 2021.
-  T. Chen, E. Khoury, K. Phatak, and G. Sivaraman, “Pindrop Labs’ Submission to the ASVspoof 2021 Challenge,” in Proc. ASVspoof 2021 Workshop, 2021.
-  W. Cai, D. Cai, W. Liu, G. Li, and M. Li, “Countermeasures for Automatic Speaker Verification Replay Spoofing Attack : On Data Augmentation, Feature Representation, Classification and Fusion,” in Proc. Interspeech 2017, 2017, pp. 17–21.
-  H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “End-to-end anti-spoofing with RawNet2,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6369–6373.
-  W. H. Kang, J. Alam, and A. Fathan, “CRIM’s system description for the ASVspoof 2021 Challenge,” in Proc. ASVspoof 2021 Workshop, 2021.
-  X. Chen, Y. Zhang, G. Zhu, and Z. Duan, “UR channel-robust synthetic speech detection system for ASVspoof 2021,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 75–82.
-  H. Tak, J.-w. Jung, J. Patino, M. Kamble, M. Todisco, and N. Evans, “End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 1–8.
-  L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, “CatBoost: unbiased boosting with categorical features,” Advances in neural information processing systems, vol. 31, 2018.
-  A. Rahimi and B. Recht, “Random Features for Large-Scale Kernel Machines,” in Proceedings of the 20th International Conference on Neural Information Processing Systems, ser. NIPS’07. Curran Associates Inc., 2007, p. 1177–1184.
-  J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks,” arXiv preprint arXiv:2110.01200, 2021.
-  X. Wu, R. He, Z. Sun, and T. Tan, “A Light CNN for Deep Face Representation with Noisy Labels,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 11, pp. 2884–2896, 2018.
-  G. Lavrentyeva, S. Novoselov, A. Tseren, M. Volkova, A. Gorlanov, and A. Kozlov, “STC Antispoofing Systems for the ASVspoof2019 Challenge,” in Proc. Interspeech 2019, 2019, pp. 1033–1037.
-  B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proc. Interspeech 2020, 2020, pp. 3830–3834.
-  J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep Speaker Recognition,” in Proc. Interspeech 2018, 2018, pp. 1086–1090.
-  X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V. Vestman, T. Kinnunen, K. A. Lee et al., “ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” Computer Speech & Language, vol. 64, p. 101114, 2020.
-  N. M. Müller, F. Dieckmann, P. Czempin, R. Canals, K. Böttinger, and J. Williams, “Speech is Silver, Silence is Golden: What do ASVspoof-trained Models Really Learn?” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 55–60.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine Learning in Python,”Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.