Recent advances in audio spoofing techniques, such as voice conversion, speech synthesis, and replay attacks, threaten the reliability of automatic speaker verification (ASV) systems. To protect ASV systems against spoofing attacks, the ASVspoof initiative leads research on audio spoofing detection by collecting and providing a common dataset and holding competitions for vigorous research [1, 2, 3]. ASVspoof 2019 focuses on countermeasures for three major attack types, and comprises two data conditions: logical access (LA) and physical access (PA). LA covers voice conversion and speech synthesis, while PA covers replay attacks.
The PA task requires the physical process of recording and playback for data collection while in the LA task spoofed data can be generate easily through a computerized process. This requirement makes the collection of data for the PA task difficult. The scale of datasets for replay spoofing detection is therefore relatively limited compared to other domains, such as speaker recognition , demonstrating the difficulty of gathering replayed utterances. For example, the recently released ASVspoof 2019 dataset contains utterances from only 40 speakers. Thus, difficulties in collecting replay attack data hinder the development of research on replay attack detection.
This study is mainly motivated from the work of Shim et al.  which demonstrated that the information of replay configurations, including playback devices, the environment, and replay devices, can aid in detecting replay attacks. Because replay configurations are present in replayed speech, they may provide information that can differentiate between replayed speech and bona-fide speech. Inspired by this concept, we assume that overall acoustic configurations can help generalize to unseen acoustic configurations. Because the size of replay spoofing data is limited, generalization to unseen conditions, especially for channel mismatch conditions , is an important problem in replay spoofing detection. However, only labeled data can be used to learn acoustic configurations in conventional studies.
To overcome this limitation, we propose a novel scheme that utilizes self-supervised learning . The proposed scheme trains various audio segments from YouTube
source data without explicit supervision of acoustic configurations for the PA task. Acoustic configurations refer to information of diverse variables that are present in the process of a voice being uttered by a speaker and recorded through a microphone. For example, microphone type, location, and noise level can be included in the acoustic configuration. Self-supervised learning is a method that trains deep neural networks (DNNs) with automatically generated labels instead of human-annotated labels. Naturally available content, such as relevant context, correlations, and embedded metadata, can be used to automatically generate labels. Self-supervised learning makes it possible to use external data that are unrelated to the PA task, and helps avoid overfitting to the limited PA dataset.
Specifically, we define learning acoustic configurations as a binary classification scheme using self-supervised learning. We compare two segments of audio and determined whether they are within a single utterance to train acoustic configurations without any context. We hypothesized that two audio segments have the same acoustic configurations if they are extracted from the same utterance, and have different acoustic configurations if they are extracted from different utterances, even if they originate from a single speaker. To the best of our knowledge, this is the first approach using self-supervised learning for spoofing detection. Using the proposed approach, speech data published for various purposes can be used to develop a spoofing detection system.
2 Self-supervised learning
The performance of DNNs is affected by the amount of training data as well as the network capabilities. In the image domain, pre-training with large-scale datasets, such as ImageNet, and fine-tuning for other tasks is a commonly used approach. The benefit of this approach is that a wide variety of data can help initialize parameters and enable fast convergence. Furthermore, when there are few data for the target task, learning about other tasks can help reduce overfitting problems. Although large-scale datasets with labels lead to higher performance, the cost of collecting and annotating large-scale datasets is enormous. Therefore, research on utilizing unlabeled data has been actively conducted in recent years, and self-supervised learning is among the solutions.
Self-supervised learning is a method that trains DNNs with naturally available relevant context from data without human-annotated labels . With a self-supervised scheme, there are advantages to being able to use unlabeled data to learn the supervisory signals of data for representation. self-supervised learning consisted of training the pretext task and the downstream task. Training the pretext task and the downstream task can be thought of as pre-training and main training, respectively. Training the pretext task is used in self-supervised learning to generate useful feature representations as prerequisite knowledge related to the target task. After training the pretext task, the learned parameters are transferred to the downstream task for fine-tuning . Self-supervised learning was first exploited for context prediction in image domain , which predicts relative position using random pairs of patches within an image without information of the original image. Self-supervised learning is not limited to relative positioning and can be used in various applications, such as image transformation 10] in images, and predicting sequence order  in videos. Self-supervised learning has also been exploited in the audio domain to extract speaker embeddings .
3 Proposed method
We propose a two-phase framework for replay spoofing detection that first trains a DNN to learn acoustic configurations using self-supervised learning, and then performs conventional training for replay spoofing detection. The underlying hypothesis is that pre-training with acoustic configurations can improve the generalization of the DNN toward unseen acoustic configurations. Acoustic configurations include the environment in which the speaker speaks, microphone types, and the distance between the microphone and the speaker. However, it is difficult to obtain labels for such acoustic configurations. In the ASVspoof2017 dataset, each device is explicitly labeled; however, the ASVspoof2019 dataset is labeled into three levels of the device quality: upper, middle, and lower levels, thus making it difficult to use the available information.
Using self-supervised learning, detailed information, such as microphone types and environment, is not necessary. Figure 1 illustrates the proposed method of pre-training acoustic configuration using self-supervised learning. We pre-train the DNN using the utterances of various individuals, audio devices, and surroundings included in YouTube source data using VoxCeleb datasets. Although an unlabeled dataset is generally used for training the pretext task in self-supervised learning, we only compose a pair within a speaker to exclude the impact of speaker information. The speaker labels used to identify speakers for pair construction were automatically generated from YouTube data using the method described in . This is based on our comparison result of pair composition with the identical speaker or between different speakers. The result showed better performance when composing pairs between identical speakers as excluding speaker information, focusing on acoustic configurations.
is the cosine similarity between two vectors, andis the corresponding label of pair . In particular, if two segments are in the same utterance, the label is set to 1; otherwise, the label is set to -1.
An outline of the proposed system is provided in Figure 2. First, the DNN is trained for predicting acoustic configurations in Phase 1. Pairs of segments are fed to the DNN and the cosine distance is used to determine whether two segments are similar or dissimilar. Here, similar signifies that two segments are from a single utterance, while dissimilar signifies that two segments are from different utterances. In Figure 2, a pair is composed of segments with a solid line and segments with a dashed line. After the pretext task is trained, we train the DNN for replay detection, which is the ultimate goal of the downstream task in Phase 2. Replay detection is a binary classification task with two categories: bona-fide and spoofed (replayed). We initialize the DNN using pre-trained weights, and freeze the weights of some layers and compare the effects in the experiments. We follow a general scheme of training spoofing detection using DNN architecture to determine whether the input is spoofed .
4 Experimental settings
We used the VoxCeleb1 and VoxCeleb2 datasets for the proposed pre-training approach [4, 14]. Both datasets consisted of utterances cropped from YouTube videos that were originally designed for text-independent speaker verification. VoxCeleb1 consisted of 153,516 utterances of 1,251 speakers from 22,496 video clips, while VoxCeleb2 consisted of 1,128,246 voices of 6,112 speakers from 150,480 video clips. We used the entire VoxCeleb1 and VoxCeleb2 datasets for pretext task training; however, we experimented with only half of the VoxCeleb1 dataset to compare the effect of the number of utterances on performance.
For detecting replay spoofing as a downstream task, we used the ASVspoof 2019 PA dataset . This dataset consisted of 54,000 utterances (5,400 bona-fide, 48,600 spoofed) for training, 29,700 utterances (5,400 bona-fide, 24,300 spoofed) for development, and 137,457 utterances (18,090 bona-fide, 119,367 spoofed) for evaluation. Data for training and development were created under 27 different acoustic configurations, with combinations of three room sizes, three levels of reverberation, and three speaker-to-ASV microphone distances. For replayed utterances, nine different replay configurations were created using combinations of three attack-to-speaker recording distances and three loudspeaker quality levels.
4.2 Experimental configuration
Magnitude spectrograms of 2,048 points were extracted for all datasets, including VoxCeleb1, VoxCeleb2
, and the ASVspoof 2019 PA dataset. The window length and shift size were 50 ms and 30 ms, respectively. We applied pre-emphasis before extracting spectrograms, and did not apply any normalization to the acoustic features. Utterances of varying duration were used for the test phase for both the pretext task and downstream task, and they were cropped into lengths of 200 and 120 for the pretext task and downstream task, respectively, for the training phases. We compared batch sizes of 16 and 32 in the pretext and downstream tasks. We implemented the DNNs using Pytorch, a deep learning library. We used modified end-to-end (E2E) DNN architecture that demonstrated comparable results in the ASVspoof2019 PA condition 
. We only used a CNN without gated recurrent units (GRUs) layers to perform a simple comparison between the conventional replay spoofing detection scheme and the proposed self-supervised scheme.
An E2E DNN has the same residual blocks as in the study by He et al. 
, which includes convolutional layers and batch normalization19] are used after each batch normalization. Table 1 depicts the overall architecture used in this study. For all DNNs in these experiments, weight decay with was applied and trained with the Adam  optimizer. Learning rates of 0.001, 0.0005, and 0.0001 were used for both tasks.
|BN||(120, 1025, 16)|
|Res block||5||(15, 17, 128)|
|MaxPool||Pool||(1, 1, 128)|
|AvgPool||Pool||(1, 1, 128)|
5 Results and analysis
All result in our experiments are measured with equal error rates (EERs). Table 2 presents a comparison between the baseline and our proposed system where VoxCeleb1 is used for pre-training. Because the importance of the batch size and learning rate is emphasized when using a pre-trained model, we adjusted the learning rates and batch sizes. We observed that a smaller learning rate in the pretext task and a large batch size led to improved performance.
Table 3 presents results using different datasets for training the pretext task when the batch size of the downstream task was fixed to 32. A larger number of utterances in the training pretext task led to improved performance. This suggests that learning with various acoustic configurations can lead to improved replay spoofing detection performance. Therefore, the proposed method may improve spoofing detection performance by using more diverse datasets. We also compared the freezing points where we fix the weights up to a certain layer and do not train them. However, the freezing layer degraded the performance in all cases. Therefore, pre-trained parameters were used only for initialization purposes.
|(# of POIs)|
|Half of||1 e-4||1 e-4||3.22||6.94|
|VoxCeleb1||1 e-4||1 e-4||2.69||5.51|
|VoxCeleb2||1 e-4||1 e-4||2.74||4.66|
In this study, we propose a pre-training scheme using self-supervised learning for replay attack spoofing detection. To overcome the limited availability of data for audio spoofing research, we hypothesized that training for acoustic configurations using audio datasets unrelated to replay spoofing could improve the performance of replay spoofing detection. We utilize self-supervised learning for training acoustic configurations. To the best of our knowledge, this is the first attempt to use self-supervised learning for spoofing detection. We achieved a relative error rate reduction of 37% with an EER of 4.64% and 6.36% for the same DNN with and without the proposed pre-training scheme, respectively. In future work, we intend to improve performance by modifying the training methods with various model architectures.
-  Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Cemal Hanilçi, Md Sahidullah, and Aleksandr Sizov, “Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
-  Tomi Kinnunen, Md. Sahidullah, Héctor Delgado, Massimiliano Todisco, Nicholas Evans, Junichi Yamagishi, and Kong Aik Lee, “The asvspoof 2017 challenge: Assessing the limits of replay spoofing attack detection,” in Proc. Interspeech 2017, 2017, pp. 2–6.
-  Massimiliano Todisco, Xin Wang, Ville Vestman, Md. Sahidullah, Héctor Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi H. Kinnunen, and Kong Aik Lee, “ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,” in Proc. Interspeech 2019, 2019, pp. 1008–1012.
-  Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in INTERSPEECH, 2017.
Hye jin Shim, Jee weon Jung, Hee-Soo Heo, Sung-Hyun Yoon, and Ha-Jin Yu,
“Replay spoofing detection system for automatic speaker verification
using multi-task learning of noise classes,”
2018 Conference on Technologies and Applications of Artificial Intelligence (TAAI), pp. 172–176, 2018.
-  Hemant A Patil and Madhu R Kamble, “A survey on replay attack detection for automatic speaker verification (asv) system,” in 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2018, pp. 1047–1053.
Carl Doersch, Abhinav Gupta, and Alexei A Efros,
“Unsupervised visual representation learning by context
Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1422–1430.
-  Longlong Jing and Yingli Tian, “Self-supervised visual feature learning with deep neural networks: A survey,” arXiv preprint arXiv:1902.06162, 2019.
-  Spyros Gidaris, Praveer Singh, and Nikos Komodakis, “Unsupervised representation learning by predicting image rotations,” ArXiv, vol. abs/1803.07728, 2018.
-  Richard Zhang, Phillip Isola, and Alexei A Efros, “Colorful image colorization,” in European conference on computer vision. Springer, 2016, pp. 649–666.
Ishan Misra, C Lawrence Zitnick, and Martial Hebert,
“Shuffle and learn: unsupervised learning using temporal order verification,”in European Conference on Computer Vision. Springer, 2016, pp. 527–544.
-  Themos Stafylakis, Johan Rohdin, Oldřich Plchot, Petr Mizera, and Lukáš Burget, “Self-Supervised Speaker Embeddings,” in Proc. Interspeech 2019, 2019, pp. 2863–2867.
-  Md Sahidullah, Héctor Delgado, Massimiliano Todisco, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, and Kong-Aik Lee, “Introduction to voice presentation attack detection and recent advances,” in Handbook of Biometric Anti-Spoofing, pp. 321–361. Springer, 2019.
-  Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “Voxceleb2: Deep speaker recognition,” in INTERSPEECH, 2018.
-  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer, “Automatic differentiation in PyTorch,” in NIPS Autodiff Workshop, 2017.
-  Jee weon Jung, Hye jin Shim, Hee-Soo Heo, and Ha-Jin Yu, “Replay Attack Detection with Complementary High-Resolution Information Using End-to-End DNN for the ASVspoof 2019 Challenge,” in Proc. Interspeech 2019, 2019, pp. 1083–1087.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645.
Sergey Ioffe and Christian Szegedy,
“Batch normalization: Accelerating deep network training by reducing
internal covariate shift,”
Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37. 2015, ICML’15, pp. 448–456, JMLR.org.
-  Andrew L Maas, Awni Y Hannun, and Andrew Y Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. icml, 2013.
-  Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.