Log In Sign Up

Self-supervised pre-training with acoustic configurations for replay spoofing detection

by   Hye-Jin Shim, et al.

Large datasets are well-known as a key to the recent advances in deep learning. However, dataset construction, especially for replay spoofing detection, requires the physical process of playing an utterance and re-recording it, which hinders the construction of large-scale datasets. To compensate for the limited availability of replay spoofing datasets, in this study, we propose a method for pre-training acoustic configurations using external data unrelated to replay attacks. Here, acoustic configurations refer to variables present in the process of a voice being uttered by a speaker and recorded through a microphone. Specifically, we select pairs of audio segments and train the network to determine whether the acoustic configurations of two segments are identical. We conducted experiments using the ASVspoof 2019 physical access dataset, and the results revealed that our proposed method reduced the relative error rate by over 37


page 1

page 2

page 3

page 4


Integrated Replay Spoofing-aware Text-independent Speaker Verification

A number of studies have successfully developed speaker verification or ...

Towards robust audio spoofing detection: a detailed comparison of traditional and learned features

Automatic speaker verification, like every other biometric system, is vu...

Replay spoofing detection system for automatic speaker verification using multi-task learning of noise classes

In this paper, we propose a replay attack spoofing detection system for ...

Replay attack spoofing detection system using replay noise by multi-task learning

In this paper, we propose a spoofing detection system for replay attack ...

Voice Spoofing Detection Corpus for Single and Multi-order Audio Replays

The evolution of modern voice controlled devices (VCDs) in recent years ...

ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild

Benchmarking initiatives support the meaningful comparison of competing ...

1 Introduction

Recent advances in audio spoofing techniques, such as voice conversion, speech synthesis, and replay attacks, threaten the reliability of automatic speaker verification (ASV) systems. To protect ASV systems against spoofing attacks, the ASVspoof initiative leads research on audio spoofing detection by collecting and providing a common dataset and holding competitions for vigorous research [1, 2, 3]. ASVspoof 2019 focuses on countermeasures for three major attack types, and comprises two data conditions: logical access (LA) and physical access (PA). LA covers voice conversion and speech synthesis, while PA covers replay attacks.

The PA task requires the physical process of recording and playback for data collection while in the LA task spoofed data can be generate easily through a computerized process. This requirement makes the collection of data for the PA task difficult. The scale of datasets for replay spoofing detection is therefore relatively limited compared to other domains, such as speaker recognition [4], demonstrating the difficulty of gathering replayed utterances. For example, the recently released ASVspoof 2019 dataset contains utterances from only 40 speakers. Thus, difficulties in collecting replay attack data hinder the development of research on replay attack detection.

This study is mainly motivated from the work of Shim et al. [5] which demonstrated that the information of replay configurations, including playback devices, the environment, and replay devices, can aid in detecting replay attacks. Because replay configurations are present in replayed speech, they may provide information that can differentiate between replayed speech and bona-fide speech. Inspired by this concept, we assume that overall acoustic configurations can help generalize to unseen acoustic configurations. Because the size of replay spoofing data is limited, generalization to unseen conditions, especially for channel mismatch conditions [6], is an important problem in replay spoofing detection. However, only labeled data can be used to learn acoustic configurations in conventional studies.

Figure 1: Illustration of the proposed self-supervised learning scheme for pre-training acoustic configuration. The proposed method compares two audio segments to determine whether they have identical acoustic configurations. We assumed that two audio segments have an identical acoustic configuration if they are extracted from the same utterance, and have different acoustic configurations if they are extracted from different utterances even if they originate from the same speaker.

To overcome this limitation, we propose a novel scheme that utilizes self-supervised learning [7]. The proposed scheme trains various audio segments from YouTube

source data without explicit supervision of acoustic configurations for the PA task. Acoustic configurations refer to information of diverse variables that are present in the process of a voice being uttered by a speaker and recorded through a microphone. For example, microphone type, location, and noise level can be included in the acoustic configuration. Self-supervised learning is a method that trains deep neural networks (DNNs) with automatically generated labels instead of human-annotated labels. Naturally available content, such as relevant context, correlations, and embedded metadata, can be used to automatically generate labels. Self-supervised learning makes it possible to use external data that are unrelated to the PA task, and helps avoid overfitting to the limited PA dataset.

Specifically, we define learning acoustic configurations as a binary classification scheme using self-supervised learning. We compare two segments of audio and determined whether they are within a single utterance to train acoustic configurations without any context. We hypothesized that two audio segments have the same acoustic configurations if they are extracted from the same utterance, and have different acoustic configurations if they are extracted from different utterances, even if they originate from a single speaker. To the best of our knowledge, this is the first approach using self-supervised learning for spoofing detection. Using the proposed approach, speech data published for various purposes can be used to develop a spoofing detection system.

2 Self-supervised learning

The performance of DNNs is affected by the amount of training data as well as the network capabilities. In the image domain, pre-training with large-scale datasets, such as ImageNet, and fine-tuning for other tasks is a commonly used approach. The benefit of this approach is that a wide variety of data can help initialize parameters and enable fast convergence. Furthermore, when there are few data for the target task, learning about other tasks can help reduce overfitting problems

[8]. Although large-scale datasets with labels lead to higher performance, the cost of collecting and annotating large-scale datasets is enormous. Therefore, research on utilizing unlabeled data has been actively conducted in recent years, and self-supervised learning is among the solutions.

Self-supervised learning is a method that trains DNNs with naturally available relevant context from data without human-annotated labels [7]. With a self-supervised scheme, there are advantages to being able to use unlabeled data to learn the supervisory signals of data for representation. self-supervised learning consisted of training the pretext task and the downstream task. Training the pretext task and the downstream task can be thought of as pre-training and main training, respectively. Training the pretext task is used in self-supervised learning to generate useful feature representations as prerequisite knowledge related to the target task. After training the pretext task, the learned parameters are transferred to the downstream task for fine-tuning [8]. Self-supervised learning was first exploited for context prediction in image domain [7], which predicts relative position using random pairs of patches within an image without information of the original image. Self-supervised learning is not limited to relative positioning and can be used in various applications, such as image transformation [9]

, colorization

[10] in images, and predicting sequence order [11] in videos. Self-supervised learning has also been exploited in the audio domain to extract speaker embeddings [12].

3 Proposed method

We propose a two-phase framework for replay spoofing detection that first trains a DNN to learn acoustic configurations using self-supervised learning, and then performs conventional training for replay spoofing detection. The underlying hypothesis is that pre-training with acoustic configurations can improve the generalization of the DNN toward unseen acoustic configurations. Acoustic configurations include the environment in which the speaker speaks, microphone types, and the distance between the microphone and the speaker. However, it is difficult to obtain labels for such acoustic configurations. In the ASVspoof2017 dataset, each device is explicitly labeled; however, the ASVspoof2019 dataset is labeled into three levels of the device quality: upper, middle, and lower levels, thus making it difficult to use the available information.

Using self-supervised learning, detailed information, such as microphone types and environment, is not necessary. Figure 1 illustrates the proposed method of pre-training acoustic configuration using self-supervised learning. We pre-train the DNN using the utterances of various individuals, audio devices, and surroundings included in YouTube source data using VoxCeleb datasets. Although an unlabeled dataset is generally used for training the pretext task in self-supervised learning, we only compose a pair within a speaker to exclude the impact of speaker information. The speaker labels used to identify speakers for pair construction were automatically generated from YouTube data using the method described in [4]. This is based on our comparison result of pair composition with the identical speaker or between different speakers. The result showed better performance when composing pairs between identical speakers as excluding speaker information, focusing on acoustic configurations.

We used cosine similarity to compare two segments. The loss function is presented in Equation (1), where

and are embeddings from randomly selected audio segments,

is the cosine similarity between two vectors, and

is the corresponding label of pair . In particular, if two segments are in the same utterance, the label is set to 1; otherwise, the label is set to -1.


An outline of the proposed system is provided in Figure 2. First, the DNN is trained for predicting acoustic configurations in Phase 1. Pairs of segments are fed to the DNN and the cosine distance is used to determine whether two segments are similar or dissimilar. Here, similar signifies that two segments are from a single utterance, while dissimilar signifies that two segments are from different utterances. In Figure 2, a pair is composed of segments with a solid line and segments with a dashed line. After the pretext task is trained, we train the DNN for replay detection, which is the ultimate goal of the downstream task in Phase 2. Replay detection is a binary classification task with two categories: bona-fide and spoofed (replayed). We initialize the DNN using pre-trained weights, and freeze the weights of some layers and compare the effects in the experiments. We follow a general scheme of training spoofing detection using DNN architecture to determine whether the input is spoofed [13].

Figure 2: Pipeline of proposed system. There are two phases: self-supervised pretext task training of acoustic configurations, and supervised downstream task training of replay spoofing detection. After the pretext task is trained, knowledge transfers to the downstream task occurs with parameters learned from the pretext task.

4 Experimental settings

4.1 Dataset

We used the VoxCeleb1 and VoxCeleb2 datasets for the proposed pre-training approach [4, 14]. Both datasets consisted of utterances cropped from YouTube videos that were originally designed for text-independent speaker verification. VoxCeleb1 consisted of 153,516 utterances of 1,251 speakers from 22,496 video clips, while VoxCeleb2 consisted of 1,128,246 voices of 6,112 speakers from 150,480 video clips. We used the entire VoxCeleb1 and VoxCeleb2 datasets for pretext task training; however, we experimented with only half of the VoxCeleb1 dataset to compare the effect of the number of utterances on performance.

For detecting replay spoofing as a downstream task, we used the ASVspoof 2019 PA dataset [3]. This dataset consisted of 54,000 utterances (5,400 bona-fide, 48,600 spoofed) for training, 29,700 utterances (5,400 bona-fide, 24,300 spoofed) for development, and 137,457 utterances (18,090 bona-fide, 119,367 spoofed) for evaluation. Data for training and development were created under 27 different acoustic configurations, with combinations of three room sizes, three levels of reverberation, and three speaker-to-ASV microphone distances. For replayed utterances, nine different replay configurations were created using combinations of three attack-to-speaker recording distances and three loudspeaker quality levels.

4.2 Experimental configuration

Magnitude spectrograms of 2,048 points were extracted for all datasets, including VoxCeleb1, VoxCeleb2

, and the ASVspoof 2019 PA dataset. The window length and shift size were 50 ms and 30 ms, respectively. We applied pre-emphasis before extracting spectrograms, and did not apply any normalization to the acoustic features. Utterances of varying duration were used for the test phase for both the pretext task and downstream task, and they were cropped into lengths of 200 and 120 for the pretext task and downstream task, respectively, for the training phases. We compared batch sizes of 16 and 32 in the pretext and downstream tasks. We implemented the DNNs using Pytorch, a deep learning library

[15]. We used modified end-to-end (E2E) DNN architecture that demonstrated comparable results in the ASVspoof2019 PA condition [16]

. We only used a CNN without gated recurrent units (GRUs) layers to perform a simple comparison between the conventional replay spoofing detection scheme and the proposed self-supervised scheme.

An E2E DNN has the same residual blocks as in the study by He et al. [17]

, which includes convolutional layers and batch normalization


layers. Leaky rectified linear unit (LReLU) activation functions

[19] are used after each batch normalization. Table 1 depicts the overall architecture used in this study. For all DNNs in these experiments, weight decay with was applied and trained with the Adam [20] optimizer. Learning rates of 0.001, 0.0005, and 0.0001 were used for both tasks.

Layer Type Output shape
Conv1 Conv
BN (120, 1025, 16)
Res block 5 (15, 17, 128)
MaxPool Pool (1, 1, 128)
AvgPool Pool (1, 1, 128)
Dense FC (64,)
Output FC (2,)
Table 1: DNN architecture. The numbers in the Output shape column refer to the frame (time), frequency, and number of filters. Conv, BN and FC indicate convolutional layer, batch normalization and fully-connected layer.

5 Results and analysis

All result in our experiments are measured with equal error rates (EERs). Table 2 presents a comparison between the baseline and our proposed system where VoxCeleb1 is used for pre-training. Because the importance of the batch size and learning rate is emphasized when using a pre-trained model, we adjusted the learning rates and batch sizes. We observed that a smaller learning rate in the pretext task and a large batch size led to improved performance.

Table 3 presents results using different datasets for training the pretext task when the batch size of the downstream task was fixed to 32. A larger number of utterances in the training pretext task led to improved performance. This suggests that learning with various acoustic configurations can lead to improved replay spoofing detection performance. Therefore, the proposed method may improve spoofing detection performance by using more diverse datasets. We also compared the freezing points where we fix the weights up to a certain layer and do not train them. However, the freezing layer degraded the performance in all cases. Therefore, pre-trained parameters were used only for initialization purposes.

batch size 32 16 system pre_lr lr val eval val eval baseline - 1 e-4 2.91 6.89 3.22 6.60 5 e-4 4.24 6.88 4.29 6.36 VoxCeleb1 1 e-4 1 e-4 2.69 5.51 2.75 5.38 5 e-4 2.50 5.15 3.44 5.85 5 e-4 1 e-4 4.90 6.00 5.87 7.0 5 e-4 3.83 5.18 6.68 7.79 1 e-3 1 e-4 6.27 9.38 5.87 8.85 5 e-4 4.98 7.24 6.75 7.41

Table 2: Comparison results of baseline and VoxCeleb1 with batch sizes of 16 and 32. Pre_lr and main_lr refer to the learning rates of the pretext task and downstream task, respectively, while batch size refers to the batch size of the downstream task when the batch size of the pretext task was fixed to 16. Val and eval refer to the performance of the validation set and evaluation set, respectively.
Dataset pre_lr main_lr val eval
(# of POIs)
Half of 1 e-4 1 e-4 3.22 6.94
VoxCeleb1 5 e-4 4.03 6.66
VoxCeleb1 1 e-4 1 e-4 2.69 5.51
(1,251) 5 e-4 2.5 5.15
VoxCeleb2 1 e-4 1 e-4 2.74 4.66
(6,112) 5 e-4 2.44 4.64
Table 3: Performance comparison between experiments using different datasets for training the pretext task. When predicting acoustic configurations, a pair is constructed within a identical speaker to exclude the speaker’s information, and half of VoxCeleb1 is composed of only half speakers of VoxCeleb1. POI: Person of Interest.

6 Conclusion

In this study, we propose a pre-training scheme using self-supervised learning for replay attack spoofing detection. To overcome the limited availability of data for audio spoofing research, we hypothesized that training for acoustic configurations using audio datasets unrelated to replay spoofing could improve the performance of replay spoofing detection. We utilize self-supervised learning for training acoustic configurations. To the best of our knowledge, this is the first attempt to use self-supervised learning for spoofing detection. We achieved a relative error rate reduction of 37% with an EER of 4.64% and 6.36% for the same DNN with and without the proposed pre-training scheme, respectively. In future work, we intend to improve performance by modifying the training methods with various model architectures.


  • [1] Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Cemal Hanilçi, Md Sahidullah, and Aleksandr Sizov, “Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • [2] Tomi Kinnunen, Md. Sahidullah, Héctor Delgado, Massimiliano Todisco, Nicholas Evans, Junichi Yamagishi, and Kong Aik Lee, “The asvspoof 2017 challenge: Assessing the limits of replay spoofing attack detection,” in Proc. Interspeech 2017, 2017, pp. 2–6.
  • [3] Massimiliano Todisco, Xin Wang, Ville Vestman, Md. Sahidullah, Héctor Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi H. Kinnunen, and Kong Aik Lee, “ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,” in Proc. Interspeech 2019, 2019, pp. 1008–1012.
  • [4] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in INTERSPEECH, 2017.
  • [5] Hye jin Shim, Jee weon Jung, Hee-Soo Heo, Sung-Hyun Yoon, and Ha-Jin Yu, “Replay spoofing detection system for automatic speaker verification using multi-task learning of noise classes,”

    2018 Conference on Technologies and Applications of Artificial Intelligence (TAAI)

    , pp. 172–176, 2018.
  • [6] Hemant A Patil and Madhu R Kamble, “A survey on replay attack detection for automatic speaker verification (asv) system,” in 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2018, pp. 1047–1053.
  • [7] Carl Doersch, Abhinav Gupta, and Alexei A Efros, “Unsupervised visual representation learning by context prediction,” in

    Proceedings of the IEEE International Conference on Computer Vision

    , 2015, pp. 1422–1430.
  • [8] Longlong Jing and Yingli Tian, “Self-supervised visual feature learning with deep neural networks: A survey,” arXiv preprint arXiv:1902.06162, 2019.
  • [9] Spyros Gidaris, Praveer Singh, and Nikos Komodakis, “Unsupervised representation learning by predicting image rotations,” ArXiv, vol. abs/1803.07728, 2018.
  • [10] Richard Zhang, Phillip Isola, and Alexei A Efros, “Colorful image colorization,” in European conference on computer vision. Springer, 2016, pp. 649–666.
  • [11] Ishan Misra, C Lawrence Zitnick, and Martial Hebert,

    “Shuffle and learn: unsupervised learning using temporal order verification,”

    in European Conference on Computer Vision. Springer, 2016, pp. 527–544.
  • [12] Themos Stafylakis, Johan Rohdin, Oldřich Plchot, Petr Mizera, and Lukáš Burget, “Self-Supervised Speaker Embeddings,” in Proc. Interspeech 2019, 2019, pp. 2863–2867.
  • [13] Md Sahidullah, Héctor Delgado, Massimiliano Todisco, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, and Kong-Aik Lee, “Introduction to voice presentation attack detection and recent advances,” in Handbook of Biometric Anti-Spoofing, pp. 321–361. Springer, 2019.
  • [14] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “Voxceleb2: Deep speaker recognition,” in INTERSPEECH, 2018.
  • [15] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer, “Automatic differentiation in PyTorch,” in NIPS Autodiff Workshop, 2017.
  • [16] Jee weon Jung, Hye jin Shim, Hee-Soo Heo, and Ha-Jin Yu, “Replay Attack Detection with Complementary High-Resolution Information Using End-to-End DNN for the ASVspoof 2019 Challenge,” in Proc. Interspeech 2019, 2019, pp. 1083–1087.
  • [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645.
  • [18] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in

    Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37

    . 2015, ICML’15, pp. 448–456,
  • [19] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. icml, 2013.
  • [20] Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.