In recent years, advances in automatic speaker recognition have made audio spoofing attacks possible and created a need for countermeasures to prevent such attacks [22, 26, 2, 3, 24, 31, 23, 13, 4]. The ASVspoof challenges are providing a common platform for researchers to verify and compare various approaches [30, 10, 25]
. Among such challenges, the 2019 challenge provides two scenarios for the countermeasures against audio spoofing attacks with controlled configurations for research: the logical and physical access (PA) scenarios. In this study, we concentrate on the physical scenario, which is also referred to as replay attack detection. The task is a binary classification problem where an input audio is classified into eitherbona-fide (also called genuine) or spoofed (replayed).
Recently, deep neural networks (DNNs) have displayed promising performances across a range of academic and industrial tasks, including the image and audio domains [32, 5]. One of the research objective of DNN-related studies is to train DNN as a feature extractor, where a N-dimensional representation vector from the last hidden layer is used as the feature for the target task (i.e. d-vector/x-vector speaker embeddings for speaker recognition, referred to as ‘code’ throughout this paper) [27, 22, 21, 12]. Regarding this research objective, a number of studies are devising techniques to either subtract certain information from or incorporate it in the representation vector [7, 1]. For example, Heo et al.  devised an adversarial scheme called cosine adversarial network (CAN) to train the code to become orthogonal to the basis vectors of the subsidiary information which is known to be an obstacle for the target task. As an another example, Chen et al.  used a multi-task learning (MTL) framework  to incorporate phoneme information to the code for the speaker verification task with the hypothesis that the code which also includes phoneme information would better represent the input utterance.
In this study, we further analyze the role of various categories of subsidiary information in replay attack spoofing detection. Specifically, we determine whether each category of subsidiary information is beneficial for spoofing detection by observing the loss and equal error rates (EERs) when each category of subsidiary information is either removed or added. For this goal, we use two frameworks introduced in the previous paragraph: CAN for subtracting and MTL for adding various categories of subsidiary information. We analyze all five categories of subsidiary information (‘Room Size’, ‘Reverberation’, ‘Speaker-to-ASV distance’, ‘Attacker-to-Speaker distance’, and ‘Replay Device Quality’) provided in the ASVspoof 2019 PA scenario dataset.
Our study differs from the conventional studies that deal with subsidiary information in the following two ways [20, 11, 15]. First, we analyzed the individual effects of each category of subsidiary information separately, while previous studies merely included combinations of different categories of subsidiary information to the code. Second, we analyzed the role of each subsidiary information in more detail, whether such subsidiary information is required for conducting replay spoofing detection. We did this by including and by trying to exclude it from the code.
Through the analysis regarding the role of various categories of subsidiary information we show that all five categories of subsidiary information does not reside in the code of the conventional systems enough through experiments that subsidiary information classification cannot be conducted using the code, trained for replay attack detection in a in binary classification task (bona-fide/spoofed). By including various categories of subsidiary information to the code, we found performance improvements in replay spoofing detection in closed set condition.
The rest of this paper is organized as follows: Section 2 introduces the baseline end-to-end (E2E) DNN model used for replay attack spoofing detection. The two frameworks, for subtracting and adding a category of subsidiary information, are addressed in Section 3. The experiments and result analysis are presented in Section 4 and 5 respectively. The paper is concluded in Section 6.
|Layer||Input:(120, 1025)||Output shape|
|Conv1||Conv(3,3,128)||(120, 1025, 16)|
|Res block||9||(15, 17, 128)|
|AvgPool||Pool(1, 17)||(15, 128)|
Architecture of the E2E DNN. Each number in ‘Output shape’ refers to frame (time), frequency, and the number of filters respectively. The number 120 of ‘Input’ is only for training phase for mini-batch construction. At test phase, utterance with varying duration is input to the model. All convolutional layers have filter length of 3 for frame and 7 for frequency dimension. Three numbers inside the bracket of convolutional layers refer to stride size in frame and frequency dimension and the number of filters. Batch normalization and the first activation is omitted at the first residual block following.
2 End-to-end DNN Baseline
2.1 Model architecture
. It incorporates both feature extraction and classification into a single DNN to fully exploit the merit of a data-driven training scheme. In this study, we use the E2E DNN of which showed promising performance at the ASVspoof 2019 PA challenge despite its simple process pipeline 111A Github link with the full code of our implementation of the model will be released after the anonymity period ends.
Table 1 describes the E2E DNN architecture used in this study. This network takes spectrograms as input, and outputs the result of replay attack detection using an output layer with two nodes indicating bona-fide and spoofed, respectively. We used residual blocks that comprise convolutional layers, batch normalization layers 14] following the identity mapping  of He et al.
used to extract multiple segment-level embeddings from an input spectrogram. Then, the frequency axis is averaged using an average pooling layer. A gated recurrent unit (GRU) layer then extracts an utterance-level embedding followed by a fully-connected layer (the code).
2.2 Ring loss
A number of loss functions are being studied to supplement the softmax output[29, 33]. Among these loss functions, Ring loss 
, which was first proposed for the face recognition task, has been showing promising results. Ring loss normalizes the code (from the last hidden layer) to a learnt valueby penalizing how far the norm of the code is from the current . Ring loss can be defined by:
where refers to the weight of the Ring loss (weight of CCE loss is assumed to be 1), refers to the mini-batch size, refers to the radius of the ring (norm value) and is the code. In this study, we compare experimental results of both with and without Ring loss (see Section 5).
3 Frameworks for exploiting
In this section, we introduce two frameworks for subtracting and adding a certain category of subsidiary information: cosine adversarial network (CAN)  and multi-task learning (MTL) . To introduce both frameworks, we reinterpret the E2E DNN as illustrated in Figure 1. In this point of view, the E2E DNN explained in the previous section is interpreted as a combination of an ‘encoder’ which extracts the code from spectrogram and a ‘primary’ model which conducts binary classification using the code.
3.1 Cosine Adversarial Network
Cosine adversarial network (CAN) was proposed to eliminate subsidiary information from the code extracted by an encoder . By eliminating subsidiary information that is known to decrease the performance of the target task, improvements in performance have been observed. For example, improved performance was shown by removing the channel information in the speaker verification.
In this study, we use the CAN framework to examine whether various categories of subsidiary information are included in the code for conducting replay spoofing attack detection when the model is trained for a binary classification task. Five categories of subsidiary information are analyzed using the meta data provided from the ASVspoof 2019 PA dataset: ‘Room Size’, ‘Reverberation’, ‘Speaker-to-Mic distance’, ‘Attacker-to-Speaker distance’, and ‘Replay Device Quality’.
The CAN framework is trained using a repetition of the three phases shown in Figure 2. In the first phase, the parameters of the subsidiary model are frozen, and the encoder and the primary model are trained. This procedure is equivalent to the training of E2E DNN introduced in Section 2. In the second phase, parameters of the encoder and the primary model are frozen, and the subsidiary model is trained. If the loss decreases well in this process, it can be interpreted that the subsidiary information actually exists in the current code, and if the loss does not decreases well, it can be the first case which explained in the previous paragraph (subsidiary information does not reside in the code). In the third phase, parameters of the primary model and the subsidiary model are frozen, and the encoder is trained to exclude subsidiary information by the adversarial process. Here, we remove the subsidiary information by the encoder by training the code to be orthogonal from basis vectors of the subsidiary model, trained at the previous phase. By repeating these three phases, CAN trains the DNN to perform the primary task and exclude subsidiary information.
Here, interpret the results of the CAN in three cases as follows.
The encoder and the primary model is successfully trained, but the subsidiary loss does not decrease while the parameters of encoder is frozen: subsidiary information does not reside in the code enough to conduct subsidiary information classification when the DNN is trained as a binary classifier.
The CAN framework is successfully trained, but the performance of the replay detection decrease as subsidiary information is excluded: subsidiary information is helpful for conducting replay detection.
The CAN framework is successfully trained, and performance is increased: subsidiary information is an obstacle for replay attack detection.
3.2 Multi-task Learning Framework
The multi-task learning (MTL) framework was proposed by Caruana et al. to train a DNN that can conduct more than one task  simultaneously. In a number of studies, this framework has successfully incorporated additional information into the code by adding additional output layers. By training with more than a single task, better generalization performance has also been reported which is interpreted as a result of the network not being overfitted to a single task. For example, Chen et al. reported better performance in speaker verification by explicitly training phoneme information with the speaker information . In replay attack detection,  showed that the information regarding the subsidiary task could be included in the code and improves the performance on the ASVspoof 2017 dataset. However,  used a combination of various categories of subsidiary information that does not reveal the role of each category of subsidiary information. In this study, we analyze the effect of adding each category of subsidiary information via the MTL framework.
The MTL framework is trained using the following equation:
where refers to the final loss, , refer to the categorical cross-entropy (CCE) loss, and , refer to their weights respectively. Comparing to the CAN framework, the MTL framework can be interpreted as training the encoder, the primary, and the subsidiary model concurrently.
3.3 Modified MTL Framework for Replay Spoofing Detection
When the MTL framework is utilized in replay attack spoofing detection for adding subsidiary information, overlaps can occur between different tasks. Using ‘Replay Quality’ information, for example, when bona-fide utterance is input, there are no labels for the subsidiary task. In the authors’ previous study, these confusions between the primary task and the subsidiary task were further analyzed and a novel scheme was proposed222Referred to as authors’ work for now, but will cite papers after the anonymity period ends.
In the modified scheme, there exists only one output layer. The number of nodes of the output layer is equal to the number of defined subsidiary task’s labels , where the additional node is for bona-fide input. Through comparison experiments conducted in the authors’ previous study using the ASVspoof 2017 dataset , this scheme clearly outperformed that of the original MTL framework. Comparison experiments of this scheme with the MTL framework on the ASV2019 PA dataset are described in Table 4. Note that because the used architecture is E2E, we use the value of the bona-fide node directly as the score.
4 Experimental Settings
All experiments of this paper were conducted using PyTorch
, a deep learning library in Python333A Github link to the full experimental code will be released after the anonymity period ends.
We used the ASVspoof 2019 PA dataset for all experiments . Table 2 describes the subset configuration. This dataset comprises 20 speakers (8 male, 12 female) and all utterances are recorded at 16-kHz sampling rate with 16-bit resolution. The training subset comprises 27 different acoustic configurations in total which both bona-fide and replayed utterances share: a combination of three ‘Room Size’, three ‘Reverberation’, and three ‘Speaker-to-ASV Mic distance’. Nine different replay configurations are used which only replayed utterances have: a combination of three ‘Attacker-to-talker distance’ and three ‘Replay Device Quality’. Note that the development set is a closed set and the evaluation set is an open set configuration (e.g. utterances in the development set will share the level of ‘Reverberation’ with the train set, and the unknown level of ‘Reverberation’ will be included in the evaluation set).
4.2 Acoustic Feature
Magnitude spectrograms with 2048 points fast Fourier transform were used for all experiments in this study. The window length and the shift size were 50 ms and 30 ms respectively, following
. Normalization on mean or standard deviation was not applied. In the training phase, 120 randomly selected continuous frames were used for mini-batch construction. In the test (inference) phase, entire frames were used.
4.3 Experimental Settings: Common for Baseline E2E DNN, CAN and MTL
The encoder consists of nine residual blocks, a GRU layer, and a code layer. The encoder is common to E2E DNN, CAN and MTL. Each residual blocks have filter size of (3, 7) and stride size of (2, 4) where inequality in the size was set considering the size of input spectrogram ((120, 1025) for training). The GRU layer has 512 nodes. The code layer is fully-connected layer with 64 nodes. We used LReLU  for all the non-linearity.
We used the AMSGrad  optimizer with a learning rate of 0.0005. A weight decay with 0.0001 was applied to all parameters. The size of the mini-batch is 32. For models with Ring loss, the weights for the Ring loss and CCE loss are the same.
For each epoch, we made the training set with all 5400 bona-fide utterances and randomly selected 5400 spoofing utterances to balance the number of samples per each class. In our internal comparison experiments, this method brought performance improvement.
The primary model consists of only one output layer (fully-connected, 2 nodes indicating bona-fide/spoof) with no other hidden layers, for all E2E DNN, CAN, and MTL.
4.4 Experimental Settings: CAN
For experiments with the CAN framework, the subsidiary model consists of two fully-connected hidden layers with 128 nodes, and an output layer with three nodes following . The relative proportions of the first, second, and the third phase, was set to 3:1:1, where the first refers to Figure 2-(a), which train the encoder and the primary model (This is identical to Baseline E2E training).
4.5 Experimental Settings: MTL
For experiments with the MTL framework, the subsidiary model consists of an output layer that has three nodes following . The weight of the subsidiary loss ( of Eq. 2) was set to 0.5. When using replay-related subsidiary information (‘Attacker-to-Talker’ and ‘Replay Quality’), we added one node to the subsidiary model’s output layer for bona-fide input.
5 Results and analysis
5.1 Removing subsidiary information (CAN)
Experiments conducted on all five categories showed identical results that the loss of subsidiary task did not decrease while the encoder was frozen (The first case among the three described in Section 3.1.). We interpreted this result as that the subsidiary information does not naturally reside enough in the code to conduct subsidiary information classification when the DNN is trained using a binary classification scheme (bona-fide/replayed). In other words, training a spoofing detector using a simple binary classification scheme could not utilize the subsidiary information. Based on this interpretation, the results in [20, 11] which have seen improvement in performance by multi-task learning means that adding subsidiary information can enhance the discrimination power of the code.
We omit the table for the performance of CAN framework because we concluded that subsidiary information does not reside in the code enough to conduct the classification of subsidiary information, which the CAN framework aims to remove, making the EER meaningless. Note that for all five categories of subsidiary information, removing subsidiary information with the CAN framework worsened the performances in terms of EER of the primary task (replay attack detection). Figure 3 depicts an example of losses when training CAN to remove a category of subsidiary information from the code. During the first phase (left part of the figure), the encoder and the primary models are successfully trained. However, after freezing the encoder, loss of the subsidiary model does not decrease. In additional experiments where the encoder was not frozen when training the subsidiary information, the loss of the subsidiary task decreased successfully, but we did not achieve improvements in terms of EER.
5.2 Adding subsidiary information (MTL)
Tables 3 and 4 describe the result of applying the MTL framework and the modified MTL framework, respectively. The performances on the development set (closed set configuration) improved in all five categories of subsidiary information, showing a relative error rate reduction (RER) greater than 30 % (‘Reverberation’ in Table 3, ‘Attacker-to-Talker’ in Table 4). However, in the evaluation set (open set configuration), we gained a minor performance improvement of RER 3 % by using ‘Reverberation’ and ‘Replay Quality’ only.
We analyze that the performance difference between the development set and the evaluation set is mainly due to two reasons. First, different configuration between the closed set the and open set have brought performance differences. Training and development set share common configurations for each subsidiary information. However, although five categories of subsidiary information is used, the evaluation set comprises open set replay configurations. Second, in each category of subsidiary information, the number of sub-category labels are too small and ambiguous. For example, for ‘Replay Quality’ of the ASVspoof 2019 PA dataset, there exists only three different kinds of labels, ‘perfect’, ‘high’, and ‘low’. We conclude that three sub-categories, with rather ambiguous labels are too small to generalize towards unknown open set configurations. With this conclusion, successful results using the ASVspoof 2017 evaluation dataset (open set) with the MTL framework are analyzed to have occurred owing to actual replay device labels.
The modified MTL framework also demonstrated performance improvement only for the closed set configuration (development set). In the results of this study, using the MTL framework to add various subsidiary information did not show significant improvement in the evaluation set. However, significant improvement in the development set still demonstrates that when detailed subsidiary information labels are provided, adding these information can effectively improve the performance of the replay attack detection, and by using specifically labelled wide range of subsidiary labels, better generalization on open set configurations can be expected. Through experimental results, we concluded that various categories of subsidiary information can effectively aid in replay attack detection, but the process needs a wide range of specific labels for each category of subsidiary information.
In this study, we analyzed what sort of information is needed to conduct replay attack spoofing detection. For this purpose, we employed an E2E DNN and analyzed what information is included in the code embedding. Two frameworks, CAN and MTL were utilized to either subtract or add certain category of subsidiary information provided in the ASVspoof 2019 PA dataset to the code, respectively. Surprisingly, there was not enough relevant information to train the subsidiary model in the code extracted by the frozen encoder for all five categories of subsidiary information when the encoder has been trained using a binary classification scheme (bona-fide/replayed). Through addition of various categories of subsidiary information to the code, performance improvement was measured in closed set configuration, showing an RER over 30 % for ‘Reverberation’ and ‘Attacker-to-Talker’. We interpreted the above two results as following: the simple binary classification scheme was not appropriate for utilizing the subsidiary information which is helpful to replay spoofing attack detection. However, in open set configuration, only minor performance improvements were observed. In our analysis, the minor performance improvements in open set are due to ambiguous, and small number of labels for each category of subsidiary information. As our future work, we intend to study different frameworks to include various categories of subsidiary information to the code to further benefit from additional information.
-  (1998) Multitask learning: a knowledge-based source of inductive bias. Learning to Learn, Springer, US. Cited by: §1, §3.2, §3.
-  (2017) End-to-end spoofing detection with raw waveform cldnns. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 4860–4864. Cited by: §1.
-  (2017) Experimental analysis of features for replay attack detection-results on the asvspoof 2017 challenge.. In Interspeech, pp. 7–11. Cited by: §1.
-  (2018) Detection of replay-spoofing attacks using frequency modulation features. Proc. Interspeech 2018, pp. 636–640. Cited by: §1.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: §1.
-  (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: Table 1, §2.1.
-  (2019) Cosine similarity-based adversarial process. arXiv preprint arXiv:1907.00542. Cited by: Figure 1, §1, §3.1, §3, §4.4.
-  (2018) A complete end-to-end speaker verification system using deep neural networks: from raw signals to verification result. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5349–5353. Cited by: §2.1.
-  (2019) Replay attack detection with complementary high-resolution information using end-to-end dnn for the asvspoof 2019 challenge. In INTERSPEECH (to be appeared), Cited by: §2.1, §4.2.
-  (2017) The asvspoof 2017 challenge: assessing the limits of replay spoofing attack detection. Cited by: §1, §3.3.
-  (2017) Audio replay attack detection with deep learning frameworks.. In Interspeech, pp. 82–86. Cited by: §1, §5.1.
-  (2017) Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304. Cited by: §1.
-  (2018) Multiple phase information combination for replay attacks detection. Proc. Interspeech 2018, pp. 656–660. Cited by: §1.
-  (2013) Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30, pp. 3. Cited by: §2.1, §4.3.
-  (2017) Replay attack detection using dnn for channel discrimination.. In Interspeech, pp. 97–101. Cited by: §1.
-  (2015) Multi-task learning for text-dependent speaker verification. In Interspeech, Cited by: §1, §3.2.
-  (2017) Automatic differentiation in pytorch. NIPS-W. Cited by: §4.
-  (2018) On the convergence of adam and beyond. Cited by: §4.3.
Batch normalization: accelerating deep network training by reducing internal covariate shift.
International Conference on Machine Learning, pp. 448–456. Cited by: §2.1.
Replay spoofing detection system for automatic speaker verification using multi-task learning of noise classes.
2018 Conference on Technologies and Applications of Artificial Intelligence (TAAI), pp. 172–176. Cited by: §1, §3.2, §4.5, §5.1.
-  (2018) Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model. arXiv preprint arXiv:1809.04437. Cited by: §1.
-  (2018) X-vectors: robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. Cited by: §1, §1.
-  (2018) Relative phase shift features for replay spoof detection system. SLTU. Cited by: §1.
-  (2017) Independent modelling of high and low energy speech frames for spoofing detection.. In INTERSPEECH, pp. 2606–2610. Cited by: §1.
-  (2019) ASVspoof 2019: future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441. Cited by: §1, §4.1.
-  (2018) End-to-end audio replay attack detection using deep convolutional networks with attention. INTERSPEECH, Hyderabad, India. Cited by: §1, §2.1.
-  (2014) Deep neural networks for small footprint text-dependent speaker verification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. Cited by: §1.
-  (2017) Generalized end-to-end loss for speaker verification. arXiv preprint arXiv:1710.10467. Cited by: §2.1.
-  (2016) A discriminative feature learning approach for deep face recognition. In European conference on computer vision, pp. 499–515. Cited by: §2.2.
-  (2015) ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §1.
-  (2015) Spoofing speech detection using high dimensional magnitude and phase features: the ntu approach for asvspoof 2015 challenge. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §1.
-  (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §1.
-  (2018) Ring loss: convex feature normalization for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5089–5097. Cited by: §2.2.