Recent advances in deep neural networks (DNNs) have improved the performance of speaker verification (SV) systems, including short duration and far-field scenarios [bhattacharya2019deep, jung2019rawnet, 9053871, jin2007far, 9004029]
. However, SV systems are known to be vulnerable to spoofing attacks such as replay attacks, voice conversion, and speech synthesis. These vulnerabilities have inspired research into presentation attack detection (PAD), which classifies given utterances as spoofed or not spoofed[wu2015asvspoof, kinnunen2017asvspoof, todisco2019asvspoof]; notably, many DNN-based systems have achieved promising results [lai2019assert, jung2019replay, lavrentyeva2019stc].
Table 1 demonstrates the vulnerability of conventional SV systems when faced with replay attacks. The performance is reported using the three types of equal error rates (EERs) described in Table 2 [todisco2018integrated]. Table 2 shows the target and non-target trials for calculating the EER, and is represented by 1 and 0, respectively. Zero-effort (ZE)-EER describes the conventional SV performance without the presence of replay attacks. PAD-EER denotes the errors of replay attack detection. Integrated (Int)-EER describes overall performance, including both ZE and replayed non-target trials. Hereafter, we refer to “replay spoofing-aware SV” as an integrated speaker verification (ISV) task and report its performance using Int-EER. Results show that the EER degrades to 33.72% with replayed utterances; this fatal performance degradation supports the necessity of a spoofing-aware ISV system. In this paper, PAD includes replay attacks only as the official integrated trials of PAD and ASV are provided for ASVspoof2017 which cover replay attacks only.
While a number of studies have worked to develop independent systems for SV and PAD, few have sought to integrate the SV and PAD systems [sahidullah2016integrated, todisco2018integrated, sizov2015joint, dhanush2017factor, li2019multi, li2020joint]. More specifically, this handful of studies proposed approaches such as cascaded, parallel [sahidullah2016integrated, todisco2018integrated], and joint systems [sizov2015joint, li2019multi, li2020joint]. Most existing studies use common features to integrate the two tasks for system efficiency. Section 2 further takes up this existing body of work.
In this study, we propose two spoofing-aware frameworks for the ISV task, illustrated in Figure 1. The first proposed framework expands existing work by proposing a monolithic end-to-end (E2E) architecture. More specifically, it conducts speaker identification (SID) and PAD to train a common feature using multi-task learning (MTL) [MultitaskLearning]. Concurrently, it uses the embeddings to compose trials and conduct the ISV task. Using the sum of SID, PAD, and ISV losses, the entire DNN is jointly optimized. However, based on tendencies observed during internal experiments, we hypothesize that training a common feature for the ISV task may not be ideal because the properties required for each task differ: the PAD task representation uses device and channel information while SV need to remove it (this is further discussed in Section 3).
Based on our hypothesis, we propose a novel modular approach using a separate DNN.
This approach inputs two speaker embeddings (for enroll and test each) and a PAD prediction to make the ISV decision.
It adopts a two-phase approach. In the first phase, the speaker identifier and PAD system are trained separately. In the second phase, speaker embeddings are extracted from a pretrained speaker identifier [ d-vector
d-vector], and the embeddings and PAD predictions results are fed to a separate DNN module. Using this framework, we achieved a 21.77% relative improvement in terms of Int-EER.111https://www.asvspoof.org/index2017.html
The contributions of this paper are:
Propose a novel E2E framework which jointly optimizes SID, PAD, and the ISV task
Experimentally validate the hypothesis that the discriminative information required for the SV and the PAD task may be distinct, requiring separate front-end modeling
Propose a separate modular back-end DNN which takes speaker embeddings and PAD predictions as input to make ISV decision
The remainder of the paper is organized as follows. Section 2 details related work on the integrated system of ASV and PAD. Section 3 introduces the two proposed frameworks. Section 4 presents our experiments and results and the paper is concluded in Section 5.
2 Related work
In this section, we introduce the two studies most relevant to this study [todisco2018integrated, li2019multi, li2020joint]. Firstly, Todisco et. al. [todisco2018integrated] proposed a separate modelling of two Gaussian back-end systems with a unified threshold for both SV and PAD tasks. This study explored various acoustic features to find which ones best simultaneously suited both tasks. As organizers of the ASVspoof challenges, official trials for the ISV task were released in this study. For our purposes, it is important to highlight that these trials include both ZE and replayed non-target, which we used throughout this paper. However, Todisco et. al. [todisco2018integrated]
reported the average of two EERs, ZE-EER and PAD-EER, because they separately modeled two Gaussian mixture models for each task.
Meanwhile, Li et. al. [li2019multi, li2020joint] extended Todisco’s work [todisco2018integrated] by proposing an integrated ISV system, which was the first study to report an Int-EER. More specifically, they proposed a three-phase training framework for extracting an embedding for the ISV task, followed by a probabilistic linear discriminant analysis (PLDA) back-end. In the first phase, MTL [MultitaskLearning] was employed to train a common embedding for both SV and PAD tasks. In the second and third phases, the embedding was adapted to fit the ISV task. However, because the DNN was adapted in the third phase to fit the enrollment speakers, it has limitation for real world scenarios. In addition, because the performance was reported using self-configured trials, it is difficult to compare the EER.
In this study, we first propose an E2E framework, illustrated in Figure 1-(a), that extends the work of Li et. al. [li2019multi, li2020joint]
in two aspects. First, we adopt a single phase training approach by using three loss functions for SID, PAD, and ISV. Second, our framework directly outputs a spoofing-aware score without using a separate back-end system.
3 Integrated speaker verification
In this section, we describe the proposed two frameworks for conducting speaker verification that are aware of replay spoofing attacks as shown in Figure 1.
3.1 End-to-end monolithic approach
We first propose an E2E monolithic approach. This architecture simultaneously trains all components, including SID, PAD, and ISV, using a common feature, as illustrated in Figure 1-(a). The loss function for training the proposed E2E architecture comprises three components: a categorical cross-entropy (CCE) loss for SID, a binary cross-entropy (BCE) loss for PAD, and a two-class BCE loss for ISV. When a mini-batch is input for training, the proposed system first conducts SID and PAD with an MTL framework. Then, it composes a number of trials. A trial consists of two embeddings, one for enroll, and the other for test. The ISV prediction is made by feed-forwarding the two embeddings through a few fully-connected layers. The entire DNN is jointly optimized using the sum of three loss functions. The objective function is defined as follows:
where refers to the CCE loss for SID, is the BCE loss for PAD, and denotes the CCE loss of ISV.
However, we found consistent tendencies that it is difficult to extract a common representation, i.e. feature, for performing both SV and PAD tasks through experiments. Therefore, we hypothesize that, although SV and PAD tasks are closely related in the scenario, the discriminative information required for each task collides. Speaker embeddings for the SV task requires robustness to device and channel difference; meanwhile, representation for the PAD task uses such information [shim2018replay]. Also, both bona-fide and replayed utterances include the same speaker information, making it a less discriminative factor for the PAD task; meanwhile, it is key information for the SV task. The study of Sahidullah et. al. [sahidullah2016integrated] supports our hypothesize, which analyzes that the SV and PAD tasks should exist independently. To validate our hypothesis, we address experiments using separately trained SV and PAD systems and MTL-based systems. We further detail these elements in Section 4.3.
3.2 Back-end modular approach
We also propose a novel modular approach using a separate DNN that take speaker embeddings and PAD predictions as input to make an ISV decision. Figure 1-(b) illustrates our second proposed system. We use LCNN architecture [LCNN] to extract both speaker embeddings and spoofing predictions; this choice is based on its success in various spoofing detection studies [lavrentyeva2017audio, lavrentyeva2019stc].
Based on the hypothesis addressed in the previous subsection, we design an integrated system using a two-phase approach. In the first phase, we separately train an SID system to extract speaker embeddings from the last hidden layer and a PAD system to extract a spoofing prediction. Then, we train the ISV system by using two speaker embeddings (one for enroll and the other for test) extracted from the SID system as a pair and a PAD label as an input. This system has a output layer with two nodes: the first node indicates “acceptance” and the second node indicates “rejection” for both ZE and replay trials.
In Figure 1
-(b), the part trained in phase 2 is the proposed back-end ISV system. It takes two speaker embeddings and multiplication of the two embeddings as input and a module of four fully-connected layers outputs a scalar that indicates whether they are uttered by the same speaker. The fully-connected layers comprise 256 nodes each and an output layer comprise one node with a sigmoid function.
Next, the SV and PAD prediction results, and their multiplication are fed to a fully-connected layer to make the final decision. In an ideal scenario, the multiplication of the SV result and PAD prediction would indicate when both SV and PAD are positive and otherwise; we assume this multiplication would additionally inform the final decision. The objective function for the back-end modular approach comprises loss for the SV task and the loss for the final decision, defined as:
where and refer to the BCE loss of the SV task and the CCE loss of the ISV task, respectively, and signifies the weight for the SV loss. We note that training the proposed back-end DNN with only results in overfitting.
Based on a number of experiments that we omit for sake of brevity, we found two key components that made our proposed back-end DNN framework successful. First, we model ZE and replay trials into separate score distributions. Figures 2-(a) and (b) respectively illustrate the score distributions of the evaluation trials of the SV baseline and the proposed modular back-end DNN. In Figure 2-(a), the score refers to the cosine similarity of the two embeddings. Here, the score distribution of replay non-target trials severely overlaps with that of target trials. In our analysis, this results from embeddings that only considered speaker information in which replayed and bona-fide utterances coincided. In various experiments, it was impossible to model both replay and ze non-target trials into the same score distribution. When one kind of non-target trial was successfully modeled, the other resulted in a distribution similar to uniform. Therefore, we aim to separate two non-target score distributions, specifically by modelling the score distribution of ze non-target to have mean
and replay non-target to have zero mean. To do so, we sequentially apply rectified linear unit (ReLU) and sigmoid activation functions to the output of SV, before the last hidden layer for ISV. Figure2-(b) demonstrates the score distribution of the proposed method. The results demonstrate that three types of evaluation trials are modeled as intended (i.e. well generalized) in case of evaluation trials although these trials comprise unknown speakers and replay conditions.
Second, we use actual PAD labels instead of PAD predictions of the spoofing DNN in the training phase. It is based on empirical comparisons in which the use of PAD predictions in the training phase worsened performance. In our analysis, using PAD labels in the training phase was more helpful because even a small number of misclassified utterances among PAD predictions can interrupt the training of the proposed DNN. Notably, we empirically observed model collapse when training the proposed modular DNN using PAD predictions.
4 Experiments & results
All experiments in this study were conducted using the ASVspoof2017-v2 dataset [delgado2018asvspoof].222For the ASVspoof2019 dataset, official trials do not exist for integrated systems To evaluate the proposed integrated systems, we used the trials reported in [todisco2018integrated]. We used training and development sets to train all systems comprising 2267 bona-fide and 2457 replay spoofed utterances from 18 speakers. To evaluate speaker verification and spoofing detection performances, we measured the ZE-EER and the PAD-EER using the ASVspoof2017 joint PAD+ASV evaluation trial. This trial comprised 1106 target, 18624 ze, and 10878 replayed trials. We use target & ze for ZE-EER and target & replayed for PAD-EER evaluations.
4.2 Experimental configurations
Regarding our use of ASVspoof2017-v2, we found that relatively thin LCNN structures were helpful for performance improvement; this may have been a result of the small size of the dataset. In addition, we also found that minute changes in DNN greatly influence the performance because of the small data scale, therefore, a relatively thin structure remained particularly helpful for performance improvement. To derive a value between 0 and 1 for the PAD task, we used a network architecture identical to that of [lavrentyeva2019stc] but replaced the angular margin softmax activation [wang2018cosface] with a sigmoid function. We also modified the architecture for the SV task based on [lavrentyeva2019stc]. Speaker embeddings had a dimensionality of 1024.
4.3 Results analysis
|Train loss||DNN arch||ZE-EER (SV)||PAD-EER|
Table 3 describes the results of the proposed E2E framework with a monolithic approach. System #1 refers to the proposed architecture that jointly optimizes SID, PAD, and ISV loss, Figure 1-(a). System #2-SE is the result of applying squeeze-excitation (SE) [hu2018squeeze] based on its recent application to PAD [lai2019assert]. System #3 describes the result of assigning three max feature map (MFM) blocks [LCNN] for SID as well as for PAD after the first three MFM blocks. Because most of the system’s performance measures deteriorated compared to the SV baseline, we concluded that the monolithic E2E approach was not ideal for the ISV task. While the results of the experiments were different from what we expected, they nevertheless serve as a springboard for establishing a new hypothesis.
Table 4 addresses the validation of our hypothesis in Section 3 that the discriminative information for the SV and the PAD task are distinct based on the results of Table 3. To validate our hypothesis, we trained our SV and PAD baselines with and without additional loss for extracting common embeddings. Here, the first and third rows refer to the SV and PAD baselines and the second and fourth rows refer to the usage of the MTL framework. The results demonstrate that, in both baselines, additionally adopting another loss function degraded performance.
|Todisco et. al. [todisco2018integrated]||4.71||18.11||-|
Table 5 summarizes the results of performance improvement across various attempts to improve the performance of the proposed method in the back-end modular approach. The comparison of Systems #4 and #5 shows the effectiveness of using multiplication of the SV result and PAD prediction for the ISV task. System #6 refers to the result of setting weights to the SV task in the training phase where we set the to 20. System #7 shows the result of reducing the number of nodes per hidden layer.
Finally, Table 6 compares our proposed modular approach with the SV baseline and existing work [todisco2018integrated] using official trials. The results demonstrate that the proposed approach stabilizes unbalanced performance between ZE-EER and PAD-EER. Compared with the SV baseline, which does not consider PAD attacks, we achieved a relative improvement of 21.77%. Important to note here is that we were unable to compare the Int-EER with that of Todisco et. al. [todisco2018integrated], although it is the only study that reported performance using official trials. Because it proposed a unified threshold for conducting SV and PAD tasks, Int-EER results using the full trial does not exist.
In this paper, we investigated the integration of speaker verification and replay spoofing detection. We proposed two methods for their integration: an E2E monolithic approach and a back-end modular approach. The proposed E2E approach simultaneously trains SID, PAD, and ISV, using a common feature. The experimental results of the E2E approach led us to hypothesize that the discriminative information for SID and PAD differs. Based on our hypothesis, we proposed a framework using a separate back-end DNN that takes speaker embedding and a PAD prediction extracted from pretrained SV and PAD systems as input. The effectiveness of our proposed systems was verified using official trials for the ISV task where we achieved an EER of 15.63%. It is expected that the proposed method will continue to enhance performance when improved speaker embeddings and PAD prediction are input.