Multi-Task Siamese Neural Network for Improving Replay Attack Detection

by   Patrick von Platen, et al.

Automatic speaker verification systems are vulnerable to audio replay attacks which bypass security by replaying recordings of authorized speakers. Replay attack detection (RA) detection systems built upon Residual Neural Networks (ResNet)s have yielded astonishing results on the public benchmark ASVspoof 2019 Physical Access challenge. With most teams using fine-tuned feature extraction pipelines and model architectures, the generalizability of such systems remains questionable though. In this work, we analyse the effect of discriminative feature learning in a multi-task learning (MTL) setting can have on the generalizability and discriminability of RA detection systems. We use a popular ResNet architecture optimized by the cross-entropy criterion as our baseline and compare it to the same architecture optimized by MTL using Siamese Neural Networks (SNN). It can be shown that SNN outperform the baseline by relative 26.8 architecture and demonstrate that SNN with additional reconstruction loss yield another significant improvement of relative 13.8


page 1

page 2

page 3

page 4


Replay spoofing detection system for automatic speaker verification using multi-task learning of noise classes

In this paper, we propose a replay attack spoofing detection system for ...

Attentive Filtering Networks for Audio Replay Attack Detection

An attacker may use a variety of techniques to fool an automatic speaker...

Replay attack spoofing detection system using replay noise by multi-task learning

In this paper, we propose a spoofing detection system for replay attack ...

Differential Morphed Face Detection Using Deep Siamese Networks

Although biometric facial recognition systems are fast becoming part of ...

Detecting Replay Attacks Using Multi-Channel Audio: A Neural Network-Based Method

With the rapidly growing number of security-sensitive systems that use v...

Deep Residual Neural Networks for Audio Spoofing Detection

The state-of-art models for speech synthesis and voice conversion are ca...

A study on the role of subsidiary information in replay attack spoofing detection

In this study, we analyze the role of various categories of subsidiary i...

1 Introduction

Automatic speaker verification (ASV) systems are nowadays increasingly used for various applications. However, ASV systems are vulnerable to audio spoofing attacks, which attempt to gain unauthorized access by manipulating the audio input. One of the most popular and effective audio spoofing attacks are replay attacks (RA)s. In an RA the attacker fools the ASV system by replaying a recording of an authorized speaker. Considering how effective and cheap RAs are, it is necessary to augment an ASV system with an RA detection system in practice.

The public benchmark ASVspoof initiative started with the ASVspoof 2015 challenge which dealt with text-to-speech and voice conversion spoofing attacks [1]. ASVspoof 2017 [2] was the first challenge concerned with RA detection and thus created a benchmark data set consisting of voice command recordings. ASVspoof 2019 [3], then introduced a much larger corpus of longer and text-independent recordings for RA detection.

The performance of RA detection systems has been thought highly dependent on their input feature processing [4]. Correspondingly, earlier work has largely dealt with handcrafted feature processing and it has been found that high frequency and phase information can be helpful for RA detection (e.g. in [5, 6]). Popular input features that emerged include linear frequency cepstral coefficients (LFCC) [7] and group delay (GD) grams [8]. In recent years, input features derived from shorter handcrafted feature processing pipelines, such as the log power magnitude spectra (LOGSPEC) [9], attracted more interest. In contrast to LFCC, LOGSPEC preserves much more of the information present in the original raw signal and thus relies on deep neural netwokrs (DNN)s as powerful feature extractors [9, 10, 11, 12, 13]. Overall, there is currently no conclusive consensus about the best input feature for RA detection.

As the quality of recording and replaying devices is getting better, detecting the difference between genuine and spoofed audios is becoming more difficult. Thus, it becomes necessary to improve the discriminability and generalizability of RA detection systems. Besides common regularization techniques, like data augmentation and Dropout (cf. with [10, 13]

), multiple teams have used discriminative loss functions and multi-task learning (MTL)

[14] for better feature discrimination and generalization (cf. with [9, 12, 15]).

Siamese Neural Networks (SNN) [16] have shown to significantly improve the discriminability and generalizability of models [17]. In this paper, we propose to use SNN in an MTL setting for RA detection. More generally, we investigate to what extent adding discriminative loss functions in a MTL setting can improve the performance of RA detection systems on the ASVspoof 2019 challenge Physical Access (PA) data. The analysis is conducted on multiple input features. It is made sure that none of the systems rely on additional data and labels and that all of our settings follow the real-world application implementation. Our main contributions include: 1) Proposal of SNN in MTL setting for improved discriminability and generalizability of RA detection systems; 2) Extensive analysis of discriminative loss functions on multiple input features; 3) Enhancement of a popular architecture for RA detection with second-order statistics pooling; 4) Combination of reconstruction loss (ReL) with SNN in an MTL setting.

2 Related Work

Convolutional neural networks (CNN)s and especially deep residual neural networks (ResNet)s [18] have yielded the state-of-the-art performance on the ASVspoof 2019 PA data set [9, 10]. To deal with the much smaller data set than the one ResNet was originally designed for [9, 10, 15]

significantly reduce the size of their models by scaling down the number of kernels employed in each of the CNN layers. A key component in the architecture of their models is the projection of ResNet’s three dimensional tensor output to a one dimensional vector for further binary classification. In

[15] a recurrent layer processes the tensor along the time dimension and outputs the last hidden state. A simpler and apparently more effective approach is to use a global average pooling (GAP) layer instead [9, 10]. Given the success of ResNet with GAP, we use this architecture as our baseline in this study. In other fields of research it has been shown that using second-order statistics in addition to first-order statistics yields better feature embeddings for utterance level classification tasks, e.g. in [19]

. This led us to extent the GAP layer to additionally perform variance pooling.

In [15], MTL has been applied for RA detection in the form of center loss (CL), which has been shown to greatly improve the discriminability of a model [20]. CL is comprised of the cross-entropy (CE) loss and the intra-class variance loss of the feature embeddings weighted by a hyper parameter to control the intra-class compactness [20]. SNN are known to significantly improve the discriminability and generalizability of a model [17]

and have found to be effective in similarity assessment in computer vision

[21]. By using a pair of input features during training, SNN simultaneously increase the inter-class variance of the embedded input features while decreasing the intra-class variance of the embedded input features. Since CL can be seen as a special case of SNN 111The centroid used in CL can be seen as one of the inputs in the input pair used for SNN., SNN are expected to better improve the discriminability of the model. This inspired us to propose SNN in a MTL setting for RA detection.

Another loss function, which is easily applicable in the MTL setting, is ReL. ReL is an unsupervised loss function and is usually employed in autoencoders to improve the network’s ability to maintain the most distinctive information about the input features in compressed form. When added to a standard CE loss function, ReL can act as an effective regularizer by encouraging the network to learn robust feature embeddings


3 Proposed Approach

3.1 Audio Preprocessing & Feature Extraction

In a real-world application, the utterance input can be considered as a continuous buffer of audio input. We set the buffer size to

seconds to keep the audio processing step simple and easy to deploy. Therefore, all utterances are cut or zero-padded to have a maximum length of

seconds and only utterance-level input is considered.

The models are tested on the three input features: linear frequency filterbank features (LFBANK), LOGSPEC and GD grams. LFBANK correspond to the conventionally used LFCC features without the discrete cosine decorrelation step. We chose to leave out this decorrelation step because neural networks are known to act as excellent decorrelators.

We used modified GD grams as defined in Eqs. (28) and (29) in [8] with and because the GD grams as formalized in [6] and [10]

did not yield any reasonable results in our experiments. For all input features, the short time Fourier transform employed a window size of 50ms and a window shift of 15ms. LFBANK subsequently applies 80 filters (

cf. with [23]) without any delta coefficients. The resulting input dimension for GD gram/LOGSPEC and LFBANK is and , respectively.

Confirming the observations made in [9] and [23], the input features are not normalized, but simply scaled down to be in the range from to .

3.2 ResNet for Replay Attack Detection

Similar to [10], the RA detection system is built upon a ”thin” 34-layer ResNet, which is presented in detail in Table 1. The ResNet blocks (i.e. Res1 - Res4) employ the ”full pre-activation” residual unit proposed in [24]

. Due to differences in the input dimensions between LOGSPEC/GD gram and LFBANK, slightly different stride kernels are used (

cf. with Table 1).

Layer Kernel Filters Input feature Stride
Table 1: Architecture of ”thin” 34-layer ResNet

The ResNet network is followed by a GAP layer as explained in Eq. (1) in [10]. Extending GAP to the retrieval of second-order statistics, we define a global average and variance pooling (GAVP) layer that extracts both the mean and variance from all feature maps of ResNet’s last CNN layer. To keep the number of parameters constant, the pooling layer is followed by a dense layer if GAP is employed and a dense layer if GAVP is employed. The final dense layer (called ”Out” in Fig. 1) following the GAP or GAVP layer has a

single output neuron

with Sigmoid activation function yielding the probability of the input feature to be classified as being


. All layers except the pooling and final layer make use of the Rectified Linear Unit (ReLU). The model has about 1.34 million trainable parameters.

3.3 Multi-Task Learning with Siamese Neural Networks

SNN are made of two sub-networks which share the same set of trainable parameters so that a pair of input features is used as an input during training. Besides computing the conventional CE loss for each sub-network individually (i.e. , ), a distance loss between the feature embedding (i.e. ) of each sub-network is calculated (cf. with Fig. 1). A common choice for is the hinge loss (cf. with [25]):


wheres represents the margin, equals if the input feature labels are equal or else and

is a distance metric of choice for which we empirically found the cosine similarity to work best. During training, SNN then aims at minimizing the sum of

, and , whereas each loss contributes with equal weight.

Optionally, two ReLs () - one for each sub-network - can be added to the overall loss. In this case a shared decoder (with a negligible amount of parameters) is used to reconstruct the pair of input features from the outputs of the last convolutional layer:


with being the Frobenius norm. The decoder consists of three consecutive Deconvolution layers each of which upsamples the input using the stride kernel and which employ kernels of size respectively. The outputs of and are ”mean-pooled” over their output feature maps and finally zero-padded to have exactly the same dimension as their respective input feature matrices. The complete architecture of SNN is illustracted in Fig. 1.

As can be noted from Eq. (1), the space of possible training samples for SNN includes all pair-wise combinations of with itself, which is prohibitively large. A simple remedy taken in this study is to control the dataset’s size by a hyper parameter numSamples

. Before every epoch, a dataset

is created by the following simple, but effective sampling procedure:

function CreateSNNDataSet()
     for  in to  do
         for  in to  do

First, no data sample in (or , respectively) is used twice before every other data sample has been sampled at least once, which ensures almost certainly that all data samples are used per epoch by setting numSamples accordingly. Second, it is ensured that the space of possible sample pairs is vastly explored by shuffling the order of and before every epoch. Third, by choosing to sample from or with even probability, the smaller of the two data sets is upsampled so that is balanced.

Figure 1: A sketch map of SNN for MTL showing the loss functions and for RA detection.

4 Experimental Setup and Results

In all experiments, the models were evaluated on the PA subset of the ASVspoof 2019 corpus [3]. PA consists of 48600 spoofed and 5400 genuine utterances in the training (train) data set, 24300 spoofed and 5400 genuine utterances in the development (dev) data set and 116640 spoofed and 18090 genuine utterances in the evaluation (eval) data set. The models were optimized by Adam with , , learning rate

and weight decay, which was tuned for each experiment separately. Training was stopped if the equal error rate (EER) on the dev data set did not improve over 15 consecutive epochs. The models were implemented with the Keras framework


First, we analysed the effect of audio input length on the performance of a simplified ResNet model using LFBANK on the eval set. We noticed that increasing the input length from 5.0s to 6.5s and eventually to 8.5s improved the EER from 9.31 % to 6.75 % to finally 6.22 %. In this experiment, we simply cut or padded the end of the audio to the specific length. Based on existing literature (e.g. [23]), it can be explained that the beginning and tailing silence cues can lead to better performance. Considering these findings and our practical application, we decided to use 8.5s input length and to do cutting and padding at the end of the audio from now on (so that we do not rely on voice activity detection in practical applications).

We then analyse the proposed model architecture with GAP [10]. As one baseline, the model was trained using simple CE loss () [15], which is abbreviated as . As another baseline, the model was trained using CL loss (), which is abbreviated as . For , we found that yields the best results. The baselines are compared to SNN as described in Section 3.3, which we abbreviate as . For the numSamples was set to and the margin was set to . In all training setups, we used a batch size of . and

were all evaluated using LFBANK, LOGSPEC and GD gram as input. In a final step, the systems were systematically fused by means of logistic regression with the Bosaris toolkit

[27] using the dev data set for calibration.

Due to the data imbalance of 9 to 1 in the training set, we adopted the weighted CE loss for and with the CE weight for spoofed utterance input set to . To improve training stability, the bias of the output neuron was initialized to (cf. with [28]) if weighted CE was used. The results can be seen in Table 2.

Model Loss Input Feature Dev Eval
LFBANK 3.70 5.17
GD Gram 6.20 8.63
LOGSPEC 1.98 2.79
Fused - 2.22
LFBANK 2.76 4.06
GD Gram 4.44 7.13
LOGSPEC 1.37 2.33
Fused - 1.70
LFBANK 2.73 3.66
GD Gram 3.53 5.89
LOGSPEC 1.15 2.25
Fused - 1.52
Table 2: Comparison of for different input features. Results are reported in % EER.

It can be seen that both MTL models and outperform the single task learning model by relative 23.4 % EER and 31.5 % EER averaged over all input features. further outperforms by a relative margin of 10.6 % EER. We could observe that during training, the MTL setups and converged faster and also seemed to generalize better as the EER on the dev data set decreased much smoother during training.

In the second experiment, we took the best performing model for LOGSPEC input as our new baseline. First, we analysed the effect of extracting second-order statistics in addition to first-order statistics from of the CNN feature maps by replacing the GAP layer with a GAVP layer. This setup is abbreviated as . Second, we extended SNN with two additional reconstruction loss functions according to Eq. (2) for both GAP () and GAVP (). Empirically, it was found that and is much smaller than and , so that the loss is scaled by a weighting factor of . Because, we experienced RAM memory overflow issues with and , the batch size used in training was reduced to and numSamples set to to have the same number of steps per epoch as before 222More details can be found at The results are shown in Table 3.

Model Loss Dev Eval
0.83 2.01
0.66 2.08
0.76 1.94
Table 3: Comparison of and for LOGSPEC input feature. Results are reported in % EER.

It can be seen that both using the GAVP layer and adding ReL gives a significant performance boost compared to . Consequently the best single system performance of % EER on the eval data set is achieved by which outperforms by relative 30.5 % EER while having the same number of parameters.

5 Conclusion

We have thoroughly analysed the discriminate feature learning in an MTL setting for RA detection and found that SNN significantly outperforms the baseline on multiple input features. We explain this improvement by the following. First, SNN greatly improve the discriminability of the model by explicitly increasing the inter-class variance of the model. Second, because SNN sample from a very large pool of possible sample pairs - each giving a different gradient signal - the model regularizes much better during training. We then further improve upon SNN by adding ReL and replacing GAP with GAVP. This leads to a single system EER of % and can be justified by better regularization induced by ReL and more discriminative feature embeddings thanks to the extraction of first- and second-order statistics.

6 References