Audio-replay attack detection countermeasures

05/24/2017 ∙ by Galina Lavrentyeva, et al. ∙ 0

This paper presents the Speech Technology Center (STC) replay attack detection systems proposed for Automatic Speaker Verification Spoofing and Countermeasures Challenge 2017. In this study we focused on comparison of different spoofing detection approaches. These were GMM based methods, high level features extraction with simple classifier and deep learning frameworks. Experiments performed on the development and evaluation parts of the challenge dataset demonstrated stable efficiency of deep learning approaches in case of changing acoustic conditions. At the same time SVM classifier with high level features provided a substantial input in the efficiency of the resulting STC systems according to the fusion systems results.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, due to increasing security concerns in all aspects of our daily lives, the need for convenient and non-intrusive authentication methods has grown. Automatic speaker verification (ASV) offers a low-cost and reliable solution for identification problem when voice services are provided. It is already used in social security entitlement, immigration control and election management. Speaker recognition systems are widely used in customer identification during call to a call center, Internet-banking systems and other fields of e-commerce. However, despite the fact that it has reached the point of mass market adoption this technology is acknowledged to be vulnerable to spoofing attacks [1].

According to the [2] ASV spoofing attacks can be classified into direct and indirect attacks acсording to the stage they are applied to. Indirect ones require access to the system and attack inner modules (feature extraction module, voice models, classification results). In opposite, direct attacks use the recording stage or transmission level and is focused on the substitution of the input data. Since speaker verification is mostly used in automatic systems without face-to-face contact, direct attacks are more likely to be used by criminals due to implementation simplicity. The most well-known spoofing attacks are impersonation, replay attack (RA), voice conversion (VC) attack and text-to-speech (TTS) attack [3], [4].

Unlike other spoofing types, impersonation does not leave any traces of the recording and playback devices or signal processing, as it is genuine speech of a non-target speaker. It can be detected by reliable speaker verification system [5]. TTS and VC detection methods were the topic of Automatic Speaker Verification Spoofing (ASVspoof) 2015 Challenge [6]. Results presented during that challenge confirmed the great potential in detection of VC and TTS. Compared to these spoofing types RA is much more simpler as it’s realisation does not require specific audio signal processing knowledges. In RA fraudster usually uses pre-recorded speech samples of the target speaker that can be easily prepared via low-cost recording devices or smartphones. Due to this RA is the most available and therefore critical spoofing technique.

For today there is a small number of studies addressed to the RA detection. The most part of solutions presented for text-dependent ASV are based on the comparison of the test utterance with the stored utterance recorded during the registration. Vulnerability of text-independent ASV to RA was considered in [3] and [7]. These papers show the serious increase in false acceptance rate of ASV system in case of RA presence. RA detection methods for text-independent case are mostly based on additional noise detection, specific for pre-defined acoustic conditions.

The second edition of ASVspoof Challenge conducted in 2017 was aimed to promote the development of RA countermeasures reliable to both known and unknown conditions [8] which can vary greatly. ASVspoof 2017 was focused on a standalone RA detection task for text-dependent case considered without ASV system and any pre-recorded enroll data.

In this paper we described Speech Technology Center (STC) RA detection systems proposed for ASVspoof 2017. Here we investigated and compared different approaches for spoofing detection. These were Gaussian Mixture Model (GMM) based systems, systems based on high level features with simple classifier and deep learning approaches.

2 Automatic Speaker Verification Spoofing Detection Challenge 2017

ASVspoof Challenge was organized in order to assess the potential to detect RA ”in the wild”, specifically in varying acoustic conditions. For this purpose the spoofing database was collected using text-dependent RedDots data [9]. RedDots corpus was replayed and recaptured in heterogeneous acoustic environment (open lab space, balcony, etc.). For spoofing trials 15 different playback and 16 recording devices were used, including smartphones and high-quality speakers. The original RedDots records were used as genuine speech trials. This dataset was divided into training, development and evaluation parts. The evaluation part contained no information about spoofing trials, devices and recording conditions. Spoofing trials from evaluation part were prepared with the use of devices that were not used in the recording process of the training or development data. In this way they presented previously unforeseen spoofing attacks.

3 GMM approach

Authors of [10] insist that spoofing detection methods should be more focused on the feature engineering rather than on models complication. They proposed the system based on constant Q transform cepstral coefficients (CQCC) and simple GMM. This approach showed impressive results on the ASVspoof 2015 dataset and achieved the 72% improvement over the best system participated in the challenge. That is why the reference implementation of this system was provided as a Baseline system by organisers of ASVspoof 2017.

3.1 Baseline

Constant Q transform (CQT) is widely used in music signal processing. By using geometrically spaced frequency bins it overcomes the lack of frequency resolution at lower frequencies and time resolution at higher frequencies that can be produced by Fourier transform with regular space frequency bins. Figure


demonstrates the example of CQT spectrum. CQCC estimated according to the scheme in Figure

2 were used as input features in the Baseline system.

In the Back-End the Baseline system used standard 2-class GMM classifier. 512-component models were trained with an expectation-maximisation (EM) algorithm with random initialisation for genuine and spoofed speech, respectively. For each utterance the log-likelihood score was obtained from GMM models and the final score was computed as the log-likelihood ratio: , where

is a sequence of test utterance feature vectors,

denotes the likelihood function, and and represent the GMM for genuine and spoofed speech.

Figure 1: Log power magnitude CQT spectrum
Figure 2: CQCC features extraction

3.2 Baseline modifications

The first modifications we tried were normalization techniques for acoustic features. We considered mean and variance normalization on the log power spectrum (mvn) and cepstral coefficients (cmvn) in different combinations. We have also tried acoustic features that were effective for VC and TTS detection. These were Mel Wavelet Packet Coefficients (MWPC), based on applying the multiresolution wavelet transform and phase-based features CosPhasePC


Experiments results obtained on the development part (Table 1) confirm the efficiency of normalization for CQCC based systems. Systems with other Front-End features did not outperform these results.

Features CQCC MWPC MWPC CosPhasePC
Normalization - mvn cmvn mvn+cmvn - cmvn -
train EER (%) 0.03 1.81 1.25 1.1 0.4 0.23 0.43
dev EER (%) 10.35 8.74 11.86 9.85 8.81 19.17 24.64
Table 1: Experimental results for GMM-based systems obtained on the development and evaluation parts of ASVspoof 2017 database (EER %)

4 I-vector based system


The most efficient spoofing detection systems on ASVspoof 2015 were based on high level features modelling by standard i-vector approach [12]. Inspired by its success for VC ans TTS spoofing detection we proposed similar approach for RA detection. We experimented with variety of acoustic features from ASVspoof 2015, such as MFCC, CosPhasePC and MWPC features. According to our observations the best system was the system based on the Linear Prediction Cepstral Coefficients (LPCC).

78 LPCC coefficients were obtained using the Hanning window function with a seconds window size and a seconds step for FFT power spectrum estimation [13]. I-vectors of the Total Variability space were extracted from the whole speaker utterances. Here we used the the 128-component Gaussian mixture model of the described features for diagonal covariance UBM (Universal Background Model), and the dimension of the T-matrix was 200. After centering and length-normalization i-vectors were used as an input for a linear kernel SVM classifier. For SVM training the efficient LIBLINEAR [14] library was used.

4.2 Phrase-dependent system

During our investigations we suggested that phrase-dependent system can perform a higher accuracy than text-independent system. We compared i-vector based system with the similar systems trained for several different phrases independently. Our experiments presented in Table 2 showed reduction in spoofing detection in comparison with the common systems trained on the whole training dataset. This effect can be explained by the insufficient size of the training data in a phrase-dependent case which leads to fast overfitting.

UBM phrase-dependent common common common
T-matrix phrase-dependent phrase-dependent common common
SVM phrase-dependent phrase-dependent phrase-dependent common
train EER (%) 0.398 0.133 0.401 1.459
dev EER (%) 11.702 12.368 11.447 9.95
Table 2: Experimental results for the i-vector based system obtained on the development and evaluation parts of ASVspoof 2017 database (EER %)

5 Deep learning approaches

Deep learning approaches have already achieved remarkable performance in many classification and recognition tasks. The success of CNN in video classification [15], image classification [16, 17]

, face recognition

[18] prompts to apply such approaches for ASV anti-spoofing tasks. The idea of employing CNN for spoofing detection is not new and was used in face spoofing detection [19], [20].

Several experimental results confirm the efficiency of CNN based approaches for synthetic speech detection. For example in [21]

authors demonstrated high efficiency of deep neural network (DNN), CNN and recurrent neural network (RNN) architectures for VC and TTS spoofing detection on the base of ASVspoof 2015 dataset. They also proposed a stacked CNN+RNN architecture and demonstrated its state-of-the-art performance. It is particularly important to note that CNN architecture showed similar to CNN+RNN results and their fusion outperformed the best individual system twice in terms of detection quality. In

[22] temporal CNN architectures were proposed for VC and TTS detection. This approach also achieved notable results on the ASVspoof 2015 corpora.

In this section we describe several systems based on CNN frameworks for RA spoofing detection. Such problem can be reduced to the detection of local spectral artifacts presented in the reproduced replay attack that distinguish it from the genuine speech. For this purpose the CNN was used as a robust feature extractor from the input signal representation in a time-frequency domain.

5.1 Unified shape time-frequency features

In our research we chose 2 types of features based on spectrum estimation of the utterance. To prepare CNN input acoustic features we used the normalized log power magnitude spectrum obtained via:

  • constant Q transform (CQT) [10]

  • Fast Fourier Transform (FFT)

  • Discrete wavelet Transform (DWT), obtained by Daubechies wavelets db4

Special attention should be paid here to the fact that CNN input data should have a unified form. We considered two techniques for obtaining a unified time-frequency (T-F) shape of features. First one truncates the spectrum along the time axis with a fixed size. During this procedure short files are extended by repeating their contents if necessary to match the required length. The other technique uses a sliding window approach with a fixed window size.

5.2 Deep Learning architectures

It is known from image processing that the choice of convolutional neural network architecture is a critical task and greatly affects learning result. In this research we investigated several deep neural network architectures, demonstrated the best results in RA detection.

5.2.1 Inception CNN based system

The first proposed neural network architecture was CNN with inception modules (ICNN) (

[23]). The proposed architecture was the reduced version of GoogLeNet and contained 3 inception modules. Inception module acts as multiple convolution filter inputs, that are processed on the same input in parallel. It also does pooling at the same time. All resulting feature maps are then concatenated before going to the next layer. This allows the model to pick the best convolutions and take advantages of multi-level feature extraction by recovering both local feature via smaller convolutions and high abstracted features with larger convolutions.

The ICNN was applied for high level feature extraction from the log power magnitude CQT spectrum. The last fully-connected (FC) layer with softmax activation function was used to discriminate between genuine and spoofing classes during the training process. And a low-dimensional high-level audio representation was extracted from the penultimate FC layer.

To obtain unified time-frequency representation of the audio-signal we used sliding window technique with window size and 90% overlapping. For ICNN input we applied mean variance normalization.

Since in this space of high level features genuine and replay spoofing classes are well separated, it was enough to use the simplest one-component models for each class distribution modelling. We used the standard 2-class GMM classifier (1 GMM for genuine speech and 1 for spoofed speech) trained on the training part of the ASVspoof 2017 database with EM algorithm. The score for an input signal was computed as the loglikelihood ratio. In this scenario we extracted high-level features independently for each sliding window and all high-level features corresponding to one utterance were used to estimate GMM likelihood. However, it should be mentioned that such deep neural approach can be used for End-To-End solution without additional classifier.

5.2.2 Light CNN

The second CNN we explored was the reduced version of the LCNN proposed in [24] with a smaller number of filters in each layer. LCNN consisted of 5 convolution layers, 4 Network in Network (NIN) layers [25]

, 10 Max-Feature-Map layers, 4 max-pooling layers and 2 fully connected layers


. The proposed LCNN used Max-Feature-Map activation function that allows to reduce CNN architecture. In contrast to commonly used Rectified Linear Unit function that suppresses a neuron by a threshold (or bias), MFM suppresses a neuron by a competitive relationship. In this way being applied in particular MFM selects more informative features. Each convolution layer was a combination of two independent convolutional parts calculated for layer’s input. MFM activation function was used then to calculate element-wise maximum of these parts. Max-Pooling layers with kernel of size

and stride of size

were used for time and frequency dimensions reduction. Described CNN was used to obtain high-level audio features similar to the ICNN based system and simple GMM classifier was used at the evaluation stage.

We proposed a number of systems based on LCNN high-level features extractor with different acoustic features. used truncated features obtained from the normalized CQT spectrograms with size. Additionally we explored FFT based features instead of CQT: system used truncated features of size and was based on the sliding window features extraction with window and 90% overlapping along time axis. Alternative system was based on DWT implementation with sliding window of size and 83.4% overlapping.

5.2.3 Stacking CNN and RNN

We also probed the combined CNN + RNN architecture from [21] for RA spoofing detection. In this stacked architecture CNN was used as a feature extractor and RNN modeled the long-term dependencies of a speech sequence. Both CNN and RNN were optimized jointly through the back-propagation algorithm. In this implementation CNN+RNN was used as End-to-End solution.

CNN itself was a reduced version of LCNN. But unlike LCNN from the previous systems max-pooling was applied with the stride along the frequency axis to compress frequency information and stride along the time axis to save time dimensionality.

The RNN part consists of two gated recurrent units

[27] forming the bidirectional gated recurrent unit (BGRU). The first GRU was responsible for the forward pass, while the second GRU performed the backward pass. The last output vectors of both forward and backward passes were taken further to obtain two -dimensional vectors. Such BGRU unit was applied to each channel of CNN’s output resulting in tensor. Weights were shared between each channel’s unit to prevent overfitting. The flattened output of RNN was used as an input to the fully-connected layers with MFM activations resulting in probability of the utterance being spoofed.

System based on this architecture, , used truncated features extracted from log magnitude power FFT spectrum. But due to the limited computational resources we reduced the input data dimension to .

Alternatively we used CNN+RNN architecture for system based on the ensemble empirical mode decomposition (EEMD) features.These features were obtained by the following algorithm with the use of libEEMD library [28]:

  1. Let be the FFT spectrogram of the original signal

  2. Get the first empirical mode of the signal using the EEMD with ensemble size of and noise strength equals to

  3. Compute as the FFT spectrogram of the signal

6 Evaluation results and discussion

The experimental results of all described individual systems on development and evaluation parts of ASVspoof 2017 corpus are presented in Table 3.

Individual system EER (%)
Dev set Eval set
10.35 30.17
9.85 17.4
9.95 12.82
5.73 15.11
4.53 7.34
5.25 11.91
4.80 16.54
8.71 16.41
7.51 10.68
9.94 18.93
Fusion system Dev set Eval set
Primary: , , 3.95 6.73
Contrastive: , , , , 2.62 7.56
, ,
Table 3: Evaluation results

The best result for development and evaluation sets was demonstrated by LCNN system with FFT truncated features. Similar system with CQT-based features showed poor stability on the evaluation set. This can be explained by the poor robustness of the CQT features, which is also approved by results of the baseline system.

The sliding window technique demonstrated worse results compared to the truncated approach on the evaluation set. A possible reason for this is that using spectrograms of the whole utterances (in most cases) as CNN input leads to more accurate text-dependent deep model.

Our CNN+RNN combination also performed worse RA detection quality than a single LCNN. We explain performance degradation by the reduced frequency resolution in the spectrum estimation.

Summarizing all results of our individual systems for the final submission of the ASVspoof 2017 Challenge we prepared two solutions that use fusion of individual systems on the score level with Bosaris toolkit [29]. Our primary system used fusion of , and systems scores. And our Contrastive system combined , , , , , and systems. Comparing two fusion systems on the development and evaluation parts we can see that complicated fusion of 7 systems have less performance than simpler fusion of , , which are completely different in architectures, features and presumably detect different artefacts. Despite the noticeable quality reduction for some individual systems on the eval part the difference in 1% of EER for fusion systems can be explained by the impressive results of . According to Bosaris fusion model, this system has the biggest weight in both fusion solutions.However, we suppose that in case of more common RA detection problem with more various conditions complex system will have better spoofing detection quality.

7 Conlusions

In this paper we explored the applicability of several different approaches for replay attack spoofing detection. We investigated state-of-the-art methods from VC and TTS spoofing detection and deep learning approaches. Our experiments conducted on the ASVspoof 2017 dataset confirmed high efficiency of deep learning frameworks for spoofing detection ”in the wild”. EER of the best individual CNN system was 7.34%. At the same time SVM classifier with high level features provides a substantial input into the efficiency of the resulting STC systems according to the fusion systems results. Our primary system based on systems score fusion provided 6.73% EER on the evaluation set.


  • [1] M. Sébastien, M. Nixon, and S. Li, eds., Handbook of biometric anti-spoofing: trusted biometrics under spoofing attacks. Springer, 2014.
  • [2] M. Faundez-Zanuy, M. Hagmüller, and G. Kubin, “Speaker verification security improvement by means of speech watermarking,” Speech Communication, vol. 48, no. 12, pp. 1608 – 1619, 2006. {NOLISP} 2005.
  • [3] J. Villalba and E. Lleida, “Speaker verification performance degradation against spoofing and tampering attacks,” in FALA 2010, pp. 131–134, 2010.
  • [4] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li, “Spoofing and countermeasures for speaker verification: A survey,” Speech Communication, vol. 66, pp. 130 – 153, 2015.
  • [5] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li, “Spoofing and countermeasures for speaker verification: a survey,” Speech Communication, vol. 66, pp. 130–153, 2015.
  • [6] Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilçi, M. Sahidullah, and A. Sizov, “Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,” Training, vol. 10, no. 15, p. 3750, 2015.
  • [7] J. Villalba and E. Lleida, “Preventing replay attacks on speaker verification systems,” in IEEE International Carnahan Conference on Security Technology, ICCST 2011, pp. 284–291, IEEE, 2011.
  • [8] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, and K. A. Lee, “The asvspoof 2017 challenge: Assessing the limits of replay spoofing attack detection,” in (submitted to Interspeech 2017), 2017.
  • [9] K.-A. Lee, A. Larcher, G. Wang, P. Kenny, N. Brümmer, D. A. van Leeuwen, H. Aronowitz, M. Kockmann, C. Vaquero, B. Ma, H. Li, T. Stafylakis, M. J. Alam, A. Swart, and J. Perez, “The reddots data collection for speaker recognition.,” in INTERSPEECH, pp. 2996–3000, ISCA, 2015.
  • [10] M. Todisco, H. Delgado, and N. Evans, “A new feature for automatic speaker verification antispoofing: Constant q cepstral coefficients,” Processings of Odyssey 2016, 2016.
  • [11] S. Novoselov, A. Kozlov, G. Lavrentyeva, K. Simonchik, and V. Shchemelinin, “Stc anti-spoofing systems for the asvspoof 2015 challenge,” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.
  • [12] M. Li, X. Zhang, Y. Yan, and S. Narayanan, “Speaker verification using sparse representations on total variability i-vectors,” Processings of INTERSPEECH 2011, 2011.
  • [13] D. P. and W. Ellis, “PLP and RASTA (and MFCC, and inversion) in Matlab,” 2005. online web resource.
  • [14] “Liblinear library, url =”
  • [15] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in

    Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

    , pp. 1725–1732, 2014.
  • [16] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
  • [17]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in neural information processing systems, pp. 1097–1105, 2012.
  • [18] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1708, 2014.
  • [19] J. Galbally, S. Marcel, and J. Fierrez, “Biometric antispoofing methods: A survey in face recognition,” IEEE Access, vol. 2, pp. 1530–1552, 2014.
  • [20] J. Yang, Z. Lei, and S. Z. Li, “Learn convolutional neural network for face anti-spoofing,” arXiv preprint arXiv:1408.5601, 2014.
  • [21] C. Zhang, C. Yu, and J. H. L. Hansen, “An investigation of deep learning frameworks for speaker verification anti-spoofing,” IEEE Journal of Selected Topics in Signal Processing, no. 99, 2011.
  • [22] X. Tian, X. Xiao, C. E. Siong, and H. Li, “Spoofing speech detection using temporal convolutional neural network,” in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016.
  • [23] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” Computer Vision and Pattern Recognition, 2015.
  • [24] X. W. X., R. He, Z. Sun, and T. Tan, “A light cnn for deep face representation with noisy labels,” IEEE Journal of Selected Topics in Signal Processing, 2015.
  • [25] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
  • [26] G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Kudashev, and V. Shchemelinin, “Audio replay attack detection with deep learning frameworks,” (submitted to Interspeech-2017), 2017.
  • [27] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” CoRR, vol. abs/1412.3555, 2014.
  • [28] “libeemd library, url =”
  • [29] “Bosaris toolkit, url =”