Spoofing Speaker Verification Systems with Deep Multi-speaker Text-to-speech Synthesis

by   Mingrui Yuan, et al.

This paper proposes a deep multi-speaker text-to-speech (TTS) model for spoofing speaker verification (SV) systems. The proposed model employs one network to synthesize time-downsampled mel-spectrograms from text input and another network to convert them to linear-frequency spectrograms, which are further converted to the time domain using the Griffin-Lim algorithm. Both networks are trained separately under the generative adversarial networks (GAN) framework. Spoofing experiments on two state-of-the-art SV systems (i-vectors and Google's GE2E) show that the proposed system can successfully spoof these systems with a high success rate. Spoofing experiments on anti-spoofing systems (i.e., binary classifiers for discriminating real and synthetic speech) also show a high spoof success rate when such anti-spoofing systems' structures are exposed to the proposed TTS system.



There are no comments yet.


page 1

page 2

page 3

page 4


Multi-task Learning Based Spoofing-Robust Automatic Speaker Verification System

Spoofing attacks posed by generating artificial speech can severely degr...

Attacking Speaker Recognition With Deep Generative Models

In this paper we investigate the ability of generative adversarial netwo...

Visualizing Classifier Adjacency Relations: A Case Study in Speaker Verification and Voice Anti-Spoofing

Whether it be for results summarization, or the analysis of classifier f...

Dynamically Mitigating Data Discrepancy with Balanced Focal Loss for Replay Attack Detection

It becomes urgent to design effective anti-spoofing algorithms for vulne...

An initial investigation on optimizing tandem speaker verification and countermeasure systems using reinforcement learning

The spoofing countermeasure (CM) systems in automatic speaker verificati...

Using Multi-Resolution Feature Maps with Convolutional Neural Networks for Anti-Spoofing in ASV

This paper presents a simple but effective method that uses multi-resolu...

A Multi-Resolution Front-End for End-to-End Speech Anti-Spoofing

The choice of an optimal time-frequency resolution is usually a difficul...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speaker verification (SV) is to verify whether a claim that an utterance belongs to a speaker is true or not. It is a widely used biometric and has been under research for decades. One widely used traditional method uses Gaussian mixture models (GMM) with a universal background model (UBM)


. Later, i-vectors method was proposed, which maps high-dimensional statistics from the UBM into a low-dimensional representation


. Recently, deep learning methods have shown significant advances on verification accuracy over traditional methods

[3, 4, 5, 6].

Similar to verification systems using other biometrics, SV systems face the problem of fake identification which is termed as presentation attack or spoofing. Spoofing has two main forms. One is physical access (PA) including direct imitation and replay. The other is logical access (LA) including speech synthesis and voice conversion. This paper focuses on the spoofing effects of text-to-speech (TTS) synthesis on SV systems. TTS has been investigated for decades. Early methods use unit selection [7], formant synthesis [8]

, and hidden Markov models (HMM)

[9], among others. Spoofing effects of synthetic speech from these traditional methods has been investigated in [10, 11]. Recent years have witnessed the surge of deep learning speech synthesis models such as WaveNet [12] and Tacotron [13]; they can generate speech that is hard to be distinguished from real speech by listening. However, spoofing effects of synthetic speech from deep learning models have not been properly investigated [14].

In [15], a voice conversion (VC) system is proposed to spoof SV systems. It is trained using feedback from black-box SV systems for better spoofing effects. For TTS systems, a deep learning model based on GAN [16] is proposed to spoof an SV system by synthesizing mel-spectrograms. While promising results are reported, the spoofed SV system takes mel-spectrograms as input instead of time-domain signals. This is impractical in real life. In [17], another GAN system is proposed to design a TTS system, which incorporates an anti-spoofing system as the discriminator. This system, however, is only evaluated on the feature distributions of the synthetic speech with and without GAN training; no evaluation is performed on spoofing SV systems.

In this paper, we propose a Wasserstein GAN-based multi-speaker TTS system based on an existing TTS [18] architecture to spoof SV systems111Source code at https://github.com/MingruiYuan/SpoofSV. This system consists of two sub-models. The first model synthesizes a time-downsampled mel-spectrogram from text input using speaker embeddings of the target identity. The second model then converts the mel-specrogram to a linear-frequency spectrogram and finally to the time domain using the Griffin-Lim algorithm [19]. We perform adversarial training for each sub-model. Experiments are conducted on spoofing two state-of-the-art SV systems (i-vectors and Google GE2E) in a black-box condition, and results show a high spoof rate of the proposed TTS system.

Our contributions are the following: 1) We proposed a multi-speaker TTS spoofing system using Wasserstein GAN training; 2) Comprehensive spoofing experiments showed a high spoof rate on two state-of-the-art SV systems in the black-box condition; 3) Our experiments also uncovered threats of TTS spoofing to anti-spoofing SV systems when their model structures are not kept confidential.

2 Text-to-speech Model

Our proposed TTS model follows the two-stage process in [18]

. In the first stage we use a Text2Mel network to convert the input text into a time-downsampled mel-spectrogram “spoken” by the target speaker. In the second stage we use a Spectrogram Super-resolution Network (SSRN) to convert the time-downsampled mel-spectrogram into the linear-frequency spectrogram. The Text2Mel model works in an online fashion: it processes acoustic features frame by frame. Previously generated frames of the features are fed back as input to Text2Mel to generate the next frame.

2.1 Text2Mel Network


consists of a text encoder (TEnc), an audio and speaker encoder (ASEnc) and an audio decoder (ADec). TEnc takes text embeddings as inputs. The text embeddings are obtained by mapping each character through a trainable lookup table. These embeddings are then processed by subsequent layers of TEnc to obtain output tensor

. ASEnc has two input branches. One accepts the time-downsampled mel-spectrogram of previously generated audio frames and the other accepts the speaker embedding of the target speaker extracted by the Deep Speaker model [5]. The two branches are added together and then processed by subsequent layers to obtain an output tensor .

Attention mechanism is employed to align the text input and the generated mel-spectrogram. This is implemented through a trainable attention matrix with size , where is the total number of characters of the text input and is the total number of frames of the to-be-generated mel-spectrogram.

is the probability of the

frame of the mel-spectrogram being generated from the character of the input text. As the alignment between text and its speech utterance is monotonic, during generation, we do not allow the alignment path to move backward in either dimension. In addition, we do not allow the path to skip 2 or more positions, leaving a valid step size of 0, 1 or 2 in both dimensions. This ensures a roughly continuous alignment path but also allows speed changes in the synthesized speech.

Finally, ADec takes a concatenated tensor as input and predicts a new frame of the time-downsampled melspectrogram in each time step, which is then appended to the generated spectrogram and fed to ASEnc in the next time step.

In all modules of Text2Mel, 1D dilated convolutional layers are used to model short and long contextual information. Highway convolutional layers are applied according to highway networks [20] to improve training efficiency. Detailed structure of ASEnc is shown in Figure 1, while that of TEnc, ADec and SSRN follow the same structure as that in [18].

2.2 Spectrogram Super-resolution Network (SSRN)

SSRN converts the time-downsampled mel-spectrogram from Text2Mel to the linear-frequency spectrogram. It uses transpose convolutional layers and a series of 1D dilated convolutional layers to achieve this super resolution along both time and frequency axes. Finally, Griffin-Lim algorithm [19]

is used to estimate the phase spectrogram to obtain the time-domain waveform of the generated speech.

Figure 1: The proposed text-to-speech model. Architecture of ASEnc is illustrated, while that of TEnc, ADec and SSRN follows the same structure as that in [18].

2.3 Training

Training of Text2Mel and SSRN is performed separately, as shown in Figure 1

. Training of Text2Mel requires pairs of text and time-downsampled mel-spectrograms, while training of SSRN only requires pairs of time-downsampled mel-spectrograms and linear-frequency spectrograms. As both networks only consist of 1D dilated convolutional layers, sequential models are avoided and all frames of each time-downsampled mel-spectrogram can be reconstructed at the same time with teacher-forcing in the training stage. This is the key to better training efficiency. The reconstruction loss functions for Text2Mel and SSRN are formulated as


where denotes the reconstructed (mel-)spectrogram and denotes the corresponding ground-truth. Both models use the and cross-entropy losses to assess the reconstruction quality. For Text2Mel, it also includes an attention loss term , where the weight matrix shows a high weight off diagonal. This term penalizes the attention matrix if it contains significant energy off diagonal. The rationale is that the alignment between text and its speech utterance is usually along the diagonal, assuming a stable speaking speed.

When training the two sub-models we also incorporate discriminators that are trained to discriminate real and synthetic (mel-)spectrograms. The discriminators consist of 1D convolutional layers, highway convolutional layers, and 1D (adaptive) average pooling layers. Model details are omitted due to space limit but can be found in the open source code. We use Wasserstein GAN with gradient penalty (WGAN-GP) 

[21] because it achieves better results than the vanilla GAN in our experiments. Therefore, the output of each discriminator is a confidence value; A (mel-)spectrogram with a higher value is more likely to be real. The final loss function for Text2Mel and SSRN becomes: [22]


where is the reconstruction loss function of each sub-model in Eqs. (1) and (2), while is the loss from the discriminator. The two parts in the loss function are normalized by their averages in each batch to have the same weight.

3 Experiments

3.1 Training Spoofing Models

We use the entire VCTK-corpus [23] to train our TTS model. This corpus contains 108 valid English speakers (p315 is eliminated for the absence of texts) and each speaker has around 400 utterances (0.5 hour). All audio files are downsampled from 48 kHz to 22.05 kHz and all texts are converted to lower case. We use STFT with a hanning window of size 1024 and hop size 256 to calculate the spectrogram. For the time-downsampled mel-spectrogram, we use 80 mel filterbanks, and select 1 frame out of every 4 frames. The optimizer is Adam [24] with and batch size is 16. For every update of the generator, the discriminator is updated for 5 times. We set the gradient penalty coefficient . Layer normalization [25]

is applied before each activation function as we found it useful in the experiments. We select three models trained for different number of iterations to perform experiments:

(Text2Mel) 500k-(SSRN) 300k. 700k-500k. 1000k-800k.

3.2 Spoofing Effects on SV Systems

We choose two state-of-the-art SV systems to spoof: i-vectors [2] provided by Kaldi and an open source implementation [26] of Google’s deep learning system GE2E [6]. We also use the VCTK corpus to train the SV systems. We split speakers in the corpus into training and test sets with three different schemes: (Train) 42-(Test) 66. 60-48. 88-20. We use the default settings of Kaldi’s aishell example and the GE2E github repository to train the SV systems.

To investigate spoofing effects, we create a mixed set containing 50% real and 50% synthetic utterances to perform speaker verification. We randomly select 3 real utterances of each test speaker for enrollment. We then randomly choose another 20 real utterances and 20 synthetic utterances of each speaker for verification. The synthetic utterances are synthesized by the proposed TTS system on Harvard Sentences.

i-vectors GE2E
42.12% 1.97% 57.90% 19.35%
42.40% 1.88% 77.36% 18.64%
26% 1.5% 60.69% 18.57%
70.15% 2.42% 62.39% 19.06%
72.08% 2.08% 80.20% 18.49%
41% 0.5% 65.07% 18.73%
74.47% 2.27% 69.80% 19.88%
66.98% 1.67% 72.48% 19.68%
57% 1.5% 69.16% 17.19%
Average 54.69% 1.75% 68.34% 18.74%
Table 1: Spoofing effects on speaker verification systems using three trained models (, , ) of the proposed TTS system with three data split schemes (, , ).

We propose spoof rate (SR) to quantify the spoofing effects. It is defined as the percentage of synthetic speech utterances that are accepted by the SV system as their claimed identities. Apparently, SR is affected by the threshold tuning of SV systems. In this experiment, we tune the SV systems to achieve equal error rate (EER) on real speech utterances in the test set. In Table 1, we report SR of all the three models in the three train-test split schemes. We also report EER on real utterances as a control measure of the performance of the SV systems. From Table 1, we can see that all of the three models trained with the three data split schemes achieved a high spoof rate on both SV systems. The average spoof rate is 54.69% for i-vectors and 68.34% for GE2E. This shows the significant vulnerability of both SV systems under the attack of our TTS system. The EER on real utterances of i-vectors is below 2.5% in all settings, showing that the SV system is well trained. The EER of GE2E is much higher, suggesting that it is not well trained on our limited dataset. In fact, according to the GE2E’s paper [6], 18K speakers are used to train the model. Our training set, however, has less than 100 speakers for all the three data split schemes. It is possible that a better trained GE2E model could be more robust to our TTS attack, and more investigations are needed to draw this conclusion.

In Figure 2, we vary the threshold of SV systems and plot the curve of spoof rate versus false rejection rate (FRR), where a false rejection is defined as the rejection of a real speech utterance that indeed belongs to the claimed identity. An ideal SV system that is robust to spoofing would show a monotonically decreasing curve very close to the origin. Curves in Figure 2, however, have a certain distance to the origin for all models and data split schemes. Take the M3-S3 i-vectors curve as an example, to lower the SR below 10%, the FRR would be as high as 15%, which would not be acceptable in practice. This again shows vulnerability of i-vectors and GE2E under the attack of our proposed TTS system.

Figure 2: Spoof rate (horizontal axis) vs. false rejection rate (FRR) (vertical axis) of the three trained models (, , ) on three data split schemes (, , ). SV systems: i-vectors (blue solid line), GE2E (red dash line).

3.3 Spoofing Effects on Anti-spoofing Systems

We further evaluate the proposed TTS model by spoofing anti-spoofing systems. Here, the anti-spoofing systems are binary classifiers that discriminate real from synthetic speech. We perform this evaluation in two conditions. In the blackbox condition, the anti-spoofing system is treated as a blackbox and its model structure is not revealed to the TTS system. In the whitebox condition, the model structure of the anti-spoofing system is revealed. The anti-spoofing systems make two types of errors: 1) false acceptance of synthetic speech, and 2) false rejection of real speech. As the test set contains equal amount of real and synthetic utterances, we report the equal error rate (EER) as the evaluation measure.

For the blackbox condition, we choose the ASVspoof2019 provided GMM-based anti-spoofing system. It takes Linear Frequency Cepstral Coefficients (LFCC) as input features, and is trained on the logical access (LA) part of the ASVspoof2019 dataset [27]. We compose two test sets, each of which is a mix of real speech utterances (50%) and synthetic utterances (50%). The difference is on the synthetic utterances. For the first set , they are synthesized by the proposed TTS model. For the second set , they are downloaded from https://google.github.io/tacotron/ and are synthesized using other high-quality TTS models. The resulted EER is 0.47% on and 3.74% on . These low values shows the difficulty of spoofing anti-spoofing systems in the blackbox condition.

For the whitebox condition, we choose two variants and of the discriminator that we use in the GAN training of our model as the anti-spoofing system. Each variant has a similar structure to the original discriminator, with differences on the removal of an average pooling layer and the insertion of a convolutional layer, respectively. We use test set to spoof and . The resulted EER is 42.56% for and 36.21% for . These high EER values show that both anti-spoofing systems fail to discriminate synthetic from real speech. This suggests that anti-spoofing systems, when their structures are disclosed, can be very vulnerable to TTS spoofing attacks.

4 Conclusion and Future Work

This paper proposed a deep multi-speaker TTS model for spoofing SV systems. GAN training was employed to train the two sub-models of the system. Experiments on spoofing state-of-the-art SV systems revealed their significant vulnerability under the attack of the proposed TTS system. Experiments on anti-spoofing systems also revealed their vulnerability if their model structures are disclosed. For future work, we plan to use reinforcement learning to improve the spoofing capability on blackbox SV systems. We also plan to design stronger anti-spoofing systems to defend TTS attack.