Attention-Based Models for Text-Dependent Speaker Verification

10/28/2017 ∙ by F A Rezaur Rahman Chowdhury, et al. ∙ Washington State University 0

Attention-based models have recently shown great performance on a range of tasks, such as speech recognition, machine translation, and image captioning due to their ability to summarize relevant information that expands through the entire length of an input sequence. In this paper, we analyze the usage of attention mechanisms to the problem of sequence summarization in our end-to-end text-dependent speaker recognition system. We explore different topologies and their variants of the attention layer, and compare different pooling methods on the attention weights. Ultimately, we show that attention-based models can improves the Equal Error Rate (EER) of our speaker verification system by relatively 14



There are no comments yet.


page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speaker verification (SV) is the process of verifying, based on a set of reference enrollment utterances, whether an verification utterance belongs to a known speaker. One subtask of SV is global password text-dependent speaker verification (TD-SV), which refers to the set of problems for which the transcripts of reference enrollment and verification utterances are constrained to a specific phrase. In this study, we focus on “OK Google” and “Hey Google” global passwords, as they relate to the Voice Match feature of Google Home [1, 2].


[3] based systems in combination with verification back-ends such as Probabilistic Linear Discriminant Analysis (PLDA) [4]

have been the dominating paradigm of SV in previous years. More recently, with the rising of deep learning


in various machine learning applications, more efforts have been focusing on using neural networks for speaker verification. Currently, the most promising approaches are end-to-end integrated architectures that simulate the enrollment-verification two-stage process during training.

For example, in [6] the authors propose architectures that resemble the components of an i-vector + PLDA system. Such architecture allowed to bootstrap the network parameters from pretrained i-vector and PLDA models for a better performance. However, such initialization stage also constrained the type of network architectures that could be used — only Deep Neural Networks (DNN) can be initialized from classical i-vector and PLDA models. In [7]

, we have shown that Long Short-Term Memory (LSTM) networks

[8] can achieve better performance than DNNs for integrated end-to-end architectures in TD-SV scenarios.

However, one challenge in our architecture introduced in [7] is that, silence and background noise are not being well captured. Though our speaker verification runs on a short 800ms window that is segmented by the keyword detector [9, 10], the phonemes are usually surrounded by frames of silence and background noise. Ideally, the speaker embedding should be built only using the frames corresponding to phonemes. Thus, we propose to use an attention layer [11, 12, 13] as a soft mechanism to emphasize the most relevant elements of the input sequence.

This paper is organized as follows. In Sec. 2, we first briefly review our LSTM-based d-vector baseline approach trained with the end-to-end architecture [7]. In Sec. 3, we introduce how we add the attention mechanism to our baseline architecture, covering different scoring functions, layer variants, and weights pooling methods. In Sec. 4 we setup experiments to compare attention-based models against our baseline model, and present the EER results on our testing set. Conclusions are made in Sec. 5.

2 Baseline Architecture

Our end-to-end training architecture [7] is described in Fig. 1. For each training step, a tuple of one evaluation utterance and enrollment utterances (for ) is fed into our LSTM network: , where represents the features (log-mel-filterbank energies) from a fixed-length segment, and represent the speakers of the utterances, and may or may not equal . The tuple includes a single utterance from speaker , and different utterance from speaker . We call a tuple positive if and the enrollment utterances are from the same speaker, i.e., , and negative otherwise. We generate positive and negative tuples alternatively.

For each utterance, let the output of the LSTM’s last layer at frame be a fixed dimensional vector , where . We take the last frame output as the d-vector (Fig. (a)a), and build a new tuple: . The centroid of tuple represents the voiceprint built from utterances, and is defined as follows:


The similarity is defined using the cosine similarity function:


with learnable and . The tuple-based end-to-end loss is finally defined as:



is the standard sigmoid function and

equals if , otherwise equals to

. The end-to-end loss function encourages a larger value of

when , and a smaller value of when . Consider the update for both positive and negative tuples — this loss function is very similar to the triplet loss in FaceNet [14].

Figure 1: Our baseline end-to-end training architecture as introduced in [7].

3 Attention-based model

3.1 Basic attention layer

In our baseline end-to-end training, we directly take the last frame output as d-vector . Alternatively, we could learn a scalar score for the LSTM output at each frame :


Then we can compute the normalized weights using these scores:


such that . And finally, as shown in Fig. (b)b, we form the d-vector as the weighted average of the LSTM outputs at all frames:


3.2 Scoring functions

By using different scoring functions in Eq. (4), we get different attention layers:

  • [itemsep=1mm,topsep=1mm]

  • Bias-only attention, where is a scalar. Note this attention does not depend on the LSTM output .

  • Linear attention, where is an -dimensional vector, and is a scalar.

  • Shared-parameter linear attention, where the -dimensional vector and scalar are the same for all frames.

  • Non-linear attention, where is an matrix, and are -dimensional vectors. The dimension can be tuned on a development dataset.

  • Shared-parameter non-linear attention, where the same , and are used for all frames.


In all the above scoring functions, all the parameters are trainable within the end-to-end architecture [7].

Figure 4: (a) LSTM-based d-vector baseline [7]. (b) Basic attention layer.

3.3 Attention layer variants

Apart from the basic attention layer described in Sec. 3.1, here we introduce two variants: cross-layer attention, and divided-layer attention.

For cross-layer attention (Fig. (a)a), the scores and weights are not computed using the outputs of the last LSTM layer , but the outputs of an intermediate LSTM layer , e.g. the second-to-last layer:


However, the d-vector is still the weighted average of the last layer output .

For divided-layer attention (Fig. (b)b), we double the dimension of the last layer LSTM output , and equally divide its dimension into two parts: part-a , and part-b . We use part-a to build the d-vector, while using part-b to learn the scores:

Figure 7: Two variants of the attention layer: (a) cross-layer attention; (b) divided-layer attention.

3.4 Weights pooling

Another variation of the basic attention layer is that, instead of directly using the normalized weights to average LSTM outputs, we can optionally perform maxpooling on the attention weights. This additional pooling mechanism can potentially make our network more robust to temporal variations of the input signals. We have experimented with two maxpooling methods (Fig. 8):

  • [itemsep=1mm,topsep=1mm]

  • Sliding window maxpooling: We run a sliding window on the weights, and for each window, only keep the largest value, and set other values to 0.

  • Global top- maxpooling: Only keep the largest values in the weights, and set all other values to 0.

Figure 8: Different pooling methods on attention weights. The th pixel corresponds to the weight , and a brighter intensity means a larger value of the weight.

4 Experiments

4.1 Datasets and basic setup

To fairly compare different attention techniques, we use the same training and testing datasets for all our experiments.

Our training dataset is a collection of anonymized user voice queries, which is a mixture of “OK Google” and “Hey Google”. It has around 150M utterances from around 630K speakers. Our testing dataset is a manual collection consisting of 665 speakers. It’s divided into two enrollment sets and two verification sets for each of “OK Google” and “Hey Google”. Each enrollment and evaluation dataset contains respectively, an average of 4.5 and 10 evaluation utterances per speaker.

We report the speaker verification Equal Error Rate (EER) on the four combinations of enrollment set and verification set.

Our baseline model is a 3-layer LSTM, where each layer has dimension 128, with a projection layer [15] of dimension 64. On top of the LSTM is a linear layer of dimension 64. The acoustic parametrization consists of 40-dimensional log-mel-filterbank coefficients computed over a window of 25ms with 15ms of overlap. The same acoustic features are used for both keyword detection [10] and speaker verification.

The keyword spotting system isolates segments of length frames (800ms) that only contain the global password, and these segments form the tuples mentioned above. The two keywords are mixed together using the MultiReader technique introduced in [16].

4.2 Basic attention layer

First, we compare the baseline model with basic attention layer (Sec. 3.1) using different scoring function (Sec. 3.2). The results are shown in Table 1. As we can see, while bias-only and linear attention bring little improvement to the EER, non-linear attention111For the intermediate dimension of non-linear scoring functions, we use , such that and are square matrices. improves the performance significantly, especially with shared parameters.

Test data Non-attention Basic attention
Enroll Verify baseline
OK Google OK Google 0.88 0.85 0.81 0.8 0.79 0.78
OK Google Hey Google 2.77 2.97 2.74 2.75 2.69 2.66
Hey Google OK Google 2.19 2.3 2.28 2.23 2.14 2.08
Hey Google Hey Google 1.05 1.04 1.03 1.03 1.00 1.01
Average 1.72 1.79 1.72 1.70 1.66 1.63
Table 1: Evaluation EER(%): Non-attention baseline model vs. basic attention layer using different scoring functions.

4.3 Variants

To compare the basic attention layer with the two variants (Sec. 3.3), we use the same scoring function that performs the best in the previous experiment: the shared-parameter non-linear scoring function . From the results in Table 2, we can see that divided-layer attention performs slightly better than basic attention and cross-layer attention222In our experiments, for cross-layer attention, scores are learned from the second-to-last layer., at the cost that the dimension of last LSTM layer is doubled.

Test data Basic Cross-layer Divided-layer
OK OK 0.78 0.81 0.75
OK Hey 2.66 2.61 2.44
Hey OK 2.08 2.03 2.07
Hey Hey 1.01 0.97 0.99
Average 1.63 1.61 1.56
Table 2: Evaluation EER(%): Basic attention layer vs. variants — all using as scoring function.
Test data No pooling Sliding window Top-
OK OK 0.75 0.72 0.72
OK Hey 2.44 2.37 2.63
Hey OK 2.07 1.88 1.99
Hey Hey 0.99 0.95 0.94
Average 1.56 1.48 1.57
Table 3: Evaluation EER(%): Different pooling methods for attention weights — all using and divided-layer.

4.4 Weights pooling

To compare different pooling methods on the attention weights as introduced in Sec. 3.4, we use the divided-layer attention with shared-parameter non-linear scoring function. For sliding window maxpooling, we experimented with different window sizes and steps, and found that a window size of 10 frames and a step of 5 frames perform the best in our evaluations. Also, for global top- maxpooling, we found that the performance is the best when . The results are shown in Table 3. We can see that sliding window maxpooling further improves the EER.

We also visualize the attention weights of a training batch for different pooling methods in Fig. 9. An interesting observation is that, when there’s no pooling, we can see a clear 4-strand or 3-strand pattern in the batch. This pattern corresponds to the “O-kay-Goo-gle” 4-phoneme or “Hey-Goo-gle” 3-phoneme structure of the keywords.

When we apply sliding window maxpooling or global top- maxpooling, the attention weights are much larger at the near-end of the utterance, which is easy to understand — the LSTM has accumulated more information at the near-end than at the beginning, thus is more confident to produce the d-vector.

Figure 9: Visualized attention weights for different pooling methods. In each image, x-axis is time, and y-axis is for different utterances in a training batch. (a) No pooling; (b) Sliding window maxpooling, where window size is 10, and step is 5; (c) Global top- maxpooling, where .

5 Conclusions

In this paper, we experimented with different attention mechanisms for our keyword-based text-dependent speaker verification system [7]. From our experimental results, the best practice is to: (1) Use a shared-parameter non-linear scoring function; (2) Use a divided-layer attention connection to the last layer output of the LSTM; and (3) Apply a sliding window maxpooling on the attention weights. After combining all these best practices, we improved the EER of our baseline LSTM model from 1.72% to 1.48%, which is a 14% relative improvement. The same attention mechanisms, especially the ones using shared-parameter scoring functions, could potentially be used to improve text-independent speaker verification models [16] and speaker diarization systems [17].