Generalized End-to-End Loss for Speaker Verification

10/28/2017 ∙ by Li Wan, et al. ∙ 0

In this paper, we propose a new loss function called generalized end-to-end (GE2E) loss, which makes the training of speaker verification models more efficient than our previous tuple-based end-to-end (TE2E) loss function. Unlike TE2E, the GE2E loss function updates the network in a way that emphasizes examples that are difficult to verify at each step of the training process. Additionally, the GE2E loss does not require an initial stage of example selection. With these properties, the model with new loss function learns a better model, by decreasing EER by more than 10 reducing the training time by >60 technique, which allow us do domain adaptation - training more accurate model that supports multiple keywords (i.e. "OK Google" and "Hey Google") as well as multiple dialects.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Background

Speaker verification (SV) is the process of verifying whether an utterance belongs to a specific speaker, based on that speaker’s known utterances (i.e., enrollment utterances), with applications such as Voice Match [1, 2].

Depending on the restrictions of the utterances used for enrollment and verification, speaker verification models usually fall into one of two categories: text-dependent speaker verification (TD-SV) and text-independent speaker verification (TI-SV). In TD-SV, the transcript of both enrollment and verification utterances is phonetially constrained, while in TI-SV, there are no lexicon constraints on the transcript of the enrollment or verification utterances, exposing a larger variability of phonemes and utterance durations 

[3, 4]. In this work, we focus on TI-SV and a particular subtask of TD-SV known as global password TD-SV, where the verification is based on a detected keyword, e.g. “OK Google” [5, 6]

In previous studies, i-vector based systems have been the dominating approach for both TD-SV and TI-SV applications [7]

. In recent years, more efforts have been focusing on using neural networks for speaker verification, while the most successful systems use end-to-end training 

[8, 9, 10, 11, 12]

. In such systems, the neural network output vectors are usually referred to as embedding vectors (also known as

d-vectors). Similarly to as in the case of i-vectors, such embedding can then be used to represent utterances in a fix dimensional space, in which other, typically simpler, methods can be used to disambiguate among speakers.

1.2 Tuple-Based End-to-End Loss

In our previous work [13], we proposed the tuple-based end-to-end (TE2E) model, which simulates the two-stage process of runtime enrollment and verification during training. In our experiments, the TE2E model combined with LSTM [14] achieved the best performance at the time. For each training step, a tuple of one evaluation utterance and enrollment utterances (for ) is fed into our LSTM network: , where represents the features (log-mel-filterbank energies) from a fixed-length segment, and represent the speakers of the utterances, and may or may not equal . The tuple includes a single utterance from speaker and different utterance from speaker . We call a tuple positive if and the enrollment utterances are from the same speaker, i.e., , and negative otherwise. We generate positive and negative tuples alternatively.

For each input tuple, we compute the L2 normalized response of the LSTM: . Here each is an embedding vector of fixed dimension that results from the sequence-to-vector mapping defined by the LSTM. The centroid of tuple represents the voiceprint built from utterances, and is defined as follows:


The similarity is defined using the cosine similarity function:


with learnable and . The TE2E loss is finally defined as:



is the standard sigmoid function and

equals if , otherwise equals to . The TE2E loss function encourages a larger value of when , and a smaller value of when . Consider the update for both positive and negative tuples — this loss function is very similar to the triplet loss in FaceNet [15].

1.3 Overview

In this paper, we introduce a generalization of our TE2E architecture. This new architecture constructs tuples from input sequences of various lengths in a more efficient way, leading to a significant boost of performance and training speed for both TD-SV and TI-SV. This paper is organized as follows: In Sec. 2.1 we give the definition of the GE2E loss; Sec. 2.2 is the theoretical justification for why GE2E updates the model parameters more effectively; Sec. 2.3 introduces a technique called “MultiReader”, which enables us to train a single model that supports multiple keywords and languages; Finally, we present our experimental results in Sec. 3.

Figure 1: System overview. Different colors indicate utterances/embeddings from different speakers.

2 Generalized End-to-End Model

Generalized end-to-end (GE2E) training is based on processing a large number of utterances at once, in the form of a batch that contains speakers, and utterances from each speaker in average, as is depicted in Figure 1.

2.1 Training Method

We fetch utterances to build a batch. These utterances are from different speakers, and each speaker has utterances. Each feature vector ( and

) represents the features extracted from speaker

utterance .

Similar to our previous work [13], we feed the features extracted from each utterance into an LSTM network. A linear layer is connected to the last LSTM layer as an additional transformation of the last frame response of the network. We denote the output of the entire neural network as where represents all parameters of the neural network (including both, LSTM layers and the linear layer). The embedding vector (d-vector) is defined as the L2 normalization of the network output:


Here represents the embedding vector of the th speaker’s th utterance. The centroid of the embedding vectors from the th speaker is defined as via Equation 1.

The similarity matrix is defined as the scaled cosine similarities between each embedding vector to all centroids (, and ):


where and are learnable parameters. We constrain the weight to be positive , because we want the similarity to be larger when cosine similarity is larger. The major difference between TE2E and GE2E is as follows:

  • TE2E’s similarity (Equation 2) is a scalar value that defines the similarity between embedding vector and a single tuple centroid .

  • GE2E builds a similarity matrix (Equation 5) that defines the similarities between each and all centroids .

Figure 1 illustrates the whole process with features, embedding vectors, and similarity scores from different speakers, represented by different colors.

Figure 2: GE2E loss pushes the embedding towards the centroid of the true speaker, and away from the centroid of the most similar different speaker.

During the training, we want the embedding of each utterance to be similar to the centroid of all that speaker’s embeddings, while at the same time, far from other speakers’ centroids. As shown in the similarity matrix in Figure 1, we want the similarity values of colored areas to be large, and the values of gray areas to be small. Figure 2 illustrates the same concept in a different way: we want the blue embedding vector to be close to its own speaker’s centroid (blue triangle), and far from the others centroids (red and purple triangles), especially the closest one (red triangle). Given an embedding vector , all centroids , and the corresponding similarity matrix , there are two ways to implement this concept:

Softmax We put a softmax on for that makes the output equal to iff , otherwise makes the output equal to . Thus, the loss on each embedding vector could be defined as:


This loss function means that we push each embedding vector close to its centroid and pull it away from all other centroids.

Contrast The contrast loss is defined on positive pairs and most aggressive negative pairs, as:


where is the sigmoid function. For every utterance, exactly two components are added to the loss: (1) A positive component, which is associated with a positive match between the embedding vector and its true speaker’s voiceprint (centroid). (2) A hard negative component, which is associated with a negative match between the embedding vector and the voiceprint (centroid) with the highest similarity among all false speakers.

In Figure 2, the positive term corresponds to pushing (blue circle) towards (blue triangle). The negative term corresponds to pulling (blue circle) away from (red triangle), because is more similar to compared with . Thus, contrast loss allows us to focus on difficult pairs of embedding vector and negative centroid.

In our experiments, we find both implementations of GE2E loss are useful: contrast loss performs better for TD-SV, while softmax loss performs slightly better for TI-SV.

In addition, we observed that removing when computing the centroid of the true speaker makes training stable and helps avoid trivial solutions. So, while we still use Equation 1 when calculating negative similarity (i.e., ), we instead use Equation 8 when :


Combining Equations 467 and 9, the final GE2E loss is the sum of all losses over the similarity matrix (, and ):


2.2 Comparison between TE2E and GE2E

Consider a single batch in GE2E loss update: we have speakers, each with utterances. Each single step update will push all embedding vectors toward their own centroids, and pull them away the other centroids.

This mirrors what happens with all possible tuples in the TE2E loss function [13] for each . Assume we randomly choose utterances from speaker when comparing speakers:

  1. Positive tuples: for and . There are such positive tuples.

  2. Negative tuples: for and for . For each , we have to compare with all other centroids, where each set of those comparisons contains tuples.

Each positive tuple is balanced with a negative tuple, thus the total number is the maximum number of positive and negative tuples times 2. So, the total number of tuples in TE2E loss is:


The lower bound of Equation 11 occurs when . Thus, each update for in our GE2E loss is identical to at least steps in our TE2E loss. The above analysis shows why GE2E updates models more efficiently than TE2E, which is consistent with our empirical observations: GE2E converges to a better model in shorter time (See Sec. 3 for details).

2.3 Training with MultiReader

Consider the following case: we care about the model application in a domain with a small dataset . At the same time, we have a larger dataset in a similar, but not identical domain. We want to train a single model that performs well on dataset , with the help from :


This is similar to the regularization technique: in normal regularization, we use to regularize the model. But here, we use for regularization. When dataset does not have sufficient data, training the network on can lead to overfitting. Requiring the network to also perform reasonably well on helps to regularize the network.

This can be generalized to combine different, possibly extremely unbalanced, data sources: . We assign a weight to each data source, indicating the importance of that data source. During training, in each step we fetch one batch/tuple of utterances from each data source, and compute the combined loss as: , where each is the loss defined in Equation 10.

3 Experiments

In our experiments, the feature extraction process is the same as [6]. The audio signals are first transformed into frames of width 25ms and step 10ms. Then we extract 40-dimension log-mel-filterbank energies as the features for each frame. For TD-SV applications, the same features are used for both keyword detection and speaker verification. The keyword detection system will only pass the frames containing the keyword into the speaker verification system. These frames form a fixed-length (usually 800ms) segment. For TI-SV applications, we usually extract random fixed-length segments after Voice Activity Detection (VAD), and use a sliding window approach for inference (discussed in Sec. 3.2) .

Our production system uses a 3-layer LSTM with projection [16]. The embedding vector (d-vector) size is the same as the LSTM projection size. For TD-SV, we use hidden nodes and the projection size is . For TI-SV, we use hidden nodes with projection size . When training the GE2E model, each batch contains speakers and utterances per speaker. We train the network with SGD using initial learning rate , and decrease it by half every 30M steps. The L2-norm of gradient is clipped at  [17], and the gradient scale for projection node in LSTM is set to . Regarding the scaling factor in loss function, we also observed that a good initial value is , and the smaller gradient scale of on them helped to smooth convergence.

3.1 Text-Dependent Speaker Verification

Though existing voice assistants usually only support a single keyword, studies show that users prefer that multiple keywords are supported at the same time. For multi-user on Google Home, two keywords are supported simultaneously: “OK Google” and “Hey Google”.

Enabling speaker verification on multiple keywords falls between TD-SV and TI-SV, since the transcript is neither constrained to a single phrase, nor completely unconstrained. We solve this problem using the MultiReader technique (Sec. 2.3). MultiReader has a great advantage compared to simpler approaches, e.g. directly mixing multiple data sources together: It handles the case when different data sources are unbalanced in size. In our case, we have two data sources for training: 1) An “OK Google” training set from anonymized user queries with  M utterances and  K speakers; 2) A mixed “OK/Hey Google” training set that is manually collected with  M utterances and  K speakers. The first dataset is larger than the second by a factor of 125 in the number of utterances and 35 in the number of speakers.

For evaluation, we report the Equal Error Rate (EER) on four cases: enroll with either keyword, and verify on either keyword. All evaluation datasets are manually collected from 665 speakers with an average of enrollment utterances and evaluation utterances per speaker. The results are shown in Table 1. As we can see, MultiReader brings around 30% relative improvement on all four cases.

Test data Mixed data MultiReader
(Enroll Verify) EER (%) EER (%)
OK Google OK Google 1.16 0.82
OK Google Hey Google 4.47 2.99
Hey Google OK Google 3.30 2.30
Hey Google Hey Google 1.69 1.15
Table 1: MultiReader vs. directly mixing multiple data sources.

We also performed more comprehensive evaluations in a larger dataset collected from  K different speakers and environmental conditions, from both anonymized logs and manual collections. We use an average of enrollment utterances and evaluation utterances per speaker. Table 2 summarizes average EER for different loss functions trained with and without MultiReader setup. The baseline model is a single layer LSTM with nodes and an embedding vector size of  [13]. The second and third rows’ model architecture is 3-layer LSTM. Comparing the 2nd and 3rd rows, we see that GE2E is about better than TE2E. Similar to Table 1, here we also see that the model performs significantly better with MultiReader. While not shown in the table, it is also worth noting that the GE2E model took about less training time than TE2E.

Model Embed Loss Multi Average
Architecture Size Reader EER (%)
[13] TE2E No 3.30
Yes 2.78
TE2E No 3.55
Yes 2.67
GE2E No 3.10
Yes 2.38
Table 2: Text-dependent speaker verification EER.

3.2 Text-Independent Speaker Verification

For TI-SV training, we divide training utterances into smaller segments, which we refer to as partial utterances. While we don’t require all partial utterances to be of the same length, all partial utterances in the same batch must be of the same length. Thus, for each batch of data, we randomly choose a time length within frames, and enforce that all partial utterances in that batch are of length (as shown in Figure 3).

During inference time, for every utterance we apply a sliding window of fixed size frames with overlap. We compute the d-vector for each window. The final utterance-wise d-vector is generated by L2 normalizing the window-wise d-vectors, then taking the element-wise averge (as shown in Figure 4).

Our TI-SV models are trained on around 36M utterances from 18K speakers, which are extracted from anonymized logs. For evaluation, we use an additional 1000 speakers with in average 6.3 enrollment utterances and 7.2 evaluation utterances per speaker. Table 3 shows the performance comparison between different training loss functions. The first column is a softmax that predicts the speaker label for all speakers in the training data. The second column is a model trained with TE2E loss. The third column is a model trained with GE2E loss. As shown in the table, GE2E performs better than both softmax and TE2E. The EER performance improvement is larger than . In addition, we also observed that GE2E training was about faster than the other loss functions.

Figure 3: Batch construction process for training TI-SV models.
Figure 4: Sliding window used for TI-SV.
Softmax TE2E [13] GE2E
4.06 4.13 3.55
Table 3: Text-independent speaker verification EER (%).

4 Conclusions

In this paper, we proposed the generalized end-to-end (GE2E) loss function to train speaker verification models more efficiently. Both theoretical and experimental results verified the advantage of this novel loss function. We also introduced the MultiReader technique to combine different data sources, enabling our models to support multiple keywords and multiple languages. By combining these two techniques, we produced more accurate speaker verification models.