Speaker verification (SV) is the process of verifying whether an utterance belongs to a specific speaker, based on that speaker’s known utterances (i.e., enrollment utterances), with applications such as Voice Match [1, 2].
Depending on the restrictions of the utterances used for enrollment and verification, speaker verification models usually fall into one of two categories: text-dependent speaker verification (TD-SV) and text-independent speaker verification (TI-SV). In TD-SV, the transcript of both enrollment and verification utterances is phonetially constrained, while in TI-SV, there are no lexicon constraints on the transcript of the enrollment or verification utterances, exposing a larger variability of phonemes and utterance durations[3, 4]. In this work, we focus on TI-SV and a particular subtask of TD-SV known as global password TD-SV, where the verification is based on a detected keyword, e.g. “OK Google” [5, 6]
In previous studies, i-vector based systems have been the dominating approach for both TD-SV and TI-SV applications 
. In recent years, more efforts have been focusing on using neural networks for speaker verification, while the most successful systems use end-to-end training[8, 9, 10, 11, 12]
. In such systems, the neural network output vectors are usually referred to as embedding vectors (also known asd-vectors). Similarly to as in the case of i-vectors, such embedding can then be used to represent utterances in a fix dimensional space, in which other, typically simpler, methods can be used to disambiguate among speakers.
1.2 Tuple-Based End-to-End Loss
In our previous work , we proposed the tuple-based end-to-end (TE2E) model, which simulates the two-stage process of runtime enrollment and verification during training. In our experiments, the TE2E model combined with LSTM  achieved the best performance at the time. For each training step, a tuple of one evaluation utterance and enrollment utterances (for ) is fed into our LSTM network: , where represents the features (log-mel-filterbank energies) from a fixed-length segment, and represent the speakers of the utterances, and may or may not equal . The tuple includes a single utterance from speaker and different utterance from speaker . We call a tuple positive if and the enrollment utterances are from the same speaker, i.e., , and negative otherwise. We generate positive and negative tuples alternatively.
For each input tuple, we compute the L2 normalized response of the LSTM: . Here each is an embedding vector of fixed dimension that results from the sequence-to-vector mapping defined by the LSTM. The centroid of tuple represents the voiceprint built from utterances, and is defined as follows:
The similarity is defined using the cosine similarity function:
with learnable and . The TE2E loss is finally defined as:
is the standard sigmoid function andequals if , otherwise equals to . The TE2E loss function encourages a larger value of when , and a smaller value of when . Consider the update for both positive and negative tuples — this loss function is very similar to the triplet loss in FaceNet .
In this paper, we introduce a generalization of our TE2E architecture. This new architecture constructs tuples from input sequences of various lengths in a more efficient way, leading to a significant boost of performance and training speed for both TD-SV and TI-SV. This paper is organized as follows: In Sec. 2.1 we give the definition of the GE2E loss; Sec. 2.2 is the theoretical justification for why GE2E updates the model parameters more effectively; Sec. 2.3 introduces a technique called “MultiReader”, which enables us to train a single model that supports multiple keywords and languages; Finally, we present our experimental results in Sec. 3.
2 Generalized End-to-End Model
Generalized end-to-end (GE2E) training is based on processing a large number of utterances at once, in the form of a batch that contains speakers, and utterances from each speaker in average, as is depicted in Figure 1.
2.1 Training Method
We fetch utterances to build a batch. These utterances are from different speakers, and each speaker has utterances. Each feature vector ( and
) represents the features extracted from speakerutterance .
Similar to our previous work , we feed the features extracted from each utterance into an LSTM network. A linear layer is connected to the last LSTM layer as an additional transformation of the last frame response of the network. We denote the output of the entire neural network as where represents all parameters of the neural network (including both, LSTM layers and the linear layer). The embedding vector (d-vector) is defined as the L2 normalization of the network output:
Here represents the embedding vector of the th speaker’s th utterance. The centroid of the embedding vectors from the th speaker is defined as via Equation 1.
The similarity matrix is defined as the scaled cosine similarities between each embedding vector to all centroids (, and ):
where and are learnable parameters. We constrain the weight to be positive , because we want the similarity to be larger when cosine similarity is larger. The major difference between TE2E and GE2E is as follows:
TE2E’s similarity (Equation 2) is a scalar value that defines the similarity between embedding vector and a single tuple centroid .
GE2E builds a similarity matrix (Equation 5) that defines the similarities between each and all centroids .
Figure 1 illustrates the whole process with features, embedding vectors, and similarity scores from different speakers, represented by different colors.
During the training, we want the embedding of each utterance to be similar to the centroid of all that speaker’s embeddings, while at the same time, far from other speakers’ centroids. As shown in the similarity matrix in Figure 1, we want the similarity values of colored areas to be large, and the values of gray areas to be small. Figure 2 illustrates the same concept in a different way: we want the blue embedding vector to be close to its own speaker’s centroid (blue triangle), and far from the others centroids (red and purple triangles), especially the closest one (red triangle). Given an embedding vector , all centroids , and the corresponding similarity matrix , there are two ways to implement this concept:
Softmax We put a softmax on for that makes the output equal to iff , otherwise makes the output equal to . Thus, the loss on each embedding vector could be defined as:
This loss function means that we push each embedding vector close to its centroid and pull it away from all other centroids.
Contrast The contrast loss is defined on positive pairs and most aggressive negative pairs, as:
where is the sigmoid function. For every utterance, exactly two components are added to the loss: (1) A positive component, which is associated with a positive match between the embedding vector and its true speaker’s voiceprint (centroid). (2) A hard negative component, which is associated with a negative match between the embedding vector and the voiceprint (centroid) with the highest similarity among all false speakers.
In Figure 2, the positive term corresponds to pushing (blue circle) towards (blue triangle). The negative term corresponds to pulling (blue circle) away from (red triangle), because is more similar to compared with . Thus, contrast loss allows us to focus on difficult pairs of embedding vector and negative centroid.
In our experiments, we find both implementations of GE2E loss are useful: contrast loss performs better for TD-SV, while softmax loss performs slightly better for TI-SV.
In addition, we observed that removing when computing the centroid of the true speaker makes training stable and helps avoid trivial solutions. So, while we still use Equation 1 when calculating negative similarity (i.e., ), we instead use Equation 8 when :
2.2 Comparison between TE2E and GE2E
Consider a single batch in GE2E loss update: we have speakers, each with utterances. Each single step update will push all embedding vectors toward their own centroids, and pull them away the other centroids.
This mirrors what happens with all possible tuples in the TE2E loss function  for each . Assume we randomly choose utterances from speaker when comparing speakers:
Positive tuples: for and . There are such positive tuples.
Negative tuples: for and for . For each , we have to compare with all other centroids, where each set of those comparisons contains tuples.
Each positive tuple is balanced with a negative tuple, thus the total number is the maximum number of positive and negative tuples times 2. So, the total number of tuples in TE2E loss is:
The lower bound of Equation 11 occurs when . Thus, each update for in our GE2E loss is identical to at least steps in our TE2E loss. The above analysis shows why GE2E updates models more efficiently than TE2E, which is consistent with our empirical observations: GE2E converges to a better model in shorter time (See Sec. 3 for details).
2.3 Training with MultiReader
Consider the following case: we care about the model application in a domain with a small dataset . At the same time, we have a larger dataset in a similar, but not identical domain. We want to train a single model that performs well on dataset , with the help from :
This is similar to the regularization technique: in normal regularization, we use to regularize the model. But here, we use for regularization. When dataset does not have sufficient data, training the network on can lead to overfitting. Requiring the network to also perform reasonably well on helps to regularize the network.
This can be generalized to combine different, possibly extremely unbalanced, data sources: . We assign a weight to each data source, indicating the importance of that data source. During training, in each step we fetch one batch/tuple of utterances from each data source, and compute the combined loss as: , where each is the loss defined in Equation 10.
In our experiments, the feature extraction process is the same as . The audio signals are first transformed into frames of width 25ms and step 10ms. Then we extract 40-dimension log-mel-filterbank energies as the features for each frame. For TD-SV applications, the same features are used for both keyword detection and speaker verification. The keyword detection system will only pass the frames containing the keyword into the speaker verification system. These frames form a fixed-length (usually 800ms) segment. For TI-SV applications, we usually extract random fixed-length segments after Voice Activity Detection (VAD), and use a sliding window approach for inference (discussed in Sec. 3.2) .
Our production system uses a 3-layer LSTM with projection . The embedding vector (d-vector) size is the same as the LSTM projection size. For TD-SV, we use hidden nodes and the projection size is . For TI-SV, we use hidden nodes with projection size . When training the GE2E model, each batch contains speakers and utterances per speaker. We train the network with SGD using initial learning rate , and decrease it by half every 30M steps. The L2-norm of gradient is clipped at , and the gradient scale for projection node in LSTM is set to . Regarding the scaling factor in loss function, we also observed that a good initial value is , and the smaller gradient scale of on them helped to smooth convergence.
3.1 Text-Dependent Speaker Verification
Though existing voice assistants usually only support a single keyword, studies show that users prefer that multiple keywords are supported at the same time. For multi-user on Google Home, two keywords are supported simultaneously: “OK Google” and “Hey Google”.
Enabling speaker verification on multiple keywords falls between TD-SV and TI-SV, since the transcript is neither constrained to a single phrase, nor completely unconstrained. We solve this problem using the MultiReader technique (Sec. 2.3). MultiReader has a great advantage compared to simpler approaches, e.g. directly mixing multiple data sources together: It handles the case when different data sources are unbalanced in size. In our case, we have two data sources for training: 1) An “OK Google” training set from anonymized user queries with M utterances and K speakers; 2) A mixed “OK/Hey Google” training set that is manually collected with M utterances and K speakers. The first dataset is larger than the second by a factor of 125 in the number of utterances and 35 in the number of speakers.
For evaluation, we report the Equal Error Rate (EER) on four cases: enroll with either keyword, and verify on either keyword. All evaluation datasets are manually collected from 665 speakers with an average of enrollment utterances and evaluation utterances per speaker. The results are shown in Table 1. As we can see, MultiReader brings around 30% relative improvement on all four cases.
|Test data||Mixed data||MultiReader|
|(Enroll Verify)||EER (%)||EER (%)|
|OK Google OK Google||1.16||0.82|
|OK Google Hey Google||4.47||2.99|
|Hey Google OK Google||3.30||2.30|
|Hey Google Hey Google||1.69||1.15|
We also performed more comprehensive evaluations in a larger dataset collected from K different speakers and environmental conditions, from both anonymized logs and manual collections. We use an average of enrollment utterances and evaluation utterances per speaker. Table 2 summarizes average EER for different loss functions trained with and without MultiReader setup. The baseline model is a single layer LSTM with nodes and an embedding vector size of . The second and third rows’ model architecture is 3-layer LSTM. Comparing the 2nd and 3rd rows, we see that GE2E is about better than TE2E. Similar to Table 1, here we also see that the model performs significantly better with MultiReader. While not shown in the table, it is also worth noting that the GE2E model took about less training time than TE2E.
3.2 Text-Independent Speaker Verification
For TI-SV training, we divide training utterances into smaller segments, which we refer to as partial utterances. While we don’t require all partial utterances to be of the same length, all partial utterances in the same batch must be of the same length. Thus, for each batch of data, we randomly choose a time length within frames, and enforce that all partial utterances in that batch are of length (as shown in Figure 3).
During inference time, for every utterance we apply a sliding window of fixed size frames with overlap. We compute the d-vector for each window. The final utterance-wise d-vector is generated by L2 normalizing the window-wise d-vectors, then taking the element-wise averge (as shown in Figure 4).
Our TI-SV models are trained on around 36M utterances from 18K speakers, which are extracted from anonymized logs. For evaluation, we use an additional 1000 speakers with in average 6.3 enrollment utterances and 7.2 evaluation utterances per speaker. Table 3 shows the performance comparison between different training loss functions. The first column is a softmax that predicts the speaker label for all speakers in the training data. The second column is a model trained with TE2E loss. The third column is a model trained with GE2E loss. As shown in the table, GE2E performs better than both softmax and TE2E. The EER performance improvement is larger than . In addition, we also observed that GE2E training was about faster than the other loss functions.
In this paper, we proposed the generalized end-to-end (GE2E) loss function to train speaker verification models more efficiently. Both theoretical and experimental results verified the advantage of this novel loss function. We also introduced the MultiReader technique to combine different data sources, enabling our models to support multiple keywords and multiple languages. By combining these two techniques, we produced more accurate speaker verification models.
-  Yury Pinsky, “Tomato, tomahto. google home now supports multiple users,” https://www.blog.google/products/assistant/tomato-tomahto-google-home-now-supports-multiple-users, 2017.
-  Mihai Matei, “Voice match will allow google home to recognize your voice,” https://www.androidheadlines.com/2017/10/voice-match-will-allow-google-home-to-recognize-your-voice.html, 2017.
-  Tomi Kinnunen and Haizhou Li, “An overview of text-independent speaker recognition: From features to supervectors,” Speech communication, vol. 52, no. 1, pp. 12–40, 2010.
-  Frédéric Bimbot, Jean-François Bonastre, Corinne Fredouille, Guillaume Gravier, Ivan Magrin-Chagnolleau, Sylvain Meignier, Teva Merlin, Javier Ortega-García, Dijana Petrovska-Delacrétaz, and Douglas A Reynolds, “A tutorial on text-independent speaker verification,” EURASIP journal on applied signal processing, vol. 2004, pp. 430–451, 2004.
-  Guoguo Chen, Carolina Parada, and Georg Heigold, “Small-footprint keyword spotting using deep neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 4087–4091.
-  Rohit Prabhavalkar, Raziel Alvarez, Carolina Parada, Preetum Nakkiran, and Tara N Sainath, “Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4704–4708.
-  Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
-  Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 4052–4056.
Yu-hsin Chen, Ignacio Lopez-Moreno, Tara N Sainath, Mirkó Visontai, Raziel
Alvarez, and Carolina Parada,
“Locally-connected and convolutional neural networks for small footprint speaker recognition,”in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
-  Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, and Zhenyao Zhu, “Deep speaker: an end-to-end neural speaker embedding system,” CoRR, vol. abs/1705.02304, 2017.
-  Shi-Xiong Zhang, Zhuo Chen, Yong Zhao, Jinyu Li, and Yifan Gong, “End-to-end attention based text-dependent speaker verification,” CoRR, vol. abs/1701.00562, 2017.
-  Seyed Omid Sadjadi, Sriram Ganapathy, and Jason W. Pelecanos, “The IBM 2016 speaker recognition system,” CoRR, vol. abs/1602.07291, 2016.
-  Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer, “End-to-end text-dependent speaker verification,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 5115–5119.
-  Sepp Hochreiter and Jürgen Schmidhuber, Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
Florian Schroff, Dmitry Kalenichenko, and James Philbin,
“Facenet: A unified embedding for face recognition and clustering,”in , 2015, pp. 815–823.
Haşim Sak, Andrew Senior, and Françoise Beaufays,
“Long short-term memory recurrent neural network architectures for large scale acoustic modeling,”in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
-  Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio, “Understanding the exploding gradient problem,” CoRR, vol. abs/1211.5063, 2012.