A trend in speaker recognition has focused on building systems utilizing fixed-size, compact, speaker embeddings. Examples include i-vectors[dehak_2011_1], x-vectors [snyder_2018_1], d-vectors [heigold_2016_1, wan_2018_1], residual network variants [desplanques_2020_1] and adapted network representations [tan2021_1]. These embeddings are used directly for cosine based scoring as a similarity measure, or are included as part of a backend. The backend is trained either separately or jointly with the embedding network. For example, a PLDA model [prince_2007_1] was trained separately as part of a backend to score x-vectors [snyder_2018_1]. For contrast, there is an example of joint training of an embedding extractor with a PLDA backend [silnova_2020_1] or a decision network [pelecanos2021_1].
With the introduction of neural network Transformers [vaswani2017_1], the speaker recognition community has considered how this work may be leveraged [tong_2020_1, safari_2020_1, mary_2020_1, wang_2022_1]. Typically, the transformer encoder is involved in generating a speaker embedding. A key element of the Transformer model is the attention mechanism. In this work, we are interested in how attention can be used to compare utterance embeddings.
Interestingly, some work previously examined how attention could be applied as part of backend in an end-to-end trained system. For example, in [Li2020TextIndependentSV], the authors utilize a mutual-attention approach to perform a weighted combination of enrollment frames given an intermediate test utterance embedding and vice versa. To compute the speaker similarity, rather than just using speaker embeddings, both the enrollment and test utterances are required. For the work in [Jung2021graphatt]
, the authors extract embeddings from short audio chunks and then use stacked modules of attention modeling to compute the final similarity score. Both approaches require scoring parameters to be estimated as part of the training process.
From an engineering perspective, having one single neural network that computes the speaker embedding without the need of additional parameters or models to compute the scores simplifies the deployment process of a speaker recognition system. It can also reduce risk related to version control [wang2020version] by avoiding the need to handle separate embedding generation and scoring networks.
Inspired by the Transformer work [vaswani2017_1], we propose the use of a parameter-free attentive scoring mechanism111 An open source attentive scoring implementation based on Lingvo
An open source attentive scoring implementation based on Lingvo[shen2019lingvo] is provided at: https://github.com/google/speaker-id/tree/master/attentive_scoring. To the best of our knowledge, this is the first paper to perform attentive scoring using a parameter-free comparison of speaker embeddings. In contrast to [Li2020TextIndependentSV], this approach does not need the original recordings once the speaker embeddings are generated. In addition, we perform a study of how performance changes with different types of feature normalization (Layer norm [ba2016_1], L2-Norm, and no-norm) and modeling complexity (i.e. the number of keys). We also evaluate scenarios with independent and tied queries/keys and as well as an approach for handling many enrollment utterances.
2 Parameter-free attentive scoring
2.1 Core approach
In this section, we detail an attentive scoring mechanism based on the work of [vaswani2017_1]. First, we assume that a jointly trained embedding network will generate the fixed dimensional speaker representations for both enrollment and test utterances. A test utterance representation can be composed of query vectors and corresponding value vectors . For enrollment (potentially with multiple utterances), there are key vectors and corresponding value vectors given by . Keys and queries are of dimension , while values are of dimension . Let a parameter represent a scaling factor (or temperature parameter [Hinton2015DistillingTK]
) for the softmax function; which can be jointly trained or directly specified as a hyperparameter (for example[vaswani2017_1] set ). The parameter-free (apart from ) attentive score may be calculated as follows:
Here () represents the dot product.
The scoring function can be interpreted simply as a weighted combination of dot products across embeddings (i.e. the value vectors), where the weights are determined using the softmax calculation. In addition, the softmax weight function enables value vectors to be compared across multiple key/query indexes with an emphasis on the most relevant pairings. We found that normalization is important for enhancing performance and we discuss this next.
In this work we examine different types of normalization and evaluate their effectiveness in experiments section.
2.2.1 Layer normalization
One approach is Layer Normalization [ba2016_1]
where the features across a single example are transformed to have a zero mean and unit standard deviation. This is followed by the application of a per element bias and gain. While there are other possible configurations, we apply Layer Normalization at the utterance representation level.
2.2.2 Query, key and value normalization
In Equations 1 and 2, the key and value vectors may be used as is or they can each be normalized. One approach we assess is to apply L2 length normalization to the query, key and value vectors. If the value vectors are L2 length normalized, the attentive scoring function effectively calculates a weighted combination of cosine similarity scores.
2.2.3 Global length normalization
Equation 1 can be interpreted as a dot product between enrollment and test value vector representations with split between enrollment and test. The score can be calculated as follows.
We can apply L2 length normalization to these representations to obtain a result that relates to a cosine or correlation calculation for the concatenated vector representations and . The normalized score may be calculated as:
Here, and represent the squares of the Euclidean norms of the corresponding vectors and . Similarly, and represent the squares of the Euclidean norms for and . This normalized dot product is evaluated in the experiments as the Global L2-norm system.
2.3 Implementation considerations
This section covers the implementation considerations related to attentive scoring.
2.3.1 Generating keys, queries and values
First we discuss how to generate the key/query/value vector representations from the speaker embedding network. A speaker embedding network based on Conformers [gulati2020conformer] is used to generate a fixed dimensional representation. This is followed by a linear layer (without activation) where the number of output nodes is equivalent to the number of parameters needed to estimate the key/query/value vectors. The relevant vectors are unpacked from the speaker representation generated by the embedding network. Figure 1a shows how a packed speaker representation containing 2 query-key-value vectors is unpacked into independent queries (2 dimensions), keys (2 dimensions), and values (3 dimensions). Figure 1b shows how a query-key-value vector is unpacked when the queries and keys have values that are tied (i.e. they are the same).
There is an important clarification regarding enrollment utterances and the effective number of keys/queries/values. As presented in Figure 1, the number of enrollment keys (and similarly queries and values) for a speaker is the number of enrollment utterances for the speaker times the number of enrollment keys per utterance. Since there is only one test utterance involved in a trial, this equates to keys/queries per utterance. In the experiments there are up to 6 enrollment utterances per speaker. So, for the case of 6 enrollment segments, the number of enrollment keys, , would be equal to .
2.3.2 Setting the softmax temperature
One clarification is how the softmax scaling parameter alpha () is set. In the attention paper [vaswani2017_1], it is suggested that the softmax scaling parameter ( in Equation 2) can be specified manually as . It is a scaling factor (or temperature) used to govern how much the highest score dominates the softmax function. In this work, we explicitly train the scaling factor which can be included with the output of the speaker embedding network if required.
2.3.3 Handling many enrollment utterances
Another consideration is the handling of many enrollment utterances, since an embedding is generated and kept for each utterance. Having many enrollment utterances can make the size of the effective speaker representation significantly larger. In practical applications, we aim to keep this size reasonable. Some approaches to address this include hierarchical clustering, variability modeling and the most straightforward approach being a simple average of enrollment utterance representations. We examine the performance of simple averaging in the experiments in Section4.5.
3 System description
This section covers a description of the system used to generate the utterance embeddings, details on the attentive scoring approach and information on system training. First we discuss how the utterance embeddings are generated (see Figure 2).
The feature extraction front-end essentially consists of calculating log Mel-filterbank energy feature vectors which are then stacked. More specifically, given the speech signal, automatic gain control (AGC) is first applied[prabhavalkar2015automatic]. This is followed by framing into 32ms Hanning windows with a 10ms frame shift. For each frame, 128 log Mel-filterbank energies are calculated across a 125-7500Hz bandwidth. The final features consist of 4 stacked frames sampled every 3 frames. The resulting features are 512 dimensions with a 30ms frame shift.
These features are fed into a stack of conformer layers [gulati2020conformer]
. Our system uses a stack of 12 conformer layers with a native dimensionality of 512. Each layer has a relative positional embedding dimension of 512 and the attention mechanisms consist of 4 heads. The convolutional component layers span 32 elements. The conformer network generates a 512 dimensional output for each input frame. Last but not least, we perform a stack by 2 with a stride of 2 after the third conformer layer, and insert a non-linear projection layer with an output dimension of 512 after the fourth conformer layer. To handle the ‘stack by 2’ processing, we stream packets of 2 frames at inference time.
An attentive temporal pooling (weighted averaging) layer [wang2022attentive] is applied to the output of the conformer network to aggregate the first and second order statistics over the duration of the recording. The weighted mean and standard deviation statistics of the conformer outputs are calculated. The weight is a 0 to 1 value generated for each frame by taking a linear combination of the conformer outputs, adding a bias term, and passing it through a sigmoid non-linearity. These per-sample weights are used to generate the running mean and standard deviation terms which are then concatenated to give 1024 dimensional features for each frame. In many situations, only the last frame of the model output is used, but intermediate statistics are available if they are needed for various streaming applications. Other works have considered related approaches [Chowdhury2018_1, zhu2018_1].
Given the cumulative mean and standard deviation statistics at the final frame, the system applies an affine transformation followed by a ReLU[nair2010_1, Fukushima1980_1]
non-linearity to give 512 output dimensions. This is followed by a linear transformation that has a range of output nodes depending on the number of required parameters. For example, for the cosine calculation, the numbers of output nodes tested were 256, 512 and although not necessary 2,304. For the attentive scoring approach, there were up to 36,864 output nodes.
The utterance embedding extractor is used to generate representations for both enrollment and test utterances. These representations are compared using the cosine similarity (as baseline) or the attentive scoring mechanism to generate a speaker similarity score. During training, these scores are scaled (using a scaling parameter that is trained) and are optimized according to a generalized end-to-end extended-set softmax loss [pelecanos2021_1]. Optimization is performed across randomly generated mini-batches, where each mini-batch compares the statistics of 128 utterances from 16 speakers. More details are available here [pelecanos2021_1, wan_2018_1]. In this work, the system is trained with Adam optimization [Kingma2015_1] and uses a warm-up (ramp-up) schedule of 50k steps before allowing the learning rate to decrease until it reaches 500k training steps (a similar approach was used in [vaswani2017_1]).
In this section, we describe the experimental data and perform various experiments examining the effect of normalization, query/key configuration, the number of keys, and enrollment utterance consolidation.
4.1 Training and evaluation data
Includes vendor collected data from languages outside of the languages mentioned as well as LibriVox, CN-Celeb, and LDC sourced data.
The training set in Table 1 consists mostly of vendor collected speech queries from different language varieties using devices such as laptops and cell phones. It also includes LibriVox [librivox_2020_1], CN-Celeb [fan_2020_1] and LDC sourced data (Fisher [Cieri2004_1], Mixer 4/5 [Brandschain2020_1], TIMIT [garofolo_1993_1]).
For system training, we apply data augmentation techniques based on noise and room simulation effects [lippmann1987multi, ko2017study, kim2017generation]. Similar augmentation techniques [garcia-romero_2012_1, lei_2012_1, avila_2014_1, snyder_2018_1, huang_2019_1] were previously used for speaker recognition. Noise is added to the training utterances with an SNR ranging from 3dB to 15dB. The signal amplitude is also varied from a scale of 0.01x to 1.2x.
Performance is evaluated by averaging the Equal Error Rate (EER) across the 9 language varieties listed in Table 1. We evaluate performance across two dimensions. The first is whether speaker enrollment consists of a single utterance or multiple (up to 6) utterances. For the single utterance enrollment task, we select the same trials as in the multiple enrollment utterance task, except that only the first enrollment utterance is chosen for enrollment. It is expected that the multiple enrollment case should have a significantly lower EER than the single enrollment result. The second factor considers if the speech is clean (the original data) or noisy. The noisy data is simply the clean test segment data (not the enrollment segments) with previously unseen noise (at 3-15dB SNR) and reverberation effects applied.
4.2 Embedding normalization approaches
Table 2 compares the cosine similarity baseline systems with different attentive scoring configurations with tied keys and queries. Empirically, we found that the best cosine similarity result was achieved by calculating the cosine similarity between the test utterance representation and the average of the L2 normalized enrollment utterance representations. For cosine similarity scoring we compare three results. The first has an output embedding size of 256. We also include the result for an embedding size of 512, which is the maximum appropriate embedding size given a native conformer model dimension of 512. Unlike attentive scoring (which is a non-linear scoring function), there should be no additional benefit to increasing dimensionality past the native dimensionality when using a linear transform followed by cosine similarity. This is supported by the third row of cosine results which has an output dimension of 2,304. Across the 3 baseline systems, apart from the % EER result, all other numbers are comparable and have a very similar task average.
We also compare the baseline systems to attention models with different query/key/value normalization schemes. The first attention result is without normalization (i.e. just using the raw linear layer outputs). The next result considers the use of Layer Normalization [ba2016_1]. In this implementation, we apply layer normalization on a per utterance representation basis. The results may be different if normalization was applied on a per-key and per-value vector basis. The next two rows of results (Key & Value L2-Norm) explore the utility of applying L2 length normalization of both queries/keys and values. There is some improvement observed for both the 8 and 32 key-value pairs over previous rows of results. However, even better results (last two rows, Key & Global L2-Norm) are achieved when each key vector is first L2-normalized followed by applying Global L2-normalization (see Section 2.2.3 for details). Of these two sets of results, the best result is for 32 key-value pairs and represents a 10% relative improvement in average task EER over the best cosine similarity baseline.
4.3 Tied versus independent query-key estimation
In this section (see Table 3) we compare performance differences between a system where the queries and keys are identical and a system where queries are estimated independently to keys. The results are mixed and may be dependent on the particular configuration of the queries-keys-values (i.e. fewer/more queries or keys). For system Key & Value L2-Norm (A), with only 8 queries/keys, independent query and key estimation provides improvement in 3 of the 4 cases. For the Key & Global L2-Norm (B) system, with significantly more queries/keys at 32, tying the queries and keys is consistently better. Having independent queries and keys allows the model to set each query and its corresponding key to be very different and results in increased flexibility in the final scoring function. The attentive scoring function does not have this flexibility for tied queries and keys. It is forced to give high scores to speaker representations that are the same even though they may be derived from recordings with only noise.
4.4 Varying the number of keys
It is also helpful to understand performance across different numbers of keys. Table 4 shares performance numbers as the number of keys is increased from 1 and doubled up to 128 keys. With our non-linear scoring function, we may be able to improve performance by increasing the dimensionality of the output representation beyond the native model dimensionality of 512. As highlighted earlier, this would not be the case for the regular cosine similarity. In examining the results, the performance improves as the number of keys is increased. The best task average (overall) result is reached at 64 keys.
|Single Enroll||Multi Enroll||Task|
|Value||EER (%)||EER (%)||Average|
4.5 Managing many enrollment utterances
For the proposed attentive scoring models, one embedding is generated for each utterance in enrollment. For the case where only a few utterances are involved, this is a non-issue. For situations where there is a large number of utterances for speaker enrollment (such as when adaptation is allowed), there is the practical consideration of storing many enrollment representations for each speaker. There are several approaches to this problem. A simple approach is to average the utterance embeddings. In these experiments we average before applying further processing such as normalization. Table 5 presents three sets of results. The first row of results relates to the Key & Value L2-Norm (A) result (from Table 2). This represents the regular attentive scoring system where all enrollment utterances are used jointly in both system training and evaluation. The second row of results is determined by using the same system as before except during evaluation the system uses the mean of the enrollment utterances and considers it as a single enrollment representation. The last row of results involves the mean being calculated for both system training and evaluation. Given that all systems are trained using multiple enrollment utterances for a speaker, it is interesting to note that either enrollment utterance averaging approach is reasonable for the Multi Enroll tasks. However, if averaging is done as part of training, a noteworthy performance decrease is observed for the Single Enroll tasks.
We proposed a parameter-free attentive scoring approach to meet the objectives of improving performance while using a relatively simple (parameter-free) scoring mechanism. We evaluated different normalization techniques and results suggest that applying per query/key L2-normalization followed by global L2-normalization was the most effective. We also observed that using independently estimated queries and keys (versus tied queries and keys) gave mixed results. We note that the approach may be helpful depending on the evaluation data and the attentive scoring configuration. Furthermore, we evaluated the key and value L2-norm system across different numbers of keys and found that selecting 64 keys gave the best task average EER. This type of system was also shown to be work well when scoring trials using a single average representation of multiple enrollment utterances.
Future work could consider other vector normalization techniques and query-key-vector configurations. Another path is to consider better approaches for converting from frame based network outputs to the final fixed dimensional representations. Currently we capture (single-head) attention based mean and standard deviation statistics. However, frame-based pooling using multi-headed attention could be an appropriate extension.