DeepAI
Log In Sign Up

Closing the Gap between Single-User and Multi-User VoiceFilter-Lite

02/24/2022
by   Rajeev Rikhye, et al.
0

VoiceFilter-Lite is a speaker-conditioned voice separation model that plays a crucial role in improving speech recognition and speaker verification by suppressing overlapping speech from non-target speakers. However, one limitation of VoiceFilter-Lite, and other speaker-conditioned speech models in general, is that these models are usually limited to a single target speaker. This is undesirable as most smart home devices now support multiple enrolled users. In order to extend the benefits of personalization to multiple users, we previously developed an attention-based speaker selection mechanism and applied it to VoiceFilter-Lite. However, the original multi-user VoiceFilter-Lite model suffers from significant performance degradation compared with single-user models. In this paper, we devised a series of experiments to improve the multi-user VoiceFilter-Lite model. By incorporating a dual learning rate schedule and by using feature-wise linear modulation (FiLM) to condition the model with the attended speaker embedding, we successfully closed the performance gap between multi-user and single-user VoiceFilter-Lite models on single-speaker evaluations. At the same time, the new model can also be easily extended to support any number of users, and significantly outperforms our previously published model on multi-speaker evaluations.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/02/2021

Multi-user VoiceFilter-Lite via Attentive Speaker Embedding

In this paper, we propose a solution to allow speaker conditioned speech...
03/30/2021

Target Speaker Verification with Selective Auditory Attention for Single and Multi-talker Speech

Speaker verification has been studied mostly under the single-talker con...
08/12/2019

Personal VAD: Speaker-Conditioned Voice Activity Detection

In this paper, we propose "personal VAD", a system to detect the voice a...
03/28/2018

Machine Speech Chain with One-shot Speaker Adaptation

In previous work, we developed a closed-loop speech chain model based on...
12/13/2018

Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice

Neural TTS has shown it can generate high quality synthesized speech. In...
09/17/2021

Continuous Streaming Multi-Talker ASR with Dual-path Transducers

Streaming recognition of multi-talker conversations has so far been eval...

1 Introduction

Speaker-conditioned speech models are a class of speech models that are conditioned on a target speaker embedding, allowing the model to produce personalized outputs. For example, in personalized speaker separation, prior knowledge of a target speaker’s voice profile is used to suppress overlapping speech from non-target speakers  [Wang2019, Wang2020, wang2018deep, zmolikova2017speaker, vzmolikova2017learning, delcroix2018single, xu2020spex]

. In personalized Automatic Speech Recognition (ASR), a speaker’s voice profile is also used to improve the overall recognition accuracy  

[he2018streaming, bellpasr, denisov2019end, shi2021improving]. Additionally, in personalized Voice Activity Detection (VAD), the target speaker profile is used to determine when the target speaker begins or stops talking, which in turn improves the accuracy of downstream components such as ASR [ding2019personal].

While beneficial, speaker-conditioned speech models are often only limited to a single enrolled user. This makes them incompatible with many devices, such as smart displays and smart speakers, which currently support multiple users [multiuser]. One naive approach to mitigate this would be to have multiple passes of the same model — one pass for each enrolled user. This approach is, however, computationally expensive and unacceptable for on-device applications. As such, extending speaker-conditioned models to support multiple users remains an open and relevant problem [kanda2020joint, han2020continuous]

To overcome this limitation, we previously described an attention-based [vaswani2017attention] speaker selection mechanism, and extended VoiceFilter-Lite to support an arbitrary number of enrolled users  [rikhye2021multiuser] (see Fig. 1). This multi-user VoiceFilter-Lite model significantly reduces speech recognition Word Error Rate (WER) and speaker verification Equal Error Rate (EER) when the input audio contains overlapping speech. We also demonstrated how this multi-user VoiceFilter-Lite model is critical to a personalized keyphrase detection system on shared devices  [rikhye2021personalized, rikhye2021multiuser] by reducing the false rejection rate caused by speaker mis-identification. Although the performance is promising, this original multi-user model suffered from two issues. First, on single-user evaluations, the multi-user model had worse performance than a single-user model on both speech recognition and speaker verification tasks. This is undesirable as we do not wish to degrade performance on devices with a single enrolled user. Second, the original multi-user model seems to overfit the training data and failed to generalize well to unseen combinations of enrolled users. These limitations raise severe concerns regarding the deployment of the multi-user VoiceFilter-Lite model in production environments.

In this paper, we focus on addressing these limitations by exploring variations of each component of the multi-user VoiceFilter-Lite model, and developed a new version of the model that closes the performance gap on single-user evaluations. In summary, the original contributions of this paper include:

  1. We introduce a dual learning rate scheduler where the AttentionNet is independently trained with a learning rate that is an order of magnitude smaller than the VoiceFilterNet. Experiments in Section 4.2 show that the dual learning rate scheduler prevents the AttentionNet from overfitting and significantly improves model quality.

  2. We introduce FiLM  [perez2017film, o2021conformer, narayanan2021cross] as an efficient way to condition the VoiceFilterNet on the attended embedding. Doing so reduces model size from 3.47 MB to 3.23 MB, and significantly improves performance of the model on a speaker verification task, as shown in Section 4.3.

  3. As a complement to the original multi-user VoiceFilter-Lite paper [rikhye2021multiuser], we carefully compared different implementations of aggregating multiple enrolled speaker embeddings into a single embedding, and confirmed that the attention mechanism is critical to the performance, as shown in Section 4.1.

  4. Using a combination of the best practices from above, the new multi-user VoiceFilter-Lite model performs identically to the single-user model when there is only one enrolled user, and at the same time significantly reduces speaker verification EER when there are multiple enrolled users. The resulting model meets the quality bar for deployment to production environments.

2 Methods

2.1 Review of VoiceFilter-Lite

VoiceFilter-Lite is a targeted voice separation model for streaming, on-device automatic speech recognition (ASR) [Wang2020], as well as text-independent speaker verification (TI-SV) [rikhye2021personalized]. It assumes that the target speaker has completed an offline enrollment process [enrollmentblog, wang2020version]

, which uses a speaker recognition model to produce an aggregated embedding vector

that presents the voice characteristics of this speaker. In this work, we use the d-vector embedding [wan2018generalized] trained with the generalized end-to-end extended-set softmax loss [pelecanos2021dr] as the speaker embedding.

Let be the input feature frame at time from the speech to be processed. This feature is first frame-wise concatenated with the d-vector , then fed into an LSTM network [hochreiter1997long]

followed by a fully connected neural network to produce a mask

:

(1)

At runtime, the mask is element-wise multiplied to the input

to produce the final enhanced features. Separately, we also use another LSTM-based neural network followed by a fully connected layer to estimate the noise type (either overlapping or non-overlapping speech) from the input

. This noise type prediction is then used during inference to deactivate the VoiceFilter-Lite model when the input frame contains no overlapping speech. For more details, we refer the reader to [Wang2020].

We have previously demonstrated that the VoiceFilter-Lite model is an important component of text-independent speaker verification  [rikhye2021personalized]. In particular, in the presence of overlapping background speech (the multitalker scenario), speaker verification tends to fail. Adding VoiceFilter-Lite to the feature frontend of speaker verification helps to suppress overlapping speech, which in turn improves the accuracy of target speaker verification. This in turn helps to reduce the false rejection rate of personalized keyphrases. For more details, we refer the reader to  [rikhye2021personalized].

2.2 Review of multi-user VoiceFilter-Lite

Figure 1: Overall architecture of the multi-user VoiceFilter-Lite model proposed in  [rikhye2021multiuser]. This model comprises two parts — an AttentionNet which computes the most relevant speaker from a noisy frame, and a VoiceFilterNet, which is identical to the single-use VoiceFilter-Lite model [Wang2020].

To extend the VoiceFilter-Lite model to support multiple enrolled users, we added an AttentionNet to the VoiceFilter-Lite model, as illustrated in Fig. 1. This AttentionNet uses an attention mechanism to compute the most relevant speaker embedding given an input frame over an inventory of multiple speaker embeddings: .

The AttentionNet comprises two parts — the PreNet and the ScorerNet. The PreNet is a stack of three LSTM layers that computes, for each frame, a compressed representation of features in the stacked filterbank, referred to as the key vector :

(2)

This compressed representation is then individually combined with each of the enrolled speaker embeddings in the ScorerNet to generate a score for each enrolled speaker. The attention weights are the softmax over these scores:

(3)
(4)

Finally, the attended embedding is the dot product of these attention weights and the matrix of the enrolled speaker emebddings. In this way, the ScorerNet selects one of the

enrolled speaker embeddings that is most relevant to the compressed representation, and therefore the most probable speaker in that frame:

(5)

This attended embedding is used as a conditioning input in the VoiceFilterNet, which is identical to the original VoiceFilter-Lite as described previously:

(6)

Both the AttentionNet and VoiceFilterNet in the multi-user VoiceFilter-Lite model are jointly trained with an Adam optimizer [kingma2014adam]

using a weighted linear combination of the following three loss functions:

  1. : an asymmetric L2 loss for signal reconstruction;

  2. : a noise type prediction loss for adaptive suppression at runtime;

  3. : an attention loss that measures how well the attention weights predict the target speaker:

    (7)

    where is the ground truth attention weights and is the weight if the

    regularization term. The ground truth attention weights are the one-hot encoding of the position of the target speaker embedding. For example, if

    and the target speaker is the second enrolled speaker, then . This loss ensures that the attention weights of the non-target speakers tend towards , while the target speaker weight tends to .

For more details on the performance on this model on a variety of tasks, we refer the reader to [rikhye2021multiuser].

2.3 Closing the performance gap between the single-user and multi-user models

Figure 2: Anatomy of the proposed new AttentionNet with FiLM-based speaker modulation.

As previously discussed in Section 1, the original multi-user VoiceFilter-Lite described in Section 2.2 suffers from performance degradation on single-user evaluations when compared with single-user VoiceFilter-Lite models, which prevents us from deploying such models in production environments. However, we observed an interesting fact — the loss functions of the multi-user VoiceFilter-Lite model look reasonable during training. This implies the attention mechanism in the original multi-user VoiceFilter-Lite model is likely overfitting the training data, and specifically, the combinations of enrolled speakers in the training data.

To address this overfitting issue, we use a dual learning rate schedule, where the AttentionNet is trained independently and with a smaller learning rate than the VoiceFilterNet. Doing so ensures smaller weight updates for the AttentionNet, allowing the optimizer to more effectively minimize the loss function to produce an optimal solution. We found that this approach prevents the AttentionNet from memorizing the training data, which in turn allows it to generalize better to unseen examples.

To further improve the VoiceFilter-Lite model for speaker verification, we also replace the frame-wise concatenation operation between the attended embedding and the input features with a feature-wise linear modulation (FiLM). In FiLM, the input features are modulated by the embedding via the following affine transformation:

(8)
(9)

where and are two different fully connected neural networks, and denotes the element-wise product. We used two-layer FC networks, where the final layer projects the attended embedding to the same dimension as the input features with a activation function. Unlike concatenation, FiLM learns to influence each input frame in an element-wise fashion by applying an affine transformation. As a result, the attended embedding is able to scale features in the input frame up or down, or negate them or even selectively threshold them allowing a more fine grained control than simple concatenation. Furthermore, FiLM only requires two parameters ( and ) per input frame, making it a computationally more efficient conditioning method. Numerous studies have described the benefit of using FiLM in ASR [kim2017dynamic, yousefi2021speaker] and speech enhancement [narayanan2021cross, o2021conformer], demonstrating FiLM’s broad relevance for speaker-conditioned speech models.

3 Experimental Setup

3.1 Experimental design

Although the multi-user VoiceFilter-Lite model supports an arbitrary number of enrolled speaker embeddings as side input, there are additional constraints to consider when implementing this model in TFLite [alvarez2016efficient, shangguan2019optimizing]. Since TFLite does not support inputs with an unknown dimension, we had to pre-define a maximal number of enrolled speaker embeddings, i.e. in our implementation. Then, at runtime, if the actual number of enrolled speakers is smaller than , we use an all-zero vector as the embedding of any missing speaker. Thus in the experiments to be shown in Section 4, for simplicity, we first assume the maximal number of speakers is in our studies. Then in Section 4.4, we demonstrate that the observations from experiments are also valid when we extend it to .

Furthermore, in this paper, we focus only on addressing the multi-talker speaker verification challenge, especially for the multi-user multi-talker case. For example, when both speaker and speaker enrolled their voices on the device (multi-user), and at runtime, speaker and speaker speak at the same time (multi-talker), we expect the speaker verification system to accept the input, because it contains speech from one of the enrolled users (i.e. speaker ).

For consistency, the acoustic feature frontend and the speaker verification model we used in our experiments are exactly the same as the ones used in [rikhye2021multiuser].

3.2 Model topology

For all the models in our experiments, the VoiceFilterNet has 3 LSTM layers, each with 256 nodes, and a fully connected layer with sigmoid activation function. The noise type prediction network has 2 LSTM layers, each with 128 nodes, and a fully connected layer with 64 nodes. In the multi-user setup, the PreNet has 3 LSTM layers, each with 128 nodes; the ScorerNet has two feedforward layers, each with 64 nodes.

3.3 Training and evaluation data

All the VoiceFilter-Lite models in our experiments are trained on a combination of: (1) The LibriSpeech training set [panayotov2015librispeech]; and (2) a vendor-collected dataset of English speech queries. To generate the noisy inputs, we augment these training data with different noise sources (speech and non-speech) and with different room configurations  [lippmann1987multi, ko2017study, kim2017generation]

, using a signal-to-noise ratio (SNR) drawn from a uniform distribution between

dB and dB. In the multi-user setup, each training utterance is attached with both the target speaker embedding, and randomly sampled speaker embeddings from other speakers. For example, for a 4-enrolled user model, we randomly sample 3 speaker embeddings from other speakers. And to ensure that we train on all possible speaker combinations (e.g. 1, 2, and 3 enrolled users), we use a dropout probability of to randomly replace each non-target speaker embedding with an all-zero vector.

For evaluation, we use a vendor-provided English speech query dataset. The enrollment list comprises 8,069 utterances from 1,434 speakers, while the test list comprises 194,890 utterances from 1,241 speakers. The interfering speech are drawn from a separate English dev-set consisting of 220,092 utterances from 958 speakers. During evaluation, we apply different noise sources and room configurations to the data. We use “Clean” to denote the original non-noisified data, although they could be quite noisy already. The non-speech noise source consists of ambient noises recorded in cafes, vehicles, and quiet environments, as well as audio clips of music and sound effects downloaded from Getty Images [getty]. The speech noise source is a distinct development set without overlapping speakers from the testing set. We evaluate on reverberating room conditions, which consists of 3 million convolutional room impulse responses generated by a room simulator [kim2017generation] with three SNR values: dB, dB, and dB.

4 Experimental Results

Model Name
Num. of
speakers
Clean
Non-speech Noise
Speech Noise
-5dB 0dB 5dB -5dB 0dB 5dB
No VoiceFilter - 0.71 5.04 2.23 1.50 12.40 8.29 5.13
Single-user VoiceFilter 1 0.71 5.01 2.19 1.48 3.97 2.42 1.65
Multi-user VoiceFilter Averaging Model 1 0.71 5.01 2.20 1.48 4.75 2.66 1.72
2 0.71 5.02 2.21 1.48 7.12 3.91 2.29
Concat Model 1 0.71 5.01 2.20 1.48 4.57 2.58 1.72
2 0.71 5.02 2.21 1.48 7.41 3.98 2.29
[rgb]0,0.5,0AttentionNet
[rgb]0,0.5,0+ Weighted Sum Model
1 0.71 5.01 2.22 1.49 [rgb]0,0.5,03.94 [rgb]0,0.5,02.37 [rgb]0,0.5,01.64
2 0.72 5.03 2.21 1.47 [rgb]0,0.5,07.11 [rgb]0,0.5,03.55 [rgb]0,0.5,01.99
AttentionNet
+ Concat Top-K Model
1 0.71 5.01 2.20 1.48 3.92 2.39 1.62
2 0.72 5.02 2.22 1.49 7.23 3.77 2.13
Table 1: Equal Error Rate (EER) of text-independent speaker verification with different VoiceFilter-Lite (VF) models in the frontend and different number of enrolled users. The multi-user VoiceFilter-Lite models all use dual learning rates. Bold green text indicates best model.

4.1 Experiment 1 - Attention is required for accurate voice separation

The aim of our first experiment is to determine whether the AttentionNet is required or not. There are two naive alternative approaches one can feed the enrolled speaker embeddings to the VoiceFilter-Lite model without the AttentionNet:

  • Averaging Model: The attended embedding is the average (arithmetic mean) of all enrolled speaker embeddings.

  • Concat Model: The attended embedding is an unordered concatenation of all enrolled speakers embeddings. To preserve the size of the attended embeddig, we linearly project this concatenated vector down to the size of a single speaker embedding.

We also explore two variations of the attention-based model, where we generate the attended embedding by:

  • Weighted Sum Model: We take the dot product (see Eq. 5) between the attention weights and the speaker embeddings, to compute a weighted sum of all the enrolled speaker embeddings. This is different from the Averaging Model, which uses identical weight for each enrolled speaker.

  • Concat Top-K Model: Out of enrolled speaker embeddings, we pick the embeddings with the largest attention weights, and concatenate them by the order of the corresponding attention weights. It is important to note that this is different from the naive Concat Model, because here the concatenated speaker embeddings are ordered by their attention weights.

The evaluation results are shown in Table 1. From this table, we make three key observations. First, compared to “No VoiceFilter”, adding VoiceFilter-Lite model (either single-user or multi-user) to the feature frontend of the text-independent speaker verification system significantly reduces the equal error rate for speaker verification, confirming our previous results  [rikhye2021personalized, rikhye2021multiuser]. Since VoiceFilter-Lite is disabled when there is no overlapping speech, we observe no difference in the EER for the non-speech noise cases for all models.

Second, amongst the multi-user VoiceFilter-Lite models, we see that neither the Averaging Model nor the Concat Model perform as well as the attention-based multi-user VoiceFilter-Lite models. This result suggests that attention, which finds the most relevant target speaker, is required for good performance. Furthermore, since the multi-user VoiceFilter-Lite models with AttentionNet have single user EERs that closely match the single-user VoiceFilter-Lite model, we can confidently say that the attention mechanism is indeed able to generalize to unseen examples and is able to correctly identify the target speaker.

Third, between the two AttentionNet models, we find that the Weighted Sum Model outperforms the Concat Top-K Model for the two-enrolled speaker case. Similarly, we notice that the Averaging Model also performs better than the Concat Model for the same two-enrolled speaker case. This suggest that concatenating the two speaker embeddings, with or without ordering, and then projecting it to 256 dimensions does not contain sufficient information for the VoiceFilterNet to identify and enhance speech features of the target speaker in the input data. Rather, using a weighted sum of the speaker embeddings is a much better predictor of the target speaker embedding. The difference in single-user EER between the Averaging Model and the Weighted Sum model further reinforces the fact that the AttentionNet is selecting the correct speaker.

Taken together, the results of our first experiment indicate that the AttentionNet with weighted sum is critical to the multi-user VoiceFilter-Lite model. The simpler, non-attention-based strategies are insufficient for such tasks. In all multi-user VoiceFilter-Lite models in subsequent sections, we will use the AttentionNet + Weighted Sum Model configuration.

Model Name
Num.
spk
Clean
Speech Noise
-5dB 0dB 5dB
No VFL - 0.71 12.40 8.29 5.13
Single-user VFL
LR:
1 0.71 3.97 2.42 1.65
Single-user VFL
LR:
1 0.71 6.67 3.79 2.24
Multi-user VFL
LR:
1 0.71 4.13 2.50 1.68
2 0.71 10.39 6.79 4.31
Multi-user VFL
LR:
1 0.71 6.97 3.88 2.25
2 0.71 9.52 5.24 2.81
[rgb]0,0.5,0
Multi-user VFL
Dual LR
1 0.71 [rgb]0,0.5,04.02 [rgb]0,0.5,02.51 [rgb]0,0.5,01.73
2 0.71 [rgb]0,0.5,08.46 [rgb]0,0.5,04.84 [rgb]0,0.5,03.43
Table 2: EER of text-independent speaker verification with different VoiceFilter-Lite (VFL) models. Here, we vary the learning rate (LR). Each model is trained for 25 million steps. All models use the weighted sum attention mechanism. “Num. Spk” is the number of enrolled speakers during evaluation. For the “Dual LR” setup, we use a LR of for VoiceFilterNet and a LR of for AttentionNet.

4.2 Experiment 2 - Dual learning rate schedule helps to avoid AttentionNet overfitting

One observation we made in our previous multi-user VoiceFilter-Lite study [rikhye2021multiuser] is that the attention mechanism tends to overfit and memorize training data. Our next experiment is aimed at addressing this limitation by tuning the learning rate of the model.

Evaluation results are shown in Table 2. Since changing the learning rate or model architecture does not affect performance on non-speech background noise (see Table 1), we omit the non-speech noise results from the next two tables.

First, for the single-user VoiceFilter-Lite model, we notice that using a smaller learning rate of results in a significantly worse model with a much higher EER across all SNR values compared to the model trained with a higher learning rate . Secondly, for the multi-user VoiceFilter-Lite model, we observe a regression in the EER with two-enrolled users with the higher learning rate. This suggests that with a higher learning rate the AttentionNet tends to overfit on training data and fails to generalize to the evaluation data. Therefore, we implemented a dual learning rate scheduler where the AttentionNet is trained with a smaller learning rate of , while the VoiceFilterNet is trained with a larger learning rate of . As shown in Table 2, this significantly improves both the single- and two-user performance of the model.

Model Name
Num.
spk
Clean
Speech Noise
-5dB 0dB 5dB
No VFL - 0.71 12.40 8.29 5.13
Single-user VFL 1 0.71 3.97 2.42 1.65
Multi-user VFL
+ Concat Cond.
1 0.71 4.02 2.51 1.73
2 0.71 8.46 4.84 3.49
[rgb]0,0.5,0
Multi-user VFL
+ FiLM Cond.
1 0.71 [rgb]0,0.5,03.94 [rgb]0,0.5,02.37 [rgb]0,0.5,01.64
2 0.71 [rgb]0,0.5,07.11 [rgb]0,0.5,03.55 [rgb]0,0.5,01.99
Table 3: EER of text-independent speaker verification with different VoiceFilter-Lite (VFL) models. Here, the attended embedding conditioning mechanism is changed. All multi-user VFL models use Weighted Sum and Dual Learning Rate schedule. Bold green text indicates best model.
Model Name
Num. of
speakers
Clean
Non-speech Noise
Speech Noise
-5dB 0dB 5dB -5dB 0dB 5dB
No VFL - 0.71 5.04 2.23 1.50 12.40 8.29 5.13
Single-user VFL 1 0.71 5.01 2.19 1.48 3.97 2.42 1.65
Best
Two-user VFL
1 0.71 5.01 2.22 1.49 3.94 2.37 1.64
2 0.72 5.03 2.21 1.47 7.11 3.55 1.99
Previously Published
Four-user VFL  [rikhye2021multiuser]
1 0.71 5.03 2.21 1.47 7.32 3.90 2.19
2 0.72 5.04 2.21 1.49 9.34 5.18 2.78
3 0.72 5.01 2.22 1.49 10.36 5.73 3.01
4 0.72 5.05 2.21 1.49 10.99 6.10 3.14
New
Four-user VFL
1 0.71 5.03 2.21 1.47 4.32 2.54 1.71
2 0.71 5.03 2.21 1.47 7.59 4.21 2.50
3 0.72 5.03 2.22 1.49 8.05 5.14 2.85
4 0.72 5.03 2.21 1.50 9.78 5.38 2.99
Table 4: EER of text-independent speaker verification with different VoiceFilter-Lite (VFL) models. Here, the best 2-enrolled user and best 4-enrolled user are compared with the previously published model.

4.3 Experiment 3 - FiLM-based speaker conditioning improves model performance

So far, we have shown that having an AttentionNet and training it with a smaller learning rate than the VoiceFilterNet is necessary for good performance in reducing EER when the multi-user VoiceFilter-Lite model is present in the text-independent speaker verification frontend. Another aspect of the model that can be further optimized is how the attended embedding is used by the VoiceFilterNet.

There are several ways in which the attended embedding can be used to condition the VoiceFilterNet:

  • Concat-Conditioned Model: The attended embedding is concatenated with each input frame before being fed into the VoiceFilterNet LSTM stack. This increases the dimensions of the input frame by the size of the attended embedding (256 dimensions).

  • FiLM-Conditioned Model: An affine transformation, shown in Eq.  8, is applied to each input frame. This affine transformation allows the attended embedding to modulate the input frame in a feature-wise manner. This does not change the dimensions of the input frame.

Evaluation results for these different models are shown in Table 3. In these experiments, we keep the AttentionNet architecture the same (Weighted Sum Model) and use a dual learning rate schedule. We observed that the multi-user VoiceFilter-Lite model that uses FiLM to condition the input frames with the attended emebedding performs significantly better than the model that uses concatenation.

4.4 Experiment 4 - Same observations hold for four enrolled users

Finally, we demonstrate that the best two-user model can be easily extended to support four enrolled users. In the following experiments, we trained a four-user model with the same model architecture. Evaluation results for this model is shown in Table 4.

Compared to our previously published model [rikhye2021multiuser], the new four-user model, which uses FiLM and dual learning rates (see Fig. 2) results in a significantly lower EER for all speaker combinations. Interestingly, we observe a regression in EER between the best two-user VoiceFilter-Lite model and the four-user model for the 1-speaker and 2-speaker evaluations. One reason for this could be that there are fewer 1-speaker and 2-speaker examples during training the four-user model () than the two-user model () due to the way we process our training data (see Section 3.3). In fact, for the four-user model, only about of the training data contains one or two enrolled users. As a result, during evaluation, the model does not generalize very well on 1-speaker or 2-speaker evaluations compared with 3-speaker and 4-speaker evaluations. To address this issue, one of our future work directions is to balance our training data according to the realistic distributions of the number of users on shared devices, as well as to make the model more robust to unbalanced data.

5 Conclusions

In this paper, we devised a series of experiments to evaluate the impact of various design choices in the multi-user VoiceFilter-Lite model. We confirmed that an attention mechanism is critical for the multi-user model to function well, which cannot be replaced by naive aggregation logic such as either averaging or concatenating all enrolled speaker embeddings. We found that training the attention mechanism with a learning rate that is an order of magnitude smaller than the rest of the model addresses the overfitting issue, and is critical to close the performance gap between single-user and multi-user models on single-user evaluations. Additionally, the performance of the model could be further improved by using FiLM to modulate the attended speaker embedding.

Although all experiments in this paper are carried out for multi-user VoiceFilter-Lite, it is important to note that the proposed attention-based speaker selection mechanism is a generic solution that can be applied to any speaker-conditioned speech models. This is crucial as most smart home devices, such as smart displays and smart speakers, usually support multiple enrolled users. Thus as our future work, we would like to adopt the best practices from the multi-user VoiceFilter-Lite to other speaker-conditioned speech models, including personalized ASR or personal VAD.

References