Personal VAD: Speaker-Conditioned Voice Activity Detection

08/12/2019 ∙ by Shaojin Ding, et al. ∙ Texas A&M University 0

In this paper, we propose "personal VAD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption. We achieve this by training a VAD-alike neural network that is conditioned on the target speaker embedding or the speaker verification score. For every frame, personal VAD outputs the scores for three classes: non-speech, target speaker speech, and non-target speaker speech. With our optimal setup, we are able to train a 130KB model that outperforms a baseline system where individually trained standard VAD and speaker recognition network are combined to perform the same task.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In modern speech processing systems, voice activity detection (VAD) usually lives in the upstream of other speech components such as speech recognition and speaker recognition. As a gating module, VAD not only improves the performance of downstream components by discarding non-speech signal, but also significantly reduces the overall computational cost due to its relatively small size.

A typical VAD system uses a frame-level classifier with acoustic features to make speech/non-speech decisions for each audio frame (

e.g. with 25ms width and 10ms step). Poor VAD systems could either mistakenly accept background noise as speech or falsely reject speech. False accepting non-speech as speech largely slows down the downstream automatic speech recognition (ASR) processing. It is also computationally expensive as ASR models are normally much larger than VAD models. On the other hand, false rejecting speech leads to deletion errors in ASR transcriptions (a few milliseconds of missed audio could remove an entire word). The VAD needs to work accurately in challenging environments, including noisy conditions, reverberant environments and environments with competing speech. Significant research has been devoted to finding the optimal VAD features and models  [1, 2, 3, 4, 5]. In the literature, LSTM based VAD is a popular architecture for sequential modeling of the VAD task, showing state-of-the-art performance [3, 5].

In many scenarios, especially on-device speech recognition [6], the computational resources such as CPU, memory, and battery are typically limited. In such cases, we wish to run the computationally intensive components such as speech recognition only when the target user is talking to the device. False triggering such components in the background while only speech signals from other talkers or TV noises are present would cause battery drain and bad user experience. Thus, having a tiny model that only passes through speech signals from the target user is very necessary, which is our motivation of developing the personal VAD system. To the best of our knowledge, this work is the first work that aims at detecting the voice activity of a target speaker.

The proposed personal VAD is a VAD-alike neural network, conditioned on the target speaker embedding or the speaker verification score. Instead of determining whether a frame is speech or non-speech in standard VAD, personal VAD extends the determination to three classes: non-speech, target speaker speech, and non-target speaker speech. We propose four different architectures to achieve personal VAD, as described in Section 2.2. In the training of personal VAD, we first treat it as a three-class classification problem and use cross entropy loss to optimize the model. In addition, we noticed that the discriminitivity between non-speech and non-target speaker speech is relatively less important as that between target speaker speech and the other two classes in personal VAD. Therefore, we further propose a weighted pairwise loss to enforce the model to learn these differences, as introduced in Section 2.3. We evaluate the model on the LibriSpeech dataset [7], with experiment setup described in Section 3.2 and results presented in Section 3.5. Conclusions are drawn in Section 4.

2 Approach

2.1 Recap of speaker verification system

Personal VAD relies on a pre-trained text-independent speaker recognition model to encode the speaker identity into embedding vectors. In this work, we use the “d-vector” model introduced in 

[8], which has been successfully applied to various applications including speaker diarization [9, 10], speech synthesis [11], source separation [12], and speech translation [13]

. We retrained the model using data from 8 languages for language robustness and better performance. During inference, the model produces embeddings on sliding windows, and a final aggregated embedding named “d-vector” is used to represent the voice characteristics of this utterance. The cosine similarity between two d-vector embeddings can be used to measure the similarity of two voices.

In a real application, users are required to follow an enrollment process before enabling speaker verification or personal VAD. During enrollment, d-vector embeddings are computed from the target user’s recordings, and stored on the device. Since the enrollment is a one-off experience and can happen on server-side, we can assume the embeddings are available at runtime with no cost.

Figure 1: Four different architectures to implement personal VAD: (a) SC: Run standard VAD and frame-level speaker verification independently, and combine their results. This is used as a baseline for other aproaches. (b) ST: Concatenate frame-level speaker verification score with acoustic features to train a personal VAD model. (c) ET: Concatenate speaker embedding with acoustic features to train a personal VAD model. (d) SET: Concatenate both speaker verification score and speaker embedding with acoustic features to train a personal VAD model.

2.2 System architecture

A personal VAD system should produce frame-level class labels for three categories: non-speech (ns), target speaker speech (tss), and non-target speaker speech (ntss). We implemented four different architectures to achieve personal VAD, as illustrated by Fig. 1. All four architectures rely on the embedding of the target speaker, which is acquired via the enrollment process.

2.2.1 Score combination (SC)

As shown in Fig. 1(a), this approach does not require training any new model, so we use it as a baseline

for other approaches. A standard VAD model and a speaker recognition model runs independently on the acoustic features. The Standard VAD produces softmax probabilities for speech (

s) and non-speech (ns) for each frame. The speaker recognition model produces an embedding at each frame, and verifies it against the target speaker embedding. The resulting cosine similarity can be rescaled and combined with the probability of s to obtain probabilities of tss and ntss, and the probability of ns is directly used for class ns.

There are two major disadvantages of this architecture. First, it is running a window-based speaker recognition model at a frame level, and such inconsistency could cause significant performance degradation. However, training frame-level speaker recognition models is not scalable due to the difficulties to batch utterances of different length. Second, this architecture requires running a speaker recognition system at runtime, which can be expensive since speaker recognition models are usually much bigger than VAD models.

2.2.2 Score conditioned training (ST)

As shown in Fig. 1(b), this approach uses the speaker recognition model to produce a cosine similarity score for each frame, and concatenate that cosine similarity score to the acoustic features. Our acoustic features are 40-dimensional log Mel filterbank energies. So after concatenation, the feature becomes 41-dimensional. We train a new personal VAD network that takes the concatenated features as input, and outputs the three class labels for each frame.

This approach still requires running speaker recognition model at runtime. But since it’s retraining the model based on the speaker verification scores, it’s expected to perform better than simply combining the scores of two individually trained systems.

2.2.3 Embedding conditioned training (ET)

As shown in Fig. 1(c), this approach directly concatenates the target speaker embedding (acquired via enrollment process) with the acoustic features to train a new personal VAD network to output the three labels at the frame level. Our embedding is 256-dimensional, so the concatenated feature here is 296-dimensional.

Since this approach does not require running the big speaker recognition model at runtime, it is the most lightweight solution among all architectures.

One way to interprete this approach is to view it as a knowledge distillation [14] process from the large speaker recognition model to the small personal VAD model.

2.2.4 Score and embedding conditioned training (SET)

As shown in Fig. 1(d), this approach concatenates both the frame-level speaker recognition score and the target speaker embedding to the acoustic features to train a new personal VAD model. The concatenated feature is 297-dimensional.

This approach makes use of most information from the speaker recognition system. However, it still requires running the speaker recognition model at runtime.

2.3 Weighted pairwise loss

With an input frame and the corresponding ground truth label , personal VAD can be thought of as a ternary classification problem. The network outputs the unnormalized distribution of over the three classes, denoted as , where is the parameters of the network. We use to denote the unnormalized probability of the -th class. To train the model, we minimize the cross entropy loss as:


where .

However, in personal VAD, our goal is to detect the voice activity from only the target speaker. Audio frames that are classified into class ns and ntss will be discarded similarly by downstream components. As a result, confusion errors between <ns,ntss> have less impact to system performance than errors between <tss,ntss> and <tss,ns>. Inspired by Tuplemax loss [15], here we propose a weighted pairwise loss to model the different tolerance to each class pair. Given and , we define weighted pairwise loss as:


where is the weight between class and class . By setting lower weight to <ns,ntss> errors than <tss,ntss> and <tss,ns> errors, we can enforce the model to be more tolerant to the confusion between <ns,ntss> and to focus on distinguishing tss from ns and ntss.

3 Experiments

3.1 Datasets

We use the LibriSpeech dataset [7] to train and evaluate the proposed method. The training set contains 960 hours of speech, where 460 hours of them are “clean” speech and the other 500 hours are “noisy” speech. The testing set also consists of both “clean” and “noisy” speech.

We also considered using datasets in which each utterance contains natural speaker turns, such as those used for speaker diarization [9]. However, these datasets do not contain enrollment utterances for individual speakers, so they are not applicable for our personal VAD setup. Thus we have to concatenate utterances to simulate speaker turns (see Section 3.2.1), and then noisify the utterances to mitigate the concatenation artifacts (see Section 3.2.2).

In all the experiments, we use the concatenated LibriSpeech training set to train the models. We use both the original LibriSpeech testing set and the concatenated LibriSpeech testing set for evaluation, as specified in the following subsections. For all the datsets, we use force alignment to produce the frame-level ground truth labels used in training and evaluation.

Method Loss Without MTR With MTR Network parameters
tss ns ntss mean tss ns ntss mean
SC (baseline) CE 0.886 0.970 0.872 0.900 0.777 0.908 0.768 0.801 4.95 million
ST 0.956 0.968 0.956 0.957 0.905 0.885 0.905 0.901 4.95 million
ET 0.932 0.962 0.946 0.946 0.878 0.873 0.890 0.883 0.13 million
SET 0.970 0.969 0.972 0.969 0.938 0.888 0.938 0.928 4.95 million
ET WPL 0.955 0.965 0.961 0.959 0.916 0.883 0.920 0.912 0.13 million

Table 1:

Architecture and loss function comparison results.

SC: Score combination, the baseline system. ST: Score conditioned training. ET: Embedding conditioned training. SET: Score and embedding conditioned training. CE: Cross entropy loss. WPL: Weighted pairwise loss (). We report the Average Precision (AP) for each class, and the mean Average Precision (mAP) over all the classes. Network parameters include 4.88 million parameters from the speaker recognition model, if it is used during inference.

3.2 Experimental settings

3.2.1 Utterance concatenation

In the training corpora of standard VAD, each utterance usually only contains the speech from one single speaker. However, personal VAD aims to find the voice activity of a target speaker in a conversation with multiple speakers engaged. Therefore, we cannot directly use the standard VAD training corpora to train personal VAD. To simulate the conversational speech, we concatenate utterances from multiple speakers into a longer utterance, and then we randomly select one of the speakers as the target speaker in the concatenated utterance.

To generate a concatenated utterance, we draw a random number

indicating the number of utterances used for concatenation from a uniform distribution:


where and are the minimum and maximum number of utterances used for concatenation. The waveforms from the randomly selected utterances are concatenated, and one of the speakers is assumed as the target speaker of the concatenated utterance. At the same time, we modify the VAD ground truth label of each frame according to the target speaker: “non-speech” frames remain the same, while “speech” frames are modifed to either “target speaker speech” or “non-target speaker speech” according to whether the source utterance is from the target speaker.

In our experiments, we generated concatenated utterances for training set and concatenated utterances for testing sets. We set and for both sets.

3.2.2 Multistyle training

For both training and evaluations, we apply a data augmentation technique named “multistyle training” (MTR) on our datasets to avoid domain overfitting and mitigate concatenation artifacts. During MTR, the original (concatenated) source utterance is noisified with multiple randomly selected noise sources, using a randomly selected room configuration. Our noise sources include 827 audios of ambient noises recorded in cafes, 786 audios recoreded in silent environments, and 6433 YouTube segments containing background music or noise. We generated 3 million room configurations using a room simulator to cover different reverberation conditions. The distribution of the signal-to-noise ratio (SNR) of our MTR is shown in Fig. 2.

Figure 2: Histogram of SNR (dB) of our multistyle training.

3.3 Model configuration

The acoustic features are 40-dimensional log Mel filterbank energies, extracted on frames with 25ms width and 10ms step. For both standard VAD model and personal VAD model, we used a 2-layer LSTM network with 64 neurons, followed by a fully-connected layer with 64 neurons. We also tried larger networks but did not see performance improvements, possibly due to the limited variety in training data. We used TensorFlow 

[16] for training and inference. During training, we used Adam Optimizer with a learning rate of . For the models with weighted pairwise loss, we set and explored different values for . To reduce the model size and accelerate the runtime inference, we quantize the parameters of the model to 8-bit integer values following [17].

3.4 Metrics

To evaluate the performance of the proposed method, we computed the Average Precision (AP) for each class and the mean Average Precision (mAP) over all the classes. We adopted the micro-mean111 when computing mAP to take class imbalance into account.

3.5 Results

Figure 3: Mean Average Precision (mAP) of personal VAD (ET) with different values of in weighted pairwise loss.

We conducted three groups of experiments to evaluate the proposed method. First, we compared the four architectures for personal VAD. Following this, we examined the effectiveness of weighted pairwise loss and compared it against conventional cross entropy loss. Finally, we evaluate personal VAD on a standard VAD task, to see if personal VAD can replace standard VAD without performance degradation.

3.5.1 Architecture comparisons

In the first group of experiments, we compared the performance of four personal VAD architectures described in Fig. 1. We evaluated these systems on the concatenated LibriSpeech testing set. Additionally, to explore the performance of personal VAD on noisy speech, we also applied MTR on the testing set. We reported the evaluation results on the testing set with and without MTR, as shown in Table 1. Results show that ST, ET, and SET significantly outperform the baseline SC system in all cases. When applying MTR to the testing set, we observed a larger performance gain between the proposed methods and the baseline. Among the proposed systems, SET achieved the highest AP for tss, and ST slightly outperforms ET. However, both ST and SET require to run speaker recognition model to compute the cosine similarity score during inference time, which would largely increase both the number of parameters in the model and inference computational cost. By contrast, ET obtained 0.932 (without MTR) / 0.878 (with MTR) AP for class tss on the testing set with a model of only 0.13 million parameters ( 40 times smaller), which is more appropriate for on-device applications.

3.5.2 Loss function comparisons

In the second group of experiments, we compared the proposed weighted pairwise loss against the conventional cross entropy loss. Here, we only consider ET architecture, as it is much more lightweight while achieving reasonably good performance. Similarly, we evaluated the systems on the concatenated LibriSpeech testing set with and without MTR.

In Fig. 3, we plot the AP for tss against different values of in weighted pairwise loss. From the results, we observed that using a smaller value of than and will improve the performance. The best performance is reached when setting , with detailed results listed in Table 1.

3.5.3 Personal VAD for standard VAD tasks

If we want to replace a standard VAD component with personal VAD, we also need to guarantee that the performance degradation on a standard speech/non-speech task is minimal. We evaluated two personal VAD models (ET architecture with cross entropy loss, ET architecture with weighted pairwise loss) on the non-concatenated LibriSpeech testing data (so each utterance has only the target speaker). The results are shown in Table 2. We can see that the AP for class speech (s) is very close between personal VAD and standard VAD, which justifies replacing standard VAD by personal VAD.

Method Loss Without MTR With MTR
s ns s ns
Standard VAD CE 0.992 0.975 0.975 0.918
Personal VAD (ET) CE 0.991 0.965 0.979 0.893
Personal VAD (ET) WPL 0.991 0.967 0.979 0.901
Table 2: Evaluation on a standard VAD task. We report the Average Precision (AP) for speech (s) and non-speech (ns).

4 Conclusions

In this paper, we proposed four different architectures to implement personal VAD, a system that detects the voice activity of a target user. Among the different architectures, using a single small network that takes acoustic features and enrolled target speaker embedding as inputs achieves near-optimal performance with smallest runtime computational cost. To model the tolerance to different types of errors, we proposed a new loss function, the weighted pairwise loss, which proves to have better performance than a conventional cross entropy loss. Our experiments also show that personal VAD and standard VAD perform equally well on a standard VAD task. In summary, our findings suggest that, by focusing only on the desired target speaker, a personal VAD can reduce the overall computational cost of speech recognition systems operating in noisy environments.