Speaker recognition(SR) with short utterance is an important challenge in realistic setting as test utterance may short. While the recent advances in deep learning make it possible to get impressive performance on speaker recognition such as speaker verification(SV) and identification(SI), it still remain challenging in the actual setting(e.g. short duration, unseen speaker SI). As these issues grow in importance, some of the works are proposed to deal with these problems. For the SR with short utterance,[gao2018improved, hajavi2019deep] introduce task-specific feature extractor to extract as many information from short utterance. [xie2019utterance, hajavi2019deep] introduce aggregation method to attend for informative frames from output of feature extractor. In addition to these approaches, various attempts have been made to deal with short utterance. However, many of them do not provide a fundamental solution to the real situation.
In this work, we aim to tackle this problem by meta-learning with imbalance length pair. Specifically, we organize episode with support set of long utterance and query set of relatively short utterance. By optimizing sequence of episode, we can train our network to match long-short utterance pair well rather than conventional (also referred as ‘vanilla’) training, where same length of utterance is optimized at once. Yet, a crucial problem here is that the query samples could not be discriminative against whole classes in training set. Therefore, we further classify every sample in episode over whole training classes(also referred as ‘global classification’). In doing so, embedding of short utterance can be discriminative against other classes and matched to its own long utterance at the same time.(See Figure 1)
Also, SR system should be robust to speech duaration to cope with practical situation, since same person can speak at different speed and it varies from person to person. To deal with this problem, we meta-learn the model by simulating the real situation. Specifically, query utterance has variable length which is smaller than support one as like enroll and test pair in practice. This allows the model to consider various scenarios during meta-learning and thus allows to obtain more robust model to duration for test utterance. Also, consistent framework between training and testing makes the model be able to verify and classify unseen speaker well.(See Figure 2)
Our proposed learning scheme is based on ResNet34 [he2016deep], which is widely used in SR. To verify efficiency of our proposed model, we just use aggregation as temporal average pooling(TAP) and non-margin metric loss. Also, we use input feature as 40-dimensional log-Mel filterbank to reduce time complexity, since in the realistic setting the execution time should be short. We experiment on realistic setting such as short utterance SV and unseen SI including conventional experiment setting(full utterance SV). We use VoxCeleb datasets [nagrani2017voxceleb, chung2018voxceleb2] to directly compare with other models. Our model obtains state-of-the-art results on short utterance and we show results of our model in various test data and settings for ease comparison and reproduction.
Our main contributions are as follows:
We propose a novel meta-learning method for short utterance speaker recognition, in which each episode is composed of support and query pair of imbalance utterance length.
We propose a training procedure that combines meta-learning and global classification to get well-matched and discriminated embedding.
We validate our model on VoxCeleb datasets for various realistic setting including speaker verification with short duration and unseen speaker identification and achieve state-of-the-art results.
2 Related Work
DNN based speaker embedding: Recently, speaker recognition(SR) has achieved impressive performance through the DNN based methods [variani2014deep, li2017deep, snyder2018x, zhang2018text]
. The key component of DNN based system is feature extractor, aggregation of temporal features and optimization. First, many SR systems use 1D or 2D convolutional neural network and recurrent neural network as feature extractor. These extractors made it possible to extract the time and frequency properties of the speaker features(MFCC, mel-filter bank). Extracted frame-level features are summarized as fixed length vector by aggregation methods. There has been many works to capture intrinsic speaker information such as attentive statistic pooling(ASP)[okabe2018attentive], self attentive pooling(SAP) [cai2018exploring] learnable dictionary encoding(LDE) [cai2018exploring] and spatial pyramid encoding(SPE) [jung2019spatial]
. Then, we can get an utterance-level features. These features are generally optimized with softmax with fully-connected layers. However, it is not enough to make embedding discriminative in the embedding space. In recent work, to compensate this weakness, angular margin-based metric are applied to reduce per-class variance such as A-softmax[liu2017sphereface], AM-softmax [wang2018additive] and AMM-softmax [deng2019arcface].
Metric-based meta-learning for few-shot classification: Speaker verification and speaker identification for unseen speaker are a kind of few-shot learning task. In the few-shot classification, the goal is to correctly classify unlabeled query set with only a few labeled support set. Since labeled data is limited, classifiers for each task must rely on the meta-knowledge accumulated in previous tasks, which is called meta-learning. One of the most popular approach is metric-based meta-learning [kye2020transductive, vinyals2016matching, snell2017prototypical, liu2018learning], which allows the model to directly measure distance between samples with metric. The most famous metric-based approaches are Matching Networks[vinyals2016matching] and Prototypical Networks[snell2017prototypical]
. Each of them uses cosine distance and euclidean distance respectively. In this work, we use cosine similarity as distance metric and loss for each episode is calculated in a similar way to Prototypical Networks[snell2017prototypical].
Speaker recognition for short utterance: Since there is very little inherent information about the speaker, speaker recognition for short utterance is still challenging. To tackle this problem, [gao2018improved, hajavi2019deep, xie2019utterance] propose the network architecture with aggregation technique to extract as much information as possible from short speech. NetVLAD/GhostVLAD [xie2019utterance] is attention-based pooling method with learnable dictionary encoding and time-distributed voting(TDV) [hajavi2019deep] utilizes short-cut connection information with weighted sum of them. TDV shows that impressive improvement over GhostVLAD for short utterance. However, TDV shows good performance only in short length and relatively low performance in long utterance. In addition to these methods, there have been many attempts to solve this problem such as knowledge distillation [jung2019short], generative adversarial networks [zhang2018vector], angular margin-based method [huang2018angular, gusev2020deep].
In this work, we tackle practical setting for unseen speaker recognition, where the length of test utterance is shorter than enrollment utterance. Therefore, a suitable model for this situation should not only have a ability to make long utterance and short utterance match well but classify unseen speakers well. We introduce meta-learning scheme, where support set and query set is consist of long and short utterance respectively. We also classify support and query set over whole training classes to make it discriminative, while imbalance pair is matched.
3.1 Problem Definition
In the -way -shot classification, we first sample classes randomly from the entire set of classes, and then sample and examples from each class for the support set and query set, respectively. We define episode sampling distribution as . As a result, we have a support set and query set , where are the class labels.
The goal of classification in episode is to correctly classify query examples in given the support set . Since includes only a few examples for each class, conventional learning algorithms will mostly fail due to overfitting (e.g. consider 1-shot classification). Thus, most existing approaches tackle this problem by meta-learning over a task distribution , such that the later tasks can benefit from the knowledge obtained over the previous training episodes.
One of the most popular and successful approaches for few-shot classification is the metric-based approach. We aim to learn an embedding function that maps an input to an -dimensional metric space. Support set and query set are then mapped into this space, such that we can measure the distance between class prototypes and query embeddings. In this paper, we use cosine similarity as distance metric.
3.2 Meta-Learning for imbalance length pair
Despite of large improvement on speaker recognition, speaker recognition with short duration remains very challenging. Our proposed learning paradigm deals with this problem in realistic perspect. As proved in [kanagasundaram2019study], in conventional training setting, mismatch of length between train and test speech can degrade performance for short utterance. In other words, as shown in [gusev2020deep], model trained with short segment performs better with short test utterance than model trained with relatively long speech, but with relatively poor performance with long speech. This trade-off is very fatal to realistic settings, leading to not discriminative embedding of enrollment or test utterance.
If so, how can we train robust models for duration of utterance? To tackle this problem, we meta-learn the model with episode, where support set and query set is constructed by imbalance length pair. In practical setting, once we enroll long utterance, it is fixed and then test utterance with variable and short length is entered into system. To simulate this situation in training phase, we set the support set longer than query utterance. Contrariwise, query set is consist of variable length feature, which is shorter than support utterance.
As doing in [snell2017prototypical], we make the prototype by average of support set and let the query examples close to be prototype. First, we define as the set of support examples in class and then compute the prototype of each class in episode:
Then, we compute distance between each query and prototype. In this work, we use cosine similarity as distance metric:
where it can be seen as cosine similarity with an input-wise length scale. So, we can predict for each class c as following:
where it can be also called normalized softmax. Next, compute episode loss:
3.3 Global classification
With the meta-learning scheme, we can make variable short utterance to be close to relatively long utterance. However, only optimization within episode can be vulnerable to making discriminative embeddings. Inspired by [kye2020transductive], we additionally classify support and query samples against whole training classes. While optimizing support and query samples of different length at once, we can reduce per class variance according to utterance duration. At the same time, we can make the discriminative embeddings over all other classes in training set. Following [kye2020transductive], we assume a set of global prototypes for each class:
where is the number of classes in entire training set and is dimension of embedding. For the , we predict for each class as follows:
and compute the global loss:
where is distance metric described in Eq. (2). Note that global classification is conducted on both support and query samples. Finally, our learning objective combines the episode loss in Eq. (4) with the global loss in Eq. (7).
where is balance factor of loss and we simply set the to 1. We average task distribution via Monte-Carlo (MC) approximation with a single sample during training. This combined objective allows model to match imbalance length pair, while this pair is classified over whole training classes together.
|i-vectors+PLDA [nagrani2017voxceleb]||NR||Vox1(D)||0.73 / 8.8|
|VGG-M (TAP+C) [nagrani2017voxceleb]||Spec-512||Vox1(D)||0.71 / 7.8|
|ResNet34 (SAP+A) [cai2018exploring]||MFB-64||Vox1(D)||0.622 / 4.40|
|ResNet34 (SPE+A) [jung2019spatial]||MFB-64||Vox1(D)||0.402 / 4.03|
|TDNN (ASP+SM) [okabe2018attentive]||MFCC-40*||Vox1(D)||0.406 / 3.85|
|ResNet34 (TAP+NS)||MFB-40||Vox1(D)||0.418 / 3.81|
|UtterIdNet (TDV+SM) [hajavi2019deep]||Spec-257||Vox2(D)||NR / 4.26|
|Thin ResNet34 [xie2019utterance]||Spec-257||Vox2(D)||NR / 3.22|
|ResNet34 (SPE+A) [jung2019spatial]||MFB-64||Vox2(D)||0.245 / 2.61|
|ResNet34 (TAP+NS)||MFB-40||Vox2(D)||0.234 / 2.08|
|Thin ResNet34(GhostVlad+SM) [xie2019utterance]||Spec-257||Vox2(D)||Vox1(D+T)||12.71||6.59||3.34|
Verification performance on short utterance. G: Global classification; M: Meta-learning; Spec: Spectrogram; D: Development set; T: Test set; *: With data augmentation.
We experiment our method on various setting with VoxCeleb datasets. VoxCeleb1 [nagrani2017voxceleb] and VoxCeleb2 [chung2018voxceleb2] are large scale text-independent speaker recognition datasets. Each of them consists of 1251 and 5994 speakers respectively. Speakers in these two datasets do not overlap. Verification results are presented with equal error rate(EER) and the minimum detection cost function(minDCF or) at
= 0.01. Veriﬁcation trials are scored using cosine similarity. For the unseen speaker identification, average accuracy over 1000 randomly generated episodes is reported with 95% confidence intervals.
4.2 Experiment setting
We use input feature as 40-dimensional log Mel-ﬁlterbank(MFB) features with a frame-length of 25 ms, which are overlapping adjacent frames by 15ms. Inputs are mean-normalized along time-axis without any voice activity detection(VAD) and data augmentation. In training episode, we use 1-shot 100-way and the number of query examples for each class is set to 2. For the memory efficiency, we set length of support set to 2 seconds and the length of the query to between half and full of the support length. For the vanilla training, we use fixed length speech at 2 seconds. Frame-level feature extractor is ResNet34 with 32-64-128-256 channels for each residual stage. Extracted feature are aggregated with temporal average pooling(TAP) and it passes through the fully-connected layer to be 256-dimensional embedding. We use SGD optimizer with the Nesterov momentum of 0.9 and set the weight decay to 0.0001. We set initial learning rate to 0.1 and decay it by a factor of 10 until convergence. Every experiment is done on a single NVIDIA 2080Ti GPU.
4.3 Speaker verification for full utterance
We first examine the result of full duration SV to analyze the advantage of using our training scheme. Every results in Table 1 are evaluated on VoxCeleb1 [nagrani2017voxceleb] original test trial. For fair comparison, we report baselines without VAD and data augmentation except x-vector [snyder2018x] based model [okabe2018attentive]. For the VoxCeleb1, our proposed model outperforms previous state-of-the-arts models. For the same backbone(i.e. ResNet34), our model achieves superior performance without any aggregation and margin-based metric. In general, additional aggregation and margin-based metric lead the better performance. Further, our model outperforms time delayed neural network(TDNN) with attentive statistic pooling [okabe2018attentive]. For the much larger dataset VoxCeleb2 [chung2018voxceleb2], our model consistently outperforms other baseline models.
4.4 Speaker verification for short utterance
We first describe the test setting, then compare other previous state-of-the-arts models for short utterance. We test our model on two datasets. First is original VoxCeleb1 test trial which is the same one used to evaluate full utterance. Second is VoxCeleb1 full dataset(1251 speakers in total). Enrollment utterance is used for full duration, but test utterance is randomly cropped by 1, 2 and 5 seconds. If the test utterance is shorter than required, we duplicate sample using segment in its own.
To prove efficiency of our model, we perform an ablation study with VoxCeleb1. We observe that Time Distributed Voting(TDV) [hajavi2019deep] which is proposed to aim at short segment outperforms temporal average pooling with slight margin. But, the result of third row shows that model only trained with meta-learning outperforms TDV and conventionally trained model(See first row). Further, our proposed model that combine meta-learning with global classification gets the best performance against other baselines trained on VoxCeleb1 with large margin.
For the comparison with other previous state-of-the-arts models [hajavi2019deep, xie2019utterance, gusev2020deep], we trained the model with VoxCeleb2 dataset and tested on VoxCeleb1 full dataset. We use same trial as described in [gusev2020deep]. For every speaker, trial is randomly generated for 100 positive pairs and 100 negative pairs. Bottom rows in table 2 show that our model outperforms other baselines with significantly large margin for 1-2 seconds. Since our model doesn’t use any other aggregation technique and margin-based optimization, we can say that its impressive improvement is only due to our combined learning scheme. Furthermore, our model uses only 40-dimensional feature but the other baselines use more than twice the dimensions. As proven in [gusev2020deep], it means that the performance gap can be bigger if we use higher dimensional inputs. For the comparison with [hajavi2019deep], since UtterIdNet is not publicly available, we compared it using TDV instead.
Our performance gain is due to two reasons. First, we compose training episode with imbalance length pair, where utterance length of query is variable and shorter than support set. In Table 3, we can observe that variable short query setting outperforms both equal length pair and fixed long-short pair. Note that [wang2019centroid, anand2019few] are a kind of equal length pair. In our proposed setting, model comes across various length pair setting for each episode, and then is meta-learned to be good at matching imbalance length pair and robust to speech duration. Secondly, to make more discriminative embedding, we classify both support and query samples against whole training classes. Unlike the conventional method which classify the same length for each batch, our combined scheme classifies different length at once. It results in reduction of variance caused by duration and enlarges inter-class cluster. With combined these two components, our proposed model shows state-of-the-arts performance in short utterance, resulting good performance in full utterance.
4.5 Unseen speaker identification
Now, we evaluate the performance of our model on unseen SI. To analyze our model, we trained the model on VoxCeleb2 dataset and tested on whole VoxCeleb1 dataset. For similar setting with verification, we enroll with one utterance and the enrollment utterance was equally set to 5 seconds to be fairly classified. Therefore, We randomly sample -speakers from VoxCeleb1 dataset, and then sample 1 and 5 samples from each speakers for enrollment and test utterance, respectively. For test utterance shorter than required, we handled it as in 4.4. As shown in Table 4, our proposed method outperforms vanilla training in every setting. The performance gap increases as the number of classification classes grows. Generally, the performance of identification decreases as the number of speaker is larger and as the length of utterance is shorter.
We proposed a novel meta-learning scheme for short duration speaker recognition. In order to simulate actual setting in training, we propose episode composition in which support and query set have imbalance length. Our meta-learning scheme is combined with global classification, resulting well-discriminated embedding space. With the VoxCeleb datasets, we validate our model on various settings and obtain state-of-the-art performance on short utterance speaker verification.
This material is based upon work supported by the Ministry of Trade, Industry and Energy (MOTIE, Korea) under Industrial Technology Innovation Program (No.10063424, Development of distant speech recognition and multi-task dialog processing technologies for in-door conversational robots).