Short utterance compensation in speaker verification via cosine-based teacher-student learning of speaker embeddings

10/25/2018 ∙ by Jee-weon Jung, et al. ∙ 0

Input utterance with short duration is one of the most critical threats that degrade the performance of speaker verification systems. This study aimed to develop an integrated text-independent speaker verification system that inputs utterances with short durations of 2.05 seconds. For this goal, we propose an approach using a teacher-student learning framework that maximizes the cosine similarity of two speaker embeddings extracted from long and short utterances. In the proposed architecture, phonetic-level features in which each feature represents a segment of 130 ms are extracted using convolutional layers. The gated recurrent units extract an utterance-level speaker embedding using the phonetic-level features. Experiments were conducted using deep neural networks that take raw waveforms as input, and output speaker embeddings on the VoxCeleb 1 dataset. The equal error rates without short utterance compensation are 8.72 respectively. The proposed model with compensation exhibits an equal error rate of 10.08 performance degradation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent speaker verification systems generally work based on utterance-level features such as i-vectors and speaker embeddings from deep neural networks (DNNs)

[1, 2]

. In utterance-level features extracted from short utterances, uncertainty exist owing to the insufficient phonetic information, which is a well-known factor of performance degradation of speaker verification systems. To compensate for this uncertainty caused by short utterances, Saeidi

et al. proposed a propagation method in the i-vector space [3]. Yamamoto et al., proposed a DNN-based compensation system that transforms an i-vector extracted from a short utterance into an i-vector corresponding to a long utterance. In Yamamoto’s research, it was shown that phonetic information can be effectively used for compensating short utterances [4]. However, actual improvement in performance could not be obtained through this approach. We assume that this limitation occurred because it is difficult to compensate the missing phonetic information using already extracted utterance-level features [3, 4, 5].

Unlike most previous studies that compensate utterance-level features after they have been extracted, we propose a novel integrated short-utterance-compensation system based on phonetic-level features, which text-independently extracts speaker embeddings directly from short utterances having a duration of 2.05 s. The phonetic-level feature is an intermediate concept between frame-level and utterance-level features, which represents segments of approximately 130 ms. The duration of 130 ms is known to be appropriate for representing phonetic information based on conventional phonetic knowledge [6, 7]. Figure 1-(a) illustrates the concept of the phonetic-level features. A gated recurrent unit (GRU) and teacher-student (TS) learning framework are used to efficiently compensate the short utterances using phonetic information. Resulting proposed system is an integrated short utterance compensation system that extracts speaker embeddings directly from short utterances of 2.05 s duration, text-independently.

The remaining paper is organized as follows: Section 2 describes the speaker embedding system. Section 3 introduces the teacher-student learning framework. The proposed short utterance compensation system is discussed in Section 4. The experimental settings and result analysis are described in Section 5 and the study is concluded in Section 6.

Figure 1: (a) Conceptual illustration of various levels of features based on CNN-GRU network (b) Workflow of the proposed teacher-student learning-based short utterance compensation system.

2 Raw waveform
speaker embedding models

Recent advances in deep neural networks (DNNs) have resulted in several successful speaker embedding systems that directly model raw waveforms [8, 9, 10]. We expected that suitable pre-processing for speaker verification and short utterance compensation could be performed by training through DNNs, because raw waveforms are modeled without any feature extraction process. In this study, we use the raw waveform CNN-LSTM (RWCNN-LSTM) architecture proposed in [8]

with the following two modifications: leaky rectified linear unit (LReLU) activation was used


instead of ReLU activation, and long short-term memory layer was replaced to gated recurrent units (GRU) layer. Comparative experimental results show that these two modifications lead to an additional decrease of 10 % in the equal error rates (EERs). The RWCNN-GRU model comprises convolutional blocks followed by one GRU layer and two fully-connected layers. Convolutional blocks model raw waveforms into phonetic-level features (addressed in Section 4) that represent segments of approximately 130 ms. The outputs of the last convolutional block are fed to the following GRU layer, which produces fixed dimensional utterance-level representations. Then, the last fully-connected layer’s LReLU activation is used as the speaker embedding. Speaker verification is performed by comparing the cosine similarity between the two speaker embeddings. In this research, both teacher and student DNNs have identical architecture. However, the sequence length of the output of the last convolutional block (which can also be thought of the timestep of the GRU input) varies depending on the length of input utterances.

3 Teacher-student learning

Teacher-student (TS) learning uses two DNNs, teacher and student, in which the student DNN is trained using soft labels that the pre-trained teacher DNN provides. In this framework, after the teacher network is trained, the student network is trained to have an output distribution similar to that of the teacher network. This framework was first proposed for model compression and is also widely used for compensating far-field utterances [12, 13, 14]. In this paper, TS framework is applied to short utterance compensation for the first time, to the best of our knowledge. When TS learning is used for short utterance compensation, the KL divergence objective function can be written as


where and refer to the speaker and utterance indices, respectively; and refer to the long and short crop of the same utterance, respectively; and and are the outputs of the teacher and student DNNs, respectively. The above equation shows that TS learning trains the student DNN’s output layer distribution same as that of the teacher DNN despite being provided with short utterances.

4 Proposed short utterance compensation system

Figure 2: Speaker embeddings visualized using the t-SNE algorithm [15]. The five different colors represent five randomly selected speakers from the evaluation set. A triangle denotes the mean of the speaker embeddings extracted from long utterances.

The core concept of the proposed system is to directly compensate the speaker embedding using TS learning. In conventional TS learning, the student DNN’s output distribution is compared with that of the teacher DNN. However, because the ultimate goal of a short utterance compensation system is to make the speaker embedding of a short utterance identical to that of a long utterance, we propose an objective function to compare speaker embeddings directly. The objective function of the proposed TS learning can be written as an extension of Equation 1.


Here and denote the speaker embedding of the teacher and student DNNs, respectively, and denotes the measure of the distance between two embeddings such as the cosine similarity or mean squared error.

In [16], Jiacen et al. proposed that when short utterance compensation is performed, the speaker embedding of the short utterance might become close to that of long utterance. However, the discriminant power of the compensated embedding cannot be ensured. Supporting such proposal, in the experiments using cosine loss as the sole objective function, the performance did improve, although not considerably. To maintain the discriminant power of the speaker embedding when short utterance compensation is performed, conventional KL-divergence term is included in the proposed objective function. Superior results were obtained using both losses as the final objective function.

The approach presented herein is notably different from existing short utterance compensation approaches owing to two aspects. The first is that short utterances are compensated at the phonetic-level rather than the utterance-level. Previous researches exploited an additional compensation system to transform speaker embeddings extracted from short utterances after utterance-level feature extraction. This is because the uncertainty caused by lacking phonetic information is observed in utterance-level features. However, compensating phonetic-level features appears to be a more direct solution, because uncertainty arises in the process of extracting utterance-level features from phonetic-level features. Using the proposed approach, the transformation is performed within the network in which the GRU layer tries to move the phonetic-level features of the short utterances to the optimal position derived from the corresponding long utterance with abundant phonetic information. To compensate in the phonetic-level, convolutional blocks are exploited to extract phonetic-level features. Figure 2 shows the decreased uncertainty by the GRU layer, obtained using the proposed method on the evaluation set (unseen data). Through a comparison between the average embedding of the output of last convolutional block’s feature-map ((a), (c)) and the embedding after the GRU layer ((b), (d)), we could conclude that the GRU layer increases the discriminant power for each speaker. By comparing the embeddings from the GRU layer according to whether the proposed method is applied or not (i.e., by comparing cases (b) and (d)), we can confirm that the uncertainty of each speaker caused by short utterances is significantly compensated by applying the proposed method.

The second difference pertains to the adoption of approaches of compensating short utterances and maintaining the discriminant power simultaneously using TS learning. This results in the speaker embedding layers being compared using the cosine similarity metric (compensation), proposed for the first time herein to the best of our knowledge, while also using the conventional KL-divergence loss (discriminant power). Overall illustration of the proposed system is depicted in Figure 1-(b).

5 Experiments

5.1 Dataset

In all the experiments described herein, we used the VoxCeleb dataset, which comprises approximately 330 hours of audio of 1251 speakers, at a sampling rate of 16 kHz [17]. The dataset involves utterances with an average and minimum duration of 8.2 s and 4 s, respectively, in a text-independent scenario. The evaluation trials and training / evaluation subset divisions follow the dataset’s guidelines. To evaluate the performance on the long and short utterances, utterances of the evaluation set were cropped into lengths of 3.59 s (59,049 samples) and 2.05 s (32,805 samples). We took the center part of each utterance to compose evaluation sets.

5.2 Experiment configurations

All experiments were conducted using Keras, which is a python library with a Tensorflow backend

[18, 19, 20]. The RWCNN-LSTM system with two modifications was used for both teacher and student DNN architectures. The teacher DNN inputs the raw waveform corresponding to 59,049 samples (

3.59 s). It involves one strided convolutional layer with stride size of 3 and six residual convolution blocks that do not reduce the length of the input sequence (the residual block is identical to that employed in


). After each residual convolution block, a max pooling layer with stride size of 3 is applied. The output shape of the last convolution block is (27, 512) where 27 is the sequence length and 512 is the number of kernels in the last convolutional layer.

is derived from where 59,049 is the number of samples, is for strided convolution, and for six max poolings. The GRU layer has 512 units and the two fully-connected layers have 1,024 nodes each. The multi-step training proposed in [8, 21] is used for training the teacher DNN. The weights of the teacher DNN are frozen when the student DNN is trained.

The student DNN is initialized using the weights of the teacher DNN as this process has been proved to ease the training in [22]. The architecture of the student DNN is identical to that of the teacher DNN except that it inputs raw waveform with 32,805 samples ( 2.05 s), which means that the output shape of the last convolution block is (15, 512).

The stochastic gradient descent with learning rate of 0.001 and momentum of 0.9 was used as the optimizer when training the teacher DNN. The same optimizer with a learning rate of 0.01 was used for the student DNN.

5.3 Results and analysis

The baseline performances are presented in Table 1. Without duration restriction, the EER was 7.51 %. EER increased by 46 % when the duration of the evaluation set was changed from 3.59 s to 2.05 s (8.72 % to 12.8 %). Using short utterances at the training phase, one of the well-known approaches for short utterance compensation, showed only 5 % improvement in terms of EER. This result seems to have occurred because the duration of the short utterance considered herein is less than that used in other studies, in a text-independent scenario (configurations of 5 or 10 s are usually used).


System full-length 3.59 s 2.05 s
eval eval eval
RWCNN-GRU 7.51 8.72 12.80
(3.59 s train)
RWCNN-GRU - - 12.08
(2.05 s train)


Table 1: Performance of the baseline systems with different durations. “Full-length eval” corresponds to the use of various length utterances without modification. The numbers represent EERs (%)

Table 2 presents the results of the proposed approaches. Conventional TS learning, which uses the output layer’s KL-divergence loss, did not show noticeable improvement. The proposed method that directly compares the speaker embedding layers demonstrated a better performance (using only the ‘dist’ term in Equation 2.), with EER 10.98 % and 10.8 % for mean squared error and cosine similarity as distance metrics, respectively. The best result could be achieved by using both the KL-divergence of the output layer and cosine similarity of speaker embedding layer, which compensated more than 65 % of the performance degradation due to shortened input utterance. We interpret that the reason for additional performance increase by comparing both output and speaker embedding can be found in Jiacen et al.’s research [16], which suggests that when compensating short utterances, the compensated feature can become similar to that of the long utterance, but this may not lead to increase in its discriminant power. Referring to this argument, comparing speaker embeddings can make the embedding of the student DNN equivalent to that of the teacher DNN, and the KL-divergence between output layers can help maintain its discriminant power.


Systems EER (%)
Output (KL-Div) (Original TS) 12.46
Embedding (MSE) 10.98
Embedding (Cos Sim) 10.80
Embedding (Cos Sim) + Output (KL-Div) 10.08


Table 2: Evaluation of various proposed systems using the modified 2.05 s evaluation set. “Embedding” and “Output” in the figure refer to layers to compare between the teacher and student networks. Values inside the bracket indicates the metric.

6 Discussion and future work

In this paper, we proposed a text-independent short utterance speaker verification that works on utterances with a duration of 2.05 s. The proposed system does not transform the utterance-level feature from the short utterance as in conventional approaches, but rather directly extracts the compensated speaker embeddings from short utterances by focusing on phonetic-level compensation. This is because we expected that the main key for compensating short utterances corresponds to the phonetic information, whose absence leads to the uncertainty of the utterance-level features. To process phonetic information, phonetic-level features that represent segments of 130 ms were extracted using CNN, and then modeled to the utterance-level using a GRU layer. The effectiveness of the defined phonetic-level features was indirectly demonstrated by the performance improvement of the speaker verification system using short utterance compensation. In future work, we will analyze the information included in phonetic-level features and construct phonetic-level features using speech recognition systems. After establishing a clear definition of the phonetic-level features, we aim to develop a system that can simultaneously compensate for the short utterances of various lengths, rather than only compensating for a specific length of utterances (2.05 s was considered herein).


  • [1] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
  • [2] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu, “Deep speaker: an end-to-end neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
  • [3] R. Saeidi and P. Alku,

    “Accounting for uncertainty of i-vectors in speaker recognition using uncertainty propagation and modified imputation,”

    in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • [4] H. Yamamoto and T. Koshinaka,

    Denoising autoencoder-based speaker feature restoration for utterances of short duration,”

    in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • [5] I. Yang, H. Heo, S. Yoon, and H. Yu, “Applying compensation techniques on i-vectors extracted from short-test utterances for speaker verification using deep neural network,” in Proc. ICASSP. IEEE, 2017.
  • [6] G. Peterson and I. Lehiste, “Duration of syllable nuclei in english,” The Journal of the Acoustical Society of America, vol. 32, no. 6, pp. 693–703, 1960.
  • [7] M. Ordin and L. Polyanskaya, “Acquisition of speech rhythm in a second language by learners with rhythmically different native languages,” The Journal of the Acoustical Society of America, vol. 138, no. 2, pp. 533–544, 2015.
  • [8] J. Jung, H. Heo, I. Yang, H. Shim, and H. Yu, “Avoiding speaker overfitting in end-to-end dnns using raw waveform for text-independent speaker verification,” Proc. Interspeech 2018, pp. 3583–3587, 2018.
  • [9] J. Jung, H. Heo, I. Yang, H. Shim, and H. Yu, “A complete end-to-end speaker verification system using deep neural networks: From raw signals to verification result,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5349–5353.
  • [10] H. Muckenhirn, M. Doss, and S. Marcell, “Towards directly modeling raw speech signal for speaker verification using cnns,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4884–4888.
  • [11] A. L. Maas, A. Y. Hannun, and A. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. icml, 2013, vol. 30, p. 3.
  • [12] J. Li, R. Zhao, J. Huang, and Y. Gong, “Learning small-size dnn with output-distribution-based criteria,” in Fifteenth annual conference of the international speech communication association, 2014.
  • [13] J. Li, R. Zhao, Z. Chen, C. Liu, X. Xiao, G. Ye, and Y. Gong, “Developing far-field speaker system via teacher-student learning,” arXiv preprint arXiv:1804.05166, 2018.
  • [14] J. Kim, M. El-Khamy, and J. Lee,

    “Bridgenets: Student-teacher transfer learning based on recursive neural networks and its application to distant speech recognition,”

    in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5719–5723.
  • [15] L. Maaten and G. Hinton, “Visualizing data using t-sne,”

    Journal of machine learning research

    , vol. 9, no. Nov, 2008.
  • [16] J. Zhang, N. Inoue, and K. Shinoda, “I-vector transformation using conditional generative adversarial networks for short utterance speaker verification,” Proceedings of INTERSPEECH, Hyderabad, India, 2018.
  • [17] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in Interspeech, 2017.
  • [18] F. Chollet et al., “Keras,”, 2015.
  • [19] A. Martín, A. Ashish, B. Paul, B. Eugene, et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” 2015.
  • [20] A. Martin, B. Paul, C. Jianmin, C. Zhifeng, D. Andy, D. Jeffrey, D. Matthieu, G. Sanjay, I. Geoffrey, I. Michael, K. Manjunath, L. Josh, M. Rajat, M. Sherry, M. G. Derek, S. Benoit, T. Paul, V. Vijay, W. Pete, W. Martin, Y. Yuan, and Z. Xiaoqiang, “Tensorflow: A system for large-scale machine learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016, pp. 265–283.
  • [21] H. S. Heo, J. W. Jung, I. H. Yang, S. H. Yoon, and H. J. Yu, “Joint training of expanded end-to-end DNN for text-dependent speaker verification,” Proc. Interspeech 2017, pp. 1532–1536, 2017.
  • [22] R. Pang, T. Sainath, R. Prabhavalkar, S. Gupta, Y. Wu, S. Zhang, and C. Chiu, “Compression of end-to-end models,” 2018.