Phonetic-attention scoring for deep speaker features in speaker verification

11/08/2018 ∙ by Lantian Li, et al. ∙ Tsinghua University 0

Recent studies have shown that frame-level deep speaker features can be derived from a deep neural network with the training target set to discriminate speakers by a short speech segment. By pooling the frame-level features, utterance-level representations, called d-vectors, can be derived and used in the automatic speaker verification (ASV) task. This simple average pooling, however, is inherently sensitive to the phonetic content of the utterance. An interesting idea borrowed from machine translation is the attention-based mechanism, where the contribution of an input word to the translation at a particular time is weighted by an attention score. This score reflects the relevance of the input word and the present translation. We can use the same idea to align utterances with different phonetic contents. This paper proposes a phonetic-attention scoring approach for d-vector systems. By this approach, an attention score is computed for each frame pair. This score reflects the similarity of the two frames in phonetic content, and is used to weigh the contribution of this frame pair in the utterance-based scoring. This new scoring approach emphasizes the frame pairs with similar phonetic contents, which essentially provides a soft alignment for utterances with any phonetic contents. Experimental results show that compared with the naive average pooling, this phonetic-attention scoring approach can deliver consistent performance improvement in ASV tasks of both text-dependent and text-independent.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic speaker verification (ASV) is an important biometric authentication technology and has a broad range of applications. The current ASV approach can be categorized into two groups: the statistical model approach and the neural model approach. The most famous statistical models for ASV involve the Gaussian mixture model-universal background model (GMM-UBM) 

[1], the joint factor analysis model [2] and the i-vector model [3, 4, 5]. As for the neural model approach, Ehsan et al. proposed the first successful implementation [6], where frame-level speaker features were extracted from a deep neural network (DNN), and utterance-level speaker representations (‘d-vectors’) were derived by averaging the frame-level features, i.e., average pooling. This work was followed by a bunch of researchers [7, 8, 9, 10].

The neural-based approach is essentially a feature learning approach, i.e., learning frame-level speaker features from raw speech. In previous work, we found that by this feature learning, speakers can be discriminated by a speech segment as short as seconds [10], either a word or a cough [11]. However, with the conventional d-vector pipeline, this brilliant frame-level discriminatory power cannot be fully utilized by the utterance-level ASV, due to the simple average pooling. This shortage was quickly identified by researchers, and hence almost all the studies after Ehsan et al. [6] chose to learn representations of segments rather than frames, the so-called end-to-end approach [8, 12, 13, 14]. However, frame-level feature learning possesses its own advantages in both generalizability and ease of training [15], and meets our long-term desire of deciphering speech signals [16]. An ideal approach, therefore, is to keep the feature learning framework but solve the problem caused by average pooling.

To understand the problem of average pooling, first notice that feature pooling is equivalent to score pooling. To make the presentation clear, we consider the simple inner product score:

where and are two utterances in test, denotes frames; and are frame-level speaker features and utterance-level d-vectors, respectively. A simple arrangement leads to:

This formula indicates that with average pooling, the utterance-level score is the average of the frame-level scores . Most importantly, the scores of all the frame pairs are equally weighted, which is obviously suboptimal, as the reliability of scores from different frame pairs may be substantially different. In particular, a pair of frames in the same phonetic context may result in a much more reliable frame-level score compared to a pair in different phonetic context, as demonstrated by the fact that text-dependent ASV generally outperforms text-independent ASV. This indicates that a key problem of the average pooling method is that phonetic variation may cause serious performance degradation. This partly explains why d-vector systems are mostly successful in text-dependent tasks.

A simple idea is to discriminate frame pairs in similar / different phonetic contents, and put more emphasis on the frame pairs in similar phones. This can be formulated by:


where represents the weight for the frame pair , computed from the similarity of their phonetic contents. This is essentially a soft-alignment approach that aligns two utterances with respect to phonetic contents, where represents the alignment degree of frames and , derived from the phonetic information of the two frames.

The idea of soft-alignment was motivated by the attention mechanism

in neural machine translation (NMT) 

[17], where the contribution of an input word to the translation at a particular time is weighted by an attention score, and this attention score reflects the relevance of the input word and the present translation. We therefore name our new scoring model by Eq. (1) as phonetic-attention scoring. By paying more attention to frame pairs in similar phonetic contents, this new scoring approach essentially turns a text-independent task to a text-dependent task, hence partly solving the problem caused by phone variation with the naive average pooling .

In the next section, we will briefly describe the attention mechanism. The phonetic-attention scoring approach will be presented in Section 3, and the experiments will be reported in Section 5. The entire paper will be concluded in Section 6.

2 Attention mechanism

The attention mechanism was firstly proposed by [17] in the framework of sequence to sequence learning, and was applied to NMT. Recently, this model has been widely used in many sequential learning tasks, e.g., speech recognition [18]. In a nutshell, the attention approach looks up all the input elements (e.g., words in a sentence or frames in an utterance) at each decoding time, and computes an attention weight for each element that reflects the relevance of that element with the present decoding. Based on these attention weights, the information of the input elements is collected and used to guide decoding. As shown in Fig. 1, at decoding time , the attention weight is computed for each input element (more precisely, the annotation of , denoted by ), formally written as:

where is the decoding status at time , and is a value function that can be in any form. is a normalization function (usually softmax) that ensures . The decoding for is then formally written as:

where is the decoding model. In the conventional setting, is a parametric function, e.g., a neural net, whose parameters are jointly optimized with other parts of the model, e.g., the decoding model .

Figure 1: Attention mechanism in sequence to sequence model.

3 Phonetic-attention scoring

We borrow the architecture shown in Fig. 1

to build our phonetic-attention model in Eq. (

1). Since our purpose is to align two existing sequences rather than sequence to sequence generation, the structure can be largely simplified. For example, the recurrent connection in both the input and output sequence can be omitted. Secondly, in Fig. 1, the value function is learned from data; for our scoring model, we have a clear goal to align utterances by phonetic content, so we can design the value function by hand (although function learning with prior may help). This leads to the phonetic-attention model shown in Fig. 2.

Figure 2: Diagram of the phonetic-attention model.

The architecture and the associated scoring method can be summarized into the following four steps:

(1) For both the enrollment and test utterances, compute the frame-level speaker features from a speaker recognition DNN, denoted by and . Additionally, compute the frame-level phonetic features from a speech recognition DNN, denoted by and .

(2) For each frame in the test utterance, compute the attention weight for each frame in the enrollment utterance.

where the KL denotes the reciprocal of KL distance. This step is represented by the red dashed line in Fig. 2.

(3) Compute the matching score of frame in the test utterance as follows:

This step is represented by the green solid line in Fig. 2.

(4) Compute the matching score of the two utterances by averaging the frame-level matching score:

4 Related work

The attention mechanism has been studied by several authors in ASV, e.g., [13, 19, 20]. However, most of the proposals used the attention mechanism to produce a better frame pooling, while we use it to produce a better utterance alignment. In essence, these methods learn which frame should contribute to the speaker embedding, while our approach learn which frame-pair should contribute to the matching score. Moreover, most of these studies do not use phonetic knowledge explicitly, except [13].

Another work relevant to ours is the segmental dynamic time warping (SDTW) approach proposed by Mohamed et al. [21]. This work holds the same idea as ours in aligning frame-level speaker features, however their alignment is based on local temporal continuity, while ours is based on global phonetic contents.

5 Experiments

5.1 Data

5.1.1 Training data

The data used to train the d-vector systems is the CSLT-7500 database, which was collected by CSLT@Tsinghua University. It consists of speakers and utterances. The sampling rate is kHz and the precision is -bit. Data augmentation is applied to cover more acoustic conditions, for which the MUSAN corpus [22] is used to provide additive noise, and the room impulse responses (RIRS) corpus [23] is used to generate reverberated samples.

5.1.2 Evaluation data

(1) CIIH: a dataset contains short commands used in the intelligent home scenario. It contains recordings of short commands from speakers, and each command consists of Chinese characters. For each speaker, every command is recorded times, amounting to utterances per speaker. This dataset is used to evaluate the text-dependent (TD) task.

(2) DSDB: a dataset involving digital strings. It contains speakers, each speaking Chinese digital strings. Each string contains Chinese digits, and is about seconds. For each speaker, utterances are randomly sampled as enrollment, and the rest are used for test. This dataset is used to evaluate the text-prompted (TP) task.

(3) ALI-WILD: a dataset collected by the Ali crowdsource platform. It covers unlimited real-world scenarios, and contains speakers and speech segments. We designed two test conditions: a short-duration scenario Ali(S) where the duration of the enrollment is seconds and the test is seconds, and a long-duration scenario Ali(L) where the duration of the enrollment is seconds and the test is seconds. This dataset is used to evaluate the text-independent (TI) task.

5.2 Settings

The DNN model to produce frame-level speaker features is a 9-layer time-delay neural network (TDNN), where the slicing parameters are {-, -, , +, +}, {-, +}, {}, {-, +}, {}, {-, +}, {}, {-, +}, {}. Except the last hidden layer that involves neurons, the size of all other layers is . Once the DNN has been fully trained, -dimensional deep speaker features were extracted from the last hidden layer. The model was trained using the Kaldi toolkit [24]. Based on this model, we built a standard d-vector system with the naive average pooling, denoted by Baseline.

The phonetic-attention model requires frame-level phonetic features. We built a DNN-HMM hybrid system using Kaldi following the WSJ S5 recipe. The training used hours of Chinese speech data. The model is a TDNN, and each layer contains nodes. The output layer contains units, corresponding to the number of GMM senones. Once the model was trained, -dimensional phone posteriors were derived from the output layer and were used as phonetic features. The phonetic-attention system based on the phone posteriors is denoted by Att-Post

. Another type of phonetic features can be derived from the final affine layer. To compress the size of the feature vector, the Singular Value Decomposition (SVD) was applied to decompose the final affine matrix into two low-rank matrices, where the rank was set to

. The -dimensional activations were read from the low-rank layer of the decomposed matrix, which we call bottleneck features. The phonetic-attention system based on the bottleneck features is denoted by Att-BN.

Finally, we built a phone-blind attention system where the attention weight is computed from the speaker feature itself, rather than phonetic features. This approach is similar to the work in [19, 20], though the attention function is not trained. This system is denoted by Att-Spk.

5.3 Results

The results in terms of the equal error rate (EER) are shown in Table 1, where the baseline system is based on the naive average pooling, while the three attention-based systems use attention models based on different features. For each system, it reports results with two frame-level metrics: cosine distance and cosine distance after LDA. The LDA model was trained on CSLT-7500, and the dimensionality of its projection space was set to . There are four tasks in total: the TD task on CIIH, the TP task on DSDB, the TI short-duration task on Ali(S), and the TI long-duration task on Ali(L). The best performance is marked in bold face.

From these results, it can be seen that on all these tasks, the attention-based systems outperform the baseline system, indicating that the naive average pooling is indeed problematic. When comparing these three attention-based systems, we find they perform quite different on different tasks. On the TD task CIIH and TP task DSDB, the phone-blind attention system Att-Spk seems slightly superior, while on the TI task Ali(S) and Ali(L), the two phonetic-attention systems are clearly better. This observation is understandable, as on the TD or the TP tasks, the phonetic variation in enrollment and test utterances are largely identical, so the appropriate alignment can be easily found by even a phone-blind attention. On the TI tasks, however, the phonetic variation is much more complex, for which additional phonetic information is required to align the enrollment and test utterances. Finally, comparing the two phonetic-attention systems, the Att-BN is consistently better. This indicates that the bottleneck feature is a more compact representation for the phonetic content.

Systems Metric EER(%)
Baseline Cosine 3.71 1.02 9.24 4.95
LDA 2.49 0.70 5.84 2.44
Att-Spk Cosine 3.27 0.95 9.07 4.95
LDA 2.11 0.65 5.80 2.50
Att-Post Cosine 3.28 0.97 9.12 4.85
LDA 2.22 0.69 5.76 2.32
Att-BN Cosine 3.20 0.98 9.11 4.84
LDA 2.18 0.70 5.69 2.31
Table 1: Performance of different systems on different tasks.

5.4 Analysis

To better understand the difference behavior of the phone-blind attention and the phonetic attention, we draw the alignment produced by them on two samples from the TD and TI tasks respectively. The figures are shown in Fig. 3 and Fig. 4.111The observations of the TD and TP tasks are quite similar, so here the figure on the TP task is omitted. It can be seen that on the TD task, two attention approaches produce similar alignments, while the alignment produced by phonetic attention is more concentrated. This is not surprising, as the phonetic features are short-term and change more quickly than the speaker features. Actually, this might be a key problem of the present implementation of the phonetic attention, as the concentration means less frames in one utterance being aligned for each frame in the other utterance, leading to unreliable scores. Nevertheless, the explicit phonetic information does provide much more accurate alignments in the TI scenario, where the phonetic variation is complex and phone-blind attention may produce rather poor alignments. This can be seen from Fig. 4 that the aligned segments produced by the phonetic attention show clear slopped patterns, which is more realistic than the flat patterns produced by the phone-blind attention.

Figure 3: Alignment produced by the phone-blind and phonetic attentions on the TD task.
Figure 4: Alignment produced by the phone-blind and phonetic attentions on the TI task.

6 Conclusions

This paper proposed a phonetic-attention scoring approach for the d-vector speaker recognition system. This approach uses frame-level phonetic information to produce a soft alignment between the enrollment and test utterances, and computes the matching score by emphasizing the aligned frame pairs. We tested the method on text-dependent, text-prompted and text-independent tasks, and found that it delivered consistent performance improvement over the baseline system. The phonetic attention was also compared with a naive phone-blind attention, and the results showed that the phone-blind attention worked well in text-dependent and text-prompt tasks, but failed in text-independent tasks. Analysis was conducted to explain the observation. In the further work, we will study speaker features that change more slowly. e.g., vowel-only feature. It is also interesting to learn the value function.


  • [1] Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital signal processing, vol. 10, no. 1-3, pp. 19–41, 2000.
  • [2] Patrick Kenny, Gilles Boulianne, Pierre Ouellet, and Pierre Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1435–1447, 2007.
  • [3] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
  • [4] Sergey Ioffe, “Probabilistic linear discriminant analysis,” Computer Vision–ECCV, pp. 531–542, 2006.
  • [5] Yun Lei, Nicolas Scheffer, Luciana Ferrer, and Mitchell McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in ICASSP. IEEE, 2014, pp. 1695–1699.
  • [6] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in ICASSP. IEEE, 2014, pp. 4052–4056.
  • [7] Yuan Liu, Yanmin Qian, Nanxin Chen, Tianfan Fu, Ya Zhang, and Kai Yu,

    Deep feature for text-dependent speaker verification,”

    Speech Communication, vol. 73, pp. 1–13, 2015.
  • [8] David Snyder, Pegah Ghahremani, Daniel Povey, Daniel Garcia-Romero, Yishay Carmiel, and Sanjeev Khudanpur, “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in Spoken Language Technology Workshop. IEEE, 2016, pp. 165–170.
  • [9] David Snyder, Daniel Garcia-Romero, Daniel Povey, and Sanjeev Khudanpur, “Deep neural network embeddings for text-independent speaker verification,” in Interspeech, 2017, pp. 999–1003.
  • [10] Lantian Li, Yixiang Chen, Ying Shi, Zhiyuan Tang, and Dong Wang, “Deep speaker feature learning for text-independent speaker verification,” in Interspeech, 2017, pp. 1542–1546.
  • [11] Miao Zhang, Yixiang Chen, Lantian Li, and Dong Wang, “Speaker recognition with cough, laugh and” wei”,” arXiv preprint arXiv:1706.07860, 2017.
  • [12] Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer, “End-to-end text-dependent speaker verification,” in ICASSP. IEEE, 2016, pp. 5115–5119.
  • [13] Shi-Xiong Zhang, Zhuo Chen, Yong Zhao, Jinyu Li, and Yifan Gong, “End-to-end attention based text-dependent speaker verification,” in Spoken Language Technology Workshop. IEEE, 2016, pp. 171–178.
  • [14] Chunlei Zhang and Kazuhito Koishida, “End-to-end text-independent speaker verification with triplet loss on short utterances,” in Interspeech, 2017.
  • [15] Dong Wang, Lantian Li, Zhiyuan Tang, and Thomas Fang Zheng, “Deep speaker verification: Do we need end to end?,” arXiv preprint arXiv:1706.07859, 2017.
  • [16] Lantian Li, Dong Wang, Yixiang Chen, Ying Shi, Zhiyuan Tang, and Thomas Fang Zheng, “Deep factorization for speech signal,” in ICASSP. IEEE, 2018.
  • [17] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” in ICLR, 2015.
  • [18] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP. IEEE, 2016, pp. 4960–4964.
  • [19] FA Rezaur rahman Chowdhury, Quan Wang, Ignacio Lopez Moreno, and Li Wan, “Attention-based models for text-dependent speaker verification,” in ICASSP. IEEE, 2018, pp. 5359–5363.
  • [20] Yi Liu, Liang He, Weiwei Liu, and Jia Liu, “Exploring a unified attention-based pooling framework for speaker verification,” arXiv preprint arXiv:1808.07120, 2018.
  • [21] Mohamed Adel, Mohamed Afify, and Akram Gaballah, “Text-independent speaker verification based on deep neural networks and segmental dynamic time warping,” arXiv preprint arXiv:1806.09932, 2018.
  • [22] David Snyder, Guoguo Chen, and Daniel Povey, “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1.
  • [23] Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L Seltzer, and Sanjeev Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in ICASSP. IEEE, 2017, pp. 5220–5224.
  • [24] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The kaldi speech recognition toolkit,” in Workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011.