Automatic speaker verification (ASV) [Tomi_summaryasv2010] aims to determine whether two speech segments are from the same speaker or not. It finds applications in forensics, surveillance, access control, and home electronics.
While the field has long been dominated by approaches such as i-vectors
i-vectors[Dehak_ivector2011], the focus has recently shifted to non-linear deep neural networks (DNNs). They have been found to surpass previous solutions in many cases.
Representative DNN approaches include d-vector [Variani_dvector2014], deep speaker [Baidu_deepspeaker2017] and x-vector [Snyder_xvector2018]. As illustrated in Figure 1, DNNs are used to extract fixed-sized speaker embedding
from each utterance. These embeddings can then be used for speaker comparison with a back-end classifier. The network input and output consist of a sequence of acoustic feature vectors and a vector of speaker posteriors, respectively. The DNN learns input-output mapping through a number of intermediate layers, including temporal pooling (necessary for the extraction of fixed-sized embedding). A number of improvements to this core framework have been proposed, includinghybrid frame-level layers [Snyder_extended_xvector2019], use of multi-task learning [You_multitask_xvector2019] and alternative loss functions [Li_angular_softmax2018], to name a few. In addition, practitioners often use external data [Ko_rir2017, Snyder_musan2015] to augment training data. This enforces the DNN to extract speaker-related attributes regardless of input perturbations.
While substantial amount of work has been devoted in improving DNN architectures, loss functions, and data augmentation recipes, the same cannot be said about acoustic features. There are, however, at least two important reasons to study feature extraction. First, data-driven models can only be as good as their input data — the features. Second, in collaborative settings, it is customary to fuse several ASV systems. These systems should not only perform well in isolation, but be sufficiently diverse as well. One way to achieve diversity is to train systems with different features.
The acoustic features used to train deep speaker embedding extractors are typically standard mel-frequency cepstral coefficients (MFCCs) or intermediate representations needed in MFCC extraction: raw spectrum [Nagrani_vox1_2017], mel-spectrum or mel-filterbank outputs. There are few exceptions where feature extractor is also learnt as part of the DNN architecture (e.g. [Ravanelli_sincnet2018]), although the empirical performance is often behind hand-crafted feature extraction schemes. This raises a question whether deep speaker embedding extractors might be improved by simple plug-and-play of other hand-crafted feature extractors in place of MFCCs. Such methods are abundant in the past ASV literature [Tomi_multitaper2012, Rajan_apgdf2013, Kim_pncc2016], and in the context of related tasks such as spoofing attack detection [Sahid_feature_synthetic2015, Hanilci_feature_reply_attack2015]. An extensive study in the context of DNN-based ASV is however missing. Our study aims to fill this gap.
MFCCs are obtained from the power spectrum of a specific time-frequency representation, short-term Fourier transform
short-term Fourier transform(STFT). MFCCs are therefore subjected to certain shortcomings of the STFT. They also lack specificity to the short-term phase of the signal. We therefore include a number of alternative features based on short-term power spectrum and short-term phase. Additionally, we also include fundamental frequency and methods that leverage from long-term processing beyond a short-time frame. Improvements over MFCCs are often motivated by robustness to additive noise, improved statistical properties, or closer alignment with human perception. The selected 14 features and their categorization, detailed below, is inspired from [Sahid_feature_synthetic2015] and [Hanilci_feature_reply_attack2015]. For generality, we carry experiments on two widely-adopted datasets, VoxCeleb [Nagrani_vox1_2017] and speakers-in-the-wild (SITW) [Mclaren_sitw2016]. To the best of our knowledge, this is the first extensive re-assessment of acoustic features for DNN-based ASV.
2 Feature Extraction Methods
In this section, we provide a comprehensive list of feature extractors with brief description for each method. Table 1 summarizes the selected feature extractors along with their parameter settings and references to earlier ASV studies.
|Category||Feature (dim.)||Configuration details||Previous work on ASV|
|Short-term magnitude power spectral features||MFCC (30)||Baseline, No. of FFT coefficients=512||[Snyder_xvector2018, Snyder_extended_xvector2019]|
|CQCC (60)||CQCC_v2.0 package111http://www.audio.eurecom.fr/software/CQCC_v2.0.zip||[Todisco_artf_cqcc2016]|
|LPCC (30)||LP order=30||[Jing_lpcc_asv2014]|
|PLPCC (30)||LP order=30, bark-scale filterbank||[Alam_multitaper_plpcc_2013]|
|SCFC (30)||No. filters=30||[Kua_scfc_scmc2010]|
|SCMC (30)||No. filters=30|
|Multi-taper (30)||MFCC with SWCE windowing, no. tapers=8||[Tomi_multitaper2012, Alam_multitaper_plpcc_2013]|
|Short-term phase spectral features||MGDF(30)||, , first 30 coeff. from DCT||[Rajan_mgdf2009, Thiruvaran_phase_features2007]|
|APGDF (30)||LP order=30||[Rajan_apgdf2013]|
|CosPhase (30)||First 30 coeff. from DCT||-|
|CMPOC (30)||, First 30 coeff. from DCT||-|
|Short-term features with long-term processing||MHEC (30)||No. of filters in Gammatone filter bank=20||[Sadjadi_mhec2015]|
|PNCC (30)||First 30 coeff. from DCT||[Wang_PNCC_asv2016]|
|Fundamental frequency features||MFCC+pitch (33)||Kaldi pitch extractor, MFCC (30) with pitch (3)||[Adami_prosody_asv2007]|
2.1 Short-term magnitude power spectral features
Mel frequency cepstral coefficients. MFCCs are computed by integrating STFT power spectrum with overlapped band-pass filters on the mel-scale, followed by log compression and discrete cosine transform (DCT). Following [Tomi_summaryasv2010] a desired number of lower-order coefficients is retained. Standard MFCCs form our baseline features.
Multi-taper mel frequency cepstral coefficients (Multi-taper)
. Viewing each short-term frame of speech as a realization of a random processes, the windowed STFT used in MFCC extraction is known to have high variance. To alleviate this,multi-taper
spectrum estimator is adopted[Tomi_multitaper2012]. It uses several window functions (tapers) to obtain a low-variance power spectrum estimate, given by . Here, is the -th taper (window) and is its corresponding weight. The number of tapers, , is an integer (typically between 4 and 8). There are a number of alternative taper sets to choose from: Thomson window [Thomson_multitaper1982], sinusoidal model (SWCE) [Hansson_multitaper_swce2009] and multi-peak [Hansson_multitaper_multipeak1995]. In this study, we chose SWCE. A detailed introduction of such spectrum estimator with experiments on conventional ASV can be found in [Tomi_multitaper2012].
Linear prediction cepstral features. An alternative to MFCC in terms of cepstral feature computation is from all-pole [Makhoul_lp_clasic1975] representation of signal. Linear prediction cepstral coefficients (LPCCs) are derived from the linear prediction coefficients (LPCs) by a recursive operation [Rabiner_basicspeech1993]. Similar method applies for perceptual LPCCs (PLPCCs) with applying a series of perceptual processing at primary stage [Hermansky_plp_classic1990].
Spectral subband centroid features. Spectral subband centroid based features were introduced and investigated in statistical ASV [Kua_scfc_scmc2010]. We consider two types of spectral centroid features: spectral centroid magnitude (SCM) and subband centroid frequency (SCF). They can be computed from weighted average of normalized energy of subband magnitude and frequency respectively. SCFs are then used directly as SCF coefficients (SCFCs) while log compression and DCT are performed for SCMs to obtain SCM coefficients (SCMCs). For more details one can refer to [Kua_scfc_scmc2010].
Constant-Q cepstral coefficients (CQCCs). Constant-Q transform (CQT) was introduced in [YoungBerg_cqt1978]. It has been applied in music signal processing [Schorkhuber_cqt2010], spoofing detection [Todisco_cqcc2016] as well in ASV [Delgado_icmc2016]. Different from STFT, CQT produces a time-frequency representation with variable resolution. The resulting CQT power spectrum is log-compressed and uniformly sampled, followed by DCT to yield CQCCs. Further details can be found in [Todisco_cqcc2016].
2.2 Short-term phase features
Modified group delayed function (MGDF). MGDF was introduced in [Murphy_mgdf2003] with application to phone recognition and further applied to speaker recognition [Rajan_mgdf2009]. It is a parametric representation of the phase spectrum, defined as , where is the frequency index; and are real and imaginary part of discrete Fourier transform (DFT) from speech samples ; and are real and the imaginary parts of DFT of . is the the sign of while and are the control parameters; is a smoothed magnitude spectrum. The cepstral-like coefficients which can be used as features are then obtained from function outputs by log-compression and DCT.
All-pole group delayed function (APGDF). An alternative phase representation of signal was proposed for ASV in [Rajan_apgdf2013]. The group delay function is computed by differentiating the unwrapped phase of all-pole spectrum. The main advantage of APGDF compared to MGDF is a fewer number of control parameters.
Cosine phase function (cosphase). Cosine of phase has been applied for spoofing attack detection [Sahid_feature_synthetic2015, Wu_mgdf_synthetic2013]. The DFT-based unwrapped phase DFT is first normalized to using cosine operation, and then processed with DCT to derive the cosphase coefficients.
Constant-Q magnitude–phase octave coefficients (CMPOCs). Unlike the previous DFT-based features, CMPOCs utilize CQT. The magnitude-phase spectrum (MPS) from CQT is computed as , where and denote magnitude and phase of CQT. Then, MPS is segmented according to the octave, and processed with log-compression and DCT to derive CMPOCs. The CMPOCs are studied so far for playback attack detection [Yang_cmpoc2018].
2.3 Short-term features with long term processing
We use the term ‘long-term processing’ to refer methods that use information across a longer context of consecutive frames.
Mean Hilbert envelope coefficients (MHECs). Proposed in [Sadjadi_mhec2015] for i-vector based ASV, MHEC applies Gammatone filterbanks on the speech signal. The output of each channel of the filterbank is then processed to compute temporal envelopes as , where is the so-called ‘analytical signal’ and denotes its Hilbert transform [Cohen_tfanalysis_textbook1995]. and represent time and channel index respectively. The envelopes are low-pass filtered, framed and averaged to compute energies. Finally, the energies are transformed to cepstral-like coefficients by log-compression and DCT. More details can be found in [Sadjadi_mhec2015].
Power-normalized cepstral coefficients (PNCCs). To generate PNCCs input waveform is first processed by Gammatone filterbanks and fed into a cascade of non-linear time-varying operations, aimed at suppressing the impact of noise and reverberation. Mean power normalization is performed at the output of such operation series so as to minimize the potentially detrimental effect on amplitude scaling. Cepstral features are then obtained by power-law non-linearity and DCT. PNCCs have been applied to speech recognition [Kim_pncc2016] as well as i-vector based ASV [Wang_PNCC_asv2016].
2.4 Fundamental frequency features
Aside from various type of features an initial investigation on the effect of harmonic information was conducted. For simplicity and comparability, the pitch extraction algorithm from [Pegah_pitchasr2014] based on normalized cross correlation function (NCCF) was employed to extract 3-dimensional pitch vectors. They are then appended to MFCCs. In rest of the paper, we refer this feature as MFCC+pitch.
We conducted training of neural network on the dev [Nagrani_vox1_2017] part of Voxceleb1 consisting 1211 speakers. We used two evaluation sets, one for matched train-test condition and the other for relatively mismatched condition. First one was from the test part of the same VoxCeleb1 dataset consisting 40 speakers, and the other one was from the development part of SITW under “core-core” condition, consisting 119 speakers. The VoxCeleb1 evaluation consists of 18860 genuine trials and same number of imposter trials. On the other hand, the corresponding SITW partition has 2597 genuine and 335629 imposter trials. We will refer the two datasets as ‘Voxceleb1-E’ and ‘SITW-DEV’ respectively.
3.2 Feature configuration
Before being fed into feature extractors, we extracted all the features with a frame length of 25 ms and 10 ms shift. We apply Hamming [Harris_windowing_classic1980] window in all cases except for the multi-taper feature. In Table 1, we describe the associated control parameters (if applicable) and the implementation details for each feature extractor. As for post-processing, we applied energy-based speech activity detection (SAD) and utterance-level cepstral mean normalization (CMN) [Tomi_summaryasv2010] except for MFCC+pitch, where the additional components contain probability of voicing (POV).
3.3 ASV system configuration
To compare different feature extractors, we trained x-vector system for each of them, as illustrated in Figure 1. We replicated the DNN configuration from [Snyder_xvector2018]. We trained the model using data described above without any data augmentation. This will help to assess the inherent robustness of individual features. We extracted 512-dimensional speaker embedding for each test utterance. The embeddings were length-normalized and centered before being transformed using a 200-dimensional linear discriminant analysis (LDA), followed by scoring with a probabilistic linear discriminant analysis (PLDA) [Ioffe_plda2006] classifier.
The verification accuracy was measured by equal error rate (EER) and minimum detection cost function (minDCF) with target speaker prior and two costs . Detection error trade-off (DET) curves for all feature extraction methods are also presented. We used Kaldi222https://github.com/kaldi-asr/kaldi for computing EER and minDCF. BOSARIS333https://sites.google.com/site/bosaristoolkit/ was called for DET illustration.
We first conducted a preliminary experiment on investigating the effectiveness of dynamic features with result reported in Table 2, as a sanity check. We extended the baseline by adding delta and double-delta coefficients along with the static MFCCs. According to the table adding delta features did not improve performance. This might be because the frame-level network layers already capture information across neighboring frames. In the remainder, we utilize static features only.
Table 3 summarizes the results for both corpora. In experiment of Voxceleb1-E, we found that MFCCs outperform most of alternative features in terms of EER, with SCMCs as the only exception. This may indicate the effectiveness of information related to subband energies. However, SCFCs did not outperform SCMCs, which suggests that the subband magnitudes may be more important than their frequencies. Concerning phase spectral features, MGDFs were behind the other features. This might be due to sub-optimal control parameter settings. CMPOCs reached relatively 27.6% lower EER than CQCCs, which highlights the effectiveness of phase features in CQT-based feature category. Moreover, while competitive EER and best minDCF can be observed from MFCC+pitch, LPCCs and PLPCCs did not perform as good. This indicates the potential importance of explicit harmonic information. Such finding can be further found in SITW-DEV results. Similar observation can be found from multi-taper MFCCs, which reclaims the efficacy of multi-taper windowing from conventional ASV.
Focusing more on SITW-DEV, most competitive features include those from the phase and ‘long-term’ categories. PNCCs reached best performance in both metrics, outperforming baseline MFCCs by 25.1% relative in terms of EER. This might be due to the robustness-enhancing operations integrated in the pipeline, recalling that SITW-DEV represents more challenging and mismatched data conditions. While not outperforming the baseline in Voxceleb1-E, SCFCs yielded competitive numbers along with SCMCs, which further indicates usefulness of subband information. Best performance from cosphase under phase category reflects the advantage of cosine normalizer relative to group delay function. An additional benefit of cosphase over group delay features is that it has lesser number of control parameters.
Next, we addressed simple equal-weighted linear score fusion. We considered two sets of features: 1) MFCCs, SCMCs and Multi-taper; 2) MFCCs, cosphase and PNCCs. The former set of extractors share similar spectral operations while the latter cover more diverse speech attributes. Results are presented at the bottom of Table 3. In Voxceleb1-E, we can see further improvement for both fused systems, especially for the first one which reached lowest overall EER, outperforming baseline by 16.3% relatively. But under SITW-DEV the best performance was still held by single system. This indicates that simple equal-weighted linear score-level fusion may be more effective for relatively matched conditions.
Finally, the DET curves for all systems including fused ones are shown in Figure 2, which agrees with the findings in Table 3. Concerning Voxceleb1-E, the two fusion systems are closer to the origin than any of the single systems in general, which corresponds to the indication above. Concerning SITW, PNCCs confirms its superior performance on SITW-DEV, but from right-bottom both spectral centroid features are heading out, which may indicate their favor to systems that are less strict on false alarms.
This paper presents an extensive re-assessment of various acoustic feature extractors for DNN-based ASV systems. We evaluated them on Voxceleb1 and SITW, covering matched and unmatched conditions. We achieved improvements over MFCCs especially on SITW, which represents more mismatched testing condition. We also found alternative methods such as spectral centroids, group delay function, and integrated noise suppression can be useful for DNN system. For future work they thus shall be revisited and extended under more scenarios. Finally we gave an initial attempt on score-level fused systems with competitive performance, indicating the potential of such approach.
This work was partially supported by Academy of Finland (project 309629) and Inria Nancy Grand Est.