1.1 Introduction111The organization of this Chapter is inspired from Nanxin Chen’s Center for Language and Speech Processing Seminar Talk "Advances in speech representation for speaker recognition".
Speech is the main medium we use to communicate with the others, and therefore it contains rich information of our interests. Upon hearing a speech, in addition to identify what its content, it is natural for us to ask: Who is the speaker? What is the nationality of the speaker? What is his/her emotion?
Speaker Recognition is the collection of techniques to either identifies or verifies the speaker-related information of segments of speech utterances, and Automatic Speaker Recognition is speaker recognition performed by machines. Figure 1.1 is an overview of the speaker information in speech. Speaker information is embedding in speech, but it is often corrupted by channel effects to some degree. Channel effects can be environment noises, and more often recording noises since automatic speaker recognition is performed on speech recordings. There are some speaker-related information we are also interested in, such as age, emotion and language.
This Chapter first gives a overview of Automatic Speaker Verification. Then several major speaker verification techniques, from the earlier Gaussian Mixture Models to the recent neural models, are presented subsequently.
1.1.1 Speaker Identification v.s. Verification
Speaker Recognition concerns with speaker-related information. Automatic Speaker Recognition is therefore the machines that perform speaker recognition for humans. Speaker Recognition can be categorized into Speaker Identification and Speaker Verification, by the testing protocol (Figure 1.2). As with any machine learning models, Automatic Speaker Recognition requires training data and testing data. Speaker Identification is to identify whether the speaker of a testing utterance matches any training utterances, and hence it is a closed-set problem. On the other hand, Speaker Verification is to verify weather the speakers of a pair of utterances match. The pair is consisted of an enrollment utterance and a testing utterance, which may not be presented beforehand, and hence it is a more challenging open-set problem.
This thesis work focuses on Automatic Speaker Verification.
1.1.2 General Processing Pipeline
describes the four main stages of Automatic Speaker Recognition (thus includes Verification). Most systems have these four aspects in their system design. Feature Processing is to get low-level feature descriptors from the speech waveforms, such as Mel-Frequency Cepstral Coefficients (MFCC), FilterBank, Perceptual Linear Predictive (PLP) Analysis, or bottleneck features. Clustering is the process to differentiate different acoustic units and process them separately, and it is commonly adopted in speaker recognition, such as Gaussian Mixture Model (GMM). Summarization is the conversion from variable-length frame-level features to a fixed-length utterance-level feature, such as the i-vectors or average pooling. Backend Processing is for scoring and making decisions, such as Support Vector Machine (SVM), Cosine Similarity or Probablistic Linear Discriminant Analysis (PLDA).
There are various metrics defining how well a system performs, such as the Decision Cost Functio (DCF) and Equal Error Rate (EER). DCF is defined as:
EER is the equilibrium point between False Alarm Rate and False Negative Rate. We adopt EER for this thesis work for its common use in Automatic Speaker Recognition work.
Speaker Recognition at its core is to optimize a Sequence-to-One mapping function. From the task perspective, it is supposedly easier than Sequence-to-Sequence tasks since it only outputs once per sequence. However, from the data perspective, it is much harder. Comparing to automatic speech recognition or machine translation, which are Sequence-to-Sequence mappings, there is very little data for automatic speaker recognition. For example, a 100 seconds YouTube video could have more than 100 words spoken but only 1 speaker identity. In addition to data, channel effects have been the major bottleneck for previous research work on speaker recognition (Figure 1.1). Advances in the field has developed techniques that aim to address it, such as the Joint Factor Analysis, but channel effects still play a significant role. This is one reason why the most fundamental task in speech, voice activity detection, still remains as a research problem.
Automatic Speaker Recognition techniques are transferable to the aforementioned tasks: Language Recognition dehak2011language, Age Estimation chen2018measuring Ghahremani2018, Emotion Classification Cho2018, and Spoofing Attacks Detection lai2018attentive.
1.2 Adapted Gaussian Mixture Models (GMM-UBM)
In the 1990s, Gaussian Mixture Models (GMM) based systems was the dominant approach to automatic speaker verification. Building on top of GMM, Gaussian Mixture Model-Universal Background Model (GMM-UBM) builds a large speaker-independent GMM, referred to as UBM, and adapts the UBM to specific speaker models via Bayesian adaptation reynolds2000speaker. UBM-GMM is the basis for later work such as the i-vectors, which collects sufficient statistics from a UBM, and UBM-GMM is one of the most important developments for automatic speaker verification.
1.2.1 Likelihood Ratio Detector
The task of speaker verification is to determine whether an test utterance is spoken by a given speaker . GMM-UBM defines two models: Background Model (UBM) and Speaker Model (GMM). If the likelihood that comes from -dependent GMM is larger than the likelihood that comes from -independent UBM, then is spoken by , and vice versa. The process above is defined as likelihood ratio:
where is called the likelihood ratio detector. Figure 1.4 is an illustration of .
One basic assumption GMM-UBM assumes is that human speech can be decomposed into speaker-independent and speaker-dependent characteristics. Speaker-independent characteristics are traits that are shared across human speech, and example of which could be pitch and vowels. Speaker-dependent characteristics are traits that are unique to every speaker, and example of which could be accent. GMM-UBM builds upon this assumption. First, speaker-independent characteristics are modeled by a large GMM, a UBM. Since it should capture traits shared across all humans, UBM is trained on large data, usually the whole train dataset. Secondly, speaker-dependent characteristics, which is usually presented in the enrollment data, is obtained by adapting the UBM. UBM is trained by the EM algorithm, and the speaker model adaptation is done via MAP estimation.
Another motivation to split speaker modeling into two steps is that there is often very little enrollment data. For example, setting up smartphones with finger printer readers usually only takes a couple seconds. The enrollment data that is collected is too little to build a powerful model. On the other hand, there are tons of unlabelled data available for training but it does not come from the user. GMM-UBM is one solution that takes advantage of large unlabelled data to build a speaker-specific model by adaptation.
1.2.3 MAP Estimation
MAP estimation is illustrated in Figure 1.5. Given the sufficient statistics of UBM (mixture weights , mixture means
, mixture variances) and some enrollment data, MAP estimation linearly adapts , and . In reynolds2000speaker, all , and are adapted although it is common to only adapt the mixture means, and keep the weights and variances fixed.
1.3 Joint Factor Analysis (JFA)
Joint Factor Analysis is proposed to compensate the shortcomings of GMM-UBM. Refer to Figure 1.5, UBM is adapted via MAP to speaker-dependent GMM. If we consider only mean adaptation, we can put the mean vectors of each Gaussian mixture into a huge vector, which is termed the "Supervector". Let , where is the feature dimension, and assume there are number of mixtures in the UBM. Then, the supervector . Let us further denote the real speaker mean supervector as , then MAP estimation is essentially a high-dimensional mapping from to . This is not ideal since MAP not only adapts speaker-specific information but also the channel effects (Figure 1.1). Another disadvantage of representing speaker with a mean supervector is that the dimension is too huge. For example, it is common to have as 39 (with delta and double-deltas), and as 1024. will end up with a almost 40,000 dimension supervector.
JFA proposed to address the problem by splitting the supervector into speaker independent, speaker dependent, channel dependent, and residual subspaces lei2011joint, with each subspace represented by a low-dimensional vector. JFA is formulated as follows:
where are low rank matrices for speaker-dependent, channel-dependent, and residual subspaces respectively. With JFA, a low dimensional speaker vector is extracted. Compare to GMM-UBM’s , is of much lower dimension (300 v.s. 40,000) and does not have channel effects.
1.4 Front-End Factor Analysis (i-vectors)
One empirical finding suggested that the channel vector in JFA also contains speaker information, and a subsequently modification of JFA is proposed and has been one of the most dominant speech representaiton in the last decade: the i-vectors dehak2011front. The modified formula is:
where is the total variability matrix (also low rank), and w is the i-vectors. Compare this to Equation 1.3, there is only one low-rank matrix which models both speaker and channel variabilities. Figure 1.6 is a simple illustration of how JFA and i-vectors converts the supervectors to a low-dimensional embedding.
After is extracted, it is used to represent the speaker. In Figure 1.3, we refer to i-vectors as a summarization step since it reduces the variable-length supervector to a fixed-length vector. In dehak2011front, SVM and cosine similarity are used for backend processing. However, i-vector PLDA was a more popular combination.
1.5 Robust DNN Embeddings (x-vectors)
i-vectors systems have produced several state-of-the-art results on speaker-related tasks. However, as with any statistical systems, an i-vector system is composed of several independent (unsupervised) subsystems trained with different objectives: an UBM for collecting sufficient statistics, an i-vector extrator for extracting i-vectors, and a scoring backend (usually PLDA). x-vectors systems is a supervised DNN-based speaker recognition system that was aimed to combine the clustering and summarization steps in Figure 1.3
into one snyder2017deepsnyder2018x. The DNN is based on Network-In-Network lin2013network, and trained to classify different speakers (Figure1.7). The layer outputs after the statistical pooling layer can be used as the speaker embeddings, or the x-vectors. Since x-vectors is based on DNN, which requires lots of data, x-vectors systems also utilize data-augmentation by adding noises and reverberations to increase the total amount of data. x-vectors do not necessarily outperform i-vectors on speaker recognition, especially if data and computational resources are limited.
1.6 Learnable Dictionary Encoding (LDE)
The x-vectors framework is not truly end-to-end since it uses a separately trained PLDA for scoring. An elegant end-to-end framework, Learnable Dictionary Encoding, explores a few pooling layers and loss functions cai2018exploring, and showed that it is possible to combine the clustering, summarization, and backend processing steps in Figure1.3.
Instead of using a feed-forward deep neural network, LDE employs ResNet34 he2016deep in its framework. In addition, contrary to the x-vectors DNN in Figure1.7 where there are few layers after the pooling layer, LDE only has a fully-connected layer (for classification) after its pooling layer. LDE uses a LDE layer for pooling (or summarization) in Figure 1.8.
i-vectors and x-vectors systems requires a separately trained backend (PLDA) for scoring, and LDE showed that with Angular Softmax Losses liu2017sphereface, a separate backend is not necessary and hence the wholeframework is end-to-end.
The Feature Processing step in 1.3 extracts low-level feature descriptors from raw waveform, and several earlier work showed that Fourier analysis based transforms can effectively capture information of speech signals. Conventional low-level speech features include Log-spectrogram, Log-Filterbank, Mel-Frequency Cepstral Coefficients (MFCC), and Peceptual Linera Predictive (PLP) Analysis. DNN-based speech recognition systems hinton2012deep, GMM-UBM systems reynolds2000speaker and i-vectors systems dehak2011front are based on MFCC; x-vectors systems snyder2018x and LDE cai2018exploring are based on Log-Filterbank; Attentive Filtering Network lai2018attentive is based on Log-Spectrogram. We established our baseline on MFCC, and this chapter will introduce MFCC and the MFCC configuration used in our experiments in Chapter 4.
2.2 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC is one of the most standard and common low-level feature in automatic speaker recognition systems. The procedure of MFCC extraction is followed:
Take Short-Term Fourier Transform (STFT) on the waveform. This step will give us a Spectrogram.
Apply Mel-scale Filters. This step will give us a Filterbank.
Take the logarithm on the powers in all Mel-bins. Logarithm is taken also for Log-Spectrogram and Log-Filterbank.
Apply Discrete Consine Transform (DCT), and keep several cepstral coefficients. This step decorrelates and reduces the dimensionality.
A visual comparison of Log-Spectrogram, Log-Filterbank, and MFCC is 2.1. We can see that there are more structures in Log-Spectrogram and Log-Filterbank, and MFCC has less dimensions than the former two.
2.3 MFCC Details
Our experiments (see Chapter 4 for more details) are conducted on the LibriSpeech Corpus panayotov2015librispeech, in which speech utterances are recorded in 16k Hz. We used the standard 25 ms frame-length and 10 ms frame-shift for STFT computation, 40 Mel filters, and took 24 cepstral coefficients after DCT. The first and second order derivatives (deltas and double-deltas) are computed during UBM training. Details of our MFCC configuration is in Table 2.1.
|Sampling Frequency||16000 Hz|
|Frame Length for STFT||25 ms|
|Frame Shift for STFT||10 ms|
|High Frequency Cutoff for Mel Bins||7600 Hz|
|Low Frequency Cutoff for Mel Bins||20 Hz|
|Number of Mel Bins||40|
|Number of Cepstral Coefficients after DCT||24|
Predictive coding is a well-motivated and developed research area in neuroscience. The central idea of predictive coding is that the current and past states of a system contain relevant information of its future states. On the other hand, one long-standing research question in speech processing has been to extract global information from noisy speech recordings. In speech recognition, this can be related to as retrieving phone labels from the recordings. In speaker recognition, the same research question could be framed as sentiment analysis of the recordings. Could we harness the concept of predictive coding to design a model which extracts representations that are invariant to noise? Contrastive Predictive Coding (CPC) connects the idea of predictive coding and representation learning. This Chapter will give a background overview of predictive coding in neuroscience3.2, a background of CPC 3.3 and CPC models 3.3. Lastly, the application of CPC for speaker verification is presented 3.5.
3.2 Predictive Coding in Neuroscience
In a famous study by hubel1968receptive, the visual Receptive Field (RF) in the monkey striate cortex is studied. Macaque monkey is presented with line stimuli of different orientations while RF responses in the striate cortex are recorded. The experiment showed that cells responded optimally (with high firing rates) to particular line orientations, illustrated in Figure 3.1
. The interesting question to ask here is: why don’t neurons always respond in proportion to the stimulus magnitude?
Predictive coding is one prominent theory that aims to provide a possible explanation. Predictive coding states that human brain can be modeled by a framework that is constantly generating hypotheses and fixing its internal states through an error feedback loop. Since neighboring neurons are likely to be correlated, predictive coding implies that the RF response of a neuron can be predicted by those RF responses of its surroundings, and therefore a strong stimulus does not always correspond to a strong RF response. The first hierarchical model with several levels of predictive coding is proposed for visual processing in rao1999predictive. Each level receives a prediction from the previous level and calculates the residual error between prediction and the reality. To achieve efficient coding, only the residual error is propagated forward to the next level, while the next prediction for the current level is made, illustrated in Figure 3.2.
The study of rao1999predictive suggested the importance of feedback connection in addition to feedforward information transmission for visual processing. However, the key insight of how predictive coding is connected to representation learning is that by learning to predict, the model should implicitly retain properties or structures of the input.
3.3 Contrastive Predictive Coding (CPC)
3.3.1 Connection to Predictive Coding
Contrastive Predictive Coding (CPC) is proposed in oord2018representation as a new unsupervised representation learning framework. One challenging aspect of representation learning within high dimensional signal is noise. The primary goal of CPC is to extract high-level representation, or the slow-varying features wiskott2002slow, from a sensory signal full of low-level noises. On the other hand, predictive coding retains properties or structures of the input 3.2. By predicting the future, the model has to infer global properties or structures from the past, and therefore has to separate global information from noises. One example is TV show series. After watching several episodes of a TV series, most people could generally predict some plots in the next few episodes. But only a few who know the entire series and its history very well can make plot predictions beyond five episodes. These few people has "mastered" the TV series such that they can tell the important plot development from those that are minor in comparison. CPC leverage this idea and therefore could be powerful for separating high-level representation from noises.
However, how do we quantify high-level representation and monitor how well the model is learning? To quantify high-level representation, CPC calculates the mutual information between the sensory signal and global information . Let us refer back to the TV series example. The correct prediction of the plots in future episodes are often hidden as several key points in previous episodes. If we put it in terms of mutual information, the sensory signal is the future episode plots, and global information is the several key points, such as an important plot twist or character development. 3.3.2 gives a background of mutual information theory.
What metric should we use to train the predictive coding model? Figure 3.2 is the original hierarchical model of predictive coding proposed for visual processing, and from the figure we can see that the residual error is calculated during the feedforward pass. An straightforward implementation of residual error could be the L1 loss 3.3.1 or Mean Squared Error (MSE) 3.3.1 between prediction and actual value , where is some learnable latent representation and
is a mapping from the latent space to input space. In fact, this implementation can be dated back to the 1960s where MSE is used for training the predictive coding model for speech coding atal1970adaptive. Predictive Coding Network, another predictive coding based unsupervised learning framework, is trained with L1 loss lotter2016deep. However, either L1 loss or MSE loss requires a mapping function, namely a decoder, that computes . In our TV series example, is saying, "tell me all the details of future plots given the several key points . Intuitively, this is a hard task and unnecessary for our purpose since we are interested in high-level representations. To get around this issue, CPC models the mutual information directly with the noise contrastive estimation technique, which is introduced in 3.3.3.
3.3.2 Mutual Information
Mutual information denotes the amount of information shared between the two variables. Given two random variableand , mutual information is defined as,
where is the entropy of and is the conditional entropy of given . is defined as,
and is defined as,
With the above definitions, we can subsequently show the following:
First we expand 3.3.2 as:
Then by substitution and Baye’s rule,
We can also easily show that if and are independent, their mutual information is zero:
Given and are independent, . By definition, we can rewrite as:
and therefore, we have:
In the context of representation learning, mutual information gives us a quantitative measure of how well a model learns the global information. Let us look back at the TV series example again. If a person only has limited memory and has successfully observed the key developments, denoted as , over the past episodes, those developments are likely to be highly relevant to the upcoming episodes, denoted as . We can say that their mutual information is high. Hoewver, given the limited amount of memory everyone has, if the person only remembered the minor plot developments, denoted as , the mutual information is most likely to be low.
3.3.3 Noise-Contrastive Estimation (NCE)
Noise-Contrastive Estimation (NCE) is an estimation technique for estimating the parameters of parametric density functions gutmann2012noise. Let us consider a set of observations , where . In real world examples,
is often of high dimension, and the goal of all machine learning models is to find, or give an accurate estimate of, the underlying data distribution, the probability density function (pdf), from the observable set . NCE makes an assumption that comes from a parameterized family of functions:
where is a set of parameters. Put it another way, there exists some such that the following is true,
Now, let us denote any estimate of as . Then, the following must hold for any pdf :
If these two constraints are satisfied for all , then we say is normalized; otherwise,
is unnormalized. It is common for models to be unnormazlied, such as the Gibbs distribution. Let us further give these unnormalized parametric models a name,. To normalize , we would need to calculate the partition function :
and can be normalized by .
Everything so far is reasonable, except that in real word examples,is still unnormalized. One simple solution NCE proposed is, why not make an additional parameter gutmann2012noise? Let us define the new pdf accordingly:
where , and . The estimate now is not subject to the two constraints above since provides a scaling factor. The intuition here is that instead of calculating to normalize for all , only is normalized.
However, Maximum Likelihood Estimation only works for normalized pdf, and is not normalized for all . NCE is therefore proposed for estimating unnormalized parametric pdfs.
22.214.171.124 Density Estimation in a Supervised Setting
The goal of density estimation is to give an accurate description of the underlying probablistic density distribution of an observable data set with unknown density . The intuition of NCE is that by comparing against a known set , which has a known density , we can get a good grasp of what looks like. Put it more concretely, by drawing samples from with a known pdf , and samples from , we can estimate the density ratio . With and , we have the target density .
By classifying samples from noise
with a simple classifier, in this case logistic regression, we show NCE gets a estimate of the probability density ratio.
Let and be two observable sets containing data , , and let be , . is drawn from an unknown pdf , and is drawn from a known pdf . Since is not our target, it is commonly referred to as the "noise". We also assign each datapoint in a label : if and if . From the above settings, the likelihood distributions are then:
The prior distributions are:
The probability of the data is thus:
With Baye’s rule, we can derive the posterior distributions of and :
Similarly, we can get
can further be expressed as,
Now, we can denote our target density ratio with a new variable :
is a Bernoulli distribution with value ofor . We can write the log-likelihood as:
Optimize with respect to the parameters will lead to an estimate of , which is the density ratio we want. If we take a step back, we can see that is in fact a cross-entropy loss. In a supervised setting, NCE gives us a density estimation!
126.96.36.199 The NCE Estimator
3.4 Representation Learning with CPC
3.4.1 Single Autoregressive Model
As mentioned in the previous sections, mutual information gives the model a good criterion to measure how much global information is preserved. We can explicitly write out the formula for mutual information:
In speech, we can make the waveform of any utterance, and the global information such as speaker label. Therefore, the mutual information we are interested in becomes:
where represents utterance and represents speaker label. In oord2018representation, NCE objective is introduced for model training, and the term is selected as the density ratio to be estimated in NCE. We will prove why is selected later. The NCE objective is subsequently named NCE loss.
3.4 is an illustration of the proposed CPC model in oord2018representation. The model takes in raw waveforms as input and transforms it to some latent space
by an encoder. In the latent space, an Recurrent Neural Network is trained by the NCE loss to learn.
188.8.131.52 NCE Loss
CPC selects as the density ratio to be estimated in the NCE estimator. We can denote it with :
We can see that is unnormalized, and this is the reason why we started off with NCE. In addition, since could not be explicitly computed. An alternative way is to model with log-bilinear model, which signifies how relevant the input is to the context:
Refer back to the model 3.4, we can see that is modeled by the context vector of the recurrent neural network, and can be modeled by either the waveform or latent space . Since we would like the model to learn high-level information, it makes more sense to model with . Therefore, becomes:
However, the dimension of the context vector and latent space do not always agree. A simple solution is to add a matrix to conform the dimension. Let and . We define a matrix and 184.108.40.206 becomes:
We are now ready to define the NCE loss for training the CPC model. Refer to 220.127.116.11, NCE gives an estimate of the density ratio by classifying data samples from noise samples. Given a batch of utterances , which includes data sampels and noise samples, where the positive sample comes from the data distribution and the noise samples come from noise distributions . NCE loss is defined as:
where is any frame segment from utterance , is the corresponding global context for frame segment , is the prediction of the model, and is taking the softmax over .
However, the current loss has nothing to do with predictive coding 3.2, where a prediction of the future is made by the context and the residual error is propagated back to correct the context lotter2016deep. Similarly, CPC model also incorporates future frame predictions. We can modify the as:
where instead of computing loss only with the density ratio of current frame , we also calculate the density ratio of future frames up to frames in the future, .
18.104.22.168 Connection to Mutual Information
Why does CPC selects as the density ratio to be estimated in the NCE estimator? How does it connect to mutual information?
We will show that minimizing the NCE loss will result in maximizing the mutual information. First, we prove that optimizing will converge the density ratio to .
will converge to by optimizing , where is the data distribution and is the noise distribution.
The prediction of is . Let us denote the optimal probability of classifying positive samples correctly as (it is correct if it comes from the data distribution, and therefore incorrect if it comes from the noise distribution):
Compare and we have,
Therefore, will converge to . ∎
Now, with the optimal , we can proof mutual information , where is the optimal loss. Minimizing the NCE loss will result in maximizing the mutual information .
The lower bound for is .
We first rewrite by separating the positive sample and negative samples explicitly,
where is the negative samples in batch , in which there are samples. By substituting the optimal density ratio in , we will get the optimal loss :
Then, simplify the term . Since is the ratio of two continuous probability densities, it is also continuous and thus we can write the Expectation term in integral:
Substitue back in and we get:
In addition, since random variables and both are sampled from the sample distribution , (the uncertainty of a random variable becomes smaller once another variable is fixed). Therefore we have the following relationship:
Therefore, the lower bound for is:
Minimizing the loss will lead to maximizing the mutual information . ∎
3.4.2 Shared Encoder Approach
The original proposed CPC model contains only one autoregressive model - an unidirectional RNN. The unidirectional RNN context vectors from the first few frames of a speech signal can be inaccuracte since the RNN has only seen a few frames. It is therefore common to have a bidirectional RNN instead, such as for machine translation applications. However, similar to language modeling such as n-gram language model, the CPC model is trained on future frames prediction and birdirectional RNN, which takes in the whole sequence, contradicts our NCE training objective.
we took inspiration from peters2018deep, which have two separate RNNs, one for forward sequence and one for backward sequence. The two RNNs are jointly trained, and the hidden states are later concatenated together for next word prediction. We proposed the shared encoder approach - two autoregressive models in the same latent space, illustrated in Figure 3.5. Compare to the single autoregressive model, the shared encoder approach has an additinoal autoregressive model for the backward sequence. The two autoregressive models do frame predictions separately but are optimized jointly with the loss:
where is the density ratio from the autoregressive model trained on forward sequence, and is the density ratio from the second autoregressive model trained on backward sequence. Similar to peters2018deep, we concatenate the context vectors (hidden states) from the two autoregressive models during inference for downstream task (speaker verification).
3.4.3 Detailed Implementation
Most of the CPC model implementation conforms to oord2018representation with minor modifications. The raw waveform is input to the encoder without being processed with Voice Activity Detection or Mean Variance Normalization. In each training iteration, a segment of 1.28 seconds (or 20480 data points) is randomly extracted from the original waveform for every utterance, before inputting to the encoder. The encoder is a five layers 1-dimensional Convolutional Neural Network (CNN) with a 160 downsampling factor. For each of the five layers, the filter (kernel) sizes are, the strides are
, and the zero paddings are. All five layers have 512 hidden dimension. In oord2018representation, the autoregressive model is implemented as a GRU with 256 hidden dimension, and the context vector (hidden state) is used as the CPC feature for downstream tasks. However for standard speaker verification systems, 256 input feature dimension would cost weeks to train and therefore it is impractical. We explored three CPC models with different GRU hidden dimension, and a comparison of the three CPC models are detailed in Figure 3.1. CDCK2 and CDCK5 are variants of the single autoregressive model approach, while CDCK6 is based on the shared encoder approach.
|CPC model ID||
To implement the NCE loss , we draw negative samples from different utterances excluding the current utterance. This can be conveniently implemented by selecting the other samples in the same batch as the negative samples. The advantage of such implementation is that the negative samples can be drawn in one batch of the forward pass. Finally, the timestep for future frame prediction is set to 12, and the batch size is set to 64 for all CPC models. Figure 3.6 is a visualization of the details of our CPC model implementation.
3.5 CPC-based Speaker Verification System
Since CPC feature learns high level information of the given input signal, it could contain relevant speaker information. We are interested in the effectiveness of the CPC feature in speaker verification, and how it fits in a standard speaker verification system. Figure 3.7 describes our CPC-based speaker verification system. The CPC model is trained on the training data, and frame-level representation is extracted by the model. To get a fixed-length utterance-level representation, we either temporally average across all frames for each utterance, or train an additional summarization system, the i-vector extractor. After getting the utterance-level representation, we first mean and length normalize across all representations, and train a Linear Discriminant Analysis to reduce feature dimension per utterance. Lastly, a decision generator, the PLDA model, is trained to get the log-likelihood ratio for each utterance before computing the EER. Figure 3.8 describe the testing pipeline for the CPC-based speaker verification system.
We tested our CPC-model on the LibriSpeech corpus. LibriSpeech Corpus is an 1000-hour speech data set based on LibriVox’s audio books panayotov2015librispeech, and it consists of male and female speakers reading segments of book chapters. For example, 1320-122612-0000 means ’Segment 0000 of Chapter 122612 read by Speaker 1320.’ The speech data is recorded at 16k Hz. LibriSpeech Corpus is partitioned into 7 subsets, and the description of each subset is summarized in Figure 4.1. In our experiments, we used train-clean-100, train-clean-360, and train-clean-500 subsets for training. Dev-other and dev-test are used as validation and CPC model selection. Finally, we report our speaker verification results on test-clean.
4.2 Speaker Verification Trial List
Since LibriSpeech is originally created for speech recognition, we have to manually create the speaker verification trial list. The trial list contains two three columns: enrollment ID, test ID and target/nontarget. The enrollment ID column contains the speech recordings that are enrolled, the test recordings are those tested against the enrollment recordings, and the target/nontarget indicates whether the speaker of the given test recording matches the speaker of the given enrollment recording. Table 4.1 contains three example trials.
|enrollment ID||test ID||target/nontarget|
We prepared our trial list in two different ways. The first trial list is created by randomly selecting half of the LibriSpeech recordings as enrollment and the other half as test. There are a total of 1716019 trials in the first trial list. The second trial list is also created in the same manner but we made sure that there is no overlap in chapters spoken by the same speaker. For example, the trial ’1320-122617-0000 1320-122617-0025 target’ is allowed in the first trial list but not in the second trial list. The two trial lists we described above are available for download: first trial list111https://drive.google.com/open?id=10h9GH_vi-BRBT_L_xmSM1ZumQ__jRBmx and second trial list222https://drive.google.com/open?id=1FDOU1iNSdGT-IMCQnuuJCWV421168x4H.
4.3 Speaker Verification EER
We presented the model training results and speaker verification error rate of the three CPC models we implemented in Table 4.2. CDCK2 and CDCK5 are trained for 60 iterations, and CDCK6 is trained for 30 iterations due to time limitation. CDCK5 has around 1.8 million less model parameters than CDCK2 and CDCK6 because its GRU hidden dimension is 40, which is significantly smaller. Expectedly, due to the larger model size, CDCK2 and CDCK6 has smaller NCE losses and higher positive sample prediction accuracies than CDCK5. Furthermore, CDCK6 attains higher prediction accuracies with half the training iterations, which suggests that the shared encoder approach is more powerful than the single autoregressive model approach.
|CPC model ID||
number of epoch
Figures 4.2, 4.3, 4.4 are the future frame positive sample prediction accuracies for CDCK2, CDCK5, and CDCK6 respectively. Figures 4.5, 4.6, 4.7 are the NCE losses for CDCK2, CDCK5, and CDCK6 respectively. The reported loss and accuracy are performed on the dev set, and we can see that the losses decrease while the prediction accuracies increase over training iterations. Note that the NCE loss is averaged over all future prediction timesteps , and the prediction accuracy is calculated only on the last timestep . In our implementation, is set to 12. Therefore, is averaged over 12 timesteps, but the positive sample prediction accuracy is on the timestep only.
After the CPC models are trained, the context vectors (hidden states) of the models are extracted as the CPC features. These features are used as the input feature for speaker verification. We explored two approaches to summarization in the speaker verification system described in Figure 3.7: temporal average pooling and i-vectors. In the first approach, temporal average pooling, frame-level features are averaged across frames to get a fixed-length utterance-level feature for each utterance. The speaker verification results of the CPC features and the baseline MFCC features with temporal average pooling is summarized in Table 4.3. We can first see that the speaker verification EER of the first trial list is significantly lower than that of the second trial list. This is expected since the second trial list contains no speaker-chapter overlap between enrollment and test, and thus the higher error rate. Secondly, CPC features show significant improvement over MFCC. Specifically, features from CDCK2 model recorded and EER, which are and relative improvements over the baseline. Although CDCK6 showed lower NCE loss and higher prediction accuracies during training, its features performed worse than the ones from CDCK2.
|Feature||Feature Dim||Summarization||LDA Dim||1st EER||2nd EER|
The second approach to summarization in speaker verification is i-vectors, which also gives a fix-length utterance level feature for each utterance. However, as mentioned earlier, usually the feature dimension to i-vectors is below 60. A feature dimension of 256 will take weeks to train an i-vector extractor. Therefore, dimension reduction on frame-level CPC features is first performed before summarization. We chose Principal Componenet Anaysis (PCA) for reducing the CPC feature dimension because we do not want to introduce extra nonlinearity for the learned feature and PCA is a linear transform. Table4.4 is the summary of the CPC features after PCA transform with their corresponding PCA variance ratio, and the feature dimensions are all smaller or equal to 60 after PCA.
|Feature w PCA||Original Feature||PCA Dim||PCA Variance Ratio|
Table 4.5 presents the result of various MFCC, CPC, and combinations of MFCC and CPC features for speaker verificaiton with i-vectors. We can see that i-vectors with MFCC alone got and EER on the two trial lists. We trained three i-vectors systems with CPC features after PCA: CDCK2-60, CDCK5-24, and CDCK6-60. We can see that these features achieved up to EER relative improvement over the baseline on the first trial list. The relative improvements are much smaller compare to their counterparts in Table 4.3. Furthermore, on the second trial list, MFCC with i-vectors prevails CPC with i-vectors.
|Feature||Feature Dim||Summarization||1st EER||2nd EER|
|MFCC + CDCK2-36||60||i-vectors||3.62||6.898|
|MFCC + CDCK5-24||48||i-vectors||3.712||6.962|
|MFCC + CDCK6-36||60||i-vectors||3.691||6.765|
Since MFCC and CPC are two very different feature extraction methods, they should capture different aspects of the speech signal, which may be complementary for speaker verification. We fused MFCC and CPC features before i-vectors by simply concatenating the two feature vectors. The last three rows of Table4.5 show the results of fusing MFCC with CPC features after PCA. We can see that the best combinations attains and relative improvements over MFCC i-vectors on the two lists.
4.4 Feature Visualizations
It is a good practice to visualize speech features, and we visualize the CPC features and compare them to MFCC. Since CPC features from model CDCK2 and CDCK6 are 256 dimension, which may contain too much visual details, we chose to visualize CPC feature from CDCK5, which has 40 dimension. Figure 4.8 and 4.9 are visual comparisons of MFCC and CPC features on two randomly picked LibriSpeech test-clean-100 utterances: 2830-3980-0028 and 5105-28241-0017. We also visualize CPC features with PCA transform, CDCK5-24. Looking at the visualizations, CPC and MFCC bear very little similarity that they differ in structure and magnitude. However, one observation worth noting of the CPC features is that there are several feature bins whose values remain in a small range over time, which signifies that the CPC features learn some global information that lasts over time.
4.5 Speaker Verificaiton DET Curves
To examine the tradeoff between false alram and miss rate, we plotted the Detection Error Tradeoff (DET) curves for the CPC and MFCC based speaker verification system. Figure 4.10 and 4.11 are DET curves for MFCC and CPC fusion-based i-vectors speaker verification system. For both trial lists, we can see that the fusion features reduced the miss and false alarm probabilities compared to the baseline.
5.1 CPC as an Alternative Feature for Speaker Verification
Common speech and speaker recognition systems employed deterministic Fourier-Transform-based features, such as MFCC, FilterBanks, or Peceptual Linear Predictive (PLP). In this work, we explored an unsupervised learned feature, CPC, for speaker verification task. We showed that CPC attains competitive speaker verification accuracy on LibriSpeech corpus, and it is presented as a potential alternative feature for future speaker verification research.
5.2 i-vectors is not an Ideal Summarization Method for CPC
i-vectors is one of the most popular features for speech analysis tasks. It is widely used for speaker recognition, language identification, speech recognition, etc. However, one constraint that i-vectors imposed on the input feature is that it has multi-Guassian distributed. If the input feature does not comply to a multi-Guassian distribution, GMM-UBM and hence i-vectors would not likely to work. From our experiments, we observed that i-vectors is not an ideal summarization method, that summarizes frame-level feature into utterance-level feature, for CPC compared to MFCC i-vectors. Compare Table 4.3, which shows speaker verification EER of CPC features with average pooling, and Table 4.5, which shows the EER of CPC features with i-vectors. CPC shows very strong results over MFCC with average pooling as the summarization method. On the other hand, when i-vectors is used as the summarization method, CPC does not show clear advantage oer MFCC. One speculation is that CPC features are not multi-Guassian distributed, and hence there may be better summarization method, such as the x-vectors, which does not assume any input distribution on the input features.
5.3 CPC Complements MFCC for i-vectors Speaker Verification
We observed that CPC complements MFCC for i-vectors speaker verification system. Table 4.5 contains results of CPC and MFCC feature fusion with i-vectors, which give improvements over both MFCC i-vectors and CPC i-vectors. Similarly, Figure 4.10 and 4.11 are the fusion i-vectors DET curves, which are better than that of CPC features 4.12 and 4.13. Therefore, we hypothesize that CPC complements MFCC for i-vectors based speaker verification system on the LibriSpeech corpus. However, whether this is true for all speech data is left for future work.
5.4 Future Work
Looking ahead, there are several directions for this work worth exploring. We listed five potential improvements and applications we would like to work on in the near future.
5.4.1 Density Estimation Methods
First of all, we followed oord2018representation and used Noice Contrastive Estimation for estimating the density ratio for learning high-level representation. There are other possible density estimation methods we can experimented with, such as the Importance Sampling. We are curious with the effectiveness of NCE and how it compares to other density estimation methods.
Librispeech corpus is a relatively clean (little noise) datasets that was originally made for speech recognition. Although the results we presented show potentials, we have to tested on publicly recognized datasets. In addition, we manually created our own trial lists since LibriSpeech does not provide one. We could not compare our findings to other speaker verification systems. We are planning to conduct CPC model refinements and speaker verification experiments on NIST SRE16 with the data in Table 5.1111Based on https://github.com/kaldi-asr/kaldi/tree/master/egs/sre16/v1.
|Corpus||LDC Catalog No.|
|SWBD2 Phase 1||LDC98S75|
|SWBD2 Phase 2||LDC99S79|
|SWBD2 Phase 3||LDC2002S06|
|SWBD Cellular 1||LDC2001S13|
|SWBD Cellular 2||LDC2004S07|
|SRE2006 Test 1||LDC2011S10|
|SRE2006 Test 2||LDC2012S01|
5.4.3 CPC x-vectors
As mentioned previously, i-vectors may not be the ideal summarization methods for CPC. We plan to conduct x-vectors snyder2018x speaker verification experiments after switching to SRE16.
5.4.4 Language Identification
We would also like to conduct CPC experiments on language identification222Based on https://github.com/kaldi-asr/kaldi/tree/master/egs/lre07., which uses techniques from speaker recognition. Since CPC is designed to capture global information, it should learn some degree of language information in addition to speaker information of a speech signal.
5.4.5 Domain Adaptation for Speaker Recognition
Finally, we would like to apply CPC for speaker recognition domain adaptation. Although there are signs that CPC may not generalize well to unseen conditions 4.5, we are interested to see how CPC can be used in that context. [title=Bibliography]