TristouNet: Triplet Loss for Speaker Turn Embedding
TristouNet is a neural network architecture based on Long Short-Term Memory recurrent networks, meant to project speech sequences into a fixed-dimensional euclidean space. Thanks to the triplet loss paradigm used for training, the resulting sequence embeddings can be compared directly with the euclidean distance, for speaker comparison purposes. Experiments on short (between 500ms and 5s) speech turn comparison and speaker change detection show that TristouNet brings significant improvements over the current state-of-the-art techniques for both tasks.READ FULL TEXT VIEW PDF
Speaker embeddings become growing popular in the text-independent speake...
With the advent of digital technology, it is more common that committed
This paper aims to improve the widely used deep speaker embedding x-vect...
With deep learning approaches becoming state-of-the-art in many speech (...
We present a Lipreading system, i.e. a speech recognition system using o...
More and more neural network approaches have achieved considerable
Recent neural network models for algorithmic tasks have led to significa...
TristouNet: Triplet Loss for Speaker Turn Embedding
Given a speech sequence and a claimed identity , speaker verification aims at accepting or rejecting the identity claim. It is a supervised binary classification task usually addressed by comparing the test speech sequence to the enrollement sequence uttered by the speaker whose identity is claimed. Speaker identification is the task of determining which speaker (from a predefined set of speakers ) has uttered the sequence . It is a supervised multiclass classification task addressed by looking for the enrollement sequence the most similar to the test speech sequence . Speaker diarization is the task of partitioning an audio stream into homogeneous temporal segments according to the identity of the speaker. It is broadly addressed as the series of three steps: speech activity detection, speaker change detection (i.e. finding boundaries between any two different speakers), and speech turn clustering.
Whether we address speaker verification, speaker identification, or speaker diarization, it all boils down to finding the best pair (, ) of representation function and comparison function with the following ideal property. Given a speech sequence uttered by a given speaker, any speech sequence uttered by the same speaker should be closer to than any speech sequence uttered by a different one:
, the i-vector approach has become the de facto standard for as far as speaker recognition is concerned. Hence, given a common i-vector implementation, the objective for participants to this challenge is to design the best comparison function . In this paper, we address the dual problem: choosing as the euclidean distance, we want to find a representation function that has the property described in Equation 1.
The i-vector approach has also become the state-of-the-art for speaker diarization . However, due to its sensitivity to sequence duration , it is only used once short speech turns have been clustered into larger groups using Bayesian Information Criterion (BIC)  or Gaussian divergence . These two techniques are still commonly used for short (i.e. shorter than 5 seconds) speech turn segmentation and clustering. In this paper, we show that the proposed embedding outperforms both approaches and leads to better speaker change detection results.
to train euclidean embeddings has been recently and successfully applied to face recognition and clustering in
. We use the triplet loss and triplet sampling strategy they proposed. Going with the euclidean distance and unitary embeddings was also inspired by it. The main difference lies in the choice of the neural network architecture used for the embedding. While convolutional neural networks are particularly adapted to (multi-dimensional) image processing and were used in, we went with recurrent neural networks (more precisely, bi-directional long short-term memory networks, BiLSTM) that are particularly adapted to sequence modeling  and were first used for speech processing in .
trained a multilayer perceptron (MLP) where the input consists of cepstral coefficients extracted from a sequence of frames, and the output layer has one output per speaker in the training set
. The activations of (bottleneck) hidden layers are then used as the representation function for later speaker recognition experiments using (back then, state-of-the-art) Gaussian Mixture Models (GMM). The main limitations for this kind of approaches is summarized nicely byYella et al. in their recent paper : “we hypothesize that the hidden layers of a network trained in this fashion should transform spectral features into a space more conducive to speaker discrimination.”. In other words, we are not quite sure of the efficiency of this internal representation as it is not the one being optimized during training – these approaches still require a carefully designed comparison function (based on GMMs for 
or Hidden Markov Models for). Our approach is different in that the representation function is the one being optimized with respect to the fixed euclidean distance .
That being said,  is very similar to our work in that their neural network is given pairs of sequences as input, and is trained using binary cross-entropy loss to decide whether the two sequences are from the same speaker or from two different speakers. With pairs of 500ms speech sequences, they report a 35% error rate on this task. As depicted in Figure 1, the main difference with our approach lies in the fact the we use triplets of sequences (instead of pairs) and optimize the shared embedding directly thanks to the triplet loss (in place of the intermediate binary cross-entropy loss).
Recently, LSTMs have been particularly successful for automatic speech recognition . They have also been applied recently to speaker adaptation for acoustic modelling [15, 16]. However, to the best of our knowledge, it is the first time they are used for an actual speaker comparison task, and a fortiori for speaker turn embedding.
Figure 1 summarizes the main idea behind triplet loss embeddings. During training, the triplet sampling module generates triplets where
are features extracted from a sequence (calledanchor) of a given speaker, are from another sequence (called positive) from the same speaker, and are from a sequence (called negative) from a different speaker. Then, all three feature vectors (or sequences of vectors, in our case) are passed through the neural network embedding . Finally, the triplet loss  minimizes the distance between the embeddings of the anchor and positive, and maximizing the distance between the anchor and negative.
Let be the set of all possible triplets in the training set. The triplet loss is motivated by the Equation 1 introduced earlier, and tries to achieve an even better separation between positive and negative pairs by adding a safety margin . For any triplet , we want where
More precisely, the loss that we try to minimize is defined as
As discussed thoroughly in , it is not efficient nor effective to generate all possible triplets. Instead, one should focus on triplets that violate the constraint . Any other triplet would not contribute to the loss and would only make training slower. Though we do plan to test other triplet sampling strategies in the future, we chose to go with the one called “hard negative” in .
More precisely, after each epoch, we repeat the following sampling process. First, we start by randomly samplingsequences from each of the speakers of the training set. This leads to a total of anchor-positive pairs. Then, for each of those pairs, we randomly choose one negative out of all negative candidates, such that the resulting triplet has the following properties: .
Figure 2 depicts the topology of TristouNet111triplet loss for speaker turn neural network (colloquial French for gloomy) , the neural network we propose for sequence embedding. Two Long Short-Term Memory (LSTM) recurrent networks  (with units each) both take the feature sequence as input. The first LSTM processes the sequence in chronological order, while the second goes backward. Average pooling is applied to their respective sequence of outputs. This leads to two -dimensional output vectors which are then concatenated into one -dimensional vector. Returning only the average output has one advantage: projecting variable-length input sequences into a fixed-dimension space. However, in this paper, we only used fixed-length input sequences in order to evaluate how well the approach performs depending on the duration. Two fully connected layers (with and units respectively) are then stacked. The final output is -normalized, constraining the final embedding to live on the -dimensional unit hypersphere.
The ETAPE TV subset contains 29 hours of TV broadcast (18h for training, 5.5h for development and 5.5h for test) from three French TV channels with news, debates, and entertainment . Fine “who speaks when” annotations were obtained on a subset of the training and development set using the following two-steps process: automatic forced alignement of the manual speech transcription followed by manual boundaries adjustment by trained phoneticians. Overall, this leads to a training set of 13.8h containing different speakers, and a development set of 4.2h containing 61 speakers (out of which 18 are also in the training set). Due to coarser annotations, the test set is not used in this paper.
Feature extraction. 35-dimensional acoustic features are extracted every 20ms on a 32ms window using Yaafe toolkit : 11 Mel-Frequency Cepstral Coefficients (MFCC), their first and second derivatives, and the first and second derivatives of the energy. Both BIC and Gaussian divergence baselines rely on the same set of features (without derivatives, because it leads to better performance).
We use Kerasdeep learning library for training TristouNet. The number of outputs is set to 16 for every layer (i.e. ). In particular, means that the sequence embeddings live on the 16-dimensional unit hypersphere. We use activation function for every layer as well. Every model (one for each sequence duration 500ms, 1s, 2s and 5s) is trained for 50 epochs, using margin as proposed in the original paper 
, and the RMSProp optimizer with learning rate. Finally, the triplet sampling uses random sequences per speaker, for a total of 143520 triplets per epoch.
Reproducible research. github.com/hbredin/TristouNet provides Python code to reproduce the experiments.
This first set of experiments aims at evaluating the intrinsic quality of the learned embedding.
Protocol. 100 sequences are extracted randomly for each of the 61 speakers in the ETAPE development set. The “same/different” experiment consists in a binary classification task: given any two of those sequences, decide whether they were uttered by the same speaker, or two different speakers. This is achieved by thresholding the computed distance between sequences. We compare several approaches: Gaussian divergence , Bayesian Information Criterion , and the proposed embedding with euclidean distance.
Evaluation metric. Two types of errors exist: a false positive
is triggered when two sequences from two different speakers are incorrectly classified as uttered by the same speaker, and afalse negative is when two sequences from the same speaker are classified as uttered by two different speakers. The higher (resp. lower) the decision threshold is, the higher the false negative (resp. positive) rate is (FNR, FPR). We report the equal error rate (EER), i.e. the value of FPR and FNR when they are equal.
Training. Figures 3 and 4 illustrate how the intrinsic quality of the embedding (of 2s sequences) improves over time, during training. Figure 3 clearly shows how the discriminative power of the embedding improves every 10 epochs: same and different speaker(s) distance distributions are progressively separating until convergence and no further significant improvement is observed.
Results. Figure 5 summarizes the results. As expected, embeddings of longer sequences get better performance: EER decreases from for 500ms sequences down to for 5s sequences. Most importantly, our approach significantly outperforms the commonly used approaches (BIC and Gaussian divergence), bringing an absolute (or relative ) EER decrease for 2s sequence comparison. Note how the 500ms embedding is almost as good as the (four times longer) 2s BIC baseline approach.
Speaker change detection consists in finding the boundaries between speech turns of two different speakers. It is often used as a first step before speech turns clustering in speaker diarization approaches.
Protocol. For each files in the ETAPE development set, we compute the distance between two (left and right) 2s sliding windows, every 100ms. Peak detection is then applied to the resulting 1-dimensional signal by looking for local maxima within 1s context. A final thresholding step removes small peaks and only keeps large ones as speaker changes.
Evaluation metric. Given the set of reference speech turns, and the set of hypothesized segments, coverage is:
where is the duration of segment and is the intersection of segments and . Purity is the dual metric where the role of and are interchanged. Over-segmentation (i.e. detecting too many speaker changes) would result in high purity but low coverage, while missing lots of speaker changes would decrease purity – which is critical for subsequent speech turn agglomerative clustering.
Results. Figure 6 summarizes the results obtained when varying the value of the final threshold. Embedding-based speaker change detection clearly outperforms both BIC- and divergence-based approaches. Though it does not improve the best achievable purity (it gets vs. for divergence), embedding-based speaker change detection does improve coverage significantly. For instance, at purity, coverage is while BIC- and divergence-based approaches are stuck at . In other words, it means that hypothesized speech turns are longer on average, with the same level of purity.
The impact of this major improvement on the overall performance of a complete speaker diarization system (including speech activity detection and speech turn clustering) has yet to be quantified. It would also be a valuable experiment to evaluate how it generalizes to variable-length sequences (this is already supported, only not tested yet); as well as its application to speaker recognition. Furthermore, possible future work would be to investigate the use of deeper or wider neural network architectures. Replacing the triplet loss by the center loss recently proposed for face recognition  might also be a promising research direction.
Acknowledgement This work was supported by ANR through the ODESSA and MetaDaTV projects. Thanks to “LSTM guru” Grégory Gelly for fruitful discussions.
“Learning Fine-Grained Image Similarity with Deep Ranking,”in , Washington, DC, USA, 2014, CVPR ’14, pp. 1386–1393.