During the last years, deep learning classifiers and representations have surpassed the performance of shallow and fully probabilistic counterparts in several tasks of speech recognition and computer vision, often by a large margin. A key ingredient towards this success was the availability of large annotated datasets, which enabled very deep architectures to be trained using supervised learning approaches. The availability of large in-domain corpora played a major role in building robust speaker recognition models, too. The success of Joint Factor Analysis and i-vectors can largely be attributed to such corpora, which enabled modeling correlations between acoustic units[1, 2]. More recently, deep learning architectures outperformed such methods in most of the speaker recognition benchmarks .
On the other hand, these architectures require the datasets to be labelled with respect to speaker, which was not the case with i-vectors. As an unsupervised model, an i-vector extractor does not require utterances with associated speaker labels for training . Labelled utterances are needed merely for training the backend classifier (typically a PLDA) which requires much less data, thanks to its relatively small number of trainable parameters.
In this paper, we introduce a training architecture capable of learning speaker embeddings with only few or no speaker labels. The structure we add to the standard speaker embedding network is a decoder network, which learns how to reconstruct speech segments in the frame-level, using a mean-squared error loss. A key idea of the method is the conditioning of the decoder not merely on the embedding extracted by the encoder (i.e. the embedding extractor), but also on the phonetic sequence of the decoding speech segment, as estimated by an independently trained Automatic Speech Recognition (ASR) model. Such a conditioning allows for a decoding loss over speech frames to apply for learning in an end-to-end fashion using standard backpropagation. It moreover enables learning speaker embeddings that capture only the idiosyncratic characteristics of a speaker, rather than irrelevant information about the phonetic sequence. The latter property is further improved by extracting two different segments from an utterance: the first for feeding the encoder and extracting the embedding, and the second one for using it as target for the decoder, together with its associated phone sequence.
We show that the proposed decoder loss can be combined with the standard x-vector architecture and loss (i.e. cross-entropy over training speakers) yielding significant improvement. Finally, we consider a semi-supervised learning scenario, where only a small fraction of the training utterances contain speaker labels and we show how the proposed architecture can leverage both labelled and unlabelled utterances. All our experiments are conducted on VoxCeleb and Speakers In The Wild benchmarks.
2 Related work
2.1 Speaker recognition using autoencoders
There have been several attempts in speaker recognition to make use of reconstruction losses. Most of them are based on (plain or variational) autoencoders, either in an unsupervised way or using speaker labels[4, 5, 6]. Other such approaches aim at reducing the phonetic variability of short segments by learning a mapping from short segments to the whole utterance. The main weakness of these methods is the fact that they operate over fixed, utterance-level representations, typically i-vectors . Our approach of conditioning the reconstruction on the estimated phone sequence of each segment can be employed, enabling such approaches to be revisited in an end-to-end fashion. Other recent approaches aiming at enhancing the x-vector architecture with adversarial loss are also relevant, since they are propose joint training of the network with auxiliary losses and structures which are removed in runtime [8, 9, 10].
2.2 Speech synthesis, recognition, and factorization
Recently, speaker embeddings have been deployed in text-to-speech (TTS) and voice conversion [11, 12, 13]. The embeddings are typically extracted using a pretrained network (e.g. a d- or x-vector extractor), which may be fine-tuned to the task . Conditioning the decoder on speaker embeddings (together with the text of the target utterance) is crucial for training multispeaker TTS systems and producing synthetic speech for target speakers unseen during training. Although our method shares certain similarities with this family of TTS methods (especially in the decoder), our goals are different. Rather than employing a pretrained speaker recognition model to extract embeddings, we demonstrate that speaker-discriminative training is feasible using merely a reconstruction loss over speech segments and training the overall network jointly. Finally, a recently introduced ASR approach for integrating ASR and TTS into a single cycle during training has also certain similarities with our method (), and the same holds for deep factorization method for speech proposed in .
2.3 Self-supervised learning
The approach of extracting speaker embeddings via reconstructing different parts of a sequence can be consider as an application of self-supervised learning, where a network is trained with a loss on a pretext task, without the need for human annotation. Models using self-supervised learning for initialization are now state-of-the-art in several domains and tasks, such as action recognition, reinforcement learning, and natural language understanding[16, 17, 18, 19].
3 The proposed architecture
In this section we describe the network used in training and we provide rationale for certain algorithmic choices. The architecture is depicted in Figure 1. Architectural details are given in Section 4.2.
We denote by the frame sequence of an utterance, and the corresponding estimated phones sequence by . We also denote an associated version of the same utterance by , where tilde indicates corruption by noise, reverberation, or another data augmentation scheme. The encoder consumes randomly extracted segments from , denoted by , where is randomly sampled in . Let also be another segment of the same utterance (without data augmentation), and let denote its corresponding estimated phone sequence. The encoder part of the network (i.e. the embedding extractor) is denoted by
and it is a function parametrized by .
3.2 Purely self-supervised training
We train the network using a decoder, which is a network implementing the following function
The decoder is parametrized by and receives as input the embedding and the estimates phone sequence . Note that the embedding and target frames correspond in general to different segments of the same utterance. The architecture is trained using mean-squared error (MSE) loss, i.e.
The above equations show that the encoder and decoder can be trained jointly using standard backpropagation.
3.3 Combining self-supervision with cross-entropy over speakers
In case the utterance has a speaker label we may combine the decoding loss with the standard cross-entropy loss over speakers. The speaker classifier estimates the posterior distribution over the set of training speakers, i.e.
it is parametrized by and has a softmax as final layer. The cross-entropy (CE) loss can be added to the decoding loss and the overall network can be trained jointly, i.e.
where indicates whether the particular utterance has a speaker label, and is a scalar for balancing the two losses.
3.4.1 Encoding and decoding segments
The rationale for defining the encoding and decoding sequences as different segments of the same utterance is to encourage the encoder to learn embeddings that encode information about the way a speaker pronounces acoustic events unseen in the encoding sequence. Note also that we keep the decoding target sequence clean (i.e. without augmentation). We do so for encouraging the encoder to learn how to denoise speech sequences, an approach similar to denoising autoencoders.
3.4.2 Representation of the phonetic sequence
For passing the phonetic sequence to the decoder, we choose to define estimated phones as phonetic units, in the form of one-hot vectors. Clearly, there are several other options, such as bottleneck features, senones or characters.
One of the reasons why we did not use bottleneck features is that they inevitable carry speaker-discriminative information, as the experience shows (recall that speaker recognition is feasible with plain bottleneck features ). Therefore, passing bottleneck features to the decoder could create information leakage, preventing the embedding from capturing useful speaker-discriminative information. Another drawback of bottlenecks is that they would tie the remaining network to a specific bottleneck extractor. Contrarily, symbolic entities such as phones allow for different ASR models to be used for estimating or even for using the ground truth phones, when available.
On the other hand, senones would result in a much larger and harder to train decoding network, while the senone posteriors would be much less spiky compared to phones, and hence far from resembling one-hot vectors. Furthermore, passing senones to the decoder seems unnecessary; the decoder can recover the context-dependence of each phone since it is conditioned on the overall phone sequence .
Finally, using characters would require additional complexity to align the two sequences, such as an attention mechanism employed in TTS approaches . For these reasons, we consider phones in the form of one-hot vectors as the appropriate representation and level of granularity for this task and setup.
3.4.3 Semi and weakly supervised learning
The proposed architecture defines a principled way of leveraging unlabelled data in x-vector training. There are other losses for supervised training with which one may combine it, such as the triplet loss [22, 23, 24]. Note that both cross-entropy and triplet losses cannot leverage unlabelled utterances (e.g. by splitting the same utterances into multiple segments), unless one assumes that each utterance in the training set is coming from a different speaker (which is typically not the case). On the other hand, our self-supervised method requires only the knowledge that two segments belong to the same speakers, while it can be extended to encode and decode on segments coming different utterances of the same speaker. This would make it suitable also for certain weakly supervised learning settings, where labels indicate merely that two or more utterances are coming from the same speaker, without excluding the possibility that other utterances may belong to the same speaker as well. In such cases, a training criterion that does not require exclusive labels (cross-entropy loss) or negative pairwise labels (triplet loss) seems to be the only principled method for learning speaker representations.
4.1 VoxCeleb and SITW datasets
We evaluate the systems on the Speakers in the wild (SITW)  core-core eval set and the VoxCeleb 1 test set . We use the SITW core-core development set for tuning various hyperparamters of the systems. For preparing the data, we use the Kaldi  SITW recipe (sitw/v2). This recipe uses VoxCeleb 1 and 2  for training data. We use the recipe as is, except that we do not include VoxCeleb 1 test set in the training set. The number of speakers in the training set is 7146 and the number of utterances is 2081192 including augmentations. For semisupervised experiments, we randomly selected 1000 speakers, having in total 227998 utterances.
4.2 Implementation and training details
4.2.1 Implementation and decoder
We use the TensorFlow toolkit for implementing our systems. As baseline, we use the standard Kaldi x-vector architecture 
, i.e. five TDNN layers with ReLU activation functions followed by batch normalization, followed by a pooling layer that accumulates mean and standard deviations, followed by two feed-forward layers with ReLU and batch normalization, and finally a softmax layer for classifying speakers. Different from Kaldi, we apply a global normalization on the input features and batch normalization also after the pooling layer. As discussed above, the loss is CE over training speakers.
The reconstruction network (i.e. the decoder) consists of five layers that operates framewise. Its input are the phone labels represented as one-hot vector and its output predicts the 30-dimensional feature vectors. The input layer is either (a) a feed-forward layer or (b) a TDNN layer with a context of three frames on each side (denoted by ctx). The other layers are feed-forward layers with an output dimension 166 (i.e. same as the number of phone labels). All layers except the last one are followed by ReLU and batch normalization. The embedding is appended to the input of each layer. The loss for the reconstruction is the MSE between the real and predicted features.
In the experiments, we use minibatches containing 150 segments. The lengths of the segments are 2-4s. We use the ADAM optimizer 
, starting with a learning rate of 1e-2 which we then halve whenever the loss on a validation set does not improve for 32 epochs, where an epoch is defined to be 400 minibatches. In the semisupervised experiment, each batch contains 150 labelled segments and 150 unlabelled segments and only the labelled segments will be used to calculate the speaker classification loss.
4.2.2 The ASR model
The frame-level phone labels are generated using the official Kaldi  Tedlium speech recognition recipe (s5_r3). This recipe uses a TDNN based acoustic model with i-vector adaptation and an RNN based language model. Phone posteriors are obtained from the lattices using the forward-backward algorithm and then converted to hard labels. There are 39 phones, each coming in four different versions depending on their position in the word, plus a silence (SIL) and noise class (NSN) that has 5 versions each, resulting in 166 phone classes.
4.2.3 PLDA Backend
We used an identical backend to the one in the Kaldi x-vector recipe. This backend involves a preprocessing step which first reduces the x-vector dimension by LDA from 512 to 128, and then applies a nonstandard variant of length-norm111https://github.com/kaldi-asr/kaldi/blob/master/src/ivector/plda.cc. The backend was implemented in python based on our in-house toolkit Pytel. For the fully supervised experiments we, as the Kaldi recipe, use the 200k longest utterances, resulting in 6298 speakers. For the semi-supervised experiments, we use all of these utterances where the speaker is among the 10000 randomly selected, resulting in 899 speakers and 31785 utterances.
4.3 Experimental Results
4.3.1 Fully labelled training set
The results using the standard training set of VoxCeleb are given in Table 1.
|Self, cln, ctx||4.844||0.485||5.758||0.536|
|Self, cln, ctx, same||4.760||0.478||6.347||0.580|
|Spk+Self, cln, ctx||3.362||0.353||3.759||0.344|
The results show that the self-supervised models are capable of extracting speaker-discriminant embeddings. The use of clean (cln) decoding utterances yields slightly better results, as it enforces the encoder to act as a denoiser. Moreover, conditioning the decoding TDNN on a 7-frame phonetic context (ctx) is clearly beneficial compared to conditioning it merely on the phone of the target frame.
We also observe that using same segments for encoding and decoding (same) yields inferior performance on VoxCeleb, while on SITW their performance is equivalent. A plausible explanation is that VoxCeleb contains shorter segments compared to SITW. Hence, as encoding and decoding on different segments encourages the network to learn how to reconstruct phonetic subsequences unseen in the encoding segments, it is expected to be more beneficial for short durations.
When the two losses are combined (Spk+Self), the model clearly outperforms plain x-vector (Spk). In this case, the self-supervised loss has a regularization effect, constraining the network to learn representations that generalize well to unseen speakers. Again, the use of context yields superior performance, although in this case the differences are less significant.
4.3.2 Partly labelled training set
In this set of experiments, we assume that only a fraction of the training utterances is labelled. Hence, in the results we provide in Table 2 the PLDA is trained with = 899 VoxCeleb speakers out of .
|Spk, full set||4.347||0.424||4.804||0.472|
|Self, cln, ctx||5.793||0.548||6.644||0.589|
|Self, cln, ctx, same||5.768||0.548||7.503||0.612|
|Spk+Self, cln, ctx||5.416||0.488||6.315||0.534|
In the first two experiments in Table 2 we use standard CE over speakers loss. We observe the severe degradation when the number of speakers used to train the x-vector baseline is reduced to (Spk, baseline). For comparison, we report the experimental results where the full set of speakers is used for training the x-vector model (Spk, full set).
The results using only self-supervision with context are clearly superior to those of pure x-vectors, due to the capacity of self-supervision in leveraging all available utterances during training. Moreover, when the two losses are combined, the results become even better, especially in terms of minDCF. Finally, we observe again the gains in performance by using different encoding and decoding segments.
5 Conclusions and future work
In this paper, we introduced a new way of training speaker embeddings extractors using self-supervision. We showed that a typical TDNN-based extractor can be trained without speaker labels, using a decoder network to approximate in the MSE sense a speech segment of the same utterance. A key idea for enabling decoding is the conditioning of the decoder on both the embedding and the phonetic sequence of the decoding segment, as estimated by an ASR model. Furthermore, we showed that the proposed loss can be combined with the standard cross-entropy, yielding notable improvements. Finally, we demonstrated its effectiveness on semi-supervised learning, i.e. when only a small fraction of the training set is labelled. Both additional networks we introduced (decoder and ASR model) are only needed during training, leaving the standard x-vector architecture unchanged in runtime.
The proposed approach can be extended in several ways. The method of conditioning the decoder on the phonetic sequence of the speech segment paves the way for revisiting methods such as variational autoencoders in an end-to-end fashion. Speech synthesis approaches may also benefit from the proposed method, e.g. by training embedding extractors jointly with TTS from scratch. Finally, there is large room for improvement in the architecture (e.g. by using a recurrent or attentive decoder or deeper and wider encoder ), in the training scheme (e.g. by varying the duration of encoding and decoding segments), and in the way the existing speaker labels are used in training (e.g. by extracting the two segments from different utterances of the same speaker).
-  P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1435–1447, 2007.
-  N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
-  D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
-  J.-T. Chien and C.-W. Hsu, “Variational manifold learning for speaker recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 4935–4939.
-  J. Villalba, N. Brümmer, and N. Dehak, “Tied variational autoencoder backends for i-vector speaker recognition.” in INTERSPEECH, 2017, pp. 1004–1008.
-  A. Silnova, N. Brummer, D. Garcia-Romero, D. Snyder, and L. Burget, “Fast variational bayes for heavy-tailed plda applied to i-vectors and x-vectors,” in INTERSPEECH 2018, 2018.
J. Guo, N. Xu, K. Qian, Y. Shi, K. Xu, Y. Wu, and A. Alwan, “Deep neural network based i-vector mapping for speaker verification using short utterances,”Speech Communication, vol. 105, pp. 92–102, 2018.
-  W. Ding and L. He, “MTGAN: Speaker Verification through Multitasking Triplet Generative Adversarial Networks,” in INTERSPEECH, 2018.
-  J. Rohdin, T. Stafylakis, A. Silnova, H. Zeinali, L. Burget, and O. Plchot, “Speaker verification using end-to-end adversarial language adaptation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
-  G. Bhattacharya, J. Monteiro, J. Alam, and P. Kenny, “Generative adversarial speaker embedding networks for domain robust end-to-end speaker verification,” arXiv preprint arXiv:1811.03063, 2018.
-  A. Gibiansky, S. Arik, G. Diamos, J. Miller, K. Peng, W. Ping, J. Raiman, and Y. Zhou, “Deep voice 2: Multi-speaker neural text-to-speech,” in Advances in Neural Information Processing Systems, 2017, pp. 2962–2970.
-  Y. Chen, Y. Assael, B. Shillingford, D. Budden, S. Reed, H. Zen, Q. Wang, L. C. Cobo, A. Trask, B. Laurie et al., “Sample efficient adaptive text-to-speech,” arXiv preprint arXiv:1809.10460, 2018.
Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. L.
Moreno, Y. Wu et al.
, “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” inAdvances in Neural Information Processing Systems, 2018, pp. 4485–4495.
-  A. Tjandra, S. Sakti, and S. Nakamura, “End-to-end feedback loss in speech chain framework via straight-through estimator,” arXiv preprint arXiv:1810.13107, 2018.
-  L. Li, D. Wang, Y. Chen, Y. Shi, Z. Tang, and T. F. Zheng, “Deep factorization for speech signal,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
-  O. Wiles, A. Koepke, and A. Zisserman, “Self-supervised learning of a facial attribute embedding from video,” in BMVC, 2018.
-  P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine, and G. Brain, “Time-contrastive networks: Self-supervised learning from video,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 1134–1141.
M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash, “Boosting
self-supervised learning via knowledge transfer,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9359–9367.
-  J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and
composing robust features with denoising autoencoders,” in
Proceedings of the 25th international conference on Machine learning. ACM, 2008, pp. 1096–1103.
-  A. Lozano-Diez, A. Silnova, P. Matejka, O. Glembek, O. Plchot, J. Pešán, L. Burget, and J. Gonzalez-Rodriguez, “Analysis and optimization of bottleneck features for speaker recognition,” in Proceedings of Odyssey, vol. 2016. ISCA Bilbao, Spain, 2016, pp. 352–357.
F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
-  C. Zhang and K. Koishida, “End-to-end text-independent speaker verification with triplet loss on short utterances.” in Interspeech, 2017, pp. 1487–1491.
-  C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu, “Deep speaker: an end-to-end neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
-  M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The 2016 speakers in the wild speaker recognition evaluation,” in INTERSPEECH, 2016.
-  A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in INTERSPEECH, 2017.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding, no. EPFL-CONF-192584. IEEE Signal Processing Society, 2011.
-  J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in INTERSPEECH, 2018.
-  M. A. et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
-  D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification,” in INTERSPEECH, 2017.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. [Online]. Available: http://arxiv.org/abs/1412.6980
-  D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, and S. Khudanpur, “Speaker recognition for multi-speaker conversations using x-vectors,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5796–5800.