With the development of various optimization techniques, deep learning has become a powerful tool for numerous applications, including speech and image recognition. To build high-performance models, supervised learning is the most popular methodology, in which labeled samples are used for optimizing model parameters. It is known that deep neural networks (e.g., ResNet
) having more than a million parameters outperform hand-crafted feature extraction methods. As such, optimizing parameters with a well-designed objective function is one of the most important research topics in deep learning.
In recent years, supervised metric learning methods for deep neural networks have attracted attention. Examples of these include triplet loss  and prototypical episode loss , which predispose a network to minimize within-class distance and maximize between-class distance. They are also effective for text-independent speaker verification, as shown in 
, because cosine similarity between utterances from the same speaker is directly maximized in the training phase.
Nevertheless, unsupervised learning methods have grown greatly, thanks to large-scale collections of unlabeled samples. Some studies have recently proven that self-supervised learning achieves performance very close to that of supervised learning. For example, the simple framework for contrastive learning of representations (SimCLR)  provides superior image representation by introducing contrastive NT-Xent loss using data augmentation on unlabeled images. For speaker verification, these methods motivate us to explore unsupervised and semi-supervised ways to learn speaker embeddings by effectively using unlabeled utterances.
In general, supervised learning and unsupervised learning depend on different methodologies. However, supervised metric learning and unsupervised contrastive learning share a common idea to maximize or minimize the similarity between samples. This implies the possibility of unifying these two learning frameworks.
In this paper, we propose a semi-supervised contrastive learning framework based on generalized contrastive loss (GCL). GCL provides a unified formulation of two different losses from supervised metric learning and unsupervised contrastive learning. Thus, it naturally works as a loss function for semi-supervised learning. In experiments, we applied the proposed framework to text-independent speaker verification on the VoxCeleb dataset. We demonstrated that GCL enables the network to learn speaker embeddings in three manners, supervised learning, semi-supervised learning, and unsupervised learning, without any changes in the definition of the loss function.
2 Related Work
2.1 Supervised Metric Learning
Supervised metric learning is a framework to learn a metric space from a given set of labeled training samples. For recognition problems, such as audio and image recognition, the goal is typically to learn the semantic distance between samples.
A recent trend in supervised metric learning is to design a loss function at the top of a deep neural network. Examples include contrastive loss for Siamese networks , triplet loss for triplet networks , and episode loss for prototypical networks . To measure the distance between samples, Euclidean distance is often used with these losses.
For face identification from images, measuring similarity by cosine similarity often improves the performance. ArcFace , CosFace , and SphereFace  are its popular implementations. Their effectiveness is also shown in speaker verification from audio samples with some extended loss definitions, such as ring loss [10, 11]. One of the best choices for speaker verification is angle-prototypical loss , which introduces cosine similarity to episode loss, as shown in  with thorough experiments.
2.2 Unsupervised Contrastive Learning
Unsupervised learning is a framework to train a model from a given set of unlabeled training samples. Classic methods for unsupervised learning include clustering methods such as -means clustering 
. Most of them are statistical approaches with some objectives based on means and variances.
Recently, self-supervised learning has proven to be effective for pre-training deep neural networks. For example, Jigsaw  and Rotation  define a pretext task on unlabeled data and pre-train networks for image recognition by solving it. Deep InfoMax  and its multiscale extension AMDIM  focus on mutual information between representations extracted from multiple views of a context. SimCLR  introduces contrastive learning using data augmentation. The effectiveness of contrastive learning is also shown in MoCo V2 [17, 18]. These methods achieve performance comparable with that of supervised learning in tasks of image representation learning.
Cross-modal approaches are also effective if more than one source is available. For speaker verification, Nagrani et al.  proposed a cross-modal self-supervised learning method, which uses face images as supervision of audio signals to identify speakers.
2.3 Semi-Supervised Learning
Semi-supervised learning is a framework to train a model from a set consisting of both labeled and unlabeled samples. To effectively incorporate information from unlabeled samples into the parameter optimization step, a regularization term is often introduced into the objective function. For example, consistency regularization  is used to penalize sensitivity to augmented unlabeled samples.
3.1 Supervised Metric Learning
Let be a training dataset for supervised learning, which consists of sample pairs and their discrete class label . The goal of supervised metric learning is to learn a metric function , which assigns a small distance between samples belonging to the same class, and relatively large distance between samples from different classes. Assuming that the training phase has iterations for parameter updates, a mini-batch is sampled from at each iteration. For convenience, two-step sampling is often used . First, a set of different classes are randomly sampled from the set of training classes. We denote the sampled classes by . Second, independent samples are randomly sampled from each of classes. We denote the samples from the class as . As a result, a mini-batch consists of samples.
As an example of supervised metric learning, we show the training process of a prototypical network . The main idea of a prototypical network is to make prototype representations of each class and to minimize the distance between a query sample and its corresponding prototype. Its loss for parameter updates is computed as follows:
Sample a mini-batch from and split it into a query set and a support set .
Extract query representations from by
where is a neural network for embedding (i.e., a network without the final loss layer) and is a set of parameters.
Construct prototype representations from by
From a representation batch , compute the episode loss defined by
where is the exponential function of negative distance between representations , and is the squared Euclidean distance.
3.2 Unsupervised Contrastive Learning
Let be a training dataset for unsupervised learning, which consists of unlabeled samples . The goal of unsupervised learning is to train networks without any manually attached labels.
As an example of unsupervised learning, we show the training process of SimCLR . SimCLR maximizes the similarity between representations of two augmented samples and , where and are two randomly selected augmentation functions. Its loss for parameter updates is computed as follows:
Sample a mini-batch from .
Extract the first representation by
Note that is randomly selected from a set of augmentation functions for each .
Extract the second representation by
From a representation batch , compute the NT-Xent loss  defined by
Here is the exponential of similarity between representations , is a fully connected layer with a parameter , and is a hyper-parameter.
We note that by omitting the second summation in the denominator of Eq. (7) or (8) we obtain Eq. (3). This opens a way to bridge the two losses for supervised metric learning and unsupervised contrastive learning.
4 Proposed Method
This section presents 1) Generalized contrastive loss (GCL) and 2) GCL for semi-supervised learning. GCL unifies losses from two different learning frameworks, supervised metric learning and unsupervised contrastive learning, and thus it naturally works as a loss function for semi-supervised learning.
4.1 Generalized Contrastive Loss
is a fourth-order affinity tensor,denotes the application of Macaulay brackets to the ramp function as , and is a constant to avoid a division by zero. Note that a positive value for predisposes two representations and to be close to each other, a negative value does the opposite. The episode loss can be viewed as a special case of GCL when is made from a mini-batch of labeled samples via prototypes, as shown in Sec. 3.1, and the affinity tensor is defined by
Note that is the category index and is the sample index in this case.
The NT-Xent loss can also be viewed as a special case of GCL when is made from a mini-batch of unlabeled samples using augmentation, as shown in Sec. 3.2, and the affinity tensor is defined by
Note that is the sample index and is the augmentation type index in this case.
4.2 GCL for Semi-Supervised Learning
In semi-supervised learning, a training dataset includes both labeled and unlabeled samples. Thus, a mini-batch is given by a pair of a set of labeled samples and a set of unlabeled samples . To apply GCL to , its representation batch is constructed by , where
is a representation batch of given from a supervised metric learning method and
is a representation batch of given from an unsupervised contrastive learning method, as shown in Figure 1.
The GCL for semi-supervised learning is then defined on by
where denote labeled or unlabeled samples. Note that affinity tensor becomes a sixth-order tensor to predispose similarity between and to be close or far.
Here, we provide an example definition of for semi-supervised learning. Compared with NT-Xent loss, we relax the affinity between unlabeled samples because some labeled samples are available for training.
This definition is effective for semi-supervised learning for speaker verification, where labeled utterances are from a pre-defined set of speakers and unlabeled utterances are from another (different) set of unknown speakers. For the similarity measure, we use . This definition is used in .
5.1 Experimental Settings
We used the VoxCeleb dataset [23, 24] for evaluating our proposed framework. The training set (voxceleb_2_dev) consists of 1,092,009 utterances of 5,994 speakers. The test set (voxceleb_1_test) consists of 37,611 enrollment-test utterance pairs. The equal error rate (EER) was used as an evaluation measure.
For semi-supervised learning experiments, we randomly selected speakers from the set of 5,994. We used their labeled samples and the remaining unlabeled samples for training. This is the same evaluation setting proposed in . For unsupervised learning experiments, we did not use speaker labels. This evaluation setting is more difficult than the cross-modal self-supervised setting in  because we did not use videos (face images) for training. For supervised learning experiments, we used all labeled samples for training. This is the official evaluation setting on the VoxCeleb dataset.
We used the ResNet18 convolutional network with an input of 40-dimensional filter bank features. For data augmentation to construct a representation batch from unlabeled samples, we used four Kaldi data augmentation schemes with the MUSAN (noise, music, and babble) and the RIR (room impulse response) datasets. For semi-supervised learning, 10 % of samples in each mini-batch were unlabeled and the others were labeled.
|Method||Training Scenario||Additional Data/Model||EER (%)|
|SSL embedding ||Semi-supervised||Speech recognition||6.31|
|Cross-modal ||Unsupervised||Video (face images)||20.09|
Table 1 summarizes EERs for semi-supervised, unsupervised, and supervised learning settings. The results demonstrate that GCL enables the learning of speaker embeddings in the three different settings without any changes in the definition of the loss function.
For semi-supervised learning experiments, we compared the results with those of  by using the same number of labeled speakers (). The results show that our framework achieves comparable performance. Note that the method in  uses an automatic speech recognition model pre-trained on another dataset, but we did not use such pre-trained models. Comparison with a supervised learning method is shown in Figure 2. We see that adding unlabeled utterances improved the performance, in particular when the number of available labeled utterances was small.
For unsupervised learning experiments, our method outperformed the cross-modal self-supervised method in . Note that our method did not use any visual information, such as face images, for supervision. Audio-visual unsupervised learning with our framework is promising as a next step.
For supervised learning experiments, our method achieves a 2.56 % EER without using data augmentation. However, there is still room for improvement, because training the same network with Softmax and AM-Softmax losses (training with Softmax and fine-tuning with AM-Softmax) achieves a 1.81 % EER. Introducing a more effective network structure, such as ECAPA-TDNN  and AutoSpeech-NAS , to our framework would be also interesting as future work.
This paper proposed a semi-supervised contrastive learning framework with GCL. We showed via experiments on the VoxCeleb dataset that the proposed GCL enables a network to learn speaker embeddings in three manners, namely, supervised learning, semi-supervised learning, and unsupervised learning. Furthermore, this was accomplished without making any changes to the definition of the loss function.
This work was partially supported by the Japan Science and Technology Agency, ACT-X Grant JPMJAX1905, and the Japan Society for the Promotion of Science, KAKENHI Grant 19K22865.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In , pages 770–778, 2016.
-  Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In Proceedings of the International Workshop on Similarity-Based Pattern Recognition (SIMBAD), pages 84–92, 2015.
-  Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pages 4077–4087, 2017.
-  Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, and Icksang Han. In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982, 2020.
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton.
A simple framework for contrastive learning of visual representations.
Proceedings of the International Conference on Machine Learning (ICML), 2020.
-  Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pages 1735–1742, 2006.
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou.
Arcface: Additive angular margin loss for deep face recognition.In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pages 4690–4699, 2019.
-  Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pages 5265–5274, 2018.
-  Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pages 212–220, 2017.
-  Yutong Zheng, Dipan K. Pal, and Marios Savvides. Ring loss: Convex feature normalization for face recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pages 5089–5097, 2018.
-  Yi Liu, Liang He, and Jia Liu. Large Margin Softmax Loss for Speaker Verification. In Proceedings of Interspeech, 2019.
-  Stuart Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, vol. 28, no. 2, pages 129–137, 1982.
-  Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the European Conference on Computer Vision (ECCV), pages 69–84, 2016.
-  Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In Proceedings of the International Conference on Learning Representations, 2018.
R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil
Bachman, Adam Trischler, and Yoshua Bengio.
Learning deep representations by mutual information estimation and maximization.In Proceedings of the International Conference on Learning Representations, 2019.
-  Philip Bachman, R. Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pages 15509–15519, 2019.
-  Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
-  Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
-  Arsha Nagrani, Joon Son Chung, Samuel Albanie, and Andrew Zisserman. Disentangled speech embeddings using cross-modal self-supervision. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6829–6833, 2020.
-  Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pages 1163–1171, 2016.
-  Themos Stafylakis, Johan Rohdin, Oldrich Plchot, Petr Mizera, and Lukas Burget. Self-supervised speaker embeddings. Proceedings of Interspeech, 2019.
-  Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized end-to-end loss for speaker verification. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4879–4883, 2018.
-  Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: A large-scale speaker identification dataset. In Proceedings of Interspeech, 2017.
-  Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: Deep speaker recognition. In Proceedings of Interspeech, 2018.
-  Shaojin Ding, Tianlong Chen, Xinyu Gong, Weiwei Zha, and Zhangyang Wang. Autospeech: Neural architecture search for speaker recognition, arXiv preprint arXiv:2005.03215, 2020.
-  Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143, 2020.
The complete form of the proposed GCL is defined over a representation batch by
where is an affinity tensor, is the similarity between and given an affinity value , and is a normalization or clipping function.
Table 2 summarizes how to obtain popular loss functions from GCL. We hope this provides an overview of recent progress and helps other researchers develop new unsupervised, semi-supervised, and supervised learning methods.
6.1 Affinity Type
Four types of affinity tensor definitions are used in Table 2. With all of them, a positive value for predisposes two representations and to be close to each other, a negative value does the opposite. The density of increases in the order of Types 1 to 4, as shown in Figure 3. Definitions of the types are given below. Note that is assumed for simplicity.
Type 1 makes pairs and its output is if two samples are from the same class (i.e., ) and if two samples are from different classes (i.e., ). An example definition of this type is given by
Type 2 makes triplets where . With respect to an anchor , is marked as positive and is marked as negative. An example definition of this type is given by
Type 3 makes -tuples . With respect to an anchor , is marked as positive and all the others are marked as negative. The definition of this type is given by
Type 4 makes -tuples . With respect to an anchor , is marked as positive and all the others are marked as negative. The definition of this type is given by
6.2 Representation batch
Table 2 gives three types of definition for the representation batch .
Labeled: With labeled samples for supervised learning, denotes the -th representation from class . A representation is defined by sample representation or a statistical representation, such as a mean representation (prototype) computed from some samples in , specifically, . Here, is a mini-batch of labeled samples.
Labeled+weights: This type uses parameters as prototypes, where is a representation from class and is a weight parameter for class .
Unlabeled: With unlabeled samples for unsupervised learning, denotes the representation of the -th sample with the -th augmentation. With this type, prototypes can also be introduced in the same way as prototypes are constructed for labeled samples, that is, by taking the mean of representations from more than one augmentation function.