Log In Sign Up

Disentangled Speaker Representation Learning via Mutual Information Minimization

Domain mismatch problem caused by speaker-unrelated feature has been a major topic in speaker recognition. In this paper, we propose an explicit disentanglement framework to unravel speaker-relevant features from speaker-unrelated features via mutual information (MI) minimization. To achieve our goal of minimizing MI between speaker-related and speaker-unrelated features, we adopt a contrastive log-ratio upper bound (CLUB), which exploits the upper bound of MI. Our framework is constructed in a 3-stage structure. First, in the front-end encoder, input speech is encoded into shared initial embedding. Next, in the decoupling block, shared initial embedding is split into separate speaker-related and speaker-unrelated embeddings. Finally, disentanglement is conducted by MI minimization in the last stage. Experiments on Far-Field Speaker Verification Challenge 2022 (FFSVC2022) demonstrate that our proposed framework is effective for disentanglement. Also, to utilize domain-unknown datasets containing numerous speakers, we pre-trained the front-end encoder with VoxCeleb datasets. We then fine-tuned the speaker embedding model in the disentanglement framework with FFSVC 2022 dataset. The experimental results show that fine-tuning with a disentanglement framework on a existing pre-trained model is valid and can further improve performance.


page 1

page 6


DEAAN: Disentangled Embedding and Adversarial Adaptation Network for Robust Speaker Representation Learning

Despite speaker verification has achieved significant performance improv...

PRISM: Pre-trained Indeterminate Speaker Representation Model for Speaker Diarization and Speaker Verification

Speaker embedding has been a fundamental feature for speaker-related tas...

Powerful Speaker Embedding Training Framework by Adversarially Disentangled Identity Representation

The main challenge of speaker verification in the wild is the interferen...

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion

One-shot voice conversion (VC) with only a single target speaker's speec...

CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information

Mutual information (MI) minimization has gained considerable interests i...

Learning Speaker Representations with Mutual Information

Learning good representations is of crucial importance in deep learning....

Speaker diarization with session-level speaker embedding refinement using graph neural networks

Deep speaker embedding models have been commonly used as a building bloc...

1 Introduction

Speaker verification is a task of determining whether the input speech is spoken by the same speaker or not [1]

. The general speaker verification framework consists of an embedding extraction and scoring process. In the embedding extraction step, audio with variable duration is converted into a single fixed-dimensional vector representation called speaker embedding, which is assumed to contain speaker-relevant information. With a sophisticated speaker embedding, even a simple scoring method such as cosine similarity or euclidean distance has shown high speaker verification performance

[2, 3, 4]. Therefore, most studies have been focused on how to extract a fine speaker embedding from input speech.

With the development of the deep learning field, various studies have been proposed to utilize the neural network for extracting speaker representation called deep speaker embedding, which reflects the speaker’s characteristics well

[7, 5, 6]. Despite the success of deep speaker embedding methods, there still remains the problem of performance degradation in mismatched conditions (e.g., device, noise, language). In order to solve this problem, there has been a demand for robust speaker embedding, unaffected by the domain mismatch due to speaker-irrelevant factors.

Traditionally, data augmentation is the most common approach for training neural networks robust to domain mismatch. For speaker verification, simulated reverberation [8], additive noise [9], and SpecAugment [10] can be good options for data augmentation to increase the number of acoustic environments that might be encountered in the inference phase [11, 12]. While these methods are proven to be effective when there are insufficient data on target conditions, they can only indirectly mitigate the domain mismatch problem.

Unlike the methods described above, various studies have been proposed to disentangle speaker-irrelevant variability from the speaker embedding directly. Recently, adversarial learning-based domain adaptation methods have been studied. [13, 14, 15]

utilized gradient reversal layer (GRL) to prevent speaker embeddings network from learning the information needed for the sub-task (i.e., noise classification). Although the gradient reversal techniques have proven to be effective for performance improvement, training a network with GRL is known to be unstable and sensitive to the hyper-parameter setting. As an alternative to GRL, domain adversarial training similar to the generative adversarial network (GAN) framework was exploited to maximize the error on the subtask

[16, 17]. However, these domain adaptation methods have a limitation that adaptation is applied to the feature space shared by both speaker-relevant and speaker-irrelevant factors. Therefore, speaker embedding is inherently hindered by speaker-independent factors. Also, adversarial training has known to be difficult and unstable [18].

Alternatively, there have been several approaches to minimize correlation between speaker and speaker-independent embeddings in distinct space. For instance, joint factor embedding (JFE) [18] framework simultaneously extracts speaker and nuisance (i.e., non-speaker) embeddings and maximizes entropy (or uncertainty) on their opposite task, while minimizing correlation between two embeddings using mean absolute Pearson’s correlation (MAPC) computed batch-wise. Similarly, [19, 20] divided features into the speaker and residual embeddings and increased their uncertainty on the contrary task, and [21]

minimized mutual information via mutual information neural estimator (MINE) with GRL. Additionally, they adopted an autoencoder framework for training merged embedding to maintain the complete information of input speech

[19, 20, 21]. However, naively increasing uncertainty on the other task does not guarantee disentanglement.

For learning disentangled representations, mutual information (MI) minimization has gained considerable interest in various machine learning tasks

[23, 22, 24]. Since the exact computation of MI in high-dimensional space is intractable when only sample-based approaches are available, several prominent MI estimators have been proposed [26, 28, 27, 25]. Among them, contrastive log-ratio upper bound (CLUB) [25]

proposed the MI upper bound estimator by using the difference of conditional probabilities between positive and negative sample pairs in a contrastive learning manner. As our goal is to learn disentangled speaker embedding, we utilize CLUB to reduce the interdependence between opposite latent representations explicitly.

In this work, we propose an effective learning framework for disentangled speaker representation via MI minimization. To learn speaker embedding that is not only soundly disentangled but also has high speaker discrimination ability, we construct a 3-stage structure; Front-end Encoder Network, Decoupling Block, and Classifier and MI Estimator parts. Through this framework, we explicitly learn disentangled representations and obtain practically good speaker embedding.

The rest of this paper is organized as follows: Section ii@ describes the MI estimation and CLUB, and Section iii@ presents the proposed framework. Then, the experiments and results are addressed in Sections iv@ and v@, respectively. Finally, we conclude in Section vi@.

2 Mutual Information Upper Bound Estimation

Mutual information (MI) is a quantity to measure the amount of dependency between two random variables. For two continuous random variables

x and y, MI is defined as follows:



is the joint distribution,

and denote the marginal distributions.

Since our goal is to learn disentangled representations, MI minimization between two random variables is required. Therefore, we focus on contrastive log-ratio upper bound (CLUB) [25], MI upper bound estimator. For given two random variables x and y, CLUB is formulated as follows:


As the conditional distribution is intractable in our framework, we approximate it using a variational distribution . In practice, a variational CLUB (vCLUB) is obtained as follows:


where is sample pairs drawn from the joint distribution . vCLUB is not guaranteed to be the MI upper bound anymore since we approximate to . However, if the Kullback–Leibler (KL) divergence between conditional and variational distribution is small enough, it can be a reliable MI upper bound estimator. Let be the variational joint distribution, then KL divergence between and is as follows:


where Equation (6) denotes . Consequently, minimizing is equivalent to maximizing with respect to . We train the variational network

by minimizing the negative log-likelihood loss function as follows:

Figure 1: The overall disentangled speaker and device embeddings training framework: Front-end encoder , decoupling block , classifiers , and MI estimators , , with variational networks , , .

3 Proposed Framework

In this work, our proposed framework is constructed under the Far-Field Speaker Verification Challenge 2022 (FFSVC 2022) [29] scenario to explore more practical cases. FFSVC 2022 provides a far-field dataset collected by real 155 speakers in complex environments with multiple conditions. In particular, the datasets of FFSVC 2022 consist of noisy speech samples recorded under far-field conditions and different devices (i.e., tablet, telephone, and microphone array). In these settings, we learn disentangled speaker and device representations. As shown in Figure 1, the overall proposed framework is composed of A. Front-end Encoder Network, B. Decoupling Block, and C. Classifier and MI Estimator parts.

3.1 Front-end Encoder Network

Given an -dimensional acoustic feature with frames, the front-end encoder network extracts an utterance-level initial embedding . To efficiently capture global and local information, we adopt the multi-scale feature aggregation conformer (MFA-Conformer) [30] backbone and the channel and context-dependent statistic pooling [12] for the front-end network.

In case the proposed systems are trained from scratch using a dataset with a limited number of speakers (i.e., FFSVC2022 training set), our disentanglement framework could not work properly (discussed in Section v@). To this end, we firstly force the initial embedding to obtain sufficient speaker discrimination ability by pre-training the front-end encoder with a large-scale dataset including many different speakers but no device labels. Then we fine-tune the whole network with the dataset containing device labels but limited speakers to effectively focus on disentangling the speaker and device factors latent in the initial embedding.

3.2 Decoupling Block

To explicitly divide the initial embedding extracted from into the latent speaker and device representations, we deploy the decoupling block

, as shown in Figure 2. The speaker and device embeddings are obtained via multi-layer perceptron (MLP) modules in

. MLP module is sequentially comprised of a fully-connected (FC) layer, a batch-normalization (BN) layer, and a rectified linear unit (ReLU) activation function. Two fixed dimensional embedding vectors,

and , are learned to represent the input speech’s speaker and device characteristics, respectively. For the evaluation, the speaker embeddings are extracted, and the similarities are calculated to perform the verification.

3.3 Classifier and MI Estimator

Analogous to the previous disentanglement approaches [17, 18, 22, 24, 31], we follow the multitask learning strategy which includes the classification and MI minimization-based disentanglement tasks. As shown in Figure 1, the classification task consists of the speaker and device classifiers, and , respectively. For the MI minimization-based disentanglement task, there are three MI estimators, , , and .

Speaker classifier : To force the speaker embeddings to discriminate their speaker labels, we adopt the combination of the additive angular margin (AAM) softmax [32] and the angular prototypical (AP) loss [33], which has shown the great performance in this field [4, 34]. Given the pairs of speaker embeddings and labels , the speaker classification loss function is formulated as follows:


where is the batch size, is a scale factor, is a margin, is the normalized dot product between the -th class weight of and , and denotes the cosine similarity between two different utterances of -th speaker.

Device classifier : As in the speaker classifier, the device embeddings are trained to identify their device labels. The device classification loss is defined as AAM softmax:


MI estimator : To minimize the MI between speaker and device embeddings, we adopt the mechanism of variational CLUB estimator, which calculates the MI upper bound via the difference of variational distributions between positive and negative sample pairs. The MI upper bound between and is estimated as:

Figure 2: Front-end encoder and decoupling block networks.

where is the variational network with trainable parameters for approximating , i.e., representing . The variational distribution is estimated via the isotropic Gaussian with a diagonal covariance matrix, as shown in Figure 3 (left network). and are obtained via the last two MLP layers of . The parameters of the variational network are optimized independently with the parameters of the main networks by minimizing the following negative log-likelihood:


MI estimators and : To reduce the interdependence between embeddings and labels, the estimators and estimate the MI upper bounds of and , respectively, through the variational CLUB as follows:


where and are the variational networks with trainable parameters and , respectively, as shown in Figure 3 (right network). is the softmax activation output to approximate . The variational parameters and are optimized using and , respectively, in the same way as in the MI estimator .

3.4 Total Objective Function

Finally, the main networks (i.e., , , , and ) are jointly trained with following total objective function:


where , , , , and

are weighting factors to balance each loss term. Algorithm 1 summarizes the overall disentangled representation learning framework where

is an optimizer, is a learning rate, and

is the number of updates for variatinal networks per epoch. The main and variatinal networks are updated alternately.

Figure 3: Variational networks. Left network indicates ) and right network is ).
is initialized with pre-trained .
for  to  do
        // Variational networks update
        for  to  do
               for  to  do
               end for
        end for
        // Main networks update
        for  to  do
        end for
end for
Algorithm 1 Overall Disentangled Speaker and Device Representations Learning Framework.
Pre-training Dataset (Front-end Encoder) Fine-tuning Dataset (whole network) Objective Function Development Set
VoxCeleb Only using initial embedding from pre-trained 12.09 0.722
FFSVC2022 JFE [18] 11.98 0.688
VoxCeleb FFSVC2022 7.02 0.460
FFSVC2022 11.83 0.668
VoxCeleb FFSVC2022 7.08 0.468
FFSVC2022 12.20 0.690
VoxCeleb FFSVC2022 7.15 0.473
FFSVC2022 12.06 0.718
VoxCeleb FFSVC2022 7.03 0.467
FFSVC2022 12.00 0.703
VoxCeleb FFSVC2022 6.99 0.461
FFSVC2022 11.95 0.684
VoxCeleb FFSVC2022 6.95 0.450
Table 1: Speaker verification performances on the FFSVC 2022 development trial protocol. : Our re-implementation.

4 Experiments

4.1 Datasets

To pre-train the front-end encoder network , we employ the development set of VoxCeleb1 and VoxCeleb2 datasets [35, 36, 37]

, which consist of 1,092,009 and 148,642 utterances from 5,994 and 1,211 speakers, respectively. VoxCeleb dataset is one of the most popular corpora for large-scale text-independent speaker verification. The speech samples were extracted from YouTube video clips and degraded with real-world noises, including background chatter, laughter, overlapping speech, room acoustics, etc. The front-end encoder network was trained in a fully supervised learning manner with the speaker classifier.

When fine-tuning the whole network with pre-trained , we use the FFSVC2022 training dataset which is the composition of the training, development, and supplementary sets of the FFSVC 2020 challenge [38]. FFSVC2022 training dataset totally contains 2,548,351 utterances from 155 speakers where we only utilize samples longer than 1 second (i.e., 2,542,392 utterances). FFSVC2022 dataset was collected from four recording devices (i.e., iPhone, Android phone, iPad, and normal/circular microphone array) in six different locations (i.e., 0m, 25cm, 1m, 1.5m, 3m, and 5m). For our disentangled representation learning framework, we fine-tuned the whole network using the utterances with corresponding speaker and device labels.

(a) Only speaker classification objective function: .
(b) JFE objective function [18]: .
(c) Proposed objective function: ).
Figure 4: t-SNE visualization of speaker embedding space results. The input utterances are randomly sampled from FFSVC2022 dataset. The sampled utterances consist of ten speakers (0004,0008,0009,0030,0050,0127,0161,0183,0252,0277) under three recording devices (iPhone-IPH, microphone array-MIC, and iPad-IPD). Speaker embeddings with speaker labels (left) and device labels (right) are learned using (a) only speaker classification loss, (b) JFE [18], and (c) proposed objective function.

4.2 Evaluation Protocol and Metrics

To evaluate the system performance, we adopt development trial protocol provided by FFSVC2022 challenge, which was utilized to tune hyper-parameters and validate the model performance during the previous competition period [29]. Since the FFSVC2022 development trial protocol contains speech samples collected by real speakers in multiple environments, we can evaluate the system performance in realistic scenarios with multiple conditions. We report two performance metrics: the equal error rate (EER) and the minimum detection cost function (MinDCF). The EER is the error when the false alarm rate (FAR) and the false reject rate (FRR) are the same, and the MinDCF is defined as the minimum value of the weighted sum of the FAR and FRR. The parameters of MinDCF were set as , , and .

4.3 Model Architectures

For the front-end encoder network, we adopt the MFA-Conformer [30] architecture, which is the multi-scale feature-aggregated encoder for extracting speaker embedding based on the convolution-augmented transformer. We use six conformer layers which consist of the multi-headed self-attention module (MHSA), the convolution module (CM), the feed-forward module (FFM), and the sub-sampling layer (SSL). For the MHSA, the encoder dimension, the number of attention heads, the dropout rate, and the kernel size are set to 256, 4, 0.1, and 15, respectively. For the CM, the kernel size is set to 15. For the FFM, FC layers with the dimension of 2,048 are used. For the SSL, a convolution layer with a sub-sampling rate of 2 is employed. We aggregate the frame-level output features to the 192-dimensional initial embedding x via the channel and context-dependent statistic pooling [12]. In the decoupling block, there are three MLP layers, as shown in Figure 2, where each MLP layer consists of FC-ReLU-BN sequentially. From the outputs of the last two MLP layers, the 192-dimensional speaker and device embeddings are obtained. The dimension of the hidden and last FC layers for the variational network is set to 1,024 and 192, respectively.

4.4 Baseline: Joint Factor Embedding (JFE)

To compare with the existing disentanglement method, we adopt joint factor embedding (JFE) [18]. JFE framework simultaneously learns speaker and nuisance (device) embeddings where the cross-entropy on their main task ( and ) is minimized while the entropy on their opposite task ( and ) is maximized. Also, the negative MAPC between two embeddings () is jointly minimized. For our experimental setting, the speaker and device embeddings are optimized using the following JFE objective function:


4.5 Implementation Details

We made use of the PyTorch library and conducted experiments using

NVIDIA GeForce RTX 3090 GPUs in parallel111All implementations are developed based on During both pre-training and fine-tuning phases, we randomly cropped an input utterance to 200-frames segment and then applied MUSAN noises [9] or the simulated room impulse responses (RIRs) [8]

for data augmentation. If input utterance is shorter than 200 frames, we duplicated and randomly selected 200-frames segment. Acoustic features are 80-dimensional log mel-filterbanks with a hamming window length of 25ms and hop-size of 10ms with 512-size FFT bins. Mean and variance normalization is applied to the log mel-filterbanks. The AAM-softmax loss function 

[32] employs a margin of 0.2 and a scale of 30. The AP loss function [33] uses the prototype with one utterance. We adopted a batch size of and an Adam optimizer with a weight decay of 2e-5. For the pre-training phase, we scheduled the learning rate via the cosine annealing with warm-up restart (SGDR) [39] with a cycle size of 25 epochs, the maximum learning rate of 1e-3 and the decreasing rate of 0.8 for two cycles. In the fine-tuning phase, we set the hyper-parameters of SGDR scheduler to a cycle size of 4 epochs, the maximum learning rate of 1e-5, and the minimum learning rate of 1e-8 for one cycle. The weighting factors for total objective function are set to , , , , and . The weighting factors for JFE objective function are set to , , and .

5 Results

5.1 Speaker Verification Performance

Table 1 shows the speaker verification performances on the FFSVC 2022 development set. We report the experimental results of seven systems to compare the verification performance of proposed methods with the baseline and analyze the effect of each objective function term in the proposed framework, i.e., , , , , and .

In Table 1, the first row shows the result using the only initial embedding x from the pre-trained front-end encoder without fine-tuning. The second row in Table 1 indicates the performance of the JFE baseline described in Section iv@.D. The systems from the third to seventh rows in Table 1 show the results using the speaker embeddings fine-tuned with (3 row) the speaker classification loss , (4 row) the multi-task learning of speaker and device classification losses , (5 row) the multi-task learning including the estimated MI between and loss , (6 row) the multi-task learning including the estimated MIs between the embeddings and labels loss , and (7 row) the total objective loss . Also, for each system, we report the results of fine-tuning with a randomly initialized front-end encoder from scratch.

As shown in Table 1, where upper values in each row indicate the performances without pre-training, applying regularization terms, i.e., multi-task learning and CLUB estimators , did not show significant improvement in the speaker verification performance but rather even degrades the system. However, utilizing the front-end encoder pre-trained using a large-scale dataset without the device labels significantly improved the system performance. This shows that the proposed framework can work effectively when the speaker and device factors, and , latent in the shared embedding x are separated after securing sufficient speaker discrimination ability. Comparing the 3 and 4 rows of Table 1 in the cases using the pre-trained , we observed that multi-task learning does not help improve the verification performance. However, jointly employing the MI regularization terms led to a consistent performance improvement, as shown in the 5, 6 and 7 rows. Finally, we obtained the best performing result using the final objective function, achieving EER of 6.95% and MinDCF of 0.450 on the FFSVC2022 development trial protocol, respectively. These results outperform those of the JFE baseline system of the 2 row in Table 1.

5.2 Visualization of Speaker Embedding Space

We also investigate the effect of our proposed framework in embedding space by visualizing the speaker representations learned using the three different training strategies, i.e., (a) only speaker classification loss, (b) JFE objective function [18], and (c) the proposed objective function. Figure 4 (a), (b), and (c) show the t-SNE plots of speaker embeddings of ten speakers and three devices. Embedding points are colored by speaker labels in the left parts of Figure 4 while colored by device labels in the right parts.

As shown in the left parts of Figure 4 (a), (b), and (c), the speaker embeddings are well separated between different speakers. However, from the view of the device label in the right parts of Figure 4, the embedding points of different devices are highly overlapped, making it difficult to identify their own color (red, blue, and green). In particular, it is observed that the embedding points in the right part of Figure 4 (c) are more evenly dispersed over different devices compared to those in the right parts of Figure 4 (a) and (b). This shows that the speaker embedding extracted from the proposed framework is well-discriminated in the main task while indistinguishable in the sub-task. Furthermore, the speaker embedding learned via our proposed framework demonstrates a more disentangled visualization result than the speaker embeddings obtained from other training strategies, i.e., only speaker classification loss and JFE objective function.

6 Conclusion

In this paper, we propose a novel framework for disentangling speaker representation from speaker-irrelevant factors in a direct manner. The proposed framework can explicitly reduce the mutual information by minimizing the estimation of its upper bound. Through mutual information minimization, the interdependence of decoupled speaker and device embedding is removed. Experimental results demonstrate that our approach can improve the speaker verification performance by taking advantage of the pre-trained front-end encoder. Also, visualization of speaker embedding space shows that device-dependent factor in speaker embedding is dispersed, from which we can assert that their inter-dependency is lost.


This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2021-0-00456, Development of Ultra-high Speech Quality Technology for Remote Multi-speaker Conference System).