. The deep learning paradigm aims to describe the world around us by means of a hierarchy of representations, that are progressively combined to lead to representations of higher level abstractions. These representations should capture intermediate concepts, features, or latent variables that are useful to solve the machine learning tasks of interest. Most commonly, deep neural networks are trained in a supervised way, while methods to learn meaningful representations in an unsupervised fashion is more challenging but could be useful especially in semi-supervised setting.
Several approaches have been proposed for deep unsupervised learning in the last decade. Notable examples are deep autoencoders
, Restricted Boltzmann Machines (RBMs), variational autoencoders  and, more recently, Generative Adversarial Networks (GANs) . GANs are often used in the context of generative modeling, where they attempt to minimize a measure of discrepancy between a distribution generated by a neural network and the data distribution. Beyond generative modeling, some works have extended this framework to learn features that are invariant to different domains  or to noise conditions . Moreover, we recently witnessed some remarkable attempts to learn unsupervised representations by minimizing or maximizing Mutual Information (MI) [13, 14, 15, 16]
. This measure is a fundamental quantity for estimating the statistical dependence between random variables and is defined as the Kullback-Leibler (KL) divergence between the joint distribution over these random variables and the product of their marginal distributions. As opposed to other metrics, such as correlation, MI can capture complex non-linear relationships between random variables . MI, however, is difficult to compute directly, especially in high dimensional spaces . The aforementioned works found that it is possible to maximize or minimize the MI within a framework that closely resembles that of GANs. Additionally,  has further proved that it is even possible to explicitly compute it by exploiting the Donsker-Varadhan bound.
Here we attempt to learn good speaker representations by maximizing the mutual information between two encoded random chunks of speech sampled from the same sentence. Our architecture employs both an encoder, that transforms raw speech samples into a compact feature vector, and a discriminator. The latter is alternatively fed by samples from the joint distribution (i.e. two encoded speech vectors from the same sentence) and encoded samples drawn from the product of marginal distributions (two vectors coming from different speakers). The discriminator is jointly trained with the encoder to maximize the separability of the two distributions. The encoder is based on the SincNet [20, 21], which processes the raw input waveforms with learnable band-pass filters based on sinc functions. This neural model is useful to learn speech representations from raw waveforms directly and turned out to be an important component in our unsupervised framework.
The experimental results show that our approach learns useful speaker features, leading to promising results on speaker identification and verification tasks. Our experiments are conducted in both unsupervised and semi-supervised settings and compare different objective functions for the discriminator.
2 Speaker Representations based on MI
The mutual information between two random variables and is defined as follows:
where is the Kullback-Leibler (KL) divergence between the joint distribution and the product of marginals . The MI is minimized when the random variables and are statistically independent (i.e., the joint distribution is equal to the product of marginals) and is maximized when the two variables contain the same information (in which case the MI is simply the entropy of any one of the variables).
The system depicted in Fig.1, aims to derive a compact representation that summarizes speaker identities. The encoder , with , is fed by N speech samples and outputs a vector composed of M real values, while the discriminator , with , is fed by two speaker representations and outputs a real scalar. We learn the parameters and of the encoder and the discriminator such that we maximize the mutual information :
where the two representations and are obtained by encoding the speech chunks and that are randomly sampled from the same sentence. Note that one reliable factor that is shared across chunks within each utterance is the speaker identity111The underline assumption is that the sentences contain a single speaker.. The maximization of should thus hopefully be able to properly disentangle this constant factor from the other variables (e.g., phonemes) that characterize the speech signal but are not shared across chunks of the same utterance.
As shown in Alg. 1, the maximization of MI relies on a sampling strategy that draws positive and negative samples from the joint and the product of marginal distributions, respectively. As discussed so far, the positive samples are simply derived by randomly sampling speech chunks from the same sentence. Negative samples , instead, are obtained by randomly sampling from another utterance, that likely belongs to a different speaker. A set of positive and negative examples is sampled to form a minibatch . The minibatch feeds the discriminator , that is jointly trained with the encoder. Given , the discriminator has to decide whether its other input ( or ) comes from the same sentence or from a different one (and generally a different speaker). Differently to the GAN framework, the encoder and the discriminator are not adversarial here but must cooperate to maximize the discrepancy between the joint and the product of marginal distributions. In other words, we play a max-max game rather than a min-max one, making it easier to monitor progress of training (compared to GAN training), simply as the average loss of the discriminator.
Different objectives functions can be used for the discriminator. The simplest solution, adopted in , and , consists in using the standard binary cross-entropy (BCE) loss222The output layer must be based on a sigmoid when using BCE.:
where and denote the expectation over positive and negative samples, respectively. Such a metric estimates the Jensen-Shannon divergence between two distributions rather than the KL divergence. Consequently, this loss does not optimize the exact KL-based definition of MI, but a similar divergence between two distributions. Differently from standard MI, this metric is bounded (i.e., its maximum is zero), making the convergence of the architecture more numerically stable.
As an alternative, it is possible to directly optimize the MI with the MINE objective :
MINE explicitly computes MI by exploiting a lower-bound based on the Donsker-Varadhan representation of the KL divergence.
The third alternative explored in this work is the Noise Contrastive Estimation (NCE) loss proposed in, that is defined as follows:
where the minibatch is composed of a single positive sample and negative samples. In  it is proved that maximizing this loss maximizes a lower bound on MI.
All the aforementioned objectives are based on the idea of maximizing a discrepancy between the joint and product of marginal distributions. Nevertheless, such losses might be more or less easy to optimize within the proposed framework.
The unsupervised representations
are then used to train a speaker-id classifier in a standard supervised way. Beyond unsupervised learning, this paper explores two semi-supervised variations for learning speaker representations. The first one is based on pre-training the encoder with the unsupervised parameters and fine-tuning it together with the speaker-id classifier. As an alternative, we jointly train encoder, discriminator, and speaker-id networks from scratch. This way, the gradient computed within the encoder not only depends on the supervised loss but also on the unsupervised objective. The latter approach turned out to be very effective, since the unsupervised gradient acts as a powerful regularizer.
Similarly to [23, 24, 25, 26, 27], we propose to directly process raw waveforms rather than using standard MFCC, or FBANK features. The latter hand-crafted features are originally designed from perceptual evidence and there are no guarantees that such inputs are optimal for all speech-related tasks. Standard features, in fact, smooth the speech spectrum, possibly hindering the extraction of crucial narrow-band speaker characteristics such as pitch and formants, that are important clues on the speaker identity. To better process raw audio, the encoder is based on SincNet [20, 21]
, a novel Convolutional Neural Network (CNN) that encourages the first layer to discover more meaningful filters. In contrast to standard CNNs, which learn all the elements of each filter, only low and high cutoff frequencies of band-pass sinc-based filters are directly learned from data, making SincNet suitable to process the high-dimensional audio.
3 Related Work
Similarly to this work, other attempts have recently been made to learn unsupervised representations with mutual information. In 
, a GAN that minimizes MI has been proposed for Independent Component Analysis (ICA). The paper introduced a simple strategy to derive samples from the joint and from the product of marginal distributions and proposed to train an encoder and a discriminator to minimize the Jansen-Shannon divergence. A similar approach can be used to maximize MI. In
authors proposed a method called Contrastive Predicting Coding (CPC), that learns representations by predicting the future in a latent space. It uses an autoregressive model optimized with a probabilistic contrastive loss. In authors introduced DeepInfoMax (DIM), an architecture that learns representations based on both local and high-level global information. In , Deep Graph Infomax (DGI) extends this approach to graph-structured data.
This paper employs an encoder-discriminator architecture similar to that of the aforementioned works. To the best of our knowledge, the novelty of our approach is the following. First of all, we propose a strategy to draw positive and negative samples for speaker-id. The difference with other sampling approaches is that DIM and DGI perform a maximization of MI between local and global representations, CPC relies on future predictions, while our method is simply based on random local sampling only. This strategy has been effective especially when coupled with SincNet [20, 21]
, that is here used for the first time in an unsupervised framework. Our contribution also extends the previous works by addressing semi-supervised learning where encoder, discriminator, and speaker-id classifier are jointly trained from scratch. Finally, this paper compares several objective functions for MI optimization in a speech task.
4 Experimental Setup
The proposed method has been evaluated on both speaker identification and verification tasks. In particular, this work considers a challenging but realistic speaker recognition scenario: for all the adopted corpora, we only employed 12-15 seconds of training material for each speaker, and we tested the system performance on short sentences lasting from 2 to 6 seconds. In the spirit of reproducible research, we release the code of SincNet [20, 21, 28] on GitHub333 at https://github.com/mravanelli/SincNet/.. In the following, an overview of the experimental setting is provided.
To provide experimental evidence on datasets with a different numbers of speakers, this paper considers the TIMIT (462 spks, train chunk)  and Librispeech (2484 spks)  corpora. Non-speech intervals at the beginning and end of each sentence were removed. The Librispeech sentences with internal silences lasting more than 125 ms were split into multiple chunks. To address text-independent speaker recognition, the calibration sentences of TIMIT (i.e., the utterances with the same text for all speakers) have been removed. For the latter dataset, five sentences for each speaker were used for training, while the remaining three were used for test. For the Librispeech corpus, the speech sentences have been randomly selected to exploit 12-15 seconds of training material for each speaker and test sentences lasting 2-6 seconds.
4.2 DNN Setup
The waveform of each speech sentence was split into chunks of 200 ms (with 10 ms overlap), which were fed into the SincNet encoder. The first layer of the encoder performs sinc-based convolutions, using 80 filters of length samples. The architecture then employs two standard convolutional layers, both using 60 filters of length 5. Layer normalization 
was used for both the input samples and for all convolutional layers (including the SincNet input layer). Next, two fully-connected layers composed of 2048 and 1024 neurons (normalized with batch normalization[32, 33]
) were applied. All hidden layers use leaky-ReLU non-linearities. The parameters of the sinc-layer were initialized using mel-scale cutoff frequencies, while the rest of the network was initialized with the well-known “Glorot” initialization scheme 
. Both the discriminator and the speaker-id classifier are fed by the encoder output and consist of MLPs based on a single ReLU layer. Frame-level speaker classification was obtained from the speaker-id network by applying a softmax output layer, that provides a set of posterior probabilities over the targeted speakers. A sentence-level classification was derived by averaging the frame predictions and voting for the speaker which maximizes the average posterior. Training used the RMSprop optimizer, with a learning rate, , , and minibatches of size 128. All the hyper-parameters of the architecture were tuned on TIMIT, then inherited for Librispeech as well.
The speaker verification system was derived from the speaker-id neural network using the d-vector technique. The d-vector [36, 37] was extracted from the last hidden layer of the speaker-id network. A speaker-dependent d-vector was computed and stored for each enrollment speaker by performing an L2 normalization and averaging all the d-vectors of the different speech chunks. The cosine distance between enrolment and test d-vectors was then calculated, and a threshold is then applied on it to reject or accept the speaker. Ten utterances from impostors were randomly selected for each sentence coming from a genuine speaker. Note that to assess our approach on a standard open-set speaker verification task, all the enrolment and test utterances were taken from a speaker pool different from that used for training the speaker-id DNN.
In this section, we summarize our experimental activity. To ensure more accurate comparisons, five runs varying the initialization seeds were conducted for each experiment. The results reported in the following thus show the average performance.
5.1 Speaker Identification
Tab. 1 reports the sentence-level classification error rates (CER%) achieved when using binary cross-entropy (BCE), MINE, and Noise Contractive Estimation (NCE) losses.
The table highlights that the features learned with the proposed approach clearly embed information on the speaker identity, leading to a CER(%) ranging from 2.15% to 1.21% in all the considered settings. The best performance is achieved with the standard binary cross-entropy. Similar to , we have observed that this bounded metric is more stable and more easy to optimize. Both MINE and NCE objective are unbounded and their value can grow indefinitely during training, eventually causing numerical issues. This result suggests that a direct optimization of the mutual information is not crucial, but similar measures that maximize a meaningful divergence between two distributions can be a valid alternative. The table also shows that SincNet outperforms a standard CNN. This confirms the promising achievements obtained in [20, 21] in a standard supervised setting. SincNet, in fact, turns out to converge faster and to a better solution, thanks to the compact sinc filters that make learning from high-dimensional raw samples significantly easier. This achievement highlights the importance of using proper architectures in unsupervised learning.
Tab. 2 extends previous speaker-id results to other training modalities, including supervised and semi-supervised learning.
From the table it emerges that standard supervised learning outperforms a pure unsupervised approach. The internal representations learned in a supervised way, in fact, are likely to be more tuned on the specific tasks solved by the network. Nevertheless, when we pass from unsupervised to semi-supervised learning, we observe the best results. In particular, the joint semi-supervised framework (i.e., the approach that jointly trains encoder, discriminator, and speaker classification for scratch) yields the best performance, achieving a CER(%) of 0.65% on TIMIT and 0.59% on Librispeech. The internal representations discovered in this way are influenced by both the supervised and the unsupervised loss. The latter one acts as a powerful regularizer, that allows the neural network to find robust features that are “general” and tuned on the specific task at the same time.
5.2 Speaker Verification
In this sub-section, we finally extend our validation to speaker verification. Table 3 reports the Equal Error Rate (EER%) achieved with the Librispeech corpus using SincNet.
All models show promising performance, leading to an EER lower than 1% in all cases. The evidence emerged from the speaker identification task is confirmed for speaker verification as well. The best performance is obtained with the joint semi-supervised approach, leading to a 13% relative performance gain over a standard supervised learning baseline. Although a detailed comparison with i-vectors is out of the scope of this paper, it is worth mentioning that our best i-vector system achieves a EER=1.1%, rather far from what achieved with DNN systems. It is well-known in the literature that i-vectors provide competitive performance when more training material and longer test sentences are employed . Under the conditions faced in this work, neural networks achieve better generalization.
6 Conclusions and Future Work
This paper proposed a method for learning speaker representations by maximizing mutual information. The experiments have shown promising performance on both speaker identification and recognition tasks and have highlighted better results when adopting the standard binary cross-entropy loss, that turned out to be more stable and easier to optimize than other metrics. It also highlighted the importance of using proper architectures for unsupervised learning. In particular, SincNet significantly outperformed a standard CNN, confirming its effectiveness when processing raw audio waveforms. The best results are obtained with end-to-end semi-supervised learning, where an ecosystem of neural networks composed of an encoder, a discriminator, and a speaker-id is jointly trained.
In the future, we will scale up our experiments to address other popular speaker recognition tasks, such as VoxCeleb. Although this study targeted speaker recognition only, we believe that this approach defines a more general paradigm to unsupervised and semi-supervised learning. Our future effort will be thus devoted to extending our findings to other tasks, such as speech recognition.
We would like to thank Devon Hjelm, Titouan Parcollet, and Maurizio Omologo, for their helpful comments. This research was enabled by support provided by Calcul Québec and Compute Canada.
-  D. Yu and L. Deng, Automatic Speech Recognition - A Deep Learning Approach, Springer, 2015.
-  G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large vocabulary speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30–42, 2012.
-  M. Ravanelli, Deep learning for Distant Speech Recognition, PhD Thesis, Unitn, 2017.
-  M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “A network of deep neural networks for distant speech recognition,” in Proc. of ICASSP, 2017, pp. 4880–4884.
-  M. McLaren, Y. Lei, and L. Ferrer, “Advances in deep neural network approaches to speaker recognition,” in Proc. of ICASSP, 2015, pp. 4814–4818.
-  F. Richardson, D. Reynolds, and N. Dehak, “Deep neural network approaches to speaker and language recognition,” IEEE Signal Processing Letters, vol. 22, no. 10, pp. 1671–1675, 2015.
-  Y. Bengio, Pascal L., D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,” in Proc. of NIPS, pp. 153–160. 2007.
-  G.E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” 2006, vol. 18, pp. 1527–1554.
-  D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” CoRR, vol. abs/1312.6114, 2013.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. of NIPS, pp. 2672–2680. 2014.
-  Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” J. Mach. Learn. Res., vol. 17, no. 1, pp. 2096–2030, Jan. 2016.
-  D. Serdyuk, P. Brakel, B. Ramabhadran, S. Thomas, Y. Bengio, and K. Audhkhasi, “Invariant representations for noisy speech recognition,” arXiv e-prints, vol. abs/1612.01928, 2016.
-  P. Brakel and Y. Bengio, “Learning independent features with adversarial nets for non-linear ica,” arXiv e-prints, vol. 1710.05050, 2017.
-  M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and R. D. Hjelm, “Mutual information neural estimation,” in Proc. of ICML, 2018, pp. 531–540.
-  A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” CoRR, vol. abs/1807.03748, 2018.
-  R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, A. Trischler, and Y. Bengio, “Learning deep representations by mutual information estimation and maximization,” arXiv e-prints, vol. 1808.06670, 2018.
-  D. Applebaum, Probability and Information: An Integrated Approach, Cambridge University Press, 2 edition, 2008.
-  J. B. Kinney and G. S. Atwal, “Equitability, mutual information, and the maximal information coefficient,” Proceedings of the National Academy of Sciences, vol. 111, no. 9, pp. 3354–3359, 2014.
-  L. Paninski, “Estimation of entropy and mutual information,” Neural Comput., vol. 15, no. 6, pp. 1191–1253, June 2003.
-  M. Ravanelli and Y. Bengio, “Speaker Recognition from raw waveform with SincNet,” in Proc. of SLT, 2018.
-  M. Ravanelli and Y. Bengio, “Interpretable Convolutional Filters with SincNet,” in Proc. of NIPS@IRASL, 2018.
-  P. Velickovic, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm, “Deep graph infomax,” CoRR, vol. abs/1809.10341, 2018.
-  D. Palaz, M. Magimai-Doss, and R. Collobert, “Analysis of CNN-based speech recognition system using raw speech as input,” in Proc. of Interspeech, 2015.
-  T. N. Sainath, R. J. Weiss, A. W. Senior, K. W. Wilson, and O. Vinyals, “Learning the speech front-end with raw waveform CLDNNs,” in Proc. of Interspeech, 2015.
-  Z. Tüske, P. Golik, R. Schlüter, and H. Ney, “Acoustic modeling with deep neural networks using raw time signal for LVCSR,” in Proc. of Interspeech, 2014.
-  G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,” in Proc. of ICASSP, 2016, pp. 5200–5204.
-  H. Muckenhirn, M. Magimai-Doss, and S. Marcel, “Towards directly modeling raw speech signal for speaker verification using CNNs,” in Proc. of ICASSP, 2018.
M. Ravanelli, T. Parcollet, and Y. Bengio,
“The PyTorch-Kaldi Speech Recognition Toolkit,”in Submitted to ICASSP, 2019.
-  J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, “DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus CDROM,” 1993.
-  V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. of ICASSP, 2015, pp. 5206–5210.
-  J. Ba, R. Kiros, and G. E. Hinton, “Layer normalization,” CoRR, vol. abs/1607.06450, 2016.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. of ICML, 2015, pp. 448–456.
-  M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Batch-normalized joint training for DNN-based distant speech recognition,” in Proc. of SLT, 2016.
-  A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. of ICML, 2013.
-  X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. of AISTATS, 2010, pp. 249–256.
-  E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in Proc. of ICASSP, 2014, pp. 4052–4056.
-  A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in Proc. of Interspech, 2017.
-  A. K. Sarkar, D Matrouf, P.M. Bousquet, and J.F. Bonastre, “Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification,” in Proc. of Interspeech, 2012, pp. 2662–2665.