The task of a text-independent speaker recognition system is to identify a speaker from a given voice recording, irrespective of its linguistic content. This is accomplished by modelling the voice characteristics of each speaker after an enrolment phase . While the most common application of speaker recognition is verification , other important applications exist . One role that a speaker recognition system can fulfil is the selection of a speaker-dependent acoustic model for a speech recognition system . It can also be used to perform speaker segmentation, an important pre-processing step for speaker diarisation 
. The realisation of each application is dependent upon a high-performance speaker recognition system. The first speaker recognition system to have been considered as high performance modelled each speaker’s voice using a Gaussian mixture model (GMM).
One obstacle that prevented the commercial introduction of GMM speaker models was their poor performance in the presence of noise , spurring the investigation of robust approaches . A noteworthy approach was the missing feature approach, which is underpinned by evidence that speech is intelligible to humans even after it has undergone substantial spectral masking . Marginalisation, as proposed by Cook et al. , has been the most prominent missing feature approach in the literature , and is able to significantly increases the robustness of a GMM speaker model 
. With marginalisation, the marginal probability density function is obtained by integrating over the components of the feature vector that have been classified as unreliable representations of speech. Classification is thus performed on a partial instantiation of a given feature vector, consisting of only the components that reliably represent speech.
Recently, speaker recognition has been performed using convolutional neural networks (CNNs) , which provide a significant improvement over GMM speaker models for clean conditions. One recent example is SincNet, which employs parametrised sinc functions to pre-define a bank of band-pass filters for its first layer . Another example, as proposed by Xie et al. , uses a ‘thin’ residual CNN, and dictionary-based NetVLAD  and GhostVLAD  layers for feature aggregation (referred to as Xie2019 henceforth). Despite their high performance on clean speech, modern speaker recognition systems are still susceptible to performance degradation in the presence of noise . CNNs are however not probabilistic models and are incapable of inferring from partial instantiations of a given feature vector, meaning that they cannot employ classifier-compensation missing feature approaches like marginalisation. Currently, the most popular approach to increase their robustness is to use a front-end technique to pre-process the noisy speech [20, 21].
In 2011, Poon et al. 
proposed a model that is both a deep architecture and a tractable probabilistic graphical model, called a sum-product network (SPN). For the deep architecture case, an SPN can be described as a deep neural network, restricted to using sum and product operators. For the probabilistic graphical model case, an SPN can be described as a rooted directed acyclic graph, with variables as leaves. SPNs have clear semantics; each node represents an unnormalised joint probability distribution over a set of variables. As an SPN is a probabilistic model, it can perform inference on a partial instantiation of a given feature vector, making it applicable for marginalisation.
Here, SPNs utilising marginalisation are proposed as robust speaker models. SPN speaker models have both advantages and disadvantages over CNN speaker recognition systems. SPN speaker models are trained solely on clean speech, and can use marginalisation to remain robust, avoiding the need for the noisy speech to be pre-processed. When a new speaker is enrolled, their SPN speaker model is simply added to the set of already existent SPN speaker models, whereas a CNN must be retrained for all speakers. Additionally, the accuracy of classifying the reliability of a spectral feature has increased in recent years , further supporting the case for robust SPN speaker models utilising marginalisation. A disadvantage is that SPN structure and weight learning algorithms, as well as libraries, are currently undeveloped, as highlighted by Jaini et al. .
as its front-end. Speaker recognition accuracy is used as the evaluation metric and is found for multiple conditions, including clean speech mixed with real-world non-stationary and coloured noise sources for multiple SNR levels. The paper is organised as follows: the proposed SPN speaker models are presented in SectionII; the experiment setup is described in Section III, including a description of each speaker model type; the results and discussion are presented in Section IV; conclusions are drawn in Section V.
Ii SPN Speaker Models
Ii-a Frequency Domain Representation
A frequency domain representation is used as the feature vector for marginalisation, like the power spectral density (PSD) estimate of the short-time Fourier transform (STFT), or the spectral sub-band energies of the PSD estimate. In this work, we use the log-spectral sub-band energies (LSSEs) of a clean speech frame as the feature vector for the SPN and GMM speaker models. The LSSEs are computed from the single-sided PSD estimate111For convenience, the frame index is ommited from the notation.:
where denotes the frame length in discrete-time samples, denotes the discrete-frequency index, , for all , denotes the PSD estimate for a frame, and , for all , denotes the filter of a bank of triangular-shaped critical band filters spaced uniformly on the mel-scale. The PSD is estimated from the STFT of the clean speech using the periodogram method, as in .
Ii-B SPN Speaker Models with Gaussian Leaves
An SPN , where in this case, X is the feature vector of LSSEs for a frame of clean speech. An observation of X is denoted here by . Hence, the SPN, , for speaker class is a function of the observed feature vector: , where the value of the SPN is given by its root. An SPN consists of multiple layers of sum and product nodes, with distributions as leaves. The multivariate distribution of the leaf is over a subset of the variables:
, and is assumed to be normally distributed:, with mean , and diagonal covariance . The probability density function for the leaf is given by
where , and indicates the random variable indices for . An SPN over two variables with univariate Gaussian leaves is shown in Figure 1.
If node is a product node, its value is given by the product of the values of its children, : , where is the child of node . If node is a sum node, its value is given by the sum of the values of its children: , where weight is the non-negative weighted edge between and , where . To be a valid joint distribution, an SPN must be both decomposable, and complete, as described in . The scope of a node, , is defined as the set of variables that are descendants of it. An SPN is said to be decomposable when the scopes of the children of its product nodes are disjoint: , where indicates an empty set. An SPN is said to be complete when the scopes of the children of its sum nodes are identical: .
|SNR level (dB)||
|Voice babble||Street music|
|SincNet + IRM||-||-||0.63||4.44||25.40||71.75||92.70||1.27||5.40||23.81||64.44||92.38||-17.14|
|Xie2019 + IRM||-||-||0.63||1.27||10.48||28.89||53.33||0.32||1.27||4.44||20.63||40.95||-15.46|
Ii-C Marginalisation for SPNs
For marginalisation, each component of an observed noisy speech feature vector is classified as either a reliable or an unreliable representation of the corresponding unobserved clean speech feature vector components. The noisy speech feature vector, y, can thus be described as the union of the reliable and unreliable components: . Here, we not only apply marginalisation to SPNs, but also bounded marginalisation, as proposed by Cook et al. . For bounded marginalisation, the unreliable components are treated as the upper bounds to the unobserved clean speech component values. For LSSEs, the bounds are taken from . Thus, the probability density function for the leaf becomes
For marginalisation, the unreliable components are treated as missing. Thus, the bounds are taken from , and the integral in Equation 3 reduces to unity. This gives the marginal probability density function for the leaf: . When all of the components of are unreliable, it is treated as a vector with no instantiated components: .
Iii Experiment Setup
Iii-a Signal Processing
The feature vectors for the GMM and SPN speaker models were computed using the following hyperparameters. The Hamming window function was used for analysis, with a frame length of 32 ms (512 discrete-time samples) and a frame shift of 16 ms (256 discrete-time samples). The 257-point single-sided PSD estimate for a frame included both the DC and Nyquist frequency component. The LSSEs of a PSD estimate were computed using 26 triangular-shaped critical band filters spaced uniformly on the mel-scale.
Iii-B Classification of Reliable Spectral Components
Here, the reliability of a spectral component is determined by its a priori SNR, as in . A component with an a priori SNR of greater than 0 dB is classified as reliable . Deep Xi from  is used here as the a priori
SNR estimator. It is a deep learning approach toa priori SNR estimation, and is available at: https://github.com/anicolson/DeepXi. Deep Xi estimates the a priori SNR for each of the 257 frequency-domain components of a noisy speech frame. The a priori SNR estimate for each sub-band is subsequently found by applying the filterbank used to compute the LSSEs.
Iii-C Training and Testing Sets
The TIMIT corpus  ( kHz, single-channel), which consists of speakers with utterances each, was used as the clean speech set in this work. The and subsets were used for training ( utterances) and the subset was used for testing ( utterances). Each clean speech recording from the subset was mixed additively with one of four real-world noise source recordings to create the noisy speech for testing ( clean speech recordings for each noise source). Each noisy speech recording was replicated at five SNR levels: to dB, in dB increments, forming a testing set of noisy speech recordings. The real-world noise sources included two non-stationary and two coloured. The two real-world non-stationary noise sources included voice babble from the RSG-10 noise dataset  and street music222Street music recording number was used from the Urban Sound dataset. from the Urban Sound dataset . The two real-world coloured noise sources included F16 and factory (welding) from the RSG-10 noise dataset .
Iii-D Speaker Model Configurations
GMM: For each speaker, a GMM consisting of diagonal covariance clusters was trained on the training set using the expectation-maximisation algorithm 
, and the k-means++ algorithm for parameter initialisation.
Xie2019:  is available at: https://github.com/WeidiXie/VGG-Speaker-Recognition, and was trained using the training set with an input spectrogram size of 1 second.
SPN: Each speaker was modelled using an SPN with univariate Gaussian leaves. The SPFlow library was used to implement the SPN speaker models . A variant of the LearnSPN algorithm  that partitions and clusters variables using the Hirschfeld-Gebelein-Rényi maximum correlation coefficient  was used as the structure learning algorithm. The minimum number of instances to split was set to , and the threshold of significance was set to for the structure learning algorithm.
|SNR level (dB)||
|SincNet + IRM||-||-||0.63||1.27||5.71||26.67||72.70||0.95||1.59||13.02||44.13||86.67||-21.52|
|Xie2019 + IRM||-||-||0.32||0.32||2.86||6.98||20.00||0.00||0.32||0.63||2.86||21.27||-14.41|
Iv Results and Dicsussion
Iv-a Real-World Non-Stationary Noise Sources
Table I shows the speaker recognition accuracy for the real-world non-stationary noise sources: voice babble and street music. Over all of the tested conditions in Table I, the SPN speaker models demonstrated an average improvement of over the GMM speaker models. This indicates that the SPN speaker models are better able to model the joint distribution of each speaker. It can be seen that the robustness of the SPN speaker models significantly increases when either marginalisation or bounded marginalisation is used. The SPN speaker models attained an average improvement of and over the GMM speaker models when marginalisation and bounded marginalisation were used, respectively. The performance imrovement that the SPN speaker models posses over the GMM speaker models is thus extended when either marginalisation or bounded marginalisation is used.
The SPN speaker models employing bounded marginalisation were able to significantly outperform SincNet + IRM, with an average improvement of . While SincNet + IRM achieved the best accuracy at 15 dB for both non-stationary noise sources, it was significantly outperformed at lower SNR levels by the SPN speaker models employing bounded marginalisation. The results presented in Table I show that the SPN speaker models are highly robust to real-world non-stationary noise sources when either marginalisation or bounded marginalisation is used, especially at lower SNR levels.
|Params. per speaker|
Iv-B Real-World Coloured Noise Sources
Table II shows the speaker recognition accuracy for the real-world coloured noise sources: F16 and factory. The SPN speaker models utilising bounded marginalisation again outperformed SincNet + IRM, with an average performance increase of . Over all of the tested conditions in Table II, the SPN speaker models demonstrated an average improvement of and over the GMM speaker models when marginalisation and bounded marginalisation were used, respectively. This indicates that marginalisation and bounded marginalisation are more suited to SPN speaker models than GMM speaker models.
The results presented in Tables I and II show that the SPN speaker models are highly robust to both real-world non-stationary and coloured noise sources, when either marginalisation or bounded marginalisation is used. These results are made more significant when considering the number of parameters that each speaker recognition system expends on a speaker, as specified in Table III. The SPN speaker models were more robust than SincNet, whilst employing 14.7 times fewer parameters on average per speaker. This exhibits the parameter efficiency of the SPN speaker models.
Iv-C Future Direction
The SPN structure learning algorithm used here, LearnSPN (introduced in 2013) , was the second-ever proposed. Several structure learning algorithms that can outperform LearnSPN have since been introduced, including Prometheus 
. Additionally, it is common to use backpropagation to fine-tune the weights of an SPN (as used to train CNNs), something that was not carried out here. With a better structure learning algorithm, and the use of a weight learning algorithm, the joint distribution of a speaker could perhaps be more effectively modelled. This may improve the performance of SPN speaker models at higher SNR levels. SPN acoustic models utilising marginalisation should also be investigated for robust automatic speech recognition (ASR), and for robust speaker verification, where the universal background model framework could be employed.
Here, sum-product networks (SPNs) are employed as robust speaker models. They are evaluated on real-world non-stationary and coloured noise sources at multiple SNR levels. When marginalisation is used, SPN speaker models are more robust than current speaker recognition systems that employ significantly more parameters and pre-processing techniques. With the development of better structure and weight learning algorithms, SPNs are predicted to have a bright future not only for robust speaker recognition, but also for robust ASR.
-  Z. Zhang, “Mechanics of human voice production and control,” The Journal of the Acoustical Society of America, vol. 140, no. 4, pp. 2614–2635, 2016.
-  D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Processing, vol. 10, no. 1, pp. 19 – 41, 2000.
-  N. Singh, R. Khan, and R. Shree, “Applications of speaker recognition,” Procedia Engineering, vol. 38, pp. 3122 – 3126, 2012. International Conference On Modelling Optimization And Computing.
-  Y. Tu, J. Du, L. Dai, and C. Lee, “A speaker-dependent deep learning approach to joint speech separation and acoustic modeling for multi-talker automatic speech recognition,” in 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–5, Oct 2016.
-  T. J. Park and P. Georgiou, “Multimodal speaker segmentation and diarization using lexical and acoustic cues via sequence to sequence neural networks,” in Proc. Interspeech 2018, pp. 1373–1377, 2018.
-  D. A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models,” Speech Communication, vol. 17, no. 1, pp. 91 – 108, 1995.
-  D. A. Reynolds, “Channel robust speaker verification via feature mapping,” in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03), vol. 2, pp. 53–56, April 2003.
-  R. J. Mammone, X. Zhang, and R. P. Ramachandran, “Robust speaker recognition: a feature-based approach,” IEEE Signal Processing Magazine, vol. 13, pp. 58–71, Sep. 1996.
-  B. Raj and R. M. Stern, “Missing-feature approaches in speech recognition,” IEEE Signal Processing Magazine, vol. 22, pp. 101–116, Sep. 2005.
-  M. Cooke, P. Green, L. Josifovski, and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Communication, vol. 34, no. 3, pp. 267 – 285, 2001.
-  R. Togneri and D. Pullella, “An overview of speaker identification: Accuracy and robustness issues,” IEEE Circuits and Systems Magazine, vol. 11, pp. 23–61, Secondquarter 2011.
-  A. Nicolson, J. Hanson, J. Lyons, and K. Paliwal, “Spectral subband centroids for robust speaker identification using marginalization-based missing feature theory,” International Journal of Signal Processing Systems, vol. 6, pp. 12–16, March 2018.
-  A. Nicolson and K. K. Paliwal, “Bidirectional long-short term memory network-based estimation of reliable spectral component locations,” in Proc. Interspeech 2018, pp. 1606–1610, 2018.
-  J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in Proc. Interspeech 2018, pp. 1086–1090, 2018.
-  M. Ravanelli and Y. Bengio, “Speaker Recognition from Raw Waveform with SincNet,” arXiv e-prints, p. arXiv:1808.00158, Jul 2018.
-  W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level aggregation for speaker recognition in the wild,” arXiv preprint arXiv:1902.10107, 2019.
-  R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, pp. 1437–1451, June 2018.
-  Y. Zhong, R. Arandjelovic, and A. Zisserman, “GhostVLAD for set-based face recognition,” CoRR, vol. abs/1810.09951, 2018.
-  M. I. Mandasari, M. McLaren, and D. A. van Leeuwen, “The effect of noise on modern automatic speaker recognition systems,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4249–4252, March 2012.
-  I. Bisio, C. Garibotto, A. Grattarola, F. Lavagetto, and A. Sciarrone, “Smart and robust speaker recognition for context-aware in-vehicle applications,” IEEE Transactions on Vehicular Technology, vol. 67, pp. 8808–8821, Sep. 2018.
-  Z. Zhang, J. Geiger, J. Pohjalainen, A. E.-D. Mousa, W. Jin, and B. Schuller, “Deep learning for environmentally robust speech recognition: An overview of recent developments,” ACM Trans. Intell. Syst. Technol., vol. 9, pp. 1–28, Apr. 2018.
H. Poon and P. Domingos, “Sum-product networks: A new deep architecture,”
2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 689–690, Nov 2011.
P. Jaini, A. Ghose, and P. Poupart, “Prometheus: Directly learning acyclic
directed graph structures for sum-product networks,” in Proceedings of
the Ninth International Conference on Probabilistic Graphical Models
(V. Kratochvíl and M. Studený, eds.), vol. 72 of
Proceedings of Machine Learning Research, (Prague, Czech Republic), pp. 181–192, PMLR, 11–14 Sep 2018.
-  D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Transactions on Speech and Audio Processing, vol. 3, pp. 72–83, Jan 1995.
-  J. Chen and D. Wang, “Long short-term memory for speaker generalization in supervised speech separation,” The Journal of the Acoustical Society of America, vol. 141, no. 6, pp. 4705–4714, 2017.
-  J. Barker, L. Josifovski, M. Cooke, and P. Green, “Soft decisions in missing data techniques for robust automatic speech recognition,” in Sixth International Conference on Spoken Language Processing, 2000.
-  D. Wang, On ideal binary mask as the computational goal of auditory scene analysis, pp. 181–197. Boston, MA: Springer US, 2005.
-  A. Nicolson and K. K. Paliwal, “Deep learning for minimum mean-square error approaches to speech enhancement,” Speech Communication, vol. 111, pp. 44 – 55, 2019.
-  J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon Technical Report N, vol. 93, Feb. 1993.
-  H. J. Steeneken and F. W. Geurtsen, “Description of the RSG-10 noise database,” Report IZF 1988-3, TNO Institute for Perception, Soesterberg, The Netherlands, 1988.
-  J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in Proceedings of the 22nd ACM International Conference on Multimedia, MM ’14, (New York, NY, USA), pp. 1041–1044, ACM, 2014.
-  A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–22, 1977.
-  D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” Technical Report 13, Stanford InfoLab, June 2006.
-  A. Nicolson and K. K. Paliwal, “Deep Xi as a front-end for robust automatic speech recognition,” in Submitted to Interspeech 2019, 2019.
-  A. Molina, A. Vergari, K. Stelzner, R. Peharz, P. Subramani, N. D. Mauro, P. Poupart, and K. Kersting, “SPFlow: An easy and extensible library for deep probabilistic learning using sum-product networks,” CoRR, vol. abs/1901.03704, 2019.
-  R. Gens and D. Pedro, “Learning the structure of sum-product networks,” in Proceedings of the 30th International Conference on Machine Learning (S. Dasgupta and D. McAllester, eds.), vol. 28 of Proceedings of Machine Learning Research, (Atlanta, Georgia, USA), pp. 873–880, PMLR, 17–19 Jun 2013.
A. Molina, A. Vergari, N. Di Mauro, S. Natarajan, F. Esposito, and K. Kersting,
“Mixed sum-product networks: A deep architecture for hybrid domains,” in
Thirty-Second AAAI Conference on Artificial Intelligence, pp. 3828–3835, 2018.