I-vector Based Features Embedding for Heart Sound Classification

by   Mohammad Adiban, et al.

Cardiovascular disease (CVD) is considered as one of the main causes of death in the world. Accordingly, scientists look for methods to recognize normal/abnormal heart patterns. Over recent years, researchers have been interested in to investigate CVDs based on heart sounds. The physionet 2016 corpus is presented to provide a standard database for researchers in this field. In this study we proposed an approach for normal/abnormal heart sound detection, based on i-vector features on phiysionet 2016 corpus. In this method, a fixed length vector, namely i-vector, is extracted from each record, and then Principal Component Analysis (PCA) is applied. Then Variational AuotoEncoders (VAE) is used to reduce dimensions of the obtained i-vector. After that, this i-vector and its transmitted version by PCA and VAE are used for training two Gaussian Mixture Models (GMMs). Finally, test set is scored using these trained GMMs. In the next step we applied a simple global threshold to classify the obtained scores. We reported the results based on Equal Error Rate (EER) and Modified Accuracy (MAcc). Experimental results show the obtained Accuracy by our proposed system could improve the results reported on the baseline system by 16



There are no comments yet.


page 1

page 2

page 3

page 4


Wavelet Based Normal and Abnormal Heart Sound Identification using Spectrogram Analysis

The present work proposes a computer-aided normal and abnormal heart sou...

Classification of normal/abnormal heart sound recordings based on multi-domain features and back propagation neural network

This paper aims to classify a single PCG recording as normal or abnormal...

Understanding the Importance of Heart Sound Segmentation for Heart Anomaly Detection

Traditionally, abnormal heart sound classification is framed as a three-...

Early Prediction of Heart Disease Using PCA and Hybrid Genetic Algorithm with k-Means

Worldwide research shows that millions of lives lost per year because of...

Unsupervised heart abnormality detection based on phonocardiogram analysis with Beta Variational Auto-Encoders

Heart Sound (also known as phonocardiogram (PCG)) analysis is a popular ...

Autoencoders for music sound synthesis: a comparison of linear, shallow, deep and variational models

This study investigates the use of non-linear unsupervised dimensionalit...

A Novel Minimum Divergence Approach to Robust Speaker Identification

In this work, a novel solution to the speaker identification problem is ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cardiovascular disease (CVD) is the most common cause of death in most countries of the world and is the leading cause of disability ref1 .

Based on information provided by the World Heart Association, 2017, 17.7 million people die every year due to CVDS, which is approximately equal to 31% of all global deaths. The most prevalent CVDs are heart attacks and strokes ref1 .

In 2013, all 194 members of World Health Organization accepted the implementing Global Action Plan for the Prevention and Control of Non-communicable Diseases, a plan for 2013 to 2020, to be prepared against CVDs. Implementation of nine global and voluntary goals in this plan, the number of premature deaths due to non-communicable diseases is decreased. Among these goals, two of them particularly focus on the prevention and control of CVDs ref1 .

Accordingly, in recent years, researchers showed a great interest in detecting heart diseases based on heart sounds  ref2

. Most approaches in this context rely on sound segmentation and feature extraction and machine learning classification on different datasets.

In recent years, various studies are conducted for normal/abnormal heart sound detection using segmentation methods.

In ref3 the Shannon energy envelop for the local spectrum is calculated by a new method, which uses S-transform for every sound produced by heart sound signal. Sensitivity and positive predictively was evaluated on 80 heart sound recording (including 40 normal and 40 pathological), and their values were reported over 95%. In a study by ref4 , an approach proposed for automatic segmentation, using Hilbert transform. Features for this study included envelops near the peaks of S1, S2, the transmission points T12 from S1 to S2, and visa-versa. Database for this study, consisted of 7730s of heart sound from pathological patients, 600s from normal subjects, and finally 1496.8 s from Michigan MHSDB database. Average accuracy for sound with mixed S1, and S2, was 96.69%, and for those with separated S1 and S2, was reported 97.37%.

Another envelope extraction method engaged for heart sound segmentation is called Cardiac Sound Characteristic Waveform (CSCW). The work presented in ref5 used this method for only a small set of heart sounds, including 9 sound recording and the accuracy was reported 99.0%. No train-test split was performed for evaluation in this study.

The work in ref6 achieved an accuracy of 92.4% for S1 and 93.5% for S2 segmentation by engaging homomorphic filtering and HMM, on PASCAL database ref7 . The work investigated in ref8

also used the same approach with wavelet analysis on the same database and accuracy for S1 was reported 90.9% for S1 segmentation and this value was 93.3% for S2 segmentation. There is also a study on expected duration of heart sound using HMM and Hidden Semi-Markov Model (HSMM) introduced in 


. In this study, positions of S1 and S2 sounds was labeled in 113 recording, first. After that they calculated Gaussian distributions for the expected duration of each four states including S1, systole, S2 and diastole, using average duration of mentioned sound and also autocorrelation analysis of systolic and diastolic durations. Homomorphic envelope plus three other frequencies features (in 25-50, 50-100 and 100-150 Hz ranges) were among features they used for this study. Then they calculated Gaussian distributions for training HMM states and emission probabilities. Finally, for decoding process, backward and forward Viterbi algorithm engaged and they reported 98.8% sensitivity and 98.6% positive predictively. This work also proposed HSMM alongside logistic regression (for emission probability estimation) to accurately segment noisy, and real-world heart sound recording 

ref10 . This work also used Viterbi algorithm to decode state sequences. For evaluation, they used a database of 10172s of heart sounds recoded from 112 patients. F1 score for this study reported 95.63%, improving the previous state of the art study with 86.28% on same test set.

Other works were also developed using other methods based on the feature extraction and classification using machine learning classifier such as ANN, SVM, HMM and kNN.

For distinction between spectral energy between normal and pathological recordings, the work introduced in ref11 extracted five frequency bands and their spectral energy was given as input to ANN. Results on a dataset with 50 recorded sounds showed 95% sensitivity and 93.33% specificity.

In a study by ref12 , a discrete wavelet transform in addition to a fuzzy logic was used for a three-class problem; including normal, pulmonary stenosis, and mitral stenosis. An ANN was employed to classify dataset of 120 subjects with 50/50 split for train and test set. Reported results was 100% for sensitivity, 95.24% for specificity, and 98.33% for average accuracy. Moreover, he used time-frequency as an input for ANN in ref13 . This work reported 90.4% sensitivity, 97.44% specificity, and 95% accuracy on same dataset for same problem (three-class classification including normal, pulmonary and mitral stenosis heart valve diseases).

The work in ref14

also performed a study to classify normal and pathological cases using Least Square Support Vector Machine (LSSVM) engaging wavelet to extract features. They evaluated their method on a dataset with heart sound of 64 patients (32 cases for train and 32 cases for test set) and reported 86.72% for accuracy. In a work 

ref15 with same classifier, wavelet packets and extracted features are engaged like sample entropy and energy fraction as input. Dataset used for this problem consisted of 40 normal persons and 67 pathological patients and they resulted 97.17% accuracy, 93.48% sensitivity and 98.55% specificity. Another study ref16 , also used LSSVM as classifier while using tunable-Q wavelet transform as input features. Evaluation in this study showed 98.8% sensitivity and 99.3% specificity on a dataset comprising 4628 cycles from 163 heart sound recordings, with unknown number of patients. As another work on SVM,  ref17 engaged frequency power with varying length frames over systole as input features, and used Growing Time SVM (GTSVM) for classifying pathological and normal murmurs. Results on 56 persons (including 26 murmurs and 30 normal) was reported 86.4% for sensitivity and 89.3% for specificity.

Another work on HMM was performed by ref18

where a HMM was fit to the frequency spectrum form heart cycle and used four HMMs for evaluating posterior probability of the features given to model for classification. For better results, they used Principal Component Analysis (PCA) as reduction procedure and results were reported 95% sensitivity, 98.8% specificity and 97.5% accuracy on a dataset with 60 samples.

As an approach for clustering,  ref19 employed K-Nearest Neighbor (K-NN) on a features obtained from various time-frequency representation extracted from subset of 22 persons including 16 normal persons and 6 pathological patients. They reported 98% accuracy for this problem where likelihood of over-training was used as parameters for KNN. The work investigated in ref19 also chose K-NN for clustering the samples as normal and pathological. This study also employed two approach for dimensionality reduction of extracted time-frequency features; linear decomposition and tiling partition of mentioned features plane. Results were achieved on totally 45 recordings; including 19 pathological and 26 normal, and they was reported as 99.0% average accuracy with 11-fold cross-validation.

In the following, to organize these studies and due to the lack of standard dataset in this context, the PhysioNet/CinC Challenge 2016 and its related database is introduced ref2 . This database collected from a total of 9 independent databases with different numbers and types of patients and different recording quality, over a decade. Some of the related works on PhysioNet 2016 are investigated below:
The work presented in ref21

employed a feature set of 54 total features extracted from timing information for heart sounds, using mutual information and based Redundancy Maximum Relevance (mRMR) technique and also used non-linear radial basis function based Support Vector Machine (SVM) as classifier. In this work, 0.7749% Sensitivity and 0.7891% Specificity was reported on the hidden test set.

In the work investigated in ref22 , the time, frequency, and time-frequency domains features are employed without any segmentation. To classify these features, an ensemble of 20 feedforward ANN used for classification task and achieved overall score of 91.50% (94.23% for sensitivity and 88.76% for specificity) on train set and 85.90% (86.91% sensitivity and 84.90% specificity) on blind test set.

Author Database Method Se% Sp% P+% Acc%
Moukadem et al. (2013) - Segmentation 96/97 - 95 -
Sun et al. (2014) - Segmentation - - - 96.69
Yan et al. (2010) - Segmentation - - - 99
Sedighian et al. (2014) PASCAL Segmentation - - - 92.4/93.5
Castro et al. (2013) PASCAL Segmentation - - - 90.9/93.3
Schmidt et al. (2010a) - Segmentation 98.8 - 98.6 -
Sepehri et al. (2008) 36 normal and 54 pathological Frequency+ ANN 95 93.3 - -
Uguz (2012) 40 normal, 40 pulmonary and 40 mitral steno Time-frequency + ANN 90.48 97.44 - 95
Ari et al. (2010) 64 patients (normal and pathological) Wavelet + SVM - - - 86.72
Zheng et al. (2015) 40 normal and67 pathological Wavelet + SVM 98.8 99.3 - 98.9
Gharehbaghi et al. (2015) 30 normal, 26 innocent and 30 AS Frequency + SVM 86.4 89.3 - -
Saracoglu (2012) 40 normal, 40 pulmonary and 40 mitral stenosis DFT and PCA + HMM 95 98.8 - 97.5
Quiceno-Manrique et al. (2010) 16 normal and 6 pathological Time-frequency + kNN - - - 98
Avendano-Valencia et al. (2010) 16 normal and 6 pathological Time-frequency + kNN 99.56 98.54 - 99
Puri et al. (2016) Physione 2016 mRMR + SVM 77.49 78.91 - -
Zabihi et al. (2016) Physione 2016 Time-frequency + ANN 85.9 86.91 - 84.9
Potes et al. (2016) Physione 2016 Time-frequency and AdaBoost + CNN 94.24 77.8 - -
Rubin et al. (2016) Physione 2016 MFCC + CNN 75 100 - 88
Table 1: Summary of the previous heart sound works, methods, database and results ref2 .

The work presented in ref23

reports 0.9424 Sensitivity, 0.7781 Specificity and overall score 0.8602 on blind data set using total of 124 time-frequency features and applying variant of the AdaBoost and convolutional neural network (CNN) classifiers.

The work ref24 employed CNN method for classification of normal and abnormal heart sounds based on the MFCC features. The experimental results was reported in two phases according to different applying train set. The sensitivity, specificity and overall scores on hidden set for the phase one was 75%, 100% and 88%, respectively. Also, for the phase two sensitivity, specificity and overall scores on hidden set was reported 76.5%, 93.1% and 84.8%, respectively. Table1 summarizes the works investigated in this section. In this study, we focus on detect heart diseases using heart sounds based on the PhysioNet/CinC Challenge 2016 and we aim to provide an approach rely on identity vector (i-vector).

Although the i-vector was originally used for speaker recognition applications ref25 , it is currently used in various fields such as language identification ref26 ; ref27 , accent identification ref28 , gender recognition, age estimation, emotion recognition ref29 ; ref30

, audio scene classification 

ref31 etc. In this study, we adopt the i-vector to normal/abnormal heart sound detection.

Our motivation for using this method in this context is owing to the fact that human heart sounds can be considered as physiological traits of a person ref32 which are distinctive and permanent, unless accidents, illnesses, genetic defects, or aging have altered or destroyed them ref32 .

In this work, we utilized two features, Comprising Mel-Frequency Cepstral Coefficients (MFCCs) and i-vector and also we used Gaussian Mixture Models (GMMs) as classifier. To detect a normal heart sound signal from the abnormal we extracted the MFCCs features from the given heart sound signal, and then we obtained the i-vector of each heart sound signal using MFCCs.

Furthermore, to classify a normal heart sound form abnormal, we trained GMMs and then applied the i-vecors to them. The rest of this paper is organized as follows: in Section 2 features and classifier are introduced. The experiment setup is and experimental results are reported in Section 3 and 4, respectively. Eventually, the conclusion is presented in Section 5.

2 Methodology

In this paper, we proposed a method aims at using the i-vector for normal/abnormal heart sound detection. In this method, we first train a GMM using all heart sounds (i.e. both normal and abnormal heart sounds) in our training set. After training this UBM, zero and first-order statistics of the training features are extracted, accordingly. Then, using these statistics we train an i-vector extractor using several iterations of the EM algorithm explained in Section 2.4.2. After training the i-vector extractor, we extract i-vectors from all records in our training set. In this stage, we have extracted several i-vectors with different dimensions for each record in the training set and we use them to train the intra-class variation reduction methods. Specifically, we train VAE by original heart sounds in our training set and use them to transform the i-vectors into the new space. After training, we extract i-vectors from the heart sounds and transform them using PCA and VAE. Therefore, we have a representative i-vector for each record, which we will use for scoring.

Fig. 1 briefly illustrates our proposed system.

Figure 1: Our proposed system structure.

2.1 Mel-frequency Cepstral Coefficients

MFCCs were engaged over years as one of the most important features for speaker recognition ref33 . The MFCC attempts to model the human hearing perceptions by focusing on low frequencies (0-1Khz) ref34 . In better words, the differences of critical bandwidth in human ear is basis of what we know as MFCCs. In addition, Mel frequency scale is applied to extract critical features of speech, specially its pitch.

2.1.1 MFCC Extraction

In the following, we will explain how the MFCC feature is extracted. Initially, the given signal is pre-emphasized. The concept of ”pre-emphasis” means the reinforcement of high-frequency components passed by a high-pass filter ref33 . The output of the filter is as follows


In the next step, the pre-emphasized signal is divided into short-time frames (e.g. ) and Hamming windows are pre-processed. The hamming windows can be applied as


Where N is number of samples in each frame.

To analyze

in the frequency domain, a N-point Fast Fourier Transform (FFT) is applied to convert them into the frequency. The frequency of the FFT can be computed according to


A logarithmic power spectrum is obtained on a Mel-scale using a filter bank consists of L filter


Where is the th triangular filter, and are the lower limit and upper limit of the th filter, respectively.

The given frequency in hertz can be converted to Mel-scale as follow


Eventually, the MFCCs coefficients are obtained by applying Discrete Cosine Transform (DCT) to the


Where m is the obtained features form frequency components of . The steps for extracting the MFCC are depicted in Fig. 2.

Figure 2: Block diagram of MFCC feature extraction ref26 .

2.2 i-Vector

The i-vector method procedure mainly include compact fixed-length extracting from the input signal. The vector distance-based similarity is measured using extracted feature vector. This also helps us to transform input other features. In order to extract i-vector, Baum-Welch statistics are computed using MFCC features. These MFCC features are extracted from input signal ref37 . In following, the steps for this process is explained.

2.2.1 Universal Background Model (UBM) Training

First of all, a global model is created named as UBM ref38 . One of most popular UBM model is GMM which guarantee text-independent speaker verification ref25 ; ref39 . Some approaches use HMM which is text-dependent model and hence it is not suitable for applications which need identical features for each individual as signature  ref40 ; ref38 ; ref42 . In normal/abnormal heart sound detection tasks, GMM model is trained by features from all individuals in development set which is large enough to cover all the feature space. GMM is defined as weighted sum of multivariate Gaussian distributions in below:


In this equation, describes a D-dimensional vector, is weight for each component, and finally is Gaussian distribution with mean and covariance . For simplicity, covariance is defined as a diagonal matrix, and in this work, it is considered as diagonal matrix.

2.2.2 Extraction of Baum–Welch Statistics

Next, zero and first-order Baum-Welch statistics are extracted using UBM (in our case; GMM)  ref38 ; ref45 .

Suppose as whole features collected to train th, zero and first-order statistics named as and for th component of UBM (here we use GMM) is calculated as follows:


here is th of whole features for signature th, indicates mean of th component, and finally shows the posterior probability of by the th component described as below:


2.2.3 i-Vector

We consider as mean-supervector for each individual which is dependent to each one of them and represents feature vectors for each record. Supervector is a DC-dimensional vector acquired by concatenating D-dimensional mean vectors of GMM obtained from each signature. For i-vector we can model a supervector as follows  ref25 :


Here we define as mean-supervector which is independent for each individual and calculated from UBM, is a low rank matrix, and

indicates a random latent variable with standard normal distribution.

is the i-vector obtained by MAP point estimate of variable and has mean of posterior probability given by specific input signature. Considering these assumption, we assume is a Gaussian distribution with mean and covariance equal to .

2.2.4 Model Parameters

The UBM supervector’s mean is always shown as . Therefore, if one appends all values, the supervector can be achieved ref45

. In this study we have used the expectation maximization (EM) algorithm to train

. Assuming that UBM has number of components, and feature vectors dimensions is , the matrix can be described as  ref40


where is the UBM component covariance matrix. If the collection of feature vectors for record is shown with , and the probability of with a GMM specified by the supervector and the super-covariance matrix is defined by , then the EM optimization can be realized in two steps. First, the current value of matrix is used to find the vector maximizing the probability. Eq. 13 shows this procedure ref40


Second, maximizing the following relation, value is updated as


Eq. 14 is used to achieve log-likelihood for each record as  ref40


where iterates overall components of the model and t iterates overall feature vectors. Here, is component submatrix of . Assume the zero and the first-order statistics have been computed employing Eq. 8 and Eq. 9, respectively, now, we compute the posterior covariance matrix, , the mean

, and the second moment

for as


Ultimately, maximizing Eq. 14, the updated value of T can be calculated as  ref40


2.2.5 i-vector Extraction

The MAP point for is estimated to extract the i-vector. Here, i-vector is described as Eq.17 where is a random hidden variable with a standard normal distribution. Note that i-vector is the mean of the posterior probability of for the input record.

2.3 Important Information Extraction and Dimension Reduction Methods

There are different approaches to extract important information. In the i-vector based tasks, different methods such as nuisance attribute projection (NAP) ref25 ; ref45 ; ref46 ; ref47 , within-class covariance normalization (WCCN) ref25 ; ref48 ; ref49 , principal component analysis (PCA) ref49 , and linear discriminant analysis (LDA) ref50 are widely used. Here, we used PCA and one new method, called Variational AuotoEncoders (VAE) ref51 , for extract important information and dimension reduction which will be explained in following sections.

2.3.1 Pca

In this method, important information is extracted from the data as new orthogonal variables, which are referred to as the principal components ref52 .

To achieve this, assume a given zero mean data matrix ( and indicate the number of experiment repetition and a particular feature, respectively). Accordingly, to define the transformation consider vector of which is mapped by a set of p-dimensional vectors of weights to a new vector of principal component , as follow


In other word, vector (consists

) inherit the maximum variance from

by weight vector constrained to be a unit vector ref53 .

2.3.2 Vae

As one of the most important approaches to extract informative information, Variational AutoEncoders (VAEs) are among the generative models. Here is the architecture for VAEs: they are made of hidden layers 


with odd numbers and

nodes. The weights are shared between top and bottom layers, which have nodes, both.

VAE tries to reconstruct data from input data. Consider as input for a VAE and it tries to encode the inputs to latent variables , and then reconstructed input will be produced from latent variables. For this purpose, training process try to minimize cost function (Mean Square Error (MSE) between input and output). In most optimum situation, the input and output are the same. Schematic of a VAE is depicted in Fig. 3.

Figure 3: Block diagram of VAE.

Encoded variable can be used as enhanced, significant features for the better description of input

. It is noteworthy that VAEs are a good solution for different problems such as missing data imputation analysis and etc 

ref51 .

To obtain the vector , we define a probability function on , called , and try to maximize likelihood of mentioned probability;  ref54 .

show the expectation of random variable

over probability function . As we have no information about , so we compute approximation of , called . Hence, based on Bayes rule we have ref54


Here, we multiply and divide the term by as an approximation for


So we can conclude that


And finally


The term is intractable, and we know it has value greater than zero.

So we try to maximize term as tractable lower bound. The log-likelihood measure is a good indicator to show how much samples from can describe data .

2.4 Gaussian Mixture Models

In this work, Gaussian mixture models (GMMs) are engaged as a classifier for extracted features. GMMs are among models with probabilistic nature, and they are suitable for general distributions consisted of sub-populations  ref650

. GMMs use an iterative process to determine which data point belong to each sub-population, without any knowledge about data point labels. Hence, GMMs are considered as unsupervised learning models.

2.4.1 Gaussian model

The GMM is introduced with two types of values: the weights of the Gaussian mixture components and the means and the variance of the Gaussian mixture components. The probability distribution function (PDF) of a

components GMM, with mean and covariance matrix for the component is defined as

Subset #Patients #Records #Proportion of recordings #The weight parameters
Abnormal Normal Unsure wa wa wn wn
Training 746 3153 18.1 73.03 8.8 0.8602 0.1398 0.9252 0.0748
Eval. - 301 - - - 0.7888 0.2119 0.9467 0.0533
Test 308 1277 12.1 77.1 10.9 - - - -
Table 2: Statistics of the 2016 PhysioNet/CinC dataset ref2 .

Where is a feature vector and is the weight of the mixture component .

2.4.2 Learning the model

If the number of components is defined, Expectation Maximization (EM) is a method that is often used to estimate the parameters of the mixture model. In frequentist probability theory, models are usually learned using maximum likelihood estimation techniques. The maximum probability estimate is engaged to maximize the probability or similarity of the observed data with respect to the model parameters 

ref55 . The maximization of EM is a numerical method for estimating the maximum probability. Maximization of EM is a repetitive algorithm and has the property that the most similarity of data with each subsequent replication increases significantly, which means that it achieves to the maximum point or the local maximum point ref55 .

2.4.3 Maximum Likelihood Estimation of GMMs

Maximizing Likelihood estimation of Gaussian mixture models includes two steps. The first step is known as ”expectation”, which includes calculating the expectation and assigning the component () for each data point with the parameters of the model , and . The second step is known as ”maximization”, which includes maximizing the expectation calculated in the previous step relative to the model parameters. This step involves updating the values of , and .

The entire process is repeated as long as the algorithm converges, and it gives maximum likelihood estimation. More details are available at ref55 .

3 Experimental Setup

3.1 Dataset

The 2016 PhysioNet/CinC challenge is introduced to provide a standard database containing normal and abnormal heart sound ref2 . The presented dataset in this challenge is a heart sound recordings set of subjects/patients which is collected from variety of environmental conditions (including noisy conditions with low signal quality) as described in ref2 . Therefore, many heart sounds have been incurred different noises during recordings such as speech, stethoscope motion, breathing and intestinal activity ref2 . These noises make difficult to classify normal and abnormal heart sounds. Accordingly, the organizers allowed the participants to classify some of the recordings as ’unsure’ ref2 and it shows the difficulty level of the challenge. This corpus consists three subsets: training, validation and test. For training purposes, six labeled databases (names with prefix to ) containing 3153 sound recording from 764 subjects/patients, with duration of 5-120 s).

The validation subset is comprised of 150 normal and 151 abnormal heart sound (with file names prefixed alphabetically, through ) and the test data includes 1277 hearts sound trials generated from 308 subjects/patients. It is necessary to mention that 301 selected recording from train set used as test set for validation.

The Challenge test set consisted of six databases labeled from b to , , and with 1277 heart sound recordings from 308 subjects.

It should be noted that the test set is publicly unavailable and will remain private for the purpose of scoring ref2 . The statistics of each subset are summarized and illustrated in Table 2. More details about the corpus and the 2016 PhysioNet/CinC challenge can be found in ref2 .

In this work, we reported our results based on physionet/CinC 2016 dataset. It is worth mentioning that the validation subset consists of 301 records, which is a copy of the training data. Accordingly, in order to report valid results, we first removed the validation records from the training set and then divided the training set into two parts in five phases. In each phase, we randomly assigned 80% of training set as our training set and the rest of 20% were assigned as our validation set which is used for tuning the parameters. In addition, we used physionet/CinC 2016 validation set as our test set.

3.2 Evaluation Metrics

In this task the metric of evaluation is reported based on Equal Error Rate (EER) and Modified Accuracy (MAcc). Therefore, to compute EER, we assign to each trial a score value, then let define as the false alarm and as the miss rates at threshold :


Now EER is computed ref64 :


where is the value of the parameter when equals .

Also for MAcc computation, classified data are in three class; normal, abnormal or unsure, with two references in each categories. The modified sensitivity () and specificity () can be computed according to:


where and are the percentages of the abnormal recordings of the signal with good quality and poor quality respectively, and and are of the normal recordings of the signal with good quality and poor quality respectively.

For all 3153 training set recordings, values for weight parameters of , , and are equal to 0.8602, 0.1398, 0.9252 and 0.0748 respectively, in train set. These parameters also calculated for validation set and they were reported 0.78881, 0.2119, 0.9467 and 0.0533 respectively. The “Score” for this challenge is computed using following equation


3.3 Scoring and Decision Making

To assign score to a given heart sound based on GMM classifier we proceed as follows. First, we extract an i-vector from our training set and project them to the new space using the PCA or VAE and apply them to two GMMs (one GMM for the normal heart sound and the other for the abnormal heart sound) with different components to learn the model by EM iterations (training GMMs). In the next step, the score for each trial is obtained by computing log likelihood ratio:


where is a i-vector corresponding to the test record and and denote the GMMs for normal and abnormal heart sound, respectively. After finding the score, a simple global threshold is applied to it to make the final decision of normal/abnormal heart sound detection. If the score was higher than the threshold, the test heart sound is labeled as normal and otherwise, it is labeled as abnormal. In this paper, we used a global threshold to be able to plot the detection error trade-off (DET) and detection accuracy trade-off (DAT) curves.

4 Experimental Results

In this section, first we briefly introduce the baseline system and in the following, we introduce the experimental results which include two Sections. In the Section 4.2, we investigate the effect of GMM components and i-vector dimensionality using the whole of the training set. In the Section 4.3, we study the effect of applying different sizes of the training set on our proposed approach.

4.1 Baseline System

In this paper, we consider proposed approach in ref56 as the baseline system. The physionet 2016 dataset is used in the baseline system in the same way that we used in our system. The proposed method in the baseline system is based on asynchronous frames ref56 . Accordingly, 103228 frames were extracted from physionet 2016 dataset. To report the results, they repeated their experiments for five iterations and reported the average of the obtained results. The attained results in terms of sensitivity, specificity and mean accuracy was reported at 0.845, 0.785 and 0.815, respectively.

4.2 Effects of GMM Components Count and i-vectors Dimensionality

The first part of our experiments was performed to investigate the effects of the number of GMM components, the effects of i-vectors dimension numbers Without Applying (W.A) VAE or PCA and finally, effects of i-vector dimension reduction by applying PCA and VAE. Table. 3 represents the obtained EERs and MAccs on the test set, engaging mentioned approaches. It is worth mentioning that we did not label any data as “unsure”, and we assigned “normal” or “abnormal” labels to all test data. Furthermore, in this part, we applied the whole of the training set to our system.

i-vector dimensions 64 components 128 components 256 components
Initial after PCA/VAE EER% Se% Sp% MAcc% EER% Se% Sp% MAcc% EER% Se% Sp% MAcc%
64 W .A 12 85.3 91.3 88.3 9.1 94 88.07 91.03 12.22 92.66 83.4 8.05
16 12.4/11.8 82.6/88 92.7/88.7 87.56/88.35 7.2/5.9 91.3/93.3 94.7/94.7 93/94 10.01/7.55 94/94 89.4/96.02 93/95.01
32 10.1/11.05 85.3/88 94.03/90.7 89.66/89.35 8.1/6.5 91.3/93.3 92.71/93.3 92/93.3 8.8/7.06 92/86.6 90/90 91/88.3
64 10.21/11.3 88.33/87.33 92.05/90 89.69/88.66 8.7/6.15 93.3/95.3 89.4/93.37 91.35/94.33 11.8/8.33 92/90 84.1/92.05 88.05/91.02
128 W .A 14.72 84 86.75 85.37 8.38 94.66 88.74 91.7 11.07 90.66 86.09 89.37
16 11.8/10.2 84/89.3 92.71/91.39 88.35/90.3 7.41/5.64 90.66/4.66 95.36/95.36 93.01/95.01 7.23/5.91 96.6/94 89.4/96.02 93/95.01
32 8.35/9.15 89.3/90 94.03/93.37 91.66/91.68 6.11/5.88 91.33/94 96.02/96.02 93.07/95.01 10.85/5.75 93.33/94.66 88.74/95.36 91.03/95.01
64 9/8.48 89.33/90 92.71/93.37 91.01/91.68 6.16/5.23 96.66/96 90.72/92.71 93.69/94.35 10.49/6.18 94/92.66 87.41/94.7 90.70/93.68
256 W .A 12.55 82 92.05 87.02 10.81 87.33 90.72 89.02 11.30 97.33 80.13 88.73
16 12.7/8.33 80/90.66 96.68/91.39 88.34/91.02 7.53/6.48 90.63/94 94.77/94.03 92.65/94.01 7.81/5.59 96/96.6 88.74/94.7 92.37/95.65
32 12.23/8.76 86.6/94.66 93.37/94.03 89.98/94.34 7.40/5.14 92.66/96 94.03/94.7 93.34/95.35 7.66/5.36 96.6/94 89.4/96.6 93/95.3
64 10.28/9.4 84.66/94.66 93.37/94.7 89.01/94.68 8.38/4.1 94/97.33 90.06/94.70 92.03/96.01 10.5/4.77 96/95.3 85.4/96.6 90.7/95.95
512 W .A 11.65 86 92.71 89.35 8.42 95.33 88.74 92.03 11.68 94.66 86.09 90.37
16 12.2/6.83 86.66/92.66 92.71/94.03 89.68/93.34 5.87/5.23 94/94.66 96.02/94.70 95.01/94.68 8.29/8.53 95.33/95.33 89.40/87.41 92.36/91.37
32 10.25/9.66 90/94.66 93.37/94.70 91.68/94.68 5.5/4.63 93.33/95.33 95.35/95.36 94.34/95.34 9.14/5.31 94/95.33 88.74/95.36 91.37/95.34
64 8.83/6.44 89.33/93.33 92.71/93.37 91.02/93.35 8.87/5.1 93.33/94.66 90.72/94.70 92.01/94.68 14.02/8.66 91.33/94 81.45/92.71 86.39/93.35
1024 W .A 18.92 78.66 84.1 81.38 10.34 92 86.09 89.04 13.77 93.33 78.8 86.06
16 12.30/3.46 80.66/96 94.7/96.68 87.68/96.34 5.44/2.80 91.3/97.33 98.67/96.68 95/97.05 5.79/4.81 98.6/96.67 93.37/97.3 95.98/96.98
32 9.59/3.08 83.33/97.33 96.02/98.68 89.67/97 6.18/2.95 90.66/98.86 98.01/96.02 94.33/97.34 7.74/3.12 99.33/95.33 87.41/98.67 93.37/97
64 11.98/2.71 87.33/97.33 91.39/97.35 89.36/97.34 6.3/2.80 96.66/96 88.74/98.67 93.7/97.33 8.84/3.55 98.66/96.6 83.44/97.35 91.05/96.97
Table 3: EER and MAcc comparison based on UBM components count, raw i-vector dimension and dimension of i-vector using PCA and VAE for the proposed method. Here the word “W.A” means “Without Applying PCA or VAE”.

In each element of Table. 3, there are two result values (separated by a slash) that represent the effect of using the PCA and VAE techniques, respectively. In addition, the number of components used in GMMs are specified separately in the table.

Table. 3 shows i-vector and its VAE (right side of slash) performs better than others. As shown in this table, the best results are achieved by higher dimensions of i-vector and its VAE.

The left side of slash denotes the results of i-vector and its PCA. It is observed that the result values obtained by applying PCA are not as good as of applying VAE values are.

Discussion: This is due to the fact that VAE tries to minimize the cost function which is defined as MSE between input (whole features) and output (reconstructed features). While PCA just tries to extract important information, compare to VAE which try to extract features with capability to produce original data. So VAEs can extract important information which are able to produce original data, as much as they can and that’s why EER is reducing over time. On the other hand, incrementing raw i-vector dimension, may add useless, sparse features to feature vector and this leads to classification error and decreasing Accuracy.

As it is shown in Table. 3, generally the best EER and MAcc values are obtained by the GMMs which are trained by 128 components. In our proposed system, the GMMs are not well trained with 64 components. Conversely, engaging 256 components cause over-fitting, while training their parameters due to the low amount of training data.

The best results extracted from Table. 3 are depicted in Fig. 4 and Fig. 5. The red point-line of Fig. 4 and Fig. 5 represent the best values achieved by different dimensions of i-vectors without applying PCA or VAE. Also, blue point-line and green point-line of the Fig. 4 and Fig. 5 represent the best values obtained by different dimensions of i-vectors and applying PCA and VAE, respectively.

As shown in the Fig. 4 and Fig. 5, commonly the EER values decrease with increasing dimensions of the i-vectors. After applying VAE or PCA to them, the MAcc values increase, subsequently. But this pattern is not true for raw i-vectors and they yield different EER and MAcc results.

Discussion: A higher-dimensional i-vector includes more detailed information. On the other hand, this information may include useless details and common information. Therefore, the PCA and VAE methods are used to make this information more informative. As shown in Table. 3, applying PCA and VAE can significantly improve the result values relative to the applying raw i-vectors.

Figure 4: DET curve comparison for raw i-vevtor and its PCA and VAE. In each case, results are reported using the best parameters configuration.
Figure 5: DAT curve comparison for raw i-vevtor and its PCA and VAE. In each case, results are reported using the best parameters configuration.

Table. 4 shows the results obtained by the baseline system and the best results obtained by our proposed systems in this paper. Accordingly, the best accuracy achieved by our proposed system is 97.34% that could improve the accuracy of the baseline system by 15.84%.

System EER% Se% Sp% MAcc%
baseline - 84.5 78.5 81.5
i-vector 8.38 88.74 95.33 92.03
i-vector + PCA 5.44 93.37 98.6 95.98
i-vector + VAE 2.71 96.02 98.86 97.34
Table 4: Best results for Our proposed approaches and baseline system.

Discussion: In baseline system represented in  ref56 , extracted features are mostly based on frequency and sub-band features; such as MFCC, Mel-Spectrogram, and etc. These features are suitable for robust speech or sound detection, but in other applications like heart sound detection we need to extract identical features for our purpose. This is due to identical heart sound, that is unique for every individual. As a result i-vector can be better features for heart sound identification, and hence can improve classification error and accuracy better than approaches which are based on robust feature extraction.

4.3 Effect of Training the System using Different Size of Training Set

In this section, we are going to evaluate the effect of different size of training set on proposed method. To satisfy the conditions, we divided the training data into 5 folds (each fold include 20% of training set) randomly. In the next step, we increase training set fold by fold each time, and observe the impacts on EER improvement. Table 5 shows the effect of applying different size of the training set to our system, with fixed number of GMM components. This observation showed better results in the first part of our experiments. The reported values in this table are based on the best results obtained from the different size of raw i-vectors and applying PCA and VAE to them. (In each case, results are reported using the parameters configuration for best results).

As it is summarized in Table. 5, the classification performance is improved by increasing the amount of training data. The results suggest that increasing the size of training data over than 80%, leads to less improvement, in comparison with the cases where size of the training set are smaller. According to Table. 5, our proposed system performed similar to the baseline system when only 60% of training set is used.

Size of training set System EER% Se% Sp% MAcc%
20% Raw i-vector 37.85 86.00 34.44 60.22
i-vector + PCA 30.12 95.33 52.32 73.82
i-vector + VAE 31.31 40.00 88.74 64.37
40% Raw i-vector 24.75 60.00 74.83 70.41
i-vector + PCA 27.44 60.00 83.44 71.72
i-vector + VAE 28.95 65.33 72.85 69.09
60% Raw i-vector 20.38 65.33 94.7 80.10
i-vector + PCA 17.82 82.00 83.44 82.70
i-vector + VAE 18.95 64.67 99.34 82.00
80% Raw i-vector 11.94 89.33 87.42 88.75
i-vector + PCA 8.27 88.00 92.05 90.02
i-vector + VAE 4.12 93.33 98.01 95.67
100% Raw i-vector 8.38 88.74 95.33 92.03
i-vector + PCA 5.44 91.30 98.67 95.00
i-vector + VAE 2.80 96.02 98.86 97.34
Table 5: The Effect of Using Different Size of training set on the performance of the Proposed System.

Fig. 6 and Fig. 7 show the classification EER and MAcc of the proposed system as a function of training set size. Fig. 6 and Fig. 7 depict the effect of varying training set size on the EER and MAcc values, respectively.


As it is shown in these figures, the MAcc and EER improve while training set size increase, gradually. Obviously, number of samples are important in improving results, since it improves the generalization, and helps the system to adapt to new samples. But most important discussion is about comparison of three different approaches we engaged to see whether feature reduction is applicable or not. First as you see using larger dataset for raw i-vector demonstrate lower improvement than using PCA. It is obvious that dimension reduction in large scale dataset and small set dataset yields better performance than raw i-vector. Most important point here is VAE has best performance. Main reason for performance is that VAE as one of Deep Neural Networks (DNNs) requires more data to generalize results and as much as data increases, the results for VAE improve over time. So it yields the best results among all approaches.

Figure 6: DET curve comparison for raw i-vevtor and its PCA and VAE for using different size of training set. In each case, results are reported using the best parameters configuration.
Figure 7: DAT curve comparison for raw i-vevtor and its PCA and VAE for using different size of training set. In each case, results are reported using the best parameters configuration.

5 Conclusions

This paper proposes a novel method for automatic heart sound classification based on i-vector MFCC feature embedding, in which MFCC is extracted from heart sounds to represent the characteristics of the subject’s heard sound. The experiments on a public dataset demonstrate the effectiveness of the proposed method. This method is based on fix-sized i-vector and therefore insensitive to the length of the input sounds. Combination of MFCC and i-vector are stable and can reflect the key point features to discriminate two types of subject accurately. The i-vector feature of heart sound is more suitable to describe the characteristics of heart sound than other length variable features since the sound is always regarded as a whole when producing the i-vector. The proposed method has low computational cost and can work well on even wearable devices and it also works well even when the amount of training data is little. In conclusion, the proposed method outperforms the state-of-the-art approaches.

6 Acknowledgment

We thank Mr. Mohammad Elmi and Mr. Majid Osati for comments that greatly improved the manuscript.



  • (1) W. G. MEMBERS, E. J. Benjamin, M. J. Blaha, S. E. Chiuve, M. Cushman, S. R. Das, R. Deo, S. D. de Ferranti, J. Floyd, M. Fornage, et al., Heart disease and stroke statistics—2017 update: a report from the american heart association, Circulation 135 (10) (2017) e146.
  • (2) C. Liu, D. Springer, Q. Li, B. Moody, R. A. Juan, F. J. Chorro, F. Castells, J. M. Roig, I. Silva, A. E. Johnson, et al., An open access database for the evaluation of heart sound algorithms, Physiological Measurement 37 (12) (2016) 2181.
  • (3) A. Moukadem, A. Dieterlen, N. Hueber, C. Brandt, Localization of heart sounds based on s-transform and radial basis function neural network, in: 15th Nordic-Baltic Conference on Biomedical Engineering and Medical Physics (NBC 2011), Springer, 2011, pp. 168–171.
  • (4) S. Sun, Z. Jiang, H. Wang, Y. Fang, Automatic moment segmentation and peak detection analysis of heart sound pattern via short-time modified hilbert transform, Computer methods and programs in biomedicine 114 (3) (2014) 219–230.
  • (5) Z. Yan, Z. Jiang, A. Miyamoto, Y. Wei, The moment segmentation analysis of heart sound pattern, Computer methods and programs in biomedicine 98 (2) (2010) 140–150.
  • (6)

    P. Sedighian, A. W. Subudhi, F. Scalzo, S. Asgari, Pediatric heart sound segmentation using hidden markov model, in: Engineering in Medicine and Biology Society (EMBC), 2014 36th Annual International Conference of the IEEE, IEEE, 2014, pp. 5490–5493.

  • (7) P. Bentley, G. Nordehn, M. Coimbra, S. Mannor, R. Getz, The pascal classifying heart sounds challenge 2011 (chsc2011) results, See http://www. peterjbentley. com/heartchallenge/index. html.
  • (8) A. Castro, T. T. Vinhoza, S. S. Mattos, M. T. Coimbra, Heart sound segmentation of pediatric auscultations using wavelet analysis, in: Engineering in Medicine and Biology Society (EMBC), 2013 35th Annual International Conference of the IEEE, IEEE, 2013, pp. 3909–3912.
  • (9) S. E. Schmidt, C. Holst-Hansen, C. Graff, E. Toft, J. J. Struijk, Segmentation of heart sound recordings by a duration-dependent hidden markov model, Physiological measurement 31 (4) (2010) 513.
  • (10) D. B. Springer, L. Tarassenko, G. D. Clifford, Logistic regression-hsmm-based heart sound segmentation, IEEE Transactions on Biomedical Engineering 63 (4) (2016) 822–832.
  • (11) A. A. Sepehri, J. Hancq, T. Dutoit, A. Gharehbaghi, A. Kocharian, A. Kiani, Computerized screening of children congenital heart diseases, Computer methods and programs in biomedicine 92 (2) (2008) 186–192.
  • (12) H. Uğuz, Adaptive neuro-fuzzy inference system for diagnosis of the heart valve diseases using wavelet transform with entropy, Neural Computing and applications 21 (7) (2012) 1617–1628.
  • (13) H. Uğuz, A biomedical system based on artificial neural network and principal component analysis for diagnosis of the heart valve diseases, Journal of medical systems 36 (1) (2012) 61–72.
  • (14) S. Ari, K. Hembram, G. Saha, Detection of cardiac abnormality from pcg signal using lms based least square svm classifier, Expert Systems with Applications 37 (12) (2010) 8019–8026.
  • (15) Y. Zheng, X. Guo, X. Ding, A novel hybrid energy fraction and entropy-based approach for systolic heart murmurs identification, Expert Systems with Applications 42 (5) (2015) 2710–2721.
  • (16) S. Patidar, R. B. Pachori, N. Garg, Automatic diagnosis of septal defects based on tunable-q wavelet transform of cardiac sound signals, Expert Systems with Applications 42 (7) (2015) 3315–3326.
  • (17) A. Gharehbaghi, I. Ekman, P. Ask, E. Nylander, B. Janerot-Sjoberg, Assessment of aortic valve stenosis severity using intelligent phonocardiography, International journal of cardiology 198 (2015) 58–60.
  • (18)

    R. SaraçOğLu, Hidden markov model-based classification of heart valve disease with pca for dimension reduction, Engineering Applications of Artificial Intelligence 25 (7) (2012) 1523–1528.

  • (19) A. Quiceno-Manrique, J. Godino-Llorente, M. Blanco-Velasco, G. Castellanos-Dominguez, Selection of dynamic features based on time–frequency representations for heart murmur detection from phonocardiographic signals, Annals of biomedical engineering 38 (1) (2010) 118–137.
  • (20)

    C. Puri, A. Ukil, S. Bandyopadhyay, R. Singh, A. Pal, A. Mukherjee, D. Mukherjee, Classification of normal and abnormal heart sound recordings through robust feature selection, in: Computing in Cardiology Conference (CinC), 2016, IEEE, 2016, pp. 1125–1128.

  • (21) M. Zabihi, A. B. Rad, S. Kiranyaz, M. Gabbouj, A. K. Katsaggelos, Heart sound anomaly and quality detection using ensemble of neural networks without segmentation, in: Computing in Cardiology Conference (CinC), 2016, IEEE, 2016, pp. 613–616.
  • (22)

    C. Potes, S. Parvaneh, A. Rahman, B. Conroy, Ensemble of feature-based and deep learning-based classifiers for detection of abnormal heart sounds, in: Computing in Cardiology Conference (CinC), 2016, IEEE, 2016, pp. 621–624.

  • (23) J. Rubin, R. Abreu, A. Ganguli, S. Nelaturi, I. Matei, K. Sricharan, Classifying heart sound recordings using deep convolutional neural networks and mel-frequency cepstral coefficients, in: Computing in Cardiology Conference (CinC), 2016, IEEE, 2016, pp. 813–816.
  • (24) N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language Processing 19 (4) (2011) 788–798.
  • (25) N. Dehak, P. A. Torres-Carrasquillo, D. Reynolds, R. Dehak, Language recognition via i-vectors and dimensionality reduction, in: Twelfth annual conference of the international speech communication association, 2011.
  • (26) D. Martinez, O. Plchot, L. Burget, O. Glembek, P. Matějka, Language recognition in ivectors space, in: Twelfth Annual Conference of the International Speech Communication Association, 2011.
  • (27) M. H. Bahari, R. Saeidi, D. van Leeuwen, et al., Accent recognition using i-vector, gaussian mean supervector and gaussian posterior probability supervector for spontaneous telephone speech.
  • (28) R. Xia, Y. Liu, Using i-vector space model for emotion recognition, in: Thirteenth Annual Conference of the International Speech Communication Association, 2012.
  • (29) H. Khaki, E. Erzin, Continuous emotion tracking using total variability space, in: Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • (30) H. Eghbal-Zadeh, B. Lehner, M. Dorfer, G. Widmer, Cp-jku submissions for dcase-2016: A hybrid approach using binaural i-vectors and deep convolutional neural networks, IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE).
  • (31) R. Wahid, N. I. Ghali, H. S. Own, T.-h. Kim, A. E. Hassanien, A gaussian mixture models approach to human heart signal verification using different feature extraction algorithms, in: Computer Applications for Bio-technology, Multimedia, and Ubiquitous City, Springer, 2012, pp. 16–24.
  • (32) M. R. Hasan, M. Jamil, M. Rahman, et al., Speaker identification using mel frequency cepstral coefficients, variations 1 (4).
  • (33) V. Tiwari, Mfcc and its applications in speaker recognition, International journal on emerging technologies 1 (1) (2010) 19–22.
  • (34) P. Kenny, G. Boulianne, P. Dumouchel, Eigenvoice modeling with sparse training data, IEEE transactions on speech and audio processing 13 (3) (2005) 345–354.
  • (35) D. A. Reynolds, T. F. Quatieri, R. B. Dunn, Speaker verification using adapted gaussian mixture models, Digital signal processing 10 (1-3) (2000) 19–41.
  • (36)

    H. Zeinali, A. Mirian, H. Sameti, B. BabaAli, Non-speaker information reduction from cosine similarity scoring in i-vector based speaker verification, Computers & Electrical Engineering 48 (2015) 226–238.

  • (37) H. Zeinali, B. BabaAli, H. Hadian, Online signature verification using i-vector representation, IET Biometrics.
  • (38) P. Kenny, P. Ouellet, N. Dehak, V. Gupta, P. Dumouchel, A study of interspeaker variability in speaker verification, IEEE Transactions on Audio, Speech, and Language Processing 16 (5) (2008) 980–988.
  • (39) W. M. Campbell, D. E. Sturim, D. A. Reynolds, A. Solomonoff, Svm based speaker verification using a gmm supervector kernel and nap variability compensation, in: Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, Vol. 1, IEEE, 2006, pp. I–I.
  • (40) A. Solomonoff, C. Quillen, W. M. Campbell, Channel compensation for svm speaker recognition., in: Odyssey, Vol. 4, Citeseer, 2004, pp. 219–226.
  • (41) A. Solomonoff, W. M. Campbell, I. Boardman, Advances in channel compensation for svm speaker recognition, in: Acoustics, Speech, and Signal Processing, 2005. Proceedings.(ICASSP’05). IEEE International Conference on, Vol. 1, IEEE, 2005, pp. I–629.
  • (42) A. O. Hatch, S. Kajarekar, A. Stolcke, Within-class covariance normalization for svm-based speaker recognition, in: Ninth international conference on spoken language processing, 2006.
  • (43) N. Dehak, P. Kenny, R. Dehak, O. Glembek, P. Dumouchel, L. Burget, V. Hubeika, F. Castaldo, Support vector machines and joint factor analysis for speaker verification.
  • (44) C. R. Rao, The utilization of multiple measurements in problems of biological classification, Journal of the Royal Statistical Society. Series B (Methodological) 10 (2) (1948) 159–203.
  • (45) L. Van Der Maaten, E. Postma, J. Van den Herik, Dimensionality reduction: a comparative, J Mach Learn Res 10 (2009) 66–71.
  • (46) H. Abdi, L. J. Williams, Principal component analysis, Wiley interdisciplinary reviews: computational statistics 2 (4) (2010) 433–459.
  • (47) D. Jang, H. Park, G. Choi, Estimation of leakage ratio using principal component analysis and artificial neural network in water distribution systems, Sustainability 10 (3) (2018) 750.
  • (48) D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114.
  • (49) D. Reynolds, Gaussian mixture models, Encyclopedia of biometrics (2015) 827–832.
  • (50) D. A. Reynolds, Automatic speaker recognition using gaussian mixture speaker models, in: The Lincoln Laboratory Journal, Citeseer, 1995.
  • (51) M. Adiban, H. Sameti, N. Maghsoodi, S. Shahsavari, Sut system description for anti-spoofing 2017 challenge, in: Proceedings of the 29th Conference on Computational Linguistics and Speech Processing (ROCLING 2017), 2017, pp. 264–275.
  • (52) B. Bozkurt, I. Germanakis, Y. Stylianou, A study of time-frequency features for cnn-based automatic heart sound classification for pathology detection, Computers in biology and medicine.