1 Introduction
Cardiovascular disease (CVD) is the most common cause of death in most countries of the world and is the leading cause of disability ref1 .
Based on information provided by the World Heart Association, 2017, 17.7 million people die every year due to CVDS, which is approximately equal to 31% of all global deaths. The most prevalent CVDs are heart attacks and strokes ref1 .
In 2013, all 194 members of World Health Organization accepted the implementing Global Action Plan for the Prevention and Control of Noncommunicable Diseases, a plan for 2013 to 2020, to be prepared against CVDs. Implementation of nine global and voluntary goals in this plan, the number of premature deaths due to noncommunicable diseases is decreased. Among these goals, two of them particularly focus on the prevention and control of CVDs ref1 .
Accordingly, in recent years, researchers showed a great interest in detecting heart diseases based on heart sounds ref2
. Most approaches in this context rely on sound segmentation and feature extraction and machine learning classification on different datasets.
In recent years, various studies are conducted for normal/abnormal heart sound detection using segmentation methods.
In ref3 the Shannon energy envelop for the local spectrum is calculated by a new method, which uses Stransform for every sound produced by heart sound signal. Sensitivity and positive predictively was evaluated on 80 heart sound recording (including 40 normal and 40 pathological), and their values were reported over 95%. In a study by ref4 , an approach proposed for automatic segmentation, using Hilbert transform. Features for this study included envelops near the peaks of S1, S2, the transmission points T12 from S1 to S2, and visaversa. Database for this study, consisted of 7730s of heart sound from pathological patients, 600s from normal subjects, and finally 1496.8 s from Michigan MHSDB database. Average accuracy for sound with mixed S1, and S2, was 96.69%, and for those with separated S1 and S2, was reported 97.37%.
Another envelope extraction method engaged for heart sound segmentation is called Cardiac Sound Characteristic Waveform (CSCW). The work presented in ref5 used this method for only a small set of heart sounds, including 9 sound recording and the accuracy was reported 99.0%. No traintest split was performed for evaluation in this study.
The work in ref6 achieved an accuracy of 92.4% for S1 and 93.5% for S2 segmentation by engaging homomorphic filtering and HMM, on PASCAL database ref7 . The work investigated in ref8
also used the same approach with wavelet analysis on the same database and accuracy for S1 was reported 90.9% for S1 segmentation and this value was 93.3% for S2 segmentation. There is also a study on expected duration of heart sound using HMM and Hidden SemiMarkov Model (HSMM) introduced in
ref9. In this study, positions of S1 and S2 sounds was labeled in 113 recording, first. After that they calculated Gaussian distributions for the expected duration of each four states including S1, systole, S2 and diastole, using average duration of mentioned sound and also autocorrelation analysis of systolic and diastolic durations. Homomorphic envelope plus three other frequencies features (in 2550, 50100 and 100150 Hz ranges) were among features they used for this study. Then they calculated Gaussian distributions for training HMM states and emission probabilities. Finally, for decoding process, backward and forward Viterbi algorithm engaged and they reported 98.8% sensitivity and 98.6% positive predictively. This work also proposed HSMM alongside logistic regression (for emission probability estimation) to accurately segment noisy, and realworld heart sound recording
ref10 . This work also used Viterbi algorithm to decode state sequences. For evaluation, they used a database of 10172s of heart sounds recoded from 112 patients. F1 score for this study reported 95.63%, improving the previous state of the art study with 86.28% on same test set.Other works were also developed using other methods based on the feature extraction and classification using machine learning classifier such as ANN, SVM, HMM and kNN.
For distinction between spectral energy between normal and pathological recordings, the work introduced in ref11 extracted five frequency bands and their spectral energy was given as input to ANN. Results on a dataset with 50 recorded sounds showed 95% sensitivity and 93.33% specificity.
In a study by ref12 , a discrete wavelet transform in addition to a fuzzy logic was used for a threeclass problem; including normal, pulmonary stenosis, and mitral stenosis. An ANN was employed to classify dataset of 120 subjects with 50/50 split for train and test set. Reported results was 100% for sensitivity, 95.24% for specificity, and 98.33% for average accuracy. Moreover, he used timefrequency as an input for ANN in ref13 . This work reported 90.4% sensitivity, 97.44% specificity, and 95% accuracy on same dataset for same problem (threeclass classification including normal, pulmonary and mitral stenosis heart valve diseases).
The work in ref14
also performed a study to classify normal and pathological cases using Least Square Support Vector Machine (LSSVM) engaging wavelet to extract features. They evaluated their method on a dataset with heart sound of 64 patients (32 cases for train and 32 cases for test set) and reported 86.72% for accuracy. In a work
ref15 with same classifier, wavelet packets and extracted features are engaged like sample entropy and energy fraction as input. Dataset used for this problem consisted of 40 normal persons and 67 pathological patients and they resulted 97.17% accuracy, 93.48% sensitivity and 98.55% specificity. Another study ref16 , also used LSSVM as classifier while using tunableQ wavelet transform as input features. Evaluation in this study showed 98.8% sensitivity and 99.3% specificity on a dataset comprising 4628 cycles from 163 heart sound recordings, with unknown number of patients. As another work on SVM, ref17 engaged frequency power with varying length frames over systole as input features, and used Growing Time SVM (GTSVM) for classifying pathological and normal murmurs. Results on 56 persons (including 26 murmurs and 30 normal) was reported 86.4% for sensitivity and 89.3% for specificity.Another work on HMM was performed by ref18
where a HMM was fit to the frequency spectrum form heart cycle and used four HMMs for evaluating posterior probability of the features given to model for classification. For better results, they used Principal Component Analysis (PCA) as reduction procedure and results were reported 95% sensitivity, 98.8% specificity and 97.5% accuracy on a dataset with 60 samples.
As an approach for clustering, ref19 employed KNearest Neighbor (KNN) on a features obtained from various timefrequency representation extracted from subset of 22 persons including 16 normal persons and 6 pathological patients. They reported 98% accuracy for this problem where likelihood of overtraining was used as parameters for KNN. The work investigated in ref19 also chose KNN for clustering the samples as normal and pathological. This study also employed two approach for dimensionality reduction of extracted timefrequency features; linear decomposition and tiling partition of mentioned features plane. Results were achieved on totally 45 recordings; including 19 pathological and 26 normal, and they was reported as 99.0% average accuracy with 11fold crossvalidation.
In the following, to organize these studies and due to the lack of standard dataset in this context, the PhysioNet/CinC Challenge 2016 and its related database is introduced ref2 . This database collected from a total of 9 independent databases with different numbers and types of patients and different recording quality, over a decade. Some of the related works on PhysioNet 2016 are investigated below:
The work presented in ref21
employed a feature set of 54 total features extracted from timing information for heart sounds, using mutual information and based Redundancy Maximum Relevance (mRMR) technique and also used nonlinear radial basis function based Support Vector Machine (SVM) as classifier. In this work, 0.7749% Sensitivity and 0.7891% Specificity was reported on the hidden test set.
In the work investigated in ref22 , the time, frequency, and timefrequency domains features are employed without any segmentation. To classify these features, an ensemble of 20 feedforward ANN used for classification task and achieved overall score of 91.50% (94.23% for sensitivity and 88.76% for specificity) on train set and 85.90% (86.91% sensitivity and 84.90% specificity) on blind test set.
Author  Database  Method  Se%  Sp%  P+%  Acc% 

Moukadem et al. (2013)    Segmentation  96/97    95   
Sun et al. (2014)    Segmentation        96.69 
Yan et al. (2010)    Segmentation        99 
Sedighian et al. (2014)  PASCAL  Segmentation        92.4/93.5 
Castro et al. (2013)  PASCAL  Segmentation        90.9/93.3 
Schmidt et al. (2010a)    Segmentation  98.8    98.6   
Sepehri et al. (2008)  36 normal and 54 pathological  Frequency+ ANN  95  93.3     
Uguz (2012)  40 normal, 40 pulmonary and 40 mitral steno  Timefrequency + ANN  90.48  97.44    95 
Ari et al. (2010)  64 patients (normal and pathological)  Wavelet + SVM        86.72 
Zheng et al. (2015)  40 normal and67 pathological  Wavelet + SVM  98.8  99.3    98.9 
Gharehbaghi et al. (2015)  30 normal, 26 innocent and 30 AS  Frequency + SVM  86.4  89.3     
Saracoglu (2012)  40 normal, 40 pulmonary and 40 mitral stenosis  DFT and PCA + HMM  95  98.8    97.5 
QuicenoManrique et al. (2010)  16 normal and 6 pathological  Timefrequency + kNN        98 
AvendanoValencia et al. (2010)  16 normal and 6 pathological  Timefrequency + kNN  99.56  98.54    99 
Puri et al. (2016)  Physione 2016  mRMR + SVM  77.49  78.91     
Zabihi et al. (2016)  Physione 2016  Timefrequency + ANN  85.9  86.91    84.9 
Potes et al. (2016)  Physione 2016  Timefrequency and AdaBoost + CNN  94.24  77.8     
Rubin et al. (2016)  Physione 2016  MFCC + CNN  75  100    88 
The work presented in ref23
reports 0.9424 Sensitivity, 0.7781 Specificity and overall score 0.8602 on blind data set using total of 124 timefrequency features and applying variant of the AdaBoost and convolutional neural network (CNN) classifiers.
The work ref24 employed CNN method for classification of normal and abnormal heart sounds based on the MFCC features. The experimental results was reported in two phases according to different applying train set. The sensitivity, specificity and overall scores on hidden set for the phase one was 75%, 100% and 88%, respectively. Also, for the phase two sensitivity, specificity and overall scores on hidden set was reported 76.5%, 93.1% and 84.8%, respectively. Table1 summarizes the works investigated in this section. In this study, we focus on detect heart diseases using heart sounds based on the PhysioNet/CinC Challenge 2016 and we aim to provide an approach rely on identity vector (ivector).
Although the ivector was originally used for speaker recognition applications ref25 , it is currently used in various fields such as language identification ref26 ; ref27 , accent identification ref28 , gender recognition, age estimation, emotion recognition ref29 ; ref30
, audio scene classification
ref31 etc. In this study, we adopt the ivector to normal/abnormal heart sound detection.Our motivation for using this method in this context is owing to the fact that human heart sounds can be considered as physiological traits of a person ref32 which are distinctive and permanent, unless accidents, illnesses, genetic defects, or aging have altered or destroyed them ref32 .
In this work, we utilized two features, Comprising MelFrequency Cepstral Coefficients (MFCCs) and ivector and also we used Gaussian Mixture Models (GMMs) as classifier. To detect a normal heart sound signal from the abnormal we extracted the MFCCs features from the given heart sound signal, and then we obtained the ivector of each heart sound signal using MFCCs.
Furthermore, to classify a normal heart sound form abnormal, we trained GMMs and then applied the ivecors to them. The rest of this paper is organized as follows: in Section 2 features and classifier are introduced. The experiment setup is and experimental results are reported in Section 3 and 4, respectively. Eventually, the conclusion is presented in Section 5.
2 Methodology
In this paper, we proposed a method aims at using the ivector for normal/abnormal heart sound detection. In this method, we first train a GMM using all heart sounds (i.e. both normal and abnormal heart sounds) in our training set. After training this UBM, zero and firstorder statistics of the training features are extracted, accordingly. Then, using these statistics we train an ivector extractor using several iterations of the EM algorithm explained in Section 2.4.2. After training the ivector extractor, we extract ivectors from all records in our training set. In this stage, we have extracted several ivectors with different dimensions for each record in the training set and we use them to train the intraclass variation reduction methods. Specifically, we train VAE by original heart sounds in our training set and use them to transform the ivectors into the new space. After training, we extract ivectors from the heart sounds and transform them using PCA and VAE. Therefore, we have a representative ivector for each record, which we will use for scoring.
Fig. 1 briefly illustrates our proposed system.
2.1 Melfrequency Cepstral Coefficients
MFCCs were engaged over years as one of the most important features for speaker recognition ref33 . The MFCC attempts to model the human hearing perceptions by focusing on low frequencies (01Khz) ref34 . In better words, the differences of critical bandwidth in human ear is basis of what we know as MFCCs. In addition, Mel frequency scale is applied to extract critical features of speech, specially its pitch.
2.1.1 MFCC Extraction
In the following, we will explain how the MFCC feature is extracted. Initially, the given signal is preemphasized. The concept of ”preemphasis” means the reinforcement of highfrequency components passed by a highpass filter ref33 . The output of the filter is as follows
(1) 
In the next step, the preemphasized signal is divided into shorttime frames (e.g. ) and Hamming windows are preprocessed. The hamming windows can be applied as
(2) 
Where N is number of samples in each frame.
To analyze
in the frequency domain, a Npoint Fast Fourier Transform (FFT) is applied to convert them into the frequency. The frequency of the FFT can be computed according to
(3) 
A logarithmic power spectrum is obtained on a Melscale using a filter bank consists of L filter
(4) 
Where is the th triangular filter, and are the lower limit and upper limit of the th filter, respectively.
The given frequency in hertz can be converted to Melscale as follow
(5) 
Eventually, the MFCCs coefficients are obtained by applying Discrete Cosine Transform (DCT) to the
(6) 
Where m is the obtained features form frequency components of . The steps for extracting the MFCC are depicted in Fig. 2.
2.2 iVector
The ivector method procedure mainly include compact fixedlength extracting from the input signal. The vector distancebased similarity is measured using extracted feature vector. This also helps us to transform input other features. In order to extract ivector, BaumWelch statistics are computed using MFCC features. These MFCC features are extracted from input signal ref37 . In following, the steps for this process is explained.
2.2.1 Universal Background Model (UBM) Training
First of all, a global model is created named as UBM ref38 . One of most popular UBM model is GMM which guarantee textindependent speaker verification ref25 ; ref39 . Some approaches use HMM which is textdependent model and hence it is not suitable for applications which need identical features for each individual as signature ref40 ; ref38 ; ref42 . In normal/abnormal heart sound detection tasks, GMM model is trained by features from all individuals in development set which is large enough to cover all the feature space. GMM is defined as weighted sum of multivariate Gaussian distributions in below:
(7) 
In this equation, describes a Ddimensional vector, is weight for each component, and finally is Gaussian distribution with mean and covariance . For simplicity, covariance is defined as a diagonal matrix, and in this work, it is considered as diagonal matrix.
2.2.2 Extraction of Baum–Welch Statistics
Next, zero and firstorder BaumWelch statistics are extracted using UBM (in our case; GMM) ref38 ; ref45 .
Suppose as whole features collected to train th, zero and firstorder statistics named as and for th component of UBM (here we use GMM) is calculated as follows:
(8) 
(9) 
here is th of whole features for signature th, indicates mean of th component, and finally shows the posterior probability of by the th component described as below:
(10) 
2.2.3 iVector
We consider as meansupervector for each individual which is dependent to each one of them and represents feature vectors for each record. Supervector is a DCdimensional vector acquired by concatenating Ddimensional mean vectors of GMM obtained from each signature. For ivector we can model a supervector as follows ref25 :
(11) 
Here we define as meansupervector which is independent for each individual and calculated from UBM, is a low rank matrix, and
indicates a random latent variable with standard normal distribution.
is the ivector obtained by MAP point estimate of variable and has mean of posterior probability given by specific input signature. Considering these assumption, we assume is a Gaussian distribution with mean and covariance equal to .2.2.4 Model Parameters
The UBM supervector’s mean is always shown as . Therefore, if one appends all values, the supervector can be achieved ref45
. In this study we have used the expectation maximization (EM) algorithm to train
. Assuming that UBM has number of components, and feature vectors dimensions is , the matrix can be described as ref40(12) 
where is the UBM component covariance matrix. If the collection of feature vectors for record is shown with , and the probability of with a GMM specified by the supervector and the supercovariance matrix is defined by , then the EM optimization can be realized in two steps. First, the current value of matrix is used to find the vector maximizing the probability. Eq. 13 shows this procedure ref40
(13) 
Second, maximizing the following relation, value is updated as
(14) 
Eq. 14 is used to achieve loglikelihood for each record as ref40
(15) 
where iterates overall components of the model and t iterates overall feature vectors. Here, is component submatrix of . Assume the zero and the firstorder statistics have been computed employing Eq. 8 and Eq. 9, respectively, now, we compute the posterior covariance matrix, , the mean
, and the second moment
for as(16) 
(17) 
(18) 
Ultimately, maximizing Eq. 14, the updated value of T can be calculated as ref40
(19) 
2.2.5 ivector Extraction
The MAP point for is estimated to extract the ivector. Here, ivector is described as Eq.17 where is a random hidden variable with a standard normal distribution. Note that ivector is the mean of the posterior probability of for the input record.
2.3 Important Information Extraction and Dimension Reduction Methods
There are different approaches to extract important information. In the ivector based tasks, different methods such as nuisance attribute projection (NAP) ref25 ; ref45 ; ref46 ; ref47 , withinclass covariance normalization (WCCN) ref25 ; ref48 ; ref49 , principal component analysis (PCA) ref49 , and linear discriminant analysis (LDA) ref50 are widely used. Here, we used PCA and one new method, called Variational AuotoEncoders (VAE) ref51 , for extract important information and dimension reduction which will be explained in following sections.
2.3.1 Pca
In this method, important information is extracted from the data as new orthogonal variables, which are referred to as the principal components ref52 .
To achieve this, assume a given zero mean data matrix ( and indicate the number of experiment repetition and a particular feature, respectively). Accordingly, to define the transformation consider vector of which is mapped by a set of pdimensional vectors of weights to a new vector of principal component , as follow
(20) 
In other word, vector (consists
) inherit the maximum variance from
by weight vector constrained to be a unit vector ref53 .2.3.2 Vae
As one of the most important approaches to extract informative information, Variational AutoEncoders (VAEs) are among the generative models. Here is the architecture for VAEs: they are made of hidden layers
ref51with odd numbers and
nodes. The weights are shared between top and bottom layers, which have nodes, both.VAE tries to reconstruct data from input data. Consider as input for a VAE and it tries to encode the inputs to latent variables , and then reconstructed input will be produced from latent variables. For this purpose, training process try to minimize cost function (Mean Square Error (MSE) between input and output). In most optimum situation, the input and output are the same. Schematic of a VAE is depicted in Fig. 3.
Encoded variable can be used as enhanced, significant features for the better description of input
. It is noteworthy that VAEs are a good solution for different problems such as missing data imputation analysis and etc
ref51 .To obtain the vector , we define a probability function on , called , and try to maximize likelihood of mentioned probability; ref54 .
show the expectation of random variable
over probability function . As we have no information about , so we compute approximation of , called . Hence, based on Bayes rule we have ref54(21) 
Here, we multiply and divide the term by as an approximation for
(22) 
So we can conclude that
(23) 
And finally
(24) 
The term is intractable, and we know it has value greater than zero.
So we try to maximize term as tractable lower bound. The loglikelihood measure is a good indicator to show how much samples from can describe data .
2.4 Gaussian Mixture Models
In this work, Gaussian mixture models (GMMs) are engaged as a classifier for extracted features. GMMs are among models with probabilistic nature, and they are suitable for general distributions consisted of subpopulations ref650
. GMMs use an iterative process to determine which data point belong to each subpopulation, without any knowledge about data point labels. Hence, GMMs are considered as unsupervised learning models.
2.4.1 Gaussian model
The GMM is introduced with two types of values: the weights of the Gaussian mixture components and the means and the variance of the Gaussian mixture components. The probability distribution function (PDF) of a
components GMM, with mean and covariance matrix for the component is defined as(25) 
(26) 
Subset  #Patients  #Records  #Proportion of recordings  #The weight parameters  

Abnormal  Normal  Unsure  wa  wa  wn  wn  
Training  746  3153  18.1  73.03  8.8  0.8602  0.1398  0.9252  0.0748 
Eval.    301        0.7888  0.2119  0.9467  0.0533 
Test  308  1277  12.1  77.1  10.9         
Where is a feature vector and is the weight of the mixture component .
2.4.2 Learning the model
If the number of components is defined, Expectation Maximization (EM) is a method that is often used to estimate the parameters of the mixture model. In frequentist probability theory, models are usually learned using maximum likelihood estimation techniques. The maximum probability estimate is engaged to maximize the probability or similarity of the observed data with respect to the model parameters
ref55 . The maximization of EM is a numerical method for estimating the maximum probability. Maximization of EM is a repetitive algorithm and has the property that the most similarity of data with each subsequent replication increases significantly, which means that it achieves to the maximum point or the local maximum point ref55 .2.4.3 Maximum Likelihood Estimation of GMMs
Maximizing Likelihood estimation of Gaussian mixture models includes two steps. The first step is known as ”expectation”, which includes calculating the expectation and assigning the component () for each data point with the parameters of the model , and . The second step is known as ”maximization”, which includes maximizing the expectation calculated in the previous step relative to the model parameters. This step involves updating the values of , and .
The entire process is repeated as long as the algorithm converges, and it gives maximum likelihood estimation. More details are available at ref55 .
3 Experimental Setup
3.1 Dataset
The 2016 PhysioNet/CinC challenge is introduced to provide a standard database containing normal and abnormal heart sound ref2 . The presented dataset in this challenge is a heart sound recordings set of subjects/patients which is collected from variety of environmental conditions (including noisy conditions with low signal quality) as described in ref2 . Therefore, many heart sounds have been incurred different noises during recordings such as speech, stethoscope motion, breathing and intestinal activity ref2 . These noises make difficult to classify normal and abnormal heart sounds. Accordingly, the organizers allowed the participants to classify some of the recordings as ’unsure’ ref2 and it shows the difficulty level of the challenge. This corpus consists three subsets: training, validation and test. For training purposes, six labeled databases (names with prefix to ) containing 3153 sound recording from 764 subjects/patients, with duration of 5120 s).
The validation subset is comprised of 150 normal and 151 abnormal heart sound (with file names prefixed alphabetically, through ) and the test data includes 1277 hearts sound trials generated from 308 subjects/patients. It is necessary to mention that 301 selected recording from train set used as test set for validation.
The Challenge test set consisted of six databases labeled from b to , , and with 1277 heart sound recordings from 308 subjects.
It should be noted that the test set is publicly unavailable and will remain private for the purpose of scoring ref2 . The statistics of each subset are summarized and illustrated in Table 2. More details about the corpus and the 2016 PhysioNet/CinC challenge can be found in ref2 .
In this work, we reported our results based on physionet/CinC 2016 dataset. It is worth mentioning that the validation subset consists of 301 records, which is a copy of the training data. Accordingly, in order to report valid results, we first removed the validation records from the training set and then divided the training set into two parts in five phases. In each phase, we randomly assigned 80% of training set as our training set and the rest of 20% were assigned as our validation set which is used for tuning the parameters. In addition, we used physionet/CinC 2016 validation set as our test set.
3.2 Evaluation Metrics
In this task the metric of evaluation is reported based on Equal Error Rate (EER) and Modified Accuracy (MAcc). Therefore, to compute EER, we assign to each trial a score value, then let define as the false alarm and as the miss rates at threshold :
(27) 
(28) 
Now EER is computed ref64 :
(29) 
where is the value of the parameter when equals .
Also for MAcc computation, classified data are in three class; normal, abnormal or unsure, with two references in each categories. The modified sensitivity () and specificity () can be computed according to:
(30) 
(31) 
where and are the percentages of the abnormal recordings of the signal with good quality and poor quality respectively, and and are of the normal recordings of the signal with good quality and poor quality respectively.
For all 3153 training set recordings, values for weight parameters of , , and are equal to 0.8602, 0.1398, 0.9252 and 0.0748 respectively, in train set. These parameters also calculated for validation set and they were reported 0.78881, 0.2119, 0.9467 and 0.0533 respectively. The “Score” for this challenge is computed using following equation
(32) 
3.3 Scoring and Decision Making
To assign score to a given heart sound based on GMM classifier we proceed as follows. First, we extract an ivector from our training set and project them to the new space using the PCA or VAE and apply them to two GMMs (one GMM for the normal heart sound and the other for the abnormal heart sound) with different components to learn the model by EM iterations (training GMMs). In the next step, the score for each trial is obtained by computing log likelihood ratio:
(33) 
where is a ivector corresponding to the test record and and denote the GMMs for normal and abnormal heart sound, respectively. After finding the score, a simple global threshold is applied to it to make the final decision of normal/abnormal heart sound detection. If the score was higher than the threshold, the test heart sound is labeled as normal and otherwise, it is labeled as abnormal. In this paper, we used a global threshold to be able to plot the detection error tradeoff (DET) and detection accuracy tradeoff (DAT) curves.
4 Experimental Results
In this section, first we briefly introduce the baseline system and in the following, we introduce the experimental results which include two Sections. In the Section 4.2, we investigate the effect of GMM components and ivector dimensionality using the whole of the training set. In the Section 4.3, we study the effect of applying different sizes of the training set on our proposed approach.
4.1 Baseline System
In this paper, we consider proposed approach in ref56 as the baseline system. The physionet 2016 dataset is used in the baseline system in the same way that we used in our system. The proposed method in the baseline system is based on asynchronous frames ref56 . Accordingly, 103228 frames were extracted from physionet 2016 dataset. To report the results, they repeated their experiments for five iterations and reported the average of the obtained results. The attained results in terms of sensitivity, specificity and mean accuracy was reported at 0.845, 0.785 and 0.815, respectively.
4.2 Effects of GMM Components Count and ivectors Dimensionality
The first part of our experiments was performed to investigate the effects of the number of GMM components, the effects of ivectors dimension numbers Without Applying (W.A) VAE or PCA and finally, effects of ivector dimension reduction by applying PCA and VAE. Table. 3 represents the obtained EERs and MAccs on the test set, engaging mentioned approaches. It is worth mentioning that we did not label any data as “unsure”, and we assigned “normal” or “abnormal” labels to all test data. Furthermore, in this part, we applied the whole of the training set to our system.
ivector dimensions  64 components  128 components  256 components  
Initial  after PCA/VAE  EER%  Se%  Sp%  MAcc%  EER%  Se%  Sp%  MAcc%  EER%  Se%  Sp%  MAcc% 
64  W .A  12  85.3  91.3  88.3  9.1  94  88.07  91.03  12.22  92.66  83.4  8.05 
16  12.4/11.8  82.6/88  92.7/88.7  87.56/88.35  7.2/5.9  91.3/93.3  94.7/94.7  93/94  10.01/7.55  94/94  89.4/96.02  93/95.01  
32  10.1/11.05  85.3/88  94.03/90.7  89.66/89.35  8.1/6.5  91.3/93.3  92.71/93.3  92/93.3  8.8/7.06  92/86.6  90/90  91/88.3  
64  10.21/11.3  88.33/87.33  92.05/90  89.69/88.66  8.7/6.15  93.3/95.3  89.4/93.37  91.35/94.33  11.8/8.33  92/90  84.1/92.05  88.05/91.02  
128  W .A  14.72  84  86.75  85.37  8.38  94.66  88.74  91.7  11.07  90.66  86.09  89.37 
16  11.8/10.2  84/89.3  92.71/91.39  88.35/90.3  7.41/5.64  90.66/4.66  95.36/95.36  93.01/95.01  7.23/5.91  96.6/94  89.4/96.02  93/95.01  
32  8.35/9.15  89.3/90  94.03/93.37  91.66/91.68  6.11/5.88  91.33/94  96.02/96.02  93.07/95.01  10.85/5.75  93.33/94.66  88.74/95.36  91.03/95.01  
64  9/8.48  89.33/90  92.71/93.37  91.01/91.68  6.16/5.23  96.66/96  90.72/92.71  93.69/94.35  10.49/6.18  94/92.66  87.41/94.7  90.70/93.68  
256  W .A  12.55  82  92.05  87.02  10.81  87.33  90.72  89.02  11.30  97.33  80.13  88.73 
16  12.7/8.33  80/90.66  96.68/91.39  88.34/91.02  7.53/6.48  90.63/94  94.77/94.03  92.65/94.01  7.81/5.59  96/96.6  88.74/94.7  92.37/95.65  
32  12.23/8.76  86.6/94.66  93.37/94.03  89.98/94.34  7.40/5.14  92.66/96  94.03/94.7  93.34/95.35  7.66/5.36  96.6/94  89.4/96.6  93/95.3  
64  10.28/9.4  84.66/94.66  93.37/94.7  89.01/94.68  8.38/4.1  94/97.33  90.06/94.70  92.03/96.01  10.5/4.77  96/95.3  85.4/96.6  90.7/95.95  
512  W .A  11.65  86  92.71  89.35  8.42  95.33  88.74  92.03  11.68  94.66  86.09  90.37 
16  12.2/6.83  86.66/92.66  92.71/94.03  89.68/93.34  5.87/5.23  94/94.66  96.02/94.70  95.01/94.68  8.29/8.53  95.33/95.33  89.40/87.41  92.36/91.37  
32  10.25/9.66  90/94.66  93.37/94.70  91.68/94.68  5.5/4.63  93.33/95.33  95.35/95.36  94.34/95.34  9.14/5.31  94/95.33  88.74/95.36  91.37/95.34  
64  8.83/6.44  89.33/93.33  92.71/93.37  91.02/93.35  8.87/5.1  93.33/94.66  90.72/94.70  92.01/94.68  14.02/8.66  91.33/94  81.45/92.71  86.39/93.35  
1024  W .A  18.92  78.66  84.1  81.38  10.34  92  86.09  89.04  13.77  93.33  78.8  86.06 
16  12.30/3.46  80.66/96  94.7/96.68  87.68/96.34  5.44/2.80  91.3/97.33  98.67/96.68  95/97.05  5.79/4.81  98.6/96.67  93.37/97.3  95.98/96.98  
32  9.59/3.08  83.33/97.33  96.02/98.68  89.67/97  6.18/2.95  90.66/98.86  98.01/96.02  94.33/97.34  7.74/3.12  99.33/95.33  87.41/98.67  93.37/97  
64  11.98/2.71  87.33/97.33  91.39/97.35  89.36/97.34  6.3/2.80  96.66/96  88.74/98.67  93.7/97.33  8.84/3.55  98.66/96.6  83.44/97.35  91.05/96.97  
In each element of Table. 3, there are two result values (separated by a slash) that represent the effect of using the PCA and VAE techniques, respectively. In addition, the number of components used in GMMs are specified separately in the table.
Table. 3 shows ivector and its VAE (right side of slash) performs better than others. As shown in this table, the best results are achieved by higher dimensions of ivector and its VAE.
The left side of slash denotes the results of ivector and its PCA. It is observed that the result values obtained by applying PCA are not as good as of applying VAE values are.
Discussion: This is due to the fact that VAE tries to minimize the cost function which is defined as MSE between input (whole features) and output (reconstructed features). While PCA just tries to extract important information, compare to VAE which try to extract features with capability to produce original data. So VAEs can extract important information which are able to produce original data, as much as they can and that’s why EER is reducing over time. On the other hand, incrementing raw ivector dimension, may add useless, sparse features to feature vector and this leads to classification error and decreasing Accuracy.
As it is shown in Table. 3, generally the best EER and MAcc values are obtained by the GMMs which are trained by 128 components. In our proposed system, the GMMs are not well trained with 64 components. Conversely, engaging 256 components cause overfitting, while training their parameters due to the low amount of training data.
The best results extracted from Table. 3 are depicted in Fig. 4 and Fig. 5. The red pointline of Fig. 4 and Fig. 5 represent the best values achieved by different dimensions of ivectors without applying PCA or VAE. Also, blue pointline and green pointline of the Fig. 4 and Fig. 5 represent the best values obtained by different dimensions of ivectors and applying PCA and VAE, respectively.
As shown in the Fig. 4 and Fig. 5, commonly the EER values decrease with increasing dimensions of the ivectors. After applying VAE or PCA to them, the MAcc values increase, subsequently. But this pattern is not true for raw ivectors and they yield different EER and MAcc results.
Discussion: A higherdimensional ivector includes more detailed information. On the other hand, this information may include useless details and common information. Therefore, the PCA and VAE methods are used to make this information more informative. As shown in Table. 3, applying PCA and VAE can significantly improve the result values relative to the applying raw ivectors.
Table. 4 shows the results obtained by the baseline system and the best results obtained by our proposed systems in this paper. Accordingly, the best accuracy achieved by our proposed system is 97.34% that could improve the accuracy of the baseline system by 15.84%.
System  EER%  Se%  Sp%  MAcc% 

baseline    84.5  78.5  81.5 
ivector  8.38  88.74  95.33  92.03 
ivector + PCA  5.44  93.37  98.6  95.98 
ivector + VAE  2.71  96.02  98.86  97.34 
Discussion: In baseline system represented in ref56 , extracted features are mostly based on frequency and subband features; such as MFCC, MelSpectrogram, and etc. These features are suitable for robust speech or sound detection, but in other applications like heart sound detection we need to extract identical features for our purpose. This is due to identical heart sound, that is unique for every individual. As a result ivector can be better features for heart sound identification, and hence can improve classification error and accuracy better than approaches which are based on robust feature extraction.
4.3 Effect of Training the System using Different Size of Training Set
In this section, we are going to evaluate the effect of different size of training set on proposed method. To satisfy the conditions, we divided the training data into 5 folds (each fold include 20% of training set) randomly. In the next step, we increase training set fold by fold each time, and observe the impacts on EER improvement. Table 5 shows the effect of applying different size of the training set to our system, with fixed number of GMM components. This observation showed better results in the first part of our experiments. The reported values in this table are based on the best results obtained from the different size of raw ivectors and applying PCA and VAE to them. (In each case, results are reported using the parameters configuration for best results).
As it is summarized in Table. 5, the classification performance is improved by increasing the amount of training data. The results suggest that increasing the size of training data over than 80%, leads to less improvement, in comparison with the cases where size of the training set are smaller. According to Table. 5, our proposed system performed similar to the baseline system when only 60% of training set is used.
Size of training set  System  EER%  Se%  Sp%  MAcc% 

20%  Raw ivector  37.85  86.00  34.44  60.22 
ivector + PCA  30.12  95.33  52.32  73.82  
ivector + VAE  31.31  40.00  88.74  64.37  
40%  Raw ivector  24.75  60.00  74.83  70.41 
ivector + PCA  27.44  60.00  83.44  71.72  
ivector + VAE  28.95  65.33  72.85  69.09  
60%  Raw ivector  20.38  65.33  94.7  80.10 
ivector + PCA  17.82  82.00  83.44  82.70  
ivector + VAE  18.95  64.67  99.34  82.00  
80%  Raw ivector  11.94  89.33  87.42  88.75 
ivector + PCA  8.27  88.00  92.05  90.02  
ivector + VAE  4.12  93.33  98.01  95.67  
100%  Raw ivector  8.38  88.74  95.33  92.03 
ivector + PCA  5.44  91.30  98.67  95.00  
ivector + VAE  2.80  96.02  98.86  97.34 
Fig. 6 and Fig. 7 show the classification EER and MAcc of the proposed system as a function of training set size. Fig. 6 and Fig. 7 depict the effect of varying training set size on the EER and MAcc values, respectively.
Discussion:
As it is shown in these figures, the MAcc and EER improve while training set size increase, gradually. Obviously, number of samples are important in improving results, since it improves the generalization, and helps the system to adapt to new samples. But most important discussion is about comparison of three different approaches we engaged to see whether feature reduction is applicable or not. First as you see using larger dataset for raw ivector demonstrate lower improvement than using PCA. It is obvious that dimension reduction in large scale dataset and small set dataset yields better performance than raw ivector. Most important point here is VAE has best performance. Main reason for performance is that VAE as one of Deep Neural Networks (DNNs) requires more data to generalize results and as much as data increases, the results for VAE improve over time. So it yields the best results among all approaches.
5 Conclusions
This paper proposes a novel method for automatic heart sound classification based on ivector MFCC feature embedding, in which MFCC is extracted from heart sounds to represent the characteristics of the subject’s heard sound. The experiments on a public dataset demonstrate the effectiveness of the proposed method. This method is based on fixsized ivector and therefore insensitive to the length of the input sounds. Combination of MFCC and ivector are stable and can reflect the key point features to discriminate two types of subject accurately. The ivector feature of heart sound is more suitable to describe the characteristics of heart sound than other length variable features since the sound is always regarded as a whole when producing the ivector. The proposed method has low computational cost and can work well on even wearable devices and it also works well even when the amount of training data is little. In conclusion, the proposed method outperforms the stateoftheart approaches.
6 Acknowledgment
We thank Mr. Mohammad Elmi and Mr. Majid Osati for comments that greatly improved the manuscript.
References
References
 (1) W. G. MEMBERS, E. J. Benjamin, M. J. Blaha, S. E. Chiuve, M. Cushman, S. R. Das, R. Deo, S. D. de Ferranti, J. Floyd, M. Fornage, et al., Heart disease and stroke statistics—2017 update: a report from the american heart association, Circulation 135 (10) (2017) e146.
 (2) C. Liu, D. Springer, Q. Li, B. Moody, R. A. Juan, F. J. Chorro, F. Castells, J. M. Roig, I. Silva, A. E. Johnson, et al., An open access database for the evaluation of heart sound algorithms, Physiological Measurement 37 (12) (2016) 2181.
 (3) A. Moukadem, A. Dieterlen, N. Hueber, C. Brandt, Localization of heart sounds based on stransform and radial basis function neural network, in: 15th NordicBaltic Conference on Biomedical Engineering and Medical Physics (NBC 2011), Springer, 2011, pp. 168–171.
 (4) S. Sun, Z. Jiang, H. Wang, Y. Fang, Automatic moment segmentation and peak detection analysis of heart sound pattern via shorttime modified hilbert transform, Computer methods and programs in biomedicine 114 (3) (2014) 219–230.
 (5) Z. Yan, Z. Jiang, A. Miyamoto, Y. Wei, The moment segmentation analysis of heart sound pattern, Computer methods and programs in biomedicine 98 (2) (2010) 140–150.

(6)
P. Sedighian, A. W. Subudhi, F. Scalzo, S. Asgari, Pediatric heart sound segmentation using hidden markov model, in: Engineering in Medicine and Biology Society (EMBC), 2014 36th Annual International Conference of the IEEE, IEEE, 2014, pp. 5490–5493.
 (7) P. Bentley, G. Nordehn, M. Coimbra, S. Mannor, R. Getz, The pascal classifying heart sounds challenge 2011 (chsc2011) results, See http://www. peterjbentley. com/heartchallenge/index. html.
 (8) A. Castro, T. T. Vinhoza, S. S. Mattos, M. T. Coimbra, Heart sound segmentation of pediatric auscultations using wavelet analysis, in: Engineering in Medicine and Biology Society (EMBC), 2013 35th Annual International Conference of the IEEE, IEEE, 2013, pp. 3909–3912.
 (9) S. E. Schmidt, C. HolstHansen, C. Graff, E. Toft, J. J. Struijk, Segmentation of heart sound recordings by a durationdependent hidden markov model, Physiological measurement 31 (4) (2010) 513.
 (10) D. B. Springer, L. Tarassenko, G. D. Clifford, Logistic regressionhsmmbased heart sound segmentation, IEEE Transactions on Biomedical Engineering 63 (4) (2016) 822–832.
 (11) A. A. Sepehri, J. Hancq, T. Dutoit, A. Gharehbaghi, A. Kocharian, A. Kiani, Computerized screening of children congenital heart diseases, Computer methods and programs in biomedicine 92 (2) (2008) 186–192.
 (12) H. Uğuz, Adaptive neurofuzzy inference system for diagnosis of the heart valve diseases using wavelet transform with entropy, Neural Computing and applications 21 (7) (2012) 1617–1628.
 (13) H. Uğuz, A biomedical system based on artificial neural network and principal component analysis for diagnosis of the heart valve diseases, Journal of medical systems 36 (1) (2012) 61–72.
 (14) S. Ari, K. Hembram, G. Saha, Detection of cardiac abnormality from pcg signal using lms based least square svm classifier, Expert Systems with Applications 37 (12) (2010) 8019–8026.
 (15) Y. Zheng, X. Guo, X. Ding, A novel hybrid energy fraction and entropybased approach for systolic heart murmurs identification, Expert Systems with Applications 42 (5) (2015) 2710–2721.
 (16) S. Patidar, R. B. Pachori, N. Garg, Automatic diagnosis of septal defects based on tunableq wavelet transform of cardiac sound signals, Expert Systems with Applications 42 (7) (2015) 3315–3326.
 (17) A. Gharehbaghi, I. Ekman, P. Ask, E. Nylander, B. JanerotSjoberg, Assessment of aortic valve stenosis severity using intelligent phonocardiography, International journal of cardiology 198 (2015) 58–60.

(18)
R. SaraçOğLu, Hidden markov modelbased classification of heart valve disease with pca for dimension reduction, Engineering Applications of Artificial Intelligence 25 (7) (2012) 1523–1528.
 (19) A. QuicenoManrique, J. GodinoLlorente, M. BlancoVelasco, G. CastellanosDominguez, Selection of dynamic features based on time–frequency representations for heart murmur detection from phonocardiographic signals, Annals of biomedical engineering 38 (1) (2010) 118–137.

(20)
C. Puri, A. Ukil, S. Bandyopadhyay, R. Singh, A. Pal, A. Mukherjee, D. Mukherjee, Classification of normal and abnormal heart sound recordings through robust feature selection, in: Computing in Cardiology Conference (CinC), 2016, IEEE, 2016, pp. 1125–1128.
 (21) M. Zabihi, A. B. Rad, S. Kiranyaz, M. Gabbouj, A. K. Katsaggelos, Heart sound anomaly and quality detection using ensemble of neural networks without segmentation, in: Computing in Cardiology Conference (CinC), 2016, IEEE, 2016, pp. 613–616.

(22)
C. Potes, S. Parvaneh, A. Rahman, B. Conroy, Ensemble of featurebased and deep learningbased classifiers for detection of abnormal heart sounds, in: Computing in Cardiology Conference (CinC), 2016, IEEE, 2016, pp. 621–624.
 (23) J. Rubin, R. Abreu, A. Ganguli, S. Nelaturi, I. Matei, K. Sricharan, Classifying heart sound recordings using deep convolutional neural networks and melfrequency cepstral coefficients, in: Computing in Cardiology Conference (CinC), 2016, IEEE, 2016, pp. 813–816.
 (24) N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Frontend factor analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language Processing 19 (4) (2011) 788–798.
 (25) N. Dehak, P. A. TorresCarrasquillo, D. Reynolds, R. Dehak, Language recognition via ivectors and dimensionality reduction, in: Twelfth annual conference of the international speech communication association, 2011.
 (26) D. Martinez, O. Plchot, L. Burget, O. Glembek, P. Matějka, Language recognition in ivectors space, in: Twelfth Annual Conference of the International Speech Communication Association, 2011.
 (27) M. H. Bahari, R. Saeidi, D. van Leeuwen, et al., Accent recognition using ivector, gaussian mean supervector and gaussian posterior probability supervector for spontaneous telephone speech.
 (28) R. Xia, Y. Liu, Using ivector space model for emotion recognition, in: Thirteenth Annual Conference of the International Speech Communication Association, 2012.
 (29) H. Khaki, E. Erzin, Continuous emotion tracking using total variability space, in: Sixteenth Annual Conference of the International Speech Communication Association, 2015.
 (30) H. EghbalZadeh, B. Lehner, M. Dorfer, G. Widmer, Cpjku submissions for dcase2016: A hybrid approach using binaural ivectors and deep convolutional neural networks, IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE).
 (31) R. Wahid, N. I. Ghali, H. S. Own, T.h. Kim, A. E. Hassanien, A gaussian mixture models approach to human heart signal verification using different feature extraction algorithms, in: Computer Applications for Biotechnology, Multimedia, and Ubiquitous City, Springer, 2012, pp. 16–24.
 (32) M. R. Hasan, M. Jamil, M. Rahman, et al., Speaker identification using mel frequency cepstral coefficients, variations 1 (4).
 (33) V. Tiwari, Mfcc and its applications in speaker recognition, International journal on emerging technologies 1 (1) (2010) 19–22.
 (34) P. Kenny, G. Boulianne, P. Dumouchel, Eigenvoice modeling with sparse training data, IEEE transactions on speech and audio processing 13 (3) (2005) 345–354.
 (35) D. A. Reynolds, T. F. Quatieri, R. B. Dunn, Speaker verification using adapted gaussian mixture models, Digital signal processing 10 (13) (2000) 19–41.

(36)
H. Zeinali, A. Mirian, H. Sameti, B. BabaAli, Nonspeaker information reduction from cosine similarity scoring in ivector based speaker verification, Computers & Electrical Engineering 48 (2015) 226–238.
 (37) H. Zeinali, B. BabaAli, H. Hadian, Online signature verification using ivector representation, IET Biometrics.
 (38) P. Kenny, P. Ouellet, N. Dehak, V. Gupta, P. Dumouchel, A study of interspeaker variability in speaker verification, IEEE Transactions on Audio, Speech, and Language Processing 16 (5) (2008) 980–988.
 (39) W. M. Campbell, D. E. Sturim, D. A. Reynolds, A. Solomonoff, Svm based speaker verification using a gmm supervector kernel and nap variability compensation, in: Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, Vol. 1, IEEE, 2006, pp. I–I.
 (40) A. Solomonoff, C. Quillen, W. M. Campbell, Channel compensation for svm speaker recognition., in: Odyssey, Vol. 4, Citeseer, 2004, pp. 219–226.
 (41) A. Solomonoff, W. M. Campbell, I. Boardman, Advances in channel compensation for svm speaker recognition, in: Acoustics, Speech, and Signal Processing, 2005. Proceedings.(ICASSP’05). IEEE International Conference on, Vol. 1, IEEE, 2005, pp. I–629.
 (42) A. O. Hatch, S. Kajarekar, A. Stolcke, Withinclass covariance normalization for svmbased speaker recognition, in: Ninth international conference on spoken language processing, 2006.
 (43) N. Dehak, P. Kenny, R. Dehak, O. Glembek, P. Dumouchel, L. Burget, V. Hubeika, F. Castaldo, Support vector machines and joint factor analysis for speaker verification.
 (44) C. R. Rao, The utilization of multiple measurements in problems of biological classification, Journal of the Royal Statistical Society. Series B (Methodological) 10 (2) (1948) 159–203.
 (45) L. Van Der Maaten, E. Postma, J. Van den Herik, Dimensionality reduction: a comparative, J Mach Learn Res 10 (2009) 66–71.
 (46) H. Abdi, L. J. Williams, Principal component analysis, Wiley interdisciplinary reviews: computational statistics 2 (4) (2010) 433–459.
 (47) D. Jang, H. Park, G. Choi, Estimation of leakage ratio using principal component analysis and artificial neural network in water distribution systems, Sustainability 10 (3) (2018) 750.
 (48) D. P. Kingma, M. Welling, Autoencoding variational bayes, arXiv preprint arXiv:1312.6114.
 (49) D. Reynolds, Gaussian mixture models, Encyclopedia of biometrics (2015) 827–832.
 (50) D. A. Reynolds, Automatic speaker recognition using gaussian mixture speaker models, in: The Lincoln Laboratory Journal, Citeseer, 1995.
 (51) M. Adiban, H. Sameti, N. Maghsoodi, S. Shahsavari, Sut system description for antispoofing 2017 challenge, in: Proceedings of the 29th Conference on Computational Linguistics and Speech Processing (ROCLING 2017), 2017, pp. 264–275.
 (52) B. Bozkurt, I. Germanakis, Y. Stylianou, A study of timefrequency features for cnnbased automatic heart sound classification for pathology detection, Computers in biology and medicine.
Comments
There are no comments yet.