Quantifying Cochlear Implant Users' Ability for Speaker Identification using CI Auditory Stimuli

07/31/2019 ∙ by Nursadul Mamun, et al. ∙ 0

Speaker recognition is a biometric modality that uses underlying speech information to determine the identity of the speaker. Speaker Identification (SID) under noisy conditions is one of the challenging topics in the field of speech processing, specifically when it comes to individuals with cochlear implants (CI). This study analyzes and quantifies the ability of CI-users to perform speaker identification based on direct electric auditory stimuli. CI users employ a limited number of frequency bands (8 to 22) and use electrodes to directly stimulate the Basilar Membrane/Cochlear in order to recognize the speech signal. The sparsity of electric stimulation within the CI frequency range is a prime reason for loss in human speech recognition, as well as SID performance. Therefore, it is assumed that CI-users might be unable to recognize and distinguish a speaker given dependent information such as formant frequencies, pitch etc. which are lost to un-simulated electrodes. To quantify this assumption, the input speech signal is processed using a CI Advanced Combined Encoder (ACE) signal processing strategy to construct the CI auditory electrodogram. The proposed study uses 50 speakers from each of three different databases for training the system using two different classifiers under quiet, and tested under both quiet and noisy conditions. The objective result shows that, the CI users can effectively identify a limited number of speakers. However, their performance decreases when more speakers are added in the system, as well as when noisy conditions are introduced. This information could therefore be used for improving CI-user signal processing techniques to improve human SID.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A cochlear implant is an implantable electric device that allows people with sensorineural hearing loss to recover hearing abilities especially speech recognition. Efficient encoding of temporal information in current CI signal processing strategy allows most of the CI users to achieve over 80% speech understanding in quite acoustic condition [1, 2]. However, other aspects of auditory processing (such as speech understanding in noise, distinguishing speaker’s identity, gender, emotion) remain difficult for CI users to interpret, which is important for a better life [3, 4, 5]. Those difficulties could be due to the limited access of frequency channels involve with their electrodes than that of normal hearing person. Research also shows that, formant frequencies or spectral peaks of human voice which are critical for speech recognition also reflect the individual anatomical and physiological properties and thus carry the information of speaker’s identity [6, 7].

The cochlear implant provides the sense of sound by directly stimulating the auditory nerve. The implant comprises of two important components coupled using a powerful magnet- first is an external sound processor and the other is a surgically implanted electrode array (16-22 electrode long) connecting to the auditory nerve. At a time only, a limited number of electrodes can be stimulated in the electrode-array, based on the mechanism and design of the array implanted [8].

Figure 1: (a) Cross-section of the human ear high-lighting the implanted electrodes in the cochlea. (b) Comparison of the SID mechanism between Normal Hearing Person and CI-user.

Figure 1 illustrates the basic speaker identification approaches for normal hearing subjects and CI users. As shown in Fig. 1

speaker identification ability for normal hearing subjects are quantified by extracting Mel Frequency Cepstral Coefficients (MFCC) features from the speech signal and using Gaussian mixture model (GMM) or probabilistic linear discriminant analysis (PLDA) as classifier. Instead of using MFCC features, CI user’s performance in speaker identification is quantified by extracting electrodograms from speech signal using CI processor. A normal hearing person can perform efficient speaker recognition when the individual-speaker features are as widely separated as possible

[9, 10, 11, 12, 13]

. Based on this feature extraction mechanism, MFCC have been widely used in automated speaker recognition system as an attempt to mimic the speaker recognition capabilities of humans. The MFCC algorithm has a Mel filter Bank that comprises of 40 Mel filters which are used to generate the MFCC coefficients and collect the speaker-related information effectively

[14]. Compared to this a CI uses an algorithm that generates electrodograms, containing electrical stimulation information of a maximum of only 22 channels (usually 8 22) that is almost half the number of filter-banks used for the normal speech recognition process. Therefore our assumption here is that as a lot of speaker-specific feature information would be lost, the speaker-identification capabilities of a CI-user will also deteriorate extensively and even more in noisy conditions.

In this study, we focus on the ability of CI users to quantify a specific feature of the human voice, speaker identity. Although a CI user can understand a good percentage of speech (around 30%) by the movement of the lips, this study is interested to analyze how well a CI-user can identify a distant speaker which include voice from a radio or someone over the phone. To perform this study, CI auditory stimuli, represented as electrodograms, are used as a feature for CI users. Figure 2 represents a electrodogram used to train and test the speaker model in this study. Electrodograms are similar to spectrograms, except electrical stimulation information are simulated and presented as a function of time.

Figure 2: Cochlear Implant electrode stimulation response as an electrodogram

Electrodograms reflect the CI auditory stimulation as acoustic time versus frequency-to-electrode allocation [15, 16]. I-Vector features are extracted from the electrodograms and used to train a Gaussian mixture model- universal background model (GMM-UBM) and probabilistic linear discriminant analysis (PLDA) based speaker model. An in-set data of unlabeled speakers is then used in the testing phase under quiet and noisy environments.

The paper is organized as follows, Sec. 2 briefly explains our proposed speaker identification system. The results under quiet and noisy conditions are presented in Sec. 3 followed by acknowledgment and conclusion.

2 Methodology

This section briefly explains the proposed method to quantify speaker identification capability of CI-users. The basic block diagram of the proposed system is depicted in Fig. 3 . Input speech signal is sent to voice activity detector (VAD) to remove the silent period. The processed signal is then fed to the CI signal processing encoder [17, 18], to generate electrodograms for different speech tokens from each speaker. The well-known GMM-UBM and PLDA classifier are used to train the electrodograms for each speaker. Finally, features extracted from test speaker are then used to identify the correct speaker model using maximum likelihood function, and thereby identify the correct speaker. The performance of the proposed system is evaluated under quiet and noisy conditions.

2.1 Pre-processing

Clean signals are submitted to the VAD algorithm to remove the unnecessary silent periods of the signal [19]. It also detects the unvoiced part of the signal and removes it to provide the voiced part of the signal as an output of the pre-processor.

Figure 3: Block diagram of proposed Speaker Identification system using CI auditory stimuli features.

2.2 Feature Extraction

In this study, electrodograms are used as the auditory stimuli feature for CI system. Electrodogram (shown in Fig. 2) is a 2-D time-electrode representation which is constructed by combining current levels from 22 electrodes. ACE strategy is used to simulate the received signal and generate electrodograms from speech data. ACE is designed to customize sounds by combining the benefits of pitch information of the SPEAK [20] strategy, with the higher rates of simulation offered by the CIS strategy. The result is an advanced strategy that can be customized to meet each CI-user’s hearing needs.

Figure 4: Basic block diagram of ACE processing strategy used in this study to simulate the CI-users signal.

The basic block diagram of the feature extraction using ACE processing strategy is shown in Fig. 4 [8]. The incoming signal is pre-processed to emphasize the higher frequency components of the signal. The pre-emphasize signals then divided into frames using a Hamming window (87.5% overlap between adjacent frames) of length of 128 samples (10 ms) and the envelope (ENV) of each frame calculated. The ENV of each frame is passed through a 22 band pass filter-banks with center frequency specified by Cochlear Corporation. The individual CI-user’s mapping function is then applied on the selected bands and finally electrical pulses are generated for a set of speaker dependent map parameters [21, 22]. The electrodograms are then generated for each speech signal using electrical pulses.

2.3 Speaker Training Model

2.3.1 GMM-UBM Classifier

The standard training method for GMM models uses maximum a-posteriori (MAP) adaptation of the means of the mixture components based on speech from a target speaker. To compensate for speaker and channel variability, we stack the means of the GMM model to form a GMM mean super vector [23]

which has been highlighted in recent research. GMM parameters are iteratively refined by the Expectation Maximization (EM) algorithm that monotonically increases the likelihood of the estimated model for the observed feature vectors

[24], and the estimated parameters can be adapted to the new data by MAP adaptation. The GMM speaker model is adapted with the UBM-based training data of each speaker’s data to make the system faster, stable, and having better performance [25]. In this study, a GMM-UBM classifier with 512 mixture components is used to train the proposed features to generate a model for each speaker.

2.3.2 GMM-UBM-I-vectors-PLDA

I-Vector is a low dimensional vector containing both speaker and channel information acquired from a speech segment. PLDA, which is closely related to joint factor analysis (JFA) used for speaker recognition, is a probabilistic extension of linear discriminant analysis. Unlike conventional GMMs, where the correlations are weakly modelled using the diagonal co-variance matrices, PLDA captures the correlation of the feature vectors in subspaces without vastly expanding the model. When PLDA is used on an i-Vector, dimension reduction is performed twice: first in the i-vector extraction process and second in the PLDA model. Figure 

5 represents the speaker modeling system using i-Vector combined with probabilistic linear discriminant analysis (PLDA).

Figure 5: Basic block diagram of speaker traning model using I-Vector-PLDA system.

2.4 Text Corpora

2.4.1 Text-Dependent Database

A small dedicated ‘University of Malaya’ database as a text dependent database is used which consists of 390 signals collected from 39 speakers (25 males and 14 females) [26]. Audio signals were recorded in a noiseless room with a sample rate of 8 kHz where each speaker uttered ‘University Malaya’ 10 times in different sessions. For this study, 70% (clean) of recorded random data from each speaker is used for training and the remaining 30% is used to test performance of the proposed system.

2.4.2 Text-Independent Database

Next, more extensive corpora based on SRE-18 and YOHO-database are used to test the CI-user ability with text-independent database. This study used 50 speakers with 60 samples ( 8 second each) per speaker from both SRE-18 and YOHO database. SRE-18 database [27] is a dataset provided by the National Institute of Standards and Technology for the Speaker Recognition Evaluations (SRE) series conducted by NIST since 1996. The SRE-18 database is composed of telephone conversations collected outside North America, Voice over IP (VOIP) data and audio from Video (AfV). The proposed system is also evaluated with the speech signal from YOHO database. YOHO database is a large scale high quality dataset collected by the ITT (Inter Telephone Telegraph) technical institute in 1989 and is frequently used for speaker identification and verification systems [28]. Each speaker has four sets of enrollment sessions with 24 independent utterances (with three two-digit number, e.g., 27-82-39) for each enrollment session.

The sample rate for each dataset is 8 kHz. For this study, seventy five percent of speech tokens from each speaker were randomly selected for training the speaker model, and the remaining are used for testing the speaker model.

3 Results

This section represents the CI user’s performance to quantify speaker identity for both text dependent and independent databases. The performance of the system was calculated for three different databases, with each experiment repeated 10 times and average scores reported in this section. The effects of GMM parameters are also explained for further analysis.

3.1 SID performance with Text-dependent database

3.1.1 Speaker-identification under quite conditions

Speaker identification performance of CI users were predicted for text dependent database based on CI auditory stimuli derived from electrodograms. The accuracy was evaluated using two different classifier system (i.e., GMM adapted with UBM and i-Vector-PLDA). Performance was calculated for 39 speakers from “UM database” used in the training session in quiet environment. The result shows that, CI user auditory stimuli based system can effectively predict the speaker identity when text is common. The highest accuracy was 95% and 99% for 39 speakers using GMM-UBM and i-Vector-PLDA based classifier, respectively. Therefore, it is expected that CI users would have high speaker identification ability for text dependent database.

Table 1: Speaker Identification accuracy for different GMM mixtures (%): Text-dependent Corpus.

3.1.2 Effects of GMM parameter on SID performance

Table 1 illustrates the accuracy of the system for increasing number of acoustic distribution of components for the GMM. The system was trained and tested for different speech token from text dependent database for 10 dB speech shaped noise. The accuracy of the system was then evaluated for increasing number of GMM parameters. Increased accuracy is achieved when the number of GMM components is expanded (GMM-UBM based method). However, i-vector based method showed the opposite trend. This could be due to the overestimation speaker information. Moreover, with an increased GMM distribution, the computational time for log-likelihood iterations also increases which is needed to achieve the ideal value for convergence for the GMM.

3.2 SID performance with text-independent database:

3.2.1 Speaker-identification under quite conditions

It is expected that a CI-user has more difficulty as the number of speakers increase. Table 2 represents the predicted speaker identification ability of CI-users for text-independent SID. The results were evaluated for 50 different speakers from two different databases (SRE-18 and YOHO)The speaker models were constructed and classified using two well-defined classifiers based on GMM-UBM and i-Vector-PLDA. To evaluate the performance against different number of speakers, the model was trained and tested for 4, 12, 24, 36, and 50 speakers. To check the consistency, the experiments were repeated 10 times and average results are presented in the table. In general, a CI user auditory stimuli based system can predict speaker identity under quiet conditions for both databases. The system can effectively identify the speaker’s identity when tested with 4-12 speakers. However, their performance decreases as more and more speakers are incorporated into the training set. Performance of the system is higher when the PLDA based classifier is used versus that for the GMM-UBM based classifier.

Table 2: Closed-set Speaker Identification accuracy for text independent data in percentage (%).

3.2.2 Speaker-identification accuracy under noise conditions

To assess CI-users’ performance of speaker identification under noisy conditions, system performance was evaluated by adding 10dB noise. The speaker model was trained with clean data from both the YOHO and SRE databases and then tested for speech tokens contaminated with 10 dB White Gaussian noise (WGN) and Speech Shaped noise (SSN). System performance was also evaluated for varying number of speakers (4, 12, 24, 36, and 50). Training and test samples were randomly selected, and the experiment repeated 10 times to reduce any system bias. Finally, the average results of the experiments are reported in Table 3.

Table 3: Speaker Identification accuracy under noisy conditions in percentage (%).

Although, speaker identification accuracy is high in quiet environment, it drastically falls in noisy conditions. Moreover, the proposed system has higher accuracy for SRE versus the longer test durations in YOHO database, as it contains complete sentences which suggests that the subjects obtain sufficient cues regarding speaker identity. In addition, system performance using PLDA classifier is respectively higher than for GMM-UBM under noisy conditions. Therefore, it is clear that the speaker identification accuracy is very poor (approximately zero) under noise which reflects the reality of CI-user capability.

4 Conclusion

This study quantifies the speaker identification capability for CI-user based on a parameterized electrodogram feature set. Electrodograms were generated using ACE signal processing strategy for speech signal from three different databases. To quantify the performance of speaker ID for normal hearing (NH) subjects versus cochlear implant (CI) subjects, two alternate time-frequency acoustic front-ends were considered to represent NH versus CI based human SID performance. Two different backend classifiers were used- GMM-UBM and i-Vector–PLDA along with GMM to evaluate the CI-user performance. The results showed that, the CI-based auditory stimuli (e.g., parameterized electrodograms) is effective for speaker ID under quite conditions (e.g., high accuracy of 90 99%). It is also shown that, the CI based acoustic representation within the i-Vector based speaker ID system is more successful (98%) vs. the GMM-UBM based system (94%). However, CI electrodogram based SID results are completely confused and unable to predict speakers under noisy conditions, suggesting that CI-user auditory stimuli is not capable of representing speaker ID traits for CI listeners. An important analysis is that CI-users can easily predict speakers identity when text was fixed, but deteriorates for text independent scenarios. For future work, it is suggested that a parallel investigation using CI-users for a subjective study could validate these corresponding MFCC (NH) and Electrodogram (CI) based SID systems. Finally, it is suggested that the resulting proposed systems could be applied to improve the signal processing strategies in cochlear implant processors to improve speaker characterization for CI listeners in both quiet and noisy environments, thereby improving quality-of-life experience for CI users.

5 Acknowledgement

This work was supported primarily by Grant No. R01 DC016839-02 from National Institutes of Health (NIDCD); and partially by University of Texas at Dallas from the Distinguished University Chair in Telecommunications Engineering held by J.H.L. Hansen.


  • [1] J. Rouger, S. Lagleyre, B. Fraysse, S. Deneve, O. Deguine, and P. Barone, “Evidence that cochlear-implanted deaf patients are better multisensory integrators,” Proceedings of the National Academy of Sciences, vol. 104, no. 17, pp. 7295–7300, 2007.
  • [2] H. Ali, N. Mamun, A. Bruggeman, R. C. M. Chandra Shekar, J. N. Saba, and J. H. Hansen, “The cci-mobile vocoder,” The Journal of the Acoustical Society of America, vol. 144, no. 3, pp. 1872–1872, 2018.
  • [3] B. Munson and P. B. Nelson, “Phonetic identification in quiet and in noise by listeners with cochlear implants,” The Journal of the Acoustical Society of America, vol. 118, no. 4, pp. 2607–2617, 2005.
  • [4] N. Mamun, W. A. Jassim, and M. S. Zilany, “Prediction of speech intelligibility using a neurogram orthogonal polynomial measure (nopm),” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 4, pp. 760–773, 2015.
  • [5]

    J. Gideon, S. Khorram, Z. Aldeneh, D. Dimitriadis, and E. M. Provost, “Progressive neural networks for transfer learning in emotion recognition,”

    Interspeech 2017, pp. 1098–1102, 2017.
  • [6] J. M. Fellowes, R. E. Remez, and P. E. Rubin, “Perceiving the sex and identity of a talker without natural vocal timbre,” Perception & psychophysics, vol. 59, no. 6, pp. 839–849, 1997.
  • [7] M. Yousefi, N. Shokouhi, and J. Hansen, “Assessing speaker engagement in 2-person debates: Overlap detection in united states presidential debates,” in Proc. Interspeech 2018, 2018, pp. 2117–2121.
  • [8] J. Hansen, H. Ali, J. Saba, R. Chandrashekhar, N. Mamun, R. Ghosh, and A. Brueggeman, “Cci-mobile: Design and evaluation of a cochlear implant and hearing aid research platform for speech scientists and engineers.” IEEE EMBS Inter Conf. Biomedical and health informatics (BHI-19), Chicago, IL, May 19-22, 2019.
  • [9] J. Hansen and T. Hasan, “Speaker recognition by machines and humans: A tutorial review,” IEEE Signal processing magazine, vol. 32, no. 6, pp. 74–99, 2015.
  • [10] M. I. Khalil, N. Mamun, and K. Akter, “A robust text dependent speaker identification using neural responses from the model of the auditory system,” in 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE).   IEEE, 2019, pp. 1–4.
  • [11] S. Chowdhury, N. Mamun, A. A. S. Khan, and F. Ahmed, “Text dependent and independent speaker recognition using neural responses from the model of the auditory system,” in 2017 International Conference on Electrical, Computer and Communication Engineering (ECCE).   IEEE, 2017, pp. 871–874.
  • [12] W. Xia, J. Huang, and J. H. Hansen, “Cross-lingual text-independent speaker verification using unsupervised adversarial discriminative domain adaptation,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 5816–5820.
  • [13] M. A. Islam, W. A. Jassim, N. S. Cheok, and M. S. A. Zilany, “A robust speaker identification system using the responses from a model of the auditory periphery,” PloS one, vol. 11, no. 7, p. e0158520, 2016.
  • [14] V. Tiwari, “Mfcc and its applications in speaker recognition,” International journal on emerging technologies, vol. 1, no. 1, pp. 19–22, 2010.
  • [15] S. W. Teoh, H. S. Neuburger, and M. A. Svirsky, “Acoustic and electrical pattern analysis of consonant perceptual cues used by cochlear implant users,” Audiology and Neurotology, vol. 8, no. 5, pp. 269–285, 2003.
  • [16]

    N. Mamun, S. Khorram, and J. H. Hansen, “Convolutional neural network-based speech enhancement for cochlear implant recipients.” in

    Interspeech, 2019.
  • [17] G. M. Clark, “The university of melbourne/cochlear corporation (nucleus) program.” Otolaryngologic Clinics of North America, vol. 19, no. 2, pp. 329–354, 1986.
  • [18] P. Arndt, “Within subject comparison of advanced coding strategies in the nucleus 24 cochlear implant,” in 1999 Conference on Implantable Auditory Prostheses, Asilomar, CA, 1999.
  • [19] M. Brookes, “Voicebox: Speech processing toolbox for matlab. software,[mar 2011],” 1997.
  • [20] M. W. Skinner, L. K. Holden, L. A. Whitford, K. L. Plant, C. Psarros, and T. A. Holden, “Speech recognition with the nucleus 24 speak, ace, and cis speech coding strategies in newly implanted adults,” Ear and hearing, vol. 23, no. 3, pp. 207–223, 2002.
  • [21] M. Vongphoe and F.-G. Zeng, “Speaker recognition with temporal cues in acoustic and electric hearing,” The Journal of the Acoustical Society of America, vol. 118, no. 2, pp. 1055–1061, 2005.
  • [22] E. P. Wilkinson, O. Abdel-Hamid, J. J. Galvin III, H. Jiang, and Q.-J. Fu, “Voice conversion in cochlear implantation,” The Laryngoscope, vol. 123, no. S3, pp. S29–S43, 2013.
  • [23]

    W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support vector machines using gmm supervectors for speaker verification,”

    IEEE signal processing letters, vol. 13, no. 5, pp. 308–311, 2006.
  • [24] J. A. Bilmes et al.

    , “A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models,”

    International Computer Science Institute, vol. 4, no. 510, p. 126, 1998.
  • [25] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted gaussian mixture models,” Digital signal processing, vol. 10, no. 1-3, pp. 19–41, 2000.
  • [26] N. Mamun, W. A. Jassim, and M. S. Zilany, “Robust gender classification using neural responses from the model of the auditory system,” in 2014 IEEE 19th International Functional Electrical Stimulation Society Annual Conference (IFESS).   IEEE, 2014, pp. 1–4.
  • [27] G. R. Doddington, M. A. Przybocki, A. F. Martin, and D. A. Reynolds, “The nist speaker recognition evaluation–overview, methodology, systems, results, perspective,” Speech Communication, vol. 31, no. 2-3, pp. 225–254, 2000.
  • [28] J. Campbell and A. Higgins, “Yoho speaker verification,” Linguistic Data Consortium, Philadelphia, 1994.