I Introduction
Intelligibility is defined as the proportion of words correctly identified by a listener and is a natural measure for quantifying the effectiveness of speechbased communication systems [1]. Although listening tests can provide valid data, such tests are timeconsuming to conduct. For this reason, instrumental intelligibility metrics that are correlated with intelligibility and quick to compute are often preferred.
We can distinguish two types of instrumental intelligibility metrics: intrusive, and nonintrusive. Intrusive intelligibility metrics require knowledge of the clean speech and either the communication channel or degraded speech, whereas nonintrusive intelligibility metrics require only the degraded speech. In this paper we develop a new intrusive intelligibility metric based on information theory [2].
Existing intrusive intelligibility metrics include the speech intelligibility index (SII) [3], the speech transmission index (STI) [4], the coherence SII (CSII) [5], the extended SII (ESII) [6], the normalized covariance measure (NCM) [7, 8], the hearingaid speech perception index (HASPI) [9], the shorttime objective intelligibility measure (STOI)[10], the extended STOI (ESTOI)[11], the speechbased envelope power spectrum model (sEPSM) [12, 13, 14], and the glimpse proportion metric (GP) [15, 16, 17]
. As a group, the above algorithms have been successful at predicting speech intelligibility in a widerange of conditions including additive noise, filtering, reverberation, and nonlinear enhancement. However, each intelligibility metric tends to perform well for only a narrow subset of conditions. This is because the above algorithms were heuristically motivated and were often designed with a specific type of distortion or data set in mind.
Information theory provides a mathematical framework for modelling communication systems. Information theoretical concepts have previously been used in the analysis of linguistics [18, 19], speech production [20], and human hearing [21]. Additionally, stateoftheart speech enhancement algorithms [20, 22] and intelligibility metrics [23, 24, 25] that are based on information theory have been developed.
Existing information theoretic intelligibility metrics, such as the mutual information knearest neighbour metric (MIKNN) [23], assume that speech can be described by a memoryless stochastic process and that the energy of a speech signal at one timefrequency location is statistically independent to the energy at all other timefrequency locations. In reality neither of these assumptions are valid, which leads to an overestimate of the information shared between a talker and a listener.
In this paper we propose a conceptually simple intelligibility metric called SIIB. SIIB is a function of a clean acoustic signal produced by a talker and a degraded signal that is received by a listener. As described in Section II and Section III, the acoustic signals are converted to a representation of speech based on a crude model of the human auditory system. A nonparametric estimate of the mutual information rate of the signals is then computed. Unlike existing metrics, SIIB partially accounts for timefrequency dependencies in the speech signals using the KarhunenLoève transform (KLT) [26] and incorporates the theory developed in [20] to account for the effect that talkervariability has on the information rate. In Section IV and Section V, SIIB is evaluated by comparing its performance to STOI [10], ESTOI [11], and MIKNN [23] for speech degraded by noise and processed by enhancement algorithms.
Ii Model of Speech Communication
In this section we present a theoretical model of speech communication similar to that described in [20, 25], and [27]. The model considers the transmission of a message from a talker to a listener. Stochastic processes are denoted by
, random variables are denoted by bold font, and their realisations are denoted by regular font.
Iia The Communication Channel
A message , speech signal , and degraded speech signal
are represented by ergodic stationary discretetime vectorvalued random processes where
is the time index. The message can be thought of as a sequence of latent variables that represent, for example, a sequence of sentences, phonemes, or neural states. The talker encodes the message into a speech signal according to a conditional probability distribution
. In this way, the variability of different talkers encoding the same message into a speech signal is incorporated into the model.The speech signal is transmitted to a listener through a communication channel that may distort the signal. Examples of distortion include noise, reverberation, speech coding algorithms, and speech enhancement algorithms. Overall, the communication process is described by a Markov chain:
(1) 
We call the speech production channel and the environmental channel.
The representation of speech used in this paper is based on a crude model of the human auditory system and was motivated using information theoretic arguments in [21] and [27]. Let be a realvalued random process that represents the samples of an acoustic speech signal where is the sample index and let
be the shorttime Fourier transform (STFT) of
where is the frame index. We define as an valued random variable that represents auditory logspectra given by(2) 
where is a matrix that represents an auditory filterbank, and the logarithm and squared magnitude operators are applied elementwise. To account for temporal masking in the auditory system, the masking function described in [28] is applied to . The degraded speech is defined similarly.
IiB Information Rate of the Communication Channel
The proposed intelligibility metric is based on the hypothesis that intelligibility is a function of the mutual information rate between the message and the degraded speech. Let , where denotes the transpose, be a vector obtained by stacking consecutive message vectors and similarly for . The mutual information rate is defined by
(3) 
where is the mutual information between and given by
(4)  
To estimate (3), realisations of and are needed. Estimating a realisation of requires a chorus of speech signals (see [27]). In typical applications of intelligibility prediction, such a chorus is not available, so instead we use an upper bound on (3). By applying the data processing inequality twice we have [29]
(5) 
In the case of a distortionless environmental channel, is unbounded from above and saturates at the information rate of the speech production channel [20]. This maximum information rate is determined by the variability in pronunciation between different talkers. The following subsections describe how and can be calculated.
IiC Information Rate of the Environmental Channel
The mutual information rate of the environmental channel is given by
(6) 
Estimating the mutual information between vectors of high dimensionality is a challenging task [30] particularly when the vector elements have strong statistical dependencies [31]. For this reason we introduce an invertible transform that aims to remove the dependencies between the vector elements.
Let and . In the following we assume that the elements of can be approximated as statistically independent, and likewise for . Then (6) can be decomposed into a summation:
(7)  
where denotes the element index in the vector.
Finding an invertible that simultaneously removes the dependencies in both and is difficult. Early speech recognition systems used the discrete cosine transform (DCT), which results in Melfrequency cepstral coefficients [32]. It can be shown that the DCT approximates the KarhunenLoève transform (KLT) for stationary signals [33]. The KLT is the transformation that we use here and it is given by:
(8) 
and
(9) 
where
is a matrix with rows equal to the unitmagnitude eigenvectors of the covariance matrix of
and is the expected value operator. The KLT ensures that the elements of are statistically uncorrelated, and if is Gaussian, which is a reasonable approximation, then the elements are also statistically independent.The KLT does not guarantee the same properties for unless is also Gaussian and has a covariance matrix equal to a scalar multiple of the covariance matrix of . In practice the environmental channel can result in nonGaussian or can introduce statistical dependencies in that are not present in . An example of the latter is a reverberant channel. In this case, the statistical dependencies in the source are accounted for by the KLT, but the statistical dependencies in the received signal are not accounted for. The consequence is that (7) underestimates the mutual information rate. Although the KLT does not meet all of the requirements for we found that it improves performance.
IiD Information Rate of the Speech Production Channel
Approximating and as Gaussian, the information rate of the speech production channel is
(10)  
where is defined similarly to and is called the production noise correlation coefficient. The production noise correlation coefficient describes the efficiency of encoding a message into a speech signal according to . Based on the measurements in [25] and [27], this paper uses for all .
Iii Proposed Intelligibility Metric
The proposed intelligibility metric combines (7), (10), and (5) to give an estimate of the amount of information shared between and in bits per second. It is given by
(11) 
where is the frame rate in Hz.
We now describe our implementation. An estimate of is computed by applying a knearest neighbour mutual information estimator [34] to observed sample sequences and . To obtain and , a clean acoustic speech signal and a degraded signal are resampled to a sampling rate of 16 kHz. An energybased voice activity detector with a 40 dB threshold is applied to remove silent segments. Subsequently, the signals are transformed to the STFT domain using a 400point Hann window with 50% overlap. This gives a frame rate of Hz, which is sufficient for capturing the spectral modulations required for high intelligibility [35].
A gammatone filterbank [36] that includes filters linearly spaced on the ERBrate scale [37] between 100 Hz and 6500 Hz is used to obtain and according to (2). A sequence of stacked vectors for the clean speech is then formed by stacking consecutive vectors:
(12) 
and similarly for . Setting means that dependencies spanning 187.5 ms are considered. For comparison, the mean duration of a phoneme is 80 ms [38]. The sample covariance matrix of is computed and the KLT in (8) and (9) is applied to obtain and .
Iv Evaluation Procedures
This section describes the procedures used to evaluate SIIB. The evaluation considered four intelligibility data sets and used
two performance measures to quantify the strength of the relationship between SIIB and intelligibility.
Iva Intelligibility Data Sets
IvA1 JensenSCNR
The first data set consists of speech subjected to single channel noise reduction. In [39] phrases from the Dutch version of the Hagerman test [40, 41] were degraded by speechshaped noise (SSN) at SNRs of and dB and processed by three noise reduction algorithms. The three algorithms compute a minimum meansquared error estimate of the clean speech by multiplying the shorttime spectral magnitude of the degraded speech with a gain function. In total there are 5 SNRs (3 algorithms + 1 unprocessed) = 20 conditions. The stimuli were presented to 13 normalhearing subjects for identification.
IvA2 KleijnPRE
The second data set consists of speech subjected to preprocessing enhancement and degraded by noise. In [20] phrases from the Dutch version of the Hagerman test were subjected to three preprocessing enhancement algorithms and then degraded either by SSN at SNRs of and dB, or car noise at SNRs of and dB. The three enhancement algorithms optimally redistribute the energy of the clean speech according to a distortion criterion. In total there are 2 noise types 4 SNRs (3 algorithms + 1 unprocessed) = 32 conditions. The stimuli were presented to nine normalhearing listeners for identification.
IvA3 CookePRE
The third data set also consists of speech subjected to preprocessing enhancement. In [42] Harvard sentences [43] were processed by 19 preprocessing enhancement algorithms and degraded either by SSN at SNRs of and dB, or by speech from a competing talker at SNRs of and dB. The stimuli were presented to 175 normalhearing listeners for identification. For this paper, a subset of the data in [42] was considered because the entire data set was not available. Ten of the Harvard sentences and nine of the enhancement algorithms were used. The algorithms are referred to in [42] as AdaptDRC, F0shift, IWFEMD, on/offset, OptimalSII, RESSYSMOD, SBM, SEO, and SSS. In total there are 2 noise types 3 SNRs (9 algorithms + 1 unprocessed) = 60 conditions.
IvA4 KjemsITFS
The fourth data set consists of speech subjected to ideal timefrequency segregation processing (ITFS). In [44] phrases from the Dantale II corpus [45] were degraded by four types of noise: SSN, cafeteria noise, noise from a bottling factory, and car noise. For each noise type, the degraded signals were processed by two types of ITFS called an ideal binary mask and a target binary mask. Three SNRs were used ( dB, and SNRs corresponding to 20% and 50% intelligibility) and eight variants of each ITFS algorithm were considered. In total there are 168 conditions. The stimuli were presented to 15 normalhearing subjects for identification.
IvB Performance Measures
The most important characteristic of an intelligibility metric is that it has a strong monotonic increasing relationship with intelligibility. This paper uses two performance measures to quantify the strength of the relationship: Kendall’s tau coefficient [46] , and Pearson’s correlation coefficient . To use effectively, the relationship between the metric, , and intelligibility, , must be linear. For this reason, a monotonic function is applied to to linearise the relationship:
(13) 
where are free parameters that are fit to each data set to minimise the mean squared error between and over all conditions. These free parameters are affected by the speech corpus, apparatus, and experimental procedures used during the listening test. Pearson’s correlation coefficient between and is then computed.
V Results
The performance of SIIB is compared to three stateoftheart intelligibility metrics: STOI [10], ESTOI [11], and MIKNN [23]. Fig. 1 shows scatter plots for each data set and each intelligibility metric. The vertical axis shows the intelligibility and the horizontal axis shows the score computed by an intelligibility metric . Each point represents a different condition in the data set. The function in (13) that is used to linearise the relationship is also shown. Table I displays for each data set and metric and, similarly, Table II displays .
The row of scatter plots corresponding to KleijnPRE shows that all of the reference metrics struggle to predict the effect that optimal energy redistribution has on intelligibility. In contrast SIIB is strongly correlated with intelligibility for this data set ( and ).
For CookePRE all of the metrics have reasonable performance except for STOI. This is in agreement with [11] which showed that STOI performs poorly for speech degraded by modulated noise sources such as interfering talkers. An assumption sometimes made by the speech processing community is that in order to predict intelligibility for modulated noise sources, statistics have to be averaged over shorttime segments to capture the affect of ‘listening for glimpses of clean speech’ [15]. It is then surprising that SIIB performs well on this data set ( and ) because SIIB is based on global statistics only.
Compared to the reference metrics SIIB has excellent performance for JensenSCNR, KleijnPRE, and CookePRE, but poorer performance for KjemsITFS ( and ). In [47] seventeen intelligibility metrics were evaluated using KjemsITFS and only five metrics achieved . SIIB may not perform as well on KjemsITFS because ITFS processing generates some stimuli with distortions that are not normally encountered in nature. For these stimuli it is plausible that humans are poor decoders. SIIB may correctly estimate the mutual information rate, but humans may be unable to efficiently use all of the information. This hypothesis could be tested by extensively training listeners to decode ITFS processed speech before conducting a listening test.
Notice that for maximum intelligibility, SIIB estimates an information rate of about b/s. This is higher than estimates based on linguistic models of speech communication where the information rate is 50100 b/s [48, 49, 50]. This overestimate is likely the consequence of approximating as Gaussian. Since is only approximately Gaussian, the KLT does not remove all statistical dependencies. Accounting for the remaining dependencies would give a lower information rate.
MIKNN  STOI  ESTOI  SIIB  

JensenSCNR  0.68  0.89  0.83  0.92 
KleijnPRE  0.71  0.70  0.58  0.86 
CookePRE  0.72  0.56  0.77  0.76 
KjemsITFS  0.71  0.82  0.81  0.73 
Mean  0.71  0.75  0.75  0.82 
MIKNN  STOI  ESTOI  SIIB  

JensenSCNR  0.86  0.99  0.98  0.99 
KleijnPRE  0.80  0.91  0.81  0.98 
CookePRE  0.90  0.69  0.95  0.95 
KjemsITFS  0.88  0.96  0.95  0.88 
Mean  0.86  0.89  0.92  0.95 
Vi Conclusion
In this paper we proposed an intrusive instrumental intelligibility metric called SIIB. SIIB is based on the hypothesis that intelligibility is related to the amount of information shared between a clean and degraded speech signal in bits per second. Compared to existing metrics, SIIB is conceptually simple, theoretically motivated, and has high performance. According to Occam’s razor, these properties suggest that SIIB might generalise well to new data sets. A MATLAB implementation is available at https://stevenvankuyk.com/matlab_code/
References
 [1] J. B. Allen, “Articulation and intelligibility,” Synthesis Lectures on Speech and Audio Processing, vol. 1, no. 1, pp. 1–124, 2005.
 [2] C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.
 [3] “American national standard methods for calculation of the speech intelligibility index,” ANSI/ASA S3.51997 (R2012), 2012.
 [4] T. Houtgast and H. J. M. Steeneken, “Evaluation of speech transmission channels by using artificial signals,” Acustica, vol. 25, no. 6, pp. 355–367, 1971.
 [5] J. M. Kates and K. H. Arehart, “Coherence and the speech intelligibility index,” J. Acoust. Soc. Amer., vol. 117, no. 4, pp. 2224–2237, 2005.
 [6] K. S. Rhebergen and N. J. Versfeld, “A speech intelligibility indexbased approach to predict the speech reception threshold for sentences in fluctuating noise for normalhearing listeners,” J. Acoust. Soc. Amer., vol. 117, no. 4, pp. 2181–2192, 2005.
 [7] R. Koch, Auditory sound analysis for the prediction and improvement of speech intelligibility. Goettingen, Germany: University of Goettingen, 1992.
 [8] R. L. Goldsworthy and J. E. Greenberg, “Analysis of speechbased speech transmission index methods with implications for nonlinear operations,” J. Acoust. Soc. Amer., vol. 116, no. 6, pp. 3679–3689, 2004.
 [9] J. M. Kates and K. H. Arehart, “The hearingaid speech perception index,” Speech Commun., vol. 65, pp. 75–93, 2014.
 [10] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Trans. Audio, Speech Lang. Process., vol. 19, no. 7, pp. 2125–2136, 2011.
 [11] J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Trans. Audio, Speech Lang. Process., vol. 24, no. 11, pp. 2009–2022, 2016.
 [12] S. Jørgensen and T. Dau, “Predicting speech intelligibility based on the signaltonoise envelope power ratio after modulationfrequency selective processing,” J. Acoust. Soc. Amer., vol. 130, no. 3, pp. 1475–1487, 2011.
 [13] S. Jørgensen, S. D. Ewert, and T. Dau, “A multiresolution envelopepower based model for speech intelligibility,” J. Acoust. Soc. Amer., vol. 134, no. 1, pp. 436–446, 2013.
 [14] H. RelañoIborra, T. May, J. Zaar, C. Scheidiger, and T. Dau, “Predicting speech intelligibility based on a correlation metric in the envelope power spectrum domain,” J. Acoust. Soc. Amer., vol. 140, no. 4, pp. 2670–2679, 2016.
 [15] M. Cooke, “A glimpsing model of speech perception in noise,” J. Acoust. Soc. Amer., vol. 119, no. 3, pp. 1562–1573, 2006.
 [16] J. Barker and M. Cooke, “Modelling speaker intelligibility in noise,” Speech Commun., vol. 49, no. 5, pp. 402–417, 2007.
 [17] Y. Tang and M. Cooke, “Glimpsebased metrics for predicting speech intelligibility in additive noise conditions,” in Proc. Interspeech, 2016, pp. 2488–2492.
 [18] C. E. Shannon, “Prediction and entropy of printed english,” Bell Labs Technical Journal, vol. 30, no. 1, pp. 50–64, 1951.
 [19] F. Pellegrino, C. Coupé, and E. Marsico, “Acrosslanguage perspective on speech information rate,” Language, vol. 87, no. 3, pp. 539–558, 2011.
 [20] W. B. Kleijn and R. C. Hendriks, “A simple model of speech communication and its application to intelligibility enhancement,” IEEE Signal Process. Lett., vol. 22, no. 3, pp. 303–307, 2015.
 [21] E. C. Smith and M. S. Lewicki, “Efficient auditory coding,” Nature, vol. 439, no. 7079, pp. 978–982, 2006.
 [22] S. Khademi, R. Hendriks, and W. B. Kleijn, “Intelligibility enhancement based on mutual information,” IEEE/ACM Trans. Audio, Speech Lang. Process., 2017.
 [23] J. Taghia and R. Martin, “Objective intelligibility measures based on mutual information for speech subjected to speech enhancement processing,” IEEE/ACM Trans. Audio, Speech Lang. Process., vol. 22, no. 1, pp. 6–16, 2014.
 [24] J. Jensen and C. H. Taal, “Speech intelligibility prediction based on mutual information,” IEEE/ACM Trans. Audio, Speech Lang. Process., vol. 22, no. 2, pp. 430–440, 2014.
 [25] S. Van Kuyk, W. B. Kleijn, and R. C. Hendriks, “An intelligibility metric based on a simple model of speech communication,” in Proc. IEEE. Int. Workshop on Acoust. Speech Enhancement (IWAENC), 2016, pp. 1–5.
 [26] K. Karhunen, Über lineare Methoden in der Wahrscheinlichkeitsrechnung. Helsinki, Finland: Universitat Helsinki, 1947.
 [27] S. Van Kuyk, W. B. Kleijn, and R. C. Hendriks, “On the information rate of speech communication,” in Proc. IEEE. Int. Conf. Acoust. Speech. Signal Process., (ICASSP), 2017, pp. 5625–5629.
 [28] K. S. Rhebergen, N. J. Versfeld, and W. A. Dreschler, “Extended speech intelligibility index for the prediction of the speech reception threshold in fluctuating noise,” J. Acoust. Soc. Amer., vol. 120, no. 6, pp. 3988–3997, 2006.
 [29] T. M. Cover and J. A. Thomas, Elements of information theory. New York, USA: John Wiley & Sons, 2012.

[30]
G. Doquire and M. Verleysen, “A comparison of multivariate mutual information estimators for feature selection,” in
Proc. Int. Conf. on Pattern Recognition Applications and Methods
, 2012, pp. 176–185. 
[31]
S. Gao, G. Ver Steeg, and A. Galstyan, “Efficient estimation of mutual
information for strongly dependent variables,” in
Proc. Int. Conf. on Artificial Intelligence and Statistics
, 2015, pp. 277–286.  [32] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 4, pp. 357–366, 1980.
 [33] K. R. Rao and P. Yip, Discrete cosine transform: algorithms, advantages, applications. San Diego, USA: Academic press, 1990.
 [34] A. Kraskov, H. Stögbauer, and P. Grassberger, “Estimating mutual information,” Phys. Rev. E, vol. 69, no. 6, p. 066138, 2004.
 [35] T. M. Elliott and F. E. Theunissen, “The modulation transfer function for speech intelligibility,” PLOS Comput. Biol., vol. 5, no. 3, p. e1000302, 2009.
 [36] M. Slaney, “An efficient implementaion of the PattersonHoldsworth auditory filter bank,” Apple Comp. Tech. Report, 1993.
 [37] B. R. Glasberg and B. C. J. Moore, “Derivation of auditory filter shapes from notchednoise data,” Hear. Res., vol. 47, no. 1, pp. 103–138, 1990.
 [38] T. H. Crystal and A. S. House, “Segmental durations in connectedspeech signals: Current results,” J. Acoust. Soc. Amer., vol. 83, no. 4, pp. 1553–1573, 1988.
 [39] J. Jensen and R. C. Hendriks, “Spectral magnitude minimum meansquare error estimation using binary and continuous gain functions,” IEEE Trans. Audio, Speech Lang. Process., vol. 20, no. 1, pp. 92–102, 2012.
 [40] J. Koopman, R. Houben, W. A. Dreschler, and J. Verschuure, “Development of a speech in noise test (matrix),” in Proc. 8th EFAS Congr., 10th DGA Congr., 2007.
 [41] B. Hagerman, “Sentences for testing speech intelligibility in noise,” Scand. Audiol., vol. 11, no. 2, pp. 79–87, 1982.
 [42] M. Cooke, C. Mayo, and C. ValentiniBotinhao, “Intelligibilityenhancing speech modifications: the Hurricane Challenge,” in Proc. Interspeech, 2013.
 [43] E. H. Rothauser, W. D. Chapman, N. Guttman, H. R. Silbiger, M. H. L. Hecker, G. E. Urbanek, K. S. Nordby, and M. Weinstock, “IEEE recommended practice for speech quality measurements,” IEEE Trans. on Audio and Electroacoustics, vol. 17, pp. 225–246, 1969.
 [44] U. Kjems, J. B. Boldt, M. S. Pedersen, T. Lunner, and D. Wang, “Role of mask pattern in intelligibility of ideal binarymasked noisy speech,” J. Acoust. Soc. Amer., vol. 126, no. 3, pp. 1415–1426, 2009.
 [45] K. Wagener, J. L. Josvassen, and R. Ardenkjær, “Design, optimization and evaluation of a danish sentence test in noise,” Int. J. Audiol., vol. 42, no. 1, pp. 10–17, 2003.
 [46] M. G. Kendall, “A new measure of rank correlation,” Biometrika, vol. 30, pp. 81–93, 1938.
 [47] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An evaluation of objective measures for intelligibility prediction of timefrequency weighted noisy speech,” J. Acoust. Soc. Amer., vol. 130, no. 5, pp. 3013–3027, 2011.
 [48] R. M. Fano, “The information theory point of view in speech communication,” J. Acoust. Soc. Amer., vol. 22, no. 6, pp. 691–696, 1950.
 [49] J. L. Flanagan, Speech analysis synthesis and perception. New York, USA: Springer, 1972.
 [50] J. Villasenor, Y. Han, D. Wen, E. Gonzalez, J. Chen, and J. Wen, “The information rate of modern speech and its implications for language evolution,” in Proc. Int. Conf. Evolution Lang., 2012, p. 376.
Comments
There are no comments yet.