I Introduction
As an acoustic signal enters the ear, it passes several processing stages until the information it carries is decoded in the brain. There exist numerous works that have studied and modeled some stages of auditory processing, from biophysical to computational models [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. Due to the information processing and transmission at each stage, some information loss may occur. In this study, our motivation is to quantify the information loss in the human auditory system, from the eardrum to the speech decoding stage in the brain. This is a first step towards assessing the information loss of the individual components in the human auditory system. We model a speech communication system, where a speaker utters a word from a fixed dictionary and the word waveform passes a noisy communication channel, before it is classified by a human listener.
Our key idea is to define a notion of information loss, which is related to the number of words that are not correctly recognized. Since a certain degree of information is also lost in the acoustic communication channel, we normalize the information loss so it describes the ratio of the amount of information lost due to being processed by the listener and the total amount of information that reaches the listener’s eardrum.
To assess the word recognition rate of humans, we conduct a closedvocabulary intelligibility test. This test is a listening test that reflects key properties of the DANTALE II intelligibility test [12], where intelligibility is determined by presenting speech stimuli contaminated by noise to test subjects, and calculating the word recognition rate.
We quantify the information flow through the acoustic channel by establishing computable lower and upper bounds on the mutual information between the words being uttered and the output of the noisy acoustic channel. Simulations reveal that the bounds are tight, and we observe that the information loss in the human auditory system increases as the signal to noise ratio (SNR) decreases. We also observe, that the information loss has an inverse relationship with the word recognition rate of humans.
Our framework further allows us to assess whether humans are optimal in terms of speech perception in a noisy environment. It may be hypothesized that the ability to understand speech under varying acoustic conditions has provided humans with an evolutionary advantage. In particular, it has been hypothesized that animals are closetooptimal at performing tasks that are important for their survival, e.g. they transfer information optimally from the sensory world to the brain [13, 14]. Rieke et al. [15]
studied the peripheral auditory system of the bullfrog. They first estimated the input stimulus (i.e., stimulus reconstruction) by linearly filtering the output (spike trains) of the auditory system. They then measured the information rate carried by spike trains, e.g. the rate at which the spike trains remove uncertainty about the sensory output. This information rate reaches its upper bound, i.e. the stimulus is transfered optimally, when the input stimulus is a natural sound, rather than a synthetic stimulus. This indicates that the auditory system of this organism is tuned to natural stimuli. Similar studies have been done on different organisms, e.g.
[16, 17]. For example, in [17], they investigated a single neuron, which is sensitive to movements in the visual system of the blowfly in terms of information rate and conclude that the visual system of this organism transmits information optimally.
To answer the question of whether humans are optimal in terms of speech perception, we derive optimal classifiers and compare human and machine performances in terms of their information losses and word recognition rates. It is observed that with equal SNR, humans have higher information loss compared to the optimal classifiers. In fact, depending on the SNR, the machine classifier may outperform humans by as much as 8 dB, depending on the prior knowledge assumed available to the classifier. This implies that for the speechinstationarynoise setup considered here, the human auditory system is suboptimal for recognizing noisy word.
Ia Overview of the Paper
The rest of the paper is organized as follows. In Sec. II, we describe our speech communication model. In Sec. III, we introduce and quantify the relative information loss in the speech communication model and find lower and upper bounds for it. We also derive the optimal classifiers in this section. We explain the simulation study and report the results in Sec. IV and discuss them in Sec. V. Sec. VI concludes the paper.
IB Notation
We denote random vectors and random scalars with boldface uppercase, and italic uppercase letters respectively. Boldface lowercase and italic lowercase letters are used for denoting deterministic vectors and deterministic scalars respectively. We denote the expectation operation with respect to random variable
y by . The information theoretic quantities of differential entropy, entropy, and mutual information are denoted by , and , respectively. The trace operation and the matrix determinant are denoted by andrespectively. We denote Markov chains by twoheaded arrows; e.g
. The probability mass function (PMF) is denoted by
, andis used for the probability density function (PDF). For example,
denotes the conditional PDF of Y given . The notation represents the sequence .Ii Communication Model
Figure 1 illustrates the speech communication model that is composed of three parts: Speaker, Noisy Channel, and Listener/Classifier. We elaborate on these parts below.
Iia Speaker
The speaker constructs sequences of words by choosing words randomly from fixed dictionaries that are also known by the listener/classifier. Let us consider a set . The waveform of the word is modeled as a random vector that contains
samples. The discrete random variable
indexes the word that is picked and uttered. denotes the probability that the word is chosen.IiB Noisy Channel
The noisy channel is composed of a clean speech term multiplied by a scaling factor and an additive noise term (Fig. 2). The additive noise, W, is zeromean coloured Gaussian and has a longterm spectrum similar to the average longterm spectrum of the clean words. The scale factor serves to modify the SNR, which is defined as the ratio of the average power of the words , to the noise power . Without loss of generality, we fix , then . The received word waveform Y is expressed as:
(1) 
IiC Listener/Classifier
The listener receives the noisy word waveform y and attempts to recognize it by mapping it to one of the words in the dictionary. The random variable specifies the word selected by the listener/classifier.
Iii Analysis
Iiia Relative Information Loss
Consider the speech communication model in Fig. 1. Since is a deterministic function of y, we can write:
(2) 
Equation (2) implies that and form a Markov chain, , from which, by the data processing inequality, we have[18]:
(3) 
As mentioned above, information may be lost at any stage of auditory processing. As the result, the amount of information common between and , i.e. , is less than . Therefore, the difference between and is the amount of information that is lost in the listener/classifier part (cf. Fig 1). Based on this argument, we define the information loss, , as follows:
(4) 
If the classifier block in Fig. 1 is performed by human listeners, quantifies the amount of information that is lost in the human auditory system from the eardrum to the decoding stage in the brain. However, the information loss in (4) does not reveal the size of the loss compared to the total information that reaches the eardrum . Thus, we introduce a relative information loss as:
(5) 
When the decoding block is performed by human listeners, can be interpreted as the fraction of the information reaching the eardrum, which is actually used for decoding the speech signal. From (3), it is easy to show that .
IiiB Bounds on
To calculate , let us start with the definition of :
(6) 
The PDF of Y can be obtained as:
(7) 
As seen, is a mixture of distributions, and to the best of the authors knowledge, there exists no closedform expression for the entropy of a mixture distribution. However, we can find upper and lower bounds for , and, consequently, for .
We first divide and Y into successive, nonoverlapping frames each of length :
Y 
where is the frame of the word, is the frame of the noisy word, and denotes the number of frames. Using the product rule, we can write:
(9) 
Note that we do not assume frames to be independent.
Let and denote lower and upper bounds for , and let and () denote the KLdivergence and the Chernoff divergence between two distributions respectively [19]:
(10) 
Lemma 1.
The mutual information between and Y is lower and upper bounded by:
and
(11) 
Proof.
See Appendix A. ∎
IiiB1 Gaussian Case
As seen from (1), the lower and upper bounds for depend on the PDF of . From (1), we have:
(12) 
At high and medium SNRs where humans successfully recognize all the words, the information loss is zero. We thus focus on low SNRs in this work. At low SNRs (), the additive Gaussian noise in (12) is dominant. Therefore, it is reasonable to assume that
approximately follows a Gaussian distribution
. Consequently, also follows a Gaussian distribution, , where:(13) 
When is Gaussian (i.e. for low SNRs), we obtain the closed form expression for the KLdivergence and the Chernoff divergence of two Gaussians [19] in (1):
(14) 
where
(15) 
We will use this result for Gaussian signals to bound the relative information loss in the next subsection and derive optimal classifiers in subsection III.D.
IiiC Bounds on Relative Information Loss
Suppose that is the word recognition rate. From the definition of mutual information we have:
(16) 
where the first term on the righthand side of (IIIC) follows from the fact that is drawn uniformly from the set , and the last two terms result from calculating using the definition of conditional entropy.
As can be seen from (IIIC), depends on . In order to calculate this probability, if the classifier block is performed by human listeners, we perform a listening test (see section IV for more details). We also derive optimal classifiers in the next subsection, to obtain , when the decoder part is performed by the optimal machine classifier.
Using (1) and (IIIC), the upper and lower bound for the relative information loss are obtained as follows:
(17) 
where and denote the upper and lower bound for the relative information loss, respectively. The tightness of the lower and upper bounds for depends on how tight the lower and upper bounds for the entropy of the Gaussian mixture model (GMM) are. In [20], it is shown that the upper and lower bounds for the entropy of the GMM are significantly tighter than wellknown existing bounds [21] [22][23]. We also observe that in our case, the upper and lower bounds for are tight (See Section IV).
IiiD Optimal Classifier
In this section, we derive the optimal classifiers for our speech communication model. Since the performance of the optimal classifiers will be compared to that of the humans, in order to have a fair comparison, we make an assumption and some requirements for the optimal classifiers based on the situations that human listeners encounter.

We assume that test subjects are able to learn and store a model of the words based on the spectral envelope contents of subwords encountered during the training phase. In a similar manner, subjects create an internal noise model. This assumption is inspired by [24, 25], which suggest that humans build internal statistical models of the words based on characteristics of the spectral contents of subwords. In our classifier, this is achieved by allowing the classifier to have access to training data in terms of average shortterm speech spectra of the clean speech and of the noise.
In addition, we impose the following requirements on the subjective listening test and the classifier:

We design a classifier, which maximizes the probability of correct word detection. This reflects the fact that subjects are instructed to make a best guess of each noisy word.

When listening to the stimuli, i.e. the noisy sentences, the subjects are not informed about the SNRs a priori. In a similar manner, the classifier does not rely on a priori knowledge of the SNR. Therefore from the decoder point of view, the realization of the SNR, , is uniformly drawn from where .

Subjects do not know a priori when the words start. Similarly, the classifier has no a priori information about the temporal locations of the words within the noisy sentences.
In the following, we derive three different classifiers implying different assumptions on how humans perform the classification task. We observe, however, that all the classifiers perform almost identically (See Section IV), which means that it does not matter which one we choose to compare with human performance.
IiiD1 Map ()
The classifier chooses which word was spoken by maximizing the posterior probability: ; in other words:
(18) 
IiiD2 Map ()
One may argue that subjects are able to identify the SNR and thereby the scale factor after having listened to a particular test stimulus, before deciding on the word. In this case, one should maximize rather than . This leads to the following optimization problem:
(19) 
Lemma 3.
The optimal pair defined in (19) is given by:^{1}^{1}1We assume that , otherwise the nearest point ( or ) should be chosen.
where
is obtained by solving the following equation with respect to :
Proof.
See Appendix C. ∎
IiiD3 Map ()
In the version of the listening test used in this paper, a fixed limited set of SNRs are used, and it might be reasonable to assume that the subjects can identify these SNRs through the training phase. In this case, the scale factor is a discrete random variable () rather than a continuous one. Thus, we maximize . The optimization problem in (19) can thus be rewritten as:
(20) 
In order to take into account requirement (iii) that subjects do not know when the word starts a priori, a window with the same size as the word is shifted within the stimuli. For each shift, the likelihoods , , and are calculated using lemma 2, lemma 3, and lemma 4. Denoting by the portion of y captured by a shift of , new problems corresponding to problems (2)(4) are respectively formulated as:
(21) 
Iv Simulations And Experiments
Iva Database
We use the DANTALE II database [12] for our simulations. This database contains 150 sentences sampled at 20 kHz and with a resolution of 16 bits. The sentences are spoken by a native Danish speaker. Each sentence is composed of five words from five categories (name, verb, numeral, adjective, object). There are 10 different words in each of the five categories (). The sentences are syntactically fixed, but semantically unpredictable (nonsense), i.e. sentences have the same grammatical structure, but do not necessarily make sense.
IvB Listening Test
We perform a listening test inspired by the Danish sentence test paradigm DANTALE II [12], which has been designed in order to determine the speech reception threshold (SRT), i.e. the signaltonoise ratio for which the word recognition rate is 50%. In our test, the sentences are contaminated with additive stationary Gaussian noise with the same longterm spectrum as the sentences. The listening test is composed of two phases: training phase and test phase. In the training phase, we ask normalhearing subjects to listen to versions of the noisy sentences to familiarize themselves with the test. In the test phase, subjects listen to the noisy sentences at different SNRs and they choose the words they hear using a GUI interface. The GUI interface displays all candidate words on a computer screen (i.e. this is a closedset listening set), and subjects are asked to choose a candidate word for each of the 5 word categories even if they were unable to recognize the words (forcedchoice). In both phases, DANTALE II sentences are used and subjects listen to the noisy sentences using headphones.
Eighteen normalhearing native Danish speaking subjects participated in this test. In the training phase, the subjects were exposed to 12 noisy sentences at 6 different SNRs, where each SNR was used twice. In the test phase, each subject listened to 48 sentences (6 SNRs 8 repetitions). From this listening test, we obtained the human performance for word recognition which is shown in Fig. 3 (blue circles and green fitted curve). The fitted line is a Maximum Likelihood (ML)fitted logistic function of the form .
IvC Computing Information and Relative Information Loss
To calculate , and in (1), and performance of the optimal classifiers, we need to build the covariance matrix for frames of each word. In the DANTALE II database, each word has 15 different realizations (, where denotes the realization of the word). In our simulations, we use 14 different realizations of words for training (building covariance matrices), and one realization of the words for testing. Using the leaveoneout method, where 14 realizations are used for training and the last one for testing, we obtain 15 results whose average is used as the final result. In this way, we assume that the listeners learn one statistical model (covariance matrices) of subwords for all realizations of that word through the training phase. To construct the covariance matrix for each frame, we first build . We segment each word into nonoverlapping frames with a duration of 20 ms. We then stack the same sequence of 14 realizations in a long vector . Then the vector of the linear prediction (LP) coefficients () of this long vector is obtained, and the covariance matrix of this sequence is calculated as described in [26]. In a similar manner, we construct using the LP coefficient () of a long vector that is built by stacking all realizations of all words. The covariance matrices of and are the submatrices of . Therefore, using (IIIB1), can be calculated. In our simulations, we consider two cases. In the first case, we assume that frames are independent of each other
. In the second case, we consider a Markov model of first order
.Using (IIIC) and (1), we calculate the bounds on the relative information loss in the human auditory system and the relative information loss in the optimal classifiers. The result is plotted in Fig. 4. The relationship between the relative information loss and probability of detection () for humans and the optimal classifier are plotted in Fig. 5.
V Discussion
We observe from Fig. 3 that the word recognition rate for the classifiers and humans reach 1 at high SNRs, whereas at low SNRs, it is at the chance level . This is because at high noise levels, the words are completely masked by noise, and both classifiers and humans choose words randomly from words. It can also be seen that the optimal classifiers perform almost identically, when we consider the Markov model of the first order compared to the optimal classifiers employing an independentframe assumption. This implies that the independence assumption does not compromise the performance significantly. This result suggests that an independentframe assumption, which is often employed in various speech processing contexts (e.g. [27]), is a reasonable assumption, at least in this context. The fact that the performances of all three classifiers are nearly identical, means that, in this test, the alphabet of the SNR and prior assumptions on it, are insignificant. Finally, we observe from Fig. 3 that machine performance is substantially better than human performance. In particular, the speech perception threshold for machine receivers is approximately dB lower than that of humans. The superior performance of classifiers for detecting noisy words compared to humans contradicts our hypothesis that humans are optimal at recognizing words in noise. In other words, the human auditory system performs suboptimally in this particular task.
As can be seen from Fig. 4, the relative information loss at high SNRs () for humans is around 0, whereas at low SNRs the relative information loss reaches its maximum 1. This is because for SNRs less than dB, humans can not recognize the words and therefore simply guess; so in this case and . We can therefore conclude that not only is less and less information available at the eardrum for decreasing SNRs, but of the information that is available at the eardrum, less and less is used for identifying the word. On the contrary, at high SNRs, there is no loss of the useful information needed for identifying the words in the human auditory system. It should be noted that the lower and upper bounds for the relative information loss for humans almost coincide. We also observe that the relative information loss in the optimal classifiers is less than the relative information loss in the human auditory system, which confirms that in this setup, the human auditory system performs suboptimally.
From Fig. 5, it is seen that there is an inverse relationship between probability of correct decision (or speech intelligibility) and the relative information loss for humans. This monotonic relationship between intelligibility and the amount of information that reaches the brain has also been observed in [28].
Vi Conclusion
In this paper, we defined and quantified the information loss in the human auditory system. We first considered a speech communication model where words are spoken and sent through a noisy channel, and then received by a listener. For this setup, we defined and bounded the relative information loss in the listener. The relative information loss describes the fraction of speech information that reaches the eardrum of the listener, but which is not used to decode the speech. To obtain the word recognition rate for humans, we conducted a listening test. The results showed that bounds for the relative information loss in the human auditory system are tight and as SNR increases, the relative information loss decreases. We also assessed the hypothesis that whether humans are optimal in recognizing speech signals in noise. To do so, we derived optimal classifiers and compared their performance for information loss and word recognition rate to those of humans. The lower information loss and higher word recognition rate for machine classifiers compared to humans implied the suboptimality of the human auditory system for recognizing noisy words, at least for speech contaminated by additive, Gaussian, speechshaped noise.
Appendix A Proof of Lemma 1
Appendix B Proof of lemma 2
Using Bayes’ theorem, the
posterior probability can be written as [29]:(26) 
Since the speaker chooses words uniformly, , and is independent of , from (26), (18) can be rewritten:
(27) 
In (23), we have obtained the PDF of . However, the classifier does not know the SNR, , so we can write:
(28) 
Using (IIIB1), we obtain: