Information Loss in the Human Auditory System

From the eardrum to the auditory cortex, where acoustic stimuli are decoded, there are several stages of auditory processing and transmission where information may potentially get lost. In this paper, we aim at quantifying the information loss in the human auditory system by using information theoretic tools. To do so, we consider a speech communication model, where words are uttered and sent through a noisy channel, and then received and processed by a human listener. We define a notion of information loss that is related to the human word recognition rate. To assess the word recognition rate of humans, we conduct a closed-vocabulary intelligibility test. We derive upper and lower bounds on the information loss. Simulations reveal that the bounds are tight and we observe that the information loss in the human auditory system increases as the signal to noise ratio (SNR) decreases. Our framework also allows us to study whether humans are optimal in terms of speech perception in a noisy environment. Towards that end, we derive optimal classifiers and compare the human and machine performance in terms of information loss and word recognition rate. We observe a higher information loss and lower word recognition rate for humans compared to the optimal classifiers. In fact, depending on the SNR, the machine classifier may outperform humans by as much as 8 dB. This implies that for the speech-in-stationary-noise setup considered here, the human auditory system is sub-optimal for recognizing noisy words.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

03/03/2021

The effect of speech and noise levels on the quality perceived by cochlear implant and normal hearing listeners

Electrical hearing by cochlear implants (CIs) may be fundamentally diffe...
01/29/2020

Environment-aware Reconfigurable Noise Suppression

The paper proposes an efficient, robust, and reconfigurable technique to...
02/13/2022

ORBGRAND Is Almost Capacity-Achieving

Decoding via sequentially guessing the error pattern in a received noisy...
06/22/2016

A Curriculum Learning Method for Improved Noise Robustness in Automatic Speech Recognition

The performance of automatic speech recognition systems under noisy envi...
08/08/2022

A Simple Capacity-Achieving Scheme for Channels with Polarization-Dependent Loss

We demonstrate, for a widely used model of channels with polarization de...
04/13/2022

Call-sign recognition and understanding for noisy air-traffic transcripts using surveillance information

Air traffic control (ATC) relies on communication via speech between pil...

I Introduction

As an acoustic signal enters the ear, it passes several processing stages until the information it carries is decoded in the brain. There exist numerous works that have studied and modeled some stages of auditory processing, from biophysical to computational models [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. Due to the information processing and transmission at each stage, some information loss may occur. In this study, our motivation is to quantify the information loss in the human auditory system, from the eardrum to the speech decoding stage in the brain. This is a first step towards assessing the information loss of the individual components in the human auditory system. We model a speech communication system, where a speaker utters a word from a fixed dictionary and the word waveform passes a noisy communication channel, before it is classified by a human listener.

Our key idea is to define a notion of information loss, which is related to the number of words that are not correctly recognized. Since a certain degree of information is also lost in the acoustic communication channel, we normalize the information loss so it describes the ratio of the amount of information lost due to being processed by the listener and the total amount of information that reaches the listener’s eardrum.

To assess the word recognition rate of humans, we conduct a closed-vocabulary intelligibility test. This test is a listening test that reflects key properties of the DANTALE II intelligibility test [12], where intelligibility is determined by presenting speech stimuli contaminated by noise to test subjects, and calculating the word recognition rate.

We quantify the information flow through the acoustic channel by establishing computable lower and upper bounds on the mutual information between the words being uttered and the output of the noisy acoustic channel. Simulations reveal that the bounds are tight, and we observe that the information loss in the human auditory system increases as the signal to noise ratio (SNR) decreases. We also observe, that the information loss has an inverse relationship with the word recognition rate of humans.

Our framework further allows us to assess whether humans are optimal in terms of speech perception in a noisy environment. It may be hypothesized that the ability to understand speech under varying acoustic conditions has provided humans with an evolutionary advantage. In particular, it has been hypothesized that animals are close-to-optimal at performing tasks that are important for their survival, e.g. they transfer information optimally from the sensory world to the brain [13, 14]. Rieke et al. [15]

studied the peripheral auditory system of the bullfrog. They first estimated the input stimulus (i.e., stimulus reconstruction) by linearly filtering the output (spike trains) of the auditory system. They then measured the information rate carried by spike trains, e.g. the rate at which the spike trains remove uncertainty about the sensory output. This information rate reaches its upper bound, i.e. the stimulus is transfered optimally, when the input stimulus is a natural sound, rather than a synthetic stimulus. This indicates that the auditory system of this organism is tuned to natural stimuli. Similar studies have been done on different organisms, e.g.

[16, 17]. For example, in [17]

, they investigated a single neuron, which is sensitive to movements in the visual system of the blowfly in terms of information rate and conclude that the visual system of this organism transmits information optimally.

To answer the question of whether humans are optimal in terms of speech perception, we derive optimal classifiers and compare human and machine performances in terms of their information losses and word recognition rates. It is observed that with equal SNR, humans have higher information loss compared to the optimal classifiers. In fact, depending on the SNR, the machine classifier may outperform humans by as much as 8 dB, depending on the prior knowledge assumed available to the classifier. This implies that for the speech-in-stationary-noise setup considered here, the human auditory system is sub-optimal for recognizing noisy word.

I-a Overview of the Paper

The rest of the paper is organized as follows. In Sec. II, we describe our speech communication model. In Sec. III, we introduce and quantify the relative information loss in the speech communication model and find lower and upper bounds for it. We also derive the optimal classifiers in this section. We explain the simulation study and report the results in Sec. IV and discuss them in Sec. V. Sec. VI concludes the paper.

I-B Notation

We denote random vectors and random scalars with boldface uppercase, and italic uppercase letters respectively. Boldface lowercase and italic lowercase letters are used for denoting deterministic vectors and deterministic scalars respectively. We denote the expectation operation with respect to random variable

y by . The information theoretic quantities of differential entropy, entropy, and mutual information are denoted by , and , respectively. The trace operation and the matrix determinant are denoted by and

respectively. We denote Markov chains by two-headed arrows; e.g

. The probability mass function (PMF) is denoted by

, and

is used for the probability density function (PDF). For example,

denotes the conditional PDF of Y given . The notation represents the sequence .

Ii Communication Model

.

.

.

Noisy Acoustic channel

Y

Listener/ Classifier

Fig. 1: Block diagram of the speech communication model

Figure 1 illustrates the speech communication model that is composed of three parts: Speaker, Noisy Channel, and Listener/Classifier. We elaborate on these parts below.

Ii-a Speaker

The speaker constructs sequences of words by choosing words randomly from fixed dictionaries that are also known by the listener/classifier. Let us consider a set . The waveform of the word is modeled as a random vector that contains

samples. The discrete random variable

indexes the word that is picked and uttered. denotes the probability that the word is chosen.

Ii-B Noisy Channel

W

Y

Fig. 2: The word is conveyed over a noisy acoustic channel.

The noisy channel is composed of a clean speech term multiplied by a scaling factor and an additive noise term (Fig. 2). The additive noise, W, is zero-mean coloured Gaussian and has a long-term spectrum similar to the average long-term spectrum of the clean words. The scale factor serves to modify the SNR, which is defined as the ratio of the average power of the words , to the noise power . Without loss of generality, we fix , then . The received word waveform Y is expressed as:

(1)

Ii-C Listener/Classifier

The listener receives the noisy word waveform y and attempts to recognize it by mapping it to one of the words in the dictionary. The random variable specifies the word selected by the listener/classifier.

Iii Analysis

Iii-a Relative Information Loss

Consider the speech communication model in Fig. 1. Since is a deterministic function of y, we can write:

(2)

Equation (2) implies that and form a Markov chain, , from which, by the data processing inequality, we have[18]:

(3)

As mentioned above, information may be lost at any stage of auditory processing. As the result, the amount of information common between and , i.e. , is less than . Therefore, the difference between and is the amount of information that is lost in the listener/classifier part (cf. Fig 1). Based on this argument, we define the information loss, , as follows:

(4)

If the classifier block in Fig. 1 is performed by human listeners, quantifies the amount of information that is lost in the human auditory system from the eardrum to the decoding stage in the brain. However, the information loss in (4) does not reveal the size of the loss compared to the total information that reaches the eardrum . Thus, we introduce a relative information loss as:

(5)

When the decoding block is performed by human listeners, can be interpreted as the fraction of the information reaching the eardrum, which is actually used for decoding the speech signal. From (3), it is easy to show that .

Iii-B Bounds on

To calculate , let us start with the definition of :

(6)

The PDF of Y can be obtained as:

(7)

As seen, is a mixture of distributions, and to the best of the authors knowledge, there exists no closed-form expression for the entropy of a mixture distribution. However, we can find upper and lower bounds for , and, consequently, for .

We first divide and Y into successive, non-overlapping frames each of length :

Y

where is the frame of the word, is the frame of the noisy word, and denotes the number of frames. Using the product rule, we can write:

(9)

Note that we do not assume frames to be independent.

Let and denote lower and upper bounds for , and let and () denote the KL-divergence and the Chernoff -divergence between two distributions respectively [19]:

(10)
Lemma 1.

The mutual information between and Y is lower and upper bounded by:

and

(11)
Proof.

See Appendix A. ∎

Iii-B1 Gaussian Case

As seen from (1), the lower and upper bounds for depend on the PDF of . From (1), we have:

(12)

At high and medium SNRs where humans successfully recognize all the words, the information loss is zero. We thus focus on low SNRs in this work. At low SNRs (), the additive Gaussian noise in (12) is dominant. Therefore, it is reasonable to assume that

approximately follows a Gaussian distribution

. Consequently, also follows a Gaussian distribution, , where:

(13)

When is Gaussian (i.e. for low SNRs), we obtain the closed form expression for the KL-divergence and the Chernoff -divergence of two Gaussians [19] in (1):

(14)

where

(15)

We will use this result for Gaussian signals to bound the relative information loss in the next subsection and derive optimal classifiers in subsection III.D.

Iii-C Bounds on Relative Information Loss

Suppose that is the word recognition rate. From the definition of mutual information we have:

(16)

where the first term on the right-hand side of (III-C) follows from the fact that is drawn uniformly from the set , and the last two terms result from calculating using the definition of conditional entropy.

As can be seen from (III-C), depends on . In order to calculate this probability, if the classifier block is performed by human listeners, we perform a listening test (see section IV for more details). We also derive optimal classifiers in the next subsection, to obtain , when the decoder part is performed by the optimal machine classifier.

Using (1) and (III-C), the upper and lower bound for the relative information loss are obtained as follows:

(17)

where and denote the upper and lower bound for the relative information loss, respectively. The tightness of the lower and upper bounds for depends on how tight the lower and upper bounds for the entropy of the Gaussian mixture model (GMM) are. In [20], it is shown that the upper and lower bounds for the entropy of the GMM are significantly tighter than well-known existing bounds [21] [22][23]. We also observe that in our case, the upper and lower bounds for are tight (See Section IV).

Iii-D Optimal Classifier

In this section, we derive the optimal classifiers for our speech communication model. Since the performance of the optimal classifiers will be compared to that of the humans, in order to have a fair comparison, we make an assumption and some requirements for the optimal classifiers based on the situations that human listeners encounter.

  1. We assume that test subjects are able to learn and store a model of the words based on the spectral envelope contents of sub-words encountered during the training phase. In a similar manner, subjects create an internal noise model. This assumption is inspired by [24, 25], which suggest that humans build internal statistical models of the words based on characteristics of the spectral contents of sub-words. In our classifier, this is achieved by allowing the classifier to have access to training data in terms of average short-term speech spectra of the clean speech and of the noise.

In addition, we impose the following requirements on the subjective listening test and the classifier:

  1. We design a classifier, which maximizes the probability of correct word detection. This reflects the fact that subjects are instructed to make a best guess of each noisy word.

  2. When listening to the stimuli, i.e. the noisy sentences, the subjects are not informed about the SNRs a priori. In a similar manner, the classifier does not rely on a priori knowledge of the SNR. Therefore from the decoder point of view, the realization of the SNR, , is uniformly drawn from where .

  3. Subjects do not know a priori when the words start. Similarly, the classifier has no a priori information about the temporal locations of the words within the noisy sentences.

In the following, we derive three different classifiers implying different assumptions on how humans perform the classification task. We observe, however, that all the classifiers perform almost identically (See Section IV), which means that it does not matter which one we choose to compare with human performance.

Iii-D1 Map ()

The classifier chooses which word was spoken by maximizing the posterior probability: ; in other words:

(18)
Lemma 2.

The optimal defined in (18) is given by:

where

Proof.

See Appendix B. ∎

Iii-D2 Map ()

One may argue that subjects are able to identify the SNR and thereby the scale factor after having listened to a particular test stimulus, before deciding on the word. In this case, one should maximize rather than . This leads to the following optimization problem:

(19)
Lemma 3.

The optimal pair defined in (19) is given by:111We assume that , otherwise the nearest point ( or ) should be chosen.

where

is obtained by solving the following equation with respect to :

Proof.

See Appendix C. ∎

Iii-D3 Map ()

In the version of the listening test used in this paper, a fixed limited set of SNRs are used, and it might be reasonable to assume that the subjects can identify these SNRs through the training phase. In this case, the scale factor is a discrete random variable () rather than a continuous one. Thus, we maximize . The optimization problem in (19) can thus be rewritten as:

(20)
Lemma 4.

The optimal pair defined in (20) is given by:

where

Proof.

See Appendix D. ∎

In order to take into account requirement (iii) that subjects do not know when the word starts a priori, a window with the same size as the word is shifted within the stimuli. For each shift, the likelihoods , , and are calculated using lemma 2, lemma 3, and lemma 4. Denoting by the portion of y captured by a shift of , new problems corresponding to problems (2)-(4) are respectively formulated as:

(21)

Iv Simulations And Experiments

Iv-a Database

We use the DANTALE II database [12] for our simulations. This database contains 150 sentences sampled at 20 kHz and with a resolution of 16 bits. The sentences are spoken by a native Danish speaker. Each sentence is composed of five words from five categories (name, verb, numeral, adjective, object). There are 10 different words in each of the five categories (). The sentences are syntactically fixed, but semantically unpredictable (nonsense), i.e. sentences have the same grammatical structure, but do not necessarily make sense.

Iv-B Listening Test

We perform a listening test inspired by the Danish sentence test paradigm DANTALE II [12], which has been designed in order to determine the speech reception threshold (SRT), i.e. the signal-to-noise ratio for which the word recognition rate is 50%. In our test, the sentences are contaminated with additive stationary Gaussian noise with the same long-term spectrum as the sentences. The listening test is composed of two phases: training phase and test phase. In the training phase, we ask normal-hearing subjects to listen to versions of the noisy sentences to familiarize themselves with the test. In the test phase, subjects listen to the noisy sentences at different SNRs and they choose the words they hear using a GUI interface. The GUI interface displays all candidate words on a computer screen (i.e. this is a closed-set listening set), and subjects are asked to choose a candidate word for each of the 5 word categories even if they were unable to recognize the words (forced-choice). In both phases, DANTALE II sentences are used and subjects listen to the noisy sentences using headphones.

Eighteen normal-hearing native Danish speaking subjects participated in this test. In the training phase, the subjects were exposed to 12 noisy sentences at 6 different SNRs, where each SNR was used twice. In the test phase, each subject listened to 48 sentences (6 SNRs 8 repetitions). From this listening test, we obtained the human performance for word recognition which is shown in Fig. 3 (blue circles and green fitted curve). The fitted line is a Maximum Likelihood (ML)-fitted logistic function of the form .

Iv-C Computing Information and Relative Information Loss

To calculate , and in (1), and performance of the optimal classifiers, we need to build the covariance matrix for frames of each word. In the DANTALE II database, each word has 15 different realizations (, where denotes the realization of the word). In our simulations, we use 14 different realizations of words for training (building covariance matrices), and one realization of the words for testing. Using the leave-one-out method, where 14 realizations are used for training and the last one for testing, we obtain 15 results whose average is used as the final result. In this way, we assume that the listeners learn one statistical model (covariance matrices) of sub-words for all realizations of that word through the training phase. To construct the covariance matrix for each frame, we first build . We segment each word into non-overlapping frames with a duration of 20 ms. We then stack the same sequence of 14 realizations in a long vector . Then the vector of the linear prediction (LP) coefficients () of this long vector is obtained, and the covariance matrix of this sequence is calculated as described in [26]. In a similar manner, we construct using the LP coefficient () of a long vector that is built by stacking all realizations of all words. The covariance matrices of and are the sub-matrices of . Therefore, using (III-B1), can be calculated. In our simulations, we consider two cases. In the first case, we assume that frames are independent of each other

. In the second case, we consider a Markov model of first order

.

Using (III-C) and (1), we calculate the bounds on the relative information loss in the human auditory system and the relative information loss in the optimal classifiers. The result is plotted in Fig. 4. The relationship between the relative information loss and probability of detection () for humans and the optimal classifier are plotted in Fig. 5.

Fig. 3:

Performance of humans and the optimal classifiers for word recognition as a function of SNR. The error bars show the standard deviation of probability of correct decision among subjects.

Fig. 4: Upper and lower bounds for the relative information loss in the human auditory system and the relative information loss in the optimal classifier as a function of SNR.
Fig. 5: The relative information loss vs. probability of correct decision, , for humans.

V Discussion

We observe from Fig. 3 that the word recognition rate for the classifiers and humans reach 1 at high SNRs, whereas at low SNRs, it is at the chance level . This is because at high noise levels, the words are completely masked by noise, and both classifiers and humans choose words randomly from words. It can also be seen that the optimal classifiers perform almost identically, when we consider the Markov model of the first order compared to the optimal classifiers employing an independent-frame assumption. This implies that the independence assumption does not compromise the performance significantly. This result suggests that an independent-frame assumption, which is often employed in various speech processing contexts (e.g. [27]), is a reasonable assumption, at least in this context. The fact that the performances of all three classifiers are nearly identical, means that, in this test, the alphabet of the SNR and prior assumptions on it, are insignificant. Finally, we observe from Fig. 3 that machine performance is substantially better than human performance. In particular, the speech perception threshold for machine receivers is approximately dB lower than that of humans. The superior performance of classifiers for detecting noisy words compared to humans contradicts our hypothesis that humans are optimal at recognizing words in noise. In other words, the human auditory system performs sub-optimally in this particular task.

As can be seen from Fig. 4, the relative information loss at high SNRs () for humans is around 0, whereas at low SNRs the relative information loss reaches its maximum 1. This is because for SNRs less than dB, humans can not recognize the words and therefore simply guess; so in this case and . We can therefore conclude that not only is less and less information available at the eardrum for decreasing SNRs, but of the information that is available at the eardrum, less and less is used for identifying the word. On the contrary, at high SNRs, there is no loss of the useful information needed for identifying the words in the human auditory system. It should be noted that the lower and upper bounds for the relative information loss for humans almost coincide. We also observe that the relative information loss in the optimal classifiers is less than the relative information loss in the human auditory system, which confirms that in this set-up, the human auditory system performs sub-optimally.

From Fig. 5, it is seen that there is an inverse relationship between probability of correct decision (or speech intelligibility) and the relative information loss for humans. This monotonic relationship between intelligibility and the amount of information that reaches the brain has also been observed in [28].

Vi Conclusion

In this paper, we defined and quantified the information loss in the human auditory system. We first considered a speech communication model where words are spoken and sent through a noisy channel, and then received by a listener. For this setup, we defined and bounded the relative information loss in the listener. The relative information loss describes the fraction of speech information that reaches the eardrum of the listener, but which is not used to decode the speech. To obtain the word recognition rate for humans, we conducted a listening test. The results showed that bounds for the relative information loss in the human auditory system are tight and as SNR increases, the relative information loss decreases. We also assessed the hypothesis that whether humans are optimal in recognizing speech signals in noise. To do so, we derived optimal classifiers and compared their performance for information loss and word recognition rate to those of humans. The lower information loss and higher word recognition rate for machine classifiers compared to humans implied the sub-optimality of the human auditory system for recognizing noisy words, at least for speech contaminated by additive, Gaussian, speech-shaped noise.

Appendix A Proof of Lemma 1

Upper and lower bounds for the entropy of the mixture of distributions are obtained as [20]:

(22)

where and denote the upper and lower bounds for respectively. From (9), we have:

(23)

So the KL-divergence and the Chernoff -divergence in (A) can be written as:

(24)

Using (A) and the fact that , we find a lower and upper bound for :

(25)

Using the result in (A) and substituting (A) in (A) completes our proof.

Appendix B Proof of lemma 2

Using Bayes’ theorem, the

posterior probability can be written as [29]:

(26)

Since the speaker chooses words uniformly, , and is independent of , from (26), (18) can be rewritten:

(27)

In (23), we have obtained the PDF of . However, the classifier does not know the SNR, , so we can write:

(28)

Using (III-B1), we obtain: