The sense of hearing provides humans with the possibility to explore and interact with their surrounding sound environment. Examples of this interaction are the ability to localize a sound object, obtain information about its identity, and communicate with others. The ability to access such information by using the hearing system is assumed to be possible due to the existence of internal processes of perceptual organization McAdams & Bigand (1993). The information used by these internal processes is sometimes referred to as “internal representations.” This term indicates that the hearing system delivers information about the sound object to the brain. The hearing system consists of a mechanical and a neural part. After the mechanical or peripheral auditory processing –that comprises the outer, middle, and inner ear– the sounds are represented as firing patterns in the auditory nerve. The neural part comprises the connectivity and involved functional mechanisms that transmit the information, i.e., firing patterns of the auditory nerve, through the central nervous system to the brain (e.g., Kohlrausch et al., 2013).
Although there is consensus that firing patterns at the level of the auditory nerve are encoded according to a frequency-to-place conversion that occurs in the inner ear (e.g., Robles & Ruggero, 2001), there is no similar consensus with respect to higher-level neural processing stages. This has been translated into computational models that consist of stages of peripheral and central processing, with the first following a structure generally based on a tonotopic analysis using a cochlear filter bank and the latter employing diverging schemes to further process simulated firing patterns or, in other words, to obtain and use internal representations.
|Central processor type||# Repr.||Peripheral stage based on|
|A. Template-based optimal detector Dau et al. (1997)||3||Dau et al. (1997)|
|B. Autocorrelator-based pitch analyzer Meddis & O’Mard (1997)||1||Meddis & Hewitt (1991)|
|C. Discriminability analyzer Fritz et al. (2007)||2||Glasberg & Moore (2002)|
|D. Envelope analyzer Jørgensen & Dau (2011)||1111Processor D processes “individual” speech samples in noise (i.e., one test interval), but the processor also needs to have access to the internal representation of the noise alone in order to generate its output metric.||Ewert & Dau (2000)|
|E. Room Acoustic Analyzer van Dorp et al. (2013); Osses et al. (2017a, 2020a)||1||Breebaart et al. (2001a)|
|F. Envelope analyzer Bianchi et al. (2019)||1||Zilany et al. (2009, 2014)|
|G. RMS difference detector Osses et al. (2019b); Verhulst et al. (2018b)||3||Verhulst et al. (2018a)|
|H. Template-based discriminability detector Maxwell et al. (2020)||2||Zilany et al. (2009, 2014)|
A central processing stage can therefore be seen as a back-end module for the peripheral processing. A selected list of central processors used in published models is presented in Table 1. A central processor accounts for: (1) high-level neural processing of the hearing system (to a greater or to a lesser extent), and (2) coupling of the internal representation to a certain “criterion” (decision stage) that provides concrete information about the processed sound object. In general this latter aspect is assessed by either comparing two or more internal representations (e.g., processors A, C, G, and H in Table 1) or by converting the internal representation into a metric believed to reflect some perceptual aspect of the processed sound object (e.g., processors B, D, E, and F in Table 1). In this study, a computational model that follows the former rationale is used. We use an updated version of the perception model (PEMO) described by Dau et al. (1997) with a central processor that compares different internal representations by using the concept of optimal detector. Therefore, our work is concerned with one possible way of comparing internal representations of different sounds. Particularly, the comparison of internal representations is implemented as a three-alternative forced-choice (3-AFC) performance task and it is applied to the evaluation of perceptual similarity between piano note recordings Osses et al. (2019a). The choice of evaluating piano sounds was motivated by: (1) the complex spectro-temporal properties present in piano sounds, (2) the fact that piano sounds have been thoroughly studied in physical acoustics and we recently quantified differences psychoacoustically Chaigne et al. (2019); Osses et al. (2019a), and (3) the fact that the PEMO model has been primarily applied to study artificial sounds Dau et al. (1996a, b); Jepsen et al. (2008) and speech Jørgensen & Dau (2011); Relaño-Iborra et al. (2019) and less often to other types of sounds, including musical instrument sounds Huber & Kollmeier (2006). Although Huber & Kollmeier applied this auditory model to more diverse sets of sounds, their central processor was adapted to provide a quality metric and, therefore, the goal in their study was to assess judgments of sound quality rather than simulating performance. In this context, the work presented in this study can be seen as an extension to the use of the unified framework offered by the PEMO model.
We start by providing an overview of each stage within the PEMO model (Sec. IIA–B). Subsequently an internal representation obtained from the model is described (Sec. II.3), introducing the information-based approach we adopted to analyze the contribution from different frequency regions in the internal representations of our stimuli. In Sec. III, the dataset of piano sounds and background noises we used are presented together with the adopted simulation protocol. The obtained results are presented in Sec. IV and discussed in Sec. V. This includes simulations using the same noises as used in the reference listening experiments, and using an alternative noise that has slightly different spectral properties. These latter simulations were used to provide insights into the weighting of frequency information used by the listeners during the reference experiment. We conclude this paper by providing perspectives on the use of the PEMO model for applications different from those for which the model was developed.
Ii Model description
The block diagram of the perception model (PEMO Dau et al., 1997) is shown in Fig. 1. Each of the model stages is described in this section.111A similar but much shorter description of the stages of the PEMO model, including the adopted central processor, can be found in our previous simulation paper Osses & Kohlrausch (2018). These descriptions were compiled from previous published implementations of the PEMO model Dau et al. (1996a, 1997); Verhey et al. (1999); Derleth & Dau (2000) or variants of it Breebaart et al. (2001a, b, c); Jepsen et al. (2008). The current model configuration has been included in the AMT toolbox Søndergaard & Majdak (2013), v0.10 (Appendix A).
ii.1 Processing stages
ii.1.1 Outer- and middle-ear filtering
This stage accounts for the effects of outer and middle ear on the incoming signal and it is implemented as two cascaded 512-tap finite impulse response (FIR) filters. The outer-ear filter introduces a transfer function from headphones to the tympanic membrane, emphasizing frequencies around 2750 Hz and attenuating frequencies above 6000 Hz Pralong (1996). The middle-ear filter introduces a transfer function from the tympanic membrane to the stapes, approximating the (peak-to-peak) velocity of the stapes in response to pure tones. The filter acts as a band-pass filter (BPF) with a maximum at 800 Hz (unit gain, dB) and slopes of about 6 dB/octave below and above that frequency Lopez-Poveda & Meddis (2001); Goode et al. (1994). The resulting frequency response shows two local peaks at 2750 Hz ( dB gain) and 5000 Hz ( dB). This stage was included in the auditory model of Jepsen et al. (2008) but not in previous versions of the PEMO model.
ii.1.2 Gammatone filter bank
This set of filters corresponds to a level-independent approximation of a critical-band filter bank. The Gammatone filter bank consists of 31 bands having center frequencies between 87 Hz (3 ERB) and 7819 Hz (33 ERB), spaced at 1 ERB. The model uses band-limited signals obtained from the real part of the complex valued all-pole implementation described by Hohmann (2002). All further processing stages of the model work independently on each auditory filter output.
ii.1.3 Hair-cell transduction
This stage accounts for a simplified inner hair cell processing. It simulates the transformation from mechanical oscillations in the basilar membrane into receptor potentials in the inner hair cells. The signals are first half-wave rectified and then low-pass filtered using five first-order infinite impulse response (IIR) filters with a cut-off frequency (fcut-off) of 2000 Hz. The combined effect of the cascaded low-pass filters (LPFs) is equivalent to applying a fifth-order IIR filter with a fcut-off of 770 Hz. Therefore, frequency components below 770 Hz are almost unaffected, preserving the phase information (maximum attenuation of 3 dB at 770 Hz). Frequency components between 770 and 2000 Hz are gradually attenuated (attenuations between 3 and 15 dB, respectively), meaning that the phase information is gradually lost. For frequency components above 2000 Hz almost all phase information is removed (attenuation greater than 15 dB, slope of dB/octave). This way of removing phase information is consistent with the decrease of phase locking observed in the auditory nerve Breebaart et al. (2001a).
This stage simulates the non-linear adaptive properties of the hearing system at the level of the auditory nerve, where changes in the transformation characteristic (system gain) are introduced according to the input signal level (e.g., Kohlrausch et al., 1992). If the change of the input level is rapid compared to a time constant , the system gain increases transforming the levels linearly. For slower variations, the system gain gradually decreases and the input levels are compressed. The adaptation loops structure is used for this purpose Püschel (1988); Münkner (1993); Dau et al. (1996a), which consists of 5 feedback loops, each of them having a different time constant (=5, 50, 129, 253, 500 ms). A detailed description of this stage is given in Appendix B (see also Osses, 2018, his App. C). In short, each loop acts as a low-pass filter that gets charged or discharged (applying more or less compression, respectively) depending on the instantaneous characteristics of the incoming signals and the corresponding . In line with previous model implementations, we introduced an overshoot limitation, meaning that the (output) amplitudes for rapid input changes (relative to ) are limited. However, unlike previous studies, the limiter factor was set to a lower value (lim= instead of 10) which lead to a more severe compression of overshoot responses. Due to the relevance of the note onset in piano sounds, the choice of this new limiter factor strongly influenced the simulations that are shown later in this paper. For this reason, the effect of using the new limiter factor is described in Sec. II.3.
ii.1.5 Modulation filter bank
The modulation filter bank processes the incoming signal in terms of changes in its envelope. In this stage the same implementation as suggested by Jepsen et al. (2008) is used. First, a reduction in the sensitivity to modulation frequencies above 150 Hz is introduced Kohlrausch et al. (2000) by applying a LPF (f Hz, roll-off dB/octave). The filter bank comprises a maximum of 12 filters (Table 2
) that have two different envelope frequency domains: (1) Bands 1 to 3 (mf10 Hz, constant bandwidth of 5.4 Hz, Table 2) where the real-valued part of the filtered (band-limited) signals is used. This processing preserves the modulation phase information; (2) Bands 4 to 12 (mf10 Hz, constant quality factor Q=2222Other studies that use modulation filters have used alternatively wider filters, with Q factors of 1 Ewert & Dau (2000); Nelson & Carney (2004); Jørgensen & Dau (2011); Wallaert et al. (2017). We did not investigate the effect of filter tuning.) where the norm of the complex signals is used, which represents an approximation to the Hilbert envelope Hohmann (2002). Although this process reduces considerably the modulation phase information, the modulation energy within the respective band is maintained. An attenuation factor of is applied to the resulting signals Jepsen et al. (2008).
The modulation filters for each audio frequency band are limited to filters having an mf below a quarter of the audio center frequency . This is motivated by results presented by Langner & Schreiner (1988), where evidence is provided that neural activity in the ascending auditory path (auditory brainstem) has best modulation frequencies limited to that frequency range (i.e., mf). The modulation filter bank output represents an internal representation that approximates the time-frequency modulation sensitivity at the level of the inferior colliculus Dau et al. (1997). The amplitudes of the representation are scaled in arbitrary or model units (MU) Kohlrausch et al. (1992); Dau et al. (1996a).
|Band||Frequency [Hz]||BW||audio-band f|
|Nr.||mf||inf - sup||[Hz]||Q||[ERB]|
|1||2.7||0.0 - 2.7||2.7||-222Band 1 is a LPF, for which no Q factor is indicated.||3-33|
|2||5.0||2.7 - 8.1||5.4||0.9||3-33|
|3||10.0||7.4 - 12.8||5.4||1.9||3-33|
|4||16.7||12.8 - 20.9||8.1||2.1||3-33|
|5||27.8||20.9 - 35.0||14.1||2.0||4-33|
|6||46.3||35.0 - 58.5||23.6||2.0||4-33|
|7||77.2||57.9 - 96.9||39.0||2.0||8-33|
|8||128.6||96.9 - 160.8||63.9||2.0||11-33|
|9||214.3||160.8 - 268.5||107.7||2.0||15-33|
|10||357.2||268.5 - 446.8||178.3||2.0||19-33|
|11||595.4||446.8 - 744.2||297.4||2.0||23-33|
ii.1.6 Central processor
In this stage, the internal representation of each interval (=1, 2, 3) is compared with a reference representation or “sound image” that is stored in the memory (top-down component) of the model –the template – to mimic the decision that human listeners perform in a 3-AFC task. This central processor is inspired by the concept of an optimal detector (Green & Swets, 1966, Ch. 6 and 7). The model is hence used as an artificial listener,333In the literature (and in this paper), the terms “artificial listener” and “artificial observer” are used interchangeably. where the template corresponds to an expected sound representation that gives a clear indication to the artificial listener about “what to listen for” Dau et al. (1996a)
. For this reason the template should be derived in a condition that is easily detectable, i.e., in a supra-threshold condition. For detection and discrimination tasks conducted in noise this is at a high signal-to-noise ratio (SNR). In our simulations we adopted an SNRsupra equal to 21 dB, which represents an SNR that is 5 dB above the initial SNR used to collect the reference (experimental) data.
Use of a template
In order to obtain a decision outcome using the internal representations and the stored template , the representation (=1, 2, 3) that has the highest similarity with the template is indicated by the artificial listener as containing the target interval. The cross-correlation value (CCV) can be used for this purpose:
where and are digital signals with sampling rate and samples. The difference representation is obtained from the piano-plus-noise representation and the corresponding noise alone representation , similar to how it has been done in previous tone-in-noise simulations Dau et al. (1996a). However, as explained next, our similarity task further required to enable the central processor to use two templates.
Template in a similarity task
The template we adopted to simulate the 3-AFC similarity task described by Osses et al. (2019a) is determined by: (a) the target and “reference” sounds, and; (b) two or more realizations of a noise that can efficiently mask the properties of both piano sounds, given that the 3-AFC task is implemented as a piano-in-noise discrimination. To account for the latter aspect, noises that are generated using a modified ICRA algorithm444ICRA stands for International Collegium of Rehabilitative Audiology. We decided to keep this denomination in the current study, given that our background noises had the same objective than the original ICRA noises Dreschler et al. (2001) of creating noises with the same spectro-temporal properties as the input sounds. (version A, Osses et al., 2019a) are used in every piano presentation. For the first aspect, the template is derived from the internal representations of both, the target piano and reference piano , because their discrimination threshold depends on how different they are from each other. In the course of this study different ways of deriving using and were evaluated. Only the adopted approach is described in this section. The interested reader is referred to (Osses, 2018, his App. E) where two other (discarded) template approaches are described.
In the adopted approach, two templates are derived, and , for the target and reference piano sounds, respectively. For each template, an average representation of the piano sounds embedded in four different ICRA noise realizations at a highly discriminable condition (SNRsupra dB) is obtained.555The number of ICRA noise realizations (four) used to derive each average piano-plus-noise representation was an arbitrary choice. The templates are normalized to unit energy Dau et al. (1996a) to satisfy:
where corresponds to the number of samples used by the artificial listener to make the decision,666In analogy to the theory of optimal detectors presented by Green & Swets (1966), we treat the templates and as “expected signals” along one (temporal) dimension. In fact, there are two other dimensions: audio and modulation frequency. Considering all template dimensions and following the nomenclature of Eq. 7, Eq. 2 would turn into . during what we refer to as the observation (listening) period .
Use of two templates
During the simulations, the templates and are compared with the -intervals in each 3-AFC trial. Equation 1 is used for this purpose, which uses a difference representation obtained from and the corresponding noise representation () at the SNR of the trial.777The use of difference representations is relevant to obtain CCV and CCV values that are in a comparable range. This “CCV baseline” is needed because the unit energy normalization of the templates is done independently and is, therefore, an inherent problem of the use of two templates. Subtracting the noise alone representations () in the CCV calculation implies that the resulting CCV and CCV values indicate the contribution of information of piano relative to that of noise . Six CCV values are obtained:
Based on these values, the artificial listener chooses the interval that is more likely to contain the target sound. If we assume that the target interval is presented in the first interval (), then for a correct discrimination:
In other words, the target template is expected to elicit the maximum CCV value when being correlated with the target interval. Likewise, the reference template elicits higher CCV values when being correlated with the reference intervals and therefore the lowest CCV value is attributed to the target interval. The hat symbol indicates that the CCV values differ from the exact definition given in Eq. 3
. This is caused by an internal noise, whose values are drawn from a Gaussian distributionwith mean = MU. In our implementation of the internal noise, one independent number is added for each CCV value:
Since , the standard deviation corresponds to the actual source of internal variability in the decision process. The use of this Gaussian noise leads to a reduction in the process performance when either the CCV values get close to each other or when the CCV values do. The standard deviation MU was obtained by running an increment-discrimination task with each piano sound and tracking the amount of noise needed to produce an average performance of 70.7% for a difference in level of dB (Osses, 2018, his App. D).
Alignment between piano representations
A final aspect that was considered in the decision criterion is that the cross-correlation between templates and interval representations should deliver the highest possible CCV values. As described in Sec. III.2.1, our piano stimuli are aligned to have the note onset at a time stamp of 0.1 s. This alignment seemed to be sufficient for listeners to perceive aligned piano-plus-noise intervals during the experimental sessions Osses et al. (2019a). However, this did not always ensure a maximum CCV value in the simulated decision process, especially when correlating the target piano representation with the reference template or the reference piano representation with the target template . The workaround we implemented was the assessment of the cross-correlation function between each template and interval for time lags between ms and ms, with 1-ms steps. In this way, the CCV values of Eq. 5 are obtained from the maximum of the corresponding cross-correlation function. Another interesting approach that may have produced similar maximization results is the time stretching between representations Agus et al. (2012).
ii.2 Sources of internal and external variability
The perception of specific sounds is influenced by uncertainties in the stimuli and by internal variability caused, e.g., by imperfections in memory and changes in the concentration level Yost et al. (1989). In this study we differentiate between sources of variability that are internal or external. Uncertainties in the stimuli are related to an external source of variability, while the effects of human memory and concentration correspond to sources of internal variability. To (partly) account for variations in listening performance due to sources of internal variability, an internal noise is often used in models of auditory processing, which in the PEMO model is simulated as an additive Gaussian noise (Eq. 5). In threshold-detection tasks in background noise, a typical source of external variability is the use of running noises, i.e., the use of different noise relizations that have the same statistical properties within and between trials (e.g., von Klitzing & Kohlrausch, 1994). In the instrument-in-noise test, a running noise condition is approximated by using 12 different ICRA noise realizations for each piano pair Osses & Kohlrausch (2018); Osses et al. (2019a)
. Another source of external variability in the test is the presentation level of each interval, which is randomized (roved) by levels uniformly distributed in the rangedB.
ii.3 Description of internal representations
The internal representations of this study were simulated using the PEMO model with an adaptation stage that had a stronger limiter factor (lim=5) than typically used in the literature (lim=10). For this reason this section shows how this parameter choice influenced the piano representations that were later used to simulate the discrimination thresholds between pianos.
ii.3.1 General description of the representations
The internal representation of pianos P1 and P3 (see Sec. III.2.1) after the modulation filter bank (output of Stage 5, Fig. 1) is shown in Fig. 2 for the frequency band that contains the fundamental frequency of the piano note (= ERB or Hz, =554 Hz). The piano sounds start at s and their onsets occur shortly thereafter. The onset of the lowest modulation filter (band 1, mf= Hz, see Table 2) occurs approximately at = s, for band 2 at = s and for the rest of the bands between =0.10 and =0.11 s. It can also be observed in the figure that after the piano onset, the amplitudes in bands 2 to 8 of P3 present more variations than those of P1, especially between =1 and s.
ii.3.2 Effect of the stronger limiter factor
The effect of using a stronger limiter factor in the adaptation loops (Stage 4, Fig. 1) is illustrated for one of the piano representations (piano P1). The following description is qualitatively valid also for the other 6 piano sounds of the dataset (not shown here).
Two configurations of the adaptation loops are used: Using a limiter factor lim= (as used in this study, Fig. 2A) and using a factor lim= (as used in the literature, Fig. 3A). The representation with lim= has amplitudes that range between and 142 MU. The amplitudes of the representation with lim= range between and 231.5 MU. In both cases the minimum and maximum amplitudes occur in band 2 (centered at mf=5 Hz). The difference between both representations is shown in Fig. 3B. This difference is more prominent around the sound onset, visible as light areas in Fig. 3B, indicating less compressed amplitudes for the representation with lim than those of the representation with lim=. This is followed, however, by a short but strong compression immediately after the overshoot peaks in the representation with lim=, which is indicated by the dark regions in the figure. This compression is at most MU (band 2) below the P1 representation with lim=. The largest difference between representations is found in band 2, where the representation with lim= reaches an amplitude 89.5 MU above the maximum of the representation with lim=.
ii.3.3 Information in the internal representations
The internal representations obtained with the PEMO model have three dimensions: time (), audio frequency (), and modulation frequency (). Here we suggest an approach to benefit from the information available across dimensions within a representation, that can be used to compare between two (or more) representations. We illustrate this method for the comparison of P1 representations using lim=5 and lim=10.
The contribution of information for each audio and modulation frequency can be assessed using:
This expression is similar to Eq. 3, but the subindexes and have been added to indicate that the “integration of information” can be done by either deriving the contribution (1) of audio frequency bands across all modulation filter bands, or (2) of modulation frequency bands across all audio frequency bands. The contributions and can be expressed as percentages of the total information , with given by:
The results of this information-based analysis in the comparison of P1 representations with lim=5 and lim=10 are shown in Fig. 4A–B for the audio () and modulation bands (), respectively. It can be observed that the use of a stronger limitation (lim=5) increases the relative contribution of higher audio frequency bands (Fig. 4A), while no substantial change in the information weighting is observed across modulation bands (Fig. 4B). For the representation with lim=, the audio frequency bands with below ERB ( Hz) comprise only 30.9% of the information in contrast to 45.6% for the representation with lim= in the same frequency region. In terms of modulation frequency content, which is similar for both representations, bands 1 and 2 (mf Hz) comprise about 40% of the information and the remaining 60% is distributed across bands 3 to 12.
The 3-AFC discrimination experiment described by Osses et al. (2019a) to evaluate the perceptual similarity between piano notes was simulated using the PEMO model. A general description of the experimental procedure and sound stimuli is given, indicating the adopted considerations to run our simulations. Complementary details about the experimental design can be found in our previous studies Osses et al. (2016, 2017b, 2019a); Chaigne et al. (2019).
iii.1 Apparatus and procedure
The simulations were run using the AFC toolbox for MATLAB Ewert (2013). The AFC toolbox provides a framework to conduct listening experiments, including the option to enable an artificial rather than a human listener. The artificial listener consists of an auditory model (here the PEMO model) with a central processor based on signal detection theory (Sec. II.1.6).
The experiment was implemented as a 3-AFC task with the level of the ICRA noises used as adjustable parameter, expressed as signal-to-noise ratios (SNRs). The set-up of the task was almost identical to that used in the experimental sessions Osses et al. (2019a). We only introduced small deviations to the experimental procedure, aiming at reducing the simulation time. The simulation procedure was as follows: In each adaptive track two sounds were compared, the target sound (presented once) and the reference sound (presented twice). The noise level was adjusted following a two-down one-up rule until 8 reversals were reached (4 less than in the experiments). The step sizes were set to 4 dB, 2 dB (after the second reversal) and 1 dB (after the fourth reversal). The corresponding discrimination threshold was obtained from the median of the last 4 reversals. The sound presentation level was randomly varied (roved) by uniformly-distributed levels in the range
dB. The threshold estimation was repeated 6 times for each condition.
iii.2.1 Piano sounds
A selection of non-reverberant piano notes recorded from historical Viennese pianos was used for the simulations (see Dataset 1, Osses et al., 2019a). In brief, recordings of note C# ( Hz) from seven pianos were used. One recording per piano was chosen leading to a total of 7 stimuli. The waveforms had a duration of 1.3 s including a 150-ms down cosine ramp. The note onset occurred after 0.1 s reaching a maximum loudness max of about 18 sone. With 7 stimuli, 21 piano pairs can be formed. For each of these 21 piano pairs, 6 thresholds were simulated, using 3 times one of the pianos as target with the other piano as reference and vice versa.
iii.2.2 Piano-weighted noises
As in the experimental sessions, background noises that are obtained using a modified ICRA noise algorithm (ICRA version A, Osses et al., 2019a) were used in the simulations. The resulting ICRA noises approximately follow the spectro-temporal properties of the piano sounds, but they have a gradual spectral tilt towards high frequencies. This spectral mismatch of up to 10 dB was corrected in version B of the algorithm (Osses et al., 2019a, as used for their Dataset 2). This is illustrated in Fig. 5. In this paper we first compared the experimental data with the simulations using ICRA noises version A. We then ran additional simulations using version B of the algorithm, where no experimental data were collected, to gain insights into how frequency cues with similar temporal characteristics may have been integrated by the participants of the reference experiment. A detailed description of both ICRA noise algorithms can be found in our experimental paper Osses et al. (2019a).
For the comparison between two pianos, e.g., pianos P1 and P3 (or P3 and P1) individual noises that followed the spectro-temporal properties of each piano (N1 and N3) were combined to generate a paired noise (N13).
iii.3 Reference data: Experimental discrimination thresholds
Experimental thresholds thresexp using the stimuli and procedures described in this section had been previously collected from 20 participants (Osses et al., 2019a, dataset 1). From a total of 210 discrimination thresholds thresexp, 179 values were used (31 thresholds had been excluded after a data consistency check) to obtain the median thresholds that are indicated as red triangles in Fig. 6, which are shown together with their corresponding interquartile ranges (IQRs). The experimental thresholds range between thresexp,max dB (pair 23) and thresexp,min dB (pair 26), with a dynamic range DRexpthresexp,maxthresexp,min dB.
iii.4.1 Exploratory simulations: Subset of piano sounds
At first, a subset of 9 (of the 21) available piano pairs was used for the simulations. These 9 pairs were chosen to be a representative sample of the range of experimental similarity between pianos, i.e., of the SNRs of thresexp (red triangles in Fig. 6). The selected piano pairs were: pair 12, 15, 16, 23, 26, 27, 37, 45, and 47.888Piano pairs 23 and 47 were taken from the most similar end (high SNRs at the threshold) of the similarity axis (abscissa of Fig. 6). Pairs 26, 27, and 37 were taken from the least similar end of the axis. The remaining pairs 12, 15, 16, and 45 were taken from the intermediate similarity range. This reduced set of sounds was used for (1) developing our template approach (Sec. II.1.6), and for (2) testing the duration of the “observation (listening) period” of the template. This latter aspect is a consequence of the lack of success (see the last column of Table 3) to simulate the discrimination thresholds when using whole-duration piano waveforms as inputs to the model. The low thresholds in that condition were attributed to a sensitive artificial listener, who had access to more information than human listeners. As a way to remove available cues within the auditory model, the piano sounds were truncated to shorter durations. This is equivalent to reducing the observation period obs of the artificial listener and can be seen as a simple way to account for a limited human-like working memory999In the artificial listener’s decision, a representation is correlated with a time-aligned template . Any slight temporal misalignment between the two would reduce the correlation value and make the artificial observer less sensitive. One could argue that the human memory is not capable of preserving such a detailed template with a duration of 1.3 s. and that latter parts of the stored template have an increasing temporal jitter, reducing their contribution to the discrimination process. This form of information reduction is in line with the memory-decay approach adopted by Wallaert et al. (2017) who introduced a memory noise “Emem,” that increased with the length of the internal representations. Our chosen approach of a shortened observation window should be seen as the most simple implementation of this concept. (see also, Osses & Kohlrausch, 2018).
Under the hypothesis that listeners provided a greater weighting to the piano note onset, a truncation of the piano waveforms should give a higher correlation between the simulated and experimental results. As will be shown in Sec. IV, our simulation results provide evidence to support this hypothesis.
iii.4.2 Simulations using the whole dataset of piano sounds
The simulation of discrimination thresholds thressim for the whole dataset of piano sounds (21 piano pairs) was run using the optimal observation period obtained from the exploratory simulations and the adopted template approach. These thressim values were used to evaluate the performance of the artificial listener with respect to the reference thresholds thresexp (Section III.3).
|“Observation (listening) period” (s)|
|Pearson||0.66111The 150-Hz LPF was omitted for this analysis.||0.71111Significant correlation, , .||0.65222Correlations that approach significance, , .||0.34||0.45||0.25||-0.21|
|Spearman||0.60222Correlations that approach significance, , .||0.78111Significant correlation, , .||0.47||0.11||0.49||0.21||0.09|
iii.4.3 Extra simulations using ICRA noises version B
Simulations using ICRA noises version B were run for the whole dataset of pianos using the optimal observation duration . This choice allows to quantify the perceptual difference between ICRA noise versions, illustrated in Fig. 5B, that we hypothesized to be small based on our previous comparison between non-reverberant (using version A) and reverberant piano sounds (using version B) Osses et al. (2019a).
iv.1 Exploratory simulations: Subset of piano sounds
The simulation results for the selection of 9 piano pairs are shown in Table 3. In the table, information about the minimum (lowest median) and maximum (highest median) estimated thresholds is shown and their difference is indicated as a dynamic range (DR) in dB. The simulations with =1.3 s delivered thresholds between thressim,max dB and thressim,min dB with a DR of dB. Such low thresholds with respect to the thresexp values indicate that the artificial listener had access to more information than the actual participants with =1.3 s. As a way to remove available cues within the model, the observation period of the artificial listener was limited to shorter durations ( of 0.20, 0.25, 0.3, 0.5, 0.7, 0.9 s), omitting the latter tails of the representations from the CCV assessment (Eq. 3). The results for =0.9 and 1.3 s had a constant DR of 5 dB, and for shorter durations, thressim,max increased reaching a maximum DR of 20.5 dB for =0.25 s. For the shortest tested duration of 0.20 s the DR decreased by 6 dB. The interpretation of these results is that at =0.25 s the piano sounds were judged by the artificial listener as most distinct. Given that this duration also had the best fit with the experimental data (Pearson correlation =0.71, =0.03; Spearman correlation =0.78, =0.02, for =9), =0.25 s was further used to simulate the discrimination thresholds of the remaining 13 piano pairs.
iv.2 Simulations using the whole dataset of piano sounds
The discrimination thresholds using the whole dataset of piano sounds (21 piano pairs) were simulated using the first =0.25 s of waveforms, based on the results of the exploratory simulations. The median thresholds thressim are indicated as magenta circle markers in Fig. 6. The thresholds are shown together with their interquartile ranges (IQRs). The simulations at this duration (=0.25 s) were not only highly correlated with the experimental data but they also had a comparable DR= dB (same DR as in the exploratory analysis). The thressim values range between thressim,max dB (pair 47) and thressim,min dB (pair 16). The discrimination thresholds thressim and thresexp were significantly correlated with a Spearman (rank-order) correlation =0.63, 0.001, =21, and a Pearson correlation =0.54, =0.02, =19.101010For the assessment of the Pearson correlation, the data of two piano pairs (pairs 23 and 47) had to be excluded to meet the assumption of data normality. The two excluded pairs had both SNR thresholds above 12 dB.
iv.3 Extra simulations using ICRA noises version B
The discrimination thresholds thressim,B using the whole dataset of piano sounds were simulated using =0.25 s and ICRA noises version B, following the same simulation procedure as previously described. The median thresholds thressim,B are indicated as blue diamond markers in Fig. 7. For ease of comparison, the thresholds obtained using noises version A (thressim) were replotted from Fig. 6. All thresholds are shown together with the corresponding IQRs derived from 6 simulation runs. The thressim,B values ranged between thressim,B,max dB (pair 47) and thressim,B,min dB (pair 36). The thresholds thressim and thressim,B were significantly correlated with a Spearman correlation =0.62, =0.003, =21, and a Pearson correlation =0.52, =0.02, =19.
V Data analysis and discussion
The simulated thresholds thressim of the instrument-in-noise test were significantly correlated with the experimental thresholds thresexp when only the initial part of the waveforms was used. Two aspects that affected the internal representation of the sounds leading to the obtained thressim values are addressed in this section: (1) The weighting of information in each (audio and modulation) frequency band, and; (2) the concept of “optimal detector” used in the central processor stage and how its performance was affected by shortening the duration of the observation period and by the sources of variability in the model (internal) and in the stimuli (external). Finally, the effect of using ICRA noises A and B on the simulated discrimination thresholds is discussed in terms of threshold shifts for the whole set of piano sounds.
v.1 Information-based analysis of internal representations
The weighting of information for each audio and modulation frequency band within the PEMO model is primarily introduced by the dot product between internal representations and templates ( and , Eq. 3). Following a similar approach to that used to analyze the representation of piano P1 (Fig. 4), the contribution of each frequency band ( and ) can be assessed using Eqs. 6 and 7. The following conditions were considered here: (1) when the adaptation loops are limited using a factor lim= (as suggested in this study) and with lim=, and (2) considering the total duration of the piano-plus-noise sounds (1.3 s) and when only the first =0.25 s are evaluated. In this analysis, all 21 piano pairs were included. Since our interest is on the weighting of information at threshold, the difference representation was assessed at the ICRA noise level indicated by thressim.
The information-weighted values ( and ) for the comparison between limiter factors are shown in Fig. 8. The values and were obtained as the median of 42 values (21 pairs with one value using and one value using ). The error bars indicate IQRs. The weighting shown in Fig. 8A shows that using a stronger limiter factor (here lim=5), the information of higher audio frequency bands receive a higher relative weighting. For lim=10, the weighting seems to be very similar to the distribution of information for the piano-alone representation shown in Fig. 4A.
The information contribution of each modulation band is shown in Fig. 8B. The second modulation band (mf= Hz) had a weighting of 18.6% for the representations with lim=, which is 2.6% below the weighting for the same band when lim=10 was used (weighting of 21.2%). The first modulation band had a low weighting despite its high value in the piano-alone representation (Fig. 4B). This result indicates that the slow envelope changes tracked in this modulation band did not differ considerably from piano to piano. Bands 6 to 9 showed a slight increase in their weighting for lim= (compared to the lim=), while the rest of the bands had a similar weighting with both limiter factors.
The band weighting values for the comparison between signal durations (= s and s) were very similar (mean difference of 0.00%, IQR of 0.32%, not shown here) and, therefore, they seemed to be unaffected by the duration of the piano sounds. To explain the influence of on the simulated thresholds, the performance of the artificial listener was analyzed in terms of other factors, as detailed next.
v.2 Factors that affected the simulated performance
v.2.1 Reducing the performance of the optimal detector
The central processor of the PEMO model is inspired by the concept of an optimal detector. In signal detection theory, the term optimal refers to the fact that the detector has the best possible performance given specific stimulus properties. In other words, if a cue is available in the stimulus, then the detector uses it (Green & Swets, 1966, Ch. 6). For this reason, detectors that are optimal can be used as baselines for human detection. The results of our exploratory simulations showed that the participants’ performance in the instrument-in-noise experiment is below the ideal performance, where simulated thresholds for whole-duration sounds covered a range of only dB (Table 3).
One way to bring the simulated thresholds to a range closer to that of the experimental data is the removal of “evidence” from the stimuli, which is assumed to be cumulative during the observation period of the artificial listener. The simulated thresholds for shorter durations resulted in thresholds with a higher dynamic range (DR=thresmaxtresmin), increasing from dB for = s to dB for = s. To evaluate this DR increase, an analysis based on CCV values is presented using the durations of 0.25 s and 1.3 s.
The CCV values expressed in model units (MU) for the subset of 9 piano pairs are shown in Fig. 9 at a noise level given by their corresponding thressim value, with filled and open markers indicating CCV values using durations of 0.25 s or 1.3 s, respectively. For this analysis no level roving was applied, meaning that at the exact CCV values of the figure the artificial listener would obtain discrimination scores of 70.7% or slightly above.111111This is due to the overall lower thresholds (i.e., better discriminability) when removing the level roving, as can be seen in no-rove thresholds of Fig. 10. In general, at these noise levels only one of the two decision criteria of Eq. 4 failed. The criterion that failed first was labeled as “leading criterion.” The CCV values of the leading criterion for target and reference intervals are shown in Fig. 9A and Fig. 9B, respectively, and their difference CCV is shown in Fig. 9C. The CCV values ranged between MU (pair 16) and MU (pair 47) for representations with =0.25 s and between MU (pair 16) and MU (pair 47) for representations with =1.30 s. These CCV values indicate that the discriminability between pianos either remained approximately unchanged (pair 16) or improved with (pairs 12, 15, 23, 26, 27, 37, 45, and 47) and that the use of shorter internal representations compressed the CCV0.25 s values without changing significantly the relative discriminability between pianos, having a rank-order correlation of =, =, = with respect to the CCV1.30 s values. The differences CCV0.25 sof MU, the difference CCV values were also normally distributed, but with a standard deviation of MU. Eight of the 9 difference CCV0.25 s values in Fig. 9C (20 of 21 if the whole dataset were used) lie in the variability range of the internal noise ( MU). This means that the internal noise played a prominent role in the discrimination performance of the artificial listener. For representations with =1.3 s a much larger variance of the internal noise would be needed for reaching thresholds in a similar SNR range. Although it was possible to introduce a higher variability to the internal representations, this would have strongly limited the performance of the PEMO model, reducing its predictive power for other already validated auditory tasks (e.g., Dau et al., 1997; Osses, 2018, his App. D), violating the so-called backward compatibility of the PEMO model.
v.3 Effect of the sources of variability
In order to quantify the influence of the sources of variability on the obtained thresholds thressim, simulations for the subset of 9 piano pairs using =0.25 s were run in the following conditions: (1) No level roving (no–rove condition), i.e., using only the internal noise variability and running noises, and (2) No internal noise (no–int condition), i.e., using only sources of external variability (level rove and running noise). The resulting median thresholds (of 6 estimates) with their IQRs are indicated by the blue squares and the green triangles in Fig. 10, respectively. The simulated thresholds using both sources of variability (as shown in Fig. 6) are indicated by the magenta circle markers (ext+int condition) and were used as a baseline for this analysis. The simulated thresholds in the no–rove condition followed the same trend as the ext+int-thresholds (correlation of =0.77, =0.02, =9) and differed by 3.5 dB (pair 23) or less. This was not the case for the thresholds in the no–int condition, that were much lower than the ext+int-thresholds and were not significantly correlated (=, =, =). This means that the limit in performance introduced by the sources of external variability of the instrument-in-noise task was not sufficient to explain the performance of the artificial listener. This analysis provides further evidence of the dominant role played by the internal noise in the decision of the artificial listener for representations with =0.25 s.
v.4 Simulated thresholds for different ICRA noise versions
The simulated thresholds thressim,B of the instrument-in-noise method using ICRA noises version B (Fig. 7) were significantly correlated with the thressim values obtained using noises version A (=, and =, see Sec. IV.3). The difference between simulated thresholds (SNR=thressimthressim,B) is shown in Fig. 11. The median difference across all piano pairs was dB (horizontal gray dashed-dotted line in the figure) with an IQR between and dB. This means that on average, noises version A produced higher discrimination thresholds (thressimthressim,B).
To investigate how the two ICRA noise types influenced the threshold estimations, we classified the piano pairs into three groups, defined by the shadowed area in Fig.11: (1) Pairs with SNR values that were equal to or greater than percentile 75 (SNR dB): pairs 17, 23, 27, 47, 57, 67; (2) Pairs with differences within the IQR: pairs 12, 15, 24, 25, 26, 35, 36, 37, 45, 56; and (3) Pairs with differences that were equal to or lower than percentile 25 (SNR dB): pairs 13, 14, 16, 34, and 46. One piano pair of each group was further analyzed: pair 47 (SNR dB), pair 26 (SNR dB), and pair 14 (SNR dB).
The spectrum of the noises related to the selected pairs (N47, N26, N14) expressed as band levels (BL) within 1 ERB are shown in Fig. 12. This figure shows that the noise spectra (ICRA version A and B) related to paired noise N47 (Group 1) have similar BLs at around (Fig. 12A), and that for noise N14 (Group 3) this is the case at around (Fig. 12C). The spectra related to noise N26 (Group 2) have similar BLs between and (between 13.4 and 21.4 ERB in Fig. 12B). We may therefore infer that the frequency region where noises A and B produce a similar masking, i.e., when BL0 (see the open makers in Fig. 12D), is related to the most important (audio) frequency range during the simulations and that a similar frequency weighting would have been used by the participants during the experimental sessions. This analysis is based on the fact that:
[wide, labelwidth=!, labelindent=0pt]
Both ICRA noise versions have the same statistical temporal properties (Fig. 5A) and differ only in their weighting towards high auditory bands in version A ( dB in the highest band with respect to the -centered band, Fig. 5B). This means that the expected masking difference when using ICRA noises A and B should be at most dB, which seemed to be the case (maximum difference of dB for pair 47, Fig. 11).
The masking efficiency of both ICRA noises has been experimentally validated for “Dataset 1” (same stimuli as used in this study) and Dataset 2 (reverberant piano sounds) for version A and version B noises, respectively Osses et al. (2019a). Moreover, this paper provides simulations using noises version A, whereas simulations using noises version B were published by Osses & Kohlrausch (2018) using the PEMO model with the exact same configuration as presented in this paper. The correlation between experimental data and simulations were similar (Dataset 1 with simulations shown in Sec. IV.2: =0.63, 0.001, =21; Dataset 2 by Osses & Kohlrausch 2018: =0.61, 0.001, =21) suggesting that the PEMO model performance is the same with either ICRA noise version and, hence, the model is able to follow overall changes in measured psychoacoustic performance.
As for further evaluations using ICRA noises (Sec. III.2.2), both noise versions A and B were able to efficiently mask the spectro-temporal properties of the test piano sounds. Yet, if the noise efficiency is to be evaluated in terms of the amount of noise needed to mask the properties of the piano sounds, then noises version A perform better because for the same overall (broad-band) noise level the discrimination thresholds have on average higher SNRs (lower noise level) compared to noises version B: SNR dB (Fig. 11). This “better performance” is, however, at the expense of a gradual level mismatch towards higher frequencies of the noises with respect to the sounds to be masked. Therefore, if the efficiency of the noises is to be evaluated in terms of how well do the spectro-temporal properties of the noise follow the properties of the sounds to be masked, then noises version B perform better.
In this study a long-tradition model of the auditory periphery, the perception model (PEMO) Dau et al. (1997), was used to simulate the perceptual similarity between recorded sounds of one note (C#) played on 7 different pianos. Each stage of the model was described indicating the set of configuration parameters we used. The use of sounds with strong onset characteristics required us to include an in-depth analysis of the auditory adaptation stage, the adaptation loops, whose properties are often described at a high level and the details of their implementation and properties are scarce. In this paper we showed that the overshoot limitation factor is not directly related to the ratio between onset and steady-state responses of the adaptation loops as typically claimed in the literature (Appendix B), and that the use of a limiter factor of 5 (instead of 10) produces an onset to steady-state ratio that is closer to the physiological observations by Westerman & Smith (1984) that Münkner (1993) used as reference to implement the overshoot limitation of the adaptation loops.
The simulated task for the piano comparisons was a 3-AFC discriminability task conducted in a background noise, whose experimental (reference) data were recently reported by Osses et al. (2019a). To simulate how similar two piano sounds were, the back-end stage of the model required the use of two memory templates. The simulated discriminability thresholds thressim, expressed as signal-to-noise ratios, were significantly correlated with the reference thresholds thresexp, but this required a considerable reduction of accessible information to the artificial observer. This information reduction was achieved by limiting the observation period of the artificial observer to the sound onsets, with an optimal duration of 0.25 s. Such an approach can be interpreted as an attentional trigger that assumes non-useful information in the piano internal representations for . This approach is comparable to the (more elaborate) memory-noise concept employed by Wallaert et al. (2017). Further analyses were presented to quantify the influence of an additive internal noise, which dominated the model performance for the obtained , and how other sources of external variability (level roving and running noises), to a lesser extent, limited the model performance in the instrument-in-noise task.
Finally, an information-based analysis at simulated discriminability thresholds using an alternative noise (ICRA version B), with similar temporal but slightly different spectral properties, was used to show how the piano spectra were likely weighted during the (reference) experimental sessions. With this analysis we identified that the spectral information between the fundamental frequency of the note =554 Hz and the partial at was most relevant to resolve the piano comparisons.
The results presented in this paper support the idea that the unified framework offered by the PEMO model can be used to evaluate perceptual tasks using complex sounds. This can be seen as an extension of the use of this type of models and their success relies on the adjustment of the central processor stage included within the model, in combination with an appropriate representation of sources of internal noise.
Acknowledgements.This research work was funded by the European Commission (EC) within the ITN Marie Skłodowska-Curie Action project BATWOMAN under the Seventh Framework Programme (EC Grant No. 605867).
Appendix A PEMO model in the AMT toolbox
|Stage / Description||Model|
|Current||Dau (1997)||Verhey (1999)||Breebaart (2001)||Jepsen (2008)|
|AMT function (*_preproc.m)||dau1997||dau1997||dau1997||breebaart2001||jepsen2008|
|1-2 /||Outer, middle ear, cochlear filters||‘gtf_osses2020’||‘gtb_dau’111Default AMT flag for the corresponding preprocessing model (*_preproc.m).||‘gtb_dau’111Default AMT flag for the corresponding preprocessing model (*_preproc.m).||default222No flag is used within AMT as this is the only possible stage configuration for the corresponding preprocessing model.||default|
|Outer, middle ear||yes||No||No||Yes||Yes|
|Cochlear filter bank type333Cochlear filter bank types: GTF = Gammatone (linear) filter bank; DRNL = Dual-resonance non-linear filter bank.||GTF||GTF||GTF||GTF||DRNL|
|3 /||Half-wave rectification||‘ihc_breebaart’||‘ihc_dau’111Default AMT flag for the corresponding preprocessing model (*_preproc.m).||‘ihc_dau’111Default AMT flag for the corresponding preprocessing model (*_preproc.m).||‘ihc_breebaart’111Default AMT flag for the corresponding preprocessing model (*_preproc.m).||‘ihc_dau’111Default AMT flag for the corresponding preprocessing model (*_preproc.m).|
|LPF, Hz (filter order)||770 (5)||1000 (1)||1000 (1)||770 (5)||1000 (1)|
|4 /||Adaptation loops||‘adt_osses2020’||‘adt_dau’111Default AMT flag for the corresponding preprocessing model (*_preproc.m).||‘adt_dau’111Default AMT flag for the corresponding preprocessing model (*_preproc.m).||‘adt_breebaart’111Default AMT flag for the corresponding preprocessing model (*_preproc.m).||‘adt_dau’111Default AMT flag for the corresponding preprocessing model (*_preproc.m).|
|Set of constants 444Set A of time constants: =5, 50, 129, 253, 500 ms; Set B, linearly spaced between 5 and 500 ms: =5, 128.75, 252.5, 376.25, 500 ms.||A||A||A||B||A|
|Limitation (factor)||Yes (5)||Yes (10)||Yes (10)||No ()||Yes (10)|
|5 /||Modulation filter bank||‘mfb_jepsen2008’||‘mfb_dau1997’111Default AMT flag for the corresponding preprocessing model (*_preproc.m).||‘mfb_verhey1999’||default222No flag is used within AMT as this is the only possible stage configuration for the corresponding preprocessing model.||‘mfb_jepsen2008’111Default AMT flag for the corresponding preprocessing model (*_preproc.m).|
|Number of modulation filters||12||12||12||1||12|
|Limited to filters with mf||Yes||No||Yes||No||Yes|
|Scaling factor ()||Yes||No||No||No||Yes|
The implementation of peripheral stages of the PEMO model (Stages 1-5, Fig. 1) as used in this study is available within the AMT toolbox (v0.10) for MATLAB Søndergaard & Majdak (2013). This implementation required the attachment of the outer- and middle-ear processing (Stage 1) and the extension of configuration parameters within the dau1997_preproc routine. The model parameters and its variants are summarized in Table 4.
Appendix B Adaptation loops
The adaptation loops stage is a digital feedback structure included in the PEMO model and in other variants of this model (see Sec. II). The adaptation loops stage simulates the adaptive properties of the hearing system Westerman & Smith (1984); Kohlrausch et al. (1992). The most relevant properties of the adaptation loops are highlighted in this appendix, which were partly collected from the original implementation descriptions Püschel (1988); Münkner (1993); Dau et al. (1996a) and were further explored in our study to justify the use of a more severe limitation of the onset overshoot (limiter factor lim=5) with respect to the literature (no limiter factor or lim=10).
b.1 Adaptation and use of the RC analogy
The adaptation loops receive signals after the cochlear band-pass filtering and subsequent inner hair cell processing (after Stage 3 in the PEMO model, Fig. 1). The -th adaptation loop (with from 1 to 5) is implemented as a first-order IIR filter (time constants =5, 50, 129, 253, 500 ms) that corresponds to a resistor-capacitor (RC) circuit (Fig. 13). This digital structure acts as a low-pass filter between node in and output . Output represents the charging state of the filter and ranges between an initial charging state and 1, with shorter or longer time constants producing a faster or slower charge/discharge of the circuit, respectively. Furthermore, an uncharged RC circuit amplifies the incoming signal and a fully charged circuit does not alter the amplitude of the incoming signal. This results in a nearly logarithmic transformation of stationary input signals (see Fig. 15B, and also, Dau et al., 1996a, their Fig. 3). The initial charging state is defined by the minimum input level lvlmin to the first adaptation loop, which is set to an amplitude of (0 dB SPL for a full scale convention of 100 dB):
where =1, is one of the coefficients of the difference equation that characterizes the IIR filter between in and . The full difference equation is given by:
where =, =, and =. The output , after the five adaptation loops, is finally scaled in a way that outputs with an amplitude of (100 dB steady inputs) are mapped to 100 MU and outputs with an amplitude of = (Eq. 9) are mapped to 0 MU:
b.2 Overshoot limitation
The strong onset response obtained with the adaptation loops structure of Fig. 13, “no-limit path,” motivated the introduction of a limiter at the output of each individual loop Münkner (1993). The so-called overshoot limitation is implemented as a compressor with a ratio that follows a logistic growth described by:
This equation implements a compression to the input in, obtaining an output (Fig. 13, signal path “2”). The compressor has a threshold of (non-normalized amplitude) and a limiter threshold threslim, that depends on an arbitrary limiter factor lim and on the initial charge of each adaptation loop. For convenience, a constant (to be used in Eq. 11) is also defined:
For a limiter factor lim=10, as suggested by Münkner, the limiter threslim,1 is equal to 10 (=9, =0.0032), hence producing outputs of loop 1 of maximally 10 times the input amplitude in, but that due to the subsequent compression in the remaining 4 loops, results in a maximum possible output amplitude threslim,5 of 5.1 (=4.1, =0.6978). For a limiter factor lim=5, as suggested in the current study, threslim,1 is equal to 5 (=4, =0.0032) resulting in a maximum possible output threslim,5 of 2.6 (with =1.6, =0.6978). The effect of the adaptation loops on normalized outputs (Eq. 10) are illustrated in Figs. 14 and 15.
b.3 Input-output characteristic
The effect of the adaptation loops on normalized outputs is shown for pure tones with a carrier frequency of 4000 Hz (duration of 300 ms, 2.5-ms up/down ramps) using no limiter factor (equivalent to lim), lim=10, and lim=5. This stimulus choice is similar to the sounds employed by Westerman & Smith (1984, see their Fig. 6) in recordings of auditory nerve responses of the Mongolian gerbil, which served as a reference for the development of the overshoot limitation scheme Münkner (1993). The stimuli shown here have, however, levels between 10 and 100 dB SPL (Westerman & Smith used stimuli with levels up to 40 dB).
We first show in Fig. 14A–C the effect of the different limiter factors for a pure tone of 70 dB for lim (no factor), lim=10, and lim=5, respectively. Maximum onset (onsetmax) and average steady-state (steadyavg) amplitudes were calculated, with the later being obtained from the average of the last 20 ms of the amplitudes before the signal offset. The onset amplitude onsetmax in (A) was 5401 MU, which is 91.5 times the steady-state response steadyavg of 59 MU. Such overshoot supports the need for the overshoot limitation that produced an onsetmax of 1432 MU (22.4 times the steadyavg value of 64 MU), and 614 MU (9.3 times the steadyavg value of 66 MU), for (B) lim=10 and (C) lim=5, respectively. This figure shows that lim=10 does not produce an overshoot limitation of 10 times the steady-state response as claimed in the literature. That is actually the case if only one adaptation loop were to be used (Eq. 12).
The input-output characteristic functions for the onset and steady-state responses to the 4000-Hz test tones are shown in Fig. 15A and B, respectively. The tones were presented at levels between 10 and 100 dB SPL in steps of 10 dB for the three limitation configurations. The onset to steady-state ratio is shown in Fig. 15C.
In Fig. 15A, the onset of the non-limited responses continue to grow outside the range indicated in the figure, while for the responses with lim=10 (gray squares) the onsets are (1) almost unaffected for input levels up to 20 dB, (2) compressed for levels between 20 and 50 dB, and (3) limited to approximately 1442 MU, with a minimum ratio equal to 14.2 (Fig. 15C). For responses with lim=5, the compressing range extends from 20 to 35 dB, with higher input levels being limited to 614 MU (minimum ratio of 5.8).
The steady-state responses in Fig. 15B show that they do not change considerably for the different adaptation loops configurations. In addition, the error bars shown in the figure (for clarity only for the responses with lim=5) indicate maximum and minimum amplitudes over the 20-ms integration period. This indicates that not all phase information was removed by the fifth-order 770-Hz LPF (IHC processing), i.e., there is still some temporal fine structure present in these responses.
b.4 Implications of the onset limitation in the current study
The ratios between onset and steady-state responses shown in Fig. 15B show values that are overall higher than the target ratio of 10 observed by Westerman & Smith (1984). Adopting a factor lim=5, produced ratios that were closer to that target. The piano sounds investigated in this paper have prominent onset characteristics, which required a stronger overshoot limitation (lim=5), which would have otherwise lead to unsuccessful simulation results, even though this choice did not affect considerably other psychoacoustic tasks (some of them taken from Jepsen et al., 2008) that we simulated to validate the custom PEMO model configuration we used (App. D of Osses, 2018, not shown in this paper).
- Agus et al. (2012) Agus, T., Suied, C., Thorpe, S. & Pressnitzer, D. (2012). “Fast recognition of musical sounds based on timbre,” J. Acoust. Soc. Am. 131, 4124–4133.
- Bianchi et al. (2019) Bianchi, F., Carney, L., Dau, T. & Santurette, S. (2019). “Effects of musical training and hearing loss on fundamental frequency discrimination and temporal fine structure processing: Psychophysics and modeling,” J. Assoc. Res. Otolaryngol. 20, 263-277.
- Breebaart et al. (2001a) Breebaart, J., van de Par, S., & Kohlrausch, A. (2001a). “Binaural processing model based on contralateral inhibition. I. Model structure,” J. Acoust. Soc. Am. 110, 1074–1088.
- Breebaart et al. (2001b) Breebaart, J., van de Par, S., & Kohlrausch, A. (2001b). “Binaural processing model based on contralateral inhibition. II. Dependence on spectral parameters,” J. Acoust. Soc. Am. 110, 1089–1104.
- Breebaart et al. (2001c) Breebaart, J., van de Par, S., & Kohlrausch, A. (2001c). “Binaural processing model based on contralateral inhibition. III. Dependence on temporal parameters,” J. Acoust. Soc. Am. 110, 1105–1117.
- Chaigne et al. (2019) Chaigne, A., Osses, A., & Kohlrausch, A. (2019). “Similarity of piano tones: A psychoacoustical and sound analysis study,” Appl. Acoust. 149, 46–58.
- Dau et al. (1996a) Dau, T., Püschel, D., & Kohlrausch, A. (1996a). “A quantitative model of the “effective” signal processing in the auditory system. I. Model structure,” J. Acoust. Soc. Am. 99, 3615–3622.
- Dau et al. (1996b) Dau, T., Püschel, D., & Kohlrausch, A. (1996b). “A quantitative model of the “effective” signal processing in the auditory system. II. Simulations and measurements,” J. Acoust. Soc. Am. 99, 3623–3631.
- Dau et al. (1997) Dau, T., Kollmeier, B., & Kohlrausch, A. (1997). “Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers,” J. Acoust. Soc. Am. 102, 2892–2905.
- Derleth & Dau (2000) Derleth, R., & Dau, T. (2000). “On the role of envelope fluctuation processing in spectral masking,” J. Acoust. Soc. Am. 108, 285–296.
- van Dorp et al. (2013) van Dorp, J., de Vries, D., & Lindau, A. (2013). “Deriving content-specific measures of room acoustic perception using a binaural, nonlinear auditory model,” J. Acoust. Soc. Am. 133, 1572–1585.
- Dreschler et al. (2001) Dreschler, W., Verschuure, H., Ludvigsen, C., & Westermann, S. (2001). “ICRA noises: Artificial noise signals with speech-like spectral and temporal properties for hearing instrument assessment,” Int. J. Audiol. 40, 148–157.
- Ewert (2013) Ewert, S. (2013). “AFC–A modular framework for running psychoacoustic experiments and computational perception models,” In Proc. International Conference on Acoustics AIA-DAGA, pp. 1326–1329.
- Ewert & Dau (2000) Ewert, S., & Dau, T. (2000). “Characterizing frequency selectivity for envelope fluctuations,” J. Acoust. Soc. Am. 108, 1181–1196.
- Fritz et al. (2007) Fritz, C., Cross, I., Moore, B., & Woodhouse, J. (2007). “Perceptual thresholds for detecting modifications applied to the acoustical properties of a violin,” J. Acoust. Soc. Am. 122, 3640–3650.
- Glasberg & Moore (2002) Glasberg, B., & Moore, B. (2002). “A model of loudness applicable to time-varying sounds,” J. Audio Eng. Soc. 50, 331–342.
- Goode et al. (1994) Goode, R., Killion, M., Nakamura, K., & Nishihara, S. (1994). “New knowledge about the function of the human middle ear: Development of an improved analog model,” Am. J. Otol. 15, 145–154.
- Green & Swets (1966) Green, D., & Swets, J. (1966). “Signal detection theory and psychophysics,” (John Wiley & Sons Inc.)
- Hohmann (2002) Hohmann, V. (2002). “Frequency analysis and synthesis using a Gammatone filterbank,” Acust. Acta Acust. 88, 433–442.
- Huber & Kollmeier (2006) Huber, R., & Kollmeier, B. (2006). “PEMO-Q—A new method for objective audio quality assessment using a model of auditory perception,” IEEE Trans. Audio, Speech, Lang. Process. 14, 1902–1911.
- Jepsen et al. (2008) Jepsen, M., Ewert, S., & Dau, T. (2008). “A computational model of human auditory signal processing and perception,” J. Acoust. Soc. Am. 124, 422–438.
- Jørgensen & Dau (2011) Jørgensen, S., & Dau, T. (2011). “Predicting speech intelligibility based on the signal-to-noise envelope power ratio after modulation-frequency selective processing,” J. Acoust. Soc. Am. 130, 1475–1487.
- von Klitzing & Kohlrausch (1994) von Klitzing, R., & Kohlrausch, A. (1994). “Effect of masker level on overshoot in running- and frozen-noise maskers,” J. Acoust. Soc. Am. 95, 2192–2201.
- Kohlrausch et al. (2013) Kohlrausch, A., Braasch, J., Kolossa, D., & Blauert, J. (2013). “An Introduction to Binaural Processing,” In The Technology of Binaural Hearing (Springer Berlin Heidelberg), pp. 1–32.
- Kohlrausch et al. (1992) Kohlrausch, A., Püschel, D., & Alphei, H. (1992). “Temporal resolution and modulation analysis in models of the auditory system,” In The auditory processing of speech, edited by M. Schouten (Mouton de Gruyter), pp. 85–98.
- Kohlrausch et al. (2000) Kohlrausch, A., Fassel, R., & Dau, T. (2000). “The influence of carrier level and frequency on modulation and beat-detection thresholds for sinusoidal carriers,” J. Acoust. Soc. Am. 108, 723–734.
Langner & Schreiner (1988)
Langner, G., & Schreiner, C.
). “Periodicity coding in the Inferior Colliculus of the cat. I. Neuronal mechanisms,” J. Neurophysiol.60, 1799–1822.
- Lopez-Poveda & Meddis (2001) Lopez-Poveda, E. & Meddis, R. (2001). “A human nonlinear cochlear filterbank,” J. Acoust. Soc. Am. 110, 3107–3118.
- Maxwell et al. (2020) Maxwell, B., Richards, V. & Carney, L. (2020). “Neural fluctuation cues for simultaneous notched-noise masking and profile-analysis tasks: Insights from model midbrain responses,” J. Acoust. Soc. Am. 147, 3523–3537.
- Meddis & Hewitt (1991) Meddis, R., & Hewitt, M. (1991). “Virtual pitch and phase sensitivity of a computer model of the auditory periphery. I: Pitch identification,” J. Acoust. Soc. Am. 89, 2866–2882.
- Meddis & O’Mard (1997) Meddis, R. & O’Mard, L. (1997). “A unitary model of pitch perception,” J. Acoust. Soc. Am. 102, 1811–1820.
- McAdams & Bigand (1993) McAdams, S., & Bigand, E. (1993). Thinking in Sound: The Cognitive Psychology of Human Audition (Oxford University Press).
- Münkner (1993) Münkner, S. (1993). “Modellentwicklung und Messungen zur Wahrnehmung nichtstationärer akustischer Signale,” Ph.D. thesis, University of Göttingen.
- Nelson & Carney (2004) Nelson, P., & Carney, L.(2004). “A phenomenological model of peripheral and central neural responses to amplitude-modulated tones,” J. Acoust. Soc. Am. 116, 2173–2186.
- Osses et al. (2016) Osses Vecchi, A., Chaigne, A., & Kohlrausch, A. (2016). “Assessing the acoustic similarity of different pianos using an instrument-in-noise test,” Proc. International Symposium on Musical and Room Acoustics, pp. 1–10. La Plata, Argentina.
- Osses et al. (2017a) Osses Vecchi, A., Kohlrausch, A., Lachenmayr, W., & Mommertz, E. (2017a). “Predicting the perceived reverberation in different room acoustic environments using a binaural model,” J. Acoust. Soc. Am. 141, EL381–EL387.
- Osses et al. (2017b) Osses Vecchi, A., Chaigne, A., & Kohlrausch, A. (2017b). “Meten van klankverschillen in klassieke piano’s (Measurement of sound differences in classic pianos, in Dutch,” Nederlands Tijdschrift voor Natuurkunde 87, 248-251.
- Osses et al. (2019a) Osses Vecchi, A., Kohlrausch, A., & Chaigne, A. (2019a). “Perceptual similarity between piano notes: Experimental method applicable to reverberant and non-reverberant sounds,” J. Acoust. Soc. Am. 146, 1024–1035.
- Osses et al. (2019b) Osses Vecchi, A., Ernst, F., & Verhulst, S. (2019b). “Hearing-impaired sound perception: What can we learn from a biophysical model of the human auditory periphery?,” Proc. ICA, Aachen, Germany, pp. 1–10.
- Osses et al. (2020a) Osses Vecchi, A., McLachlan, G., & Kohlrausch, A. (2020a). “Assessing the perceived reverberation in different rooms for a set of musical instrument sounds,” BioRxiv 146, 1–6 doi:10.1101/2020.03.13.984542.
- Osses & Kohlrausch (2018) Osses Vecchi, A., & Kohlrausch, A. (2018). “Auditory modelling of the perceptual similarity between piano sounds,” Acta Acust. united Ac. 104, 930–934.
- Osses (2018) Osses Vecchi, A. (2018). “Prediction of perceptual similarity based on time-domain models of auditory perception,” Ph.D. thesis, Technische Universiteit Eindhoven. url: research.tue.nl/ files/103868563/20180919_Osses_Vecchi.pdf.
- Pralong (1996) Pralong, D., & Carlile, S. (1996). “The role of individualized headphone calibration for the generation of high fidelity virtual auditory space,” J. Acoust. Soc. Am. 100, 3785–3793.
- Püschel (1988) Püschel, D. (1988). “Prinzipien der zeitlichen Analyse beim Hören,” Ph.D. thesis, University of Göttingen.
- Relaño-Iborra et al. (2019) Relaño-Iborra, H., Zaar, J., & Dau, T. (2019). “A speech-based computational auditory signal processing and perception model,” J. Acoust. Soc. Am. 146, 3306–3317.
- Robles & Ruggero (2001) Robles, L., & Ruggero, M. (2001). “Mechanics of the mammalian cochlea,” Physiol. Rev. 81, 1305–1352.
- Søndergaard & Majdak (2013) Søndergaard, P., & Majdak, P. (2013). “The Auditory Modeling Toolbox,” in The technology of binaural listening, edited by J. Blauert (Springer Berlin Heidelberg), Chap. 2, pp. 33–56. url: http://amtoolbox.sourceforge.net/.
- Verhulst et al. (2018a) Verhulst, S., Altoè, A., & Vasilkov, V. (2018a). “Computational modeling of the human auditory periphery: Auditory-nerve responses, evoked potentials and hearing loss,” Hear. Res. 360, 55–75.
- Verhulst et al. (2018b) Verhulst, S., Ernst, F., Garrett, M., & Vasilkov, V. (2018b). “Supra-threshold psychoacoustics and envelope-following response relations: normal-hearing, synaptopathy and cochlear gain loss,” Acta Acust. united Ac. 104, 800–803.
- Verhey et al. (1999) Verhey, J., Dau, T., & Kollmeier, B. (1999). “Within-channel cues in comodulation masking release (CMR): experiments and model predictions using a modulation-filterbank model,” J. Acoust. Soc. Am. 106, 2733–2745.
- Wallaert et al. (2017) Wallaert, N., Moore, B., Ewert, S., & Lorenzi, C. (2017). “Sensorineural hearing loss enhances auditory sensitivity and temporal integration for amplitude modulation,” J. Acoust. Soc. Am. 141, 971–980.
- Westerman & Smith (1984) Westerman, L., & Smith, R. (1984). “Rapid and short-term adaptation in auditory nerve responses,” Hear. Res. 15, 249–260.
- Yost et al. (1989) Yost, W., Braida, L., Hartmann, W., Kidd, G., Kruskal, J., Pastore, R., Sachs, M., Sorkin, R. & Warren, R. (1989). “Classification of complex nonspeech sounds,” (Tech. Rep.). Washington D.C.: National Academy.
Zilany et al. (2009)
Zilany, M., Bruce, I., Nelson, P. & Carney, L.
). “A phenomenological model of the synapse between the inner hair cell and auditory nerve: Long-term adaptation with power-law dynamics,” J. Acoust. Soc. Am.126, 2390-2412.
- Zilany et al. (2014) Zilany, M., Bruce, I. & Carney, L. (2014). “Updated parameters and expanded simulation options for a model of the auditory periphery,” J. Acoust. Soc. Am. 135, 283-286.