1 Introduction
Auditory filterbanks have been widely accepted and applied in numerous speech signal processing algorithms especially in the computational auditory scene analysis (CASA) area [1], for various applications including the speech enhancement, recognition and transcription.
A typical auditory filterbank consists of two parts, i.e. the filter type and the centre frequencies of filters. Common filter types include the gammatone, gammachirp, and their variants [2], which simulate the auditory response of human hearers. Choice of center frequencies of the auditory filters has evolved from the earlier critical bandwidth and the critical-band-rate scale [3], to the polynomial approximation of equivalent rectangular bandwidth (ERB) [4], and the currently well-accepted linear ERB [5], as well as their corresponding ERB-rate scales (ERBS).
Although the linear ERB approximation in [5] has been found useful in practical implementations, it has been based on experimental findings through psychoacoustic measurement and curve-fitting. Logarithmic frequency scales have also been applied [6, 7, 8]. However, the selection of the number of subbands for a given frequency range still remains empirical for both of the ERB rate scale and the logarithmic scale.
In this paper, we further investigate the frequency scaling and provide new insights including a new proposed frequency coverage metric, and also derivations of a new frequency scaling function that lead to consistent frequency coverage for auditory filterbanks. Moreover, based on the proposed definition of frequency coverage, we also derive an expression for the frequency coverage metric from the existing linear ERB.
2 Equivalent Rectangular Bandwidth Scale
The ERB of a particular filter is defined as the bandwidth of a rectangular filter to pass the same energy of the filter [4, 5]. The relationship between the ERB of the human auditory filter and the center frequency has been studied extensively using analytical expressions to approximate measurement data from psychoacoustic experiment. An early approximation has the polynomial form [4]
(1) |
where is the frequency in unit of Hz, and are parameters. However, one of the most widely accepted analytical approximation over the past decades has been the linear form [5]
(2) | ||||
Each ERB corresponds to a constant distance along the basilar membrane [9, 5] in cochlea.
The ERB-rate scale (ERBS) has been developed to scale frequency in terms of units of the ERB, by solving the integral [4, 5]:
(3) |
with the boundary condition
(4) |
Using (2) in (3) and (4) yields [5]
(5) |
The ERB and ERBS given in (2) and (5) have been applied in numerous auditory studies, for selecting the center frequencies of the auditory filterbank [10], yet the ERB approximation is still found as a result of curve-fitting from experiments, and the number of subbands for a given frequency range is still an empirical parameter.
3 Suggested Frequency Scaling and Coverage
3.1 Speaker Signal Model
Based on the source excitation - vocal tract models for the process of speech production [11], as well as the amplitude-modulation (AM) and frequency modulation (FM) structure [12], a harmonic model is used for the speaker signal:
(6) |
(7) |
where is continuous time, the speech signal from the -th speaker, , integer the number of concurrent speakers, the -th harmonic of speaker , integer the order of harmonics for a speaker, integer the maximum order of harmonics for speaker , the envelope of each harmonic, the phase (which is short-time constant for speech signals), and the (angular) fundamental frequency.
With appropriate selection of filter center frequencies, the auditory filterbank ideally separates into subbands the harmonic components of not only a single speaker, but also multiple concurrent speakers, based on the time-frequency sparsity assumption of speech signals [13].
3.2 Logarithmic Frequency Scaling
In practice, concurrent speakers usually have different fundamental frequencies. Thus we can denote fundamental frequencies of two speakers as , (, , ), and their difference is
(8) |
Thus from (7) the frequency difference of their -th harmonic is
. This means that their harmonics (of same order) are more distant at higher frequencies on the linear frequency scale, which makes selection of the filterbank center frequencies difficult for a regular per-speaker estimate.
We thus propose a frequency scaling function that satisfies (9) so that speech components of separate speakers appear equidistantly, with respect to (w.r.t.) :
(9) |
The logarithmic functions are functional solutions to (9):
(10) |
where . They also have better resolutions for the lower frequencies, which aligns with the fact that most speech energy falls in low frequencies (e.g. fundamental frequencies and their lower-order harmonics). We can easily verify from (10) that , which is constant with respect to .
Denote the ratio of center frequency to the bandwidth as for filter band (), integer is the number of filter bands, i.e.
(11) |
where and denote the bandwidth and center frequency of filter band , respectively. is also referred to as the quality factor (Q-factor) of subband .
Denote the frequency range that we are interested in as , where . Assuming that the center frequencies of filter bands are equidistantly spaced in the proposed frequency range, we have
(12) |
and
(13) |
where denotes the inverse function of .
3.3 Proposed Frequency Coverage
The auditory filterbank requires sufficient frequency coverage to capture all harmonic components of concurrent speakers. Here we propose to define the frequency coverage of the filterbank on the proposed frequency scale as
(15) |
where and denote the distance between consecutive filter bands and the half of the sum of their bandwidths, as shown in (16) and (17), respectively:
(16) |
and
(17) |
Apparently gives a full coverage for ideal “brick-wall” bandpass filters with no overlap. For a practical auditory filterbank however, the filters always have finite roll-off rate, thus reasonable overlap is required for full coverage, leading to . Also depending on applications, we may have when full coverage is not required.
3.4 Frequency Coverage of the Existing ERBS
The existing ERB function (2) does not lead to a constant , here we investigate its corresponding frequency coverage by applying the definition in (15).
Assuming the filter bandwidth is a constant scale of the ERB, which is true for some auditory filters, e.g. the gammatone filter [2], i.e.
(23) |
where is a constant. Note here that the Q-factor is not constant as .
Therefore, selecting equidistantly on the scale , similar to (14), we have
(24) | ||||
Thus from (15) and (19) we have
(25) | ||||
which is also constant over filter subbands. Thus as long as the ERB has the linear form as (19) and assuming that (23) holds, the resulting frequency coverage is constant over frequency at given , and . Thus the number of subbands for a given frequency range can be derived from the required frequency coverage using (25), and the subband center frequencies can then be calculated from (14) or (24).
4 Numerical Studies
4.1 New ERB and ERBS Functions
From (10) we have a new frequency scaling function that can lead to consistent frequency coverage for the auditory filterbank, as well as a constant Q-factor. Now we calculate the parameters.
Denote the maximum inaudible frequency as , usually Hz, we use the boundary condition
(26) |
instead of (4). Thus from (10) we have
(27) |
From (3) and (10) we have a new approximation of the ERB:
(28) | ||||
Choosing natural logarithm, i.e. , where , we can get from linear fitting of experimental readings from the literature [14, 15, 16, 17, 18, 19] as shown in Fig. 1. We can see that
(29) |
where fits the data well. Then we have
(30) |
where .
Equations (29) and (30) are the proposed new ERB and ERBS functions. Note here that the ERB of human auditory system may vary with age and sound level and from one listener to another [4]. Thus the precise values of and may vary. However, the derivation from (10) to (18) shows that, as long as the ERB function has the proposed form of (28) or (29), the resulting frequency scaling always satisfies the frequency coverage as (18) shows.
The existing and proposed ERBS functions are plotted in Fig. 2. We can see that the proposed scaling follows the proposed logarithmic scaling, and is steeper at frequencies lower than about Hz. In this section we use Hz and Hz. The center frequencies that correspond to equidistant points on respective ERBS for are plotted in . We can see that the proposed ERBS has more points at low frequencies. This can provide better frequency resolution on the lower frequencies as most of speaker fundamental frequencies are below Hz, and usually most speech energies are in the fundamental frequency or its lower order harmonics [11].
4.2 Frequency Coverage of the Gammatone Filterbank
The frequency coverage is the property that we propose for the selection of center frequencies of an auditory filterbank. Here we use the gammatone filter to demonstrate the feature.
We can see from [2] that bandwidth of the gammatone filter is only dependent on the filter order () and the ERB, i.e.
(31) |
where
(32) |
This satisfies the assumptions in (18) and (23). Thus using the new ERB function (29) instead of (2) in (31), we have the Q-factor for the gammatone filter
(33) |
which is constant over frequency, e.g. when , we have , and . Thus given , we can get from (18), and from (25).
Fig. 3 further provides the frequency coverage of the proposed and existing ERBS over the number of subbands of the -th order gammatone auditory filterbank for the frequency range of Hz. We can see from the top panel that for frequencies above about Hz, both ERBs align well with each other. However, the existing ERB has decreasing Q-factors as frequencies decrease below about Hz, while the proposed ERB is consistent across the entire frequency range. We can also see from the bottom panel that for both ERB scaling functions, the frequency coverage is constant for a given number of subbands , and increases almost linearly with the number of subbands. The frequency coverage reaches about 1 at for both scaling. However, it can also be noted that for the same frequency range, the ERBS requires less number of subbands than the new logarithmic scale, for a desired frequency coverage.
5 Conclusions
This paper investigates the frequency scaling of the auditory filterbanks, and proposes a novel frequency coverage metric for the selection of center frequencies of auditory filterbanks. We also propose a new ERB that aligns with the logarithmic frequency scaling, and derive that equidistant frequencies on the logarithmic frequency scale provide a consistent frequency coverage for the filterbanks. Moreover, we show that the existing and any possible linear ERB can also provide consistent frequency coverage. The suggested frequency coverage is demonstrated using the gammatone filterbank.
Acknowledgment
The author would like to acknowledge the contribution of the Australian Postgraduate Award and Australian Government Research Training Program Scholarship in supporting this research. Due thanks are given to Professor S. Nordholm and anonymous reviewers for the review comments on early revisions of the manuscript.
References
- [1] D. Wang and G. J. Brown, Computational auditory scene analysis: Principles, algorithms, and applications. Wiley-IEEE Press, 2006.
- [2] J. Holdsworth, I. Nimmo-Smith, R. Patterson, and P. Rice, “Implementing a gammatone filter bank,” Annex C of the SVOS Final Report: Part A: The Auditory Filterbank, vol. 1, pp. 1–5, 1988.
- [3] E. Zwicker and E. Terhardt, “Analytical expressions for critical-band rate and critical bandwidth as a function of frequency,” The Journal of the Acoustical Society of America, vol. 68, no. 5, pp. 1523–1525, 1980.
- [4] B. C. Moore and B. R. Glasberg, “Suggested formulae for calculating auditory-filter bandwidths and excitation patterns,” The Journal of the Acoustical Society of America, vol. 74, no. 3, pp. 750–753, 1983.
- [5] B. R. Glasberg and B. C. Moore, “Derivation of auditory filter shapes from notched-noise data,” Hearing research, vol. 47, no. 1, pp. 103–138, 1990.
- [6] X. Sun, “Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio,” in Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, vol. 1. IEEE, 2002, pp. I–333.
- [7] F. Nolan, “Intonational equivalence: an experimental evaluation of pitch scales,” in Proceedings of the 15th International Congress of Phonetic Sciences, Barcelona, vol. 39, 2003.
- [8] W. Biesmans, N. Das, T. Francart, and A. Bertrand, “Auditory-inspired speech envelope extraction methods for improved eeg-based auditory attention detection in a cocktail party scenario,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 25, no. 5, pp. 402–412, 2017.
- [9] B. Moore, “Parallels between frequency selectivity measured psychophysically ant) in (cochilear mechanics,” 1986.
- [10] R. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice, “An efficient auditory filterbank based on the gammatone function,” in a meeting of the IOC Speech Group on Auditory Modelling at RSRE, vol. 2, no. 7, 1987.
- [11] J. R. Deller Jr, J. G. Proakis, and J. H. Hansen, Discrete time processing of speech signals. Prentice Hall PTR, 1993.
- [12] P. Maragos, J. F. Kaiser, and T. F. Quatieri, “Energy separation in signal modulations with application to speech analysis,” IEEE transactions on signal processing, vol. 41, no. 10, pp. 3024–3051, 1993.
- [13] O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via time-frequency masking,” IEEE Transactions on Signal Processing, vol. 52, no. 7, pp. 1830–1847, 2004.
- [14] R. D. Patterson, “Auditory filter shapes derived with noise stimuli,” The Journal of the Acoustical Society of America, vol. 59, no. 3, pp. 640–654, 1976.
- [15] D. L. Weber, “Growth of masking and the auditory filter,” The Journal of the Acoustical Society of America, vol. 62, no. 2, pp. 424–429, 1977.
- [16] T. Houtgast, “Auditory-filter characteristics derived from direct-masking data and pulsation-threshold data with a rippled-noise masker,” The Journal of the Acoustical Society of America, vol. 62, no. 2, pp. 409–415, 1977.
- [17] R. D. Patterson, I. Nimmo-Smith, D. L. Weber, and R. Milroy, “The deterioration of hearing with age: Frequency selectivity, the critical ratio, the audiogram, and speech threshold,” The Journal of the Acoustical Society of America, vol. 72, no. 6, pp. 1788–1803, 1982.
- [18] S. Fidell, R. Horonjeff, S. Teffeteller, and D. M. Green, “Effective masking bandwidths at low frequencies,” The Journal of the Acoustical Society of America, vol. 73, no. 2, pp. 628–638, 1983.
- [19] M. J. Shailer and B. C. Moore, “Gap detection as a function of frequency, bandwidth, and level,” The Journal of the Acoustical Society of America, vol. 74, no. 2, pp. 467–473, 1983.
Comments
There are no comments yet.