Log In Sign Up

PSD Estimation and Source Separation in a Noisy Reverberant Environment using a Spherical Microphone Array

by   Abdullah Fahim, et al.

In this paper, we propose an efficient technique for estimating individual power spectral density (PSD) components, i.e., PSD of each desired sound source as well as of noise and reverberation, in a multi-source reverberant sound scene with coherent background noise. We formulate the problem in the spherical harmonics domain to take the advantage of the inherent orthogonality of the spherical harmonics basis functions and extract the PSD components from the cross-correlation between the different sound field modes. We also investigate an implementation issue that occurs at the nulls of the Bessel functions and offer an engineering solution. The performance evaluation takes place in a practical environment with a commercial microphone array in order to measure the robustness of the proposed algorithm against all the deviations incurred in practice. We also exhibit an application of the proposed PSD estimator through a source septation algorithm and compare the performance with a contemporary method in terms of different objective measures.


page 9

page 10

page 11


PSD Estimation of Multiple Sound Sources in a Reverberant Room Using a Spherical Microphone Array

We propose an efficient method to estimate source power spectral densiti...

Real-time separation of non-stationary sound fields on spheres

The sound field separation methods can separate the target field from th...

Multiple Sound Source Localisation with Steered Response Power Density and Hierarchical Grid Refinement

Estimation of the direction-of-arrival (DOA) of sound sources is an impo...

Weakly Supervised Source-Specific Sound Level Estimation in Noisy Soundscapes

While the estimation of what sound sources are, when they occur, and fro...

Objective-oriented method for uniformation of various directivity representations

Over recent years, numerous attempts were taken to provide efficient met...

Multi-Source DOA Estimation through Pattern Recognition of the Modal Coherence of a Reverberant Soundfield

We propose a novel multi-source direction of arrival (DOA) estimation te...

Determining the signal dimension in second order source separation

While an important topic in practice, the estimation of the number of no...

I Introduction

The power spectral density (PSD) of an audio signal carries useful information about the signal characteristics. Many spectral enhancement techniques, most commonly Wiener filter and spectral subtraction, use the knowledge of the PSD to suppress undesired signal components such as background noise[1], late reverberation [2, 3], or both [4]. Few other applications of PSD include computing direct to reverberation energy ratio (DRR) [5] or separating sound sources in a mixed acoustic environment [6]. Most of the existing spectral enhancement techniques, including the aforementioned ones, focused on estimating PSD components under strict assumptions such as a noiseless, free-field or a single-source scenario. In this work, we develop a new technique for estimating individual signal PSDs along with the PSD of the reverberation and coherent noise components in a multi-source noisy and reverberant environment using a spherical microphone array.

I-a Literature review

Several techniques have been proposed in the literature for single-source reverberant environment to estimate both the source and reverberation PSDs. Lebart et al. [2] used a statistical model of room impulse responses (RIR) to estimate the reverberation PSD and used that in a spectral subtraction-based speech dereverberation technique. Braun et al. [7] proposed a PSD estimator in using a reference signal under the strict statistical assumption of diffused reverberant field. Kuklasiński et al. [4] developed a maximum likelihood-based method for estimating speech and late reverberation PSD in a single-source noisy reverberant environment assuming a known value of noise PSD. The spatial correlation between the received microphone signals were utilized in [5] to compute direct to reverberant energy ratio in a noiseless environment by estimating PSDs of the direct and reverberant components.

Saruwatari et al. [8] proposed a method to suppress undesired signals in a multi-source environment using complementary beamformers. Hioka et al. [6] used a similar idea for PSD estimation and source separation which utilized multiple fixed beamformer to estimate source PSDs. While [6] offered a PSD estimation technique for an underdetermined system capable of working with a larger number of sources compared to [8], both the algorithms were developed for non-reverberant case and required the directions of each speech and noise sources. An improvement to [6] was suggested in [9] where the property of an M-matrix was utilized to design the fixed beamformers, whereas the former chose the steering direction in an empirical manner. However, while [9] eliminated the performance issue due to ill-conditioning of the de-mixing matrix, the method was developed under the assumption of a noiseless free-field environment.

A common use of PSD is in post-filter design to be used in conjunction with a beamformer, e.g. multi-channel Wiener filter [10], in order to enhance the desired speech signal in a mixed environment. Beamforming is a common speech enhancement technique used for decades [11, 12]. However, a post-filter at the beamformer output is known to enhance system performance by boosting interference rejection [13].

In the recent past, several methods have been proposed in the spherical harmonics domain to estimate PSD components of an acoustic signal [14] or for interference rejection through beamforming [15, 16]. One of the major advantages of spherical harmonics domain representation of a signal [17, 18] is the inherent orthogonality of its basis functions. Hence, the spherical harmonics-based solutions are becoming popular in other fields of acoustics signal processing, such as source localization [19], speech dereverberation [20], and noise suppression [21]. However, to the best of our knowledge, no methods have been developed in the harmonics domain to estimate individual PSD components in a multi-source mixed acoustic environment. A harmonics-based single-source PSD estimator was proposed by the authors of this paper [14] for a noiseless reverberant environment in order to estimate DRR.

I-B Our approach and contribution

We consider a multi-source reverberant environment with coherent background noise and propose a novel technique to estimate the individual PSDs of each desired source as well as the PSDs of the reverberant and noise components. The method is developed in the spherical harmonics domain to take the advantage of the inherent orthogonality of the spherical harmonics basis functions which ensures a well-posed solution without the requirement of any additional design criteria such as an M-matrix consideration as adopted in [9]. Additionally, in contrast to the conventional beamformer-based methods [6] where only the autocorrelation coefficients of the beamformer output were used, we also incorporate the cross-correlation between the spherical harmonics coefficients in our solution. This latter approach was first used in [5] for estimating DRR in a single-source environment. The larger number of correlation coefficients makes the algorithm suitable for separating a larger number of sources compared to the conventional techniques. We also derive a harmonics-based novel closed form expression for the spatial correlation of a coherent noise field. We carry out detailed theoretical analysis, demonstrate the practical impact and offer an engineering solution to a Bessel-zero issue which, if not addressed in a correct way, significantly limits the performance of spherical array-based systems. Initial work on this approach was published in [22] where a limited-scope implementation of the idea was presented under the assumption of a noiseless case. Furthermore, a detailed analysis on the implementation challenges and a benchmark in terms of perceived quality were not included in our prior work of [22].

It is worth mentioning that, the performance evaluation of the proposed algorithm took place in a practical environment with a commercially available microphone array, ‘Eigenmike’[23], without any prior knowledge about the source characteristics. Hence, the evaluation process incorporated all the practical deviations like the source localization error, thermal noise at the microphones, non-ideal characteristics of the sound fields, etc. We validate the performance of the algorithm by carrying out experiments in different acoustic environments with varying number of speakers using independent mixed-gender speech signals. The improved objective measures in terms of perceptual evaluation of speech quality (PESQ) [24] and frequency-weighted segmental signal to noise ratio (FWSegSNR) [25] indicate that the proposed algorithm is robust against such practical distortions and produce a better estimation compared to the competing methods.

The paper is structured as follows. Section II contains problem statement and defines the objective of the work. In Section III, we develop a framework for PSD estimation based on the spatial correlation of sound field coefficients in a noisy reverberant environment. We use the framework in Section IV to formulate a PSD estimation technique, and discuss and offer solutions to a practical challenge which we term as Bessel-zero issue. In Section V, we outline a practical application of the estimated PSDs. Finally in Section VI, we evaluate and compare the performance of the proposed algorithm with other methods based on objective metrics and graphical aids.

Ii Problem formulation

Let us consider a microphone array consisting of microphones to capture the sound field in a noisy reverberant room with distinct sound sources. The received signal at the microphone is given by


where , , denotes the microphone position, is the RIR between the source and the microphone, is the discrete time index, denotes the convolution operation, is the source excitation for the sound source, and is the coherent noise111Here the coherent noise refers to the colored background noise, different from the white thermal noise, which can be originated from any unknown noise source such as room air-conditioning system. at the microphone position. The RIR can be decomposed into two parts


where and are the direct and reverberant path components, respectively. Substituting (2) into (1

) and converting into frequency domain using short-time Fourier transform (STFT), we obtain


where represent the corresponding signals of in the STFT domain, is the time frame index, , denotes the frequency, and is the speed of sound propagation. In the subsequent sections, the time frame index is omitted for brevity.

Given the measured sound pressure , we aim to estimate the individual source PSDs, , where represents the expected value over time. As an application of a common use of PSD, we further measure the signal enhancement of a beamformer output through a basic Wiener post filter driven by the estimated PSDs, and demonstrate an improved estimation technique for source separation.

Iii Framework for PSD estimation

In this section, we develop a spherical harmonics domain framework to establish the relationship between the sound field coefficients and the individual PSD components in a multi-source noisy and reverberant environment. We use this model in Section IV and V to estimate individual PSD components and separate sound sources from a mixed recording.

Iii-a Spatial domain representation of room transfer function

We model the direct and reverberant path of room transfer function (RTF) in the spatial domain as


where represents the direct path gain at the origin for the source, ,

is a unit vector towards the direction of the

source, and is the reflection gain at the origin along the direction of for the source. Hence, we obtain the spatial domain equivalent of (3) by substituting the spatial domain RTF from (4) and (5) as


Iii-B Spherical harmonics decomposition

In this section, we derive the spherical harmonics expansion of (6) using the existing mathematical models. Spherical harmonics are a set of orthonormal basis functions which can represent a function over a sphere. A spherical function can be expressed in the spherical harmonics domain as


where is defined over a sphere, , denotes the spherical harmonic of order and degree , and indicates corresponding coefficient. The spherical harmonics inherently possess the orthonormal property, i.e.


where is the Kronecker delta function. Using (7) and (8), sound field coefficients can be calculated as


where denotes the complex conjugate operation. As the realization of (9) requires an impractical continuous microphone array over a sphere, several array structures have been proposed in the literature to estimate , such as a spherical open/rigid array [26, 27], multiple circular arrays [28], or a planar array with differential microphones [29]. The subsequent theory is developed for a spherical microphone array, however, this is equally applicable for any array geometry given that they can produce (e.g. [30]).

The spherical harmonics decomposition of the D incident sound field of (6) is given by [17]


where is the array radius, is a unit vector towards the direction of the microphone, and is the array-independent sound field coefficient. The function depends on the array configuration and is defined as


where , and denote the order spherical Bessel and Hankel functions of the first kind, respectively, and refers to the corresponding first derivative term. Eq. (10) can be truncated at the sound field order due to the high-pass nature of the higher order Bessel functions [31, 32], where and denoting the ceiling operation. We can estimate the sound field coefficients with a spherical microphone array as [26, 27]


where is imposed to avoid spatial aliasing and are suitable microphone weights that enforce the orthonormal property of the spherical harmonics with a limited number of sampling points, i.e.


Similarly, the spherical harmonics decomposition of the coherent noise component of (6) is


where is the sound field coefficient due to the coherent noise sources. Finally, the spherical harmonics expansion of the Green’s function is given by [33, pp. 27–33]


Using (10), (14) and (15) in (6), we obtain the harmonics-domain representation of a noisy reverberant sound field by


Hence, the expression for the combined sound field coefficients is obtained from (16) as


where is defined as the sound field coefficients related to the direct and reverberant components of the sound signals.

It is important to note that we consider a far-field sound propagation model in (4) and (5). For a near-field sound propagation, the corresponding Green’s function and its spherical harmonics expansion is defined as [33, pp. 31]


where is the position vector of source and denotes the Euclidean norm. In this work, we use the far-field assumption for mathematical tractability, however, the model is equally applicable for a near-field sound propagation.

Iii-C Spatial correlation of the sound field coefficients

In this section, we propose novel techniques to develop closed form expressions of the spatial correlation between the harmonics coefficients of reverberant and noise fields in a multi-source environment. From (17), the spatial correlation between and is


where we assume uncorrelated speech and noise sources, i.e.


Iii-C1 Spatial correlation of the direct and reverberant components

From (17) and (18), the spatial cross-correlation between the direct and reverberant path coefficients is


where . Due to the autonomous behavior of the reflective surfaces in a room (i.e., the reflection gains from the reflective surfaces are independent from the direct path gain), the cross-correlation between the direct and reverberant gains is negligible, i.e.,


Furthermore, we assume that the sources are uncorrelated with each other, and so are the reverberant path gains from different directions, i.e.


where denotes absolute value. Using (23), we eliminate the cross terms of the right hand side of (22) and deduce


Defining as the PSD of the source at the origin, we use (24) and (25) in (26) to obtain


Since is defined over a sphere, we can represent it using the spherical harmonics decomposition as


where is the coefficient of the power of a reverberant sound field due to source and is a non-negative integer defining corresponding order. Substituting the value of from (28) into (27), we derive


Using the definition of Wigner constants from Appendix A, we rewrite (29) as




is the total reverberant power for order and degree . Please note that the spatial correlation model developed in [14] was derived for a single source case, i.e. , and did not include background noise in the model.

Iii-C2 Spatial correlation model for coherent noise

In a similar way (9) is derived, we obtain the expression for from (14) as


where . Hence, we deduce


The spatial correlation of the coherent noise is given by [34]


where is the complex gain of the noise sources from direction. In a reverberant room, the noise field can be assumed to be diffused [35], hence (34) reduces to


where is the PSD of the noise field at . Furthermore, for the sake of simplicity, we assume that the noise field is spatially white within the small area of a spherical microphone array (e.g., a commercially available spherical microphone array ‘Eigenmike’[23] has a radius of cm), i.e. . Hence, from (33) and (35), we get


Iii-C3 The combined model

Finally, from (30) and (36), we obtain the complete model of the spatial correlation in a noisy reverberant environment as




The integrals of (40) can be evaluated using a numerical computing tool. An approximation of (40) can be made through the finite summations as


where and are chosen such a way that the orthonormal property of the spherical harmonics holds. Also, a closed-form expression for (40) is derived in Appendix B with the help of the addition theorem of the spherical Bessel functions [36] as


The spatial correlation model of (37) is developed considering a far-field sound propagation. Following the discussion of Section III-A, it is evident from (15) and (19) that a near-field source consideration for the direct path signals changes the direct path coefficient of (37) as


Hence, to design a system with near-field sources, we require the additional knowledge of the source distance .

Iv PSD estimation

In this section, we reformulate (37) into a matrix form and solve it in the least square sense to estimate the source, reverberant and noise PSDs. We also discuss an implementation issue and offer engineering solutions to the problem.

Iv-a Source PSDs



we can write (37) in a matrix form by considering the cross-correlation between all the available modes as




where denotes transpose operation. Note that, the frequency dependency is omitted in (45)-(48) to simplify the notation. We estimate the source, reverberant, and noise PSDs by


where indicates the pseudo-inversion of a matrix. In a practical implementation, a half-wave rectification or similar measure is required on (49) to avoid negative PSDs. The terms and in the vector of (49) represent the estimated source and noise PSDs at the origin, respectively. It is worth noting that, (49) can readily be used for estimating source PSDs in a non-reverberant or noiseless environment by respectively discarding the and terms from the translation matrix in (47).

Iv-B Reverberant PSD

The total reverberation PSD at the origin due to all the sound sources is


Using (28), the definition of in (31), and the symmetrical property of the spherical harmonics, (50) can be written as


where is the Dirac delta function. PSD estimation process for a single frequency bin is shown in Algorithm 1.

1 Find using (12). is manufacture defined;
2 Get using any suitable DOA estimation technique;
3 Calculate , , and from (38), (39) and (42), respectively;
4 Get the expected value using (58);
Solve (49) for using the defintions from (46) - (48).
Algorithm 1 Algorithm to estimate PSD components

Iv-C Bessel-zero issue

One of the challenges in calculating the vector is the Bessel-zero issue. We define Bessel-zero issue as the case when of (12) takes a near-zero value and thus causes noise amplification and induces error in estimation. This situation arises in 3 distinct scenarios:

Iv-C1 At low frequencies

To avoid underdetermined solutions as well as to improve the estimation accuracy of (49) by incorporating extra spatial modes, we force a minimum value of the sound field order at the lower frequency bins. For example, with , and Hz, the calculated sound field order is and the dimension of of (45) becomes which results in an underdetermined system. In another scenario, if we choose a smaller value of , though we can avoid an underdetermined system, the availability of a fewer spatial modes affects the estimation accuracy. Hence, we impose a lower boundary on for all frequency bins such that where denotes the maximum value and is the lower limit of . For this work, we choose in an empirical manner. This, however, results in the aforementioned Bessel-zero issue for at the lower frequencies as shown Fig. 1(a). To avoid this issue, we impose a lower boundary on as well such that


where is a pre-defined floor value for .

(a) Before modification (b) After modification
Fig. 1: The unaltered Bessel functions with the modified version to alleviate the Bessel-zero issue. (a) Plots unaltered as a function of . The complex values are plotted as magnitudes. Solid and dashed lines denote open and rigid arrays, respectively. (b) Shows after modification. Dashed extension denotes the original value.

Iv-C2 At the mode activation boundary

This scenario appears at the first few frequency bins after a higher order mode () becomes active. As an example, for cm, order modes are activated approximately at and , where and are defined as the values of when we consider and , respectively. In the proposed algorithm, the order modes are introduced at and we observe from Fig. 1(a) that the value of is close to zero for the first few frequency bins after the activation of the order modes. To overcome this, we introduce another lower boundary criterion on as


It is important to note that, the modifications proposed in (52) and (53) only affect the higher order modes at each frequency bin whereas the lower-order modes remain unchanged. Hence the distortion resulted from these modifications is expected to have less adverse impact than the Bessel-zero issue.

Iv-C3 Zero-crossing at a particular frequency

Another case of a Bessel-zero issue occurs when the Bessel functions cross the zero-line on the y-axis at higher frequencies. This is more prominent with the open array configuration as shown in Fig. 1(a). The use of a rigid microphone array in the experiment is a way to avoid this issue which we followed in our experiments. Also note that, the modifications we propose for the previous two scenarios also take care of this zero crossing issue of the Bessel functions for an open array, when .

Fig. 1(b) plots the magnitudes of after the modification for different values of . The impact of the Bessel-zero issue and the improvement after the proposed modifications are discussed in the result section.

V A practical application of PSD estimation in source separation

An accurate knowledge of the PSDs of different signal components is desirable in many audio processing applications such as spectral subtraction, and Wiener filtering. In this section, we demonstrate a practical use of the proposed method in separating the speech signals from multiple concurrent speakers in a noisy reverberant room. This example uses the signal PSDs, estimated through the proposed algorithm, to significantly enhance the output of a conventional beamformer using a Wiener post filter. The performance of the Wiener filter largely depends on the estimation accuracy of the source and interferer PSDs, which is where the importance of a PSD estimation algorithm lies. A block digram of the complete source separation technique is shown in Fig. 2 and explained in the subsequent sections.

Fig. 2: The block diagram of the source separation as an application to the proposed PSD estimation method. The double-bordered block represents the proposed PSD estimation block.

V-a Estimation of the direction of arrival

The proposed algorithm (and also any beamforming technique) requires the knowledge of the direction of arrival (DOA) of the desired speech signals as a priori. If the source positions are unknown, any suitable localization technique, e.g., multiple signal classification, commonly known as MUSIC [37], can be used to estimate the DOA of the source signals. In the experiment where we measured the performance of the proposed algorithm, we used a frequency-smoothed approach of the MUSIC algorithm implemented in the spherical harmonics domain [38].

V-B Choice of Beamformer

There are several beamforming techniques available in the literature such as delay and sum (DS), maximum directivity (MD), or minimum variance distortionless response (MVDR) etc. The choice of the beamforming technique depends on the use case in practice. In this work where the undesired signal includes the correlated reverberant component of the desired signal, an MVDR beamformer can result desired signal cancellation if the undesired PSD components at each microphone position are unknown. Hence, a simple delay and sum beamformer or a maximum directivity beamformer is more appropriate for the current case whose output, when steered towards

far-field source, is given by [27, 39]




V-C Wiener post-filter

Regardless of the choice of the beamformer, a post filter is found to enhance the beamformer output in most of the cases [40, 13, 35]. Hence, at the last stage, we apply a Wiener post filter at the beamformer output where the estimated PSD values is used. The transfer function of a Wiener filter for the source is given by


where all the PSD components are already estimated by the proposed algorithm and available in the vector . Hence, the source signal is estimated by


Vi Experimental Results

In this section, we demonstrate and discuss the experimental results based on practical recordings in a noisy and reverberant room using , , and speech sources.

Vi-a Experimental setup

We evaluated the performance in distinct scenarios under different reverberant environments2228-speaker case was not tested in Room B & C due to logistical issues. as shown in Table I. The reverberation time and DRR in Table I were calculated based on the methods used in [41]. All the experiments included background noise from the air-conditioning system and the vibration of the electrical equipments in the lab. We created separate training and evaluation datasets that consisted of male and female voices from the TIMIT database [42]. The training dataset was used to set the parameters such as , etc. whereas the algorithm performance was measured using the evaluation dataset. Each of the scenarios was evaluated times with different mixtures of mixed-gender speech signals making it a total of unique experiments. We used the far-field assumption in each case for computational tractability. We measured the performance with the true and estimated DOA333The true and estimated DOAs for are listed in Appendix C. (denoted as Proposed-GT and Proposed-EST, respectively), where the latter was found with a MUSIC-based algorithm. We compared the performance with multiple beamformer-based method of [6] (denoted as MBF) and a conventional DS beamformer (denoted as BF) as, to the best of our knowledge, no similar harmonics-based technique has been proposed in the literature. Note that, for the fairness of comparison, we used all microphones of Eigenmike for all the competing methods. Furthermore, as the experiments were designed with the practical recordings instead of a simulation-based approach, the robustness of the proposed algorithm against the realistic thermal noise at the microphones was already evaluated through the process.

Room size (m) (ms) DRR # Speakers
Room A dB m
Room B dB m
Room C dB m
TABLE I: Experimental environments. denotes source to microphone distance.

The Eigenmike consists of pressure microphones distributed on the surface of a sphere with a radius of cm. The mixed sound was recorded at kHz sampling rate, but downsampled to kHz for computational efficiency. The recorded mixed signals were then converted to the frequency domain with a ms Hanning window, frame overlap, and a -point fast Fourier transform (FFT). All the subsequent processing were performed in the STFT domain with the truncated sound field order , , and , unless mentioned otherwise. The noise PSD was assumed to have significant power up to kHz whereas all other PSD components were estimated for the whole frequency band. The expected value of (44) was computed using an exponentially weighted moving average as


where is a smoothing factor, we chose .

Vi-B Selection of

represents the order of the power of a reverberation sound field. The exact harmonics analysis of the power of a reverberation sound field is a difficult task and depends on the structure, orientation, and characteristics of the reflective surfaces. Hence, unlike a free-field sound propagation, a reverberant field cannot be analytically decomposed into linear combination of Bessel functions which enables the truncation of a non-reverberant sound field order [31, 32]. Theoretically, extends to infinity, however, we need to consider several limiting factors such as

  • Avoid an underdetermined system of equations in (43) which impose a limit on as

  • Save computational complexity by choosing the minimum required value of .

It is also important to note that the nature of the reverberation field plays an important role in determining . As an example, for a perfectly diffused reverberant room with spatially-uniform reverberant power, only order () mode is enough. On the other hand, a room with strong directional characteristics requires the higher orders to be considered. Hence, should be tuned separately for each reverberation environment to obtain an acceptable performance. In our experiments, we chose for Room A, B, and C, respectively, based on the performance with the training dataset.

Vi-C Evaluation metrics

We evaluate the performance through visual comparisons of the true and estimated PSDs of the sound sources. In addition to that, we also introduce an objective performance index to measure the full-band normalized PSD estimation error as