I Introduction
The power spectral density (PSD) of an audio signal carries useful information about the signal characteristics. Many spectral enhancement techniques, most commonly Wiener filter and spectral subtraction, use the knowledge of the PSD to suppress undesired signal components such as background noise[1], late reverberation [2, 3], or both [4]. Few other applications of PSD include computing direct to reverberation energy ratio (DRR) [5] or separating sound sources in a mixed acoustic environment [6]. Most of the existing spectral enhancement techniques, including the aforementioned ones, focused on estimating PSD components under strict assumptions such as a noiseless, freefield or a singlesource scenario. In this work, we develop a new technique for estimating individual signal PSDs along with the PSD of the reverberation and coherent noise components in a multisource noisy and reverberant environment using a spherical microphone array.
Ia Literature review
Several techniques have been proposed in the literature for singlesource reverberant environment to estimate both the source and reverberation PSDs. Lebart et al. [2] used a statistical model of room impulse responses (RIR) to estimate the reverberation PSD and used that in a spectral subtractionbased speech dereverberation technique. Braun et al. [7] proposed a PSD estimator in using a reference signal under the strict statistical assumption of diffused reverberant field. Kuklasiński et al. [4] developed a maximum likelihoodbased method for estimating speech and late reverberation PSD in a singlesource noisy reverberant environment assuming a known value of noise PSD. The spatial correlation between the received microphone signals were utilized in [5] to compute direct to reverberant energy ratio in a noiseless environment by estimating PSDs of the direct and reverberant components.
Saruwatari et al. [8] proposed a method to suppress undesired signals in a multisource environment using complementary beamformers. Hioka et al. [6] used a similar idea for PSD estimation and source separation which utilized multiple fixed beamformer to estimate source PSDs. While [6] offered a PSD estimation technique for an underdetermined system capable of working with a larger number of sources compared to [8], both the algorithms were developed for nonreverberant case and required the directions of each speech and noise sources. An improvement to [6] was suggested in [9] where the property of an Mmatrix was utilized to design the fixed beamformers, whereas the former chose the steering direction in an empirical manner. However, while [9] eliminated the performance issue due to illconditioning of the demixing matrix, the method was developed under the assumption of a noiseless freefield environment.
A common use of PSD is in postfilter design to be used in conjunction with a beamformer, e.g. multichannel Wiener filter [10], in order to enhance the desired speech signal in a mixed environment. Beamforming is a common speech enhancement technique used for decades [11, 12]. However, a postfilter at the beamformer output is known to enhance system performance by boosting interference rejection [13].
In the recent past, several methods have been proposed in the spherical harmonics domain to estimate PSD components of an acoustic signal [14] or for interference rejection through beamforming [15, 16]. One of the major advantages of spherical harmonics domain representation of a signal [17, 18] is the inherent orthogonality of its basis functions. Hence, the spherical harmonicsbased solutions are becoming popular in other fields of acoustics signal processing, such as source localization [19], speech dereverberation [20], and noise suppression [21]. However, to the best of our knowledge, no methods have been developed in the harmonics domain to estimate individual PSD components in a multisource mixed acoustic environment. A harmonicsbased singlesource PSD estimator was proposed by the authors of this paper [14] for a noiseless reverberant environment in order to estimate DRR.
IB Our approach and contribution
We consider a multisource reverberant environment with coherent background noise and propose a novel technique to estimate the individual PSDs of each desired source as well as the PSDs of the reverberant and noise components. The method is developed in the spherical harmonics domain to take the advantage of the inherent orthogonality of the spherical harmonics basis functions which ensures a wellposed solution without the requirement of any additional design criteria such as an Mmatrix consideration as adopted in [9]. Additionally, in contrast to the conventional beamformerbased methods [6] where only the autocorrelation coefficients of the beamformer output were used, we also incorporate the crosscorrelation between the spherical harmonics coefficients in our solution. This latter approach was first used in [5] for estimating DRR in a singlesource environment. The larger number of correlation coefficients makes the algorithm suitable for separating a larger number of sources compared to the conventional techniques. We also derive a harmonicsbased novel closed form expression for the spatial correlation of a coherent noise field. We carry out detailed theoretical analysis, demonstrate the practical impact and offer an engineering solution to a Besselzero issue which, if not addressed in a correct way, significantly limits the performance of spherical arraybased systems. Initial work on this approach was published in [22] where a limitedscope implementation of the idea was presented under the assumption of a noiseless case. Furthermore, a detailed analysis on the implementation challenges and a benchmark in terms of perceived quality were not included in our prior work of [22].
It is worth mentioning that, the performance evaluation of the proposed algorithm took place in a practical environment with a commercially available microphone array, ‘Eigenmike’[23], without any prior knowledge about the source characteristics. Hence, the evaluation process incorporated all the practical deviations like the source localization error, thermal noise at the microphones, nonideal characteristics of the sound fields, etc. We validate the performance of the algorithm by carrying out experiments in different acoustic environments with varying number of speakers using independent mixedgender speech signals. The improved objective measures in terms of perceptual evaluation of speech quality (PESQ) [24] and frequencyweighted segmental signal to noise ratio (FWSegSNR) [25] indicate that the proposed algorithm is robust against such practical distortions and produce a better estimation compared to the competing methods.
The paper is structured as follows. Section II contains problem statement and defines the objective of the work. In Section III, we develop a framework for PSD estimation based on the spatial correlation of sound field coefficients in a noisy reverberant environment. We use the framework in Section IV to formulate a PSD estimation technique, and discuss and offer solutions to a practical challenge which we term as Besselzero issue. In Section V, we outline a practical application of the estimated PSDs. Finally in Section VI, we evaluate and compare the performance of the proposed algorithm with other methods based on objective metrics and graphical aids.
Ii Problem formulation
Let us consider a microphone array consisting of microphones to capture the sound field in a noisy reverberant room with distinct sound sources. The received signal at the microphone is given by
(1) 
where , , denotes the microphone position, is the RIR between the source and the microphone, is the discrete time index, denotes the convolution operation, is the source excitation for the sound source, and is the coherent noise^{1}^{1}1Here the coherent noise refers to the colored background noise, different from the white thermal noise, which can be originated from any unknown noise source such as room airconditioning system. at the microphone position. The RIR can be decomposed into two parts
(2) 
where and are the direct and reverberant path components, respectively. Substituting (2) into (1
) and converting into frequency domain using shorttime Fourier transform (STFT), we obtain
(3) 
where represent the corresponding signals of in the STFT domain, is the time frame index, , denotes the frequency, and is the speed of sound propagation. In the subsequent sections, the time frame index is omitted for brevity.
Given the measured sound pressure , we aim to estimate the individual source PSDs, , where represents the expected value over time. As an application of a common use of PSD, we further measure the signal enhancement of a beamformer output through a basic Wiener post filter driven by the estimated PSDs, and demonstrate an improved estimation technique for source separation.
Iii Framework for PSD estimation
In this section, we develop a spherical harmonics domain framework to establish the relationship between the sound field coefficients and the individual PSD components in a multisource noisy and reverberant environment. We use this model in Section IV and V to estimate individual PSD components and separate sound sources from a mixed recording.
Iiia Spatial domain representation of room transfer function
We model the direct and reverberant path of room transfer function (RTF) in the spatial domain as
(4) 
(5) 
where represents the direct path gain at the origin for the source, ,
is a unit vector towards the direction of the
source, and is the reflection gain at the origin along the direction of for the source. Hence, we obtain the spatial domain equivalent of (3) by substituting the spatial domain RTF from (4) and (5) as(6) 
IiiB Spherical harmonics decomposition
In this section, we derive the spherical harmonics expansion of (6) using the existing mathematical models. Spherical harmonics are a set of orthonormal basis functions which can represent a function over a sphere. A spherical function can be expressed in the spherical harmonics domain as
(7) 
where is defined over a sphere, , denotes the spherical harmonic of order and degree , and indicates corresponding coefficient. The spherical harmonics inherently possess the orthonormal property, i.e.
(8) 
where is the Kronecker delta function. Using (7) and (8), sound field coefficients can be calculated as
(9) 
where denotes the complex conjugate operation. As the realization of (9) requires an impractical continuous microphone array over a sphere, several array structures have been proposed in the literature to estimate , such as a spherical open/rigid array [26, 27], multiple circular arrays [28], or a planar array with differential microphones [29]. The subsequent theory is developed for a spherical microphone array, however, this is equally applicable for any array geometry given that they can produce (e.g. [30]).
The spherical harmonics decomposition of the D incident sound field of (6) is given by [17]
(10) 
where is the array radius, is a unit vector towards the direction of the microphone, and is the arrayindependent sound field coefficient. The function depends on the array configuration and is defined as
(11) 
where , and denote the order spherical Bessel and Hankel functions of the first kind, respectively, and refers to the corresponding first derivative term. Eq. (10) can be truncated at the sound field order due to the highpass nature of the higher order Bessel functions [31, 32], where and denoting the ceiling operation. We can estimate the sound field coefficients with a spherical microphone array as [26, 27]
(12) 
where is imposed to avoid spatial aliasing and are suitable microphone weights that enforce the orthonormal property of the spherical harmonics with a limited number of sampling points, i.e.
(13) 
Similarly, the spherical harmonics decomposition of the coherent noise component of (6) is
(14) 
where is the sound field coefficient due to the coherent noise sources. Finally, the spherical harmonics expansion of the Green’s function is given by [33, pp. 27–33]
(15) 
Using (10), (14) and (15) in (6), we obtain the harmonicsdomain representation of a noisy reverberant sound field by
(16) 
Hence, the expression for the combined sound field coefficients is obtained from (16) as
(17)  
(18) 
where is defined as the sound field coefficients related to the direct and reverberant components of the sound signals.
It is important to note that we consider a farfield sound propagation model in (4) and (5). For a nearfield sound propagation, the corresponding Green’s function and its spherical harmonics expansion is defined as [33, pp. 31]
(19) 
where is the position vector of source and denotes the Euclidean norm. In this work, we use the farfield assumption for mathematical tractability, however, the model is equally applicable for a nearfield sound propagation.
IiiC Spatial correlation of the sound field coefficients
In this section, we propose novel techniques to develop closed form expressions of the spatial correlation between the harmonics coefficients of reverberant and noise fields in a multisource environment. From (17), the spatial correlation between and is
(20) 
where we assume uncorrelated speech and noise sources, i.e.
(21) 
IiiC1 Spatial correlation of the direct and reverberant components
From (17) and (18), the spatial crosscorrelation between the direct and reverberant path coefficients is
(22) 
where . Due to the autonomous behavior of the reflective surfaces in a room (i.e., the reflection gains from the reflective surfaces are independent from the direct path gain), the crosscorrelation between the direct and reverberant gains is negligible, i.e.,
(23) 
Furthermore, we assume that the sources are uncorrelated with each other, and so are the reverberant path gains from different directions, i.e.
(24) 
(25) 
where denotes absolute value. Using (23), we eliminate the cross terms of the right hand side of (22) and deduce
(26) 
Defining as the PSD of the source at the origin, we use (24) and (25) in (26) to obtain
(27) 
Since is defined over a sphere, we can represent it using the spherical harmonics decomposition as
(28) 
where is the coefficient of the power of a reverberant sound field due to source and is a nonnegative integer defining corresponding order. Substituting the value of from (28) into (27), we derive
(29) 
Using the definition of Wigner constants from Appendix A, we rewrite (29) as
(30) 
where
(31) 
is the total reverberant power for order and degree . Please note that the spatial correlation model developed in [14] was derived for a single source case, i.e. , and did not include background noise in the model.
IiiC2 Spatial correlation model for coherent noise
In a similar way (9) is derived, we obtain the expression for from (14) as
(32) 
where . Hence, we deduce
(33) 
The spatial correlation of the coherent noise is given by [34]
(34) 
where is the complex gain of the noise sources from direction. In a reverberant room, the noise field can be assumed to be diffused [35], hence (34) reduces to
(35) 
where is the PSD of the noise field at . Furthermore, for the sake of simplicity, we assume that the noise field is spatially white within the small area of a spherical microphone array (e.g., a commercially available spherical microphone array ‘Eigenmike’[23] has a radius of cm), i.e. . Hence, from (33) and (35), we get
(36) 
IiiC3 The combined model
Finally, from (30) and (36), we obtain the complete model of the spatial correlation in a noisy reverberant environment as
(37) 
where
(38) 
(39) 
(40) 
The integrals of (40) can be evaluated using a numerical computing tool. An approximation of (40) can be made through the finite summations as
(41) 
where and are chosen such a way that the orthonormal property of the spherical harmonics holds. Also, a closedform expression for (40) is derived in Appendix B with the help of the addition theorem of the spherical Bessel functions [36] as
(42) 
The spatial correlation model of (37) is developed considering a farfield sound propagation. Following the discussion of Section IIIA, it is evident from (15) and (19) that a nearfield source consideration for the direct path signals changes the direct path coefficient of (37) as
(43) 
Hence, to design a system with nearfield sources, we require the additional knowledge of the source distance .
Iv PSD estimation
In this section, we reformulate (37) into a matrix form and solve it in the least square sense to estimate the source, reverberant and noise PSDs. We also discuss an implementation issue and offer engineering solutions to the problem.
Iva Source PSDs
Defining
(44) 
we can write (37) in a matrix form by considering the crosscorrelation between all the available modes as
(45) 
where
(46) 
(47) 
(48) 
where denotes transpose operation. Note that, the frequency dependency is omitted in (45)(48) to simplify the notation. We estimate the source, reverberant, and noise PSDs by
(49) 
where indicates the pseudoinversion of a matrix. In a practical implementation, a halfwave rectification or similar measure is required on (49) to avoid negative PSDs. The terms and in the vector of (49) represent the estimated source and noise PSDs at the origin, respectively. It is worth noting that, (49) can readily be used for estimating source PSDs in a nonreverberant or noiseless environment by respectively discarding the and terms from the translation matrix in (47).
IvB Reverberant PSD
The total reverberation PSD at the origin due to all the sound sources is
(50) 
Using (28), the definition of in (31), and the symmetrical property of the spherical harmonics, (50) can be written as
(51) 
where is the Dirac delta function. PSD estimation process for a single frequency bin is shown in Algorithm 1.
IvC Besselzero issue
One of the challenges in calculating the vector is the Besselzero issue. We define Besselzero issue as the case when of (12) takes a nearzero value and thus causes noise amplification and induces error in estimation. This situation arises in 3 distinct scenarios:
IvC1 At low frequencies
To avoid underdetermined solutions as well as to improve the estimation accuracy of (49) by incorporating extra spatial modes, we force a minimum value of the sound field order at the lower frequency bins. For example, with , and Hz, the calculated sound field order is and the dimension of of (45) becomes which results in an underdetermined system. In another scenario, if we choose a smaller value of , though we can avoid an underdetermined system, the availability of a fewer spatial modes affects the estimation accuracy. Hence, we impose a lower boundary on for all frequency bins such that where denotes the maximum value and is the lower limit of . For this work, we choose in an empirical manner. This, however, results in the aforementioned Besselzero issue for at the lower frequencies as shown Fig. 1(a). To avoid this issue, we impose a lower boundary on as well such that
(52) 
where is a predefined floor value for .
IvC2 At the mode activation boundary
This scenario appears at the first few frequency bins after a higher order mode () becomes active. As an example, for cm, order modes are activated approximately at and , where and are defined as the values of when we consider and , respectively. In the proposed algorithm, the order modes are introduced at and we observe from Fig. 1(a) that the value of is close to zero for the first few frequency bins after the activation of the order modes. To overcome this, we introduce another lower boundary criterion on as
(53) 
It is important to note that, the modifications proposed in (52) and (53) only affect the higher order modes at each frequency bin whereas the lowerorder modes remain unchanged. Hence the distortion resulted from these modifications is expected to have less adverse impact than the Besselzero issue.
IvC3 Zerocrossing at a particular frequency
Another case of a Besselzero issue occurs when the Bessel functions cross the zeroline on the yaxis at higher frequencies. This is more prominent with the open array configuration as shown in Fig. 1(a). The use of a rigid microphone array in the experiment is a way to avoid this issue which we followed in our experiments. Also note that, the modifications we propose for the previous two scenarios also take care of this zero crossing issue of the Bessel functions for an open array, when .
Fig. 1(b) plots the magnitudes of after the modification for different values of . The impact of the Besselzero issue and the improvement after the proposed modifications are discussed in the result section.
V A practical application of PSD estimation in source separation
An accurate knowledge of the PSDs of different signal components is desirable in many audio processing applications such as spectral subtraction, and Wiener filtering. In this section, we demonstrate a practical use of the proposed method in separating the speech signals from multiple concurrent speakers in a noisy reverberant room. This example uses the signal PSDs, estimated through the proposed algorithm, to significantly enhance the output of a conventional beamformer using a Wiener post filter. The performance of the Wiener filter largely depends on the estimation accuracy of the source and interferer PSDs, which is where the importance of a PSD estimation algorithm lies. A block digram of the complete source separation technique is shown in Fig. 2 and explained in the subsequent sections.
Va Estimation of the direction of arrival
The proposed algorithm (and also any beamforming technique) requires the knowledge of the direction of arrival (DOA) of the desired speech signals as a priori. If the source positions are unknown, any suitable localization technique, e.g., multiple signal classification, commonly known as MUSIC [37], can be used to estimate the DOA of the source signals. In the experiment where we measured the performance of the proposed algorithm, we used a frequencysmoothed approach of the MUSIC algorithm implemented in the spherical harmonics domain [38].
VB Choice of Beamformer
There are several beamforming techniques available in the literature such as delay and sum (DS), maximum directivity (MD), or minimum variance distortionless response (MVDR) etc. The choice of the beamforming technique depends on the use case in practice. In this work where the undesired signal includes the correlated reverberant component of the desired signal, an MVDR beamformer can result desired signal cancellation if the undesired PSD components at each microphone position are unknown. Hence, a simple delay and sum beamformer or a maximum directivity beamformer is more appropriate for the current case whose output, when steered towards
farfield source, is given by [27, 39](54) 
where
(55) 
VC Wiener postfilter
Regardless of the choice of the beamformer, a post filter is found to enhance the beamformer output in most of the cases [40, 13, 35]. Hence, at the last stage, we apply a Wiener post filter at the beamformer output where the estimated PSD values is used. The transfer function of a Wiener filter for the source is given by
(56) 
where all the PSD components are already estimated by the proposed algorithm and available in the vector . Hence, the source signal is estimated by
(57) 
Vi Experimental Results
In this section, we demonstrate and discuss the experimental results based on practical recordings in a noisy and reverberant room using , , and speech sources.
Via Experimental setup
We evaluated the performance in distinct scenarios under different reverberant environments^{2}^{2}28speaker case was not tested in Room B & C due to logistical issues. as shown in Table I. The reverberation time and DRR in Table I were calculated based on the methods used in [41]. All the experiments included background noise from the airconditioning system and the vibration of the electrical equipments in the lab. We created separate training and evaluation datasets that consisted of male and female voices from the TIMIT database [42]. The training dataset was used to set the parameters such as , etc. whereas the algorithm performance was measured using the evaluation dataset. Each of the scenarios was evaluated times with different mixtures of mixedgender speech signals making it a total of unique experiments. We used the farfield assumption in each case for computational tractability. We measured the performance with the true and estimated DOA^{3}^{3}3The true and estimated DOAs for are listed in Appendix C. (denoted as ProposedGT and ProposedEST, respectively), where the latter was found with a MUSICbased algorithm. We compared the performance with multiple beamformerbased method of [6] (denoted as MBF) and a conventional DS beamformer (denoted as BF) as, to the best of our knowledge, no similar harmonicsbased technique has been proposed in the literature. Note that, for the fairness of comparison, we used all microphones of Eigenmike for all the competing methods. Furthermore, as the experiments were designed with the practical recordings instead of a simulationbased approach, the robustness of the proposed algorithm against the realistic thermal noise at the microphones was already evaluated through the process.
Room size (m)  (ms)  DRR  # Speakers  

Room A  dB  m  
Room B  dB  m  
Room C  dB  m 
The Eigenmike consists of pressure microphones distributed on the surface of a sphere with a radius of cm. The mixed sound was recorded at kHz sampling rate, but downsampled to kHz for computational efficiency. The recorded mixed signals were then converted to the frequency domain with a ms Hanning window, frame overlap, and a point fast Fourier transform (FFT). All the subsequent processing were performed in the STFT domain with the truncated sound field order , , and , unless mentioned otherwise. The noise PSD was assumed to have significant power up to kHz whereas all other PSD components were estimated for the whole frequency band. The expected value of (44) was computed using an exponentially weighted moving average as
(58) 
where is a smoothing factor, we chose .
ViB Selection of
represents the order of the power of a reverberation sound field. The exact harmonics analysis of the power of a reverberation sound field is a difficult task and depends on the structure, orientation, and characteristics of the reflective surfaces. Hence, unlike a freefield sound propagation, a reverberant field cannot be analytically decomposed into linear combination of Bessel functions which enables the truncation of a nonreverberant sound field order [31, 32]. Theoretically, extends to infinity, however, we need to consider several limiting factors such as

Avoid an underdetermined system of equations in (43) which impose a limit on as
(59) 
Save computational complexity by choosing the minimum required value of .
It is also important to note that the nature of the reverberation field plays an important role in determining . As an example, for a perfectly diffused reverberant room with spatiallyuniform reverberant power, only order () mode is enough. On the other hand, a room with strong directional characteristics requires the higher orders to be considered. Hence, should be tuned separately for each reverberation environment to obtain an acceptable performance. In our experiments, we chose for Room A, B, and C, respectively, based on the performance with the training dataset.
ViC Evaluation metrics
We evaluate the performance through visual comparisons of the true and estimated PSDs of the sound sources. In addition to that, we also introduce an objective performance index to measure the fullband normalized PSD estimation error as