Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained with Noise Signals

07/31/2018 ∙ by Soumitro Chakrabarty, et al. ∙ 0

Supervised learning based methods for source localization, being data driven, can be adapted to different acoustic conditions via training and have been shown to be robust to adverse acoustic environments. In this paper, a convolutional neural network (CNN) based supervised learning method for estimating the direction-of-arrival (DOA) of multiple speakers is proposed. Multi-speaker DOA estimation is formulated as a multi-class multi-label classification problem, where the assignment of each DOA label to the input feature is treated as a separate binary classification problem. The phase component of the short-time Fourier transform (STFT) coefficients of the received microphone signals are directly fed into the CNN, and the features for DOA estimation are learnt during training. Utilizing the assumption of disjoint speaker activity in the STFT domain, a novel method is proposed to train the CNN with synthesized noise signals. Through experimental evaluation with both simulated and measured acoustic impulse responses, the ability of the proposed DOA estimation approach to adapt to unseen acoustic conditions and its robustness to unseen noise type is demonstrated. Through additional empirical investigation, it is also shown that with an array of M microphones our proposed framework yields the best localization performance with M-1 convolution layers. The ability of the proposed method to accurately localize speakers in a dynamic acoustic scenario with varying number of sources is also shown.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Many applications such as hands-free communication, teleconferencing, robot audition and distant speech recognition require information on the location of sound sources in the acoustic environment. Information regarding the source location can be utilized for the task of enhancing the signal coming from a specific location while suppressing the undesired signal components. In some applications, the information is used for camera steering whereas in applications like robot audition the source location information is used for navigation purposes. The relative direction of a sound source with respect to a microphone array is generally given in terms of the direction of arrival (DOA) of the sound wave originating from the source position. In most practical scenarios, this information is not available and the DOA of the sound source need to be estimated. However, accurate DOA estimation is a challenging task in the presence of noise and reverberation. The task becomes even more difficult when the DOAs of multiple sound sources need to be estimated.

In the literature related to DOA estimation, there exist two kinds of estimation paradigms: broadband and narrowband DOA estimation. In narrowband DOA estimation, the task of DOA estimation is performed separately for each frequency sub-band, whereas in broadband DOA estimation the task is performed for the whole input spectrum. In this work, the focus is on broadband DOA estimation.

Over the years, several approaches have been developed for the task of broadband DOA estimation. Some popular approaches are: i) subspace based approaches such as multiple signal classification (MUSIC) [1, 2], ii) time difference of arrival (TDOA) based approaches that use the family of generalized cross correlation (GCC) methods [3, 4], iii) generalizations of the cross-correlation methods such as steered response power with phase transform (SRP-PHAT) [5], and multichannel cross correlation coefficient (MCCC) [6], iv) adaptive multichannel time delay estimation using blind system identification based methods [7], v) probabilistic model based methods such as maximum likelihood method [8] and vi) methods based on histogram analysis of narrowband DOA estimates [9, 10]. These methods are generally formulated under the assumption of free-field propagation of sound waves, however in indoor acoustic environments this assumption is violated due to the presence of reverberation leading to severe degradation in their performance. Additionally, these methods are also not robust to noise and generally have a high computational cost [6].

Compared to the signal processing based approaches, supervised learning approaches, being data driven, have the advantage that they can be adapted to different acoustic conditions via training. Also, if training data from diverse acoustic conditions are available, then these approaches can be made robust against noise and reverberation. Following the recent success of deep learning based supervised learning methods in various signal processing related tasks

[11, 12], different methods for DOA estimation have been proposed [13, 14, 15, 16, 17, 18, 19]. A common aspect of the methods proposed in [13, 14, 15, 16, 17]

is that they all involve an explicit feature extraction step. In

[16, 14]

, GCC vectors, computed from the microphone signals, are provided as input to the learning framework. In

[15, 17]

, similar to the computations involved in the MUSIC method for localization, the eigenvalue decomposition of the spatial correlation matrix is performed to get the eigenvectors corresponding to the noise subspace, and is provided as input to a neural network. In

[13], a binaural setup is considered and binaural cues at different frequency sub-bands are computed and given as input. Such feature extraction steps generally lead to a high computational cost. Additionally, when features computed from the microphone signals are given as input the neural network mainly just learns the functional mapping from the features to the final DOA, which can possibly lead to a lack of robustness against adverse acoustic conditions.

One of the main reasons for the success of deep learning has been the encapsulation of the feature extraction step into the learning framework. Also, by studying the traditional signal processing based methods for DOA estimation, it can be seen that most methods exploit the phase difference information between the microphone signals to perform localization. Based on these observations, in [18], the current authors proposed a convolutional neural network (CNN) based supervised learning method for broadband DOA estimation of a single active speaker per short-time Fourier transform (STFT) time frame. Rather than involving an explicit feature extraction step, the phase component of STFT coefficients of the input signal were directly provided as input to the neural network. Another contribution of the work was to show the possibility of training the system using synthesized noise signals, which made the creation of training data much simpler compared to using real world signals like speech.

Following that, in [19], the previously proposed framework was extended to estimate multiple speaker DOAs. There, a novel method was developed to generate the training data using synthesized noise signals for multi-speaker localization. One of the main challenges of using noise signals for the multi-speaker case is that, for overlapping signals, the phase of the STFT coefficients get combined non-linearly, and depend on the magnitude of the individual signals. This makes the learning procedure for the CNN difficult. To overcome this problem, the property of W-disjoint orthogonality [20], which holds approximately for speech signals, was utilized. In terms of evaluation, only preliminary results with simulated data for a single acoustic setup was shown in [19].

In this paper, we further extend the initial work on DOA estimation of multiple speakers presented in [19]

. The formulation of the task of multi-speaker DOA estimation as a multi-label multi-class classification problem is presented, where first the posterior probabilities of the active source DOAs are estimated at the frame-level. Then, these frame-level probabilities are averaged over multiple time frames depending on the chosen block length over which the final DOA estimates are to be obtained. From these averaged posterior probabilities, assuming the number of speakers,

, within that block is known, the DOAs corresponding to the classes with the highest probabilities are chosen as the final DOA estimates. To build robustness to adverse acoustic conditions, multi-condition training in the form of training data from diverse acoustic scenarios is performed. A detailed description of the previously proposed method for generating training data using synthesized noise signals is also presented.

With respect to the proposed CNN architecture, we first posit that due to the small filters chosen to learn the phase correlations between neighboring microphones, convolution layers are required to learn from the phase correlation between all the microphone pairs, where is the number of microphones in the array. Through experimental evaluation, the requirement of layers is shown in terms of both localization performance as well as number of trainable parameters. The influence of distance between the sources and the microphone array is also investigated experimentally. Through further experiments with both simulated and measured room impulse responses (RIRs), the robustness of the proposed method to unseen acoustic conditions and noise types is investigated. Additionally, we also show that even when the CNN is trained to estimate the posterior probabilities of maximum two DOA classes per STFT time frame, at a block level the proposed method can be used to localize greater than two speakers also.

The remainder of this paper is organized as follows. In Section II the formulation of the problem as a multi-class multi-label classification is described. In Section III, we review the input feature representation used in our framework. The task of obtaining the final DOA estimates in our proposed system is described in Section IV. Section V presents a detailed description of the proposed method for generating training data using synthesized noise signals. Experimental evaluation of the proposed method is presented in Section VI. Section VII concludes the paper.

Ii Problem Formulation

We want to utilize a CNN based supervised learning framework for estimating the DOAs of multiple simultaneously active sources by learning the mapping from the recorded microphone signals to the DOA of the active speech sources using a large set of labeled data. The DOA estimation in this work is performed for signal blocks that consist of multiple time frames of the STFT representation of the observed signals. The block length can be chosen depending on the application scenario. For example, for dynamic sound scenes it might be preferable to choose shorter block lengths compared to a scenario when it is known that the sources would be static.

The problem of multi-source DOA estimation is formulated as an -class multi-label classification problem. As the first step, the whole DOA range is discretized to form a set of possible DOA values, . A class vector of length is then formed where each class corresponds to a possible DOA value in the set . In this work, we assume an independent source DOA model, i.e., the spatial location of the sources are independent of each other. Due to this assumption, multi-label classification can be tackled using the binary relevance method [21], where the assignment of each DOA class label to the input is treated as a separate binary classification problem. As stated earlier, the aim is to obtain the DOA estimates of multiple speakers for a signal block, however the input to the system is a feature representation for each STFT time frame separately.

As shown in Fig. 1, a supervised learning framework consists of a training and a test phase. In the training phase, the CNN is trained with a training data set that consists of pairs of fixed dimension feature vectors for each STFT time frame and the corresponding true DOA class labels. In the test phase, given the input feature representation corresponding to a single STFT time frame, the first task is to estimate the posterior probability of each DOA class. Following this, depending on the chosen block length, the frame-level probabilities are averaged over all the time frames in the block. Finally, considering sources, the DOA estimates are given by selecting the DOA classes with the highest probabilities.

In this work, we consider the number of sources to be known. As an alternative, the number of active sources can be estimated based on the number of clear peaks in the averaged posterior probabilities for a signal block. Also, the recorded signal from a reference microphone can also be used for speaker count estimation using the method proposed in [22]. Investigating the best strategy for this problem would be part of future work.

Fig. 1: Block diagram of the proposed system.

Iii Input Representation

Fig. 2: Proposed Architecture.

In this work, the aim is to learn the relevant features for the task of DOA estimation via training rather than have an explicit feature extraction step to compute the input to be given to the system. Therefore we use the phase map [18, 19] as the input feature representation in this work. For the sake of completeness, we give a brief description of this representation.

As described earlier, the input to the DNN framework is a feature representation corresponding to each STFT time frame. Let us consider that the received microphone signals are transformed to the STFT domain using an point discrete Fourier transform (DFT). In the STFT domain, the observed signals at each TF instance are represented by complex numbers. Therefore, the observed signal can be expressed as

(1)

where represents the magnitude component and denotes the phase component of the STFT coefficient of the received signal at the -th microphone for the -th time frame and -th frequency bin. In this work, we directly provide the phase component of the STFT coefficients of the received signals as input to our system. Note that this phase term consists of the phase of the source along with the effect of the propagation path. The idea is to make the system learn the relevant feature for DOA estimation from the phase component through training.

Since the aim is to compute the posterior probabilities of the DOA classes at each time frame, the input feature for the -th time frame is formed by arranging for each time-frequency bin and each microphone into a matrix of size , where is the total number of frequency bins, upto the Nyquist frequency, at each time frame and is the total number of microphones in the array. We call this feature representation as the phase map. For example, if we consider a microphone array with microphones and , then the input feature matrix is of size .

Given the input representations, the next task is to estimate the posterior probabilities of the DOA classes for each time frame. For this, we propose a CNN based supervised learning method, described in the following section.

Iv DOA estimation with CNNs

CNNs are a variant of the standard fully-connected neural network, where the architecture typically consists of one or more convolution layers followed by fully-connected networks leading ot the output [23]. In this work, the main motivation behind using CNNs is to learn the discriminative features for DOA estimation from the phase map input by applying small local filters to learn the phase correlations at the different frequency sub-bands.

Given the phase map as the input, the CNN generates the posterior probability for each of the DOA classes. Let us denote the phase map for the -th time frame as . Then the posterior probability generated by the CNN at the output is given by , where is the DOA corresponding to the -th class. In Fig. 2, the CNN architecture used in this work is shown. In the convolution layers, small filters of size are applied to learn the phase correlations between neighboring microphones at each frequency sub-band separately. This is in contrast to [18], where square filters of size were used to learn the features from the neighboring frequency bins also. However, in the case of multiple speakers neighboring frequency bins might contain dominant activity from different speakers, therefore in this work we use filters. These learned features for each sub-band are then aggregated by the fully connected layers for the classification task. The proposed architecture consists of at most convolution layers, where is the number of microphones, since after layers performing 2D convolutions is no longer possible as the feature maps become vectors.

In terms of the design choice related to the number of convolution layers, we posit that by using small filters of size , with each subsequent convolution layer after the first one, for each sub-band, the phase correlation information from different microphone pairs are aggregated due to the growing receptive field of the filters, and to learn from the correlation between all microphone pairs, convolution layers would be required to incorporate this information into the learned features. In Section VI-B4, we experimentally demonstrate that indeed convolution layers are required to obtain the best DOA estimation performance for a given microphone array and also show the efficiency of this design choice in terms of number of free parameters.

As stated earlier, we utilize the binary relevance method [21] to tackle the multi-label classification problem, therefore the output layer of the CNN consists of

sigmoid units, each corresponding to a DOA class. During training, the optimization of the network weights are done in terms of each output neuron separately, using binary cross-entropy as the loss function.

Here, the task of multi-source DOA estimation is performed for a signal block consisting of time frames. The block-level posterior probability is obtained by averaging frame-level posterior probabilities for each , given by

(2)

From these computed average posterior probabilities, the DOAs corresponding to the classes with the highest probabilities are selected as the DOA estimates. In this work we chose this simple method to demonstrate the effectiveness of the proposed algorithm. Using more advanced post-processing methods, such as automatic peak detection [24], is beyond the scope of this paper.

V Training Data Generation

In this section, we describe the training data generation method employed in this work. Please recall that though the task of DOA estimation is performed for a segment of multiple time frames, in the proposed system the posterior probabilities of the DOA classes are estimated at each time frame. Therefore, using speech as training signals can be problematic since we would require an extremely accurate voice activity detection method in order to avoid including silent time frames in the training data, and errors in this task can adversely affect the training. To avoid this problem, in [18], we proposed to use synthesized noise signals to generate the training data for the single speaker scenario. However, when trying to localize simultaneously active speakers, using overlapping noise signals for the training data is not suitable since at each TF bin, the phase component of the observed microphone signals’ STFT coefficient is a non-linear combination of the phase of the individual directional sources. Thus, learning the relevant features from such an input might be difficult for the CNN.

To effectively use synthesized noise signals to generate the training data, and taking into account the aim to localize speech sources, we utilize the assumption that the TF representation of two simultaneously active speech sources do not overlap. This is known as W-disjoint orthogonality, and, with an appropriate choice of the time and frequency resolutions, has been shown to hold approximately for speech signals [20]. In the following, we explain the procedure for generating the training data for a scenario with two active speakers.

As a first step, we generate the training signals for a single speaker case by convolving the room impulse responses (RIRs) corresponding to different directions for each acoustic condition considered for training with synthesized spectrally white noise signals. Then, for a specific source array setup, the STFT representation of two multi-channel training signals, corresponding to different DOAs, are concatenated along the time frame axis. Following this, for each frequency sub-band separately, the time-frequency bins for all microphones are randomized to get a single training signal. This procedure is repeated for all combinations of DOAs for all different acoustic conditions considered for training. Finally, the phase map corresponding to each time frame, for all training signals, is extracted to form the complete training dataset.

While generating the training data, there are two important things to note regarding the randomization process. First, it is essential that the randomization of the TF bins is done separately for each frequency sub-band, such that the order of the frequency sub-bands remains the same for different time frames. This is essential since phase correlations are frequency dependent and for all the different time frames, preserving the spectral structure can aid the feature learning. Secondly, it is essential that for each frequency sub-band, the TF bins for all the microphones are randomized together, such that phase relations between the microphones for the individual TF bins are preserved.

An illustration of this procedure is shown in Fig. 3. The figure on the left illustrates the concatenated TF representation of two directional signals, originating from two different directions, and . Following the randomization procedure, it can be seen that at each time frame there are approximately equal number of TF bins with activity corresponding to the two DOAs. Therefore, at each frequency sub-band of the phase map input to the CNN, the phase of the STFT coefficients for all microphones correspond to a single source. This makes the assumption of disjoint activity of signals implicit within our framework. With this training input, the CNN can learn the relevant features for localizing multiple speakers at each time frame from the individual TF bins that contain the phase relations across the microphones for each source DOA separately.

By repeating the above mentioned procedure for all possible angular combinations and acoustic conditions, we obtain the complete training dataset. The different acoustic conditions considered for the multi-condition training of the CNN is given in Table. I. The different rooms as well as positions inside each room are considered to develop robustness in various acoustic conditions, whereas additionally the network is also trained with different levels of spatially white noise for robust performance in noisy scenarios.

In total, the training data consisted of around 12.4 million time frames. The CNN was trained using the Adam gradient-based optimizer [25], with mini-batches of 512 time frames and a learning rate of 0.001. During training, at the end of the convolution layers and after each fully connected layer, a dropout procedure [26]

with a rate of 0.5 was used to avoid over fitting. All the implementations were done in Keras

[27].

Please note that, in this work, the CNN is trained to estimate the posterior probabilities of DOAs of only two speakers given the phase map input for each STFT time frame. By following the same procedure as described above the method can be extended for estimating the DOA class posterior probabilities of more than two speakers per time frame. In Section VI-C1, it is shown that despite such a training procedure the proposed method can estimate the DOAs of more than two speakers for a signal block with multiple time frames.

Simulated training data
Signal Synthesized noise signals
Room size R1: () m , R2: () m, R3: () m, R4: () m, R5: () m
Array positions in room 7 different positions in each room
Source-array distance 1 m and 2 m for each array position
RT (s) R1: 0.3, R2: 0.2, R3: 0.8, R4: 0.4, R5: 0.6
SNR Uniformly sampled from 0 to 30 dB
TABLE I: Configuration for training data generation. All rooms are 2.7 m high.
Fig. 3: Illustration of the method used for generating the training data.
Simulated test data
Signal Speech signals from LIBRI
Room size Room 1: () m , Room 2: () m
Array positions in room 4 arbitrary positions in each room
Source-array distance 1.3 m for Room 1, 1.7 m for Room 2
RT Room 1: 0.38 , Room 2: 0.70
TABLE II: Configuration for generating test data for experiments presented in Section VI-B1 and VI-B2. All rooms are 3 m high.
Test Room Room 1 Room 2
SNR dB dB dB dB dB dB
Measure MAE Acc. MAE Acc. MAE Acc. MAE Acc. MAE Acc. MAE Acc.
SRP-PHAT
MUSIC
Proposed
TABLE III: Results for two different rooms with varying levels of spatially white noise computed over 3150 speech segments of 0.8 s for each array position. For each SNR, the result is averaged over the four different array positions in the room.

Vi Experimental Evaluation

In this section, different experiments with simulated and measured data are presented to objectively evaluate the performance of the proposed system. For all the experimental evaluations except the one presented in Section VI-B4, we consider a ULA with microphones with inter-microphone distance of 8 cm, and the input signals are transformed to the STFT domain using a DFT length of , with overlap, resulting in . The sampling frequency of the signals is kHz. To form the classes, we discretize the whole DOA range of a ULA with a resolution to get DOA classes, for both training and testing. All the presented objective evaluations are for the two speakers scenario. However, in Section VI-C1, we also demonstrate the ability of the proposed method to deal with scenarios with varying number of speakers.

The speech signals used for evaluation are taken from the LIBRI speech corpus. With random selected speech utterances, five different two speaker mixtures, each of length 2 s, were used. Since the angular space is discretized with a 5 resolution, for the experiments with simulated RIRs in Section VI-B, it was ensured that the angular distance between the two speakers in the different mixtures is at least 10. Therefore, for a specific source-array setup in a room, each two speaker mixture is considered for each possible angular combination. This was done to avoid influence of signal variation on the difference in performance for different acoustic conditions.

Since the speech utterances can have different lengths of silence at the beginning, the central 0.8 s segment of the mixtures was selected for evaluation. Considering an STFT window length of 32 ms with 50 overlap, this resulted in a signal block of time frames over which the frame-level posterior probabilities are averaged for the final DOA estimation, as shown in (2).

Test Room Room 1 Room 2
SNR dB dB dB dB dB dB
Measure MAE Acc. MAE Acc. MAE Acc. MAE Acc. MAE Acc. MAE Acc.
SRP-PHAT
MUSIC
Proposed
TABLE IV: Results for two different rooms with varying levels of babble noise computed over 3150 speech segments of 0.8 s for each array position. For each SNR, the result is averaged over the four different array positions in the room.

Vi-a Baselines and objective measures

The performance of the proposed method is compared to two commonly used signal processing based methods: Steered Response Power with PHase Transform (SRP-PHAT) [5], and broadband MUltiple SIgnal Classification (MUSIC) [2]. For the broadband MUSIC method, to keep the comparison similar with the other methods, the MUSIC pseudo-spectrum is computed at each frequency sub-band for each STFT time frame, with an angular resolution of 5 over the whole DOA space, and then it is averaged over all the frequency sub-bands to get the broadband pseudo-spectrum. This is then averaged over all the time frames considered in a signal block and similar to the proposed method, the DOAs with the highest values are selected as the final DOA estimates. Similar post-processing is also performed for the computed SRP-PHAT pseudo-likelihoods at each time frame to get the final DOA estimates for a signal block.

For the objective evaluation, two different measures were used: Mean Absolute Error (MAE) and localization accuracy (Acc.). The mean absolute error computed between the true and estimated DOAs for each evaluated acoustic condition is given by

(3)

where is the number of simultaneously active speakers and is the total number of speech mixture segments considered for evaluation for a specific acoustic condition. The true and estimated DOAs for the -th speaker in the -th mixture are denoted by and , respectively.

The localization accuracy is given by

(4)

where denotes the number of speech mixtures for which the localization of the speakers is accurate. In our evaluation, the localization of speakers for a speech segment is considered accurate if the distance between the estimated and the true DOA for all the speakers is less than or equal to 5.

Vi-B Experiments with simulated RIRs

In this section, first, the performance of the proposed method is evaluated for acoustic conditions different from those considered during training, in the presence of varying levels of spatially uncorrelated white noise in Section VI-B1. Then, we evaluate the performance in the presence of varying levels of diffuse babble noise, a noise type which was unseen during training, along with a constant level of spatially white noise in Section VI-B2. In Section VI-B4, we study the influence of the number of convolution layers on the performance of the proposed method and empirically demonstrate the optimal choice for the number of convolution layers for the proposed method.

Fig. 4: Array setup for experiment presented in Section VI-B4.

Vi-B1 Generalization to unseen acoustic conditions

To evaluate the performance of the methods for unseen acoustic conditions, we consider two rooms with different reverberation times as shown in Table II. In each room, the ULA is placed at four different positions and for each of these array positions, the two speakers from each of the five considered mixtures are placed at different angular positions at the same specified source-array distance. For each array position, the total number of mixtures considered for evaluation is , where corresponds to the number of possible angular combinations with the constraint of angular separation between the two speakers for each of the five mixtures.

The performance of the three methods under test is evaluated for three different levels of spatially white noise, with input SNRs 10, 20 and 30 dB, for both the rooms and the results in terms of the two considered objective measures are presented in Table III. The shown results for each input SNR was averaged over the four different array positions considered in each room.

From the results, it can be seen that the proposed method is able to provide accurate localization performance in acoustic environments that were not part of the training data. For input SNR of 30 dB, it manages to localize both sources accurately in 98 of the speech mixtures and shows a very low MAE. As the noise level increases, the performance worsens, however it always provides a much better localization accuracy and much lower error compared to both MUSIC and SRP-PHAT.

Considering same noise level, performance of the proposed method in both rooms is relatively similar compared to the signal processing based methods which have a considerably better performance in the less reverberant room (Room 1). One of the main reasons for this difference is the assumption of free-field sound propagation in the formulation of the signal processing based methods which leads to considerable deterioration in their performance in more reverberant conditions. On the other hand, the proposed supervised learning based method is trained in a diverse set of acoustic conditions, leading to a much better robustness to adverse acoustic environments.

Overall, it can be seen that the proposed method has a superior performance, in terms of both MAE and localization accuracy, compared to the traditional signal processing based methods for all the different levels of spatially white noise in both rooms. Among the two signal processing based methods, MUSIC performs much better since the averaged broadband MUSIC pseudo-spectrum contains clearer peaks compared to SRP-PHAT which tends to exhibit a flatter distribution over the DOAs.

(a) MAE
(b) Acc.
Fig. 5: Results for the experiment showing the performance of the proposed method for increasing source-array distances presented in Section VI-B3.
(a) MAE
(b) Acc.
Fig. 6: Results for the experiment on the influence of convolution layers on the proposed method presented in Section VI-B4.

Vi-B2 Generalization to unseen noise type

In the previous experiment, the performance of the proposed method was evaluated for different levels of spatially white noise, which is a noise type seen by the network during training. In this Section, we consider the presence of diffuse babble noise in the acoustic environment, which has different spatial as well as spectral characteristics compared to white noise, and is a noise type with which the CNN was not trained. A 40 s long sample of multi-channel diffuse babble noise was generated using the acoustic noise field generator [28], assuming an isotropic spherically diffuse noise field. The generated babble noise was divided into 20 segments of 2 s each and randomly chosen segments were added to each mixture.

The performance of the methods was evaluated for three different input SNRs of babble noise: -5 dB, 0 dB and 5 dB. Along with diffuse babble noise, spatially white noise with an input SNR of 40 dB was also added and results for the two different rooms are shown in Table IV. Similar to previous experiment, results for each input SNR of babble noise was averaged over the four different array positions considered in each room.

Though the proposed method is not trained with diffuse babble noise, it can be seen from the results that even at the lowest input SNR of -5 dB, the proposed method is able to perform accurate localization of the two speakers in both rooms for approximately 90 of the speech mixtures. Since we consider an isotropic spherically diffuse noise field, the spatial coherence of the babble noise is frequency dependent whereas white noise is incoherent for all frequencies. Despite this difference, since the proposed method is trained to localize directional sources and due to multi-condition training, as long as the noise source is not directional the proposed method can provide very good performance.

If the results from Table III are compared to Table IV, it can be seen that the deterioration in performance of the proposed method, in terms of the objective measures, as the noise levels increase is more prominent when white noise is considered compared to diffuse babble noise. The main reason for this difference is the spectral characteristics of the two different types of noises. On one hand, spatially white noise is present across the spectrum, therefore the input features at all frequency sub-bands are equally affected. On the other hand, babble noise is mostly dominant at low frequencies, therefore since each filter kernel in the convolution layers of the CNN learns from the complete input feature space, the filters are able to extract the relevant features for localization from the high SNR regions of the input to compensate for the lack of information in the low SNR regions.

Overall, the proposed method provides a much better localization accuracy and lower error than the signal processing based methods, with the difference in performance being especially significant at low input SNRs of diffuse babble noise.

Vi-B3 Influence of source-array distance

The CNN used for the earlier evaluations was trained for each room and array position for two specific source-array distances of 1 m and 2 m. To investigate the influence of source-array distance, in this part, the localization performance of the proposed method is evaluated for varying source-array distances.

For this experiment, we simulated a room with dimensions m and a reverberation time of 0.38 s. The test data was generated for three different array positions. For each of these array positions, the sound sources were placed at distances varying from 0.4 m to 3 m. It should be noted that both the speakers were placed at the same distance for each setup. A single two speaker mixture was used and spatially white noise was added resulting in input SNR of 20 dB.

The results for this experiment, in terms of both MAE and localization accuracy, is shown in Fig. 5. Each point in the plot corresponds to a specific source-array distance. For each of these points, the measures were averaged over all possible angular combinations for the two speakers at each of the different array positions in the room.

From the result plots, it can be seen that when the sources are very close to the microphone array the error in localization is higher, since the CNN was trained considering a far-field scenario, however for very small source-array distances, the sources are essentially in the near-field of the array. The minimum error as well as maximum accuracy in localization can be observed for the two specific distances of 1 m and 2 m, which were part of the training setup. Additionally, for distances close to these training distances, the errors are also relatively lower. When the sources are between the two training distances, the errors are slightly higher, however if we observe the absolute value of the MAE as well as the accuracy, the degradation in performance is not significant. Similarly for distances larger than 2 m, it can be seen that the localization performance deteriorates slightly.

Overall, observing the absolute value of the objective measures, it can be seen that though the network is trained with two specific source-array distances, there is small deterioration in performance for other distances, except when the sources are very close to the microphone array.

RT 0.160 s 0.360 s 0.610 s
Distances 1 m 2 m 1 m 2 m 1 m 2 m
Measure MAE Acc. MAE Acc. MAE Acc. MAE Acc. MAE Acc. MAE Acc.
SRP-PHAT
MUSIC
Proposed
TABLE V: Results with measured RIRs.
(a) Frame level DOA probabilities for the proposed method (top) and MUSIC (middle). The ground truth DOAs and source activities for each segment are shown in the bottom figure.
(b) Normalized histogram computed from the frame level probabilities for each segment.
Fig. 7: Results for experiment presented in Section VI-C1 with measured RIR and a four microphone ULA. The reverberation time of the room is 0.36 s with the source placed 2 m away from the array center. Spatially uncorrelated noise and diffuse babble noise were added to the mixture signal with input SNRs of 40 dB and 5 dB, respectively.

Vi-B4 Influence of number of convolution layers

In the previous experiments, we considered a ULA with microphones and the CNN architecture used was the same architecture that was proposed in [18, 19] which consisted of three convolution layers followed by two fully connected layers. In this section we empirically demonstrate that given the choice of small filters of size for all the convolution layers, with the aim to learn the relevant features for localization from the phase correlations at neighboring microphones, a CNN architecture with three convolution layers is not always the best performing architecture. Here we show that the number of convolution layers need to be to obtain the best localization performance.

For this experiment we consider a ULA with 8 microphones with an inter-microphone distance of 2 cm. From this array, we select two sub-arrays, one with 6 microphones and the other with 4 microphones that are formed by selecting the respective number of middle microphones from the main eight element array, as shown in Fig. 4, to get a ULA with and another ULA with , respectively. All the arrays have the same inter-microphone distance and array center.

Using the same training data configuration from previous experiments (Table I), multiple CNNs with number of convolution layers varying from 2 to are trained for each of the arrays. The number of convolution layers is restricted to since further 2D convolution layers are not possible as the microphone dimension of the phase map input is reduced to 1 after the -th layer. For the eight microphone array, 6 CNNs are trained, whereas for the six microphones and the four microphone array, 4 and 2 CNNs are trained, respectively. All the networks were trained with the same amount of data. To analyze the performance of the 12 different trained networks, test data corresponding to the Room 1 configuration in Table II is generated for each of the arrays. Spatially white noise is added for an input SNR of 30 dB.

The results for this experiment, in terms of both MAE and localization accuracy, is shown in Fig. 6. In the figures, the center of the circle markers correspond to the value of the objective measure and the area of the markers denote the number of trainable/free parameters for that specific network.

The first trend that can be noticed from the figures is that for each of the arrays, as the number of convolution layers is decreased from the performance of the networks degrades in terms of both MAE and localization accuracy. This shows that with small filters of size , to aggregate the phase correlation features from all the microphone pairs in an array, convolution layers are required. When lesser number of convolution layers are used, as the same filter size is used in each of these layers, phase correlation information from all microphone pairs are not incorporated into the learned features leading to deterioration in performance.

It can also be seen from the figures that the best localization performances of the three arrays is different and it is better for the array with higher number of microphones. This difference in performance comes from the different apertures of the considered arrays, and similar to signal processing based localization methods, here also we observe better performance for a ULA with a larger aperture.

In Fig. 6, we also observe that as the number of convolution layers is decreased the number of trainable/free parameters increases, as depicted by the area of the markers for each network. From Fig. 2, it can be seen that when convolution layers are used, the size of each feature map at the end of the convolution layers is always . As the number of convolution layers is decreased the size of each feature map at the end of the convolution layers actually becomes larger leading to a larger number of trainable/free parameters for the complete network. This further demonstrates the need of convolution layers, as very large number of free parameters can lead to problems of over fitting, if the amount of available training data is not sufficient.

Since the requirement of convolution layers is mainly related to the aggregation of information in the feature space by the slowly growing receptive field of the small filters used in our framework, techniques for a more aggressive expansion of the receptive field of the filters can also be employed. This is however beyond the scope of this paper and is a topic for future research.

Vi-C Experiments with measured RIRs

For the experiments with measured RIRs, we used the Multichannel Impulse Response Database from Bar-Ilan University [29]. The database consists of RIRs measured at Bar-Ilan University’s acoustics lab, of size m, for three different reverberation times of RT = 0.160, 0.360, and 0.610 s. The recordings were done for several source positions placed on a spatial grid of semi-circular shape covering the whole angular range for a linear array, i.e., , in steps of 15 at distances of 1 m and 2 m from the center of the microphone array.

The recordings were done with a linear microphone array with three different microphone spacings. For our experiment, we chose the [8, 8, 8, 8, 8, 8, 8] cm setup [29], which consists of eight microphones where the distance between the microphones is 8 cm. We selected a sub-array of the four middle microphones out of the total eight microphones used in the original setup, to have a ULA with elements with an inter-microphone distance of cm, which corresponds to the array setup used in experiments with simulated RIRs. Therefore, the CNN trained with simulated data used for the earlier evaluations in Section VI-B1 and VI-B2 was also used for these experiments. We used the same five mixtures from earlier, with the total number of mixtures for evaluation being , where is the number of all possible angular combinations with discretization of the complete DOA space of a ULA with 15 resolution.

The results for all the different reverberation times and source-array distances are shown in Table V. For this experiment, spatially white noise was added to each mixture resulting in an input SNR of 30 dB.

Even when trained with simulated data only, the results show that the proposed method is able to provide very good localization performance in real conditions, even when the sources are placed far from the array in reverberant conditions. The performance of all the compared methods is better when the sources are close to the array, however the difference in performance, for different distances, for the signal processing based methods is considerable since the effect of reverberation is more significant when the sources are further away from the array.

Overall, the proposed method provides significantly better performance compared to both MUSIC and SRP-PHAT, and the difference is more prominent as the acoustic environment becomes more reverberant.

Vi-C1 Dynamic acoustic scenario

In all the previous experiments, we considered the two speaker scenario for the evaluation of the performance of the proposed method. In this experiment we show that even though the CNN is trained to estimate the frame-level posterior probabilities of a maximum of two sources, with the proposed method it is possible to estimate the DOA of more than two sources for a short segment. Simultaneously, it is also shown that since the input to the CNN is the phase map for a single STFT time frame, the proposed method is also able to handle dynamic acoustic scenarios where the number of speakers changes over time.

For this experiment, we consider the reverberation time of 0.36 s and source-array distance of 2 m from the measured RIR database used in the previous experiment. A 6 s speech mixture segment is created where for the first 1 s only one source from 60 is active. For the next 2 s, an additional source is active from 105. A third source from 135 is active for the next 2 s along with the first two sources. For the final 1 s duration, only the third source is active. The source activities for each segment and the corresponding ground truth DOAs of the sources are shown in the bottom figure of Fig. 7(a). Spatially white noise and diffuse babble noise are added to the speech mixture resulting in input SNRs of 40 dB and 5 dB, respectively.

The estimated frame-level probabilities for the proposed method and broadband MUSIC are depicted in the top and middle figures of Fig. 7(a), respectively. Since from the previous experiments, it was found that MUSIC is the better performing method out of the two considered signal processing based techniques, the results for SRP-PHAT are not presented. It can be seen that the estimated frame-level probabilities for the proposed method is much more concentrated towards the actual source DOAs compared to MUSIC.

In Fig. 7(b), the frame level probabilities are averaged over the time frames in each segment and then normalized to a maximum value of 1. This specific normalization is done for the purpose of visualization only. From these figures, it can be seen that the proposed method exhibits much clearer peaks at the true source DOAs compared to MUSIC which lead to the superior performance of the proposed method in previously presented evaluations even with the simple post-processing method considered in this work for obtaining the final DOA estimates. It can also be seen that in the segment S3, where three sources are simultaneously active, though the network is trained to estimate frame level probabilities of two speakers, clear peaks are visible at all the three true source DOAs. Also, when only one source is active (S1 and S4), the highest peaks correspond to the true DOA.

Vii Conclusion

A convolutional neural network based supervised learning method for DOA estimation of multiple speakers was presented that is trained using synthesized noise signals. Through experimental evaluation, it was shown that the proposed method provides excellent localization performance in unseen acoustic environments as well as in the presence of unseen noise types. It was also shown to exhibit a far superior performance compared to the signal processing based localization methods, SRP-PHAT and MUSIC, for the tested conditions. The ability of the proposed method to deal with acoustic scenarios with varying number of sources was also shown.

For the design choice of the number of convolution layers in the proposed architecture, it was empirically shown that for a microphone array with microphones, convolution layers are required for the best localization performance. It was also shown that such a choice leads to lesser number of trainable parameters. The choice of convolution layers is required for the aggregation of the phase correlation information from all microphone pairs in the extracted features, when using contiguous convolution operations, as done in this work.

References

  • [1] R. O. Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE Trans. Antennas Propag., vol. 34, no. 3, pp. 276–280, 1986.
  • [2] J. P. Dmochowski, J. Benesty, and S. Affes, “Broadband MUSIC: Opportunities and challenges for multiple source localization,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Oct 2007, pp. 18–21.
  • [3] C. Knapp and G. Carter, “The generalized correlation method for estimation of time delay,” IEEE Trans. Acoust., Speech, Signal Process., vol. 24, no. 4, pp. 320–327, Aug. 1976.
  • [4] Y. A. Huang, J. Benesty, G. W. Elko, and R. M. Mersereati, “Real-time passive source localization: a practical linear-correction least-squares approach,” IEEE Trans. Speech Audio Process., vol. 9, no. 8, pp. 943–956, Nov. 2001.
  • [5] M. S. Brandstein and H. F. Silverman, “A robust method for speech signal time-delay estimation in reverberant rooms,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, Apr. 1997, pp. 375–378.
  • [6] J. Benesty, J. Chen, and Y. Huang, Microphone Array Signal Processing.   Berlin, Germany: Springer-Verlag, 2008.
  • [7] J. Benesty and Y. Huang, Eds., Adaptive Signal Processing: Application to real-world problems, ser. Signals and Communication Technology.   Berlin, Germany: Springer, 2003.
  • [8] P. Stoica and K. C. Sharman, “Maximum likelihood methods for direction-of-arrival estimation,” IEEE Trans. Acoust., Speech, Signal Process., vol. 38, no. 7, pp. 1132–1143, Jul 1990.
  • [9] S. Delikaris-Manias, D. Pavlidi, A. Mouchtaris, and V. Pulkki, “Doa estimation with histogram analysis of spatially constrained active intensity vectors,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp. 526–530.
  • [10] A. Moore, C. Evers, and P. Naylor, “2D direction of arrival estimation of multiple moving sources using a spherical microphone array,” in Proc. European Signal Processing Conf. (EUSIPCO), 2016, pp. 1217–1221.
  • [11]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems, 2012, pp. 1106–1114.
  • [12] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, Nov 2012.
  • [13] N. Ma, T. May, and G. J. Brown, “Exploiting deep neural networks and head movements for robust binaural localization of multiple sources in reverberant environments,” IEEE Trans. Audio, Speech, Lang. Process., vol. 25, no. 12, pp. 2444–2453, Dec 2017.
  • [14] F. Vesperini, P. Vecchiotti, E. Principi, S. Squartini, and F. Piazza, “A neural network based algorithm for speaker localization in a multi-room environment,” in

    IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP)

    , 2016, pp. 1–6.
  • [15]

    R. Takeda and K. Komatani, “Sound source localization based on deep neural networks with directional activate function exploiting phase information,” in

    Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp. 405–409.
  • [16] X. Xiao, S. Zhao, X. Zhong, D. L. Jones, E. S. Chng, and H. Li, “A learning-based approach to direction of arrival estimation in noisy and reverberant environments,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), April 2015, pp. 2814–2818.
  • [17] R. Takeda and K. Komatani, “Discriminative multiple sound source localization based on deep neural networks using independent location model,” in IEEE Spoken Language Technology Workshop (SLT), Dec 2016, pp. 603–609.
  • [18] S. Chakrabarty and E. A. P. Habets, “Broadband DOA estimation using convolutional neural networks trained with noise signals,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Oct. 2017.
  • [19] ——, “Multi-speaker localization using convolutional neural network trained with noise,” in ML4Audio Worskhop at Proc. Neural Information Processing Conf, 2017.
  • [20] S. Rickard and O. Yilmaz, “On the approximate W-disjoint orthogonality of speech,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), May 2002, pp. 529–532.
  • [21]

    J. Read, B. Pfahringer, G. Holmes, and E. Frank, “Classifier chains for multi-label classification,”

    Machine Learning, vol. 85, no. 3, p. 333, Jun 2011. [Online]. Available: https://doi.org/10.1007/s10994-011-5256-5
  • [22] F. Stoeter, S. Chakrabarty, B. Edler, and E. A. P. Habets, “Classification vs. regression in supervised learning for single channel speaker count estimation,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2018.
  • [23] Y. LeCun and Y. Bengio, “The handbook of brain theory and neural networks,” M. A. Arbib, Ed.   Cambridge, MA, USA: MIT Press, 1998, ch. Convolutional Networks for Images, Speech, and Time Series, pp. 255–258. [Online]. Available: http://dl.acm.org/citation.cfm?id=303568.303704
  • [24] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, Third Edition, 3rd ed.   The MIT Press, 2009.
  • [25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, 2014.
  • [26] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, Jan. 2014.
  • [27] F. Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015.
  • [28] E. A. P. Habets and S. Gannot, “Generating Sensor Signals in Isotropic Noise Fields,” Journal Acoust. Soc. of America, vol. 122, no. 6, pp. 3464–3470, Dec. 2007.
  • [29] E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel audio database in various acoustic environments,” in Proc. Intl. Workshop Acoust. Echo Noise Control (IWAENC), Sept 2014, pp. 313–317.