A Survey of Sound Source Localization with Deep Learning Methods

09/08/2021 ∙ by Pierre-Amaury Grumiaux, et al. ∙ Grenoble Institute of Technology Orange 0

This article is a survey on deep learning methods for single and multiple sound source localization. We are particularly interested in sound source localization in indoor/domestic environment, where reverberation and diffuse noise are present. We provide an exhaustive topography of the neural-based localization literature in this context, organized according to several aspects: the neural network architecture, the type of input features, the output strategy (classification or regression), the types of data used for model training and evaluation, and the model training strategy. This way, an interested reader can easily comprehend the vast panorama of the deep learning-based sound source localization methods. Tables summarizing the literature survey are provided at the end of the paper for a quick search of methods with a given set of target characteristics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Sound source localization (SSL) is the problem of estimating the position of one or several sound sources relative to some arbitrary reference position, which is generally the position of the recording microphone array, based on the recorded multichannel acoustic signals. In most practical cases, SSL is simplified to the estimation of the sources’ Direction of Arrival (DoA),

i.e. it focuses on the estimation of azimuth and elevation angles, without estimating the distance to the microphone array.111Therefore, unless otherwise specified, in this article we use the terms SSL and DoA estimation interchangeably. Sound source localization has numerous practical applications, for instance in source separation [29], speech recognition [116], speech enhancement [236] or human-robot interaction [119]. As more detailed in the following, in this paper we focus on sound sources in the audible range (typically speech and audio signals) in indoor (office or domestic) environments.

Although SSL is a longstanding and widely researched topic [43, 7, 34], it remains a very challenging problem to date. Traditional SSL methods are based on signal/channel models and signal processing techniques. Although they showed notable advances in the domain over the years, they are known to perform poorly in difficult yet common scenarios where noise, reverberation and several simultaneously emitting sound sources may be present [16, 50]. In the last decade, the potential of data-driven deep learning (DL) techniques for addressing such difficult scenarios has raised an increasing interest. As a result, more and more SSL systems based on deep neural networks (DNNs) are proposed each year. Most of these studies have indicated the superiority of DNN models over conventional222Hereafter, the term “conventional” is used to refer to SSL systems that are based on traditional signal processing techniques, and not on DNNs. SSL methods, which has further fueled the expansion of scientific papers on deep learning applied to SSL. For example, in the last three years (2019 to 2021), we have witnessed a threefold increase in the number of corresponding publications. In the meantime, there has been no comprehensive survey of the existing approaches, which we deem extremely useful for researchers and practitioners in the domain. Although we can find reviews mostly focused on conventional methods, e.g. [7, 34, 50, 54], to the best of our knowledge only a very few have explicitly targeted sound source localization by deep learning methods. In [4], the authors present a short survey of several existing DL models and datasets for SSL before proposing a DL architecture of their own. References [14] and [165]

are very interesting overviews of machine learning applied to various problems in audio and acoustics. Nevertheless, only a short portion of each is dedicated to SSL with deep neural networks.

I-a Aim of the paper

The goal of the present paper is to fill this gap, and provide a thorough survey of the SSL literature using deep learning techniques. More precisely, we examined more than 120 more or less recent papers (published after 2013) and we classify and discuss the different approaches in terms of characteristics of the employed methods and addressed configurations (e.g. single-source vs multi-source localization setup or neural network architecture, the exact list is given in Section 

I-C). In other words, we present a taxonomy of the DL-based SSL literature. At the end of the paper, we present a summary of this survey in the form of two large tables (one for the period 2013-2019 and one for 2020-2021). All methods that we reviewed are reported in those tables with a summary of their characteristics presented in different columns. This enables the reader to rapidly select the subset of methods having a given set of characteristic, if he/she is interested into that particular type of methods.

Note that in this survey paper, we do not aim to evaluate and compare the performance of the different systems. Due to the large number of neural-based SSL papers and diversity of configurations, such a contribution would be very difficult and cumbersome (albeit very useful), especially because the discussed systems are often trained and evaluated on different datasets. As we will see later, listing and commenting those different datasets is however part of our survey effort. Note also that we do not consider SSL systems exploiting other modalities in addition to sound, e.g. audio-visual systems [10]. Finally, we do consider DL-based methods for joint sound event localization and detection (SELD), but we mainly focus on their localization part.

I-B General principle of DL-based SSL

The general principle of DL-based SSL methods and systems can be schematized with a quite simple pipeline, as illustrated in Fig. 1

. A multichannel input signal recorded with a microphone array is processed by a feature extraction module, to provide input features. Those input features are fed into a DNN which delivers an estimate of the source location or DoA. As discussed later in the paper, a recent trend is to skip the feature extraction module to directly feed the network with multichannel raw data. In any case, the two fundamental reasons behind the design of such SSL are the following.

First, multichannel signals recorded with an array of microphones distributed in space contain information about the source(s) location. Indeed, when the microphones are close to each other compared to their distance to the source(s), the microphone signal waveforms, although looking similar from a distance, differ by more or less notable and complex differences in terms of delay and amplitude, depending on the experimental setup. These interchannel differences are due to distinct propagation paths from the source to the different microphones, for both the direct path (line of sight between source and microphone) and the numerous reflections that compose the reverberation in indoor environment. In other words, a source signal is filtered by different room impulse responses (RIRs) depending on the source position, microphone positions, and acoustic environment configuration (e.g. room shape), and thus the resulting recordings contain information on relative source-to-microphone array position. In this paper, we do not detail those foundations in more detail. They can be found in several references on general acoustics [86, 176], room acoustics [110], array signal processing [19, 11, 87, 166], speech enhancement and audio source separation [55, 226], and many papers on conventional SSL.

The second reason for designing DNN-based SSL systems is that even if the relationship between the information contained in the multichannel signal and the source(s) location is generally complex (especially in a multisource reverberant and noisy configuration), DNNs are powerful models that are able to automatically identify this relationship and exploit it for SSL, given that they are provided with a sufficiently large and representative amount of training examples. This ability of data-driven deep learning methods to replace conventional methods based on a signal/channel model and signal processing techniques,333or at least a part of them, since the feature extractor module can be based on conventional processing. makes them attractive for problems such as SSL. An appealing property of DL-based methods is their capacity to deal with real-world data, whereas conventional methods often suffer from oversimplistic assumptions compared to the complexity of real-world acoustics. The major drawback of the DNN-based approaches is the lack of generality. A deep model designed for and trained in a given configuration (for example a given microphone array geometry) will not provide satisfying localization results if the setup changes [124, 136], unless some relevant adaptation method can be used, which is still an open problem in deep learning in general. In this paper, we do not consider this aspect. And we do not intend to further detail the pros and cons of DL-based methods vs conventional methods. Our goal is rather to present, in a soundly organized manner, a representative (if not exhaustive) panorama of DL-based SSL methods published in the last decade.

Fig. 1: General pipeline of a deep-learning-based SSL system.

I-C Outline of the paper

The following of the paper is organized as follows. In Section II, we specify the context and scope of the survey in terms of considered acoustic environment and sound source configurations. In Section III, we quickly present the most common conventional SSL methods. This is motivated by two reasons: First, they are often used as baseline for the evaluation of DL-based methods, and second, we will see that several types of features extracted by conventional methods can be used in DL-based methods. Section IV aims to classify the different neural network architectures used for SSL. Section V presents the various types of input features used for SSL with neural networks. In Section VI, we explain the two output strategies employed in DL-based SSL: classification and regression. We then discuss in Section VII the datasets used for training and evaluating the models. In Section VIII

, learning paradigms such as supervised or semi-supervised learning are discussed from the SSL perspective. Section 

IX provides the two summary tables and concludes the paper.

Ii Acoustic environment and sound source configurations

SSL has been applied in different configurations, depending on the application. In this section we specify the scope of our survey, in terms of acoustic environment (noisy, reverberant, or even multi-room), and the nature of the considered sound sources (their type, number and static/mobile status).

Ii-a Acoustic environments

In this paper, we focus on the problem of estimating the location of sound sources in an indoor environment, i.e. when the microphone array and the sound source(s) are present in a closed room, generally of moderate size, typically an office room or a domestic environment. This implies reverberation: in addition to the direct source-to-microphone propagation path, the recorded sound contains many other multi-path components of the same source. All those components form the room impulse response which is defined for each source position and microphone array position (including orientation), and for a given room configuration.

In a general manner, the presence of reverberation is seen as a notable perturbation that makes SSL more difficult, compared to the simpler (but somehow unrealistic) anechoic case, which assumes the absence of reverberation, as is obtained in the so-called free-field propagation setup. Another important adverse factor to take into account in SSL is noise. On the one hand, noise can come from interfering sound sources in the surrounding environment: TV, background music, pets, street noise passing through open or closed windows, etc. Often, noise is considered as diffuse, i.e. it does not originate from a clear direction. On the other hand, the imperfections of the recording devices are another source of noise which are generally considered as artifacts.

Early works on using neural networks for DoA estimation considered direct-path propagation only (the anechoic setting) see e.g. [168, 61, 91, 93, 92, 52, 242, 195, 48]. Most of these works are from the pre-deep-learning era, using “shallow” neural networks with only one or two hidden layers. We do not detail these works in our survey, although we acknowledge them as pioneering contributions to the neural-based DoA estimation problem. A few more recent works, based on more “modern” neural network architectures, such as [124, 215, 12, 49, 31], also focus on anechoic propagation only, or they do not consider sound sources in audible bandwidth. We do not detail those papers as well, since we focus on SSL in real-world reverberant environments.

Ii-B Source types

In the SSL literature, a great proportion of systems focuses on localizing speech sources, because of its importance in related tasks such as speech enhancement or speech recognition. Examples of speaker localization systems can be found in [27, 65, 76, 71]. In such systems, the neural networks are trained to estimate the DoA of speech sources so that they are somehow specialized in this type of source. Other systems consider a variety of domestic sound sources, for instance those participating to the DCASE challenge [161, 160]. Depending on the challenge task and its corresponding dataset, these methods are capable of localizing alarms, crying babies, crashes, barking dogs, female/male screams, female/male speech, footsteps, knockings on door, ringings, phones and piano sounds. Note that domestic sound source localization is not necessarily a more difficult problem than multi-speaker localization, since domestic sounds usually have distinct spectral characteristics, that neural models may exploit for better detection and localization.

Ii-C Number of sources

The number of sources (NoS) in a recorded mixture signal is an important parameter for sound source localization. In the SSL literature, the NoS might be considered as known (as a working hypothesis) or alternately it can be estimated along with the source location, in which case the SSL problem is a combination of detection and localization.

A lot of works consider only one source to localize, as it is the simplest scenario to address, e.g. [155, 18, 123]. We refer to this scenario as single-source SSL. In this case, the networks are trained and evaluated on datasets with only at most one active source.444A source is said to be active when emitting sound and inactive otherwise. In terms of number of sources, we thus have here either 1 or 0 active source. The activity of the source in the processed signal, which generally contains background noise, can be artificially controlled, i.e. the knowledge of source activity is a working hypothesis. This is a reasonable approach at training time when using synthetic data, but quite unrealistic at test time on real-world data. Alternately it can be estimated, which is a more realistic approach at test time. In the latter case, there are two ways of dealing with the source activity detection problem. The first is to employ a source detection algorithm beforehand and to apply the SSL method only on the signal portions with an active source. For example, a voice activity detection (VAD) technique has been used in several SSL works [95, 28, 188, 121]

. The other way is to detect the activity of the source at the same time as the localization algorithm. For example, an additional neuron has been added to the output layer of the DNN used in

[241], which outputs when no source is active (in that case all other localization neurons are trained to output ), and otherwise.

Multi-source localization is a much more difficult problem than single-source SSL. Current state-of-the art DL-based methods address multi-source SSL in adverse environments. In this survey, we consider as multi-source localization the scenario in which several sources overlap in time (i.e. they are simultaneously emitting), regardless of their type (e.g. there could be several speakers or several distinct sound events). The specific case where we have several speakers taking speech turns with or without overlap is strongly connected to the speaker diarization problem (“who speaks when?”) [213, 6, 150]. Speaker localization, diarization and (speech) source separation are intrinsically connected problems, as the information retrieved from solving each one of them can be useful for addressing the others [226, 103, 90]. An investigation of those connections is out of the scope of the present survey.

In the multi-source scenario, the source detection problem transposes to a source counting problem, but the same considerations as in the single-source scenario hold. In some works, the knowledge of the number of sources is a working hypothesis [65, 128, 75, 153, 51, 17, 64] and the sources’ DoA can be directly estimated. If the NoS is unknown, one can apply a source counting system beforehand, e.g. with a dedicated neural network [63]. For example, in [212]

, the author trained a separate neural network to estimate the NoS in the recorded mixture signal, then he used this information along with the output of the DoA estimation neural network. Alternately, the NoS can be estimated alongside the DoAs, as in the single-source scenario, based on the SSL network output. When using a classification paradigm, the network output generally predicts the probability of presence of a source within each discretized region of the space (see Section 

VIII). One can thus set a threshold on the estimated probability to detect regions containing an active source, which implicitly provides source counting.555Note that this problem is common to DL-based multi-source SSL methods and conventional methods for which a source activity profile is estimated and peak-picking algorithms are typically used to select the active sources. Otherwise, the ground-truth or estimated NoS is typically used to select the corresponding number of classes having the highest probability. Finally, several neural-based system were purposefully designed to estimate the NoS alongside the DoAs. For example, the method proposed in [141] uses a neural architecture with two output branches: the first one is used to estimate the NoS (up to 4 sources; the problem is formulated as a classification task), the second branch is used to classify the azimuth into several regions.

Ii-D Moving sources

Source tracking is the problem of localizing moving sources, i.e. sources whose location evolves with time. In this survey paper, we do not address the problem of tracking on its own, which is usually done in a separate algorithm using the sequence of DoA estimates obtained by applying SSL on successive time windows [228]. Still, several deep-learning-based SSL systems are shown to produce more accurate localization of moving sources when they are trained on a dataset that includes this type of sources [3, 41, 66, 77]. In other cases, as the number of real-world datasets with moving sources is limited and the simulation of signals with moving sources is cumbersome, a number of systems are trained on static sources, but are also shown to retain fair to good performance on moving sources [64, 198].

Iii Conventional SSL methods

Before the advent of deep learning, a set of signal processing techniques have been developed to address SSL. A detailed review of those techniques can be found in [43]. A review in the specific robotics context is made in [7]. In this section, we briefly present the most common conventional SSL methods. As briefly stated in the introduction, the reason for this is twofold: First, conventional SSL methods are often used as baselines for DL-based methods, and second, many DL-based SSL methods use input features extracted with conventional methods (see Section V).

When the geometry of the microphone array is known, DoA estimation can be performed by estimating the time-difference of arrival (TDoA) of the sources between the microphones [238]. One of the most employed methods to estimate the TDoA is the generalized cross-correlation with phase transform (GCC-PHAT) [100]

. This latter is computed as the inverse Fourier transform of a weighted version of the cross-power spectrum (CPS) between the signals of two microphones. The TDoA estimate is then obtained by finding the time-delay between the microphone signals which maximizes the GCC-PHAT function. When an array is composed of more than two microphones, TDoA estimates can be computed for all microphone pairs, which may be exploited to improve the localization robustness

[11].

Building an acoustic power (or energy) map is another way to retrieve the DoA of one or multiple sources. The most common approach is through computation of the steered response power with phase transform (SRP-PHAT) [44]. Practically, a signal-independent beamformer [55] is steered towards candidate positions in space, in order to evaluate their corresponding weighted “energies.” The local maxima of such acoustic map then correspond to the estimated DoAs. As an alternative to building an SRP-based acoustic map, which can be computationally prohibitive as it usually amounts to grid search, sound intensity-based methods have been proposed [87, 209, 99]. In favorable acoustic conditions, sound intensity is parallel to the direction of the propagating sound wave (see Section V-E), hence the DoA can be efficiently estimated. Unfortunately, its accuracy quickly degrades in the presence of acoustic reflections [38].

Subspace methods

are another classical family of localization algorithms. These methods rely on the eigenvalue decomposition (EVD) of the multichannel (microphone) covariance matrix. Assuming that the target source signals and noise are uncorrelated, the multiple signal classification (MUSIC) method

[185] applies EVD to estimate the signal and noise subspaces, whose bases are then used to probe a given direction for the presence of a source. This time-demanding search can be avoided using the Estimation of Signal Parameters via Rotational Invariance Technique (ESPRIT) algorithm [177], which exploits the structure of the source subspace to directly infer the source DoA. However, this often comes at the price of producing less accurate predictions than MUSIC [130]. MUSIC and ESPRIT assume narrowband signals, although wideband extensions have been proposed, e.g. [45, 80]. Subspace methods are robust to noise and can produce highly accurate estimates, but are sensitive to reverberation.

Methods based on probabilistic generative mixture models have been proposed in, e.g., [173, 131, 133, 234, 46, 120]

. Typically, the models are variants of Gaussian mixture models (GMMs), with one Gaussian component per source to be localized and/or per candidate source position. The model parameters are estimated with histogram-based or Expectation-Maximization algorithms exploiting the sparsity of sound sources in the time-frequency domain

[170]. When this is done at runtime (i.e. using the test data with sources to be localized), the source localization can be computationally intensive. A variant of such model functioning directly in regression mode (in other words a form of Gaussian mixture regression (GMR)) has been proposed for single-source localization in [39] and extended to multi-source localization in [40]. The GMR is locally linear but globally non-linear, and the model parameters estimation is done offline on training data, hence the spirit is close to DNN-based SSL. In [39, 40]white noise signals convolved with synthetic RIRs are used for training. The method generalizes well to speech signals which are sparser than noise in the TF domain, thanks to the use of a latent variable modeling the signal activity in each TF bin.

Finally, Independent Component Analysis (ICA) is a class of algorithms aiming to retrieve the different source signals composing a mixture by assuming and exploiting their mutual statistical independence. It has been most often used in audio processing for blind source separation, but it has also proven to be useful for multi-source SSL

[183]. As briefly stated before, in the multi-source scenario, SSL is closely related to the source separation problem: localization can help separation and separation can help localization [55, 227]. A deep investigation of this topic is out of the scope of the present paper.

Iv Neural network architectures for SSL

In this section, we discuss the neural network architectures that have been proposed in the literature to address the SSL problem. However, we do not present the basics of these neural networks, since they have been extensively described in the general deep learning literature, see e.g. [115, 59, 32]

. The design of neural networks for a given application often requires investigating (and possibly combining) different architectures and tuning their hyperparameters. We have organized the presentation according to the type of layers used in the networks, with a progressive and “inclusive” approach in terms of complexity: a network within a given category can contain layers from another previously presented category. We thus first present systems based on feedforward neural networks (FFNNs). We then focus on convolutional neural networks (CNNs) and recurrent neural networks (RNNs), which generally incorporate some feedforward layers. Then, we review architectures combining CNNs with RNNs, namely convolutional recurrent neural networks (CRNNs). Then, we focus on neural networks with residual connections and with attention mechanisms. Finally, we present neural networks with an encoder-decoder structure.

Iv-a Feedforward neural networks

The feedforward neural network was the first and simplest type of artificial neural network to be designed. In such a network, data moves in one direction from the input layer to the output layer, possibly via a series of hidden layers [59, 115]

. Non-linear activation functions are usually used after each layer (possibly except for the output layer). While this definition of FFNN is very general, and may include architectures such as CNNs (discussed in the next subsection), here we mainly focus on fully-connected architectures known as Perceptron and Multi-Layer Perceptron (MLP)

[59, 115]. A Perceptron has no hidden layer while the notion of MLP is a bit ambiguous: some authors state that a MLP has one hidden layer while others allow more hidden layers. In this paper, we call a MLP a feedforward neural network with one or more hidden layers.

A few pioneering SSL methods using shallow neural networks (Perceptron or 1-hidden layer MLP) and applied in “unrealistic” setups (e.g. assuming direct-path sound propagation only) have been briefly mentioned in Section II-A. One of the first uses of an MLP for SSL has been proposed by Kim and Ling [97] in 2011. They actually consider several MLPs. One network estimates the number of sources, then a distinct network is used for SSL for each considered NoS. They evaluate their method on reverberant data even if they assume an anechoic setting. In [214], the authors showed that using a complex-valued MLP on complex two-microphone-based features led to better results than using a real-valued MLP. In [245], the authors also used an MLP to estimate the azimuth of a sound source from a binaural recording made with a robot head. The interaural time difference (ITD) and the interaural level difference (ILD) values (see Section V) are separately fed into the input layer and are each processed by a specific set of neurons. A single-hidden-layer MLP is presented in [237], taking GCC-PHAT-based features as inputs and tackling SSL as a classification problem (see Section VIII), which showed an improvement over conventional methods on simulated and real data. A similar approach was proposed in [225], but the localization is done by regression in the horizontal plane.

Naturally, MLPs with deeper architecture (i.e. more hidden layers) have also been investigated for SSL. In [172], Roden et al. compared the performance of an MLP with two hidden layers and different input types, the number of hidden neurons being linked to the type of input features (see Section V for more details). In [244], an MLP with three hidden layers (tested with different numbers of neurons) is used to output source azimuth and distance estimates. An MLP with four hidden layers has been tested in [73] for multi-source localization and speech/non-speech classification, showing similar results as a 4-layer CNN (see Section IV-B).

Ma et al. [128]

proposed to use a different MLP for each of different frequency sub-bands, each MLP having eight hidden layers. The output of each sub-band MLP corresponds to a probability distribution on azimuth regions, and the final azimuth estimation is obtained by integrating the probability values over the frequency bands. Another system in the same vein was proposed by Takeda

et al. in [205, 204, 206, 207]

. In these works, the eigenvectors of the recorded signal interchannel correlation matrix are separately fed per frequency band into specific fully-connected layers. Then several additional fully-connected layers progressively integrate the frequency-dependent outputs (see Fig. 

2). The authors show that this specific architecture outperforms a more conventional 7-layer MLP and the classical MUSIC algorithm on anechoic and reverberant single- and multi-source signals. Opochinsky et al. [144] proposed a small 3-layer MLP to estimate the azimuth of a single source using the relative transfer function (RTF) of the signal (see Section V

). Their approach is weakly supervised since one part of the loss function is computed without the ground truth DoA labels (see Section

VIII).

An indirect use of an MLP is explored in [145], where the authors use a 3-layer MLP to enhance the interaural phase difference (IPD) (see Section V) of the input signal, which is then used for DoA estimation.

Fig. 2: Multi-Layer Perceptron architecture used in [205, 204, 206, 207]

. Multiple sub-band feedforward layers are trained to extract features from the eigenvectors of the multichannel signal correlation matrix. The obtained sub-band vectors are integrated progressively via other feedforward layers. The output layer finally classifies its input in one of the candidate DoAs.

Iv-B Convolutional neural networks

Convolutional neural networks are a popular class of deep neural networks widely used for pattern recognition, due to their property of being translation invariant

[114]. They have been successfully applied to various tasks such as image classification [107]

, natural language processing (NLP)

[96]

or automatic speech recognition

[229]. CNNs have also been used for SSL, in various configurations like basic convolutions, as detailed below.

To our knowledge, Hirvonen [78] was the first to use a CNN for SSL. He employed this architecture to classify an audio signal containing one speech or musical source into one of eight spatial regions (see Fig. 3). This CNN is composed of four convolutional layers to extract feature maps from multichannel magnitude spectrograms (see Section V

), followed by four fully-connected layers for classification. Classical pooling is not used because, according to the author, it does not seem relevant for audio representations. Instead, a 4-tap stride with a 2-tap overlap is used to reduce the number of parameters. This approach shows good performance on single-source signals and proves to be able to adapt to different configurations without hand-engineering. However, two topical issues of such a system were pointed out by the author: the robustness of the network with respect to a shift in source location and the difficulty of interpreting the hidden features.

Fig. 3: The CNN architecture proposed in [78] for SSL. The magnitude spectrum of one frame is fed into a series of four convolutional layers with 500 or 600 learnable kernels. Then the new extracted features pass through several feedforward layers. The output layer estimates the probability of a source to be present in eight candidate DoAs.

Chakrabarty and Habets also designed a CNN to predict the azimuth of one [24] or two speakers [27, 25] in reverberant environments. The input features are the multichannel short-time Fourier transform (STFT) phase spectrograms (see Section V). In [24], they propose to use three successive convolutional layers with filters of size to consider neighbouring frequency bands and microphones. In [25], they reduce the filter size to ( in the frequency axis), because of the W-disjoint orthogonality (WDO) assumption for speech signals, which assumes that several speakers are not simultaneously active in a same time-frequency bin [170]. In [27], they prove that for a -microphone array the optimal number of convolutional layers is .

In [73], a 4-layer MLP and a 4-layer CNN were compared in the multi-speaker detection and localization task. The results showed similar accuracy for both architectures. A much deeper architecture was proposed in [241], with 11 to 20 convolutional layers depending on the experiments. These deeper CNNs showed robustness against noise compared to MUSIC, as well as smaller training time, but this was partly due to the presence of residual blocks (see Section IV-E). A similar architecture was presented in [74], with many convolutional layers and some residual blocks, though with a specific multi-task configuration: the end of the network is split into two convolutional branches, one for azimuth estimation, the other for speech/non-speech signal classification.

While most localization systems aim to estimate the azimuth or both the azimuth and elevation, the authors of [211] investigated the estimation of only the elevation angle using a CNN with binaural features input: the ipsilateral and contralateral head-related transfer function (HRTF) magnitude responses (see Section V). In [223], Vera-Diaz et al. chose to apply a CNN directly on raw multichannel waveforms, assembled side by side as an image, to predict the cartesian coordinates of a single static or moving speaker. The successive convolutional layers contain around a hundred filters from size for the first layers to for the last layer. In [129], Ma and Liu also used a CNN to perform regression but they used the cross-power spectrum matrix as input feature (see Section V). To estimate both the azimuth and elevation, the authors of [139] used a relatively small CNN (two convolutional layers) in regression mode, with binaural input features. A similar approach was considered by Sivasankaran et al. in [193] for speaker localization based on a CNN. They show that injecting a speaker identifier, particularly a mask estimated for the speaker uttering a given keyword, alongside the binaural features at the input layer improves the DoA estimation.

A joint VAD and DoA estimation CNN has been developed by Vecchiotti et al. in [220]. They show that both problems can be handled jointly in a multi-room environment using the same architecture, however considering separate input features (GCC-PHAT and log-mel-spectrograms) in two separate input branches. These branches are then concatenated in a further layer. They extend this work in [222] by exploring several variant architectures and experimental configurations. An end-to-end auditory-inspired system based on a CNN has been developed in [221], in which Gammatone filter layers are included in the neural architecture. A method based on mask estimation is proposed in [250], in which a time-frequency mask is estimated and used to either clean or be appended to the input features, facilitating the DoA estimation by a CNN.

In [141], Nguyen et al. presented a multi-task CNN containing 10 convolutional layers with average pooling, inferring both the number of sources and their DoA. They evaluated their network on signals with up to four sources, showing very good performance on both simulated and real environments. A small 3-layer CNN is employed in [216] to infer both azimuth and elevation using signals decomposed with third-order spherical harmonics (see Section V). The authors tried several combinations of input features, including using only the magnitude and/or the phase of the spherical harmonic decomposition.

In the context of hearing aids, a CNN has been applied for both VAD and DoA estimation in [218]. The system is based on two input features, GCC-PHAT and periodicity degree, both fed separately into two convolutional branches. These two branches are then concatenated in a further layer which is followed by feedforward layers. In [51], Fahim et al. applied an 8-layer CNN to first-order Ambisonics modal coherence input features (see Section V) for localization of multiple sources in a reverberant environment. They propose a new method to train a multi-source DoA estimation network with only single-source training data, showing an improvement over [27], especially for signals with three speakers. A real-time investigation of SSL using a CNN is provided in [71], with a relatively small architecture (three convolutional layers).

In [105], a study of several types of convolution has been proposed. The authors found out that networks using 3D convolutions (on the time, frequency and channel axes) achieve better localization accuracy compared to those based on 2D convolutions, complex convolutions and depth-wise separable convolutions (all of them on the time and frequency axes), but with a high computational cost. They also show that the use of depth-wise separable convolutions leads to a good trade-off between accuracy and model complexity (to our knowledge, they are the first to explore this type of convolutions).

In [18], the neural network architecture includes a set of 2D convolutional layers for frame-wise feature extraction, followed by several 1D convolutional layers in the time dimension for temporal aggregation. In [41], 3D convolutional layers are applied on SRP-PHAT power maps computed for both azimuth and elevation estimation. The authors also use a couple of 1D causal convolutional layers at the end of the network to perform tracking. Their whole architecture has been designed to function in fully causal mode so that it is adapted for real-time applications.

CNNs have also been used for Task 3 of the DCASE challenge (sound event detection and localization) [161, 160]. In [33], convolutional layers with hundreds of filters of size are used for azimuth and elevation estimation in a regression mode. Kong et al [102] compared different numbers of convolutional layers for SELD, while an 8-layer CNN was proposed in [142] to improve the results over the baseline.

An indirect use of a CNN is proposed in [180]. The authors trained the neural network to estimate a weight for each of the narrow-band SRP components fed at the input layer, in order to compute a weighted combination of these components. In their experiments, they show on a few test examples that this allows to obtain a better fusion of the narrow-band components and reduce the effects of noise and reverberation, leading to a better localization accuracy.

In the DoA estimation literature, a few works have explored the use of dilated convolutions in deep neural networks. Dilated convolutions, also known as atrous convolutions, are a type of convolutional layer in which the convolution kernel is wider than the classical one but zeros are inserted so that the number of parameters remains the same. Formally, a 1D dilated convolution with a dilation factor is defined by:

(1)

where is the input and the convolution kernel. The conventional linear convolution is obtained with . This definition extends to multidimensional convolution.

In [26], Chakrabarty and Habets showed that, for a -microphone array, the optimal number of convolutional layers (which was in the conventional convolution framework, as proved in [27]) can be reduced by incorporating dilated convolutions with gradually increasing dilation factors. This leads to an architecture with similar SSL performance and lower computational cost.

Iv-C Recurrent neural networks

Recurrent neural networks are neural networks designed for modeling temporal sequences of data [115, 59]

. Particular types of RNNs include long short-term memory (LSTM) cells

[79]

and gated recurrent units (GRU)

[30]

. These two types of RNNs have become very popular thanks to their capability to circumvent the training difficulties that regular RNNs were facing, in particular the vanishing gradient problem

[115, 59].

There are not a lot of published works on SSL using only RNNs, as recurrent layers are often combined with convolutional layers (see Section IV-D). In [140], an RNN is used to align the sound event detection (SED) and DoA predictions which are obtained separately for each possible sound event type. The RNN is used ultimately to find which SED prediction matches which DoA estimation. A bidirectional LSTM network is used in [232] to estimate a time-frequency (TF) mask to enhance the signal, further facilitating DoA estimation by conventional methods such as SRP or subspace methods.

Iv-D Convolutional recurrent neural networks

Convolutional recurrent neural networks are neural networks containing one or more convolutional layers and one or more recurrent layers. CRNNs have been regularly exploited for SSL since 2018, because of the respective capabilities of these layers: the convolutional layers proved to be suitable to extract relevant features for SSL, and the recurrent layers are well designed for integrating the information over time.

In [3, 2, 1], Adavanne et al. used a CRNN for sound event localization and detection, in a multi-task configuration, with first-order Ambisonics (FOA) input features (see Section V). In [2]

, their architecture contains a series of successive convolutional layers, each one followed by a max-pooling layer, and two bidirectional GRU (BGRU) layers. Then a feedforward layer performs a spatial pseudo-spectrum (SPS) estimation, acting as an intermediary output (see Fig. 

4). This SPS is then fed into the second part of the neural network which is composed of two convolutional layers, a dense layer, two BGRU layers and a final feedforward layer for azimuth and elevation estimation by classification. The use of an intermediary output has been proposed to help the neural network learning a representation that have proved to be useful for SSL using traditional methods.

In [1] and [3], they do not use this intermediary output anymore and directly estimate the DoA using a block of convolutional layers, a block of BGRU layers and a feedforward layer. This system is able to localize and detect several sound events even if they overlap in time, provided they are of different types (e.g. speech and car, see the discussion in Subsection II-B). This CRNN was the baseline system for Task 3 of the DCASE challenge in 2019 [161] and 2020 [160]. Therefore, it has inspired many other works, and many DCASE challenge candidate systems were built over [1] with various modifications and improvements. In [122], Gaussian noise is added to the input spectrograms to train the network to be more robust to noise. The author of [125] integrates some additional convolutional layers and replace the bidirectional GRU layers with bidirectional LSTM layers. In [126], the same architecture is reused with all combinations of cross-channel power spectra, whereas the replacement of input features with group delays is tested in [143]. GCC-PHAT features are added as input features in [132]. In [249], Zhang et al. use data augmentation during training and average the output of the network for a more stable DoA estimation. In [240], the input features are separately fed into different branches of convolutional layers, log-mel and constant Q-transform features on one hand, phase spectrograms and cross-power spectrum on the other hand (see Section V). In [22], the authors concatenate the log-mel spectrogram, the intensity vector and GCC-PHAT features and feed them into two separate CRNNs for SED and DoA estimation. In contrast to [1], more convolutional layers and one single BGRU layer are used. The convolutional part of the DoA network was transferred from the SED CRNN, which was followed by fine-tuning of the DoA branch, labelling this method as two-stage. This led to a notable improvement in localization performance over the DCASE baseline [1]. Small changes to the system of [22] have been tested in [89], such as the use of Bark-spectrograms as input features, the modification of activation function or pooling layers, and the use of data augmentation, resulting in noticeable improvements for some experiments.

The same neural architecture as in [1] is used in [94], with one separate (but identical except for the output layer) CRNN instance for each subtask: source counting (up to two sources), DoA estimation of source 1 (if applicable), DoA estimation of source 2 (if applicable) and sound type classification. They show that their method is more efficient than the baseline. In [104], different manners of splitting the SED and DoA estimation tasks in a CRNN are explored. While some configurations show improvement in SED, the localization accuracy is below the baseline for reported experiments. A combination of gated linear unit (GLU, a convolutional block with gated mechanism) and trellis network (containing convolutional and recurrent layers, see [9] for more details) is investigated in [148], showing better results than the baseline. The authors extend this work for the DCASE challenge 2020, by improving the overall architecture and investigating other loss functions [149]. A non-direct DoA estimation scheme is also derived in [62], in which the authors estimate the TDoA using a CRNN, and then infer the DoA from it.

We also find propositions of CRNN-based systems in the 2020 edition of the DCASE challenge [160]. In [192], the same CRNN as in the baseline [1] is used, except that the authors do not use two separated output branches for SED and DoA estimation. Instead they concatenate the SED output, with the output of the previous layer to estimate the DoA. In [194], Song uses separated neural networks similar to the one in [1] to address NoS estimation and DoA estimation in a sequential way. Multiple CRNNs are trained in [212]: one to estimate the NoS (up to two sources), another to estimate the DoA assuming one active source, and another (same as the baseline) to estimate the DoAs of two simultaneously active sources. In [23], Cao et al. designed an end-to-end CRNN architecture to detect and estimate the DoA of possibly two instances of the same sound event. The addition of one-dimensional convolutional filters has been investigated in [174], in order to exploit the information along the feature axes. In [181], the baseline system of [1] is improved by providing more input features (log-mel-spectrograms, GCC-PHAT and intensity vector, see Section V) to the network.

Fig. 4: The CRNN architecture of [2, 3, 1]. Two bidirectional GRU layers follow a series of convolutional layers to capture the temporal evolution of the extracted features. This scheme is used to estimate the spatial pseudo-spectrum as an intermediate output feature, as well as the DoA of the sources.

Independently of the DCASE challenge, Adavanne et al.’s CRNN was adapted in [36] to receive quaternion FOA input features (see Section V), which slightly improved the CRNN performance. Perotin et al. proposed to use a CRNN with bidirectional LSTM layers on FOA intensity vector to localize one [155] or two [153] speakers. They showed that this architecture achieves very good performance in simulated and real reverberant environments with static speakers. This work was extended in [64] in which a substantial improvement in performance over [153] was obtained by adding more convolutional layers with less max-pooling, to localize up to three simultaneous speakers.

Non-square convolutional filters and a unidirectional LSTM layer are used in the CRNN architecture of [118]. In [239], a CRNN is presented with two types of input features: the phase of the cross-power spectrum and the signal waveforms. The former is first processed by a series of convolutional layers, before being concatenated with the latter.

Another improvement of [1] has been proposed in [101], in which the authors replaced the classical convolutional blocks with GLUs, with the idea that GLUs are better suited for extracting relevant features from phase spectrograms. This has led to a notable improvement of localization performance compared to [1]. In [17], an extension of [27] was proposed in which LSTMs and temporal convolutional networks (TCNs) replace the last dense layer of the former architecture. A TCN is made of successive 1D dilated causal convolutional layers with increasing dilated factors [113]. The authors showed that taking the temporal context into account with such temporal layers actually improves the localization accuracy.

Iv-E Residual neural networks

Residual neural networks have originally been introduced in [72], where the authors point out that designing very deep neural networks can lead the gradients to explode or vanish due to the non-linear activation functions, as well as degrading the overall performance. Residual connections are designed to enable a feature to bypass a layer block in parallel to the conventional process through this layer block. This allows the gradients to flow directly through the network, usually leading to a better training.

To our knowledge, the first use of a network with residual connections for SSL can be found in [241]. As illustrated in Fig. 5, the network includes three residual blocks. Each of them is made of three convolutional layers, the first and last one being designed with filters and the middle one being designed with filters. A residual connection is used between the input and output of each residual block. The same type of residual blocks is used for SSL by He et al. in [75, 74] in parallel of sound classification as speech or non-speech. In [199], a series of 1D convolutional layers are used with several residual connections for single-source localization, directly from the multichannel waveform.

In [163, 164], Pujol et al. integrate residual connections alongside 1D dilated convolutional layers with increasing dilation factors. They use the multichannel waveform as network input. After the input layer, the architecture is divided into several subnetworks containing the dilated convolutional layers, which function as filterbanks. In [167], a modified version of the original ResNet architecture [72] has been combined with recurrent layers for SELD. This was shown to reduce the DoA error by more than 20° compared to the baseline [1]. Similarly, in [109] Kujawski et al. have adopted the original ResNet architecture and applied it to the single-source localization problem.

Another interesting architecture containing residual connections has been proposed in [138] for the DCASE 2020 challenge. Before the recurrent layers (consisting of two bidirectional GRUs), three residual blocks successively process the input features. These residual blocks contain two residual convolutional layers, followed by a squeeze-excitation module [81]. Those modules aims to improve the modeling of interdependencies between input feature channels compared to classical convolutional layers. Similar squeeze-excitation mechanisms are used in [198] for multi-source localization.

In [190, 189], Shimada et al. adapted the MMDenseLSTM architecture, originally proposed in [202] for sound source separation, to the SELD problem. This architecture consists in a series of blocks made of convolutions and recurrent layers with residual connections. Their system showed very good performance among the candidates. In [231], an ensemble learning approach has been used, where several variants of residual neural networks and recurrent layers were trained to estimate the DoA, achieving the best performance of the DCASE 2020 challenge.

Fig. 5: The residual neural network architecture used in [241]. The residual connections are represented with dashed arrows.

In [66], the authors designed a neural network with TCN in addition to classical 2D convolutions and residual connections. Instead of using recurrent layers as usually considered, the architecture is composed of TCN blocks which are made of several residual blocks including 1D dilated convolutional layer with increasing dilated factor. They show that replacing recurrent layers with TCNs makes the hardware implementation of the network more efficient, while slightly improving the SELD performance compared to [1].

In [243], a CRNN with residual connections has been exploited in a indirect way for DoA estimation, using a FOA intensity vector input (see Section V). A CRNN is first used to remove the reverberant part of the FOA intensity vector, then another CRNN is used to estimate a time-frequency mask, which is applied to attenuate TF bins with a large amount of noise. The source DoA is finally estimated from the dereverberated and denoised intensity vector.

Iv-F Attention-based neural networks

An attention mechanism is a method which allows a neural network to put emphasis on vectors of a temporal sequence that are more relevant for a given task. Originally, attention was proposed by Bahdanau et al. [8] to improve sequence-to-sequence models for machine translation. The general principle is to allocate a different weight to the vectors of the input sequence when using a combination of those vectors for estimating a vector of the output sequence. The model is trained to compute the optimal weights that reflect both the link between vectors of the input sequence (self-attention) and the relevance of the input vectors to explain each output vector (attention at the decoder). This pioneering work has inspired the now popular Transformer architecture [219]

, which greatly improved the machine translation performance. Attention models are now used in more and more DL applications, including SSL.

In [158, 157], the authors submitted an attention-based neural system for the DCASE 2020 challenge. Their architecture is made of several convolutional layers, followed by a bidirectional GRU, then a self-attention layer is used to infer the activity and the DoA of several distinct sound events at each timestep. In [187], an attention mechanism is added after the recurrent layers of a CRNN to output an estimation of the sound source activity and its azimuth/elevation. Compared to [1], the addition of attention showed a better use of temporal information for SELD.

Multi-head self-attention, which is the use of several Transformer-type attention models in parallel [219], has also inspired SSL methods. For example, the authors of [21] placed an 8-head attention layer after a series of convolutional layers to track the source location predictions over time for different sources (up to two sources in their experiments). In [186], Schymura et al. used three 4-head self-attention encoders along the time axis after a series of convolutional layers before estimating the activity and location of several sound events (see Fig. 6). This neural architecture showed an improvement over the baseline [1]. In [230], the authors adapted to SSL the Conformer architecture [68], originally designed for automatic speech recognition. This architecture is composed of a feature extraction module based on ResNet, and a multi-head self-attention module which learns local and global context representations. The authors showed the benefit using a specific data augmentation technique on this model. Finally, Grumiaux et al. showed in [65] that replacing the recurrent layers of a CRNN with self-attention encoders allows to notably reduce the computation time. Moreover, the use of multi-head self-attention slightly improves the localization performance upon the baseline CRNN architecture [153], for the considered multiple speaker localization task.

Fig. 6: The self-attention-based neural network architecture of [186]. This system is made of a feature extraction bloc including convolutional layers (not detailed in the figure), followed by a self-attention module identical to a Transformer encoder.

Iv-G Encoder-decoder neural networks

Encoder-decoder architectures have been largely explored in the deep learning literature due to their capability to provide compact data representation in an unsupervised manner [59]. Here, we refer to as an encoder-decoder network an architecture that is made of two building blocks: an encoder, which is fed by the input features and which outputs a specific representation of the input data, and a decoder, which transforms the new data representation from the encoder into the desired output data.

Iv-G1 Autoencoder

An autoencoder (AE) is an encoder-decoder neural network which is trained to output a copy of its input. Often, the dimension of the encoder’s last layer output is small compared to the dimension of the data. This layer is then known as the

bottleneck layer and it provides a compressed encoding of the input data. Originally, auto-encoders were made of feed-forward layers, but nowadays this term is also used to designate AE networks with other types of layers, such as convolutional or recurrent layers.

To our best knowledge, the first use of an autoencoder for DoA estimation has been reported in [247]. The authors use a simple autoencoder to estimate TF masks for each possible DoA, which are then used for source separation. An interesting AE-based method is presented in [82], where an ensemble of AEs is trained to reproduce the multichannel input signal at the output, with one AE per candidate source position. Since the common latent information among the different channels is the dry signal, each encoder approximately deconvolves the signal from a given microphone. These dry signal estimates should be similar provided that the source is indeed at the assumed position, hence the localization is performed by finding the AE with the most consistent latent representation. However, it is not clear whether this model can generalize well to the unseen source positions and acoustic conditions.

In [137], Le Moing et al. presented an autoencoder with a large number of convolutional and transposed convolutional layers,666In an autoencoder, a transposed convolutional layer is a layer of the decoder that processes the inverse operation of the corresponding convolutional layer at the encoder. which estimates a potential source activity of each subregions in the plane divided in a grid, making it possible to locate multiple sources. They evaluated several types of outputs (binary, Gaussian-based, and binary followed by regression refinement) which showed promising results on the simulated and real data. They extended their work in [136] where they proposed to use adversarial training (see Section VIII) to improve the network performance on real data, as well as on microphone arrays unseen in the training set, in an unsupervised training scheme. To do that, they introduced a novel explicit transformation layer which helps the network to be invariant to the microphone array layout. Another encoder-decoder architecture is proposed in [77], in which a multichannel waveform is fed into a filter bank with learnable parameters, then a 1D convolutional encoder-decoder network processes the filter bank output. The output of the last decoder is then fed separately into two branches, one for SED and the other for DoA estimation.

An encoder-decoder structure with one encoder followed by two separate decoders was proposed in [235]. Signals recorded from several microphone arrays are first transformed in the short-term Fourier transform domain (see Section V

) and then stacked in a 4D tensor (whose dimensions are time, frequency, microphone array, microphone). This tensor is then sent to the encoder block, which is made of a series of convolutional layers followed by several residual blocks. The output of the encoder is then fed into two separate decoders: the first one is trained to output a probability of source presence for each candidate

region, while the second one is trained in the same way but with a range compensation to make the network more robust.

An indirect use of an autoencoder is proposed in [224] in which convolutional and transposed convolutional layers are used to estimate a time-delay similar to the TDoA from GCC-based input features. The main idea is to rely on the encoder-decoder capacity to reduce the dimension of the input data so that the bottleneck representation forces the decoder to output a smoother version of the time-delay function. This technique is shown to outperform the classical GCC-PHAT method in the reported experiments.

Iv-G2 Variational autoencoder

A variational autoencoder (VAE) is a generative model that was originally proposed in [98] and [169], and that is now very popular in the deep learning community. A VAE can be seen as a probabilistic version of an AE: unlike a classical AE, it learns a probability distribution of the data at the output of the decoder and it also models the probability distribution of the so-called latent vector at the bottleneck layer. New data can thus be obtained with the decoder by sampling those distributions.

To our knowledge, Bianco et al. were the first to apply a VAE for SSL [15]. Their VAE, made of convolutional layers, was trained to generate the phase of inter-microphone relative transfer functions (RTFs, see Section V-A) for multichannel speech signals, jointly with a classifier which estimates the speaker’s DoA from the RTFs. They trained this network with labeled and unlabeled data and showed that it outperforms an SRP-PHAT-based method as well as a supervised CNN in reverberant scenarios.

Iv-G3 U-Net architecture

A U-Net architecture is a particular fully-convolutional neural network originally proposed in [175] for biomedical image segmentation. In U-net, the input features are decomposed into successive feature maps throughout the encoder layers and then recomposed into “symmetrical” feature maps throughout the decoder layers, similarly to CNNs. Having the same dimension for feature maps at the same level in the encoder and decoder enables one to propagate information directly from an encoder level to the corresponding level of the decoder via residual connections. This leads to the typical U-shape schematization, see Fig. 7.

Regarding SSL and DoA estimation, several works have been inspired by the original U-Net paper. In [29], Chazan et al. employed such an architecture to estimate one TF mask per considered DoA (see Fig. 7), in which each time-frequency bin is associated to a single particular DoA. This spectral mask is finally applied for source separation. Another joint localization and separation system based on a U-Net architecture is proposed in [90]. In this system, they train a U-Net based on 1D convolutional layers and GLUs. The input is the multichannel raw waveform accompanied by an angular window which helps the network to perform separation on a particular zone. If the output of the network on the window is empty, no source is detected, otherwise, one or more sources are detected and the process is repeated with a smaller angular window, until the angular window reaches °. This system shows interesting results on both synthetic and real reverberant data containing up to eight speakers.

For the DCASE 2020 challenge, a U-Net with several bidirectional GRU layers in-between the convolutional blocks, was proposed for SELD in [151]. The last transposed convolutional layer of this U-Net outputs a single-channel feature map per sound event, corresponding to its activity and DoA for all frames. It showed an improvement over the baseline [1] in terms of DoA error. In [35], a U-Net architecture is used in the second part of the proposed neural network to estimate both the azimuth and the elevation. The first part, composed of convolutional layers, learns to map GCC-PHAT features to the ray space transform [13] as an intermediate output, which is then used as the input of the U-Net part.

Fig. 7: The U-Net network architecture of [29]. Several levels of encoders and decoders are used. In each encoder, several convolutional layers are used, and the output of the last layer is both transmitted to the next level’s encoder and concatenated to the input of the same level’s decoder, via a residual connection. The output consists in one time-frequency mask per considered DoA.

V Input features

In this section, we provide an overview of the variety of input feature types found in the deep-learning-based SSL literature. Generally, the considered features can be low-level signal representations such as waveforms or spectrograms, hand-crafted features such as binaural features (ILD, ITD), or borrowed from traditional signal processing methods such as MUSIC or GCC-PHAT. We organize this section into the following feature categories: inter-channel, cross-correlation-based, spectrogram-based, Ambisonics and intensity-based, and finally the direct use of the multichannel waveforms. Note that different kinds of features are often combined at the input layer of SSL neural networks. A few publications compare different types of input features for SSL, see e.g. [172, 106].

V-a Inter-channel features

V-A1 Relative Transfer functions

The relative transfer function is a very general inter-channel feature that has been widely used for conventional (non-deep) SSL and other spatial audio processing such as source separation and beamforming [55], and that is now considered for DL-based SSL as well. The RTF is defined for a microphone pair as the ratio of the (source-to-microphone) acoustic transfer functions (ATFs) of the two microphones, hence it is strongly depending on the source DoA (for a given recording set-up). In a multichannel set-up with more than two microphones, we can define an RTF for each microphone pair. Often, one microphone is used as a reference microphone, and the ATF of all other microphones are divided by the ATF of this reference microphone.

Here we are working in the STFT domain since an ATF is the Fourier Transform of the corresponding RIR. As an ATF ratio, an RTF is thus a vector with an entry defined for each frequency bin. In practice, an RTF estimate is obtained for each STFT frame (and each frequency bin, and each considered microphone pair) by taking the ratio between the STFT transforms of the recorded waveforms of the two considered channels. If only one source is present in the recorded signals, this ratio is assumed to correspond to the RTF thanks to the so-called narrow-band assumption. If multiple sources are present, things become more complicated, but using the WDO assumption, i.e. only at most one source is assumed to be active in each TF bin [170], the same principle as for one active source can be applied separately in each TF bin. Therefore, a multiple set of measured RTFs at different frequencies (and possibly at different time-frames if the sources are static or not moving too fast) can be used for multi-source localization.

An RTF is a complex-valued vector. In practice, an equivalent real-valued pair of vectors is often used. We can use either the real and imaginary parts, or the modulus and argument. Often, the log-squared value of the interchannel power ratio is used, i.e. the interchannel power ratio in dB. And the argument of the RTF estimate ideally corresponds to the difference of the ATF phases. Such RTF-based representations have been used in several neural-based systems for SSL. For example, in [29, 15] the input features are the arguments of the measured RTFs obtained from all microphone pairs.

V-A2 Binaural features

Binaural features have also been used extensively for SSL, in both conventional and deep systems. They correspond to a specific two-channel recording set-up, which attempts to reproduce human hearing in the most possible realistic way. To this aim, a dummy head/body with in-ear microphones is used to mimic the source-to-human-ear propagation, and in particular the effects of the head and external ear (pinnae) which are important for source localization by the human perception system. In an anechoic binaural set-up environment, the (two-channel) source-to-microphone impulse response is referred to as the binaural impulse response (BIR). The frequency-domain representation of a BIR is referred to as the head related transfer function. Both BIR and HRTF are functions of the source DoA. To take into account the room acoustics in a real-world SSL application, BIRs are extended to binaural room impulse response (BRIRs), which combine head/body effects and room effects (in particular reverberation).

Several binaural features are derived from binaural recordings: The interaural level difference corresponds to short-term log-power magnitude of the ratio between the two binaural channels in the STFT domain. The interaural phase difference is the argument of this ratio, and the interaural time difference is the delay which maximizes the cross-correlation between the two channels. Just like the RTF, those features are actually vectors with frequency-dependent entries. In fact, the ILD and IPD are strongly related (not to say similar) to the log-power and argument of the RTF, the difference relying more in the set-up than in the features themselves. The RTF can be seen as a more general (multichannel) concept, whereas binaural features refer to the specific two-channel binaural set-up. As for the RTF, the ILD, ITD and IPD implicitly encode the position of a source. When several sources are present, the WDO property allows ILD/ITD/IPD values to provide information on the position of several simultaneously active sources.

Binaural signals and ILD/ITD/IPD features have been previously used in conventional SSL systems [7]. In [245], ILD and ITD vectors are computed and fed separately into specific input branches of an MLP. In [128, 244], the cross-correlation of the two binaural channels is concatenated with the ILD before being fed into the input layer. In [139], the IPD is used as the argument of a unitary complex number which is decomposed into real and imaginary parts, and those parts are concatenated to the ILD for several frequency bins and several time frames, leading to a 2D tensor which is then fed into a CNN. An example of a system relying only on the IPD can be found in [145]. An MLP is trained to output a clean version of the noisy input IPD in order to better retrieve the DoA using a conventional method. In [193], the input features are a concatenation of the cosine and sine of the IPDs for several frequency bins and time frames. This choice is based on a previous work which showed similar performance for this type of input features compared to classical phase maps, but with a lower dimension. In an original way, Thuillier et al. employed unusual binaural features in the system presented in [211]: they used the ipsilateral and contralateral spectra. Those features showed to be relevant for elevation estimation using a CNN. We finally find other neural-based systems which used ILD [172, 247], ITD [172] or IPD [190, 189, 247, 197] in addition to other types of features.

V-B Cross-correlation-based features

Another manner to extract and exploit inter-channel information that depends on source location is to use features based on the cross-correlation (CC) between the signals of different channels. In particular, as we have seen in Section III, a variant of CC known as GCC-PHAT [100] is a common feature used in classical localization methods. It is less sensitive to speech signal variations than standard CC, but may be adversely affected by noise and reverberation [16]. Therefore it has been used within the framework of neural networks which reveal robust to this type of disturbance/artefacts. In several systems [237, 225, 73], GCC-PHAT has been computed for each microphone pair and several time delays, all concatenated to form a 1D vector used as the input of an MLP. Other architectures include convolutional layers to extract useful information from multi-frame GCC-PHAT features [73, 220, 222, 142, 125, 132, 89, 194, 118, 35].

Some SSL systems rely on the cross-power spectrum, that we already mentioned in Section III and that is linked to the cross-correlation by a Fourier transform operation (in practice, short-term estimates of the CPS are obtained by multiplying the STFT of one channel with the conjugate STFT of the other channel). In several works [126, 239], the CPS is fed into a CRNN architecture, to improve localization performance over the baseline [1] (see Section IV). In [62], Grondin et al. also used the cross-spectrum for each microphone pair in the convolutional block of their architecture, whereas GCC-PHAT features are concatenated in a deeper layer. In [129], the CPS is also used as an input features.

The SRP-PHAT method, that we also have seen in Section III, has also been used to derive input features for neural SSL systems. In [180], the authors proposed to calculate the narrowband normalized steered response power for a set of candidate TDoAs corresponding to an angular grid and feed it into a convolutional layer. This led to a localization performance improvement compared to the traditional SRP-PHAT method. Such power maps have also been used in [41] as inputs of 3D convolutional layers.

Traditional signal processing localization methods such as MUSIC [185] or ESPRIT [177] have been widely examined in the literature, see Section III. They are based on the eigen-decomposition of the cross-correlation matrix of a multichannel recording. Several neural SSL systems have been inspired by these methods and reuse such features as input for their neural networks. In [205, 204, 206, 207], Takeda et al. extract the complex eigenvectors from the correlation matrix and inject them separately into several branches, one per eigenvector, in order to extract a directional image obtained using a proposed complex directional activate function. The remaining parts of the network progressively integrates over all branches until the output layer. In [141], the spatial pseudo-spectrum is computed based on MUSIC algorithm then used as input features for a CNN.

V-C Spectrogram-based features

Alternately to inter-channel features or cross-correlation-based feature which already encode a relative information between channels, another approach is to provide a SSL system directly with “raw” multichannel information, i.e. without any pre-processing in the channel dimension.

This does not prevent some pre-processing in the other dimensions, and from an historical perspective, we notice that many models in this line use spectral or spectro-temporal features instead of raw waveforms (see next subsection) as inputs. In practice, (multichannel) STFT spectrograms are typically used [226]. Those multichannel spectrograms are generally organized as 3D tensors, with one dimension for time (or frames), one for frequency (bins) and one for channel. The general spirit of neural-based SSL methods is here that the network should be able to “see” by itself and automatically extract and exploit the differences between time-frequency spectrograms along the channel dimension, while exploiting the “sparsity” of TF signal representation.

In several works, the individual spectral vectors from the different STFT frames are provided independently to the neural model, meaning that the network does not take into account their temporal correlation (and a localization is result is generally obtained independently for each frame). Thus, in that case, the network input is a matrix of size , with the number of microphones, and the number of considered STFT frequency bins. In [78], the log-spectra of 8 channels are concatenated for each individual analysis frame, and are directly fed into a CNN as a 2D matrix. In the works of Chakrabarty and Habets [27, 24, 25, 26], the multichannel phase spectrogram has been used as input features, disregarding the magnitude information. As an extension of this work, phase maps are also exploited in [17].

When several consecutive frames are considered, the STFT coefficients for multiple timesteps and multiple frequency bins form a 2D matrix for each recording channel. Usually, these spectrograms are stacked together in a third dimension to form the 3D input tensor. Several systems consider only the magnitude spectrograms, such as [241, 232, 151, 156], while other consider only the phase spectrogram [250, 197] When considering both magnitude and phase, they can be stacked also in a third dimension (as well as channels). This representation has been employed in many neural-based SSL systems [76, 66, 105, 122, 132, 249, 94, 104, 186]. Other systems proposed to decompose the complex-valued spectrograms into real and imaginary parts [71, 74, 137, 108].

While basic (STFT) spectrograms consider equally-spaced frequency bins, Mel-scale spectrograms and Bark-scale spectrograms are represented with non-linear sub-bands division, corresponding to a perceptual scale (low-frequency sub-bands have a higher resolution than high-frequency sub-bands) [152]. Mel-spectrogram have been preferred to STFT spectrograms in several SSL neural systems [220, 102, 22, 167]. The Bark scale has also been explored for spectrograms in the SSL system of [89].

V-D Ambisonic signal representation

In the SSL literature, numerous systems utilize the Ambisonics format, i.e. the spherical harmonics decomposition coefficients [87], to represent the input signal. It is a multichannel format that is more and more used due to its capability of representing the spatial properties of a sound field, while being agnostic to the microphone array configuration [87, 251]. Depending on the number of microphones available, one can consider first-order Ambisonics (consisting in 4 ambisonic channels) or higher-order Ambisonics (HOA, with more than 4 ambisonic channels).

As with other microphone arrays, one can use a time-frequency representation of the Ambisonics signal, obtained by applying the STFT on each Ambisonics channel: in [3, 66, 2, 1, 94, 104], the authors use FOA spectrograms, decomposed into magnitude and phase components. In [216, 162], third-order Ambisonics spectrograms are used. In the latter paper, the authors even compared the performance of a CRNN with higher-order Ambisonics spectrograms from order to , showing that the higher the order, the better the localization accuracy of the network (but still below the performance of the so-called FOA intensity features, which we will discuss in Section V-E). An interesting choice is that they use only the phase for elevation estimation, and only the magnitude for azimuth estimation. Another way of representing the Ambisonics format is proposed in [36]. Based on the FOA spectrograms, the author proposed to consider them as quaternion-based input features which proved to be a suitable representation in previous works [146]. To cope with this type of input features, they adapted the neural network in [1], showing improvement over the baseline.

V-E Intensity-based features

Sound intensity is an acoustic quantity defined as the product of sound pressure and particle velocity [176]. In the frequency or time-frequency domain, sound intensity is a complex vector, whose real part (known as “active” intensity) is proportional to the gradient of the phase of sound pressure, i.e. it is orthogonal to the wavefront. This is a useful property that has been extensively used for SSL, see e.g. [87, 209, 99]. The imaginary part (“reactive” intensity) is related to dissipative local energy transfers [176], hence, it has been largely ignored in the SSL community. While the pressure is directly measurable by the microphones, particle velocity has to be approximated. Under certain conditions, particle velocity can be assumed proportional to the spatial gradient of sound pressure [176, 135], which allows for the estimation by e.g., the finite difference method [209], or using the FOA channels discussed in the previous section [87, 251].

The intensity vector has been used as an input feature of a number of recent neural models (especially those based on Ambisonic representations), and has led to very good SSL performance. The first use of an Ambisonics intensity vector, reported in [155], showed superiority in performance compared to the use of the raw Ambisonics waveforms and to traditional Ambisonic-based methods. Interestingly, the authors have demonstrated that using both active and reactive intensity improves the SSL performance. Moreover, they have normalized the intensity vector of each frequency band by its energy, which can be shown [38] to yield features similar to RTFs in the spherical harmonics domain [87]. In [243], the authors proposed to use two CRNNs to refine the input FOA intensity vector. The first CRNN is trained to estimate denoising and separation masks, under the assumption that there are two active sources, and that the WDO hypothesis holds. The second CRNN estimates another mask to remove the remaining unwanted components (e.g. reverberation). The two networks, hence, produce an estimate of “clean” intensity vector for each active source (the NoS is estimated by their system as well). The Ambisonic intensity vector has been used consequently in several other recent works [65, 153, 64, 140, 22, 149, 194, 21, 208, 154].

Sound intensity has also been explored in [123] without the Ambisonics representation. The authors calculate the instantaneous complex sound intensity using an average of the sound pressure across the four considered channels, and two orthogonal particle velocity components using the differences in sound pressure for both microphone pairs. They keep only the real part of the estimated sound intensity (active intensity), and apply a PHAT weighting to improve the robustness against reverberation.

V-F Waveforms

Since 2018, several authors have proposed to directly provide their neural network models with the raw multichannel recorded signal waveforms. This idea relies on the capability of the neural network to find the best representation for SSL without the need of hand-crafted features or pre-processing of any kind. This is in line with the general trend of deep-learning to go towards an end-to-end approach that is observed in many other applications, including in speech/audio processing. Of course, this goes together with the always increasing size of networks, datasets and computational power.

To our knowledge, Suvorov et al. were the first to apply this idea in [199]: they trained their neural network directly with the recorded 8-channel waveforms, stacking many 1D convolutional layers to extract high-level features for the final DoA classification. In [223, 221, 33, 23, 163, 164] raw multichannel waveforms are fed into 2D convolutional layers. In [82], the multichannel waveforms are fed into an autoencoder. In [90], the waveforms of each channel are shifted to be temporally aligned according to the TDoA before being injected into the input layer. In the same vein, Huang et al. [83, 84] proposed to time-shift the multichannel signal by calculating the time delay between the microphone position and the candidate source location, which requires to scan for all candidate locations.

A potential disadvantage of waveform-based features is that the architectures exploiting such data are often more complex, as one part of the network needs to be dedicated to feature extraction. Moreover, some studies indicate that learning the “optimal” feature representations from raw data becomes more difficult when noise is present in the input signals [233], or may even harm generalization, in some cases [182]. However, it is interesting to mention that in some studies [178, 127], the visual inspection of the learned weights of the input layers of some end-to-end (waveform-based) neural networks has revealed that they often resemble the filterbanks that are usually applied in the pre-processing stage of SSL (see Section V-C) and other various classical speech/audio processing tasks.

V-G Other types of features

In the neural-based SSL literature, several systems proposed unusual types of feature which does not belong to one of the categories described above. In [218], in addition to a GCC-PHAT feature, a periodicity degree feature is used alongside GCC-PHAT in a CNN. The periodicity degree is computed for a given frame and period: it is equal to the ratio between the harmonic power signal for the given period and the total power signal. This brings information about the harmonic content of the source signal to the CNN.

Vi Output strategies and evaluation

In this section, we discuss the different strategies proposed in the literature to obtain a final DoA estimate. We generally divide the strategies into two categories: classification and regression. When the SSL network is designed for the classification task, the source location search space is generally divided into several zones, corresponding to different classes, and the neural network outputs a probability value for each class. As for regression, the goal is to directly estimate (continuous) source position/direction values, which are usually either cartesian coordinates , or spherical coordinates (and very rarely the source-microphone distance ). In the last subsection, we report a few non-direct methods in which the neural network does not directly provide the location of the source(s), but instead helps for another algorithm to finally retrieve the desired DoA. A reader particularly interested in the comparison between the classification and regression approaches may consult [208, 154].

Vi-a DoA estimation via classification

A lot of systems treat DoA estimation as a classification problem, i.e. each class represents a certain zone in the considered search space. In other words, space is divided into several subregions, usually of similar size, and the neural network is trained to produce a probability of active source presence for each subregion. Such a classification problem is often addressed by using a feedforward layer as the last layer in the network, with as many neurons as the number of considered subregions. Two activation functions are generally associated to the final layer neurons: the softmax and sigmoid functions. Softmax ensures that the sum of all neuron outputs is

, so it is suitable for a single-source localization scenario. With a sigmoid, all neuron outputs are within independently from each other, which is suitable for multi-source localization. The last layer output is often referred to as the spatial spectrum whose peaks correspond to a high probability of source activity in the corresponding zone.

As already mentioned in Section II-C, the final DoA estimate(s) is/are generally extracted using a peak picking algorithm: if the number of sources is known, selection of the highest peaks gives the multi-source DoA estimation; if the number of sources is unknown, usually the peaks above a certain user-defined threshold are selected, leading to joint number of sources and localization estimation. Some preprocessing such as spatial spectrum smoothing or angular distance constraints can be used for better DoA estimation. Hence, such classification strategy can be readily used for single-source and/or multi-source localization, as the neural network is trained to estimate a probability of source activity in each zone, regardless the number of sources.

Vi-A1 Spherical coordinates

Regarding the quantization of the source location space, namely the localization grid, different approaches have been proposed. Most of early works focus on estimating only the source’s azimuth relatively to the microphone array position, dividing the ° azimuth space into regions of equal size, leading to a grid quantization step of . Without being exhaustive, we found in the literature many different values for , e.g. [172], [78], [199], [221], [128], [237]. Some other works do not consider the whole ° azimuth space. For example, in [29] the authors focus on the region with .

Estimating the elevation alone has not been investigated a lot in the literature, probably because of the lack of interesting applications in indoor scenarios. To the best of our knowledge, only one paper focuses on only estimating the elevation [211]. The authors divide the whole elevation range into nine regions of equal size. The majority of recent SSL neural networks are trained to estimate both source azimuth and elevation. To do that, several options have been proposed in the literature. One can use two separate output layers, each one with the same number of neurons as the number of subregions in the corresponding dimension. For example, the output layer of the neural architecture proposed in [51] is divided into two branches with fully-connected layers, one for azimuth estimation ( neurons), the other for elevation estimation ( neurons). One can also have a single output layer where each neuron corresponds to a zone in the unit sphere, i.e. a unique pair , as in [65, 153]. Finally, one can directly design two separate neural networks, each one estimating the azimuth or the elevation angle. This approach was adopted in [216].

However, most of the neural networks following the classification strategy for joint azimuth and elevation estimation are designed so that the output corresponds to a 2D grid on the unit sphere. For example, in [155, 153, 64], a quasi-uniform spherical grid is devised, leading to classes, each represented by a unique neuron in the output layer. In [2], Adavanne et al. sampled the unit sphere in the whole azimuth axis but in a limited elevation range (), yielding an output vector corresponding to classes.

Distance estimation has barely been investigated in the SSL literature, highlighting the fact that it is a difficult problem. In [172], Roden et al. addressed the distance estimation along with azimuth or elevation prediction, by dividing the distance range into 5 candidate classes. In [244], the distance range is quantized in four classes, and is estimated along with three possible azimuth values. In [205], the azimuth axis is classified with classes along with the distance and height of the source, but those last two quantities are classified into a very small set of possible pairs: , and (in centimeters). In [18], Bologni et al. trained a CNN to classify a single-source signal into a 2D map representing the azimuth and distance dimensions.

Vi-A2 Cartesian coordinates

A few works applied the classification paradigm to estimate the cartesian coordinates. In [136, 129, 137], the horizontal plane is divided into small regions of same size, each of them being a class in the output layer. However this representation suffers from a decreasing angular difference between the regions that are far from the microphone array, which is probably why regression is usually preferred for estimating cartesian coordinates.

Vi-B DoA estimation via regression

In regression SSL networks, the source location estimate is directly given by the continuous value provided by one or several output neurons (whether we consider cartesian or spherical coordinates, and how many source coordinates are of interest). This technique offers the advantage of a potentially more accurate DoA estimation since there is no quantization. Its drawback is twofold. First, the NoS needs to be known or assumed, as there is no way to estimate if a source is active or not based on a localization regression. Second, regression-based SSL usually faces the well-known source permutation problem [197], which occurs in the multi-source localization configuration and is common with deep-learning-based source separation methods. Indeed, during the computation of the loss function at the training time, there is an ambiguity in the association between target and actual output, in other words which estimate should be associated to which target? This issue also arises during the evaluation. One possible solution is to force the SSL network training to be permutation invariant [197], in the line of what was proposed for audio source separation [246].

As for classification, when using regression there is a variety of possibilities for the type of coordinates to be estimated. The choice among these possibilities is more driven by the context or the application than by design limitations, since regression generally requires only a few output neurons.

Vi-B1 Spherical coordinates

In [214], Tsuzuki et al. proposed a complex-valued neural approach for SSL. The output of the network is a complex number of unit amplitude whose argument is an estimate of the azimuth of the source. A direct regression scheme is employed in [139] with a 2-neuron output layer which predicts the azimuth and elevation values in a single-source environment. In [144], the system performs only azimuth estimation. Regarding the DCASE 2019 challenge [161], a certain number of challenge candidate systems used two neurons per event type to estimate the azimuth and elevation of the considered event [33, 22, 148], while the event activity is jointly estimated in order to extract or not the corresponding coordinates.

In [132], azimuth and elevation estimations are done separately into two network branches, each containing a specific dense layer. In [198], Sundar et al. proposed a regression method relying on a previous classification step: dividing the azimuth space into equal subregions, the output of the neural network is made of neurons. Assuming there is at most one active source per subregion, 3 neurons are associated to each of them: one neuron is trained to detect the presence of a source, then two other neurons estimate the distance and azimuth of that source. The loss function for training is a weighted sum of categorical cross-entropy (for the classification task) and mean square error (for the regression task).

Vi-B2 Cartesian coordinates

Another way to predict the DoA with regression is to estimate the cartesian coordinates of the source(s). In [225], Vesperini et al. designed their network output layer with only two neurons, to estimate the coordinates and in the horizontal plane, with an output range normalized within which represents the scaled version of the room size in each dimension. Following the same idea, Vecchiotti et al. also used two neurons to estimate but added a third one to estimate the source activity [220, 222].

The estimation of the three cartesian coordinates has been investigated in several systems. The authors of [223] designed the output layer with three neurons to estimate the coordinates of a single source with regression, as in [105]. In [3, 1], Adavanne et al. have chosen the same strategy. However, they perform SELD for several types of event, and thus there are three output neurons to provide estimates for each event type, plus another output neuron to estimate if this event is active or not. The hyperbolic tangent activation function is used for the localization neurons to keep the output values in range , leading to a DoA estimate on the unit sphere. The same strategy has been followed in an extension of this work in [36].

Vi-C Non-direct DoA estimation

Neural networks have also been used in regression mode to estimate intermediate quantities which are then used by a non-neural algorithm to predict the final DoA.

In [156], Pertilä and Cakir proposed to use a CNN in regression mode to estimate a time-frequency mask. This mask is then applied to the noisy multichannel spectrogram to obtain an estimate of the clean multichannel spectrogram, and a classical SRP-PHAT method is then applied to retrieve the final DoA. Another TF mask estimation is done in [232] using a bidirectional LSTM network to improve traditional DoA estimation methods such as GCC-PHAT or MUSIC. In [145], the authors trained a MLP to remove unwanted artifacts of the IPD input features. The cleaned feature is then used to estimate the DoA with a non-neural method. In [243], Yasuda et al. proposed a method to filter out reverberation and other non-desired effects from the intensity vector by TF mask estimation. The filtered intensity vector leads to a better DoA estimation then an intensity-based conventional method.

In [83, 84], Huang et al. employed neural networks on multichannel waveforms, shifted in time with a delay corresponding to a certain candidate source location, to estimate the original dry signal. Doing this for a set of supposed locations, they then calculate the sum of cross-correlation coefficients between the estimated dry source signals, for all candidate source locations. The final estimated location is obtained as the one leading to the maximum sum.

A joint localization and separation scheme is proposed in [90]. The neural network is trained to estimate the signal coming from a certain direction within a certain angular window, whose parameters are injected as an input to each layer. Thus, the network acts like a radar and scans through all directions, and progressively reduces the angular window up to a desired angular resolution.

Several works propose to employ neural networks for a better prediction of the TDoA, which is then used to determine the DoA as often done in traditional methods. In [62], the TDoA is estimated in regression mode using a hyperbolic tangent activation function at the output layer. In [224], Vera-Diaz et al. used an autoencoder to estimate a function from GCC-based features (similar to TDoA) that exhibits a clear peak corresponding to the estimated DoA.

Vii Data

In this section, we detail the different approaches taken to deal with data during model training or testing. Because we are dealing with indoor domestic/office environments, noise and reverberation are common in real-world signals. In this section, we inspect successively the use of synthetic and recorded datasets in neural-based SSL.

Vii-a Synthetic data

A well-known limitation of supervised learning (see Section 

VIII) for SSL is the lack of labeled training data. In a general manner, it is difficult to produce datasets of recorded signals with corresponding source position metadata in diverse spatial configurations (and possibly with diverse spectral content) that would be sufficiently large for efficient SSL neural model training. Therefore, one often has to simulate a large amount of data to obtain an efficient SSL system.

To generate realistic data, taking into account reverberation, one needs to simulate the room acoustics. This is usually done by synthesizing the room impulse response that models the sound propagation for a “virtual” source-microphone pair. This is done for all microphones of the array (and for a large number of source positions and microphone array positions, see below). Then a “dry” (i.e. clean reverberation-free monophonic) source signal is convolved with this RIR to obtain the simulated microphone signal (this is done for every channel of the microphone array). As already stated in Section I-B, the foundation of SSL relies in the fact that the relative location of a source with respect to the microphone array position is implicitly encoded in the (multichannel) RIR, and an SSL DNN learns to extract and exploit this information from examples. Therefore, such data generation has to be done with many different dry signals and for a large number of simulated RIRs with different source and microphone array positions. Those latter must be representative of the configurations in which the SSL system will be used in practice. Moreover, other parameters such as room dimensions and reverberation time may have to be varied to take into account other factors of variations in SSL.

One advantage of this approach is that many dry signal datasets exist, in particular for speech signals [57, 111, 56]. Therefore many SSL methods are trained with dry speech signals convolved with simulated RIRs. In [24, 25], Chakrabarty and Habets have used white noise as the dry signal for training and speech signals for testing. This approach is reminiscent of the work in [39, 40] based on a Gaussian mixture regression and already mentioned in Section III. Using white noise as the dry signal enables to obtain training data that are “dense” in the time-frequency domain. A study [217] however showed that training on speech or music signals leads to better results than noise-based training, even when the signals were simulated with a GAN.

As for RIR simulation, there exist several methods (and variants thereof), and many acoustic simulation softwares. Detailing those methods and software implementations is out of the scope of this article, an interested reader may consult appropriate references, e.g. [171, 200, 191]. Let us only mention that the simulators based on the image source method (ISM) [5] have been widely used in the SSL community, probably due to a fact that they offer a relatively good trade-off between the simulation fidelity777Specifically, with regards to the “head” of a RIR, i.e., the direct propagation and early reflections [171]. and computational complexity. Among publicly available libraries, RIR generator [69], Spherical Microphone Impulse Response (SMIR) generator [88], and Pyroomacoustics [184] are very popular, e.g., they have been used in [27, 153, 64, 141, 216, 180, 118, 15]

, to mention only a few examples. An efficient open-source implementation of the ISM method, relying on the Graphic Processing Unit (GPU) acceleration, has been recently presented in

[42], and used in [41] to simulate moving sources.

Other improved models based on the ISM have also been used to simulate impulse responses, such as the one presented in [78]. This model relies on the one in [117] which adds a diffuse reverberation model to the original ISM method. In [85], Hübner et al. proposed a low-complexity model-based training data generation method which includes a deterministic model for the direct path and a statistical model for late reverberation. This method shows similar performance compared to the usual ISM, while being computationally more efficient. An investigation of several simulation methods has been done in [58], with extensions of ISM, namely ISM with directional sources, and ISM with a diffuse field due to scattering. The authors of [58] compared the simulation algorithms via the training of an MLP (in both regression and classification modes) and showed that ISM with scattering effect and directional sources leads to the best SSL performance.

Training and testing binaural SSL systems requires either to directly use signals recorded in a binaural setup (see next subsection) or to use a dataset of two-channel binaural impulse responses and convolve those BIRs with (speech/audio) dry signals, just like for simulations in conventional set-up. Most of the time the BIRs are recorded ones (see next subsection; there exist a few BIR simulators, but we will not detail this quite specific aspect here). To take into account the room acoustics in a real-world SSL application, BIRs effects are often combined with RIRs effects. This is not obtained by trivially cascading the BIR and RIR filters, since the BIR depends on the source DoA and a RIR contains many reverberation components from many directions. However, such process is included in several RIR simulators, which are able to produce the corresponding combined response, called binaural room impulse response, see e.g. [245]. Note that BIRs are often manipulated in the frequency domain where they are called head-related transfer functions and are function of both frequency and source DoA.

Vii-B Real data

Collecting real labeled data, while being crucial to assess the robustness of an SSL neural network in a real-world environment, is a cumbersome task. As of today, only a few datasets of such recordings exist. Among them, several impulse responses datasets are publicly available and have been used to generate training and/or testing data.

The distant-speech interaction for robust home applications (DIRHA) simulated corpus presented in [37] has been used to simulate microphone speech signals based on real RIRs, recorded in a multi-room environment [225, 220]. Another database consisting of recorded RIRs from three rooms with different acoustic characteristics is publicly available [70], using three microphone array configurations to capture signals from several source azimuth positions in range . Other RIR datasets have been presented in [201, 47].

As for BIR dataset recordings, a physical head-and-torso simulator (HATS) (aka “dummy head”) is used, with ear microphones plugged into the dummy head ears. To isolate head and torso effects from other environment effects such as reverberation, binaural recordings are generally made in an anechoic room. For example, the dataset presented in [210] has been collected using four different dummy heads and used for SSL in [172].

The Surrey Binaural Room Impulse Responses database has been presented in [53] and has been used for SSL in, e.g., [128] to synthesize test examples. This database has been recorded using an HATS in four room configurations, with sound coming from loudspeakers. It thus combines binaural effects with room effects.

Several challenges have also been organized for some years and evaluation datasets with real recordings have been constituted to classify the candidate systems. The DCASE challenge datasets [161, 160, 159], created for SELD purposes, consists in recordings with static and moving sound events in reverberant and noisy environments. The recordings come in two 4-microphone spatial audio formats: tetrahedral microphone array and first-order Ambisonics. The dataset comprises 12 sound event types, including e.g. barking dog, female/male speech or ringing, with up to three simultaneous events overlapping. In the 2021 edition of the DCASE dataset, additional sound events have been added to the recordings while they are not to be classified. These datasets has been used in many SSL studies, e.g. [22, 148, 62, 23, 138, 190, 231, 134]. The acoustic source localization and tracking (LOCATA) [50] has been one of the most comprehensive challenges targeting localization of speech sources. The challenge tasks include single and multiple SSL, each of which in a setting where the sources and/or microphones are static or mobile. The recordings have been made using several types of microphone arrays, namely a planar array from [20], the em32 Eigenmike® spherical array, a hearing aid, and a set of microphones mounted on a robot head. The ground truth data includes position information obtained through an optical tracking system, hand-labeled VAD metadata, and dry (or close-talking) source signals. This dataset has been used in a number of works to validate the effectiveness of a proposed method on “real-life” recordings, e.g., [64, 41, 198, 145, 216, 208]. Very recently, a SELD challenge focused on 3D sound has been announced [67], where a pair of FOA microphones was used to capture a large number of RIRs in an office room, from which the audio data has been generated.

A few audio-visual datasets have also been developed and are publicly available, in which the audio data are enriched with video information. This type of dataset is dedicated to the development and test of audio-visual localization and tracking techniques, which are out of the scope of the present survey paper. Among those corpora, the AV16.3 corpus [112] and CHIL database [196] have provided evaluation basis for several (purely audio) SSL systems [223, 224] by considering only the audio part of the audiovisual dataset.

Finally, we also found a series of works in which the neural networks are tested using real data specifically recorded for the presented work in the researchers’ own laboratories, e.g., [29, 65, 76, 155, 64, 141, 73, 216, 137, 154].

Vii-C Data augmentation techniques

To limit the massive use of simulated data which can limit the robustness of the network on real-world data, and to overcome the limitation in the amount of real data, several authors have proposed to resort to data augmentation techniques. Without producing more recordings, data augmentation allows to create additional training examples, often leading to an improved network performance.

For the DCASE challenge, many submitted systems were trained using data augmentation techniques on the train dataset. In [134], Mazzon et al. proposed and evaluated three techniques to augment the training data, taking advantage of the FOA representation used by their SSL neural network: swap or inversion of FOA channels, label-oriented rotation (the rotation is applied to result in the desired label) or channel-oriented rotation (the rotation is directly applied with a desired matrix). Interestingly, the channel-oriented rotation method gave the worst results in their experiments, while the other two methods showed improvement in the neural networks performance. In [249], the authors applied the SpecAugment method [147] which leads to new data examples by masking certain time frames or frequencies of a spectrogram, or both at the same time. In [89], new training material is created with the mixup method [248] which relies on convex combinations of an existing training data pair. In [142], Noh et al. used pitch shifting and block mixing data augmentation [179]. The techniques from [134] and [249] have been employed in [190] to create new mixtures, along with another data augmentation method proposed in [203], which is based on random mixing of two training signals.

In [230], four new data augmentation techniques have been applied to the DCASE dataset [159]. The first one takes benefit of the FOA format to change the location of the sources by swapping audio channels. The second method is based on the extraction of spatial and spectral information on the sources, which are then modified and recombined to create new training examples. The third one relies on mixing multiple examples, resulting in new multi-source labelled mixtures. The fourth technique is based on random time-frequency masking. They evaluate the benefit of these data augmentation methods both when used separately and when applied sequentially.

Viii Learning strategies

In a general manner, when training a neural network to accomplish a certain task, one needs to choose a training paradigm which often depends on the type and amount of available data. In the neural-based SSL literature, most of the systems rely on supervised learning, though several examples of semi-supervised and weakly supervised learning can also be found.

Viii-a Supervised learning

When training a neural network with supervised learning, the training dataset must contain the output target (also known as the label, especially in the classification mode) for each corresponding input data. A cost function (or loss function) is used to quantify the error between the output target and the actual output of the neural network for a given input data, and training consists in minimizing the average loss function over the training dataset. We have seen in Section VI that in a single-source SSL scenario with the classification paradigm, a softmax output function is generally used. In that case, the cost function is generally the categorical cross-entropy, see e.g., [155, 241, 24]. When dealing with multiple sources, still with the classification paradigm, sigmoid activation functions and a binary cross-entropy loss function are used, see e.g., [153, 64, 25]. With a regression scheme, the choice for the cost function has been the mean square error in most systems [76, 139, 105, 180, 1, 189, 156]. We also sometimes witness the use of other cost functions, such as the angular error [154] and the -norm [90].

The limitation of supervised training is that the training relies on a great amount of labeled training data, whereas only a few real-world datasets with limited size have been collected for SSL. These datasets are not sufficient for robust training with deep learning models. To cope for these issues, one can opt for a data simulation method, as seen in Section VII-A, or data augmentation techniques, as seen in Section VII-C. Otherwise, alternative training strategies can be employed, such as semi-supervised and weakly supervised learning, as presented hereafter.

Viii-B Semi-supervised and weakly supervised learning

Unsupervised learning refer to model training with a dataset that does non contain labels. In the present SSL framework, this means that we would have a dataset of recorded acoustic signals without the knowledge of sources position/direction, hence unsupervised learning standalone is not applicable to SSL in practice. Semi-supervised learning refers to when part of the learning is done in a supervised manner, and another part is done in an unsupervised manner. Usually the network is pre-trained with labeled data training and refined (or fine-tuned) using unsupervised learning, i.e. without resorting to labels. In the SSL literature, semi-supervised learning has been proposed to improve the performance of the neural network on conditions unseen during supervised training or on real data, compared to its performance when trained only in the supervised manner. It can be seen as an alternative manner to enrich a training labeled dataset of too limited size or conditions (see Section VII).

For example, in [206] and [207], a pre-trained neural network is adapted to unseen conditions in a unsupervised way. For the cost function, the cross-entropy is modified to be computed only with the estimated output, so that the overall entropy is minimized. They also apply a parameter selection method dedicated to avoid overfitting, as well as early stopping. In [15], Bianco et al. combined supervised and unsupervised learning using a VAE-based system. A generative network is trained to infer the phase of RTFs which are used as input features in a classifier network. The cost function directly encompasses a supervised term and an unsupervised term, and during the training the examples can come with or without labels.

In [136], a semi-supervised approach is proposed to adapt the network to real-world data after it was trained with a simulated dataset. This strategy is implemented with adversarial training [60]. In the present SSL context, a discriminator network is trained to label incoming data as synthetic or real, and the generator network learns to fool the discriminator. This enables to adapt the DoA estimation network to infer from real data.

A different kind of training, named weakly supervised, is used in [75] and [76]. The authors fine-tuned a pre-trained neural network by adapting the cost function to account for weak labels, which is the number of sources, presumably known. This helps improving the network performance by reducing the amount of incoherent predictions. Another example of weak supervision can be found in [144]. Under the assumptions that only a few training data come with labels, a triplet loss function is computed. For each training step, three examples are drawn: a query sample, acting as an usual example, a positive sample close to the query sample, and a negative sample from a more remote source position. The triplet loss (named so because of these three components) is then derived so that the network learns to infer the position of the positive sample closer to the query sample than the negative sample.

Author Year Architecture Type Learning Input features Output Sources Data
NoS Known Mov. Train Test
SA RA SR RR SA RA SR RR
Kim [97] 2011 MLP R S Power of multiple beams 1-5
Tsuzuki [214] 2013 MLP R S Time delay, phase delay, sound pressure diff. 1
Youssef [245] 2013 MLP R S ILD, ITD 1
Hirvonen [78] 2015 CNN C S Magnitude spectro 1
Ma [128] 2015 MLP C S Binaural cross-correlation + ILD 1-3
Roden [172] 2015 MLP C S ILD, ITD, binaural magnitude + phase spectro, //r 1
binaural real + imaginary spectro
Xiao [237] 2015 MLP C S GCC-PHAT 1
Takeda [205] 2016 MLP C S Complex eigenvectors from correlation matrix ,z,r 0-1
Takeda [204] 2016 MLP C S Complex eigenvectors from correlation matrix 0-2
Vesperini [225] 2016 MLP R S GCC-PHAT x,y 1
Zermini [247] 2016 AE C S Mixing vector + ILD + IPD
Chakrabarty [24] 2017 CNN C S Phase map 1
Chakrabarty [25] 2017 CNN C S Phase map 2
Pertilä [156] 2017 CNN R S Magnitude spectro TF Mask 1
Takeda [206] 2017 MLP C SS Complex eigenvectors from correlation matrix , 1
Yalta [241] 2017 Res. CNN C S Magnitude spectro 1
Yiwere [244] 2017 MLP C S Binaural cross-correlation + ILD ,d 1
Adavanne [2] 2018 CRNN C S Magnitude + phase spectro SPS,,
He [73] 2018 MLP, CNN C S GCC-PHAT 0-2 ✗/✓
He [74] 2018 Res. CNN C S Real + imaginary spectro
Huang [83] 2018 DNN R S Waveforms dry signal 1
Li [118] 2018 CRNN C S GCC-PHAT 1
Ma [129] 2018 CNN C S CPS x,y 3
Nguyen [139] 2018 CNN R S ILD + IPD , 1
Perotin [155] 2018 CRNN C S Intensity vector , 1
Salvati [180] 2018 CNN C/R S Narrowband SRP components SRP weights 1 ?
Sivasankaran [193] 2018 CNN C S IPD 1
Suvorov [199] 2018 Res. CNN C S Waveforms 1
Takeda [207] 2018 MLP C SS Complex eigenvectors from correlation matrix 1
Thuillier [211] 2018 CNN C S Ipsilateral + contralateral ear input signal 1
Vecchiotti [220] 2018 CNN R S GCC-PHAT + mel spectro x,y 1
Vera-Diaz [223] 2018 CNN R S Waveforms x,y,z 1
Adavanne [1] 2019 CRNN R S FOA magnitude + phase spectrogram x,y,z 1
Adavanne [3] 2019 CRNN R S FOA magnitude + phase spectrogram x,y,z 1
Cao [22] 2019 CRNN R S Intensity vector + GCC-PHAT , 1
Chakrabarty [26] 2019 CNN C S Phase map 2
Chakrabarty [27] 2019 CNN C S Phase map 2
Chazan [29] 2019 U-net C S Phase map of the RTF between each mic pair
Chytas [33] 2019 CNN R S Waveforms , 1
Comminiello [36] 2019 CRNN R S Quaternion FOA x,y,z 1
Grondin [62] 2019 CRNN R S CPS + GCC-PHAT , 1
He [75] 2019 Res. CNN C WS Real + imaginary spectro 1-2
Huang [84] 2019 CNN R S Waveforms dry signal 1
Jee [89] 2019 CRNN R S GCC-PHAT + mel/bark spectro , 1
Kapka [94] 2019 CRNN R S Magnitude + phase spectro x,y,z 1-2
Kong [102] 2019 CNN R S Log-mel magnitude FOA spectro , 1
Krause [104] 2019 CRNN R S Magnitude / phase spectro , 1
Küçük [108] 2019 CNN C S Real + imaginary spectro 1
Kujawski [109] 2019 Res. CNN R S ? x,y 1
Lin [122] 2019 CRNN C S Magnitude and phase spectro , 1
Lu [125] 2019 CRNN R S GCC-PHAT , 1
Lueng [126] 2019 CRNN R S CPS , 1
Maruri [132] 2019 CRNN R S GCC-PHAT + magnitude + phase spectro , 1
Mazzon [134] 2019 CRNN R S Mel-spectrogram + GCC-PHAT/intensity vector , 1
Noh [142] 2019 CNN C S GCC-PHAT , 1
Nustede [143] 2019 CRNN R S Group delays , 1
Opochinsky [144] 2019 MLP R WS RTFs 1
Pak [145] 2019 MLP R S IPD (clean) IPD
Park [148] 2019 CRNN R S Intensity vector , 1
Perotin [153] 2019 CRNN C S FOA intensity vector , 2
Perotin [154] 2019 CRNN C/R S FOA intensity vector ,/x,y,z 1
Pujol [163] 2019 Res. CNN R S Waveforms x,y 1
Ranjan [167] 2019 Res. CRNN C S Log-mel spectro , 1
Tang [208] 2019 CRNN C/R S FOA intensity vector ,/x,y,z 1
Vecchiotti [222] 2019 CNN R S GCC-PHAT + mel spectro x,y 1
Vecchiotti [221] 2019 CNN C S Waveforms 1
Wang [232] 2019 RNN R S Magnitude spectro TF Mask 1
Xue [240] 2019 CRNN R S Log-mel spectrum + CQT + phase spectro + CPS , 1
Zhang [249] 2019 CRNN R S Magnitude and phase spectro , 1
Zhang [250] 2019 CNN C S Phase spectro 1
TABLE I: Summary of deep-learning-based SSL systems from 2013 to 2019, organized in chronological then alphabetical order. Type: R = regression, C = classification. Learning: S = supervised, SS = semi-supervised, WS = weakly supervised. Sources: NoS = considered number of sources, Known indicates if the NoS is known or not before estimating the DoA (✓= yes, ✗= no), Mov. specifies if moving sources are considered. Data: SA = synthetic anechoic, RA = real anechoic, SR = synthetic reverberant, RR = real reverberant.
Author Year Architecture Type Learning Input features Output Sources Data
NoS Known Mov. Train Test
SA RA SR RR SA RA SR RR
Bianco [15] 2020 VAE C SS RTFs 1
Cao [23] 2020 CRNN R S FOA waveforms , 0-2
Comanducci [35] 2020 CNN/U-Net C S GCC-PHAT , 1
Fahim [51] 2020 CNN C S FOA modal coherence , 1-7
Hao [71] 2020 CNN C S Real + imaginary spectro + spectral flux 1
Huang [82] 2020 AE R S Waveforms 1
Hübner [85] 2020 CNN C S Phase map 1
Jenrungrot [90] 2020 U-Net R S Waveforms 0-8
Le Moing [137] 2020 AE C,R S Real + imaginary spectro x,y 1-3
Le Moing [136] 2020 AE C SS Real + imaginary spectro x,y 1-3
Naranjo-Alcazar [138] 2020 Res. CRNN R S Log-mel magnitude spectro + GCC-PHAT x,y,z 1
Nguyen [141] 2020 CNN C S Spatial Pseudo-Spectrum 0-4
Park [149] 2020 CRNN R S Log-mel energy + intensity vector , 1
Patel [151] 2020 U-Net R S Mel magnitude spectro x,y,z 1
Phan [158] 2020 CRNN + SA R S Log-mel magnitude FOA spectro + active/reactive x,y,z 1
intensity vector, or GCC-PHAT
Phan [157] 2020 CRNN + SA R S Log-mel magnitude FOA spectro + active/reactive x,y,z 1
intensity vector, or GCC-PHAT
Ronchini [174] 2020 CRNN R S FOA log-mel magnitude spectro x,y,z 1
+ log-mel intensity vector
Sampathkumar [181] 2020 CRNN R S MIC+FOA mel spectro + active intensity vector , 1
+ GCC-PHAT
Shimada [189] 2020 Res. CRNN R S FOA magnitude spectrogram + IPD , 1
Shimada [190] 2020 Res. CRNN R S FOA magnitude spectro + IPD , 1
Singla [192] 2020 CRNN R S FOA log-mel magnitude spectro x,y,z 1
+ log-mel intensity vector
Song [194] 2020 CRNN R S GCC-PHAT + FOA active intensity vector x,y,z 1
Sundar [198] 2020 Res. CNN C/R S Waveforms d, 1-3
Tian [212] 2020 CRNN ? S Ambisonics ? ?
Varanasi [216] 2020 CNN C S 3rd spherical harmonics (phase or phase+magnitude) , 1
Varzandeh [218] 2020 CNN C S GCC-PHAT + periodicity degree 0-1
Wang [231] 2020 Res. CRNN R S FOA intensity vector + FOA log-mel spectro x,y,z 1
+ GCC-PHAT
Xue [239] 2020 CRNN C S CPS + waveforms + beamforming output , 1
Yasuda [243] 2020 Res. CRNN R S Log-mel FOA spectrogram + intensity vector denoised IV 2
Bohlender [17] 2021 CNN/CRNN C S Phase map 1-3
Bologni [18] 2021 CNN C S Waveforms ,d 1
Cao [21] 2021 SA R S Log-mel spectro + intensity vector x,y,z 0-2
Diaz-Guerra [41] 2021 CNN R S SRP-PHAT power map x,y,z 1
Gelderblom [58] 2021 MLP C/R S GCC-PHAT 2
Grumiaux [64] 2021 CRNN C S Intensity vector , 1-3
Grumiaux [65] 2021 CNN + SA C S Intensity vector , 1-3
Guirguis [66] 2021 TCN R S Magnitude + phase spectro x,y,z 1
He [76] 2021 Res. CNN C WS Magnitude + phase spectro 1-4 ✓/✗
He [77] 2021 CNN R S Waveforms x,y,z 1
Komatsu [101] 2021 CRNN R S FOA magnitude + phase spectro , 1
Krause [105] 2021 CNN R S Magnitude + phase spectro x,y,z 1
Krause [106] 2021 CRNN R S Misc. , 1
Liu [123] 2021 CNN C S Intensity vector 1
Nguyen [140] 2021 CRNN C S Intensity vector/GCC-PHAT , 1
Poschadel [162] 2021 CRNN C S HOA magnitude + phase spectro , 1
Pujol [164] 2021 Res. CNN R S Waveforms , 1
Schymura [186] 2021 CNN + SA R S Magnitude + phase spectrogram , 1
Schymura [187] 2021 CNN + AE + att. R S FOA magnitude + phase spectro , 1
Subramanian [197] 2021 CRNN C Phase spectro, IPD 2
Vargas [217] 2021 CNN C S Phase map 1
Vera-Diaz [224] 2021 AE R S GCC-PHAT time-delay 1
Wang [230] 2021 SA R S Mel-spectro + intensity/mel-spectro + GCC-PHAT x,y,z 1
Wu [235] 2021 AE R S Likelihood surface x,y 1
TABLE II: Summary of deep-learning-based SSL systems from 2020 to 2021, organized in chronological then alphabetical order. See Table I’s caption for acronyms specification.

Ix Conclusion

In this paper, we have presented a comprehensive overview of the literature on sound source localization techniques based on deep learning methods. We attempted to categorize the many publications in this domain according to different characteristics of the methods in terms of source (mixture) configuration, neural network architecture, input data type, output strategy, training and test datasets, and learning strategy. As we have seen through this survey, most research is oriented towards finding an appropriate neural architecture encompassing several constraints (number of sources, moving sources, high reverberation, real-time implementation, etc.) More recently, part of the scientific effort has been geared towards the adaptation of systems trained on synthetic data, to perform better on real-world data.

Tables I and II summarize our survey: They gather the references of the reviewed DL-based SSL studies with their main characteristics (the ones that were used in our taxonomy of the different methods) being reported into different columns. We believe these tables can be very useful for a quick search of methods with a given set of characteristics.

References

  • [1] S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen (2019-03) Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE J. Sel. Topics Signal Process. 13 (1), pp. 34–48 (en). External Links: ISSN 1932-4553, 1941-0484, Document Cited by: Fig. 4, §IV-D, §IV-D, §IV-D, §IV-D, §IV-D, §IV-E, §IV-E, §IV-F, §IV-F, §IV-G3, §V-B, §V-D, §VI-B2, §VIII-A, TABLE I.
  • [2] S. Adavanne, A. Politis, and T. Virtanen (2018-09) Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network. In Proc. Europ. Signal Process. Conf. (EUSIPCO), Rome, Italy. Cited by: Fig. 4, §IV-D, §V-D, §VI-A1, TABLE I.
  • [3] S. Adavanne, A. Politis, and T. Virtanen (2019-04) Localization, detection and tracking of multiple moving sound sources with a convolutional recurrent neural network. arXiv:1904.12769 (en). Cited by: §II-D, Fig. 4, §IV-D, §IV-D, §V-D, §VI-B2, TABLE I.
  • [4] M. Ahmad, M. Muaz, and M. Adeel (2021-05) A survey of deep neural network in acoustic direction finding. In Proc. IEEE Int. Conf. Digital Futures Transf. Technol. (ICoDT2), Islamabad, Pakistan. Cited by: §I.
  • [5] J. B. Allen and D. A. Berkley (1979) Image method for efficiently simulating small‐room acoustics. J. Acoust. Soc. Am. 65 (4), pp. 943–950 (en). External Links: ISSN 0001-4966, Document Cited by: §VII-A.
  • [6] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and O. Vinyals (2012) Speaker diarization: a review of recent research. IEEE Trans. Audio, Speech, Lang. Process. 20 (2), pp. 356–370. Cited by: §II-C.
  • [7] S. Argentieri, P. Danes, and P. Souères (2015) A survey on sound source localization in robotics: from binaural to array processing methods. Computer Speech Lang. 34 (1), pp. 87–112. Cited by: §I, §III, §V-A2.
  • [8] D. Bahdanau, K. Cho, and Y. Bengio (2016-05) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473. Cited by: §IV-F.
  • [9] S. Bai, J. Z. Kolter, and V. Koltun (2019-03) Trellis networks for sequence modeling. arXiv:1810.06682. Cited by: §IV-D.
  • [10] Y. Ban, X. Li, X. Alameda-Pineda, L. Girin, and R. Horaud (2018-04) Accounting for room acoustics in audio-visual multi-speaker tracking. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Alberta, Canada, pp. 6553–6557. External Links: Document Cited by: §I-A.
  • [11] J. Benesty, J. Chen, and Y. Huang (2008) Microphone array signal processing. Springer Science & Business Media (en). External Links: ISBN 978-3-540-78611-5 978-3-540-78612-2 Cited by: §I-B, §III.
  • [12] O. Bialer, N. Garnett, and T. Tirer (2019) Performance advantages of deep neural networks for angle of arrival estimation. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Brighton, UK, pp. 3907–3911 (en). Cited by: §II-A.
  • [13] L. Bianchi, F. Antonacci, A. Sarti, and S. Tubaro (2016-11) The ray space transform: a new framework for wave field processing. IEEE Trans. Signal Process. 64 (21), pp. 5696–5706. External Links: ISSN 1941-0476, Document Cited by: §IV-G3.
  • [14] M. J. Bianco, P. Gerstoft, J. Traer, E. Ozanich, M. A. Roch, S. Gannot, and C. Deledalle (2019) Machine learning in acoustics: theory and applications. J. Acoust. Soc. Am. 146 (5), pp. 3590–3628. Cited by: §I.
  • [15] M. J. Bianco, S. Gannot, and P. Gerstoft (2020-07) Semi-supervised source localization with deep generative modeling. In Proc. IEEE Int. Workshop Mach. Learn. Signal Process. (MLSP), Eespo, Finland. Cited by: §IV-G2, §V-A1, §VII-A, §VIII-B, TABLE II.
  • [16] C. Blandin, A. Ozerov, and E. Vincent (2012) Multi-source TDOA estimation in reverberant audio using angular spectra and clustering. Signal Process. 92 (8), pp. 1950–1960. Cited by: §I, §V-B.
  • [17] A. Bohlender, A. Spriet, W. Tirry, and N. Madhu (2021) Exploiting temporal context in CNN based multisource DoA estimation. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, pp. 1594–1608. External Links: ISSN 2329-9304, Document Cited by: §II-C, §IV-D, §V-C, TABLE II.
  • [18] G. Bologni, R. Heusdens, and J. Martinez (2021-06) Acoustic reflectors localization from stereo recordings using neural networks. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Vancouver, Canada, pp. 1–5. External Links: Document Cited by: §II-C, §IV-B, §VI-A1, TABLE II.
  • [19] M. Brandstein (2001) Microphone arrays: signal processing techniques and applications. Springer Science & Business Media. Cited by: §I-B.
  • [20] A. Brutti, L. Cristoforetti, W. Kellermann, L. Marquardt, and M. Omologo (2010) WOZ acoustic data collection for interactive TV. Lang. Resources Eval. 44 (3), pp. 205–219. Cited by: §VII-B.
  • [21] Y. Cao, T. Iqbal, Q. Kong, F. An, W. Wang, and M. D. Plumbley (2021) An improved event-independent network for polyphonic sound event localization and detection. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Vancouver, Canada. Cited by: §IV-F, §V-E, TABLE II.
  • [22] Y. Cao, T. Iqbal, Q. Kong, M. B. Galindo, W. Wang, and M. D. Plumbley (2019) Two-stage sound event localization and detection using intensity vector and generalized cross-correlation. Technical report Cited by: §IV-D, §V-C, §V-E, §VI-B1, §VII-B, TABLE I.
  • [23] Y. Cao, T. Iqbal, Q. Kong, Y. Zhong, W. Wang, and M. D. Plumbley (2020-09) Event-independent network for polyphonic sound event localization and detection. Technical report Cited by: §IV-D, §V-F, §VII-B, TABLE II.
  • [24] S. Chakrabarty and E. A. P. Habets (2017) Broadband DoA estimation using convolutional neural networks trained with noise signals. In Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA), New-Paltz, NY, pp. 136–140. Cited by: §IV-B, §V-C, §VII-A, §VIII-A, TABLE I.
  • [25] S. Chakrabarty and E. A. P. Habets (2017) Multi-speaker localization using convolutional neural network trained with noise. arXiv:1712.04276. Cited by: §IV-B, §V-C, §VII-A, §VIII-A, TABLE I.
  • [26] S. Chakrabarty and E. A. P. Habets (2019) Multi-scale aggregation of phase information for reducing computational cost of CNN based DoA estimation. In Proc. Europ. Signal Process. Conf. (EUSIPCO), A Coruña, Spain (en). Cited by: §IV-B, §V-C, TABLE I.
  • [27] S. Chakrabarty and E. A. P. Habets (2019) Multi-speaker DoA estimation using deep convolutional networks trained with noise signals. IEEE J. Sel. Topics Signal Process. 13 (1), pp. 8–21 (en). External Links: ISSN 1932-4553, 1941-0484 Cited by: §II-B, §IV-B, §IV-B, §IV-B, §IV-D, §V-C, §VII-A, TABLE I.
  • [28] S. Chang, B. Li, G. Simko, T. N. Sainath, A. Tripathi, A. van den Oord, and O. Vinyals (2018-04) Temporal modeling using dilated convolution and gating for voice-activity-detection. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Calgary, Canada, pp. 5549–5553. External Links: Document Cited by: §II-C.
  • [29] S. E. Chazan, H. Hammer, G. Hazan, J. Goldberger, and S. Gannot (2019-09) Multi-microphone speaker separation based on deep DoA estimation. In Proc. Europ. Signal Process. Conf. (EUSIPCO), A Coruña, Spain. External Links: Document Cited by: §I, Fig. 7, §IV-G3, §V-A1, §VI-A1, §VII-B, TABLE I.
  • [30] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014-09) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078. Cited by: §IV-C.
  • [31] J. Choi and J. Chang (2020-01) Convolutional neural network-based doa estimation using stereo microphones for drone. In Int. Conf. Electron., Inform., Comm. (ICEIC), Barcelona, Spain, pp. 1–5. External Links: Document Cited by: §II-A.
  • [32] F. Chollet (2017) Deep learning with python. Simon and Schuster. Cited by: §IV.
  • [33] S. P. Chytas and G. Potamianos (2019) Hierarchical detection of sound events and their localization using convolutional neural networks with adaptive thresholds. Technical report (en). Cited by: §IV-B, §V-F, §VI-B1, TABLE I.
  • [34] M. Cobos, F. Antonacci, A. Alexandridis, A. Mouchtaris, and B. Lee (2017) A survey of sound source localization methods in wireless acoustic sensor networks. Wireless Comm. Mobile Computing 2017. Cited by: §I.
  • [35] L. Comanducci, F. Borra, P. Bestagini, F. Antonacci, S. Tubaro, and A. Sarti (2020) Source localization using distributed microphones in reverberant environments based on deep learning and ray space transform. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28, pp. 2238–2251. External Links: ISSN 2329-9304, Document Cited by: §IV-G3, §V-B, TABLE II.
  • [36] D. Comminiello, M. Lella, S. Scardapane, and A. Uncini (2019) Quaternion convolutional neural networks for detection and localization of 3D sound events. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Brighton, UK (en). Cited by: §IV-D, §V-D, §VI-B2, TABLE I.
  • [37] L. Cristoforetti, M. Ravanelli, M. Omologo, A. Sosi, and A. Abad (2014) The DIRHA simulated corpus. In Int. Conf. Lang. Resources Eval. (LREC), Reykjavik, Iceland, pp. 2629–2634 (en). Cited by: §VII-B.
  • [38] J. Daniel and S. Kitić (2020) Time-domain velocity vector for retracing the multipath propagation. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), virtual Barcelona, pp. 421–425. Cited by: §III, §V-E.
  • [39] A. Deleforge, F. Forbes, and R. Horaud (2013) Variational EM for binaural sound-source separation and localization. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Vancouver, Canada. Cited by: §III, §VII-A.
  • [40] A. Deleforge, R. Horaud, Y. Y. Schechner, and L. Girin (2015)

    Co-localization of audio sources in images using binaural features and locally-linear regression

    .
    IEEE/ACM Trans. Audio, Speech, Lang. Process. 23 (4), pp. 718–731. Cited by: §III, §VII-A.
  • [41] D. Diaz-Guerra, A. Miguel, and J. R. Beltran (2021) Robust sound source tracking using SRP-PHAT and 3D convolutional neural networks. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, pp. 300–311. External Links: ISSN 2329-9304, Document Cited by: §II-D, §IV-B, §V-B, §VII-A, §VII-B, TABLE II.
  • [42] D. Diaz-Guerra, A. Miguel, and J. R. Beltran (2021) gpuRIR: A python library for room impulse response simulation with GPU acceleration. Multimedia Tools Applic. 80 (4), pp. 5653–5671. Cited by: §VII-A.
  • [43] J. H. DiBiase, H. F. Silverman, and M. S. Brandstein (2001) Robust localization in reverberant rooms. In Microphone Arrays: Signal Processing Techniques and Applications, M. Brandstein and D. Ward (Eds.), pp. 157–180 (en). External Links: ISBN 978-3-662-04619-7 Cited by: §I, §III.
  • [44] J. P. Dmochowski, J. Benesty, and S. Affes (2007-11) A generalized steered response power method for computationally viable source localization. IEEE Trans. Audio, Speech, Lang. Process. 15 (8), pp. 2510–2526. External Links: ISSN 1558-7924, Document Cited by: §III.
  • [45] J. P. Dmochowski, J. Benesty, and S. Affes (2007) Broadband MUSIC: opportunities and challenges for multiple source localization. In Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA), New-Paltz, NY, pp. 18–21. Cited by: §III.
  • [46] Y. Dorfan and S. Gannot (2015) Tree-based recursive expectation-maximization algorithm for localization of acoustic sources. IEEE/ACM Trans. Audio, Speech, Lang. Process. 23 (10), pp. 1692–1703. Cited by: §III.
  • [47] J. Eaton, N. D. Gaubitch, A. H. Moore, and P. A. Naylor (2015-10) The ACE challenge — Corpus description and performance evaluation. In Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA), New-Paltz, NY, pp. 1–5. External Links: Document Cited by: §VII-B.
  • [48] A.H. El Zooghby, C.G. Christodoulou, and M. Georgiopoulos (2000) A neural network-based smart antenna for multiple source tracking. IEEE Trans. Antennas Propag. 48 (5), pp. 768–776 (en). External Links: ISSN 0018926X, Document Cited by: §II-A.
  • [49] A. M. Elbir (2020-04) DeepMUSIC: multiple signal classification via deep learning. IEEE Sensors Lett. 4 (4), pp. 1–4 (en). External Links: Document Cited by: §II-A.
  • [50] C. Evers, H. W. Löllmann, H. Mellmann, A. Schmidt, H. Barfuss, P. A. Naylor, and W. Kellermann (2020) The LOCATA challenge: acoustic source localization and tracking. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28, pp. 1620–1643. Cited by: §I, §VII-B.
  • [51] A. Fahim, P. N. Samarasinghe, and T. D. Abhayapala (2020) Multi-source DoA estimation through pattern recognition of the modal coherence of a reverberant soundfield. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28, pp. 605–618. External Links: ISSN 2329-9304, Document Cited by: §II-C, §IV-B, §VI-A1, TABLE II.
  • [52] L. Falong, J. Hongbing, and Z. Xiaopeng (1993) The ML bearing estimation by using neural networks. J. Electronics (China) 10 (1), pp. 1–8 (en). External Links: ISSN 0217-9822, 1993-0615, Document Cited by: §II-A.
  • [53] J. Francombe (2017) IoSR listening room multichannel BRIR dataset. University of Surrey. Cited by: §VII-B.
  • [54] S. Gannot, M. Haardt, W. Kellermann, and P. Willett (2019) Introduction to the issue on acoustic source localization and tracking in dynamic real-life scenes. IEEE J. Sel. Topics Signal Process. 13 (1), pp. 3–7. Cited by: §I.
  • [55] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov (2017) A consolidated perspective on multimicrophone speech enhancement and source separation. IEEE/ACM Trans. Audio, Speech, Lang. Process. 25 (4), pp. 692–730. Cited by: §I-B, §III, §III, §V-A1.
  • [56] J. Garofolo, D. Graff, D. Paul, and D. Pallett (1993) CSR-I (WSJ0) Sennheiser LDC93S6B. https://catalog.ldc.upenn.edu/ldc93s6b. Philadelphia: Linguistic Data Consortium. Cited by: §VII-A.
  • [57] J. S. Garofolo, L. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, and V. Zue (1993) TIMIT Acoustic-Phonetic Continuous Speech Corpus. Technical report Cited by: §VII-A.
  • [58] F. B. Gelderblom, Y. Liu, J. Kvam, and T. A. Myrvoll (2021) Synthetic data for DNN-based DoA estimation of indoor speech. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Vancouver, Canada (en). Cited by: §VII-A, TABLE II.
  • [59] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Cited by: §IV-A, §IV-C, §IV-G, §IV.
  • [60] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative Adversarial Nets. In Proc. Advances Neural Inform. Process. Syst. (NIPS), Montréal, Canada (en). Cited by: §VIII-B.
  • [61] D. Goryn and M. Kaveh (1988) Neural networks for narrowband and wideband direction finding. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), New-York, NY, pp. 2164–2167. External Links: Document Cited by: §II-A.
  • [62] F. Grondin, J. Glass, I. Sobieraj, and M. D. Plumbley (2019-10) Sound event localization and detection using CRNN on pairs of microphones. Technical report Cited by: §IV-D, §V-B, §VI-C, §VII-B, TABLE I.
  • [63] P. Grumiaux, S. Kitic, L. Girin, and A. Guerin (2020) High-resolution speaker counting in reverberant rooms using CRNN with Ambisonics features. In Proc. Europ. Signal Process. Conf. (EUSIPCO), Amsterdam, Netherlands (en). External Links: ISBN 978-90-827970-5-3, Document Cited by: §II-C.
  • [64] P. Grumiaux, S. Kitic, L. Girin, and A. Guérin (2021) Improved feature extraction for CRNN-based multiple sound source localization. In Proc. Europ. Signal Process. Conf. (EUSIPCO), Dublin, Ireland (en). Cited by: §II-C, §II-D, §IV-D, §V-E, §VI-A1, §VII-A, §VII-B, §VII-B, §VIII-A, TABLE II.
  • [65] P. Grumiaux, S. Kitic, P. Srivastava, L. Girin, and A. Guérin (2021) SALADnet: self-attentive multisource localization in the Ambisonics domain. In Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA), New-Paltz, NY (en). Cited by: §II-B, §II-C, §IV-F, §V-E, §VI-A1, §VII-B, TABLE II.
  • [66] K. Guirguis, C. Schorn, A. Guntoro, S. Abdulatif, and B. Yang (2020) SELD-TCN: sound event localization & detection via temporal convolutional networks. In Proc. Europ. Signal Process. Conf. (EUSIPCO), Amsterdam, Netherlands, pp. 16–20 (en). External Links: ISBN 978-90-827970-5-3, Document Cited by: §II-D, §IV-E, §V-C, §V-D, TABLE II.
  • [67] E. Guizzo, R. F. Gramaccioni, S. Jamili, C. Marinoni, E. Massaro, C. Medaglia, G. Nachira, L. Nucciarelli, L. Paglialunga, M. Pennese, et al. (2021) L3DAS21 Challenge: Machine Learning for 3D Audio Signal Processing. arXiv preprint arXiv:2104.05499. Cited by: §VII-B.
  • [68] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang (2020-10) Conformer: convolution-augmented Transformer for speech recognition. In Proc. Interspeech Conf., Shanghai, China, pp. 5036–5040 (en). External Links: Document Cited by: §IV-F.
  • [69] E. A. P. Habets (2006) Room impulse response generator. Technical report Technische Universiteit Eindhoven. Cited by: §VII-A.
  • [70] E. Hadad, F. Heese, P. Vary, and S. Gannot (2014-09) Multichannel audio database in various acoustic environments. In Proc. IEEE Int. Workshop Acoustic Signal Enhanc. (IWAENC), Antibes, France, pp. 313–317 (en). External Links: ISBN 978-1-4799-6808-4, Document Cited by: §VII-B.
  • [71] Y. Hao, A. Küçük, A. Ganguly, and I. M. S. Panahi (2020) Spectral flux-based convolutional neural network architecture for speech source localization and its real-time implementation. IEEE Access 8, pp. 197047–197058. External Links: ISSN 2169-3536, Document Cited by: §II-B, §IV-B, §V-C, TABLE II.
  • [72] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proc. IEEE Conf. Computer Vision Pattern Recogn. (CVPR)

    ,
    Las Vegas, NV, pp. 770–778. Cited by: §IV-E, §IV-E.
  • [73] W. He, P. Motlicek, and J. Odobez (2018) Deep neural networks for multiple speaker detection and localization. In IEEE Int. Conf. Robotics Autom. (ICRA), Brisbane, Australia, pp. 74–79. External Links: Document Cited by: §IV-A, §IV-B, §V-B, §VII-B, TABLE I.
  • [74] W. He, P. Motlicek, and J. Odobez (2018) Joint localization and classification of multiple sound sources using a multi-task neural network. In Proc. Interspeech Conf., Hyderabad, India, pp. 312–316 (en). Cited by: §IV-B, §IV-E, §V-C, TABLE I.
  • [75] W. He, P. Motlicek, and J. Odobez (2019-05) Adaptation of multiple sound source localization neural networks with weak supervision and domain-adversarial training. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Brighton, UK, pp. 770–774 (en). External Links: ISBN 978-1-4799-8131-1, Document Cited by: §II-C, §IV-E, §VIII-B, TABLE I.
  • [76] W. He, P. Motlicek, and J. Odobez (2021) Neural network adaptation and data augmentation for multi-speaker direction-of-arrival estimation. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, pp. 1303–1317. External Links: ISSN 2329-9304, Document Cited by: §II-B, §V-C, §VII-B, §VIII-A, §VIII-B, TABLE II.
  • [77] Y. He, N. Trigoni, and A. Markham (2021-06) SoundDet: polyphonic sound event detection and localization from raw waveform. arXiv:2106.06969. Cited by: §II-D, §IV-G1, TABLE II.
  • [78] T. Hirvonen (2015) Classification of spatial audio location and content using convolutional neural networks. In Audio Eng. Soc. Conv., (en). Cited by: Fig. 3, §IV-B, §V-C, §VI-A1, §VII-A, TABLE I.
  • [79] S. Hochreiter and J. Schmidhuber (1997-11) Long short-term memory. Neural Comp. 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Document Cited by: §IV-C.
  • [80] A. O. Hogg, V. W. Neo, S. Weiss, C. Evers, and P. A. Naylor (2021) A polynomial eigenvalue decomposition MUSIC approach for broadband sound source localization. In Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA), New-Paltz, NY. Cited by: §III.
  • [81] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu (2020-08) Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 42 (8), pp. 2011–2023. External Links: ISSN 1939-3539, Document Cited by: §IV-E.
  • [82] Y. Huang, X. Wu, and T. Qu (2020-09) A time-domain unsupervised learning based sound source localization method. In Int. Conf. Inform. Comm. Signal Process., Shanghai, China, pp. 26–32. External Links: Document Cited by: §IV-G1, §V-F, TABLE II.
  • [83] Y. Huang, X. Wu, and T. Qu (2018-12) DNN-based sound source localization method with microphone array. In Proc. Int. Conf. Inform., Electron. Comm. Eng. (IECE), Beijing, China (en). External Links: ISSN 2475-8841, Document Cited by: §V-F, §VI-C, TABLE I.
  • [84] Y. Huang, X. Wu, and T. Qu (2019-09) A time-domain end-to-end method for sound source localization using multi-task learning. In Proc. IEEE Int. Conf. Inform. Comm. Signal Process. (ICSP), Weihai, China, pp. 52–56. External Links: Document Cited by: §V-F, §VI-C, TABLE I.
  • [85] F. Hübner, W. Mack, and E. A. P. Habets (2021) Efficient training data generation for phase-based DoA estimation. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Vancouver, Canada. Cited by: §VII-A, TABLE II.
  • [86] F. Jacobsen and P. M. Juhl (2013) Fundamentals of general linear acoustics. John Wiley & Sons (en). Cited by: §I-B.
  • [87] D. P. Jarrett, E. A. Habets, and P. A. Naylor (2017) Theory and applications of spherical microphone array processing. Springer. Cited by: §I-B, §III, §V-D, §V-E, §V-E.
  • [88] D. Jarrett, E. Habets, M. Thomas, and P. Naylor (2012) Rigid sphere room impulse response simulation: algorithm and applications. J. Acoust. Soc. Am. 132 (3), pp. 1462–1472. Cited by: §VII-A.
  • [89] W. J. Jee, R. Mars, P. Pratik, S. Nagisetty, and S. L. Chong (2019) Sound event localization and detection using convolutional recurrent neural network. Technical report (en). Cited by: §IV-D, §V-B, §V-C, §VII-C, TABLE I.
  • [90] T. Jenrungrot, V. Jayaram, S. Seitz, and I. Kemelmacher-Shlizerman (2020-10) The cone of silence: speech separation by localization. arXiv:2010.06007. Cited by: §II-C, §IV-G3, §V-F, §VI-C, §VIII-A, TABLE II.
  • [91] S. Jha, R. Chapman, and T.S. Durrani (1988) Bearing estimation using neural networks. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), New-York, NY, pp. 2156–2159. External Links: Document Cited by: §II-A.
  • [92] S. Jha and T. Durrani (1991) Direction of arrival estimation using artificial neural networks. IEEE Trans. Systems, Man, Cybern. 21 (5), pp. 1192–1201. External Links: ISSN 2168-2909, Document Cited by: §II-A.
  • [93] S.K. Jha and T.S. Durrani (1989) Bearing estimation using neural optimisation methods. In Proc. IEE Int. Conf. Artif. Neural Networks, London, UK, pp. 129–133. Cited by: §II-A.
  • [94] S. Kapka and M. Lewandowski (2019) Sound source detection, localization and classification using consecutive ensemble of CRNN models. Technical report Cited by: §IV-D, §V-C, §V-D, TABLE I.
  • [95] J. Kim and M. Hahn (2018-08) Voice activity detection using an adaptive context attention model. IEEE Signal Process. Lett. 25 (8), pp. 1181–1185. External Links: ISSN 1558-2361, Document Cited by: §II-C.
  • [96] Y. Kim (2014) Convolutional neural networks for sentence classification. arXiv:1408.5882. Cited by: §IV-B.
  • [97] Y. Kim and H. Ling (2011) Direction of arrival estimation of humans with a small sensor array using an artificial neural network. Prog. Electromagn. Research 27, pp. 127–149 (en). External Links: ISSN 1937-6472, Document Cited by: §IV-A, TABLE I.
  • [98] D. P. Kingma and M. Welling (2014) Auto-encoding variational Bayes. In Proc. Int. Conf. Learning Repres. (ICLR), Banff, Canada. Cited by: §IV-G2.
  • [99] S. Kitić and A. Guérin (2018) TRAMP: Tracking by a Real-time AMbisonic-based Particle filter. In IEEE-AASP Challenge on Acoustic Source Localization and Tracking (LOCATA), Cited by: §III, §V-E.
  • [100] C. Knapp and G. Carter (1976-08) The generalized correlation method for estimation of time delay. IEEE Trans. Acoust., Speech, Signal Process. 24 (4), pp. 320–327. External Links: ISSN 0096-3518, Document Cited by: §III, §V-B.
  • [101] T. Komatsu, M. Togami, and T. Takahashi (2020) Sound event localization and detection using convolutional recurrent neural networks and gated linear units. In Proc. Europ. Signal Process. Conf. (EUSIPCO), Amsterdam, Netherlands, pp. 41–45 (en). External Links: ISBN 978-90-827970-5-3, Document Cited by: §IV-D, TABLE II.
  • [102] Q. Kong, Y. Cao, T. Iqbal, W. Wang, and M. D. Plumbley (2019) Cross-task learning for audio tagging, sound event detection and spatial localization. Technical report (en). Cited by: §IV-B, §V-C, TABLE I.
  • [103] D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, and R. Horaud (2017) An EM algorithm for joint source separation and diarisation of multichannel convolutive speech mixtures. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), New Orleans, LA, pp. 16–20. Cited by: §II-C.
  • [104] D. Krause and K. Kowalczyk (2019) Arborescent neural network architectures for sound event detection and localization. Technical report (en). Cited by: §IV-D, §V-C, §V-D, TABLE I.
  • [105] D. Krause, A. Politis, and K. Kowalczyk (2020) Comparison of convolution types in CNN-based feature extraction for sound source localization. In Proc. Europ. Signal Process. Conf. (EUSIPCO), Amsterdam, Netherlands, pp. 820–824 (en). External Links: ISBN 978-90-827970-5-3, Document Cited by: §IV-B, §V-C, §VI-B2, §VIII-A, TABLE II.
  • [106] D. Krause, A. Politis, and K. Kowalczyk (2020) Feature overview for joint modeling of sound event detection and localization using a microphone array. In Proc. Europ. Signal Process. Conf. (EUSIPCO), Amsterdam, Netherlands, pp. 31–35 (en). External Links: ISBN 978-90-827970-5-3, Document Cited by: §V, TABLE II.
  • [107] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2017) ImageNet classification with deep convolutional neural networks. Comm. ACM 60 (6), pp. 84–90 (en). External Links: ISSN 0001-0782, 1557-7317, Document Cited by: §IV-B.
  • [108] A. Küçük, A. Ganguly, Y. Hao, and I. M. S. Panahi (2019) Real-time convolutional neural network-based speech source localization on smartphone. IEEE Access 7, pp. 169969–169978. External Links: ISSN 2169-3536, Document Cited by: §V-C, TABLE I.
  • [109] A. Kujawski, G. Herold, and E. Sarradj (2019-09) A deep learning method for grid-free localization and quantification of sound sources. J. Acoust. Soc. Am. 146 (3), pp. EL225–EL231. External Links: ISSN 0001-4966, Document Cited by: §IV-E, TABLE I.
  • [110] H. Kuttruff (2016) Room acoustics. Crc Press. Cited by: §I-B.
  • [111] L. Lamel, J. Gauvain, and M. Eskenazi (1991) BREF, a large vocabulary spoken corpus for French. In Proc. Europ. Conf. Speech Comm. Technol. (Eurospeech), Genove, Italy, pp. 4–7 (en). Cited by: §VII-A.
  • [112] G. Lathoud, J. Odobez, and D. Gatica-Perez (2004) AV16.3: an audio-visual corpus for speaker localization and tracking. In Proc. Int. Workshop Mach. Learn. Multimodal Interact., Martigny, Switzerland, pp. 182–195 (en). External Links: ISBN 978-3-540-24509-4 978-3-540-30568-2 Cited by: §VII-B.
  • [113] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager (2017-07) Temporal convolutional networks for action segmentation and detection. In Proc. IEEE Conf. Computer Vision Pattern Recogn. (CVPR), Honolulu, HI, pp. 1003–1012 (en). External Links: ISBN 978-1-5386-0457-1, Document Cited by: §IV-D.
  • [114] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel (1989) Backpropagation applied to handwritten zip code recognition. Neural Comp. 1 (4), pp. 541–551. External Links: ISSN 0899-7667, Document Cited by: §IV-B.
  • [115] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §IV-A, §IV-C, §IV.
  • [116] H. Lee, J. Cho, M. Kim, and H. Park (2016-08) DNN-based feature enhancement using DoA-constrained ICA for robust speech recognition. IEEE Signal Process. Lett. 23 (8), pp. 1091–1095. External Links: ISSN 1558-2361, Document Cited by: §I.
  • [117] E. A. Lehmann and A. M. Johansson (2010-08) Diffuse reverberation model for efficient image-source simulation of room impulse responses. IEEE Trans. Audio, Speech, Lang. Process. 18 (6), pp. 1429–1439 (en). External Links: ISSN 1558-7916, Document Cited by: §VII-A.
  • [118] Q. Li, X. Zhang, and H. Li (2018-04) Online direction of arrival estimation based on deep learning. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Calgary, Canada, pp. 2616–2620 (en). External Links: ISBN 978-1-5386-4658-8, Document Cited by: §IV-D, §V-B, §VII-A, TABLE I.
  • [119] X. Li, L. Girin, F. Badeig, and R. Horaud (2016-10) Reverberant sound localization with a robot head based on direct-path relative transfer function. In Proc. IEEE/RSJ Int. Conf. Intell. Robots Systems (IROS), Daejeon, Korea, pp. 2819–2826. External Links: Document Cited by: §I.
  • [120] X. Li, L. Girin, R. Horaud, and S. Gannot (2017) Multiple-speaker localization based on direct-path features and likelihood maximization with spatial sparsity regularization. IEEE/ACM Trans. Audio, Speech, Lang. Process. 25 (10), pp. 1997–2012. Cited by: §III.
  • [121] X. Li, R. Horaud, L. Girin, and S. Gannot (2016) Voice activity detection based on statistical likelihood ratio with adaptive thresholding. In Proc. IEEE Int. Workshop Acoustic Signal Enhanc. (IWAENC), Vol. , Xi’an, China, pp. 1–5. External Links: Document Cited by: §II-C.
  • [122] Y. Lin and Z. Wang (2019) A report on sound event localization and detection. Technical report (en). Cited by: §IV-D, §V-C, TABLE I.
  • [123] N. Liu, H. Chen, K. Songgong, and Y. Li (2021-02) Deep learning assisted sound source localization using two orthogonal first-order differential microphone arrays. J. Acoust. Soc. Am. 149 (2), pp. 1069–1084. External Links: ISSN 0001-4966, Document Cited by: §II-C, §V-E, TABLE II.
  • [124] Z. Liu, C. Zhang, and P. S. Yu (2018) Direction-of-arrival estimation based on deep neural networks with robustness to array imperfections. IEEE Trans. Antennas Propag. 66 (12), pp. 7315–7327. External Links: ISSN 1558-2221, Document Cited by: §I-B, §II-A.
  • [125] Z. Lu (2019) Sound event detection and localization based on CNN and LSTM. Technical report (en). Cited by: §IV-D, §V-B, TABLE I.
  • [126] S. Lueng and Y. Ren (2019) Spectrum combination and convolutional recurrent neural networks for joint localization and detection of sound events. Technical report (en). Cited by: §IV-D, §V-B, TABLE I.
  • [127] Y. Luo and N. Mesgarani (2019) Conv-TASnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio, Speech, Lang. Process. 27 (8), pp. 1256–1266. Cited by: §V-F.
  • [128] N. Ma, G. Brown, and T. May (2015) Exploiting deep neural networks and head movements for binaural localisation of multiple speakers in reverberant conditions. In Proc. Interspeech Conf., Dresden, Germany, pp. 160–164 (en). Cited by: §II-C, §IV-A, §V-A2, §VI-A1, §VII-B, TABLE I.
  • [129] W. Ma and X. Liu (2018) Phased microphone array for sound source localization with deep learning. Aerospace Syst. 2 (2), pp. 71–81. Cited by: §IV-B, §V-B, §VI-A2, TABLE I.
  • [130] E. Mabande, H. Sun, K. Kowalczyk, and W. Kellermann (2011) Comparison of subspace-based and steered beamformer-based reflection localization methods. In Proc. Europ. Signal Process. Conf. (EUSIPCO), Barcelona, Spain, pp. 146–150. Cited by: §III.
  • [131] M. I. Mandel, R. J. Weiss, and D. P. Ellis (2009) Model-based expectation-maximization source separation and localization. IEEE Trans. Audio, Speech, Lang. Process. 18 (2), pp. 382–394. Cited by: §III.
  • [132] H. A. C. Maruri, P. L. Meyer, J. Huang, J. A. d. H. Ontiveros, and H. Lu (2019) GCC-PHAT cross-correlation audio features for simultaneous sound event localization and detection (SELD) in multiple rooms. Technical report (en). Cited by: §IV-D, §V-B, §V-C, §VI-B1, TABLE I.
  • [133] T. May, S. Van De Par, and A. Kohlrausch (2010) A probabilistic model for robust localization based on a binaural auditory front-end. IEEE Trans. Audio, Speech, Lang. Process. 19 (1), pp. 1–13. Cited by: §III.
  • [134] L. Mazzon, Y. Koizumi, M. Yasuda, and N. Harada (2019-10) First order Ambisonics domain spatial augmentation for DNN-based direction of arrival estimation. arXiv:1910.04388. Cited by: §VII-B, §VII-C, TABLE I.
  • [135] J. Merimaa (2006) Analysis, synthesis, and perception of spatial sound: binaural localization modeling and multichannel loudspeaker reproduction. Ph.D. Thesis, Helsinki Univ. Technol.. Cited by: §V-E.
  • [136] G. L. Moing, P. Vinayavekhin, D. J. Agravante, T. Inoue, J. Vongkulbhisal, A. Munawar, and R. Tachibana (2021) Data-efficient framework for real-world multiple sound source 2D localization. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Vancouver, Canada. Cited by: §I-B, §IV-G1, §VI-A2, §VIII-B, TABLE II.
  • [137] G. L. Moing, P. Vinayavekhin, T. Inoue, J. Vongkulbhisal, A. Munawar, R. Tachibana, and D. J. Agravante (2020-12) Learning multiple sound source 2D localization. In Proc. IEEE Int. Workshop Multimedia Signal Process. (MMSP), virtual Tampere, Finland. Cited by: §IV-G1, §V-C, §VI-A2, §VII-B, TABLE II.
  • [138] J. Naranjo-Alcazar, S. Perez-Castanos, J. Ferrandis, P. Zuccarello, and M. Cobos (2020-06) Sound event localization and detection using squeeze-excitation residual CNNs. Technical report Cited by: §IV-E, §VII-B, TABLE II.
  • [139] Q. Nguyen, L. Girin, G. Bailly, F. Elisei, and D. Nguyen (2018) Autonomous sensorimotor learning for sound source localization by a humanoid robot. In IEEE/RSJ IROS Workshop Crossmodal Learn. Intell. Robotics, Madrid, Spain (en). Cited by: §IV-B, §V-A2, §VI-B1, §VIII-A, TABLE I.
  • [140] T. N. T. Nguyen, N. K. Nguyen, H. Phan, L. Pham, K. Ooi, D. L. Jones, and W. Gan (2021-06)

    A general network architecture for sound event localization and detection using transfer learning and recurrent neural network

    .
    In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Vancouver, Canada, pp. 935–939. External Links: Document Cited by: §IV-C, §V-E, TABLE II.
  • [141] T. N. T. Nguyen, W. Gan, R. Ranjan, and D. L. Jones (2020) Robust source counting and DoA estimation using spatial pseudo-spectrum and convolutional neural network. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28, pp. 2626–2637. External Links: ISSN 2329-9304, Document Cited by: §II-C, §IV-B, §V-B, §VII-A, §VII-B, TABLE II.
  • [142] K. Noh, J. Choi, D. Jeon, and J. Chang (2019) Three-stage approach for sound event localization and detection. Technical report (en). Cited by: §IV-B, §V-B, §VII-C, TABLE I.
  • [143] E. J. Nustede and J. Anemüller (2019) Group delay features for sound event detection and localization. Technical report (en). Cited by: §IV-D, TABLE I.
  • [144] R. Opochinsky, B. Laufer-Goldshtein, S. Gannot, and G. Chechik (2019) Deep ranking-based sound source localization. In Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA), New-Paltz, NY, pp. 283–287 (en). External Links: ISBN 978-1-72811-123-0, Document Cited by: §IV-A, §VI-B1, §VIII-B, TABLE I.
  • [145] J. Pak and J. W. Shin (2019) Sound localization based on phase difference enhancement using deep neural networks. IEEE/ACM Trans. Audio, Speech, Lang. Process. 27 (8), pp. 1335–1345. External Links: ISSN 2329-9304, Document Cited by: §IV-A, §V-A2, §VI-C, §VII-B, TABLE I.
  • [146] T. Parcollet, Y. Zhang, M. Morchid, C. Trabelsi, G. Linarès, R. De Mori, and Y. Bengio (2018-06) Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition. arXiv:1806.07789 (en). Cited by: §V-D.
  • [147] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019-09) SpecAugment: a simple data augmentation method for automatic speech recognition. In Proc. Interspeech Conf., Graz, Austria, pp. 2613–2617 (en). External Links: Document Cited by: §VII-C.
  • [148] S. Park, W. Lim, S. Suh, and Y. Jeong (2019) Reassembly learning for sound event localization and detection using CRNN and TRELLISNET. Technical report (en). Cited by: §IV-D, §VI-B1, §VII-B, TABLE I.
  • [149] S. Park, S. Suh, and Y. Jeong (2020) Sound event localization and detection with various loss functions. Technical report (en). Cited by: §IV-D, §V-E, TABLE II.
  • [150] T. J. Park, N. Kanda, D. Dimitriadis, K. J. Han, S. Watanabe, and S. Narayanan (2021-06) A review of speaker diarization: recent advances with deep learning. arXiv:2101.09624. Cited by: §II-C.
  • [151] S. J. Patel, M. Zawodniok, and J. Benesty (2020) A single stage fully convolutional neural network for sound source localization and detection. Technical report (en). Cited by: §IV-G3, §V-C, TABLE II.
  • [152] G. Peeters (2004) A large set of audio features for sound description (similarity and classification) in the CUIDADO project. CUIDADO Project Report 54.0. Cited by: §V-C.
  • [153] L. Perotin, R. Serizel, E. Vincent, and A. Guérin (2019-03) CRNN-based multiple DoA estimation using acoustic intensity features for Ambisonics recordings. IEEE J. Sel. Topics Signal Process. 13 (1), pp. 22–33. External Links: ISSN 1941-0484, Document Cited by: §II-C, §IV-D, §IV-F, §V-E, §VI-A1, §VI-A1, §VII-A, §VIII-A, TABLE I.
  • [154] L. Perotin, A. Défossez, E. Vincent, R. Serizel, and A. Guérin (2019) Regression versus classification for neural network based audio source localization. In Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA), New-Paltz, NY (en). Cited by: §V-E, §VI, §VII-B, §VIII-A, TABLE I.
  • [155] L. Perotin, R. Serizel, E. Vincent, and A. Guérin (2018-09) CRNN-based joint azimuth and elevation localization with the Ambisonics intensity vector. In Proc. IEEE Int. Workshop Acoustic Signal Enhanc. (IWAENC), Tokyo, Japan, pp. 241–245. External Links: Document Cited by: §II-C, §IV-D, §V-E, §VI-A1, §VII-B, §VIII-A, TABLE I.
  • [156] P. Pertilä and E. Cakir (2017-03) Robust direction estimation with convolutional neural networks based steered response power. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), New Orleans, LA, pp. 6125–6129. External Links: Document Cited by: §V-C, §VI-C, §VIII-A, TABLE I.
  • [157] H. Phan, L. Pham, P. Koch, N. Q. K. Duong, I. McLoughlin, and A. Mertins (2020-09) On multitask loss function for audio event detection and localization. arXiv:2009.05527. Cited by: §IV-F, TABLE II.
  • [158] H. Phan, L. Pham, P. Koch, N. Q. K. Duong, I. McLoughlin, and A. Mertins (2020) Audio event detection and localization with multitask regression network. Technical report Cited by: §IV-F, TABLE II.
  • [159] A. Politis, S. Adavanne, D. Krause, A. Deleforge, P. Srivastava, and T. Virtanen (2021-06) A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection. arXiv:2106.06999. Cited by: §VII-B, §VII-C.
  • [160] A. Politis, S. Adavanne, and T. Virtanen DCASE challenge 2020 - sound event localization and detection. External Links: Link Cited by: §II-B, §IV-B, §IV-D, §IV-D, §VII-B.
  • [161] A. Politis, A. Mesaros, S. Adavanne, T. Heittola, and T. Virtanen (2020-09) Overview and evaluation of sound event localization and detection in DCASE 2019. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, pp. 684–698. Cited by: §II-B, §IV-B, §IV-D, §VI-B1, §VII-B.
  • [162] N. Poschadel, R. Hupke, S. Preihs, and J. Peissig (2021-03) Direction of arrival estimation of noisy speech using convolutional recurrent neural networks with higher-order Ambisonics signals. arXiv:2102.09853. Cited by: §V-D, TABLE II.
  • [163] H. Pujol, E. Bavu, and A. Garcia (2019) Source localization in reverberant rooms using deep learning and microphone arrays. In Proc. Int. Congr. Acoust. (ICA), Aachen, Germany (en). Cited by: §IV-E, §V-F, TABLE I.
  • [164] H. Pujol, E. Bavu, and A. Garcia (2021-04) BeamLearning: an end-to-end deep learning approach for the angular localization of sound sources using raw multichannel acoustic pressure data. J. Acoust. Soc. Am. 149 (6), pp. 4248–4263. Cited by: §IV-E, §V-F, TABLE II.
  • [165] H. Purwins, B. Li, T. Virtanen, J. Schlüter, S. Chang, and T. Sainath (2019) Deep learning for audio signal processing. IEEE J. Sel. Topics Signal Process. 13 (2), pp. 206–219. Cited by: §I.
  • [166] B. Rafaely (2019) Fundamentals of spherical array processing. Springer (en). External Links: ISBN 978-3-319-99560-1 978-3-319-99561-8 Cited by: §I-B.
  • [167] R. Ranjan, S. Jayabalan, T. N. T. Nguyen, and W. Lim (2019) Sound events detection and direction of arrival estimation using residual net and recurrent neural networks. Technical report (en). Cited by: §IV-E, §V-C, TABLE I.
  • [168] R. Rastogi, P. Gupta, and R. Kumaresan (1987) Array signal processing with interconnected neuron-like elements. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Dallas, TX, pp. 2328–2331. External Links: Document Cited by: §II-A.
  • [169] D. J. Rezende, S. Mohamed, and D. Wierstra (2014) Stochastic backpropagation and approximate inference in deep generative models. In Proc. Int. Conf. Mach. Learn. (ICML), Beijing, China. Cited by: §IV-G2.
  • [170] S. Rickard (2002) On the approximate W-disjoint orthogonality of speech. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Orlando, Florida, pp. 529–532 (en). Cited by: §III, §IV-B, §V-A1.
  • [171] J. H. Rindel (2000) The use of computer modeling in room acoustics. J. Vibroengineer. 3 (4), pp. 219–224. Cited by: §VII-A, footnote 7.
  • [172] R. Roden, N. Moritz, S. Gerlach, S. Weinzierl, and S. Goetze (2015) On sound source localization of speech signals using deep neural networks. In Proc. Deutsche Jahrestagung Akustik (DAGA), Nuremberg, Germany (en). External Links: ISBN 978-3-939296-08-9 Cited by: §IV-A, §V-A2, §V, §VI-A1, §VI-A1, §VII-B, TABLE I.
  • [173] N. Roman and D. Wang (2008) Binaural tracking of multiple moving sources. IEEE Trans. Audio, Speech, Lang. Process. 16 (4), pp. 728–739. Cited by: §III.
  • [174] F. Ronchini, D. Arteaga, and A. Pérez-López (2020-10) Sound event localization and detection based on CRNN using rectangular filters and channel rotation data augmentation. Technical report Cited by: §IV-D, TABLE II.
  • [175] O. Ronneberger, P. Fischer, and T. Brox (2015-05) U-Net: convolutional networks for biomedical image segmentation. In Int. Conf. Medical Image Comput. Computer-Assisted Interv. (MICCAI), Munich, Germany, pp. 234–241. Cited by: §IV-G3.
  • [176] T. D. Rossing (2007) Springer handbook of acoustics. Springer. Cited by: §I-B, §V-E.
  • [177] R. Roy and T. Kailath (1989-07) ESPRIT: estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust., Speech, Signal Process. 37 (7), pp. 984–995. External Links: ISSN 0096-3518, Document Cited by: §III, §V-B.
  • [178] T. N. Sainath, R. J. Weiss, K. W. Wilson, B. Li, A. Narayanan, E. Variani, M. Bacchiani, I. Shafran, A. Senior, et al. (2017) Multichannel signal processing with deep neural networks for automatic speech recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 25 (5), pp. 965–979. Cited by: §V-F.
  • [179] J. Salamon and J. P. Bello (2017-03) Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24 (3), pp. 279–283. External Links: ISSN 1558-2361, Document Cited by: §VII-C.
  • [180] D. Salvati, C. Drioli, and G. L. Foresti (2018-04) Exploiting CNNs for improving acoustic source localization in noisy and reverberant conditions. IEEE Trans. Emerg. Topics Comput. Intell. 2 (2), pp. 103–116. External Links: ISSN 2471-285X, Document Cited by: §IV-B, §V-B, §VII-A, §VIII-A, TABLE I.
  • [181] A. Sampathkumar and D. Kowerko (2020) Sound event detection and localization using CRNN models. Technical report (en). Cited by: §IV-D, TABLE II.
  • [182] I. Sato, G. Liu, K. Ishikawa, T. Suzuki, and M. Tanaka (2021) Does end-to-end trained deep model always perform better than non-end-to-end counterpart?. Electronic Imaging 2021 (10), pp. 240–1. Cited by: §V-F.
  • [183] H. Sawada, R. Mukai, and S. Makino (2003) Direction of arrival estimation for multiple source signals using independent component analysis. In IEEE Int. Symp. Signal Process. Applic., Paris, France, pp. 411–414 (en). External Links: ISBN 978-0-7803-7946-6, Document Cited by: §III.
  • [184] R. Scheibler, E. Bezzam, and I. Dokmanić (2018-04) Pyroomacoustics: a Python package for audio room simulation and array processing algorithms. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Calgary, Canada, pp. 351–355. External Links: Document Cited by: §VII-A.
  • [185] R. Schmidt (1986-03) Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag. 34 (3), pp. 276–280. External Links: ISSN 0018-926X, Document Cited by: §III, §V-B.
  • [186] C. Schymura, B. Bönninghoff, T. Ochiai, M. Delcroix, K. Kinoshita, T. Nakatani, S. Araki, and D. Kolossa (2021-06) PILOT: introducing Transformers for probabilistic sound event localization. arXiv:2106.03903. Cited by: Fig. 6, §IV-F, §V-C, TABLE II.
  • [187] C. Schymura, T. Ochiai, M. Delcroix, K. Kinoshita, T. Nakatani, S. Araki, and D. Kolossa (2020) Exploiting attention-based sequence-to-sequence architectures for sound event localization. In Proc. Europ. Signal Process. Conf. (EUSIPCO), Amsterdam, Netherlands (en). Cited by: §IV-F, TABLE II.
  • [188] A. Sehgal and N. Kehtarnavaz (2018) A convolutional neural network smartphone app for real-time voice activity detection. IEEE Access 6, pp. 9017–9026. External Links: ISSN 2169-3536, Document Cited by: §II-C.
  • [189] K. Shimada, Y. Koyama, N. Takahashi, S. Takahashi, and Y. Mitsufuji (2020-10) ACCDOA: activity-coupled cartesian direction of arrival representation for sound event localization and detection. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), virtual Barcelona. Cited by: §IV-E, §V-A2, §VIII-A, TABLE II.
  • [190] K. Shimada, N. Takahashi, S. Takahashi, and Y. Mitsufuji (2020-06) Sound event localization and detection using activity-coupled cartesian DoA vector and RD3net. arXiv:2006.12014. Cited by: §IV-E, §V-A2, §VII-B, §VII-C, TABLE II.
  • [191] S. Siltanen, T. Lokki, and L. Savioja (2010) Rays or waves? understanding the strengths and weaknesses of computational room acoustics modeling techniques. In Proc. Int. Symposium on Room Acoustics, Cited by: §VII-A.
  • [192] R. Singla, S. Tiwari, and R. Sharma (2020) A sequential system for sound event detection and localization using CRNN. Technical report (en). Cited by: §IV-D, TABLE II.
  • [193] S. Sivasankaran, E. Vincent, and D. Fohr (2018) Keyword-based speaker localization: localizing a target speaker in a multi-speaker environment. In Proc. Interspeech Conf., Hyderabad, India (en). Cited by: §IV-B, §V-A2, TABLE I.
  • [194] J. Song (2020) Localization and detection for moving sound sources using consecutive ensembles of 2D-CRNN. Technical report (en). Cited by: §IV-D, §V-B, §V-E, TABLE II.
  • [195] H.L. Southall, J.A. Simmers, and T.H. O’Donnell (1995) Direction finding in phased arrays with a neural network beamformer. IEEE Trans. Antennas Propag. 43 (12), pp. 1369–1374. External Links: ISSN 1558-2221, Document Cited by: §II-A.
  • [196] R. Stiefelhagen, K. Bernardin, R. Bowers, R. T. Rose, M. Michel, and J. Garofolo (2007) The CLEAR 2007 evaluation. In Proc. Multimodal Technol. Percept. Humans, Baltimore, MD, pp. 3–34 (en). Cited by: §VII-B.
  • [197] A. S. Subramanian, C. Weng, S. Watanabe, M. Yu, and D. Yu (2021-02) Deep learning based multi-source localization with source splitting and its effectiveness in multi-talker speech recognition. arXiv:2102.07955. Cited by: §V-A2, §V-C, §VI-B, TABLE II.
  • [198] H. Sundar, W. Wang, M. Sun, and C. Wang (2020-05) Raw waveform based end-to-end deep convolutional network for spatial localization of multiple acoustic sources. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Virtual Barcelona, pp. 4642–4646. External Links: Document Cited by: §II-D, §IV-E, §VI-B1, §VII-B, TABLE II.
  • [199] D. Suvorov, G. Dong, and R. Zhukov (2018-08) Deep residual network for sound source localization in the time domain. arXiv:1808.06429. Cited by: §IV-E, §V-F, §VI-A1, TABLE I.
  • [200] P. Svensson and U. R. Kristiansen (2002) Computational modelling and simulation of acoustic spaces. In Proc. Audio Eng. Soc. Conf., Espoo, Finland. Cited by: §VII-A.
  • [201] I. Szöke, M. Skácel, L. Mošner, J. Paliesek, and J. Černocký (2019-08) Building and evaluation of a real room impulse response dataset. IEEE J. Sel. Topics Signal Process. 13 (4), pp. 863–876. External Links: ISSN 1941-0484, Document Cited by: §VII-B.
  • [202] N. Takahashi, N. Goswami, and Y. Mitsufuji (2018-05) MMDenseLSTM: An efficient combination of convolutional and recurrent neural networks for audio source separation. In Proc. IEEE Int. Workshop Acoustic Signal Enhanc. (IWAENC), Tokyo, Japan. Cited by: §IV-E.
  • [203] N. Takahashi, M. Gygli, B. Pfister, and L. V. Gool (2016-09) Deep convolutional neural networks and data augmentation for acoustic event recognition. In Proc. Interspeech Conf., San Francisco, CA, pp. 2982–2986 (en). External Links: Document Cited by: §VII-C.
  • [204] R. Takeda and K. Komatani (2016) Discriminative multiple sound source localization based on deep neural networks using independent location model. In IEEE Spoken Language Technol. Workshop, San Juan, Portugal, pp. 603–609. External Links: Document Cited by: Fig. 2, §IV-A, §V-B, TABLE I.
  • [205] R. Takeda and K. Komatani (2016) Sound source localization based on deep neural networks with directional activate function exploiting phase information. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Shanghai, China, pp. 405–409 (en). External Links: ISBN 978-1-4799-9988-0, Document Cited by: Fig. 2, §IV-A, §V-B, §VI-A1, TABLE I.
  • [206] R. Takeda and K. Komatani (2017) Unsupervised adaptation of deep neural networks for sound source localization using entropy minimization. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), New-Orleans, LA, pp. 2217–2221. External Links: Document Cited by: Fig. 2, §IV-A, §V-B, §VIII-B, TABLE I.
  • [207] R. Takeda, Y. Kudo, K. Takashima, Y. Kitamura, and K. Komatani (2018) Unsupervised adaptation of neural networks for discriminative sound source localization with eliminative constraint. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Calgary, Canada, pp. 3514–3518. External Links: Document Cited by: Fig. 2, §IV-A, §V-B, §VIII-B, TABLE I.
  • [208] Z. Tang, J. D. Kanu, K. Hogan, and D. Manocha (2019-09) Regression and classification for direction-of-arrival estimation with convolutional recurrent neural networks. In Proc. Interspeech Conf., Graz, Austria, pp. 654–658. External Links: Document Cited by: §V-E, §VI, §VII-B, TABLE I.
  • [209] S. Tervo (2009) Direction estimation based on sound intensity vectors. In Proc. Europ. Signal Process. Conf. (EUSIPCO), Glasgow, Scotland, pp. 700–704. Cited by: §III, §V-E.
  • [210] J. Thiemann and S. Van De Par (2015) Multiple model high-spatial resolution HRTF measurements. In Proc. Deutsche Jahrestagung Akustik (DAGA), Nuremberg, Germany (en). Cited by: §VII-B.
  • [211] E. Thuillier, H. Gamper, and I. J. Tashev (2018) Spatial audio feature discovery with convolutional neural networks. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Calgary, Canada, pp. 6797–6801 (en). External Links: ISBN 978-1-5386-4658-8, Document Cited by: §IV-B, §V-A2, §VI-A1, TABLE I.
  • [212] C. Tian (2020) Multiple CRNN for SELD. Technical report (en). Cited by: §II-C, §IV-D, TABLE II.
  • [213] S. E. Tranter and D. A. Reynolds (2006) An overview of automatic speaker diarization systems. IEEE Trans. Audio, Speech, Lang. Process. 14 (5), pp. 1557–1565. Cited by: §II-C.
  • [214] H. Tsuzuki, M. Kugler, S. Kuroyanagi, and A. Iwata (2013) An approach for sound source localization by complex-valued neural network. IEICE Trans. Inform. Syst. 96 (10), pp. 2257–2265 (en). External Links: ISSN 0916-8532, 1745-1361, Document Cited by: §IV-A, §VI-B1, TABLE I.
  • [215] M. F. Ünlerşen and E. Yaldiz (2016-11) Direction of arrival estimation by using artificial neural networks. In Proc. Euro. Modelling Symp., pp. 242–245. External Links: Document Cited by: §II-A.
  • [216] V. Varanasi, H. Gupta, and R. M. Hegde (2020) A deep learning framework for robust DoA estimation using spherical harmonic decomposition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28, pp. 1248–1259. External Links: ISSN 2329-9304, Document Cited by: §IV-B, §V-D, §VI-A1, §VII-A, §VII-B, §VII-B, TABLE II.
  • [217] E. Vargas, J. R. Hopgood, K. Brown, and K. Subr (2021) On improved training of CNN for acoustic source localisation. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, pp. 720–732. External Links: ISSN 2329-9304, Document Cited by: §VII-A, TABLE II.
  • [218] R. Varzandeh, K. Adiloğlu, S. Doclo, and V. Hohmann (2020-05) Exploiting periodicity features for joint detection and DoA estimation of speech sources using convolutional neural networks. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), virtual Barcelona, pp. 566–570. External Links: Document Cited by: §IV-B, §V-G, TABLE II.
  • [219] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017-12) Attention is all you need. arXiv:1706.03762 (en). Cited by: §IV-F, §IV-F.
  • [220] P. Vecchiotti, E. Principi, S. Squartini, and F. Piazza (2018) Deep neural networks for joint voice activity detection and speaker localization. In Proc. Europ. Signal Process. Conf. (EUSIPCO), Roma, Italy, pp. 1567–1571. External Links: Document Cited by: §IV-B, §V-B, §V-C, §VI-B2, §VII-B, TABLE I.
  • [221] P. Vecchiotti, N. Ma, S. Squartini, and G. J. Brown (2019) End-to-end binaural sound localisation from the raw waveform. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Brighton, UK, pp. 451–455 (en). Cited by: §IV-B, §V-F, §VI-A1, TABLE I.
  • [222] P. Vecchiotti, G. Pepe, E. Principi, and S. Squartini (2019) Detection of activity and position of speakers by using deep neural networks and acoustic data augmentation. Expert Syst. with Applic. 134, pp. 53–65 (en). External Links: ISSN 0957-4174, Document Cited by: §IV-B, §V-B, §VI-B2, TABLE I.
  • [223] J. M. Vera-Diaz, D. Pizarro, and J. Macias-Guarasa (2018) Towards end-to-end acoustic localization using deep learning: from audio signal to source position coordinates. Sensors 18 (10), pp. 3418. External Links: ISSN 1424-8220, Document Cited by: §IV-B, §V-F, §VI-B2, §VII-B, TABLE I.
  • [224] J. M. Vera-Diaz, D. Pizarro, and J. Macias-Guarasa (2020) Towards domain independence in CNN-based acoustic localization using deep cross correlations. In Proc. Europ. Signal Process. Conf. (EUSIPCO), Amsterdam, Netherlands, pp. 226–230 (en). Cited by: §IV-G1, §VI-C, §VII-B, TABLE II.
  • [225] F. Vesperini, P. Vecchiotti, E. Principi, S. Squartini, and F. Piazza (2016) A neural network based algorithm for speaker localization in a multi-room environment. In IEEE Int. Workshop Machine Learning for Signal Process., Salerno, Italy, pp. 1–6. External Links: Document Cited by: §IV-A, §V-B, §VI-B2, §VII-B, TABLE I.
  • [226] E. Vincent, T. Virtanen, and S. Gannot (2018) Audio source separation and speech enhancement. John Wiley & Sons. Cited by: §I-B, §II-C, §V-C.
  • [227] E. Vincent, T. Virtanen, and S. Gannot (2018) Audio source separation and speech enhancement. John Wiley & Sons. Cited by: §III.
  • [228] B. Vo, M. Mallick, Y. Bar‐shalom, S. Coraluppi, R. Osborne, R. Mahler, and B. Vo (2015) Multitarget tracking. In Wiley Encyclopedia of Electrical and Electronics Engineering, (en). External Links: ISBN 978-0-471-34608-1, Document Cited by: §II-D.
  • [229] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K.J. Lang (1989) Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust., Speech, Signal Process. 37 (3), pp. 328–339. External Links: ISSN 0096-3518, Document Cited by: §IV-B.
  • [230] Q. Wang, J. Du, H. Wu, J. Pan, F. Ma, and C. Lee (2021-01) A four-stage data augmentation approach to ResNet-Conformer based acoustic modeling for sound event localization and detection. arXiv:2101.02919 (en). Cited by: §IV-F, §VII-C, TABLE II.
  • [231] Q. Wang, H. Wu, Z. Jing, F. Ma, Y. Fang, Y. Wang, T. Chen, J. Pan, J. Du, and C. Lee (2020) The USTC-IFLYTEK system for sound event localization and detection of DCASE 2020 challenge. Technical report (en). Cited by: §IV-E, §VII-B, TABLE II.
  • [232] Z. Wang, X. Zhang, and D. Wang (2019-01) Robust speaker localization guided by deep learning-based time-frequency masking. IEEE/ACM Trans. Audio, Speech, Lang. Process. 27 (1), pp. 178–188. External Links: ISSN 2329-9304, Document Cited by: §IV-C, §V-C, §VI-C, TABLE I.
  • [233] G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux (2019) WHAM!: extending speech separation to noisy environments. In Proc. Interspeech Conf., Graz, Austria. Cited by: §V-F.
  • [234] J. Woodruff and D. Wang (2012) Binaural localization of multiple sources in reverberant and noisy environments. IEEE Trans. Audio, Speech, Lang. Process. 20 (5), pp. 1503–1512. Cited by: §III.
  • [235] Y. Wu, R. Ayyalasomayajula, M. J. Bianco, D. Bharadia, and P. Gerstoft (2021-02) SSLIDE: sound source localization for indoors based on deep learning. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Vancouver, Canada. Cited by: §IV-G1, TABLE II.
  • [236] A. Xenaki, J. Bünsow Boldt, and M. Græsbøll Christensen (2018-06) Sound source localization and speech enhancement with sparse Bayesian learning beamforming. J. Acoust. Soc. Am. 143 (6), pp. 3912–3921. External Links: ISSN 0001-4966, Document Cited by: §I.
  • [237] X. Xiao, S. Zhao, X. Zhong, D. L. Jones, E. S. Chng, and H. Li (2015) A learning-based approach to direction of arrival estimation in noisy and reverberant environments. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Brisbane, Australia, pp. 2814–2818. External Links: Document Cited by: §IV-A, §V-B, §VI-A1, TABLE I.
  • [238] B. Xu, G. Sun, R. Yu, and Z. Yang (2013-08) High-accuracy TDOA-based localization without time synchronization. IEEE Trans. Parallel Distrib. Syst. 24 (8), pp. 1567–1576. External Links: ISSN 1558-2183, Document Cited by: §III.
  • [239] W. Xue, Y. Tong, C. Zhang, G. Ding, X. He, and B. Zhou (2020-10) Sound event localization and detection based on multiple DoA beamforming and multi-task learning. In Proc. Interspeech Conf., Shanghai, China (en). External Links: Document Cited by: §IV-D, §V-B, TABLE II.
  • [240] W. Xue, T. Ying, Z. Chao, and D. Guohong (2019) Multi-beam and multi-task learning for joint sound event detection and localization. Technical report (en). Cited by: §IV-D, TABLE I.
  • [241] N. Yalta, K. Nakadai, and T. Ogata (2017) Sound source localization using deep learning models. J. Robotics Mechatron. 29 (1), pp. 37–48. External Links: Document Cited by: §II-C, Fig. 5, §IV-B, §IV-E, §V-C, §VIII-A, TABLE I.
  • [242] W.-H. Yang, K.-K. Chan, and P.-R. Chang (1994) Complex-valued neural network for direction of arrival estimation. Electronics Lett. 30 (7), pp. 574–575. External Links: ISSN 0013-5194, Document Cited by: §II-A.
  • [243] M. Yasuda, Y. Koizumi, S. Saito, H. Uematsu, and K. Imoto (2020-05) Sound event localization based on sound intensity vector refined by DNN-based denoising and source separation. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), virtual Barcelona, pp. 651–655. External Links: Document Cited by: §IV-E, §V-E, §VI-C, TABLE II.
  • [244] M. Yiwere and E. J. Rhee (2017) Distance estimation and localization of sound sources in reverberant conditions using deep neural networks. Int. J. Eng. Research Applic. 12 (22), pp. 12384–12389 (en). Cited by: §IV-A, §V-A2, §VI-A1, TABLE I.
  • [245] K. Youssef, S. Argentieri, and J. Zarader (2013) A learning-based approach to robust binaural sound localization. In Proc. IEEE/RSJ Int. Conf. Intell. Robots Systems (IROS), Tokyo, Japan, pp. 2927–2932. External Links: Document Cited by: §IV-A, §V-A2, §VII-A, TABLE I.
  • [246] D. Yu, M. Kolbæk, Z. Tan, and J. Jensen (2017-03) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), New Orleans, LA, pp. 241–245. External Links: Document Cited by: §VI-B.
  • [247] A. Zermini, Y. Yu, Y. Xu, W. Wang, and M. D. Plumbley (2016) Deep neural network based audio source separation. In IMA Int. Conf. Math. Signal Process., Birmingham, UK. Cited by: §IV-G1, §V-A2, TABLE I.
  • [248] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018-04) Mixup: beyond empirical risk minimization. arXiv:1710.09412. Cited by: §VII-C.
  • [249] J. Zhang, W. Ding, and L. He (2019) Data augmentation and priori knowledge-based regularization for sound event localization and detection. Technical report (en). Cited by: §IV-D, §V-C, §VII-C, TABLE I.
  • [250] W. Zhang, Y. Zhou, and Y. Qian (2019-09) Robust DoA estimation based on convolutional neural network and time-frequency masking. In Proc. Interspeech Conf., Graz, Austria, pp. 2703–2707 (en). External Links: Document Cited by: §IV-B, §V-C, TABLE I.
  • [251] F. Zotter and M. Frank (2019) Ambisonics: a practical 3D audio theory for recording, studio production, sound reinforcement, and virtual reality. Springer Nature (en). External Links: ISBN 978-3-030-17206-0 978-3-030-17207-7 Cited by: §V-D, §V-E.