Data-Efficient Framework for Real-world Multiple Sound Source 2D Localization

12/10/2020
by   Guillaume Le Moing, et al.
0

Deep neural networks have recently led to promising results for the task of multiple sound source localization. Yet, they require a lot of training data to cover a variety of acoustic conditions and microphone array layouts. One can leverage acoustic simulators to inexpensively generate labeled training data. However, models trained on synthetic data tend to perform poorly with real-world recordings due to the domain mismatch. Moreover, learning for different microphone array layouts makes the task more complicated due to the infinite number of possible layouts. We propose to use adversarial learning methods to close the gap between synthetic and real domains. Our novel ensemble-discrimination method significantly improves the localization performance without requiring any label from the real data. Furthermore, we propose a novel explicit transformation layer to be embedded in the localization architecture. It enables the model to be trained with data from specific microphone array layouts while generalizing well to unseen layouts during inference.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/10/2020

Ensemble of Discriminators for Domain Adaptation in Multiple Sound Source 2D Localization

This paper introduces an ensemble of discriminators that improves the ac...
12/10/2020

Learning Multiple Sound Source 2D Localization

In this paper, we propose novel deep learning based algorithms for multi...
08/20/2018

Deep Residual Network for Sound Source Localization in the Time Domain

This study presents a system for sound source localization in time domai...
07/29/2018

Towards End-to-End Acoustic Localization using Deep Learning: from Audio Signal to Source Position Coordinates

This paper presents a novel approach for indoor acoustic source localiza...
11/30/2020

Sim2SG: Sim-to-Real Scene Graph Generation for Transfer Learning

Scene graph (SG) generation has been gaining a lot of traction recently....
09/19/2021

Towards Zero-Label Language Learning

This paper explores zero-label learning in Natural Language Processing (...
06/09/2020

C-SL: Contrastive Sound Localization with Inertial-Acoustic Sensors

Human brain employs perceptual information about the head and eye moveme...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sound source localization (SSL

) aims at estimating the pose/location of sound sources. With the increasing popularity in installing smart speakers in home environments, source location provides additional knowledge that could enable a variety of applications such as monitoring daily human activities 

[22], speech enhancement [13] and human-robot interaction [2]. SSL is an active research topic for which various signal-processing methods have been proposed [2, 5]. These data-independent methods work well under strict assumptions [5], e.g. high signal-to-noise ratio (SNR), known number of sources, and low reverberation. Such ideal conditions hardly hold true in real-world applications and usually require special treatments [26, 1]. Data-driven methods, and in particular deep learning, have recently outperformed classical signal-processing methods for various audio tasks [17, 11] including SSL [16, 9].

An advantage of these data-driven methods is their ability to adapt to challenging acoustic conditions. Yet, they require a large amount of training data, which is expensive to gather and label [10]. Acoustic simulators are an appealing solution as they can abundantly generate high-quality labeled datasets. However, models trained on synthetic data as a source domain can suffer from a critical drop in performance when exposed to real-world data as the target domain. This is due to acoustic conditions that are outside the distribution of the synthetic dataset [28, 4], resulting in a domain shift [3].

Recently, there have been several attempts to tackle domain adaptation for SSL. Unsupervised methods using entropy minimization of the localization output [19, 20] have been proposed. However, such methods are not suitable to our problem because entropy minimization encourages the prediction of a single source, whereas we must cater to multiple outputs. In this context, He et al. [10] proposed two adaptation methods compatible with multiple sources. First, a weakly supervised method conditioned on the expected number of sources, but this number is not known in our case. Second, an unsupervised method based on Domain Adversarial Neural Networks (DANN[6] which intends to align latent feature distributions for synthetic and real-world domains, but it was reported that the method did not improve the localization performance. Still, adversarial methods, like DANN [6], are popular outside SSL

, e.g. in computer vision 

[25]

, and for other audio tasks such as acoustic scene classification 

[27] and sound event detection [7]. Since it is not clear whether adversarial methods are suitable for SSL, we present the results of extensive experiments with such methods.

Another advantage of these data-driven methods is that a single model can be trained to work with various microphone array layouts. However, the model only performs well with the layouts that existed in the training data [14, 23, 24]. For different layouts, an additional training dataset is required. Hence, the model cannot be used in applications with moving microphone arrays or for which the spatial layout of arrays is not known beforehand. To the best of our knowledge, no prior work has tackled this issue so far.

Figure 1: Outline two domain adaptation methods with intermediate discrimination.

In this work, we address two critical issues of data-driven methods for SSL : the domain shift and the difficulty to generalize to unseen layout configurations. Our contributions are two-fold:

  • [wide,labelwidth=!,labelindent=0pt, noitemsep, topsep=0pt]

  • We investigate two domain adaptation methods based on adversarial learning: gradient reversal and label flipping. For both methods, we propose an ensemble-discrimination method that combines discriminators at different levels of the localization model: the intermediate and output spaces. Through extensive experiments, we show that our ensemble-discrimination method outperforms the standard adversarial learning approaches and, without using any label for the real-world data, can perform nearly as well as localization models that were trained in a supervised fashion on labeled real-world data.

  • We introduce a novel neural network layer, the explicit transformation layer, which makes the localization model invariant to the microphone array layout. Specifically, it allows the network to explicitly use the relative poses between microphone arrays which, in turn, makes the network generalize better to data in unseen layouts during inference. Experiments show that the proposed layer learns more efficiently from varied layouts than baseline methods which process the layout implicitly. Moreover, only our method is able to localize sources from new layouts when trained with a fixed layout.

2 Domain Adaptation for Multiple Sound Source 2D Localization

Training a deep learning model on synthetic data and testing it on real data usually results in a performance drop due to the distributions being different, that is, the simulator is not perfect. However, improving the simulation model is often too difficult. A more practical approach is randomizing the simulator parameters [21]. This expands the source domain in the hopes of covering the target domain, e.g. parameters such as noise and reverberation can be randomized to achieve this effect. Assuming we can no longer improve the simulator and the model performance on real data is still insufficient, domain adaptation methods [3] can be used. This paper focuses on enhancing these methods, in particular those that use adversarial learning, using an ensemble of discriminators.

We start by describing the base task of learning to predict the location of multiple simultaneous sound sources in a 2D horizontal plane. To do that, we extract spectral features from the sound recorded by a set of microphones. Then, we train a localization model , with parameters , to map spectral features to localization heatmaps

: discretized grids with Gaussian distributions centered at the source locations. Training the model amounts to solving:

(1)

where the localization loss is the mean squared error. More details on the framework can be found in a previous study [14].

We wish to augment (1) to perform well on real data while only having labels for the synthetic data . Without labels , (1) cannot be used directly to learn the mapping from real-world sound features to location cues. This is an unsupervised domain adaptation problem. To solve this, we use adversarial learning to make features generalizable and indistinguishable between the synthetic and real domains. A discriminative model , with parameters , is plugged at a given layer of the localization neural network model (see Fig. 1). We denote the submodel of the localization model up to this layer as , with parameters (a subset of ). Here, is always trained with (1), using only synthetic labeled data, and is trained to assign domain class labels ( for synthetic and for real) by using a binary cross entropy (BCE) loss (). Meanwhile, attempts to generate features that cannot be distinguished by . This can be formalized as a minimax objective [8]:

(2)

To implement (2), it has to be formalized as a cost-function minimization. There are two methods for solving this in practice.

Gradient reversal The Gradient Reversal Layer (GRL[6] is a weight-free layer, placed between and . It behaves as the identity during forward pass and negates the gradients during backward pass; changing the to a for optimizing weights .

Label flipping is another method that is commonly used in the adversarial learning community. It involves decomposing the minimax objective (2) into two minimization problems:

(3)

where the equation changes depending on the setting for . To optimize , we set it to and refer to it as (3a). To optimize , we flip the labels by using the setting and refer to it as (3b). To help illustrate this difference in Fig. 1, we replace label with , an adversarial loss, when using (3b).

Although both gradient reversal and label flipping are methods that intend to solve (2), there are important differences. First, gradient reversal requires one forward-backward pass in at each update step whereas label flipping requires two passes, that is, one for each objective (3a), (3b). Second, their gradient computation differs in the magnitude at each update step. For label flipping, the weights update for and follows the same dynamic with respect to their objective so that the update is larger the farther it is from the optimum. In contrast, the dynamic is inverted for gradient reversal for updates on . This results in smaller updates farther from the optimum, slowing down convergence. This can cause to converge faster than , which may result in not being able to provide sufficient gradients to improve  [8]. Although we present this basic analysis, stable adversarial learning is still an active research topic. Therefore, we compare both methods with extensive empirical results.

Ensemble of Discriminators To describe our proposed ensemble discrimination method, we must first present what discrimination levels in adversarial learning mean. A discrimination level refers to the layer of at which we conduct adversarial learning. Although this can be continuously moved, we opt to take only the two extreme levels of discrimination. The first is intermediate discrimination (int-dist), where we place the discriminator, , right after the encoder of the SSL model in [14]. In this case, the submodel is the encoder. We do not go further into the encoder to allow enough capacity for to learn the domain independent features. The second is output discrimination (out-dist), where we place after the output layer such that is all of . Note that and distributions being similar (number and 2D location of sources) is a prerequisite for the success of this strategy. If not, can learn the dissimilarities and the generator will unwantedly distort its predictions to satisfy (2). Constraining the output is common for domain adaptation in SSL with, for example, entropy methods [19, 20] or weak supervision [10]. However, such methods can degrade the performance by boosting incorrect low confidence predictions [10] resulting in more false positives. Using out-dist has this same concern. Lastly, our ensemble-discrimination method uses both int-dist and out-dist. For this, the two discriminators are trained independently, together with one common localization model . We expect that each discriminator will enhance in different aspects and lead to improved results.

3 Explicit Transformation Layer

Figure 2: Layout generalization methods: ET layer and two baselines.

We introduce a novel explicit transformation (ET) layer for layout invariant SSL (see Fig. 2). Here, we describe the use of the ET layer in a layout with two microphone arrays, and

,but the extension to more arrays is straightforward. Let tensor

of size xx be the output of decoder of in the coordinate system of , where , and are the width, height and number of feature maps. and are the output of the decoder after passing raw sound of each array through the network of [14]. The role of ET is to transform to . Let of size xx, be the th feature map of tensor . Let be the relative pose of from in a 2D plane, the 2D Euclidean transformation from to is defined by:

(4)

Such that the relation of locations can be computed as , or equivalently , where is a position in and is a position in .

Now that we have this correspondence, we can compute the values of . First, for each position in , we find the corresponding point in space based on the transformation above. This has a floating-point value. We then take the values of the closest discrete positions in and compute the value of

using the bilinear interpolation method. This is done for all positions of

. When applied to all feature maps of , we obtain which is the desired output of the ET layer.

ET

is a specialized form of spatial transformer network 

[12]. It is a weight-free layer which is computed according to the known relative position between the two microphone arrays. Intuitively, the role of the ET layer is to transform the decoder output of a target microphone array from its own coordinate system to the coordinate system of the reference microphone array. In addition, the gradient can back-propagate through the ET layer during training.

4 Experiments

4.1 Dataset

Our data-collection setup was extensively described in the previous study [14], but the main points are recalled here. Experiments are conducted in a area with two microphone arrays. Music clips from the classical and funk genres are used for recording training, validation and testing data splits. An additional testing dataset from the jazz genre is recorded to verify the generalizability of the model. One or two sources can be active simultaneously, and sources are at least apart to ease the evaluation.

We synthesize data with Pyroomacoustics [18], where we recreated the real room configuration. We generate three training datasets of samples: TrainS, an anechoic chamber with a low-noise setting and fixed array layout, TrainS+, same as TrainS but room size and SNR are randomized, and TrainSa, same as TrainS with various microphone array layouts. Another synthetic dataset of size with various microphone array layouts, TestSa, is captured for testing. We make sure there is no overlap between TrainSa and TestSa in terms of layout configurations.

We also record real-world data. To ease labeling, we use an augmentation method [14], wherein we capture one active source at a time and use sound superposition to generate samples with multiple sources. s audio samples are then extracted. TrainR, ValidR, TestRc and TestRj are the datasets for training, validating and testing respectively. All of them are made of classical-funk music clips, except for TestRj which is made of jazz. Training sets have samples, while other sets have samples.

4.2 Experimental Protocols

Our localization model, , has the encoder-decoder architecture [14] and takes as input

short-time Fourier transform

(STFT) features. We conduct independent experiments for adversarial adaptation methods and generalization to arbitrary layouts methods.

For adversarial adaptation, input features are processed in individual encoders for each array, merged and then decoded together. Discrimination is conducted on the merged encoded features for int-dist, and localization heatmaps for out-dist. Discriminator int-dist consists of dense layers while out-dist is composed of convolutional layers followed by

dense layers. A ReLu activation is used after each layer, except for the last one which is a sigmoid.

For layout generalization, we use individual decoders for each array and we use channel concatenation followed by convolutional layers to merge the two streams after the decoding part (see Fig. 2).

Our evaluation metrics are precision, recall and F1-score. Source coordinates are extracted from the output heatmap as predicted keypoints 

[14], and then matched with the groundtruth keypoints with a resolution threshold of 0.3m on synthetic data and 1m on real data. Based on the matches, we count the true positives (TP), false positives (FP) and false negatives (FN), from which we compute the F1-score. The Root mean square error (RMSE) of the matches is provided as an additional indicator of the quality.

Methods are trained for epochs with ADAM optimizer, initial learning rate of and batch size of

. Adversarial methods being notoriously unstable, we report the mean and standard deviation of 5 independent trainings with different random seeds. For layout generalization methods, we report one run.

  • [wide,labelwidth=!,labelindent=0pt, noitemsep, topsep=0pt]

  • Lower bound method S is a supervised method trained on labeled synthetic data (TrainS). It represents the minimum performance without any knowledge of the target domain.

  • Upper bound method R is a supervised method trained on labeled real-world data (TrainR). It is an approximation of the maximum performance [3], when a lot of efforts is dedicated to label real-world data which we would like to avoid in practice.

  • Others refers to methods that use modified training data. Method R&S uses TrainR and TrainS, both with labels, to train . It sacrifices data quality by mixing domains, but has the same data quantity used in the adversarial methods. The main burden is that it requires labels for the real-world data. Method S+ uses domain randomization [21] by using TrainS+ instead of TrainS in method S.

  • Adversarial Adaptation methods are the variations of unsupervised domain adaptation methods using adversarial learning. These are trained on TrainS with labels and TrainR without labels. These methods are: Grint, Lfint, Grout, Lfout, and ensemble methods GrintGrout and LfintLfout. Gr and Lf, refer to gradient reversal and label flipping respectively, and int and out are short for the discrimination levels int-dist and out-dist. We also considered GrintGrout+ and LfintLfout+ which uses TrainS+ instead of TrainS.

  • Layout Generalization methods are composed of two baselines and the proposed method, shown in Fig. 2. (a) Plain Encoder-Decoder (black) is the plain network without any additional input or layer. (b) Fully-connected (blue) receives the relative pose between arrays as input, processes it through a fully-connected (FC) layer, and concatenates the resulting features to the output of the encoder. (c) Explicit Transformation (red) uses the relative pose to define an ET layer which then transforms the output of the array decoder.

4.3 Results and Discussion

Figure 3: F1-Scores of adversarial adaptation methods compared with upper-bound and lower-bound methods.
(Test Data) TestRc (classical & funk)
Method RMSE Precision Recall
Low B. S .51.01 .80.03 .48.03
Up B. R .34.01 .95.03 .71.03
Others R&S .33.02 .96.03 .71.04

S+ .44.00 .84.01 .57.01

Adv. Adap.
GRint .48.03 .81.05 .57.04

LFint .50.03 .71.06 .44.09

GRout .51.02 .68.02 .67.02

LFout .51.02 .64.04 .67.02

GRintGRout .47.05 .82.05 .72.03

LFintLFout .53.03 .65.10 .60.09

GRintGRout+ .46.02 .76.07 .71.05

LFintLFout+ .50.02 .72.09 .65.02

Low B., Up B. and Adv. Adap. refer to lower bound, upper bound and adversarial adaptation respectively. / denote that higher or lower is better.

Table 1: Domain adaptation results on additional metrics.

Adversarial learning Localization performance across all methods for TestRc is illustrated in Fig. 3 and Table 1. For our lower bound baseline S, there are not many detections (low recall) but they are mostly correct (high precision), and correct predictions have a low quality (high RMSE). In comparison, method R which uses labeled real-world data, yielded significantly better performance across all metrics. In theory [3], our domain adaptation methods should have results within this range. Method R&S marginally improves performance compared to method R on TestRc. Method S+ shows a slight improvement over S but is still far from method R and R&S. Most domain adaptation methods, especially our ensemble-discrimination methods, improve in performance compared to the lower bound method S. In particular, the improvement in recall for some of our proposed methods compared to S is significant and leads to a higher F1-score as shown in Fig. 3.

The results of domain adaptation seem to be even better on TestRj, even though the model is trained on classifical and funk excerpts. We speculate that this is due to TrainS being more similar to TestRj in terms of spectral features. This explains why lower bound S is higher for TestRj while upper bound R is lower.

Methods based on Gr

tend to have better performance and lower variance compared to L

f. In particular, one of our ensemble-discrimination methods, GrintGrout, leads to the best results across all test sets and performs almost as well as method R while not using labels for the real data. From Table 1, we can see that methods with int-dist lead to a higher precision and a lower RMSE while out-dist mainly increases recall. It shows that int-dist improves the existing detections while out-dist encourages more detections but at a higher risk of false positives. All in all, combining them gives the best results. Randomization (using TrainS+ instead of TrainS) did not seem to further improve the ensembling results. While most unsupervised domain adaptation methods studied in this paper outperform lower bound methods which is contrary to the results presented in [10], adversarial methods are notoriously difficult to train [15]. The high variance of Lf in our results further supports this.

Layout Generalization Table 2 displays the results on layout generalization. When trained with a fixed array layout (TrainS), ET can generalize to new layouts fairly well while the FC baseline cannot. Moreover, when traind with various layouts (TrainSa), the plain network which has to learn the pose implicitly from the sound signals performs poorly while ET yields the best F1-score and RMSE.

(Test Data) TestSa
(Train Data) TrainS TrainSa
Method F1-Score RMSE F1-Score RMSE
Plain Encoder-Decoder - - .35 .17
Fully-connected .09 .18 .54 .17
Explicit Transformation .62 .14 .80 .11
Table 2: Performance of layout generalization.

5 Conclusion

This paper proposed two solutions for data-efficient learning of multiple sound source 2D localization. First, we studied adversarial adaptation methods. These methods, and in particular the proposed ensembling method, yield a significant improvement compared to solely training on synthetic data, at a much lower cost than training on labeled real-world data. Second, we made the localization model invariant to the pose of the microphone arrays. The proposed method outperforms baseline approaches and greatly reduces the need for data since it can generalize from a few layouts to unseen layouts.

References

  • [1] A. Alexandridis and A. Mouchtaris (2018-02) Multiple sound source location estimation in wireless acoustic sensor networks using doa estimates: the data-association problem. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 26 (2), pp. 342–356. External Links: ISSN 2329-9290, Document Cited by: §1.
  • [2] S. Argentieri, P. Danes, and P. Souères (2015) A survey on sound source localization in robotics: from binaural to array processing methods. Computer Speech & Language 34 (1), pp. 87–112. Cited by: §1.
  • [3] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Vaughan (2010) A theory of learning from different domains. Machine Learning 79, pp. 151–175. Cited by: §1, §2, 2nd item, §4.3.
  • [4] D. D. Carlo, A. Deleforge, and N.Bertin (2019) Mirage: 2d source localization using microphone pair augmentation with echoes. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. . Cited by: §1.
  • [5] M. Cobos, F. Antonacci, A. Alexandridis, A. Mouchtaris, and B. Le (2017) A survey of sound source localization methods in wireless acoustic sensor networks. Wireless Communications and Mobile Computing. External Links: Document Cited by: §1.
  • [6] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V. Lempitsky (2016) Domain-adversarial training of neural networks. Journal of Machine Learning Research 17 (59), pp. 1–35. Cited by: §1, §2.
  • [7] S. Gharib, K. Drossos, E. Cakir, D. Serdyuk, and T. Virtanen (2018-11) Unsupervised adversarial domain adaptation for acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), pp. 138–142. Cited by: §1.
  • [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pp. 2672–2680. Cited by: §2, §2.
  • [9] W. He, P. Motlicek, and J.-M. Odobez (2018) Joint localization and classification of multiple sound sources using a multi-task neural network. In Proc. Interspeech, External Links: Document Cited by: §1.
  • [10] W. He, P. Motlicek, and J. Odobez (2019) Adaptation of multiple sound source localization neural networks with weak supervision and domain-adversarial training. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. . Cited by: §1, §1, §2, §4.3.
  • [11] T. Inoue, P. Vinayavekhin, S. Wang, D. Wood, N. Greco, and R. Tachibana (2018-09) Domestic activities classification based on CNN using shuffling and mixing data augmentation. Technical report DCASE2018 Challenge. Cited by: §1.
  • [12] M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu (2015)

    Spatial transformer networks

    .
    In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2017–2025. External Links: Link Cited by: §3.
  • [13] P. Jeyasingh and M. Mohamed Ismail (2018) Real-time multi source speech enhancement based on sound source separation using microphone array. In 2018 Conference on Emerging Devices and Smart Systems (ICEDSS), Vol. , pp. 183–187. Cited by: §1.
  • [14] G. Le Moing, P. Vinayavekhin, T. Inoue, J. Vongkulbhisal, A. Munawar, R. Tachibana, and D. J. Agravante (2019) Learning multiple sound source 2d localization. In IEEE 21st Intl’ Workshop on Multimedia Signal Processing (MMSP), Cited by: §1, §2, §2, §3, §4.1, §4.1, §4.2, §4.2.
  • [15] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet (2018) Are gans created equal? a large-scale study. In Advances in Neural Information Processing Systems 31, pp. 700–709. Cited by: §4.3.
  • [16] P. Pertilä and E. Cakir (2017)

    Robust direction estimation with convolutional neural networks based steered response power

    .
    In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6125–6129. External Links: Document, ISSN Cited by: §1.
  • [17] J. Salamon and J. P. Bello (2017-03-01) Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters 24 (3), pp. 279–283 (English (US)). External Links: ISSN 1070-9908 Cited by: §1.
  • [18] R. Scheibler, E. Bezzam, and I. Dokmanic (2018) Pyroomacoustics: a python package for audio room simulation and array processing algorithms. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Cited by: §4.1.
  • [19] R. Takeda and K. Komatani (2017) Unsupervised adaptation of deep neural networks for sound source localization using entropy minimization. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. . External Links: Document Cited by: §1, §2.
  • [20] R. Takeda, Y. Kudo, K. Takashima, Y. Kitamura, and K. Komatani (2018) Unsupervised adaptation of neural networks for discriminative sound source localization with eliminative constraint. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. . Cited by: §1, §2.
  • [21] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. IEEE/RSJ Intl’ Conference on Intelligent Robots and Systems (IROS). Cited by: §2, 3rd item.
  • [22] M. Vacher, F. Portet, A. Fleury, and N. Noury (2011) Development of audio sensing technology for ambient assisted living: applications and challenges. In International Journal of E-Health and Medical Communications (IJEHMC), Cited by: §1.
  • [23] J. Vera-Diaz, D. Pizarro, and J. Macias-Guarasa (2018-10) Towards end-to-end acoustic localization using deep learning: from audio signals to source position coordinates. Sensors 18 (10), pp. 3418. External Links: ISSN 1424-8220, Document Cited by: §1.
  • [24] F. Vesperini, P. Vecchiotti, E. Principi, S. Squartini, and F. Piazza (2016-Sep.) A neural network based algorithm for speaker localization in a multi-room environment. In IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), Vol. , pp. 1–6. External Links: Document, ISSN Cited by: §1.
  • [25] T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez (2019) ADVENT: adversarial entropy minimization for domain adaptation in semantic segmentation. In CVPR, Cited by: §1.
  • [26] W.-K. Ma, B.-N. Vo, S. S. Singh, and A. Baddeley (2006-Sep.) Tracking an unknown time-varying number of speakers using tdoa measurements: a random finite set approach. IEEE Transactions on Signal Processing 54, pp. 3291–3304. External Links: Document, ISSN 1053-587X Cited by: §1.
  • [27] W. Wei, H. Zhu, E. Benetos, and Y. Wang (2020) A-crnn: a domain adaptation model for sound event detection. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 276–280. Cited by: §1.
  • [28] N. Yalta, K. Nakadai, and T. Ogata (2017) Sound source localization using deep learning models. Journal of Robotics and Mechatronics 29 (1), pp. 37–48. External Links: Document Cited by: §1.