Semi-supervised source localization in reverberant environments with deep generative modeling

01/26/2021 ∙ by Michael J. Bianco, et al. ∙ University of California, San Diego 5

A semi-supervised approach to acoustic source localization in reverberant environments, based on deep generative modeling, is proposed. Localization in reverberant environments remains an open challenge. Even with large data volumes, the number of labels available for supervised learning in reverberant environments is usually small. We address this issue by performing semi-supervised learning (SSL) with convolutional variational autoencoders (VAEs) on speech signals in reverberant environments. The VAE is trained to generate the phase of relative transfer functions (RTFs) between microphones, in parallel with a direction of arrival (DOA) classifier based on RTF-phase, on both labeled and unlabeled RTF samples. In learning to perform these tasks, the VAE-SSL explicitly learns to separate the physical causes of the RTF-phase (i.e., source location) from distracting signal characteristics such as noise and speech activity. Relative to existing semi-supervised localization methods in acoustics, VAE-SSL is effectively an end-to-end processing approach which relies on minimal preprocessing of RTF-phase features. The VAE-SSL approach is compared with the steered response power with phase transform (SRP-PHAT) and fully supervised CNNs. We find that VAE-SSL can outperform both SRP-PHAT and CNN in label-limited scenarios. Further, the trained VAE-SSL system can generate new RTF-phase samples, which shows the VAE-SSL approach learns the physics of the acoustic environment. The generative modeling in VAE-SSL thus provides a means of interpreting the learned representations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

page 6

page 7

page 8

page 11

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Source localization is an important problem in acoustics and many related fields. The performance of localization algorithms is degraded by reverberation, which induces complex temporal arrival structure at sensor arrays. Despite recent advances, e.g. [1, 2, 3], acoustic localization in reverberant environments remains a major challenge [4]

. There has been great interest in machine learning (ML)-based techniques in acoustics, including source localization and event detection

[5, 6, 7, 8, 9, 10, 11, 12, 13]. A challenge for ML-based methods in acoustics is the limited amount of labeled data and the complex acoustic propagation in natural environments, despite large volumes of recordings[1, 2]. This limitation has motivated recent approaches for localization based on semi-supervised learning (SSL)[14, 15].

We approach source localization from the perspective of semi-supervised learning, with the intent of addressing real-world applications of ML. It has been shown that it is relatively easy to generate large amounts of synthetic data resembling real-world sound measurement configurations. This synthetic data has then been used to train ML-based localization models (e.g.[9]), with very good performance. However, in most real scenarios, room geometry includes irregular boundaries, scattering, and diffracting elements (e.g., furniture and uneven surfaces) which may not be convenient to model using acoustic propagation software.

In this paper we contend, along the semi-supervised learning paradigm , that if there is a microphone system with fixed geometry, recording in a room environment, that reverberant signals convey physical characteristics of the room - that given sufficient time and variety of source locations, the physics of the room may be well modeled using unsupervised ML [16, 14, 17] . We further observe that if few labels are available, obtained for instance using cell-phone data, we can leverage such unsupervised representations to localized sources in the room.

We propose an SSL localization approach based on deep generative modeling with variational autoencoders (VAEs)[18]. Deep generative models [19]

, e.g. generative adversarial networks (GANs)

[20], have received much attention for their ability to learn high-dimensional sample distributions, including those of natural images [21]. GANs in acoustics have had success in generating raw audio [22] and speech enhancement [23]. VAEs learn explicit latent codes for generating samples from high-dimensional distributions and are inspiring examples of representation learning[24, 25].

We use VAEs to perform SSL. In our proposed approach, VAEs are used to encode and generate the phase of the relative transfer function (RTF) between two microphones [16]

. The VAE is trained in parallel with a classifier network to benefit from both labeled and unlabeled examples. The resulting model estimates DOA and generates RTF-phase sequences. By learning to generate RTF phase, the VAE-SSL system learns the physical model relating the latent model and DOA label to the RTF-phase.

This approach is a form of manifold learning [19, 26, 27]. Manifold learning has recently been proposed for semi-supervised source localization in room acoustics [14]. In this work, diffusion mapping is used to obtain a graphical model relating source location to RTF features. It was shown that a lower-dimensional latent representation of the RTFs, extracted by diffusion mapping, correlated well with source positions.

We present our VAE-SSL method as an alternative to this manifold learning approach. Recent work has indicated the capabilities of deep learning in manifold learning[26]. VAE-SSL uses the non-linear modeling capabilities of deep generative modeling to obtain a semi-supervised localization approach which is less reliant on preprocessing and hand-engineering of latent representations. Thus, the approach can be regarded as nearly end-to-end. The VAE-SSL system is designed to not rely on significant preprocessing of the RTFs. Through gradient-based learning it automatically determines the best latent and discriminative representations for the given task. Instead of spectral averaging, we input to the system a sequence of instantaneous RTFs and allow the statistical model to best utilize the patterns in the data.

Our generative modeling approach disentangles the causes of the RTF-phase (i.e., source DOA) from other signal factors including source frequency variation [28, 29]. This is accomplished by training in parallel both a generative model and classifier network. The disentanglement enables the system to learn a task-specific representation of the RTF-phase patterns in input features that are most relevant to DOA estimation task. Further, the trained VAE in VAE-SSL can be used to conditionally generate RTF-phase features based on sampling over the latent representation and DOA label. This output can be interpreted physically, and thereby allows us to verify that the system is obtaining a physically meaningful representation.

We build upon our previous work in deep generative modeling for source localization [30]. In this work, the VAE-SSL approach could well learn to localize acoustic sources with a few labeled RTF-phase sequences and extensive unlabeled data. The system was trained and tested on noise signals in simulated environments.

In our current study, we extend our VAE-SSL concepts to more realistic acoustic scenarios, refine the inference and generative architectures, and demonstrate and characterize the implicit acoustic model learned by the generative model by using the trained VAE-SSL system to conditionally generate RTF-phase sequences. This includes consideration for the appropriate likelihood functions for the generative model. We train and validate the learning-based approaches on speech data in two reverberant environments. As part of our study, we have obtained a new acoustic dataset at the Technical University of Denmark (DTU). This acoustic dataset consists of reverberant acoustic impulse responses (IRs) recorded in a classroom at DTU from several source locations. In this dataset, we also considered off-grid and off-range measurements to test the generalization of the learning-based methods. As far as we are aware, our paper represents the first presented approach to modeling the physics of acoustic propagation using deep generative modeling with real acoustic data.

The VAE-SSL method is implemented using convolutional neural networks (CNNs). The performance of the convolutional VAE-SSL in reverberant environments is assessed against the steered response power with phase transform (SRP-PHAT)

[31]

approach and a supervised convolutional neural network (CNN).

We train and validate the learning-based approaches (VAE-SSL and supervised CNN) using reverberant speech. We use speech data from the LibriSpeech corpus [32]. Reverberant speech is obtained by convolving dry speech with estimated IRs from two real-world datasets: the Multichannel Impulse Response (MIR) database [33] and a room-acoustics dataset recently measured at the Technical University of Denmark (DTU).

We show that the VAE-SSL can outperform conventional source localization approaches, as well as fully supervised approaches for label-limited scenarios. This includes scenarios where the source may be off-range or off-grid from the design case, and also variations in reverberation time and speech signals. We further show the implicit physical model obtained with VAE-SSL can be used to generate RTF-phase features. The physical characteristics of the generated RTF-phase is assessed by analyzing the phase wrap of the generated phase and the corresponding phase-delay for the generated RTF time domain representation.

Ii Theory

We use RTFs[16], specifically the RTF-phase, as the acoustic feature for our VAE-SSL approach. Since the RTF is independent of the source, this feature helps to focus ML on physically relevant features, and thereby reduce the sample complexity of the model. We encode a temporal sequence of instantaneous RTF-phases as a function of source azimuth (direction of arrival, DOA). We choose the instantaneous RTF-phase, calculated using a single STFT frame (i.e. no averaging applied), minimize the intervention of feature preprocessing. This allows the VAE-SSL method to extract end-to-end the important features for source localization and RTF-phase generation. This formulation may facilitate the localization of moving sources, namely speaker tracking problems. This extension is left for a future work.

Ii-a Relative transfer function (RTF)

We consider short-time Fourier transform (STFT) domain acoustic recordings of the form

(1)

with the source signal ( the microphone index and the frequency index) the acoustic transfer function relating the source and each of the microphones, noise signals which are independent of the source. Then, the relative transfer function (RTF) is defined as[14, 16]

(2)

with the frequency index. With as reference, the instantaneous RTF is calculated using a single STFT frame

(3)

This estimator is biased since we neglect the noise spectra [Ref. [14]

, Eq. (5)]. An unbiased estimator can be obtained, but we observe here that the biased estimate

works well, see also [34, 35]

. For each STFT frame, a vector of RTFs is obtained

. is number of used frequencies.

The input to the VAE-SSL and the supervised CNN is a temporally-ordered sequence of RTF-phase with, here . The RTF-phase sequence is , with , the number of RTF-phase frames in the sequence. We use the wrapped RTF-phase, which is on the interval radians.

Ii-B Semi-supervised DOA estimation

In acoustics we are often faced with scenarios where we have large volumes of acoustic recordings from arrays but potentially only few labels. The recordings themselves contain physical information, but the paucity of labels limit the task-specific value of this physical information. We here address this issue by formulating a semi-supervised learning-based approach to source localization. We use a VAE model to obtain latent distributions of the RTF-phase physics for a particular room and corresponding DOAs from labeled and unlabeled examples.

In our SSL formulation to DOA estimation, only a subset of the full dataset containing RTF sequences have corresponding DOA labels. Here the DOA labels are represented by

, a one-hot encoding, with

the number of DOA classes. We thus have labeled and unlabeled sets defined by and . We further have labels for the unlabeled sequences , which are reserved to test the performance of the system only after training and validation. The sizes of the sets are and . We thus have labeled RTF sequences and unlabeled sequences.

The

unlabeled RTFs contain physical information, which is extracted by unsupervised learning. The

labeled RTFs have both physical information and corresponding DOA labels, which help give the model task-specific value for DOA estimation by training the classifier. The generative model, which is trained on all samples helps guide the classifier training when labels are not available. Our goal is thus to well-infer the labels corresponding to labeled and unlabeled RTFs based on the trained VAE-SSL model.

Fig. 1: VAE-SSL generative (a) and inference (b) models with corresponding likelihoods. (a) Generative model likelihood parameterized by decoder, see Fig. 3(c). (b) Inference model likelihoods parameterized by encoders, see Fig. 3(a,b).

Ii-C Semi-supervised learning with VAEs

We formulate a principled semi-supervised learning framework based on VAEs[24, 29]. A classifier neural network (NN) and VAE are trained jointly, using both labeled and unlabeled data. The approach treats the label as either latent and observed, depending on whether the data is labeled () or unlabeled (). This corresponds to the ‘M2’ model in [24]. To simplify notation, we disregard subscripts for the theory derivation.

We assume each RTF phase sequence

is generated by a random process involving the latent random variable

and source location label .

VAEs[24, 29] combine the rich, high-dimensional modeling capacity of NNs with the tools of variational inference (VI)[36]

to learn generative models of high-dimensional distributions. We here adopt the VAE approach, and in the following the conditional probabilities are parameterized using NNs

We assume the likelihood for each RTF-phase sequence given label and latent representation

(4)

with a suitable likelihood function and the parameters of the generative model. and are assumed independent, with their marginal densities and . We thus have the generative model , giving the graphical model Fig. 1(a). The parameters in our paper describe the parameters of the NN associated with the generative model (decoder).

Now we are presented with the challenge of inferring (when it is not specified) and the latent variable . For labeled data , the posterior of the latent variable is

(5)

and for unlabeled data , the joint posterior of the latent and the label is

(6)

Direct estimation of the posterior, e.g. from (5), is nearly always intractable due to . Thus the posteriors are approximated using VI.

A variational approximation to the intractable posterior is , with the subscript corresponding to distributions and functions defined using the encoder network. The graphical model for the corresponding inference model is are shown in Fig. 1(b). The inference model for the DOA label results from the following derivations.

Starting with the model for the labeled data, see (5), per VI we seek which minimizes the KL-divergence

(7)

Considering first the labeled data, the intractable posterior is approximated by . Assessing the -divergence, we obtain

(8)

with the expectation relative to . This reveals the dependence of the KL divergence on evidence , which is intractable. The other two terms in (II-C) form the evidence lower bound (ELBO). Since the KL divergence is non-negative, the ELBO ‘lower bounds’ the evidence: . Maximizing the ELBO is equivalent to minimizing the (7). We thus minimize .

Considering the ELBO terms from (II-C), we formulate the objective for labeled data, with and independent (in the generative model, see Fig. 1(a)).

(9)

This follows [Ref. [24], Eq.(6)]. Next, an objective for unlabeled data is derived. The intractable posterior from (6) is approximated by . From the , we find the objective (negative ELBO), using terms from (6), as

(10)

with the expectation relative to . Further expanding (10) we obtain

(11)

This follows [Ref. [24], Eq.(7)]. More details of the derivation are given in Appendix A. Assessing the terms in the supervised learning objective (9), it does not condition the label on the sample , since for the supervised case the label is assumed known per (5). The term is only present in the unsupervised learning objective (11). In this configuration, the classifier network learns only from the unsupervised sequence. It is important for the classifier to learn from the labeled sequences and we enforce this by adding an auxiliary term to the supervised objective. This is a typical procedure [24].

An overall objective for training the VAE and classifier models using labeled and unlabeled data is derived by combining (9) and (11) with the auxiliary term

(12)

with

a scaling term, selected by hyperparameter optimization. This follows [

[24], Eq.(8,9)]. The optimization of the VAE-SSL based on the objective (II-C) is performed using stochastic variational inference [18], implemented with the probabilistic programming package Pyro[37].

Fig. 2: Classroom and measurement configuration at Technical University of Denmark where room IRs were collected.

Ii-D Distributions and neural network (NN) parameters

The VAE-SSL model consists of 3 NNs which, for training and conditional generation, parameterize the probability distributions in (

10), (11), and (II-C). See Fig. 3 and Table. I for NN configurations. The networks are: 1) the label inference (classifier) network , the output of which is used to parameterize ; 2) the latent inference network and , parameterizing ; and 3) the decoder (generative) network and which parameterizes .

The densities parameterized by the NNs are for label inference with

the categorical (multinomial) distribution and for latent inference the normal distribution

. Since the wrapped RTF phase is on the interval , we use as the likelihood function (see (4)) the truncated normal distribution. This is parameterized by the decoder by .

Since the decoder parameterizes a truncated normal likelihood, hyperbolic tangent activation is applied to the outputs of the decoder mean to constrain the mean and scaled sigmoid activation is applied to the decoder variance:

and variance . The input to VAE-SSL is normalized by , thus corresponds to the interval . The variance is limited to improve the training speed, since the truncated normal is implemented using rejection sampling.

The marginal densities (see (11) for terms) are defined as , with the probabilities of the classes, which are assumed equal, and normalized such that ; and .

Ii-E Label estimation and conditional RTF generation

From the trained inference model, the DOA is estimated by the indicator function

(13)

with

(14)

and and the discrete DOA indices. The DOA angle represented by the one-hot encoding with active index is and its estimate .

RTFs can be conditionally generated from the trained generative model for a given label using the prior and likelihood :

(15)

with (as before), and .

Fig. 3: Neural network configurations. (a) Label inference network (classifier), (b) Encoder, (c) Decoder.

Iii Experiments

We assess the DOA estimation performance of the VAE-SSL approaches in moderately reverberant environments against two alternative techniques: SRP-PHAT[38], and a supervised CNN baseline. The performance of the methods, summarized in Table II and III, is quantified in terms of mean absolute error (MAE) and sequence-level accuracy (Acc.). We further-assess the generative modeling capabilities of the trained VAE-SSL system, as shown in Figs. 411.

In the following, we define the MAE as

(16)

with denoting the absolute value. The sequence level accuracy (Acc.) is defined by

(17)

We consider training and validation of the VAE-SSL approach using speech. This is to obtain real-world application performance. We use measured IRs experimentally obtained at the Technical University of Denmark (DTU). We also compare performance with the Multichannel Impulse Response (MIR) Database[33].

The VAE-SSL system and the fully supervised CNN were implemented using Pytorch

[39], with the Pyro package [37] used for stochastic variational inference and optimization of the VAE-SSL. The NNs were optimized using Adam.

Iii-a Measured Impulse Responses

We use measured IRs from two different datasets: a dataset recently recorded at DTU and the MIR database[33]. We here briefly describe the experimental configuration for the DTU dataset. The data were recorded in a classroom at DTU in June 2020 (Fig. 2). The classroom was approximately rectangular and fully furnished. One of the sidewalls is irregular, with a  cm extrusion (for heating and ventilation) across the entire wall, in addition to support columns. All walls have scattering elements mounted on them (whiteboards, blackboards, diffusers and windowpanes), and the sound reflections are not specular. The nominal source-array range was 1.5 m. IRs were obtained from 19 DOAs ( resolution []). The reverberation time in the classroom was estimated to be . There were two microphones, with 8.5 cm spacing. The sampling rate was 48 kHz. The IRs were truncated to 0.5 s and downsampled to 16 kHz for this study.

Name Type Input shape Output shape Kernel Activation
Reshape1 Reshape [3937] [127,31,1]
Conv1 Convolution [127,31,1] [63,15,32] [3,3] ReLU
Conv2 Convolution [63,15,32] [31,7,64] [3,3] ReLU
Flatten Reshape [31,7,64] [13888]
FC1x Fully connected [13888 ] [200] ReLU
FC1y Fully connected [T] [200] ReLU
FC2 Fully connected [200] [T] Softmax
FC3 Fully connected [200] [50] None
FC4 Fully connected [50] [200] ReLU
FC5 Fully connected [200] [13888] ReLU
Reshape2 Reshape [13888] [31,7,64]
Conv1T Transpose convolution [31,7,64] [63,15,32] [3,3] ReLU
Conv2Tmean Transpose convolution [63,15,32] [127,31,1] [3,3] Tanh
Conv2Tvar Transpose convolution [63,15,32] [127,31,1] [3,3] Sigmoid
Reshape2 Reshape [127,31,1] [3937]
TABLE I: Network parameters corresponding to configuration in Fig. 3.

In addition to the nominal source grid for the DTU dataset, several off-range IRs (3 cases) and off-grid IRs (6 cases) were obtained to test the generalization of learning-based localization methods. The off-grid source DOAs were with  m range. The off-range source DOAs (ranges) were ( m), ( m), ( m), ( m), ( m), and ( m).

The MIR database was recorded in a  m room with reverberation time controlled by acoustic panels [33]. The dataset consists of source DOAs on a resolution [], giving 13 DOAs. Each of the DOAs was obtained at two ranges (1 and 2 m) with three reverberation times ( ms). The IRs were obtained for 8 microphones located in the center of the room. There were several configurations, with different microphone spacing. We used data with 3 and 8 cm spacing. The sampling rate for the MIR database was also 48 kHz. For this study we use the two center microphones, with  cm spacing. Further, from MIR we only use the  m source range, with reverberation times  ms. The MIR IRs were processed in the same manner as the DTU IRs. For more details of the MIR database, see [33].

Iii-B RTF calculations

The signal at the microphones is given in (1). We obtain the RTFs from the data by (3). The RTFs are estimated using single STFT frames with Hamming windowing with 50% overlap and segment length . The VAE and the supervised CNN inputs use RTF vectors, giving an input size

(neglecting the highest frequencies, to support strided transpose convolution without padding). For fair comparison, SRP-PHAT used the

STFT frames to estimate the RTFs, with . Thus, the temporal length of the sequences for all methods was  s.

Iii-C Speech data processing

We used speech data from the LibriSpeech corpus [32] development set, which contained 5.4 hours of recorded speech at 16 kHz from public domain audiobooks. The dataset contains equal numbers of male and female speakers (20 each).

The speech segments from LibriSpeech were convolved with the recorded room IRs to obtain reverberant speech. Voice activity detection (VAD) was performed on the dry speech before convolution with the IRs. We used the WebRTC project VAD system [40]

, a popular open source VAD system based on pretrained Gaussian mixture models. The VAD settings used were 3 (most aggressive) with a 10 ms analysis window.

VAE-SSL CNN
(# labels) MAE Acc. MAE Acc.
114 6.34 56.2 31.5 40.3
247 3.38 75.5 25.4 49.9
494 1.85 84.9 20.4 60.1
988 1.36 88.4 17.8 63.4
1995 1.29 89.3 15.0 70.1
3990 1.03 91.6 12.5 74.6
7999 0.73 94.0 10.0 81.2
SRP-PHAT 5.84 58.4
(a) DTU training data
VAE-SSL CNN
(# labels) MAE Acc. MAE Acc.
114 7.81 52.0 33.1 37.8
247 5.05 68.7 27.6 46.5
494 3.56 76.0 23.2 55.6
988 2.99 79.5 20.3 59.1
1995 3.01 79.9 17.8 64.8
3990 2.92 81.4 16.2 68.0
7999 3.00 80.5 14.3 73.7
SRP-PHAT 6.57 57.0
(b) DTU validation data
VAE-SSL CNN
(# labels) MAE Acc. MAE Acc.
117 9.65 82.0 17.0 51.1
247 1.89 92.4 9.32 67.8
494 1.50 92.8 6.11 77.1
988 1.41 93.4 4.69 82.3
1989 1.37 93.8 4.04 84.2
3991 1.38 93.9 3.01 88.5
7995 1.12 95.1 2.09 91.9
SRP-PHAT 6.40 72.1
(c) MIR training data
VAE-SSL CNN
(# labels) MAE Acc. MAE Acc.
117 12.9 69.3 16.3 47.6
247 3.93 81.4 9.99 60.9
494 3.46 82.1 7.68 67.9
988 3.25 83.3 6.36 72.9
1989 3.32 83.5 5.71 74.4
3991 3.61 82.2 5.31 76.2
7995 3.11 84.4 4.96 77.1
SRP-PHAT 5.18 75.0
(d) MIR validation data
TABLE II: Localization performance of VAE-SSL, fully supervised CNN and SRP-PHAT on (a, b) DTU dataset and (c, d) MIR database. Training and validation DOA estimation performance given for unlabeled data . Performance quantified in terms of mean absolute error (MAE) and sequence level accuracy (Acc.).

VAD was applied to the entire LibriSpeech development corpus, and a total of 40 speech segments 2-3 s in duration were randomly selected for training and validation, 20 segments each. This yielded RTF sequence for the nominal DTU IRs and sequences for the MIR IRs ( sequences per DOA). This further yielded and sequences for the DTU off-grid and off-range measurements using the validation speech. Given one active speaker recording segment of time length , with 50% STFT overlap, the number of RTF-phase sequences in the segment is

(18)

For training and validation, labeled sequences were drawn uniformly from the concatenated reverberant speech sequences (for each DOA). The remaining sequences from the training set were used for unsupervised learning. Sensor noise with 20 dB SNR was added to the microphone signals (see (1)). We consider a range of values for , for evaluating the effect of the number of labelled sequences on the performance of the learning-based approaches.

Iii-D Learning-based model parameters

The VAE-SSL model (classifier, inference, and generative networks) were implemented using strided CNNs, with stride of 2. The network architectures are given in Fig. 3, and the corresponding parameters are given in Table I. Several subnetworks reuse the same architecture, hence the repeated names. However, weights for each implementation are trained independently - i.e. there is no weight sharing. The latent variable dimension for all experiments was

, assuming that the hidden representation must account for temporal variation of the RTFs, dynamics of the speech signal, and speaker-to-speaker variation. Dropout with probability of 0.5 is applied to the large fully connected layers FC1x and FC5 (Fig. 

3, Table I).

For comparison, we used the baseline ‘supervised-only’ CNN from the VAE-SSL model, without unsupervised learning and integration with the generative model.

For all cases, the learning rate was 5e-5 and the batch size was 256. The default momentum and decay values (0.9,0.999) were used from the Pytorch implementation of Adam. The VAE-SSL has the tuning parameter, the auxiliary multiplier , which weights the supervised objective (see (II-C)). It was found in the experiments that the performance of VAE-SSL was not very sensitive to , though the best value per validation accuracy was found by grid search over the interval [500,10000].

Iii-E Non-learning: SRP-PHAT configuration

The STFT frames used to calculate the RTF-phase sequences for VAE-SSL and CNN were input to the SRP-PHAT approach. There were STFT frames in the VAE-SSL and CNN method input sequences . SRP-PHAT used 13 candidate DOAs over for the MIR IRs and 19 for the DTU IRs. We used the SRP-PHAT implementation from the Pyroomacoustics toolbox [41].

Fig. 4: Reconstruction of RTF phase sequences using VAE-SSL (trained using labels). 10 sequences are reconstructed for each DOA using the DTU IR data convolved with speech. (a) Input RTF phase sequences . (b) Reconstructed RTF phase sequences . (c) RTF phase sequence mean . (d) Free space RTF per (21). Phase-wrap function plotted with white dashed lines. The color scale is RTF phase [,] radians. Theory, other parameters located in Sec. II D–E, Sec. III D.

Iii-F Training and localization performance

For VAE-SSL and supervised CNN training, labeled sequences were drawn from training and labeled sequences were drawn from the validation set. The model was chosen based on the validation accuracy. For VAE-SSL, the additional

unlabeled sequences were used to train the networks with unsupervised learning. During each training epoch, unlabeled data batches were used approximately once per

labeled batches. Only the supervised samples were used to train the supervised CNN. Since for the unlabeled examples, the DOAs are technically unavailable, there are unsupervised RTF-phase sequences which contain frames with different DOA labels. For the supervised examples only input sequences where each frame has the same DOA are used. All the RTF-phase sequences were normalized by for the VAE-SSL and CNN approaches.

Method Off-grid Off-range
SRP-PHAT 6.80 5.17
CNN 8.96 47.7
VAE-SSL 6.15 2.72
RTF-1NN 7.33 38.0
TABLE III: Localization error (MAE) of VAE-SSL, fully supervised CNN and SRP-PHAT on off-grid and off-range measurements from DTU dataset.

The performance of VAE-SSL and competing approaches for the DTU IR dataset are given in Table II(a,b). VAE-SSL and fully supervised CNN were trained using 20 speech signals and validated using an additional 20 speech signals (see Sec. III-C). We use supervised sequences, which are multiples of the number of candidate DOAs () to ensure an equal number of labeled samples for each DOA. It is observed that the VAE-SSL approach generalizes to the validation data, with better performance than SRP-PHAT with as few as labeled sequences (or per DOA). For both the validation and training data, the performance increases significantly for additional labels. The VAE-SSL outperforms CNN for all the experiments in this paper. It is apparent that in using the minimally processed RTF sequences, that our semi-supervised learning approach can learn to identify the relevant features from the data.

Similar trends are observed for the MIR database IRs, given in Table II(d,e). Here VAE-SSL and fully supervised CNN were trained and validated using the same 20 speech signals with two different reverberation times. The training data IR was  ms case, and the validation was  ms. We use supervised sequences, which are multiples of the number of candidate DOAs (). It is observed that VAE-SSL significantly outperforms the fully supervised CNN and SRP-PHAT methods for both the training and validation cases when the number of supervised examples is sufficient. A similar number of labels to the DTU dataset were required to outperform SRP-PHAT: labeled sequences, or in this case 19 labels per DOA. It is shown that the VAE-SSL generalizes well to different reverberation times.

Fig. 5: Conditionally generated RTF phase sequences (100 sequences, 31 frames per sequence, 100*31=3100 frames) using VAE-SSL generative model trained with DTU IR data convolved with speech for DOA (trained using labels). (a) Conditionally generated RTF sequences . (c) Generated RTF phase sequence mean . (d) Free space RTF per (21), same as Fig. 4.
Fig. 6: Conditionally generated RTF phase sequences for DOA= (trained using labels). Same configuration as Fig 5.
(a)
(b)
Fig. 7: Inverse fast Fourier transforms (IFFT) of RTF (from RTF phase), see Sec. III-H1. (a) Simulated anechoic time delay based on DTU dataset source-receiver parameters. (b) Time delays from reconstructed RTF-phase mean from reverberant DTU dataset (see Fig. 4(b) for RTFs), calculated from one RTF realization per DOA. Hypothesized time delay shown as black dashed line.
(a)
(b)
Fig. 8: IFFTs of VAE-SSL conditionally generated RTFs (time delay), see Sec. III-H1. Time delays from conditionally generated RTF-phase means from reverberant DTU dataset for (a) DOA = (see Fig. 5(b) for RTFs) and (b) DOA = (see Fig. 6(b) for RTFs). Calculated for 10 RTF realizations (sampling random ). Hypothesized time delay shown as black dashed line.

Iii-G Off-grid and off-range generalization

We previously quantified the performance of the localization methods for different speakers and reverberation times. We now assess the off-grid and off-range localization performance of the methods in a real-room environment with the DTU dataset. This important test examines whether the representations learned by the ML approaches, VAE-SSL and fully supervised CNN, can generalize to source locations that were not present in the training data. In real applications, it can be expected that the source locations will not precisely correspond to the training locations. The off-grid DOAs are to away from the training DOAs, and the off-range are to  m away, see Sec. IIIA.

For the learning-based approaches, results are given for systems trained with labels. For an additional comparison, we consider a basic one-nearest-neighbors approach, which uses spectrally averaged RTF features. This manual preprocessing should reduce the noise in the RTF features and give a reasonable baseline for a hand-engineered, but still data-driven, approach.

For the nearest-neighbors approach, which we deem RTF-1NN, we perform spectral averaging over the full RTFs [14] using the STFT frames used to estimate the instantaneous RTF sequence (Sec.II-A). The spectrally averaged RTFs for each DOA are given by

(19)

with the complex conjugate and the set of STFT frames used. The RTF from the labeled data corresponding to each DOA is , with . The RTF from each sequence in the unlabeled data, which is to be classified, is with . The DOA for is estimated by

(20)

with the norm. We note that this formulation is related to matched-filter beamforming as suggested by [42].

The results from the different methods are shown in Table III. We find that the VAE-SSL approach generalizes well to off-grid and off-range sources. It is shown that the trained VAE-SSL outperforms the other approaches. Again, this performance is achieved using minimal preprocessing of the input features. Particularly, for the off-range scenario, the CNN and RTF-1NN generalize quite poorly.

Iii-H RTF generation with VAE-SSL

The trained generative model from VAE-SSL can be used to generate new RTF sequences. We demonstrate RTF phase generation and discuss their physical interpretation. Generated RTF-phase sequences are shown in Fig.46. For display, the generated RTF-phase sequence vectors are reshaped to . We use the phase-wrap of the RTF ( rad.) (as a function of sensor separation and DOA , ) to help qualify the physics learned by VAE-SSL, with the microphone spacing. This is plotted along with the RTF-phase frames from the design case room configuration. We also compare the generated RTF-phases with the corresponding free-space RTF, which is calculated by

(21)

We use the VAE-SSL model trained on the DTU IRs to reconstruct and conditionally generate RTFs. In Fig. 4, 10 RTF sequences for each DOA ( DOA, giving 190 sequences) are reconstructed. It is observed that the RTF phase is well-reconstructed, and that the reconstructed RTF phase (Fig. 4(b)) and phase mean (Fig. 4(c)) conform to the phase-wrap function. We further show the free-space RTF phase (per (21)) in Fig. 4(d). The generated mean RTF phase correlates well with the free-space RTF phase.

Using the trained generative model from VAE-SSL with DTU IRs and speech, we conditionally sample (see Sec. II-E) for fixed label and randomly drawn . In Fig. 5, we use corresponding to DOA and sample 100 RTF phase frames. Similarly, in Fig. 6 we use corresponding to DOA . It is observed that the generated phase and its corresponding mean in each case is well-correlated with the predicted RTF phase wrap function. In the case of the conditionally generated RTFs, they also correlate well with their corresponding free-space RTFs. The output sampled from the generative model (see (II-E)) is multiplied by , since normalization was applied during training.

Iii-H1 RTF time delay

We further evaluate the generated RTF-phase obtained by VAE-SSL by considering the time delay corresponding from the corresponding RTFs.

The free-space propagation delay for a given sensor is

(22)
(23)

with the sensor spacing. We assume sensor 1 the reference (with ) and sensor 2 the negative displacement . Thus the free-space RTF per (3) is

(24)
(25)

with the time delay.

Given the derivation for RTF time delay in free-space, synthetic free-space RTFs are generated per (25) for the same source-microphone geometry as the DTU dataset (Sec. III-A). The inverse fast Fourier transform (IFFT) of the RTFs are obtained, Fig. 7(a). These time delays are compared to those from the IFFT of the VAE-generated RTF mean values from the reverberant DTU dataset (Sec. III-A). In the previous section, the DTU RTF phase was reconstructed based on input the VAE. The results are shown in Fig. 4. One generated RTF from each of the 19 DOAs from the dataset (corresponding to phases in Fig. 4) were used to obtain time delays, and these delays are plotted in Fig. 7(b). The maximum value of the IFFT is indicated in each plot.

Overall, the peak of the IFFT of the generated reverberant RTF means correspond to the free-space RTF delay. In Fig. 7(b) the peak correlates well with the hypothesized DOA location based on , shown as a black line in both subfigures.

We also consider the time delays corresponding to the conditionally generated RTF phase (see Fig. 5 and 6) from VAE-SSL trained on DTU dataset. The IFFT of the RTFs corresponding to the conditionally generated phases are shown in Fig. 8. Again, the peak correlates well with the hypothesized time delay.

As a final comparison, we consider similar analyses using VAE-SSL trained using the conventional normal distribution likelihood for the generative model with labels. Results are shown in Fig. 911. In Fig. 9, we reconstruct the input RTF-phase sequences, similar to Fig. 4. In Fig. 10 we compare the RTF-phase distributions generated using the normal and truncated normal distributions. In Fig. 11, we show the RTF-phase and mean RTF-phase generated by VAE-SSL using the normal and truncated normal distributions, similar to Fig. 7. Since the wrapped RTF phase is on the interval , the truncated distribution to the same range is more physically meaningful. Further, the large peak of the full normal distribution-generated RTF phase (Fig. 10(b)) is related to the bias in the IFFT of the RTF corresponding to the generated phase (see Fig. 11) since zero RTF phase implies broadside source . In all cases, the physical characteristics of the wrapped RTF-phase are better modeled and interpreted using the truncated normal likelihood.

Iv Conclusions

We have proposed a semi-supervised approach to acoustic source localization in reverberant environments based on deep generative modeling with VAEs, which we deem VAE-SSL. This study shows that that VAE-SSL can outperform both SRP-PHAT and CNN in label-limited scenarios, and generalizes well to source locations that were not in the training data. We demonstrate this performance using real IRs, obtained from two different environments, and speech signals from the LibriSpeech corpus. By learning to generate RTF-phase from minimally pre-processed input data, VAE-SSL models, end-to-end, the reverberant acoustic environment and exploits the structure in the unlabeled data to improve localization performance over what can be achieved simply using the available labels. In learning to perform these tasks, the VAE-SSL explicitly learns to separate the physical causes of the RTF-phase (i.e. source location) from signal characteristics such as noise and speech activity which are not salient to the task. We further showed that the trained VAE-SSL system can be used to generate new RTF-phase samples. We interpreted the generated RTF phase and verified the VAE-SSL approach well-learns the physics of the acoustic environment. The generative modeling used in VAE-SSL provides interpretable features.

We thus observe that deep generative modeling can improve ML model interpretability in the context of acoustics. Such models are robust to only light processing of input features and can automatically obtain the appropriate task-specific representation in an end-to-end fashion. In our future work, we will extend this approach to multi-source and moving-source scenarios. We will further consider the effect of acoustic feature processing on the VAE-SSL generalization and sample complexity.

Fig. 9: Reconstruction of RTF phase sequences using VAE-SSL (trained using labels) with normal likelihood. Same arrangement as Fig. 4
(a)
(b)
(c)
Fig. 10: Histograms of RTF-phase at DOA = for 100 RTF frames from (a) input [Fig. 11(a)], (b) generated with normal likelihood [Fig. 11(b)], (c) generated with truncated normal likelihood (Fig. 5(b), Sec. II D–E, Sec. III D). The truncated normal likelihood (c) better approximates the input RTF phase statistics (a).
(a)
(b)
(c)
(d)
(e)
Fig. 11: For comparison, we here show the IFFTs of input RTFs (a) from DTU dataset against the generated RTFs based on full (b,c) and truncated normal (d,e) likelihoods for the VAE-SSL generative model. See Fig. 4 for input and generated RTF-phase using truncated normal likelihood, and Fig. 9 for those generated with normal likelihood. IFFT of RTF from mean phase and full generated RTF phase from normal likelihood (b and c) and (d and e) truncated normal likelihood. Calculated for one RTF realization per DOA ( DOAs). The results using full-normal likelihood are biased, and from truncated normal are not. Hypothesized time delay shown as black dashed line.

Appendix A detailed derivation of unsupervised objective

We here give additional details of the derivation of the objective for unlabeled data (11). The steps also clarify the derivation of the supervised objective, which follows a very similar development. Starting with Bayes’ rule we have for unlabeled data

(26)

Direct estimation of the posterior is nearly always intractable due to . indicates distributions and functions defined using the decoder network, i.e. the generative model in the VAE.

VAEs[24, 29] approximate posterior distributions using variational inference (VI)[36], a family of methods for approximating conditional densities which relies on optimization instead of (MCMC) sampling. In VAEs the conditional densities are modeled with NNs. A variational approximation to the intractable posterior is defined by the encoder networks (two encoder networks, corresponding to ) as , with indicating distributions and functions defined using the encoder network. The networks constituting the VAE-SSL model are shown in Fig. 1.

Per VI we seek an approximate density which minimizes the KL-divergence

(27)

The KL divergence is for two arbitrary distributions is (and can be factored as)

(28)

with the expectation with respect to .

With the definitions in (26) and (28), the -divergence (27) is assessed

(29)

with the expectation relative to . This reveals the dependence of the KL divergence on evidence , which is intractable. The other two terms in (A) form the evidence lower bound (ELBO). Since the KL divergence is non-negative, the ELBO ‘lower bounds’ the evidence: . Maximizing the ELBO is equivalent to minimizing the . We thus minimize .

Considering the ELBO terms from (A), we find the objective (negative ELBO) for the unlabeled data as

(30)

We can factorize the expectation in (30) with and independent. Since has discrete states, we have for an arbitrary density

(31)

Using (A), we expand (30)

(32)

Thus it follows from the derivation with Bayes’ rule that in the unlabeled case, we marginalize over the states of . This requires sampling the terms in (32) over all states of for unlabeled data .

References

  • [1] H. Purwins, B. Li, T. Virtanen, J. Schluter, S.-Y. Chang, and T. Sainath, “Deep learning for audio signal processing,” IEEE J. Sel. Topics Signal Process., vol. 13, no. 2, pp. 206–219, May 2019.
  • [2] M. J. Bianco, P. Gerstoft, J. Traer, E. Ozanich, M. A. Roch, S. Gannot, and C.-A. Deledalle, “Machine learning in acoustics: Theory and applications,” J. Acoust. Soc. Am., vol. 146, no. 5, pp. 3590–3628, 2019.
  • [3] S. Gannot, M. Haardt, W. Kellermann, and P. Willett, “Introduction to the issue on acoustic source localization and tracking in dynamic real-life scenes,” IEEE J. Sel. Topics Signal Process., vol. 13, no. 1, pp. 3–7, 2019.
  • [4] James Traer and Josh H McDermott, “Statistics of natural reverberation enable perceptual separation of sound and space,” Proc. Nat. Acad. Sci., vol. 113, no. 48, pp. E7856–E7865, 2016.
  • [5] H. Nakashima and T. Mukai, “3D sound source localization system based on learning of binaural hearing,” in IEEE Int. Conf. Syst., Man, Cybern. IEEE, 2005, vol. 4, pp. 3534–3539.
  • [6] A. Deleforge and R. Horaud, “2D sound-source localization on the binaural manifold,” in IEEE Int. Workshop Mach. Learn. Signal Process. IEEE, 2012, pp. 1–6.
  • [7] A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “Dcase 2017 challenge setup: Tasks, datasets and baseline system,” in DCASE 2017, 2017.
  • [8] E. Vincent, T. Virtanen, and S. Gannot, Audio source separation and speech enhancement, John Wiley & Sons, 2018.
  • [9] S. Chakrabarty and E. A. P. Habets, “Multi-speaker doa estimation using deep convolutional networks trained with noise signals,” IEEE J. Sel. Topics Signal Process., vol. 13, no. 1, pp. 8–21, Mar. 2019.
  • [10] S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen,

    “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,”

    IEEE J. Sel. Topics Signal Process., 2019.
  • [11] G. Ping, E. Fernandez-Grande, P. Gerstoft, and Z. Chu, “Three-dimensional source localization using sparse bayesian learning on a spherical microphone array,” J. Acoust. Soc. Am., vol. 147, no. 6, pp. 3895–3904, 2020.
  • [12] E. Ozanich, P. Gerstoft, and H. Niu, “A feedforward neural network for direction-of-arrival estimation,” J. Acoust. Soc. Am., vol. 147, no. 3, pp. 2035–2048, 2020.
  • [13] Xiaoyu Zhu, Hefeng Dong, Pierluigi Salvo Rossi, and Martin Landrø, “Feature selection based on principal component analysis for underwater source localization by deep learning,” arXiv preprint arXiv:2011.12754, 2020.
  • [14] B. Laufer-Goldshtein, R. Talmon, and S. Gannot, “Semi-supervised sound source localization based on manifold regularization,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., vol. 24, no. 8, pp. 1393–1407, Aug. 2016.
  • [15] R. Opochinsky, B. Laufer-Goldshtein, S. Gannot, and G. Chechik, “Deep ranking-based sound source localization,” in IEEE Workshop Appl. Signal Process. Audio Acoustic. (WASPAA), 2019, pp. 283–287.
  • [16] S. Gannot, D. Burshtein, and E. Weinstein, “Signal enhancement using beamforming and nonstationarity with applications to speech,” IEEE Trans. Signal Process., vol. 49, no. 8, pp. 1614–1626, Aug. 2001.
  • [17] Sharon Gannot, Emmanuel Vincent, Shmulik Markovich-Golan, and Alexey Ozerov, “A consolidated perspective on multimicrophone speech enhancement and source separation,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., vol. 25, no. 4, pp. 692–730, 2017.
  • [18] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” Proc. Int. Conf. Learn. Represent., 2014.
  • [19] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning, vol. 1, MIT press Cambridge, 2016.
  • [20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Adv. Neural Info. Process. Sys., 2014, pp. 2672–2680.
  • [21] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” arXiv preprint arXiv:1912.04958, 2019.
  • [22] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
  • [23] C. Donahue, B. Li, and R. Prabhavalkar, “Exploring speech enhancement with generative adversarial networks for robust speech recognition,” in IEEE Int. Conf. Acoust. Speech and Signal Process. (ICASSP), 2018, pp. 5024–5028.
  • [24] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semi-supervised learning with deep generative models,” in Adv. Neural Info. Process. Sys., 2014, pp. 3581–3589.
  • [25] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: a review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
  • [26] Erez Peterfreund, Ofir Lindenbaum, Felix Dietrich, Tom Bertalan, Matan Gavish, Ioannis G Kevrekidis, and Ronald R Coifman, “Local conformal autoencoder for standardized data coordinates,” Proc. Nat. Acad. Sci., vol. 117, no. 49, pp. 30918–30927, 2020.
  • [27] Bracha Laufer-Goldshtein, Ronen Talmon, Sharon Gannot, et al., “Data-driven multi-microphone speaker localization on manifolds,” Found. Trends Signal Process., vol. 14, no. 1–2, pp. 1–161, 2020.
  • [28] Narayanaswamy Siddharth, Brooks Paige, Jan-Willem Van de Meent, Alban Desmaison, Noah Goodman, Pushmeet Kohli, Frank Wood, and Philip Torr, “Learning disentangled representations with semi-supervised deep generative models,” in Adv. Neural Info. Process. Sys., 2017, pp. 5925–5935.
  • [29] D. P Kingma, M. Welling, et al., “An introduction to variational autoencoders,” Found. and Trends in Machine Learn., vol. 12, no. 4, pp. 307–392, 2019.
  • [30] M. J. Bianco, S. Gannot, and Gerstoft P., “Semi-supervised source localization with deep generative modeling,” in IEEE Int. Workshop Mach. Learn. Signal Process. IEEE, 2020.
  • [31] M. S. Brandstein and H. F. Silverman, “A robust method for speech signal time-delay estimation in reverberant rooms,” in IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 1997, vol. 1, pp. 375–378.
  • [32] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in IEEE Int. Conf. Acoust. Speech and Signal Process. (ICASSP). IEEE, 2015, pp. 5206–5210.
  • [33] E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel audio database in various acoustic environments,” in Int. Worksh. Acoust. Signal Enhance. (IWAENC), 2014, pp. 313–317.
  • [34] S. Markovich-Golan, S. Gannot, and W. Kellermann, “Performance analysis of the covariance-whitening and the covariance-subtraction methods for estimating the relative transfer function,” in EUSIPCO, 2018, pp. 2499–2503.
  • [35] Z. Koldovskỳ, J. Málek, and S. Gannot, “Spatial source subtraction based on incomplete measurements of relative transfer function,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., vol. 23, no. 8, pp. 1335–1347, 2015.
  • [36] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference: A review for statisticians,” J. Amer. Stat. Assoc., vol. 112, no. 518, pp. 859–877, 2017.
  • [37] E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Pradhan, . Karaletsos, R. Singh, P. Szerlip, P. Horsfall, and N. D. Goodman, “Pyro: Deep Universal Probabilistic Programming,” J. Mach. Learn. Res., 2018.
  • [38] J.H DiBiase, H.F Silverman, and M.S Brandstein, “Robust localization in reverberant rooms,” in Microphone Arrays, pp. 157–180. Springer, 2001.
  • [39] A. Paszke, S. Gross, F. Massa, A. Lerer, and J. Bradbury et al., “Pytorch: An imperative style, high-performance deep learning library,” in Adv. Neural Info. Process. Sys., 2019, pp. 8024–8035.
  • [40] Google, “WebRTC,” 2011.
  • [41] R. Scheibler, E. Bezzam, and I. Dokmanić, “Pyroomacoustics: A python package for audio room simulation and array processing algorithms,” in IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2018, pp. 351–355.
  • [42] Ea-Ee Jan and James Flanagan, “Sound capture from spatial volumes: Matched-filter processing of microphone arrays having randomly-distributed sensors,” in ”IEEE Conf. Acoust., Speech and Sig. Proc., 1996, vol. 2, pp. 917–920.