Source localization is an important problem in acoustics and many related fields. The performance of many source localization algorithms is degraded by reverberation which induces complex temporal arrival structure at sensor arrays. Despite recent advances, e.g. [purwins_deep_2019, bianco2019machine, gannot2019introduction], acoustic source localization in reverberant environments remains a major challenge.
In recent years, there has been great interest in machine learning (ML)-based techniques for applications in acoustics, including source localization and event detection [vincent2018audio, mesaros2017dcase]. Specifically, there has been much research in using neural network (NN) architectures for acoustic source localization, e.g. [chakrabarty2017broadband, chakrabarty2019multi, adavanne2019sound], and the associated results are often considered state-of-the-art for data-driven methods.
One challenge for developing ML-based methods in acoustics is the limited amount of labeled data and the complex acoustic propagation in natural environments[purwins_deep_2019, bianco2019machine]. This limitation has motivated recent approaches for source localization based on semi-supervised learning (SSL)[laufer2016semi, opochinsky2019deep]. In SSL, ML models are trained using many examples with only few labels, with goal of exploiting the natural structure of the data [murphy2012].
We propose an SSL localization approach based on deep generative modeling with variational autoencoders (VAE)[kingma2014auto]. Deep generative models [goodfellow2016deep], e.g. generative adversarial networks (GANs)[goodfellow2014generative], have received much attention for their ability to learn high-dimensional sample distributions, including those of natural images [karras2019analyzing]. GANs in acoustics have had success in generating raw audio [oord2016wavenet] and speech enhancement [donahue2018exploring]. An alternative to GANs, VAEs learn explicit latent codes for generating samples, and are inspiring examples of representation learning[kingma2014semi, bengio2013representation].
We use VAEs to obtain the relationship between relative transfer function (RTF) [gannot2001signal] and source location with SSL. The RTF is the ratio of the acoustic transfer functions from two sensors, which gives the acoustic propagation between the sensors independent of the source. The VAE-SSL approach encodes the RTF phase to a latent parametric distribution, while learning to generate the phase of the RTF based on the latent distribution, and to classify samples. In VAE-SSL the classifier is trained using both the labelled and unlabeled samples. We compare the performance of VAE-SSL in reverberant environments the steered response power with phase transform (SRP-PHAT)[brandstein1997robust], and convolutional neural networks (CNNs).
We use RTFs[gannot2001signal], specifically the RTF phase, as the acoustic feature for our SSL-VAE approach. The RTF is independent of the source and well represents the physics of the acoustic system. In this case, the RTF phase is function of source azimuth (direction of arrival, DOA).
2.1 Relative transfer function (RTF)
We consider time domain acoustic recordings from two microphones,
with the source signal, the impulse responses (IRs) relating the source and each of the microphones, noise signals which are independent of the source, and the time index. Define the acoustic transfer functions
as the Fourier transform of the IRs. Then, the relative transfer function (RTF) is defined as[gannot2001signal]
with the frequency index. With as reference, is
with the PSD and the CPSD. This estimator is biased since we neglect the PSD of the noise
. An unbiased estimator can be obtained, but we observe in this paper that the biased estimateworks well. For more details, please see [markovich2018performance, koldovsky2015spatial]. For each frame
, a vector of RTFs is obtainedfor frequency bins.
The input samples to the VAE are , with , , and the number of RTF frames. We use frames with 50% overlap.
2.2 Semi-supervised learning with VAEs
We assume the wrapped RTF phase
are generated by a random process involving the latent random variableand source location labels . We formulate a principled semi-supervised learning framework based on VAEs, which treats the labels as either latent and observed, and trains a classifier using both labelled and unlabeled data. This corresponds to the ‘M2’ model in [kingma2014semi]. Starting with Bayes rule we have for labelled data
and unlabeled data
Using (4) as an example, direct estimation of the posterior is nearly always intractable due to . As will later be shown, constitutes the parameters of the decoder network, i.e. the generative model in the VAE.
VAEs[kingma2014semi, kingma2019introduction] approximate posterior distributions using variational inference (VI)[blei2017variational], a family of methods for approximating conditional densities which relies on optimization instead of (MCMC) sampling. In VAEs the conditional densities are modeled with NNs. A variational approximation to the intractable posterior is defined by the encoder network as , with the parameters of the encoder network. The networks constituting the VAE-SSL model are shown in Fig. 1.
Starting with the model for the labelled data (see (4)), per VI we seek which minimizes the KL-divergence
Considering first the labelled data, the intractable posterior is approximated by . Assessing the -divergence, we obtain
with the expectation relative to . This reveals the dependence of the divergence on evidence , which is intractable. The other two terms in (2.2) form the evidence lower bound (ELBO). Since the KL divergence is non-negative, the ELBO ’lower bounds’ the evidence: . Maximizing the ELBO is thus equivalent to minimizing the (6). For optimization, we will minimize .
Considering now the ELBO terms from (2.2), we formulate the objective for the labelled data
Next, an objective for unlabeled data is derived. The intractable posterior from (5). From the , we find the objective (negative ELBO) as
with the expectation relative to . Further expanding (9) we obtain
Assessing the terms in (8), the supervised learning objective does not condition on the sample . The
is only present in the unsupervised learning objective (11). This issue is remedied by with an additional term
It is assumed that the data is explained by the generative process (see (10) for terms): , with the categorical (multinomial) distribution; (as before); and , with the decoder. The densities of the inference model are: , with and the outputs of the z-encoder; and , with the classifier network.
Thus in this case, we have 3 networks: (1) the label inference (classifier) network corresponding to , (2) the inference network and corresponding to , and (3) the decoder (generative) network corresponding to . Graphical models representing the inference and generative flows are shown in Fig. 1. We evaluate the objectives using probabilistic programming [bingham2018pyro].
We compare the DOA estimation performance of the VAE-SSL approaches in weakly and moderately reverberant environments with two alternative techniques: SRP-PHAT[dibiase2001robust], and one NN baseline — CNNs, which were used in [chakrabarty2017broadband]. We further analyze the generated RTF from the VAE and the estimated labels from the FNN, based on the VAE latent features, to help quantify the physics learned by the generative model (see Fig. 2). The results are summarized in Table 1–4, giving the performance of each method in terms of DOA error (RMSE) and frame-level accuracy.
The reverberant room data were generated using the Room Impulse Response (RIR) generator [habets2016]. We use two microphones with a nominal spacing of m and simulated sources with resolution for azimuth relative to the array broadside (37 candidate DOAs). We simulate only one active source for each time bin. We simulated 4 different room configurations to test the generalization of the learning-based methods in label poor scenarios, as we expect this to closely approximate real applications. Thus, we test the learning-based methods (VAE-SSL and CNN) using few labels, with the number of labels. For more details see Sec. 3.3.
All simulations were coded in Python, except the simulated reverberant data which was generated using Matlab [hadad2014multichannel]
. The VAE-SSL system and CNN were implemented using Pytorch[pytorch], with the Pyro package [bingham2018pyro] used for stochastic VI in the VAE. The NNs were optimized using Adam [kingma2014adam].
3.1 Learning-based model parameters
The signal at the microphones are given in (2.1). We obtain the RTFs from the data by (3). The RTFs are estimated using single FFT frames with hamming windowing and RTF vectors for the VAE and CNN. For all experiments, the RTF features are normalized to zero mean and unit STD.
The VAE-SSL model consisted of 3 fully connected neural networks (FNNs) which each had two fully connected layers with 500 hidden units with softplus activation. The label inference (classifier) network had softmax activation on the outputs. The latent code dimension for all experiments was . The inference ( and ) and generative network () input and output units had linear activation. For all VAE-SSL training, a learning rate of .0001 and batch size of 256 were used. We set (see (12)) for all experiments as this gave the best performance.
The input to the CNN was the same RTF estimates used for VAE-SSL. The CNN consisted of two convolutional layers and three fully connected layers. The convolutional layers had 6 and 16 channels each with an intervening maxpooling layer. The conv. layers were 2d: kernel size 3x3 and maxpooling 2x2. For both cases, there were three fully connected layers following the convolutional layers, with 480, 120, and 60 hidden units. All activations on the hidden layers were ReLU. A learning-rate of .001 and batch size of 256 was used for the CNN during training.
3.2 Non-learning: SRP-PHAT configuration
The FFT features (used to calculate RTF) were used for the SRP-PHAT approach with the same number of frames as the VAE-SSL and CNN methods. SRP-PHAT used 37 candidate DOAs over . SRP-PHAT was implemented using the Pyroomacoustics toolbox [scheibler2018pyroomacoustics]. The number of candidate DOAs were increase to candidate DOAs from to account for L-R ambiguity. The ambiguity was corrected for the results.
|J (# labels)||RMSE||Acc.||RMSE||Acc.|
|J (# labels)||RMSE||Acc.||RMSE||Acc.|
|J (# labels)||RMSE||Acc.||RMSE||Acc.|
|J (# labels)||RMSE||Acc.||RMSE||Acc.|
3.3 Reverberant room data
The reverberant room acoustic data were generated using the Room Impulse Response (RIR) generator [habets2016]. The synthetic IRs are convolved with a Gaussian time-domain signal (see (2.1)) and FFTs and RTFs are obtained for two sensors with only active source location for each time bin. We simulated 4 different room configurations to test the generalization of the learning-based methods in label poor scenarios. This included off-design conditions with changes to reverberation time and perturbations in the microphone position.
The nominal room configuration, deemed the ‘design’ case (see Table 1 for results) is a square room with x-y-z dimensions with a reverberation time ms and m/s. Two omnidirectional mics were located in the center of the room with spacing 0.26 m. The source range was m, with sources at resolution for azimuth relative to the array broadside. We used 0.5 s signals for each sensor location to obtain FFT and RTF frames with a 48 kHz sampling rate. A sensor noise level of 20dB was assumed and 10 signal realizations were generated for each candidate DOA. This yielded 66,000 FFT (RTF) frames (NFFT=256) per room configuration, using Hamming windowing with 50% overlap.
Three off-design rooms were simulated. The first, deemed the validation case (see Table 2 for results) had the same room geometry and source configuration as the design case, but the reverberation time ms. The second, deemed test case I (see Table 3 for results), had the same room geometry, source configuration and as the design case, but the two microphone locations were displaced by 5 mm and 3 mm in opposite directions along the y-axis. The third, deemed test case II (see Table 4 for results), had the same physical configuration as test case I, but was increased to .
The VAE and CNN inputs were obtained using RTF vectors, giving an input size . The CNN used the same input data, and SRP-PHAT used FFT frames.
3.4 Training and performance
The VAE-SSL and CNN were trained using a subset of the available labeled data (66,000 frames for each room) from the design case and the validation case. The number of supervised samples for each case is given by , with (multiples of 37). The same data is used for CNN. In VAE-SSL, for each value of
, the remaining samples are used for unsupervised learning, assuming no labels are available for those samples. During each training epoch, the supervised batches are used at a frequency proportional to their proportion of the overall data (supervised and unsupervised).
The models were trained on the design case and validated using the validation case using only labelled samples from each set. Thus, early stopping was implemented, and models were selected based on maximum validation accuracy. The performance on the design, validation, test I, and test II cases (see Table 1–4) was assessed using the labelled samples, not available in training.
Overall, the performance of the VAE-SSL method exceeds CNN by a large margin for these label-limited scenarios in terms of RMSE and often by accuracy as well. In general, SRP-PHAT was outperformed by the learning-based methods. From these experiments, it is apparent that labels is insufficient for obtaining lower RMSE error than SRP-PHAT in any of the cases, though the accuracy of the learning approaches always exceeded that of SRP-PHAT. For increasing labels to improves the VAE-SSL performance dramatically.
VAE-SSL has the ability to conditionally generate RTF phase based on label input. We generate the reverberant RTF phase using the trained VAE and plot it relative to the input labels in Fig. 2(b). RTF samples are conditionally generated by , holding constant, and sampling over the DOA labels . We use the phase-wrap of the RTF () (as a function of sensor separation and DOA , ) to help qualify the physics learned by VAE-SSL. This is plotted along with the RTF frames from the design case room configuration (Fig. 2(c)). It is observed that the physics of the RTF phase are well-learned by the VAE-SSL model.
It has been demonstrated for low to medium reverberant scenarios that VAE-SSL outperforms CNN when only few labels are available, provided significant unlabeled data is available. Training the VAE-SSL on all reverberant frames (with and without labels) allows the VAE system to fully exploit the structure of the data. The strength of the VAE-SSL approach lies in learning, in statistically a principled way, from both labelled and unlabeled examples is evident. As observed in this experiment labels is sufficient for VAE-SSL to obtain better performance than SRP-PHAT. This study shows that deep generative modeling can be used to well localize sources in reverberant environments, relative to existing approaches, when only few labels might be available. Further, such models are robust to perturbations in the microphone locations and reverberation levels. Further, the representations learned by such generative approaches can be used to generate new samples.