A Variational Bayesian Approach to Learning Latent Variables for Acoustic Knowledge Transfer

by   Hu Hu, et al.

We propose a variational Bayesian (VB) approach to learning distributions of latent variables in deep neural network (DNN) models for cross-domain knowledge transfer, to address acoustic mismatches between training and testing conditions. Instead of carrying out point estimation in conventional maximum a posteriori estimation with a risk of having a curse of dimensionality in estimating a huge number of model parameters, we focus our attention on estimating a manageable number of latent variables of DNNs via a VB inference framework. To accomplish model transfer, knowledge learnt from a source domain is encoded in prior distributions of latent variables and optimally combined, in a Bayesian sense, with a small set of adaptation data from a target domain to approximate the corresponding posterior distributions. Experimental results on device adaptation in acoustic scene classification show that our proposed VB approach can obtain good improvements on target devices, and consistently outperforms 13 state-of-the-art knowledge transfer algorithms.



There are no comments yet.



Unsupervised Cross-domain Image Classification by Distance Metric Guided Feature Alignment

Learning deep neural networks that are generalizable across different do...

Inferring Parameters and Structure of Latent Variable Models by Variational Bayes

Current methods for learning graphical models with latent variables and ...

Learning to Predict with Supporting Evidence: Applications to Clinical Risk Prediction

The impact of machine learning models on healthcare will depend on the d...

Unsupervised Domain Adaptation for Acoustic Scene Classification Using Band-Wise Statistics Matching

The performance of machine learning algorithms is known to be negatively...

Graph Domain Adaptation: A Generative View

Recent years have witnessed tremendous interest in deep learning on grap...

Domain Mismatch Robust Acoustic Scene Classification using Channel Information Conversion

In a recent acoustic scene classification (ASC) research field, training...

Not Only Look But Observe: Variational Observation Model of Scene-Level 3D Multi-Object Understanding for Probabilistic SLAM

We present NOLBO, a variational observation model estimation for 3D mult...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in machine learning are largely due to an evolution of deep learning combined with an availability of massive amounts of data. Deep neural networks (DNNs) have demonstrated state-of-the-art results in building acoustic systems

[1, 2, 3]. Nonetheless, audio and speech systems still highly depend on how close the training data used in model building covers the statistical variation of the signals in testing environments. Acoustic mismatches, such as changes in speakers, environmental noise and recording devices, usually cause an unexpected and severe performance degradation [4, 5, 6]. For example, as for audio scene classification (ASC), device mismatch is an inevitable problem in real production scenarios [7, 8, 9, 10]. Moreover, the amount of data for the specific target domain is often not sufficient to train a good deep target model to achieve a similar performance to the source model. A key issue is to design an effective adaptation procedure to transfer knowledge from the source to target domains, while avoiding catastrophic forgetting and curse of dimensionality [11, 12, 13, 14, 15] often encountered in deep learning.

Figure 1: Illustration of the proposed knowledge transfer framework.

Bayesian learning provides a mathematical framework to model uncertainties and incorporate prior knowledge. It usually performs estimation via either maximum a posteriori (MAP) or variational Bayesian (VB) approaches. By leveraging upon target data and prior belief, a posterior belief can be obtained by optimally combining them. In the MAP solution, a point estimate can be obtained, which has been proven effective in handling acoustic mismatches in hidden Markov models (HMMs)

[4, 16, 17] and DNNs [12, 18, 13] by assuming a distribution on the model parameters. On the other hand, the VB approach performs an estimation on the entire posterior distribution via a stochastic variational inference method [19, 20, 21, 22, 23, 24]. Bayesian learning can facilitate building an adaptive system for specific target conditions in a particular environment. Thus the mismatches between training and testing can be reduced, and the overall system performance is greatly enhanced.

Traditional Bayesian formulations usually impose uncertainties on model parameters. However, for commonly used DNNs, the number of parameters is usually much larger than the available training samples, making an accurate estimation difficult. Moreover, a feature based knowledge transfer framework, namely teacher-student learning (TSL, also called knowledge distillation) [25, 26] has been investigated in recent years. The basic TSL transfers knowledge acquired by the source / teacher model and encoded in its softened outputs (model outputs after softmax), to the target / student model, where the target model directly mimics the final prediction of the source model through a KL divergence loss. The idea is then extended to hidden embedding of intermediate layers [27, 28, 29], where different embedded representations are proposed to encode and transfer knowledge. However, instead of considering the whole distribution of latent variables, they only perform point estimation as in MAP, potentially leading to sub-optimal results and may lose distributional information.

In this work, we aim at establishing a Bayesian adaptation framework based on latent variables of DNNs, where the knowledge is transferred in the form of distributions of deep latent variables. Thus, a novel variational Bayesian knowledge transfer (VBKT) approach is proposed. We take into account of the model uncertainties and perform distribution estimation on latent variables. In particular, by leveraging upon variational inference, the distributions of the source latent variables (prior) are combined with the knowledge learned from target data (likelihood) to yield the distributions of the target latent variables (posterior). Prior knowledge from the source domain is thus encoded and transferred to the target domain, by approximating the posterior distributions of latent variables. An extensive and thorough experimental comparison against 13 recent cut-edging knowledge transfer methods is carried out. Experimental evidence demonstrates that our proposed VBKT approach outperforms all competing algorithms on device adaptation tasks of ASC.

2 Bayesian Inference of Latent Variables

2.1 Knowledge Transfer of Latent Variables

Suppose we are given some data observations , and let and indicate the source and target domain data, respectively. Our framework requires parallel data, e.g., for each target data sample , there exists a paired data sample from the source data, where and share the same audio content but recorded by different devices. Consider a DNN based discriminative model with parameters to be estimated, where usually represents network weights. Starting from the classical Bayesian approach, a prior distribution is defined over , and the posterior distribution after seeing the observations can be obtained by the Bayes Rule as follows,


Figure 1 illustrates the overall framework of our proposed knowledge transfer approach. Parallel input features of and are firstly extracted and then fed into neural networks. In addition to network weights, we introduce the latent variables to model the intermediate hidden embedding of DNNs. Here refers to the unobserved intermediate representations, encoding transferable distributional information. We then decouple the network weights into two independent subsets, and , as illustrated by the subnets in the 4 squares in Figure 1, to represent weights before and after is generated, respectively. Thus we have


Note that the relationship in Eq. (2) holds for both prior and posterior . Here we focus on transferring knowledge in a distribution sense via the latent variables . With parallel data, we thus assume that there exists retaining the same distributions across the source and target domains. Specifically, for the target model we have , as the prior knowledge learnt from the source encoded in .

2.2 Variational Bayesian Knowledge Transfer

Denoting in the target model as , typically, the posterior is often intractable, and an approximation is required. In this work, we propose a variational Bayesian approach to approximate the posterior; therefore, a variational distribution is introduced. For the target domain model, the optimal is obtained by minimizing KL divergence between the variational distribution and the real one, over a family of allowed approximate distributions :


In this work, we focus on latent variables, , and we assume a non-informative prior over and . Next, by substituting Eqs. (1), (2), and the prior distribution into Eq. (3), we arrive, after re-arranging the terms, to the following variational lower bound as


Simply put, a Gaussian mean-field approximation is used to specify the distribution forms for both the prior and posterior over . Specifically, each latent variable in follows an -dimension isotropic Gaussian, where is the hidden embedding size. Given a parallel data set, we can approximate the KL divergence term in Eq. (4

) by establishing a mapping of each pair of Gaussian distributions across domains via sample pairs. We denote the Gaussian mean and variance for the source and target domains as

and , respectively. A stochastic gradient variational Bayesian (SGVB) estimator [21] is then used to approximate the posterior, with the network hidden outputs being regarded as the mean of the Gaussian. Moreover, we assign a fixed value to both and , as the variance for all individual Gaussian components. We can now obtain a close-form solution for the KLD term in Eq. (4). Furthermore, by adopting Monte Carlo to generate -pairs of sample, the lower bound in Eq. (4) can be approximated empirically as:


where the first term is the likelihood, and the second term is deduced from the KL divergence between prior and posterior of the latent variables. Each instance of , is sampled from the posterior distribution as , thus the expectation form in the first term can be reduced. To flow the gradients of sampling operation through deep neural nets, a reparameterization trick [30, 21] is adopted during the network training. In the inference stage, as it’s a classification task, we directly take to simplify the computation.

3 Experiments

3.1 Experimental Setup

We evaluate our proposed VBKT approach on the acoustic scene classification (ASC) task of DCASE2020 challenge task1a [31]. The training set contains

10K scene audio clips recorded by the source device (device A), and 750 clips for each of the 8 target devices (Device B, C, s1-s6). Each target audio is paired with a source audio, and the only difference between the two audios is the recording device. The goal is to solve the device mismatch issue for one specific target device at a time, i.e., device adaptation, which is a common scenario in real applications. For each audio clip, log-mel filter bank (LMFB) features are extracted, and scaled to [0,1] before feeding into the classifier.

Two state-of-the-art models, namely: a dual-path resnet (RESNET) and a fully convolutional neural network with channel attention (FCNN), are tested according to the challenge results

[31, 32]. We use the same models for both the source and target devices. Mix-up [33] and SpecAugment [34]

are used in the training stage. Stochastic gradient descent (SGD) with a cosine-decay restart learning rate scheduler is used to train all models. Maximum and minimum learning rates are 0.1, and 1e-5, respectively. The latent variables are based on the hidden outputs before the last layer. Specifically, the hidden embedding after batch-normalization but before ReLU activation of the second last convolutional layer is utilized. As stated in Section 

2, a deterministic value is set for . In our experiments, we generate extra data [35]

and compute the average standard deviation over each audio clip, where we finally set

. For the other 13 tested cut-edging TSL based methods, we mostly follow the recommended setups and hyper-parameter settings in their original papers. The temperature parameter is set to 1.0 for all when computing KL divergence with soft labels.

avg. (%)
w/ TSL
avg. (%)
avg. (%)
w/ TSL
avg. (%)
Source model 37.70 - 37.13 -
No transfer 54.29 - 49.97 -
One-hot 63.76 - 64.45 -
TSL [26] 68.04 68.04 66.27 66.27
NLE [36] 65.64 67.76 64.47 64.53
Fitnet [27] 66.73 69.89 67.29 69.06
AT [28] 63.73 68.06 64.16 66.35
AB [37] 65.34 68.69 66.21 66.91
VID [38] 63.90 68.56 63.79 65.75
FSP [39] 64.44 68.94 65.33 66.01
COFD [29] 64.92 68.57 66.69 68.63
SP [40] 64.57 68.45 65.74 67.36
CCKD [41] 65.59 69.47 66.52 68.29
PKT [42] 64.65 65.43 64.84 67.25
NST [43] 68.35 68.51 67.13 68.84
RKD [44] 65.28 68.46 65.63 67.27
VBKT 69.58 69.90 69.96 70.50
Table 1: Comparison of average evaluation accuracies (in %) on recordings of the DCASE2020 ASC data set. Each method is tested with and without the combination of the basic TSL method. Each cell represents the average value over 32 experimental results for 8 target devices 4 repeated trails.

3.2 Evaluation Results on Device Adaptation

Evaluation results of device adaptation on the DCASE2020 ASC task are shown in Table 1. The source models are trained on data recorded by Device A, where we can get a classification accuracy of 79.09% for RESNET and 79.70% for FCNN, on the source test set, respectively. There are 8 target devices, i.e., Device B, C, s1-s6. The accuracy reported in each cell of Table 1 is obtained by averaging among 32 experimental results, from 8 target devices and 4 trails for each. The first and third columns represent results by directly using the knowledge transfer methods; whereas the second and fourth columns list accuracies obtained when further combined with the basic TSL method.

We look at the results without the combination of TSL at first. The 1st row gives results by directly testing the source model on target devices. We can observe a huge degradation (from 79% to 37%) when compared with the results on source test set. That shows the device mismatch is indeed an critical aspect in acoustic scene classification, as the device changing causing a serve performance drop. The 2nd and 3rd rows in Table 1 give results of target model trained by target data either from the scratch or fine-tuned on source model. By comparing them we can argue the importance of knowledge transfer when building a target model.

The 4th to 16th rows in Table 1 show the evaluated results of 13 recent top TSL based methods. The result of basic TSL [25, 26], which minimizes KL divergence between model outputs and soft labels, is shown in the 4th row. We can observe a gain obtained by TSL when compared with one-hot fine-tuning. If we compare the other methods (5th to 16th rows) with basic TSL, they only show small advantages on this task. The bottom row of Table 1 shows results of our proposed VBKT approach. It not only outperforms one-hot fine-tuning by a large margin (69.58% vs. 63.76% for RESNET, and 69.12% vs. 61.54% for FCNN), but also attains superior classification results to those obtained with other algorithms.

We further investigate the combination of the proposed approach and the basic TSL. The experimental results are shown in the 2nd and 4th columns of Table 1, for RESNET and FCNN models, respectively. Specifically, the original cross entropy (CE) loss is replaced by an addition of 0.9 KL loss with soft labels and 0.1 CE loss with hard labels. There is no change to basic TSL so results remain the same in the 4th row. For the other tests (from 5th to 16th rows), when combining with basic TSL, most tested methods can attain further gains. Indeed, such a combination is recommended by some studies [27, 28, 42]. Finally, we compare our proposed VBKT approach with others, when combined with TSL, the accuracy can be further boosted, and it still outperforms other tested methods under the same setup.

Figure 2: Evaluation results for using different hidden layers of FCNN model, on two target devices: (a) Device s3 and (b) Device s5. The basic TSL is combined with all methods.

3.3 Effects of Hidden Embedding Depth

In our basic setup, the hidden embedding before the last convolutional layer is utilized for modeling latent variables. Ablation experiments are further carried out to investigate the effects of using different-layer hidden embedding. Experiments are performed on FCNN since it has a sequential architecture with stacked convolutional layers, namely 8 convolutional layers and 11 convolutional layer for outputs. Results are shown in Figure 2. Methods use all hidden layers, like COFD, are not covered here. We don’t have results for RKD and NST on Conv2 due to they exceeds the memory limitation. From the results we can see that the best performance is obtained from last layer for most of the assessed approaches. Moreover, the hidden embedding closer to the model output allows for a higher accuracy than that closer to the input. Therefore we can argue that late features are better than early features in transferring knowledge across domains. That is in line with what observed in [27, 43]. Finally, VBKT attains a very competitive ASC accuracy and consistently outperforms other methods independently of the selected hidden layers.

3.4 Visualization of Intra-class Discrepancy

To better understand the effectiveness of the proposed VBKT approach, we compare the intra-class discrepancy between target model outputs (before softmax). The visualized heatmap results are shown in (a)-(f) of Figure 3. Here we randomly select 30 samples from the same class and compute distance between model outputs of each two. Thus each cell in subnets of Figure 3 represents the discrepancy between two outputs, as the darker color means bigger intra-class discrepancy. From these visualization results we can argue that the one obtained by our proposed VBKT approach in Figure 2(f) has consistently smaller intra-class discrepancy than those produced by others, implying that VBKT brings up more discriminative information and results in a better cohesion of instances from the same class.

4 Conclusion

In this study, we propose a variational Bayesian approach to address the cross-domain knowledge transfer issues when deep models are used. Different from previous solutions, we propose to transfer knowledge via prior distributions of deep latent variables from the source domain. We cast the problem into learning distributions of latent variables in deep neural networks. In contrast to conventional maximum a posteriori estimation, a variational Bayesian inference algorithm is then formulated to approximate the posterior distribution in the target domains. We assess the effectiveness of our proposed VB approach on the device adaptation tasks for the DCASE2020 ASC data set. Experimental evidence clearly demonstrate that the target model obtained with our proposed approach outperforms all other tested methods in all tested conditions.

(a) TSL
(b) Fitnets
(c) AT
(d) SP
(e) CCKD
(f) VBKT
Figure 3:

Visualized heatmaps of the intra-class discrepancy between target outputs. FCNN on target device s5 is used.