Transferring Voice Knowledge for Acoustic Event Detection: An Empirical Study

by   Dawei Liang, et al.

Detection of common events and scenes from audio is useful for extracting and understanding human contexts in daily life. Prior studies have shown that leveraging knowledge from a relevant domain is beneficial for a target acoustic event detection (AED) process. Inspired by the observation that many human-centered acoustic events in daily life involve voice elements, this paper investigates the potential of transferring high-level voice representations extracted from a public speaker dataset to enrich an AED pipeline. Towards this end, we develop a dual-branch neural network architecture for the joint learning of voice and acoustic features during an AED process and conduct thorough empirical studies to examine the performance on the public AudioSet [1] with different types of inputs. Our main observations are that: 1) Joint learning of audio and voice inputs improves the AED performance (mean average precision) for both a CNN baseline (0.292 vs 0.134 mAP) and a TALNet [2] baseline (0.361 vs 0.351 mAP); 2) Augmenting the extra voice features is critical to maximize the model performance with dual inputs.



There are no comments yet.


page 1

page 2

page 3

page 4


DCASE 2018 Challenge Surrey Cross-Task convolutional neural network baseline

The Detection and Classification of Acoustic Scenes and Events (DCASE) c...

Learn2Sing: Target Speaker Singing Voice Synthesis by learning from a Singing Teacher

Singing voice synthesis has been paid rising attention with the rapid de...

VOICe: A Sound Event Detection Dataset For Generalizable Domain Adaptation

The performance of sound event detection methods can significantly degra...

Chimpanzee voice prints? Insights from transfer learning experiments from human voices

Individual vocal differences are ubiquitous in the animal kingdom. In hu...

Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations

In this paper, a method for non-parallel sequence-to-sequence (seq2seq) ...

CNN-based Discriminative Training for Domain Compensation in Acoustic Event Detection with Frame-wise Classifier

Domain mismatch is a noteworthy issue in acoustic event detection tasks,...

Detecting Escalation Level from Speech with Transfer Learning and Acoustic-Lexical Information Fusion

Textual escalation detection has been widely applied to e-commerce compa...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the development of modern smart devices with listening capabilities such as voice assistants, smartphones, and wearable devices, audio has been increasingly used as a modality for the inference of human activities, and contexts [17, 19]. Acoustic event detection (AED) is the process of detecting the type and temporal onset/offset of acoustic events within an audio stream. While existing AED models have been advanced for modeling the target audio, studies have shown that knowledge transfer from a relevant domain is beneficial to boost the learning process further and the model capabilities [3, 15]. For example, knowledge transfer is especially useful when data in the target domain is not sufficient for model generalization [30], which is a typical case when dealing with real-world audio.

Many acoustic event types captured in daily life are related to human voice or contain voice elements, such as conversations, TV/radio sounds, music, or sounds from a crowd. To the best of our knowledge, however, very few prior attempts have studied the opportunities of transferring and incorporating voice knowledge into a general AED process. This paper investigates this opportunity by jointly training audio and high-level voice representations for an AED model based on dual-branch neural network architecture. We observe the benefits of adding extra voice inputs to a convolutional neural network (CNN) and a TALNet

[35] baseline by using AudioSet [8] as the test dataset. Specifically, our study demonstrates a few strategies that bridge the learning gap between the audio and the voice features.

2 Related work

The general process of AED is to build a classification model where the existence of an acoustic class is determined by the output class probability from the model. Conventional approaches include the usage of statistical models based on hand-crafted features

[33, 32]. Recent work has increasingly focused on using neural networks for the modeling of audio [9, 7, 36, 23]

given their success in computer vision.

Due to the increasing scale of audio data exposure, modern audio datasets are typically weak-labeled by annotators. The dataset only gives the label for the whole recording without detailed frame-level annotation. However, in practice, frame-level labeling is required. The following work [35, 37, 16, 5] has addressed the issue. The TALNet [35] is one of the state-of-the-art efforts for AED with weakly labeled audio inputs, which has demonstrated strong performance for acoustic event tagging and localization at the same time.

Transferring knowledge from a source domain to a target task can be a useful way to enrich the learning of the target dataset [30]. It is particularly meaningful in real-world audio analysis where the target audio accessibility can be limited due to challenges such as scalability [4] and privacy constraints [18]. For acoustic classification, transfer learning has been successfully applied both across tasks [14, 6, 21, 31, 39] and across modalities [3, 11]. Specifically, the extraction and leveraging of pre-trained neural network embeddings is a common way of audio knowledge transfer [35, 14, 10]

. Compared to conventional hand-crafted voice features such as i-vectors or the mel features

[40, 27, 26, 2], voice embeddings are directly obtained from a neural network trained for speaker voice classification. The voice embedding represents the knowledge that the network has learned to identify the speaker patterns [28, 20, 34, 22]. Our study aims to leverage voice embeddings extracted from an existing speaker dataset to enrich the AED process. As far as we know, this is the first effort to incorporate knowledge from voice inputs for AED on the AudioSet corpus.

3 Architecture

3.1 Overall pipeline

Figure 1:

Overall pipeline of our study. The acoustic classifier is a neural network with two input branches, taking as input the log-mel features and extracted voice embeddings of an utterance respectively.

Fig. 1

shows the overall pipeline of our study. The pipeline consists of two steps - feature extraction and acoustic event detection. The first step extracts the log-mel features of an audio input utterance. In addition to leveraging the log-mel features as the AED inputs, we apply an extra pre-trained model to extract voice embedding representations from the audio. Applying voice embedding transfers pre-trained voice knowledge of the feature extractor to the target audio, and it applies to both vocal and non-vocal input. In the second step, the acoustic classifier is a neural network architecture with two input branches: an audio branch with log-mel inputs and a voice branch with the generated voice feature inputs. The outputs of both branches are concatenated along the feature dimension and fed to the final fully connected layer(s). We did not apply early fusion of the features, because feature fusion at an intermediate layer left us more flexibility to optimize the two input branches separately

[12]. In our study, the voice feature extractor was trained with an existing speaker dataset. Once pre-trained, the parameters of the feature extractor were fixed, and only the audio and voice branches were trained for the target AED task.

3.2 Audio branch architecture

The audio branch was developed for the log-mel inputs. In our study, we started with a shallow CNN baseline and then the TALNet. The CNN architecture is as follows:

Input Conv1[64] Conv2[128] Conv3[256] Conv4[256] FC[2048] FC[1024] FC[527]

where ConvX[

] denotes a 2D convolutional layer with the ReLU

[1] activation and

channels. The kernel size, padding, and stride were (3

3), (11), and (1

1). Besides, a max-pooling of size (2

2), (22), and (12) was added for Conv1, Conv2 and Conv4, respectively. FC[] denotes a fully connected layer of size with the ReLU activation. We adopted the model architectures for both baselines, excluding their fully connected layer(s) as our audio branch. We then added back the fully connected layer(s) after feature fusion.

3.3 Voice branch architecture

Figure 2: Architecture of the voice branch. It consists of 1D convolutional layers on the time (t) dimension followed by a uni-directional GRU layer.

Unlike the audio branch, we did not apply convolution on the feature dimension of the voice inputs since adjacent elements of an embedding may not have spatial correlation as the log-mel vectors do. Hence, our voice branch consists of 1D convolutional layers along the temporal dimension of the voice embeddings. The feature dimension of the embeddings is mapped to the channel dimension of the convolutional layers. Fig. 2 shows such a process. In such a design, reducing the number of network channels of each convolutional layer essentially reduces the size of the feature dimension. We added an extra uni-directional GRU layer following the convolutional layers to improve the learning performance.

4 Voice representations

4.1 Pre-training of voice models

To develop the voice feature extractor, we constructed a speaker recognition task where a network was trained to classify given speaker voice classes. The task was built on the public VoxCeleb1 [22] speaker dataset. The dataset consists of audio utterances of over 1K celebrities from public YouTube videos of varying lengths. Specifically, we leveraged 1,211 speakers in the dataset for our model training and validation. For each speaker, ten utterances were randomly selected for model validation, and the rest were used for training, resulting in an average of 109 utterances per speaker in our training set.

The audio utterances were sampled at 16kHz and truncated or padded to 10 seconds. We then extracted 64D log-mel features using a frame length of 64ms and a frame shift of 25ms. The resulting log-mel features of a minibatch of input to our voice models had a shape of (batch1400

64). The features were normalized per dimension by the mean and standard deviation calculated over the entire training set. Two voice models were then applied, inspired from

[22] (Arch1) and a simplified version of TALNet (Arch2). We slightly modified the model parameters to fit our requirements, as shown in Table 1

. The models were deployed using PyTorch


. ReLU activation and batch normalization were applied except for the last fc layer. Besides, the “ceil” mode was enabled for the max-pooling layers. For Arch1, the output of the last conv2D layer was re-shaped from (batch

10241001) to (batch1001024). The temporal dimension (100) was then aggregated by average pooling. For Arch2, the temporal dimension was aggregated after the fc layer.

Arch1 Arch2
conv2D (96, 33, 1, 1) conv2D (32, 33, 1, 1)
mpool (22) mpool (22)
conv2D (256, 33, 1, 1) conv2D (32, 33, 1, 1)

mpool (22)
conv2D (64, 33, 1, 1)
conv2D (384, 33, 1, 1) mpool (22)
conv2D2 (256, 33, 1, 1) conv2D (64, 33, 1, 1)
mpool (12) flatten (batch1001024)
conv2D (1024, 18, 1, 0) biGRU (5122)
fc2 (1024 / 1211) fc (1211)
Table 1: Architecture of our voice models. conv2D: 2D convolutional layers (channel, kernel size, stride, padding); mpool: max pooling; fc: fully-connected layers.

To set up training, we used an initial learning rate of 2 and a batch size of 25. We applied the same class balancing strategy as in [35] to account for class imbalance. The learning rate was shrunk by a factor of 0.9 when the validation accuracy plateaued, and training was stopped when the learning rate reached 1. We used the Adam [13] optimizer and the cross-entropy loss.

4.2 Voice feature extraction

The best validation accuracy we obtained was 95.8% and 93.4% for Arch1 and Arch2, respectively. The high accuracy values were as expected because of the large utterance size we used. The voice embeddings were then extracted before average pooling for Arch1 and following the GRU layer for Arch2, which maintains the temporal information of the input utterance at a resolution of 10Hz. The resulting embeddings were then of shape (batch1001024) for a batch. In the following sections, we will refer to the embeddings from the two models as emb_1 and emb_2 for convenience.

5 Experiments

5.1 Training setup

We leveraged AudioSet for our AED study. The dataset consists of over 2 million 10-second audio utterances of 527 annotated acoustic classes, including vocal and non-vocal sounds extracted from public YouTube videos. We used the same evaluation set containing 24,832 utterances.

We followed our speaker recognition steps to derive the same type of log-mel features as inputs to the AED part. We then tested the two baselines of the audio branch independently. In deployment, we removed Conv4 of the CNN baseline for our dual-branch tests to maintain a similar model size with and without voice inputs. Besides, we used a hidden size of 768 for the GRU layer of TALNet. For both baselines, we enabled the “ceil” mode of the max-pooling layers in PyTorch. The output of the audio branch was consistently in a shape of (batch100768), where the temporal size was 100 (0.1s resolution).

For the voice branch, we applied two 1D convolutional layers so that the parameter size of the voice branch could be less considerable (1M) compared to the baseline models. The kernel size, padding, and stride were 3, 1, and 1, respectively, with ReLU activation. Batch normalization was also added before the activation. In the convolutional layers, the number of channels was 256 and 64, respectively. For the GRU layer, the hidden size was 64. Hence, the outputs of the voice branch were of shape (batch10064).

The final shape of features was (batch100832) after fusion. The last fully connected layer was of size 527 with the sigmoid activation. The per-class binary predictions were aggregated for an utterance by linear softmax pooling on the frame-level probability outputs. We used the same learning rate and optimization setup for training as in the speaker recognition task, but the validation metric was switched to the mean average precision (mAP) score which AudioSet also used. Besides, we used the binary cross-entropy loss.

5.2 Experiments and result discussions

We first examined the maximum performance of the models with augmented voice inputs. Inspired by common augmentation strategies for acoustic features, we processed the voice embeddings with three strategies – time masking [24], mixup [38], and adding dropout [29] to the voice branch. Specifically, we randomly masked 40 voice embedding frames out of the total 100 for each input utterance. The mixup was applied by mixing up labels and the corresponding input features at a batch level. We applied this for both branches at the same time with an alpha value of 1. Besides, we dropped out the voice features with a probability of 0.5.

Combination mAP mAUC d-prime
CNN 0.134 0.903 1.840
CNN+emb1 0.292 0.951 2.338
CNN+emb2 0.256 0.950 2.325

0.351 0.966 2.584
TALNet+emb1 0.360 0.962 2.506
TALNet+emb2 0.361 0.962 2.517
Table 2: Overall AED results with different input types and base architectures of the audio branch.

Table 2 shows the overall AED results based on combinations of the input types and the audio branch architectures. In addition to the best mAP values, we also reported the best mean area under the curve (mAUC) and -prime metrics to show the performance better. The results with both baselines show that adding the extra voice branch improves the AED performance regarding the mAP metric, and the observation was consistent for both types of voice inputs. As expected, the performance jump was much bigger for the CNN baseline than for TALNet, since TALNet is a more sophisticated model for the AudioSet AED task, and the supplementary information carried by the voice inputs can be relatively marginal. Interestingly, the mAUC and -prime metrics degraded for TALNet with the voice inputs, possibly because the training was only optimized for the validation mAP. We also tried doubling the number of convolutional layers for the voice branch (with 1/2 of the channel size per layer to maintain the same output shape). However, no improvement was observed for either type of embeddings (0.357 mAP for both emb1 and emb2), indicating that 2 layers of CNN were sufficient for learning from the high-level voice features.

We then studied the AED effects with and without augmentation applied to the voice inputs. Table 3

shows the best validation epochs, the corresponding loss values and the mAP for the test cases. From the comparisons, it can be clearly seen that adding augmentation is critical to improve joint training of the two branches and maximize the model performance. Table

4 further demonstrates the effects of individual types of augmenting strategies with the TALNet-based audio branch. While all techniques improved the joint training process, we can see that mixup augmentation yielded the biggest improvement. For a better comparison, we also ran a test for the original TALNet baseline with the same setup of mixup augmentation on the log-mel inputs. However, no improvement was observed in our test (0.342 mAP), indicating that the augmentation process contributed more significantly with the extra voice inputs than with the original baseline model.

By checking the class-wise predictions of the TALNet-based models, we found that the predicting performance led most ( 0.06 mAP) with the voice embeddings compared to the baseline model for the 10 classes: Hoot, Dental_drill, Yodeling, Air_horn, Gobble, Roar, Ringtone, Television, Snoring, Video_game_music. An interesting finding was that adding the extra voice inputs did not guarantee a better prediction for the voice-related AudioSet classes. For example, the mAP for Speech was 0.776 / 0.778 / 0.782 with the original TALNet / TALNet+emb1 / TALNet+emb2, but the values were 0.320 / 0.314 / 0.336 for Singing and 0.209 / 0.186 / 0.206 for Conversation. A possible explanation is that the training of the two branches was an integrated process and the models were optimized for the global performance, sometimes at a compromise of such individual classes.

Combination Epoch Train / Val loss mAP
CNN+emb1, no aug 11 9.7 / 13.3 0.229
CNN+emb1, with aug 39 12.7 / 12.2 0.264
CNN+emb2, no aug 13 8.9 / 12.8 0.256
CNN+emb2, with aug 45 12.6 / 11.7 0.292
TALNet+emb1, no aug 25 6.7 / 11.8 0.328
TALNet+emb1, with aug 56 11.4 / 10.6 0.360
TALNet+emb2, no aug 19 7.6 / 11.5 0.331
TALNet+emb2, with aug 71 11.4 / 10.6 0.361
Table 3: Results with and without augmentation on the voice features. Feature augmentation is a critical factor for better joint training of the two input branches.
Strategy Epoch Train / Val loss mAP
emb1+dp 21 7.1 / 11.4 0.334
emb1+tmask 25 6.8 / 11.7 0.331
emb1+mixup 62 10.6 / 10.8 0.350
emb2+dp 20 7.3 / 11.4 0.334
emb2+tmask 19 7.8 / 11.5 0.334
emb2+mixup 68 11.6 / 10.7 0.356
Table 4: Results with a single type of augmenting strategies. dp: dropout; tmask: time-mask augmentation; mixup: mixup augmentation. Mixup augmentation is the most effective approach to improve the training.

6 Conclusions

This paper explored a novel approach for acoustic event detection by incorporating pre-trained voice embeddings into an AED pipeline. Towards this end, we developed a dual-branch neural network architecture for joint training of the inputs. We then reported the overall and class-wise performance with a CNN baseline and a strong TALNet baseline developed on AudioSet. Our results showed the benefits of adding extra voice inputs to the tested models (0.292 vs 0.134 mAP for the CNN baseline and 0.361 vs 0.351 mAP for TALNet baseline). Furthermore, we showed that adding augmentation and dropout on the voice inputs is critical to maximize the model performance with dual inputs.


  • [1] A. F. Agarap (2018) Deep learning using rectified linear units (relu). arXiv:1803.08375. Cited by: §3.2.
  • [2] M. T. Al-Kaltakchi, W. L. Woo, et al. (2017)

    Comparison of i-vector and gmm-ubm approaches to speaker identification with timit and nist 2008 databases in challenging environments

    In EUSIPCO, Cited by: §2.
  • [3] Y. Aytar, C. Vondrick, et al. (2016) Soundnet: learning sound representations from unlabeled video. NIPS. Cited by: §1, §2.
  • [4] Y. Chon et al. (2013) Understanding the coverage and scalability of place-centric crowdsensing. In UbiComp, Cited by: §2.
  • [5] S. Chou, J. R. Jang, and Y. Yang (2018) Learning to recognize transient sound events using attentional supervision.. In IJCAI, Cited by: §2.
  • [6] V. Dissanayake et al. (2020)

    Speech emotion recognition’in the wild’using an autoencoder.

    In INTERSPEECH, Cited by: §2.
  • [7] M. Espi et al. (2015) Exploiting spectro-temporal locality in deep learning based acoustic event detection. EURASIP Journal on Audio, Speech, and Music Processing. Cited by: §2.
  • [8] J. F. Gemmeke et al. (2017) Audio set: an ontology and human-labeled dataset for audio events. In ICASSP, Cited by: Transferring voice knowledge for Acoustic event detection: An empirical study, §1.
  • [9] A. Gorin, N. Makhazhanov, and N. Shmyrev (2016) DCASE 2016 sound event detection system based on convolutional neural network. IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events. Cited by: §2.
  • [10] S. Hershey et al. (2017) CNN architectures for large-scale audio classification. In icassp, Cited by: §2.
  • [11] H. Kaya, F. Gürpınar, and A. A. Salah (2017) Video-based emotion recognition in the wild using deep transfer learning and score fusion. Image and Vision Computing. Cited by: §2.
  • [12] B. Khaleghi et al. (2013) Multisensor data fusion: a review of the state-of-the-art. Information fusion. Cited by: §3.1.
  • [13] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv:1412.6980. Cited by: §4.1.
  • [14] Q. Kong, Y. Cao, et al. (2019) Cross-task learning for audio tagging, sound event detection and spatial localization: dcase 2019 baseline systems. arXiv:1904.03476. Cited by: §2.
  • [15] Q. Kong, Y. Cao, et al. (2020)

    Panns: large-scale pretrained audio neural networks for audio pattern recognition

    IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 2880–2894. Cited by: §1.
  • [16] Q. Kong et al. (2018)

    Audio set classification with attention model: a probabilistic perspective

    In ICASSP, Cited by: §2.
  • [17] N. D. Lane, P. Georgiev, and L. Qendro (2015) Deepear: robust smartphone audio sensing in unconstrained acoustic environments using deep learning. In UbiComp, Cited by: §1.
  • [18] D. Liang, W. Song, and E. Thomaz (2020) Characterizing the effect of audio degradation on privacy perception and inference performance in audio-based human activity recognition. In MobileHCI, Cited by: §2.
  • [19] D. Liang and E. Thomaz (2019) Audio-based activities of daily living (adl) recognition with large-scale acoustic embeddings from online videos. IMWUT. Cited by: §1.
  • [20] Y. X. Lukic et al. (2017) Learning embeddings for speaker clustering based on voice equality. In IEEE MLSP, Cited by: §2.
  • [21] N. Moritz, G. Wichern, et al. (2020) All-in-one transformer: unifying speech recognition, audio tagging, and event detection. Interspeech. Cited by: §2.
  • [22] A. Nagrani, J. S. Chung, and A. Zisserman (2017) Voxceleb: a large-scale speaker identification dataset. arXiv:1706.08612. Cited by: §2, §4.1, §4.1.
  • [23] G. Parascandolo, H. Huttunen, and T. Virtanen (2016) Recurrent neural networks for polyphonic sound event detection in real life recordings. In ICASSP, Cited by: §2.
  • [24] D. S. Park et al. (2019)

    Specaugment: a simple data augmentation method for automatic speech recognition

    arXiv:1904.08779. Cited by: §5.2.
  • [25] A. Paszke et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In NIPS, Cited by: §4.1.
  • [26] L. Schmidt, M. Sharifi, and I. L. Moreno (2014) Large-scale speaker identification. In ICASSP, Cited by: §2.
  • [27] M. Schmidt and H. Gish (1996) Speaker identification via support vector classifiers. In ICASSP, Cited by: §2.
  • [28] D. Snyder et al. (2018) X-vectors: robust dnn embeddings for speaker recognition. In ICASSP, Cited by: §2.
  • [29] N. Srivastava et al. (2014) Dropout: a simple way to prevent neural networks from overfitting.

    The journal of machine learning research

    Cited by: §5.2.
  • [30] C. Tan et al. (2018) A survey on deep transfer learning. In ICANN, Cited by: §1, §2.
  • [31] N. Tonami et al. (2019) Joint analysis of acoustic events and scenes based on multitask learning. In IEEE WASPAA, Cited by: §2.
  • [32] B. Uzkent, B. D. Barkana, et al. (2012) Non-speech environmental sound classification using svms with a new set of features. International Journal of Innovative Computing, Information and Control 8 (5), pp. 3511–3524. Cited by: §2.
  • [33] L. Vuegen, B. Broeck, et al. (2013) An mfcc-gmm approach for event detection and classification. In WASPAA, Cited by: §2.
  • [34] L. Wan et al. (2018) Generalized end-to-end loss for speaker verification. In ICASSP, Cited by: §2.
  • [35] Y. Wang, J. Li, and F. Metze (2019) A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling. In ICASSP, Cited by: Transferring voice knowledge for Acoustic event detection: An empirical study, §1, §2, §2, §4.1.
  • [36] Y. Wang, L. Neves, and F. Metze (2016) Audio-based multimedia event detection using deep recurrent neural networks. In ICASSP, pp. 2742–2746. Cited by: §2.
  • [37] C. Yu et al. (2018) Multi-level attention model for weakly supervised audio classification. arXiv:1803.02353. Cited by: §2.
  • [38] H. Zhang et al. (2017) Mixup: beyond empirical risk minimization. arXiv:1710.09412. Cited by: §5.2.
  • [39] R. Zhang, W. Zou, and X. Li (2019)

    Cross-task pre-training for on-device acoustic scene classification

    arXiv:1910.09935. Cited by: §2.
  • [40] X. Zhao and D. Wang (2013) Analyzing noise robustness of mfcc and gfcc features in speaker identification. In ICASSP, pp. 7204–7208. Cited by: §2.