With the development of modern smart devices with listening capabilities such as voice assistants, smartphones, and wearable devices, audio has been increasingly used as a modality for the inference of human activities, and contexts [17, 19]. Acoustic event detection (AED) is the process of detecting the type and temporal onset/offset of acoustic events within an audio stream. While existing AED models have been advanced for modeling the target audio, studies have shown that knowledge transfer from a relevant domain is beneficial to boost the learning process further and the model capabilities [3, 15]. For example, knowledge transfer is especially useful when data in the target domain is not sufficient for model generalization , which is a typical case when dealing with real-world audio.
Many acoustic event types captured in daily life are related to human voice or contain voice elements, such as conversations, TV/radio sounds, music, or sounds from a crowd. To the best of our knowledge, however, very few prior attempts have studied the opportunities of transferring and incorporating voice knowledge into a general AED process. This paper investigates this opportunity by jointly training audio and high-level voice representations for an AED model based on dual-branch neural network architecture. We observe the benefits of adding extra voice inputs to a convolutional neural network (CNN) and a TALNet baseline by using AudioSet  as the test dataset. Specifically, our study demonstrates a few strategies that bridge the learning gap between the audio and the voice features.
2 Related work
The general process of AED is to build a classification model where the existence of an acoustic class is determined by the output class probability from the model. Conventional approaches include the usage of statistical models based on hand-crafted features[33, 32]. Recent work has increasingly focused on using neural networks for the modeling of audio [9, 7, 36, 23]
given their success in computer vision.
Due to the increasing scale of audio data exposure, modern audio datasets are typically weak-labeled by annotators. The dataset only gives the label for the whole recording without detailed frame-level annotation. However, in practice, frame-level labeling is required. The following work [35, 37, 16, 5] has addressed the issue. The TALNet  is one of the state-of-the-art efforts for AED with weakly labeled audio inputs, which has demonstrated strong performance for acoustic event tagging and localization at the same time.
Transferring knowledge from a source domain to a target task can be a useful way to enrich the learning of the target dataset . It is particularly meaningful in real-world audio analysis where the target audio accessibility can be limited due to challenges such as scalability  and privacy constraints . For acoustic classification, transfer learning has been successfully applied both across tasks [14, 6, 21, 31, 39] and across modalities [3, 11]. Specifically, the extraction and leveraging of pre-trained neural network embeddings is a common way of audio knowledge transfer [35, 14, 10]
. Compared to conventional hand-crafted voice features such as i-vectors or the mel features[40, 27, 26, 2], voice embeddings are directly obtained from a neural network trained for speaker voice classification. The voice embedding represents the knowledge that the network has learned to identify the speaker patterns [28, 20, 34, 22]. Our study aims to leverage voice embeddings extracted from an existing speaker dataset to enrich the AED process. As far as we know, this is the first effort to incorporate knowledge from voice inputs for AED on the AudioSet corpus.
3.1 Overall pipeline
shows the overall pipeline of our study. The pipeline consists of two steps - feature extraction and acoustic event detection. The first step extracts the log-mel features of an audio input utterance. In addition to leveraging the log-mel features as the AED inputs, we apply an extra pre-trained model to extract voice embedding representations from the audio. Applying voice embedding transfers pre-trained voice knowledge of the feature extractor to the target audio, and it applies to both vocal and non-vocal input. In the second step, the acoustic classifier is a neural network architecture with two input branches: an audio branch with log-mel inputs and a voice branch with the generated voice feature inputs. The outputs of both branches are concatenated along the feature dimension and fed to the final fully connected layer(s). We did not apply early fusion of the features, because feature fusion at an intermediate layer left us more flexibility to optimize the two input branches separately. In our study, the voice feature extractor was trained with an existing speaker dataset. Once pre-trained, the parameters of the feature extractor were fixed, and only the audio and voice branches were trained for the target AED task.
3.2 Audio branch architecture
The audio branch was developed for the log-mel inputs. In our study, we started with a shallow CNN baseline and then the TALNet. The CNN architecture is as follows:
Input Conv1 Conv2 Conv3 Conv4 FC FC FC
] denotes a 2D convolutional layer with the ReLU activation and 3), (11), and (1
1). Besides, a max-pooling of size (22), (22), and (12) was added for Conv1, Conv2 and Conv4, respectively. FC denotes a fully connected layer of size with the ReLU activation. We adopted the model architectures for both baselines, excluding their fully connected layer(s) as our audio branch. We then added back the fully connected layer(s) after feature fusion.
3.3 Voice branch architecture
Unlike the audio branch, we did not apply convolution on the feature dimension of the voice inputs since adjacent elements of an embedding may not have spatial correlation as the log-mel vectors do. Hence, our voice branch consists of 1D convolutional layers along the temporal dimension of the voice embeddings. The feature dimension of the embeddings is mapped to the channel dimension of the convolutional layers. Fig. 2 shows such a process. In such a design, reducing the number of network channels of each convolutional layer essentially reduces the size of the feature dimension. We added an extra uni-directional GRU layer following the convolutional layers to improve the learning performance.
4 Voice representations
4.1 Pre-training of voice models
To develop the voice feature extractor, we constructed a speaker recognition task where a network was trained to classify given speaker voice classes. The task was built on the public VoxCeleb1  speaker dataset. The dataset consists of audio utterances of over 1K celebrities from public YouTube videos of varying lengths. Specifically, we leveraged 1,211 speakers in the dataset for our model training and validation. For each speaker, ten utterances were randomly selected for model validation, and the rest were used for training, resulting in an average of 109 utterances per speaker in our training set.
The audio utterances were sampled at 16kHz and truncated or padded to 10 seconds. We then extracted 64D log-mel features using a frame length of 64ms and a frame shift of 25ms. The resulting log-mel features of a minibatch of input to our voice models had a shape of (batch1400
64). The features were normalized per dimension by the mean and standard deviation calculated over the entire training set. Two voice models were then applied, inspired from (Arch1) and a simplified version of TALNet (Arch2). We slightly modified the model parameters to fit our requirements, as shown in Table 1
. The models were deployed using PyTorch
. ReLU activation and batch normalization were applied except for the last fc layer. Besides, the “ceil” mode was enabled for the max-pooling layers. For Arch1, the output of the last conv2D layer was re-shaped from (batch10241001) to (batch1001024). The temporal dimension (100) was then aggregated by average pooling. For Arch2, the temporal dimension was aggregated after the fc layer.
|conv2D (96, 33, 1, 1)||conv2D (32, 33, 1, 1)|
|mpool (22)||mpool (22)|
|conv2D (256, 33, 1, 1)||conv2D (32, 33, 1, 1)|
|conv2D (64, 33, 1, 1)|
|conv2D (384, 33, 1, 1)||mpool (22)|
|conv2D2 (256, 33, 1, 1)||conv2D (64, 33, 1, 1)|
|mpool (12)||flatten (batch1001024)|
|conv2D (1024, 18, 1, 0)||biGRU (5122)|
|fc2 (1024 / 1211)||fc (1211)|
To set up training, we used an initial learning rate of 2 and a batch size of 25. We applied the same class balancing strategy as in  to account for class imbalance. The learning rate was shrunk by a factor of 0.9 when the validation accuracy plateaued, and training was stopped when the learning rate reached 1. We used the Adam  optimizer and the cross-entropy loss.
4.2 Voice feature extraction
The best validation accuracy we obtained was 95.8% and 93.4% for Arch1 and Arch2, respectively. The high accuracy values were as expected because of the large utterance size we used. The voice embeddings were then extracted before average pooling for Arch1 and following the GRU layer for Arch2, which maintains the temporal information of the input utterance at a resolution of 10Hz. The resulting embeddings were then of shape (batch1001024) for a batch. In the following sections, we will refer to the embeddings from the two models as emb_1 and emb_2 for convenience.
5.1 Training setup
We leveraged AudioSet for our AED study. The dataset consists of over 2 million 10-second audio utterances of 527 annotated acoustic classes, including vocal and non-vocal sounds extracted from public YouTube videos. We used the same evaluation set containing 24,832 utterances.
We followed our speaker recognition steps to derive the same type of log-mel features as inputs to the AED part. We then tested the two baselines of the audio branch independently. In deployment, we removed Conv4 of the CNN baseline for our dual-branch tests to maintain a similar model size with and without voice inputs. Besides, we used a hidden size of 768 for the GRU layer of TALNet. For both baselines, we enabled the “ceil” mode of the max-pooling layers in PyTorch. The output of the audio branch was consistently in a shape of (batch100768), where the temporal size was 100 (0.1s resolution).
For the voice branch, we applied two 1D convolutional layers so that the parameter size of the voice branch could be less considerable (1M) compared to the baseline models. The kernel size, padding, and stride were 3, 1, and 1, respectively, with ReLU activation. Batch normalization was also added before the activation. In the convolutional layers, the number of channels was 256 and 64, respectively. For the GRU layer, the hidden size was 64. Hence, the outputs of the voice branch were of shape (batch10064).
The final shape of features was (batch100832) after fusion. The last fully connected layer was of size 527 with the sigmoid activation. The per-class binary predictions were aggregated for an utterance by linear softmax pooling on the frame-level probability outputs. We used the same learning rate and optimization setup for training as in the speaker recognition task, but the validation metric was switched to the mean average precision (mAP) score which AudioSet also used. Besides, we used the binary cross-entropy loss.
5.2 Experiments and result discussions
We first examined the maximum performance of the models with augmented voice inputs. Inspired by common augmentation strategies for acoustic features, we processed the voice embeddings with three strategies – time masking , mixup , and adding dropout  to the voice branch. Specifically, we randomly masked 40 voice embedding frames out of the total 100 for each input utterance. The mixup was applied by mixing up labels and the corresponding input features at a batch level. We applied this for both branches at the same time with an alpha value of 1. Besides, we dropped out the voice features with a probability of 0.5.
Table 2 shows the overall AED results based on combinations of the input types and the audio branch architectures. In addition to the best mAP values, we also reported the best mean area under the curve (mAUC) and -prime metrics to show the performance better. The results with both baselines show that adding the extra voice branch improves the AED performance regarding the mAP metric, and the observation was consistent for both types of voice inputs. As expected, the performance jump was much bigger for the CNN baseline than for TALNet, since TALNet is a more sophisticated model for the AudioSet AED task, and the supplementary information carried by the voice inputs can be relatively marginal. Interestingly, the mAUC and -prime metrics degraded for TALNet with the voice inputs, possibly because the training was only optimized for the validation mAP. We also tried doubling the number of convolutional layers for the voice branch (with 1/2 of the channel size per layer to maintain the same output shape). However, no improvement was observed for either type of embeddings (0.357 mAP for both emb1 and emb2), indicating that 2 layers of CNN were sufficient for learning from the high-level voice features.
We then studied the AED effects with and without augmentation applied to the voice inputs. Table 3
shows the best validation epochs, the corresponding loss values and the mAP for the test cases. From the comparisons, it can be clearly seen that adding augmentation is critical to improve joint training of the two branches and maximize the model performance. Table4 further demonstrates the effects of individual types of augmenting strategies with the TALNet-based audio branch. While all techniques improved the joint training process, we can see that mixup augmentation yielded the biggest improvement. For a better comparison, we also ran a test for the original TALNet baseline with the same setup of mixup augmentation on the log-mel inputs. However, no improvement was observed in our test (0.342 mAP), indicating that the augmentation process contributed more significantly with the extra voice inputs than with the original baseline model.
By checking the class-wise predictions of the TALNet-based models, we found that the predicting performance led most ( 0.06 mAP) with the voice embeddings compared to the baseline model for the 10 classes: Hoot, Dental_drill, Yodeling, Air_horn, Gobble, Roar, Ringtone, Television, Snoring, Video_game_music. An interesting finding was that adding the extra voice inputs did not guarantee a better prediction for the voice-related AudioSet classes. For example, the mAP for Speech was 0.776 / 0.778 / 0.782 with the original TALNet / TALNet+emb1 / TALNet+emb2, but the values were 0.320 / 0.314 / 0.336 for Singing and 0.209 / 0.186 / 0.206 for Conversation. A possible explanation is that the training of the two branches was an integrated process and the models were optimized for the global performance, sometimes at a compromise of such individual classes.
|Combination||Epoch||Train / Val loss||mAP|
|CNN+emb1, no aug||11||9.7 / 13.3||0.229|
|CNN+emb1, with aug||39||12.7 / 12.2||0.264|
|CNN+emb2, no aug||13||8.9 / 12.8||0.256|
|CNN+emb2, with aug||45||12.6 / 11.7||0.292|
|TALNet+emb1, no aug||25||6.7 / 11.8||0.328|
|TALNet+emb1, with aug||56||11.4 / 10.6||0.360|
|TALNet+emb2, no aug||19||7.6 / 11.5||0.331|
|TALNet+emb2, with aug||71||11.4 / 10.6||0.361|
|Strategy||Epoch||Train / Val loss||mAP|
|emb1+dp||21||7.1 / 11.4||0.334|
|emb1+tmask||25||6.8 / 11.7||0.331|
|emb1+mixup||62||10.6 / 10.8||0.350|
|emb2+dp||20||7.3 / 11.4||0.334|
|emb2+tmask||19||7.8 / 11.5||0.334|
|emb2+mixup||68||11.6 / 10.7||0.356|
This paper explored a novel approach for acoustic event detection by incorporating pre-trained voice embeddings into an AED pipeline. Towards this end, we developed a dual-branch neural network architecture for joint training of the inputs. We then reported the overall and class-wise performance with a CNN baseline and a strong TALNet baseline developed on AudioSet. Our results showed the benefits of adding extra voice inputs to the tested models (0.292 vs 0.134 mAP for the CNN baseline and 0.361 vs 0.351 mAP for TALNet baseline). Furthermore, we showed that adding augmentation and dropout on the voice inputs is critical to maximize the model performance with dual inputs.
-  (2018) Deep learning using rectified linear units (relu). arXiv:1803.08375. Cited by: §3.2.
Comparison of i-vector and gmm-ubm approaches to speaker identification with timit and nist 2008 databases in challenging environments. In EUSIPCO, Cited by: §2.
-  (2016) Soundnet: learning sound representations from unlabeled video. NIPS. Cited by: §1, §2.
-  (2013) Understanding the coverage and scalability of place-centric crowdsensing. In UbiComp, Cited by: §2.
-  (2018) Learning to recognize transient sound events using attentional supervision.. In IJCAI, Cited by: §2.
Speech emotion recognition’in the wild’using an autoencoder.. In INTERSPEECH, Cited by: §2.
-  (2015) Exploiting spectro-temporal locality in deep learning based acoustic event detection. EURASIP Journal on Audio, Speech, and Music Processing. Cited by: §2.
-  (2017) Audio set: an ontology and human-labeled dataset for audio events. In ICASSP, Cited by: Transferring voice knowledge for Acoustic event detection: An empirical study, §1.
-  (2016) DCASE 2016 sound event detection system based on convolutional neural network. IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events. Cited by: §2.
-  (2017) CNN architectures for large-scale audio classification. In icassp, Cited by: §2.
-  (2017) Video-based emotion recognition in the wild using deep transfer learning and score fusion. Image and Vision Computing. Cited by: §2.
-  (2013) Multisensor data fusion: a review of the state-of-the-art. Information fusion. Cited by: §3.1.
-  (2014) Adam: a method for stochastic optimization. arXiv:1412.6980. Cited by: §4.1.
-  (2019) Cross-task learning for audio tagging, sound event detection and spatial localization: dcase 2019 baseline systems. arXiv:1904.03476. Cited by: §2.
Panns: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 2880–2894. Cited by: §1.
Audio set classification with attention model: a probabilistic perspective. In ICASSP, Cited by: §2.
-  (2015) Deepear: robust smartphone audio sensing in unconstrained acoustic environments using deep learning. In UbiComp, Cited by: §1.
-  (2020) Characterizing the effect of audio degradation on privacy perception and inference performance in audio-based human activity recognition. In MobileHCI, Cited by: §2.
-  (2019) Audio-based activities of daily living (adl) recognition with large-scale acoustic embeddings from online videos. IMWUT. Cited by: §1.
-  (2017) Learning embeddings for speaker clustering based on voice equality. In IEEE MLSP, Cited by: §2.
-  (2020) All-in-one transformer: unifying speech recognition, audio tagging, and event detection. Interspeech. Cited by: §2.
-  (2017) Voxceleb: a large-scale speaker identification dataset. arXiv:1706.08612. Cited by: §2, §4.1, §4.1.
-  (2016) Recurrent neural networks for polyphonic sound event detection in real life recordings. In ICASSP, Cited by: §2.
Specaugment: a simple data augmentation method for automatic speech recognition. arXiv:1904.08779. Cited by: §5.2.
-  (2019) PyTorch: an imperative style, high-performance deep learning library. In NIPS, Cited by: §4.1.
-  (2014) Large-scale speaker identification. In ICASSP, Cited by: §2.
-  (1996) Speaker identification via support vector classifiers. In ICASSP, Cited by: §2.
-  (2018) X-vectors: robust dnn embeddings for speaker recognition. In ICASSP, Cited by: §2.
Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research. Cited by: §5.2.
-  (2018) A survey on deep transfer learning. In ICANN, Cited by: §1, §2.
-  (2019) Joint analysis of acoustic events and scenes based on multitask learning. In IEEE WASPAA, Cited by: §2.
-  (2012) Non-speech environmental sound classification using svms with a new set of features. International Journal of Innovative Computing, Information and Control 8 (5), pp. 3511–3524. Cited by: §2.
-  (2013) An mfcc-gmm approach for event detection and classification. In WASPAA, Cited by: §2.
-  (2018) Generalized end-to-end loss for speaker verification. In ICASSP, Cited by: §2.
-  (2019) A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling. In ICASSP, Cited by: Transferring voice knowledge for Acoustic event detection: An empirical study, §1, §2, §2, §4.1.
-  (2016) Audio-based multimedia event detection using deep recurrent neural networks. In ICASSP, pp. 2742–2746. Cited by: §2.
-  (2018) Multi-level attention model for weakly supervised audio classification. arXiv:1803.02353. Cited by: §2.
-  (2017) Mixup: beyond empirical risk minimization. arXiv:1710.09412. Cited by: §5.2.
Cross-task pre-training for on-device acoustic scene classification. arXiv:1910.09935. Cited by: §2.
-  (2013) Analyzing noise robustness of mfcc and gfcc features in speaker identification. In ICASSP, pp. 7204–7208. Cited by: §2.