Feature representations learned with deep architectures from large amounts of training data have shown to be powerful for a variety of tasks beyond the tasks they were trained for. The pre-trained features from the BERT model 
, for example, have been successfully used in multiple Natural Language Processing (NLP) tasks such as question answering, text classification, and language inference
. In the image domain, large-scale pre-trained models such as ImageNet and VGG 
have achieved competitive results in image classification and assist in many visual feature extraction tasks.
There exist several pre-trained models in the audio domain, such as L-Net , VGGish , and Jukebox . L-Net leverages both audio and visual information provided by video to train an audio feature extraction network, while the VGGish model is pre-trained on a large-scale audio dataset, AudioSet . Jukebox adopts a language model similar to NLP and is pre-trained on music audio only. Those features have been successfully applied to various Music Information Retrieval (MIR) tasks, including weakly-supervised instrument recognition , cross-modal representation learning [32, 33], music auto-tagging [18, 2], music emotion recognition [17, 1], and music genre classification [25, 14]. This successful application to a multitude of different tasks combined with the comparably compact representation implies that these features are able to capture many task-agnostic properties of audio and music signals.
Although these transfer learning approaches are used successfully for such a wide range of tasks, there are potential drawbacks when using pre-trained features directly as the input representation. First, the extracted compact representation might lack task-specific information, which, in contrast, a task-specific representation learned from spectrogram input will include . While it is possible to update and fine-tune the pre-trained model with task-specific datasets to catch this relevant information , the computational burden during training is increased due to the high complexity of the feature extractor. Second, the time resolution of pre-trained features is fixed, forcing any system utilizing the features to the same time resolution. VGGish features, for example, have a time resolution of approx. 1 s. Tasks that need a higher resolution such as beat-tracking , might not be able to utilize pre-trained feature representations. Third, extractors such as VGGish and L-Net rely on fixed input representations such as Mel spectrograms to extract features. Several recent studies, however, have shown that other input representations such as harmonic representations  or multi-rate PCEN  achieve superior performance on several audio-related tasks. Fourth, the deep architectures of pre-trained feature extractors also increase the computational workload and the execution time at inference. Small real-time devices with limited computational power, for example, might not be able to benefit from the pre-trained representations.
To address these drawbacks, we propose to incorporate the task-agnostic information of pre-trained representations into a new smaller-scale model without using them as input to the model. Our approaches show some similarity to feature-based knowledge distillation approaches used in teacher-student learning , where the pre-trained representations are used to regularize the embedding space during training. Thus, the pre-trained representations provide condensed, task-agnostic information to help shape the embedding, but are not required during inference. The architecture of the original model does not need to change, so that the input representation and its time resolution can be chosen freely for the task at hand. Such feature-based knowledge transfer approaches are rarely explored in the music/audio domain. Moreover, to extend commonly-used regularization loss functions (e.g.,
Moreover, to extend commonly-used regularization loss functions (e.g.,, ), we propose two loss functions based on cosine distance to match the embeddings. Cosine distance has been commonly used in several metric learning scenarios, such as few-shot learning [29, 9] and recommendation system .
To summarize, the main contributions of this paper are: (i) the introduction of two regularization methods integrating the information of pre-trained feature representations during training time without increasing the computational complexity during inference, and (ii) an investigation of the suitability of the two representations VGGish and OpenL3 and their combination for this regularization, and a detailed study of the proposed regularization methods for SOTA models.
We propose two approaches to incorporate pre-trained feature representations into existing model architectures. In the following sections, we first introduce the general training pipeline of our regularization methods and the two pre-trained representations we use in our experiment. Then, details of two proposed regularization methods will be presented.
Ii-a System Description
shows the general structure of our training pipeline. For the regularization to be applicable, we assume the deep-learning model to be composed of a similar pipeline with an encoder () to extract the intermediate representation () and a decoder () to predict the output based on this representation. The intermediate representation is also referred to as embedding, where represents the number of time frames while represents the dimensionality of the embedding.
The pre-trained features are a 2-dimensional representation where represents the number of time frames while represents the dimensionality of the learned features. Although we only investigate two widely used representations extracted with deep architectures, it is conceivable that any kind of custom feature in this format can be used.
Ii-B Learned Features
We investigate three regularization inputs in this study, VGGish , OpenL3 , and the combination of both. VGGish and OpenL3 features are chosen due to their proven success in several audio-related downstream tasks [11, 33, 18, 25]. Both representations are extracted by models trained on large-scale audio datasets. VGGish features are extracted by the VGGish model  pre-trained on AudioSet  to perform audio classification at a time resolution of approx. 1 s (no overlap) with a feature dimensionality of . The features are PCA transformed (with whitening) and quantized to 8-bits. The OpenL3 features  are extracted from L-net pre-trained on a subset of AudioSet. These features are also 2-dimensional representations with a time resolution of 0.1 s (no overlap) and .
The combination of these two features is achieved by a simple concatenation along the dimension. To match the temporal resolution of OpenL3, VGGish features are repeated 10 times along the temporal dimension. The resulting combined representation has a time resolution of 0.1 s and a dimensionality of .111 More sophisticated interpolation methods for upsampling to match the time resolutions will be explored in future work.
More sophisticated interpolation methods for upsampling to match the time resolutions will be explored in future work.
|Methods||DCASE 17 (F1)||MTAT (PR-AUC)|
|Baseline||Won et al. ||0.547||-||-||-||0.465||-||-||-|
Ii-C Proposed Feature Integration Methods
Since VGGish, OpenL3, and the mixture features have different time resolution than the embeddings from most of the existing models, we repeat or average-pool the features times with representing the number of time frames in 1 s or 0.1 s to fit the time resolution. This approach is based on the assumption that a slight misalignment in time will not influence the result too much.
The design of the regularization methods is based on two assumptions. First, pre-trained features might contain information that is useful for various tasks but cannot be adequately represented in the unregularized embedding space (e.g., due to insufficient training data). Second, pre-trained features have strong discriminative power. The proposed regularization methods attempt to transfer the knowledge from the pre-trained feature space into the embedding vectors by adding structure to the embedding vectors resulting in a more separable embedding space.
Both methods add an extra loss term for network training:
with the hyperparameteradjusting the contribution on the overall loss . The following methods are proposed to incorporate learned features into the embeddings:
Con-Reg: Con-Reg aims at regularizing the embedding space so that its layout becomes more similar to the feature space of the pre-trained feature . To do so, we utilize the features extracted from the audio and add an extra loss term based on cosine distance222Pilot experiments using as the distance function did not lead to competitive results. to minimize the distance between embeddings and pre-trained features:
where represents a 1D CNN with kernel size 1 szegedy2015going to transform the feature dimensionality to match the embedding dimensionality, allowing us to compute the cosine distance between and .
Dis-Reg: Dis-Reg is a distance-based regularization. Similar to Con-Reg, the embedding space is regularized with an additional loss. In this case, however, the additional loss term aims at forcing the distances between pairs of embedding vectors to be similar to the distance of corresponding pairs of pre-trained features:
where represents the cosine distance between two embeddings from samples and () or the distance of two corresponding learned features, respectively.
Ii-D Model Architecture
The harmonic CNN model proposed by Won et al.  is chosen as our experimental system. This model utilizes learnable harmonic filters to capture the inherent harmonic structure of the input audio and achieves SOTA results on music tagging, sound event tagging, and keyword spotting. The model is composed of seven residual blocks as an encoder to extract embeddings from the harmonic representation. Two linear layers are then used as a decoder to predict the output from the embeddings.
Iii-a Baseline Feature Integration Methods
To provide a baseline reference as a comparison to the proposed regularization methods, we also present the result for three baseline methods incorporating pre-trained features:
Features-only: The simplest way of incorporating pre-trained features is to directly use these pre-trained features as the input for the decoder. The decoder of the original model is adjusted to fit the dimensionality of the features.
Concat: Concat simply concatenates embeddings and pre-trained features along the dimension. The pre-trained features, therefore, act as a supplementary of the embeddings and the embedding space is transformed into a joint feature space. The decoder then has the flexibility to decide how to leverage the information from the joint feature space. Since the concatenated embeddings have a larger dimension along than the original embeddings, the first layer of the decoder will be adjusted to fit the dimensionality of the new intermediate representation.
FiLM: A Feature-wise Linear Modulation (FiLM) layer is originally proposed as a general-purpose conditioning method to assist visual reasoning, which is difficult for standard deep-learning methods . The success of FiLM leads to its successful usage in several audio-related tasks [27, 16]
. The FiLM layer influences neural network computation via a simple, feature-wise affine transformation based on conditioning information:
where and are calculated based on pre-trained features . We use a simple linear layer for the functions and to compute the transformation parameters and . The resulting embeddings after the transformation are .
We investigate our proposed methods on two different tasks to demonstrate the influence of the regularization: music tagging and sound event tagging as also used by Won et al. .
Iii-B1 Music Auto-Tagging
Music auto-tagging is a multi-label classification problem aiming to predict tags for a musical piece. The used dataset is MagnaTagATune (MTAT) . This dataset has approx. 21k audio clips with each clip around 30 s. The dataset contains a variety of tags, including genre, mood, and instrumentation. The top-50 tags are chosen for label prediction. The results are reported using the Area Under Precision-Recall Curve (PR-AUC). The OpenL3 features are extracted in the music subset mode.
Iii-B2 Sound Event Tagging
Sound event tagging is an important task that has been included in DCASE challenges for many years . The goal of this task is to detect audio events in a sound excerpt. Following Won et al., we choose “Task 4: Large-scale weakly supervised sound event detection for smart cars” from the DCASE2017 challenge as our target task. The dataset is a subset of AudioSet  and contains approx. 53k audio excerpts for 17 classes. Each excerpt is around 10 s. The results are reported using the average of instance-level F1-scores with a threshold value of 0.1. In this case, the OpenL3 features are extracted in the environmental subset mode.
Iii-C Experimental Setup
The training setup, including dataset cleaning and split and the selection of optimizer and learning rate, is mirrored from their work. All methods are evaluated on each task by using VGGish, OpenL3, and the combined pre-trained features. The best model is selected based on the validation metric for each task for testing. After hyperparameter tuning, we choose for and for .
Iv Result and Discussion
The experimental results are shown in Table I. In the following, we discuss different aspects of these results separately
Iv-1 Learned Features
It can be observed that VGGish features generally outperform the OpenL3 features. The combined features tend to perform on par or better than the individual features. This result suggests that —although VGGish and OpenL3 were both trained on AudioSet— they encode somewhat complementary information. This is most likely due to different training strategies and hints at the possibility of combining them for other downstream tasks.
Among all the baseline methods, FiLM and Concat generally perform better than using only pre-trained features for prediction, but comparable to or only slightly better than Won et al.’s original system. This result indicates that there is some task-specific information that exists in the harmonic representations but not in the pre-trained features. FiLM, in general, tends to outperform Concat.
Iv-3 Proposed Methods
Both proposed methods tend to generally give higher results than the baselines and thus demonstrate the efficiency of our proposed regularization methods. Since the pre-trained features could not achieve better performance than learning from harmonic representation, directly incorporating features into embedding can potentially be seen as noise during inference and drop the performance. In contrast, indirectly leveraging pre-trained features does not have this limitation. The model is able to combine the task-specific information learned through the harmonic representation with the complementary, task-agnostic but highly meaningful information of the pre-trained features.
Con-Reg generally outperforms Dis-Reg. We discover that has larger values than during training. The larger loss term indicates that optimizing the model to predict embedding vectors matching the distance of pre-trained features is harder than directly minimizing the distance between embedding vectors and pre-trained features. As a result, this regularization method could not structure embeddings space better than Con-Reg.
Furthermore, Con-Reg achieves better results than the SOTA model proposed by Won et al.  on both the DCASE 17 dataset and MTAT dataset.
Iv-4 Amount of Training Data
We can observe that the improvement for the MTAT dataset is very limited compared to the DCASE 17 dataset. This might be due to the fact that the MTAT dataset is larger in size; therefore, the model is able to learn good embeddings directly from the training data and the additional information from the learned features is unnecessary. To compare the effect of insufficient training data, we present the results of an ablation study by randomly choosing percent of audio files in the training set to form training subsets with . We use the best setting, Con-Reg with combined features, for the experiment and compare the performance with the SOTA system. The results are shown in Figure 2. We can observe that with decreasing training data, the improvement from the proposed regularization method becomes more obvious, especially lower than 30% of training data.
|Methods||Training Parameters||Inference Parameters|
Iv-a Model Complexity
Table II shows the number of training and inference parameters for both the proposed and the baseline methods. Since audio files can be pre-processed to extract the learned features, the feature-only method only needs the decoder during training and has, therefore, the smallest number of trainable parameters. In contrast, the same feature-only method requires a large number of parameters during inference due to the large-scale VGGish/OpenL3 extractor. A similar imbalance between the number of training and inference parameters can be observed for the Concat and FiLM methods.
Comparing the baseline results from Table I with the number of baseline parameters listed in Table II we can conclude that the better the result, the more parameters the extractor needs. For example, VGGish features achieve better performance than OpenL3 but its extractor is also more complex. As mentioned above, the combination of pre-trained features can achieve better performance, but the parameters for inference increase considerably as they require two extractors (both VGGish and OpenL3) during inference.
The proposed regularization methods have a similar number of parameters as the original model proposed by Won et al.  (approx. 3.6M), for both training and inference. This is because our proposed methods only use features in the additional regularization loss function without modifying the model architecture.
In this work, we proposed two novel regularization methods to incorporate the information from pre-trained features during training by adding an additional loss term restructuring the embedding space. The proposed regularization methods show improvements compared to baseline feature integration methods, which either directly use pre-trained features or directly influence the model through concatenation or a feature-wise linear transform. Furthermore, the regularized models can outperform SOTA audio classification models, especially having a more pronounced performance increase in the case of limited training data.
Emotion and themes recognition in music utilising convolutional and recurrent neural networks. In CEUR Workshop Proceedings, Vol. 2670. Cited by: §I.
-  (2021) Codified audio language modeling learns useful representations for music information retrieval. In Proceedings of International Society for Music Information Retrieval Conference, Cited by: §I.
-  (2019) Look, listen, and learn more: design choices for deep audio embeddings. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3852–3856. Cited by: §I, §II-B.
-  (2009) Evaluation methods for musical audio beat tracking algorithms. Queen Mary University of London, Centre for Digital Music, Tech. Rep. C4DM-TR-09-06. Cited by: §I.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186. Cited by: §I.
-  (2020) Jukebox: a generative model for music. arXiv preprint arXiv:2005.00341. Cited by: §I, §V.
-  (2019) Ranking generated summaries by correctness: an interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2214–2220. Cited by: §I.
-  (2017) Audio Set: an ontology and human-labeled dataset for audio events. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 776–780. Cited by: §II-B, §III-B2.
Dynamic few-shot visual learning without forgetting.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4367–4375. Cited by: §I.
-  (2021) Knowledge distillation: a survey. International Journal of Computer Vision 129 (6), pp. 1789–1819. Cited by: §I.
-  (2019) An attention mechanism for musical instrument recognition. In Proceedings of International Society for Music Information Retrieval Conference, pp. 83–90. Cited by: §I, §II-B.
-  (2020) Contextual and sequential user embeddings for large-scale music recommendation. In Proceedings of ACM Conference on Recommender Systems, pp. 53–62. Cited by: §I.
-  (2017) CNN architectures for large-scale audio classification. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Cited by: §I, §II-B.
-  (2020) Large-scale weakly-supervised content embeddings for music recommendation and tagging. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8364–8368. Cited by: §I.
-  (2021) Sound event detection in urban audio with single and multi-rate PCEN. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 880–884. Cited by: §I.
-  (2019) Neural music synthesis for flexible timbre control. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 176–180. Cited by: 3rd item.
-  (2021) Comparison and analysis of deep audio embeddings for music emotion recognition. AAAI Workshop on Affective Content Analysis. Cited by: §I.
Receptive field regularization techniques for audio classification and tagging with deep convolutional neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, pp. 1987–2000. External Links: Cited by: §I, §II-B.
-  (2012) ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25, pp. 1097–1105. Cited by: §I.
-  (2009) Evaluation of algorithms using games: the case of music tagging.. In Proceedings of International Society for Music Information Retrieval Conference, pp. 387–392. Cited by: §III-B1.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §I.
-  (2017) DCASE 2017 challenge setup: tasks, datasets and baseline system. In DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events, Cited by: §III-B2.
Film: visual reasoning with a general conditioning layer.
Proceedings of AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: 3rd item.
-  (2019) Musicnn: pre-trained convolutional neural networks for music audio tagging. In Proceedings of International Society for Music Information Retrieval Conference, Cited by: §V.
-  (2020) Machine learning for music genre: multifaceted review and experimentation with audioset. Journal of Intelligent Information Systems 55 (3), pp. 469–499. Cited by: §I, §II-B.
-  (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §I.
-  (2019) End-to-end sound source separation conditioned on instrument labels. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 306–310. Cited by: 3rd item.
-  (2019) DCASE 2019 challenge task 5: CNN+ vggish. Technical Report, DCASE2019 Challenge. Cited by: §I.
-  (2016) Matching networks for one shot learning. Advances in Neural Information Processing Systems 29. Cited by: §I.
-  (2020) Data-driven harmonic filters for audio representation learning. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 536–540. Cited by: §I, §II-D, TABLE I, §III-B, §IV-3, §IV-A.
-  (2018) The Aalto system based on fine-tuned audioset features for dcase2018 task2 - general purpose audio tagging. In Technical Report, DCASE2018 Challenge, Cited by: §I.
-  (2019) Deep cross-modal correlation learning for audio and lyrics in music retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 15 (1), pp. 1–16. Cited by: §I.
-  (2018) Audio-visual embedding for cross-modal music video retrieval through supervised deep CCA. In International Symposium on Multimedia, pp. 143–150. Cited by: §I, §II-B.