AVASpeech_Music_Labels
None
view repo
We propose a dataset, AVASpeech-SMAD, to assist speech and music activity detection research. With frame-level music labels, the proposed dataset extends the existing AVASpeech dataset, which originally consists of 45 hours of audio and speech activity labels. To the best of our knowledge, the proposed AVASpeech-SMAD is the first open-source dataset that features strong polyphonic labels for both music and speech. The dataset was manually annotated and verified via an iterative cross-checking process. A simple automatic examination was also implemented to further improve the quality of the labels. Evaluation results from two state-of-the-art SMAD systems are also provided as a benchmark for future reference.
READ FULL TEXT VIEW PDFNone
Speech and music activity detection (SMAD) is a long-studied problem and has been included in the Music Information Retrieval Evaluation eXchange (MIREX) competition for several years. Like other Music Information Retrieval (MIR) tasks, the most recent improvements to SMAD rely on data-driven approaches [12, 5, 2, 4, 3, 7]. However, due to copyright issues, many of these systems were trained on private datasets which impeded the reproducibility of the results (e.g., the radio datasets used in [5] and [12]). Although several publicly available datasets have been proposed for training or evaluation of SMAD systems, as shown in Table 2, they suffer from several drawbacks. GTZAN [11], MUSAN [10], SSMSC [8] and Muspeak [13] only have non-overlapping speech or music segments. OpenBMAT [6] and ORF TV [9] only support music labels, while the original AVASpeech dataset [1] only has speech labels. In the real-world where speech and music co-occur regularly, these datasets might not be suitable training sources.
To solve the data limitation problem, we proposed a supplementary dataset, AVASpeech-SMAD. The dataset is the extension of the AVASpeech dataset proposed by Chaudhuri et al. [1]. The original AVASpeech dataset only contains speech activities and we extended it by manually labelling the music activities. We expect our proposed dataset to be used for training or evaluation of future SMAD systems. The dataset is open-sourced on the Github repository: https://github.com/biboamy/AVASpeech_Music_Labels
Avg (%) | Min (%) | Max (%) | |
---|---|---|---|
Speech | |||
Music | |||
Both speech and music |
Dataset | Music Labels | Speech Labels | Overlap | # instances | Duration (hrs) |
---|---|---|---|---|---|
GTZAN Speech and Music [11] | ✓ | ✓ | No | 1.1 | |
MUSAN [10] | ✓ | ✓ | No | 109 | |
Scheirer & Slaney Music Speech (SSMSC) [8] | ✓ | ✓ | No | 1 | |
Muspeak [13] | ✓ | ✓ | No | 5 | |
OpenBMAT [6] | ✓ | ✗ | Yes | 27.5 | |
ORF TV [9] | ✓ | ✗ | Yes | 9 | |
AVASpeech-SMAD (Proposed) | ✓ | ✓ | Yes | 45 |
The statistics of the dataset are shown in Table 2. Compared to other publicly available datasets, ours is the only one containing overlapping speech and music frame labels. The dataset includes a variety of content, languages, genres and production quality. The audio data, labels, and annotation process are discussed in the following sections.
The original dataset contains excerpts of -minute clips taken from YouTube videos with a total duration of . Each audio file is stereo and was sampled at with per sample. Due to copyright issues, we could not distribute the audio files from this dataset. Instead, we include the scripts for downloading and preprocessing the audio files in our GitHub repository.
The speech labels were derived as-is from the original AVASpeech while the new music labels were manually annotated. The statistics of the labels are shown in Table 1. The dataset covers a variety of cases. Some samples barely have any music while some contain mostly music. Some of them do not have any region with overlapping music and speech, while some have over of the regions containing both music and speech. The detailed statistics of each clip can be found in our GitHub repository.
The annotation process of the proposed dataset is shown in Fig. 1. The annotation process has four steps, namely music detection, first manual annotation check, cross-validate and second manual annotation check. Steps involving human annotators are highlighted in orange. Seven MIR students and researchers, all of whom are also musicians of varying experience levels, have volunteered as annotators. An internal algorithm is first used to pseudo-label the music regions. The annotators are then asked to manually annotate the regions with music activity, with the pseudo labels acting as a rough guide. Each annotators were assigned between to audio clips to annotate active music regions. Sonic Visualizer111https://www.sonicvisualiser.org, last accessed 04/30/2021 was used to mark the active regions and all annotators used headphones during the annotation process. To ensure the consistency of the labels and avoid ambiguity, the guideline for determining music versus speech is listed in GitHub repository.
After the annotations are completed, all the labels were cross-validated with the speech labels provided by original AVASpeech dataset. Since the original labels contain “Speech with music” and “Clean speech” classes, we can expect that the former contains music while the latter do not contains music. Any discrepancies between our labels and the original labels were algorithmically detected: the region that was labeled as music in the original labels but not in our labels, and the region that was labeled no music in the original labels but labeled as music in our labels. If the discrepancies are less than , we automatically modify the labels based on the original labels. However, if the discrepancies are greater than three seconds, we conducted a manual review of the regions. Each region is randomly assigned to additional two annotators and majority vote is considered to determine the final labels of the regions.
We choose two existing SOTA systems to evaluate on our proposed dataset. First is the CRNN-based detector proposed by [12]. The method is trained on the combination of synthetic data and radio broadcast. The second system is the InaSpeechSegmentor proposed by [3]. The segmentor has a CNN architecture and can split audio into homogeneous music, speech and noise region. The sed_eval toolbox is used to perform segment-level evaluation as used in the MIREX 2018 competition222https://www.music-ir.org/mirex/wiki/2018:Music_and/or_Speech_Detection, last accessed 04/30/2021. The result is shown in Table 3 for future reference.
In this work, we proposed a supplementary dataset for SMAD. The dataset not only contains overlapping speech and music frame labels, but also includes a variety of content. Based on our benchmark experiments, one of the SOTA systems achieved F1-scores (80% and 77 % for music and speech, respectively) that are slightly lower than the reported 85% for both music and speech on MIREX test sets in [12]. This result suggests that our proposed dataset is able to present new challenges to the model and potentially complement the existing datasets (e.g., MIREX test sets) with diverse content and frame-level labels. We expect this dataset to serve as a new resource and reference for future SMAD research.
K. N. Watcharasupat and J. Lee acknowledge the support from the CN Yang Scholars Programme, Nanyang Technological University, Singapore. We also gratefully acknowledge Professor Alexander Lerch and Music Informatics Group supported this research by providing a Titan X GPU for computing the experiment.
Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset
. Journal on Audio, Speech, and Music Processing 2019 (1), pp. 1–18. Cited by: §1.Music detection from broadcast contents using convolutional neural networks with a mel-scale kernel
. Journal on Audio, Speech, and Music Processing 2019 (1), pp. 1–12. Cited by: §1.