Log In Sign Up

AVASpeech-SMAD: A Strongly Labelled Speech and Music Activity Detection Dataset with Label Co-Occurrence

by   Yun-Ning Hung, et al.
Georgia Institute of Technology

We propose a dataset, AVASpeech-SMAD, to assist speech and music activity detection research. With frame-level music labels, the proposed dataset extends the existing AVASpeech dataset, which originally consists of 45 hours of audio and speech activity labels. To the best of our knowledge, the proposed AVASpeech-SMAD is the first open-source dataset that features strong polyphonic labels for both music and speech. The dataset was manually annotated and verified via an iterative cross-checking process. A simple automatic examination was also implemented to further improve the quality of the labels. Evaluation results from two state-of-the-art SMAD systems are also provided as a benchmark for future reference.


page 1

page 2

page 3


MuSFA: Improving Music Structural Function Analysis with Partially Labeled Data

Music structure analysis (MSA) systems aim to segment a song recording i...

AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies

Speech activity detection (or endpointing) is an important processing st...

The Impact of Label Noise on a Music Tagger

We explore how much can be learned from noisy labels in audio music tagg...

HouseX: A Fine-grained House Music Dataset and its Potential in the Music Industry

Machine sound classification has been one of the fundamental tasks of mu...

Transfer Learning for Improving Singing-voice Detection in Polyphonic Instrumental Music

Detecting singing-voice in polyphonic instrumental music is critical to ...

Armor: A Benchmark for Meta-evaluation of Artificial Music

Objective evaluation (OE) is essential to artificial music, but it's oft...

Analysis and Detection of Singing Techniques in Repertoires of J-POP Solo Singers

In this paper, we focus on singing techniques within the scope of music ...

Code Repositories

1 Introduction

Speech and music activity detection (SMAD) is a long-studied problem and has been included in the Music Information Retrieval Evaluation eXchange (MIREX) competition for several years. Like other Music Information Retrieval (MIR) tasks, the most recent improvements to SMAD rely on data-driven approaches [12, 5, 2, 4, 3, 7]. However, due to copyright issues, many of these systems were trained on private datasets which impeded the reproducibility of the results (e.g., the radio datasets used in [5] and [12]). Although several publicly available datasets have been proposed for training or evaluation of SMAD systems, as shown in Table 2, they suffer from several drawbacks. GTZAN [11], MUSAN [10], SSMSC [8] and Muspeak [13] only have non-overlapping speech or music segments. OpenBMAT [6] and ORF TV [9] only support music labels, while the original AVASpeech dataset [1] only has speech labels. In the real-world where speech and music co-occur regularly, these datasets might not be suitable training sources.

To solve the data limitation problem, we proposed a supplementary dataset, AVASpeech-SMAD. The dataset is the extension of the AVASpeech dataset proposed by Chaudhuri et al. [1]. The original AVASpeech dataset only contains speech activities and we extended it by manually labelling the music activities. We expect our proposed dataset to be used for training or evaluation of future SMAD systems. The dataset is open-sourced on the Github repository:

Avg (%) Min (%) Max (%)
Both speech and music
Table 1: Minimum, maximum, and average percentages of regions with speech only, with music only, and with both speech and music for each audio sample.
Dataset Music Labels Speech Labels Overlap # instances Duration (hrs)
GTZAN Speech and Music [11] No 1.1
MUSAN [10] No 109
Scheirer & Slaney Music Speech (SSMSC) [8] No 1
Muspeak [13] No 5
OpenBMAT [6] Yes 27.5
ORF TV [9] Yes 9
AVASpeech-SMAD (Proposed) Yes 45
Table 2: Metadata of the existing datasets compared to our proposed AVASpeech-SMAD dataset.

2 Dataset

The statistics of the dataset are shown in Table 2. Compared to other publicly available datasets, ours is the only one containing overlapping speech and music frame labels. The dataset includes a variety of content, languages, genres and production quality. The audio data, labels, and annotation process are discussed in the following sections.

2.1 Audio & Labels

The original dataset contains excerpts of -minute clips taken from YouTube videos with a total duration of . Each audio file is stereo and was sampled at with per sample. Due to copyright issues, we could not distribute the audio files from this dataset. Instead, we include the scripts for downloading and preprocessing the audio files in our GitHub repository.

The speech labels were derived as-is from the original AVASpeech while the new music labels were manually annotated. The statistics of the labels are shown in Table 1. The dataset covers a variety of cases. Some samples barely have any music while some contain mostly music. Some of them do not have any region with overlapping music and speech, while some have over of the regions containing both music and speech. The detailed statistics of each clip can be found in our GitHub repository.

Figure 1: Overview of the annotation process.

2.2 Annotation Process

The annotation process of the proposed dataset is shown in Fig. 1. The annotation process has four steps, namely music detection, first manual annotation check, cross-validate and second manual annotation check. Steps involving human annotators are highlighted in orange. Seven MIR students and researchers, all of whom are also musicians of varying experience levels, have volunteered as annotators. An internal algorithm is first used to pseudo-label the music regions. The annotators are then asked to manually annotate the regions with music activity, with the pseudo labels acting as a rough guide. Each annotators were assigned between to audio clips to annotate active music regions. Sonic Visualizer111, last accessed 04/30/2021 was used to mark the active regions and all annotators used headphones during the annotation process. To ensure the consistency of the labels and avoid ambiguity, the guideline for determining music versus speech is listed in GitHub repository.

After the annotations are completed, all the labels were cross-validated with the speech labels provided by original AVASpeech dataset. Since the original labels contain “Speech with music” and “Clean speech” classes, we can expect that the former contains music while the latter do not contains music. Any discrepancies between our labels and the original labels were algorithmically detected: the region that was labeled as music in the original labels but not in our labels, and the region that was labeled no music in the original labels but labeled as music in our labels. If the discrepancies are less than , we automatically modify the labels based on the original labels. However, if the discrepancies are greater than three seconds, we conducted a manual review of the regions. Each region is randomly assigned to additional two annotators and majority vote is considered to determine the final labels of the regions.

Music Speech
Algo. F1 Precision Recall F1 Precision Recall
Table 3: The segment-level evaluation (%) on our AVASpeech-SMAD dataset by using two SOTA systems.

3 Benchmark

We choose two existing SOTA systems to evaluate on our proposed dataset. First is the CRNN-based detector proposed by [12]. The method is trained on the combination of synthetic data and radio broadcast. The second system is the InaSpeechSegmentor proposed by [3]. The segmentor has a CNN architecture and can split audio into homogeneous music, speech and noise region. The sed_eval toolbox is used to perform segment-level evaluation as used in the MIREX 2018 competition222, last accessed 04/30/2021. The result is shown in Table 3 for future reference.

4 Conclusion

In this work, we proposed a supplementary dataset for SMAD. The dataset not only contains overlapping speech and music frame labels, but also includes a variety of content. Based on our benchmark experiments, one of the SOTA systems achieved F1-scores (80% and 77 % for music and speech, respectively) that are slightly lower than the reported 85% for both music and speech on MIREX test sets in [12]. This result suggests that our proposed dataset is able to present new challenges to the model and potentially complement the existing datasets (e.g., MIREX test sets) with diverse content and frame-level labels. We expect this dataset to serve as a new resource and reference for future SMAD research.

5 Acknowledgement

K. N. Watcharasupat and J. Lee acknowledge the support from the CN Yang Scholars Programme, Nanyang Technological University, Singapore. We also gratefully acknowledge Professor Alexander Lerch and Music Informatics Group supported this research by providing a Titan X GPU for computing the experiment.


  • [1] S. Chaudhuri, J. Roth, D. P. Ellis, A. Gallagher, L. Kaver, R. Marvin, C. Pantofaru, N. Reale, L. G. Reid, K. Wilson, et al. (2018) AVA-speech: a densely labeled dataset of speech activity in movies. In Proceedings of the 19th Annual Conference of the International Speech Communication Association, Cited by: §1, §1.
  • [2] D. de Benito-Gorron, A. Lozano-Diez, D. T. Toledano, and J. Gonzalez-Rodriguez (2019)

    Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

    Journal on Audio, Speech, and Music Processing 2019 (1), pp. 1–18. Cited by: §1.
  • [3] D. Doukhan, E. Lechapt, M. Evrard, and J. Carrive (2018) INA’s MIREX 2018 music and speech detection system. In 14th Music Information Retrieval Evaluation eXchange, Cited by: §1, Table 3, §3.
  • [4] B. Jang, W. Heo, J. Kim, and O. Kwon (2019)

    Music detection from broadcast contents using convolutional neural networks with a mel-scale kernel

    Journal on Audio, Speech, and Music Processing 2019 (1), pp. 1–12. Cited by: §1.
  • [5] Q. Lemaire and A. Holzapfel (2019) Temporal convolutional networks for speech and music detection in radio broadcast. In Proceedings of the 20th International Society for Music Information Retrieval Conference, pp. 229–236. Cited by: §1.
  • [6] B. Meléndez-Catalán, E. Molina, and E. Gómez (2019) Open broadcast media audio from tv: a dataset of tv broadcast audio with relative music loudness annotations. Transactions of the International Society for Music Information Retrieval 2 (1). Cited by: Table 2, §1.
  • [7] M. Papakostas and T. Giannakopoulos (2018) Speech-music discrimination using deep visual feature extractors. Expert Systems with Applications 114, pp. 334–344. Cited by: §1.
  • [8] E. Scheirer and M. Slaney (1997) Construction and evaluation of a robust multifeature speech/music discriminator. In icassp, Vol. 2, pp. 1331–1334. Cited by: Table 2, §1.
  • [9] K. Seyerlehner, T. Pohle, M. Schedl, and G. Widmer (2007) Automatic music detection in television productions. In Proceedings of of the 10th International Conference on Digital Audio Effects (DAFx), Cited by: Table 2, §1.
  • [10] D. Snyder, G. Chen, and D. Povey (2015) MUSAN: a music, speech, and noise corpus. arXiv preprint arXiv:1510.08484. Cited by: Table 2, §1.
  • [11] G. Tzanetakis and P. Cook (2000) Marsyas: a framework for audio analysis. Organised Sound 4 (3), pp. 169–175. Cited by: Table 2, §1.
  • [12] S. Venkatesh, D. Moffat, and E. R. Miranda (2021) Investigating the effects of training set synthesis for audio segmentation of radio broadcast. Electronics 10 (7), pp. 827. Cited by: §1, Table 3, §3, §4.
  • [13] D. Wolff, T. Weyde, E. Benetos, and D. Tidhar (2015) MIREX muspeak sample dataset. Note: 2020-09-30 Cited by: Table 2, §1.