Log In Sign Up

User Specific Adaptation in Automatic Transcription of Vocalised Percussion

by   António Ramires, et al.

The goal of this work is to develop an application that enables music producers to use their voice to create drum patterns when composing in Digital Audio Workstations (DAWs). An easy-to-use and user-oriented system capable of automatically transcribing vocalisations of percussion sounds, called LVT - Live Vocalised Transcription, is presented. LVT is developed as a Max for Live device which follows the `segment-and-classify' methodology for drum transcription, and includes three modules: i) an onset detector to segment events in time; ii) a module that extracts relevant features from the audio content; and iii) a machine-learning component that implements the k-Nearest Neighbours (kNN) algorithm for the classification of vocalised drum timbres. Due to the wide differences in vocalisations from distinct users for the same drum sound, a user-specific approach to vocalised transcription is proposed. In this perspective, a given end-user trains the algorithm with their own vocalisations for each drum sound before inputting their desired pattern into the DAW. The user adaption is achieved via a new Max external which implements Sequential Forward Selection (SFS) for choosing the most relevant features for a given set of input drum sounds.


Identifying Actions for Sound Event Classification

In Psychology, actions are paramount for humans to perceive and separate...

LyricJam: A system for generating lyrics for live instrumental music

We describe a real-time system that receives a live audio stream from a ...

LyricJam Sonic: A Generative System for Real-Time Composition and Musical Improvisation

Electronic music artists and sound designers have unique workflow practi...

Novel Recording Studio Features for Music Information Retrieval

In the recording studio, producers of Electronic Dance Music (EDM) spend...

Multi Web Audio Sequencer: Collaborative Music Making

Recent advancements in web-based audio systems have enabled sufficiently...

Unsupervised Incremental Learning and Prediction of Music Signals

A system is presented that segments, clusters and predicts musical audio...

1 Introduction

The development of computers’ performance capacity, and the consequent possibility for real-time Digital Signal Processing (DSP) for audio, led to the appearance of Digital Audio Workstations (DAWs), making the creation of computer music available to the general public. Following these advances, many new instruments and interfaces for creating electronic music have surfaced. With changes in music culture, music production and how musicians work with their instruments has also changed. In other words, the ability to invent and reinvent the way to produce music is key to progress. Consequently, new proposals are necessary, such as designing new techniques for the composition of music.

Within the genre of Electronic Music, sequencing drum patterns plays a critical role. However, inputting drum patterns into DAWs often requires high technical skill on the part of the user, either by physically performing the patterns by tapping them on MIDI drum pads, or manually entering events via music editing software. For non-expert users both options can be very challenging, and can thus provide a barrier to entry. However, the voice is an important and powerful instrument of rhythm production, so it can be used to express or “perform” drum patterns in a very intuitive way - so called “beatboxing.” In order to leverage this concept within a computational system, our goal is towards a system to help users (both expert musicians and amateur enthusiasts) input rhythm patterns they have in mind into a sequencer via the automatic transcription of vocalised percussion. Our proposed tool is beneficial both from the perspective of workflow optimisation (by providing accurate real-time transcriptions), but also as means to encourage users to engage with technology in the pursuit of creative activities. From a technical standpoint, we seek to build on the state of the techniques from the domain of music information retrieval (MIR) for drum transcription

[Miron et al.(2013)Miron, Davies, and Gouyon, Gillet and Richard(2008)] but actively targeted towards end-users and real-world music content production scenarios.

2 Methodology

A vocalised drum transcription software, LVT, able to be trained with the user vocalisations is proposed. LVT is developed as a Max for Live project – a visual programming environment, based on Max, which allows users to build instruments and effects for use within the Ableton Live333 DAW. To develop LVT, a dataset of vocalised percussion was compiled. A group of 20 participants (11 male, 9 female) were asked to record two short vocalised percussion tracks, one identical for all participants, and the other, an improvised pattern. These input percussion tracks were recorded three times: on a low quality laptop microphone, on an iPad microphone, and using a studio quality microphone (AKG c4000b). All recorded audio tracks were manually annotated using Sonic Visualiser444, a free application for viewing and analysing the contents of music audio files. The participants spanned a wide range of experience in beatboxing (from beatboxing experts, to those who had never vocalised drum patterns before), and covered a wide age range. Thus, we consider the annotated dataset to be representative of a wide range of potential users of the system, and highly heterogeneous in terms of the types of drum sounds.

Our proposed vocalised percussion transcription system was developed following a user-specific approach. LVT follows the “segment and classify” method for drum transcription [Gillet and Richard(2008)] and integrates three main elements: i) an onset detector – to identify when each drum sound occurs, ii) a component that extracts features for each event, and iii) a machine learning component to classify the drum sounds. In the Max for Live environment, the onset detection was performed with AubioOnset 555

. Feature extraction was performed in real-time using existing Max objects:

Zsa.mfcc – to characterise the timbre, Zsa.descriptors [Malt and Jourdan(2008)] – to provide spectral centroid, spread, slope, decrease and rolloff features [Malt and Jourdan(2008)], and finally the zero crossing rate and number of zero crossings were calculated with the zerox object. The machine learning component is trained with the user’s preferred vocalisation and the features are selected which give the best results for the provided input. This is achieved using the Sequential Forward Selection method [Whitney(1971)]

along with a k-Nearest Neighbours classification algorithm, with the most significant features selected by the accuracy obtained from testing the training data (in our case, the annotated improvised patterns from each participant). SFS works by selecting the most significant feature, according to a specific parameter (in this case the classification accuracy), and adding it to an initially empty set until there are no improvements or no features remain. The k-NN algorithm was implemented using timbreID

[Brent(2010)], and a new external for Max was developed to implement the SFS. A user interface was created in Max for Live to facilitate the utilisation of the application by end-users. A screenshot of the interface of LVT is shown in Fig. 1. It demonstrates the user-specific training stage – where a user inputs a set number of the drum timbres they intend to use, after which their vocalised percussion is transcribed and rendered as a MIDI file for subsequent synthesis.

Figure 1: User interface of the LVT device.

To operate LVT, a user loads the device in Ableton Live and then vocalises the set of desired drum sounds they intend to use, e.g. five kick sounds followed by five snare sounds, followed by five hi-hat sounds. Once the expected number of drum sounds have been detected, the SFS algorithm then identifies the subset of features which best separate the drum sounds for the user. After training, the user can then vocalise rhythmic patterns which are automatically converted from audio to a MIDI representation in the DAW for later synthesis and editing.

3 Results

The evaluation of LVT was designed to serve two purposes. First, to understand how a user-specific trained system performs against state of the art drum transcription system (which have been optimised over large datasets without any user-specific training), and second, to explore how LVT could improve a producer’s workflow. We compared LVT against two existing drum transcription algorithms: LDT [Miron et al.(2013)Miron, Davies, and Gouyon], and Ableton Live’s built-in “Convert Drums to MIDI” function. For validation data we used the non-improvised vocalised patterns from our annotated dataset.

To compare the accuracy of the systems we use the F-measure of the transcriptions. Then, to investigate how our system could improve a producers workflow, the “effort” to get an accurate transcription was calculated by counting the number of editing operations required to obtain the desired patterns. These operations are as follows: to modify, to add, or to remove a MIDI note.

Edit Operations F-measure
Modify Add Remove Kick Snare Hi-hat
Ableton 33 12 296 0.518 0.470 0.297
LDT 52 24 206 0.538 0.204 0.419
LVT 39 7 15 0.914 0.691 0.802
Table 1: Number of operations and F-measure for the AKG microphone.

Table 1 summarises the results obtained from counting the total number of operations needed to obtain the desired pattern for the testing data recorded on the studio quality AKG c4000b microphone and the corresponding F-measure per vocalised drum sound, on the three drum transcription systems. The results demonstrate that, for the studio quality microphone, vocalised drum transcription accuracy for LVT is substantially higher than the other systems, and far fewer modifications were required to obtain the desired patterns when editing the automatic transcriptions.

To see the effect of user-specific training on the performance of LVT, an example is provided where LVT is trained on one user and tested on another – and vice-versa. When training the LVT with a different person with different vocalisations, the accuracy of the transcription is decreased as shown in Fig. 2. In the upper part of each screenshot is the transcription of the user when trained with its own vocalisations, while the bottom part corresponds to the transcription when trained with the other user. As can be seen, without the user-specific training, many misclassifications occur.

By examining the previously obtained results, we infer that LVT can provide a transcription closer to the ground truth than the existing state of the art systems, as shown by the higher F-measure. In addition to LVT being trained per individual user, these results may also derive from the fact that LVT does not try to detect polyphonic events (more than one drum vocalisation at the same time) as the other systems do. Furthermore, LVT does not detect as many events as the other systems, and this has a strong influence on the number of false positives, and hence the F-measure. The number of events to achieve the desired transcription, presented in Table 1, shows that the end-user of the system does not have to perform as many actions when producing music, which has a positive impact on the workflow, leaving more time for creative experimentation.

Figure 2: (top) First user vocalisations trained with the second user. (bottom) Second user vocalisations trained with the first user.

4 Conclusions

In this paper, we have presented LVT – a new interface for assistive music content creation. LVT allows Ableton Live users to sequence MIDI patterns that can be used for designing and performing rhythms with their voice. Existing state of the art systems, including one already in Ableton Live, are not able to transcribe vocalised percussion as effectively because these tools are trained for general recorded drum sounds which are typically not vocalised. Indeed, because different people vocalise drum sounds in different ways, LVT explicitly seeks to model and capture this behaviour via user-specific training. Our evaluation shows LVT to be very effective for wide range of users and vocalisations, outperforming existing systems. Furthermore, we believe LVT can be applied to any kinds of arbitrary non-pitched percussive sounds – provided that the training sound types are sufficiently different from one another, and can thus be well separated in the audio feature space using SFS.

LVT is implemented as a Max for Live device, and thus fully integrates into Ableton Live, allowing users of all ability ranges to experiment with music sequencing driven by their own personal percussion vocalisations within an easy-to-use graphical user interface.

5 Acknowledgements

This work is financed by the ERDF - European Regional Development Fund through the Operational Programme for Competitiveness and Internationalisation - COMPETE 2020 Programme within project «POCI-01-0145-FEDER-006961», and by National Funds through the FCT - Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) as part of project UID/EEA/50014/2013.

Project TEC4Growth-Pervasive Intelligence, Enhancers and Proofs of Concept with Industrial Impact/NORTE-01-0145-FEDER-000020 is financed by the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, and through the European Regional Development Fund (ERDF).


  • [Brent(2010)] W. Brent. A timbre analysis and classification toolkit for pure data. In Proc. of ICMC, pages 224–229, 2010.
  • [Gillet and Richard(2008)] O. Gillet and G. Richard. Transcription and separation of drum signals from polyphonic music. IEEE Transactions on Audio, Speech, and Language Processing, 16(3):529–540, March 2008.
  • [Malt and Jourdan(2008)] M. Malt and E. Jourdan. Zsa. descriptors: a library for real-time descriptors analysis. In Proc. of 5th SMC Conference, pages 134–137, 2008.
  • [Miron et al.(2013)Miron, Davies, and Gouyon] M. Miron, M. E. P. Davies, and F. Gouyon. An open-source drum transcription system for Pure Data and Max MSP. In Proc. of ICASSP, pages 221–225, May 2013.
  • [Whitney(1971)] A. W. Whitney. A direct method of nonparametric measurement selection. IEEE Trans. Comput., 20(9):1100–1103, September 1971.