The development of computers’ performance capacity, and the consequent possibility for real-time Digital Signal Processing (DSP) for audio, led to the appearance of Digital Audio Workstations (DAWs), making the creation of computer music available to the general public. Following these advances, many new instruments and interfaces for creating electronic music have surfaced. With changes in music culture, music production and how musicians work with their instruments has also changed. In other words, the ability to invent and reinvent the way to produce music is key to progress. Consequently, new proposals are necessary, such as designing new techniques for the composition of music.
Within the genre of Electronic Music, sequencing drum patterns plays a critical role. However, inputting drum patterns into DAWs often requires high technical skill on the part of the user, either by physically performing the patterns by tapping them on MIDI drum pads, or manually entering events via music editing software. For non-expert users both options can be very challenging, and can thus provide a barrier to entry. However, the voice is an important and powerful instrument of rhythm production, so it can be used to express or “perform” drum patterns in a very intuitive way - so called “beatboxing.” In order to leverage this concept within a computational system, our goal is towards a system to help users (both expert musicians and amateur enthusiasts) input rhythm patterns they have in mind into a sequencer via the automatic transcription of vocalised percussion. Our proposed tool is beneficial both from the perspective of workflow optimisation (by providing accurate real-time transcriptions), but also as means to encourage users to engage with technology in the pursuit of creative activities. From a technical standpoint, we seek to build on the state of the techniques from the domain of music information retrieval (MIR) for drum transcription[Miron et al.(2013)Miron, Davies, and Gouyon, Gillet and Richard(2008)] but actively targeted towards end-users and real-world music content production scenarios.
A vocalised drum transcription software, LVT, able to be trained with the user vocalisations is proposed. LVT is developed as a Max for Live project – a visual programming environment, based on Max 7222www.cycling74.com, which allows users to build instruments and effects for use within the Ableton Live333http://www.ableton.com/en/ DAW. To develop LVT, a dataset of vocalised percussion was compiled. A group of 20 participants (11 male, 9 female) were asked to record two short vocalised percussion tracks, one identical for all participants, and the other, an improvised pattern. These input percussion tracks were recorded three times: on a low quality laptop microphone, on an iPad microphone, and using a studio quality microphone (AKG c4000b). All recorded audio tracks were manually annotated using Sonic Visualiser444http://www.sonicvisualiser.org/, a free application for viewing and analysing the contents of music audio files. The participants spanned a wide range of experience in beatboxing (from beatboxing experts, to those who had never vocalised drum patterns before), and covered a wide age range. Thus, we consider the annotated dataset to be representative of a wide range of potential users of the system, and highly heterogeneous in terms of the types of drum sounds.
Our proposed vocalised percussion transcription system was developed following a user-specific approach. LVT follows the “segment and classify” method for drum transcription [Gillet and Richard(2008)] and integrates three main elements: i) an onset detector – to identify when each drum sound occurs, ii) a component that extracts features for each event, and iii) a machine learning component to classify the drum sounds. In the Max for Live environment, the onset detection was performed with AubioOnset 555https://aubio.org/manpages/latest/aubioonset.1.html
. Feature extraction was performed in real-time using existing Max objects:Zsa.mfcc – to characterise the timbre, Zsa.descriptors [Malt and Jourdan(2008)] – to provide spectral centroid, spread, slope, decrease and rolloff features [Malt and Jourdan(2008)], and finally the zero crossing rate and number of zero crossings were calculated with the zerox object. The machine learning component is trained with the user’s preferred vocalisation and the features are selected which give the best results for the provided input. This is achieved using the Sequential Forward Selection method [Whitney(1971)]
along with a k-Nearest Neighbours classification algorithm, with the most significant features selected by the accuracy obtained from testing the training data (in our case, the annotated improvised patterns from each participant). SFS works by selecting the most significant feature, according to a specific parameter (in this case the classification accuracy), and adding it to an initially empty set until there are no improvements or no features remain. The k-NN algorithm was implemented using timbreID[Brent(2010)], and a new external for Max was developed to implement the SFS. A user interface was created in Max for Live to facilitate the utilisation of the application by end-users. A screenshot of the interface of LVT is shown in Fig. 1. It demonstrates the user-specific training stage – where a user inputs a set number of the drum timbres they intend to use, after which their vocalised percussion is transcribed and rendered as a MIDI file for subsequent synthesis.
To operate LVT, a user loads the device in Ableton Live and then vocalises the set of desired drum sounds they intend to use, e.g. five kick sounds followed by five snare sounds, followed by five hi-hat sounds. Once the expected number of drum sounds have been detected, the SFS algorithm then identifies the subset of features which best separate the drum sounds for the user. After training, the user can then vocalise rhythmic patterns which are automatically converted from audio to a MIDI representation in the DAW for later synthesis and editing.
The evaluation of LVT was designed to serve two purposes. First, to understand how a user-specific trained system performs against state of the art drum transcription system (which have been optimised over large datasets without any user-specific training), and second, to explore how LVT could improve a producer’s workflow. We compared LVT against two existing drum transcription algorithms: LDT [Miron et al.(2013)Miron, Davies, and Gouyon], and Ableton Live’s built-in “Convert Drums to MIDI” function. For validation data we used the non-improvised vocalised patterns from our annotated dataset.
To compare the accuracy of the systems we use the F-measure of the transcriptions. Then, to investigate how our system could improve a producers workflow, the “effort” to get an accurate transcription was calculated by counting the number of editing operations required to obtain the desired patterns. These operations are as follows: to modify, to add, or to remove a MIDI note.
Table 1 summarises the results obtained from counting the total number of operations needed to obtain the desired pattern for the testing data recorded on the studio quality AKG c4000b microphone and the corresponding F-measure per vocalised drum sound, on the three drum transcription systems. The results demonstrate that, for the studio quality microphone, vocalised drum transcription accuracy for LVT is substantially higher than the other systems, and far fewer modifications were required to obtain the desired patterns when editing the automatic transcriptions.
To see the effect of user-specific training on the performance of LVT, an example is provided where LVT is trained on one user and tested on another – and vice-versa. When training the LVT with a different person with different vocalisations, the accuracy of the transcription is decreased as shown in Fig. 2. In the upper part of each screenshot is the transcription of the user when trained with its own vocalisations, while the bottom part corresponds to the transcription when trained with the other user. As can be seen, without the user-specific training, many misclassifications occur.
By examining the previously obtained results, we infer that LVT can provide a transcription closer to the ground truth than the existing state of the art systems, as shown by the higher F-measure. In addition to LVT being trained per individual user, these results may also derive from the fact that LVT does not try to detect polyphonic events (more than one drum vocalisation at the same time) as the other systems do. Furthermore, LVT does not detect as many events as the other systems, and this has a strong influence on the number of false positives, and hence the F-measure. The number of events to achieve the desired transcription, presented in Table 1, shows that the end-user of the system does not have to perform as many actions when producing music, which has a positive impact on the workflow, leaving more time for creative experimentation.
In this paper, we have presented LVT – a new interface for assistive music content creation. LVT allows Ableton Live users to sequence MIDI patterns that can be used for designing and performing rhythms with their voice. Existing state of the art systems, including one already in Ableton Live, are not able to transcribe vocalised percussion as effectively because these tools are trained for general recorded drum sounds which are typically not vocalised. Indeed, because different people vocalise drum sounds in different ways, LVT explicitly seeks to model and capture this behaviour via user-specific training. Our evaluation shows LVT to be very effective for wide range of users and vocalisations, outperforming existing systems. Furthermore, we believe LVT can be applied to any kinds of arbitrary non-pitched percussive sounds – provided that the training sound types are sufficiently different from one another, and can thus be well separated in the audio feature space using SFS.
LVT is implemented as a Max for Live device, and thus fully integrates into Ableton Live, allowing users of all ability ranges to experiment with music sequencing driven by their own personal percussion vocalisations within an easy-to-use graphical user interface.
This work is financed by the ERDF - European Regional Development Fund through the Operational Programme for Competitiveness and Internationalisation - COMPETE 2020 Programme within project «POCI-01-0145-FEDER-006961», and by National Funds through the FCT - Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) as part of project UID/EEA/50014/2013.
Project TEC4Growth-Pervasive Intelligence, Enhancers and Proofs of Concept with Industrial Impact/NORTE-01-0145-FEDER-000020 is financed by the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, and through the European Regional Development Fund (ERDF).
- [Brent(2010)] W. Brent. A timbre analysis and classification toolkit for pure data. In Proc. of ICMC, pages 224–229, 2010.
- [Gillet and Richard(2008)] O. Gillet and G. Richard. Transcription and separation of drum signals from polyphonic music. IEEE Transactions on Audio, Speech, and Language Processing, 16(3):529–540, March 2008.
- [Malt and Jourdan(2008)] M. Malt and E. Jourdan. Zsa. descriptors: a library for real-time descriptors analysis. In Proc. of 5th SMC Conference, pages 134–137, 2008.
- [Miron et al.(2013)Miron, Davies, and Gouyon] M. Miron, M. E. P. Davies, and F. Gouyon. An open-source drum transcription system for Pure Data and Max MSP. In Proc. of ICASSP, pages 221–225, May 2013.
- [Whitney(1971)] A. W. Whitney. A direct method of nonparametric measurement selection. IEEE Trans. Comput., 20(9):1100–1103, September 1971.