While most people have an innate sense of and appreciation for music, comparatively few are able to participate meaningfully in its creation. A non-musician could endeavor to achieve proficiency on an instrument, but the time and financial requirements may be prohibitive. Alternatively, a non-musician could operate a system which automatically generates complete songs at the push of a button, but this would remove any sense of ownership over the result. We seek to sidestep these obstacles by designing an intelligent interface which takes high-level specifications provided by a human and maps them to plausible musical performances.
The practice of “air guitar” offers hope that non-musicians can provide such specifications [godoy2005playing] — performers strum fictitious strings with rhythmical coherence and even move their hands up and down an imaginary fretboard in correspondence with melodic contours, i.e. rising and falling movement in the melody. This suggests a pair of attributes which may function as an effective communication protocol between non-musicians and generative music systems: 1) rhythm, and 2) melodic contours. In addition to air guitar, rhythm games such as Guitar Hero [guitarhero] also make use of these two attributes. However, both experiences only allow for the imitation of experts and provide no mechanism for the creation of music.
In this work, we present Piano Genie,
an intelligent controller allowing non-musicians to improvise on the piano while retaining ownership over the result.
In our web demo (Figure 1), a participant improvises on eight buttons, and their input is translated into a piano performance by a neural network running in the browser in real-time.111Video: goo.gl/Bex4xn, Web Demo: goo.gl/j5yEjg
Code: goo.gl/eKvSVP Piano Genie has similar performance mechanics to those of a real piano: pressing a button will trigger a note that sounds until the button is released. Multiple buttons can be pressed simultaneously to achieve polyphony. The mapping between buttons and pitch is non-deterministic, but the performer can control the overall form by pressing higher buttons to play higher notes and lower buttons to player lower notes.
Because we lack examples of people performing on 8-button “pianos”, we adopt an unsupervised strategy for learning the mappings. Specifically, we use the autoencoder setup, where an encoder learns to map -key piano sequences to -button sequences, and a decoder learns to map the button sequences back to piano music (Figure 2). The system is trained end-to-end to minimize reconstruction error. At performance time, we replace the encoder’s output with a user’s button presses, evaluating the decoder in real time.
2 Related Work
Perception of melodic contour is a skill acquired in infancy [trehub1984infants]. This perception is important for musical memory and is somewhat invariant to transposition and changes in intervals [dowling1978scale, huron1996melodic]. The act of sound tracing—moving one’s hands in the air while listening to music—has been studied in music information retrieval [parsons1975directory, godoy2009body, nymoen2011analyzing, kelkar2017representation, olivier2018soundtracer]. It has been suggested that the relationship between sound tracings and pitch is non-linear [eitan2014lower, kelkar2018evaluating]. Like Piano Genie, some systems use user-provided contours to compose music [roy2014trap, kitahara2017jamsketch], though these systems generate complete songs rather than allowing for real-time improvisation. An early game by Harmonix called The Axe [theaxe] allowed users to improvise in real time by manipulating contours which indexed pre-programmed melodies.
There is extensive prior work [lee1992neural, bevilacqua2005mnm, fiebrink2009meta, gillian2011machine] on supervised learning of mappings from different control modalities to musical gestures. These approaches require users to provide a training set of control gestures and associated labels. There has been less work on unsupervised approaches, where gestures are automatically extracted from arbitrary performances. Scurto and Fiebrink [scurto2016grab] describe an approach to a “grab-and-play” paradigm, where gestures are extracted from a performance on an arbitrary control surface, and mapped to inputs for another. Our approach differs in that the controller is fixed and integrated into our training methodology, and we require no example performances on the controller.
We wish to learn a mapping from sequences , i.e. amateur performances of presses on eight buttons, to sequences , i.e. professional performances on an -key piano. To preserve a one-to-one mapping between buttons pressed and notes played, we assume that both and are monophonic sequences.222This does not prevent our method from working on polyphonic piano music; we just consider each note to be a separate event. Given that we lack examples of , we propose using the autoencoder framework on examples . Specifically, we learn a deterministic mapping , and a stochastic inverse mapping .
). An encoder output vector (grey circle) is quantized to its nearest centroid (red circle) before decoding. Our IQAE strategy (right) quantizes a scalar encoder output to one ofbuckets evenly spaced between and (in this figure ).
We use LSTM recurrent neural networks (RNNs) [hochreiter1997long] for both the encoder and the decoder. For each input piano note, the encoder outputs a real-valued scalar, forming a sequence . To discretize this into we quantize it to buckets equally spaced between and (Figure 2(b)
), and use the straight-through estimator[bengio2013estimating] to bypass this non-differentiable operation in the backwards pass. We refer to this contribution as the integer-quantized autoencoder (IQAE); it is inspired by two papers from the image compression literature that also use autoencoders with discrete bottlenecks [balle2016end, theis2017lossy]. We train this system end-to-end to minimize:
Together, and constitute our proposed IQAE objective. The former term minimizes reconstruction loss of the decoder. To agree with our discretization strategy, the latter term discourages the encoder from producing values outside of . We also contribute a musically-salient regularization strategy inspired by the hinge loss [wahba1999support] which gives the model an awareness of melodic contour. By comparing the intervals of the musical input to the finite differences of the real-valued encoder output , the term encourages the encoder to produce “button contours” that match the shape of the input melodic contours.
4 Experiments and Analysis
We train our model on the the Piano-e-Competition data [pianoe], which contains around performances by skilled pianists. We flatten each polyphonic performance into a single sequence of notes ordered by start time, breaking ties by listing the notes of a chord in ascending pitch order. We split the data into training, validation and testing subsets using an ratio. To keep the latency low at inference time, we use relatively small RNNs consisting of two layers with units each. We use a bidirectional RNN for the encoder, and a unidirectional RNN for the decoder since it will be evaluated in real time. Our training examples consist of -note subsequences randomly transposed between semitones. We perform early stopping based on the reconstruction error on the validation set.
As a baseline, we consider an LSTM language model—equivalent to the decoder portion of our IQAE without button inputs—trained to simply predict the next note given previous notes. This is a challenging sequence modeling task because the monophonic sequences will frequently jump between the left and the right hand. To allow the network to reason about polyphony, we add in a feature to the input, representing the amount of time since the previous note quantized into buckets evenly spaced between and second.
We also compare to the VQ-VAE strategy [van2017neural], another approach that learns discrete autoencoders. The VQ-VAE strategy discretizes based on proximity to learned centroids within an embedding space (Figure 2(a)) as opposed to the fixed buckets in our IQAE (Figure 2(b)). Accordingly, it is not possible to apply the same contour regularization strategy to the VQ-VAE, and the meaning of the mapping between the buttons and the resultant notes is less interpretable.
To evaluate our models, we calculate two metrics on the test set: 1) the perplexity (PPL) of the model , and 2) the ratio of contour violations (CVR), i.e. the proportion of timesteps where the sign of the button interval disagrees with the sign of the note interval. We also manually create “gold standard” button sequences for eight familiar melodies (e.g. Frère Jacques), and measure the mean squared error in button space between these gold standard button sequences and the output of the encoder for those melodies (Gold). We report these metrics for all models in Table 1.
As expected, all of the autoencoder models outperformed the language model in terms of reconstruction perplexity. The VQ-VAE models achieved better reconstruction costs than their IQAE counterparts, but produced non-intuitive button sequences as measured by comparison to gold standards. In Figure 4, we show a qualitative comparison between the button sequences learned for a particular input by the VQ-VAE and our IQAE with contour regularization. The sequences learned by our IQAE model are visually more similar to the input.
Interestingly, the IQAE model regularized with the penalty had better reconstruction than its unregularized counterpart. It is possible that the contour penalty is making the decoder’s job easier by limiting the space of mappings that the encoder can learn. The penalty was effective at aligning the button contours with melodic contours; the encoder violates the melodic contour at less than of timesteps. The features improved reconstruction for all models.
5 User Study
While our above analysis is useful as a sanity check, it offers limited intuition about how Piano Genie behaves in the hands of users. Accordingly, we designed a user study to compare three mappings between eight buttons and a piano:
(G-maj) the eight buttons are deterministically mapped to a G-major scale
(language model) pressing any button triggers a prediction by our baseline musical language model; i.e. the player provides the rhythm but not the contour
(Piano Genie) our IQAE model with contour regularization
Eight participants were given up to three minutes to improvise with each mapping. The length of time they spent on each mapping was recorded as an implicit feedback signal. After each mapping, participants were asked to what extent they agreed with the following statements:
“I enjoyed the experience of performing this instrument”
“I enjoyed the music that was produced while I played”
“I was able to control the music that was produced”
This survey was conducted on a five-level Likert scale [likert1932technique] and we convert the responses to a - numerical scale in order to compare averages (Table 2).
When asked about their enjoyment of the performance experience, all eight users preferred Piano Genie to G-maj, while seven preferred Piano Genie to the language model. Five out of eight users enjoyed the music produced by Piano Genie more than that produced by G-maj. As expected, no participants said that Piano Genie gave them more control than the G-maj scale. However, all eight said that Piano Genie gave them more control than the language model.
Participants also offered comments about the experience. Many of them were quite enthusiastic about Piano Genie. One participant said “there were some times when [Piano Genie] felt like it was reading my mind”. Another participant said “how you can cover the entire keyboard with only buttons is pretty cool.” One mentioned that the generative component helped them overcome stage fright; they could blame Piano Genie for perceived errors and take credit for perceived successes. Several participants cited their inability to produce the same notes when playing the same button pattern as a potential drawback; enabling these patterns of repetition is a promising avenue for future work. The participants with less piano experience said they would have liked some more instruction about types of gestures to perform.
6 Web demo details
For the models that use the features, we have to wait until the user presses a key to run a forward pass of the neural network. For the models that do not use these features, we can run the computation for all possible buttons in advance. This allows us to both reduce the latency and display a helpful visualization of the possible model outputs contingent upon the user pressing any of the buttons.
To build an interface for Piano Genie that would be more inviting than a computer keyboard, we 3D-printed enclosures for eight arcade buttons which communicate with the computer via USB (Figure 1). Due to technical limitations of our USB microcontroller, we ended up building two boxes with four buttons instead of one with eight. This resulted in multiple unintended but interesting control modalities. Several users rearranged the boxes from a straight line to different 2D configurations. Another user—a flutist—picked up the controllers and held them to their mouth. A pair of users each took a box and performed a duet on the low and high parts of the piano.
We have proposed Piano Genie, an intelligent controller which allows non-musicians to improvise on the piano. Piano Genie has an immediacy not shared by other work in this space; sound is produced the moment a player interacts with our software rather than requiring laborious configuration. Additionally, the player is kept in the improvisational loop as they respond to the generative procedure in real-time. We believe that the autoencoder framework is a promising approach for learning mappings between complex interfaces and simpler ones, and hope that this work encourages future investigation of this space.
We would like to thank Adam Roberts, Anna Huang, Ashish Vaswani, Ben Poole, Colin Raffel, Curtis Hawthorne, Doug Eck, Jesse Engel, Jon Gillick, Yingtao Tian and others at Google AI for helpful discussions and feedback throughout this work. We would also like to thank Julian McAuley, Stephen Merity, and Tejaswinee Kelkar for helpful conversations. Additional thanks to all participants of our user study.