Many attempts have been made to model counterpoint (e.g., first species counterpoint [doi:10.1080/17513472.2012.738554]), to formalize its rules via computational methods [Song2015], or to generate compositions in the style of four-part Bach chorales [Hadjeres2016DeepBach, Huang2017CounterpointBC]. Unlike these approaches, we focus on pairwise interaction between musical parts in a piece, and consider this as an NMT task, where we treat one of the parts as the input to the encoder (i.e., the source sentence), wanting to generate the other part as the output of the decoder (i.e., the target sentence). While the use of NMT is certainly not new in the context of generative music systems, we contend to be the first to formulate two-part counterpoint as a translation task, where “translation” means “generating the other part”.
To this end, we collated and edited a bespoke Baroque music dataset, and trained a Transformer model [NIPS2017_7181]. We show that the our model is able to generate target sentences that have some of the musical signatures of the Baroque style it was trained on (e.g., imitation and phrasal development), although it still lacks on many fronts (e.g., harmonic coherence). We provide examples of these behaviors and discuss possible strategies for improving the model in its future implementations.
2 Related Work
Many computational approaches have been used when trying to model counterpoint and polyphony. Among these, rule and constraint-based methods have been explored extensively [10.2307/3680335, Tsang1991HarmonizingMA], along with grammars [Gilbert2007APC, Quick:2013:GAM:2505341.2505345]
and statistical methods, from Hidden Markov Models[Farbood2001AnalysisAS, Allan2004HarmonisingCB]
to their combinations with pattern-matching models[10.2307/3680717]. In a more recent experiment [IJIMAI2649], a Markov model is again combined with a pattern template.
Besides these approaches, music counterpoint has also been modeled via the (increasingly more ubiquitous) machine learning paradigm[Adiloglu:2007:MLA:1232940.1232984]
, and specifically by the application of artificial neural networks using modern deep learning techniques, as in[Hadjeres2016DeepBach, Liang2017AutomaticSC]Huang2017CounterpointBC]. Specific referents to our work include the Music Transformer [huang2018music], which introduces a measure of distance between any two tokens (relative attention), and OpenAI’s MuseNet [musenet]
, based on the GPT-2 model[gpt2], which uses sparse attention, whereby each of the output positions computes weightings from a subset of input positions. Both of these models encode music left-to-right and generate similarly, and are able to produce what are deemed state-of-the-art pieces. Unlike these uses of the Transformer, we instead formulate the generation of two-part counterpoint as a NMT task, and claim this to be a novel approach to polyphonic music generation.
3 The Data
To train the model with suitable data, we decided to collate a bespoke dataset, motivated by the often partial and noisy nature of the readily available ones. This is an ongoing endeavor, as more pieces are continuously added, referenced against the original scores, and curated to ensure the quality of the transcription and the absence of duplicates. Composers are chosen exclusively within the Baroque idiom (e.g., J.S. Bach, A. Vivaldi, G.F. Handel, G.P. Telemann, S.L. Weiss, etc.) with most pieces being sourced from the Werner Icking Music Archive111http://www.icking-music-archive.org/index.php and the Center for Computer Assisted Research in the Humanities222http://kern.ccarh.org/. All pieces are re-formatted as MIDI files. At the time of writing, our dataset comprised 707 two- and three-part pieces, and 597 pieces with more than three parts, including orchestral works.
3.1 Data Encoding
Each MIDI file is divided in tracks, one per instrument (with the exception of keyboard instruments that often have two). First of all, we removed from all individual tracks all polyphony such as doubling at the octave or occasional chords for string instruments, throwing away a single track altogether if the task was impractical or musicologically not feasible. This resulted in tracks per file, where indexes the files. Because we are building a model of two-part music, we built all combinations of pairs of tracks. At the time of the latest version of our dataset, we had track pairs.
Next, each track pair was arbitrarily segmented into four-measure chunks and segments with less than 10 notes in any given part were filtered out in an effort to keep only the data points that are really written in a polyphonic style. This yielded four-bar segments. Of these, were selected to make up a training set, and these segments were then transposed in all keys, augmenting the dataset to training segments and
remaining segments for validation. To encode the MIDI data for use in a neural network, we consider a vector of three elements for each note in the segment: the MIDI pitch number, the duration (floating point rounded to three decimal places), and the number of beats (also floating point, rounded) from the beginning of the segment. We treat each piece of information as a “word”, assigning a unique string to each element. The union of these strings defines the vocabularyof the data. Each note or rest in the music is represented by a sequence of three words. This representation affords simple implementation of the model using existing NLP systems.
3.2 Beat position
The encoding of the beat position as a language token might sound unnecessary: after all, the word embeddings to the Transformer already are composed with a global position embedding. Indeed, we could have used a beat-position embedding instead of encoding it as a token. However, we found it useful to force the model to output the correct beat position after each (variable-length) note, and noticed improved performance when the model is required to explicitly model the passage of musical time. For generating output MIDI files and for calculating BLEU scores (see below), these beat position outputs were discarded. In a separate experiment (mod-beat-position, below) we relied on the global position embedding from the Transformer and modified the beat position token to represent the metric position within a single measure, relative to the downbeat (e.g., the downbeat of any measure would be encoded as position 0, the position after three eighth notes have sounded would be position 1.5, etc.).
4 The Model
We used the OpenNMT333http://opennmt.net/
implementation of the Transformer in PyTorch as a basis of our model (with modifications to the beam search code). The Transformer is made of a connected encoding and decoding network; their main components areattention and self-attention layers, preceded by a positional encoding and followed by standard feed-forward layers. An attention layer has three inputs: a query matrix and a pair of key-value matrices and . In our case, for example, each row of the query matrix represents a token from the target music phrase, while each key-value pair is taken from the source music phrase. The output of the layer is a measure of how important is each key in determining the nature of the query. Its exact mathematical implementation can vary. The most typical one is the (modified) dot-product attention:
where indicates the transpose of and is the (common) dimension of the representation for each of the queries and keys (the other dimension being respectively the number of tokens in the query phrase and in the source phrase). In practice, the output of the attention layer for each query is a weighted sum over all the values where the weights are given by a function measuring the mutual connection between the query and each of the keys.
In a self-attention layer, the vectors of keys, query, and values all come from the same music phrase. If is the vector representation of the phrase (thus being the embedding of each token), then those three vectors are calculated as
where , , and are three different trainable weight matrices. Once the output of the decoder is finally calculated, a feed-forward network followed by a softmax is used to choose the generated token out of the available ones.
The other important part of the model is the positional encoding, which determines the correct embedding of each token by storing all the information about the relative ordering of the tokens. As a matter of fact, the matrix multiplications in Eq. 1 work independently on every element and disregard the ordering, treating the musical phrase “ABCDE” equivalently to “ADBEC”. This is solved by adding to the order-independent embedding a part that depends only on the position of the token in the input sequence, so that . We refer to [NIPS2017_7181] for details in the implementation.
The motivation of this work is that self-attention layers in the Transformer can learn musical structure by studying the relation between the different notes, for example discovering cadenzas, repetitions, and so forth.
5 Results & Discussion
We evaluated our model via both NLP metrics and domain-expert opinion.
Extending the NMT analogy all the way from the model’s architecture to the assessment of its output results, we employed the BiLingual Evaluation Understudy (BLEU) score, which is a metric used to evaluate a generated sequence against a reference sequence. BLEU is a modified precision metric over n-grams [P02-1040] (with, typically, ), but it has been liable to criticism in that a sentence can be translated in many different ways. A similar argument could be made considering degree equivalence in music. For example, in the key of , in a melodic phrase anchored on the pre-dominant, a (note) is contextually just as appropriate as an (note). Notwithstanding these considerations, we opted for BLEU, in the awareness that this needs to be mediated by musicological concerns. In the results below, the Pitch and Duration scores are calculated by extracting the midi pitch tokens and duration tokens, respectively, from the output stream, and then computing the BLEU score using smoothing method 2 from [chencherry2014]. The Combined score was computed using the output sequence of interleaved Pitch and Duration tokens, and increasing from 4 to 8.
|No beat-position token||22.4||26.9||56.1||30.6||35.8||32.1|
As a sanity check, we tested whether the model is prone to “memorizing” target sequences in the training data. We calculated the edit distance [Navarro:2001:GTA:375360.375365] between all possible pairs composed of a target sequence in the training set and a generated response. Edit distance was considered zero if all the pitches and their durations were identical in both sequences. We did not find any cases of direct copying behavior. For the mod-beat-position condition, the edit distance was on average, while for beat-position it was .
The scores shown in Table 1 suggest that the mod-beat-position version does better on the pitch-only and combined pitch+duration metrics, whereas beat-position does better on the duration-only metric. We also computed scores for the baseline case of not using a beat position token at all; in this case, the duration-only results are much worse.
These BLEU score results, however, do not directly translate to a measure of musical quality. We now proceed to examine from a musicological viewpoint some examples of model’s output using both variants of the beat-position token.
5.2 Musical Analysis
In Figure 2, for example, we examine the model’s mod-beat-position generated response. This shows accomplished voice leading and it is a nice example of anticipation of the query’s material, namely bars 77-80 in Handel’s Messiah, movement 44 (the famous Hallelujah chorus). Moreover, the generated part imitates in contrary motion, although by diatonic steps rather than tertiary arpeggio. When comparing this to the model’s beat-position behavior (see Figure 3), one can notice that the rhythmic and melodic contour is more varied, comprising five (instead of two) duration values, and a wider selection of intervals, respectively. Indeed, the numerical findings previously reported seem to be corroborated by a brief musical analysis.
Despite the positive traits shown above, our model fails to exhibit certain fundamental elements of what is considered valid contrapuntal motion. Leaving aside strict applications of the Gradus ad Parnassum [nla.cat-vn2555130] rules (e.g., step-wise voice motion, avoiding hidden fifths and octaves, etc.), it is evident that our model produces target sequences of dubious musical appropriateness (e.g., parallel fifths) or little harmonic coherence (e.g., missing cadences, tonal pivots and secondary dominant leading tones). Furthermore, the model’s output does not exhibit sufficient style authenticity, being, at times, more typical of the modal idiom. The most notable absence is that of canon, a more formalized type of imitation, which follows stricter rules and which comes in several guises (simple, interval, inversion, retrograde, mensuration, etc.). Sample output MIDI files are available online444https://gitlab.com/skalo/baroque-nmt/-/tree/master/selected_examples.
5.3 Future Work
An issue we foresee working on is that of hierarchical structure. While the Transformer has successfully addressed long-term structure [huang2018music] (which had been the crux of many generative approaches), hierarchical modeling remains an open problem not only in NLP, where it has been shown [DBLP:journals/corr/abs-1803-03585] that RNNs still outperform attention networks, but in music, too. We posit that hierarchical dependencies can be improved by conditioning the model on boundary segmentation of the dataset, and we intend to train the Transformer on segments obtained with perceptual [10.2307/843503, Cambouropoulos01thelocal], musicological [GTTM], and statistical [marcus2005a] methods, rather than using arbitrary, fixed-length segments.
We presented a novel approach to Baroque counterpoint modeling, using NMT. According to this perspective, counterpoint is seen as a nearly synchronous translation task. We collated a bespoke dataset to train a Transformer model, adding a beat position token to better model musical time. We concluded that, whilst being able to generate reasonable responses at times, our model is still at odds with issues that have been long resolved in systems abiding by different architectures (e.g., rule or constraint-based systems). Notwithstanding its current limitations, we believe that our framing of two-part polyphony is an original viewpoint worth investigating further, and we endeavor to do so in the near future.