We have reached an age where natural language interfaces are increasingly available for our technology. These interfaces provide opportunities to interact more naturally with our devices. For example, it is becoming common for people to speak directly to mobile devices to accomplish tasks with no human on the other end of the interactions. However, to date there has been little work on the support of conversation and interactions involving musical concepts and scores.
Deep learning with neural networks is currently a popular strategy in the domain of natural language processing (NLP), and has been applied to developing dialog systems where a machine attempts to carry on a conversation with a human . However, neural nets remain difficult to analyze and extend once trained. In contrast, the classic approach of explicit knowledge representation symbolically represents ideas and thought processes that can be directly analyzed and extended, and provides a natural framework for explaining decisions.
Falling under the general strategy of knowledge representation, elementary composable ideas, or ECIs, are simple, atomic concepts that can be combined, or composed, to form more complex concepts and relationships. For example, there is some evidence to suggest that “surface” is one such concept that infants have and use to build other concepts, such as collections of surfaces forming objects and internal spaces . This style of representation has been proposed to underly human natural language processing tasks such as describing and referring to items and events in visual environments [3, 4].
Programming languages already follow an ECI-like model: there are primitives, such as integer and floating-point numbers, and a collection of basic operations and data structures over those entities that are used to represent more complex concepts. Object-oriented programming seeks to model computational problems much in the way we view our environment as composed of objects, each of which consists of parts that interact and can be manipulated in particular ways. However, not all ideas fit neatly into this approach and there are many languages that use very different strategies, such as functional programming languages. Additionally, appropriate language features for a computer are not necessarily straightforward derivations from verbal descriptions—there is a reason why programming a computer is a complex skill that requires specialized training.
We propose a basic collection of ECIs for music with an accompanying Python implementation called MusECI 111Our implementation is available at musica.ml4ai.org/.. In this paper we describe ECI-style musical primitives and structures for grouping these primitives. We also provide basic operations to facilitate conversational manipulation of music through a query-then-operation pattern of interaction. This lays the foundation for eventually building larger platforms that would facilitate real-time conversations with machines about music. The goal of MusECI is not to replace traditional score editing software, but rather to enable the creation of new systems to augment those environments and open new avenues for human-computer interaction through music.
2 Programming Languages
A central goal of the MusECI project is to create a language for score-level music representation that can be used in an algorithmic or automated composition setting while still adhering closely to models of music cognition and expression in natural language. We adopt the constraints that the data structures used to represent music both have a straightforward interpretation for editing and playback and that they can also be easily queried to identify features based on natural language descriptions and manipulated according to instructions.
Many musical programming languages focus on the sound, or signal-level aspects of music generation. Examples of these include Csound , SuperCollider222SuperCollider also represents another language paradigm known as live coding, which allows compilation and execution of fragments of a larger program on demand. However, while gaining traction in music performance, this kind of language paradigm has not been interfaced with NLP systems., and Max/MSP. While these languages are extremely useful in the context of music creation, they do not offer a rich variety of representations for higher level features in music. On the other extreme are languages that focus on creating visually appealing musical scores. Languages such as LilyPond  focus on type-setting aspects of a score, with emphasis on details of visual presentation and formatting and less on relationships between musical features that would be useful for modeling music cognition. We seek a representation between the extremes of visual detail and sound synthesis, focusing on common natural language descriptors people use to talk about music: note, melody, chord, phrase, and so on.
Other languages for representing score-level information favor grouping of musical features, such as melodies and harmonies, over the details of their visual presentation. MusicXML is one such language. It is a markup language that describes score-level features in music in an application-independent way such that a composition can be passed between different score rendering programs. Musical features are highly nested and groupings occur by visual features in the score. Unlike MIDI format, time is measured in MusicXML by measures and beats, which are relative to a particular tempo. MusECI’s implementation supports MusicXML as an input format to specify musical structures; the method for converting MusicXML’s standard into our musical ECI data structures is described in section 3.3.
2.1 Algorithmic Score Analysis and Generation
None of the languages discussed so far support operations for searching or manipulating the music. MuSQL  is a structured query language for operating on musical databases. It permits creation of, selection from, and alteration to musical structures. Its representations of musical features are generalized to address multiple kinds of pitch data, although time representations are limited to seconds—there is no method for describing more abstract metrical structures and containers such as measures and beats. MuSQL uses standard SQL-style commands, which, although intuitive to programmers with the right training, are not necessarily easily derived from natural language statements. One of MusECI’s goals is to bridge that gap.
Music21  is a Python library that has become popular for hypothesis testing on large musical copora. It is able to read a variety of formats, including MIDI and MusicXML, and uses an object-oriented style that is very close to the structures used by MusicXML. Music21 has been used in several computational musicology projects, including analysis of the large Bach chorales MusicXML corpus. However, Music21 is primarily structured based on music theoretic concepts, which, although potentially useful to analyzing questions about music cognitive models, is not always natural as a vehicle for mapping from natural language descriptions to operations on music.
Representations for score-level structures in programming languages intended for algorithmic composition typically revolve around the two classic computer science data structures of lists and trees. The Jython Music library , a Python version of JMusic , favors list representations for constructing melodies and chords using notes and rests. Melodies are lists of notes and rests with the assumption that one element’s onset is determined by the previous element’s end time. Chords are also represented as lists of notes having the same start time and duration. Python’s support for nested lists allows for chords to be created within a larger sequential, melodic structure. While useful in pedagogical and algorithmic composition settings, this list-based approach makes representation of features such as arpeggiated chords tricky and its style of nesting musical concepts like parts and phrases is very rigid.
In the Euterpea library  for representing musical structures in Haskell, musical ideas are represented as binary trees, where leaves are notes and rests that store duration information but not onset time. The onset for a particular leaf is determined by its placement in the larger tree, whose intermediate nodes consist of two types of connectors: parallel and sequential. Sequentially composed subtrees are interpreted as following each other in time, and those composed in parallel are assumed to have the same start time (but not necessarily the same end time). Any pattern of notes that can be represented in the standard MIDI file format (which follows a list-based representation of time-stamped events) can also be represented as one of Euterpea’s trees.333This is true for notes as they would appear on a score. Performance features, such as pitch bends and aftertouch, do not have a corresponding representation in Euterpea. However, it is frequently the case that a given musical concept can have more than one tree representation, making comparison of equality between tree structures and identification of shared features problematic. It may also not be cognitively natural to nest relationships in a strictly binary way—sometimes we may think of three or more items as having the same level of hierarchy, more like a list. Here, we use an n-ary tree representation (each node essentially being a list) to mitigate this issue in developing a framework to connect to NLP systems.
The structures and operations we use here are also very similar to some representations used in Kulitta , which is a Haskell-based framework for automated composition in multiple styles. This lends our implementation to being useful for automated composition tasks, while also allowing querying of data structures in a NLP-compatible way.
2.2 Modeling Natural Language Semantics
A key constraint for MusECI is that it must accommodate the semantics of how music is described in natural language. To accomplish this, we take inspiration from computational languages designed for representing the natural language semantics of visual and physical environments, such as VoxML .
The VoxML modeling language encodes semantic knowledge of real-world objects represented as three-dimensional physical models along with events and attributes that are related and change through actions and interactions. The framework encodes background knowledge about how interactions between objects affect their properties, allowing simulation to fill in information missing from language input. With this representation, natural language expressions describing scenes, events, and actions on objects can be specified in enough detail to be simulated and visualized. We similarly seek a language in which musical concepts can be expressed in building-blocks that stand in close correspondence to natural language expressions. The framework must capture enough detail to support expressing background music knowledge typically assumed in order to interpret natural language descriptions of music as well-defined musical structures that can be represented as scores and played—a kind of “music simulation”.
3 Musical Primitives
Our lowest level of representations for Musical ECIs are called musical primitives: concepts that represent fundamental musical units. Because we are focused on music at the level of a paper score, our primitives are based on features that appear on scores: notes and rests.
3.1 Symbolic Primitives
MusECI’s atomic musical concepts are related to pitch and duration, although these, by themselves, cannot be directly represented on a musical score. Currently we only consider Western tonal systems based on the chromatic scale and therefore represent pitch information using integers. Since some of the atomic concepts are best represented as numbers, we make use of two number types: and . We begin with the following two pitch-related atomic primitives, both of which are simply musical interpretations of numbers:
These then yield the following derived primitives, which are built on atomic primitives.
where refers to a collection of environmental information and other labels used for resolving ambiguity. For , the Context would need to include information about the current Scale. The pitch number for a pitch can be computed by , where has different standards in different areas of the literature.444Octave zero on a piano keyboard is not standardized and varies between musical programming languages. Sometimes such that (C,0) is pitch number 0, but it is also common to have and occasionally even depending on the particular application.
The following two atomic primitives are building-blocks for metrical information:
Derived primitives also exist for metrical information:
Pitch and metrical primitives are composed to produce the concepts of Note and a Rest, which are similar to the lowest levels of representation in Euterpea, JythonMusic and Music21. We define the set of all notes and rests to include the following primitives:
We denote the set of all such symbols as . For brevity, in examples in this paper we will use to refer to the Python-style constructor for and, in later sections, both pitch and metrical values will be shown only as tuples. An important difference between our approach and many other musical representations it that we do not require values to be declared for all properties of these concepts. The high degree of optional information in these constructs is needed to reflect the various ways in which we speak about music, which can involve many levels of abstraction and also may be ambiguous. For example, using “_” to indicate a blank field, we may capture the concept of “the C in measure 3” as follows555In MusECI, measure and beat numbers both index from zero. Therefore, “measure 3” produces a value of 2.:
This specifies a template that can be matched against a musical representation with more concrete information to identify the particular note being referred to. In MusECI, we represent “_” using Python’s None value. This variable degree of information is useful for specifying queries over musical structure and also for communicating musical ideas at a variety of levels of abstraction that may not have a uniform level of detail over the entire score.
The Contexts field of Note and Rest indicates additional labels or other contextual information that may be useful for querying. Our implementation also supports volume as a note property, but it is omitted from the descriptions and examples in this paper.
3.2 Connecting Primitives
MusECI uses Euterpea-inspired representations for connecting musical structures in sequence and in parallel. To minimize the issue of having multiple tree structures associated with Euterpea’s binary trees, we use n-ary trees. This permits notes in a melody or chord that have the same level of conceptual hierarchy to appear at the same level within the representation. We also define normal forms for grouping structures from a score format such as MusicXML. The set of all possible sequential and parallel structures, which we will call , is defined recursively.
We will use the square bracket notation, , to denote lists of items. Therefore, “the three note melody in part B” could be represented as:
We will use the term Music value to refer to any value that is in . In other words, a Music value is some entity that could be drawn on a score, whether a single note, a rest, a melody, a chord, etc. Later we describe a collection of operations that can be applied to any of these values.
3.3 MIDI and MusicXML Conversion
MusECI is able to parse both MIDI and MusicXML files into a normal form with the representations discussed so far. We use the following algorithm to convert MIDI and MusicXML information into our system’s representations:
Greedily group Notes with the same onset and duration with .
Greedily group temporally adjacent structures (potentially including chords from step 1) with .
Group any remaining temporally separated, but still sequential structures with , using to fill temporal gaps.
Group any leftover items under a global .
Importantly, this normal form for reading MusicXML is not the only way to represent musical features. Other normal forms are possible, and we hope to improve our normal forms through learning in later work, such that nested structures within melodies and phrases are identified in genre-specific ways.
An important operation for composition by conversation is selection. Much like the SELECT statement from SQL, which searches a database of tables for entries matching a query, we wish to scan over a musical data structure and return references to the correct portions of it.
We use pattern matching as an integral part of our selection process, which involves checking to see whether concrete definitions match features present in an incomplete or abstract definition.
MusECI defines the operation as a function that takes two values: one to use as a pattern to search for (a query) and the other to pattern match against. The “_” values match any concrete value. For example, the statement will select all notes from the value,, and return a collection of references to the notes. Returning references in an object-oriented style allows for , to be updated immediately after relevant parts of it are located:
where turns a parse of a natural language sentence into a query and an operation to perform on its result.
By leveraging Python’s functional language features, MusECI also permits conditionals and classifiers as part of query statements. For example, finding notes above octave 3, can be expressed as. Selection can also be done for more complex queries using and .
Support for music creation and generation requires that certain standard operators be defined to manipulate the musical data structures. The following are some examples of operations in MusECI that we use in examples later on:
and : generalized musical inversion (flipping upside down along the pitch axis) at the first pitch and a specified pitch respectively.
: reverse a musical structure.
: transpose a Music value by some number of half steps, either chromatically or according to scale degrees depending on the tonal context given.
6 Integration with TRIPS Parser
In our work we use the TRIPS natural language parser. TRIPS is a best-first bottom-up chart parser that integrates a grammar and lexicon to encode both syntactic and semantic features. These features are used to disambiguate and produce alogical form representation that captures the semantic roles of terms in the utterance . Figure 1 shows the logical form graph that TRIPS produces after parsing the phrase, “Move the B in measure 2 up an octave.” (Many details of the TRIPS logical form representation have been removed for presentation.) Labels on arcs (surrounded by ‘’) represent semantic roles, and bold-face labels represent lexical terms preceded by their semantic type (e.g., B is a MusicNote).
The logical form produced by TRIPS allows us to map from graphical patterns in the logical form to classes of operations on musical representations. From the example in Figure 1, a Move action in the direction Up affecting a Note corresponds to a transposition operation applied to our music representation. The source location of the Note specifies the query for a note with a pitch class of B and an onset within measure 2. Finally, moving up one octave corresponds to transposing up 12 half steps. Together, this information specifies the following operation:
6.1 Resolving Ambiguities
Natural language is extremely sensitive to conversational history. Sentences such as “give that to him” are impossible to semantically resolve without context for mapping pronouns to entities. In a musical setting, references such as “the C next to it” also suffer from this issue.
Even with sufficient context to resolve references like pronouns, there may be insufficient detail to locate a precise portion of a musical concept. Consider the pitch classes in the opening refrain for “Twinkle Twinkle Little Star:” C, C, G, G, A, A, G. Requesting to “move the G up a half step” creates an ambiguity about which G should be moved, since will find three matches. However, “the G” implies we only want to operate on one of them.
A system for addressing referential ambiguity requires two features: a working memory of recent references, and an assumer algorithm or module that explores the working memory using features from a parse tree to resolve ambiguities. That resolution step may be straightforward or it may involve requesting additional information from the user before proceeding. Disambiguation, therefore, has the following workflow, where is the target Music value.
The creation of an appropriate assumer module is an area of ongoing work in conjunction with the TRIPS parser integration. Because of this, the examples presented in the next section are designed to avoid the two types of referential ambiguity described here. This means that all of the examples can be parsed directly to code statements of the form with no need for ambiguity resolution before applying the operation derived from the natural language sentence.
Consider the two-measure melody shown in Figure 2, and suppose we wish to transform it into the melody shown in Figure 4 through natural language commands. A precise way to describe this
would be “move the F up a whole step” followed by “move the C on the first beat of measure two down a half step.” This will create the following interactions between the computer (C) and user (U).
C: = the melody in Figure 2.
U: “Move the F up a whole step.”
U: “Move the C on the first beat of measure two down a
Applied sequentially, these commands produce the melodies shown in Figures 3 and 4 respectively. The data structure for the starting melody is shown in Figure 7, along with the selection statements created by each of the commands and the resulting modification to the data structure.
Statements may also operate over collections of notes. The following conversation illustrates this with inversion and reversal of sections of the music (retrograde). Figures 5 and 6 show the result of each command in this conversation, and the data structure transformation is shown in Figure 8.
C: = the melody in Figure 2.
U: “Invert the notes in measure two around G4.”
U: “Reverse the notes in measure one.”
7 Conclusion and Future Work
We have proposed the MusECI framework for modeling music in a way that allows querying based on parsing of natural language statements. Our Python implementation of the musical data structures supports normal algorithmic composition in a text editor in addition to the composition by conversation use case in a way that is easily integrated with parsing systems like TRIPS.
Although our data structures are robust for representing complex musical structures, the overall system is currently a prototype limited to a fairly narrow collection of musical operations. This collection of operations needs to be expanded to feature more tree manipulation algorithms, such as different strategies for removing and re-arranging notes within a larger structure. Operations such as removal of individual notes can have several potential interpretations, such as replacement with a rest of the same duration and removal followed by time-shifting other elements of the data structure to close the gap. Both operations are valid, but which is most appropriate depends on the larger context in which it is used. We are currently working on a more diverse array of operations such as this to support a wider range of common score manipulations.
Expansion of this framework for NLP-based music composition and manipulation will require adaptation of NLP toolkits like the TRIPS parser. Standard parsers often do not have a suitable dictionary of musically-relevant definitions to draw on when assigning semantics to nouns and verbs. While terms like “transpose” and “reverse” can be interpreted correctly due to their usage in non-musical settings, correct interpretation of other terms is trickier. Without incorporation of music-specific terminology, the TRIPS parser identifies“C” and “F” as temperature scales rather than as pitch classes. Just as humans learning about music must have their vocabulary expanded, we are developing a music-specific ontology that can be incorporated in the TRIPS parser to support a more diverse range of musical concepts and operations to achieve a correct parse tree for sentences in a composition by conversation setting.
The addition of new representations for contexts is important for the creation of an assumer module. This requires a way to infer the referents of ambiguous words like “it,” “this,” and so on as well as keeping track of what spaces or metrics are currently in use.
Our prototype framework is the beginning of an integrated approach for handling natural language and musical concepts. Currently, MusECI and its integration into NLP systems is still a work in progress. However, we hope to eventually achieve real-time interactive programs capable of interacting with real musicians in a musical setting, similar to how AI on computers and mobile devices is currently able to respond to basic spoken commands in other domains. We also aim for this communication to ultimately become bi-directional and incorporate other aspects of musical artificial intelligence, perhaps even allowing the machine to critique its human user’s musical ideas or offer its own suggestions to help complete a musical project.
-  I. V. Serban et al., “Building End-to-end Dialogue Systems Using Generative Hierarchical Neural Network Models,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2016, pp. 3776–3783.
-  J. Mandler, The foundations of mind: Origins of conceptual thought. Oxford University Press, 2004.
-  R. St.Amant et al., “An Image Schema Language,” in Proceedings of the International Conference on Cognitive Modeling, 2006.
-  J. Pustejovsky and N. Krishnaswamy, “VoxML: A Vistualization Modeling Language,” in Language Resources Evaluation Conference (LREC), 2016.
-  R. Boulanger, The Csound Book: Perspectives in Software Synthesis, Sound Design, Signal Processing,and Programming. MIT Press, 2000.
-  J. McCartney, “Rethinking the Computer Music Language: SuperCollider,” Computer Music Journal, vol. 26, no. 4, pp. 61–68, 2002.
-  A. R. Brown and A. Sorensen, “Interacting with Generative Music through Live Coding,” Contemporary Music Review, vol. 28, no. 1, pp. 17–29, 2009.
-  Cycling ’74. (2017) Tools for Sound, Graphics, and Interactivity. [Online]. Available: https://cycling74.com/
-  H. Nienhuys and J. Nieuwenhuizen, “LilyPond, a system for automated music engraving,” in Proceedings of the Colloquium on Musical Informatics, 2003.
-  M. Good, “MusicXML for Notation and Analysis,” in The Virtual Score: Representation, Retrieval, Restoration. Cambridge, MA: MIT Press, 2001, pp. 113–124.
-  C. Wang et al., “MuSQL: A Music Structured Query Language,” in Proceedings of the International Multimedia Modeling Conference, 2006, pp. 216–225.
-  Music21. [Online]. Available: http://web.mit.edu/music21/
-  B. Manaris and A. Brown, Making Music with Computers: Creative Programming in Python. New York, NY, USA: Chapman and Hall, 2014.
-  A. Brown, Making Music with Java. lulu.com, 2009.
-  P. Hudak et al. (2017) Euterpea. [Online]. Available: http://hackage.haskell.org/package/Euterpea
-  D. Quick, “Composing with Kulitta,” in Proceedings of International Computer Music Conference, 2015, pp. 306–309.
-  J. F. Allen, M. Swift, and W. de Beaumont, “Deep Semantic Analysis for Text Processing,” in Proceedings of the Symposium on Semantics in Systems for Text Processing, 2008.