Learning the helix topology of musical pitch

10/22/2019 ∙ by Vincent Lostanlen, et al. ∙ 0

To explain the consonance of octaves, music psychologists represent pitch as a helix where azimuth and axial coordinate correspond to pitch class and pitch height respectively. This article addresses the problem of discovering this helical structure from unlabeled audio data. We measure Pearson correlations in the constant-Q transform (CQT) domain to build a K-nearest neighbor graph between frequency subbands. Then, we run the Isomap manifold learning algorithm to represent this graph in a three-dimensional space in which straight lines approximate graph geodesics. Experiments on isolated musical notes demonstrate that the resulting manifold resembles a helix which makes a full turn at every octave. A circular shape is also found in English speech, but not in urban noise. We discuss the impact of various design choices on the visualization: instrumentarium, loudness mapping function, and number of neighbors K.



There are no comments yet.


page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Listening to a sequence of two pure tones elicits a sensation of pitch going “up” or “down”, correlating with changes in fundamental frequency (). However, contrary to pure tones, natural pitched sounds contain a rich spectrum of components in addition to . Neglecting inharmonicity, these components are tuned to an ideal Fourier series whose modes resonate at integer multiples of the fundamental: , , and so forth. By the change of variable , it appears that all even-numbered partials , , , and so forth make up the whole Fourier series of a periodic signal whose fundamental frequency is . Figure 1

illustrates, in the case of a synthetic signal with perfect harmonicity, that filtering out all odd-numbered partials

for integer results in a perceived pitch that morphs from to , i.e., up one octave [deutsch2008jasa].

Such ambivalence brings about well-known auditory paradoxes: tones that have a definite pitch class but lack a pitch register [deutsch2010acoustics]; glissandi that seem to ascend or descend endlessly [risset1969jasa]; and tritone intervals whose pitch directionality depends on prior context [pelofi2017philtrans]. To explain them, one may roll up the frequency axis onto a spiral or helix which makes a full turn at each octave, thereby aligning power-of-two harmonics onto the same radii.

The consonance of octaves has implications in several disciplines. In music theory, it allows for pitches to be grouped into pitch classes: e.g., in European solfège, do re mi etc. eventually “circle back” to do. In ethnomusicology, it transcends the boundaries between cultures, to the point of being held as a nearly universal attribute of music [burns1999chapter]. In neurophysiology, it explains the functional organization of the central auditory cortex [warren2003pnas]. In music information research, it motivates the design of chroma features, i.e., a representation of harmonic content that is meant to be equivariant to parallel chord progressions but invariant to chord inversions [muller2015fundamentals, chapter 5].

Despite the wealth of evidence for the crucial role of octaves in music, there is, to this day, no data-driven criterion for assessing whether a given audio corpus exhibits a property of octave equivalence. Rather, the disentanglement of pitch chroma and pitch height in time–frequency representations relies on domain-specific knowledge about music [tymoczko2006science]. More generally, although the induction of priors on the topology of pitch is widespread in symbolic music analysis [bigo2016book], few of them directly apply to audio signal processing [chuan2005icme].

Yet, in recent years, the systematic use of machine learning methods has progressively reduced the need for domain-specific knowledge in several other aspects of auditory perception, including mel-frequency spectrum

[zeghidour2018icassp] and adaptive gain control [wang2017trainable]. It remains to be known whether octave equivalence can, in turn, be discovered by a machine learning algorithm, instead of being engineered ad hoc. Furthermore, it is unclear whether octave equivalence is an exclusive characteristic of music or whether it may extend to other kinds of sounds, such as speech or environmental soundscapes.

Figure 1: Two continuous trajectories in pitch space, either by octave glissando (left) or by attenuation of odd-numbered partials (right). Darker shades indicate larger magnitudes of the constant- transform. Axes and respectively correspond to time and log-frequency. Vertical ticks denote octaves. See Section 1 for details.

In this article, we conduct an unsupervised manifold learning experiment, inspired by the protocol of “learning the 2-D topology of images” [leroux2008neurips], to visualize the helix topology of constant- spectra. Starting from a dataset of isolated notes from various musical instruments, we run the Isomap algorithm to represent each of these frequency subbands as a dot in a 3-D space, wherein spatial neighborhoods denote high correlation in loudness. Contrary to natural images, we find a mismatch between the physical dimensionality of natural acoustic spectra (i.e.,, 1-D) and their statistical dimensionality (i.e.,, 3-D or greater).

The companion website of this paper111Companion website: https://wp.nyu.edu/birdvox/lostanlen2019icassp contains a Python package to visualize octave equivalence in audio data.

Figure 2: Functional diagram of the proposed method, comprising: constant- transform , data matrix , extraction of Pearson correlations , shortest path distance matrix , and Isomap eigenbasis for the set of vertices . Darker shades in and

indicate larger absolute values of the Pearson correlation and distance respectively. The hue of colored dots and the solid grey line respectively denote pitch chroma and pitch height. The first three dimensions in the Isomap embedding explain 36%, 35%, and 9% of the total variance in

respectively. Note that our method is unsupervised: neither pitch chroma nor pitch height are directly supplied to the Isomap manifold learning algorithm. See Section 2 for details.

2 Isomap embedding of subband correlations

2.1 Constant- transform and loudness mapping

Given a corpus of audio signals in arbitrary order, we define their constant- transforms (CQT) as the convolutions with wavelets , where is a scale parameter:


The wavelet filterbank in the operator covers a range of octaves. In the case of musical notes, we define a region of interest in each as the frame of highest short-term energy at a scale of

. We compute their scalogram representation as the vector of CQT modulus responses at some

, for discretized values :


where the integer ranges from to . Then, we apply a pointwise logarithmic compression to map each magnitude coefficient onto a decibel-like scale, and clip it to :


Unless stated otherwise, we set to a Hann window, the quality factor to , and the number of octaves to 3 in the following. We compute constant- transforms with librosa v0.6.1 [mcfee2018librosa]. Section 3.4 will discuss the effect of alternative choices for .

2.2 Pearson autocorrelation of log-magnitude spectra

In accordance with [leroux2008neurips], we express the similarity between two features and in terms of their squared Pearson correlation . We begin by recentering each feature to null mean, yielding the matrix


and then compute squared cosine similarities on all pairs



Let be the set of all features in . Adopting a manifold learning perspective, we may regard as the values of a standard Gaussian kernel . Inverting the identity leads to a pseudo-Euclidean distance


2.3 Shortest path distance on the -nearest neighbor graph

Following the Isomap manifold learning algorithm [tenenbaum2000science], we compute the nearest neighbors of each feature as the set of features that minimize the distance , i.e., maximize the squared Pearson correlation . We construct a -nearest graph whose vertices are and whose adjacency matrix is equal to if or if , and infinity otherwise. We run Dijkstra’s algorithm on to measure geodesics on the manifold induced by . These geodesics yield a shortest path distance function over :


If (and only if) has more than one connected component, then is infinite over pairs of mutually unreachable vertices. On some datasets, may lead to a disconnected -nearest neighbor graph , especially for small , thereby causing numerical aberrations in Isomap. However, we make sure that the effective bandwidth of the wavelet is large enough in comparison with to yield strong correlations for all , and thus . With this caveat in mind, we postulate that is finite in all of the following. We set in the following unless stated otherwise.

2.4 Classical multidimensional scaling and 3-D embedding

Let . Classical multidimensional scaling (MDS) diagonalizes where [torgerson1952psychometrika]. We denote by and

the respective eigenvectors and eigenvalues of

, satisfying . We rank eigenvalues in decreasing order, without loss of generality. Lastly, we display the Isomap embedding as a 3-D scatter plot with Cartesian coordinates for every vertex . We compute Isomap using scikit-learn v0.21.3 [pedregosa2011scikitlearn].

(a) Trumpet in C only.
(b) Harp only.
Figure 5: Pearson correlation matrices (left) and Isomap embeddings (right) for two instruments in the SOL dataset: Trumpet in C (a) and Harp (b). In both cases, the number of neighbors is set to and the loudness mapping is logarithmic. See Section 3.2 for details.

3 Experiments with musical sounds

3.1 Main protocol

We extract from the isolated musical notes played by eight instruments in the SOL dataset [ballet1999jim]: accordion, alto saxophone, bassoon, flute, harp, trumpet in C, and cello. For each of these instruments, we include three levels of intensity dynamics: pp, mf, and ff. We include all available pitches in the tessitura (, , ). We exclude extended playing techniques, because some of them may lack a discernible pitch class [lostanlen2018dlfm]; thus resulting in a total of audio recordings for the ordinario technique. We hypothesize that cross-correlations across octaves are weaker than cross-correlations along the log-frequency axis. Because the helix topology of musical pitch relies on both kinds of correlation, we set the number of neighbors to .

Figure 2 illustrates our protocol and main finding. Isomap produces a quasi-perfect cylindrical manifold in dimension three. We color each dot according to a hue of radians. Furthermore, we draw a segment between each and its upper adjacent subband . Once these visual elements are included in the display of the scatter plot , the cylindrical manifold appears to coincide with the Drobisch-Shepard helix in music psychology [shepard1964jasa, lerdahl2004book]. Indeed, hues appear to align on the same radii, whereas the grey line grows monotonically with . This result demonstrates that, even without prior knowledge about the perception of musical pitch, it is possible to discover octave equivalence in a purely data-driven fashion, by computing the graph of greatest cross-correlations between CQT magnitudes in a corpus of isolated musical notes.

3.2 Varying the instrumentarium

We reproduce the main protocol on subsets of the SOL dataset, involving a single instrument at once. The bright timbre of brass instruments correlates with relatively loud high-order partials [poirson2005jasa], resulting in large octave equivalence and a helical topology (see Figure 5 (a)). In contrast, harp tones carry little energy at twice the fundamental; yet, they induce sympathetic resonance along the soundboard, which affects nearby strings predominantly [lecarrou009actaacustica]. These two phenomena in combination favor semitone correlations over octave correlations, resulting in a rectilinear topology (see Figure 5 (a)).

3.3 Varying the number of neighbors

We reproduce the main protocol with varying -nearest neighbor graphs. Setting results in multiple lobes, each corresponding to an octave, and connected at a single pitch class (see Figure 8). This confirms our hypothesis that large semitone correlations outnumber large octave correlations. Conversely, setting results in a topology that is more intricate than the Drobisch-Shepard helix, involving correlations across perfect fourths and fifths..

(a) neighbors.
(b) neighbors.
Figure 8: Isomap embeddings for varying -nearest neighbor graphs: (left) and (right). In both cases, the Pearson correlation matrix results from all instruments in the SOL dataset and the loudness mapping is logarithmic. See Section 3.3 for details.
(a) Linear loudness mapping function.
(b) Cubic root loudness mapping function.
Figure 11: Pearson correlation matrices (left) and Isomap embeddings (right) for cubic root (a) and linear (b) loudness mappings. In both cases, results from all instruments in the SOL dataset and the number of neighbors is set to . See Section 3.4 for details.

3.4 Varying the loudness mapping function

We reproduce the main protocol with varying loudness mappings (see Figure 11). On one hand, setting to the identity function yields a -nearest neighbor graph in which octave correlations are numerically negligible, except in the lower register. This results in an Isomap embedding which is circular in the bottommost octave and irregular in the topmost octaves. On the other hand, setting to the cubic root function yields a helical topology. This experiment demonstrates the need for nonlinear loudness compression in the protocol; in contrast with [leroux2008neurips], which relied on raw grayscale intensities in handwritten digits.

4 Extension to speech and urban sounds

4.1 Experiment with speech data

We analyze the North Texas vowel database (NTVOW), which contains utterances of 12 English vowels from 50 American speakers, including children aged three to seven as well as male and female adults [assmann2000jasa]; resulting in a total of audio recordings. As seen in Figure 14 (a), which includes the data from all age groups, there is some notion of pitch circularity seen in the mid-frequency range, but not so in the low and high-frequency ranges. This is because the distribution of in human speech are polarized around certain pitch classes, rather than distributed uniformly over the chromatic scale.

4.2 Experiment with environmental audio data

We analyze a portion of the SONYC Urban Sound Tagging dataset (SONYC-UST v0.4), which contains a collection of 3068 acoustic scenes from a network of autonomous sensors in various locations of New York City [cartwright2019sonyc]. We restrict this collection to the acoustic scenes in which the consensus of expert annotators has confirmed the absence of both human speech and music. As a result of this preliminary curation step, we obtain audio recordings from eight different sensor locations. Each of these scenes contains one or several sources of urban noise pollution, among which: engines, machinery and non-machinery impacts, powered saws, alert signals, and dog barks. Figure 14 (b) shows that no discernible correlations across octaves are observed in this dataset. This finding confirms the conclusions of a previous publication [muller2011jstsp], which stated that “music audio signal processing techniques must be informed by a deep and thorough insight into the nature of music itself”.

(a) Spoken English vowels (NTVOW dataset).
(b) Urban sounds (SONYC-UST excluding speech and music).
Figure 14: Extension of the proposed method to speech (a) and environmental soundscapes (b). In both cases, the number of neighbors is set to and the loudness mapping is logarithmic. See Section 4 for details.

5 Conclusion

The Isomap manifold learning algorithm offers an approximate visualization of nearest neighbor relationships in any non-Euclidean metric space. Thus, a previous publication [leroux2008neurips] proposed to apply Isomap onto cross-correlations between grayscale intensities in natural images. In this article, we have borrowed from the protocol of this publication to apply it on music data.

Despite their methodological resemblance, the two studies lead to different insights. While [leroux2008neurips] recovered a quasi-uniform raster, we do not recover a straight line from cross-correlations along the log-frequency axis. Instead, we obtain a cylindrical lattice in dimension three. Assigning pitch classes to subbands reveals that this lattice is akin to a Drobisch-Shepard helix [shepard1964jasa], which makes a full turn at each octave. Thus, whereas [leroux2008neurips] learned a 2-D topology from 2-D data, we learned a 3-D topology from 1-D data. Furthermore, after benchmarking several design choices, we deduce that the most regular helical shape results from: a diverse instrumentarium; a graph of nearest neighbors; and a logarithmic loudness mapping. Lastly, we have discussed the limitations of our findings: although spoken vowels also exhibit a quasi-helical topology in subband neighborhoods, the same cannot be said of urban noise.

Beyond the realm of manifold learning, the present article motivates the development of weight sharing architectures that foster octave equivalence in deep representations of music data. Three examples of such architectures are: spiral scattering transform [lostanlen2015dafx]; spiral convolutional networks [lostanlen2016ismir]; and harmonic constant- transform [bittner2017ismir]. Future work will extend this protocol to bioacoustic data, and comparing the influence of species-specific vocalizations onto the empirical topology of the frequency domain.

6 Acknowledgment

The authors wish to thank Stéphane Mallat for helpful discussions.