Listening to a sequence of two pure tones elicits a sensation of pitch going “up” or “down”, correlating with changes in fundamental frequency (). However, contrary to pure tones, natural pitched sounds contain a rich spectrum of components in addition to . Neglecting inharmonicity, these components are tuned to an ideal Fourier series whose modes resonate at integer multiples of the fundamental: , , and so forth. By the change of variable , it appears that all even-numbered partials , , , and so forth make up the whole Fourier series of a periodic signal whose fundamental frequency is . Figure 1
illustrates, in the case of a synthetic signal with perfect harmonicity, that filtering out all odd-numbered partialsfor integer results in a perceived pitch that morphs from to , i.e., up one octave [deutsch2008jasa].
Such ambivalence brings about well-known auditory paradoxes: tones that have a definite pitch class but lack a pitch register [deutsch2010acoustics]; glissandi that seem to ascend or descend endlessly [risset1969jasa]; and tritone intervals whose pitch directionality depends on prior context [pelofi2017philtrans]. To explain them, one may roll up the frequency axis onto a spiral or helix which makes a full turn at each octave, thereby aligning power-of-two harmonics onto the same radii.
The consonance of octaves has implications in several disciplines. In music theory, it allows for pitches to be grouped into pitch classes: e.g., in European solfège, do re mi etc. eventually “circle back” to do. In ethnomusicology, it transcends the boundaries between cultures, to the point of being held as a nearly universal attribute of music [burns1999chapter]. In neurophysiology, it explains the functional organization of the central auditory cortex [warren2003pnas]. In music information research, it motivates the design of chroma features, i.e., a representation of harmonic content that is meant to be equivariant to parallel chord progressions but invariant to chord inversions [muller2015fundamentals, chapter 5].
Despite the wealth of evidence for the crucial role of octaves in music, there is, to this day, no data-driven criterion for assessing whether a given audio corpus exhibits a property of octave equivalence. Rather, the disentanglement of pitch chroma and pitch height in time–frequency representations relies on domain-specific knowledge about music [tymoczko2006science]. More generally, although the induction of priors on the topology of pitch is widespread in symbolic music analysis [bigo2016book], few of them directly apply to audio signal processing [chuan2005icme].
Yet, in recent years, the systematic use of machine learning methods has progressively reduced the need for domain-specific knowledge in several other aspects of auditory perception, including mel-frequency spectrum[zeghidour2018icassp] and adaptive gain control [wang2017trainable]. It remains to be known whether octave equivalence can, in turn, be discovered by a machine learning algorithm, instead of being engineered ad hoc. Furthermore, it is unclear whether octave equivalence is an exclusive characteristic of music or whether it may extend to other kinds of sounds, such as speech or environmental soundscapes.
In this article, we conduct an unsupervised manifold learning experiment, inspired by the protocol of “learning the 2-D topology of images” [leroux2008neurips], to visualize the helix topology of constant- spectra. Starting from a dataset of isolated notes from various musical instruments, we run the Isomap algorithm to represent each of these frequency subbands as a dot in a 3-D space, wherein spatial neighborhoods denote high correlation in loudness. Contrary to natural images, we find a mismatch between the physical dimensionality of natural acoustic spectra (i.e.,, 1-D) and their statistical dimensionality (i.e.,, 3-D or greater).
The companion website of this paper111Companion website: https://wp.nyu.edu/birdvox/lostanlen2019icassp contains a Python package to visualize octave equivalence in audio data.
2 Isomap embedding of subband correlations
2.1 Constant- transform and loudness mapping
Given a corpus of audio signals in arbitrary order, we define their constant- transforms (CQT) as the convolutions with wavelets , where is a scale parameter:
The wavelet filterbank in the operator covers a range of octaves. In the case of musical notes, we define a region of interest in each as the frame of highest short-term energy at a scale of
. We compute their scalogram representation as the vector of CQT modulus responses at some, for discretized values :
where the integer ranges from to . Then, we apply a pointwise logarithmic compression to map each magnitude coefficient onto a decibel-like scale, and clip it to :
Unless stated otherwise, we set to a Hann window, the quality factor to , and the number of octaves to 3 in the following. We compute constant- transforms with librosa v0.6.1 [mcfee2018librosa]. Section 3.4 will discuss the effect of alternative choices for .
2.2 Pearson autocorrelation of log-magnitude spectra
In accordance with [leroux2008neurips], we express the similarity between two features and in terms of their squared Pearson correlation . We begin by recentering each feature to null mean, yielding the matrix
and then compute squared cosine similarities on all pairs:
Let be the set of all features in . Adopting a manifold learning perspective, we may regard as the values of a standard Gaussian kernel . Inverting the identity leads to a pseudo-Euclidean distance
2.3 Shortest path distance on the -nearest neighbor graph
Following the Isomap manifold learning algorithm [tenenbaum2000science], we compute the nearest neighbors of each feature as the set of features that minimize the distance , i.e., maximize the squared Pearson correlation . We construct a -nearest graph whose vertices are and whose adjacency matrix is equal to if or if , and infinity otherwise. We run Dijkstra’s algorithm on to measure geodesics on the manifold induced by . These geodesics yield a shortest path distance function over :
If (and only if) has more than one connected component, then is infinite over pairs of mutually unreachable vertices. On some datasets, may lead to a disconnected -nearest neighbor graph , especially for small , thereby causing numerical aberrations in Isomap. However, we make sure that the effective bandwidth of the wavelet is large enough in comparison with to yield strong correlations for all , and thus . With this caveat in mind, we postulate that is finite in all of the following. We set in the following unless stated otherwise.
2.4 Classical multidimensional scaling and 3-D embedding
Let . Classical multidimensional scaling (MDS) diagonalizes where [torgerson1952psychometrika]. We denote by and, satisfying . We rank eigenvalues in decreasing order, without loss of generality. Lastly, we display the Isomap embedding as a 3-D scatter plot with Cartesian coordinates for every vertex . We compute Isomap using scikit-learn v0.21.3 [pedregosa2011scikitlearn].
3 Experiments with musical sounds
3.1 Main protocol
We extract from the isolated musical notes played by eight instruments in the SOL dataset [ballet1999jim]: accordion, alto saxophone, bassoon, flute, harp, trumpet in C, and cello. For each of these instruments, we include three levels of intensity dynamics: pp, mf, and ff. We include all available pitches in the tessitura (, , ). We exclude extended playing techniques, because some of them may lack a discernible pitch class [lostanlen2018dlfm]; thus resulting in a total of audio recordings for the ordinario technique. We hypothesize that cross-correlations across octaves are weaker than cross-correlations along the log-frequency axis. Because the helix topology of musical pitch relies on both kinds of correlation, we set the number of neighbors to .
Figure 2 illustrates our protocol and main finding. Isomap produces a quasi-perfect cylindrical manifold in dimension three. We color each dot according to a hue of radians. Furthermore, we draw a segment between each and its upper adjacent subband . Once these visual elements are included in the display of the scatter plot , the cylindrical manifold appears to coincide with the Drobisch-Shepard helix in music psychology [shepard1964jasa, lerdahl2004book]. Indeed, hues appear to align on the same radii, whereas the grey line grows monotonically with . This result demonstrates that, even without prior knowledge about the perception of musical pitch, it is possible to discover octave equivalence in a purely data-driven fashion, by computing the graph of greatest cross-correlations between CQT magnitudes in a corpus of isolated musical notes.
3.2 Varying the instrumentarium
We reproduce the main protocol on subsets of the SOL dataset, involving a single instrument at once. The bright timbre of brass instruments correlates with relatively loud high-order partials [poirson2005jasa], resulting in large octave equivalence and a helical topology (see Figure 5 (a)). In contrast, harp tones carry little energy at twice the fundamental; yet, they induce sympathetic resonance along the soundboard, which affects nearby strings predominantly [lecarrou009actaacustica]. These two phenomena in combination favor semitone correlations over octave correlations, resulting in a rectilinear topology (see Figure 5 (a)).
3.3 Varying the number of neighbors
We reproduce the main protocol with varying -nearest neighbor graphs. Setting results in multiple lobes, each corresponding to an octave, and connected at a single pitch class (see Figure 8). This confirms our hypothesis that large semitone correlations outnumber large octave correlations. Conversely, setting results in a topology that is more intricate than the Drobisch-Shepard helix, involving correlations across perfect fourths and fifths..
3.4 Varying the loudness mapping function
We reproduce the main protocol with varying loudness mappings (see Figure 11). On one hand, setting to the identity function yields a -nearest neighbor graph in which octave correlations are numerically negligible, except in the lower register. This results in an Isomap embedding which is circular in the bottommost octave and irregular in the topmost octaves. On the other hand, setting to the cubic root function yields a helical topology. This experiment demonstrates the need for nonlinear loudness compression in the protocol; in contrast with [leroux2008neurips], which relied on raw grayscale intensities in handwritten digits.
4 Extension to speech and urban sounds
4.1 Experiment with speech data
We analyze the North Texas vowel database (NTVOW), which contains utterances of 12 English vowels from 50 American speakers, including children aged three to seven as well as male and female adults [assmann2000jasa]; resulting in a total of audio recordings. As seen in Figure 14 (a), which includes the data from all age groups, there is some notion of pitch circularity seen in the mid-frequency range, but not so in the low and high-frequency ranges. This is because the distribution of in human speech are polarized around certain pitch classes, rather than distributed uniformly over the chromatic scale.
4.2 Experiment with environmental audio data
We analyze a portion of the SONYC Urban Sound Tagging dataset (SONYC-UST v0.4), which contains a collection of 3068 acoustic scenes from a network of autonomous sensors in various locations of New York City [cartwright2019sonyc]. We restrict this collection to the acoustic scenes in which the consensus of expert annotators has confirmed the absence of both human speech and music. As a result of this preliminary curation step, we obtain audio recordings from eight different sensor locations. Each of these scenes contains one or several sources of urban noise pollution, among which: engines, machinery and non-machinery impacts, powered saws, alert signals, and dog barks. Figure 14 (b) shows that no discernible correlations across octaves are observed in this dataset. This finding confirms the conclusions of a previous publication [muller2011jstsp], which stated that “music audio signal processing techniques must be informed by a deep and thorough insight into the nature of music itself”.
The Isomap manifold learning algorithm offers an approximate visualization of nearest neighbor relationships in any non-Euclidean metric space. Thus, a previous publication [leroux2008neurips] proposed to apply Isomap onto cross-correlations between grayscale intensities in natural images. In this article, we have borrowed from the protocol of this publication to apply it on music data.
Despite their methodological resemblance, the two studies lead to different insights. While [leroux2008neurips] recovered a quasi-uniform raster, we do not recover a straight line from cross-correlations along the log-frequency axis. Instead, we obtain a cylindrical lattice in dimension three. Assigning pitch classes to subbands reveals that this lattice is akin to a Drobisch-Shepard helix [shepard1964jasa], which makes a full turn at each octave. Thus, whereas [leroux2008neurips] learned a 2-D topology from 2-D data, we learned a 3-D topology from 1-D data. Furthermore, after benchmarking several design choices, we deduce that the most regular helical shape results from: a diverse instrumentarium; a graph of nearest neighbors; and a logarithmic loudness mapping. Lastly, we have discussed the limitations of our findings: although spoken vowels also exhibit a quasi-helical topology in subband neighborhoods, the same cannot be said of urban noise.
Beyond the realm of manifold learning, the present article motivates the development of weight sharing architectures that foster octave equivalence in deep representations of music data. Three examples of such architectures are: spiral scattering transform [lostanlen2015dafx]; spiral convolutional networks [lostanlen2016ismir]; and harmonic constant- transform [bittner2017ismir]. Future work will extend this protocol to bioacoustic data, and comparing the influence of species-specific vocalizations onto the empirical topology of the frequency domain.
The authors wish to thank Stéphane Mallat for helpful discussions.