One or Two Components? The Scattering Transform Answers

03/02/2020 ∙ by Vincent Lostanlen, et al. ∙ NYU college IRIT 0

With the aim of constructing a biologically plausible model of machine listening, we study the representation of a multicomponent stationary signal by a wavelet scattering network. First, we show that renormalizing second-order nodes by their first-order parents gives a simple numerical criterion to assess whether two neighboring components will interfere psychoacoustically. Secondly, we run a manifold learning algorithm (Isomap) on scattering coefficients to visualize the similarity space underlying parametric additive synthesis. Thirdly, we generalize the "one or two components" framework to three sine waves or more, and prove that the effective scattering depth of a Fourier series grows in logarithmic proportion to its bandwidth.



There are no comments yet.


page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In the mammalian auditory system, cochlear hair cells operate like band-pass filters whose equivalent rectangular bandwidth (ERB) grows in proportion to their center frequency. Given two sine waves and of respective frequencies and , we perceive their mixture as a musical chord insofar as and belong to disjoint critical bands. However, if or , then the tone is said to be masked by . In lieu of two pure tones, we hear a “beating tone”: i.e., a locally sinusoidal wave whose carrier frequency is and whose modulation frequency is . In humans, the resolution of beating tones involves physiological processes beyond the cochlea, i.e., in the primary auditory cortex.

The scattering transform () is a deep convolutional operator which alternates constant- wavelet decompositions and the application of pointwise complex modulus, up to some time scale . Broadly speaking, its first two layers ( and ) resemble the functioning of the cochlea and the primary auditory cortex, respectively. In the context of audio classification, scattering transforms have been succesfully employed to represent speech [anden2014deep], environmental sounds [lostanlen2018jasmp], urban sounds [salamon2015eusipco], musical instruments [lostanlen2018dlfm], rhythms [haider2019cmmr], and playing techniques [wang2019ismir]. Therefore, the scattering transform simultaneously enjoys a diverse range of practical motivations, a firm rooting in wavelet theory, and a plausible correspondence with neurophysiology.

This article discusses the response of the scattering transform operator to a complex tone input , depending on the sinusoidal parameters of and . In this respect, we follow a well-established methodology in nonstationary signal processing, colloquially known as: “One or two frequencies? The X Answers”, where X is the nonlinear operator of interest. The key idea is to identify transitional regimes in the response of X with respect to variations in relative amplitude (), relative frequency (), and relative phase (). Prior publications have done so for X being the empirical mode decomposition [rilling2008tsp], the synchrosqueezing transform [wu2011aada], and the singular spectrum analysis operator [harmouche2015gretsi]. We extend this line of research to the case where X is the scattering transform in dimension one.

Ii Wavelet-based recursive interferometry

Let a Hilbert-analytic filter with null average, unit center frequency, and an ERB equal to . We define a constant- wavelet filterbank as the family . Each wavelet has a center frequency of , an ERB of , and an effective receptive field of in the time domain. In practice, the frequency variable get discretized according to a geometric progression of common ratio . Consequently, every continuous signal that is bandlimited to activates a number of wavelets at most.

We define the scalogram of as the squared complex modulus of its constant- transform (CQT):


Likewise, we define a second layer of nonlinear transformation for as the “scalogram of its scalogram”:


where the asterisk denotes a convolution product. This construct may be iterated for every integer by “scattering” the multivariate signal into all wavelet subbands :


Note that the original definition of the scattering transform adopts the complex modulus () rather its square (

) as its activation function. This is to ensure that

is a non-expansive map in terms of Lipschitz regularity. However, to simplify our calculation and spare an intermediate stage of linearization of the square root, we choose to employ a measure of power rather than amplitude. This idea was initially proposed by [balestriero2017arxiv] in the context of marine bioacoustics.

Every layer in this deep convolutional network composes an invariant linear system (namely, the CQT) and a pointwise operation (the squared complex modulus). Thus, by recurrence over the depth variable

, every tensor

is equivariant to the action of delay operators. In order to replace this equivariance property by an invariant property, we integrate each over some predefined time scale , yielding the invariant scattering transform:


where the -tuple is a scattering path and the signal is a real-valued low-pass filter of time scale .

Iii Auditory masking in a scattering network

Fig. 1: Heatmap of second-order masking coefficient after a scattering transform of two sine waves and , measured around the frequency , as a function of relative amplitude and relative frequency difference . The color of each blot denotes the resolution at the second layer. Wavelets have an asymmetric profile (Gammatone wavelets) and a quality factor . The second layer covers an interval of nine octaves below . For the sake of clarity, we only display one interference pattern per octave.

Given , the convolution between every sine wave and every wavelet writes as a multiplication in the Fourier domain. Because is Hilbert-analytic, only the analytic part of the real signal is preserved in the CQT:


By linearity of the CQT, we expand the interference between and by heterodyning:


Because the wavelet has a null average, the two constant terms in the equation above are absorbed by the first layer of the scattering network, and disappear at deeper layers. However, the cross term, proportional to , is a “difference tone” of fundamental frequency .

The authors of a previous publication [anden2012dafx] have remarked that this difference tone elicits a peak in second-order scattering coefficients for the path . In the following, we generalize their study to include the effect of the relative amplitude , the wavelet shape , the quality factor , and the time scale of local stationarity .

Equation 6 illustrates how the scalogram operator converts a complex tone (two frequencies and ) into a simple tone (one frequency ). For this simple tone to carry a nonnegligible amplitude in , three conditions must be satisfied. First, the rectangular term must be nonnegligible in comparison to the square terms and . Secondly, there must exist a wavelet whose spectrum encompasses both frequencies and . Said otherwise, must satisfy the inequalities , both for and for . Thirdly, the frequency difference must belong to the passband of some second-order wavelet . Yet, in practice, to guarantee the temporal localization of scattering coefficients and restrict the filterbank to a finite number of octaves, the scaling factor of every is upper-bounded by the temporal constant . Therefore, the period of the difference tone should be under the pseudo-period of the wavelet with support ; i.e., a pseudo-period of . Hence the third condition: .

One simple way of quantifying the amount of mutual interference between signals and is to renormalize second-order coefficients by their first-order “parent” coefficients:


This operation, initially proposed by [anden2014deep], is conceptually analogous to classical methods in adaptive gain control, notably per-channel energy normalization (PCEN) [lostanlen2019spl].

In accordance with the “one or two frequencies” methodology, Figure 1 illustrates the value of this ratio of energies in the subband , for different values of relative amplitude and relative frequency difference . We fixed without loss of generality. As expected, we observe that, for and a relative frequency difference between and , second-layer wavelets resonate with the difference tone as a result of the interference between signals and .

Iv Application to manifold learning

To demonstrate the ability of the scattering transform to characterize auditory masking, we build a dataset of complex tones according to the following additive synthesis model:


where is a Hann window of duration . This additive synthesis model depends upon two parameters: the Fourier decay

and the relative odd-to-even amplitude difference

. Figure 2 displays the CQT log-magnitude spectrum of for different values of and . In practice, we set to samples, to harmonics, and between and cycles.

Fig. 2: Constant- transform (CQT) log-magnitudes of synthetic musical tones, as a function of wavelet log-frequency (). Spectral parameters and denote the Fourier decay exponent and the relative odd-to-even amplitude difference respectively Note that all visualizations are unsupervised: See Equation 8 for details.

Our synthetic dataset comprises audio signals in total, corresponding to values of between and and values of between and , while is an integer chosen uniformly at random between and . We extract the scattering transform of each signal up to order , with and , by means of the Kymatio Python package [andreux2020jmlr]. Concatenating first-order second-order coefficients yields a representation in dimension .

For visualization purposes, we bring the -dimensional space of scattering coefficients to the dimension three by means of the Isomap algorithm for unsupervised manifold learning [tenenbaum2000science]. The appeal behind Isomap is that pairwise Euclidean distances in the 3-D point cloud approximate the corresponding geodesic distances over the -nearest neighbor graph associated to the dataset. Throughout this paper, we set the number of neighbors to and measure neighboring relationships by comparing high-dimensional distances. Crucially, in the case of the scattering transform, these distances are provably stable (i.e., Lipschitz-continuous) to the action of diffeomorphisms [mallat2012cpam, Theorem 2.12].

Fig. 3: Isomap embedding of synthetic musical notes, as described by their scattering transform coefficients (top); their Open-L3 coefficients (center); and their mel-frequency cepstral coefficients (MFCC, bottom). The color of a dot, ranging from red to blue via white, denotes the fundamental frequency (left), the Fourier decay exponent (center), and the relative odd-to-even amplitude difference (right) respectively. Note that all methods are unsupervised: triplets (, , ) are not directly supplied to the models, but only serve for color grading post hoc. See Section IV for details.

Figure 3 (top) illustrates our findings. We observe that, after scattering transform and Isomap dimensionality reduction, the dataset appears as a 3-D Cartesian mesh whose principal components align with , , and respectively. This result demonstrates that the scattering transform is capable of disentangling and linearizing multiple factors of variability in the spectral envelope of periodic signals, even if those factors are not directly amenable to diffeomorphisms in the time domain.

As a point of comparison, Figure 3 presents the outcome of Isomap on alternative feature representations: Open-L3 embedding (center) and mel-frequency cepstral coefficients (MFCC, bottom). The former results from training a deep convolutional network (convnet) on a self-supervised task of audiovisual correspondence, and yields coefficients [cramer2019icassp]. The latter resuts from a log-mel-spectrogram representation, followed by a discrete cosine transform (DCT) over the mel-frequency axis, and yields coefficients. We compute MFCC with librosa v0.7 [mcfee2020librosa] default parameters.

We observe that Open-L3 embeddings correctly disentangles boundary conditions () from fundamental frequency (), but fails to disentangle Fourier decay () from . Instead, correlations between and are positive for low-pitched sounds ( to ) cycles) and negative for high-pitched sounds ( to cycles). Although this failure deserves a more formal inquiry, we hypothesize that this it stems from the small convolutional receptive field of the -Net: mel subbands, i.e., roughly half an octave around kHz.

Fig. 4: Energy decay as a function of wavelet scattering depth , for mixtures of components with equal amplitudes, equal phases, and evenly spaced frequencies. The color of each line plot denotes the integer part of . In this numerical experiment, wavelets have a sine cardinal profile (Shannon wavelets) and a quality factor equal to . Each filterbank covers seven octaves.

Moreover, in the case of MFCC, we find that the variability in fundamental frequency () dominates the variability in spectral shape parameters ( and ), thus yielding a rectilinear embedding (top). This observation is in line with a previous publication [lostanlen2016ismir], which showed statistically that MFCCs are overly sensitive to frequency transposition in complex tones.

From this qualitative benchmark, it appears that the scattering transform is a more interpretable representation of periodic signals than Open-L3, while incurring a smaller computational cost. However, in the presence of aperiodic signals such as environmental sounds, Open-L3 outperforms the scattering transform in terms of classification accuracy with linear support vector machines

[arandjelovic2017look]. To remain competitive, the scattering transform must not only capture heterodyne interference, but also joint spectrotemporal modulations [anden2019joint]

. In this context, future work will strive to combine insights from multiresolution analysis and deep self-supervised learning.

V Beyond pairwise interference:
full-depth scattering networks

In speech and music processing, pitched sounds are rarely approximable as a mixture of merely two components. More often than not, they contain ten components or more, and span across multiple octaves in the Fourier domain. Thus, computing the masking coefficient at the second layer only provides a crude description of the timbral content within each critical band. Indeed, encodes pairwise interference between sinusoidal components but fail to characterize more intricate structures in the spectral envelope of .

To address this issue, we propose to study the scattering transform beyond order two, thus encompassing heterodyne structures of greater multiplicity. For the sake of mathematical tractability, we consider the following mother wavelet, hereafter called “complex Shannon wavelet” after [mallat2008book, Section 7.2.2]:


The definition of a scattering transform with complex Shannon wavelets requires to resort to the theory of tempered distributions. We refer to [strichartz2003book] for further mathematical details.

The following theorem, proven in the Appendix, describes the response of a deep scattering network in the important particular case of a periodic signal with finite bandwidth.

Theorem V.1.

Let a periodic signal of fundamental frequency . Let the complex Shannon wavelet as in Equation 9 and its associated scalogram operator as in Equation 1. If has a finite bandwidth of octaves, then its scattering coefficients are zero for any .

This result is in agreement with the theorem of exponential decay of scattering coefficients [waldspurger2017exponential]. Note, however, that [waldspurger2017exponential] expresses an upper bound on the energy at fixed depth for integrable signals, while we express an upper bound on the depth at fixed bandwidth for periodic signals.

We apply the theorem above to the case of a signal containing components of equal amplitudes, equal phases, and evenly spaced frequencies: . Figure 4 illustrates the decay of scatterered energy as a function of depth. The conceptual analogy between depth and scale was originally proposed by [mallat2016understanding] in a theoretical effort to clarify the role of hierarchical symmetries in convnets.

Although our findings support this analogy, we note that computing a scattering transform with layers is often impractical. However, if the Fourier series in satisfies a self-similarity assumption, it is possible to match the representational capacity of a full-depth scattering network while keeping the depth to . Indeed, spiral scattering performs wavelet convolutions over time, over log-frequency, and across octaves, thereby capturing the spectrotemporal periodicity of Shepard tones and Shepard-Risset glissandos [lostanlen2015dafx]. Further research is needed to integrate broadband demodulation into deep convolutional architectures for machine listening.

Vi Conclusion

In this article, we have studied the role of every layer in a scattering network by means of a well-established methodology, colloquially known as “one or two components” [rilling2008tsp]. We have come up with a numerical criterion of psychoacoustic masking; demonstrated that the scattering transform disentangles multiple factors of variability in the spectral envelope; and proven that the effective scattered depth of Fourier series is bounded by the logarithm of its bandwidth, thus emphasizing the importance of capturing geometric regularity across temporal scales.

Appendix: proof of Theorem v.1


We reason by induction over the depth variable . The base case () leads to if and zero otherwise. Because

has one vanishing moment, it follows that

is zero, and likewise at deeper layers. To prove the induction step at depth , to decompose into a low-pass approximation spanning the subband and a high-pass detail spanning the subband . Denoting by the complex-valued Fourier coefficients of , we have at every time :


On one hand, the coarse term has a bandwidth of octaves. Therefore, by the induction hypothesis, we have for , and a fortiori for . On the other hand, we consider the complex Shannon scalogram of in some subband :


In the double sum above, all integer differences of the form range between and . Thus, is a periodic signal of fundamental frequency spanning octaves. Furthermore, because , has a smaller bandwidth than ; i.e., octaves or less. By the induction hypothesis, we have:


In the equation above, we recognize the scattering path of . Finally, because the scattering transform is a contractive operator [mallat2012cpam], we have the inequality:


which implies , and likewise at deeper layers. We conclude by induction that the theorem holds for any . ∎