In the mammalian auditory system, cochlear hair cells operate like band-pass filters whose equivalent rectangular bandwidth (ERB) grows in proportion to their center frequency. Given two sine waves and of respective frequencies and , we perceive their mixture as a musical chord insofar as and belong to disjoint critical bands. However, if or , then the tone is said to be masked by . In lieu of two pure tones, we hear a “beating tone”: i.e., a locally sinusoidal wave whose carrier frequency is and whose modulation frequency is . In humans, the resolution of beating tones involves physiological processes beyond the cochlea, i.e., in the primary auditory cortex.
The scattering transform () is a deep convolutional operator which alternates constant- wavelet decompositions and the application of pointwise complex modulus, up to some time scale . Broadly speaking, its first two layers ( and ) resemble the functioning of the cochlea and the primary auditory cortex, respectively. In the context of audio classification, scattering transforms have been succesfully employed to represent speech [anden2014deep], environmental sounds [lostanlen2018jasmp], urban sounds [salamon2015eusipco], musical instruments [lostanlen2018dlfm], rhythms [haider2019cmmr], and playing techniques [wang2019ismir]. Therefore, the scattering transform simultaneously enjoys a diverse range of practical motivations, a firm rooting in wavelet theory, and a plausible correspondence with neurophysiology.
This article discusses the response of the scattering transform operator to a complex tone input , depending on the sinusoidal parameters of and . In this respect, we follow a well-established methodology in nonstationary signal processing, colloquially known as: “One or two frequencies? The X Answers”, where X is the nonlinear operator of interest. The key idea is to identify transitional regimes in the response of X with respect to variations in relative amplitude (), relative frequency (), and relative phase (). Prior publications have done so for X being the empirical mode decomposition [rilling2008tsp], the synchrosqueezing transform [wu2011aada], and the singular spectrum analysis operator [harmouche2015gretsi]. We extend this line of research to the case where X is the scattering transform in dimension one.
Ii Wavelet-based recursive interferometry
Let a Hilbert-analytic filter with null average, unit center frequency, and an ERB equal to . We define a constant- wavelet filterbank as the family . Each wavelet has a center frequency of , an ERB of , and an effective receptive field of in the time domain. In practice, the frequency variable get discretized according to a geometric progression of common ratio . Consequently, every continuous signal that is bandlimited to activates a number of wavelets at most.
We define the scalogram of as the squared complex modulus of its constant- transform (CQT):
Likewise, we define a second layer of nonlinear transformation for as the “scalogram of its scalogram”:
where the asterisk denotes a convolution product. This construct may be iterated for every integer by “scattering” the multivariate signal into all wavelet subbands :
Note that the original definition of the scattering transform adopts the complex modulus () rather its square (
) as its activation function. This is to ensure thatis a non-expansive map in terms of Lipschitz regularity. However, to simplify our calculation and spare an intermediate stage of linearization of the square root, we choose to employ a measure of power rather than amplitude. This idea was initially proposed by [balestriero2017arxiv] in the context of marine bioacoustics.
Every layer in this deep convolutional network composes an invariant linear system (namely, the CQT) and a pointwise operation (the squared complex modulus). Thus, by recurrence over the depth variable
, every tensoris equivariant to the action of delay operators. In order to replace this equivariance property by an invariant property, we integrate each over some predefined time scale , yielding the invariant scattering transform:
where the -tuple is a scattering path and the signal is a real-valued low-pass filter of time scale .
Iii Auditory masking in a scattering network
Given , the convolution between every sine wave and every wavelet writes as a multiplication in the Fourier domain. Because is Hilbert-analytic, only the analytic part of the real signal is preserved in the CQT:
By linearity of the CQT, we expand the interference between and by heterodyning:
Because the wavelet has a null average, the two constant terms in the equation above are absorbed by the first layer of the scattering network, and disappear at deeper layers. However, the cross term, proportional to , is a “difference tone” of fundamental frequency .
The authors of a previous publication [anden2012dafx] have remarked that this difference tone elicits a peak in second-order scattering coefficients for the path . In the following, we generalize their study to include the effect of the relative amplitude , the wavelet shape , the quality factor , and the time scale of local stationarity .
Equation 6 illustrates how the scalogram operator converts a complex tone (two frequencies and ) into a simple tone (one frequency ). For this simple tone to carry a nonnegligible amplitude in , three conditions must be satisfied. First, the rectangular term must be nonnegligible in comparison to the square terms and . Secondly, there must exist a wavelet whose spectrum encompasses both frequencies and . Said otherwise, must satisfy the inequalities , both for and for . Thirdly, the frequency difference must belong to the passband of some second-order wavelet . Yet, in practice, to guarantee the temporal localization of scattering coefficients and restrict the filterbank to a finite number of octaves, the scaling factor of every is upper-bounded by the temporal constant . Therefore, the period of the difference tone should be under the pseudo-period of the wavelet with support ; i.e., a pseudo-period of . Hence the third condition: .
One simple way of quantifying the amount of mutual interference between signals and is to renormalize second-order coefficients by their first-order “parent” coefficients:
This operation, initially proposed by [anden2014deep], is conceptually analogous to classical methods in adaptive gain control, notably per-channel energy normalization (PCEN) [lostanlen2019spl].
In accordance with the “one or two frequencies” methodology, Figure 1 illustrates the value of this ratio of energies in the subband , for different values of relative amplitude and relative frequency difference . We fixed without loss of generality. As expected, we observe that, for and a relative frequency difference between and , second-layer wavelets resonate with the difference tone as a result of the interference between signals and .
Iv Application to manifold learning
To demonstrate the ability of the scattering transform to characterize auditory masking, we build a dataset of complex tones according to the following additive synthesis model:
where is a Hann window of duration . This additive synthesis model depends upon two parameters: the Fourier decay
and the relative odd-to-even amplitude difference. Figure 2 displays the CQT log-magnitude spectrum of for different values of and . In practice, we set to samples, to harmonics, and between and cycles.
Our synthetic dataset comprises audio signals in total, corresponding to values of between and and values of between and , while is an integer chosen uniformly at random between and . We extract the scattering transform of each signal up to order , with and , by means of the Kymatio Python package [andreux2020jmlr]. Concatenating first-order second-order coefficients yields a representation in dimension .
For visualization purposes, we bring the -dimensional space of scattering coefficients to the dimension three by means of the Isomap algorithm for unsupervised manifold learning [tenenbaum2000science]. The appeal behind Isomap is that pairwise Euclidean distances in the 3-D point cloud approximate the corresponding geodesic distances over the -nearest neighbor graph associated to the dataset. Throughout this paper, we set the number of neighbors to and measure neighboring relationships by comparing high-dimensional distances. Crucially, in the case of the scattering transform, these distances are provably stable (i.e., Lipschitz-continuous) to the action of diffeomorphisms [mallat2012cpam, Theorem 2.12].
Figure 3 (top) illustrates our findings. We observe that, after scattering transform and Isomap dimensionality reduction, the dataset appears as a 3-D Cartesian mesh whose principal components align with , , and respectively. This result demonstrates that the scattering transform is capable of disentangling and linearizing multiple factors of variability in the spectral envelope of periodic signals, even if those factors are not directly amenable to diffeomorphisms in the time domain.
As a point of comparison, Figure 3 presents the outcome of Isomap on alternative feature representations: Open-L3 embedding (center) and mel-frequency cepstral coefficients (MFCC, bottom). The former results from training a deep convolutional network (convnet) on a self-supervised task of audiovisual correspondence, and yields coefficients [cramer2019icassp]. The latter resuts from a log-mel-spectrogram representation, followed by a discrete cosine transform (DCT) over the mel-frequency axis, and yields coefficients. We compute MFCC with librosa v0.7 [mcfee2020librosa] default parameters.
We observe that Open-L3 embeddings correctly disentangles boundary conditions () from fundamental frequency (), but fails to disentangle Fourier decay () from . Instead, correlations between and are positive for low-pitched sounds ( to ) cycles) and negative for high-pitched sounds ( to cycles). Although this failure deserves a more formal inquiry, we hypothesize that this it stems from the small convolutional receptive field of the -Net: mel subbands, i.e., roughly half an octave around kHz.
Moreover, in the case of MFCC, we find that the variability in fundamental frequency () dominates the variability in spectral shape parameters ( and ), thus yielding a rectilinear embedding (top). This observation is in line with a previous publication [lostanlen2016ismir], which showed statistically that MFCCs are overly sensitive to frequency transposition in complex tones.
From this qualitative benchmark, it appears that the scattering transform is a more interpretable representation of periodic signals than Open-L3, while incurring a smaller computational cost. However, in the presence of aperiodic signals such as environmental sounds, Open-L3 outperforms the scattering transform in terms of classification accuracy with linear support vector machines[arandjelovic2017look]. To remain competitive, the scattering transform must not only capture heterodyne interference, but also joint spectrotemporal modulations [anden2019joint]
. In this context, future work will strive to combine insights from multiresolution analysis and deep self-supervised learning.
V Beyond pairwise interference:
full-depth scattering networks
In speech and music processing, pitched sounds are rarely approximable as a mixture of merely two components. More often than not, they contain ten components or more, and span across multiple octaves in the Fourier domain. Thus, computing the masking coefficient at the second layer only provides a crude description of the timbral content within each critical band. Indeed, encodes pairwise interference between sinusoidal components but fail to characterize more intricate structures in the spectral envelope of .
To address this issue, we propose to study the scattering transform beyond order two, thus encompassing heterodyne structures of greater multiplicity. For the sake of mathematical tractability, we consider the following mother wavelet, hereafter called “complex Shannon wavelet” after [mallat2008book, Section 7.2.2]:
The definition of a scattering transform with complex Shannon wavelets requires to resort to the theory of tempered distributions. We refer to [strichartz2003book] for further mathematical details.
The following theorem, proven in the Appendix, describes the response of a deep scattering network in the important particular case of a periodic signal with finite bandwidth.
This result is in agreement with the theorem of exponential decay of scattering coefficients [waldspurger2017exponential]. Note, however, that [waldspurger2017exponential] expresses an upper bound on the energy at fixed depth for integrable signals, while we express an upper bound on the depth at fixed bandwidth for periodic signals.
We apply the theorem above to the case of a signal containing components of equal amplitudes, equal phases, and evenly spaced frequencies: . Figure 4 illustrates the decay of scatterered energy as a function of depth. The conceptual analogy between depth and scale was originally proposed by [mallat2016understanding] in a theoretical effort to clarify the role of hierarchical symmetries in convnets.
Although our findings support this analogy, we note that computing a scattering transform with layers is often impractical. However, if the Fourier series in satisfies a self-similarity assumption, it is possible to match the representational capacity of a full-depth scattering network while keeping the depth to . Indeed, spiral scattering performs wavelet convolutions over time, over log-frequency, and across octaves, thereby capturing the spectrotemporal periodicity of Shepard tones and Shepard-Risset glissandos [lostanlen2015dafx]. Further research is needed to integrate broadband demodulation into deep convolutional architectures for machine listening.
In this article, we have studied the role of every layer in a scattering network by means of a well-established methodology, colloquially known as “one or two components” [rilling2008tsp]. We have come up with a numerical criterion of psychoacoustic masking; demonstrated that the scattering transform disentangles multiple factors of variability in the spectral envelope; and proven that the effective scattered depth of Fourier series is bounded by the logarithm of its bandwidth, thus emphasizing the importance of capturing geometric regularity across temporal scales.
Appendix: proof of Theorem v.1
We reason by induction over the depth variable . The base case () leads to if and zero otherwise. Because
has one vanishing moment, it follows thatis zero, and likewise at deeper layers. To prove the induction step at depth , to decompose into a low-pass approximation spanning the subband and a high-pass detail spanning the subband . Denoting by the complex-valued Fourier coefficients of , we have at every time :
On one hand, the coarse term has a bandwidth of octaves. Therefore, by the induction hypothesis, we have for , and a fortiori for . On the other hand, we consider the complex Shannon scalogram of in some subband :
In the double sum above, all integer differences of the form range between and . Thus, is a periodic signal of fundamental frequency spanning octaves. Furthermore, because , has a smaller bandwidth than ; i.e., octaves or less. By the induction hypothesis, we have:
In the equation above, we recognize the scattering path of . Finally, because the scattering transform is a contractive operator [mallat2012cpam], we have the inequality:
which implies , and likewise at deeper layers. We conclude by induction that the theorem holds for any . ∎