In this paper a feature extractor based on Gabor frames and Mallat’s scattering transform, called Gabor scattering, is introduced. This feature extractor is applied to a simple signal model for audio signals, i.e. a class of tones consisting of fundamental frequency and its multiples and an according envelope. Within different layers, different invariances to certain signal features occur. In this paper we give a mathematical explanation for the first and the second layer which are illustrated by numerical examples. Deformation stability of this feature extractor will be shown by using a decoupling technique, previously suggested for the scattering transform of Cartoon functions. Here it is used to see if the feature extractor is robust to changes in spectral shape and frequency modulation.
Popular machine learning techniques such as (deep) convolutional networks (CNN) have led to impressive results in several applications, e.g., classification tasks. In audio, data usually undergo some pre-processing, often calledfeature extraction, before being fed into the trainable machine. CNNs as feature extractors have been given a rigorous mathematical analysis by Mallat , leading to the so-called scattering transform, based on a wavelet transform in each network layer. Invariance and deformation stability properties of the resulting feature extractors were investigated in [6, 2, 1].
Wiatowski and Bölcskei extended the scattering transform by allowing different semi-discrete frames in each network layer, general Lipschitz-continuous non-linearities and pooling operators, see [9, 8]. In this general network, the authors proved vertical translation invariance and deformation stability for band-limited functions.
In the current contribution we introduce a new feature extractor called Gabor scattering previously introduced in the conference contribution 
, which is a scattering transform based on Gabor frames in each layer. Gabor frames are closely related to the short-time Fourier transform (STFT), see Section2.2, which is commonly used in practice to analyse audio data. We then study invariance properties of Gabor scattering when applied to a common signal model for audio signals, Section 3 and use the same signal model to establish deformation bounds for a feature extractor based on Gabor scattering, Section 4
. This approach is motivated by the observation, that assuming more specific restrictions than band-limitation will lead to more precise deformation estimates, when dealing with signals with certain known properties. Thereby, we follow the idea presented in Grohs et al. and use a decoupling technique which proves deformation stability by exploiting stability of the signal class introduced in Section 2.3 in combination with general properties, namely contractivity, which is entirely due to the network architecture.
2.1 Deep Convolutional Neural Networks (DCNNs) and Invariance
In order to understand the mathematical construction used within this paper, we briefly introduce the principal idea and structure of a DCNN. DCNNs enable the computer to learn from observation data. Their structure is inspired by neurones and the human process of learning. Usually a network consists of several layers, namely an input, several hidden (since we consider the case ofdeep CNN the number of hidden layers is supposed to be
) and one output layer. As input we use the data, that should be classified or more generally whose properties should be learnt. A hidden layer consists of several ingredients: first the convolution of the data with a small weighting matrix, which can be interpreted as localization of certain properties of the input data. Similarly neurones in our brain act only in a local environment, when certain stimulation appears. The next ingredient of the hidden layer is the application of a non-linearity function, also called activation function, which signals if information of this neurone is relevant to be transmitted. Furthermore, in order to reduce redundancy and increase invariance, pooling is applied. Due to the ingredients of the hidden layer, i.e. convolution with a weighting matrix, non-linearity function and pooling, certain invariances are generated. In order to make concrete statements about invariances generated by certain layers for a certain class input data, one needs to develop a signal model, as we do in Section 2.3. Within our simple model for tones, convolution is performed with a low-pass filter and thus in combination with pooling, temporal fine structure is averaged out, while frequency information is maintained. Hence in our case we have some invariance w.r.t. the envelope in the first layer. Moreover, after further convolutions, more temporal fine structure is removed and information on large scales is captured in higher layers. In our model the second layer is invariant w.r.t. the pitch and information about the envelopes is visible, cp. Section 3 and 5
. Note that in a neural network, in particular in CNNs, the output, e.g. classification labels, is obtained after several concatenated hidden layers. In the case of scattering network the outputs of each layer are stacked together into a feature vector and further processing is necessary to obtain the desired result. Usually, after some kind of dimensionality reduction, cf.
, this vector can be fed into a support vector machine or a general NN, which performs the classification task.
2.2 Gabor Scattering
Since Wiatowski and Bölcskei used general semi-discrete frames to obtain a wider class of window functions for the scattering transform (cp. [9, 8]), it seems natural to consider a specific model used for audio data and analyse Gabor frames for the scattering transform and study corresponding properties. We next introduce the basics of Gabor frames and refer to  for more details. A sequence of elements in a Hilbert space is called frame if there exist positive frame bounds such that for all
If then we call a tight frame. In order to define Gabor frames we need to introduce two operators, i.e. the translation and modulation operator of a function Let
then the translation (time shift) operator is defined as for all
then the modulation (frequency shift) operator is defined as for all
Moreover, we can use these operators to express the short-time Fourier transform (STFT) of a function with respect to a given window function as In order to reduce redundancy, we sample on a separable lattice , in time by and in frequency by The resulting samples correspond to the coefficients of w.r.t. a Gabor system.
Given a window function , where the Hilbert space is either or and lattice parameters , the set of time-frequency shifted versions of
is called a Gabor system.
We proceed to introduce a scattering transform for Gabor frames. We base our considerations on  by using a triplet-sequence here is associated to the -th layer of the network.
Note that in this contribution we will deal with Hilbert spaces or ; more precisely in the input layer, i.e. the th layer, we have and due to the discretization inherent in the Gabor transform,
We briefly review the elements of the triplet:
with is a Gabor frame indexed by a lattice
Pooling depends on a pooling factor which leads to dimensionality reduction. Mostly used are max- or average-pooling, some more examples can be found in . In our context, pooling is covered by choosing specific lattices in each layer.
In order to explain the interpretation of Gabor scattering as CNN, we write and have Thus the Gabor coefficients can be interpreted as the samples of a convolution.
Let be a triplet-sequence, with ingredients explained above. Then the -th layer of the Gabor scattering transform is defined as the output of the operator
where is the output-vector of the previous layer and
Taking the calculation steps of the previous layers into account, we can extend (2) to paths on index sets and obtain
Similar to  for each layer, we use one atom of the Gabor frame in the subsequent layer as output-generating atom, i.e. Since this element is the -th convolution, it is an element of the -th frame, but because it belongs to the -th layer, its index is We also want to introduce a countable set and the space of sets for all Now we can define the feature extractor of a signal as in [9, Def. 3].
Let be a triplet-sequence and the output generating atom for each layer. Then the feature extractor is defined as
In the following section we are going to introduce the signal model which we consider in this paper.
2.3 Musical Signal Model
Tones are one of the smallest units and simpel models of an audio signal, consisting of one fundamental frequency , corresponding harmonics and a shaping envelope for each harmonic, providing specific timbre.
Further, since our ears are limited to frequencies below , we develop our model over finitely many harmonics, i.e. .
The general model has the following form
where and For one single tone we choose Moreover we can create a space of tones and assume
3 Gabor Scattering of Music Signals
In  it was already stated that due to the structure of the scattering transform the energy of the signal is pushed towards low frequencies, where it then is captured by a low-pass filter as output generating atom. In the current section we explain how Gabor scattering separates relevant structures of signals described by our signal model . Due to the smoothing action of the output generating atom, each layer expresses certain invariances, which will be illustrated by numerical examples in Section 5. In Proposition 1, inspired by , we add some assumptions on the analysis window in the first layer for some and
Let with For fixed is chosen such that Then we obtain
where is the indicator function.
Equation (5) shows that for slowly varying amplitude functions , the first layer mainly captures the contributions near the frequencies of the tone’s harmonics. Obviously, for time-sections during which the envelopes undergo faster changes, such as during a tone’s onset, energy will also be found outside a small interval around the harmonics’ frequencies and thus the error estimate (6) becomes less stringent.
Step1 – Using the signal model for tones as input, interchanging the finite sum with the integral and performing a substitution we obtain
After performing a Taylor series expansion for where we have
Hence we choose set
and split the sum to obtain
Step 2 – We now bound (7):
For the second bound, i.e. the bound of Equation (8), we use the decay condition on thus
Next we split the sum into and We estimate the error term for
Since we have and also using we obtain
Due to symmetry we get
Summing up the error terms, we obtain (6). ∎
The error bound of Equation (10) gets bigger for lower frequencies. This makes sense, since the separation of the fundamental frequency and corresponding harmonics by the analysis window deteriorates. For higher frequencies separation improves and hence the error term gets smaller.
We now introduce two more operators, first the sampling operator
and second the periodization operator These operators have the following relation In order to see how the second layer captures relevant signal structures, depending on the first layer, we propose the following Corollary 1. Recall that
Let and Then the elements of the second layer can be expressed as
Using the outcome of Proposition 1 we obtain
For the error we use the global estimate and, using the notation above we proceed as follows:
Since the values are maximal in a neighborhood of the center frequency we consider the case separately and obtain
It remains to bound the sum, i.e. the second term of Equation (12):
In the last Equation (13) we applied the triangle inequality twice and the modulation term can be ignored because of the modulus. Now we can use our assumption and also the assumption on the Fourier transform of the analysis window
We rewrite the first term in (12):
The last Equation (15) uses Plancherl’s theorem. Rewriting the last term we obtain
The sum decreases very fast, i.e. taking the summand is already smaller as for
Note that, since the envelopes are expected to change slowly except around transients, their Fourier transforms concentrate their energy in the low frequency range. In Section 5 it will be shown by means of the analysis of example signals, how the second layer distinguishes tones which have a smooth onset (transient) from those which have a sharp attack, which leads to broadband characteristics of around this attack. Similarly, if undergoes an amplitude modulation, the frequency of this modulation can be clearly discerned, cf. Figure 1 and the corresponding example. This observation is clearly reflected in expression (12). Since decays fast, the aliasing terms in (12), for are sufficiently small.
To obtain the Gabor scattering coefficients, we need to apply the output generating atom as in (3).
Let then the output of the first layer is
and with the second layer output is
Note that the convolution is a low-pass filter for sufficient smoothness of Hence, in dependence on the pooling factor , the temporal fine-structure of is averaged out. In the second layer, applying the output generating atom removes the fine temporal structure and thus, the second layer reveals information contained in the envelopes .
We show the calculations for the first layer, for the subsequent layer it is the same:
where The factor will be absorbed by the constants in the error terms, i.e.
Calculations are similar for the second layer. ∎
4 Deformation stability
In this section we study to which extent Gabor scattering is stable with respect to certain deformations. We consider changes in spectral shape as well as frequency modulations. The method we apply is inspired by  and uses the decoupling technique, i.e. in order to prove stability of the feature extractor we first take the structural properties of the signal class into account and search for an error bound of deformations of the signals in In combination with the contractivity property of , see [9, Prop. 4], which follows from where is the upper frame bound of the Gabor frame this yields deformation stability of the feature extractor.
4.1 Envelope Changes
Simply deforming a tone would correspond to deformations of the envelope This corresponds to a change in timbre, for example by playing a note on a different instrument. Mathematically this can be expressed as:
Let and for constants and Moreover let Then
for depending only on
Setting we obtain
We apply the mean value theorem for a continuous function and get
Applying the norm on and the assumption on we obtain:
Splitting the integral into and and using the monotonicity of we have
Moreover for we have This leads to
In Equation (16) we performed a change of variables, i.e. Setting and summing up we obtain
Harmonics’ energy decreases with increasing frequency, hence
4.2 Frequency modulation
Another different kind of sound deformation results from frequency modulation of This corresponds to, for example, playing higher or lower pitch, or producing a vibrato. This can be formulated as: