Gabor frames and deep scattering networks in audio processing

06/27/2017 ∙ by Roswitha Bammer, et al. ∙ 0

In this paper a feature extractor based on Gabor frames and Mallat's scattering transform, called Gabor scattering, is introduced. This feature extractor is applied to a simple signal model for audio signals, i.e. a class of tones consisting of fundamental frequency and its multiples and an according envelope. Within different layers, different invariances to certain signal features occur. In this paper we give a mathematical explanation for the first and the second layer which are illustrated by numerical examples. Deformation stability of this feature extractor will be shown by using a decoupling technique, previously suggested for the scattering transform of Cartoon functions. Here it is used to see if the feature extractor is robust to changes in spectral shape and frequency modulation.



There are no comments yet.


page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Abstract

In this paper a feature extractor based on Gabor frames and Mallat’s scattering transform, called Gabor scattering, is introduced. This feature extractor is applied to a simple signal model for audio signals, i.e. a class of tones consisting of fundamental frequency and its multiples and an according envelope. Within different layers, different invariances to certain signal features occur. In this paper we give a mathematical explanation for the first and the second layer which are illustrated by numerical examples. Deformation stability of this feature extractor will be shown by using a decoupling technique, previously suggested for the scattering transform of Cartoon functions. Here it is used to see if the feature extractor is robust to changes in spectral shape and frequency modulation.

2 Introduction

Popular machine learning techniques such as (deep) convolutional networks (CNN) have led to impressive results in several applications, e.g., classification tasks. In audio, data usually undergo some pre-processing, often called

feature extraction, before being fed into the trainable machine. CNNs as feature extractors have been given a rigorous mathematical analysis by Mallat [6], leading to the so-called scattering transform, based on a wavelet transform in each network layer. Invariance and deformation stability properties of the resulting feature extractors were investigated in [6, 2, 1].
Wiatowski and Bölcskei extended the scattering transform by allowing different semi-discrete frames in each network layer, general Lipschitz-continuous non-linearities and pooling operators, see [9, 8]. In this general network, the authors proved vertical translation invariance and deformation stability for band-limited functions.
In the current contribution we introduce a new feature extractor called Gabor scattering previously introduced in the conference contribution [3]

, which is a scattering transform based on Gabor frames in each layer. Gabor frames are closely related to the short-time Fourier transform (STFT), see Section

2.2, which is commonly used in practice to analyse audio data. We then study invariance properties of Gabor scattering when applied to a common signal model for audio signals, Section 3 and use the same signal model to establish deformation bounds for a feature extractor based on Gabor scattering, Section 4

. This approach is motivated by the observation, that assuming more specific restrictions than band-limitation will lead to more precise deformation estimates, when dealing with signals with certain known properties. Thereby, we follow the idea presented in Grohs et al.

[5] and use a decoupling technique which proves deformation stability by exploiting stability of the signal class introduced in Section 2.3 in combination with general properties, namely contractivity, which is entirely due to the network architecture.

2.1 Deep Convolutional Neural Networks (DCNNs) and Invariance

In order to understand the mathematical construction used within this paper, we briefly introduce the principal idea and structure of a DCNN. DCNNs enable the computer to learn from observation data. Their structure is inspired by neurones and the human process of learning. Usually a network consists of several layers, namely an input, several hidden (since we consider the case of

deep CNN the number of hidden layers is supposed to be

) and one output layer. As input we use the data, that should be classified or more generally whose properties should be learnt. A hidden layer consists of several ingredients: first the convolution of the data with a small weighting matrix, which can be interpreted as localization of certain properties of the input data. Similarly neurones in our brain act only in a local environment, when certain stimulation appears. The next ingredient of the hidden layer is the application of a non-linearity function, also called activation function, which signals if information of this neurone is relevant to be transmitted. Furthermore, in order to reduce redundancy and increase invariance, pooling is applied. Due to the ingredients of the hidden layer, i.e. convolution with a weighting matrix, non-linearity function and pooling, certain invariances are generated

[7]. In order to make concrete statements about invariances generated by certain layers for a certain class input data, one needs to develop a signal model, as we do in Section 2.3. Within our simple model for tones, convolution is performed with a low-pass filter and thus in combination with pooling, temporal fine structure is averaged out, while frequency information is maintained. Hence in our case we have some invariance w.r.t. the envelope in the first layer. Moreover, after further convolutions, more temporal fine structure is removed and information on large scales is captured in higher layers. In our model the second layer is invariant w.r.t. the pitch and information about the envelopes is visible, cp. Section 3 and 5

. Note that in a neural network, in particular in CNNs, the output, e.g. classification labels, is obtained after several concatenated hidden layers. In the case of scattering network the outputs of each layer are stacked together into a feature vector and further processing is necessary to obtain the desired result. Usually, after some kind of dimensionality reduction, cf.


, this vector can be fed into a support vector machine or a general NN, which performs the classification task.

2.2 Gabor Scattering

Since Wiatowski and Bölcskei used general semi-discrete frames to obtain a wider class of window functions for the scattering transform (cp. [9, 8]), it seems natural to consider a specific model used for audio data and analyse Gabor frames for the scattering transform and study corresponding properties. We next introduce the basics of Gabor frames and refer to [4] for more details. A sequence of elements in a Hilbert space is called frame if there exist positive frame bounds such that for all


If then we call a tight frame. In order to define Gabor frames we need to introduce two operators, i.e. the translation and modulation operator of a function Let

  • then the translation (time shift) operator is defined as for all

  • then the modulation (frequency shift) operator is defined as for all

Moreover, we can use these operators to express the short-time Fourier transform (STFT) of a function with respect to a given window function as In order to reduce redundancy, we sample on a separable lattice , in time by and in frequency by The resulting samples correspond to the coefficients of w.r.t. a Gabor system.

Definition 1.

(Gabor System)
Given a window function , where the Hilbert space is either or and lattice parameters , the set of time-frequency shifted versions of

is called a Gabor system.

We proceed to introduce a scattering transform for Gabor frames. We base our considerations on [9] by using a triplet-sequence here is associated to the -th layer of the network. Note that in this contribution we will deal with Hilbert spaces or ; more precisely in the input layer, i.e. the th layer, we have and due to the discretization inherent in the Gabor transform,
We briefly review the elements of the triplet:

  • with is a Gabor frame indexed by a lattice

  • A non-linearity function (e.g. rectified linear units, modulus function, see

    [9]) is applied pointwise and is chosen to be Lipschitz-continuous, i.e. for all In this paper we only use the modulus function with Lipschitz constant for all

  • Pooling depends on a pooling factor which leads to dimensionality reduction. Mostly used are max- or average-pooling, some more examples can be found in [9]. In our context, pooling is covered by choosing specific lattices in each layer.

In order to explain the interpretation of Gabor scattering as CNN, we write and have Thus the Gabor coefficients can be interpreted as the samples of a convolution.

Definition 2.

(Gabor Scattering)
Let be a triplet-sequence, with ingredients explained above. Then the -th layer of the Gabor scattering transform is defined as the output of the operator


where is the output-vector of the previous layer and

Taking the calculation steps of the previous layers into account, we can extend (2) to paths on index sets and obtain

Similar to [9] for each layer, we use one atom of the Gabor frame in the subsequent layer as output-generating atom, i.e. Since this element is the -th convolution, it is an element of the -th frame, but because it belongs to the -th layer, its index is We also want to introduce a countable set and the space of sets for all Now we can define the feature extractor of a signal as in [9, Def. 3].

Definition 3.

(Feature Extractor)
Let be a triplet-sequence and the output generating atom for each layer. Then the feature extractor is defined as


In the following section we are going to introduce the signal model which we consider in this paper.

2.3 Musical Signal Model

Tones are one of the smallest units and simpel models of an audio signal, consisting of one fundamental frequency , corresponding harmonics and a shaping envelope for each harmonic, providing specific timbre. Further, since our ears are limited to frequencies below , we develop our model over finitely many harmonics, i.e. .
The general model has the following form


where and For one single tone we choose Moreover we can create a space of tones and assume

3 Gabor Scattering of Music Signals

In [2] it was already stated that due to the structure of the scattering transform the energy of the signal is pushed towards low frequencies, where it then is captured by a low-pass filter as output generating atom. In the current section we explain how Gabor scattering separates relevant structures of signals described by our signal model . Due to the smoothing action of the output generating atom, each layer expresses certain invariances, which will be illustrated by numerical examples in Section 5. In Proposition 1, inspired by [2], we add some assumptions on the analysis window in the first layer for some and

Proposition 1.

Let with For fixed is chosen such that Then we obtain


where is the indicator function.

Equation (5) shows that for slowly varying amplitude functions , the first layer mainly captures the contributions near the frequencies of the tone’s harmonics. Obviously, for time-sections during which the envelopes undergo faster changes, such as during a tone’s onset, energy will also be found outside a small interval around the harmonics’ frequencies and thus the error estimate (6) becomes less stringent.


Step1 – Using the signal model for tones as input, interchanging the finite sum with the integral and performing a substitution we obtain

After performing a Taylor series expansion for where we have

Hence we choose set


and split the sum to obtain

Step 2 – We now bound (7):

For the second bound, i.e. the bound of Equation (8), we use the decay condition on thus

Next we split the sum into and We estimate the error term for

Since we have and also using we obtain

Due to symmetry we get


Summing up the error terms, we obtain (6). ∎

Remark 1.

The error bound of Equation (10) gets bigger for lower frequencies. This makes sense, since the separation of the fundamental frequency and corresponding harmonics by the analysis window deteriorates. For higher frequencies separation improves and hence the error term gets smaller.

We now introduce two more operators, first the sampling operator
and second the periodization operator These operators have the following relation In order to see how the second layer captures relevant signal structures, depending on the first layer, we propose the following Corollary 1. Recall that

Corollary 1.

Let and Then the elements of the second layer can be expressed as



Using the outcome of Proposition 1 we obtain

For the error we use the global estimate and, using the notation above we proceed as follows:


Since the values are maximal in a neighborhood of the center frequency we consider the case separately and obtain


It remains to bound the sum, i.e. the second term of Equation (12):


In the last Equation (13) we applied the triangle inequality twice and the modulation term can be ignored because of the modulus. Now we can use our assumption and also the assumption on the Fourier transform of the analysis window


We rewrite the first term in (12):


The last Equation (15) uses Plancherl’s theorem. Rewriting the last term we obtain

Remark 2.

The sum decreases very fast, i.e. taking the summand is already smaller as for

Remark 3.

Note that, since the envelopes are expected to change slowly except around transients, their Fourier transforms concentrate their energy in the low frequency range. In Section 5 it will be shown by means of the analysis of example signals, how the second layer distinguishes tones which have a smooth onset (transient) from those which have a sharp attack, which leads to broadband characteristics of around this attack. Similarly, if undergoes an amplitude modulation, the frequency of this modulation can be clearly discerned, cf. Figure 1 and the corresponding example. This observation is clearly reflected in expression (12). Since decays fast, the aliasing terms in (12), for are sufficiently small.

To obtain the Gabor scattering coefficients, we need to apply the output generating atom as in (3).

Corollary 2.

Let then the output of the first layer is


and with the second layer output is


Remark 4.

Note that the convolution is a low-pass filter for sufficient smoothness of Hence, in dependence on the pooling factor , the temporal fine-structure of is averaged out. In the second layer, applying the output generating atom removes the fine temporal structure and thus, the second layer reveals information contained in the envelopes .


We show the calculations for the first layer, for the subsequent layer it is the same:

where The factor will be absorbed by the constants in the error terms, i.e.

Calculations are similar for the second layer. ∎

4 Deformation stability

In this section we study to which extent Gabor scattering is stable with respect to certain deformations. We consider changes in spectral shape as well as frequency modulations. The method we apply is inspired by [5] and uses the decoupling technique, i.e. in order to prove stability of the feature extractor we first take the structural properties of the signal class into account and search for an error bound of deformations of the signals in In combination with the contractivity property of , see [9, Prop. 4], which follows from where is the upper frame bound of the Gabor frame this yields deformation stability of the feature extractor.

4.1 Envelope Changes

Simply deforming a tone would correspond to deformations of the envelope This corresponds to a change in timbre, for example by playing a note on a different instrument. Mathematically this can be expressed as:

Lemma 1.

Let and for constants and Moreover let Then

for depending only on


Setting we obtain

We apply the mean value theorem for a continuous function and get

Applying the norm on and the assumption on we obtain:

Splitting the integral into and and using the monotonicity of we have

Moreover for we have This leads to


In Equation (16) we performed a change of variables, i.e. Setting and summing up we obtain

Remark 5.

Harmonics’ energy decreases with increasing frequency, hence

4.2 Frequency modulation

Another different kind of sound deformation results from frequency modulation of This corresponds to, for example, playing higher or lower pitch, or producing a vibrato. This can be formulated as: