Overcomplete Frame Thresholding for Acoustic Scene Analysis

12/25/2017 ∙ by Romain Cosentino, et al. ∙ 0

In this work, we derive a generic overcomplete frame thresholding scheme based on risk minimization. Overcomplete frames being favored for analysis tasks such as classification, regression or anomaly detection, we provide a way to leverage those optimal representations in real-world applications through the use of thresholding. We validate the method on a large scale bird activity detection task via the scattering network architecture performed by means of continuous wavelets, known for being an adequate dictionary in audio environments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

From anomaly prediction in medical data to image recognition, the input signal is often corrupted with noise [26, 15, 45]

. A classical approach in signal processing is based on the following analysis-synthesis framework: the data are analyzed (change of basis), denoised in their new basis, and then either fed to a machine learning classifier or synthesized (inverse the change of basis) before being processed by another algorithm. There is a wild number of analysis-synthesis operator, each of them being well defined mathematically by their specific properties. In this work, we will focus on overcomplete frame operators. As a matter of fact, those operators have been applied and studied in a wide range of areas such as wireless communications

[40], signal processing [17, 28], quantum computing [24], functional analysis [30]. In fact, they yield interesting properties such as robustness [10], stability and compactness [2]. A simple intuition behind those properties can be trivially described via the human vocabulary analogy. Consider the atoms of a dictionary as the vocabulary of a person. The higher the cardinal of the dictionary is, the higher the person will be precise and concise in his saying. As such, the analysis of his discourse will be eased thanks to its sparsity. In signal processing, sparse representations provide a fruitful pre-processing step allowing simple machine classifiers to produce high accuracy results. In this work, we will focus on overcomplete frames yielding a time-frequency representation as having an efficient representation of a signal is a synonym of extracting both its time and frequency content [37, 16]. Especially, wavelet frames are known to provide a robust time-frequency representation for non-stationary signals as it is localized, both, in time and frequency. In most cases, wavelets can be used for three different tasks. First, for compression purposes: in fact, it has been demonstrated that wavelet transforms provide sparse representations particularly adapted for compression of natural signals with non-stationary components [33, 14]. Secondly, for denoising purposes. Due to the sparsity of the induced representation and the ability to derive analytical solutions, optimized denoising solution with reconstruction guarantees have been developed [19, 20]. Finally, the last task corresponds to analysis in which one aims at detecting subtle patterns of the input signal by using over-complete dictionaries. The analysis task can be considered today as crucial, especially when considering the need for pre-processing for current machine learning tasks. Unfortunately, thresholding methods for overcomplete frames have been tackled for special cases.

Therefore, we propose to remedy this absence by deriving a general analytical thresholding scheme for overcomplete frames principled via empirical risk minimization being computationally efficient and with the ability to leverage overcomplete frameworks within a corrupted environment.

The paper is organized as follows: 1.1 and 1.2 are devoted to the related work and contribution of the paper, the section 2 shows the theoretical results including the derivation of the over-complete hard-thresholding technique, derivation of error bounds and presentation of computational complexity and pseudo-code. Finally, in section 3 we provide experiments demonstrating the ability of the proposed scheme via the introduction of a Sparse Deep Scattering Network (SDSN), an extension of the Deep Scattering Network (DSN) 3.3. The task comparing the DSN and SDSN consists of a bird detection task using the Freefield1010111http://machine-listening.eecs.qmul.ac.uk/bird-audio-detection-challenge/ audio scenes dataset in 3. The appendix is divided into two parts, Appendix A provides both, the pre-requisite and details about the wavelets families used in our experiments; In Appendix B, we propose a review of the orthogonal denoising framework proposed by Donoho et. al. as well as the mathematical details and proofs for the over-complete thresholding non-linearity that we introduce.

1.1 Related Work

During the last decades, the properties of wavelets have been exploited to perform denoising on a wide range of signals [13, 41, 39]. In fact, the decomposition of a signal by mean of a wavelet transform concentrates the signal of interest in few coefficients having high energy, while the noise is uniformly propagated throughout all the time-frequency components [42]

. This property is the direct consequence of the number of vanishing moment of the considered wavelet. In fact, it is defined as their orthogonality with respect to a certain order

of regular signals . This property makes them an edge-detector basis (in the broad sense of the term) capable to provide sparse representations. Donoho et. al. via a statistical signal processing approach to the problem, were able to provide powerful orthogonal basis denoising techniques holding theoretical guarantees [21, 19, 20, 22]. However, orthogonal wavelet decomposition provides poor time-frequency resolution and are more prone to instabilities. Since overcomplete frames provide extremely precise and often sparse representations, different approaches have been taken to tackle special cases of an overcomplete dictionary. This ability to fit the signal such closely should allow for more precise and robust denoising capacities. An extension to the work of Donoho et. al., one step ahead toward overcomplete frames, has been achieved by Berkner et. al. in [6]

using Translation Invariant Discrete Wavelet Transform and Biorthogonal-DWT. Then, using independent Component Analysis, Marusic et. al.

[36] were able to tackle denoising in the case of multiple DWT. Others are exploiting the tools available in sparse coding-based models to perform denoising in learned ”overcomplete frames” [46, 9]. Therefore as these work rely on specific structures of overcomplete dictionaries that are often based on orthogonal filters drastically limit their applications.

1.2 Contributions

The contribution of the paper is the derivation of a general analytical thresholding technique for overcomplete frames. This technique based on empirical risk minimization will be studied by providing error bounds, computational complexity as well as pseudo-code allowing one to implement the method. We demonstrate the generality and robustness of our approach by mean of redundant continuous wavelet dictionaries. First, we propose the visualization of the thresholded scalograms as well as the evaluation of the sparsity before and after denoising. Secondly, we propose quantitative experiments validating the proposed scheme. Those later are achieved via introducing our thresholding technique into the Deep Scattering Network (DSN), which leads to the development of the Sparse Deep Scattering Network (SDNS). We compare the performance of the SDNS and the DSN on a bird activity detection task. As we will see, bringing denoising into this state-of-the-art pre-processing technique allows to further increase its linearisation capability and improves the accuracy of the classifier for a corrupted dataset composed of extremely diverse audio scenes.

2 Overcomplete Frames Denoising

In the case of analysis-synthesis framework with an overcomplete dictionary, the first question that arises is whether the mapping from the domain to the co-domain preserves distinctness. For sake of generality, we will assume that such operator has a nontrivial kernel, thus the application is not injective. Therefore, the dual of the dictionary is achieved via the pseudo-inverse transform. Notice that, Berkner et. al. in [6] proposed the use of the Moore-Pensore inverse to build the reconstruction dictionary of Bi-orthogonal wavelets and TIDWT dictionaries. Let’s assume the observed signal, denoted by

, is corrupted with white noise such that

where is the signal of interest and . We now denote by the matrix composed by the the filters at each time and scale (i.e: explicit convolution matrix of the operator ) such that , is the wavelet transform of

. In other world, for each time position and each scale, the filters vectors

are the row vectors of . Then, we denote by the generalized inverse of such that . Each column of , are the inverse filters

. The estimate of

is given by where is a diagonal binary operator such that,

(1)

with and denoting respectively the set of selected and unselected wavelet coefficients. We also define such that . Our thesholding estimate is based on the mean-square error between the signal of interest and the synthesized thresholded observed signal :

(2)

which can be rewritten in the following explicit form,

(3)

where are the co-domain’s coefficients. It is clear that the correlation inherent to the redundant information contained in the dictionary implies an intractable factorial optimization problem as the ideal risk is now dependent on all the possible pairs in the frequency axis. Since this ideal risk requires the intervention of an oracle, we develop an upper bound to the ideal risk such that we benefit an explicit equation for the thresholding operator that is adapted to any over-complete transformation and possesses a tractable and analytic expression. We propose to use a min-max formulation as an upper bound of this ideal risk explicitly derived in Appendix B.2. The upper-bound on the optimal risk is denoted by and defined as,

(4)

where we denote by the upper bound error term corresponding to unselected coefficients:

(5)

which means that if we do not select a coefficient (e.g: ), the risk is equal to a weighted value of this coefficient. The weight are all the coefficient () weighted by the correlation between the reconstruction filter () and each of the reconstruction filters (). Now, let’s be the upper bound error term corresponding to the selected coefficients:

(6)

In the case where the coefficient is selected, the error corresponds to the energy of the noise propagated by the redundant frame. As a matter of fact, the noise is weighted by the correlation of, both the filter and the reconstruction filters , with all filters and reconstruction filters , . Now, one way to evaluate this upper-bound, is to assume an orthogonal basis, and to compare it with the optimal risk in the orthogonal case which leads to the following proposition.

Proposition 1.

Assuming orthogonal filter matrix , the upper bound ideal risk coincides with the orthogonal ideal risk:

The proof is derived in B.3. Therefore, even with the upper-bound, our optimal risk, in the case of bijective transform, can still be used and the optimal solution can be recovered. In order to apply the ideal risk derive, ones needs an oracle decision regarding the signal of interest. In real application, the signal of interest is unknown. We thus propose the following empirical risk:

(7)

This risk corresponds to the empirical version of the ideal risk where the observed signal is evaluate in the left part of the minimization function. In order to compare this empirical risk with the ideal version, we propose the bound analysis with respect to the optimal risks in the two possible extreme cases:

Proposition 2.

In the case where , the empirical risk coincides with the upper bound ideal risk:

Proposition 3.

In the case where , the following bound shows the distance between the empirical risk and the upper bound ideal risk:

(8)

where,

Refer to B.4 for proofs. As the empirical risk introduces the noise in the left part of the risk expression, this term represents the propagation of this noise throughout the decomposition. Naturally, we from the proposition 3, that the more redundant the dictionary is, the less the bound is tight.

2.1 Algorithm and Complexity

In this section we present practical guide and discussion concerning the application of the method for large scale tasks. In term of complexity, by using the proposed local method, we are asymptotically linear in time hence number of signals to treat and quadratic in the number of filters present in the over-complete filter-bank. However, the latter can be ”pushed out” of the computations by performing it once for all signals. In fact, most of the costly terms to compute are constant for all treated signals (the pseudo inverse of the filters correlation). As a result the most demanding operation remains in the computation of the correlation matrix to be done for each column of the scalogram. Again this is quadratic in the number of filters but linear in time and with computations completely independent from one another allowing very efficient parallelization or map-reduce schemes.

Result: Denoised Signals
W
corrW=         Add small noise for round-off errors
M         Compute Moore-Pseudo Inverse
corrM=
for  each signal  do
      
       for For each window  do
                      Generate indices of the corresponding window
                      Perform noise level estimation (MAD in this case)
             for each column of the window do
                   ,
                   ,
                   mask         Returns or , the index of the
             end for
            
       end for
      mask
end for
Algorithm 1 Algorithm to compute Non-Orthogonal Denoising. Practical implementation does not use the internal for loops but a vectorize version of this algorithm for efficieny. is the total number of windows, its size.

3 Experimental Validation with Continuous Wavelet Transforms:
Bird Activity Detection Task

We propose to validate the two contributions over a large-scale audio dataset. The data set consists of field recording signals from freefield1010222http://machine-listening.eecs.qmul.ac.uk/bird-audio-detection-challenge/ collected via the Freesound333https://arxiv.org/abs/1309.5275 project. This collection represent a wide range of audio scenes such as birdsong, city, nature, people, train, voice, water… The focus in this paper is the bird audio detection task that can be formally defined as a binary classification task, where each label corresponds to the presence or absence of birds. Each signal is sec. long, and has been sampled at Khz. The evaluation of the results is performed via the Area Under Curve metric on of the data. The experiments are repeated 50 times. The total audio length of this dataset is thus of slightly less than hours of audio recordings. To put in comparison, it is about larger than CIFAR10 in term of number of scalar values in the dataset. The results comparing our algorithm to the DSN are achieved with two different wavelet and are provided in Table 1. For all the experiments, the octave and quality parameters of the layers are . As the feature of interests are birds songs, only high frequency content requires high resolution, the thresholding is applied per window of representing .

3.1 Continuous Wavelet Transform

We first denote the collection of scaling factors by,

(9)

The hyper-parameters are, and which represents respectively the number of octave to decompose and the quality coefficients a.k.a the number of wavelets per octave. Based on the collection , the wavelet filter-bank can be derived by scaling the mother wavelet leading to the filter-bank denoted as

(10)

where we omit the upper-script over the parameter to avoid redundancy. Also, we denote this filter-bank as . We define the continuous wavelet transform as,

(11)

3.2 Qualitative Analysis

we first provide some examples of denoising over different audio scenes. In the Fig. 1, we present the visualization and evaluation of the sparsity within the wavelet domain for several signals. The evaluation of the sparsity is achieved by the following sparsity ratio:

(12)

where denotes the norm. It is clear that the closer the ratio is to , the sparser the representation is. On the other hand, if the ratio equals zeros, none of the coefficients within the representation are null.

(a) 118596 -
(b) 118596 Denoised -
(c) 111092 -
(d) 111092 Denoised -
(e) 178878 -
(f) 178878 Denoised -
(g) 15657 -
(h) 15657 Denoised -
(i) 126154 -
(j) 126154 Denoised -
Figure 1: Noisy Scalogram () vs Sparsity Evaluation of Denoised () - the wav name and the sparsity ratio are shown in each representation’s title

Given the fact that the data set contains recording via mobile application in diverses audio scenes, it is clear that many signals contain energy within all frequency bands. As such, the wavelet decomposition, even if non-orthogonal, is not sparse in most cases. Our denoising technique has the ability to remove the background noise and conserves only the foreground activity of each signal. In the special case of noise activity, one can see in Fig. 0(d), that our algorithm conserves the bird song’s chirp while removing almost all the other information.

3.3 Quantitative Analysis: Sparse Deep Scattering Network

We now propose a quantitative analysis via solving the supervised classification task by using a variant of the standard scattering network including our proposed denoising scheme. The Deep Scattering Network, first developed in [34] and first successfully applied in [8, 1]

is based on a cascade of linear and non-linear operators on the input signal. The linear transformation is a wavelet transform, and the nonlinear transformation is a complex modulus. At each layer, the convolution of the representation with a scaling function leads to the so-called scattering coefficients. This network is stable (Lipschitz-continuous) and suitable for machine learning tasks as it removes spatio-temporal nuisances by building space/time-invariant features. The translation invariance property is provided by the scaling function that acts as an averaging operator on each layer of the transform leading to an exponential decay of the scattering coefficients

[44]

. Since the continuous wavelet transform increases the number dimension, the complex modulus is used as its contractive property reduces the variance of the projected space

[35].

The first layer of a DSN corresponds to standard scalogram, the second considers each frequency dimension of this scalogram as the input signal of the second layer. All the features are computed by applying a low-pass filter on each of these representations. In the SDSN, a non-linear thresholding operator is applied on each representation before to process both, the next layer as well as the scattering coefficient of the current layer.

3.3.1 Sparse Deep Scattering Network

We denote by the application of the filter-bank of the first layer onto a signal corresponding to the first layer of the scattering network. The output of this first layer and as previously mentioned consists of a tensor of shape with the length of the input signal. We omit here boundary conditions, sub-sampling, and consider a constant shape of throughout the representations. We thus obtain,

(13)

where operator corresponds to an element-wise complex modulus application. As previously mentionned, we define the convolution operation between those two objects as,

(14)

Then, we define by the thresholding operator minimizing the empirical risk,

(15)

This non-linearity is applied to the latent representation producing the following sparse representation: . From this sparse representation, the scattering coefficients can be computed leading to

(16)

being a ”low-pass” filter. The application of the latter is performed for each row of the representation. This application of a low frequency band-pass filter allows for symmetries invariances, inversely proportional to the cut-off frequency of . From this, the second layer is computed. We now denote the second layer representation as being a tensor. It defined as

(17)

Then, the thresholding operator is applied to each new representation independently. The final sparse representation is thus denoted by . Given this representation, the second layer scattering coefficients are defined as

(18)

We present an illustration of the network computation in Fig. 2.

As can be seen in the proposed example, while the first layer provides time-frequency information, the second layer characterizes transients as demonstrated in [3]. With this extended framework, we now dive into the problem of thresholding redundant frame, cases where the quality factor, , is greater than which are in practice needed to bring enough frequency precision. We provide in Fig. 2 illustration showing the effect of each SDSN layers. It is clear that even if the first layer representation has been denoised, the second layer’s latent representation, by decomposing each of the frequential dimension, still contains inherent noise. This noise is removed as it can be observed in the second thresholded layer representation .

Figure 2: Sparse Deep Scattering Network -Gammatone Wavelet representation - J1= 5, Q1=8, J2=4, Q2=1

As we will demonstrate below, our method as well as each contribution taken independently and jointly lead to significant increase in the final performance. We compare our results against the DSN. In all cases, the scattering coefficients are then fed into a random forest

[7] with parameters444n_estimator: , min_samples_split: ,class_weights:’balanced_subsample’ based on the sklearn library [38].

min
mean
max
Sparse Deep Scattering Network
Gammatone 72.25 74.23 77.18
Morlet
Deep Scattering Network
Gammatone
Morlet
Table 1: Classification Results - Bird Detection - Area Under Curve metric (AUC)

3.3.2 Controlled Denoising Experiment

We based our implementation on [5] leveraging the Fourier based computations of localized filter in the frequency domain. For all input signals we perform a renormalization to have unitary energy. We propose complementary experiments in this section to further demonstrate the need for thresholding even in the context of stable representations offered by the scattering network. To do so we simulate an environment for which we combine denoised and noised samples in both the training and test sets with different mixture proportions. Thus we propose in table 2 the described experiment achieved via the Gammatone wavelet only: for both the SDSN and DSN models. We evaluate the performance of the classifier for different mixture configuration as we can see in the table 2. The row corresponds to the percentage of noisy data in the training set, and the columns are the percentage of noisy data in the test set. It is clear that the percentage of denoised data in each set is one minus the percentage of noisy data. Then, in the table 3, we show the difference between the mean accuracy of the SDSN-gammatone with respect to each mixture mean accuracy.

Train Test
74.2
73.3 .9
73.1 .9
72.7 .9
72.5 .9
72.77 .8
Table 2:

DSN evaluation with respect to mixture noise-denoised - AUC with mean and standard deviation over

runs.
Train Test
+0.9
+1.13
+1.53
+1.73
+1.5
Table 3: SDSN - Improve gain in term of mean values - AUC

The configuration depicted in the diagonal of the tables 2 and 3 corresponds to the ”ideal” case of identical proportion of noise in the train and test set. However for many practical applications more volatile regimes can be witnessed. Those correspond either to the upper diagonal part for which the proportion of noisy samples is more important in the test set than the train set. Conversely the lower diagonal corresponds to the symmetric case. We can see that even in the ideal setting, introduction of the SDSN brings reasonable robustness to the classifier pipeline. More importantly for the off diagonal cases, a significant gain is achieved via the SDSN.

References

  • [1] Joakim Andén and Stéphane Mallat. Multiscale scattering for audio classification.
  • [2] Radu Balan, Peter G Casazza, Christopher Heil, and Zeph Landau. Density, overcompleteness, and localization of frames. i. theory. Journal of Fourier Analysis and Applications, 12(2):105–143, 2006.
  • [3] Randall Balestriero and Behnaam Aazhang. Robust unsupervised transient detection with invariant representation based on the scattering network. arXiv preprint arXiv:1611.07850, 2016.
  • [4] Randall Balestriero and Hervé Glotin. Scattering decomposition for massive signal classification: from theory to fast algorithm and implementation with validation on international bioacoustic benchmark. In Data Mining Workshop (ICDMW), 2015 IEEE International Conference on, pages 753–761. IEEE, 2015.
  • [5] Randall Balestriero and Herve Glotin. Linear time complexity deep fourier scattering network and extension to nonlinear invariants. arXiv preprint arXiv:1707.05841, 2017.
  • [6] Kathrin Berkner. A correlation-dependent model for denoising via nonorthogonal wavelet transforms. 1998.
  • [7] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
  • [8] Joan Bruna and Stéphane Mallat. Classification with scattering operators. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1561–1566. IEEE, 2011.
  • [9] Diego Carrera, Giacomo Boracchi, Alessandro Foi, and Brendt Wohlberg. Sparse overcomplete denoising: aggregation versus global optimization. IEEE Signal Processing Letters, 24(10):1468–1472, 2017.
  • [10] P.G. Casazza and G. Kutyniok. Finite Frames: Theory and Applications. Applied and Numerical Harmonic Analysis. Birkhäuser Boston, 2012.
  • [11] Satinder Chopra* and Kurt J Marfurt. Choice of mother wavelets in cwt spectral decomposition. In SEG Technical Program Expanded Abstracts 2015, pages 2957–2961. Society of Exploration Geophysicists, 2015.
  • [12] Leon Cohen. Time-frequency Analysis: Theory and Applications. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1995.
  • [13] Rami Cohen. Signal denoising using wavelets. 2012.
  • [14] R. Cosentino, R. Balestriero, and B. Aazhang. Best basis selection using sparsity driven multi-family wavelet transform. In 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pages 252–256, Dec 2016.
  • [15] Ethan E Danahy, Sos S Agaian, and Karen A Panetta. Non-linear algorithms for noise removal from medical signals using the logical transform.
  • [16] Carlos E D’Attellis and Elena M Fernandez-Berdaguer. Wavelet theory and harmonic analysis in applied sciences. Springer Science & Business Media, 2012.
  • [17] Ingrid Daubechies. The wavelet transform, time-frequency localization and signal analysis. IEEE transactions on information theory, 36(5):961–1005, 1990.
  • [18] Ingrid Daubechies, Alex Grossmann, and Yves Meyer. Painless nonorthogonal expansions. Journal of Mathematical Physics, 27(5):1271–1283, 1986.
  • [19] David L Donoho and Iain M Johnstone. Adapting to unknown smoothness via wavelet shrinkage. Journal of the american statistical association, 90(432):1200–1224, 1995.
  • [20] David L Donoho, Iain M Johnstone, et al. Ideal denoising in an orthonormal basis chosen from a library of bases.
  • [21] David L Donoho, Iain M Johnstone, et al. Minimax estimation via wavelet shrinkage. The annals of Statistics, 26(3):879–921, 1998.
  • [22] David L Donoho, Iain M Johnstone, Gérard Kerkyacharian, and Dominique Picard. Wavelet shrinkage: asymptopia? Journal of the Royal Statistical Society. Series B (Methodological), pages 301–369, 1995.
  • [23] Costanza D’Avanzoa, Vincenza Tarantinob, Patrizia Bisiacchib, and Giovanni Sparacinoa. A wavelet methodology for eeg time-frequency analysis in a time discrimination task.
  • [24] Yonina C Eldar and G David Forney. Optimal tight frames and quantum measurement. IEEE Transactions on Information Theory, 48(3):599–610, 2002.
  • [25] James L Flanagan. Models for approximating basilar membrane displacement. Bell Labs Technical Journal, 39(5):1163–1191, 1960.
  • [26] Gartheeban Ganeshapillai, Jessica F Liu, and John Guttag. Reconstruction of ecg signals in presence of corruption. In Engineering in Medicine and Biology Society, EMBC, 2011 Annual International Conference of the IEEE, pages 3764–3767. IEEE, 2011.
  • [27] Pierre Goupillaud, Alex Grossmann, and Jean Morlet. Cycle-octave and related transforms in seismic signal analysis. Geoexploration, 23(1):85–102, 1984.
  • [28] Vivek K Goyal, Martin Vetterli, and Nguyen T Thao. Quantized overcomplete expansions in ir/sup n: analysis, synthesis, and algorithms. IEEE Transactions on Information Theory, 44(1):16–31, 1998.
  • [29] Alexander Grossmann and Jean Morlet. Decomposition of hardy functions into square integrable wavelets of constant shape. SIAM journal on mathematical analysis, 15(4):723–736, 1984.
  • [30] Michael Lacey and Christoph Thiele. Lp estimates for the bilinear hilbert transform. Proceedings of the National Academy of Sciences of the United States of America, 94(1):33, 1997.
  • [31] Vincent Lostanlen. Opérateurs convolutionnels dans le plan temps-fréquence. PhD thesis, Paris Sciences et Lettres, 2017.
  • [32] Vincent Lostanlen and Joakim Andén.

    Binaural scene classification with wavelet scattering.

  • [33] Stéphane Mallat. A wavelet tour of signal processing. Academic press, 1999.
  • [34] Stéphane Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331–1398, 2012.
  • [35] Stéphane Mallat. Understanding deep convolutional networks. Phil. Trans. R. Soc. A, 374(2065):20150203, 2016.
  • [36] Slaven Marusic, Guang Deng, and David BH Tay. Image denoising using over-complete wavelet representations. In Signal Processing Conference, 2005 13th European, pages 1–4. IEEE, 2005.
  • [37] Yves Meyer. Uncertainty principle, hilbert bases and algebras of operators. Fundamental Papers in Wavelet Theory, page 216, 2006.
  • [38] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12(Oct):2825–2830, 2011.
  • [39] Raghuram Rangarajan, Ramji Venkataramanan, and Siddharth Shah. Image denoising using wavelets. 2002.
  • [40] Thomas Strohmer and Scott Beaver. Optimal ofdm design for time-frequency dispersive channels. IEEE Transactions on Communications, 51(7):1111–1122, 2003.
  • [41] Carl Taswell. The what, how, and why of wavelet shrinkage denoising. Computing in science & engineering, 2(3):12–19, 2000.
  • [42] A. Teolis. Computational Signal Processing with Wavelets. Applied and Numerical Harmonic Analysis. Birkhäuser Boston, 1998.
  • [43] Arun Venkitaraman, Aniruddha Adiga, and Chandra Sekhar Seelamantula. Auditory-motivated gammatone wavelet transform. Signal Processing, 94:608–619, 2014.
  • [44] Irene Waldspurger. Exponential decay of scattering coefficients. In Sampling Theory and Applications (SampTA), 2017 International Conference on, pages 143–146. IEEE, 2017.
  • [45] Junyuan Xie, Linli Xu, and Enhong Chen.

    Image denoising and inpainting with deep neural networks.

    In Advances in Neural Information Processing Systems, pages 341–349, 2012.
  • [46] Haohua Zhao, Jun Luo, Zhiheng Huang, Takefumi Nagumo, Jun Murayama, and Liqing Zhang. Image denoising based on overcomplete topographic sparse coding. In International Conference on Neural Information Processing, pages 266–273. Springer, 2013.

Appendix A Building a Deep Scattering Network

a.1 Continous Wavelet Transform

”By oscillating it resembles a wave, but by being localized it is a wavelet”.

Yves Meyer

Wavelets were first introduced for high resolution seismology [27] [29] and then developed theoretically by Meyer et al. [18]. Formally, wavelet is a function such that:

(19)

it is normalized such that . Often, wavelets can be categorized in two distinct groups, the discrete wavelets and the continuous ones. The discrete wavelets are constructed based on a system of linear equation that represent the atom’s property. These wavelet, when scaled in a dyadic fashion form an orthonormal basis. On the other hand, the continuous wavelets have an explicit formulation and build an over-complete dictionary when successively scaled. In this work, we focus on the continuous wavelets as they provide a more complete tool for analysis of signals. In order to perform a time-frequency transform of a signal, we first build a filter bank based on the mother wavelet. This wavelet is names the mother wavelet since it will be dilated and translated in order to create the filters that will constitute the filter bank. Notice that wavelets have a constant-Q property, thereby the ratio bandwidth to center frequency of the children wavelets are identical to the one of the mother. Then, the more the wavelet atom is high frequency the more it will be localized in time. The usual dilation parameters follows a geometric progression and belongs to the following set:

Where the integers and denote respectively the number of octaves, and the number of wavelets per octave. Having selected a geometric progression ensemble, the dilated version of the mother wavelet in the time are computed as follows:

and can be calculated in the Fourier domain as follows:

Notice that in practice the wavelets are computed in the Fourier domain as the wavelet transform will be based on a convolution operation which can be achieved with more efficiency in the frequency plane. By construction, the children wavelets have the same properties than the mother one. As a result, in the Fourier domain:

Thus, to create a filter bank that cover all the frequency support, one needs a function that captures the low frequencies contents. The function is called the scaling function and satisfies the following criteria:

a.2 Wavelet Families

Among the continuous wavelets, different families exist. Each posses different properties, such as bandwidth, center frequency. This section is dedicated to the development of the families that are important for the analysis of diverse signals.

a.2.1 The Morlet wavelet

The Morlet wavelet (Fig. 3) is built by modulating a complex exponential and a Gaussian window defined in the time domain by,

(20)

where defines the frequency plane. In the frequency domain, we denote it by ,

(21)

thus, it is clear that defines the center frequency of the mother wavelet. With associated frequency center and standard deviation denoted respectively by and , are:

Notice that for the admissibility criteria , however one can impose that zeros-mean condition facilely in the Fourier domain. Usually, this parameter is assign to the control of the center frequency of the mother wavelet. Then, we are able to vary the parameter in order to have different support of Morlet wavelet.

Figure 3: On the left a Morlet wavelet in the time domain where the dashed line is the imaginary part, the solid line is the real part, and the black envelope is the complex modulus, on the right a Morlet wavelet in the frequency domain.

The Morlet wavelet, is optimal from the uncertainty principle point of view [33]. The uncertainty principle, when given a time-frequency atoms, is the area of the rectangle of its joint time-frequency resolution. In the case of wavelet, given the fact that their ratio bandwidth to center frequency are equal, it implies that this area is equal for the mother wavelets and its scaled versions. As a result, because of its time-frequency versatility this wavelet is wildly used for biological signals such as bio-acoustic [4], seismic traces [11], EEG [23] data.

a.2.2 The Gammatone wavelet

The Gammatone wavelet is a complex-valued wavelet that has been developed by [43] via a transformation of the real-valued Gammatone auditory filter which provides a good approximation of the basilar membrane filter [25]. Because of its origin and properties, this wavelet has been successfully applied for classification of acoustic scene [32]. The Gammatone wavelet (Fig. 4) is defined in the time domain by,

(22)

and in the frequency domain by,

(23)

A precise work on this wavelet achieved by V. Lostalnen in [31] allows us to have an explicit formulation of the parameter such that the wavelet can be scaled while respecting the admissibility criteria:

where is the center frequency and is the bandwidth parameter. Notice that with induce a quasi orthogonal filter bank. The associated frequency center and standard deviation denoted respectively by and , are thus:

For this wavelet, thanks to the derivation in [31], we can manually select for each order the center frequency and bandwidth of the mother wavelet, which ease the filter bank design.

Figure 4: On the upper (bottom) left a () Gammatone wavelet in the time domain where the dashed line is the imaginary part and the solid line is the real part, on the upper (bottom) right a () wavelet in the frequency domain.

An important property that is directly related to the auditory response system is the asymmetric envelope, thereby the Gammatone wavelet is not invariant to time reversal to the contrary of the Morlet wavelet that behaves as a Gaussian function. Thus, for tasks such as sound classifications, this wavelet provides an efficient filter that will be prone to perceive the sound’s attacks. Beside this suitable property for specific analysis, this wavelet is near optimal with respect to the uncertainty principle. Notice that, when it yields the Gabor wavelet [12]. Another interesting property of this wavelet is the causality, by taking into account only the previous and present information, there is no bias implied by some future information and thus it is suitable for real-time signal analysis.

Appendix B Thresholding

b.1 Thresholding in an orthogonal basis

Assuming that the observed signal , is corrupted with white noise,

(24)

where

is a vector of i.i.d centered normal distributions

. Now we define the estimate of by such that:

(25)

where denotes the orthogonal basis and is a diagonal binary operator such that,

(26)

, where and denote respectively the set of selected and unselected wavelet coefficients. We also define such that . This estimate corresponds to a thresholding operation in the new basis and the inverse transform of this truncated representation.

We define the denoising problem as the solution of the following mean-square error:

(27)
(28)
(29)
(30)
(31)
(32)

Therefore, the optimal and given by the following values:

(33)

b.2 Upper-bound Non-orthogonal Risk & Empirical Risk

(34)
(35)
(36)
(37)
(38)

Developing the previous expression and denoting by the wavelet coefficient vector, we have:

(39)

we first use the triangular inequality,

(40)

Now let’s,

(41)

and,

(42)

Then, based on the following min-max formulation, we obtain an upper bound of the ideal risk, that, when minimized will approximate the ideal risk in the overcomplete case:

(43)
(44)
(45)

Now, let’s denote by the error term corresponding to unselected coefficients:

(46)

and by for the selected ones:

(47)

we have that,

(48)
(49)

b.3 Comparison Upper Bound Ideal Risk with Orthogonal Ideal Risk

Proposition 1.

Proof.

The comparison of this upper bound risk given an orthogonal dictionary and the one derived in the orthogonal case is as follows:

If the basis is orthogonal, we have,

(50)

and,

(51)

Therefore, the upper-bound derived recovers the ideal risk in the orthogonal case.

b.4 Comparison Upper Bound Ideal Risk with Empirical Risk

Proposition 2.

Proof.

If , the empirical risk is equal to:

and the upper bound risk is:

Thus both coincide as this restriction on the support of the risk makes it independent of both and . Proposition 3.

Proof.

In the case where ,

(52)
(53)

by the triangular inequality, we have that:

(54)

by the monotony of expectation and the Fubini theorem, we have almost surely:

where is equals to,