## 1 Introduction

Hybrid generative and discriminative classifiers are powerful when there is a large databases of unlabeled examples and a much smaller set of labeled examples

[10]. Building such classifiers requires to address two outstanding problems: estimating and representing a high dimensional probability distribution from unlabeled examples and integrating this representation in a supervised classifier.Deep neural networks are remarkable implementations of this strategy, which has produced state-of-the-art results in many fields including image, video, music, speech and bio-medical data [3]. Most deep neural networks cascade linear operators followed by “pooling non-linearities” which aggregate multiple variables [9, 5, 6, 7, 11]. The hidden network variables are first estimated by unsupervised learning and are then updated together with the optimization of a supervised classifier from labeled examples. Multiple regularization criteria such as contraction and sparsity have been shown to play an important role in deep networks [11]. Despite the multiplicity of architectures, algorithms and results, there is currently a lack of mathematical models to understand their behavior.

This paper introduces a mathematical and algorithmic framework, to analyze the properties of high dimensional unsupervised and supervised classification problems with deep networks. It is built on a general scattering model of deep networks with pooling, which iterates on contraction operators obtained as the complex modulus of unitary linear operators. It relies on many ideas developed in the deep network literature [3]. Section 2 proves that an expected scattering transform defines a converging deep network whose output defines a representation of the underlying probability distribution. Section 3 shows that such representations are optimized by adaptively contracting the space while preserving the volume where the distribution of unlabeled examples is concentrated. The optimization establishes a relation with sparsity, which explains mathematically why sparse regularizations are efficient for deep network learning [3]. Section 4 explains how scattering models can be estimated from a single realization with averaged scattering transforms, whose properties are analyzed. Given the large body of numerical experiments in the deep network literature, the paper concentrates on mathematical models, algorithms and proofs, which are currently lacking [3, 5, 7, 9].

Notation: The modulus of is written . If then we write .

## 2 Expected Scattering

A scattering transform provides a model for feed-forward deep networks with pooling [6, 7]. It iterates on linear unitary operators followed by complex modulus. Scattering transforms have initially been introduced with wavelet operators to build invariants to translations, which are stable to deformations, with applications to image and audio classifications [1, 2]. The following generalization only imposes the use of unitary operators, that we shall optimize from examples. It covers both convolutional and non-convolutional deep networks.

Let

be a random vector defined in

. We initialize and . An expected scattering computes each network layer by transforming the previous layer with an operator from in such thatWe typically have so is represented by a complex valued matrix whose rows are linearly dependent complex vectors. With an abuse of language, we still say that these operators are unitary. The pooling is implemented with a complex modulus along each coordinate:

(1) |

Let be the set of all propagated layers. An expected scattering transform outputs

(2) |

This provides a representation of the probability distribution of .

The operators encode the weights of the feed-forward network. Each unitary operator can be written with , where is a tight frame of

. It groups pairs of random variables

and , whose variabilities are reduced by the contractive aggregation of the complex modulusIdeally,

groups pairs of non-correlated random variables having the same variance, so that it reduces the process variability without suppressing correlation information. A scattering can also pool and compute the

norm of variables instead of just , by cascading each time more scattering contractions. This pooling is defined by operators which aggregates variables by pairs with , where if and otherwise, to progressively build each pool of variables.The operator typically performs a rotation of the space to optimize the contraction, which is related to sparsity, as explained in the next section. Redundancy and hence increasing the space dimension is important to improve sparsity. Redundancy also prevents loosing information when calculating the modulus. With a redundancy factor , one can indeed build operators from to such that any has a stable recovery from [4]. The following theorem proves that for any set of unitary operators, the expected scattering transform is contractive and preserves energy. The proof is in Appendix A.

###### Theorem 2.1.

The scattering operator is contractive

(3) |

and preserves the mean-square norm

(4) |

If is not bounded with probability then

(5) |

The theorem proof shows that the energy of network layers converges to as increases despite the fact that the dimension of these layers may increase to . The convergence is exponential at first and then goes into a slow decay asymptotic regime, which explains why is not summable. Expected scattering transforms computed with wavelet transforms operators define deep convolution networks [9], which are highly effective for a number of image and audio classification problems [1, 2]. The operator decomposes into complex wavelet signals of size , so . The operator transforms the modulus of each of these wavelet signals into yet again wavelet signals of size , so . The wavelet layer is thus of size which grows exponentially to when increases. For processes which are not bounded, (5) proves that the decay of scattering coefficients is asymptotically very slow. During the first iterations decays exponentially but it then slows down and decay slowly. In this slow regime, scattering coefficients characterize the tail of the probability distribution.

An expected scattering transform specifies a unique probability distribution of maximum entropy. Let us write . Given an expected scattering transform , the Boltzmann theorem proves that the probability density of maximum entropy which satisfies

(6) |

can be written

(7) |

where the are the Lagrange multipliers associated to the constraints (6), and is the normalization partition function. Complex audio textures are efficiently synthesized with such models [2, 1], by using wavelet operators .

## 3 Unsupervised Learning by Optimizing Contractions

A scattering transform progressively squeezes the space. Heuristics are most often used to regularize unsupervised optimizations

[3]. Because all operators are unitary, we show that optimizing this contraction amounts to minimizing the decay of scattering coefficients and leads to sparse representations. Estimators of expected scattering coefficients are given with error bounds.Unsupervised learning considers a mixture of unknown classes . We want to optimize each to then be able to discriminate each mixture component (class) at the supervised classification stage. To avoid confusing the scattering representations of different classes, we would like to find operators which maximizes the average scattering distance between classes:

(8) |

where is the probability of in the mixture . However, unsupervised learning cannot minimize this average distance since we do not know the class labels .

Following the greedy layerwise unsupervised learning strategy introduced by Hinton [5], we build the scattering transform layers one after the other, for increasing depth. We thus suppose that all operators are defined for , before optimizing . Since the average distance (8) of mixture components cannot be computed, it is replaced by a maximization of the mixture variance . Since is unitary, and , we derive that

Given , maximizing the variance of is equivalent to find a unitary operator which minimizes

(9) |

Minimizing the norm of this expected value enforces the sparsity of coefficients across realizations. It creates a deep network which filters realizations of so that their energy propagates across the network, as opposed to other signals which are not sparsified by the operators and will thus be attenuated much faster.

Expected scattering coefficients are estimated from independent examples . A scattering transform of each is calculated by initializing . For each , given we compute an estimator of with an empirical average

(10) |

The scattering iteration replaces by in (1) which defines

(11) |

Section 4 generalizes this estimated scattering by introducing an averaged scattering operator. Appendix D proves the following upper bound of the mean-square estimation error.

###### Theorem 3.1.

(12) |

Numerically, the first upper bound is typically of the order of . The estimation error is therefore small relatively to if .

To optimize the operator we estimate (9) with a summation across examples:

(13) |

Minimization of such a convex functional under the unitary condition is a Procrustes type optimization, which admits a convex relaxation formulation as a Semi Definite Positive optimization [8]

. Nearly optimal solutions can thus be computed, although the resolution of these SDP problems are numerically expensive. Stochastic gradient descent algorithms are typically used in applications

[6].Through this minimization, the complex modulus in (13) tends to define which groups non-correlated random variables and with same variance. The summation over defines an norm across different realizations. Minimizing this norm enforces the sparsity of this sequence. It produces few large coefficients and many small ones. The sparsity across realizations also implies a sparsity across for most realizations because the overall family of coefficients is sparse.

## 4 Averaged Scattering

We now explain how to compute hybrid generative and discriminative classifiers from scattering transforms, which integrate unsupervised learning with a supervised classification. It gives a mathematical model to explain the supervised refined training of deep neural networks, from an initial unsupervised training. Section 3 explains how to use unlabeled examples to optimize the unitary operators , in order to preserve the discriminability property of the expected scattering transform. Given few labeled examples for each class , (11) computes an estimation of whose risk is bounded by (12). To classify a signal , which is the realization of any unknown class , we introduce an averaged scattering transform, which provides an estimator of .

An averaged scattering transform of a vector is initialized with . Each expected value is estimated by a block averaging applied to the network layer . It averages over blocks of size , which define a partition of :

The next layer of scattering coefficients is computed by applying the unitary operator :

(14) |

The averaged scattering transform outputs the block averages of all layers :

(15) |

###### Theorem 4.1.

The averaged scattering operator is contractive

(16) |

If averages over blocks of size at most then and

(17) |

The unitary operators are optimized at the unsupervised stage, by maximizing a variance criterion which tends to increase the average distance between the unknown scattering vectors of each class. At the supervised stage each is estimated with an error bounded by Theorem 3.1. A generative classifier needs to optimize the block averages so that gives an accurate estimator of . As a result, the class of a signal can simply be estimated by

The error rate of such a classifier depends upon the estimation error of by when is a realization of . The following proposition computes an upper bound on this error.

###### Proposition 4.2.

(18) |

The estimation error is small if the averaging bias and the variance are small for all and . It means that should average the largest possible groups of coefficients where has a small variation for all . Prior information is usually available to constrain the . In sounds or images for example, the averaging is partly done in time or space (but not only), over intervals that must be adjusted according to the unknown local stationarity property of the . This is the case for audio and image texture discrimination with wavelet scattering networks [2, 1].

Discriminative classifiers typically outperform generative classifiers, and be computed directly from as opposed to . Let us consider a binary linear classifier such as an SVM, which applies classification thresholds to for an optimized vector . Since is a linear projector, with . The linear classification can thus be directly applied on . Results may be improved by using only if is a way to incorporate prior information such as local stationarity properties.

Since is computed in (14) by iterating on , the supervised classifier still needs to optimize the choice of the . It amounts to replacing the calculated by unsupervised learning by a new operator

which is optimized to minimize the classification error. Deep neural networks perform such an update of the network parameters, with a greedy layerwise supervised optimizations of the neuron weights

[3]. This last step depends upon the type of discriminative classifier which is used.## 5 Conclusion

A scattering transform provides a flexible model for general deep networks with pooling. Imposing that linear operators are unitary preserves information and stability, and defines a network whose properties can be analyzed mathematically. It provides new models for high-dimensional probability distributions, with precise bounds on estimation errors from samples. Network parameters are optimized from unlabeled examples by adjusting the space contraction, which admits an SDP convex relaxation. Supervised classifiers are computed with an averaged scattering, which is initialized by the unsupervised estimation and refined from labeled examples.

## Appendix A Proof of Theorem 2.1

Observe that is a contractive operator because is unitary. So

It results that

Letting go to proves (3). If we set then (A) is an equality so we get

(20) |

To prove (4), we must show that tends to when goes to . For all ,

(21) |

From (20), when . From the following lemma we shall derive that when .

###### Lemma A.1.

.

Inserting the lemma result in (21) proves that

It results that for all

Since , the above limit tends to when so .

Let us now prove Lemma A.1. Let be a positive number. Let be a partition of in measurable non-empty sets such that, for all , the diameter of is less than . For all , we fix . If have positive coordinates then

We write . Since and , each is Lipschitz with constant . It results that and hence

because has positive coordinates. It results from (20) that so

Letting go to proves that .