1 Introduction
The geometric structure of a data domain can be described with a graph [11]
, where neighbor data points are represented by vertices related by an edge. For sensor networks, this connectivity depends upon the sensor physical locations, but in social networks it may correspond to strong interactions or similarities between two nodes. In many applications, the connectivity graph is unknown and must therefore be estimated from data. We introduce an unsupervised learning algorithm to classify signals defined on an unknown graph.
An important source of variability on graphs results from displacement of signal values. It may be due to movements of physical sources in a sensor network, or to propagation phenomena in social networks. Classification problems are often invariant to such displacements. Image pattern recognition or characterization of communities in social networks are examples of invariant problems. They require to compute locally or globally invariant descriptors, which are sufficiently rich to discriminate complex signal classes.
Section 2 introduces a Haar scattering transform which builds an invariant representation of graph data, by cascading additions, subtractions and absolute values in a deep network. It can be factorized as a product of Haar wavelet transforms on the graph. Haar wavelet transforms are flexible representations which characterize multiscale signal patterns on graphs [6, 10, 11]. Haar scattering transforms are extensions on graphs of wavelet scattering transforms, previously introduced for uniformly sampled signals [1].
For unstructured signals defined on an unknown graph, recovering the full graph geometry is an NP complete problem. We avoid this complexity by only learning connected multiresolution graph approximations. This is sufficient to compute Haar scattering representations. Multiscale neighborhoods are calculated by minimizing an average total signal variation over training examples. It involves a pair matching algorithm of polynomial complexity. We show that this unsupervised learning algorithms computes sparse scattering representations.
For classification, the dimension of unsupervised Haar scattering representations are reduced with supervised partial least square regressions [12]. It amounts to computing a last layer of reduced dimensionality, before applying a Gaussian kernel SVM classifier. The performance of a Haar scattering classification is tested on scrambled images, whose graph geometry is unknown. Results are provided for MNIST and CIFAR10 image data bases. Classification experiments are also performed on scrambled signals whose samples are on an irregular grid of a sphere. All computations can be reproduced with a software available at www.di.ens.fr/data/scattering/haar.
2 Orthogonal Haar Scattering on a Graph
2.1 Deep Networks of Permutation Invariant Operators
We consider signals defined on an unweighted graph , with . Edges relate neighbor vertices. We suppose that is a power of to simplify explanations. A Haar scattering is calculated by iteratively applying the following permutation invariant operator
(1) 
Its values are not modified by a permutation of and , and both values are recovered by
(2) 
An orthogonal Haar scattering transform computes progressively more invariant signal descriptors by applying this invariant operator at multiple scales. This is implemented along a deep network illustrated in Figure 1. The network layer is a twodimensional array of coefficients, where is a node index and is a feature type.
The input network layer is . We compute by regrouping the nodes of in pairs , and applying the permutation invariant operator (1) to each pair :
(3) 
and
(4) 
This transform is iterated up to a maximum depth . It computes with additions, subtractions and absolute values. Since for , one can put an absolute value on the sum in (3) without changing . It results that is calculated from the previous layer
by applying a linear operator followed by a nonlinearity as in most deep neural network architectures. In our case this nonlinearity is an absolute value as opposed to rectifiers used in most deep networks
[4].For each , the scattering coefficients are calculated from the values of in a vertex set of size . One can verify by induction on (3) and (4) that for , and for any
(5) 
The embedded subsets form a multiresolution approximation of the vertex set . At each scale , different pairings define different multiresolution approximations. A small graph displacement propagates signal values from a node to its neighbors. To build nearly invariant representations over such displacements, a Haar scattering transform must regroup connected vertices. It is thus computed over multiresolution vertex sets which are connected in the graph . It results from (5) that a necessary and sufficient condition is that each pair regroups two connected sets and .
Figure 2 shows two examples of connected multiresolution approximations. Figure 2(a) illustrates the graph of an image grid, where pixels are connected to neighbors. In this example, each regroups two subsets and which are connected horizontally if is even and connected vertically if
is odd. Figure
2(b) illustrates a second example of connected multiresolution approximation on an irregular graph. There are many different connected multiresolution approximations resulting from different pairings at each scale . Different multiresolution approximations correspond to different Haar scattering transforms. In the following, we compute several Haar scattering transforms of a signal , by defining different multiresolution approximations.The following theorem proves that a Haar scattering preserves the norm and that it is contractive up to a normalization factor . The contraction is due to the absolute value which suppresses the sign and hence reduces the amplitude of differences. The proof is in Appendix A.
Theorem 2.1.
For any , and any defined on
and
2.2 Iterated Haar Wavelet Transforms
We show that a Haar scattering transform can be written as a cascade of orthogonal Haar wavelet transforms and absolute value nonlinearities. It is a particular example of scattering transforms introduced in [1]. It computes coefficients measuring signal variations at multiple scales and multiple orders. We prove that the signal can be recovered from Haar scattering coefficients computed over enough multiresolution approximations.
A scattering operator is contractive because of the absolute value. When coefficients have an arbitrary sign, suppressing the sign reduces by a factor the volume of the signal space. We say that is a coefficient of order if its computation includes absolute values of differences. The amplitude of scattering coefficients typically decreases exponentially when the scattering order increases, because of the contraction produced by the absolute value. We verify from (3) and (4) that is a coefficient of order if and of order if
It results that there are coefficients of order .
We now show that Haar scattering coefficients of order are obtained by cascading orthogonal Haar wavelet tranforms defined on the graph . A Haar wavelet at a scale is defined over each by
For any , one can verify [10, 6] that
is a nonnormalized orthogonal Haar basis of the space of signals defined on . Let us denote . Order scattering coefficients sum the values of in each
Order scattering coefficients are sums of absolute values of orthogonal Haar wavelet coefficients. They measure the variation amplitude at each scale , in each :
Appendix B proves that second order scattering coefficients are computed by applying a second orthogonal Haar wavelet transform to first order scattering coefficients. A coefficient is an averaged second order increment over , calculated from the variations at the scale , of the increments of at the scale . More generally, Appendix B also proves that order coefficients measure multiscale variations of at the order , and are obtained by applying a Haar wavelet transform on scattering coefficients of order .
A single Haar scattering transform loses information since it applies a cascade of permutation invariant operators. However, the following theorem proves that can be recovered from scattering transforms computed over different multiresolution approximations.
Theorem 2.2.
There exist multiresolution approximations such that almost all can be reconstructed from their scattering coefficients on these multiresolution approximations.
This theorem is proved in Appendix C. The key idea is that Haar scattering transforms are computed with permutation invariants operators. Inverting these operators allows to recover values of signal pairs but not their locations. However, recombining these values on enough overlapping sets allows one to recover their locations and hence the original signal . This is done with multiresolutions which are interlaced at each scale , in the sense that if a multiresolution is pairing and then another multiresolution approximation is pairing . Connectivity conditions are needed on the graph to guarantee the existence of “interlaced” multiresolution approximations which are all connected.
3 Learning
3.1 Sparse Unsupervised Learning of Multiscale Connectivity
Haar scattering transforms compute multiscale signal variations of multiple orders, over nonoverlapping sets of size . To build signal descriptors which are nearly invariant to signal displacements on the graph, we want to compute scattering transforms over connected sets in the graph, which a priori requires to know the graph connectivity. However, in many applications, the graph connectivity is unknown. For piecewise regular signals, the graph connectivity implies some form of correlation between neighbor signal values, and may thus be estimated from a training set of unlabeled examples [7].
Instead of estimating the full graph geometry, which is an NP complete problem, we estimate multiresolution approximations which are connected. This is a hierarchical clustering problem
[19]. A multiresolution approximation is connected if at each scale , each pair regroups two vertex sets which are connected. This connection is estimated by minimizing the total variation within each set , which are clusters of size [19]. It is done with a fine to coarse aggregation strategy. Given , we compute at the next scale, by finding an optimal pairingwhich minimizes the total variation of scattering vectors, averaged over the training set
:(6) 
This is a weighted matching problem which can be solved by the Blossom Algorithm of Edmonds [8] with operations. We use the implementation in [9]. Iterating on this algorithm for thus computes a multiresolution approximation at the scale , with a hierarchical aggregation of graph vertices.
Observe that
Given , it results that the minimization of (6) is equivalent to the minimization of
. This can be interpreted as finding a multiresolution approximation which yields an optimally sparse scattering transform. It operates with a greedy layerwise strategy across the network layers, similarly to sparse autoencoders for unsupervised deep learning
[4].As explained in the previous section, several Haar scattering transforms are needed to obtain a complete signal representation. The unsupervised learning computes multiresolution approximations by dividing the training set in nonoverlapping subsets, and learning a different multiresolution approximation from each training subset.
3.2 Supervised Feature Selection and Classification
The unsupervised learning computes a vector of scattering coefficients which is typically much larger than the dimension of
. However, only a subset of these invariants are needed for any particular classification task. The classification is improved by a supervised dimension reduction which selects a subset of scattering coefficients. In this paper, the feature selection is implemented with a partial least square regression
[12, 13, 14]. The final supervised classifier is a Gaussian kernel SVM.Let us denote by the set of all scattering coefficients at a scale , computed from multiresolution approximations. We perform a feature selection adapted to each class , with a partial least square regression of the oneversusall indicator function
A partial least square greedily selects and orthogonalizes each feature, one at a time. At the iteration, it selects a , and a GramSchmidt orthogonalization yields a normalized , which is uncorrelated relatively to all previously selected features:
The feature
is selected so that the linear regression of
on has a minimum meansquare error, computed on the training set. This is equivalent to finding so that is maximum.The partial least square regression thus selects and computes decorrelated scattering features for each class . For a total of classes, the union of all these feature sets defines a dictionary of size . They are linear combinations of the original Haar scattering coefficients . This dimension reduction can thus be interpreted as a last fully connected network layer, which outputs a vector of size . The parameter
allows one to optimize the bias versus variance tradeoff. It can be adjusted from the decay of the regression error of each
[12]. In our numerical experiments, it is set to a fixed size for all data bases.4 Numerical Experiments
Unsupervised Haar scattering representations are tested on classification problems, over scrambled images and scrambled data on a sphere, for which the geometry is therefore unknown. Classification results are compared with a Haar scattering algorithm computed over the known signal geometry, and with state of the art algorithms.
A Haar scattering representation involves few parameters which are reviewed. The scattering scale is the invariance scale. Scattering coefficients are computed up to the a maximum order , which is set to in all experiments. Indeed, higher order scattering coefficient have a negligible relative energy, which is below . The unsupervised learning algorithm computes multiresolution approximations, corresponding to different scattering transforms. Increasing decreases the classification error but it increases computations. The error decay becomes negligible for . The supervised dimension reduction selects a final set of orthogonalized scattering coefficients. We set in all numerical experiments.
For signals defined on an unknown graph, the unsupervised learning computes an estimation of connected multiresolution sets by minimizing an average total variation. For each data basis of scrambled signals, the precision of this estimation is evaluated by computing the percentage of multiscale sets which are indeed connected in the original topology (an image grid or a grid on the sphere).
4.1 MNIST Digit Recognition
MNIST is a data basis with handwritten digit images of size , with images for training and for testing. Examples of MNIST images before and after pixel scrambling are shown in Figure 3. The best classification results are obtained with a maximum invariance scale . The classification error is , with an unsupervised learning of multiresolution approximations. Table 1
shows that it is below but close to state of the art results obtained with fully supervised deep convolution, which are optimized with supervised backpropagation algorithms.
The unsupervised learning computes multiresolution sets from scrambled images. At scales , of these multiresolution sets are connected in the original image grid, which proves that the geometry is well estimated at these scales. This is only evaluated on meaningful pixels which do not remain zero on all training images. For and the percentages of connected sets are respectively and . The percentage of connected sets decreases because long range correlations are weaker.
One can reduce the Haar scattering classification error from to with a known image geometry. The Haar scattering transform is then computed over multiresolution approximations which are directly constructed from the image grid as in Figure 2(a). Rotations and translations define different connected multiresolution approximations, which yield a reduced error of . State of the art classification errors on MNIST, for nonaugmented data basis (without elastic deformations), are respectively with a Gabor scattering [2] and with a supervised training of deep convolution networks [5]. This shows that without any learning, a Haar scattering using geometry is close to the state of the art.
4.2 CIFAR10 Images
CIFAR10 images are color images of pixels, which are much more complex than MNIST digit images. It includes classes, such as “dogs”, “cars”, “ships” with a total of training examples and testing examples. The color bands are represented with channels and scattering coefficients are computed independently in each channel.
The Haar scattering is first applied to scrambled CIFAR images whose geometry is unknown. The minimum classification error is obtained at the scale which is below the maximum scale . It maintains some localization information on the image features. With multiresolution approximations, a Haar scattering transform has an error of . It is below previous results obtained on this data basis, given in Table 2.
Nearly of the multiresolution sets computed from scrambled images are connected in the original image grid, for , which shows that the multiscale geometry is well estimated at these fine scales. For and , the proportions of connected sets are , and respectively. As for MNIST images, the connectivity is not as precisely estimated at large scales.
Fastfood [18]  Random Kitchen Sinks [18]  Haar Scattering 
36.9  37.6  27.3 
The Haar scattering classification error is reduced from to if the image geometry is known. Same as for MNIST, we compute multiresolution approximations obtained by translating and rotating. After dimension reduction, the classification error is . This error is above the state of the art obtained by a supervised convolutional network [15] (), but the Haar scattering representation involves no learning.
4.3 Signals on a Sphere
A data basis of irregularly sampled signals on a sphere is constructed in [3], by projecting the MNIST image digits on
points randomly sampled on the 3D sphere, and by randomly rotating these images on the sphere. The random rotation is either uniformly distributed on the sphere or restricted with a smaller variance (small rotations)
[3]. The digit ‘9’ is removed from the data set because it can not be distinguished from a ‘6’ after rotation. Examples of the dataset are shown in Figure 4.The classification algorithms introduced in [3] use the known distribution of points on the sphere, by computing a representation based on the graph Laplacian. Table 3 gives the results reported in [3], with a fully connected neural network, and a spectral graph Laplacian network.
As opposed to these algorithms, the Haar scattering algorithm uses no information on the positions of points on the sphere. Computations are performed from a scrambled set of signal values, without any geometric information. Scattering transforms are calculated up to the maximum scale . A total of multiresolution approximations are estimated by unsupervised learning, and the classification is performed from selected coefficients. Despite the fact that the geometry is unknown, the Haar scattering reduces the error rate both for small and large 3D random rotations.
In order to evaluate the precision of our geometry estimation, we use the neighborhood information based on the 3D coordinates of the 4096 points on the sphere of radius 1. We say that two points are connected if their geodesic distance is smaller than 0.1. Each point on the sphere has on average connected points. For small rotations, the percentage of learned multiresolution sets which are connected is 92%, 92%, 88% and 83% for going from to . It is computed on meaningful points with nonneglegible energy. For large rotations, it is 97%, 96%, 95% and 95%. This shows that the multiscale geometry on the sphere is well estimated.
Nearest  Fully  Spectral  Haar  
Neighbors  Connect.  Net.[3]  Scattering  
Small rotations  19  5.6  6  2.2 
Large rotations  80  52  50  47.7 
5 Conclusion
A Haar scattering transform computes invariant data representations by iterating over a hierarchy of permutation invariant operators, calculated with additions, subtractions and absolute values. The geometry of unstructured signals is estimated with an unsupervised learning algorithm, which minimizes the average total signal variation over multiscale neighborhoods. This shows that unsupervised deep learning can be implemented with a polynomial complexity algorithm. The supervised classification includes a feature selection implemented with a partial least square regression. State of the art results have been shown on scrambled images as well as random signals sampled on a sphere. The two important parameters of this architecture are the network depth, which corresponds to the invariance scale, and the dimension reduction of the final layer, set to in all experiments. It can thus easily be applied to any data set.
This paper concentrates on scattering transforms of real valued signals. For a boolean vector , a boolean scattering transform is computed by replacing the operator (1) by a boolean permutation invariant operator which transforms into . Iteratively applying this operator defines a boolean scattering transform having similar properties.
References
 [1] S. Mallat, “Recursive interferometric representations”. Proc. of EUSICO Conf. 2010, Denmark.
 [2] J. Bruna, S. Mallat, “Invariant Scattering Convolution Networks,” IEEE Trans. PAMI, 35(8): 18721886, 2013.
 [3] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral Networks and Deep Locally Connected Networks on Graphs,” ICLR 2014.
 [4] Y. Bengio, A. Courville, P. Vincent, “Representation Learning: A Review and New Perspectives”, IEEE Trans. on PAMI, no.8, vol. 35, pp 17981828, 2013.
 [5] Y. LeCun, K. Kavukvuoglu, and C. Farabet, “Convolutional Networks and Applications in Vision,” Proc. IEEE Int. Sump. Circuits and Systems 2010.

[6]
M. Gavish, B. Nadler, and R. R. Coifman. “Multiscale wavelets on trees, graphs and high dimensional data: Theory and applications to semi supervised learning”, in ICML, pages 367374, 2010.
 [7] N. L. Roux, Y. Bengio, P. Lamblin, M. Joliveau and B. Kégl, “Learning the 2D topology of images”, in NIPS, pages 841848, 2008.
 [8] J. Edmonds. Paths, trees, and flowers. Canadian Journal of Mathematics, 1965.
 [9] E. Rothberg of H. Gabow’s “An Efficient Implementation of Edmond’s Algorithm for Maximum Matching on Graphs.” JACM, 23, 1v976.
 [10] R. Rustamov, L. Guibas, “Wavelets on Graphs via Deep Learning,” NIPS 2013.
 [11] D. Shuman, S. Narang, P. Frossard, A. Ortega, P. Vanderghenyst, “The Emmerging Field of Signal Processing on Graphs,” IEEE Signal Proc. Magazine, May 2013.
 [12] T. Mehmood, K. H. Liland, L. Snipen and S. Sæbø, “A Review of Variable Selection Methods in Partial Least Squares Regression”, Chemometrics and Intelligent Laboratory Systems, vol. 118, pages 6269, 2012.

[13]
H. Zhang, S. Kiranyaz and M. Gabbouj, “Cardinal Sparse Partial Least Square Feature Selection and its Application in Face Recognition”, Signal Processing Conference (EUSIPCO), 2014 Proceedings of the 22st European, Sep. 2014.

[14]
W. R. Schwartz, A. Kembhavi, D. Harwood and L. S. Davis, “Human Detection Using Partial Least Squares Analysis”, Computer vision, ICCV 2009.
 [15] I. J. Goodfellow, D. WardeFarley, M. Mirza, A. Courville and Y. Benjio, “Maxout Networks”, Arxiv preprint, arxiv:1302.4389, 2013.
 [16] D. Yu and L. Deng, “Deep Convex Net: A Scalable Architecture for Speech Pattern Classification”,in Proc. INTERSPEECH, 2011, pp.22852288.
 [17] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Improving neural networks by preventing coadaptation of feature detectors”, Technical report, arXiv:1207.0580, 2012.
 [18] Q. Le, T. Sarlos and A. Smola,“Fastfood  Approximating Kernel Expansions in Loglinear Time”, ICML, 2013.

[19]
M Hein and S. Setzer, “Beyond Spectral Clustering  Tight Relaxations of Balanced Graph Cuts,” NIPS 2011.
Appendix
Appendix A Proof of Theorem 2.1
Proof.
Observe that the permutation invariant operator which associates to the values
satisfies
Moreover, if then
Since is computed by applying this operator to pairs of values of , we derive that
Since and , iterating on these two equations proves Theorem 2.1.
∎
Appendix B Haar Scattering from Haar Wavelets
The following proposition proves that order scattering coefficients are computed by applying an orthogonal Haar wavelet transform to order scattering coefficients. We also prove by induction on that a scattering coefficient is of order if and only if with
for some . This property is valid for and the following proposition shows that if it is valid for then it is also valid for in the sense that an order coefficient is indexed by , and it is computed by applying an orthogonal Haar transform to order scattering coefficients indexed by .
Proposition B.1.
For any and we write
For any , any and ,
(B.1) 
Proof.
We derive from the definition of a scattering transform in equations (3,4) in the text that
where . Observe that
thus is calculated from the coefficients of the previous layer with
(B.2) 
Since , the coefficient is calculated from by times additions, and thus
(B.3) 
Combining equations (B.3) and (B.2) gives
(B.4) 
We go from the depth to the depth by computing
Together with (B.4) it proves the equation (B.1) of the proposition. The summation over comes from the inner product . This also proves that is the index of a coefficient of order . ∎
Since , the proposition inductively proves that the coefficients at th level for are of order . The expression in the proposition shows that an order scattering coefficient at scale is obtained by computing the Haar wavelet coefficients of several order coefficients at the scale , taking an absolute value, and then averaging their amplitudes over . It thus measures the averaged variations at the scale of the th order scattering coefficients.
Appendix C Proof of Theorem 2.2
To prove Theorem 2.2, we first define an “interlaced pairings”. We say that two pairings of
are interlaced for if there exists no strict subset of such that and are pairing elements within . The following lemma shows that a singlelayer scattering operator is invertible with two interlaced pairings.
Lemma C.1.
Suppose that takes more than different values, and two pairings and of are interlaced, then can be recovered from
Proof.
By Eq. (2), for a triplet if is a pair in and a pair in then the pair of values are determined (with a possible switch of the two) from
and those of are determined similarly. Then unless and the three values are recovered. The interlacing condition implies that pairs to an index which can not be or . Thus, the four values of are specified unless . This interlacing argument can be used to extend to the set of all indices for which is specified, unless takes only two values. ∎
Proof of Theorem 2.2.
Suppose that the multiresolution approximations are associated to the hierarchical pairings where , where for each , and are two interlaced pairings of elements. The sequence is a binary vector taking different values.
The constraint on the signal is that each of the intermediate scattering coefficients takes more than distinct values, which holds for
except for a union of hyperplanes which has zero measure. Thus for almost every
, the theorem follows from applying Lemma C.1 recursively to the th level scattering coefficients for . ∎