Unsupervised Deep Haar Scattering on Graphs

06/09/2014 ∙ by Xu Chen, et al. ∙ 0

The classification of high-dimensional data defined on graphs is particularly difficult when the graph geometry is unknown. We introduce a Haar scattering transform on graphs, which computes invariant signal descriptors. It is implemented with a deep cascade of additions, subtractions and absolute values, which iteratively compute orthogonal Haar wavelet transforms. Multiscale neighborhoods of unknown graphs are estimated by minimizing an average total variation, with a pair matching algorithm of polynomial complexity. Supervised classification with dimension reduction is tested on data bases of scrambled images, and for signals sampled on unknown irregular grids on a sphere.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The geometric structure of a data domain can be described with a graph [11]

, where neighbor data points are represented by vertices related by an edge. For sensor networks, this connectivity depends upon the sensor physical locations, but in social networks it may correspond to strong interactions or similarities between two nodes. In many applications, the connectivity graph is unknown and must therefore be estimated from data. We introduce an unsupervised learning algorithm to classify signals defined on an unknown graph.

An important source of variability on graphs results from displacement of signal values. It may be due to movements of physical sources in a sensor network, or to propagation phenomena in social networks. Classification problems are often invariant to such displacements. Image pattern recognition or characterization of communities in social networks are examples of invariant problems. They require to compute locally or globally invariant descriptors, which are sufficiently rich to discriminate complex signal classes.

Section 2 introduces a Haar scattering transform which builds an invariant representation of graph data, by cascading additions, subtractions and absolute values in a deep network. It can be factorized as a product of Haar wavelet transforms on the graph. Haar wavelet transforms are flexible representations which characterize multiscale signal patterns on graphs [6, 10, 11]. Haar scattering transforms are extensions on graphs of wavelet scattering transforms, previously introduced for uniformly sampled signals [1].

For unstructured signals defined on an unknown graph, recovering the full graph geometry is an NP complete problem. We avoid this complexity by only learning connected multiresolution graph approximations. This is sufficient to compute Haar scattering representations. Multiscale neighborhoods are calculated by minimizing an average total signal variation over training examples. It involves a pair matching algorithm of polynomial complexity. We show that this unsupervised learning algorithms computes sparse scattering representations.

For classification, the dimension of unsupervised Haar scattering representations are reduced with supervised partial least square regressions [12]. It amounts to computing a last layer of reduced dimensionality, before applying a Gaussian kernel SVM classifier. The performance of a Haar scattering classification is tested on scrambled images, whose graph geometry is unknown. Results are provided for MNIST and CIFAR-10 image data bases. Classification experiments are also performed on scrambled signals whose samples are on an irregular grid of a sphere. All computations can be reproduced with a software available at www.di.ens.fr/data/scattering/haar.

2 Orthogonal Haar Scattering on a Graph

Figure 1: A Haar scattering network computes each coefficient of a layer by adding or subtracting a pair of coefficients in the previous layer .

2.1 Deep Networks of Permutation Invariant Operators

We consider signals defined on an unweighted graph , with . Edges relate neighbor vertices. We suppose that is a power of to simplify explanations. A Haar scattering is calculated by iteratively applying the following permutation invariant operator

(1)

Its values are not modified by a permutation of and , and both values are recovered by

(2)

An orthogonal Haar scattering transform computes progressively more invariant signal descriptors by applying this invariant operator at multiple scales. This is implemented along a deep network illustrated in Figure 1. The network layer is a two-dimensional array of coefficients, where is a node index and is a feature type.

The input network layer is . We compute by regrouping the nodes of in pairs , and applying the permutation invariant operator (1) to each pair :

(3)

and

(4)

This transform is iterated up to a maximum depth . It computes with additions, subtractions and absolute values. Since for , one can put an absolute value on the sum in (3) without changing . It results that is calculated from the previous layer

by applying a linear operator followed by a non-linearity as in most deep neural network architectures. In our case this non-linearity is an absolute value as opposed to rectifiers used in most deep networks

[4].

For each , the scattering coefficients are calculated from the values of in a vertex set of size . One can verify by induction on (3) and (4) that for , and for any

(5)

The embedded subsets form a multiresolution approximation of the vertex set . At each scale , different pairings define different multiresolution approximations. A small graph displacement propagates signal values from a node to its neighbors. To build nearly invariant representations over such displacements, a Haar scattering transform must regroup connected vertices. It is thus computed over multiresolution vertex sets which are connected in the graph . It results from (5) that a necessary and sufficient condition is that each pair regroups two connected sets and .

Figure 2 shows two examples of connected multiresolution approximations. Figure 2(a) illustrates the graph of an image grid, where pixels are connected to neighbors. In this example, each regroups two subsets and which are connected horizontally if is even and connected vertically if

is odd. Figure

2(b) illustrates a second example of connected multiresolution approximation on an irregular graph. There are many different connected multiresolution approximations resulting from different pairings at each scale . Different multiresolution approximations correspond to different Haar scattering transforms. In the following, we compute several Haar scattering transforms of a signal , by defining different multiresolution approximations.

(a)
(b)
Figure 2: A connected multiresolution is a partition of vertices with embedded connected sets of size . (a): Example of partition for the graph of a square image grid, for . (b): Example on an irregular graph.

The following theorem proves that a Haar scattering preserves the norm and that it is contractive up to a normalization factor . The contraction is due to the absolute value which suppresses the sign and hence reduces the amplitude of differences. The proof is in Appendix A.

Theorem 2.1.

For any , and any defined on

and

2.2 Iterated Haar Wavelet Transforms

We show that a Haar scattering transform can be written as a cascade of orthogonal Haar wavelet transforms and absolute value non-linearities. It is a particular example of scattering transforms introduced in [1]. It computes coefficients measuring signal variations at multiple scales and multiple orders. We prove that the signal can be recovered from Haar scattering coefficients computed over enough multiresolution approximations.

A scattering operator is contractive because of the absolute value. When coefficients have an arbitrary sign, suppressing the sign reduces by a factor the volume of the signal space. We say that is a coefficient of order if its computation includes absolute values of differences. The amplitude of scattering coefficients typically decreases exponentially when the scattering order increases, because of the contraction produced by the absolute value. We verify from (3) and (4) that is a coefficient of order if and of order if

It results that there are coefficients of order .

We now show that Haar scattering coefficients of order are obtained by cascading orthogonal Haar wavelet tranforms defined on the graph . A Haar wavelet at a scale is defined over each by

For any , one can verify [10, 6] that

is a non-normalized orthogonal Haar basis of the space of signals defined on . Let us denote . Order scattering coefficients sum the values of in each

Order scattering coefficients are sums of absolute values of orthogonal Haar wavelet coefficients. They measure the variation amplitude at each scale , in each :

Appendix B proves that second order scattering coefficients are computed by applying a second orthogonal Haar wavelet transform to first order scattering coefficients. A coefficient is an averaged second order increment over , calculated from the variations at the scale , of the increments of at the scale . More generally, Appendix B also proves that order coefficients measure multiscale variations of at the order , and are obtained by applying a Haar wavelet transform on scattering coefficients of order .

A single Haar scattering transform loses information since it applies a cascade of permutation invariant operators. However, the following theorem proves that can be recovered from scattering transforms computed over different multiresolution approximations.

Theorem 2.2.

There exist multiresolution approximations such that almost all can be reconstructed from their scattering coefficients on these multiresolution approximations.

This theorem is proved in Appendix C. The key idea is that Haar scattering transforms are computed with permutation invariants operators. Inverting these operators allows to recover values of signal pairs but not their locations. However, recombining these values on enough overlapping sets allows one to recover their locations and hence the original signal . This is done with multiresolutions which are interlaced at each scale , in the sense that if a multiresolution is pairing and then another multiresolution approximation is pairing . Connectivity conditions are needed on the graph to guarantee the existence of “interlaced” multiresolution approximations which are all connected.

3 Learning

3.1 Sparse Unsupervised Learning of Multiscale Connectivity

Haar scattering transforms compute multiscale signal variations of multiple orders, over non-overlapping sets of size . To build signal descriptors which are nearly invariant to signal displacements on the graph, we want to compute scattering transforms over connected sets in the graph, which a priori requires to know the graph connectivity. However, in many applications, the graph connectivity is unknown. For piecewise regular signals, the graph connectivity implies some form of correlation between neighbor signal values, and may thus be estimated from a training set of unlabeled examples [7].

Instead of estimating the full graph geometry, which is an NP complete problem, we estimate multiresolution approximations which are connected. This is a hierarchical clustering problem

[19]. A multiresolution approximation is connected if at each scale , each pair regroups two vertex sets which are connected. This connection is estimated by minimizing the total variation within each set , which are clusters of size [19]. It is done with a fine to coarse aggregation strategy. Given , we compute at the next scale, by finding an optimal pairing

which minimizes the total variation of scattering vectors, averaged over the training set

:

(6)

This is a weighted matching problem which can be solved by the Blossom Algorithm of Edmonds [8] with operations. We use the implementation in [9]. Iterating on this algorithm for thus computes a multiresolution approximation at the scale , with a hierarchical aggregation of graph vertices.

Observe that

Given , it results that the minimization of (6) is equivalent to the minimization of

. This can be interpreted as finding a multiresolution approximation which yields an optimally sparse scattering transform. It operates with a greedy layerwise strategy across the network layers, similarly to sparse autoencoders for unsupervised deep learning

[4].

As explained in the previous section, several Haar scattering transforms are needed to obtain a complete signal representation. The unsupervised learning computes multiresolution approximations by dividing the training set in non-overlapping subsets, and learning a different multiresolution approximation from each training subset.

3.2 Supervised Feature Selection and Classification

The unsupervised learning computes a vector of scattering coefficients which is typically much larger than the dimension of

. However, only a subset of these invariants are needed for any particular classification task. The classification is improved by a supervised dimension reduction which selects a subset of scattering coefficients. In this paper, the feature selection is implemented with a partial least square regression

[12, 13, 14]. The final supervised classifier is a Gaussian kernel SVM.

Let us denote by the set of all scattering coefficients at a scale , computed from multiresolution approximations. We perform a feature selection adapted to each class , with a partial least square regression of the one-versus-all indicator function

A partial least square greedily selects and orthogonalizes each feature, one at a time. At the iteration, it selects a , and a Gram-Schmidt orthogonalization yields a normalized , which is uncorrelated relatively to all previously selected features:

The feature

is selected so that the linear regression of

on has a minimum mean-square error, computed on the training set. This is equivalent to finding so that is maximum.

The partial least square regression thus selects and computes decorrelated scattering features for each class . For a total of classes, the union of all these feature sets defines a dictionary of size . They are linear combinations of the original Haar scattering coefficients . This dimension reduction can thus be interpreted as a last fully connected network layer, which outputs a vector of size . The parameter

allows one to optimize the bias versus variance trade-off. It can be adjusted from the decay of the regression error of each

[12]. In our numerical experiments, it is set to a fixed size for all data bases.

4 Numerical Experiments

Unsupervised Haar scattering representations are tested on classification problems, over scrambled images and scrambled data on a sphere, for which the geometry is therefore unknown. Classification results are compared with a Haar scattering algorithm computed over the known signal geometry, and with state of the art algorithms.

A Haar scattering representation involves few parameters which are reviewed. The scattering scale is the invariance scale. Scattering coefficients are computed up to the a maximum order , which is set to in all experiments. Indeed, higher order scattering coefficient have a negligible relative energy, which is below . The unsupervised learning algorithm computes multiresolution approximations, corresponding to different scattering transforms. Increasing decreases the classification error but it increases computations. The error decay becomes negligible for . The supervised dimension reduction selects a final set of orthogonalized scattering coefficients. We set in all numerical experiments.

For signals defined on an unknown graph, the unsupervised learning computes an estimation of connected multiresolution sets by minimizing an average total variation. For each data basis of scrambled signals, the precision of this estimation is evaluated by computing the percentage of multiscale sets which are indeed connected in the original topology (an image grid or a grid on the sphere).

4.1 MNIST Digit Recognition

Figure 3: MNIST images (left) and images after random pixel permutations (right).

MNIST is a data basis with hand-written digit images of size , with images for training and for testing. Examples of MNIST images before and after pixel scrambling are shown in Figure 3. The best classification results are obtained with a maximum invariance scale . The classification error is , with an unsupervised learning of multiresolution approximations. Table 1

shows that it is below but close to state of the art results obtained with fully supervised deep convolution, which are optimized with supervised backpropagation algorithms.

The unsupervised learning computes multiresolution sets from scrambled images. At scales , of these multiresolution sets are connected in the original image grid, which proves that the geometry is well estimated at these scales. This is only evaluated on meaningful pixels which do not remain zero on all training images. For and the percentages of connected sets are respectively and . The percentage of connected sets decreases because long range correlations are weaker.

One can reduce the Haar scattering classification error from to with a known image geometry. The Haar scattering transform is then computed over multiresolution approximations which are directly constructed from the image grid as in Figure 2(a). Rotations and translations define different connected multiresolution approximations, which yield a reduced error of . State of the art classification errors on MNIST, for non-augmented data basis (without elastic deformations), are respectively with a Gabor scattering [2] and with a supervised training of deep convolution networks [5]. This shows that without any learning, a Haar scattering using geometry is close to the state of the art.

Maxout MLP + dropout [15] Deep convex net. [16] DBM + dropout [17] Haar Scattering
0.94 0.83 0.79 0.90
Table 1: Percentage of errors for the classification of scrambled MNIST images, obtained by different algorithms.

4.2 CIFAR-10 Images

CIFAR-10 images are color images of pixels, which are much more complex than MNIST digit images. It includes classes, such as “dogs”, “cars”, “ships” with a total of training examples and testing examples. The color bands are represented with channels and scattering coefficients are computed independently in each channel.

The Haar scattering is first applied to scrambled CIFAR images whose geometry is unknown. The minimum classification error is obtained at the scale which is below the maximum scale . It maintains some localization information on the image features. With multiresolution approximations, a Haar scattering transform has an error of . It is below previous results obtained on this data basis, given in Table 2.

Nearly of the multiresolution sets computed from scrambled images are connected in the original image grid, for , which shows that the multiscale geometry is well estimated at these fine scales. For and , the proportions of connected sets are , and respectively. As for MNIST images, the connectivity is not as precisely estimated at large scales.

Fastfood [18] Random Kitchen Sinks [18] Haar Scattering
36.9 37.6 27.3
Table 2: Percentage of errors for the classification of scrambled CIFAR-10 images, with different algorithms.

The Haar scattering classification error is reduced from to if the image geometry is known. Same as for MNIST, we compute multiresolution approximations obtained by translating and rotating. After dimension reduction, the classification error is . This error is above the state of the art obtained by a supervised convolutional network [15] (), but the Haar scattering representation involves no learning.

4.3 Signals on a Sphere

A data basis of irregularly sampled signals on a sphere is constructed in [3], by projecting the MNIST image digits on

points randomly sampled on the 3D sphere, and by randomly rotating these images on the sphere. The random rotation is either uniformly distributed on the sphere or restricted with a smaller variance (small rotations)

[3]. The digit ‘9’ is removed from the data set because it can not be distinguished from a ‘6’ after rotation. Examples of the dataset are shown in Figure 4.

Figure 4: Images of digits mapped on a sphere.

The classification algorithms introduced in [3] use the known distribution of points on the sphere, by computing a representation based on the graph Laplacian. Table 3 gives the results reported in [3], with a fully connected neural network, and a spectral graph Laplacian network.

As opposed to these algorithms, the Haar scattering algorithm uses no information on the positions of points on the sphere. Computations are performed from a scrambled set of signal values, without any geometric information. Scattering transforms are calculated up to the maximum scale . A total of multiresolution approximations are estimated by unsupervised learning, and the classification is performed from selected coefficients. Despite the fact that the geometry is unknown, the Haar scattering reduces the error rate both for small and large 3D random rotations.

In order to evaluate the precision of our geometry estimation, we use the neighborhood information based on the 3D coordinates of the 4096 points on the sphere of radius 1. We say that two points are connected if their geodesic distance is smaller than 0.1. Each point on the sphere has on average connected points. For small rotations, the percentage of learned multiresolution sets which are connected is 92%, 92%, 88% and 83% for going from to . It is computed on meaningful points with nonneglegible energy. For large rotations, it is 97%, 96%, 95% and 95%. This shows that the multiscale geometry on the sphere is well estimated.

Nearest Fully Spectral Haar
Neighbors Connect. Net.[3] Scattering
Small rotations 19 5.6 6 2.2
Large rotations 80 52 50 47.7
Table 3: Percentage of errors for the classification of MNIST images rotated and sampled on a sphere [3], with a nearest neighbor classifier, a fully connected two layer neural network, a spectral network [3], and a Haar scattering.

5 Conclusion

A Haar scattering transform computes invariant data representations by iterating over a hierarchy of permutation invariant operators, calculated with additions, subtractions and absolute values. The geometry of unstructured signals is estimated with an unsupervised learning algorithm, which minimizes the average total signal variation over multiscale neighborhoods. This shows that unsupervised deep learning can be implemented with a polynomial complexity algorithm. The supervised classification includes a feature selection implemented with a partial least square regression. State of the art results have been shown on scrambled images as well as random signals sampled on a sphere. The two important parameters of this architecture are the network depth, which corresponds to the invariance scale, and the dimension reduction of the final layer, set to in all experiments. It can thus easily be applied to any data set.

This paper concentrates on scattering transforms of real valued signals. For a boolean vector , a boolean scattering transform is computed by replacing the operator (1) by a boolean permutation invariant operator which transforms into . Iteratively applying this operator defines a boolean scattering transform having similar properties.

References

  • [1] S. Mallat, “Recursive interferometric representations”. Proc. of EUSICO Conf. 2010, Denmark.
  • [2] J. Bruna, S. Mallat, “Invariant Scattering Convolution Networks,” IEEE Trans. PAMI, 35(8): 1872-1886, 2013.
  • [3] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral Networks and Deep Locally Connected Networks on Graphs,” ICLR 2014.
  • [4] Y. Bengio, A. Courville, P. Vincent, “Representation Learning: A Review and New Perspectives”, IEEE Trans. on PAMI, no.8, vol. 35, pp 1798-1828, 2013.
  • [5] Y. LeCun, K. Kavukvuoglu, and C. Farabet, “Convolutional Networks and Applications in Vision,” Proc. IEEE Int. Sump. Circuits and Systems 2010.
  • [6]

    M. Gavish, B. Nadler, and R. R. Coifman. “Multiscale wavelets on trees, graphs and high dimensional data: Theory and applications to semi supervised learning”, in ICML, pages 367-374, 2010.

  • [7] N. L. Roux, Y. Bengio, P. Lamblin, M. Joliveau and B. Kégl, “Learning the 2-D topology of images”, in NIPS, pages 841-848, 2008.
  • [8] J. Edmonds. Paths, trees, and flowers. Canadian Journal of Mathematics, 1965.
  • [9] E. Rothberg of H. Gabow’s “An Efficient Implementation of Edmond’s Algorithm for Maximum Matching on Graphs.” JACM, 23, 1v976.
  • [10] R. Rustamov, L. Guibas, “Wavelets on Graphs via Deep Learning,” NIPS 2013.
  • [11] D. Shuman, S. Narang, P. Frossard, A. Ortega, P. Vanderghenyst, “The Emmerging Field of Signal Processing on Graphs,” IEEE Signal Proc. Magazine, May 2013.
  • [12] T. Mehmood, K. H. Liland, L. Snipen and S. Sæbø, “A Review of Variable Selection Methods in Partial Least Squares Regression”, Chemometrics and Intelligent Laboratory Systems, vol. 118, pages 62-69, 2012.
  • [13]

    H. Zhang, S. Kiranyaz and M. Gabbouj, “Cardinal Sparse Partial Least Square Feature Selection and its Application in Face Recognition”, Signal Processing Conference (EUSIPCO), 2014 Proceedings of the 22st European, Sep. 2014.

  • [14]

    W. R. Schwartz, A. Kembhavi, D. Harwood and L. S. Davis, “Human Detection Using Partial Least Squares Analysis”, Computer vision, ICCV 2009.

  • [15] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville and Y. Benjio, “Maxout Networks”, Arxiv preprint, arxiv:1302.4389, 2013.
  • [16] D. Yu and L. Deng, “Deep Convex Net: A Scalable Architecture for Speech Pattern Classification”,in Proc. INTERSPEECH, 2011, pp.2285-2288.
  • [17] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors”, Technical report, arXiv:1207.0580, 2012.
  • [18] Q. Le, T. Sarlos and A. Smola,“Fastfood - Approximating Kernel Expansions in Loglinear Time”, ICML, 2013.
  • [19]

    M Hein and S. Setzer, “Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts,” NIPS 2011.

Appendix

Appendix A Proof of Theorem 2.1

Proof.

Observe that the permutation invariant operator which associates to the values

satisfies

Moreover, if then

Since is computed by applying this operator to pairs of values of , we derive that

Since and , iterating on these two equations proves Theorem 2.1.

Appendix B Haar Scattering from Haar Wavelets

The following proposition proves that order scattering coefficients are computed by applying an orthogonal Haar wavelet transform to order scattering coefficients. We also prove by induction on that a scattering coefficient is of order if and only if with

for some . This property is valid for and the following proposition shows that if it is valid for then it is also valid for in the sense that an order coefficient is indexed by , and it is computed by applying an orthogonal Haar transform to order scattering coefficients indexed by .

Proposition B.1.

For any and we write

For any , any and ,

(B.1)
Proof.

We derive from the definition of a scattering transform in equations (3,4) in the text that

where . Observe that

thus is calculated from the coefficients of the previous layer with

(B.2)

Since , the coefficient is calculated from by times additions, and thus

(B.3)

Combining equations (B.3) and (B.2) gives

(B.4)

We go from the depth to the depth by computing

Together with (B.4) it proves the equation (B.1) of the proposition. The summation over comes from the inner product . This also proves that is the index of a coefficient of order . ∎

Since , the proposition inductively proves that the coefficients at -th level for are of order . The expression in the proposition shows that an order scattering coefficient at scale is obtained by computing the Haar wavelet coefficients of several order coefficients at the scale , taking an absolute value, and then averaging their amplitudes over . It thus measures the averaged variations at the scale of the -th order scattering coefficients.

Appendix C Proof of Theorem 2.2

To prove Theorem 2.2, we first define an “interlaced pairings”. We say that two pairings of

are interlaced for if there exists no strict subset of such that and are pairing elements within . The following lemma shows that a single-layer scattering operator is invertible with two interlaced pairings.

Lemma C.1.

Suppose that takes more than different values, and two pairings and of are interlaced, then can be recovered from

Proof.

By Eq. (2), for a triplet if is a pair in and a pair in then the pair of values are determined (with a possible switch of the two) from

and those of are determined similarly. Then unless and the three values are recovered. The interlacing condition implies that pairs to an index which can not be or . Thus, the four values of are specified unless . This interlacing argument can be used to extend to the set of all indices for which is specified, unless takes only two values. ∎

Proof of Theorem 2.2.

Suppose that the multiresolution approximations are associated to the hierarchical pairings where , where for each , and are two interlaced pairings of elements. The sequence is a binary vector taking different values.

The constraint on the signal is that each of the intermediate scattering coefficients takes more than distinct values, which holds for

except for a union of hyperplanes which has zero measure. Thus for almost every

, the theorem follows from applying Lemma C.1 recursively to the -th level scattering coefficients for . ∎