Learning Local Receptive Fields and their Weight Sharing Scheme on Graphs

06/08/2017 ∙ by Jean-Charles Vialatte, et al. ∙ City Zen Data 0

We propose a simple and generic layer formulation that extends the properties of convolutional layers to any domain that can be described by a graph. Namely, we use the support of its adjacency matrix to design learnable weight sharing filters able to exploit the underlying structure of signals in the same fashion as for images. The proposed formulation makes it possible to learn the weights of the filter as well as a scheme that controls how they are shared across the graph. We perform validation experiments with image datasets and show that these filters offer performances comparable with convolutional ones.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Convolutional Neural Networks (CNNs) have achieved state-of-the-art accuracy in many supervised learning challenges 

[1, 2, 3, 4, 5, 6]. For their ability to absorb huge amounts of data with lesser overfitting, deep learning [7] models are the golden standard when a lot of data is available. CNNs benefit from the ability to create stationary and multi-resolution low-level features from raw data, independently from their location in the training images. Some authors draw a parallel between these features and scattering transforms [8].

Obviously CNNs rely on the ability to define a convolution operator (or a translation) on signals. On images, this amounts to learn local receptive fields [9] that are convolved with training images. Considering images to be defined on a grid graph, we point out that the receptive fields of vertices are included in their neighbors – or, more generally, a neighborhood.

Reciprocally, convolution requires more than the neighborhoods of vertices in the underlying graph, as the operator is able to match specific neighbors of distinct vertices together. For instance, performing convolution on images requires the knowledge of coordinates of pixels, that is not directly accessible when considering a grid graph (c.f. [10, 11]). In this paper we are interested in demonstrating that the underlying graph is nevertheless enough to achieve comparable results.

The convolution of a signal can be formalized as its multiplication with a convolution matrix. In the case of images and for small convolution kernels, it is interesting to note that this convolution matrix has the same support as a lattice graph. Using this idea, we propose to introduce a type of layer based on a graph that connects neurons to their neighbors. Moreover, convolution matrices are entirely determined by a single row, since the same weights appear on each one. To imitate this process, we introduce a weight sharing learning procedure, that consists in using a limited pool of weights that each row of the obtained operator can make use of.

Section II presents related work. Section III describes our methodology and the links with existing architectures. Section IV contains experimental results. Section V is a conclusion.

Ii Related Work

Due to the effectiveness of CNNs on image datasets, models have been proposed to adapt them to other kind of data, e.g. for shapes and manifolds [12, 13], molecular datasets [14], or graphs [15, 16, 13]. A review is done in [17]. In particular, CNNs have also been adapted to graph signals, such as in [18, 19] where the convolution is formalized in the spectral domain of the graph defined by its Laplacian [20]. This approach have been improved in [21], with a localized and fast approximated formulation, and has been used back in vision to breed isometry invariant representations [22].

For non-spectral approaches, feature correspondences in the input domain allow to define how the weights are tied across the layer, such as for images or manifolds. For graphs and graph signals, such correspondences doesn’t necessarily exist. For example, in [16]

(where the convolution is based on multiplications with powers of the probability transition matrix) weights are tied according to the power to which they are attached, in

[15] an ordering of the nodes is used, in [13] an embedding is learned from the degrees of the nodes. These choices are arbitrary and unsimilar to what is done by regular convolutions. On the contrary, we propose a generic layer formulation that allows to also learn how the weights are linearly distributed over the local receptive field.

Our model is first designed for the task of graph signal classification, but another common task is the problem of node classification such as in [16, 23, 24, 13]. Models learning part of their structures have also been proposed, such as in [25, 26]. Moreover, because our model strongly ressembles regular convolutions, it can also ressemble some of their variants, such as group equivariant convolutions [27].

Iii Methodology

We first recall the basic principles of Deep Neural Networks (DNNs) and CNNs, then introduce our proposed graph layer.

Iii-a Background

DNNs [28] consist of a composition of layers, each one parametrized by a learnable weight kernel and a nonlinear function . Providing the input of such a layer is , the corresponding output is then:

where is the matrix product operator, is typically applied component-wise and

is a learnable bias vector.

The weight kernels are learned using an optimization routine usually based on gradient descent, so that the DNN is able to approximate an objective function. A DNN containing only this type of layer is called Multi-Layer Perceptron (MLP).

In the case of CNNs [29], some of the layers have the particular form of convolution filters. In this case, the convolutional operation can also be written as the product of the input signal with a matrix , where is a Toeplitz matrix. Previous works [30, 31, 32, 33] have shown that to obtain the best accuracy in vision challenges, it is usually better to use very small kernels, resulting in a sparse . Figure 1 depicts a convolutional layer.

5

4

3

2

1

5

4

3

2

1

Figure 1: Depiction of a 1D-convolutional layer and its associated matrix .

Iii-B Proposed Method

We propose to introduce another type of layer, that we call receptive graph layer. It is based on an adjacency matrix and aims at extending the principle of convolutional layers to any domain that can be described using a graph.

Consider an adjacency matrix that is well fitted to the signals to be learned, in the sense that it describes an underlying graph structure between the input features. We define the receptive graph layer associated with

using the product between a third rank tensor

and a weight kernel . For now, the tensor would be one-rank containing the weights of the layer and is of shape , where is the shape of the adjacency matrix and is the shape of .

On the first two ranks, the support of must not exceed that of , such that .

Overall, we obtain:

where here denotes the tensor product.

Intuitively, the values of the weight kernel are linearly distributed to pairs of neighbours in with respect to the values of . For this reason, we call the scheme (or weight sharing scheme) of the receptive graph. In a sense, this scheme tensor is to the receptive graph what the adjacency matrix is to the graph. An example is depicted in Figure 2.

1

2

3

4

5

5

4

3

2

1

5

4

3

2

1

Figure 2: Depiction of a graph, the corresponding receptive graph of the propagation and its associated weight sharing scheme . Note that are vector slices of along the first two ranks, determines how much of each weight in is allocated for the edge linking vertex to vertex .

Alike convolution on images, is extended as a third-rank tensor to include multiple input and output channels (also known as feature maps). It is worth mentioning that an implementation must be memory efficient to take care of a possibly large sparse .

Iii-C Training

The proposed formulation allows to learn both and . We perform the two jointly. Learning amounts to learning weights as in regular CNNs, whereas learning amounts to learning how these weights are tied over the receptive fields. We also experiment a fine-tuning step, which consists in freezing

in the last epochs. Indeed, when a weight sharing scheme can be decided directly from the underlying structure, it is not necessary to train

.

Because of our inspiration from CNNs, we propose constraints on the parameters of . Namely, we impose them to be between 0 and 1, and to sum to 1 along the third dimension. Therefore, the vectors on the third rank of can be interpreted as performing a weighted average of the parameters in .

We test two types of initialization for . The first one consists in distributing one-hot-bit vectors along the third rank. We impose that for each receptive field, a particular one-hot-bit vector can only be distributed at most once more than any other. We refer to it as one-hot-bit initialization. The second one consists in using a uniform random distribution with limits as described in [34].

Iii-D Genericity

For simplicity we restricted our explanation to square adjacency matrices. In the case of oriented graphs, one could remove the rows and columns of zeros and obtain a receptive graph with a distinct number of neurons in the input () than in the output (). As a result, receptive graph layers extend usual ones, as explained here:

  1. To obtain a fully connected layer, one can choose to be of size and the matrix of vectors that contains all possible one-hot-bit vectors.

  2. To obtain a convolutional layer, one can choose to be the size of the kernel.

    would be one-hot-bit encoded along its third rank and circulant along the first two ranks. A stride

    can be obtained by removing the corresponding rows.

  3. Similarly, most of the layers presented in related works can be obtained for an appropriate definition of .

In our case, is more similar to that obtained when considering convolutional layers, with the noticeable differences that we do not force which weight to allocate for which neighbor along its third rank and it is not necessarily circulant along the first two ranks.

Iii-E Discussion

Although we train and , the layer propagation is ultimately handled by their tensor product. That is, its output is determined by where . For the weight sharing to make sense, we must then not over-parameterize and over . If we call the number of non-zeros in and the shape of , then the former assumption requires or equivalently . It implies that the number of weights per filter must be lower than the total number of filters and than the number of edges .

Note that without the constraint that the support of must not exceed that of (or if the used graph is complete), the proposed formulation could also be applied to structure learning of the input features space [35, 36]. That is, operations on along the third rank might be exploitable in some way, e.g. dropping connections during training [37] or discovering some sort of structural correlations. However, even if this can be done for toy image datasets, such wouldn’t be sparse and would lead to memory issues in higher dimensions. So we didn’t include these avenues in the scope of this paper.

Iv Experiments

Iv-a Description

We are interested in comparing various receptive graph layers with convolutional ones. For this purpose, we use image datasets, but restrain priors about the underlying structure.

We first present experiments on MNIST [38]. It contains 10 classes of gray levels images (28x28 pixels) with 60’000 examples for training, 10’000 for testing. We also do experiments on a scrambled version to hide the underlying structure, as done in previous work [39]. Then we present experiments on Cifar10 [40]. It contains 10 classes of RGB images (32x32 pixels) with 50’000 examples for training, 10’000 for testing.

Because receptive graph layers are wider than their convolutional counterparts ( more parameters from ), experiments are done on shallow (but wide) networks for this introductory paper. Also note that they require times more multiply operations than a convolution lowered to a matrix multiplication [41]. In practice, they roughly took 2 to 2.5 more time.

Iv-B Experiments with grid graphs on MNIST

Here we use models composed of a single receptive graph (or convolutional) layer made of 50 feature maps, without pooling, followed by a fully connected layer of 300 neurons, and terminated by a softmax layer of 10 neurons. Rectified Linear Units 

[42] are used for the activations and a dropout [43] of 0.5 is applied on the fully-connected layer. Input layers are regularized by a factor weight of  [44]. We optimize with ADAM [45] up to 100 epochs and fine-tune (while is frozen) for up to 50 additional epochs.

We consider a grid graph that connects each pixel to itself and its 4 nearest neighbors (or less on the borders). We also use the square of this graph (pixels are connected to their 13 nearest neighbors, including themselves), the cube of this graph (25 nearest neighbors), up to 10 powers (211 nearest neighbors). Here we use one-hot-bit initialization. We test the model under two setups: either the ordering of the node is unknown, and then the one-hot-bit vectors are distributed randomly and modified upon training ; either an ordering of the node is known, and then the one-hot-bit vectors are distributed in a circulant fashion in the third rank of which is freezed in this state. We use the number of nearest neighbors as for the dimension of the third rank of . We also compare with a convolutional layer of size 5x5, thus containing as many weights as the cube of the grid graph. Table I summarizes the obtained results. The ordering is unknown for the first result given, and known for the second result between parenthesis.

Conv5x5 Grid Grid Grid
(0.87%) 1.24% (1.21%) 1.02% (0.91%) 0.93% (0.91%)
Grid Grid Grid Grid
0.90% (0.87%) 0.93% (0.80%) 1.00% (0.74%) 0.93% (0.84%)
Table I: Error rates on powers of the grid graphs on MNIST.

We observe that even without knowledge of the underlying euclidean structure, receptive grid graph layers obtain comparable performances as convolutional ones, and when the ordering is known, they match convolutions. We also noticed that after training, even though the one-hot-bit vectors used for initialization had changed to floating point values, their most significant dimension was always the same. That suggests there is room to improve the initialization and the optimization.

In Figure 3

, we plot the test error rate for various normalizations when using the square of the grid graph, as a function of the number of epochs of training. We observe that they have little influence on the performance and sometimes improve it a bit. Thus, we use them as optional hyperparameters.

Epochl2l2 + PosNoneNorm + Posl2 + Pos + Norm
Figure 3: Evolution of the test error rate when learning MNIST using the square of a grid graph and for various normalizations, as a function of the epoch of training. The legend reads: “l2” means normalization of weights is used (with weights ), “Pos” means parameters in are forced to being positive, and “Norm” means that the norm of each vector in the third dimension of is forced to 1.

Iv-C Experiments with covariance graphs on Scrambled MNIST

We use a thresholded covariance matrix obtained by using all the training examples. We choose the threshold so that the number of remaining edges corresponds to a certain density (5x5 convolutions correspond approximately to a density of ). We also infer a graph based on the nearest neighbors of the inverse of the values of this covariance matrix (-NN). The latter two are using no prior about the signal underlying structure. The pixels of the input images are shuffled and the same re-ordering of the pixels is used for every image. Dimension of the third rank of is chosen equal to and its weights are initialized random uniformly [34]. The receptive graph layers are also compared with models obtained when replacing the first layer by a fully connected or convolutional one. Architecture used is the same as in the previous section. Results are reported on table II.

MLP Conv5x5 Thresholded () -NN ()
1.44% 1.39% 1.06% 0.96%
Table II: Error rates when topology is unknown on scrambled MNIST.

We observe that the receptive graph layers outperforms the CNN and the MLP on scrambled MNIST. This is remarkable because that suggests it has been able to exploit information about the underlying structure thanks to its graph.

Iv-D Experiments with shallow architectures on Cifar10

On Cifar10, we made experiments on shallow CNN architectures and replaced convolutions by receptive graphs. We report results on a variant of AlexNet [3]

using little distortion on the input that we borrowed from a tutorial of tensorflow 

[46]

. It is composed of two 5x5 convolutional layers of 64 feature maps, with max pooling and local response normalization, followed by two fully connected layers of 384 and 192 neurons. We compare two different graph supports: the one obtained by using the underlying graph of a regular 5x5 convolution, and the support of the square of the grid graph. Optimization is done with stochastic gradient descent on 375 epochs where

is freezed on the 125 last ones. Circulant one-hot-bit intialization is used. These are weak classifiers for Cifar10 but they are enough to analyse the usefulness of the proposed layer. Exploring deeper architectures is left for further work. Experiments are run five times each. Means and standard deviations of accuracies are reported in table 

III. “Pos” means parameters in are forced to being positive, “Norm” means that the norm of each vector in the third dimension of is forced to 1, “Both” means both constraints are applied, and “None” means none are used.

Support Learn None Pos Norm Both
Conv5x5 No / / /
Conv5x5 Yes
Grid Yes
Table III: Accuracies (in %) of shallow networks on CIFAR10.

The receptive graph layers are able to outperform the corresponding CNNs by a small amount in the tested configurations, opening the way for more complex architectures.

V Conclusion

We introduced a new class of layers for deep neural networks which consists in using the support of a graph operator and linearly distributing a pool of weights over the defined edges. The linear distribution is learned jointly with the pool of weights. Thanks to these structural dependencies, we showed it is possible to share weights in a fashion similar to Convolutional Neural Networks (CNNs).

We performed experiments on vision datasets where the receptive graph layer obtains similar performance as convolutional ones, even when the underlying image structure is hidden. We believe that with further work, the proposed layer could fully extend the performance of CNNs to many other domains described by a graph.

Future works will also include exploration of more advanced graph inference techniques. One example is using gradient descent from the supervised task at hand [19]. We can also notice that in our case, this amounts to select receptive fields, breeding another avenue [47].

Acknowledgments

This work was funded in part by the CominLabs project Neural Communications, and by the ANRT (Agence Nationale de la Recherche et de la Technologie) through a CIFRE (Convention Industrielle de Formation par la REcherche).

References