Geometric Deep Learning Extension Library for PyTorch
We present Spline-based Convolutional Neural Networks (SplineCNNs), a variant of deep neural networks for irregular structured and geometric input, e.g., graphs or meshes. Our main contribution is a novel convolution operator based on B-splines, that makes the computation time independent from the kernel size due to the local support property of the B-spline basis functions. As a result, we obtain a generalization of the traditional CNN convolution operator by using continuous kernel functions parametrized by a fixed number of trainable weights. In contrast to related approaches that filter in the spectral domain, the proposed method aggregates features purely in the spatial domain. As a main advantage, SplineCNN allows entire end-to-end training of deep architectures, using only the geometric structure as input, instead of handcrafted feature descriptors. For validation, we apply our method on tasks from the fields of image graph classification, shape correspondence and graph node classification, and show that it outperforms or pars state-of-the-art approaches while being significantly faster and having favorable properties like domain-independence.READ FULL TEXT VIEW PDF
Intrinsic graph convolution operators with differentiable kernel functio...
Based on the continuous interpretation of deep learning cast as an optim...
The benefit of localized features within the regular domain has given ri...
The Convolutional Neural Network (CNN) has been successfully applied in ...
In this paper, we introduce a generalized value iteration network (GVIN)...
A number of problems can be formulated as prediction on graph-structured...
We represent 3D shape by structured 2D representations of fixed length m...
Geometric Deep Learning Extension Library for PyTorch
Most achievements obtained by deep learning methods over the last years heavily rely on properties of the convolution operation in convolutional neural networks : local connectivity, weight sharing and shift invariance. Since those layers are defined on inputs with a grid-like structure, they are not trivially portable to non-Euclidean domains like discrete manifolds, or (embedded) graphs. However, a large amount of data in practical tasks naturally comes in the form of such irregular structures, e.g. graph data or meshes. Transferring the high performance of traditional convolutional neural networks to this kind of data holds the potential for large improvements in several relevant tasks.
Recently, a set of methods brought together under the term geometric deep learning  emerged, which aim to achieve this transfer by defining convolution operations for deep neural networks that can handle irregular input data. Existing work in this field can loosely be divided into two different subsets: the spectral and the spatial filtering approaches. The former is based on spectral graph theory 
, where eigenvalues of a graph’s Laplacian matrix are interpreted as frequencies of node signals. They are filtered in the spectral domain, analogously to Fourier domain filtering of traditional signals. The latter subset, the spatial approaches, perform convolution in local Euclidean neighborhoods w.r.t. local positional relations between points, represented for example as polar, spherical or Cartesian coordinates, as shown as examples in Figure 1.
We present Spline-based Convolutional Neural Networks (SplineCNNs), a variant of deep neural networks for irregular structured data. The main contribution is a trainable, spatial, continuous convolution kernel that leverages properties of B-spline bases to efficiently filter geometric input of arbitrary dimensionality. We show that our method
can be applied on different kinds of irregular structured data, e.g., arbitrary (embedded) graphs and meshes,
uses spatial geometric relations of the input,
allows for end-to-end training without using handcrafted feature descriptors, and
improves or pars the state-of-the-art in geometric learning tasks.
In addition, we provide an efficient GPGPU algorithm and implementation that allows for fast training and inference computation.
The history of geometric deep learning began with attempts to generalize convolutional neural networks for graph inputs. A large number of successful approaches are based on spectral graph theory. Bruna et al. 
introduced convolution-like operators on spectral graphs, interpreting the eigenvectors of the Laplacian as Fourier basis. As an extension, Henaffet al. 
suggest to use spline interpolation for smoothing kernels in the spectral domain. Defferrardet al.  approximates spectral filters with Chebyshev polynomials, providing a more efficient filtering algorithm, whose kernel size determines the range of aggregated local -neighborhoods. This approach was further simplified by Kipf and Welling , who consider only the one-neighborhood for one filter application. A filter based on the Caley transform was proposed as an alternative for the Chebyshev approximation by Levie et al. . Together with a trainable zooming parameter, this results in a more stable and flexible spectral filter.
It should be noted that all these spectral approaches assume that information is only encoded in the connectivity, edge weights and node features of the input. While this is true for general graphs, it does not hold for embedded graphs or meshes, where additional information is given by relative positions of nodes, which we consider with our method.
A downside of many spectral approaches is the fact that they use domain-dependent Fourier bases, which restricts generalization to inputs with identical graph connectivity. Yi et al. 
tackle this problem by applying a spectral transformer network that synchronizes the spectral domains. Since our approach works directly in the spatial domain, it is not prone to this problem.
For the shape correspondence task on meshes, which we also analyze in this work, Litany et al.  present a siamese network using a soft error criterion based on geodesic distances between nodes. We compare our method against this specialized method.
The issue of not representing local positional relations can be tackled by using methods that extract representations for local Euclidean neighborhoods from discrete manifolds.
Based on the intrinsic shape descriptors of Kokkinos et al. , Masci et al.  present such a method for extraction of two-dimensional Euclidean neighborhoods from meshes and propose a convolution operation locally applied on these neighborhoods. Boscaini et al.  improve this approach by introducing a patch rotation method to align extracted patches based on the local principal curvatures of the input mesh.
Our convolution operator can but does not have to receive those local representations as inputs. Therefore, our approach is orthogonal to improvements in this field.
While the first continuous convolution kernels for graphs work in the spectral domain (e.g. [9, 6, 20]), spatial continuous convolution kernels for irregular structured data were introduced recently as a special case in the fields of neural message passing and self-attention mechanisms [8, 23, 18]. Furthermore, Monti et al. 
presented the MoNet framework for interpreting different kind of inputs as directed graphs, on which we built upon in our work. We show that our kernels achieve the same or better accuracy as the trainable Gaussian mixture model (GMM) kernels of MoNet, while being able to be trained directly on the geometric structure.
We define SplineCNNs as a class of deep neural networks that are built using a novel type of spline-based convolutional layer. This layer receives irregular structured data, which is mapped to a directed graph, as input. In the spatial convolutional layer, node features are aggregated using a trainable, continuous kernel function, which we define in this section.
Similar to the work of Monti et al. , we expect the input of our convolution operator to be a directed graph with being the set of nodes, the set of edges, and containing -dimensional pseudo-coordinates for each directed edge . Note that can be interpreted as an adjacency matrix with -dimensional, normalized entries if and otherwise. Also, is usually sparse with entries. For a node its neighborhood set is denoted by .
Let , with
, denote a vector ofinput features for each node . For each we reference the set as input feature map.
In addition to the input graph and node features, let denote open B-spline bases of degree , based on uniform, i.e. equidistant, knot vectors (c.f. Piegl et al. ), with defining our -dimensional kernel size.
. The heights of the red dots are the trainable parameters for a single input feature map. They are multiplied by the elements of the B-spline tensor product basis before influencing the kernel value.
Our convolution operator aggregates node features in local neighborhoods weighted by a trainable, continuous kernel function. The node features represent features on an irregular geometric structure, whose spatial relations are locally defined by the pseudo-coordinates in . Therefore, when locally aggregating feature values in a node’s neighborhood, the content of is used to determine how the features are aggregated and the content of defines what is aggregated. We argue that common inputs for geometric deep learning tasks can be mapped to this model while preserving relevant information:
For graphs, and are given and can contain edge weights or, for example, features like the node degree of the target nodes.
For discrete manifolds, contains points of the discrete manifold, represents connectivity in local Euclidean neighborhoods and can contain local relational information like polar, spherical or Cartesian coordinates of the target point in respect to the origin point for each edge.
We state no restriction on the values of , except being element of a fixed interval range. Therefore, meshes, for example, can be either interpreted as embedded three-dimensional graphs or as two-dimensional manifolds, using local Euclidean neighborhood representations like obtained by the work of Boscaini et al. . Also, either polar/spherical coordinates or Cartesian coordinates can be used, as shown in Figure 2. Independent from the type of coordinates stored in , our trainable, continuous kernel function, which we define in the following section, maps each to a scalar that is used as a weight for feature aggregation.
We begin with the definition of a continuous kernel function using B-spline bases, which is parametrized by a constant number of trainable control values. The local support property of B-spline basis functions , which states that basis functions evaluate to zero for all inputs outside of a known interval, proves to be advantageous for efficient computation and scalability.
Figure 3 visualizes the following kernel construction method for differing B-spline basis degree . We introduce a trainable parameter for each element from the Cartesian product of the B-spline bases and each of the input feature maps, indexed by . This results in trainable parameters.
We define our continuous convolution kernel as functions with
with being the product of the basis functions in :
One way to interpret this kernel is to see the trainable parameters as control values for the height of a -dimensional B-spline surface, from which a weight is sampled for each neighboring point , depending on . However, in contrast to traditional -dimensional B-spline approximation, we only have one-dimensional control points and approximate functions instead of curves. The definition range of is the interval in which the partition of unity property of the B-spline bases holds . Therefore, and depend on B-spline degree and kernel size . We scale the spatial relation vectors to exactly match this interval, c.f. Figure 3.
Given our kernel functions and input node features , we define our spatial convolution operator for a node as
Similar to traditional CNNs, the convolution operator can be used as a module in a deep neural network architecture, which we do in our SplineCNNs. To this end, the operator is applied times on the same input data with different trainable parameters, to obtain a convolutional layer that produces output feature maps. It should be highlighted that, in contrast to self-attention methods, we train an individual set of weights for each combination of input and output feature map.
Due to the local support property of B-splines, only holds true for of the different vectors . Therefore, only depends on of the trainable parameters for each neighbor , where , and are constant and usually small. In addition, for each pair of nodes , the vectors with , which we denote as , can be found in constant time, given constant and .
This allows for an alternative representation of the inner sums of our convolution operation, c.f. Equation 3, as
and can be replaced by in the time complexity of the operation. Also, does not depend on and can therefore be computed once for all input features. Figure 4 shows a scheme of the computation. The gradient flow for the backward pass can also be derived by following the solid arrows backwards.
Depending on the type of coordinate in vectors , we use closed B-spline approximation in some dimensions. One frequently occurring example of such a situation is when contains angle attributes of polar coordinates. Using closed B-spline approximation in the angle dimension naturally enforces the angle to be evaluated to the same weight as the angle or, for higher , the kernel function to be continuously differentiable at those points.
The proposed kernels can easily be modified so that they use closed approximation in an arbitrary subset of the dimensions, by mapping different to the same trainable control value . This leads to a reduction of trainable parameters and B-spline basis functions. Referring to Figure 3, this approach can be interpreted as periodic repetition of the function surface along the corresponding axis.
Up to now, we did not consider the node of neighborhood in our convolution operator. It is not aggregated together with all , like it would be the case in traditional CNNs. If Cartesian coordinates are used, we can simply define to include . However, when using polar/spherical pseudo-coordinates, problems arise since the point with zero radius is not well defined. Therefore, we introduce an additional trainable parameter for each feature of the root node and add the product of this parameter and the corresponding feature to the result.
Except for a normalization factor, our spline-based convolution operator is a generalization of the traditional convolutional layer in CNNs with odd filter size in each dimension. For example, if we assume to have a two-dimensional grid-graph with diagonal, horizontal and vertical edges to be the input, B-spline degree , kernel size , and the vectors to contain Cartesian relations between adjacent nodes, then our convolution operator is equivalent to a discrete convolution of an image with a kernel of size . This also holds for larger discrete kernels if the neighborhoods of the grid-graph are modified accordingly.
For the spline-based convolutional layer defined in the last section, we introduce a GPU algorithm which allows efficient training and inference with SplineCNNs. For simplicity, we use a tensor indexing notation with, e.g., describing the element at position of a tensor with rank three. Our forward operation of our convolution operator is outlined in Algorithm 1.
We achieve parallelization over the edges by first gathering edge-wise input features from the input matrix , using the target node of each edge as index. Then, we compute edge-wise output features , as shown in Figure 4, before scatter-adding them back to node-wise features , performing the actual neighborhood aggregation. Our algorithm has a parallel time complexity of , with small , using processors, assuming that scatter-add is a parallel operation with constant time complexity.
We achieve independence from the number of trainable weights by computing matrices and . contains the indices of parameters with while contains the basis products for these parameters. and can be preprocessed for a given graph structure or can be computed directly in the kernel. For the GPU evaluation of the basis functions required for we use explicit low-degree polynomial formulations of those functions for each
. For further details we refer to our PyTorch implementation, which is available on GitHub.
For batch learning, parallelization over a mini-batch can be achieved by creating sparse block diagonal matrices out of all of one batch and concatenating matrices in the node dimension. For matrices , and , this results in example-wise concatenation in the edge dimension. Note that this composition allows differing number of nodes and edges over examples in one batch without introducing redundant computational overhead.
We perform experiments with different SplineCNN architectures on three distinct tasks from the fields of image graph classification (Section 5.1), graph node classification (Section 5.2) and shape correspondence on meshes (Section 5.3). For each of the tasks, we create a SplineCNN using the spline-based convolution operator which we denote as SConv() for a convolutional layer with kernel size , input feature maps and output feature maps. In addition, we denote fully connected layers as FC(), with
as number of output neurons.
For validation on two-dimensional regular and irregular structured input data, we apply our method on the widely-known MNIST dataset  of 60,000 training and 10,000 test images containing grayscale, handwritten digits from different classes. We conduct two different experiments on MNIST. For both experiments, we strictly follow the experimental setup of Defferrard et al. and Monti et al. [6, 18] to provide comparability. For the first experiment, the MNIST images are represented as a set of equal grid graphs, where each node corresponds to one pixel in the original image, resulting in grids of size with nodes. For the second experiment, the MNIST superpixel dataset of Monti et al.  is used, where each image is represented as an embedded graph of 75 nodes defining the centroids of superpixels, c.f. Figure 4(a), with each graph having different node positions and connectivities. This experiment is an ideal choice to validate the capabilities of our approach on irregular structured, image-based data.
. The pooling operation is able to obtain a coarsened graph by deriving a clustering on the graph nodes, aggregating nodes in one cluster and computing new pseudo-coordinates for each of those new nodes. We denote a max-pooling layer using this algorithm with MaxP(), with being the cluster size (and approximate downscaling factor).
For the grid graph experiments, Cartesian coordinates and a B-spline basis degree of are used to reach equivalence to the traditional convolution operator in CNNs, c.f. Section 3.3. In contrast, we compare all configurations of and possible pseudo-coordinates against each other on the superpixel dataset.
For classification on the grid data, we make use of a LeNet5-like network architecture : SConv(,,) MaxP(4) SConv((),,) MaxP(4) FC() FC(). The initial learning rate was chosen as
and dropout probability as. Note that we used neighborhoods of size from the grid graph, to mirror the LeNet5 architecture with its filters.
The superpixel dataset is evaluated using the SplineCNN architecture SConv(,,) MaxP(4) SConv((),,) MaxP(4) AvgP FC() FC(), where AvgP denotes a layer that averages features in the node dimension. We use the Exponential Linear Unit (ELU) as non-linearity after each SConv layer and the first FC layer. For Cartesian coordinates, we choose the kernel size to be and for polar coordinates and
. Training was done for 20 epochs with a batch size of, initial learning rate and dropout probability . Both networks were trained for 30 epochs using the Adam method .
All results of the MNIST experiments are shown in Table 1 and Figure 4(b). The grid graph experiment results in approximately the same accuracy as LeNet5 and the MoNet method. For the superpixel dataset, we improve previous results by percentage points in accuracy. Since we are using a similar architecture and the same input data as MoNet, the better results are an indication that our operator is able to capture more relevant information in the structure of the input. This can be explained by the fact that, in contrast to the MoNet kernels, our kernel function has individual trainable weights for each combination of input and output feature maps, just like the filters in traditional CNNs.
Results for different configurations are shown in Figure 4(b). We only notice small differences in accuracy for varying and pseudo-coordinates. However, lower and using Cartesian coordinates performs slightly better than the other configurations.
In addition, we visualized the learned kernels of the first SConv layers from the grid and superpixel experiments in Figure 6. It can be observed that edge detecting patterns are learned in both approaches, whether being trained on regular or irregular structured data.
Graph node classification on the Cora dataset for different learning methods (ChebNet, GCN, CayleyNet and SplineCNN). The presented accuracy means and standard deviations are computed over 100 experiments, where for each experiment the network was trained for 200 epochs.
As second experiment, we address the problem of graph node classification using the Cora citation graph . We validate that our method also performs strongly on datasets, where no Euclidean relations are given. Cora consists of 2,708 nodes and 5,429 undirected unweighted edges, representing scientific publications and citation links respectively. Each document is represented individually by a 1,433 dimensional sparse binary bag-of-words feature vector and is labeled to exactly one out of 7 classes. Similar to the experimental setup in Levi et al. , we split the dataset into 1,708 nodes for training and 500 nodes for testing, to simulate labeled and unlabeled information.
We use a SplineCNN similar to the network architecture introduced in [15, 12, 18]: SConv(,,) SConv(,,), with ELU activation after the first SConv layer and . For pseudo-coordinates, we choose the globally normalized degree of the target nodes , leading to filtering based on the number of cites of neighboring publications. Training was done using the Adam optimization method  for 200 epochs with learning rate , dropout probability and L2 regularization
. As loss function, the cross entropy between the network’s softmax output and a one-hot target distribution was used.
Results of our and related methods are shown in Table 2 and report the mean classification accuracy averaged over 100 experiments. It can be seen that SplineCNNs improve the state-of-the-art in this experiment by approximately percentage points. We contribute this improvement to the filtering based on , which contains node degrees as additional information to learn more complex kernel functions. This indicates that SplineCNNs can be successfully applied to irregular but non-geometric data and that they are able to improve previous results in this domain.
As our last and largest experiment, we validate our method on a collection of three-dimensional meshes solving the task of shape correspondence similar to [18, 2, 17, 16]. Shape correspondence refers to the task of labeling each node of a given shape to the corresponding node of a reference shape . We use the FAUST dataset , containing 10 scanned human shapes in 10 different poses, resulting in a total of 100 non-watertight meshes with 6,890 nodes each. The first 80 subjects in FAUST were used for training and the remaining 20 subjects for testing, following the dataset splits introduced in . Ground truth correspondence of FAUST meshes are given implicitly, where nodes are sorted in the exact same order for every example. Correspondence quality is measured according to the Princeton benchmark protocol , counting the percentage of derived correspondences that lie within a geodesic radius around the correct node.
In contrast to similar approaches, e.g. [18, 2, 17, 16], we go without handcrafted feature descriptors as inputs, like the local histogram of normal vectors known as SHOT descriptors , and force the network to learn from the geometry (i.e. spatial relations encoded in itself. Therefore, input features are trivially given by . Also, we validate our method on three-dimensional meshes as inputs instead of generating two-dimensional geodesic patches for each node. These simplifications reduce the computation time and memory consumption that are required to preprocess the data by a wide margin, making training and inference completely end-to-end and very efficient.
We apply a SplineCNN architecture with 6 convolutional layers: SConv(,,) SConv(,,) SConv(,,) Lin() Lin(), where Lin() denotes a convolutional layer to
output features per node. As non-linear activation function, ELU is used after each SConv and the first Lin layer. For Cartesian coordinates we choose the kernel size to beand for polar coordinates and . We evaluate our method on multiple choices of . Training was done for 100 epochs with a batch size of 1, initial learning rate and dropout probability , using the Adam optimizer  and cross entropy loss.
Obtained accuracies for different geodesic errors are plotted in Figure 7. The results for different SplineCNN parameters match the observations from before, where only small differences could be seen but using Cartesian coordinates and small B-spline degrees seemed to be slightly better. Our SplineCNN outperforms all other approaches with of predictions on the test set having zero geodesic error. However, the global behavior over larger geodesic error bounds is slightly worse in comparison to FMNet . In Figure 6(c)
it can be seen that most nodes are classified correctly but that the few false classifications have a high geodesic error. We contribute this differences to the varying loss formulations. While we train against a one-hot binary vector using the cross entropy loss, FMNet trains using a specialized soft error loss, which is a more geometrically meaningful criterion that punishes geodesically far-away predictions stronger than predictions near the correct node. However, it is worth highlighting that we do not use SHOT descriptors as input features, like all other approaches we compare against. Instead, we train only on the geometric structure of the meshes.
We report an average forward step runtime of seconds for a single FAUST example processed by the suggested SplineCNN architecture (, ) on a single NVIDIA GTX 1080 Ti. We train this network in approximately minutes. Regarding scalability, we are able to stack up to 160 SConv(,,) layers before running out of memory on the mentioned GPU, while the runtime scales linearly with the number of layers. However, for this task we do not observe significant improvement in accuracy when using deeper networks.
We introduced SplineCNN, a spline-based convolutional neural network with a novel trainable convolution operator, which learns on irregular structured, geometric input data. Our convolution filter operates in the spatial domain and aggregates local features, applying a trainable continuous kernel function parametrized by trainable B-spline control values. We showed that SplineCNN is able to improve state-of-the-art results in several benchmark tasks, including image graph classification, graph node classification and shape correspondence on meshes, while allowing very fast training and inference computation. To conclude, SplineCNN is the first architecture that allows deep end-to-end learning directly from geometric data while providing strong results. Due to missing preprocessing, this allows for even faster processing of data.
In the future we plan to enhance SplineCNNs by concepts known from traditional CNNs, namely recurrent neurons for geometric, spatio-temporal data or dynamic graphs, and un-pooling layers to allow encoder-decoder or generative architectures.
This work has been supported by the German Research Association (DFG) within the Collaborative Research Center SFB 876, Providing Information by Resource-Constrained Analysis, projects B2 and A6. We also thank Pascal Libuschewski for proofreading and helpful advice.
Proceedings of the 34th International Conference on Machine Learning (ICML), pages 1263–1272, 2017.
The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains.IEEE Signal Processing Magazine, 30(3):83–98, 2013.