Image Classification with Hierarchical Multigraph Networks

07/21/2019 ∙ by Boris Knyazev, et al. ∙ 4

Graph Convolutional Networks (GCNs) are a class of general models that can learn from graph structured data. Despite being general, GCNs are admittedly inferior to convolutional neural networks (CNNs) when applied to vision tasks, mainly due to the lack of domain knowledge that is hardcoded into CNNs, such as spatially oriented translation invariant filters. However, a great advantage of GCNs is the ability to work on irregular inputs, such as superpixels of images. This could significantly reduce the computational cost of image reasoning tasks. Another key advantage inherent to GCNs is the natural ability to model multirelational data. Building upon these two promising properties, in this work, we show best practices for designing GCNs for image classification; in some cases even outperforming CNNs on the MNIST, CIFAR-10 and PASCAL image datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In image recognition, input data fed to models tend to be high dimensional. Even for tiny MNIST [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner] images, the input is dimensional and for larger PASCAL [Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and Zisserman] images it explodes to dimensions (Figure 1). Learning from such a high-dimensional input is challenging and requires a lot of labelled data and regularization. Convolutional Neural Networks (CNNs) successfully address these challenges by exploiting the properties of shift-invariance, locality and compositionality of images [Bronstein et al.(2017)Bronstein, Bruna, LeCun, Szlam, and Vandergheynst]. We consider an alternative approach and instead reduce the input dimensionality. One simple way to achieve that is downsampling. But in this case, we may lose vital structural information, so to better preserve it, we extract superpixels [Achanta et al.(2012)Achanta, Shaji, Smith, Lucchi, Fua, and Süsstrunk, Liang et al.(2016)Liang, Shen, Feng, Lin, and Yan]. Representing a set of superpixels such that CNNs could digest and learn from them is non-trivial. To that end, we adopt a strategy from chemistry, physics and social networks, where structured data are expressed by graphs [Bronstein et al.(2017)Bronstein, Bruna, LeCun, Szlam, and Vandergheynst, Hamilton et al.(2017)Hamilton, Ying, and Leskovec, Battaglia et al.(2018)Battaglia, Hamrick, Bapst, Sanchez-Gonzalez, Zambaldi, Malinowski, Tacchetti, Raposo, Santoro, Faulkner, et al.]. By defining operations on graphs analogous to spectral [Bruna et al.(2014)Bruna, Zaremba, Szlam, and LeCun] or spatial [Kipf and Welling(2017)] convolution, Graph Convolutional Networks (GCNs) extend CNNs to graph-based data, and show successful applications in graph/node classification [Velickovic et al.(2018)Velickovic, Cucurull, Casanova, Romero, Lio, and Bengio, Simonovsky and Komodakis(2017), Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein, Fey et al.(2018)Fey, Lenssen, Weichert, and Müller] and link prediction [Schlichtkrull et al.(2018)Schlichtkrull, Kipf, Bloem, van den Berg, Titov, and Welling].

The challenge of generalizing convolution to graphs is to have anisotropic filters (such as edge detectors). Anisotropic models, such as MoNet [Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein] and SplineCNN [Fey et al.(2018)Fey, Lenssen, Weichert, and Müller], rely on coordinate structure, work well for various vision tasks, but are often too computationally expensive and suboptimal for graph problems, in which the coordinates are not well defined [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor]. While these and other general models exist [Gilmer et al.(2017)Gilmer, Schoenholz, Riley, Vinyals, and Dahl, Battaglia et al.(2018)Battaglia, Hamrick, Bapst, Sanchez-Gonzalez, Zambaldi, Malinowski, Tacchetti, Raposo, Santoro, Faulkner, et al.], we rely on widely used graph convolutional networks (GCNs) [Kipf and Welling(2017)] and their multiscale extension, Chebyshev GCNs (ChebyNets) [Defferrard et al.(2016)Defferrard, Bresson, and Vandergheynst] that enjoy an explicit control of receptive field size.

Our focus is on multigraphs, those graphs that are permitted to have multiple edges, e.gtwo objects can be connected by edges of different types [Lu et al.(2016)Lu, Krishna, Bernstein, and Fei-Fei]: (is_part_of, action, distance, etc.). Multigraphs enable us to model spatial and hierarchical structure inherent to images. By exploiting multiple relationships, we can capture global patterns in input graphs, which is hard to achieve with a single relation, because most GCNs aggregate information in a small local neighbourhood and simply increasing its size, as in ChebyNet, can quickly make the representation too entangled (due to averaging over too many features). Hence, methods such as Deep Graph Infomax [Veličković et al.(2019)Veličković, Fedus, Hamilton, Liò, Bengio, and Hjelm] were proposed. Using multigraph networks is another approach to increase the receptive field size and disentangle the representation in a principled way, which we show to be promising.

In this work, we model images as sets of superpixels (Figure 1) and formulate image classification as a multigraph classification problem (Section 4). We first overview graph convolution (Section 2) and then adopt and extend relation type fusion methods from [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor] to improve the expressive power of our GCNs (Section 3). To improve classification accuracy, we represent an image as a multigraph and propose learnable (Section 4.1.1) and hierarchical (Section 4.1.2) relation types that we fuse to obtain a final rich representation of an image. On a number of experiments on the MNIST, CIFAR-10, and PASCAL image datasets, we evaluate our model and show a significant increase in accuracy, outperforming CNNs in some tasks (Section 5).

(a) px (b) px (c) px
(d) (e) (f)
Figure 1: Examples of the original images (a-c), defined on a regular grid, and their superpixel representations (d-f) for MNIST (a,d), CIFAR-10 (b,e) and PASCAL (c,f); is the number of superpixels (nodes in our graphs). GCNs can learn both from images and superpixels due to their flexibility, whereas standard CNNs can learn only from images defined on a regular grid (a-c).

2 Graph Convolution

We consider undirected graphs with nodes and edges with values in the range represented as an adjacency matrix . Nodes usually represent specific semantic concepts such as objects in images [Prabhu and Venkatesh Babu(2015)]. Nodes can also denote abstract blocks of information with common properties, such as superpixels in our case. Edges () define the relationships and scope of which node effects propagate.

Convolution is an essential computational block in graph networks, since it permits the gradual aggregation of information from neighbouring nodes. Following [Kipf and Welling(2017)], for some -dimensional features over nodes and trainable filters , convolution on a graph can be defined as:

(1)

where are features averaged over one-hop () neighbourhoods, is an adjacency matrix with self-loops, is a diagonal matrix with node degrees, and

is an identity matrix. We employ this convolution to build graph convolutional networks (GCNs) in our experiments. This formulation is a particular case of a more general approximate spectral graph convolution 

[Defferrard et al.(2016)Defferrard, Bresson, and Vandergheynst], in which case are multiscale features averaged over -hop neighbourhoods. Multiscale (multihop) graph convolution is used in our experiments with ChebyNet.

Figure 2: An example of the “PC” relation type fusion based on a trainable projection (Eq. 3). We first project features onto a common multirelational space, where we fuse them using a fusion operator, such as summation or concatenation. In this work, relation types 1 and 2 can denote spatial and hierarchical (or learned) edges. We also allow for three or more relation types.

3 Multigraph Convolution

In graph convolution (Eq. 1), the graph adjacency matrix encodes a single () relation type between nodes. We extend Eq. 1 to a multigraph: a graph with multiple () edges (relations) between the same nodes, represented as a set of adjacency matrices . Previous work used concatenation- or decomposition-based schemes [Schlichtkrull et al.(2018)Schlichtkrull, Kipf, Bloem, van den Berg, Titov, and Welling, Chen et al.(2018)Chen, Li, Fei-Fei, and Gupta] to fuse multiple relations. Instead, to capture more complex multirelational features, we adopt and extend recently proposed fusion methods from [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor].

Two of the methods from [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor] have certain limitations preventing us to adopt them directly in this work. In particular, the approach based on multidimensional Chebyshev polynomials is often infeasible to compute and was not shown to be superior in downstream tasks. In our experience, we also found that the multiplicative fusion was unstable to train. To that end, motivated by the success of multilayer projections in Graph Isomorphism Networks [Xu et al.(2019)Xu, Hu, Leskovec, and Jegelka], we propose two simple yet powerful fusion methods (Figure 2).

Given features corresponding to a relation type , in the first approach we concatenate features for all relation types and then transform them using a two layer fully-connected network with

hidden units and the ReLU activation:

(2)

where the first and second layers of have and trainable parameters respectively, ignoring a bias. The second approach, illustrated in Figure 2, is similar to multiplicative/additive fusion [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor], but instead of multiplication/addition we use concatenation:

(3)

where and is a single layer fully-connected network with output units followed by a nonlinearity, so that and have and trainable parameters respectively, ignoring a bias. Hereafter, we denote the first approach as CP (concatenation followed by projection) and the second as PC (projection followed by concatenation).

4 Multigraph Convolutional Networks

Image classification was recently formulated as a graph classification problem in [Defferrard et al.(2016)Defferrard, Bresson, and Vandergheynst, Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein, Fey et al.(2018)Fey, Lenssen, Weichert, and Müller], who considered small-scale image classification problems such as MNIST [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner]. In this work, we present a model that scales to more complex and larger image datasets, such as PASCAL VOC 2012 [Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and Zisserman]. We follow [Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein] and compute SLIC [Achanta et al.(2012)Achanta, Shaji, Smith, Lucchi, Fua, and Süsstrunk] superpixels for each image and build a graph, in which each node corresponds to a superpixel and edges () are computed based on the Euclidean distance between the coordinates and of their centres of masses using a Gaussian with some fixed width :

(4)

A frequent assumption of current GCNs is that there is at most one edge between any pair of nodes in a graph. This restriction is usually implied by datasets with such structure, so that in many datasets, graphs are annotated with the single most important relation type. Meanwhile, data is often complex and nodes tend to have multiple relationships of different semantic, physical, or abstract meanings. Therefore, we argue that there could be other relationships captured by relaxing this restriction and allowing for multiple kinds of edges, beyond those hardcoded in the data (e.gspatial in Eq. 4).

4.1 Learning Flat vs. Hierarchical Edges

Prior work, e.g [Schlichtkrull et al.(2018)Schlichtkrull, Kipf, Bloem, van den Berg, Titov, and Welling, Bordes et al.(2013)Bordes, Usunier, Garcia-Duran, Weston, and Yakhnenko], proposed methods to learn from multiple edges, but similarly to the methods with a single edge type [Kipf and Welling(2017)], they leveraged only predefined edges in the data. We formulate a more flexible model, which, in addition to learning from an arbitrary number of relations between nodes (see Section 3), learns abstract edges jointly with a GCN.

4.1.1 Flat Learnable Edges

We combine ideas from [Henaff et al.(2015)Henaff, Bruna, and LeCun, Velickovic et al.(2018)Velickovic, Cucurull, Casanova, Romero, Lio, and Bengio, Simonovsky and Komodakis(2017), Battaglia et al.(2018)Battaglia, Hamrick, Bapst, Sanchez-Gonzalez, Zambaldi, Malinowski, Tacchetti, Raposo, Santoro, Faulkner, et al.] and propose to learn a new edge from any node to node with coordinates and using a trainable similarity function:

(5)

where indexes relation types; is the number of learned relation types; and is a spatial neighbourhood of radius around node . Using relatively small limits a model’s predictive power, but is important for regularization and to meet computational requirements for larger images from which we extract many superpixels, such as in the PASCAL dataset. We also experimented with feeding node features, such as mean pixel intensities, to this function, but it did not give positive outcomes. Instead, we further regularize the model by constraining the input to be the absolute coordinate differences in lieu of raw coordinates and applying the softmax on top of the predictions:

(6)

where is a small two layer fully-connected network with output units and denotes taking the -th output. Using the absolute value makes the filters symmetric (Figure 5), but still sensitive to orientation as opposed to the spatial edge defined in Eq. 4. The softmax is used to encourage sparse (more localized) edges.

Our approach to predict edges can be viewed as a particular case of generative models of graphs, such as [Simonovsky and Komodakis(2018)]. However, the objectives of our model and the latter are orthogonal. Generative models typically make probabilistic predictions to induce diversity of generated graphs, and generation of each edge, node and their attributes should respect other (already generated) edges and nodes in the graph, so that the entire graph becomes plausible. In our work, we aim to scale our model to larger graphs and assume locality of relationships in visual data, therefore we design a 1) simple deterministic function 2) that makes predictions depending only on local information.

(a) spatial () (b) spatial () (c) hierarchy () (d) hierarchy ()
Figure 3: (top) We compute superpixels at several scales and combine all of them into a single set. (bottom) We then build a graph, where each node corresponds to a superpixel from this set and has features, such as mean RGB color and coordinates of the centres of masses. Using Eq. 4 and 7, we compute spatial (a) and hierarchical (c) edges. Nodes 0 to 300 correspond to the first level of the hierarchy (first scale of superpixels), and nodes 300 to 400 correspond to the second level, and so forth. Notice that spatial edges (a) are created both within and between levels, while hierarchical (c) edges exist only between hierarchical levels. (c, d) Powers of the adjacency matrices used in a multiscale ChebyNet allow information to diffuse over the graph making it possible to learn filters with more global support.

4.1.2 Hierarchical Edges

Depending on the number of SLIC superpixels, we can build different levels of image abstraction. Low levels of abstraction maintain fine-grained details, whereas higher levels mainly preserve global structure. To leverage useful information from multiple levels, we first compute superpixels at several different scales. Then, spatial and learnable relations are computed using Eq. 4 and 6, treating nodes at different scales as a joint set of superpixels. This creates a graph of multiscale image representation (Figure 3 (a,b)). However, this representation is still flat and, thus, limited.

To alleviate this, we introduce a novel hierarchical graph model, where child-parent relations are based on intersection over union (IoU) between superpixels and at different scales:

(7)

where is an adjacency matrix of hierarchical edges and for nodes at the same scale (Figure 3 (c,d)). Using IoU means that child nodes can have multiple parents. To guarantee a single parent, hierarchical superpixel algorithms can be considered [Arbelaez et al.(2010)Arbelaez, Maire, Fowlkes, and Malik].

4.2 Layer vs. Global pooling

Inspired by convolutional networks, previous works [Bruna et al.(2014)Bruna, Zaremba, Szlam, and LeCun, Defferrard et al.(2016)Defferrard, Bresson, and Vandergheynst, Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein, Simonovsky and Komodakis(2017), Fey et al.(2018)Fey, Lenssen, Weichert, and Müller, Ying et al.(2018)Ying, You, Morris, Ren, Hamilton, and Leskovec] built an analogy of pooling layers in graphs, for example, using the Graclus clustering algorithm [Dhillon et al.(2007)Dhillon, Guan, and Kulis]. In CNNs, pooling is an effective way to reduce memory and computation, particularly for large inputs. It also provides additional robustness to local deformations and leads to faster growth of receptive fields. However, we can build a convolutional network without any pooling layers with similar performance on a downstream task [Springenberg et al.(2014)Springenberg, Dosovitskiy, Brox, and Riedmiller] — it just will be relatively slow, since pooling is extremely cheap on regular grids, such as images. In graph classification tasks, the input dimensionality, which corresponds to the number of nodes , is often very small () and the benefits of pooling are less clear. Graph pooling, such as in [Dhillon et al.(2007)Dhillon, Guan, and Kulis], is also computationally intensive since we need to run the clustering algorithm for each graph independently, which limits the scale of problems we can address. Aiming to simplify the model while maintaining classification accuracy, we exclude pooling layers between graph convolutional layers and perform global maximum pooling over nodes following the last conv. layer. This fixes the size of the penultimate feature vector irrespective of (Figure 4).

5 Experiments

We evaluate our model on three image-based graph classification datasets: MNIST [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner], CIFAR-10 [Krizhevsky(2009)] and PASCAL Visual Object Classes 2012 (PASCAL) [Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and Zisserman]. For each dataset, there is a set of graphs with different numbers of nodes (superpixels), and each graph has a single categorical label that is to be predicted. For baselines we use a single () spatial relation type defined using Eq. 4. For PASCAL we predict multiple categorical labels per image and report mean average precision (mAP).

Figure 4: Image classification pipeline using our model. Each graph convolutional layer in our model takes the graph and returns a graph with the same nodes and edges. Node features become increasingly global after each subsequent layer as the receptive field increases, while edges are propagated without changes. As a result, after several graph convolutional layers, each node in the graph contains information about its immediate neighbours and a large neighbourhood around it. By pooling over nodes we summarize the information collected by each node. Fully-connected layers follow global pooling to perform classification.

MNIST consists of 70k greyscale 2828px images of handwritten digits. CIFAR-10 has 60k coloured 3232px images of 10 categories of animals and vehicles. PASCAL is a more challenging dataset with realistic high resolution images (typically around px) of 20 object categories (people, animals, vehicles, indoor objects). We use standard classification splits, training our model on 5,717 images and reporting results on 5,823 validation images. We note that CIFAR-10 and PASCAL have not been previously considered for graph-based image classification, and in this work we scale our method to these datasets. In fact, during experimentation we found some other graph convolutional methods unable to scale (see Section 5.3).

5.1 Architectural and Experimental Details

GCNs. In all experiments, we train GCNs, ChebyNets and MoNet [Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein]

with three graph convolutional layers, having 32, 64 and 512 filters with the ReLU activation after each layer followed by global max pooling and a fully-connected classification layer (Figure 

4). For MNIST, we use a dropout rate [Srivastava et al.(2014)Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov]

of 0.5 before the class. layer, while for CIFAR-10 and PASCAL instead we employ batch normalization (BN) 

[Ioffe and Szegedy(2015)] after each convolutional layer. For edge fusion, projection in Eq. 2 is modelled by a two layer network with hidden units and the ReLU activation between layers. Similarly, projections in Eq. 3 are single layer neural networks with output units followed by the activation. The edge prediction function in (Eq. 6) is a two layer neural network with hidden units, output units, and neighbourhood is set to the spatially nearest nodes.

ConvNets (CNNs). To allow fair comparison, we train CNNs with the same number of filters in convolutional layers as GCNs, with filters of size 3 for MNIST, 5 for CIFAR-10 and 7 for PASCAL, max pooling between layers and global average pooling after the last convolutional layer. Deeper and larger CNNs and GCNs can be trained to further improve results. Since images fed to CNNs are defined on a regular grid, adding coordinate features is uninformative, because these features are exactly the same for all examples in the dataset.

Low resolution ConvNets. GCNs take a relatively low resolution input compared to CNNs: 75 vs. 784 for MNIST, 150 vs. 1024 for CIFAR-10 and 1000 vs. 150,000 for PASCAL. Factors by which it is reduced are 10.5, 6.8 and 150 respectively. Therefore, direct comparison of GCNs to CNNs is unfair. To provide an alternative (yet not perfect) form of comparison, we design experiments with low resolution inputs fed to CNNs. In particular, to match the spatial dimensions of inputs to GCNs and CNNs, for MNIST we reduce the size to px, for CIFAR-10 to px and for PASCAL to

px taking into account that the average number of superpixels returned by the SLIC algorithm is often smaller than requested. Admittedly, downsampling using SLIC superpixels is more structural than bilinear downsampling, so GCNs receive a stronger signal, but we believe it is still an interesting experiment. Principal component analysis could be used as a more adequate way to implicitly reduce the input dimensionality for CNNs, but this method is infeasible for large images, such as in PASCAL, while superpixels can be easily computed in such cases. For comparison, we also report results on low resolution images using GCNs.

Training. We train all models using Adam [Kingma and Ba(2015)] with learning rate of , weight decay of

, and batch size of 32. For MNIST, the learning rate is decayed after 20 and 25 epochs and models are trained for 30 epochs. For CIFAR-10 and PASCAL, the learning rate is decayed after 35 and 45 epochs and models are trained for 50 epochs. We train models using four edge fusion methods: concatenation-based baseline, additive fusion 

[Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor] and two methods (CP and PC) introduced in Section 3 (see Table 1 and Figure 6 (a) for a summary).

In each case, we train a model 5 times with different random seeds and report average results and standard deviation. In Table 

2, we report the best results over different fusion methods superscripting the best fusion method.

Spatial edge

MNIST

CIFAR-10

Figure 5: Examples of predicted edges for MNIST and CIFAR-10 using the models with learned edges (Eq. 6). These predictions can be interpreted as filters centred at some node , so that values along both axes denote distances from node to other nodes and the intensity denotes the strength of a connection with that node. As we can see, our models learn filters with variable intensity depending on the direction to better capture features in images. For comparison, a spatial edge (Eq. 4) is shown on the left, which has the same value across all directions limiting model’s flexibility.

Graph formation for images. For MNIST, we compute a hierarchy of 75, 21, and 7 SLIC [Achanta et al.(2012)Achanta, Shaji, Smith, Lucchi, Fua, and Süsstrunk] superpixels and use mean superpixel intensity and coordinates as node features, so . For CIFAR-10, we add one more level of 150 superpixels and also use coordinate features, so . For PASCAL, due to its more challenging images, we further add two more levels of 1,000 and 300 superpixels, so . Note, the number of superpixels provided above are upper bounds we impose on the SLIC algorithm. For low resolution experiments with GCNs, we build graphs from pixels and their coordinates, that is the graph structure is the same across all examples in the dataset.

5.2 Results

The image classification results on the MNIST, CIFAR-10 and PASCAL datasets are presented in Table 2. We first observe that among GCN baselines, MoNet [Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein] and ChebyNet [Defferrard et al.(2016)Defferrard, Bresson, and Vandergheynst] show comparable performance, significantly outperforming GCN [Kipf and Welling(2017)], which can be explained by very local filters in the latter. Next, the hierarchical (H) and learned (L or L4, i.e. models with learnable edges) connections proposed in this work are shown to be complementary, substantially improving both GCNs and ChebyNets with a single spatial edge type. Additional qualitative and quantitative analysis is provide in Table 1, Figures 5 and 6. By combining hierarchical and learned edge types (H-L, H-L4) and adding multiscale filters, we achieve a further gain in accuracy while keeping the number of trainable parameters low compared to ConvNets, MoNet and ChebyNets with large . Importantly, our multirelational GCNs also show better results than ConvNets with a low resolution input.

In Table 2, for ChebyNet we report results using the best chosen from the range [2, 20]: for MNIST, CIFAR-10 and PASCAL respectively in case of a baseline ChebyNet and in case of H-L-ChebyNet. Among evaluated fusion methods, CP and additive (S) work best for MNIST, whereas PC and CP dominate for CIFAR-10 and PASCAL.

We also evaluate a baseline GCN on low resolution images to highlight the importance of SLIC superpixels compared to downsampled pixels (Table 2). Surprisingly, SLIC features provide only a moderate improvement compared to pixels bringing us to two conclusions. First, average intensity values of superpixels and their coordinates are rather weak features that carry limited geometry and texture information. Stronger features can be boundaries of superpixels, but efficient and effective modeling of such information remains an open question. Second, we hypothesize that on full resolution images GCNs can potentially reach the performance of ConvNets or even outperform them, while having appealing properties, such as in [Khasanova and Frossard(2017)]. However, to scale GCNs to such large graphs, two approaches should be considered: 1) fast and effective pooling methods [Ying et al.(2018)Ying, You, Morris, Ren, Hamilton, and Leskovec, Gao and Ji(2019)]; 2) sparsification of graphs, which we evaluated and compared to complete graphs used in our experiments (Figure 6 (c)).

Fusion method # params (GCN)
(Sp) Spatial edge (Eq. 1,4) 86.360.95 97.080.11 98.170.02
(SpM) Spatial multiscale (Eq. 1,4) 91.740.28 97.380.09 98.100.05
(H) Hier. edge (Eq. 1,7) 93.650.07 96.940.07 97.070.07
(C) Concat [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor] 97.070.06 97.950.07 98.280.11
(S) Sum [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor] 97.930.05 98.300.01 98.440.10
(CP) C-proj (Eq. 2) 97.770.08 98.310.09 98.520.09
(PC) Proj-c (Eq. 3) 97.870.04 98.280.04 98.470.09
Max gain 4.28 0.93 0.35
Table 1: Accuracy gain achieved on MNIST-75sp (MNIST with 75 superpixels) by using both spatial and hierarchical edges () versus a single edge () depending on edge fusion methods for different number of hops, , of filters.
(a) (b) (c)
Figure 6: (a) Number of trainable parameters, # params, in a graph convolutional layer as a function of the number of relations, . Fusion methods based on trainable projections, including those proposed in our work, have “# params” comparable to the baseline concatenation method while being more powerful in terms of classification (see Tables 1 and 2). (b) Comparison of single edge types, where learned and hierarchical edges outperform spatial edges. (c) Effect of graph sparsification on classification accuracy. Numeric labels over markers denote sparsity of a graph (%). Notice little or no decrease in accuracy for sparse graphs, even if they have only 2-10% non-zero edges.
Model MNIST CIFAR-10 PASCAL # params
Input ConvNets, full res, 784 1024 150 000
ConvNets/GCNs, low res, 81 144 1024
GCNs, 75 150 1000
Pixels: baselines ConvNet [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner], full res 99.420.01 83.770.21 41.420.51 320-360k
ConvNet [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner], low res 97.090.10 72.680.40 32.690.43 320-330k
GCN [Kipf and Welling(2017)], low res 1 81.360.41 50.570.14 18.570.10 42k
SLIC: baselines GCN [Kipf and Welling(2017)] 1 86.290.44 51.510.55 19.240.06 42k
MoNet [Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein] 1 96.642.01 72.620.57 880k
ChebyNet [Defferrard et al.(2016)Defferrard, Bresson, and Vandergheynst] 1 98.240.03 68.920.23 32.380.38 80-350k
SLIC: ours L-GCN 2 97.640.12 (1.05) 70.140.29 (1.00) 32.390.19 (0.08) 100k
H-GCN 2 97.930.05 (0.86) 69.050.14 (0.60) 32.180.29 (0.53) 100k
H-L-GCN 3 98.350.09 (0.63) 71.440.30 (1.81) 31.750.74 (0.54) 145k
L4-GCN 5 98.420.12 (0.10) 72.670.36 (0.00) 33.010.49 (0.38) 185k
H-L4-GCN 6 98.660.03 (0.22) 72.890.70 (0.89) 32.610.45 (2.18) 220k
H-L-ChebyNet 3 98.680.05 (0.18) 73.180.52 (2.40) 34.460.47 (1.46) 200k
Table 2: Image classification results: accuracy for MNIST and CIFAR-10 and mAP for PASCAL (%). Superscripts denote the best fusion method among the four studied methods (See Table 1 for notation). - unable to evaluate on our infrastructure due to high computational cost. # params is the total number of trainable parameters in a model (largest across rows). An up-arrow () shows the gain compared to the baseline concatenation fusion; a down-arrow () shows the loss of accuracy.

5.3 Discussion

Our method relies on Graph Convolutional Networks (GCN) [Kipf and Welling(2017)] and its multiscale variant ChebyNet [Defferrard et al.(2016)Defferrard, Bresson, and Vandergheynst]). While ChebyNet is superior to GCN in case of a single relation type due to a larger receptive field size, adding multiple relations make both methods comparable. In fact, we show that adding hierarchical edges is generally more advantageous than adding multihop ones, because hierarchy is a strong inductive bias that facilitates capturing features of spatially remote, but hierarchically close nodes (Table 1

). Learned edges also improve on spatial ones, which are defined heuristically in Eq. 

6 and therefore might be suboptimal.

Closely related to our work, [Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein] formulated the generalized graph convolution model (MoNet) based on a trainable transformation to pseudo-coordinates, giving rise to anisotropic kernels and excellent results in visual tasks. However, we found our models with multiple relations to be better (Table 2). Notably, the computational cost (both memory and speed) of MoNet is higher than for any of our models due to the costly patch operator in [Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein], so we could not perform experiments on PASCAL with 1000 superpixels due to limited GPU memory. The argument previously made in favour of MoNet against spectral methods, including ChebyNet, was the sensitivity of spectral convolution methods to changes in graph size and structure. We contradict this argument and show strong performance of ChebyNet.

Another method, SplineCNN [Fey et al.(2018)Fey, Lenssen, Weichert, and Müller], is similar to MoNet and is also based on pseudo-coordinates, but we leave studying this method for future work. Note that performance of MoNet and SplineCNN on general graph classification problems where coordinates are not well defined is inferior compared to ChebyNet [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor].

Finally, a family of methods based on graph kernels [Shervashidze et al.(2011)Shervashidze, Schweitzer, Leeuwen, Mehlhorn, and Borgwardt, Kriege et al.(2016)Kriege, Giscard, and Wilson] shows strong results on some non-visual graph classification datasets, but their application is limited to small scale graph problems with discrete node features, whereas we have real-valued features. Scalable extensions of kernel methods to graphs with continuous features were proposed [Niepert et al.(2016)Niepert, Ahmed, and Kutzkov, Yanardag and Vishwanathan(2015)], but they still tend to be less competitive than methods based on GCN and ChebyNet [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor].

6 Conclusion

We address several limitations of current graph convolutional networks and show improved graph classification results on a number of image datasets. First, we formulate the classification problem in terms of multigraphs, and extend edge fusion methods based on trainable projections. Second, we propose hierarchical edges and a way to learn new edges in a graph jointly with a graph classification model. Our results show that spatial, hierarchical, learned and multihop edges have a complimentary nature, improving accuracy when combined. We show that our models can outperform standard convolutional networks in experiments with low resolution images, which should pave the way for future research in that direction.

Acknowledgments

This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. The authors also acknowledge support from the Canadian Institute for Advanced Research and the Canada Foundation for Innovation.

References

  • [Achanta et al.(2012)Achanta, Shaji, Smith, Lucchi, Fua, and Süsstrunk] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. Slic superpixels compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence, 34(11):2274–2282, 2012.
  • [Arbelaez et al.(2010)Arbelaez, Maire, Fowlkes, and Malik] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5):898–916, 2010.
  • [Battaglia et al.(2018)Battaglia, Hamrick, Bapst, Sanchez-Gonzalez, Zambaldi, Malinowski, Tacchetti, Raposo, Santoro, Faulkner, et al.] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
  • [Bordes et al.(2013)Bordes, Usunier, Garcia-Duran, Weston, and Yakhnenko] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pages 2787–2795, 2013.
  • [Bronstein et al.(2017)Bronstein, Bruna, LeCun, Szlam, and Vandergheynst] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.

    Geometric deep learning: going beyond euclidean data.

    IEEE Signal Processing Magazine, 34(4):18–42, 2017.
  • [Bruna et al.(2014)Bruna, Zaremba, Szlam, and LeCun] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR), 2014.
  • [Chen et al.(2018)Chen, Li, Fei-Fei, and Gupta] Xinlei Chen, Li-Jia Li, Li Fei-Fei, and Abhinav Gupta. Iterative visual reasoning beyond convolutions. In Proc. CVPR, 2018.
  • [Defferrard et al.(2016)Defferrard, Bresson, and Vandergheynst] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.
  • [Dhillon et al.(2007)Dhillon, Guan, and Kulis] Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis.

    Weighted graph cuts without eigenvectors a multilevel approach.

    IEEE transactions on pattern analysis and machine intelligence, 29(11), 2007.
  • [Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and Zisserman] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.

    International journal of computer vision

    , 88(2):303–338, 2010.
  • [Fey et al.(2018)Fey, Lenssen, Weichert, and Müller] Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller. Splinecnn: Fast geometric deep learning with continuous b-spline kernels. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 869–877, 2018.
  • [Gao and Ji(2019)] Hongyang Gao and Shuiwang Ji. Graph U-Net. In

    Proceedings of the 36th International Conference on Machine Learning (ICML)

    , 2019.
  • [Gilmer et al.(2017)Gilmer, Schoenholz, Riley, Vinyals, and Dahl] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 1263–1272, 2017.
  • [Hamilton et al.(2017)Hamilton, Ying, and Leskovec] William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584, 2017.
  • [Henaff et al.(2015)Henaff, Bruna, and LeCun] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163, 2015.
  • [Ioffe and Szegedy(2015)] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015.
  • [Khasanova and Frossard(2017)] Renata Khasanova and Pascal Frossard. Graph-based isometry invariant representation learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1847–1856. JMLR. org, 2017.
  • [Kingma and Ba(2015)] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
  • [Kipf and Welling(2017)] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
  • [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor] Boris Knyazev, Xiao Lin, Mohamed R Amer, and Graham W Taylor. Spectral multigraph networks for discovering and fusing relationships in molecules. In NeurIPS Workshop on Machine Learning for Molecules and Materials, 2018.
  • [Kriege et al.(2016)Kriege, Giscard, and Wilson] Nils M Kriege, Pierre-Louis Giscard, and Richard Wilson. On valid optimal assignment kernels and applications to graph classification. In Advances in Neural Information Processing Systems, pages 1623–1631, 2016.
  • [Krizhevsky(2009)] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [Liang et al.(2016)Liang, Shen, Feng, Lin, and Yan] Xiaodan Liang, Xiaohui Shen, Jiashi Feng, Liang Lin, and Shuicheng Yan. Semantic object parsing with graph lstm. In European Conference on Computer Vision, pages 125–143. Springer, 2016.
  • [Lu et al.(2016)Lu, Krishna, Bernstein, and Fei-Fei] Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with language priors. In European Conference on Computer Vision, 2016.
  • [Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proc. CVPR, volume 1, page 3, 2017.
  • [Niepert et al.(2016)Niepert, Ahmed, and Kutzkov] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 2014–2023, 2016.
  • [Prabhu and Venkatesh Babu(2015)] Nikita Prabhu and R Venkatesh Babu. Attribute-graph: A graph based approach to image ranking. In Proceedings of the IEEE International Conference on Computer Vision, pages 1071–1079, 2015.
  • [Schlichtkrull et al.(2018)Schlichtkrull, Kipf, Bloem, van den Berg, Titov, and Welling] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pages 593–607. Springer, 2018.
  • [Shervashidze et al.(2011)Shervashidze, Schweitzer, Leeuwen, Mehlhorn, and Borgwardt] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.
  • [Simonovsky and Komodakis(2017)] Martin Simonovsky and Nikos Komodakis. Dynamic edgeconditioned filters in convolutional neural networks on graphs. In Proc. CVPR, 2017.
  • [Simonovsky and Komodakis(2018)] Martin Simonovsky and Nikos Komodakis.

    Graphvae: Towards generation of small graphs using variational autoencoders.

    In International Conference on Artificial Neural Networks, pages 412–422. Springer, 2018.
  • [Springenberg et al.(2014)Springenberg, Dosovitskiy, Brox, and Riedmiller] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
  • [Srivastava et al.(2014)Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • [Velickovic et al.(2018)Velickovic, Cucurull, Casanova, Romero, Lio, and Bengio] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations (ICLR), 2018.
  • [Veličković et al.(2019)Veličković, Fedus, Hamilton, Liò, Bengio, and Hjelm] Petar Veličković, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. In International Conference on Learning Representations (ICLR), 2019.
  • [Xu et al.(2019)Xu, Hu, Leskovec, and Jegelka] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations (ICLR), 2019.
  • [Yanardag and Vishwanathan(2015)] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1365–1374. ACM, 2015.
  • [Ying et al.(2018)Ying, You, Morris, Ren, Hamilton, and Leskovec] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pages 4805–4815, 2018.