1 Introduction
In image recognition, input data fed to models tend to be high dimensional. Even for tiny MNIST [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner] images, the input is dimensional and for larger PASCAL [Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and Zisserman] images it explodes to dimensions (Figure 1). Learning from such a highdimensional input is challenging and requires a lot of labelled data and regularization. Convolutional Neural Networks (CNNs) successfully address these challenges by exploiting the properties of shiftinvariance, locality and compositionality of images [Bronstein et al.(2017)Bronstein, Bruna, LeCun, Szlam, and Vandergheynst]. We consider an alternative approach and instead reduce the input dimensionality. One simple way to achieve that is downsampling. But in this case, we may lose vital structural information, so to better preserve it, we extract superpixels [Achanta et al.(2012)Achanta, Shaji, Smith, Lucchi, Fua, and Süsstrunk, Liang et al.(2016)Liang, Shen, Feng, Lin, and Yan]. Representing a set of superpixels such that CNNs could digest and learn from them is nontrivial. To that end, we adopt a strategy from chemistry, physics and social networks, where structured data are expressed by graphs [Bronstein et al.(2017)Bronstein, Bruna, LeCun, Szlam, and Vandergheynst, Hamilton et al.(2017)Hamilton, Ying, and Leskovec, Battaglia et al.(2018)Battaglia, Hamrick, Bapst, SanchezGonzalez, Zambaldi, Malinowski, Tacchetti, Raposo, Santoro, Faulkner, et al.]. By defining operations on graphs analogous to spectral [Bruna et al.(2014)Bruna, Zaremba, Szlam, and LeCun] or spatial [Kipf and Welling(2017)] convolution, Graph Convolutional Networks (GCNs) extend CNNs to graphbased data, and show successful applications in graph/node classification [Velickovic et al.(2018)Velickovic, Cucurull, Casanova, Romero, Lio, and Bengio, Simonovsky and Komodakis(2017), Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein, Fey et al.(2018)Fey, Lenssen, Weichert, and Müller] and link prediction [Schlichtkrull et al.(2018)Schlichtkrull, Kipf, Bloem, van den Berg, Titov, and Welling].
The challenge of generalizing convolution to graphs is to have anisotropic filters (such as edge detectors). Anisotropic models, such as MoNet [Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein] and SplineCNN [Fey et al.(2018)Fey, Lenssen, Weichert, and Müller], rely on coordinate structure, work well for various vision tasks, but are often too computationally expensive and suboptimal for graph problems, in which the coordinates are not well defined [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor]. While these and other general models exist [Gilmer et al.(2017)Gilmer, Schoenholz, Riley, Vinyals, and Dahl, Battaglia et al.(2018)Battaglia, Hamrick, Bapst, SanchezGonzalez, Zambaldi, Malinowski, Tacchetti, Raposo, Santoro, Faulkner, et al.], we rely on widely used graph convolutional networks (GCNs) [Kipf and Welling(2017)] and their multiscale extension, Chebyshev GCNs (ChebyNets) [Defferrard et al.(2016)Defferrard, Bresson, and Vandergheynst] that enjoy an explicit control of receptive field size.
Our focus is on multigraphs, those graphs that are permitted to have multiple edges, e.gtwo objects can be connected by edges of different types [Lu et al.(2016)Lu, Krishna, Bernstein, and FeiFei]: (is_part_of
, action
, distance
, etc.). Multigraphs enable us to model spatial and hierarchical structure inherent to images. By exploiting multiple relationships, we can capture global patterns in input graphs, which is hard to achieve with a single relation, because most GCNs aggregate information in a small local neighbourhood and simply increasing its size, as in ChebyNet, can quickly make the representation too entangled (due to averaging over too many features). Hence, methods such as Deep Graph Infomax [Veličković et al.(2019)Veličković, Fedus,
Hamilton, Liò, Bengio, and Hjelm] were proposed. Using multigraph networks is another approach to increase the receptive field size and disentangle the representation in a principled way, which we show to be promising.
In this work, we model images as sets of superpixels (Figure 1) and formulate image classification as a multigraph classification problem (Section 4). We first overview graph convolution (Section 2) and then adopt and extend relation type fusion methods from [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor] to improve the expressive power of our GCNs (Section 3). To improve classification accuracy, we represent an image as a multigraph and propose learnable (Section 4.1.1) and hierarchical (Section 4.1.2) relation types that we fuse to obtain a final rich representation of an image. On a number of experiments on the MNIST, CIFAR10, and PASCAL image datasets, we evaluate our model and show a significant increase in accuracy, outperforming CNNs in some tasks (Section 5).
(a) px  (b) px  (c) px 
(d)  (e)  (f) 
2 Graph Convolution
We consider undirected graphs with nodes and edges with values in the range represented as an adjacency matrix . Nodes usually represent specific semantic concepts such as objects in images [Prabhu and Venkatesh Babu(2015)]. Nodes can also denote abstract blocks of information with common properties, such as superpixels in our case. Edges () define the relationships and scope of which node effects propagate.
Convolution is an essential computational block in graph networks, since it permits the gradual aggregation of information from neighbouring nodes. Following [Kipf and Welling(2017)], for some dimensional features over nodes and trainable filters , convolution on a graph can be defined as:
(1) 
where are features averaged over onehop () neighbourhoods, is an adjacency matrix with selfloops, is a diagonal matrix with node degrees, and
is an identity matrix. We employ this convolution to build graph convolutional networks (GCNs) in our experiments. This formulation is a particular case of a more general approximate spectral graph convolution
[Defferrard et al.(2016)Defferrard, Bresson, and Vandergheynst], in which case are multiscale features averaged over hop neighbourhoods. Multiscale (multihop) graph convolution is used in our experiments with ChebyNet.3 Multigraph Convolution
In graph convolution (Eq. 1), the graph adjacency matrix encodes a single () relation type between nodes. We extend Eq. 1 to a multigraph: a graph with multiple () edges (relations) between the same nodes, represented as a set of adjacency matrices . Previous work used concatenation or decompositionbased schemes [Schlichtkrull et al.(2018)Schlichtkrull, Kipf, Bloem, van den Berg, Titov, and Welling, Chen et al.(2018)Chen, Li, FeiFei, and Gupta] to fuse multiple relations. Instead, to capture more complex multirelational features, we adopt and extend recently proposed fusion methods from [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor].
Two of the methods from [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor] have certain limitations preventing us to adopt them directly in this work. In particular, the approach based on multidimensional Chebyshev polynomials is often infeasible to compute and was not shown to be superior in downstream tasks. In our experience, we also found that the multiplicative fusion was unstable to train. To that end, motivated by the success of multilayer projections in Graph Isomorphism Networks [Xu et al.(2019)Xu, Hu, Leskovec, and Jegelka], we propose two simple yet powerful fusion methods (Figure 2).
Given features corresponding to a relation type , in the first approach we concatenate features for all relation types and then transform them using a two layer fullyconnected network with
hidden units and the ReLU activation:
(2) 
where the first and second layers of have and trainable parameters respectively, ignoring a bias. The second approach, illustrated in Figure 2, is similar to multiplicative/additive fusion [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor], but instead of multiplication/addition we use concatenation:
(3) 
where and is a single layer fullyconnected network with output units followed by a nonlinearity, so that and have and trainable parameters respectively, ignoring a bias. Hereafter, we denote the first approach as CP (concatenation followed by projection) and the second as PC (projection followed by concatenation).
4 Multigraph Convolutional Networks
Image classification was recently formulated as a graph classification problem in [Defferrard et al.(2016)Defferrard, Bresson, and Vandergheynst, Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein, Fey et al.(2018)Fey, Lenssen, Weichert, and Müller], who considered smallscale image classification problems such as MNIST [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner]. In this work, we present a model that scales to more complex and larger image datasets, such as PASCAL VOC 2012 [Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and Zisserman]. We follow [Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein] and compute SLIC [Achanta et al.(2012)Achanta, Shaji, Smith, Lucchi, Fua, and Süsstrunk] superpixels for each image and build a graph, in which each node corresponds to a superpixel and edges () are computed based on the Euclidean distance between the coordinates and of their centres of masses using a Gaussian with some fixed width :
(4) 
A frequent assumption of current GCNs is that there is at most one edge between any pair of nodes in a graph. This restriction is usually implied by datasets with such structure, so that in many datasets, graphs are annotated with the single most important relation type. Meanwhile, data is often complex and nodes tend to have multiple relationships of different semantic, physical, or abstract meanings. Therefore, we argue that there could be other relationships captured by relaxing this restriction and allowing for multiple kinds of edges, beyond those hardcoded in the data (e.gspatial in Eq. 4).
4.1 Learning Flat vs. Hierarchical Edges
Prior work, e.g [Schlichtkrull et al.(2018)Schlichtkrull, Kipf, Bloem, van den Berg, Titov, and Welling, Bordes et al.(2013)Bordes, Usunier, GarciaDuran, Weston, and Yakhnenko], proposed methods to learn from multiple edges, but similarly to the methods with a single edge type [Kipf and Welling(2017)], they leveraged only predefined edges in the data. We formulate a more flexible model, which, in addition to learning from an arbitrary number of relations between nodes (see Section 3), learns abstract edges jointly with a GCN.
4.1.1 Flat Learnable Edges
We combine ideas from [Henaff et al.(2015)Henaff, Bruna, and LeCun, Velickovic et al.(2018)Velickovic, Cucurull, Casanova, Romero, Lio, and Bengio, Simonovsky and Komodakis(2017), Battaglia et al.(2018)Battaglia, Hamrick, Bapst, SanchezGonzalez, Zambaldi, Malinowski, Tacchetti, Raposo, Santoro, Faulkner, et al.] and propose to learn a new edge from any node to node with coordinates and using a trainable similarity function:
(5) 
where indexes relation types; is the number of learned relation types; and is a spatial neighbourhood of radius around node . Using relatively small limits a model’s predictive power, but is important for regularization and to meet computational requirements for larger images from which we extract many superpixels, such as in the PASCAL dataset. We also experimented with feeding node features, such as mean pixel intensities, to this function, but it did not give positive outcomes. Instead, we further regularize the model by constraining the input to be the absolute coordinate differences in lieu of raw coordinates and applying the softmax on top of the predictions:
(6) 
where is a small two layer fullyconnected network with output units and denotes taking the th output. Using the absolute value makes the filters symmetric (Figure 5), but still sensitive to orientation as opposed to the spatial edge defined in Eq. 4. The softmax is used to encourage sparse (more localized) edges.
Our approach to predict edges can be viewed as a particular case of generative models of graphs, such as [Simonovsky and Komodakis(2018)]. However, the objectives of our model and the latter are orthogonal. Generative models typically make probabilistic predictions to induce diversity of generated graphs, and generation of each edge, node and their attributes should respect other (already generated) edges and nodes in the graph, so that the entire graph becomes plausible. In our work, we aim to scale our model to larger graphs and assume locality of relationships in visual data, therefore we design a 1) simple deterministic function 2) that makes predictions depending only on local information.
(a) spatial ()  (b) spatial ()  (c) hierarchy ()  (d) hierarchy () 
4.1.2 Hierarchical Edges
Depending on the number of SLIC superpixels, we can build different levels of image abstraction. Low levels of abstraction maintain finegrained details, whereas higher levels mainly preserve global structure. To leverage useful information from multiple levels, we first compute superpixels at several different scales. Then, spatial and learnable relations are computed using Eq. 4 and 6, treating nodes at different scales as a joint set of superpixels. This creates a graph of multiscale image representation (Figure 3 (a,b)). However, this representation is still flat and, thus, limited.
To alleviate this, we introduce a novel hierarchical graph model, where childparent relations are based on intersection over union (IoU) between superpixels and at different scales:
(7) 
where is an adjacency matrix of hierarchical edges and for nodes at the same scale (Figure 3 (c,d)). Using IoU means that child nodes can have multiple parents. To guarantee a single parent, hierarchical superpixel algorithms can be considered [Arbelaez et al.(2010)Arbelaez, Maire, Fowlkes, and Malik].
4.2 Layer vs. Global pooling
Inspired by convolutional networks, previous works [Bruna et al.(2014)Bruna, Zaremba, Szlam, and LeCun, Defferrard et al.(2016)Defferrard, Bresson, and Vandergheynst, Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein, Simonovsky and Komodakis(2017), Fey et al.(2018)Fey, Lenssen, Weichert, and Müller, Ying et al.(2018)Ying, You, Morris, Ren, Hamilton, and Leskovec] built an analogy of pooling layers in graphs, for example, using the Graclus clustering algorithm [Dhillon et al.(2007)Dhillon, Guan, and Kulis]. In CNNs, pooling is an effective way to reduce memory and computation, particularly for large inputs. It also provides additional robustness to local deformations and leads to faster growth of receptive fields. However, we can build a convolutional network without any pooling layers with similar performance on a downstream task [Springenberg et al.(2014)Springenberg, Dosovitskiy, Brox, and Riedmiller] — it just will be relatively slow, since pooling is extremely cheap on regular grids, such as images. In graph classification tasks, the input dimensionality, which corresponds to the number of nodes , is often very small () and the benefits of pooling are less clear. Graph pooling, such as in [Dhillon et al.(2007)Dhillon, Guan, and Kulis], is also computationally intensive since we need to run the clustering algorithm for each graph independently, which limits the scale of problems we can address. Aiming to simplify the model while maintaining classification accuracy, we exclude pooling layers between graph convolutional layers and perform global maximum pooling over nodes following the last conv. layer. This fixes the size of the penultimate feature vector irrespective of (Figure 4).
5 Experiments
We evaluate our model on three imagebased graph classification datasets: MNIST [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner], CIFAR10 [Krizhevsky(2009)] and PASCAL Visual Object Classes 2012 (PASCAL) [Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and Zisserman]. For each dataset, there is a set of graphs with different numbers of nodes (superpixels), and each graph has a single categorical label that is to be predicted. For baselines we use a single () spatial relation type defined using Eq. 4. For PASCAL we predict multiple categorical labels per image and report mean average precision (mAP).
MNIST consists of 70k greyscale 2828px images of handwritten digits. CIFAR10 has 60k coloured 3232px images of 10 categories of animals and vehicles. PASCAL is a more challenging dataset with realistic high resolution images (typically around px) of 20 object categories (people, animals, vehicles, indoor objects). We use standard classification splits, training our model on 5,717 images and reporting results on 5,823 validation images. We note that CIFAR10 and PASCAL have not been previously considered for graphbased image classification, and in this work we scale our method to these datasets. In fact, during experimentation we found some other graph convolutional methods unable to scale (see Section 5.3).
5.1 Architectural and Experimental Details
GCNs. In all experiments, we train GCNs, ChebyNets and MoNet [Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein]
with three graph convolutional layers, having 32, 64 and 512 filters with the ReLU activation after each layer followed by global max pooling and a fullyconnected classification layer (Figure
4). For MNIST, we use a dropout rate [Srivastava et al.(2014)Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov]of 0.5 before the class. layer, while for CIFAR10 and PASCAL instead we employ batch normalization (BN)
[Ioffe and Szegedy(2015)] after each convolutional layer. For edge fusion, projection in Eq. 2 is modelled by a two layer network with hidden units and the ReLU activation between layers. Similarly, projections in Eq. 3 are single layer neural networks with output units followed by the activation. The edge prediction function in (Eq. 6) is a two layer neural network with hidden units, output units, and neighbourhood is set to the spatially nearest nodes.ConvNets (CNNs). To allow fair comparison, we train CNNs with the same number of filters in convolutional layers as GCNs, with filters of size 3 for MNIST, 5 for CIFAR10 and 7 for PASCAL, max pooling between layers and global average pooling after the last convolutional layer. Deeper and larger CNNs and GCNs can be trained to further improve results. Since images fed to CNNs are defined on a regular grid, adding coordinate features is uninformative, because these features are exactly the same for all examples in the dataset.
Low resolution ConvNets. GCNs take a relatively low resolution input compared to CNNs: 75 vs. 784 for MNIST, 150 vs. 1024 for CIFAR10 and 1000 vs. 150,000 for PASCAL. Factors by which it is reduced are 10.5, 6.8 and 150 respectively. Therefore, direct comparison of GCNs to CNNs is unfair. To provide an alternative (yet not perfect) form of comparison, we design experiments with low resolution inputs fed to CNNs. In particular, to match the spatial dimensions of inputs to GCNs and CNNs, for MNIST we reduce the size to px, for CIFAR10 to px and for PASCAL to
px taking into account that the average number of superpixels returned by the SLIC algorithm is often smaller than requested. Admittedly, downsampling using SLIC superpixels is more structural than bilinear downsampling, so GCNs receive a stronger signal, but we believe it is still an interesting experiment. Principal component analysis could be used as a more adequate way to implicitly reduce the input dimensionality for CNNs, but this method is infeasible for large images, such as in PASCAL, while superpixels can be easily computed in such cases. For comparison, we also report results on low resolution images using GCNs.
Training. We train all models using Adam [Kingma and Ba(2015)] with learning rate of , weight decay of
, and batch size of 32. For MNIST, the learning rate is decayed after 20 and 25 epochs and models are trained for 30 epochs. For CIFAR10 and PASCAL, the learning rate is decayed after 35 and 45 epochs and models are trained for 50 epochs. We train models using four edge fusion methods: concatenationbased baseline, additive fusion
[Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor] and two methods (CP and PC) introduced in Section 3 (see Table 1 and Figure 6 (a) for a summary).In each case, we train a model 5 times with different random seeds and report average results and standard deviation. In Table
2, we report the best results over different fusion methods superscripting the best fusion method.
Spatial edge 
MNIST 
CIFAR10 
Graph formation for images. For MNIST, we compute a hierarchy of 75, 21, and 7 SLIC [Achanta et al.(2012)Achanta, Shaji, Smith, Lucchi, Fua, and Süsstrunk] superpixels and use mean superpixel intensity and coordinates as node features, so . For CIFAR10, we add one more level of 150 superpixels and also use coordinate features, so . For PASCAL, due to its more challenging images, we further add two more levels of 1,000 and 300 superpixels, so . Note, the number of superpixels provided above are upper bounds we impose on the SLIC algorithm. For low resolution experiments with GCNs, we build graphs from pixels and their coordinates, that is the graph structure is the same across all examples in the dataset.
5.2 Results
The image classification results on the MNIST, CIFAR10 and PASCAL datasets are presented in Table 2. We first observe that among GCN baselines, MoNet [Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein] and ChebyNet [Defferrard et al.(2016)Defferrard, Bresson, and Vandergheynst] show comparable performance, significantly outperforming GCN [Kipf and Welling(2017)], which can be explained by very local filters in the latter. Next, the hierarchical (H) and learned (L or L4, i.e. models with learnable edges) connections proposed in this work are shown to be complementary, substantially improving both GCNs and ChebyNets with a single spatial edge type. Additional qualitative and quantitative analysis is provide in Table 1, Figures 5 and 6. By combining hierarchical and learned edge types (HL, HL4) and adding multiscale filters, we achieve a further gain in accuracy while keeping the number of trainable parameters low compared to ConvNets, MoNet and ChebyNets with large . Importantly, our multirelational GCNs also show better results than ConvNets with a low resolution input.
In Table 2, for ChebyNet we report results using the best chosen from the range [2, 20]: for MNIST, CIFAR10 and PASCAL respectively in case of a baseline ChebyNet and in case of HLChebyNet. Among evaluated fusion methods, CP and additive (S) work best for MNIST, whereas PC and CP dominate for CIFAR10 and PASCAL.
We also evaluate a baseline GCN on low resolution images to highlight the importance of SLIC superpixels compared to downsampled pixels (Table 2). Surprisingly, SLIC features provide only a moderate improvement compared to pixels bringing us to two conclusions. First, average intensity values of superpixels and their coordinates are rather weak features that carry limited geometry and texture information. Stronger features can be boundaries of superpixels, but efficient and effective modeling of such information remains an open question. Second, we hypothesize that on full resolution images GCNs can potentially reach the performance of ConvNets or even outperform them, while having appealing properties, such as in [Khasanova and Frossard(2017)]. However, to scale GCNs to such large graphs, two approaches should be considered: 1) fast and effective pooling methods [Ying et al.(2018)Ying, You, Morris, Ren, Hamilton, and Leskovec, Gao and Ji(2019)]; 2) sparsification of graphs, which we evaluated and compared to complete graphs used in our experiments (Figure 6 (c)).
Fusion method  # params  (GCN)  
(Sp)  Spatial edge (Eq. 1,4)  86.360.95  97.080.11  98.170.02  
(SpM)  Spatial multiscale (Eq. 1,4)  91.740.28  97.380.09  98.100.05  
(H)  Hier. edge (Eq. 1,7)  93.650.07  96.940.07  97.070.07  
(C)  Concat [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor]  97.070.06  97.950.07  98.280.11  
(S)  Sum [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor]  97.930.05  98.300.01  98.440.10  
(CP)  Cproj (Eq. 2)  97.770.08  98.310.09  98.520.09  
(PC)  Projc (Eq. 3)  97.870.04  98.280.04  98.470.09  
Max gain  4.28  0.93  0.35 
(a)  (b)  (c) 
Model  MNIST  CIFAR10  PASCAL  # params  
Input  ConvNets, full res,  784  1024  150 000  
ConvNets/GCNs, low res,  81  144  1024  
GCNs,  75  150  1000  
Pixels: baselines  ConvNet [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner], full res  99.420.01  83.770.21  41.420.51  320360k  
ConvNet [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner], low res  97.090.10  72.680.40  32.690.43  320330k  
GCN [Kipf and Welling(2017)], low res  1  81.360.41  50.570.14  18.570.10  42k  
SLIC: baselines  GCN [Kipf and Welling(2017)]  1  86.290.44  51.510.55  19.240.06  42k 
MoNet [Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein]  1  96.642.01  72.620.57  880k  
ChebyNet [Defferrard et al.(2016)Defferrard, Bresson, and Vandergheynst]  1  98.240.03  68.920.23  32.380.38  80350k  
SLIC: ours  LGCN  2  97.640.12 (1.05)  70.140.29 (1.00)  32.390.19 (0.08)  100k 
HGCN  2  97.930.05 (0.86)  69.050.14 (0.60)  32.180.29 (0.53)  100k  
HLGCN  3  98.350.09 (0.63)  71.440.30 (1.81)  31.750.74 (0.54)  145k  
L4GCN  5  98.420.12 (0.10)  72.670.36 (0.00)  33.010.49 (0.38)  185k  
HL4GCN  6  98.660.03 (0.22)  72.890.70 (0.89)  32.610.45 (2.18)  220k  
HLChebyNet  3  98.680.05 (0.18)  73.180.52 (2.40)  34.460.47 (1.46)  200k 
5.3 Discussion
Our method relies on Graph Convolutional Networks (GCN) [Kipf and Welling(2017)] and its multiscale variant ChebyNet [Defferrard et al.(2016)Defferrard, Bresson, and Vandergheynst]). While ChebyNet is superior to GCN in case of a single relation type due to a larger receptive field size, adding multiple relations make both methods comparable. In fact, we show that adding hierarchical edges is generally more advantageous than adding multihop ones, because hierarchy is a strong inductive bias that facilitates capturing features of spatially remote, but hierarchically close nodes (Table 1
). Learned edges also improve on spatial ones, which are defined heuristically in Eq.
6 and therefore might be suboptimal.Closely related to our work, [Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein] formulated the generalized graph convolution model (MoNet) based on a trainable transformation to pseudocoordinates, giving rise to anisotropic kernels and excellent results in visual tasks. However, we found our models with multiple relations to be better (Table 2). Notably, the computational cost (both memory and speed) of MoNet is higher than for any of our models due to the costly patch operator in [Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein], so we could not perform experiments on PASCAL with 1000 superpixels due to limited GPU memory. The argument previously made in favour of MoNet against spectral methods, including ChebyNet, was the sensitivity of spectral convolution methods to changes in graph size and structure. We contradict this argument and show strong performance of ChebyNet.
Another method, SplineCNN [Fey et al.(2018)Fey, Lenssen, Weichert, and Müller], is similar to MoNet and is also based on pseudocoordinates, but we leave studying this method for future work. Note that performance of MoNet and SplineCNN on general graph classification problems where coordinates are not well defined is inferior compared to ChebyNet [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor].
Finally, a family of methods based on graph kernels [Shervashidze et al.(2011)Shervashidze, Schweitzer, Leeuwen, Mehlhorn, and Borgwardt, Kriege et al.(2016)Kriege, Giscard, and Wilson] shows strong results on some nonvisual graph classification datasets, but their application is limited to small scale graph problems with discrete node features, whereas we have realvalued features. Scalable extensions of kernel methods to graphs with continuous features were proposed [Niepert et al.(2016)Niepert, Ahmed, and Kutzkov, Yanardag and Vishwanathan(2015)], but they still tend to be less competitive than methods based on GCN and ChebyNet [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor].
6 Conclusion
We address several limitations of current graph convolutional networks and show improved graph classification results on a number of image datasets. First, we formulate the classification problem in terms of multigraphs, and extend edge fusion methods based on trainable projections. Second, we propose hierarchical edges and a way to learn new edges in a graph jointly with a graph classification model. Our results show that spatial, hierarchical, learned and multihop edges have a complimentary nature, improving accuracy when combined. We show that our models can outperform standard convolutional networks in experiments with low resolution images, which should pave the way for future research in that direction.
Acknowledgments
This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. The authors also acknowledge support from the Canadian Institute for Advanced Research and the Canada Foundation for Innovation.
References
 [Achanta et al.(2012)Achanta, Shaji, Smith, Lucchi, Fua, and Süsstrunk] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. Slic superpixels compared to stateoftheart superpixel methods. IEEE transactions on pattern analysis and machine intelligence, 34(11):2274–2282, 2012.
 [Arbelaez et al.(2010)Arbelaez, Maire, Fowlkes, and Malik] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5):898–916, 2010.
 [Battaglia et al.(2018)Battaglia, Hamrick, Bapst, SanchezGonzalez, Zambaldi, Malinowski, Tacchetti, Raposo, Santoro, Faulkner, et al.] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro SanchezGonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
 [Bordes et al.(2013)Bordes, Usunier, GarciaDuran, Weston, and Yakhnenko] Antoine Bordes, Nicolas Usunier, Alberto GarciaDuran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multirelational data. In Advances in neural information processing systems, pages 2787–2795, 2013.

[Bronstein et al.(2017)Bronstein, Bruna, LeCun, Szlam, and
Vandergheynst]
Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre
Vandergheynst.
Geometric deep learning: going beyond euclidean data.
IEEE Signal Processing Magazine, 34(4):18–42, 2017.  [Bruna et al.(2014)Bruna, Zaremba, Szlam, and LeCun] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR), 2014.
 [Chen et al.(2018)Chen, Li, FeiFei, and Gupta] Xinlei Chen, LiJia Li, Li FeiFei, and Abhinav Gupta. Iterative visual reasoning beyond convolutions. In Proc. CVPR, 2018.
 [Defferrard et al.(2016)Defferrard, Bresson, and Vandergheynst] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.

[Dhillon et al.(2007)Dhillon, Guan, and Kulis]
Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis.
Weighted graph cuts without eigenvectors a multilevel approach.
IEEE transactions on pattern analysis and machine intelligence, 29(11), 2007. 
[Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and
Zisserman]
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew
Zisserman.
The pascal visual object classes (voc) challenge.
International journal of computer vision
, 88(2):303–338, 2010. 
[Fey et al.(2018)Fey, Lenssen, Weichert, and
Müller]
Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller.
Splinecnn: Fast geometric deep learning with continuous bspline
kernels.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 869–877, 2018. 
[Gao and Ji(2019)]
Hongyang Gao and Shuiwang Ji.
Graph UNet.
In
Proceedings of the 36th International Conference on Machine Learning (ICML)
, 2019.  [Gilmer et al.(2017)Gilmer, Schoenholz, Riley, Vinyals, and Dahl] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 1263–1272, 2017.
 [Hamilton et al.(2017)Hamilton, Ying, and Leskovec] William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584, 2017.
 [Henaff et al.(2015)Henaff, Bruna, and LeCun] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163, 2015.
 [Ioffe and Szegedy(2015)] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015.
 [Khasanova and Frossard(2017)] Renata Khasanova and Pascal Frossard. Graphbased isometry invariant representation learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1847–1856. JMLR. org, 2017.
 [Kingma and Ba(2015)] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
 [Kipf and Welling(2017)] Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
 [Knyazev et al.(2018)Knyazev, Lin, Amer, and Taylor] Boris Knyazev, Xiao Lin, Mohamed R Amer, and Graham W Taylor. Spectral multigraph networks for discovering and fusing relationships in molecules. In NeurIPS Workshop on Machine Learning for Molecules and Materials, 2018.
 [Kriege et al.(2016)Kriege, Giscard, and Wilson] Nils M Kriege, PierreLouis Giscard, and Richard Wilson. On valid optimal assignment kernels and applications to graph classification. In Advances in Neural Information Processing Systems, pages 1623–1631, 2016.
 [Krizhevsky(2009)] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [LeCun et al.(1998)LeCun, Bottou, Bengio, and Haffner] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [Liang et al.(2016)Liang, Shen, Feng, Lin, and Yan] Xiaodan Liang, Xiaohui Shen, Jiashi Feng, Liang Lin, and Shuicheng Yan. Semantic object parsing with graph lstm. In European Conference on Computer Vision, pages 125–143. Springer, 2016.
 [Lu et al.(2016)Lu, Krishna, Bernstein, and FeiFei] Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li FeiFei. Visual relationship detection with language priors. In European Conference on Computer Vision, 2016.
 [Monti et al.(2017)Monti, Boscaini, Masci, Rodola, Svoboda, and Bronstein] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proc. CVPR, volume 1, page 3, 2017.
 [Niepert et al.(2016)Niepert, Ahmed, and Kutzkov] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 2014–2023, 2016.
 [Prabhu and Venkatesh Babu(2015)] Nikita Prabhu and R Venkatesh Babu. Attributegraph: A graph based approach to image ranking. In Proceedings of the IEEE International Conference on Computer Vision, pages 1071–1079, 2015.
 [Schlichtkrull et al.(2018)Schlichtkrull, Kipf, Bloem, van den Berg, Titov, and Welling] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pages 593–607. Springer, 2018.
 [Shervashidze et al.(2011)Shervashidze, Schweitzer, Leeuwen, Mehlhorn, and Borgwardt] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeilerlehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.
 [Simonovsky and Komodakis(2017)] Martin Simonovsky and Nikos Komodakis. Dynamic edgeconditioned filters in convolutional neural networks on graphs. In Proc. CVPR, 2017.

[Simonovsky and Komodakis(2018)]
Martin Simonovsky and Nikos Komodakis.
Graphvae: Towards generation of small graphs using variational autoencoders.
In International Conference on Artificial Neural Networks, pages 412–422. Springer, 2018.  [Springenberg et al.(2014)Springenberg, Dosovitskiy, Brox, and Riedmiller] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
 [Srivastava et al.(2014)Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 [Velickovic et al.(2018)Velickovic, Cucurull, Casanova, Romero, Lio, and Bengio] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations (ICLR), 2018.
 [Veličković et al.(2019)Veličković, Fedus, Hamilton, Liò, Bengio, and Hjelm] Petar Veličković, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. In International Conference on Learning Representations (ICLR), 2019.
 [Xu et al.(2019)Xu, Hu, Leskovec, and Jegelka] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations (ICLR), 2019.
 [Yanardag and Vishwanathan(2015)] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1365–1374. ACM, 2015.
 [Ying et al.(2018)Ying, You, Morris, Ren, Hamilton, and Leskovec] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pages 4805–4815, 2018.
Comments
There are no comments yet.