GraphFPN: Graph Feature Pyramid Network for Object Detection

08/02/2021 ∙ by Gangming Zhao, et al. ∙ Association for Computing Machinery FUDAN University 12

Feature pyramids have been proven powerful in image understanding tasks that require multi-scale features. State-of-the-art methods for multi-scale feature learning focus on performing feature interactions across space and scales using neural networks with a fixed topology. In this paper, we propose graph feature pyramid networks that are capable of adapting their topological structures to varying intrinsic image structures and supporting simultaneous feature interactions across all scales. We first define an image-specific superpixel hierarchy for each input image to represent its intrinsic image structures. The graph feature pyramid network inherits its structure from this superpixel hierarchy. Contextual and hierarchical layers are designed to achieve feature interactions within the same scale and across different scales. To make these layers more powerful, we introduce two types of local channel attention for graph neural networks by generalizing global channel attention for convolutional neural networks. The proposed graph feature pyramid network can enhance the multiscale features from a convolutional feature pyramid network. We evaluate our graph feature pyramid network in the object detection task by integrating it into the Faster R-CNN algorithm. The modified algorithm outperforms not only previous state-of-the-art feature pyramid-based methods with a clear margin but also other popular detection methods on both MS-COCO 2017 validation and test datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolutional neural networks exploit local connectivity and weights sharing, and have led to a series of breakthroughs in computer vision tasks, including image recognition 

[23, 46, 11, 47], object detection [8, 41, 33, 39, 4, 29, 45], and semantic segmentation [32, 54, 28, 17, 52, 48]

. In the backbone of a deep convolutional neural network, feature maps closer to the input image encode low-level features of the image while those further away from the input image encode high-level features that are more semantically meaningful. In the meantime, in a layer further away from the input image, the feature map tends to have a smaller spatial resolution while the neurons tend to have a larger receptive field. Since objects in an image may have varying scales, it is much desired to obtain multiscale feature maps that have fused high-level and low-level features with sufficient spatial resolution at every distinct scale. This motivated feature pyramid networks (FPN 

[30]) and its improved versions, such as path aggregation network (PANet [32]) and feature pyramid transformer (FPT [52]), and other mtehods [21, 18, 7, 50, 10].

Despite these methods represent the state of the art of multiscale feature learning, they overlook the intrinsic structures of images. Every image has multiscale intrinsic structures, including the grouping of pixels into object parts, the further grouping of parts into objects as well as the spatial layout of objects in the image space. Such multiscale intrinsic structures are different from image to image, and can provide important clues for image understanding and object recognition. But FPN and its related methods always use a fixed multiscale network topology (i.e. 2D grids of neurons) independent of the intrinsic image structures. Such a fixed network topology may not be optimal for multiscale feature learning. According to psychological evidence [14], human parse visual scenes into part-whole hierarchies, and model part-whole relationships in different images dynamically. Motivated by this, researchers have developed a series of ”capsule” models [43, 13, 22], that describe the occurrence of a particular type in a particular region of an image. Hierarchical segmentation can recursively group superpixels according to their locations and similarities to generate a superpixel hierarchy [38, 34]. Such a part-whole hierarchy can assist object detection and semantic segmentation by bridging the semantic gap between pixels and objects [34] .

It is known that multiscale features in a feature pyramid can be enhanced through cross-scale interactions [30, 32, 25, 52] in addition to interactions within the same scale. Another limitation of existing methods related to feature pyramid networks is that only features from adjacent scales interact directly while features from non-adjacent scales interact indirectly through other intermediate scales. This is partly because it is most convenient to match the resolutions of two adjacent scales, and partly because it is most convenient for existing interaction mechanisms to handle two scales at a time. Interactions between adjacent scales usually follow a top-down or bottom-up sequential order. In the existing schemes, the highest-level features at the top of the pyramid need to propagate through multiple intermediate scales and interact with the features at these scales before reaching the features at the bottom of the pyramid. During such propagation and interaction, essential feature information may be lost or weakened.

In this paper, we propose graph feature pyramid networks to overcome the aforementioned limitations because graph networks are capable of adapting their topological structures to varying intrinsic structures of input images, and they also support simultaneous feature interactions across all scales. We first define a superpixel hierarchy for an input image. This superpixel hierarchy has a number of levels, each of which consists of a set of nonoverlapping superpixels defining a segmentation of the input image. The segmentations at all levels of the hierarchy are extracted from the same hierarchical segmentation of the input image. Thus the superpixels at two adjacent levels of the hierarchy are closely related. Every superpixel on the coarser level is a union of superpixels on the finer level. Such one-to-many correspondences between superpixels on two levels define the aforementioned part-whole relationships, which can also be called ancestor-descendant relationships. The hierarchical segmentation and the superpixel hierarchy derived from it reveal intrinsic image structures.

To effectively exploit intrinsic image structures, the actual structure of our graph feature pyramid network is determined on the fly by the above superpixel hierarchy of the input image. In fact, the graph feature pyramid network inherits its structure from the superpixel hierarchy by mapping superpixels to graph nodes. Graph edges are set up between neighboring superpixels in the same level as well as corresponding superpixels in ancestor-descendant relationships. Correspondences are also set up between the levels in our graph feature pyramid network and a subset of layers in the feature extraction backbone. Initial features at all graph nodes are first mapped from the features at their corresponding positions in the backbone. Contextual and hierarchical graph neural network layers are designed to promote feature interactions within the same scale and across different scales, respectively. Hierarchical layers make corresponding features from all different scales interact directly. Final features at all levels of the graph feature pyramid are fused with the features in a conventional feature pyramid network to produce enhanced multi-scale features.

Our contributions in this paper are summarized below.

We propose a novel graph feature pyramid network to exploit intrinsic image structures and support simultaneous feature interactions across all scales. This graph feature pyramid network inherits its structure from a superpixel hierarchy of the input image. Contextual and hierarchical layers are designed to promote feature interactions within the same scale and across different scales, respectively.

We further introduce two types of local channel attention mechanisms for graph neural networks by generalizing existing global channel attention mechanisms for convolutional neural networks.

Extensive experiments on MS-COCO 2017 validation and test datasets [31] demonstrate that our graph feature pyramid network can help achieve clearly better performance than existing state-of-the-art object detection methods no matter they are feature pyramid based or not. The reported ablation studies further verify the effectiveness of the proposed network components.

Figure 1: The proposed graph feature pyramid network (GraphFPN) is a graph neural network built on a superpixel hierarchy. GraphFPN receives mapped multi-scale features from the convolutional backbone. These features pass through a number of contextual and hierarchical layers in the GraphFPN before being mapped back to rectangular feature maps, which are then fused with the feature maps from the convolutional FPN for subsequent object detection.

2 Related Work

Feature Pyramids. Feature pyramids present high-level feature maps across a range of scales, and work together with backbone networks to achieve improved and more balanced performance across multiple scales in object detection  [30, 32, 26, 55, 52] and semantic segmentation [32, 54, 28, 17, 52, 48]. Recent work on feature pyramids can be categorized into three groups: top-down networks [42, 44, 30, 54, 3, 37], top-down/bottom-up networks [27, 32], and attention based methods [52]. Feature pyramid network (FPN [30]) exploits the inherent multi-scale, pyramidal hierarchy of deep convolutional neural networks, and build a top-down architecture with lateral connections to obtain high level semantic feature maps at all scales. Path Aggregation Network (PANet [32]) shortens the information path between lower layers and topmost features with bottom-up path augmentation to enhance the feature hierarchy. ZigZagNet [28] enriches multi-level contextual information not only by dense top-down and bottom-up aggregation, but also by zig-zag crossing between different levels of the top-down and bottom-up hierarchies. Feature pyramid transformer [52] performs active feature interaction across both space and scales with three transformers. The self-transformer enables non-local interactions within individual feature maps, and the grounding/rendering transformers enable successive top-down/bottom-up interactions between adjacent levels of the feature pyramid.

In this paper, we aim to fill the semantic gaps between feature maps at different pyramid levels. The most unique characteristic of our graph feature pyramid network in comparison to the above mentioned work is that the topological structure of the graph feature pyramid dynamically adapts to the intrinsic structures of the input image. Furthermore, we build a graph neural network across all scales, making simultaneous feature interactions across all scales possible.

Graph Neural Networks. Graph neural networks [24, 49, 51, 9, 12] can model dependencies among nodes flexibly, and can be applied to scenarios with irregular data structures. Graph convolutional networks (GCN [20]) perform spectral convolutions on graphs to propagate information among nodes. Graph attention networks (GAT [49]) leverage local self-attention layers to designate weights to neighboring nodes, which has gained popularity in many tasks. Gao et al. [6] proposed graph U-Net with graph pooling and unpooling operations. A graph pooling layer relies on trainable similarity measures to adaptively select a subset of nodes to form a coarser graph while the graph unpooling layer uses saved information to reverse a graph to the structure before its paired pooling operation.

We adopt the self-attention mechanism in GAT [49] in our GraphFPN. To further increase the discriminative power of node features, we introduce local channel attention mechanisms for GNNs by generalizing existing global channel attention mechanisms for CNNs. In comparison to Graph U-Net [6], our graph pyramid is built on a superpixel hierarchy. Its node merge and split operations are not just based on local similarity ranking, but also depend on intrinsic image structures, which makes our GraphFPN more effective in image understanding tasks.

Hierarchical Segmentation and GLOM. Understanding images by building part-whole hierarchies have been a long standing open problem in computer vision [35, 2, 1, 36]. The hierarchical segmentation algorithms in MCG [38] and COB [34] can group pixels of an image into superpixels using detected boundaries. These superpixels are formed hierarchically to describe objects in a bottom-up manner. Hinton [15] proposed the GLOM imaginary system that aims to use a neural network with a fixed structure to parse images into image specific part-whole hierarchies.

Given an input image, we use the hierarchical segmentation in COB [34] to build an image specific superpixel hierarchy, on top of which we further build our graph feature pyramid network. One of the contributions of this paper lies in using image specific part-whole hierarchies to enhance multiscale feature learning, which could benefit image understanding tasks including object detection,

Figure 2: Mapping between CNN grid cells and superpxiels. Each grid cell is assigned to one superpixel it overlaps most. Each superpixel has a small collection of grid cells assigned it.

3 Graph Feature Pyramid Networks

Graph feature pyramid network aims to enhance the convolutional feature pyramid network by building a multi-scale graph neural network on top of a superpixel hierarchy.

3.1 Superpixel Hierarchy

In a hierarchical segmentation, pixels (or smaller superpixels) are recursively grouped into larger ones with a similarity measure [38, 34]. Given an image , we rely on convolutional oriented boundaries (COB [34]) to obtain a hierarchical segmentation, which is a family of image partitions . Note that each superpixel in is a single pixel in the original input image, only has one superpixel representing the entire image, and the number of superpixels in and only differ by one (that is, one of the superpixels in is a union of two superpixels in ).

In this paper, we select a subset of partitions from to define a superpixel hierarchy , where the superscript of stands for the partition level in the segmentation hierarchy, is the finest set of superpixels in the hierarchy, and superpixels in are unions of superpixels in . To match the downsampling rate in convolutional neural networks, are chosen such that the number of superpixels in is of that in . Then the superpixel hierarchy can be used to represent the part-whole hierarchy of the input image and track the ancestor-descendant relationships between superpixels.

3.2 Multi-scale Graph Pyramid

We construct a graph pyramid, , whose levels correspond to levels of the superpixel hierarchy. Every superpixel in the superpixel hierarchy has a corresponding graph node at the corresponding level of the graph pyramid. Thus the number of nodes also decreases by a factor of 4 when we move from one level of the graph pyramid to the next higher level. We define two types of edges for the graph pyramid. They are called contextual edges and hierarchical edges

. A contextual edge connects two adjacent nodes at the same level while a hierarchical edge connects two nodes at different levels if there is an ancestor-descendant relationship between their corresponding superpixels. Contextual edges are used to propagate contextual information within the same level while hierarchical edges are used for bridging semantic gaps between different levels. Note that hierarchical edges are dense because there is such an edge between every node and each of its ancestors and descendants. These dense connections incur a large computational and memory cost. Hence, every hierarchical edge is associated with the cosine similarity between its node features, and we prune hierarchical edges according to their cosine feature similarities. Among all hierarchical edges incident to a node, those ranked in the last 50% are removed.

3.3 Graph Neural Network Layers

A graph neural network called GraphFPN is constructed on the basis of the graph pyramid. There are two types of layers in GraphFPN, contextual layers and hierarchical layers. These two types of layers use the same set of nodes in the graph pyramid, but different sets of graph edges. Contextual layers use contextual edges only while hierarchical layers use pruned hierarchical edges only. Our GraphFPN has contextual layers at the beginning, hierarchical layers in the middle and contextual layers at the end. More importantly, each of these layers has its own learnable parameters, which are not shared with any of the other layers. For simplicity, , and are always equal in our experiments, and the choice of their specific value is discussed in the ablation studies. The detailed configuration of GraphFPN will be given in the supplementary materials.

Although contextual and hierarchical layers use different edges, GNN operations in these two types of layers are exactly the same. Both types of layers share the same spatial and channel attention mechanisms. We simply adopt the self-attention mechanism in graph attention networks [49] as our spatial attention. Given node and its set of neighbors , the spatial attention updates features at node as follows,

(1)

where is the single-head self-attention from [49],

is the set of feature vectors collected from the neighbors of node

, and are respectively the feature vector of node before and after the update.

The channel attention mechanism is composed of a local channel-wise attention module based on average pooling and a local channel self-attention module. In the average pooling based local channel-wise attention, the feature vectors of node and its neighbors are first averaged to obtain the feature vector . We pass the averaged feature vector through a fully connected layer with a sigmoid activation, and perform element-wise multiplication between the result and ,

(2)

where

refers to the sigmoid function,

is the learnable weight matrix of the fully connected layer, and stands for element-wise multiplication. In the local channel self-attention module, we first obtain the feature vector collection of node and its neighbors, and reshape to . Here is the size of the neighborhood of node . Next we obtain the channel similarity matrix , and apply the softmax function to every row of . The output of the local channel self-attention module is

(3)

where is a learnable weight initialized to 0 as in [5].

Our local channel-wise attention and local channel self attention are inspired by SENet [16] and Dual Attention Network [5]. The main difference is that our channel attention is defined within local neighborhoods and thus spatially varying from node to node while SENet and Dual Attention Network apply the same channel attention to the features at all spatial locations. Advantages of local channel attention in a graph neural network include much lower computational cost and higher spatial adaptivity, and thus is well suited for large networks such as our GraphFPN. The ablation study in Table 5 demonstrates that our dual local channel attention is rather effective in our GraphFPN.

3.4 Feature Mapping between GNN and CNN

Convolutional neural networks can preserve position information of parts and objects, which clearly benefits object detection, while graph neural networks can flexibly model dependencies among parts and objects across multiple semantic scales. Note that the backbone and FPN in a convolutional neural network are respectively responsible for multiscale encoding and decoding while our GraphFPN is primarily responsible for multiscale decoding. Thus features from the backbone serve as the input to the GraphFPN. To take advantage of both types of feature pyramid networks, we also fuse final features from both GraphFPN and convolutional FPN. Therefore, we need to map features from the backbone to initialize the GraphFPN, and also map final features from the GraphFPN to the convolutional FPN before feature fusion. Multi-scale feature maps in the backbone and convolutional FPN are denoted as and , respectively. Note that feature maps in are the final feature maps of the five convolutional stages in the backbone.

Mapping from CNN to GNN (): We map the -th feature map of the backbone to the -th level in . Features in lie on a rectangular grid, where each grid cell corresponds to a rectangular region in the original input image, while superpixels in usually have irregular shapes. If multiple superpixels partially overlap with the same grid cell in , as shown in Figure 1(c), we assign the grid cell to the superpixel with the laregest overlap. Such assignments result in a small collection of grid cells assigned to the same superpixel in

. We perform both max pooling and min pooling over the collection, and feed concatenated pooling results to a fully connected layer with ReLU activation. The mapped feature of

can be written as

(4)

where stands for the ReLU activation, is the learnable weight matrix of the fully connected layer, refers to the concatenation operator, and and stand for the max-pooling and min-pooling operators respectively.


Method
Training Strategy AP AP AP AP AP AP
Faster R-CNN [40] Baseline 33.1 53.8 34.6 12.6 35.3 49.5
Faster R-CNN+FPN [30] Baseline 36.2 59.1 39.0 18.2 39.0 52.4
Faster R-CNN+FPN [30] MT+AH 37.9 59.6 40.1 19.6 41.0 53.5
PAN [32] Baseline 37.3 60.4 39.9 18.9 39.7 53.0
PAN [32] MT+AH 39.0 60.8 41.7 20.2 41.5 54.1
ZigZagNet [28] Baseline 39.5
ZigZagNet [28] MT+AH 40.1 61.2 42.6 21.9 42.4 54.3
Faster R-CNN+FPN+FPT [52] Baseline 41.6 60.9 44.0 23.4 41.5 53.1
Faster R-CNN+FPN+FPT [52] AH 41.1 62.0 46.6 24.2 42.1 53.3
Faster R-CNN+FPN+FPT [52] MT 41.2 62.1 46.0 24.1 41.9 53.2
Faster R-CNN+FPN+FPT [52] MT+AH 42.6 62.4 46.9 24.9 43.0 54.5
Ours Baseline 42.1 61.3 46.1 23.6 41.1 53.3
Ours AH 42.7 63.0 47.2 25.6 43.1 53.3
Ours MT 42.4 62.7 46.9 24.3 43.1 53.6
Ours MT+AH 43.7(1.1) 64.0(1.6) 48.2(1.3) 27.2(2.3) 43.4(0.4) 54.2(0.3)
Table 1: Comparison with state-of-the-art feature pyramid based methods on MS-COCO 2017 test-dev [31]. “AH” and “MT” stand for augmented head and multi-scale training strategies [32] respectively. The backbone of all listed methods is ResNet101 [11].

Method
Detection Framework AP AP AP AP AP AP
RetinaNet + FPN [29] RetineNet 40.4 60.2 43.2 24.0 44.3 52.2
Faster R-CNN+FPN [30] Faster R-CNN 42.0 62.5 45.9 25.2 45.6 54.6
DETR [4] Set Prediction 44.9 64.7 47.7 23.7 49.5 62.3
Deformable DETR [56] Set Prediction 43.8 62.6 47.7 26.4 47.1 58.0
Sparse R-CNN+FPN  [52] Sparse R-CNN 45.6 64.6 49.5 28.3 48.3 61.6
Ours Faster R-CNN 46.7(1.1) 65.1(0.5) 50.1(0.6) 29.2(0.9) 49.1(0.8) 61.8(0.2)
Table 2: Comparison with other popular object detectors on MS-COCO 2017 val set [31]. The backbone of all listed methods is ResNet101 [11].

Mapping from GNN to CNN (): Once we run a forward pass through the GraphFPN, we map the features of its last layer to the convolutional feature pyramid . Let be the collection of grid cells in assigned to the superpixel in . We simply copy the final feature at to every grid cell in . In this way, we obtain a new feature map for the -th level of the convolutional FPN. We concatenate with , and feed the concatenated feature map to a convolutional layer with kernels to ensure the fused feature map has the same number of channels as . Finally, the fused feature pyramid is .

3.5 Object Detection

The proposed graph feature pyramid network can be integrated into the object detection pipeline in [30] by replacing the conventional FPN with the above fused feature pyramid. We adopt faster-RCNN as our detection algorithm, and perform the same end-to-end training. In the following section, we conduct extensive experiments in object detection to validate the effectiveness of the proposed method.

4 Experiments

Datasets. We evaluate the proposed method on MS COCO 2017 detection dataset [31], containing 118k training images, 5k validation images and 20k testing images. Metrics for performance evaluation include the standard average precision (AP), , , , , and . We report ablation study results on the validation set, and report results on the standard test set to compare with state-of-the-art algorithms.

Implementation details.

We have fully implemented our GraphFPN using PyTorch, and all models used in this paper are trained on 8 NVidia TITAN 2080Ti GPUs. As a common practice 

[30, 28], all backbone networks are pretrained on the ImageNet1k image classification dataset [23], and then fine-tuned on the training set of the detection dataset. Faster-RCNN [40] is adopted as our object detection framework, and we follow the settings in FPT [52] to set up the detection heads. During training, we adopt Adam [19] as our optimizer, and set the weight decay and momentum to 0.0001 and 0.9 respectively. Every mini-batch contains 16 images, and are distributed on 8 GPUs with the synchronized batch norm (SBN [53]

). For fair comparison, input images are resized to 800/1,000 pixels along the shorter/longer edge. The models used in all experiments are trained with 36 epochs on the detection training set. The initial learning rate is set to 0.001, which is decreased by a factor of 10 at the

-th and -th epochs respectively. It takes 38 hours to train a faster-RCNN model integrated with our GraphFPN on the COCO dataset.

We use codes provided by the COB project111https://cvlsegmentation.github.io/cob/ [34]

to compute hierarchical segmentations, and build a superpixel hierarchy for each image during data preparation. It takes 0.120 seconds on average to build the superpixel hierarchy of an image, which is reasonable for an object detection task. Note that machine learning models used in COB are always trained on the same training set as the detection task.

Figure 3: Sample detection results from FPN [30], FPT [52], and our GraphFPN based method.
Methods Params GFLOPs Test Speed (s)
faster RCNN [40] 34.6 M 172.3 0.139
faster RCNN + FPN [30] 64.1 M 240.6 0.051
faster RCNN + FPN + FPT [52] 88.2 M 346.2 0.146
faster RCNN + FPN + GraphFPN 100.0 M 380.0 0.157
COB + faster RCNN + FPN + GraphFPN 121.0 M 393.1 0.277
Table 3: The number of learnable parameters, the total computational cost, and the average test speed of a few detection models. All experiments are run on an NVidia TITAN 2080Ti GPU.

4.1 Comparison with State-of-the-Art Methods

We compare the object detection performance of our method (GraphFPN+FPN) with existing state-of-the-art feature pyramid based methods, including feature pyramid networks (FPN [30]), path aggregation networks (PANet [32]), ZigZagNet [28] and feature pyramid transformers (FPT [52]), using Faster-RCNN as the detection framework to verify the effectiveness of feature interactions in both the contextual layers and the hierarchical layers.

Table 1 shows experimental results achieved with the above mentioned state-of-the-art methods on MS COCO 2017 test-dev [31] in various settings. Our method achieves the highest AP (43.7%) outperforming other state-of-the-art algorithms by at least 1.1%, and maintains a leading role on AP, AP, AP, and AP. When compared with the Faster-RCNN baseline [40], the AP of our method is 10.6% higher. It indicates that multi-scale high-level feature learning is crucial for object detection. When our method is compared with FPN alone [30], the improvement in AP reaches 7.5%, which further indicates that GraphFPN significantly enhances the original multi-scale feature learning conducted with FPN, and multi-scale feature interaction and fusion are very effective for object detection. Such improvements also illustrate that graphs built on top of superpixel hierarchies are capable of capturing intrinsic structures of images, and are helpful in high-level image understanding tasks. When compared with FPT [52]

, our method achieves better performance on five evaluation metrics, including AP, AP

, AP, AP and AP, except AP. We attribute this performance to three factors. First, graph neural networks have higher efficiency in propagating information across different semantic scales by connecting nodes dynamically while FPT has to broadcast information in a cascaded manner with the top-down and bottom-up combination. Second, the superpxiel hierarchy captures the intrinsic structures of images, which benefit the detection of small-scale objects. It can be verified that our method achieves 2.3% improvement on AP in comparison to FPT. Third, superpixel hierarchies are not well suited for the detection of large-scale objects, which can be verified through the inferior result on AP.

4.2 Comparison with Other Object Detectors

In addition to comparison with feature pyramid based detection methods, we further compare our method with other popular detectors. As shown in Table 2, our method based on Faster R-CNN + FPN + GraphFPN outperforms all such detectors, including RetinaNet [29], DETR [4], Deformable DETR [56] and Sparse R-CNN+FPN  [52], by a clear margin when they use the same backbone as our method. Our method achieves compelling performance under all six performance metrics. This demonstrates our GraphFPN is capable of significantly enhancing the feature representation of a detection network, which in turn leads to superior detection performance.

4.3 Learnable Parameters and Computational Cost

Table 3 provides the number of learnable parameters, the total computational cost, and the average test speed of a few detection models. Faster RCNN [40] serves as our baseline, which has 34.6 million learnable parameters and 172.3 GFLOPs. It takes 0.139 seconds on average to process one image. Our GraphFPN works on top of Faster RCNN and FPN, and the whole pipeline has 1.89 times more learnable parameters, 1.21 times more GFLOPs and 12.9% longer test time. If we take the construction of the superpixel hierarchy into consideration, COB [34] models have 21 (+21%) million parameters, 13.1 (+3.4%) GFLOPs, and 0.12 (+76.4%) seconds time cost. This is because COB [34] needs to detect contours in an image and build a hierarchical segmentation on CPU. In fact, hierarchical segmentation could be implemented using CUDA and run on GPU, which would significantly reduce the test time.

4.4 Ablation Studies

CGL-1 HGL CGL-2 AP AP AP AP
39.1 22.4 38.9 56.7
38.2 22.1 38.7 56.1
38.7 22.1 38.9 56.6
36.2 19.2 36.3 54.4
37.2 22.1 35.1 55.6
Table 4: Ablation study on the contextual and hierarchical layers in GraphFPN. “CGL-1” stands for the first group of contextual layers before the hierarchical layers, “HGL” stands for the hierarchical layers, and “CGL-2” stands for the second group of contextual layers after the hierarchical layers. and stand for the existence of a module or not. Detection results are reported on the MS-COCO 2017 val set [31].
SA LCA LSA AP AP AP AP
39.1 22.4 38.9 56.7
37.8 21.9 37.4 56.2
37.9 21.6 37.3 56.4
37.6 21.8 37.7 55.1
37.1 21.1 36.7 54.1
Table 5: Ablation study on the attention mechanisms. “SA” stands for the spatial attention module, “LCA” stands for the local channel-wise attention module and “LSA” stands for the local channel self-attention module. and stand for the existence of a module or not. Detection results are reported on the MS-COCO 2017 val set [31].
N AP AP AP AP AP AP
1 36.1 56.3 35.4 19.3 37.9 55.4
2 37.2 57.6 38.5 21.2 38.3 55.8
3 39.1 58.3 39.4 22.4 38.9 56.7
4 38.1 57.8 38.9 22.2 38.6 56.3
5 37.1 57.1 38.0 21.9 37.9 55.4
Table 6: Ablation study on the number of layers in GraphFPN. N is the number of layers in each of the three groups of layers. Hence, the total number of layers is 3N. Detection results are reported on the MS-COCO 2017 val set [31].

To investigate the effectiveness of individual components in our GraphFPN, we conduct ablation studies by replacing or removing a single component from our pipeline. We have specifically designed ablation studies for the configuration of GNN layers (the combination and ordering of different types of GNN layers), the total number of GNN layers, and the spatial and channel attention mechanisms.

GNN Layer Configuration. In our final pipeline, the specific configuration of layers is as follows: first group of contextual layers, a group of hierarchical layers, and second group of contextual layers. The number of layers in all groups are the same. Table 4 shows the results of the ablation study on the configuration of these layers. When we remove the first group of contextual layers, the AP drops by 0.9%. It means that it is necessary to propagate contextual information within the same scale before cross-scale operations. Then we remove the second group contextual layers, the AP drops by 0.4%, which indicates contextual information propagation is still helpful even after the first group of contextual layers followed by a group of hierarchical layers. If we keep one group of contextual layers or hierarchical layers only, the AP drops by 2.9% and 1.9% respectively, which indicates the two types of layers are truly complementary to each other.

Number of GNN Layers. The total number of layers in a GNN affects its overall discriminative ability. Table 6 shows experimental results with different numbers of layers in each type. We find that when , which means each of the three groups has 3 layers and the total number of layers is 9, our method achieves the best results on all five performance metrics. Too few or too many layers could make the performance worse.

Attention Mechanism. In the ablation study shown in Table 5

, we verify the effectiveness of the spatial self-attention and the two local channel attention mechanisms. When we remove the spatial self-attention, the AP drops by 1.3%. It means that the spatial attention is powerful in modeling neighborhood dependencies. If we remove the local average-pooling based channel-wise attention or the local channel self-attention, the AP drops by 1.2% and 1.5% respectively. It demonstrates that these two local channel attention mechanisms are complementary to each other, and significantly improve the discriminative ability of deep features. If we completely remove both channel attention mechanisms, the AP is 2% worse.

5 Conclusions

In this paper, we have presented graph feature pyramid networks that are capable of adapting their topological structures to varying intrinsic structures of input images, and supporting simultaneous feature interactions across all scales. Our graph feature pyramid network inherits its structure from a superpixel hierarchy constructed according to a hierarchical segmentation. Contextual and hierarchical graph neural network layers are defined to achieve feature interactions within the same scale and across different scales, respectively. To make these layers more powerful, we further introduce two types of local channel attention for graph neural networks. Extensive experiments demonstrate that Faster R-CNN+FPN integrated with our graph feature pyramid network outperforms existing state-of-the-art object detection methods on MS-COCO 2017 validation and test datasets.

References

  • [1] S. Belongie, J. Malik, and J. Puzicha (2002) Shape matching and object recognition using shape contexts. IEEE transactions on pattern analysis and machine intelligence 24 (4), pp. 509–522. Cited by: §2.
  • [2] E. Bienenstock, S. Geman, and D. Potter (1997) Compositionality, mdl priors, and object recognition. Advances in neural information processing systems, pp. 838–844. Cited by: §2.
  • [3] P. Bilinski and V. Prisacariu (2018) Dense decoder shortcut connections for single-pass semantic segmentation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 6596–6605. Cited by: §2.
  • [4] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In European Conference on Computer Vision, pp. 213–229. Cited by: §1, Table 2, §4.2.
  • [5] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu (2019) Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3146–3154. Cited by: §3.3, §3.3.
  • [6] H. Gao and S. Ji (2019) Graph u-nets. In Proceedings of the 36th International Conference on Machine Learning, Cited by: §2, §2.
  • [7] G. Ghiasi, T. Y. Lin, and Q. V. Le (2019) NAS-fpn: learning scalable feature pyramid architecture for object detection. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [8] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §1.
  • [9] L. Gong and Q. Cheng (2019-06) Exploiting edge features for graph neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [10] C. Guo, B. Fan, Q. Zhang, S. Xiang, and C. Pan (2020) AugFPN: improving multi-scale feature learning for object detection. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, Table 1, Table 2.
  • [12] (2019) Heterogeneous graph attention network. In The World Wide Web Conference, Cited by: §2.
  • [13] G. E. Hinton, S. Sabour, and N. Frosst (2018) Matrix capsules with em routing. In International conference on learning representations, Cited by: §1.
  • [14] G. Hinton (1979) Some demonstrations of the effects of structural descriptions in mental imagery. Cognitive Science 3 (3), pp. 231–250. Cited by: §1.
  • [15] G. Hinton (2021) How to represent part-whole hierarchies in a neural network. arXiv preprint arXiv:2102.12627. Cited by: §2.
  • [16] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §3.3.
  • [17] D. Jha, M. A. Riegler, D. Johansen, P. Halvorsen, and H. D. Johansen (2020) Doubleu-net: a deep convolutional neural network for medical image segmentation. In 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), pp. 558–564. Cited by: §1, §2.
  • [18] Y. Kim, B. N. Kang, and D. Kim (2018) SAN: learning relationship between convolutional features for multi-scale object detection: 15th european conference, munich, germany, september 8–14, 2018, proceedings, part v. In Springer, Cham, Cited by: §1.
  • [19] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • [20] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2.
  • [21] T. Kong, F. Sun, W. Huang, and H. Liu (2018) Deep feature pyramid reconfiguration for object detection. Cited by: §1.
  • [22] A. R. Kosiorek, S. Sabour, Y. W. Teh, and G. E. Hinton (2019)

    Stacked capsule autoencoders

    .
    arXiv preprint arXiv:1906.06818. Cited by: §1.
  • [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, §4.
  • [24] G. Li, M. Muller, A. Thabet, and B. Ghanem (2019) Deepgcns: can gcns go as deep as cnns?. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9267–9276. Cited by: §2.
  • [25] Y. Li, Y. Chen, N. Wang, and Z. Zhang (2019) Scale-aware trident networks for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6054–6063. Cited by: §1.
  • [26] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun (2018) Detnet: design backbone for object detection. In Proceedings of the European conference on computer vision (ECCV), pp. 334–350. Cited by: §2.
  • [27] D. Lin, Y. Ji, D. Lischinski, D. Cohen-Or, and H. Huang (2018) Multi-scale context intertwining for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 603–619. Cited by: §2.
  • [28] D. Lin, D. Shen, S. Shen, Y. Ji, D. Lischinski, D. Cohen-Or, and H. Huang (2019) Zigzagnet: fusing top-down and bottom-up context for object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7490–7499. Cited by: §1, §2, Table 1, §4.1, §4.
  • [29] T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence PP (99), pp. 2999–3007. Cited by: §1, Table 2, §4.2.
  • [30] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §1, §1, §2, §3.5, Table 1, Table 2, Figure 3, §4.1, §4.1, Table 3, §4.
  • [31] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, Table 1, Table 2, §4.1, Table 4, Table 5, Table 6, §4.
  • [32] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia (2018) Path aggregation network for instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8759–8768. Cited by: §1, §1, §2, Table 1, §4.1.
  • [33] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1.
  • [34] K. Maninis, J. Pont-Tuset, P. Arbeláez, and L. Van Gool (2017) Convolutional oriented boundaries: from image segmentation to high-level tasks. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 819–833. Cited by: §1, §2, §2, §3.1, §4.3, §4.
  • [35] D. Marr (1982) Vision: a computational investigation into the human representation and processing of visual information. Cited by: §2.
  • [36] C. Pantofaru, C. Schmid, and M. Hebert (2008) Object recognition by integrating multiple image segmentations. In European conference on computer vision, pp. 481–494. Cited by: §2.
  • [37] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun (2018) Megdet: a large mini-batch object detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6181–6189. Cited by: §2.
  • [38] J. Pont-Tuset, P. Arbelaez, J. T. Barron, F. Marques, and J. Malik (2016) Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE transactions on pattern analysis and machine intelligence 39 (1), pp. 128–140. Cited by: §1, §2, §3.1.
  • [39] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1.
  • [40] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: Table 1, §4.1, §4.3, Table 3, §4.
  • [41] S. Ren, K. He, R. Girshick, and J. Sun (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 39 (6), pp. 1137–1149. Cited by: §1.
  • [42] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2.
  • [43] S. Sabour, N. Frosst, and G. E. Hinton (2017) Dynamic routing between capsules. arXiv preprint arXiv:1710.09829. Cited by: §1.
  • [44] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta (2016) Beyond skip connections: top-down modulation for object detection. arXiv preprint arXiv:1612.06851. Cited by: §2.
  • [45] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Li, Z. Yuan, and C. Wang (2020) Sparse r-cnn: end-to-end object detection with learnable proposals. Cited by: §1.
  • [46] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1.
  • [47] M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. Cited by: §1.
  • [48] A. Tao, K. Sapra, and B. Catanzaro (2020) Hierarchical multi-scale attention for semantic segmentation. arXiv preprint arXiv:2005.10821. Cited by: §1, §2.
  • [49] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2018) Graph attention networks. ICLR. Cited by: §2, §2, §3.3.
  • [50] H. Xu, L. Yao, Z. Li, X. Liang, and W. Zhang (2020) Auto-fpn: automatic network architecture adaptation for object detection beyond classification. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1.
  • [51] R. Ying, J. You, C. Morris, X. Ren, W. L. Hamilton, and J. Leskovec (2018) Hierarchical graph representation learning with differentiable pooling. Cited by: §2.
  • [52] D. Zhang, H. Zhang, J. Tang, M. Wang, X. Hua, and Q. Sun (2020) Feature pyramid transformer. In European Conference on Computer Vision, pp. 323–339. Cited by: §1, §1, §2, Table 1, Table 2, Figure 3, §4.1, §4.1, §4.2, Table 3, §4.
  • [53] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal (2018) Context encoding for semantic segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 7151–7160. Cited by: §4.
  • [54] Z. Zhang, X. Zhang, C. Peng, X. Xue, and J. Sun (2018) Exfuse: enhancing feature fusion for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 269–284. Cited by: §1, §2.
  • [55] Q. Zhao, T. Sheng, Y. Wang, Z. Tang, Y. Chen, L. Cai, and H. Ling (2019) M2det: a single-shot object detector based on multi-level feature pyramid network. In

    Proceedings of the AAAI conference on artificial intelligence

    ,
    Vol. 33, pp. 9259–9266. Cited by: §2.
  • [56] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159. Cited by: Table 2, §4.2.