Object detection is one of the most highlighted problems in computer vision. Many real-world applications, including image retrieval, advanced driver assistance system and video surveillance, are based on high-performance object detection. In the past decades, impressive achievements have been made to improve the performance of object detection in both accuracy and speed. In the most of prevalent object detection baselines[1, 2] exploiting the feature of proposals individually still the primary choice to improve performance instead of taking correlation information among proposals into account, although it has been a universal concept that context information helps image understanding  and relationship among proposals is widely applied in many other computer vision tasks .
Analysing the several frameworks leveraging the correlation among proposals into object detection , we notice several reasons why these methods are not popular:
extra models are needed for the context information, resulting to the decrease of time and memory efficiency
it is difficult to model the complex relationship among proposals properly, including modelling the spatial and semantic, local and global information simultaneously.
To circumvents the restraints mentioned above, we investigate to depict the relationship among proposals by leveraging the graph convolutional neural network(GCN), which seems a more reasonable model than others . Proposals in one image are supposed as the vertexes and the Intersection over Union(IoU) is weight of edges between proposal vertexes. Thanks to the low computation complexity of GCN, our model makes an efficient and effective improvement in AP performance without significant drop in time consumption.
Another contribution of this work is we propose graph cut layer to introduce global feature to each proposal.Inspired by the successful application of non-local information in object detection, we attempt to extract global feature by the graph.Specifically, we coarsen the graph hierarchically to get a sparse and global description of all proposals by a pooling methods based on graph cut algorithm.
We evaluate our model on COCO dataset and achieve the stable improvement compared with baseline. Our method and experiment result are demonstrated in the following part of this paper.
2 Related Works
Before CNN is prevalent in Object Detection, DPM is a widely studied model detecting objects by sliding windows in image pyramids. R-CNN makes great success by applying CNNs to process object proposals. Moreover, calculating feature maps globally instead of locally and sharing computation, Fast R-CNN and SPPNet enhance the performance of the model. Faster R-CNN further develops a network that can generate object proposal of high quality, namely region proposal network, which increases the calculating speed to a greater degree. There are also other methods proposed recently[13, 14, 15] to improve object detection performance by altering network structures.
Relation and Contextual Information
Many object detection methods basing deep neural networks only use internal features to classify the proposals. However, the relation with surroundings and contextual information are also important for object detection. Bell et al.
adopted skip pooling and spatial recurrent neural networks to construct a Inside-Outside Net, which integrated both the inside and outside information of the regions of interest. Li et al.
firstly constructed an attention-based global contextualized sub-network that adopted multi-layer long short-term memory networks to generate an attention map representing the global contextual information, and a multi-scale local contextualized sub-network to capture surrounding local context, then imported the global and local contextual information into the region-based object detectors. Ma et al. developed feature extractor by packing the recurrent neural networks into the convolutional neural networks, with which it can combine global and local information in object detection. All these efforts tried to extend the features with local or global contextual information, which demonstrate the importance of relation and contextual information in object detection.
Graph Neural Network
Graph neural networks, or GNNs, are deep learning based methods that operate on graph domain. Early studies on graph neural network done by Gori et al. and Scarselli et al.[20, 21] adpoted recurrent network architectures which can transfer information between neighbor nodes to learn a new representation of the target node.
Inspired by the convolutional neural networks, Bruna et al. defined graph convolution based on spectral graph theory and applied it in node classification. Further studies such as Edwards et al.  and Defferrard et al. also followed to make improvements and extensions on spectral based graph convolutional networks. Duvenaud et al., Atwood et al. and Hamilton et al. define convolutions directly on graph, namely non-spectral based graph convolutional networks, which operate on neighbors spatially. One kind of GCNs is illustrated in Fig.1.
A Graph is where is the set of nodes, is the set of edges, and is the adjacency matrix. In a graph, let to denote a node and to denote an edge. The adjacency matrix is a matrix with if and if . The degree of a node is the number of edges connected to it. For each node in the graph , the graph convolution can be formulated as learning a function that takes node ’s feature and ’s neighbors’ features as input, outputs ’s new representation, where . Many other graph neural networks takes graph convolution as its core part, such as auto-encoder based models and spatial-temporal networks, etc.
Based on attention mechanism, Petar et al.  developed graph attention networks, or GATs, which are actually a spatial-based graph convolution network. The key difference in GATs is that GATs assign more weight to more important nodes by involving attention mechanism which is learned together in model, as illustrated in figure 2.
There are also other beautiful efforts on GNNs, for example, Henaff et al.
extended the graph convolutional networks to large scale datasets like ImageNet Object Recognition, text categorization, and bioinformatics. Meanwhile, Niepert et al. proposed an approach of PATCHY-SAN, which defined operations of node sequence selection, neighborhood assembly, and graph normalization. As we will show later, these models successfully made CNN work under the graph settings, but they still lack of careful considerations for the specialties of the graph structures in the network design.
3 Our Approach
In the object detection task, the correlation between proposals always exist, and they deserve to be exploited better to enrich the representation of proposals. However, there are two types of difficulties in modelling the correlation of proposals. First, the structure of proposals and relations among them are not well defined. Besides, the relational and contextual information extraction module should be efficient.
In this section, we propose the relational proposal graph network (RepGN) to leverage the coherent relationship between the objects in terms of both spatial and co-occurrence probability occurring naturally in the real world.
. The utilization of spatial and semantic relation helps the estimation of the class and location of an object. We build a graph convolution based module to refine the estimation of location and the class probability of proposals from object detector by performing the spatial and semantic reasoning. Specifically, this module takes all object proposals generated by regional proposal network as nodes in an undirected simple graph and create edges between every two proposals which have overlap. The following parts illustrates the pipeline of object detection we proposed in detail.
In the typical pipeline of object detection, detectors or region proposal networks is firstly leveraged to produce a set of detected objects or regions of interest. As is shown in Figure 3 , features of the RoI generated with RPN are feed into the RepGN. Due to ROI Pooling or ROI Align, shape of RoI’s feature map can be processed to uniform and we then treat these features as nodes. In formalization, we define the input of the RepGN, i.e. the output of RPN as features and positions , here ranges from 1 to , represents the index of ROI. , and are the height, width and the number of channels of feature respectively. Furthermore, by treating each image region as one vertex, we can construct proposal graph , where denotes the set of the spatial relation between region vertices. In this work, we treat the Intersection Over Union(IoU) as the weight of edges between proposals vertices.
After the construction of proposals graph on the spatial domain, each vertex
is represented by a multi-dimensional vector with its dimension of. As is shown in Fig. 4, for each node, we concatenate its feature matrix to its neighbourhoods on the graph. Then we apply a convolution and average pool operation to this set of concatenated feature matrix to produce semantic similarity between the node and every neighbourhood. The larger the value is, the higher similarity and the stronger correlation will be. The convolution operation gives us an by affinity matrix .
In this matrix , each row represents the correlation between the corresponding node and all the nodes, including itself. is then given to softmax to normalize. We then compute the multiplication of the normalized attention and the original nodes, that is, weight and add all nodes according to the correlation weight, and finally get the refined vertex which integrates all nodes information.
Thus far, we retrieved the relation-aware representation for every proposal by a graph attention module. But for the computation efficiency of our method, we can not afford to stack multiple graph attention layers to expand the receptive field of every node. In this limitation, we can not get non-local and global contextual information through solely one graph attention layer. This problem provides a motivation for us to propose a new graph pooling method for extract the global information in a hierarchy way. This graph croasen method is based on normalized cut of graph which will be introduced on next part.
3.2 Graph Cut Pool
In this section, we propose a type of new pooling method aiming to get a general feature for the ground truth and help to diminish the local semantic feature difference for proposals.
Nevertheless, some factors hinder this assumption, unbalanced proposals, densely distributed ground truth. Experiment in  have demonstrated that the majority of negative proposals have little effect on improving performance, so it is necessary to filter easy isolated negative proposal candidates at the first time. Also, proposals for gathered ground truths always distribute densely and are difficult to separate.
3.2.1 Normalized Cut
Normalized Cut(NCut) is a classical graph cut algorithm and designed to circumvent cutting small sets of isolated nodes in the graph by computing fraction of the total edge connections to all the nodes in the graph.
Given a graph , NCut aims to divide it into disjoint sets to maximize weights of each sub-graph and minimize the weights of cut.
where the and , and means the total connection from graph set to .
By minimizing the cost of , gets balanced partitioned and little bias. More details can be referred from .
3.2.2 Graph Cut Pool
To eliminate the restrictions mentioned at the beginning of section, we adopt a two-step graph cut pooling approach. Considering the proposal candidates in one image compose a graph, proposals are vertexes whereas adjoint matrix is . On the first stage, we remove the connected components whose number of vertices is less than to refine the proposal candidates. Fig.5 (b) shows the connected components filtered.
On the second stage, we resort to classical normalized cut algorithm  on each connected components hoping separate proposals in one connected component into partitions. Fig.5 (d) shows the separated partitions by normalized cut and isolated node sets are filtered again according to step one.
After conducting graph cut hierarchically and iteratively, proposals are divided into several part and assigned pseudo labels. Proposals with same pseudo labels comprise the . On the last but most primary stage, we conduct average graph pooling on each graph.
In Eq. 3, is the number of vertex in and is the output of the Graph Pool Layer.
3.3 Identical Normalization
The feature of the identical shape of input and output make RepGN easily to be integrated into an existing object detection model. Furthermore, for the reusing of pretrained weights and simplicity of training, we add an identical connection and identical normalization operation to RepGN as Eq.4 depicted:
where is a coefficient to control the influence of relation and contextual information and exits for numerical stability. Neglecting the , the refined, local and global contextual-awared representation of proposal , has same mean and variants with original representation without relational information. This norm operation can weak distrubance to the original detection model caused by RepGN.
We verify the validity of the proposed method on COCO detection datasets 
with 80 categories, including 118k images for training and 5k for validation. Default detection evaluation metrics for COCO are used here, mainly for AP at IoU=.50:.95.
To present the effect of graph neural network general on object detection task, we take Faster RCNN  and Cascade RCNN  as baseline. ResNet-50  equipped with FPN  layers is selected as the default backbone for most of experiments, and others are based on ResNet-101 to show the generality. We implemented the baselines mentioned above and get comparable performance reported in their original papers. Due to the shape input and output of our RepGN module is identical, it’s convinent to insert RepGN, after regions proposal network and into the bounding box regressioner and classifer().
On the training stage, we select SGD as optimizer with weight_decay = 0.0001, adjusting learning rate dynamically at epoch 8 and 10. All models are trained for 12 epochs on NVIDIA Tesla V100 GPUs. Especially, for different number of GPUs, learning rate need to be set proportional to the number of GPUs, for example,or for 4/8 GPUs respectively.
4.2 Experiments Setup
In the Faster-RCNN and other baseline frameworks, RPN generates a feature map for each proposal. Next, the converts the feature map to
-dim embedding vector. Two fully-connected layers generate the final logits for classification and regression. Additional, we treat thewhere (x1, y1, x2, y2) are the top-left and bottom-right normalized coordinates of the proposal, as a spatial representation explicitly.
In this section, we adopt gradual experiments to investigate whether the information from neighbor proposals and global context, including sparial and sementic representation can boost the classification and regression performance or not.
Firstly, we select a conservative method: keeping the baseline framework and adding a graph convolution network sequence paralleled with shared fully connected layers in , aiming to get clear the impact of spatial relationship in object detection. Spatial information of proposals consists of a concatenation of location and size of each one, which is also regarded as the input vector of vertex. We employ two layers multi-head graph attention networks  with 8 heads and the output dimension is 1024 aligned with baseline.
After that, we attempt to fuse the feature of every proposal according to their relationship information, including spatial and semantic. And RepGN module can be inserted into the pipeline detection after the RPN or into the . To evaluate these two methods seperately, we note the first one RepGN(a) and second one RepGN(b). Fruthermore, we conduct experiments on the RepGN module without GCPool branch to investigate the influence of global contextual information on final performance.
4.3 Experiment Result
|Train Time(with GAT layers)|
|Inference Time(with GAT layers)|
Table 1 compares the performance of GAT paralleled with original , proves the feasibliliy of graph network based methods to enrich the representation of proposals.
Shown in Table 3, our RepGN module can improve the mAP accuracy campared with original pipeline of detection or GAT intergrated method. In detailed, GCPool provide the most remarkable improvement, these results prove the effectiveness of the local and global context information extraction on object detection task.
Furthermore, Table 2 indicates that the RepGN can make an significant improvement in AP performance without significant drop in time consumption. The low computation cost makes our RepGN module is compratiable to other relation and context modeling methods in object detection field.
In this work, we devise our relational proposal graph network(RepGN) architecture to enrich the feature and embedding of or proposals for object detection task. By organizing the proposals to a graph in the spatial domain, we can propagate the information through the graph proposal by using a graph network. In this manner, we can get a better representation of proposals by incorporating both semantic and spatial object relationships. In detailed, we propose a graph pooling method to retrieve a high-level description of all proposals for getting global contextual information in an efficiency way. We evaluate our model on COCO dataset and achieve the stable improvement compared with baseline.
-  Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6):1137–1149, 2017.
Zhaowei Cai and Nuno Vasconcelos.
Cascade R-CNN: delving into high quality object detection.
2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 6154–6162, 2018.
-  Xinlei Chen and Abhinav Gupta. Spatial memory for context reasoning in object detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 4106–4116, 2017.
-  Lu Qi, Shu Liu, Jianping Shi, and Jiaya Jia. Sequential context encoding for duplicate removal. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pages 2053–2062, 2018.
-  Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 3588–3597, 2018.
-  Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
-  Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 3588–3597, 2018.
Zhiming Luo, Akshaya Kumar Mishra, Andrew Achkar, Justin A. Eichel, Shaozi Li,
and Pierre-Marc Jodoin.
Non-local deep features for salient object detection.In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6593–6601, 2017.
-  Pedro F. Felzenszwalb, Ross B. Girshick, David A. McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell., 32(9):1627–1645, 2010.
-  Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 580–587, 2014.
-  Ross B. Girshick. Fast R-CNN. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 1440–1448, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III, pages 346–361, 2014.
-  Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-FCN: object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 379–387, 2016.
-  Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 764–773, 2017.
-  Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 8759–8768, 2018.
-  Sean Bell, C. Lawrence Zitnick, Kavita Bala, and Ross B. Girshick. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2874–2883, 2016.
-  Jianan Li, Yunchao Wei, Xiaodan Liang, Jian Dong, Tingfa Xu, Jiashi Feng, and Shuicheng Yan. Attentive contexts for object detection. IEEE Trans. Multimedia, 19(5):944–954, 2017.
-  Liangzhuang Ma, Xin Kan, Qianjiang Xiao, Wenlong Liu, and Peiqin Sun. Yes-net: An effective detector based on global information. CoRR, abs/1706.09180, 2017.
-  Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Graph neural networks: A review of methods and applications. CoRR, abs/1812.08434, 2018.
-  M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., volume 2, pages 729–734 vol. 2, July 2005.
-  Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. Computational capabilities of graph neural networks. IEEE Trans. Neural Networks, 20(1):81–102, 2009.
-  Michael Edwards and Xianghua Xie. Graph based convolutional neural network. CoRR, abs/1609.08965, 2016.
-  Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 3837–3845, 2016.
-  David K. Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 2224–2232, 2015.
-  James Atwood and Don Towsley. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 1993–2001, 2016.
-  William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 1025–1035, 2017.
-  Thomas N. Kipf and Max Welling. Variational graph auto-encoders. CoRR, abs/1611.07308, 2016.
Bing Yu, Haoteng Yin, and Zhanxing Zhu.
Spatio-temporal graph convolutional networks: A deep learning
framework for traffic forecasting.
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden., pages 3634–3640, 2018.
-  Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
-  Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structured data. CoRR, abs/1506.05163, 2015.
Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov.
Learning convolutional neural networks for graphs.
Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2014–2023, 2016.
-  Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. Deep feature flow for video recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 4141–4150, 2017.
-  Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. In 1997 Conference on Computer Vision and Pattern Recognition (CVPR ’97), June 17-19, 1997, San Juan, Puerto Rico, pages 731–737, 1997.
-  Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, pages 740–755, 2014.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778, 2016.
-  Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 936–944, 2017.