Intrinsic Relationship Reasoning for Small Object Detection

09/02/2020 ∙ by Kui Fu, et al. ∙ 6

The small objects in images and videos are usually not independent individuals. Instead, they more or less present some semantic and spatial layout relationships with each other. Modeling and inferring such intrinsic relationships can thereby be beneficial for small object detection. In this paper, we propose a novel context reasoning approach for small object detection which models and infers the intrinsic semantic and spatial layout relationships between objects. Specifically, we first construct a semantic module to model the sparse semantic relationships based on the initial regional features, and a spatial layout module to model the sparse spatial layout relationships based on their position and shape information, respectively. Both of them are then fed into a context reasoning module for integrating the contextual information with respect to the objects and their relationships, which is further fused with the original regional visual features for classification and regression. Experimental results reveal that the proposed approach can effectively boost the small object detection performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In recent years, various object detection approaches have boomed, which can attribute to the great success of deep convolutional neural networks (CNNs)

(Girshick et al., 2014; Ren et al., 2015; Liu et al., 2016). However, the performance of the majority of CNN-based detectors (He et al., 2017; Redmon et al., 2016) for the small objects is still far from satisfactory since they extract semantically strong features via stacking deep convolutional neural layers, which is usually accompanied with non-negligible spatial information attenuation. Therefore, a crucial challenge for small object detection is how to capture semantically strong features and simultaneously minimize spatial information attenuation.

Figure 1.

Comparison of different strategies for small object detection: a) Using a super-resolution network to up-sample a blurry low-resolution image, where one baseline detector is performed, to fine-scale high-resolution one, on which the object detection results are refined. b) Our proposed intrinsic relationship graph construction. The region-to-region object pair is fed into the semantic encoder and calculated semantic relatedness. Simultaneously, the spatial layout is exploited to calculate the spatial relatedness. Then the intrinsic relationship can be well modeled through integrating the semantic and spatial layout relatedness. Note that the two strategies can complementary to each other.

There is an increasing concern about small object detection. Bai et al. (Bai et al., 2018a, b) proposes an intuitive and effective solution, as illustrated in Fig. 1 (a), which employs a super-resolution network to up-sample a blurry low-resolution image to fine-scale high-resolution one, on which the detection results are refined. Such an approach fundamentally solves the spatial information attenuation problem, but at the cost of the high computational burden. In a complex scene with multiple small objects, the small objects belong to an identical category tend to have similar semantic co-occurrence information and simultaneously tend to have a similar aspect ratio, scale and appear in clusters in spatial layout. As such, human beings do not treat each region individually but integrate inter-object relationships, semantic or spatial, between regions. Such a phenomenon inspires us to explore how to model and infer the intrinsic semantic and spatial layout relationships for boosting small object detection.

To answer this question, we focus on recent works on modeling relationships and find that it is a common practice for introducing global contextual information into networks. For instance, PSP-Net (Zhao et al., 2017) and DenstASPP (Yang et al., 2018b) enlarge the receptive field of convolutional layers via combining multi-scale features to model the global relationships. Deformable CNN (Dai et al., 2017b) learns offsets for the convolution sampling locations, the scales or receptive field sizes can be adaptively determined. Moreover, Squeeze-and-Excitation Networks (Hu et al., 2018b) (SE-Net) encodes the global information via a global average pooling operation to incorporate an image-level descriptor at every stage. However, these methods rely solely on convolutions in the coordinate space to implicitly model and communicate information between different regions. It is promising to squeeze out better performance if they can handle this problem effectively. On the contrary, Graph Convolutional Networks (GCN) is usually regarded as a composition of feature aggregation/propagation and feature transformation (Veličković et al., 2017), thus enabling a global reasoning power that allows regions further away to directly communicate information with each other. As such, GCN is suitable for modeling and reasoning pair-wise high-order object relationships from the image itself which is expected to be helpful for boosting small object detection.

In this paper, we propose a context reasoning approach based on GCN for small object detection to encode the implicit pair-wise regional relationships and propagate the semantic and spatial layout contextual information between regions. The flowchart of relationship construction is illustrated in Fig. 1 (b). It involves three modules: a semantic module for modeling the sparse semantic relationships from the initial regional features, a spatial layout module for modeling the sparse spatial layout relationships from the position and shape information of objects and a context reasoning module for integrating the sparse semantic and spatial layout contextual information to generate the dynamic scene graph and propagate the contextual information between objects. Experimental results show that the proposed approach can effectively boost the small object detection.

The contributions of this work are summarized as follows: 1) We propose a context reasoning approach that can effectively propagate the contextual information between regions and update the initial regional features for boosting the small object detection. 2) We design a semantic module and a spatial module for modeling the semantic and spatial layout relationships from the image itself without introducing external handcraft linguistic knowledge, respectively. Such relationships are beneficial for identifying small objects that fall into an identical category in the same scenario. 3) Comprehensive experiments are conducted and illustrate that our proposed approach can effectively boost the small object detection.

2. Related Work

Object Detection.

Object detection is a fundamental problem in the computer vision field, and it is popularized by both two-stage and single-stage detectors. Two-stage detectors are developed from the R-CNN architecture

(Girshick et al., 2014), which firstly generates RoIs (Region of Interest) via some low-level computer vision algorithm (Zitnick and Dollár, 2014; Uijlings et al., 2013)

, and then classify and locate them. The SPPNet

(He et al., 2015) and Fast R-CNN (Girshick, 2015)

exploit the spatial pyramid pooling to generate the shared feature once and then generate region feature from RoI pooling. In this manner, the redundant computation of feature extraction in R-CNN can be effectively reduced. Faster R-CNN

(Ren et al., 2015) can further improve the effectiveness since it introduces a region proposal network (RPN) to replace the original stand alone time-consuming region proposal methods. For the sack of avoiding RoI-wise head work, R-FCN (Dai et al., 2016) constructs position-sensitive score maps through a fully convolutional network. Moreover, the RoI Align layer proposed in Mask R-CNN (He et al., 2017) can effectively address the coarse spatial quantization problem. FPN (Lin et al., 2017a)

integrates the low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections to address the scale variance. Conventionally, the two-stage detectors can achieve impressive performance but often at a high computational cost, make it hard to meet the requirements of real-time applications. To alleviate this dilemma, single-stage detectors avoid the time-consuming proposal generating step and classify the predefined anchors using CNNs directly, which are popularized by YOLO

(Redmon et al., 2016; Redmon and Farhadi, 2017) and SSD (Liu et al., 2016). RetinaNet (Lin et al., 2017b) proposes Focal Loss to reduce the loss weight for easy samples, lead to a smaller performance gap between single-stage detectors and two-stage detectors.

However, existing object detectors suffer from a performance bottleneck in complex scenes with multiple small objects since it is hard for them to strike a balance between capturing semantically strong features and retaining more spatial information. Moreover, they treat each region individually and ignore the relationships between objects which leaves room for further exploration of their performance.

Small Object Detection.

Small object detection is one of the common problems for the existing detection framework. In the field of tiny face detection, Bai

et al. (Bai et al., 2018a) proposed to employ a super-resolution network to up-sample a blurry low-resolution image to fine-scale high-resolution one, which is in hope of supplementing the spatial information in advance. Later, in (Bai et al., 2018b), Bai et al.

 proposed a multi-task generative adversarial network to recover detailed information for more accurate detection.

Regardless of their impressive performance, they suffer from a high computational burden since they introducing additional super-resolution network. They fail in mining the correlation between regions, which limits their small object detection performance improvements.

Figure 2. The overview of the proposed context reasoning framework. It consists of three module. A semantic module encodes the intrinsic semantic relationships from the initial regional features. A spatial layout module encodes the intrinsic spatial layout relationships from the position and shape information of objects. A context reasoning module integrates the contextual information between the objects and sparse relationships, and updates the initial regional features.

Relationship Mining. Relationship mining aims to reasonable interacting, propagating and variating the information between objects and scenes. It has been applied in some common visual tasks, such as classification (Marino et al., 2016), object detection (Chen et al., 2018) and visual relationship detection (Dai et al., 2017a). A common practice in previous works (Akata et al., 2013; Almazán et al., 2014; Lampert et al., 2009; Misra et al., 2017) is to consider manual designed relationships and shared attributes among objects. For example, some works (Frome et al., 2013; Mao et al., 2015; Reed et al., 2016) try to reason via modeling the similarity such as the attributes in the linguistic space. The graph structure (Chen et al., 2018; Dai et al., 2017a; Kipf and Welling, 2016; Marino et al., 2016) also demonstrates its amazing ability in incorporating external knowledge. In (Deng et al., 2014), Deng et al. construct a relation graph from labels to guide the classification. Similarly, Chen et al. (Chen et al., 2018) design an iteratively reasoning framework that leverages both local region-based reasoning and global reasoning to facilitate object recognition.

However, these works rely on external handcraft linguistic knowledge, which requires laborious annotation work. Moreover, the handcraft knowledge graph usually is not so appreciated since the gap exists between linguistic and visual context. Some works

(Hu et al., 2018a; Liu et al., 2018; Norcliffe-Brown et al., 2018) propose to construct implicit relations from the image itself. Especially, Liu et al. (Liu et al., 2018) encodes the relations via constructing a Structure Inference Network (SIN) which learns a fully-connected graph implicitly with stacked GRU cell. However, the redundant information and the inefficiency brought by a fully-connect graph make this method stagnant. We hope to imitate the human visual mechanism and construct a dynamic scene graph by mining the intrinsic semantic and spatial layout relationships from each image to facilitate small object detection.

3. Proposed Approach

In this section, we present our approach in detail. We first briefly overview the whole approach, and then expatiate on the semantic module and the spatial layout module, respectively. Finally, we present the details of a context reasoning module.

3.1. Overview

We start with an overview of the context reasoning framework before going into detail below. The system framework of our approach is shown in Fig. 2. Note that our context reasoning approach is flexible and can be easily injected into any two-stage detection pipelines. The human visual system tends to assign objects that have similar semantic co-occurrence information, aspect ratios, and scales to an identical category, which is beneficial for recognizing small objects in complex scenarios. Our approach mimics such a human visual mechanism and captures the inter-object relationships (both semantic and spatial layout) between small objects. It aims at inferring the existence of hard-to-detect small objects by measuring their relatedness to other easy-to-detect ones. In this paper, we explore whether mining the semantic and spatial layout relationships can boost small object detection.

We first construct a semantic module for encoding the intrinsic semantic relationships from the initial regional features and a spatial layout module for encoding the spatial layout relationships from the position and shape information of objects. Then both the semantic and the spatial layout relationships are fed into a context reasoning module and generate a region-to-region undirected graph , where are region nodes and each edge encodes both semantic and spatial layout relationships between nodes. Finally, the context reasoning module integrates the contextual information between the objects and sparse relationships, which is further fused with the original regional features.

3.2. Semantic Module

Figure 3. Flowchart of semantic relatedness calculation. The initial regional features from these proposals are fed into a semantic encoder to yield latent representations, which are used to calculating the relatedness from a learnable semantic relatedness function. Proposals fall into the same category tend to have similar semantic co-occurrence information lead to high relatedness and low if they not.

This module is learnable and aims to imitate the human visual mechanism to model the intrinsic semantic relationships between objects. As shown in Fig. 3, proposals fall into the identical category tend to have similar semantic co-occurrence information, lead to high relatedness and low if they not. More intuitively, a hard-to-detect small object, which has ambiguous semantic information, is more likely to be a clock if it has the top semantic similarities to some easy-to-detect clocks in the same scenario. The semantic context information of these easy-to-detect clocks tends to be beneficial for recognizing such a hard-to-detect object. We define a dynamic undirected graph to encode the semantic relationships from each image. Note that each node in corresponding to a region proposal while each edge represents the relationship between nodes. Given proposal nodes, we first construct a fully-connect graph that contains possible edges between them. However, most of the connections are invalid due to regularities in real-world object interactions. A direct solution to this problem is to calculate the semantic relatedness between the fully-connected graph and then retain the relationships in high relatedness meanwhile prune the relationships in low relatedness. The flowchart of relatedness calculation is illustrated in Fig. 3.

Inspired from (Yang et al., 2018a), given initial regional feature pool , in which is the dimension of the initial regional features, we define a learnable semantic relatedness function to calculate the semantic relatedness from each pair-wise initial regional features in the original fully-connected graph. The semantic relatedness can be formulated as

(1)

where is an indicator function that equals if the th and th regions are highly overlapped with each other and otherwise.

is a projection function that projects the initial regional features to latent representations. Since different regions are parallel and there is no subject and object division, we set it to a multi-layer perceptron (MLP) to encode undirected relationships in this paper. A sigmoid function is applied to the score matrix

for normalizing all the scores range from to . Then we sort the score matrix by rows and preserve the top K values in each row. The pair-wise regional relationships corresponding to the preserved values are set as the selected relationships. The value of adjacent edge is set to if the corresponding region-to-region relationship is selected and otherwise.

The semantic module maps the original region feature that involves rich semantic and location information into a new feature space via an MLP architecture and preserves the regions with the high similarity of corresponding features. In the training process, the location information tends to be ignored and the semantic information tends to be preserved since the high similarity of location information will result in retaining regions with a high overlap ratio and such regions will be suppressed by NMS algorism. Thus, it encodes the semantic information. In this manner, we can obtain a sparse semantic relationships that most informative edges are retained and the noising edges are pruned.

3.3. Spatial Layout Module

Figure 4. (a) Flowchart of spatial layout relatedness calculation. The spatial layout of each pair-wise region is fed into the spatial layout module to compute the spatial similarity and spatial distance weight for calculating the spatial layout relatedness. (b) An example of a spatial layout relationship graph.

Conventionally, the small objects fall into the identical category in the scene tend to have similar spatial aspect ratios and scales, for instance, the two chairs in Fig. 4 (b) are in a high spatial similarity but not so between chairs and the majority birds. Meanwhile, this is not a one-size-fits-all rule and we can easily find some failure cases in Fig. 4 (b), a few birds are in high spatial similarity with the chairs but in different categories. This suggests that we should revisit the question of how to effectively model the spatial layout relationships between small objects for better recognition. We can find that the chairs are closer to each other than they are to most birds, and the birds are in a similar situation. This phenomenon can be generalized to the majority of scenarios, that is, small objects of the identical category tend to appear in clusters in spatial layout. Inspired by this, we construct the spatial layout module to model the intrinsic spatial layout relationships from both spatial similarity and spatial distance. Its flowchart is as shown in Fig. 4 (a).

We define a spatial layout dynamic undirected graph to encode the spatial layout relationships. Similar to that in the semantic module, we define a spatial layout relatedness function to calculate the relatedness in the original fully-connected graph. The spatial layout relatedness can be formulated as

(2)

where and are region coordinates corresponding to region and , respectively. and are spatial similarity and spatial distance weight, respectively.

(3)
(4)

where is functioned as a scale parameter which is empirically set to in this paper. is the spatial distance between the centers of the two regions. We sort the score matrix by rows and preserve the top K values in each row. The pair-wise regional relationships corresponding to the preserved values are set as the selected relationships. Finally, we set the adjacent edge in the same manner as in semantic module. A constructed spatial layout graph is illustrated in Fig. 4 (b).

3.4. Context Reasoning Module

The context reasoning module is constructed to integrate the contextual information between the objects and sparse relationships. Given the initial regional features and the encoded semantic and spatial layout relationships, we need to select the relationships that are highly related to each other, semantic or spatial layout. We fuse the semantic and spatial layout relationships via

(5)

The connections between regions are non-Euclidean data and high irregular, which can not be systematically and reliably processed by CNNs in general. Graph Convolutional Network (GCN) is capable for better estimating edge strengths between the vertices of the fused relationship graph

, thus leading to more accurate connections between individuals. Intuitively, information communication between regions with high relatedness is capable provide more effective contextual information, which will effectively boost small object detection. As a result, we construct a light-weight GCN for regional context reasoning. Its flowchart is illustrated in Fig. 5. It consists of layers each with the same propagation rule defined as follows. We define as the hidden feature matrix of the -th layer and . The can be formulated as

(6)

where is the degree matrix of while is a combinatorial laplacian matrix of . denotes the trainable weight matrix of the -th layer, and

is LeakyReLU activation function. The initial regional features

are updated with the output of GCN

(7)

where and represent the updated features and element-wise addition operation, respectively.

Figure 5. The context reasoning flowchart. The semantic and spatial layout relationships are fused for propagating both the semantic and spatial layout contextual information via a GCN. The original regional features are updated with the output of the GCN.

In this manner, both co-occurrence semantic and spatial layout information can effectively propagate to each other, which enables the model a better self-correction ability compared with before, and the problems of false and omissive detection are alleviated.

backbone
one-stage YOLOv2 (Redmon et al., 2016) DarkNet-19 21.6 44.0 19.2 5.0 22.4 35.5
SSD513 (Fu et al., 2017) ResNet-101 31.2 50.4 33.3 10.2 34.5 49.8
YOLOv3 (Redmon and Farhadi, 2018) Darknet-53 33.0 57.9 34.4 18.3 35.4 41.9
DSSD513 (Fu et al., 2017) ResNet-101 33.2 53.3 35.2 13.0 35.4 51.1
RefineDet512 (Zhang et al., 2018) ResNet-101 36.4 57.5 39.5 16.6 39.9 51.4
RetinaNet (Lin et al., 2017b) ResNet-101 39.1 59.1 42.3 21.8 42.7 50.2
CornerNet511 (Law and Deng, 2018)* Hourglass-104 40.5 56.5 43.1 19.4 42.7 53.9
two-stage Faster R-CNN+++ (He et al., 2016)* ResNet-101 34.9 55.7 37.4 15.6 38.7 50.9
Faster R-CNN by G-RMI (Huang et al., 2017) Inc-ResNet-v2 34.7 55.5 36.7 13.5 38.1 52.0
Faster R-CNN w FPN (Lin et al., 2017a) ResNet-101 36.2 59.1 39.0 18.2 39.0 48.2
Faster R-CNN w TDM (Shrivastava et al., 2016) Inc-ResNet-v2 36.8 57.7 39.2 16.2 39.8 52.1
Deformable R-FCN (Dai et al., 2017b)* Aligned-Inc-ResNet 37.5 58.0 40.8 19.4 40.1 52.5
Mask R-CNN (He et al., 2017) ResNet-101 38.2 60.3 41.7 20.1 41.1 50.2
Regionlets (Hu et al., 2018a) ResNet-101 39.3 59.8 21.7 43.7 50.9
Fitness NMS (Tychsen-Smith and Petersson, 2018) ResNet-101 41.8 60.9 44.9 21.5 45.0 57.5
FRCNN-FD-WT (Peng et al., 2019) ResNet-101 42.1 63.4 45.7 21.8 45.1 57.1
IR R-CNN ResNet-50 37.6 60.0 40.6 21.9 39.7 47.0
IR R-CNN ResNet-101 39.7 62.0 43.2 22.9 42.4 50.2
  • Models used bells and whistles at inference.

Table 1. Comparison with state-of-the-art detectors on COCO test-dev. We show results for our IR R-CNN with backbone ResNet-50 and ResNet-101. Our module achieves top results in small object detection, outperforming most one-stage and two-stage models. The best, runner-up and second runner-up two-stage models are marked with red, green and blue, respectively.

4. Experiments

In this section, experiments are conducted to evaluate the effectiveness of our proposed approach. We will begin with our experimental settings and then present the implementation details and benchmark the state-of-the-art models, finally, we present a detailed performance analysis.

4.1. Experimental Settings

We evaluate our proposed approach on the bounding box detection track of the challenging COCO benchmark (Lin et al., 2014), which has more small objects than large/medium objects, approximately of objects are small (area). With respect to prior investigation of (Bell et al., 2016; Lin et al., 2017a), we train the COCO trainval135k split (union of k train images and random k subset of val images). We report the ablation studies by evaluating the minival split (the remaining k images from val images). For a fair comparison, we report the performance on test-dev split, which has no public labels and requires the use of the evaluation server.

According to the scale of objects, the COCO dataset can be divided into three subsets: small, medium and large. In detail, the large objects with an area larger than , the small objects with an area smaller than , the medium objects with an area in between. In this paper, we focus on the performance of small object detection. The standard COCO metrics are reported in this paper, including (averaged over IoU thresholds), , , and , , (AP at different scales).

4.2. Implementation Details

We re-implement Faster R-CNN (Ren et al., 2015)

, with ResNet-50 and ResNet-101 as backbones, as our baseline methods in PyTorch

(Paszke et al., 2017)

. Note that our network backbone is pre-trained on ImageNet

(Russakovsky et al., 2015) and then fine-tuned on the detection dataset. The parameters in MLP architecture and context reasoning module are randomly initialized and are trained from scratch. The overall network is trained in an end-to-end manner, and its input images are resized to have a short side of

pixels. It is trained with stochastic gradient descent (SGD). We use synchronized SGD over

GPUs with a total of images per minibatch (4 images per GPU). The model is trained for k iterations with an initial learning rate of . We decay the learning rate at k and again at k iterations with decay rate . We use a weight decay of and momentum of . We empirically set in the relationship graph construction in the context reasoning module, respectively.

Figure 6. Qualitative results of our IR R-CNN with ResNet-101 as backbone. The model is trained on COCO trainval135k split (union of k train images and random k subset of val images).

4.3. Comparison with the State-of-the-art Models

We evaluate our proposed approach to the bounding box detection task of the challenging test COCO dataset. We compare it with several state-of-the-art models, including both one-stage and two-stage models, and their performance is as shown in Tab. 1. From this table, we find that our proposed approach can achieve better accuracy than the popular models in small object detection. This reveals that our approach can strongly improve the original small object regional features, and the correctness of the theory that modeling the semantic and spatial layout relationships to boost the small object detection with only a 6.9% parameter increment (60.6 million64.8 million parameters). Note that our approach is designed for the complex scenes with multiple small objects, make it flexible and portable for diverse detection systems to improve the small object detection performance. Some qualitative examples of detection results generated by our IR R-CNN are illustrated in Fig. 6. We observe that our approach can detect the most objects that conform to the human visual cognitive system, even if there are very small objects in the scene. This indicates the effectiveness of our approach in modeling the relationships between small objects, semantic and spatial layout. However, we can also find some failure cases, which shows that our method still has room for improvements to promote the performance of small object detection.

4.4. Detailed Performance Analysis

We conduct several experiments on COCO minival to verify the effectiveness of the proposed approach. Unless otherwise stated, all models in detailed performance analysis are implemented on Faster R-CNN with ResNet-50 as the backbone.

Parameter Analysis. We conduct an experiment to evaluate the parameter K in {16, 32, 64, 96}. The performance of the proposed approach with different K is summarized in Tab. 2. From this table, we find that the overall detection performance remains relatively stable, while the performance of small object detection improves substantially as K grows and it peaks at K=64. However, when the K continues to grow, the performance of small object detection decays.

This can be interpreted as that low K will result in the proposed semantic and spatial layout module that can not encode sufficient semantic and spatial layout relationships, respectively. This constricts the semantic and spatial layout context information that can be propagated between regions and leads to inferior small object detection performance. On the contrary, large K increases the risk of unnecessary relationships being encoded. In other words, noise may be introduced, which has a negative impact on the improvements of small object detection. In summary, the performance improvements can be maximized when the appropriate K enables sufficient relationships to be encoded and effectively propagates context information between regions while avoiding the introduction of noise.

K
16 36.7 58.6 39.7 21.3 40.0 47.5
32 37.2 59.4 39.8 22.1 40.1 48.0
64 37.3 59.5 40.5 22.9 40.5 48.5
96 37.4 59.5 40.5 22.0 40.6 48.4
Table 2. Parameter analysis on minival subset.
Sem Spa
36.8 58.7 39.6 21.0 39.9 47.7
36.6 58.5 39.6 22.3 40.0 47.3
37.0 59.0 40.2 21.9 40.2 47.8
37.3 59.5 40.5 22.9 40.5 48.5
Table 3. Ablation study on minival subset.

Ablation Studies. Ablation studies, which mainly consists of two different settings, are conducted to verify the effectiveness of the proposed semantic and spatial layout modules. In the first setting, we only consider the semantic relationships and ignore the spatial layout relationships for context reasoning. In this manner, only the regions in high semantic similarity are propagating context information with each other. In the second setting, similarly, we ignore the semantic relationships between regions and only fed the spatial layout relationships into the context reasoning module for further reasoning. Tab. 3 summarizes the performance of ablation studies on minival subset.

From this table, we find that both the semantic and spatial layout module can boost the small object detection to some extent. But their respective improvements are quite limited when compared to the full model. This can be interpreted as the semantic module that is capable to encode semantic relations from semantic similarity, enable the context reasoning module to propagate the high-order semantic co-occurrence contextual information between objects, which leads to a performance gain. However, it is not so beneficial for small objects that are hard to extract semantically strong features but fall into the identical category. The spatial layout module sets aside the semantic similarity and constructs relations from spatial layout, gives the small objects, that in high spatial similarity and appear in clusters in spatial layout, an opportunity to propagate spatial layout contextual information to each other. This can alleviate the problems in the semantic module but in high risk to introducing noise. Since the two modules can complement to each other, the fusion of them naturally enables the performance gains maximum. Specially, Tab. 3 reveals that our context reasoning approach can boost the performance of small object detection by 1.9 points on minival subset.

5. Conclusion

We present a novel context reasoning approach for small object detection which models and infers the intrinsic semantic and spatial layout relationships between objects. It constructs sparse semantic relationships from the semantic similarity and sparse spatial layout relationships from the spatial similarity and spatial distance. A context reasoning module takes the semantic and spatial layout relations as input, propagates the semantic and spatial layout contextual information for updating the initial regional features, which make it capable for the object detectors to alleviate the problem of false and omissive detection for small objects. The experimental results on COCO have validated the effectiveness of the proposed approach. We believe that the IR R-CNN could benefit the current small object detection with relationship modeling and inference.

In future work, we will tentatively explore the feasibility of introducing orientation information into the context reasoning module, as well as combing both intrinsic relationship and external handcraft linguistic knowledge for further small object detection performance improvements.

References

  • Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid (2013) Label-embedding for attribute-based classification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 819–826. Cited by: §2.
  • J. Almazán, A. Gordo, A. Fornés, and E. Valveny (2014) Word spotting and recognition with embedded attributes. IEEE transactions on pattern analysis and machine intelligence 36 (12), pp. 2552–2566. Cited by: §2.
  • Y. Bai, Y. Zhang, M. Ding, and B. Ghanem (2018a) Finding tiny faces in the wild with generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–30. Cited by: §1, §2.
  • Y. Bai, Y. Zhang, M. Ding, and B. Ghanem (2018b) Sod-mtgan: small object detection via multi-task generative adversarial network. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 206–221. Cited by: §1, §2.
  • S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick (2016)

    Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2874–2883. Cited by: §4.1.
  • X. Chen, L. Li, L. Fei-Fei, and A. Gupta (2018) Iterative visual reasoning beyond convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7239–7248. Cited by: §2.
  • B. Dai, Y. Zhang, and D. Lin (2017a) Detecting visual relationships with deep relational networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3076–3086. Cited by: §2.
  • J. Dai, Y. Li, K. He, and J. Sun (2016) R-fcn: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pp. 379–387. Cited by: §2.
  • J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017b) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: §1, Table 1.
  • J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam (2014) Large-scale object classification using label relation graphs. In European conference on computer vision, pp. 48–64. Cited by: §2.
  • A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov (2013) Devise: a deep visual-semantic embedding model. In Advances in neural information processing systems, pp. 2121–2129. Cited by: §2.
  • C. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg (2017) Dssd: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659. Cited by: Table 1.
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §1, §2.
  • R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §2.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1, §2, Table 1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence 37 (9), pp. 1904–1916. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Table 1.
  • H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei (2018a) Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588–3597. Cited by: §2, Table 1.
  • J. Hu, L. Shen, and G. Sun (2018b) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1.
  • J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7310–7311. Cited by: Table 1.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2.
  • C. H. Lampert, H. Nickisch, and S. Harmeling (2009) Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 951–958. Cited by: §2.
  • H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: Table 1.
  • T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017a) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §2, Table 1, §4.1.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017b) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §2, Table 1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.1.
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1, §2.
  • Y. Liu, R. Wang, S. Shan, and X. Chen (2018) Structure inference net: object detection using scene-level context and instance-level relationships. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6985–6994. Cited by: §2.
  • J. Mao, X. Wei, Y. Yang, J. Wang, Z. Huang, and A. L. Yuille (2015) Learning like a child: fast novel visual concept learning from sentence descriptions of images. In Proceedings of the IEEE international conference on computer vision, pp. 2533–2541. Cited by: §2.
  • K. Marino, R. Salakhutdinov, and A. Gupta (2016) The more you know: using knowledge graphs for image classification. arXiv preprint arXiv:1612.04844. Cited by: §2.
  • I. Misra, A. Gupta, and M. Hebert (2017) From red wine to red tomato: composition with context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1792–1801. Cited by: §2.
  • W. Norcliffe-Brown, S. Vafeias, and S. Parisot (2018) Learning conditioned graph structures for interpretable visual question answering. In Advances in Neural Information Processing Systems, pp. 8334–8343. Cited by: §2.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.2.
  • J. Peng, M. Sun, Z. Zhang, T. Tan, and J. Yan (2019) POD: practical object detection with scale-sensitive network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9607–9616. Cited by: Table 1.
  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1, §2, Table 1.
  • J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §2.
  • J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: Table 1.
  • S. Reed, Z. Akata, H. Lee, and B. Schiele (2016) Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 49–58. Cited by: §2.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2, §4.2.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §4.2.
  • A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta (2016) Beyond skip connections: top-down modulation for object detection. arXiv preprint arXiv:1612.06851. Cited by: Table 1.
  • L. Tychsen-Smith and L. Petersson (2018) Improving object localization with fitness nms and bounded iou loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6877–6885. Cited by: Table 1.
  • J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders (2013) Selective search for object recognition. International journal of computer vision 104 (2), pp. 154–171. Cited by: §2.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1.
  • J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh (2018a) Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 670–685. Cited by: §3.2.
  • M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang (2018b) Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3684–3692. Cited by: §1.
  • S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li (2018)

    Single-shot refinement neural network for object detection

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4203–4212. Cited by: Table 1.
  • H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §1.
  • C. L. Zitnick and P. Dollár (2014) Edge boxes: locating object proposals from edges. In European conference on computer vision, pp. 391–405. Cited by: §2.