Multi-patch Feature Pyramid Network for Weakly Supervised Object Detection in Optical Remote Sensing Images

08/18/2021 ∙ by Pourya Shamsolmoali, et al. ∙ Grenoble Institute of Technology University of Leicester 11

Object detection is a challenging task in remote sensing because objects only occupy a few pixels in the images, and the models are required to simultaneously learn object locations and detection. Even though the established approaches well perform for the objects of regular sizes, they achieve weak performance when analyzing small ones or getting stuck in the local minima (e.g. false object parts). Two possible issues stand in their way. First, the existing methods struggle to perform stably on the detection of small objects because of the complicated background. Second, most of the standard methods used hand-crafted features, and do not work well on the detection of objects parts of which are missing. We here address the above issues and propose a new architecture with a multiple patch feature pyramid network (MPFP-Net). Different from the current models that during training only pursue the most discriminative patches, in MPFPNet the patches are divided into class-affiliated subsets, in which the patches are related and based on the primary loss function, a sequence of smooth loss functions are determined for the subsets to improve the model for collecting small object parts. To enhance the feature representation for patch selection, we introduce an effective method to regularize the residual values and make the fusion transition layers strictly norm-preserving. The network contains bottom-up and crosswise connections to fuse the features of different scales to achieve better accuracy, compared to several state-of-the-art object detection models. Also, the developed architecture is more efficient than the baselines.



There are no comments yet.


page 1

page 3

page 7

page 8

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

High-resolution remote sensing images (RSIs) are now available to facilitate a wide variety of applications, such as traffic management [kalantar2017multiple] and environment monitoring [marin2014building]. Recently, the application of object detection in RSIs has been extended from rural development to other areas, such as urban inspection [cheng2016survey], and the RSIs intensely increase in both quantity and quality. In [cheng2016survey]

, the authors presented a comprehensive survey and compared the performance of different machine learning methods for remote sensing image interpretation. Machine learning and deep learning-based models have been widely used for RSI object detection and classification

[marin2014building, hong2020graph, hong2018augmented, shamsolmoali2019novel, shamsolmoali2019convolutional]

, however, supervised learning models require a large scale of annotated datasets to support satisfactory detection. Furthermore, RSI annotation needs to be undertaken by trained professionals. In contrast, weakly supervised learning is another technique to augment datasets for object detection

[triguero2015self, zhao2019towards]

. One of the popular weakly supervised object detection methods is the combination of multiple instance learning (MIL) and deep neural networks

[shamsolmoali2020amil]. Despite that, their method is related to object parts rather than full body, which is due to the non-convexity of the loss functions.

In practice, standard machine learning models have two main stages for object detection: first feature extraction and then classification. In


, the extracted features were used as input, followed by support vector machine to classify the predicted targets. Standard machine learning approaches are influenced by the quality of the handcrafted and light learning-based features. Despite their promising results, standard machine learning systems fail to provide robust outcomes in challenging circumstances, for example, changes in the visual appearance of objects and complex background clutters


In recent years, object detection is progressing due to the advance of convolution neural networks (CNNs). CNNs are able to learn features’ representation using a large size of data

[sharif2014cnn]. An enhanced CNN introduces a process called selective search [uijlings2013selective] by adapting segmentation as a selective search strategy for improving detection accuracy with higher speeds. In comparison with the analysis of natural images, the main difficulty in RSI object detection is the sizes of objects. For small objects, because of low resolution, extracting significant and discriminative features is difficult, therefore, recent works are driven towards exploring solutions of extracting discriminative features [hong2020invariant]. For example, a rotational R-CNN was proposed by Guo et al. [guo2020rotational] for supervised object detection. In addition, learning feature representations is a major problem in image processing tasks, and detecting multi-scale objects is challenging. To overcome this problem, pyramidal feature representations have been introduced to represent an image through multi-scale features in object detectors [lin2017feature, liu2018path, ghiasi2019fpn]. Feature Pyramid Network (FPN) [lin2017feature] is among the best representative approaches for producing pyramidal feature representations of objects. Typically, pyramid models adopt a backbone network that is built for image segmentation or classification and to create feature pyramids by successively merging two or three consecutive layers in the backbone network with top-down and adjacent connections. High-level features have lower resolutions but they are semantically strong and can be upscaled and merged with higher resolution features to create more discriminative features. Scholars who are working on object detection in RSIs have observed the powerful ability of FPN, and applied this method to their works. Li et al. [li2018hsf] proposed a feature-based method for identifying ships in RSIs. The authors proposed a region based network to detect ships from the generated feature maps. In spite of its object detection performance, its computation is inefficient. In [li2019nested], a deep network based upon two-stream pyramid module was proposed for detecting multi-scale salient objects in RSIs. Their network has demonstrated promising performance on detecting objects of a regular size but failed to perform accurately on small-size or incomplete objects. In [yang2019cdnet], the authors proposed a detector based on an encoder-decoder FPN for detecting clouds from RSIs. The network is simple and efficient, but not effective to the crowd environments. Although FPN is effective and simple, it may not be an efficient architectural design. Recently, in [ghiasi2019fpn, li2018hsf], the authors added extra bottom-up and skip connection pathways onto FPN to enhance feature representations. Nonetheless, these methods only take into account one of the three dimensions (image size, depth, and width). However, by analyzing more dimensions and adopting fusion methods [hong2020more, yokoya2017hyperspectral], we can train a network to achieve better performance and efficiency. Shamsolmoali et al. [shamsolmoali2008road] recently introduced a new architecture named SPN that improves the performance of FPN by extracting effective features from all the layers of the network for describing different scales’ objects. This architecture is introduced to learn from the multi-level feature maps and improve the semantics of the features.

To tackle the above problems, in this paper, we introduce a scalable architecture to build effective pyramidal representations, named MPFP-Net. More specifically, the proposed model mainly consists of two components. First, we proposed a multiple patch learning (MPL) scheme to deal with the non-convexity and lack of full object representation. MPL treats input images as patches. During training, MPL learns patch subsets, which have mutual co-relation. Patch subsets with accurate trailer parameters can activate semantic regions to describe a full object. Second, we propose a new pyramid network with cross-scale connections for producing multi-scale representative features. An additional advantage of the modular pyramidal architecture is the capability of efficient multi-scale object detection. There are three contributions made in this paper.

  • Our proposed MPL strategy is built on a CNN to accurately describe an object such that the network performs even robustly with different backbone models, such as ResNet [he2016deep] and SPN [shamsolmoali2008road]. By adopting SPN as the backbone into the proposed model, the detection accuracy and speed of MPFP-Net are better than those of the other state-of-the-art models.

  • We incorporate scale-wise image fusion within the FPN architecture. The cross-scale connections are used to adapt the number of the channels and the size of the feature maps to maintain the norm of the gradients. Also, we propose a computationally efficient method to extract and normalize a multi-level feature response to the corresponding level. By using a layer for extracting multi-scale feature maps, the network weights are updated once per iteration, which considerably improves the training efficiency of the proposed model.

  • We fully evaluate several recent deep CNNs for object detection in RSIs and the overall performance is reported.

The rest of this paper is structured as follows, Section II shows a brief review of the existing object detection methods on RSIs. In Section III, we describe our proposed architecture in detail. In Section IV, we report the experimental results on the RSI datasets. Section V contains the ablation study and we conclude the paper in Section VI.

Ii Related Work

In the last ten years, object detection in RSIs has achieved significant progress. In this section, we will discuss the available object detection technologies for natural and RS images. In particular, we focus on the discussion on weakly supervised models for object detection.

Ii-a Object Detection in Nature Scene Image

Deep CNNs significantly improves the performance of object detection models in recent years. In [girshick2015region], the authors proposed a detector named R-CNN. R-CNN has a high detection rate, but its speed is limited. With the intention of increasing accuracy and speed of object detection, Fast R-CNN [girshick2015fast] was proposed to use bounding-box regression with end-to-end training. Later, Faster R-CNN [ren2015faster] was introduced to combine object proposals and identification within a unified network with impressive efficiency. Generally, there are two approaches to deal with the scale-variation problem. The first approach is to use feature-wise image pyramids to generate semantical multi-scalar features. Features from the images of different scales return individual predictions to generate the final detection. With regard to localization precision and detection accuracy, features from the images of different sizes outweigh the features that are only based on the images of a single size [guo2020rotational].

Fig. 1: The overall architecture of the proposed Multiple Patch Feature Pyramid Network, it contains patch detection, subsets selection, feature extraction and instance classification.

The second approach is to detect objects in the feature pyramid (FP) extracted from different layers in the network while as an input only a single-scale image is applied. This methodology requires a smaller memory space than the first one. Additionally, the standard FP unit can be modified and inserted into the state-of-the-art CNNs based detectors. Lin et al. [lin2017feature] proposed a feature pyramid network with double-path connections to seek more representative features. Ding et al. [ding2020weakly] proposed a weakly supervised pyramid CNN for object detection. The model consists of a hierarchical configuration with both top-down and bottom-up connections to learn high-level semantics as well as low-level features.

Ii-B Object Detection in RSI

Compared to object detection in natural scene images, object detection in RSIs has additional challenges. This topic has been intensively studied over the last few years. Conventional object detection models have ranking systems to categorize objects and background [qiu2017automatic]. For example, in [wu2019fourier, wu2019orsim], the authors proposed an efficient detection framework based on rotation-invariant feature aggregation and adopted a learning method to acquire significant and meaningful features for small object detection.

With deep learning development, there has been significant improvement on object detection. In [cheng2016learning], the authors proposed an equivariant CNN method by introducing a regularization method for object detection in RSIs. Dong et al. [dong2019sig]

proposed a deep transfer learning method on the basis of R-CNN to minimize the loss of tiny objects in RSIs. Moreover, transfer learning can be used to label RSIs by annotating both object positions and classes. In

[deng2018multi], the authors proposed multi-scale object detection models in RSIs using inspection modules to train the sub-networks. The proposed model has significant detection performance on multi-scale objects but is not efficient due to the depth of the network. Different conditions and objects variations in RSIs bring various challenges to object detection in such images. Consequently, it is hard to obtain satisfactory results by deploying the available object detection models. Furthermore, due to the deficit of training data and the complexity of the network architecture to handle various objects with multiple scales and complex background, we introduce a novel multi-scale deep learning model for object detection in RSIs.

Ii-C Weakly Supervised Object Detection

Multiple instance learning (MIL) is a weakly supervising learning method that, in the training phase, treats each image as a "bag" and constantly picks high scored instances from the bags. It performs similarly to an Expectation-Maximization algorithm for estimating instances and training detectors concurrently. On the other hand, such a model, due to lack of local minima in the non-convex loss functions, does not have smooth training process

[wan2018min]. To mitigate the non-convexity issue, bundling was used as one of the pre-processing approaches to simplify the selection of relevant instances [song2014learning]. A box-level supervision method was presented to decrease the solution space throughout a recurrent constraint network [zhao2019towards]. In [bilen2016weakly], MIL-Net was proposed in which the convolution operator and filters are used as detectors. Nevertheless, MIL-Net’s loss functions stay non-convex. To deal with this problem, spatial regularization is used [bilen2016weakly] within the cascaded convolutional networks. Current approaches mainly use selective regions (patches) as ground-truth to gradually improve the classifiers [tang2017multiple]. Existing methods that use spatial regularization and gradual refinement are successful at enhancing object detection. Even so, it is lack of a systematic approach to deal with the local minimum problem.

Iii Proposed Model and Methodology

In this section, we present the details of our proposed model that consists of two main components: global multiple patch (image regions) learning and a variation of multi-scale feature pyramid networks with a novel fusion scheme to robustly detect an object region to further improve the detection performance. Fig. 1 shows the architecture of our proposed model.

Iii-a Multiple Patch Global Learning (MPL)

In MPL, images are treated as bags and the image’s regions as patches. In MPL, the labels are only allocated to bags of patches. Here, shows the bag and represents all the bags (training images). while = shows the bag’s label, and shows the possibility that the bag owns positive patches. signifies a positive bag which always holds a single patch. On the other hand, = signifies a negative bag while the entire patches are negative. Let and indicate the patches and labels in bag , in which , is the number of the patches. The MIL methods have two steps for object detection: patch selection (patch controller) as the initial step and object estimation [wan2019c, zhao2019towards]. In the patch selection phase, a patch controller calculates the object score of all the patches to obtain a patch from .


in which denotes the parameters of the patch selector and denotes the highest scored patch. With certain patches, a detector with parameter is trained, in which . represents the parameters of the object detector. In standard MIL methods [zhao2019towards, bilen2016weakly], the above processes are combined and and are fully integrated as follows:


in which the standard loss of patch selection is determined as


and the loss of detector estimation is determined as


in which is determined based on the metric introduced in [everingham2010pascal]:


for or the Delta function or respectively.


To enhance the performance of the patch selector, we introduce a novel optimization technique. This approach splits the patches into subsets whilst handling the non-convexity of the loss function shown in Eq.(2). Our model, includes a sequence of smooth loss functions from the beginning point () to the resulting point (), where is the result of where , and is used where . Thus, we determine a range of , , and accordingly improve Eq.(2) to a persistent loss function, as follows:


in which represents the patch subset and the index of , controlled by which is a continuous parameter. is the persistent loss function of patch selection, and is the persistent loss function of the detector. To learn the patch selector, a bag (image) is divided into patch subclasses. In each subclass, the objects are spatially related and may have overlapping with each other, and each class contains partially similar objects. Each subclass at least contains a single bag (image) , and for . The overall patches in an image (bag) are arranged by their object scores and the following two processes are carried out: 1) Create a patch for a subclass based on the highest object score. 2) Identify the patches that are overlapped with the highest scored patch and then include them in the subclasses. The successive patch selection is made with in which the loss function is formulated as follows:


where , the score of patch subclass , is determined as follows:


in which indicates the total number of the patches in subclass and . In the time of model learning, all the patches in subclass equally engage to adjusting the network parameters. When , each bag only has one subset that contains all the patches. Given , is divided into several subsets, where each subset bears at least a single patch and consequently Eq.(8) is not satisfied. Referring to Eq.(9), the score of a patch in general is similar to the mean score of the patches in that subset. Therefore, our defined loss function Eq.(7), is convex and grows linearly, compared to the standard MIL Eq.(3). In supervised learning, for the subclass , the maximum average score is taken for object detection. If bounding-box annotation is not feasible, the patch selector may not be accurate and the selected subclass may contain the background or only part of the objects. To fully outline valid patches and learn to detect objects, the patches are divided into positive and negative categories with parameter . Let be the learned patch subclass and the patch of the highest score in be . Patches are separated into positive and negative categories on the basis of their relations, as follows:

Fig. 2: Proposed network design where the top-down paths are shown in blue and bottom-up paths are shown in red – (a) FPN [lin2017feature] shows a top-down path for fusing multi-scale features (F3 - F7); (b) PANet [liu2018path] added additional bottom-up paths to the FPN; (c) NAS-FPN [ghiasi2019fpn] adopted neural architecture search to identify a feature network; In (d), (e) and (f), we shown different architectures of ESS-FPN learned by multi-scale feature fusion. (f) The architecture of ESS-FPN used in the experiments.

in which stands for the intersection of union between two consequent patches. Eq.(10) states that patches where their IoU are above the threshold, () are positive and the patches with less than are negative. As stated in Eq.(10), gradually the other patches are identified as positive or negative and the detector in accordance with these patches, predicts the objects with the following loss function:


Iii-B Feature Pyramid Network for Patch Estimation

Before applying classification to the whole image, we aim to generate image regions (patches) by randomly extracting patches from the images. In the next step, the collected patches go through the proposed FPN to estimate the probability of the objects. Consider an image

with a dimension pixels. If the sliding window has pixels and

denotes the stride, then

is divided to create the patches , in which and represent the vertical- and horizontal-axis respectively. These indexes report the information of the patches. Each bag (image) contains different numbers of patches and the patches can be resized to match the network’s input size. For the patch-level estimation, we propose a novel FPN model that contains two main components: a backbone network for constructive feature extraction, and a scale-wise feature fusion module to efficiently integrate high level semantic features with low-level ones. The proposed model aims to optimize feature fusion with a better inherent way to improve object detection in RSIs. In the next section, we target at the problem of multi-scale feature fusion, and propose a novel model with multiple cross-scale connections and weighted feature fusion. Our efficient scale-wise feature pyramid architecture is named ESS-FPN.

Iii-B1 Multi-scale Feature Fusion

The goal of multi-scale feature fusion is to combine different scales of features. represent the multi-scale feature list, in which indicates the feature. Our objective is to transform to effectively combine multi-scale features and generate a list of features as the output: . Fig. 2(a) represents the methodology of the standard top-down FPN [lin2017feature]. This model gets the input features of level (3-7) , while signifies a feature with resolution of the input images. As an example, if the input resolution is of , then indicates the third level feature with the resolution of , where shows the seventh level feature with the resolution of . The standard FPN combines the features in a hierarchical fashion:


in which rescaling is often a down/up scaling operation to match the image resolution, and Conv denotes a convolutional operation for the feature processing.

Iii-B2 Cross-Wise Feature Fusion

The standard top-down FPN cannot widely reuse the features that are extracted in the previous layers as it only contains single-path information flows. Several approaches have been proposed for handling this problem. In PANet [liu2018path], the authors proposed to use both the top-down and reversed paths in the network architecture, as presented in Fig. 2(b). NAS-FPN [ghiasi2019fpn] introduced a FPN, based on neural network search, to develop a scale wise network architecture, however it is not efficient to search for the best path and it is challenging to modify the architecture, as indicated in Fig. 2(c). After having evaluated the accuracy and efficiency of these three networks, NAS-FPN achieves higher accuracy than FPN and PANet but with expensive computation. To enhance the performance and efficiency of our model, the nodes that contain single input edges are eliminated. If a particular node has a single input edge, then its contribution to the feature network is subtle and can be used in a fusion scheme. This has resulted in a multi-path network.

Unlike PANet and NAS-FPN that only have top-down and bottom-up paths, in our bottom up and cross-vise model, each path is treated as a single feature network layer. The same layer repeats several times in various directions and at the final stage the features at different scales are fused to generate high-level feature fusion. To fuse a number of input features with various resolutions, a simple yet effective method re-configures them to be of equal resolutions and then merges them. PAN [liu2018path] proposed global upscaling for improving pixel localization. Standard feature fusion methods equally treat the overall input features. Nevertheless, as the features of the input images have various resolutions, they do not share equal contributions to the output feature. To deal with this problem, we propose that each input has an additional weight during the feature fusion. In light of this idea, the following three weighted fusion approaches are considered:

Weighted Feature Fusion: as demonstrated in [hu2019dynamic], a learnable weigh can improve the accuracy with very small computation costs. Nevertheless, due to the unbounded scalar weight, it may lead to unstable training. For this reason, weight normalization is generally used to limit the range of values for weights. As discussed earlier, the outputs of the top-layers are much similar to the ground-truth. Using the same weights for all the locations compels the network to learn better fusion weights, which unfortunately miss the contributions from the low-level features. This prevents low-level features delivering adequate edge information, which is useful to identify object boundaries. Thus, before feature fusion, we deal with the scale variation of different level responses via normalization of their scales. Thereby, a robust weight learner can avoid scale variation and better learn fusion weights. A systematic multi-level feature extractor with a normalizer is here proposed to extract and normalize multi-level responses to the same scale (see Fig. 2(e) and 2(f)). In particular, feature normalization unit in the module is in charge for feature map normalization in each layer. In the proposed dynamic fusion model, two different schemes for predicting adaptive fusion weights are devised to determine the location fusion weights and ensure location-invariance. The model equally treats all the feature maps and global fusion weights are learned subsequently. Next, the model adjusts multi-level cross outputs A of size , to acquire a fused output by combining multi-level and multi-scale outcomes. The fusion process is as follows:

Fig. 3: The proposed network contains: patch detection, it adopts ResNet-50 + SPN [shamsolmoali2008road] as the backbone, ESS-FPN as the multiscale feature network, and class prediction network.


in which are the parameters of the convolution layer, denoting the fusion weights for the levels. The above equation can be generalized to the following form:


in which summarizes the operation of Eq.(13) and represents the fusion weights. In addition, an adaptive weight learner is proposed to learn the fusion weights based on the feature condition as follows:


where indicates the feature map. The above equations illustrate the key differences between our designed adaptive weight fusion model and the current static weight fusion model. The fusion weight is strongly based on the feature map , i.e. . Each input feature map will engender different parameters and consequently adjust this adaptive weight learner. Thus, the model can quickly adjust to the input image and satisfactorily learn multi-level outcome fusion weights in an end-to-end mode. represents two fusion weight strategies, location-invariant weight learners and location-adaptive weight. The location-invariant weight learner totally learns fusion weights that are generated from each location of the fused feature maps:


On the other hand, the location-adaptive weight learner generates fusion weights in accordance with different spatial locations, which in sum leads to weighting parameters.


where , and . In this model, the location-invariant weight learner generates global fusion weights, whereas the location-adaptive weight learner aggregates the generated fusion weights for each location based on the spatial variations.

Softmax fusion: If multiclass objects exist, we apply softmax to all the weights, therefore all the weights are normalized to the range between 0 and 1, indicating the importance of each input. Then, the weighted feature maps are aggregated across all the scales to build the fused feature maps. Multi-scale features in which

are the indexes of locations, channels and scales respectively. The attention model

is notated as . The fused feature map is formulated as:


Among all the previous works [wan2019c] where sigmoid is used for selecting features from different scales , in our model, softmax weight is used as . Since the proposed feature fusion model is performed across all the layers instead of only the final layer [liu2018path], we realised that softmax has better performance in fusing multi-scale features across different layers. However, having softmax causes additional costs. To reduce the extra costs, we introduce an instant fusion approach.

Instant fusion: In this approach, softmax is not used, therefore, the process is faster. In this operation, the total value of each weight falls between and . The instant fusion is formulated as:


In which

is obtained using the standard Relu activation function, and to prevent numerical instability, a small value

is used. Our experiments illustrate that our instant fusion approach obtains equal results as the sigmoid-based fusion, but runs up to 25% faster. In the proposed model, we combine the proposed multi-directional connections and the instant fusion to obtain highly semantic features. Here, we demonstrate the details of the feature fusion in the layer as illustrated in Fig. 2(f):


in which is the transitional feature at the fifth level of the cross route, and is the result of the fifth layer on the upward route. All the rest of the features are constructed similarly. It is worthy to mention, multi-scale divisible convolutions are adjusted to the network to improve the efficiency.

Input size W D
512 32 3
640 64 4
768 88 5
896 112 6
1024 160 7
1280 224 7
1280 288 8

Iii-B3 ESS-FPN Architecture

Fig. 3 presents the architecture of ESS-FPN. SPN [shamsolmoali2008road] is used as the backbone. The proposed ESS-FPN acts as the network for multi-scale feature extraction that receives the features from the to layer of the backbone and constantly executes top-down, bottom-up and cross-wise feature fusion. To improve both accuracy and efficiency, the current approaches scale up a baseline by adopting larger backbone networks, using a larger dataset, or adding more FPN layers [yang2019cdnet, he2016deep, tan2020efficientdet]. Such models are generally unproductive as they can only process one or few scaling dimensions (i.e. depth, width, and input resolution). We propose an efficient scaling method for object detection, which adopts an extensive multiplex factor to rescale the entire dimensions of the backbone network. Different from other image classification models [real2019regularized]

, object detectors have various scaling dimensions, therefore searching over all the dimensions is considerably expensive. Thus, we scale up the dimensions by using a heuristic method. We increase the ESS-FPN’s depth (layers)

and the width (channels) , to improve the proposed detector’s efficiency. More precisely, grid search is applied on the bases of , and we choose the premier value as the adjustment factor for the width. Conventionally, the width and depth of ESS-FPN are determined as follows:


We use feature levels 3-7 in ESS-FPN, therefore the input resolution should be dividable by , and the following equation is used:


where, with various , we have performed a wide range of evaluations in order to find the best parameter from to as listed in Table I, in which has a higher resolution than

. For each processed image patch, the ultimate sigmoid layer will generate a probability distribution that will be used for further patch selection. Compared to the other FPNs

[lin2017feature, ghiasi2019fpn], in the proposed MPFP-Net inspired by SPN [shamsolmoali2008road], U-shape network is adopted to build a pyramid network. In this model, the upward path gets the outputs of multiple layers as its reference sets.

Further, to enhance the performance and preserve the features’ smoothness, convolution layers are inserted after each up-scaling process. In total, 5 stacks of U-shape networks are used for building the multi-scale features. In our model, after the patch-level estimation, visual domain aggregation is applied to connecting all the detected patch-level probabilities to a detected image map. In the present patch, visual features are first aggregated using a pooling layer and then transformed to a semantic domain.

Iv Experiments

In this section, firstly we discuss the details of the benchmark datasets for object detection in RSIs and then describe the evaluation metrics, training procedure, and implementation details of our proposed model. Next, we compare the performance of the proposed MPFP-Net with that of several state-of-the-art approaches. MPFP-Net is implemented in PyTorch and all the experiments are conducted on a workstation equipped with Tesla P40 GPU. To validate that the proposed MPFP-Net can learn effective features for object detection with different appearance variations and scales, the activation values of the classification convolution layers including scales and level dimensions are shown in Fig.

4. The input image contains four harbors, two ships and two cars. The sizes of harbors, ships and cars are different.

It is worth mentioning that: (1) compared with the smaller harbors, the larger harbors have the larger activation value at the feature map of a large scale, similar to the larger ship and the smaller one; (2) the smaller harbors and the smaller ship have higher activation values at the feature maps of the same scale. This sample illustrates that: MPFP-Net learns effective features to deal with different scales and appearance-complexity across object patches. MPFP-Net is evaluated on three public datasets: NWPU VHR-10 [dong2019sig], LEVIR [zou2017random], and DOTA [xia2018dota].

Model mAP Params Ratio FLOPs Ratio GPU LAT(ms) Speedup CPU LAT(s) Speedup
MPFP-Net-S (512) 62.39 4.1M 1 3.8B 1 192.1 1 0.350.003 1
LARGE-RAM [zou2017random] 62.42 18.3M 4.3 53B 13.6 52 2.7 3.90.034 9.6
MPFP-Net-S (640) 65.07 7.6M 1 8.1B 1 230.8 1 0.780.004 1
RIRBM [diao2016efficient] 64.31 67M 8.5 368B 46.1 2790.3 12.1 570.017 63.1
MPFP-Net-S (768) 72.55 8.7M 1 16.3B 1 261.1 1 1.30.003 1
SUCNN [hu2019sample] 70.11 103M 11.4 1028B 64.3 3620.4 13.8 570.028 35.2
MPFP-Net-S (896) 75.71 13.4M 1 35B 1 44.80.5 1 2.70.005 1
HSF-Net [li2018hsf] 75.33 152M 11.3 2389B 68.4 5660.8 12.5 980.043 32
ResNet-50 + NAS-FPN (1280) [ghiasi2019fpn] 73.75 59.6M 4.5 350B 10 2670.4 5.8 570.170 20.2
ResNet-50 (1280) [he2016deep] 25.6M 125B 1170.3 240.05
MPFP-Net-S (1024) 79.05 21.1M 1 56B 1 760.4 1 5.10.002 1
Sig-NMS [dong2019sig] 78.11 71.7M 3.8 479B 8.3 2901.3 3.8 60.140.007 11.9
ResNet-50 + NAS-FPN (128084) [ghiasi2019fpn] 74.28 72.6M 3.6 546B 9.5 3270.1 4.3 63.710.031 12.5
MPFP-Net-S (1280) 83.42 34.2M 1 142B 1 1452.1 1 11.40.014 1
LV-Net [li2019nested] 82.34 114M 3.3 1259B 8.8 4381.6 3.2 84.310.034 7.5
MPFP-Net-S (1280) 86.73 51.2M 1 228B 1 1980.7 1 170.002 1
HSP [xu2020hierarchical] 85.32 125M 2.6 1428B 6.2 4161.8 2.25 850.096 4.4
MS-OPN [deng2018multi] 85.21 101.6M 2.2 1019B 4.4 3801.5 1.9 720.081 4.2
Fig. 4: Examplar activation values of multi-scale multilevel features. Best view in color.

Iv-a Dataset

Iv-A1 Nwpu Vhr-10

For evaluating the performance of the proposed MPFP-Net model, we use the challenging ten class Northwestern Polytechnical University Very High-Spatial Resolution (NWPU VHR-10) dataset [dong2019sig]. This dataset contains 650 VHR optical RSIs, in which 565 images were obtained from Google Earth, where each image has the size of pixels with the resolution ranging from m, and 85 pansharpened infrared images with resolution. The dataset includes ten manually annotated classes.

Iv-A2 Levir

This dataset contains 22k high resolution Google Earth images, where each image has the size of pixels and a resolution of m/pixel [zou2017random]. The dataset contains three annotated classes with small objects of pixels and minimum objects of pixels.

Iv-A3 Dota

It is a large RSIs dataset [xia2018dota] used for objects detection which comprises of 2806 images with different size ranges ( to pixels) and 188282 instances of 15 categories of objects: plane, baseball diamond (BD), bridge, ground field track (GFT), harbor and helicopter (HC), small vehicle (SV), large vehicle (LV), tennis court (TC), basketball court (BC), storage tank (ST), soccer ball field (SBF), roundabout (RA), and swimming pool (SP). Each image is labeled with an arbitrary quadrilateral.

Iv-B Evaluation Metrics

The target detection results contain two components, bounding boxes (BBs) which enclose the detected targets and the labels. In general, IoU is used as the evaluation metrics in object detection which denotes the ratio between the estimated object and the ground truth BBs. The IoU is formulated as follows:


in which and are the predicted areas and the ground truth boxes, respectively. Here, we use precision recall curves and the average value of precision (AP) for a single object class from recall 0 to 1 to assess the object detection framework, in which a higher AP represents better performance. While having multiclass objects, mAP is used to calculate the average AP of all the classes.

Fig. 5: Selected detection samples by MPFP-Net on NWPU VHR-10.(zoom in for better vision)

Iv-C Training procedures, and implementation details

NWPU-VHR-10 and LEVIR labels are in a conventional axis-aligned BBs form, while DOTA objects’ labels are in a quadrilateral form. For adapting the both settings, our proposed MPFP-Net uses both horizontal and oriented BBs (HBB, OBB) as ground truth, where HBB:, OBB:, here , denote width, height and is within to create ground truth for each object. In training, the OBB ground truth is produced by a group of rotated rectangles which properly overlap with the given quadrilateral labels. For the NWPU-VHR-10 and LEVIR datasets, the MPFP-Net just produces HBB results, as OBB ground truth does not exist in the datasets.

However, for the DOTA, the MPFP-Net produces both HBB and OBB outputs, as is presented in Fig. 6. In the training phase, for the LEVIR and NWPU VHR-10 datasets, we resize the original images to pixels. For the NWPU VHR-10, the quantity of images is insufficient, to increase the training set, we perform rotation, and mirroring. For the DOTA dataset, we split the images into patches with 200 pixels overlap by the development toolkit. In our experiments, 75% of LEVIR is selected for training and the remaining 25% for testing. Also, we select 60% of NWPU-VHR for training, 10% for validation, and the rest for testing, and 60% of the DOTA dataset for training, 20% for validation, and the remaining for testing. We employ the ResNet-50 + SPN [shamsolmoali2008road]

as backbone. For the DOTA and LEVIR datasets, we train the model for 150k iterations with batch size 1 on 2 Tesla P40 GPUs, which took around 24 hours. The initial learning rate is set to 1e-4 and is divided by 10 after every 30k iterations. The weight decay is set to 0.0005 and the momentum is 0.9, batch normalization and Swish activation (to enhance the backpropagation) are used after each convolution layer. Adam optimization

[kingma2014adam] is used to speed up the training. On the other hand, for the NWPU dataset, we train the model with 30k iterations, and the initial learning rate was set to 1e-3 and changed to 1e-4 and 1e-5 at 10k and 20k iterations, respectively, which took around 3 hours. For the NWPU dataset since the number of images is not enough, during training, we also use data augmentation including rotation and random flip.

Fig. 6: Example of detection results by the MPFP-Net On DOTA.
Fig. 7: Precision–Recall comparisons between MPFP-Net and other methods on LEVIR dataset, for the classes of airplane,ship, oil-tank, and Mean AP.

In Table II, we compare MPFP-Net with several object detection methods on LEVIR. Our proposed MPFP-Net achieves better accuracy with lower computation costs, compared to the other models across a broad range of resource constraints. The MPFP-Net-S generates only 4.1M parameters on top of the backbone parameters and obtains almost the same accuracy as LARGE-RAM [zou2017random] with fewer FLOPs. In comparison with RIRBM [diao2016efficient], the proposed MPFP-Net-S obtains a better detection rate with up to lower FLOPs and less number of parameters. In our proposed model, by increasing the input size from to the result is improved with 3 mAP, on the other hand, the model’s size and Flops are increased by 1.8% and 2.1% respectively.

Airplane 88.64 90.84 93.40 99.74 99.80 99.84
Ship 80.54 80.58 81.80 89.02 92.47 92.63
ST 59.13 59.25 73.57 80.13 96.99 96.98
BD 88.35 90.89 98.35 98.10 98.58 98.49
TC 79.38 80.86 84.89 86.09 90.39 89.83
BC 89.87 90.94 88.86 92.57 91.49 91.96
GTF 96.41 99.85 96.87 98.65 99.08 99.73
Harbor 81.81 90.39 91.64 94.61 88.93 94.82
Bridge 65.59 67.86 82.91 94.37 87.11 92.30
Vehicle 78.89 78.16 79.87 82.23 89.09 89.15
mAP 80.84 82.93 87.19 91.53 93.40 94.57
Parameters 153M 71.7M 115M 103M 126M 52.3M
FLOPs 2389B 479B 1270B 1022B 1525B 230B

Iv-D Comparison with other State-of-the-Art Methods

Fig. 5 presents some detected objects by our proposed model on the evaluation datasets.

Iv-D1 Performance evaluation on the NWPU VHR dataset

As clearly seen in Fig. 5, the detected objects are belonging to various classes, for example, airplanes, ships, harbors, and oil tanks. The proposed MPFP-Net has superior performance in object detection and even the objects that stay very close together are also properly detected.

Airplane 81.04 86.83 82.58 85.09 86.92 87.24
Ship 77.29 79.42 80.57 83.76 83.75 85.57
Oil-tank 67.62 68.28 83.59 86.80 87.42 87.35
mAP 75.33 78.11 82.34 85.21 85.32 86.73
Parameters 152M 71.7M 114M 101.6M 125M 51.2M
FLOPs 2389B 479B 1259B 1019B 1428B 228B
Plane 82.34 83.67 85.38 88.61 90.36 90.49
BD 79.66 80.29 84.63 86.11 86.85 86.79
Bridge 48.74 50.12 52.81 54.76 62.51 62.68
GTF 76.38 77.51 79.49 79.96 79.83 79.71
SV 65.42 68.37 71.64 75.29 78.07 78.22
LV 64.92 69.74 70.83 78.34 81.79 81.97
Ship 75.37 77.46 80.72 83.17 85.27 85.24
TC 84.68 86.69 88.48 90.91 90.82 90.88
BC 79.42 81.45 84.39 86.12 87.24 87.21
ST 74.19 79.76 81.73 84.96 85.90 85.98
SBF 60.33 61.94 65.75 68.58 69.93 70.18
RA 65.53 68.44 70.51 71.27 72.08 72.02
Harbor 69.58 71.45 73.40 76.12 84.11 84.24
SP 68.92 68.64 70.85 71.59 80.92 81.13
HC 64.60 66.39 66.42 67.77 69.81 69.78
mAP 70.67 72.65 75.23 77.38 80.36 80.43
Parameters 155M 72.8M 116M 105M 128M 54.8M
FLOPs 2394B 481B 1286B 1028B 1529B 242B

For the evaluation, other state-of-the-art RS detection approaches such as Sig-NMS [dong2019sig], HSF-Net [li2018hsf], LV-Net [li2019nested], MS-OPN [deng2018multi], and HSP [xu2020hierarchical], are applied to the NWPU VHR-10 and their object detection performance are compared with that of MPFP-Net. Table III lists the HBB detection results. As the results show, the MPFP-Net has superior performance in comparison with the other methods, and achieves better results in the airplane and ship classes. The airplane and ship classes contain tiny targets, indicating that our proposed method performs satisfactorily in detecting small objects. Our proposed MPFP-Net retains better mAP with much fewer parameters and FLOPs as compared to the other approaches on the NWPU-VHR-10 dataset.

Iv-D2 Performance evaluation on LEVIR

detailed object detection comparisons between MPFP-Net and the other state-of-the-art models on the LEVIR dataset are reported in Table IV. It is clear that MPFP-Net shows the best detection results in all the classes. In particular, our model surpasses HSP and MS-OPN in airplane and ship classes, which have a large number of small objects, and this indicates that MPFP-Net is able to successfully detect small objects. Fig. 7 shows the PRCs of the three classes of LEVIR, and mean AP by HSF-Net, Sig-NMS, LV-Net, MS-OPN, HSP and MPFP-Net, respectively. The blue curves which represent the performance of our proposed model are higher or similar to those of different classes.

Iv-D3 Performance evaluation on DOTA

In Table V, we evaluate the HBB detection performance of MPFP-Net on 15 classes of the DOTA dataset in comparison with the other approaches.The results demonstrate that the proposed method achieves state-of-the-art results and comparable performance with the other object detection approaches.

To show the advantages of MPFP-Net over the other methods, we perform a qualitative comparison of different methods on the DOTA dataset. The results are shown in Fig. 8. The results show the other approaches cannot robustly detect the objects in images, and the background is mis-detected as the foreground. Moreover, in the other methods, the bounding boxes are not well fit to the detected objects, whereas Sig-NMS [dong2019sig] only generates the horizontal box. However, our method can stably produce precise results.

Fig. 8: Qualitative detection comparison by different models on the DOTA dataset. The green boxes, show ground truth. The red, blue and yellow boxes indicate the detected planes, large vehicles and small vehicles respectively.

Iv-E Experiment on Large-Scale Images

In Fig. 9, we test MPFP-Net performance on a large-scale RSI. It is observed that the pre-trained MPFP-Net has an adequate flexibility on different image sources and conditions, which shows the effect of multiple patches and feature pyramid learning.

Fig. 9: Detection results on a large-scale RSI from Geomatica-CGI System. Green and orange respectively show ships and harbors.
Configuration mAP Parameters Flops
MPFP-Net-S+ FPN 78.41 122M 1253B
MPFP-Net-S+ SPN 83.69 84.9M 387B
MPFP-Net-S+ SPN+ ESS 86.73 51.2M 228B
Fig. 10: Comparison with different model size and inference latency. Latency is measured with batch size 1 on the machine equipped with a P40 GPU. The proposed MPFP-Net models are 2.2 3.8 smaller, and 1.9 3.7 faster than the other detectors.
Fig. 11: (a) Instant fusion vs softmax fusion and dynamic fusion. (b) Comparison of different Multi-patch scaling methods.

V Ablation Study

This section ablates different components of MPFP-Net on the LEVIR test set. We evaluate the case of different parameter sizes, FLOPs and latency on P40 GPU. Each model is run 5 times with the batch size of 1 and the mean and standard deviation are reported in Table

II. Fig. 10 shows the model’s size, Flops and GPU latency. In comparison with the other models, MPFP-Net is up to 3.7 faster. We conduct a set of evaluations to measure the contribution of the backbone network and ESS-FPN to the detection accuracy and efficiency improvements of the proposed MPFP-Net. Table VI reports the impact of each module on the overall performance of our proposed model. Starting from MPFP-Net-S with FPN [lin2017feature] as the backbone, firstly, instead of FPN we adopt the SPN [shamsolmoali2008road], which improves the detection accuracy by more than 5 mAP with a smaller number of parameters and FLOPs. Later, we add the proposed ESS-FPN, where the detection performance is improved by 3 mAP with less parameters and FLOPs.

V-a Instant fusion vs Softmax and Dynamic fusion

As earlier discussed, we propose an instant feature fusion approach that preserves the benefits of the normalized weights while reducing the softmax computation cost. In Fig. 11(a), we examine the performance of MPFP-Net while adopting dynamic fusion, softmax fusion and the proposed instant fusion. The proposed instant fusion achieves accuracy and learning behavior similar to the softmax fusion, however runs faster. During the training, the normalized weights quickly change, suggesting that various features unequally contribute to the feature fusion.

V-B Multi-Patch and Combined Multi-Scaling

As discussed in Sections III-B, we propose multiple-patch learning to deal with the lack of full object instance-labelling and propose a combined multiple scaling method to increase all the dimensions of ESS-FPN for better learning and consequently increasing the detection performance of our proposed model. In Fig. 11(b), we compare our approach with the other models that only use patch learning or scale up a single dimension (resolution/depth/width). As the results illustrate, our model results in better efficiency as compared to the other baselines, which signifies the advantages of patch-wise and jointly scale-wise learning.

Vi Conclusion

In this paper, we have proposed a novel weakly supervised model for detecting multi-scale objects in RSIs using a multi-patch feature pyramid network. First, we integrated automatic patch selection, feature aggregation and semantic domain projection within a single unified framework. Second, we proposed a weighted feature pyramid network which uses multi-directional connections for fast and efficient scale wise feature fusion to further optimize object detection in RSIs. Moreover, a joint loss was used to train the whole network end-to-end. On the basis of these methodologies, we introduced a new detector, named MPFP-Net, which obtains better accuracy and efficiency than the other state-of-the-art methods on RSIs. To evaluate the performance of MPFP-Net for multi-scale objet detection, three publicly available datasets were used. On these datasets, we evaluated the performance of our proposed method, compared with several CNNs based object detection models. Experimental results demonstrate that MPFP-Net can effectively and efficiently detect multi-scale objects.