Generally speaking, optical remote sensing imagery is collected from airborne or satellite sources in the range of nm. As a large amount of multispectral images or very-high-resolution RGB images are freely available on a large scale, there is a growing interest in various applications, such as dimensionality reduction [1, 2], segmentation [3, 4], unmixing [5, 6, 7, 8], data fusion [9, 10, 11], object detection and tracking [12, 13, 14], and classification or recognition [15, 16, 17, 18]. In recent years, geospatial object detection has been paid much attention due to its importance in environmental monitoring, ecological protection, hazard responses, etc. However, optical remote sensing imagery inevitably suffers from all kinds of deformations, e.g. variabilities in viewpoint, scaling and direction, which results in performance degradation of detection algorithm. In addition, objects in optical remote sensing imagery [19, 20, 21, 22, 23], such as cars and airplanes in Fig. 1, are generally small relative to the Ground Sampling Distance (GSD) with cluttered backgrounds. To overcome these challenges, object detection in remote sensing community has been extensively studied since the 1980s.
Many benchmarks available in public, e.g., TAS aerial car detection dataset 111http://ai.stanford.edu/~gaheitz/Research/TAS/tas.v0.tgz, NWPU VHR-10 dataset222The Vaihingen data was provided by the German Society for Photogrammetry, Remote Sensing and Geoinformation (DGPF). [24, 25], have contributed to spurring interest and progress in this area of remote sensing object detection. As the diversity of the database, many robust methods are born one after another in order to further improve the detection performances. Existing detection methods can be roughly categorized as follows : template matching-based, knowledge-based, object-based, and machine learning-based methods and other variants. These approaches mostly fail to describe object features in a complete space with a densely set of scales. In our case, the so-called complete space should involve different properties robustly against various deformations, e.g., shift, rotation, etc. Moreover, a good image descriptor should be able to capture substantial image patterns with coarsely image pyramid. We will detail them close to our work and clarify the similarities and differences as well as pons and cons in the next section: Related Work.
I-a Motivation and Objectives
Object deformation (e.g., rotation, translation) in recognition or detection task is a common but still challenging problem. In particular, the remote sensing imagery is prone to have a more complex rotation behavior (see Fig. 1
), due to its “bird perspective”. Although the learning-based methods, such as deep neural networks (DNNs), deep convolutional neural networks (deep CNNs), have been proposed to learn the rotation-invariant features by manually augmenting the training set with different rotations, yet it is inevitably limited by the pre-setting rotation angles. This could be difficult to adaptively address the rotation problem of the fractional angle, thereby yielding a performance bottleneck. Another important factor that has a great effect on detection performance is the feature itself which can be manually designed or extracted by DNN. However, such powerful learning approaches fail to provide the richer representation without the strong support of large-scale labeled training samples.
Consequently, we mainly make our efforts to artificially develop or optimize the features towards the more discriminative rotation-invariant representations under the seminal object detection framework presented by Viola and Jones (VJ) , rather than the learning-based methods in this paper.
I-B Method Overview and Contributions
To effectively address the aforementioned issues, the self-adaptive rotation-invariant channel features 
are firstly constructed in polar coordinates, which has been theoretically proven to well fit the rotation of any angles. Furthermore, the shift-invariant channel features in Cartesian coordinates (e.g., color, gradient magnitude) are also extracted for the channel extensions in order to fully explore the potential of the feature representation, yielding a joint spatial-frequency channel feature (SFCF). We then step towards feature learning or refine (e.g., subspace learning, aggregated channel features (ACF)) to further refine the representations. Such features are finally fed into a boosting classifier with a series of depth-3 decision trees.
For the geospatial object detection in remote sensing, we propose a variant of VJ object detection framework, called optical remote sensing imagery detector (ORSIm detector). Unlike previous models in [27, 28] that are sensitive to translations and rotations, ORSIm detector is a more general and powerful framework robustly against various variabilities, particularly for remote sensing imagery. Additionally, a fast pyramid method is adopted to effectively investigate the multi-scaled objects without sacrificing the detection performance. Fig. 2 outlines the basic framework of ORSIm detector. The main highlights of our work are threefold.
We propose a novel ORSIm detector by following the basic VJ framework by integrating spatial-frequency channel feature (SFCF), feature learning or refine, fast image pyramid estimation, and ensemble classifier learning (Adaboost );
A spatial-frequency channel feature is designed by simultaneously considering the invariance of rotation and shift in order to handle the complex object deformation behavior in remote sensing imagery;
An image pyramid generative model is simply but effectively embedded into the proposed framework by fast estimating a scaling factor in the image domain.
The remainder of this paper is organized as follows. Section II briefly reviews the previous work closely related to ours. Section III describes the proposed framework, including multiple domain feature exaction, feature stack, feature learning, training and testing. The experimental results on two datasets are reported in Section IV. Section V concludes our work and briefly discusses future work.
Ii Related Work
In this section, several advanced techniques in object detection are introduced with the applications to remote sensing imagery. We also emphatically clarify our superiority, compared to three kinds of similar approaches partly associated with our work.
Ii-a Channel Features
Channel Features refer to a collection of spatially discriminative features by linear or non-linear transformations of the input image. Over the past decades, channel features extraction techniques have been received an increasing interest with successful applications in pedestrian detection[28, 30]
and face detection[31, 32, 33]. Owing to their high representation ability, a variety of channel features have been widely used in geospatial object detection. Tuermer et al.  utilized the Histogram of Oriented Gradients (HOG)  as orientation channel features for airborne vehicle detection in a dense urban scene. Unfortunately, using orientation features alone is prone to hinder the detection performance from further improving. Inspired by the aggregate channel features (ACF) , Zhao et al.  extended the channel features by additionally considering color channel features (e.g., gray-scale, RGB, HSV and LUV) to detect aircrafts through remote sensing images. However, these methods usually fail to achieve desirable performances due to the sensitivity to object rotation. For that, although many tentative works have been proposed to model the object’s rotation behavior [37, 38], yet the performance gain is still limited by the discrete spatial coordinate system.
With a theoretical guarantee, Liu et al. 
proposed a fourier histogram of oriented gradients (FourierHOG) with a rigorous mathematical proof. It models the rotation-invariant descriptor in a continuous frequency domain rather than in the discrete spatial domain using a Fourier-based convolutionally-manipulated tensor-valued transformation function
. This function transfers the tensor-valued vectorized features (e.g., HOG) to a scalar-valued representation, so as to make the features invariant with a maximized information gain. In contrast with HOG-like approaches that discretely compute the features (or descriptors) in the locally estimated coordinates from pose normalization, FourierHOG uses a smooth continuous function for fitting the statistical features in a continuous coordinate, as illustrated in Fig. 3. Furthermore, such a strategy can also avoid artifacts in the gradient binning and pose sampling of the HOG descriptor.
Despite the superiority in representing rotation-invariance, FourierHOG ignores the importance of feature diversity. To this end, the proposed ORSIm extends the single channel features towards spatial-frequency joint ones, thereby further enriching the representations. On the other hand, FourierHOG, in fact, simplifies a challenging problem of object detection to that of object recognition. More specifically, the task of detecting boundary box of the object is converted into that of recognizing the central pixel to be either object or non-object, as illustrated in Fig. 4(a).
Ii-B Feature Channel Scaling
Image multi-resolution decomposition is one of the essential techniques in high-level image analysis, such as object detection and tracking. The VJ framework is a seminal work for real-time object detection [40, 26, 41], which runs at frames per second (fps) for an image of pixels with a 700 MHz Intel Pentium III processor. Following this framework, HOG  yields a higher detection accuracy. Nevertheless, the two representative algorithms evenly sample scales in log-space and construct a feature pyramid for every scale. This is very time-consuming, and keeping the computational cost low is a significant challenge. Inspired by fractal statistics of natural images , Dollar et al. proposed a fast pyramid generative model by only estimating a scale factor, basically achieving a pyramid feature extraction in parallel. The key technique used in the model can be summarized as a feature channel scaling, as illustrated in Fig. 4(b), the goal of which is to compute finely sampled feature pyramids at a fraction of the cost by means of the fractal statistics of images. Furthermore, the features are computed at octave-spaced scale intervals in order to sufficiently approximate features on a finely-sampled pyramid. Therefore, these benefits make the model successfully applied to pedestrian detection at over 30 fps on an 8 cores machine Inter Core i7-870 PC . Similarly, it has been also proven to be effective in aircrafts detection of remote sensing images . There is, however, an important assumption in the model, that is, the feature channels are supposed to be any low-level shift-invariant in order to fit the operation of sliding windows, which makes the fast detection framework sensitive to angle variation or rotation-induced deformations. For this reason, Yang et. al.  attempted to relax the constraint by learning varied face properties from multi-view images. The expensive cost of collecting multi-view remote sensing images still hinders Yang’s algorithm from generalizing well. Congruously, either color channel features or FourierHOG is able to facilitate the use of the fast pyramid generative model, while their joint use (our SFCF) naturally does well.
Ii-C Boosting Decision Tree
In the field of machine learning, the boosting methods have been widely used with great success for decades in various applications, e.g. object detection [12, 44, 45], face detection , and pose detection [46, 47]. Unlike other powerful classifiers (e.g., Rotation-based SVM , structured SVM , rotation forest 
), the boosting-based ones iteratively select weak learners from a pool of candidate weak classifiers to deal with hard examples from the previous round, which can be treated as an enhanced model integrating former results and greedily minimizing an exponential loss function. Each weak learner is able to make the sample reweighed, then latter weak learners would more focus on those examples that are misclassified by former ones. Using this, a strong classifier can be learned with higher generalization ability and parameter adaptiveness.
The performance of boosting-based classifiers mainly relies on the discriminative ability of the feature and the number of weak classifier. In the next section, we will introduce the proposed unified framework (ORSIm detector) in semantically meaningful feature extraction, feature stack and learning as well as parameter selection of the boosting classifier.
The proposed ORSIm detector starts with feature extractor. At this stage, spatial-frequency channel features are jointly extracted, including color and gradient magnitude channels from spatial domain and rotation invariant features from frequency domain. The features can be further refined by subspace learning or ACF, and then they can be fed into boosting decision tree for a better training and detection. Algorithm 1 details the main procedures of the ORSIm detector.
Iii-a Spatial-Frequency Channel Features (SFCF)
Commonly, the feature is limitedly represented in one single domain, this motivates the joint extraction of more discriminative features from the spatial and frequency domains to enrich the feature diversity.
Given a RGB remote sensing imagery as the input, we denote as SFCF, mainly including the RGB channels, first-order gradient magnitude (GM) channel, and rotation-invariant (RI) channels, defined as
where stands for the different feature sets.
1) Pixel-wise Spatial Channel Feature: In many tasks related to remote sensing, a color channel , i.e. RGB, shows a strong ability in identifying certain materials sensitive to the color (e.g., tree, grass, soil, etc.), which can be denoted as
where represents the channel features. Moreover, the normalized GM for the RGB image can be regarded as another important spatial channel features, since it can not only sharpen object edge, but also highlight small mutations that could be visually ignored in the smooth areas of the image, which has shown its effectiveness in detecting aerial or spaceborne objects . The resulting expression is
2) Pixel-wise Frequency Channel Feature: The objects in remote sensing images, more often than not, suffer from various complex deformations. It should be noted that object rotation is one of the major factors that sharply leads to the performance degradation. Compared to extracting features in Cartesian coordinates, rotation invariance has been proven to more effectively analyze in Polar coordinates  where the feature can be separated as the angular information and radial basis , respectively. Let and be the magnitude and the phase of a complex number , where and are the horizontal and vertical gradients of a pixel in Cartesian coordinates, respectively. Coincidentally, the Fourier basis is an optimal choice for modeling the angular part (), theoretically proven in , where stands for the Fourier order. The basis functions form harmonics on a circle, called circular harmonics. In , the rotation behaviors in Fourier domain can be modeled by a multiplication or convolution operator. More specifically, given two -th order Fourier representations in Polar coordinate ( and ), then we have
where is a coordinate transform with a relative rotation.
Given any one pixel (), its -th order Fourier representations () can be further deduced by
where is the distribution function of current pixel, which can be modeled by an impulse function with integral  : .
When the Eq. (5) rotates by an angle , according to the rotation behavior , we have
In order to make the feature rotation-invariant, namely , we can set a set of filters (convolution kernels) with the same rotation behavior, denoted as . Using Eqs. (4-6), this can be formulated as
as long as satisfying , we can get
thereby the convolutional features can be seen as the final rotation-invariant representation.
Inspired by the mentioned-above theory derivation in terms of rotation invariance, we construct the rotation-invariant features including the following three parts.
Using the Fourier transformation on the input remote sensing images, the magnitude channel image in-th Fourier order is naturally a kind of invariant feature, which is denoted as () in a pixel-wise () form.
To make the representation absolutely rotation-invariant, we get rid of rotation information from phase one by using Eqs. (7) and (8). That is, we generate a series of Fourier basis with equal and opposite order and use them on the Fourier representations of () by a multiplication or convolution operation, which can be formulated as , and .
We also consider a relative rotation-invariant feature representation by effectively utilizing the relative phase information . Accordingly, this can be developed as a special rotation-invariant feature by coupling the convolutional features of two neighbouring kernel-radii (please refer to  for more details), which is formulated as , and . and stand for the different convolutional kernels.
Therefore, the pixel-based frequency channel feature can be written by
thus we have the image-level representation by collecting all pixel-based features
3) Region-based Channel Feature Representation: Due to the low spatial resolution of remote sensing imagery, the detection performance is largely limited by the pixel-wise features. To better capture the semantically contextual information, we group pixel-wise channel features into region-based ones with kernel functions of different sizes. As visualized in Fig. 5, we use the triangular convolution kernels, including isotropic triangles kernel, local normalization kernel, to extract region-based channel features in both spatial and frequency domains. Besides that, we additionally design a set of Fourier-based convolution kernels denoted as to construct the region-based rotation-invariant descriptors on the frequency domain (please refer to  about specific parameter settings of convolution kernels in details). Therefore, the resulting final SFCF is
where is the region-based features using the -th convolution kernel.
Iii-B Feature Learning or Refine
To effectively eliminate the feature gap between the two different domains and meanwhile improve its robustness and representative ability, we are able to learn or refine the feature cube (see Fig. 6) along spatial and channel directions using the following two strategies.
Module 1: Subspace-based learning (e.g., Principal Component Analysis (PCA)). The extracted SFCF features can be further learned to reduce the computational and storage cost as well as improve the feature representation ability to some extent.
Module 2: Aggregation-based pooling. The SFCF can be also refined by the pooling-like operation (ACF) to dynamically adjust the support regions with different sizes and meanwhile maintain the structural consistence with the overall image . Subsequently, the two-dimensional ACF is stretched to the one-dimensional fully-connected feature vector, making it better fitting into ensemble classifier learning. Inspired by the structurally encoding pattern, we select the ACF during the process of feature refine.
Iii-C Training Phase with Ensemble Classifier Learning
Up to the present, boosting is one of the most popular learning techniques by integrating a large number of weak learners to generate a stronger one. The boosting-based method (e.g., AdaBoost) is built on the fact that those selected week classifiers should minimize the training errors and keep or reduce the test errors. For this reason, we apply a soft-cascade boosting structure with the depth-3 decision trees
, which is capability of discriminating intra- and inter-samples more effectively and simultaneously playing a role in feature selection. Significantly, the learning strategy is robust against background interference in object detection, especially in more complex scene of remote sensing imagery.
Iii-D Test Phase with Feature Channel Scaling
Sliding window is a commonly used detection technique in testing phase behind extracting finely-sampled image pyramid. However, it implies a heavy computational cost, which is not a good tool in the real-world. A fast image pyramid model  introduced in Section II.B is implemented in our framework by automatically estimating scaling factor of feature channels, which is expressed as
where is an input image, and is a re-sampled image of by . is a scaling factor to be estimated. The corresponding channel image at a scale can be presented by Eq. (12). The different channels can be computed with a linear or a non-linear transformation of the original image in the spatial and frequency domains. Using Eq. (12), we can quickly obtain the channels features of all pyramid images using the given calculated in the training phase.
Iv Experimental Results and Analysis
Iv-a Optical Remote Sensing Datasets
In this section, two well-known public optical remote datasets: car targets in satellite dataset 333http://ai.stanford.edu/~gaheitz/Research/TAS/tas.v0.tgz and airplane targets in NWPU VHR-airplane dataset 444http://www.ifp.uni-stuttgart.de/dgpf/DKEPAllg.html. are used to quantitatively evaluate the performances of the proposed method. In our work, 60% samples are assigned as training set and the rest is testing set for both datasets. The main focus of this paper is to create a more robust and discriminative feature representation, ensuring rotation and translation invariance. Generally, it is very expensive and time-consuming to collect a large number of training samples, particularly labeling remote sensing data. Therefore, it is very meaningful and challenging for users to assess the generalization performance of the classifier with a limited training set. To stably evaluate the performance of the proposed method, we conduct 5-fold cross-validation and report an average result below across the folds.
1) Satellite dataset: This dataset was acquired from Google Earth . The low resolution and the varying illumination conditions caused by the shadows of buildings make this dataset very challenging. In detail, the images contain manually labeled cars from 30 images with the size of . At the training stage, all car windows are rescaled to , due to the average car window is approximately this window. Also, their mirror images are used to double the positive images as data augmentation in all experiments, which can avoid over-fitting and improve generalization ability. Meanwhile, the negative images are cropped at random positions from the natural images without any car objects.
2) NWPU VHR-10 dataset: This dataset consists of 10 different object detection datasets acquired from Google Earth (spatial resolution 0.5m-2m) and Vaihingen data set (spatial resolution 0.08m) 555The Vaihingen data was provided by the German Society for Photogrammetry, Remote Sensing and Geoinformation (DGPF). (Please refer to [24, 25]
for more details). To meet our experimental assumption, that is, we mainly aim at detecting those objects with highly rotation behavior, hence airplane is proper research objects, which are selected as our another experimental data to effectively evaluate our method. More specifically, the positive image set without any outliers is composed of 650 airplane images and each of them includes at least one target. The negative image set consists of 150 images without any class-relevant targets. The original maximal and minimal windows are set toand pixels, respectively. Additionally, the number of positive images in training set is doubled by mirror processing, while the negative images are randomly selected from the 100 images without any airplanes.
Iv-B Experimental Setup
All the experiments in this paper were implemented with Matlab2016 on a Windows 7 operation system and conducted on an Intel Xeon 2.6GHz PC (CPU) with 128GB memory. Morevoer, there are several important modules in the proposed ORSIm framework, such as SFCF extraction, sampling window, smoothing, feature pyramid, and classifier setting. We will gradually detail them in the following.
SFCF extraction: The channel features used in our case mainly consist of two parts: spatial channel features and frequency channel features. The former involves color channels and corresponding magnitude of gradient channels, and the latter is the rotation-invariant feature channels. More specifically, RGB (red, green, blue), LUV (luminance, chromaticity coordinates) and HSV (hue, saturation, value) are selected as the potential color spaces. The magnitude of gradient channel is set as the magnitude of the channel with the maximal gradient amplitude response. There are three parts in the rotation-invariant channels, which are the true rotation invariant features (same Fourier orders, e.g. ), the magnitude features, and the coupling features across different radius (please refer to  for more details). During the process, two parameters need to be considered, namely the radii () of convolutional kernels and the number of Fourier order (). We assign five scales with six different half-width of to the value of , i.e. , while the is set to , as suggested in .
Sampling window: Due to the fact that objects in a scene hold the different resolution, it is necessary for objects (e.g., vehicle, airplane) to be upsampled or downsampled to a consistent size. Therefore, we attempt to search an optimal length-width ratio in a proper range, by resizing the cars on the satellite dataset to , , , and , as well as the airplanes on NWPU VHR-airplane dataset to , , , , and .
Smoothing: The smoothing operation has been proven to be effective in improving the representation ability of the features [31, 28]. Similarly, we perform smoothing before feature computation (pre-smoothing) and after feature learning or refine (post-smoothing) with the binomial filter. The filter radius is set to 1 in our setting.
Feature pyramid: The fast feature pyramid in  is applied in the proposed ORSIm framework by coarsely sampling feature channels in order to speed up the hard negative mining and test phase without additional loss of detection precision. We sample the objects in the four different scales (= 1, 2, 4, 8) with the sampling rate of . The smallest pyramid image is determined by the size of sampling window, and the largest one has the same size as the original image.
Classifier setting: AdaBoost , which is a boosting-based ensemble classifier learning, is used to train the classifier. To train a stronger learner, we use a weighted majority voting to generate the boosting decision tree by combining the hypotheses obtained from those diversified weaker learners. To avoid over-fitting, we gradually increase the number of weak learners from 32 to 2048. It is worth noting that negative samples used in training phase and testing phase are selected using a sliding window and a coarsely sampled image pyramid instead of point-based operators as presented in .
Evaluation criteria: Four criteria, Precision-Recall (PR) curve, Average Precision (AP), Average Recall (AR), and Average F1-score (AF), are adopted to quantitatively evaluate the detection performances. More precisely, when the rate between the overlap of the detection bounding box and the ground-truth box exceeds , it is counted as a true positive (TP); otherwise, as a false negatives (FN). Therefore, the final Precision (P) is computed by , and the Recall (R) is , while F1-score can be computed by . AP is used as a global indicator to assess the performances of the algorithm.
Iv-C Experimental Results
|Method||Image pyramid||Satellite (%)||NWPU VHR-Airplane (%)|
|ORSIm Detector||fast pyramid||91.26||93.01||94.83||4.94||91.12||93.21||95.39||4.72|
|Dataset||Method||HOG ||ACF [30, 28]||FourierHOG||our SFCF|
|NWPU VHR-Airplane||Linear SVM||74.38||76.67||80.01||85.12|
Iv-C1 Discussion on classifier selection
As listed in Table II
, those methods based on support vector machines (SVMs) or random forest (RF) classifier also achieve the good performances. This motivates us to have a great interest in investigating the classifier selection. To this end, three different classifiers (e.g., linear SVM, RF, and AdaBoost666AdaBoost , also known as AdaBoost-DTree, is used in our framework.) are used to evaluate the detection performance under four different feature descriptors, that is, HOG, ACF, FourierHOG, and our SFCF, as detailed in Table II. For a fair comparison, the parameters used in the three classifiers are optimally tuned by cross-validation on the training set. Overall, the linear SVM yields the relatively poor performances, compared to the results of RF. Because the RF is more robust than linear SVM to some extent, especially when the training samples are limited. Furthermore, the AdaBoost performs better than the two other classifiers. Two possible factors could explain the results. On one hand, AdaBoost is a boosting-based ensemble classifier learning, which can generate a more robust strong classifier by weighing a large number of weak classifiers. Consequently, it holds a more powerful performance than the linear SVM in recognition and classification. On the other hand, although the RF and AdaBoost are both based on the boosting-like strategy, yet the RF equally puts the weights on each sub-classifier and the AdaBoost adaptively weighs each weak classifier by iteratively updating weights. This makes the resulting final classifier generated by Adaboost more suitable for the current dataset, thereby yielding a better performance.
Iv-C2 Overview of performance comparison
To quantitatively assess the detection performances of the proposed method, we compare several state-of-the-art methods related to our framework, such as Exemplar-SVMs , rotation-aware features , COPD-based , BOW-SVM , fast feature pyramids , You Only Look Once (YOLO2) 777Similarly to  and , data augmentation by the rotation and translation of the training samples are performed., FourierHOG 888We select positive and negative samples by sliding windows rather than points for a fair comparison.. Fig. 7 shows the PR curves of different algorithms on the two datasets and Table I correspondingly lists the quantitative results in terms of average precisions and mean running times. Accordingly, we can make the following observations. The Exemplar-SVMs and Rotation-aware methods have similar performances, as the standard HOG features and discrete grid sampling are used. Not surprisingly, BOW-SVM and ACF yield the worst performances because they ignore the spatial contextual relationships among the local features and are limited by the rotation-related representation ability. Although the detection performance might be improved by modeling a deeper network and embedding anchor boxes, yet YOLO2 is not robust to tiny object and arbitrary pairs of objects that are not more than a tiny distance apart. FourierHOG holds a slightly lower performance than ours but much better than others on the two datasets, which indicates that the point-based feature representation is insensitive to resolution. As expected, the proposed ORSIm detector largely outperforms the other investigated methods on both datasets, which shows its effectiveness and superiority. This also can be demonstrated from Table II that the precision of ORSIm detector is dramatically higher than that of the others owing to the well-designed SFCF and the use of AdaBoost. It is worth noting in Table I that the methods with fast feature pyramid allow for faster detection than those without it. Despite of slowing down the speed (relatively lower than ACF and YOLO2999 The code is run on the tensorflow using GPU, which is available from the website:
The code is run on the tensorflow using GPU, which is available from the website:https://github.com/simo23/tinyYOLOv2.), the proposed ORSIm detector acquires the highest detection precision.
Visually, little roofs are wrongly identified as cars, and there are also some leak detection in transport cars, as shown in the first row of Fig. 8. This might result from a limited number of training samples and unbalanced class distribution. In addition, a weaker visible edge might mislead the classifier, since the transport cars are white. Compared to car detection in a complex urban scene, false detection of the airplanes also occurs when background and targets have similar shape and color, i.e. the tail of the airplane (see Fig. 9(a)). But this issue can be well fixed by a two-step nonmaximum suppression (NMS) algorithm . The improved results can be found in Fig. 9(b).
Iv-D Sensitivity Analysis
We experimentally analyze and discuss the potential influences under the different configuration of the proposed ORSIm detector, making it possible to generalize well in more datasets. The optimal combination is finally determined by 5-fold cross-validation on the training set.
Iv-D1 Towards Parameter Setting
Figs. 10 and 11 show the performance comparison of the different parameter setting on the two used datasets. More specifically, the LUV color space performs better than the two others on both dataset, and even more obvious when using a combination of the color channels with gradient magnitude channel. Interestingly, there is a similar trend after adding the rotation-invariant feature channels, as shown in Figs. 10 and 11 (c).
We also investigate the effects of the radial profiles (the size of convolution kernel) and the Fourier orders () as well as the size of sampling windows. As observed from Figs. 10 and 11 (d-e), they are relatively insensitive in a proper range, and as a result, we select them as for both dataset, and , for the satellite dataset (, for the NWPU VHR-airplane dataset). Following the same strategy with traditional detection framework, pre-smoothing and post-smoothing are usually carried out before and after running detection algorithms, in order to make the feature locally and globally smooth. The different filter are selected for smoothing and the experimental results are given in Figs. 10 and 11 (g-h). We simply set the radius for both pre-smoothing and post-smoothing as 1, as they are relatively insensitive to the different radius. In the test phase, the pyramid factor plays an important role, as displayed in Figs. 10 and 11 (i). The eight scales per octave shows a best result, which is basically consistent with . Significantly, the final detection precision would increase with the number of the weak classifier, but so does the computational cost. As a trade-off, the value is set as 2048 in our case.
Iv-D2 Towards Spatial Resolution
The image resolution is another important factor that could degrade the detection performance, and therefore we emphatically evaluate the effects of different resolution to find a proper boundary condition for the use of the proposed ORSIm detector. In detail, we adopt the different sampling rates on the two datasets to investigate the sensitivity of detection precision. As can be seen from Fig. 12, the performance may begin to degenerate with around 0.5 sampling rate and gradually decrease after that. It should be noted that the feature pyramid is usually an indispensable step in test phase. Therefore, these detection approaches are, in fact, not so sensitive to different spatial resolution, although the lower resolution inevitably suffers from information loss. Furthermore, Fig. 8 shows a visual example to clarify that the different scaled objects can be basically detected, demonstrating the effectiveness of the ORSIm detector to the multi-resolution images. That is not to say, however, that the proposed detector is capable of handling various variations. For that, we highlight a scene to give some false cases, as shown in Fig. 9 where the detector confuses the real airplanes and its tails with a small shadow, leading to some extra false alarms marked in red. This is actually a comparatively common phenomenon in object detection rather than due to the model’s sensitive to spatial resolution of an input image . A feasible solution for this issue is to use a two-step NMS, as illustrated in Fig. 9.
Object rotation is a common but challenging issue for object detection and recognition in optical remote sensing. To this end, we propose a more complete object detection framework in optical remote sensing im
agery, called ORSIm detector, by introducing the discriminative rotation-invariant channel features (spatial-frequency channel features), learning-based feature refining and fast feature channel scaling technique as well as boosting-based classifier learning. Extensive experimental results indicate ORSIm detector performs better and is more robust to various deformations, compared to previous state-of-arts methods. In the future work, we will focus on tiny object detection and extend the proposed framework to an end-to-end learning framework (e.g. deep learning). Additionally, we will expand the binary classification to multi-target detection.
The authors would like to thank the Key Laboratory of Information Fusion Technology, Ministry of Education at the University of Northwestern Polytechnical and to thank the Electrical Engineering Department at Stanford University for providing the NWPU VHR-airplane dataset and the Satellite dataset. The authors would like to express their appreciation to Prof. Piotr Dollár and Dr. K. liu for providing MATLAB codes for fast pyramid feature and FourierHOG algorithms.
-  D. Hong, N. Yokoya, and X. Zhu, “Learning a robust local manifold representation for hyperspectral dimensionality reduction,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 10, no. 6, pp. 2960–2975, Jul 2017.
D. Hong, N. Yokoya, J. Xu, and X. Zhu,
“Joint & progressive learning from high-dimensional data for multi-label classification,”in
Proc. IEEE. Conf. European Conference on Computer Vision (ECCV), Sep 2018, pp. 478–493.
-  Z. Hu, Q. Li, Q. Zou, Q. Zhang, and G. Wu, “A bilevel scale-sets model for hierarchical representation of large remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 12, pp. 7366–7377, Aug 2016.
-  Z. Hu, Q. Li, Q. Zhang, Q. Zou, and Z. Wu, “Unsupervised simplification of image hierarchies via evolution analysis in scale-sets framework,” IEEE Trans. Image Process., vol. 26, no. 5, pp. 2394–2407, Mar 2017.
-  S. Henrot, J. Chanussot, and C. Jutten, “Dynamical spectral unmixing of multitemporal hyperspectral images,” IEEE Trans. Image Process., vol. 25, no. 7, pp. 3219–3232, Jul 2016.
-  D. Hong, N. Yokoya, J. Chanussot, and X. Zhu, “Learning a low-coherence dictionary to address spectral variability for hyperspectral unmixing,” in Proc. IEEE. Conf. International Conference on Image Processing (ICIP), Feb 2017, pp. 235–239.
-  D. Hong and X. Zhu, “SULoRA: Subspace unmixing with low-rank attribute embedding for hyperspectral data analysis,” IEEE J. Sel. Topics Signal Process., vol. 12, no. 6, pp. 1351–1363, 2018.
D. Hong, N. Yokoya, J. Chanussot, and X. Zhu,
“An augmented linear mixing model to address spectral variability for hyperspectral unmixing,”IEEE Trans. on Image Process., vol. 28, no. 4, pp. 1923–1938, 2019.
M. A. Veganzones, M. Simoes, G. Licciardi, N. Yokoya, J. Bioucas-Dias, and
“Hyperspectral super-resolution of locally low rank images from complementary multisource data,”IEEE Trans. on Image Process., vol. 25, no. 1, pp. 274–288, Oct 2016.
-  D. Hong, N. Yokoya, J. Chanussot, and X. Zhu, “Cospace: Common subspace learning from hyperspectral-multispectral correspondences,” arXiv preprint arXiv:1812.11501, 2018.
-  D. Hong, N. Yokoya, N. Ge, J. Chanussot, and X. Zhu, “Learnable manifold alignment (LeMA): A semi-supervised cross-modality learning framework for land cover and land use classification,” ISPRS J. Photogramm. Remote Sens., vol. 147, pp. 193–205, 2019.
-  N. Yokoya and A. Iwasaki, “Object detection based on sparse representation and hough voting for optical remote sensing imagery,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 5, pp. 2053–2062, May 2015.
-  G. Cheng and J. Han, “A survey on object detection in optical remote sensing images,” ISPRS J. Photogramm. Remote Sens., vol. 117, pp. 11–28, Jul 2016.
-  G. Tochon, J. Chanussot, M. Dalla Mura, and A. Bertozzi, “Object tracking by hierarchical decomposition of hyperspectral video sequences: Application to chemical gas plume tracking,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 8, pp. 4567–4585, May 2017.
-  D. Hong, Z. Pan, and X. Wu, “Improved differential box counting with multi-scale and multi-direction: A new palmprint recognition method,” OPTIK, vol. 125, no. 15, pp. 4154–4160, 2014.
-  M. Zhang, W. Li, and Q. Du, “Diverse region-based cnn for hyperspectral image classification,” IEEE Trans. on Image Process., vol. 27, no. 6, pp. 2623–2634, Jun 2018.
-  X. Xu, W. Li, Q. Ran, Q. Du, L. Gao, and B. Zhang, “Multisource remote sensing data classification based on convolutional neural network,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 2, pp. 937–949, Feb 2018.
-  D. Hong, W. Liu, X. Wu, Z. Pan, and J. Su, “Robust palmprint recognition based on the fast variation vese–osher model,” Neurocomputing, vol. 174, pp. 999–1012, 2016.
-  F. Maussang, J. Chanussot, A. Hétet, and M. Amate, “Higher-order statistics for the detection of small objects in a noisy background application on sonar imaging,” EUR. J. Appl. Sig. Process., vol. 1, no. 3, pp. 1–17, Feb 2007.
-  H. Grabner, T. Nguyen, B. Gruber, and H. Bischof, “On-line boosting based car detection from aerial images,” ISPRS J. Photogramm. Remote Sens., vol. 63, no. 3, pp. 382–396, May 2008.
-  T. Moranduzzo and F. Melgani, “Automatic car counting method for unmanned aerial vehicle images,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 3, pp. 1635–1647, May 2014.
-  C. Corbane, L. Najman, E. Pecoul, L. Demagistri, and M. Petit, “A complete processing chain for ship detection using optical satellite imagery,” Int. J. Remote Sens., vol. 31, no. 22, pp. 5837–5854, Jul 2010.
-  C. Zhu, H. Zhou, R. Wang, and J. Guo, “A novel hierarchical method of ship detection from spaceborne optical image based on shape and texture features,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 9, pp. 3446–3456, Apr 2010.
-  G. Cheng, J. Han, P. Zhou, and L. Guo, “Multi-class geospatial object detection and geographic image classification based on collection of part detectors.,” ISPRS J. Photogramm. Remote Sens., vol. 98, no. 1, pp. 119–132, Dec 2014.
-  G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant convolutional neural networks for object detection in vhr optical remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 12, pp. 7405–7415, Sep 2016.
-  P. Viola and M. Jones, “Robust real-time face detection,” Int. J. Comput. Vis., vol. 57, no. 2, pp. 137–154, May 2004.
-  K. Liu, H. Skibbe, T. Schmidt, T. Blein, K. Palme, T. Brox, and O. Ronneberger, “Rotation-invariant hog descriptors using fourier analysis in polar and spherical coordinates,” Int. J. Comput. Vis., vol. 106, no. 3, pp. 342–364, Feb 2014.
-  P. Dollar, S. Belongie, S. Belongie, and P. Perona, “Fast feature pyramids for object detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 8, pp. 1532–1545, Jan 2014.
-  Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” J. Comput. Syst. Sci., vol. 55, no. 1, pp. 119–139, Aug 1997.
-  P. Dollar, Z. Tu, and P. Perona, “Integral channel features,” in Proc. Int. Conf. on British Machine Vision Conference (BMVC), Sep 2009, pp. 1–12.
-  B. Yang, J. Yan, Z. Lei, and S. Li, “Aggregate channel features for multi-view face detection,” in Proc. IEEE. Int. Conf. International Joint Conference on Biometrics (IJCB), Sep 2014, pp. 1–8.
-  S. Wang, B. Yang, Z. Lei, J. Wan, and S. Li, “A convolutional neural network combined with aggregate channel feature for face detection,” in Proc. Conf. International Conference on Wireless, Mobile Multimedia Networks (ICWMMN), Nov 2015, pp. 304–308.
-  R. Brehar, C. Vancea, and S. Nedevschi, “Pedestrian detection in infrared images using aggregated channel features,” in Proc. IEEE. Conf. International Conference on Image Processing (ICIP), Sep 2014, pp. 127–132.
-  S. Tuermer, F. Kurz, P. Reinartz, and U. Stilla, “Airborne vehicle detection in dense urban areas using hog features and disparity maps,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 6, no. 6, pp. 2327–2337, Feb 2013.
N. Dalal and B. Triggs,
“Histograms of oriented gradients for human detection,”
Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR), Jul 2005, pp. 886–893.
-  A. Zhao, K. Fu, H. Sun, X. Sun, F. Li, D. Zhang, and H. Wang, “An effective method based on acf for aircraft detection in remote sensing images,” IEEE Geosci. Remote Sens. Lett., vol. 14, no. 5, pp. 744–748, May 2017.
-  W. Zhang, X. Sun, K. Fu, C. Wang, and H. Wang, “Object detection in high-resolution remote sensing images using rotation invariant parts based model,” IEEE Geosci. Remote Sens. Lett., vol. 11, no. 1, pp. 74–78, May 2014.
-  G. Wang, X. Wang, B. Fan, and C. Pan, “Feature extraction by rotation-invariant matrix representation for object detection in aerial image,” IEEE Geosci. Remote Sens. Lett., vol. 14, no. 6, pp. 851–855, Apr 2017.
-  D. Hong, W. Liu, J. Su, Z. Pan, and G. Wang, “A novel hierarchical approach for multispectral palmprint recognition,” Neurocomputing, vol. 151, pp. 511–521, 2015.
-  P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR), Dec 2001, p. 511.
-  P. Viola, M. Jones, and D. Snow, “Detecting pedestrians using patterns of motion and appearance,” Int. J. Comput. Vis., vol. 63, no. 2, pp. 153–161, Jul 2005.
-  D. Ruderman and W. Bialek, “Statistics of natural images: Scaling in the woods,” Phys. Rev. Lett., vol. 73, no. 6, pp. 814–817, Aug 1994.
-  R. Benenson, M. Mathias, R. Timofte, and L. Van Gool, “Pedestrian detection at 100 frames per second,” in Proc. CVPR. IEEE, Jul 2012, pp. 2903–2910.
-  S. Ram and J. Rodriguez, “Vehicle detection in aerial images using multiscale structure enhancement and symmetry,” in Proc. IEEE. Conf. International Conference on Image Processing (ICIP), Sep 2016, pp. 3817–3821.
-  J. Zhang, C. Tao, Z. Zou, and H. Pan, “A vehicle detection method taking shadow areas into account for high resolution aerial imagery,” in Proc. IEEE. Int. Conf. Geoscience and Remote Sensing Symposium (IGARSS), Nov 2016, pp. 669–672.
-  M. Alb, P. Alotto, G. Capasso, M. Guarnieri, C. Magele, and W. Renhart, “Real-time pose detection for magnetic-assisted medical applications by means of a hybrid deterministic/stochastic optimization method,” IEEE Trans. Magnetics, vol. 52, no. 3, pp. 1–4, Sep 2016.
-  X. Nie, J. Feng, J. Xing, S. Xiao, and S. Yan, “Hierarchical contextual refinement networks for human pose estimation,” IEEE Trans. Image Process., vol. 28, no. 2, pp. 924–936, 2019.
-  J. Xia, J. Chanussot, P. Du, and X. He, “Rotation-based support vector machine ensemble in classification of hyperspectral data with limited training samples,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 3, pp. 1519–1531, Oct 2016.
-  A. Vedaldi, M. Blaschko, and A. Zisserman, “Learning equivariant structured output svm regressors,” in Proc. Int. Conf. International Conference on Machine Learning (ICCV), Jan 2011, pp. 959–966.
-  J. Xia, P. Du, X. He, and J. Chanussot, “Hyperspectral remote sensing image classification based on rotation forest,” IEEE Geosci. Remote Sens. Lett., vol. 11, no. 1, pp. 239–243, Jan 2014.
-  P. Doll¨¢r, S. Belongie, and P. Perona, “The fastest pedestrian detector in the west,” in Proc. Int. Conf. on British Machine Vision Conference (BMVC), Dec 2010, pp. 1–11.
-  G. B. Giannakis, “Signal reconstruction from multiple correlations: frequency-and time-domain approaches,” J. Opt. Soc. Am. A, vol. 6, no. 5, pp. 682–697, Feb 1989.
-  S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, Dec 2000.
-  G. Heitz and D. Koller, “Learning spatial context: Using stuff to find things,” in Proc. IEEE. Conf. European Conference on Computer Vision (ECCV), Oct 2008, vol. 5302, pp. 30–43.
J. Friedman, T. Hastie, and R. Tibshirani,
“Additive logistic regression: a statistical view of boosting,”The Annals of Statistics, vol. 38, no. 2, pp. 337––374, Apr 2000.
-  T. Malisiewicz, A. Gupta, and A. Efros, “Ensemble of exemplar-svms for object detection and beyond,” in Proc. Int. Conf. International Conference on Machine Learning (ICCV), Jan 2011, pp. 89–96.
-  U. Schmidt and S. Roth, “Learning rotation-aware features: From invariant priors to equivariant descriptors,” in Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR), Jul 2012, vol. 157, pp. 2050–2057.
-  S. Xu, T. Fang, D. Li, and S. Wang, “Object classification of aerial images with bag of visual words,” IEEE Geosci. Remote Sens. Lett., vol. 7, no. 2, pp. 366––370, Apr 2010.
-  J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR), Jul 2017, pp. 6517–6525.
-  X. Wu, D. Hong, P. Ghamisi, W. Li, and R. Tao, “MsRi-CCF: Multi-scale and rotation-insensitive convolutional channel features for geospatial object detection,” Remote Sens., vol. 10, no. 12, pp. 1990, 2018.