I Introduction
Recently, geospatial object detection (GOD) received a lot of attention in remote sensing community. However, the main challenges and difficulties lie in that objects in optical remote sensing imagery usually suffer from various deformations caused by scaling, offset, and rotation. This inevitably degrades the detection performances. Regarding this issue, related work has been largely proposed by researchers over the past decades. They can be roughly categorized by
template matchingbased, knowledgebased, objectbased, and machine learningbased
methods [1]. But unfortunately, these approaches mostly fail to capture rotationrelated properties under situations of smallscale training samples.The multiresolution object rotation is a common but challenging problem in the task of GOD, which can be split into two subproblems: rotationinvariant feature extraction and image pyramid generation, respectively. In the first phase, the features can be learned from the data [2, 3] or artificially designed [4, 5]. The former learns a robust and discriminative feature representation from the augmented training set generated by manually rotating or shifting samples, whose performance is limited by the quantity and diversity of samples to a great extent, while the latter extracts the rotationinvariant features in a densely sampling fashion. Although such scheme of manual feature design has been proven to be effective (e.g., Histogram of Oriented Gradients (HOG) [6]) in constructing rotationinvariant descriptors, yet the expensive computational cost and timeconsuming nature hinder it from being efficient, particularly for largescale datasets. Moreover, those artificial descriptors also yield a relatively limited performance, since they are usually constructed in a locally discrete coordinate system. For this reason, Liu et al. mathematically proved the rotationinvariant behavior and proposed a FourierHOG descriptor by converting a discrete coordinate system to a continuous one in [7], where they applied the features to address a recognitionlike detection problem, that is, each pixel or subpixel is represented as an object or material, and thus this is actually a pixelwise classification issue rather than a real object detection one. In the second phase, the features have to be repeatedly extracted from each layer of image pyramid, leading to a large computational cost. Facing this problem, inspired by fractal statistics of natural images, Dollar et al. [8] proposed a fast pyramid generative model (FPGM) by only estimating a scale factor, basically achieving a pyramid feature extraction in parallel.
Ia Motivation
Object rotation in GOD is an important factor to degrade the detection performance. Most previouslyproposed methods usually fail to extract the continuous rotationinvariant features, since either manual feature extraction [6]
or deep learningbased strategy
[9, 10] models the rotation behaviors in the discrete coordinate, such as, dividing the angles into several discrete bins or rotating the training samples with different angles for data augmentation.On the other hand, FPGM has been proven to be effective to achieve a very fast pyramid feature extraction without the additional performance loss [8]. It should be noted that FPGM has to meet a lowlevel shiftinvariant input, they are RGB, gray, and gradient channels used in the original reference [8]. However, these features are relatively poor discriminative and sensitive to the object rotation. Therefore, we expect to develop a more discriminative and robust feature descriptor and embed it into FPGM.
Motivated by the aforementioned two points, we expect to develop or find a mathematically rotationinvariant descriptor (FourierHOG in our case) against rotation behaviors of arbitrary continuous angle. In the meantime, the FourierHOG can be embedded into FPGM well with the requirement of lowlevel shiftinvariance, in order to achieve an effective object detection framework.
IB Contributions
For this purpose, we propose a novel geospatial object detection framework by effectively integrating FourierHOG channel features, aggregate channel features (ACF) [8], FPGM, and boosting learning. To the best of our knowledge, this is the first time that FPGM and boosting learning have been jointly applied to a unified geospatial object detection framework. With the further FourierHOG and ACF embedding, we have demonstrated the superiority and effectiveness using the proposed FRIFB detector on two subsets (baseballs and airplanes) of NWPU VHR10 dataset. More specifically, the main contributions of this letter can be unfolded as

An efficient geospatial object detection framework is proposed, called Fourierbased rotationinvariant feature boosting (FRIFB), encompassing feature extraction (rotationinvariant FourierHOG), feature refining (ACF), feature pyramid (FPGM), and boosting learning (decision tree ensembles).

The complementary advantages between FPGM and FourierHOG improve the performance of object detection in a fast and robust fashion. The robustness of the FourierHOG against rotation and shift, on one hand, perfectly fits the assumpation of the FPGM; on the other hand, FPGM can provide a faster pyramid feature computation.

The proposed FRIFB is effectively applied for the task of geospatial object detection in remote sensing imagery and meanwhile qualitatively and quantitatively evaluated on two different datasets and shows competitive performances against previous stateofart algorithms.
Ii Methodology
Iia Overview
Fig. 1 illustrates the workflow of the FRIFB, which consists of main five steps: set the sliding window, generate rotationinvariant channel features, refine channel features, training with multiple rounds of bootstrapping, and testing on image pyramid with octavepaced scale intervals. Step by step,

we first give a fixed bounding box for all training samples. Generally, the size of bounding box is assigned by averaging all training samples. Targets in the different scale spaces are accordingly upsampled or downsampled to the same size.

Next, the corresponding rotationinvariant channel maps are obtained using the FourierHOG algorithm.

The ACF is subsequently used to structurally refine the extracted rotationinvariant channel features.

The obtained ACF is further fed into the bootstrapping for training.

Finally, FPGM is explored to fast generate feature pyramid features during the testing process.
Algorithm 1 details the specific procedures for the FRIFB.
IiB Rotationinvariant Feature Generation (FourierHOG)
Superior to the vectorvalued function, the scalarvalued function is invariant to rotation or shift behavior. Given an image
: , denotes the location of a given pixel. The rotation of scalarvalued function is a coordinate transform rotation [7], we have(1) 
where is a rotated image of with a angle, and
is a rotation matrix. Generally, the phase function of samples with direction information, e.g., gradient field, SIFT, is a tensorvalued function
. Both the coordinate and the tensor values have to rotate, which can be expressed by(2) 
Therefore, as long as the vector rotation degenerates into scalar rotation, rotation invariant feature maps can be obtained. Rotation invariance is analyzed more effectively in polar coordinate where the features can be separated as the angular part and radial part , respectively. An optimal angular information can be represented by such a Fourier basis defined as , where stands for rotation order. In [7], the rotation behaviors in Fourier domain can be modeled by a multiplication or convolution operator as follows
(3) 
where are defined as their Fourier representations in polar coordinate.
A pixelwise amplitude and phase value, denoted as , is obtained by computing the gradients in the discrete coordinate, which can be seen as a continuous impulse function represented by [7]. Therefore, the Fourier representation of can be formulated by
(4) 
Relying on the shift properties of Fourier transform under polar coordinate, with a relative rotation is defined as
(5) 
In order to make the feature rotationinvariant, namely , we can construct a set of selfsteerability (convolution kernels) with the same rotation order by inverse Fourier transformation. According to Eq. (3), once satisfying , we can get
(6) 
IiC Aggregate Channel Features (ACF)
The ACF is simply computed by subsampling the rotationinvariant channel maps with a preset scaling factor, as shown in Fig. 1. This is a poolinglike operation, which has demonstrated its robustness to shifted and rotated deformations to some extent. With the increase of the factor, the features are represented from finely to coarsely, while the feature structure is gradually enhanced.
IiD Fast Pyramid Generative Model (FPGM)
In [11], Ruderman et al. has theoretically proven that the ensemble of scenes (natural images) has statistics which are invariant to scale. Following it, Piotr et al.[8] extended this theory and proposed a fast image feature pyramid with an application to pedestrian detection. This technique can effectively achieve a feature channel scaling. That is, the features in any image scale () can be directly obtained with a product of a scalebased ratio factor defined by and the features extracted on a given (known) scale () (More details can be found in [11] and [8].), which is formulated as
(7) 
where denote the rotationinvariant feature maps extracted from the different image scales [8]. Given and and the corresponding scale ratio (), the scaling factor is simply estimated by Eq. (7) before training and testing model.
The FPGM can compute finely sampled feature pyramids by feature scaling with octavespaced scale intervals without losing performances. Nevertheless, the input in FPGM needs to meet a lowlevel feature invariance, hence the Fourierbased rotation invariant feature maps can be perfectly embedded into this framework to correct the bias and variance of the trained classifier caused by various deformations (e.g., rotation, shift).
Iii Experiments
Iiia Data Description and Experiment Setup
The NWPU VHR10 dataset [12] is used as the benchmark data for assessing the performances of the mentioned algorithms. It is collected from Google Earth with a spatial resolution of 0.5m to 2m and infrared images with a 0.08m spatial resolution obtained from the Vaihingen data set provided by the German Society for Photogrammetry, Remote Sensing and Geoinformation (DGPF). We selected two scenes including airplanes and baseball diamonds from this datasets for deeply analyzing and discussing the superiority and effectiveness of the proposed FRIFB. In our experiments, 60% scene images are randomly selected for training, and the rest for test. Moreover, the positive samples in the training set is simply augmented by mirror processing, while the negative ones are randomly selected from 150 images without any targets. The average size of airplanes and baseball diamonds are 75 75 and 90 90, respectively. For a fair comparison, we conducted 5fold crossvalidation and report an average result.
Similarly to classical object detection methods, we employ the same indices, precision recall curve (PRC) and average precision (AP), to quantitatively evaluate the performances in geospatial object detection. More precisely, if the intersection over union (IoU) ratio between the detection bounding box and the groundtruth box exceeds 0.5, then it is counted as a true positive (TP); otherwise, as a false negatives (FN).
Methods  BOWSVM  Rotationaware  ExemplarSVMs  COPD  RDFPN  RICNN  YOLO2 (GPU)  FPGM  FRIFB 

Airplane  25.12  79.35  80.37  85.13  88.60  86.64  89.75  65.59  90.20 
Baseball diamond  35.28  84.55  79.16  88.93  90.80  88.23  94.32  79.17  97.34 
Mean times (s)  0.85  1.63  1.49  0.97  0.7  8.77  0.13  0.16  0.88 
IiiB Detection on NWPU VHR10 dataset
Fig. 3 shows the visual performance of the detection results using FRIFB, where the green and blue boxes indicate the correct localization and the false alarm, respectively. As expected, the proposed method detects most of targets with less false positive results, demonstrating the robustness and effectiveness various rotation behaviors. However, the feature discrimination still remains limited, particularly when detecting the airplane’s tail fin and the edges or corners of the ground track field. Note that the robustness of the proposed descriptor mainly lies in the resolution of training samples and the Fourier basis functions setting. That means that as long as the training samples and the basis functions can be sufficiently sampled, then the robustness against object rotations and complex noises can be theoretically guaranteed.
IiiC Comparison with StateoftheArt Algorithms
To effectively evaluate the performances of the proposed method (FRIFB), we make a comparison with some stateoftheart algorithms: BOWSVM [13], rotationaware features [14], exemplarSVMs [15], COPD [12], RDFPN [10], rotationinvariant CNN (RICNN) [3], you only look once (YOLO2)^{1}^{1}1The code we used, including data augmentation, is available from the website: https://github.com/ringringyi/DOTA_YOLOv2. [16], FPGM [8]. Similarly to RICNN, data augmentation by rotating or translating the training samples with various angles are performed in all compared methods. For the parameter setting of these compared algorithms, please refer to the corresponding references for more details.
We visually observe the trends of PRC and AP values for the seven different methods in the two different scenes, as shown in Fig. 2. Correspondingly, Table I lists quantitative comparison results in terms of AP values and running time per image. More specifically, BOWSVM yields poor performances, since it ignores to model the spatial contextual relationships, leading to a lowdiscriminative feature representation. Considering the rotation behavior of the objects, ExemplarSVMs and Rotationaware methods perform better, but the features constructed in a discrete grid still hinders their performances. Besides, the computational costs for the above methods are expensive, in particular computing image pyramid features. The detection method based fast feature pyramid is effectively run in realtime. Furthermore, fast feature pyramid consists of 10 channels, such as LUV color channels (three channels), normalized gradient magnitude (one channel), and histogram of oriented gradients (six channels), which can achieve better performances in baseball diamond samples and faster running speed. Owing to the welldesigned network architecture and GPU’s high performance computing, YOLO2 achieves a fastest running speed with a competitive detection accuracy. Nevertheless, the detection performance of YOLO2 is still inferior to that of the proposed FRIFB at around and levels on the two datasets, respectively, as the YOLO2’s sensitivity to those tiny and arbitrary pairs of objects hurts its performance to some extent. Not unexpectedly, the proposed FRIFB outperforms others in terms of precision. Without relying on the sample augmentation in the training process, FourierHOG focuses more on designing the intrinsic rotation property by simultaneously considering the local and global information of the image. Please note that although the YOLOlike methods hold a lower computational cost, yet for many applications, such as precision agriculture, urban planning that needs to accurately collect the building information, they prefer to pursue the higher detection accuracy with an acceptable running time. As a result, our proposed FRIFB might be applicable to some practical cases.
Iv Conclusion
In this letter, we revisit the fast pyramid feature method which is sensitive to rotation and provide an effective remedy by introducing a rotationinvariant descriptor. This descriptor is tightly integrated into the power law, which can fundamentally correct the bias and variance of the trained classifier caused by rotation. Furthermore, we develop a novel and efficient framework for geospatial object detection framework by integrating multitechniques that have complementary advantages. Extensive experimental results indicate the proposed method is robust to rotation and can effectively improve the detection performance. In the future work, we will focus on tiny object detection by developing an endtoend learning framework (e.g., deep learning) or introducing auxiliary data (e.g., hyperspectral or multispectral data [17]).
References
 [1] G. Cheng and J. Han, “A survey on object detection in optical remote sensing images,” ISPRS J. Photogramm. Remote Sens., vol. 117, pp. 11–28, Jul 2016.

[2]
D. Hong, N. Yokoya, J. Xu, and X. Zhu,
“Joint & progressive learning from highdimensional data for multilabel classification,”
in Proc. ECCV, 2018, pp. 469–484. 
[3]
G. Cheng, P. Zhou, and J. Han,
“Learning rotationinvariant convolutional neural networks for object detection in vhr optical remote sensing images,”
IEEE Trans. Geosci. Remote Sens., vol. 54, no. 12, pp. 7405–7415, 2016.  [4] D. Hong, W. Liu, J. Su, Z. Pan, and Wang G, “A novel hierarchical approach for multispectral palmprint recognition,” Neurocomputing, vol. 151, pp. 511–521, 2015.
 [5] X. Wu, D. Hong, J. Tian, J. Chanussot, W. Li, and R. Tao, “ORSIm Detector: A novel object detection framework in optical remote sensing imagery using spatialfrequency channel features,” arXiv preprint arXiv:1901.07925, 2019.
 [6] D. Hong, W. Liu, X. Wu, Z. Pan, and J. Su, “Robust palmprint recognition based on the fast variation vese–osher model,” Neurocomputing, vol. 174, pp. 999–1012, 2016.
 [7] K. Liu, H. Skibbe, T. Schmidt, T. Blein, K. Palme, T. Brox, and O. Ronneberger, “Rotationinvariant hog descriptors using fourier analysis in polar and spherical coordinates,” Int. J. Comput. Vis., vol. 106, no. 3, pp. 342–364, 2014.
 [8] P. Dollar, S. Belongie, S. Belongie, and P. Perona, “Fast feature pyramids for object detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 8, pp. 1532–1545, 2014.
 [9] X. Wu, D. Hong, P. Ghamisi, W. Li, and R. Tao, “Msriccf: Multiscale and rotationinsensitive convolutional channel features for geospatial object detection,” Remote Sens., vol. 10, no. 12, pp. 1990, 2018.
 [10] X. Yang, H. Sun, K. Fu, J. Yang, X. Sun, M. Yan, and Z. Guo, “Automatic ship detection in remote sensing images from google earth of complex scenes based on multiscale rotation dense feature pyramid networks,” Remote Sens., vol. 10, no. 1, pp. 132, 2018.
 [11] D. L. Ruderman and W. Bialek, “Statistics of natural images: Scaling in the woods,” Phys. Rev. Lett., vol. 73, no. 6, pp. 814–817, 1994.
 [12] G. Cheng, J. Han, P. Zhou, and L. Guo, “Multiclass geospatial object detection and geographic image classification based on collection of part detectors.,” ISPRS J. Photogramm. Remote Sens., vol. 98, no. 1, pp. 119–132, 2014.
 [13] S. Xu, T. Fang, D. Li, and S. Wang, “Object classification of aerial images with bag of visual words,” IEEE Geosci. Remote Sens. Lett., vol. 7, no. 2, pp. 366––370, 2010.
 [14] U. Schmidt and S. Roth, “Learning rotationaware features: From invariant priors to equivariant descriptors,” in Proc. CVPR. IEEE, 2012, pp. 2050–2057.
 [15] T. Malisiewicz, A. Gupta, and A. Efros, “Ensemble of exemplarsvms for object detection and beyond,” in Proc. ICCV. IEEE, 2011, pp. 89–96.
 [16] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” in Proc. CVPR. IEEE, 2017, pp. 7263–7271.

[17]
D. Hong, N. Yokoya, J. Chanussot, and X. Zhu,
“An augmented linear mixing model to address spectral variability for hyperspectral unmixing,”
IEEE Trans. Image Process., vol. 28, no. 4, pp. 1923–1938, 2019.