1 Introduction
Aerial image categorization is an important
component for many applications in artificial intelligence and
remote sensing add1 ; add2 ; add3 , such as visual
surveillance, navigation, and robot path planning. However, it is
still a challenging task to deal with aerial image categorization
successfully due to two reasons. On one hand, the aerial image
components (e.g., house roofs and
grounds) as well as their spatial configurations are complex and
inconstant, making it difficult to extract features sufficiently
discriminative for aerial image representation. On the other
hand, the efficiency of the existing aerial image categorization
methods is far from practical due to the
huge number of various components as well as their bilateral
relationships. Therefore, a discriminative and concise aerial
image representation has become
increasingly imperative for a successful categorization system.
In the literature of designing discriminative image representations for visual recognition, many features have been proposed. They can be categorized into two groups: global features and local features. Global features, such as histograms, eigenspace
eigenspace , and skeletal shape skeletalsharp , generalize the entire image with a single vector and are standard for statistic models like SVM. However, global features are sensitive to occlusion and clutter. Besides, these representations typically rely on a preliminary segmentation of objects in images. These two limitations result in unstable categorization performance. Different from global features, local features are developed to increase the discrimination, such as scale invariant feature transform (SIFT) sift . Each local feature describes a localized image region and is calculated around the interest points. Thus, they are robust to partial occlusion and clutter. To take advantage of this property, local features handwritten ; parsing ; hierarchical (e.g., junction junction , gradient gradient , contour, etc) are widely used for aerial image parsing recently. However, when employing local features for image categorization, different images typically contain different numbers of local features. That is, it is difficult to integrate the local features within an image for the standard classifiers. In many cases, they are integrated into an orderless bagoffeatures as global representation, thereby the similarity between images is determined by the orderless bagoffeatures. It is worth emphasizing that as a nonstructural representation, the bagsoffeatures representation ignores the geometric property of an image (
i.e., the spatial distribution of the local image patches), which prevents it from being highly discriminative. Given the zebra skin and the chessboard skin, their bagoffeatures representations are similar. That is to say, the bagoffeatures representation is not sufficiently descriptive to distinguish the zebra and the chessboard, although the geometric properties of the two images are significantly different.In order to encode image geometric proprieties into a categorization model, several image geometric features have been proposed. In beyond , the spatial pyramid matching kernel is obtained by clustering the local features into a few geometric types. However, the spatial pyramid matching kernel is not flexible enough, since it highly depends on the human prior knowledge. RGBdomain spin image spin describes the spatial context by exploring the chain structure of pixels in each RGB channel. However, the chain structure usually fails to describe the spatial context with complicated structures. Walk kernel walk_kernel is proposed to capture the walk structures among image local features. However, the unavoidable totter phenomenon (i.e., one vertex may occur several times in a walk) brings noise and hence limiting its discrimination. To obtain a better discrimination, parameters are provided to tune the length of the chain spin or walk walk_kernel . This operation leads to very redundant structures. Both the time consumption and the memory cost increase remarkably as the structure number goes up. Therefore, a concise image structure representation is desired for accurate aerial image categorization. Recently, many graphbased models are applied in intelligence systems and multimedia. They can be used as geometric image descriptors zhang1 ; zhang2 ; zhang3 ; zhang4 to enhance image categorization. Besides, these methods can be used as image highorder potential descriptors of superpixels zhang5 ; zhang6 ; zhang7 ; zhang8 ; zhang9 . Further, graphbased descriptors can be used as a general image aesthetic descriptors to improve image aesthetics ranking, photo retargeting and cropping zhang10 ; zhang11 ; zhang12 ; zhang13 .
In this paper, we propose a novel aerial image categorization system, which enables the exploration of the geometric property embedded in local features. An aerial image is represented by a graph, since graph is a natural and descriptive tool to express the complicated relationships among objects. By defining region connected graph (RCG), we decompose an aerial image into a set of discriminative subgraphs. To capture discriminative relationships among RCGs, a structure refinement strategy is carried out to select highly discriminative and low redundant structures. Based on the refined structures, we extract subRCGs accordingly and all the subRCGs from an aerial image form the discriminative spatial context. Finally, a quantization operation transforms the discriminative spatial context into a feature vector for categorization.
The major contributions of this paper are as follows: 1) region connected graph (RCG), a graphbased representation that describes the local patches and their topology for an areal image; 2) a structure refinement algorithm that selects highly discriminative and low redundant structures among the training RCGs; and 3) an efficient isomorphism subgraph extraction component that acquires the corresponding subRCGs.
2 Region Connected Graph(RCG)
An aerial image usually contains millions of pixels. If we treat
each pixel as a local feature, highly computational complexity
will make aerial image recognition intractable.
Fortunately, an aerial image can be
represented by a collection of clusters because pixels are usually
highly correlated with their neighboring ones. Each cluster
consists of neighboring pixels with consistent color intensities.
Thus, given an aerial image, we can represent it by a set of
regions instead of millions of pixels. The
neighboring relationships between regions define the spatial
context of an aerial image. Naturally, we can model this
representation as a labeled graph. The
labels denote the local features of each region and each edge
connects pairwise neighboring regions. In our work, we call this
representation region connected graph (RCG).
To obtain the RCG from an aerial image,
a segmentation algorithm (i.e., fuzzy
clustering fuzzy in our implementation) groups pixels into
different clusters according to their color intensity. Note that
the pixels in the same cluster are unnecessarily spatially
neighboring. As shown in Fig. 1, we use different
grayscale values to identify different clusters. Pixels in the
face and the lower half of the Snoopy’s body are grouped into the
same cluster. However, it is more reasonable if
they are categorized into different
groups, since the face and the lower half
of Snoopy are spatially isolated. To this end, a region growing
algorithm ip_matlab is employed to divide an image into
regions iteratively. In each iteration, the region growing
algorithm initializes the current region with a random pixel.
It continues adding the spatially
neighboring pixels into this region if the current pixel and the
existing pixels come from the same cluster. The iteration
terminates if the entire pixels are considered.
The clustering
result is shown on the right of Fig. 1.
On the basis of the singly connected regions, the RCG of an aerial image can be obtained as shown in Fig. 2. Given an aerial image (Fig. 2(a)), we segment it into singly connected regions (Fig. 2(b)). Then, each singly connected region is treated as a vertex (the red solid point), and the relationship between spatially neighboring vertices is linked by an edge (the green line). Finally, denoting as a collection vertices and a set of edges , we define as an RCG, where is a set of singly connected regions and is a set of spatially neighboring relationships (Fig. 2(c)). Let denote the number of vertices in RCG . The number of neighbors of a vertex is called the vertex degree. A useful attribute of RCG is that its vertex degree is upper bounded. That is to say, each region has a limited number of neighbors. It is observed that the average vertex degree of each RCG is less than four and the maximum vertex degree is no more than 15.
3 Discriminative Structures Selection
It is natural to recognize an aerial image by matching its RCG to a labeled one. However, as proved in isomorphism , given a pair of graphs, it is NPhard to determine whether they have the same structure. That means it is intractable to compare pairwise RCGs directly. Alternatively, we represent an aerial image by a set of subRCGs , where . Thereby, the aerial image categorization can be conducted by matching its subRCGs to those of the labeled aerial images. Noticeably, the RCG of an aerial image may contain tens to hundreds of vertices. Given vertices in an RCG, there will be different subRCGs, which makes it impractical to represent an aerial image by enumerating all its subRCGs (Fig. 3(a)). Toward a discriminative and concise representation for aerial image recognition, only subRCGs with highly discriminative and low redundant structures should be selected for aerial image categorization (Fig. 3(b)).
3.1 Frequent Structures Mining
Each subRCG reflects the structure of a subset of connected
vertices in the RCG. In other words, a subRCG models the spatial
context of an aerial image. Different types
of aerial image are with different spatial context, so do the
structures of subRCGs. It is natural to use the structure of
subRCG to determine the aerial image type. For instance, as shown
in Fig. 3(c), all the three
subRCGs share the same structure but slightly different color
intensity distributions. However, it is impractical to enumerate
all the possible subRCGs. Moreover, only those frequently
occurred subRCGs contribute to the
recognition task while the others are redundant. Motivated by
these, we have to select the frequent
structures.
In our implementation, an efficient
frequent subgraph discovery algorithm
called FSG fsg is employed. It is noticeable that the
vertex value of subRCGs might be different though they share the
same structure. This prevents us from mining the frequent
structures accurately. Therefore, we ignore
the difference of vertex values. In particular, given a subRCG,
its structure is obtained by setting the vertex labels of the
subRCG to a same value, e.g., one.
FSG accumulates the times of happening for each structure. It outputs the probabilities of all the structures in the training RCGs, implying that the structure is unnecessarily existing in all the training RCGs. A probability
represents the frequency of . As the number of original candidate structures is exponential, only the structure whose probability is higher than a threshold is output as a frequent one. Therefore, the number of frequent structures is greatly reduced greatly.3.2 Measures for Structure Selection
The number of frequent structures is still too large (typically 100300) though it is much smaller than that of the candidate structures. In addition, a structure with high frequency may not be highly discriminative. Thus, we carry out a further selection among the frequent structures to preserve only the highly discriminative and low redundant ones. We first define a distance to describe the similarity between subRCGs ( and with the same size:
(1) 
where is th vertex of and the local regions’ feature vector. is the Euclidean norm. More specifically, for structure and in and respectively, if , we define the structure distance between and as follows:
(2) 
where is the subRCG corresponding to . is a factor that normalize to and it is not a tuning parameter. That is, , where and denote the number of subRCGs in RCG and , respectively. By extending Eq.( 2) to the situations when , we define a more generic form of the structure distance between and . It is based on the probability by taking into account of different situations.
(3) 
The probability for structure existing in is
denoted by . It is straightforward to obtain the first line
of Eq.(3) by multiplying with the
structure distance wherein denotes the
probability for existing in and existing in .
This is similar to the second line and the
third line of Eq.(3). As is a subset of when
, the function outputs the enumerated
structures with the same size to in by FSG fsg in
the second line of Eq.(3), and vice versa in the third
line. denotes the probability for neither
existing in nor existing in . An
in the last line is the probability for
either existing in either or existing
in . .
Based on the structure distance
between and ,
measure of structure discrimination(MSD),
is defined for structure’s discrimination. Inspired by the
definition of discriminative ability in LDA klda , MSD
computes the distance ratio between RCGs with different labels and
those with same labels:
(4) 
and are functions indicating whether and are belong to the same class. If and belong to different classes, , otherwise ^{1}^{1}1Pairwise RCGs and belonging to the same class means that their corresponding aerial images belong to the same class. Similarly, two RCGs and belonging to different classes means that their corresponding aerial images belong to different classes. . A larger means a more discriminative ability of structure . However, a structure set with high discrimination doesn’t mean it is a concise one. Aiming at a concise set of structures, it is necessary to make further structure selection. Motivated by the fact that high correlation leads to high redundancy speech , we believe that one of the two structures should be removed if two structures are highly correlated. In order to calculate the correlation between structures, an approach to quantize the redundancy between structures, called measure of structures correlation (MSC), is defined based on the distance between structures:
(5) 
where the denominator functions as a normalization step. A larger leads to a lower correlation between structure and , and vice versa. Eq.(5) also can be explained by analogy with the three vertices of a triangle in Fig. 4. , and act as the distance between the three vertices. When becomes larger, the correlation between and becomes lower (Fig. 4(c)), and vice versa (Fig. 4(b)).
3.3 MSD and MSC based Structure Refinement
Based on the two structure measures MSD and MSC, we construct a
novel concise and discriminative structure refinement algorithm.
The stepwise operations of the proposed structure selection are
illustrated in and Algorithm 1 respectively. The
algorithm can be divided into two steps. First, the MSD values of
all the candidate structures are computed and sorted in descending
order. Candidate structure whose MSD value is higher than a
threshold will be preserved initially into the list .
Second, the MSC value between each pair of preserved structures
is computed to evaluate their redundancy. The removal of
redundant structures is carried out iteratively. During the first
round of iteration, we specify the preserved structure with the
largest MSD value as the final selected one. Then, we sort the MSC
values between the finally selected structure and the rest of the
preserved structures. The structure whose MSC value is higher than
a threshold will be removed. The preserved structure list will be
updated accordingly. After one round of iteration, we move to the
preserved structure with lower MSD value. The iteration terminates
when there is no structure next to . The finally
preserved structures are deemed as the refined ones.
Input: training data set 
the threshold for MSD and MSC 
Output: //a set of refined structures 
for = : do begin step1 
calculate for ; 
if 
preserve into ; 
order in descending value; 
end; 
; step2 
do begin 
; 
do begin remove redundant structures 
; 
if 
remove from ; 
; 
else 
; 
end until; 
add to ; 
; 
end until; 
Denote as the number of training RCGs and as the number of candidate structures, we assume that the structure distance between RCGs can be computed in constant time. As the distance between RCGs is required for calculating MSD and MSC, the computational cost of calculating MSD and MSC are both . As shown in Algorithm 1, the structure refinement step contains a double loop and the time complexity of each is . Therefore, the time complexity of the whole selection process is .
4 Geometric Discriminative Feature
4.1 Geometric Discriminative Feature Extraction
As the refined structures are both concise
and discriminative, they are adopted to extract the geometric
discriminative features. Guided by the refined structures, we
extract subRCGs with the same structures and then use them as the
geometric discriminative features. As RCGs are
low degree graphs (vertex degree less than
15), the computational complexity is nearly linear
increasing with the number of vertices walk_kernel .
To achieve an efficient subRCG extraction process, we
propose an algorithm to locate the subRCGs efficiently. Given a
refined structure and an RCG , the proposed algorithm
outputs a collection of subRCGs with structure
. There are three steps in the proposed geometric discriminative feature extraction. First, the vertices of
are checked to determine whether . If , then an iterative process will be carried out. Otherwise, the algorithm will terminates. Next, for each vertex in , we treat it as the reference point and compare to the structures of its correlated subRCGs. A depthfirstsearch strategy dfs is employed for graph matching. Only the subRCGs with the same structure to are the preserved. By traversing all the vertices in RCG , we perform the matching process and collect all the qualified subRCGs. Finally, a collection of qualified subRCGs are obtained4.2 Quantizing SubRCGs into Feature Vectors
Given an aerial image, it can be represented by a set of subRCGs as described above. It is worth emphasizing that the subRCGs are planar visual feature in
. Conventional classifiers such as support vector machine (SVM)
ksvm can only handle 1D vectors. Further, the number the extracted subRCGs are different from one aerial image to another. Therefore, it is impractical for a conventional classifier like SVM to carry out classification directly. To tackle this problem, a quantization method is developed to convert each aerial image into a 1D vector.The proposed quantization method is based on the distances
between the test aerial images and the training ones.
The distance is computed using the
extracted geometric discriminative features. Given an aerial
image, we first extract its geometric discriminative features,
each corresponding to a refined structure. Then. as shown in
Fig. 5, an aerial image is encoded into a vector
, where is
the number of training
aerial images and each element of is computed as:
(6) 
where is a free parameter to be tuned. In our implementation, we fix to 0.5 by using cross validation.
5 System Overview
Our aerial image categorization system can be divided into the
training and the test stages. In the training phase, structure
refinement for geometric discriminative feature extraction is
conducted. First, each aerial image is segmented into connected
regions for building the corresponding RCGs. Then, a frequent
structure mining algorithm is employed to discover the highly
frequent structures in the training RCGs. Next, MSD and MSC are
computed for each structure toward a concise set of structures.
Structure refinement is carried out to acquire the highly
discriminative and low redundant ones. Third, the geometric
discriminative features are obtained by extracting the subRCGs
corresponding to the refined structures. To convert the extracted
2D geometric discriminative features into 1D vectors, a
quantization scheme computes the distance between the given
aerial image and the training samples. Finally, we train an SVM
classifier by the vectors from the encoded training samples.
The test phase is illustrated on the right. Given a test
aerial image, we obtain its RCG firstly. Then, the geometric
discriminative features are extracted to represent the given
aerial image. Similarly, a quantization operation is carried out
to convert the aerial image into a vector using the geometric
discriminative features. This vector is fed into the trained SVM
for aerial image categorization.
6 Experiments and Results Analysis
Experiments are carried out on two data sets. The first data set contains the aerial images from the Lotus Hill (LHI) data set lotus . It consists of five categories where each category contains 20 aerial images. Each image is associated with a standard segmentation map. The second data set is our own complied data set and it includes aerial images from ten categories . The whole data set contains 2,096 aerial images crawled from the Google Earth. The experimental system is equipped with an Intel E8500 CPU and 4GB RAM. All the algorithms are implemented on the Matlab platform.
6.1 Comparative Study
In our experiment, the validation of the proposed geometric
discriminative feature is conducted on both the LHI and our own
data sets. We compare our geometric discriminative feature with
several representative discriminative visual
features, i.e., the global RGB histogram, the
intensitydomain spin images spin , the walk/tree
kernel walk_kernel , the sparse coding spatial pyramid
matching (SCSPM) scSPM , the localityconstrained spatial
pyramid matching (LLCSPM) llcSPM , and the object
bank ob . As the spatial pyramid matching
kernel beyond heavily relies on the prior knowledge, we do
not employ it for comparison. In our implementation, the
geometric discriminative features are extracted to encode both the
color intensity distribution and the spatial property. In each
segmented region, a 4096dimensional RCBhistogram is extracted as
its representation. A few example
aerial images and their geometric discriminative features are presented.
Category  Walk kernel  Tree kernel  SPM(200)  SCSPM(256)  LLCSPM(256)  OBSPM(LR1)  SPM(400)  SCSPM(512) 

Airport  0.8820.023  0.9010.032  0.7230.017  0.7210.026  0.7230.017  0.7990.021  0.8110.043  0.8430.021 
Commer.  0.5450.034  0.5320.012  0.4410.023  0.4430.031  0.3340.027  0.5170.036  0.5210.022  0.4560.012 
Indust.  0.6420.021  0.6110.032  0.5210.021  0.4990.041  0.4130.015  0.5120.056  0.4540.033  0.5760.018 
Inter.  0.6450.067  0.6850.011  0.6110.018  0.6430.023  0.3220.031  0.6750.034  0.6740.026  0.6340.011 
Park.  0.5230.039  0.4870.017  0.4430.011  0.5120.037  0.4120.021  0.5360.012  0.5120.057  0.4960.025 
Railway  0.5560.076  0.5780.056  0.5020.032  0.5110.022  0.5210.033  0.5140.013  0.5210.038  0.5960.052 
Seaport  0.8590.051  0.8430.036  0.7740.021  0.7450.034  0.7210.034  0.7660.016  0.6320.043  0.8140.009 
Soccer  0.6460.021  0.6550.006  0.5760.021  0.5890.023  0.5780.023  0.5680.032  0.5210.045  0.6240.032 
Temple  0.5030.029  0.4540.031  0.5210.042  0.5670.038  0.5110.031  0.6030.021  0.5340.024  0.5650.045 
Univer.  0.2410.045  0.2650.009  0.2890.017  0.3010.021  0.2230.044  0.3040.041  0.4980.03  0.3210.012 
Average  0.5240.041  0.6010.024  0.5400.022  0.5530.030  0.47700.033  0.5790.028  0.5680.037  0.5930.024 
Category  LLCSPM (512)  OBSPM (LRG)  SPM(800)  SCSPM(1024)  LLCSPM(1024)  OBSPM(LRG1)  SPM(HC)  SCSPM(HC) 
Airport  0.8010.021  0.8890.035  0.7990.033  0.9120.015  0.8990.019  0.8720.051  0.8130.045  0.9160.023 
Commer.  0.5670.034  0.5650.032  0.5120.032  0.6010.034  0.5210.021  0.6170.034  0.5190.043  0.5840.042 
Indust.  0.5210.025  0.6130.013  0.5850.043  0.5570.032  0.5930.019  0.5760.054  0.5980.058  0.5640.039 
Inter.  0.7660.036  0.7050.015  0.6440.022  0.7880.014  0.6220.035  0.6760.013  0.6680.041  0.7910.019 
Park.  0.4890.032  0.4860.016  0.5030.043  0.4890.043  0.4890.055  0.5120.009  0.5110.057  0.4870.025 
Railway  0.5530.042  0.5320.053  0.6020.017  0.6010.037  0.5990.009  0.5890.010  0.6140.026  0.6090.044 
Seaport  0.7510.036  0.7790.045  0.8150.031  0.7450.034  0.7980.032  0.8110.013  0.8220.039  0.7510.039 
Soccer  0.6250.026  0.6460.014  0.6340.028  0.6890.036  0.6550.014  0.6680.043  0.6430.037  0.6930.045 
Temple  0.5670.024  0.5870.027  0.5770.041  0.6890.027  0.5560.032  0.6120.025  0.5870.046  0.6490.034 
Univer.  0.4090.042  0.3890.018  0.3110.013  0.5820.035  0.2810.042  0.3040.011  0.3240.031  0.5370.033 
Average  0.6050.032  0.6200.027  0.6060.029  0.6540.033  0.6000.027  0.6360.025  0.6100.042  0.6580.032 
Category  LLCSPM(HC)  Our proposed method  
Airport  0.9040.031  0.8640.051  
Commer.  0.5340.029  0.6770.024  
Indust.  0.5980.023  0.5550.034  
Inter.  0.6340.046  0.8120.021  
Park.  0.4930.064  0.5010.061  
Railway  0.6040.005  0.6060.033  
Seaport  0.8030.046  0.7710.025  
Soccer  0.6590.026  0.6630.065  
Temple  0.5740.041  0.6650.019  
Univer.  0.2870.049  0.5510.034  
Average  0.6090.036  0.6670.037 
Recognition rate with standard deviation on our own data set(the experiment was repeated 10 times; HC is the HOG+color moment with a 1024sized codebook; the number in each bracket denotes the codebook size; and LR2 and LRG are different regularizers as described in
ob )First, we present a set of discovered
discriminative subgraphs. From a horizontal glance, we can roughly
discriminate aerial images from the five categories, especially
for the intersections and the marines. This demonstrates the
necessity to exploit the relationships among aerial image patches
for categorization.
Further, to make comparison among the global histogram,
the spin images, the walk kernel, and the proposed geometric
discriminative feature, we select half of
the images for training and leave the rest for testing. As shown
in Table 1, the proposed feature
achieves the best accuracy on average.
6.2 Discussion on different parameter settings
We notice that the influence of
segmentation operation in the RCG construction is unnegligible.
To evaluate the performance under different segmentation settings
(i.e., the number of singly connected regions), we
perform aerial image recognition on the LHI
data set, since the offtheshelf segmentation benchmark is
suitable to make a fair comparison.
Different segmentation settings are employed in our
evaluation, i.e., deficient segmentation and over
segmentation. The MSD values of each aerial image corresponding
to different segmentation settings are computed. We observed that
the benchmark segmentation setting achieves the largest MSD value
6.3, while the deficient segmentation and over segmentation gain
4.9 and 5.7, respectively. Comparatively, more regions are
obtained in overly segmentations, which means it is rarer for one
region to span several objects. Therefore, when building an RCG
by overly segmented regions, fewer discriminative objects are
neglected. Further, it is unavoidable that the unsupervised
clustering is less accurate than
the benchmark segmentation.
Category  Bench.  Defic.  Overly  Mulit. 
Intersection  0.8  0.3  0.8  0.8 
Marine  0.4  0.8  0.8  0.9 
Parking  0.9  0.5  0.6  0.6 
Residental  0.5  0.7  0.6  0.7 
School  0.6  0.3  0.3  0.6 
Average rate  0.64  0.54  0.62  0.72 
Total topology #  73  125  177  143 
Selected structure #  8  8  8  8 
Average RAG edge #  37  26  57  41 
Average RAG vertex #  19  16  31  19 
We compare the categorization
accuracy under the benchmark segmentation, the over segmentation
and the deficient segmentation. As shown in
Table 2, over segmentation obtains 2 lower accuracy
than that of the benchmark segmentation on average. Deficient
segmentation performs worse than over segmentation by providing
the lowest accuracy. The overall recognition result is consistent
to what the MSD reflects.
In the structure selection stage, both the threshold of MSD and MSC influence the obtained structures. Toward an easy parameter tuning process, we set the threshold of MSD to a small value, which allows a large number of candidate structures to be qualified. Then, we tune of threshold of MSC to carefully remove those redundant structures. As shown in Fig. 7, we set the threshold of MSD to 0.1 and tune the threshold of MSC. It is observed that the categorization accuracy increases and then becomes the threshold of MSC reaches 0.65. Thus, we set the thresholds of MSD and MSC to 0.1 and 0.65 in our implementation.
6.3 The compilation of our aerial image data set
We compiled our data set by searching aerial images from the Google Earth. The whole data set contains 2,096 aerial images from ten categories. Since the aerial images from cities are usually clearer than those from the remote areas, we collected most of our images from metropolis, such as New York, Tokyo and Beijing. Due to the various difficulties to crawl images from different categories, the number of images in each category varies are detailed in Table 3.
Categroy  Air.  Comme.  Industrial  Inter.  Park 

Number  306  262  206  302  129 
Categroy  Rail.  Seaport  Soccer  Temp.  Univ. 
Number  115  126  128  218  305 
7 Conclusions
Aerial image categorization is an important component in artificial intelligence and remote sensing add4 ; add5 . In this paper, a new geometric discriminative feature is proposed for aerial image recognition. Both the local features and their geometric property are taken into account to describe an aerial image. A region connected graph (RCG) is defined to encode the geometric property and the color intensity of an aerial image. Then, the frequent structures are mined statistically from the training RCGs. The refined structures are further selected from the frequent structures toward being highly discriminative and low redundant. Given a new aerial image, its geometric discriminative features are extracted guided by the refined structures, They are further quantized into a vector for SVM ksvm classification. We evaluated the effectiveness of our approach on both the public and our own data sets.
8 Appendix
Ideally, we want a perfect segmentation algorithm with two merits: First, each segmented region represents a semantic object/component. Second, the segmentation algorithm is parameterfree. Thus, we can apply it to segment thousands of training images once for all, without humaninteractive parameter tuning. Unfortunately, for the first merit, the highlevel features in those semanticsexploited segmentation methods are usually designed manually and data set dependent, which is not consistent with the fullyautomated and data set independent framework of the proposed method; besides, to learn semantics, semanticsexploited segmentation methods typically require wellannotated training images, however, the large number of training aerial images used in our experiment are online crawled and human annotation is laborious. For the second merit, those semanticexploited segmentation methods are usually complicated and there are several important usercontrolled parameters. Therefore, we can only use those datadriven segmentation methods, where no semantics are explored and typically contain one tuning parameter. Those wellknown datadriven segmentation algorithms can be divided into two groups. The first group algorithms need the number of segmented regions as input, such as kmeans and normalized cut; however, there is no uniform segmented region number on different images because different images usually contain different number of components. The second group algorithms require some tolerance bound as input, such as the similarity tolerance between spatially neighboring segmented regions. Compared with segmented region number, we empirically found that the tolerance bound is more flexible to tune. Therefore, in our approach, we chose the second group datadriven segmentation methods. After some experimental comparison, we found that the unsupervised fuzzy clustering
^{2}^{2}2Matlab codes: https://mywebspace.wisc.edu/pwang6/personal/ outperforms several tolerance boundbased segmentation algorithms, such as graphbased segmentation ^{3}^{3}3C++ codes: http://www.cs.brown.edu/ pff/segment/. Thus, we choose unsupervised fuzzy clustering in our approach.References
 (1) X. Yuan, H. Zhu, S. Yang, “A Robust Framework For Eigenspace Image Reconstruction,” IEEE Workshop on Appl. of Comp. Vis., pp. 5459, 2005.
 (2) H. Blum, “Biological shape and visual science”, Journal of Theoretical Biology, pages 205–287, 1973.
 (3) J. Porway, K. Wang, B. Yao, S.C. Zhu, “Scaleinvariant shape features for recognition of object categories”, in Proc. IEEE Int. Comp. Vis., pp. 90–96, 2004.

(4)
M. A. Maloof, P. Langley, T. O. Binford, R. Nevatia, S. Sage,
“Improved Rooftop Detection in Aerial Images with Machine Learning”,
Machine Learning, pages 157–191, 2003.  (5) T. Zhao, R. Nevatia, “Car detection in low resolution aerial image”, in Proc. IEEE Int. Comp. Vis., 2001.
 (6) L. Zhang, Y. Gao, Y. Xia, Q. Dai, X. Li, A FineGrained Image Categorization System by CelletEncoded Spatial Pyramid Modeling, IEEE Transcations on Industrial Electronics (TIE), 2014 (accepted).
 (7) L. Zhang, Y. Gao, C. Hong, Y. Feng, J. Zhu, D. Cai, Feature Correlation Hypergraph: Exploiting Highorder Potentials for Multimodal Recognition, IEEE Transcations on Cybernetics (TCYB), 2013 (accepted).
 (8) L. Zhang, Y. Gao, R. Ji, L. Ke, J. Shen, Representative Discovery of Structure Cues for WeaklySupervised Image Segmentation, IEEE Transcations on Multimedia (TMM), 16(2): 470–479, 2014.
 (9) L. Zhang, M. Song, Y. Yang, Q. Zhao, Z. Chen, N. Sebe, Weakly Supervised Photo Cropping, IEEE Transcations on Multimedia (TMM), 16(1): 94–107, 2014.

(10)
L. Zhang, M. Song, Z. Liu, X. Liu, J. Bu, C. Chen,
Probabilistic Graphlet Cut: Exploring Spatial Structure Cue for Weakly Supervised Image Segmentation,
IEEE Computer Vision and Pattern Recognition
(CVPR), pages: 1908–1915, 2013.  (11) H. Moissinac, H. Maitre, I. Bloch, “Urban aerial image understanding using symbolic data”, In in Proc. SPIE Image and signal proce. for remote sensing, 1994.
 (12) A. C. Berg, F. Grabler, J. Malik, “Parsing images of architectural scenes”, in Proc. IEEE Int. Comp. Vis., pp. 1–8, 2007.
 (13) J. Porway, K. Wang, B. Yao, S.C. Zhu, “A hierarchical and contextual model for aerial image understanding”, in Proc. IEEE Int. Comp. Vis., pp. 1–8, 2008.
 (14) S. Lazebnik, C. Schmid, J. Ponce, “Beyond bags of features: spatial pyramid matching for recognizing natural scene categories”, in Proc. IEEE Int. Comp. Vis., pp. 2169–2178, 2006.
 (15) N. Cristianini, B. Scholkopf. “Using spin images for efficient object recognition in cluttered 3D scenes”, IEEE Trans. on Pattern Analysis and Mach. Intell., vol. 21, no. 5, pp. 433–449, 1999.
 (16) Z. Harchaoui, F. Bach, “Image classification with segmentation graph kernels”, in Proc. IEEE Int. Comp. Vis., pp. 1–8, 2007.
 (17) R. Gonzalez, R. Woods, S. Eddins, “Digital Image Processing Using Matlab”. Prentice Hall, Dec 26, 2003.

(18)
S. Jia, Z. Zhu, L. Shen, Q. Li,
“A twostage feature selection framework for hyperspectral image classification using few labeled samples”,
IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 4, pp. 1023–1035, 2014.  (19) C. Chen, W. Li, E. W. Trame, M. Cui, S. Prasad, J. E. Fowler, “Spectralspatial preprocessing using multihypothesis prediction for noiserobust hyperspectral image classification”, IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 4, pp. 1047–1059, 2014.
 (20) N. Sherashidze, S.V.N. Vishwanathan, T.H. Petri, K. Mehlhorn, K.M. Borgwardt, “Efficient graphlet kernels for large graph comparison”, Int. Conf. on Artif. Intell. and Stat., pp. 488–495, 2009.
 (21) M. Kuramochi, G. Karypis. “An efficient algorithm for discovering frequent subgraphs”, IEEE Trans. Knowledge and Data Eng., vol. 16, no. 9, pp. 10381051, 2004.
 (22) J. R. Ullmann, “An algorithm for subgraph isomorphism”, Journal of the ACM, vol. 23, no. 1, pp. 31–42, 1976.
 (23) L. Zhang, M. Song, N. Li, J. Bu, C. Chen, “Feature selection for fast speech emotion recognition”, ACM Multimedia, pp. 753–756, 2009.
 (24) N. Cristianini, B. Scholkopf, “Support vector machines and kernel methods: the new generation of learning machines”, AI Magzine, vol. 23, no. 3, pp. 31–41, 2002.
 (25) Y. Li, S. Gong, H. Liddell, “Kernel discriminant analysis”, ACM Trans. Program. Lang. Syst., vol. 15, no. 5, pp. 745–770, 1998.
 (26) H.M. Chen, C. Lin, S.Y. Chen, C.H. Wen, C.C. Chen, Y.C. Ouyang, C.I Chang, “PPISVMIterative FLDA Approach to Unsupervised Multispectral Image Classification”, IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 6, no. 4, pp. 1834–1842, 2013.
 (27) L. Zhang, Y. Gao, R. Ji, Q. Dai, X. Li, Actively Learning Human Gaze Shifting Paths for Photo Cropping, IEEE Transcations on Image Processing (TIP), 23(5), pages: 2235–2245, 2014.
 (28) L. Zhang, Y. Gao, R. Zimmermann, Q. Tian, X. Li, Fusion of MultiChannel Local and Global Structural Cues for Photo Aesthetics Evaluation, IEEE Transcations on Image Processing (TIP), 23(3): 1419–1429, 2014.
 (29) L. Zhang, Y. Yang, C. Wang, X. Li, A Probabilistic Associative Model for Segmenting Weakly Supervised Images, IEEE Transcations on Image Processing (TIP), 2014 (accepted).

(30)
L. Zhang, R. Ji, Y. Xia, X. Li,
Learning a Probabilistic Topology Discovering Model for Scene Categorization,
IEEE Transcations on Neural Networks and Learning Systems
(TNNLS), 2014 (accepted).  (31) X. Liu, M. Song, D. Tao, L. Zhang, J. Bu, C. Chen, Learning to Track Multiple Objects, IEEE Transcations on Neural Networks and Learning Systems (TNNLS), 2014 (accepted).
 (32) L. Zhang, Y. Gao, Y. Xia, R. ji, X. Li, SpatialAware ObjectLevel Saliency Prediction by Learning Graphlet Hierarchies, IEEE Transcations on Industrial Electronics (TIE), 2014 (accepted).
 (33) B. Luo, S. Jiang, L. Zhang, “Indexing of Remote Sensing Images With Different Resolutions by Multiple Features”, IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 6, no. 4, pp. 1899–1912, 2013.
 (34) A. Makarau, G. Palubinskas, P. Reinartz, “AlphabetBased Multisensory Data Fusion and Classification Using Factor Graphs”, IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 6, no. 2, pp. 969–990, 2013.
 (35) J. Shi, J. Malik, “Normalized cuts and image segmentation”, IEEE Trans. on Pattern Analysis and Mach. Intell., vol. 22, no. 8, pp. 888–905, 2000.
 (36) B. Yao, X. Yang, S.C. Zhu, “Introduction to a large scale general purpose ground truth dataset: methodology, annotation tool, and benchmarks”, EMMCVPR, 2007.
 (37) X. Xiong, K. L. Chan, “Towards an unsupervised optimal fuzzy clustering algorithm for image database organization”, in Proc. IEEE Conf. Pattern Recognit, pp. 3909, 2000.
 (38) L. Zhang, Y. Han, Y. Yang, M. Song, S. Yan, Q. Tian, Discovering Discrminative Graphlets for Aerial Image Categories Recognition, IEEE Transcations on Image Processing (TIP), 22(12):5071–5084, 2013.
 (39) L. Zhang, M. Song, Q. Zhao, X. Liu, J. Bu, C. Chen, Probabilistic Graphlet Transfer for Photo Cropping, IEEE Transcations on Image Processing (TIP), 21(5): 2887–2897, 2013.
 (40) T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, “Introduction to Algorithms”, MIT Press and McGrawHill, pp. 540–549, 2001.
 (41) J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, “Localityconstrained linear voding for image classification”, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3360–3367, 2010.
 (42) J. Yang, K. Yu, Y. Gong, T. Huang, “Linear spatial pyramid matching using sparse coding for image classification”, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2169–2178, 2009.

(43)
L.J. Li, H. Su, E. P. Xing, F.F. Li,
Object Bank: A HighLevel Image Representation for Scene Classification and Semantic Feature Sparsification,
in Proc. Adv. Neural Inf. Process. Syst, pp. 1378–1386, 2010.
Comments
There are no comments yet.