Remote Sensing Image Scene Classification Meets Deep Learning: Challenges, Methods, Benchmarks, and Opportunities

05/03/2020
by   Gong Cheng, et al.
5

Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic categories based on their contents, has broad applications in a range of fields. Propelled by the powerful feature learning capabilities of deep neural networks, remote sensing image scene classification driven by deep learning has drawn remarkable attention and achieved significant breakthroughs. However, to the best of our knowledge, a comprehensive review of recent achievements regarding deep learning for scene classification of remote sensing images is still lacking. Considering the rapid evolution of this field, this paper provides a systematic survey of deep learning methods for remote sensing image scene classification by covering more than 140 papers. To be specific, we discuss the main challenges of scene classification and survey (1) Autoencoder-based scene classification methods, (2) Convolutional Neural Network-based scene classification methods, and (3) Generative Adversarial Network-based scene classification methods. In addition, we introduce the benchmarks used for scene classification and summarize the performance of more than two dozens of representative algorithms on three commonly-used benchmark data sets. Finally, we discuss the promising opportunities for further research.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 6

page 7

page 10

page 11

page 12

page 20

03/01/2017

Remote Sensing Image Scene Classification: Benchmark and State of the Art

Remote sensing image scene classification plays an important role in a w...
01/27/2020

Convolution Neural Network Architecture Learning for Remote Sensing Scene Classification

Remote sensing image scene classification is a fundamental but challengi...
02/13/2021

Weight Initialization Techniques for Deep Learning Algorithms in Remote Sensing: Recent Trends and Future Perspectives

During the last decade, several research works have focused on providing...
10/06/2016

Multiple Regularizations Deep Learning for Paddy Growth Stages Classification from LANDSAT-8

This study uses remote sensing technology that can provide information a...
01/22/2021

A Review on Deep Learning in UAV Remote Sensing

Deep Neural Networks (DNNs) learn representation from data with an impre...
02/04/2015

Dense v.s. Sparse: A Comparative Study of Sampling Analysis in Scene Classification of High-Resolution Remote Sensing Imagery

Scene classification is a key problem in the interpretation of high-reso...
10/02/2018

An Entropic Optimal Transport Loss for Learning Deep Neural Networks under Label Noise in Remote Sensing Images

Deep neural networks have established as a powerful tool for large scale...

Code Repositories

Aerial-Image-Recognition

The competition is hosted on https://competitions.codalab.org/competitions/27749#learn_the_details-overview


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Remote sensing images, a valuable data source for earth observation, can help us to measure and observe detailed structures on the Earth’s surface. Thanks to the advances of earth observation technology [45, 31], the volume of aerial or satellite images is drastically growing. This has given particular urgency to the quest for how to make full use of ever-increasing remote sensing images for intelligent earth observation [28, 55]. Hence, it is extremely important to understand huge and complex satellite images.


Fig. 1: Illustration of remote sensing image scene classification.

As a key and challenging problem for effectively interpreting aerial imagery, scene classification of remote sensing images has been an active research area. Remote sensing image scene classification is to correctly label given scene images with predefined semantic categories, as shown in Fig. 1. For the last few decades, extensive researches on satellite image scene classification have been undertaken driven by its real-world applications, such as urban planning [70, 104]

, remote sensing image retrieval

[64, 121, 111], natural hazards detection [79, 10, 75], environment monitoring[47, 132], vegetation mapping [62, 83], and geospatial object detection [19, 63, 14, 15, 58, 57, 20, 11].

With the improvement of spatial resolution of remote sensing images, satellite image classification has experienced three stages: pixel-level, object-level, and scene-level classification, as shown in Fig. 2

. To be specific, in the early literatures, researchers mainly focused on classifying satellite images at pixel level or subpixel level

[51, 106, 107], through labeling each pixel in the satellite images with a semantic class, because the spatial resolutions of areal images is very low—the size of a pixel is similar to the sizes of the objects of interest[49]. However, due to the advance of satellite imaging, the spatial resolution of satellite images are increasingly finer than common objects of interest, such that single pixels lose their semantic meanings. In such case, it is not feasible to recognize scene images at pixel level solely and so per-pixel analysis began to be viewed with increasing dissatisfaction. In 2001, Blaschke and Strobl[4] questioned the dominance of per-pixel research paradigm and concluded that analyzing remote sensing images at object level is more efficient than per-pixel analysis. They suggested that researchers should pay attention to object-level analysis, where the term ”object” refers to meaningful semantic entities or scene units. Subsequently, a series of approaches to analyze remote sensing images at object level has dominated satellite image analysis for the last two decades[5, 117, 6, 3]. Amazing achievements of certain specific land use and identification tasks have been accomplished by pixel-level and object-level classification algorithms. However, remote sensing images may contain different and distinct object classes because of the increasing resolutions of remote sensing images. Pixel-level and object-level methods have not been sufficient to always classify them correctly. Under the circumstances, it is of considerable interest to understand the contents and meanings of satellite images. A new paradigm of semantic-level analysis of remote sensing images has been recently suggested.


Fig. 2: The evolution of remote sensing image classification from pixel level to object level, and then to scene level.

Semantic-level aerial image scene classification, namely remote sensing image scene classification, seeks to correctly classify the given satellite images into sematic classes by capitalizing on the variations in the spatial arrangement and structural pattern of ground objects. Here the item ”scene” represents a local area cropped from a large-scale satellite image that contains clear semantic information on the earth surface[12, 43]. Scene-level classification yields a better understanding of satellite images than pixel-level and object-level classification.

It is a significant step to be able to represent visual data with discriminative features in almost all tasks of computer vision. The remote sensing domain is no exception. During the previous decade, extensive efforts have been devoted to developing discriminative visual features. A majority of early scene classification methods relied on human-engineering descriptors, e.g., Scale-Invariant Feature Transformation (SIFT)

[71], Texture Descriptors (TD)[35, 48, 86], Color Histogram (CH)[102], Histogram of Oriented Gradients (HOG)[22], and GIST[87]. Owing to their characteristic of being able to represent an entire image with features, it is feasible to directly apply CH, GIST and TD to areal image scene classification. However, SIFT and HOG cannot represent an entire image directly because of their local characteristic. To make handcrafted local descriptors represent an entire scene image, these local descriptors are encoded by certain encoding methods (e.g., the Improved Fisher Kernel (IFK)[92]

, Vector of Locally Aggregated Descriptors (VLADs)

[50], Spatial Pyramid Matching (SPM)[53], and the popular Bag-Of-Visual-Words (BoVW)[118]). Thanks to the simplicity and efficiency of these feature encoding methods, they have been broadly applied to the field of aerial image scene classification[119, 96, 84, 136, 134, 144], whereas the representation capability of handcrafted features is limited.

In this case, unsupervised learning, such as k-means clustering, Principal Component Analysis (PCA)

[112], and sparse coding[88], which automatically learns features from unlabeled images, become an appealing alternative to human-engineering features. A considerable amount of unsupervised learning-based scene classification methods have emerged[21, 81, 93, 97, 139, 141, 74, 26, 94], and made substantial progress for scene classification. Nevertheless, these unsupervised learning approaches cannot make full use of data class information.
Fortunately, due to the advances in deep learning theory and the increased availability of remote sensing data and parallel computing resources, deep learning-based algorithms have increasingly prevailed the area of aerial image scene classification. In 2006, Hinton and Salakhutdinov[42] created an approach to initialize the weights for training multilayer neural networks, which builds a solid foundation for the development of deep learning later. During the period 2006 to 2012, simple deep learning models have been developed (e.g., deep belief nets[41], autoencoder[42], and stacked autoencoder[108]).


Fig. 3: The number of publications in remote sensing image scene classification from 2012 to 2019. Data from google scholar advanced search: allintitle:(”remote sensing” or ”aerial” or ”satellite” or ”land use”) and ”scene classification”.

The feature description capabilities of these simple deep learning models have been demonstrated in many fields, involving satellite image scene classification. Since the AlexNet, a deep Convolutional Neural Network (CNN) designed by Krizhevskey et al. [52] in 2012, obtained the best results in the Large-Scale Visual Recognition Challenge (LSVRC) [23], a great many advanced deep CNNs have come forth and broken a number of records in many fields. In the wake of these successes, CNN-based methods have emerged in satellite image scene classification[82, 17, 110] and achieved advanced classification accuracy.

Nevertheless, CNN-based methods generally demand massive annotated training data, which greatly limits their application scenarios. More recently, Generative Adversarial Networks (GANs)[32], a promising unsupervised learning method, have achieved significant success in many applications. To remedy the above-mentioned limitations, GANs have begun been employed by some researchers on the field of aerial scene classification[25, 65].
Currently, driven by deep learning, a great number of methods of aerial image scene classification have sprung up (see Fig. 3). The number of papers on satellite image scene classification dramatically increased after 2014 and 2017 respectively. There are two reasons for the increase. On the one hand, around 2014, deep learning techniques began to be applied to remote sensing data analysis. On the other hand, in 2017, large-scale satellite image scene classification benchmarks appeared, which have greatly facilitated the development of deep learning-based scene classification.

In the past several years, numerous reviews of remote sensing image classification methods have been published, which are summarized in Table I. For example, Tuia et al. [107]

surveyed, tested and compared three active learning-based aerial image scene classification methods: committee, large margin, and posterior probability. GóChova et al.

[31] surveyed multimodal remote sensing image classification and summarized the leading algorithms for this field. In[80]

, Maulik et al. conducted a review of aerial images scene classification algorithms based on support vector machine (SVM). Li et al.

[59] surveyed the pixel-level, subpixel-level and object-based methods of image classification and emphasized the contribution of spatio-contextual information to aerial scene classification.

As an alternative ways to extract robust, abstract and high-level features from images, deep learning models have made amazing progress on a broad range of tasks in processing image, video, speech and audio. After this, amount of deep learning-based scene classification algorithms were proposed, such as CNN-based methods and GAN-based methods. A number of reviews of scene classification approaches have been published. Penatti et al. [91] assessed the generalization ability of pre-trained CNNs in classification of aerial images. In [43]

, Hu et al. surveyed how to apply the CNNs that trained on the ImageNet data set to aerial image scene classification. Zhu et al.

[146] presented a tutorial about deep learning-based remote sensing data analysis. In order to make full use of pre-trained CNNs, Nogueira et al. [85] analyzed the performance of CNNs for satellite image scene classification with different learning strategies: full training, fine tuning, and using CNNs as feature extractors. In [131], Zhang et al. reviewed the recent deep learning-based remote sensing data analysis. Considering the number of scene categories and the accuracy saturation of the existing scene classification data sets, Cheng et al. [13] released a large-scale scene classification benchmark, named NWPU-RESISC45, and provided a survey of recent advance in aerial image scene classification before 2017. In [113], Xia et al. proposed a novel benchmark, called AID, for aerial image classification and reviewed the existing methods of scene classification before 2017. Ma et al. [77] provided a review of the applications of deep learning in aerial image analysis. In addition, there have been several hyperspectral image classification surveys [29, 37, 60].

However, a thorough survey of deep learning for scene classification is still lacking. This motivates us to deeply analyze the main challenges faced for aerial scene classification, systematically review those deep learning-based scene classification approaches, most of which are published during the last five years, introduce the mainstream scene classification benchmarks, and discuss several promising future directions of scene classification.

The remainder of this paper is organized as follows. Section II discusses the current main challenges of remote sensing image scene classification. A brief review of deep learning models and a comprehensive survey of deep learning-based scene classification methods are provided in Section III. The scene classification data sets are introduced in Section IV. In Section V the comparison and discussion of the performance of deep learning-based scene classification methods on three widely used scene classification benchmarks are given. In Section VI, we discuss the promising future directions of scene classification. Finally, we conclude this paper in Section VII.

Ii Main Challenges of Remote Sensing Image Scene Classification

The ideal goal of scene classification of satellite images is to correctly label the given scene images with their corresponding semantic classes according to their contents, for example, categorizing a remote sensing image from urban into residential, commercial, or industrial area. Generally speaking, a remote sensing image contains a variety of ground objects. For instance, roads, trees, and buildings may be included in an industrial scene. Different from object-oriented classification, scene classification is a considerably challenging problem because of the variance and complex spatial distributions of ground objects existing in the scenes. Historically, extensive studies of satellite image scene classification have been made. However, there has not yet been an algorithm that can achieve the goal of classifying aerial image scenes with satisfactory accuracy. The challenges of scene classification include (1) big intraclass diversity, (2) high interclass similarity (also known as low between-class separability), and (3) large variance of object/scene scales, as shown in Fig.

4.

No. Survey Title Year Publication Content
1 A survey of active learning algorithms for supervised remote sensing image classification [107] 2011 IEEE JSTSP Surveying and testing the main families of active learning methods
2 A review of remote sensing image classification techniques: the role of spatio-contextual information [59] 2014 EuJRS Review of pixel-wise, subpixel-wise and object-based methods for remote sensing image classification and exploring the contribution of spatio-contextual information to scene classification
3 Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery [43] 2015 Remote Sensing Investigating how to exploit pre-trained CNNs for aerial image scene classification
4

Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?

[91]
2015 CVPR Workshop Evaluating the generalization ability of pre-trained CNNs in remote sensing image classification
5 Multimodal classification of remote sensing images: a review and future directions [31] 2015 Proceedings of the IEEE Offering a taxonomical view of the field of multimodal remote sensing image classification
6 Deep learning for remote sensing data: A technical tutorial on the state of the art [131] 2016 IEEE GRSM Reviewing deep learning-based remote sensing data analysis techniques before 2016
7 Deep learning in remote sensing: A comprehensive review and list of resources [146] 2017 IEEE GRSM Reviewing the progress of deep learning-based remote sensing data analysis before 2017
8 Advanced spectral classifiers for hyperspectral images: A review [29] 2017 IEEE GRSM Review and comparison of different supervised hyperspectral classification methods
9 Remote sensing image classification: a survey of support-vector-machine-based advanced techniques [80] 2017 IEEE GRSM Review of remote sensing image classification based on SVM
10 Towards better exploiting convolutional neural networks for remote sensing scene classification [85] 2017 Pattern Recognition Analyzing the performance of three CNN-based scene classification strategies
11 AID: a benchmark data set for performance evaluation of aerial scene classification [113] 2017 IEEE TGRS Review of aerial image scene classification methods before 2017 and proposing the AID data set
12 Remote sensing image scene classification: benchmark and state of the art [13] 2017 Proceedings of the IEEE Reviewing the progress of scene classification of remote sensing images before 2017 and proposing the NWPU-RESISC45 data set
13 Recent advances on spectral–spatial hyperspectral image classification: An overview and new guidelines [37] 2017 IEEE TGRS Survey of the progress in the classification of spectral–spatial hyperspectral images
14 Deep learning for hyperspectral image classification: An overview [60] 2019 IEEE TGRS Review of hyperspectral image classification based on deep learning
15 Deep learning in remote sensing applications: A meta-analysis and review [77] 2019 ISPRS JPRS Providing a review of the applications of deep learning in aerial image analysis
16 Remote sensing image scene classification meets deep learning: challenges, methods, benchmarks, and opportunities 2020 IEEE JSTARS A systematic review of recent advances in remote sensing image scene classification driven by deep learning
TABLE I: Summarization of a number of surveys of remote sensing image analysis.

In terms of within-class diversity, the challenge mainly stems from the large variations in the appearances of ground objects within the same semantic class. Ground objects commonly vary in style, shape, scale, and distribution, which makes it difficult to correctly classify the scene images. For example, in Fig. 4 (a), the churches appear in different building styles, and the airports and railway stations show in different shapes. In addition, when airborne or space platforms capture remote sensing images, there may be large differences in color and radiation intensity appearing within the same semantic class on account of the imaging conditions, which can be influenced by the factors such as weather, cloud, mist, etc. The variations in scene illumination may also cause within-class diversity, for example, the appearances of the scene labeled as ”beach” show large differences under different imaging conditions, as shown in Fig. 4 (a).


Fig. 4: Challenges of remote sensing image scene classification which include (a) big within-class diversity, (b) high between-class similarity (also known as low between-class separability), and (c) large variance of object/scene scales. These images are from the NWPU-RESISC45 data set [13].

For between-class similarity, the challenge is chiefly caused by the presence of the same objects within different scene classes or the high semantic overlapping between scene categories. For instance, in Fig. 4 (b), the scene classes of bridge and overpass both contain the same ground objects, namely bridge, and the basketball courts and tennis courts share high semantic information. Moreover, the ambiguous definition of scene classes degenerates inter-class dissimilarity. Some complex scenes are also similar with each other in terms of their visual contents. Therefore, it may be extremely difficult to distinguish these scene classes.
The large variance of object/scene scales is also a non-negligible challenge for aerial image scene classification. In satellite imaging, sensors operate at the orbits of various altitudes, from a few hundred kilometers to more than ten thousand kilometers, which leads to imaging altitude variation. With the examples illustrated in Fig. 4 (c), the scenes of airplane, storage tank, and thermal power station have huge scale differences under different imaging altitudes. In addition, because of some intrinsic factors, the variations in size for each object/scene category can also exist, for example, the rivers shown in Fig. 4 (c) are presented in several different sub-scenes—stream, brook, and creek.

Iii Survey on Deep Learning-Based Remote Sensing Image Scene Classification Methods

In the past decades, many researchers have committed to scene classification of remote sensing images, driven by its wide applications. A number of advanced scene classification systems or approaches have been proposed,


Fig. 5: The architectures of (a) autoencoder and (b) stacked autoencoder.

especially driven by deep learning. Before deep learning came to the attention of this field, scene classification methods mainly relied on handcrafted features (e.g., Color Histogram (CH), texture descriptors (TD), GIST) or the representations generated by encoding local features via BoVW, IFK, SPM, etc. Later, considering that handcrafted features only extract low-level information, many researchers turned to looking at unsupervised learning methods (e.g., sparse coding, PCA, and k-means). By automatically learning discriminative features from unlabeled data, unsupervised learning-based methods have obtained good results in the scene classification of aerial images. Yet, unsupervised learning-based algorithms do not adequately exploit data class information, which limits their abilities to discriminate between different scene classes. Now, thanks to the availability of enormous labeled data, the advances in machine learning theory and the increased availability of computational resources, deep learning models (e.g., autoencoder, CNNs, and GANs) have shown powerful abilities to learn fruitful features and have permeated many research fields, including the area of aerial image scene classification. Currently, numerous deep learning-based scene classification algorithms have emerged and have yielded the best classification accuracy. In this section, we systematically survey the milestone algorithms for scene classification since deep learning permeated the field of remote sensing. That is one small step for deep learning theory, but one giant leap for the scene classification of satellite images

[128]. From autoencoder, to CNNs, and then to GANs, deep learning algorithms constantly update scene classification records. To sum up, most of the deep learning-based scene classification algorithms can be broadly divided into three main categories: autoencoder-based methods, CNN-based methods, and GAN-based methods. In what follows, we discuss the three categories of methods at great length.

Iii-a Autoencoder-Based Remote Sensing Image Scene Classification

1) Brief introduction of autoencoder

Autoencoder [42] is an unsupervised feature learning model, which consists of a sort of shallow and symmetrical neural network (see Fig. 5 (a)). It contains two units—encoder and decoder. The process of encoding can be formulated as equation (1), where is the output of latent layers, denotes a nonlinear mapping, stands for the encoding weight matrix, denotes the input of autoencoder, and

is the bias vector. Generally,

is fewer than . Decoding is the inverse of encoding, which can be formulated as equation (2), where represents the reconstructed output, the decoding weight matrix is denoted by , and stands for the bias vector.

(1)
(2)

Autoencoder is able to compress high dimensional features by minimizing the cost function that usually consists of a reconstruction error term and a regularization term. By using gradient descent with back propagation, autoencoder can learn the parameters of networks. In real applications, multilayer stacked autoencoders are used (see Fig. 5 (b)) for feature leanring. The key to training stacked autoencoders is how to initialize the network. The way of initializing the parameters of networks influences the network convergence especially the early layers, as well as the stability of training. Fortunately, Hinton et al [42]

provided a good solution to initialize the weight of the network by using restricted Boltzmann machines.


Fig. 6: The architecture of CNNs.

2) Autoencoder-based scene classification methods

Autoencoder is able to automatically learn mid-level visual representations from unlabeled data. The mid-level features plays an important role in aerial image scene classification before deep learning takes off in the remote sensing community. Zhang et al. [129] introduced sparse autoencoder to scene classification. Cheng et al. [cheng2015learning] used the single-hidden-layer neural network and autoencoder for training more effective sparselets [30] to achieve efficient scene classification and object detection. In [89], Othman et. al proposed an aerial image scene classification algorithm relied on convolutional features and a sparse autoencoder. Han et al. [34] provided the scene classification methods based on hierarchical convolutional sparse autoencoder. Cheng et al. [18] demonstrated mid-level visual feature learned from autoencoder-based method is discriminative and able to facilitate scene classification tasks. In light of the limitation of feature representation of a single autoencoder, some researchers stacked multiple autoencoders together. Du et al. [24]

came up with stacked convolutional denoising autoencoder networks. After extensive experiments, their proposed framework showed superior classification performance. Yao et al.

[120] integrated pairwise constraints into a stacked sparse autoencoder to learn more discriminative features for land-use scene classification and semantic annotation tasks.

The autoencoder and the algorithms derived from autoencoder are unsupervised-learning methods and have obtained good results in scene classification of remote sensing images. However, most of the above-mentioned autoencoder-based methods cannot learn the best discrimination features to distinguish different scene classes because they do not fully exploit scene class information.


Fig. 7: The architecture of GANs.

Iii-B CNN-Based Remote Sensing Image Scene Classification

1) Brief introduction of CNN

CNNs have shown powerful feature learning ability in the visual domain. Since Krizhevsky and Hinton proposed the Alexnet [52] in 2012, a deep CNN that obtained the best accuracy in the LSVRC, there have appeared an array of advanced CNN models, such as VGGNet [98], GoogleNet [103], ResNet [36], DensNet [46], SENet [44], and SKNet [61]. CNNs are a kind of multi-layer network with learning ability that consists of convolutional layers, pooling layers, and fully connected layers (see Fig. 6).

(1) Convolutional layers

Convolutional layers play an important role on feature extraction from images. The convolutional layer’s input

consists of two-dimensional feature maps of size . The output of convolutional layers is two-dimensional feature maps of size via convolutional kernels . is trainable filters of size (typically 1, 3 or 5). The entire process of convolution is described as equation (3), where denotes two-dimensional convolution operation, additionally by using to denote the

dimensional bias term. In general, a non-linear activation function

is performed after convolution operation. As the convolutional structure deepens, the convolutional layers can capture different level features (e.g., edges, lines, corners, structures, and shapes) from the input feature maps.

(3)

(2) Pooling layers

Pooling layers are to execute a max or average operation over a small aera of each input feature map, which can be defined as equation (4), where

represents the pooling function (e.g., average pooling, max pooling, and stochastic pooling),

and denotes the input and output of the pooling layer respectively. Usually, pooling layers are applied between two successive convolutional layers. Pooling operation can create invariance, such as small shifts and distortions. In the object detection and scene classification tasks, the characteristic of invariance provided by pooling layers is very important.

(4)

(3) Fully connected layers

Fully connected layers usually appear in the top layer of CNNs, which can summarize the features extracted from the bottom layers. Fully connected layers process its input

with linear transformation by weight

and bias , then map the output of linear transformation by a non-linear activation function . The entire process can be formulated as equation (5

). In the task of classification, to output the probability of each class, a softmax layer is connected to the last fully connected layer generally. The dropout method

[81] operates on the fully connected layers to avoid overfitting because a fully connected layer usually contains a large number of parameters.

(5)

2) CNN-based scene classification methods

In the wake of CNNs successfully being applied to large-scale visual classification tasks, around 2015, the use of CNNs has finally taken off in the aerial image analysis field [131, 146]. Compared with traditional advanced methods, e.g., SIFT [71], HOG [22], and BoVW [118], CNNs have the advantage of end-to-end feature learning. Meanwhile, it can extract high-level visual features that handcrafted feature-based methods cannot learn. By using different strategies of exploiting CNNs, a variety of CNN-based scene classification methods [130, 140, 16, 123, 67, 142, 17] have emerged. Penatti et al. [91] introduced CNNs in 2015 into satellite image scene classification, and evaluated the generalization capability of off-the-shelf CNNs in classification of remote sensing images. Their experiments show that CNNs can obtain better results than low-level descriptors. Later, Hu et al. [43] treated CNNs as feature extractors and investigated how to make full use of pre-trained CNNs for scene classification. In [78], Marmanis et al. introduced a two-stage CNN scene classification framework. It used pre-trained CNNs to derive a set of representations from images. The extracted representations were then fed into shallow CNN classifiers. Chaib et al. [8] fused the deep features extracted with VGGNet to enhance scene classification performance. In [56], Li et al. fused pre-trained CNN features. The fused CNN features show better discrimination than raw CNN features in scene classification. Cheng et al. [16] designed the BoCF (bag of convolutional features) for aerial image scene classification by using off-the-shelf CNN features to replace traditional local descriptors such as SIFT. Yuan et al. [125] rearranged the local features extracted by an already trained VGG19Net for aerial image scene classification. In [38], He et al. proposed a novel multilayer stacked covariance pooling algorithm (MSCP) for satellite image scene classification. MSCP can combine multilayer feature maps extracted from pre-trained CNN automatically. Lu et al. [73] introduced an feature aggregation CNN (FACNN) for scene classification. FACNN learns scene representations through exploring semantic label information. These methods all used pre-trained CNNs as feature extractors and then fused or combined the features extracted by existing CNNs. It is worth noticing that the strategy of using off-the-shelf CNNs as feature extractors is simple and effective on small-scale data sets.
However, when the amount of training samples is not adequate to train a new CNN from scratch, fine-tuning an already trained CNNs on target data sets is a good choice. Castelluccio et al. [7]

delved into the use of CNNs for aerial image scene classification by experimenting with three learning approaches: using pre-trained CNNs as feature extractors, fine tuning, and training from scratch. And they concluded that fine-tuning gave better results than full training when the scale of data sets is small. This made researchers interested in fine-adjusting scene classification networks or optimizing its loss functions. Cheng et al.

[17] designed a novel objective function for learning discriminative CNNs (D-CNNs). The D-CNNs shows better discriminability in scene classification. In [69], Liu et al. coupled CNN with a hierarchical Wasseratein loss function (HW-CNNs) to improve CNN’s discriminatory ability. Minetto et al. [82] devised a new satellite image scene classification framework, named Hydra, which is an ensembles of CNNs and achieves the best results on the NWPU-RESISC45 data set. Wang et al. [110] introduced attention mechanism into CNNs and designed the ARCNet ( attention recurrent convolutional network ) for scene classification. It is capable of highlighting key areas and discard noncritical information. In [68], to handle the problem of object scale variation in scene classification, Liu et al. formulated the multiscale CNN (MCNN). Fang et al. [27]

designed a robust space-frequency joint representation (RSFJR) for scene classification by adding a frequency domain branch to CNNs. Because of fusing features from the space and frequency domains, the proposed method is able to provide more discriminative feature representations. Xie et al.

[115] designed a scale-free CNN (SF-CNN) for the task of scene classification. SF-CNN can accept the images of arbitrary size as input without any resizing operation. Sun et al. [100] proposed a gated bidirectional network (GBN) for scene classification, which can get rid of the interference information and aggregate the interdependent information among different CNN layers.In the above-mentioned methods, CNNs can learn discriminative features and obtain better performance by fine adjusting their structures, optimizing their objective function, or fine-tuning the modified CNNs on the target data sets.In [9], Chen et al. introduced knowledge distillation into scene classification to boost the performance of light CNNs. Zhang et al. [127] illustrated a lightweight and effective CNN that introduces the dilated convolution and channel attention into Mobilenetv2 [95] for scene classification.
In addition, it is of considerable interest to design more effective and robust CNNs for scene classification. In [130]

, Zhang et al. presented a gradient boosting random convolutional network (GBRCN) for scene classification via assembling different deep neural networks.


These CNN-based methods have obtained astonishing scene classification results. However, they generally require numerous annotated samples to fine-tune already trained CNNs or train a network from scratch.

Iii-C GAN-Based Remote Sensing Image Scene Classification

1) Brief introduction of GAN

Generative Adversarial Network (GAN) [32] is another important and promising machine learning method. As its name implies, GAN models the distribution of data via adversarial learning based on a minimax two-player game, and generates real-like data. GANs contain a pair of components—the discriminator and generator . As shown in Fig. 7, can be analogues to a group of counterfeiters who take the role of generating fake currency, while can be thought of as polices who determine whether the currency is made by or bank. and constantly pit against each other in this game until cannot distinguish between the counterfeit currency and genuine articles. GANs see the competition between and as the sole training criterion. takes an input , which is a latent variable obeying a prior distribution , then maps with noise into data space by using a differential function , where denotes the generator ’s parameters. outputs the probability of the input data that comes from real data rather than generator through a mapping with parameters , where denotes the discriminator ’s parameters. The entire process of the two-player minimax game is described as equation (6), where is the distribution of data and is an object function. From ’s perspective, given an input data generated by , will play a role in minimizing its output. While if a sample is real data, will maximize its output. This is the reason why the term is plugged into equation (6). Meanwhile, to fool , makes an effort to maximize ’s output when a generated data is input to . Thus the relationship that wants to and struggles to is formed.

(6)
Data sets Image number per class Number of scene classes Total image number Image size Year
UC Merced [118] 100 21 2100 256 256 2010
WHU-RS19 [114] 5061 19 1005 600600 2012
SIRI-WHU [135] 200 12 2400 200200 2016
RSSCN7 [147] 400 7 2800 400400 2015
RSC11 [137] about 100 11 1232 512512 2016
Brazilian Coffee Scene [91] 100 21 2100 256 256 2010
AID [113] 220420 30 10000 600600 2017
NWPU-RESISC45 [13] 700 45 31500 256256 2017
OPTIMAL-31 [110] 60 31 1860 256256 2018
EuroSAT [40] 20003000 10 27000 6464 2019
TABLE II: 10 publicly available data sets for remote sensing image scene classification.

Fig. 8: Some example images from the UC-Merced data set.

Fig. 9: Some example images from the AID data set.

Fig. 10: Some example images from the NWPU-RESISC45 data set.

2) GAN-based scene classification methods

As a key method for unsupervised learning, since the introduction by Goodfellow et al. [32]

in 2014, GANs have been gradually applied to many tasks such as image to image translation, sample generation, image super-resolution, and so on. Facing the tremendous volume of remote sensing images, CNN-based methods need to use massive labeled samples to train models. However, annotating samples is labor-intensive. Some researchers began to employ GANs to scene classification. In 2017, Lin et al.

[65] proposed a multiple-layer feature-matching generative adversarial networks (MARTA GANs) for the task of scene classification. Duan et al. [25] used an adversarial net to assist in mining the inherent and discriminative features from aerial images. The dug features are able to enhance the classification accuracy. Bashmal et al. [2] provided a GAN-based method, called Siamese-GAN, to handle the aerial vehicle images classification problems under cross-domain conditions. In [116], to generate high-quality satellite images for scene classification, Xu et al. added the scaled exponential linear unites to GANs. Ma et al. [76] designed the SiftingGAN, which can generate a large variety of authentic annotated samples for scene classification. Teng et al. [105] presented a classifier-constrained adversarial network for cross-domain semi-supervised scene classification. Han et al. [33] introduced a generative framework, named SSGF, to scene classification. Yu et al. [124] devised an attention GAN for scene classification. Attention GAN achieves better scene classification performance by enhancing the representation power of the discriminator.

In the aera of aerial image classification, most of GAN-based methods usually use GANs for generating samples, or for feature learning through training networks in an adversarial manner. Compared with CNN-based scene classification methods, only a small number of literatures have been reported so far, but the powerful self-supervised feature learning capacity of GANs provides a promising future direction for scene classification.

Iv Survey on Remote Sensing Image Scene Classification Benchmarks

Data sets play an irreplaceable role on the advance of scene classification. Meanwhile, they are crucial for developing and evaluating various scene classification methods. As the number of high-resolution satellites increases, the access to massive high-resolution satellite images makes it possible to build large-scale scene classification benchmarks. In the past few years, the researchers from different groups have proposed several publicly available high-resolution benchmark data sets for scene classification of aerial images [118, 91, 13, 113, 114, 147, 137, 135, 40, 110] to facilitate this field forward. Starting with the UC-Merced data set [118], some representative data sets include WHU-RS19 [114], Brazilian Coffee Scene [91], RSSCN7[147], RSC11[137], SIRI-WHU[135], AID[113], NWPU-RESISC45[13], OPTIMAL-31[110], and EuroSAT[40]. The features of these data sets are listed in Table II. Among them, the UC-Merced data [118], AID data set [113], and NWPU-RESISC45 data set [13] are three commonly-used benchmark data sets, which will be introduced below in detail.

Iv-a UC-Merced Data Set

The UC-Merced data set111http://weegee.vision.ucmerced.edu/datasets/form.html[118] was released in 2010 and contains 21 scene classes. Each category consists of 100 land-use images. In total, the data set comprises 2100 scene images, of which the pixel resolution is 0.3 m. These images were obtained from United States Geological Survey National Map of 21 U.S. regions and fixed at pixels. Fig. 8 lists the samples of each category from the data set. Up to now, the data set continues to be broadly employed for scene classification. When conducting algorithm evaluation, two widely-used training ratios are 50 and 80, and the remaining 50 and 20 are used for testing.

Iv-B AID Data Set

The AID [113] data set222www.lmars.whu.edu.cn/xia/AID-project.html is a relatively large-scale data set for aerial scene classification. It was published in 2017 by Wuhan University and consists of 30 scene classes. Each scene class consists of 220 to 420 images, which were cropped from Google Earth imagery and fixed at pixels. In total, the data set comprises 10000 scene images. Fig. 9 lists the samples of each category from the data set. Different from the UC-Merced data set, the AID data set is multi-sourced because these aerial images were captured with different satellites. Moreover, the data set is also multi-resolution and the pixel resolution of each scene categories varies from about 8 m to about 0.5 m. When conducting algorithm evaluation, two widely-used training ratios are 20 and 50, and the remaining 80 and 50 are used for testing.

Iv-C NWPU-RESISC45 Data Set

To the best of our knowledge, the NWPU-RESISC45 data set333http://www.escience.cn/people/gongcheng/NWPU-RESISC45.html[13], released by Northwest Polytechnical University, is currently the largest scene classification data set. It consists of 45 scene categories. Each category consists of 700 images, which were obtained from Google Earth and fixed at pixels. In total, the data set comprises 31500 scene images, which is chosen from more than 100 countries and regions. Apart from some specific classes with lower spatial resolution (e.g., island, lake, mountain, and iceberg), the pixel resolution of most the scene categories varies from about 30 m to 0.2 m. Fig. 10 lists the samples of each category from the data set. The release of NWPU-RESISC45 data set has allowed deep learning models to develop their full potential. When conducting algorithm evaluation, two widely-used training ratios are 10 and 20, and the remaining 90 and 80 are used for testing.

Method Year Publication Training ratio
50 80
Autoencoder-based SGUFL [129] 2014 IEEE TGRS - 82.721.18
partlets-based method [12] 2015 IEEE TGRS 88.760.79 -
SCDAE [24] 2016 IEEE TCYB - 93.71.3
CNN-based GBRCN [130] 2015 IEEE TGRS - 94.53
LPCNN [140] 2016 JARS - 89.90
Fusion by Addition [8] 2017 IEEE TGRS - 97.421.79
ARCNet-VGG16 [110] 2018 IEEE TGRS 96.810.14 99.120.40
MSCP [38] 2018 IEEE TGRS - 98.360.58
D-CNNs [17] 2018 IEEE TGRS - 98.930.10
MCNN [68] 2018 IEEE TGRS - 96.660.9
ADSSM [143] 2018 IEEE TGRS - 99.760.24
FACNN [73] 2019 IEEE TGRS - 98.810.24
SF-CNN [115] 2019 IEEE TGRS - 99.050.27
RSFJR [27] 2019 IEEE TGRS 97.210.65 -
GBN [100] 2019 IEEE TGRS 97.050.19 98.570.48
ADFF [145] 2019 Remote Sensing 96.050.56 97.530.63
CNN-CapsNet [133] 2019 Remote Sensing 97.590.16 99.050.24
Siamese ResNet50 [66] 2019 IEEE GRSL 90.95 94.29
GAN-based MARTA GANs [65] 2017 IEEE GRSL 85.50.69 94.860.80
Attention GANs [124] 2019 IEEE TGRS 89.060.50 97.690.69
TABLE III: Overall accuracy () comparison of 20 scene classification methods on the UC-Merced data set.
Method Year Publication Training ratio
20 50
CNN-based Fusion by Addition [8] 2017 IEEE TGRS - 91.870.36
ARCNet-VGG16 [110] 2018 IEEE TGRS 88.750.40 93.100.55
MSCP [38] 2018 IEEE TGRS 91.520.21 94.420.17
D-CNNs [17] 2018 IEEE TGRS 90.820.16 96.890.10
MCNN [68] 2018 IEEE TGRS - 91.800.22
HW-CNNs [69] 2018 IEEE TGRS - 96.980.33
FACNN [73] 2019 IEEE TGRS - 95.450.11
SF-CNN [115] 2019 IEEE TGRS 93.600.12 96.660.11
RSFJR [27] 2019 IEEE TGRS - 96.811.36
GBN [100] 2019 IEEE TGRS 92.200.23 95.480.12
ADFF [145] 2019 Remote Sensing 93.680.29 94.750.25
CNN-CapsNet [133] 2019 Remote Sensing 93.790.13 96.320.12
GAN-based MARTA GANs [65] 2017 IEEE GRSL 75.390.49 81.570.33
Attention GANs [124] 2019 IEEE TGRS 78.95±0.23 84.520.18
TABLE IV: Overall accuracy () comparison of 14 scene classification methods on the AID data set.

V Performance Comparison and Discussion

V-a Performance Comparison

In recent years, a variety of scene classification algorithms have been published. Here, 24 deep learning-based scene classification methods are selected for performance comparison on three widely-used benchmark data sets. Among the 24 deep learning methods, 3 of them are autoencoder-based methods, 19 of them are CNN-based methods, and 2 of them are GAN-based methods.

Method Year Publication Training ratio
10 20
CNN-based BoCF [16] 2017 IEEE GRSL 82.650.31 84.320.17
MSCP [38] 2018 IEEE TGRS 88.070.18 90.810.13
D-CNNs [17] 2018 IEEE TGRS 89.220.50 91.890.22
HW-CNNs [69] 2018 IEEE TGRS - 94.380.17
IORN [109] 2018 IEEE GRSL 87.830.16 91.300.17
ADSSM [143] 2018 IEEE TGRS 91.690.22 94.290.14
SF-CNN [115] 2019 IEEE TGRS 89.890.16 92.550.14
ADFF [145] 2019 Remote Sensing 90.580.19 91.910.23
CNN-CapsNet [133] 2019 Remote Sensing 89.030.21 89.030.21
Hydra [82] 2019 IEEE TGRS 92.440.34 94.510.21
Siamese ResNet50 [66] 2019 IEEE GRSL - 92.28
GAN-based MARTA GANs [65] 2017 IEEE GRSL 68.630.22 75.030.28
Attention GANs [124] 2019 IEEE TGRS 72.210.21 77.990.19
TABLE V: Overall accuracy () comparison of 13 scene classification methods on the NWPU-RESISC45 data set.

Tables III, IV, V report the classification accuracy comparison of deep learning-based scene classification methods on the UC-Merced data set, the AID data set, and the NWPU-RESISC45 data set, respectively, measured in terms of overall accuracy (OA). The metric of OA is a commonly used criteria for evaluating the performance of the methods for scene classification of aerial images, which is formulated as the total number of accurately classified samples divided by the total number of tested samples.

V-B Discussion

As can be seen from Tables III, IV, V, the performance of aerial image scene classification has been successively advanced. In the early days, deep learning-based scene classification approaches were mainly based on autoencoder, and researchers usually use the UC-Merced data set to evaluate autoencoder-based algorithms. As an early unsupervised deep learning method, the structure of autoencoder was relatively simple, so its feature learning capability was also limited. The accuracies of the autoencoder-based approaches had plateaued on the standard benchmarks.

Fortunately, after 2012, CNNs, a powerful supervised learning method, have proved to be capable of learning abstract features from raw images. Despite their powerful potential, it took some time for CNNs to take off in the satellite image scene classification domain, until 2015. A short while later, CNN-based algorithms mainly used CNNs as feature extractors, which outperformed autoencoder-based methods. However, only using CNNs as feature extractors did not make full use of the potential of CNNs. Thanks to the release of two large-scale scene classification benchmarks, namely AID and NWPU-RESISC45 in 2017, fine-tuning off-the-shelf CNNs have shown better generalization ability in the task of scene classification than only using CNNs as feature extractors.


Generally, CNN-based methods require large-scale labeled remote sensing images to train CNNs. To deal with this issue, GANs, a novel unsupervised learning method, was introduced into aerial image scene classification. Through adversarial training, GANs can model the distribution of real samples and generate new samples. GAN-based methods have been successful in scene classification of aerial images when there are no human-annotated labels in the aerial data sets. According to the reported accuracy of scene classification in Tables III, IV, V, the development of autoencoder-based methods have reached a bottleneck, CNNs-based methods still dominate and have some upside potential, the performance of GAN-based methods is relatively low on the three benchmarks, and so there remains much room for further improving the performance of GAN-based methods.
Moreover, learning discriminative feature representation is one of the critical driving forces that improve scene classification performance. Fusing multiple features [8, 27], designing effective cost functions [82, 69], modifying deep learning models [82, 115], and data augmentation [65] are all beneficial for attaining better performance. Meanwhile, with the access to large-scale benchmark data sets, it will become smaller for the gap between the scene classification approaches based on supervised learning and the scene classification approaches relied on unsupervised learning.
The release of publicly available benchmarks, such as the UC-Merced data set, the AID data set and the NWPU-RESISC45 data set, makes it easier to compare scene classification algorithms. From the perspectives of data sets, the UC-Merced data set is relatively simple, and the results on the data set driven by CNNs have reached saturation (above 99 classification accuracy by using the training ratios of 80). The AID data set is of moderate difficulty. The classification accuracy on the AID data set can reach about 96 by using 50 training samples. For NWPU-RESISC45, some advanced methods based on CNNs have reached about 94 classification accuracy when the training ratio is fixed at 20. Up to the present, the NWPU-RESISC45 data set is still challenging compared with the UC-Merced data set and the AID data set.
The performance of CNN-based methods depends very much on the quantity of training data, so developing larger-scale and more challenging aerial image scene classification benchmarks can further promote the development of data-driven algorithms.

Vi Future Opportunities

Scene classification is an important and challenging problem for remote sensing image interpretation. Driven by its wide application, it has aroused extensive research attention. Thanks to the advancement of deep learning techniques and the establishment of large-scale data sets for scene classification, scene classification has been seeing dramatic improvement. In spite of the amazing successes obtained in the past several years, there still exists a giant gap between the current understanding level of machines and human-level performance. Thus, there is still much work that needs to be done in the field of scene classification. By investigating the current scene classification algorithms and the available data sets, this paper discusses several potential future directions for scene classification in remote sensing imagery.
(1) Developing larger scale scene classification data sets. An ideal scene classification system would be capable of accurately and efficiently recognizing all scene types in all open world scenes. Recent scene classification methods are still trained with relatively limited data sets, so they are capable of classifying scene categories within the training data sets but blind, in principle, to other scene classes outside the data sets. Therefore, a compelling scene classification system should be able to accurately label a novel scene image with a semantic category. The existing data sets [118, 13, 113] contain dozens of scene classes, which are far fewer than those that humans can distinguish. Moreover, a common deep CNN has millions of parameters and it tends to over-fit the tens of thousands of training samples in the training set. Hence, fully training a deep classification model is almost impracticable by using currently available scene classification data sets. A majority of advanced scene classification algorithms mainly rely on fine-tuning already trained CNNs on the target data sets or utilizing pre-trained CNNs as feature extractors. Although the transferring solutions behaves fairly well on the target data sets with limited types and samples, they are not the most optimal solution compared with fully training a deep CNN model because the model trained from scratch is able to extract more specific features that are adaptable to the target domain when training samples is large enough. Considering this, developing a new large scale data set with considerably more scene classes for scene classification is very promising.
(2) Unsupervised learning for scene classification. Currently, the most advanced scene classification algorithms generally use fully supervised models learned from annotated data with semantic categories and have achieved amazing scene classification results. However, such fully supervised learning is extremely expensive and time-consuming to undertake because data annotation must be done manually by researchers with expert knowledge of the area of remote sensing image understanding. When the number of scene classes is huge, data annotation may become very difficult due to the massive amount of diversities and variations in remote sensing images. Meanwhile, the labeled data is generally full of noise and errors, especially for large-scale data sets, since the diverse knowledge levels of different specialists result in different understandings of the same classes of scene. Fully supervised learning can hardly work well without a large data set with clean labels. As a promising unsupervised learning method, generative adversarial networks have been used for tracking scene classification with data sets that lack annotations [25, 65, 124]. Consequently, it is valuable to explore unsupervised learning for scene classification.
(3) Compact and efficient scene classification models. During the past few years, another key factor in the outstanding progress in scene classification is the evolution of powerful deep CNNs. In order to achieve high accuracy in classification, the layer number of the CNNs has increased from several layers to hundreds of layers. Most advanced CNN models have millions of parameters and require a massive labeled data set for training and high-performance GPUs, which severely limits the deploying of scene classification algorithms on airborne and satellite-borne embedded systems. In response, some researchers are working to design compact and lightweight scene classification models [9, 127]. In this area, there is much work to be done.
(4) Learning discriminative feature representations. Two key factors that influences the performance of scene classification tasks are intraclass diversity and interclass similarity existing in remote sensing images. Even though CNNs are able to extract abundant semantic features from a given satellite image, and some methods of learning discriminative CNN feature representation have been proposed [82, 39, 138], the challenges of higher intraclass variation and smaller interclass separability are still not fully solved. These challenges seriously affect the performance of scene classification. In the future, learning more discriminative feature representations to handle the challenges needs to be addressed.
(5) Scene classification with limited samples. CNNs have obtained huge successes in the field of scene classification. However, most of those models demand large-scale labelled data and numerous iterations to train their parameter sets. This extremely limits their scalability to novel categories because of the high cost of labeling. Also, this fundamentally confines their applicability to rare scene categories (e.g., missile position, military zones), which are difficult to capture. In contrast, humans are adept at distinguishing scenes with little supervision learning, or none at all, such as few-shot [101] or zero-shot learning [122]. For instance, children can quickly and accurately recognize scene types ranging from a single image on TV, in a book, or hearing its description. The current best scene classification approaches are still far from achieving the humans’ ability to classify scene types with a few labelled samples. Exploring few-shot/zero-shot learning approachs for scene classification [66, 126, 54] still needs to be further developed.

(6) Cross-domain scene classification. Current researches have confirmed that CNNs are powerful tools for the task of scene classification. Pre-trained CNN models have shown better generalization to remote sensing scene data sets by extracting discriminative holistic features or fine-tuning off-the-shelf CNNs on target data sets. However, we often assume that the training set and test set do not exist within different domains. The assumption is not warranted, however, because satellite images are acquired under differing conditions. Indeed, it is insufficient to use simple transfer learning or fine-tuning to deal with aerial images, captured with different satellites or over different locations. In the past few years, some researchers have exported cross-domain scene classification to enhance the generalization of CNN models and reduce the distribution gap between target and source domains

[1, 90, 72, 99]. There is much potential for improving domain adaption-based methods for scene classification, such as mapping the feature representations from target and source domains onto a uniform space while preserving the original data structures, designing additional adaptation layers, and optimizing the loss functions.

Vii Conclusions

Scene classification of aerial images has obtained major improvements through several decades of development. The number of papers on aerial image scene classification is breathtaking, especially the literature about deep learning-based methods. By taking into account the rapid rate of progress in scene classification, in this paper, we first discussed the main challenges that the current area of satellite image scene classification faces for. Then, we surveyed three kinds of deep learning-based methods in detail and introduced the mainstream scene classification benchmarks. Next, we summarized the performance of deep learning-based methods on three widely used data sets in tabular forms, and also provided the analysis of the results. Finally, we discussed a set of promising opportunities for further research.

References

  • [1] N. Ammour, L. Bashmal, Y. Bazi, M. M. Al Rahhal, and M. Zuair (2018) Asymmetric adaptation of deep features for cross-domain classification in remote sensing imagery. IEEE Geoscience and Remote Sensing Letters 15 (4), pp. 597–601. Cited by: §VI.
  • [2] L. Bashmal, Y. Bazi, H. AlHichri, M. M. AlRahhal, N. Ammour, and N. Alajlan (2018) Siamese-gan: learning invariant representations for aerial vehicle image categorization. Remote Sensing 10 (2), pp. 351. Cited by: §III-C.
  • [3] T. Blaschke, S. Lang, and G. Hay (2008) Object-based image analysis: spatial concepts for knowledge-driven remote sensing applications. Springer Science & Business Media. Cited by: §I.
  • [4] T. Blaschke and J. Strobl (2001) What’s wrong with pixels? some recent developments interfacing remote sensing and gis. Zeitschrift für Geoinformationssysteme, pp. 12–17. Cited by: §I.
  • [5] T. Blaschke (2003) Object-based contextual image classification built on image segmentation. In IEEE Workshop on Advances in Techniques for Analysis of Remotely Sensed Data, 2003, pp. 113–119. Cited by: §I.
  • [6] T. Blaschke (2010) Object based image analysis for remote sensing. ISPRS journal of photogrammetry and remote sensing 65 (1), pp. 2–16. Cited by: §I.
  • [7] M. Castelluccio, G. Poggi, C. Sansone, and L. Verdoliva (2015) Land use classification in remote sensing images by convolutional neural networks. arXiv preprint arXiv:1508.00092. Cited by: §III-B.
  • [8] S. Chaib, H. Liu, Y. Gu, and H. Yao (2017) Deep feature fusion for vhr remote sensing scene classification. IEEE Transactions on Geoscience and Remote Sensing 55 (8), pp. 4775–4784. Cited by: §III-B, TABLE III, TABLE IV, §V-B.
  • [9] G. Chen, X. Zhang, X. Tan, Y. Cheng, F. Dai, K. Zhu, Y. Gong, and Q. Wang (2018) Training small networks for scene classification of remote sensing images via knowledge distillation. Remote Sensing 10 (5), pp. 719. Cited by: §III-B, §VI.
  • [10] G. Cheng, L. Guo, T. Zhao, J. Han, H. Li, and J. Fang Automatic landslide detection from remote-sensing imagery using a scene classification method based on bovw and plsa. International Journal of Remote Sensing 34 (1-2), pp. 45–59. Cited by: §I.
  • [11] G. Cheng, J. Han, L. Guo, and T. Liu (2015) Learning coarse-to-fine sparselets for efficient object detection and scene classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1173–1181. Cited by: §I.
  • [12] G. Cheng, J. Han, L. Guo, Z. Liu, S. Bu, and J. Ren (2015) Effective and efficient midlevel visual elements-oriented land-use classification using vhr remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 53 (8), pp. 4238–4249. Cited by: §I, TABLE III.
  • [13] G. Cheng, J. Han, and X. Lu (2017) Remote sensing image scene classification: benchmark and state of the art. Proceedings of the IEEE 105 (10), pp. 1865–1883. Cited by: §I, Fig. 4, TABLE I, TABLE II, §IV-C, §IV, §VI.
  • [14] G. Cheng, J. Han, P. Zhou, and D. Xu (2018) Learning rotation-invariant and fisher discriminative convolutional neural networks for object detection. IEEE Transactions on Image Processing 28 (1), pp. 265–278. Cited by: §I.
  • [15] G. Cheng and J. Han (2016) A survey on object detection in optical remote sensing images. ISPRS Journal of Photogrammetry and Remote Sensing 117, pp. 11–28. Cited by: §I.
  • [16] G. Cheng, Z. Li, X. Yao, L. Guo, and Z. Wei (2017) Remote sensing image scene classification using bag of convolutional features. IEEE Geoscience and Remote Sensing Letters 14 (10), pp. 1735–1739. Cited by: §III-B, TABLE V.
  • [17] G. Cheng, C. Yang, X. Yao, L. Guo, and J. Han (2018) When deep learning meets metric learning: remote sensing image scene classification via learning discriminative cnns. IEEE transactions on geoscience and remote sensing 56 (5), pp. 2811–2821. Cited by: §I, §III-B, TABLE III, TABLE IV, TABLE V.
  • [18] G. Cheng, P. Zhou, J. Han, L. Guo, and J. Han (2015) Auto-encoder-based shared mid-level visual dictionary learning for scene classification using very high resolution remote sensing images. IET Computer Vision 9 (5), pp. 639–647. Cited by: §III-A.
  • [19] G. Cheng, P. Zhou, and J. Han (2016) Learning rotation-invariant convolutional neural networks for object detection in vhr optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 54 (12), pp. 7405–7415. Cited by: §I.
  • [20] G. Cheng, P. Zhou, and J. Han (2016) Rifd-cnn: rotation-invariant and fisher discriminative convolutional neural networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2884–2893. Cited by: §I.
  • [21] A. M. Cheriyadat (2013) Unsupervised feature learning for aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing 52 (1), pp. 439–451. Cited by: §I.
  • [22] N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), Vol. 1, pp. 886–893. Cited by: §I, §III-B.
  • [23] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §I.
  • [24] B. Du, W. Xiong, J. Wu, L. Zhang, L. Zhang, and D. Tao (2016) Stacked convolutional denoising auto-encoders for feature representation. IEEE transactions on cybernetics 47 (4), pp. 1017–1027. Cited by: §III-A, TABLE III.
  • [25] Y. Duan, X. Tao, M. Xu, C. Han, and J. Lu (2018) GAN-nl: unsupervised representation learning for remote sensing image classification. In 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 375–379. Cited by: §I, §III-C, §VI.
  • [26] J. Fan, T. Chen, and S. Lu (2017) Unsupervised feature learning for land-use scene recognition. IEEE Transactions on Geoscience and Remote Sensing 55 (4), pp. 2250–2261. Cited by: §I.
  • [27] J. Fang, Y. Yuan, X. Lu, and Y. Feng (2019) Robust space–frequency joint representation for remote sensing image scene classification. IEEE Transactions on Geoscience and Remote Sensing 57 (10), pp. 7492–7502. Cited by: §III-B, TABLE III, TABLE IV, §V-B.
  • [28] P. Gamba (2012) Human settlements: a global challenge for eo data processing and interpretation. Proceedings of the IEEE 101 (3), pp. 570–581. Cited by: §I.
  • [29] P. Ghamisi, J. Plaza, Y. Chen, J. Li, and A. J. Plaza (2017) Advanced spectral classifiers for hyperspectral images: a review. IEEE Geoscience and Remote Sensing Magazine 5 (1), pp. 8–32. Cited by: §I, TABLE I.
  • [30] R. Girshick, H. O. Song, and T. Darrell (2013) Discriminatively activated sparselets. In International Conference on Machine Learning, pp. 196–204. Cited by: §III-A.
  • [31] L. Gómez-Chova, D. Tuia, G. Moser, and G. Camps-Valls (2015) Multimodal classification of remote sensing images: a review and future directions. Proceedings of the IEEE 103 (9), pp. 1560–1584. Cited by: §I, §I, TABLE I.
  • [32] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §I, §III-C, §III-C.
  • [33] W. Han, R. Feng, L. Wang, and Y. Cheng (2018) A semi-supervised generative framework with deep learning features for high-resolution remote sensing image scene classification. ISPRS Journal of Photogrammetry and Remote Sensing 145, pp. 23–43. Cited by: §III-C.
  • [34] X. Han, Y. Zhong, B. Zhao, and L. Zhang (2017) Scene classification based on a hierarchical convolutional sparse auto-encoder for high spatial resolution imagery. International Journal of Remote Sensing 38 (2), pp. 514–536. Cited by: §III-A.
  • [35] R. M. Haralick, K. Shanmugam, and I. H. Dinstein (1973) Textural features for image classification. IEEE Transactions on systems, man, and cybernetics (6), pp. 610–621. Cited by: §I.
  • [36] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §III-B.
  • [37] L. He, J. Li, C. Liu, and S. Li (2017) Recent advances on spectral–spatial hyperspectral image classification: an overview and new guidelines. IEEE Transactions on Geoscience and Remote Sensing 56 (3), pp. 1579–1597. Cited by: §I, TABLE I.
  • [38] N. He, L. Fang, S. Li, A. Plaza, and J. Plaza (2018) Remote sensing scene classification using multilayer stacked covariance pooling. IEEE Transactions on Geoscience and Remote Sensing 56 (12), pp. 6899–6910. Cited by: §III-B, TABLE III, TABLE IV, TABLE V.
  • [39] N. He, L. Fang, S. Li, J. Plaza, and A. Plaza (2019) Skip-connected covariance network for remote sensing scene classification. IEEE transactions on neural networks and learning systems. Cited by: §VI.
  • [40] P. Helber, B. Bischke, A. Dengel, and D. Borth (2019) Eurosat: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12 (7), pp. 2217–2226. Cited by: TABLE II, §IV.
  • [41] G. E. Hinton, S. Osindero, and Y. Teh (2006) A fast learning algorithm for deep belief nets. Neural computation 18 (7), pp. 1527–1554. Cited by: §I.
  • [42] G. E. Hinton and R. R. Salakhutdinov (2006) Reducing the dimensionality of data with neural networks. science 313 (5786), pp. 504–507. Cited by: §I, §III-A, §III-A.
  • [43] F. Hu, G. Xia, J. Hu, and L. Zhang (2015) Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sensing 7 (11), pp. 14680–14707. Cited by: §I, §I, TABLE I, §III-B.
  • [44] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §III-B.
  • [45] Q. Hu, W. Wu, T. Xia, Q. Yu, P. Yang, Z. Li, and Q. Song (2013) Exploring the use of google earth imagery and object-based methods in land use/cover mapping. Remote Sensing 5 (11), pp. 6026–6042. Cited by: §I.
  • [46] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §III-B.
  • [47] X. Huang, D. Wen, J. Li, and R. Qin (2017) Multi-level monitoring of subtle urban changes for the megacities of china using high-resolution multi-view satellite imagery. Remote sensing of environment 196, pp. 56–75. Cited by: §I.
  • [48] A. K. Jain, N. K. Ratha, and S. Lakshmanan (1997) Object detection using gabor filters. Pattern recognition 30 (2), pp. 295–309. Cited by: §I.
  • [49] L. L. Janssen and H. Middelkoop (1992) Knowledge-based crop classification of a landsat thematic mapper image. International Journal of Remote Sensing 13 (15), pp. 2827–2837. Cited by: §I.
  • [50] H. Jegou, F. Perronnin, M. Douze, J. Sánchez, P. Perez, and C. Schmid (2011) Aggregating local image descriptors into compact codes. IEEE transactions on pattern analysis and machine intelligence 34 (9), pp. 1704–1716. Cited by: §I.
  • [51] M. Ji and J. R. Jensen (1999) Effectiveness of subpixel analysis in detecting and quantifying urban imperviousness from landsat thematic mapper imagery. Geocarto International 14 (4), pp. 33–41. Cited by: §I.
  • [52] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §I, §III-B.
  • [53] S. Lazebnik, C. Schmid, and J. Ponce (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 2169–2178. Cited by: §I.
  • [54] A. Li, Z. Lu, L. Wang, T. Xiang, and J. Wen (2017) Zero-shot scene classification for high spatial resolution remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 55 (7), pp. 4157–4167. Cited by: §VI.
  • [55] D. Li, M. Wang, Z. Dong, X. Shen, and L. Shi (2017) Earth observation brain (eob): an intelligent earth observation system. Geo-spatial information science 20 (2), pp. 134–140. Cited by: §I.
  • [56] E. Li, J. Xia, P. Du, C. Lin, and A. Samat (2017) Integrating multilayer features of convolutional neural networks for remote sensing scene classification. IEEE Transactions on Geoscience and Remote Sensing 55 (10), pp. 5653–5665. Cited by: §III-B.
  • [57] K. Li, G. Cheng, S. Bu, and X. You (2017) Rotation-insensitive and context-augmented object detection in remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 56 (4), pp. 2337–2348. Cited by: §I.
  • [58] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han (2020) Object detection in optical remote sensing images: a survey and a new benchmark. ISPRS Journal of Photogrammetry and Remote Sensing 159, pp. 296–307. Cited by: §I.
  • [59] M. Li, S. Zang, B. Zhang, S. Li, and C. Wu (2014) A review of remote sensing image classification techniques: the role of spatio-contextual information. European Journal of Remote Sensing 47 (1), pp. 389–411. Cited by: §I, TABLE I.
  • [60] S. Li, W. Song, L. Fang, Y. Chen, P. Ghamisi, and J. A. Benediktsson (2019) Deep learning for hyperspectral image classification: an overview. IEEE Transactions on Geoscience and Remote Sensing 57 (9), pp. 6690–6709. Cited by: §I, TABLE I.
  • [61] X. Li, W. Wang, X. Hu, and J. Yang (2019) Selective kernel networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 510–519. Cited by: §III-B.
  • [62] X. Li and G. Shao (2013) Object-based urban vegetation mapping with high-resolution aerial photography as a single data source. International journal of remote sensing 34 (3), pp. 771–789. Cited by: §I.
  • [63] Y. Li, Y. Zhang, X. Huang, and A. L. Yuille (2018) Deep networks under scene-level supervision for multi-class geospatial object detection from remote sensing images. ISPRS Journal of Photogrammetry and Remote Sensing 146, pp. 182–196. Cited by: §I.
  • [64] Y. Li, Y. Zhang, X. Huang, H. Zhu, and J. Ma (2017) Large-scale remote sensing image retrieval by deep hashing neural networks. IEEE Transactions on Geoscience and Remote Sensing 56 (2), pp. 950–965. Cited by: §I.
  • [65] D. Lin, K. Fu, Y. Wang, G. Xu, and X. Sun (2017) MARTA gans: unsupervised representation learning for remote sensing image classification. IEEE Geoscience and Remote Sensing Letters 14 (11), pp. 2092–2096. Cited by: §I, §III-C, TABLE III, TABLE IV, §V-B, TABLE V, §VI.
  • [66] X. Liu, Y. Zhou, J. Zhao, R. Yao, B. Liu, and Y. Zheng (2019) Siamese convolutional neural networks for remote sensing scene classification. IEEE Geoscience and Remote Sensing Letters 16 (8), pp. 1200–1204. Cited by: TABLE III, TABLE V, §VI.
  • [67] Y. Liu, Y. Zhong, F. Fei, Q. Zhu, and Q. Qin (2018) Scene classification based on a deep random-scale stretched convolutional neural network. Remote Sensing 10 (3), pp. 444. Cited by: §III-B.
  • [68] Y. Liu, Y. Zhong, and Q. Qin (2018) Scene classification based on multiscale convolutional neural network. IEEE Transactions on Geoscience and Remote Sensing 56 (12), pp. 7109–7121. Cited by: §III-B, TABLE III, TABLE IV.
  • [69] Y. Liu, C. Y. Suen, Y. Liu, and L. Ding (2018) Scene classification using hierarchical wasserstein cnn. IEEE Transactions on Geoscience and Remote Sensing 57 (5), pp. 2494–2509. Cited by: §III-B, TABLE IV, §V-B, TABLE V.
  • [70] N. Longbotham, C. Chaapel, L. Bleiler, C. Padwick, W. J. Emery, and F. Pacifici (2011) Very high resolution multiangle urban classification analysis. IEEE Transactions on Geoscience and Remote Sensing 50 (4), pp. 1155–1170. Cited by: §I.
  • [71] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §I, §III-B.
  • [72] X. Lu, T. Gong, and X. Zheng (2019) Multisource compensation network for remote sensing cross-domain scene classification. IEEE Transactions on Geoscience and Remote Sensing. Cited by: §VI.
  • [73] X. Lu, H. Sun, and X. Zheng (2019) A feature aggregation convolutional neural network for remote sensing scene classification. IEEE Transactions on Geoscience and Remote Sensing 57 (10), pp. 7894–7906. Cited by: §III-B, TABLE III, TABLE IV.
  • [74] X. Lu, X. Zheng, and Y. Yuan (2017) Remote sensing scene classification by unsupervised representation learning. IEEE Transactions on Geoscience and Remote Sensing 55 (9), pp. 5148–5157. Cited by: §I.
  • [75] Z. Y. Lv, W. Shi, X. Zhang, and J. A. Benediktsson (2018) Landslide inventory mapping from bitemporal high-resolution remote sensing images using change detection and multiscale segmentation. IEEE journal of selected topics in applied earth observations and remote sensing 11 (5), pp. 1520–1532. Cited by: §I.
  • [76] D. Ma, P. Tang, and L. Zhao (2019) SiftingGAN: generating and sifting labeled samples to improve the remote sensing image scene classification baseline in vitro. IEEE Geoscience and Remote Sensing Letters 16 (7), pp. 1046–1050. Cited by: §III-C.
  • [77] L. Ma, Y. Liu, X. Zhang, Y. Ye, G. Yin, and B. A. Johnson (2019) Deep learning in remote sensing applications: a meta-analysis and review. ISPRS journal of photogrammetry and remote sensing 152, pp. 166–177. Cited by: §I, TABLE I.
  • [78] D. Marmanis, M. Datcu, T. Esch, and U. Stilla (2015) Deep learning earth observation classification using imagenet pretrained networks. IEEE Geoscience and Remote Sensing Letters 13 (1), pp. 105–109. Cited by: §III-B.
  • [79] T. R. Martha, N. Kerle, C. J. van Westen, V. Jetten, and K. V. Kumar (2011) Segment optimization and data-driven thresholding for knowledge-based landslide detection by object-based image analysis. IEEE Transactions on Geoscience and Remote Sensing 49 (12), pp. 4928–4943. Cited by: §I.
  • [80] U. Maulik and D. Chakraborty (2017) Remote sensing image classification: a survey of support-vector-machine-based advanced techniques. IEEE Geoscience and Remote Sensing Magazine 5 (1), pp. 33–52. Cited by: §I, TABLE I.
  • [81] M. L. Mekhalfi, F. Melgani, Y. Bazi, and N. Alajlan (2015) Land-use classification with compressive sensing multifeature fusion. IEEE Geoscience and Remote Sensing Letters 12 (10), pp. 2155–2159. Cited by: §I, §III-B.
  • [82] R. Minetto, M. P. Segundo, and S. Sarkar (2019) Hydra: an ensemble of convolutional neural networks for geospatial land classification. IEEE Transactions on Geoscience and Remote Sensing 57 (9), pp. 6530–6541. Cited by: §I, §III-B, §V-B, TABLE V, §VI.
  • [83] N. B. Mishra and K. A. Crews (2014)

    Mapping vegetation morphology types in a dry savanna ecosystem: integrating hierarchical object-based image analysis with random forest

    .
    International Journal of Remote Sensing 35 (3), pp. 1175–1198. Cited by: §I.
  • [84] R. Negrel, D. Picard, and P. Gosselin (2014) Evaluation of second-order visual features for land-use classification. In 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI), pp. 1–5. Cited by: §I.
  • [85] K. Nogueira, O. A. Penatti, and J. A. Dos Santos (2017) Towards better exploiting convolutional neural networks for remote sensing scene classification. Pattern Recognition 61, pp. 539–556. Cited by: §I, TABLE I.
  • [86] T. Ojala, M. Pietikainen, and T. Maenpaa (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on pattern analysis and machine intelligence 24 (7), pp. 971–987. Cited by: §I.
  • [87] A. Oliva and A. Torralba (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. International journal of computer vision 42 (3), pp. 145–175. Cited by: §I.
  • [88] B. A. Olshausen and D. J. Field (1997) Sparse coding with an overcomplete basis set: a strategy employed by v1?. Vision research 37 (23), pp. 3311–3325. Cited by: §I.
  • [89] E. Othman, Y. Bazi, N. Alajlan, H. Alhichri, and F. Melgani (2016) Using convolutional features and a sparse autoencoder for land-use scene classification. International Journal of Remote Sensing 37 (10), pp. 2149–2167. Cited by: §III-A.
  • [90] E. Othman, Y. Bazi, F. Melgani, H. Alhichri, N. Alajlan, and M. Zuair (2017) Domain adaptation network for cross-scene classification. IEEE Transactions on Geoscience and Remote Sensing 55 (8), pp. 4441–4456. Cited by: §VI.
  • [91] O. A. Penatti, K. Nogueira, and J. A. Dos Santos (2015) Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 44–51. Cited by: §I, TABLE I, §III-B, TABLE II, §IV.
  • [92] F. Perronnin, J. Sánchez, and T. Mensink (2010) Improving the fisher kernel for large-scale image classification. In European conference on computer vision, pp. 143–156. Cited by: §I.
  • [93] V. Risojević and Z. Babić (2016) Unsupervised quaternion feature learning for remote sensing image classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 9 (4), pp. 1521–1531. Cited by: §I.
  • [94] A. Romero, C. Gatta, and G. Camps-Valls (2015) Unsupervised deep feature extraction for remote sensing image classification. IEEE Transactions on Geoscience and Remote Sensing 54 (3), pp. 1349–1362. Cited by: §I.
  • [95] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §III-B.
  • [96] W. Shao, W. Yang, G. Xia, and G. Liu (2013) A hierarchical scheme of multiple feature fusion for high-resolution satellite scene categorization. In International Conference on Computer Vision Systems, pp. 324–333. Cited by: §I.
  • [97] G. Sheng, W. Yang, T. Xu, and H. Sun (2012) High-resolution satellite scene classification using a sparse coding based multiple feature combination. International journal of remote sensing 33 (8), pp. 2395–2412. Cited by: §I.
  • [98] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §III-B.
  • [99] S. Song, H. Yu, Z. Miao, Q. Zhang, Y. Lin, and S. Wang (2019) Domain adaptation for convolutional neural networks-based remote sensing scene classification. IEEE Geoscience and Remote Sensing Letters 16 (8), pp. 1324–1328. Cited by: §VI.
  • [100] H. Sun, S. Li, X. Zheng, and X. Lu (2019) Remote sensing scene classification by gated bidirectional network. IEEE Transactions on Geoscience and Remote Sensing 58 (1), pp. 82–96. Cited by: §III-B, TABLE III, TABLE IV.
  • [101] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. Cited by: §VI.
  • [102] M. J. Swain and D. H. Ballard (1991) Color indexing. International journal of computer vision 7 (1), pp. 11–32. Cited by: §I.
  • [103] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §III-B.
  • [104] A. Tayyebi, B. C. Pijanowski, and A. H. Tayyebi (2011) An urban growth boundary model using neural networks, gis and radial parameterization: an application to tehran, iran. Landscape and Urban Planning 100 (1-2), pp. 35–44. Cited by: §I.
  • [105] W. Teng, N. Wang, H. Shi, Y. Liu, and J. Wang (2019) Classifier-constrained deep adversarial domain adaptation for cross-domain semisupervised classification in remote sensing images. IEEE Geoscience and Remote Sensing Letters. Cited by: §III-C.
  • [106] D. Tuia, F. Ratle, F. Pacifici, M. F. Kanevski, and W. J. Emery (2009) Active learning methods for remote sensing image classification. IEEE Transactions on Geoscience and Remote Sensing 47 (7), pp. 2218–2232. Cited by: §I.
  • [107] D. Tuia, M. Volpi, L. Copa, M. Kanevski, and J. Munoz-Mari (2011) A survey of active learning algorithms for supervised remote sensing image classification. IEEE Journal of Selected Topics in Signal Processing 5 (3), pp. 606–617. Cited by: §I, §I, TABLE I.
  • [108] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research 11 (Dec), pp. 3371–3408. Cited by: §I.
  • [109] J. Wang, W. Liu, L. Ma, H. Chen, and L. Chen (2018) IORN: an effective remote sensing image scene classification framework. IEEE Geoscience and Remote Sensing Letters 15 (11), pp. 1695–1699. Cited by: TABLE V.
  • [110] Q. Wang, S. Liu, J. Chanussot, and X. Li (2018) Scene classification with recurrent attention of vhr remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 57 (2), pp. 1155–1167. Cited by: §I, §III-B, TABLE II, TABLE III, TABLE IV, §IV.
  • [111] Y. Wang, L. Zhang, X. Tong, L. Zhang, Z. Zhang, H. Liu, X. Xing, and P. T. Mathiopoulos (2016) A three-layered graph-based learning approach for remote sensing image retrieval. IEEE Transactions on Geoscience and Remote Sensing 54 (10), pp. 6020–6034. Cited by: §I.
  • [112] S. Wold, K. Esbensen, and P. Geladi (1987) Principal component analysis. Chemometrics and intelligent laboratory systems 2 (1-3), pp. 37–52. Cited by: §I.
  • [113] G. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, and X. Lu (2017) AID: a benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing 55 (7), pp. 3965–3981. Cited by: §I, TABLE I, TABLE II, §IV-B, §IV, §VI.
  • [114] G. Xia, W. Yang, J. Delon, Y. Gousseau, H. Sun, and H. Maître (2010) Structural high-resolution satellite image indexing. Cited by: TABLE II, §IV.
  • [115] J. Xie, N. He, L. Fang, and A. Plaza (2019) Scale-free convolutional neural network for remote sensing scene classification. IEEE Transactions on Geoscience and Remote Sensing 57 (9), pp. 6916–6928. Cited by: §III-B, TABLE III, TABLE IV, §V-B, TABLE V.
  • [116] S. Xu, X. Mu, D. Chai, and X. Zhang (2018) Remote sensing image scene classification based on generative adversarial networks. Remote sensing letters 9 (7), pp. 617–626. Cited by: §III-C.
  • [117] G. Yan, J. Mas, B. Maathuis, Z. Xiangmin, and P. Van Dijk (2006) Comparison of pixel-based and object-oriented image classification approaches?a case study in a coal fire area, wuda, inner mongolia, china. International Journal of Remote Sensing 27 (18), pp. 4039–4055. Cited by: §I.
  • [118] Y. Yang and S. Newsam (2010) Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems, pp. 270–279. Cited by: §I, §III-B, TABLE II, §IV-A, §IV, §VI.
  • [119] Y. Yang and S. Newsam (2011) Spatial pyramid co-occurrence for image classification. In 2011 International Conference on Computer Vision, pp. 1465–1472. Cited by: §I.
  • [120] X. Yao, J. Han, G. Cheng, X. Qian, and L. Guo (2016) Semantic annotation of high-resolution satellite images via weakly supervised learning. IEEE Transactions on Geoscience and Remote Sensing 54 (6), pp. 3660–3671. Cited by: §III-A.
  • [121] F. Ye, H. Xiao, X. Zhao, M. Dong, W. Luo, and W. Min (2018) Remote sensing image retrieval using convolutional neural network features and weighted distance. IEEE Geoscience and Remote Sensing Letters 15 (10), pp. 1535–1539. Cited by: §I.
  • [122] M. Ye and Y. Guo (2017) Zero-shot classification with discriminative semantic representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7140–7148. Cited by: §VI.
  • [123] Y. Yu, Z. Gong, C. Wang, and P. Zhong (2017) An unsupervised convolutional feature fusion network for deep representation of remote sensing images. IEEE Geoscience and Remote Sensing Letters 15 (1), pp. 23–27. Cited by: §III-B.
  • [124] Y. Yu, X. Li, and F. Liu (2019) Attention gans: unsupervised deep feature learning for aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing 58 (1), pp. 519–531. Cited by: §III-C, TABLE III, TABLE IV, TABLE V, §VI.
  • [125] Y. Yuan, J. Fang, X. Lu, and Y. Feng (2018) Remote sensing image scene classification using rearranged local features. IEEE Transactions on Geoscience and Remote Sensing 57 (3), pp. 1779–1792. Cited by: §III-B.
  • [126] M. Zhai, H. Liu, and F. Sun (2019) Lifelong learning for scene recognition in remote sensing images. IEEE Geoscience and Remote Sensing Letters 16 (9), pp. 1472–1476. Cited by: §VI.
  • [127] B. Zhang, Y. Zhang, and S. Wang (2019) A lightweight and discriminative model for remote sensing scene classification with multidilation pooling module. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12 (8), pp. 2636–2653. Cited by: §III-B, §VI.
  • [128] B. Zhang, Z. Chen, D. Peng, J. A. Benediktsson, B. Liu, L. Zou, J. Li, and A. Plaza (2019) Remotely sensed big data: evolution in model development for information extraction [point of view]. Proceedings of the IEEE 107 (12), pp. 2294–2301. Cited by: §III.
  • [129] F. Zhang, B. Du, and L. Zhang (2014) Saliency-guided unsupervised feature learning for scene classification. IEEE Transactions on Geoscience and Remote Sensing 53 (4), pp. 2175–2184. Cited by: §III-A, TABLE III.
  • [130] F. Zhang, B. Du, and L. Zhang (2015) Scene classification via a gradient boosting random convolutional network framework. IEEE Transactions on Geoscience and Remote Sensing 54 (3), pp. 1793–1802. Cited by: §III-B, TABLE III.
  • [131] L. Zhang, L. Zhang, and B. Du (2016) Deep learning for remote sensing data: a technical tutorial on the state of the art. IEEE Geoscience and Remote Sensing Magazine 4 (2), pp. 22–40. Cited by: §I, TABLE I, §III-B.
  • [132] T. Zhang and X. Huang (2018) Monitoring of urban impervious surfaces using time series of high-resolution remote sensing images in rapidly urbanized areas: a case study of shenzhen. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11 (8), pp. 2692–2708. Cited by: §I.
  • [133] W. Zhang, P. Tang, and L. Zhao (2019) Remote sensing image scene classification using cnn-capsnet. Remote Sensing 11 (5), pp. 494. Cited by: TABLE III, TABLE IV, TABLE V.
  • [134] Y. Zhang, X. Sun, H. Wang, and K. Fu (2013) High-resolution remote-sensing image classification via an approximate earth mover’s distance-based bag-of-features model. IEEE Geoscience and Remote Sensing Letters 10 (5), pp. 1055–1059. Cited by: §I.
  • [135] B. Zhao, Y. Zhong, G. Xia, and L. Zhang (2015) Dirichlet-derived multiple topic scene classification model for high spatial resolution remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing 54 (4), pp. 2108–2123. Cited by: TABLE II, §IV.
  • [136] L. Zhao, P. Tang, and L. Huo (2014) Land-use scene classification using a concentric circle-structured multiscale bag-of-visual-words model. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 7 (12), pp. 4620–4631. Cited by: §I.
  • [137] L. Zhao, P. Tang, and L. Huo (2016) Feature significance-based multibag-of-visual-words model for remote sensing image scene classification. Journal of Applied Remote Sensing 10 (3), pp. 035004. Cited by: TABLE II, §IV.
  • [138] X. Zheng, Y. Yuan, and X. Lu (2019) A deep scene representation for aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing 57 (7), pp. 4799–4809. Cited by: §VI.
  • [139] X. Zheng, X. Sun, K. Fu, and H. Wang (2012) Automatic annotation of satellite images via multifeature joint sparse coding with spatial relation constraint. IEEE Geoscience and Remote Sensing Letters 10 (4), pp. 652–656. Cited by: §I.
  • [140] Y. Zhong, F. Fei, and L. Zhang (2016) Large patch convolutional neural networks for the scene classification of high spatial resolution imagery. Journal of Applied Remote Sensing 10 (2), pp. 025006. Cited by: §III-B, TABLE III.
  • [141] Y. Zhong, Q. Zhu, and L. Zhang (2015) Scene classification based on the multifeature fusion probabilistic topic model for high spatial resolution remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing 53 (11), pp. 6207–6222. Cited by: §I.
  • [142] Q. Zhu, Y. Zhong, Y. Liu, L. Zhang, and D. Li (2018) A deep-local-global feature fusion framework for high spatial resolution imagery scene classification. Remote Sensing 10 (4), pp. 568. Cited by: §III-B.
  • [143] Q. Zhu, Y. Zhong, L. Zhang, and D. Li (2018) Adaptive deep sparse semantic modeling framework for high spatial resolution image scene classification. IEEE Transactions on Geoscience and Remote Sensing 56 (10), pp. 6180–6195. Cited by: TABLE III, TABLE V.
  • [144] Q. Zhu, Y. Zhong, B. Zhao, G. Xia, and L. Zhang (2016) Bag-of-visual-words scene classifier with local and global features for high spatial resolution remote sensing imagery. IEEE Geoscience and Remote Sensing Letters 13 (6), pp. 747–751. Cited by: §I.
  • [145] R. Zhu, L. Yan, N. Mo, and Y. Liu (2019) Attention-based deep feature fusion for the scene classification of high-resolution remote sensing images. Remote Sensing 11 (17), pp. 1996. Cited by: TABLE III, TABLE IV, TABLE V.
  • [146] X. X. Zhu, D. Tuia, L. Mou, G. Xia, L. Zhang, F. Xu, and F. Fraundorfer (2017) Deep learning in remote sensing: a comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5 (4), pp. 8–36. Cited by: §I, TABLE I, §III-B.
  • [147] Q. Zou, L. Ni, T. Zhang, and Q. Wang (2015)

    Deep learning based feature selection for remote sensing scene classification

    .
    IEEE Geoscience and Remote Sensing Letters 12 (11), pp. 2321–2325. Cited by: TABLE II, §IV.