Aerial-Image-Recognition
The competition is hosted on https://competitions.codalab.org/competitions/27749#learn_the_details-overview
view repo
Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic categories based on their contents, has broad applications in a range of fields. Propelled by the powerful feature learning capabilities of deep neural networks, remote sensing image scene classification driven by deep learning has drawn remarkable attention and achieved significant breakthroughs. However, to the best of our knowledge, a comprehensive review of recent achievements regarding deep learning for scene classification of remote sensing images is still lacking. Considering the rapid evolution of this field, this paper provides a systematic survey of deep learning methods for remote sensing image scene classification by covering more than 140 papers. To be specific, we discuss the main challenges of scene classification and survey (1) Autoencoder-based scene classification methods, (2) Convolutional Neural Network-based scene classification methods, and (3) Generative Adversarial Network-based scene classification methods. In addition, we introduce the benchmarks used for scene classification and summarize the performance of more than two dozens of representative algorithms on three commonly-used benchmark data sets. Finally, we discuss the promising opportunities for further research.
READ FULL TEXT VIEW PDFThe competition is hosted on https://competitions.codalab.org/competitions/27749#learn_the_details-overview
Remote sensing images, a valuable data source for earth observation, can help us to measure and observe detailed structures on the Earth’s surface. Thanks to the advances of earth observation technology [45, 31], the volume of aerial or satellite images is drastically growing. This has given particular urgency to the quest for how to make full use of ever-increasing remote sensing images for intelligent earth observation [28, 55]. Hence, it is extremely important to understand huge and complex satellite images.
As a key and challenging problem for effectively interpreting aerial imagery, scene classification of remote sensing images has been an active research area. Remote sensing image scene classification is to correctly label given scene images with predefined semantic categories, as shown in Fig. 1. For the last few decades, extensive researches on satellite image scene classification have been undertaken driven by its real-world applications, such as urban planning [70, 104]
, remote sensing image retrieval
[64, 121, 111], natural hazards detection [79, 10, 75], environment monitoring[47, 132], vegetation mapping [62, 83], and geospatial object detection [19, 63, 14, 15, 58, 57, 20, 11].With the improvement of spatial resolution of remote sensing images, satellite image classification has experienced three stages: pixel-level, object-level, and scene-level classification, as shown in Fig. 2
. To be specific, in the early literatures, researchers mainly focused on classifying satellite images at pixel level or subpixel level
[51, 106, 107], through labeling each pixel in the satellite images with a semantic class, because the spatial resolutions of areal images is very low—the size of a pixel is similar to the sizes of the objects of interest[49]. However, due to the advance of satellite imaging, the spatial resolution of satellite images are increasingly finer than common objects of interest, such that single pixels lose their semantic meanings. In such case, it is not feasible to recognize scene images at pixel level solely and so per-pixel analysis began to be viewed with increasing dissatisfaction. In 2001, Blaschke and Strobl[4] questioned the dominance of per-pixel research paradigm and concluded that analyzing remote sensing images at object level is more efficient than per-pixel analysis. They suggested that researchers should pay attention to object-level analysis, where the term ”object” refers to meaningful semantic entities or scene units. Subsequently, a series of approaches to analyze remote sensing images at object level has dominated satellite image analysis for the last two decades[5, 117, 6, 3]. Amazing achievements of certain specific land use and identification tasks have been accomplished by pixel-level and object-level classification algorithms. However, remote sensing images may contain different and distinct object classes because of the increasing resolutions of remote sensing images. Pixel-level and object-level methods have not been sufficient to always classify them correctly. Under the circumstances, it is of considerable interest to understand the contents and meanings of satellite images. A new paradigm of semantic-level analysis of remote sensing images has been recently suggested.Semantic-level aerial image scene classification, namely remote sensing image scene classification, seeks to correctly classify the given satellite images into sematic classes by capitalizing on the variations in the spatial arrangement and structural pattern of ground objects. Here the item ”scene” represents a local area cropped from a large-scale satellite image that contains clear semantic information on the earth surface[12, 43]. Scene-level classification yields a better understanding of satellite images than pixel-level and object-level classification.
It is a significant step to be able to represent visual data with discriminative features in almost all tasks of computer vision. The remote sensing domain is no exception. During the previous decade, extensive efforts have been devoted to developing discriminative visual features. A majority of early scene classification methods relied on human-engineering descriptors, e.g., Scale-Invariant Feature Transformation (SIFT)
[71], Texture Descriptors (TD)[35, 48, 86], Color Histogram (CH)[102], Histogram of Oriented Gradients (HOG)[22], and GIST[87]. Owing to their characteristic of being able to represent an entire image with features, it is feasible to directly apply CH, GIST and TD to areal image scene classification. However, SIFT and HOG cannot represent an entire image directly because of their local characteristic. To make handcrafted local descriptors represent an entire scene image, these local descriptors are encoded by certain encoding methods (e.g., the Improved Fisher Kernel (IFK)[92], Vector of Locally Aggregated Descriptors (VLADs)
[50], Spatial Pyramid Matching (SPM)[53], and the popular Bag-Of-Visual-Words (BoVW)[118]). Thanks to the simplicity and efficiency of these feature encoding methods, they have been broadly applied to the field of aerial image scene classification[119, 96, 84, 136, 134, 144], whereas the representation capability of handcrafted features is limited.In this case, unsupervised learning, such as k-means clustering, Principal Component Analysis (PCA)
[112], and sparse coding[88], which automatically learns features from unlabeled images, become an appealing alternative to human-engineering features. A considerable amount of unsupervised learning-based scene classification methods have emerged[21, 81, 93, 97, 139, 141, 74, 26, 94], and made substantial progress for scene classification. Nevertheless, these unsupervised learning approaches cannot make full use of data class information.The feature description capabilities of these simple deep learning models have been demonstrated in many fields, involving satellite image scene classification. Since the AlexNet, a deep Convolutional Neural Network (CNN) designed by Krizhevskey et al. [52] in 2012, obtained the best results in the Large-Scale Visual Recognition Challenge (LSVRC) [23], a great many advanced deep CNNs have come forth and broken a number of records in many fields. In the wake of these successes, CNN-based methods have emerged in satellite image scene classification[82, 17, 110] and achieved advanced classification accuracy.
Nevertheless, CNN-based methods generally demand massive annotated training data, which greatly limits their application scenarios. More recently, Generative Adversarial Networks (GANs)[32], a promising unsupervised learning method, have achieved significant success in many applications. To remedy the above-mentioned limitations, GANs have begun been employed by some researchers on the field of aerial scene classification[25, 65].
Currently, driven by deep learning, a great number of methods of aerial image scene classification have sprung up (see Fig. 3). The number of papers on satellite image scene classification dramatically increased after 2014 and 2017 respectively. There are two reasons for the increase. On the one hand, around 2014, deep learning techniques began to be applied to remote sensing data analysis. On the other hand, in 2017, large-scale satellite image scene classification benchmarks appeared, which have greatly facilitated the development of deep learning-based scene classification.
In the past several years, numerous reviews of remote sensing image classification methods have been published, which are summarized in Table I. For example, Tuia et al. [107]
surveyed, tested and compared three active learning-based aerial image scene classification methods: committee, large margin, and posterior probability. GóChova et al.
[31] surveyed multimodal remote sensing image classification and summarized the leading algorithms for this field. In[80], Maulik et al. conducted a review of aerial images scene classification algorithms based on support vector machine (SVM). Li et al.
[59] surveyed the pixel-level, subpixel-level and object-based methods of image classification and emphasized the contribution of spatio-contextual information to aerial scene classification.As an alternative ways to extract robust, abstract and high-level features from images, deep learning models have made amazing progress on a broad range of tasks in processing image, video, speech and audio. After this, amount of deep learning-based scene classification algorithms were proposed, such as CNN-based methods and GAN-based methods. A number of reviews of scene classification approaches have been published. Penatti et al. [91] assessed the generalization ability of pre-trained CNNs in classification of aerial images. In [43]
, Hu et al. surveyed how to apply the CNNs that trained on the ImageNet data set to aerial image scene classification. Zhu et al.
[146] presented a tutorial about deep learning-based remote sensing data analysis. In order to make full use of pre-trained CNNs, Nogueira et al. [85] analyzed the performance of CNNs for satellite image scene classification with different learning strategies: full training, fine tuning, and using CNNs as feature extractors. In [131], Zhang et al. reviewed the recent deep learning-based remote sensing data analysis. Considering the number of scene categories and the accuracy saturation of the existing scene classification data sets, Cheng et al. [13] released a large-scale scene classification benchmark, named NWPU-RESISC45, and provided a survey of recent advance in aerial image scene classification before 2017. In [113], Xia et al. proposed a novel benchmark, called AID, for aerial image classification and reviewed the existing methods of scene classification before 2017. Ma et al. [77] provided a review of the applications of deep learning in aerial image analysis. In addition, there have been several hyperspectral image classification surveys [29, 37, 60].However, a thorough survey of deep learning for scene classification is still lacking. This motivates us to deeply analyze the main challenges faced for aerial scene classification, systematically review those deep learning-based scene classification approaches, most of which are published during the last five years, introduce the mainstream scene classification benchmarks, and discuss several promising future directions of scene classification.
The remainder of this paper is organized as follows. Section II discusses the current main challenges of remote sensing image scene classification. A brief review of deep learning models and a comprehensive survey of deep learning-based scene classification methods are provided in Section III. The scene classification data sets are introduced in Section IV. In Section V the comparison and discussion of the performance of deep learning-based scene classification methods on three widely used scene classification benchmarks are given. In Section VI, we discuss the promising future directions of scene classification. Finally, we conclude this paper in Section VII.
The ideal goal of scene classification of satellite images is to correctly label the given scene images with their corresponding semantic classes according to their contents, for example, categorizing a remote sensing image from urban into residential, commercial, or industrial area. Generally speaking, a remote sensing image contains a variety of ground objects. For instance, roads, trees, and buildings may be included in an industrial scene. Different from object-oriented classification, scene classification is a considerably challenging problem because of the variance and complex spatial distributions of ground objects existing in the scenes. Historically, extensive studies of satellite image scene classification have been made. However, there has not yet been an algorithm that can achieve the goal of classifying aerial image scenes with satisfactory accuracy. The challenges of scene classification include (1) big intraclass diversity, (2) high interclass similarity (also known as low between-class separability), and (3) large variance of object/scene scales, as shown in Fig.
4.No. | Survey Title | Year | Publication | Content |
1 | A survey of active learning algorithms for supervised remote sensing image classification [107] | 2011 | IEEE JSTSP | Surveying and testing the main families of active learning methods |
2 | A review of remote sensing image classification techniques: the role of spatio-contextual information [59] | 2014 | EuJRS | Review of pixel-wise, subpixel-wise and object-based methods for remote sensing image classification and exploring the contribution of spatio-contextual information to scene classification |
3 | Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery [43] | 2015 | Remote Sensing | Investigating how to exploit pre-trained CNNs for aerial image scene classification |
4 | Do deep features generalize from everyday objects to remote sensing and aerial scenes domains? [91] |
2015 | CVPR Workshop | Evaluating the generalization ability of pre-trained CNNs in remote sensing image classification |
5 | Multimodal classification of remote sensing images: a review and future directions [31] | 2015 | Proceedings of the IEEE | Offering a taxonomical view of the field of multimodal remote sensing image classification |
6 | Deep learning for remote sensing data: A technical tutorial on the state of the art [131] | 2016 | IEEE GRSM | Reviewing deep learning-based remote sensing data analysis techniques before 2016 |
7 | Deep learning in remote sensing: A comprehensive review and list of resources [146] | 2017 | IEEE GRSM | Reviewing the progress of deep learning-based remote sensing data analysis before 2017 |
8 | Advanced spectral classifiers for hyperspectral images: A review [29] | 2017 | IEEE GRSM | Review and comparison of different supervised hyperspectral classification methods |
9 | Remote sensing image classification: a survey of support-vector-machine-based advanced techniques [80] | 2017 | IEEE GRSM | Review of remote sensing image classification based on SVM |
10 | Towards better exploiting convolutional neural networks for remote sensing scene classification [85] | 2017 | Pattern Recognition | Analyzing the performance of three CNN-based scene classification strategies |
11 | AID: a benchmark data set for performance evaluation of aerial scene classification [113] | 2017 | IEEE TGRS | Review of aerial image scene classification methods before 2017 and proposing the AID data set |
12 | Remote sensing image scene classification: benchmark and state of the art [13] | 2017 | Proceedings of the IEEE | Reviewing the progress of scene classification of remote sensing images before 2017 and proposing the NWPU-RESISC45 data set |
13 | Recent advances on spectral–spatial hyperspectral image classification: An overview and new guidelines [37] | 2017 | IEEE TGRS | Survey of the progress in the classification of spectral–spatial hyperspectral images |
14 | Deep learning for hyperspectral image classification: An overview [60] | 2019 | IEEE TGRS | Review of hyperspectral image classification based on deep learning |
15 | Deep learning in remote sensing applications: A meta-analysis and review [77] | 2019 | ISPRS JPRS | Providing a review of the applications of deep learning in aerial image analysis |
16 | Remote sensing image scene classification meets deep learning: challenges, methods, benchmarks, and opportunities | 2020 | IEEE JSTARS | A systematic review of recent advances in remote sensing image scene classification driven by deep learning |
In terms of within-class diversity, the challenge mainly stems from the large variations in the appearances of ground objects within the same semantic class. Ground objects commonly vary in style, shape, scale, and distribution, which makes it difficult to correctly classify the scene images. For example, in Fig. 4 (a), the churches appear in different building styles, and the airports and railway stations show in different shapes. In addition, when airborne or space platforms capture remote sensing images, there may be large differences in color and radiation intensity appearing within the same semantic class on account of the imaging conditions, which can be influenced by the factors such as weather, cloud, mist, etc. The variations in scene illumination may also cause within-class diversity, for example, the appearances of the scene labeled as ”beach” show large differences under different imaging conditions, as shown in Fig. 4 (a).
For between-class similarity, the challenge is chiefly caused by the presence of the same objects within different scene classes or the high semantic overlapping between scene categories. For instance, in Fig. 4 (b), the scene classes of bridge and overpass both contain the same ground objects, namely bridge, and the basketball courts and tennis courts share high semantic information. Moreover, the ambiguous definition of scene classes degenerates inter-class dissimilarity. Some complex scenes are also similar with each other in terms of their visual contents. Therefore, it may be extremely difficult to distinguish these scene classes.
The large variance of object/scene scales is also a non-negligible challenge for aerial image scene classification. In satellite imaging, sensors operate at the orbits of various altitudes, from a few hundred kilometers to more than ten thousand kilometers, which leads to imaging altitude variation. With the examples illustrated in Fig. 4 (c), the scenes of airplane, storage tank, and thermal power station have huge scale differences under different imaging altitudes. In addition, because of some intrinsic factors, the variations in size for each object/scene category can also exist, for example, the rivers shown in Fig. 4 (c) are presented in several different sub-scenes—stream, brook, and creek.
In the past decades, many researchers have committed to scene classification of remote sensing images, driven by its wide applications. A number of advanced scene classification systems or approaches have been proposed,
especially driven by deep learning. Before deep learning came to the attention of this field, scene classification methods mainly relied on handcrafted features (e.g., Color Histogram (CH), texture descriptors (TD), GIST) or the representations generated by encoding local features via BoVW, IFK, SPM, etc. Later, considering that handcrafted features only extract low-level information, many researchers turned to looking at unsupervised learning methods (e.g., sparse coding, PCA, and k-means). By automatically learning discriminative features from unlabeled data, unsupervised learning-based methods have obtained good results in the scene classification of aerial images. Yet, unsupervised learning-based algorithms do not adequately exploit data class information, which limits their abilities to discriminate between different scene classes. Now, thanks to the availability of enormous labeled data, the advances in machine learning theory and the increased availability of computational resources, deep learning models (e.g., autoencoder, CNNs, and GANs) have shown powerful abilities to learn fruitful features and have permeated many research fields, including the area of aerial image scene classification. Currently, numerous deep learning-based scene classification algorithms have emerged and have yielded the best classification accuracy. In this section, we systematically survey the milestone algorithms for scene classification since deep learning permeated the field of remote sensing. That is one small step for deep learning theory, but one giant leap for the scene classification of satellite images
[128]. From autoencoder, to CNNs, and then to GANs, deep learning algorithms constantly update scene classification records. To sum up, most of the deep learning-based scene classification algorithms can be broadly divided into three main categories: autoencoder-based methods, CNN-based methods, and GAN-based methods. In what follows, we discuss the three categories of methods at great length.Autoencoder [42] is an unsupervised feature learning model, which consists of a sort of shallow and symmetrical neural network (see Fig. 5 (a)). It contains two units—encoder and decoder. The process of encoding can be formulated as equation (1), where is the output of latent layers, denotes a nonlinear mapping, stands for the encoding weight matrix, denotes the input of autoencoder, and
is the bias vector. Generally,
is fewer than . Decoding is the inverse of encoding, which can be formulated as equation (2), where represents the reconstructed output, the decoding weight matrix is denoted by , and stands for the bias vector.(1) |
(2) |
Autoencoder is able to compress high dimensional features by minimizing the cost function that usually consists of a reconstruction error term and a regularization term. By using gradient descent with back propagation, autoencoder can learn the parameters of networks. In real applications, multilayer stacked autoencoders are used (see Fig. 5 (b)) for feature leanring. The key to training stacked autoencoders is how to initialize the network. The way of initializing the parameters of networks influences the network convergence especially the early layers, as well as the stability of training. Fortunately, Hinton et al [42]
provided a good solution to initialize the weight of the network by using restricted Boltzmann machines.
Autoencoder is able to automatically learn mid-level visual representations from unlabeled data. The mid-level features plays an important role in aerial image scene classification before deep learning takes off in the remote sensing community. Zhang et al. [129] introduced sparse autoencoder to scene classification. Cheng et al. [cheng2015learning] used the single-hidden-layer neural network and autoencoder for training more effective sparselets [30] to achieve efficient scene classification and object detection. In [89], Othman et. al proposed an aerial image scene classification algorithm relied on convolutional features and a sparse autoencoder. Han et al. [34] provided the scene classification methods based on hierarchical convolutional sparse autoencoder. Cheng et al. [18] demonstrated mid-level visual feature learned from autoencoder-based method is discriminative and able to facilitate scene classification tasks. In light of the limitation of feature representation of a single autoencoder, some researchers stacked multiple autoencoders together. Du et al. [24]
came up with stacked convolutional denoising autoencoder networks. After extensive experiments, their proposed framework showed superior classification performance. Yao et al.
[120] integrated pairwise constraints into a stacked sparse autoencoder to learn more discriminative features for land-use scene classification and semantic annotation tasks.The autoencoder and the algorithms derived from autoencoder are unsupervised-learning methods and have obtained good results in scene classification of remote sensing images. However, most of the above-mentioned autoencoder-based methods cannot learn the best discrimination features to distinguish different scene classes because they do not fully exploit scene class information.
CNNs have shown powerful feature learning ability in the visual domain. Since Krizhevsky and Hinton proposed the Alexnet [52] in 2012, a deep CNN that obtained the best accuracy in the LSVRC, there have appeared an array of advanced CNN models, such as VGGNet [98], GoogleNet [103], ResNet [36], DensNet [46], SENet [44], and SKNet [61]. CNNs are a kind of multi-layer network with learning ability that consists of convolutional layers, pooling layers, and fully connected layers (see Fig. 6).
(1) Convolutional layers
Convolutional layers play an important role on feature extraction from images. The convolutional layer’s input
consists of two-dimensional feature maps of size . The output of convolutional layers is two-dimensional feature maps of size via convolutional kernels . is trainable filters of size (typically 1, 3 or 5). The entire process of convolution is described as equation (3), where denotes two-dimensional convolution operation, additionally by using to denote thedimensional bias term. In general, a non-linear activation function
is performed after convolution operation. As the convolutional structure deepens, the convolutional layers can capture different level features (e.g., edges, lines, corners, structures, and shapes) from the input feature maps.(3) |
(2) Pooling layers
Pooling layers are to execute a max or average operation over a small aera of each input feature map, which can be defined as equation (4), where
represents the pooling function (e.g., average pooling, max pooling, and stochastic pooling),
and denotes the input and output of the pooling layer respectively. Usually, pooling layers are applied between two successive convolutional layers. Pooling operation can create invariance, such as small shifts and distortions. In the object detection and scene classification tasks, the characteristic of invariance provided by pooling layers is very important.(4) |
(3) Fully connected layers
Fully connected layers usually appear in the top layer of CNNs, which can summarize the features extracted from the bottom layers. Fully connected layers process its input
with linear transformation by weight
and bias , then map the output of linear transformation by a non-linear activation function . The entire process can be formulated as equation (5). In the task of classification, to output the probability of each class, a softmax layer is connected to the last fully connected layer generally. The dropout method
[81] operates on the fully connected layers to avoid overfitting because a fully connected layer usually contains a large number of parameters.(5) |
In the wake of CNNs successfully being applied to large-scale visual classification tasks, around 2015, the use of CNNs has finally taken off in the aerial image analysis field [131, 146]. Compared with traditional advanced methods, e.g., SIFT [71], HOG [22], and BoVW [118], CNNs have the advantage of end-to-end feature learning. Meanwhile, it can extract high-level visual features that handcrafted feature-based methods cannot learn. By using different strategies of exploiting CNNs, a variety of CNN-based scene classification methods [130, 140, 16, 123, 67, 142, 17] have emerged.
Penatti et al. [91] introduced CNNs in 2015 into satellite image scene classification, and evaluated the generalization capability of off-the-shelf CNNs in classification of remote sensing images. Their experiments show that CNNs can obtain better results than low-level descriptors. Later, Hu et al. [43] treated CNNs as feature extractors and investigated how to make full use of pre-trained CNNs for scene classification. In [78], Marmanis et al. introduced a two-stage CNN scene classification framework. It used pre-trained CNNs to derive a set of representations from images. The extracted representations were then fed into shallow CNN classifiers. Chaib et al. [8] fused the deep features extracted with VGGNet to enhance scene classification performance. In [56], Li et al. fused pre-trained CNN features. The fused CNN features show better discrimination than raw CNN features in scene classification. Cheng et al. [16] designed the BoCF (bag of convolutional features) for aerial image scene classification by using off-the-shelf CNN features to replace traditional local descriptors such as SIFT. Yuan et al. [125] rearranged the local features extracted by an already trained VGG19Net for aerial image scene classification. In [38], He et al. proposed a novel multilayer stacked covariance pooling algorithm (MSCP) for satellite image scene classification. MSCP can combine multilayer feature maps extracted from pre-trained CNN automatically. Lu et al. [73] introduced an feature aggregation CNN (FACNN) for scene classification. FACNN learns scene representations through exploring semantic label information. These methods all used pre-trained CNNs as feature extractors and then fused or combined the features extracted by existing CNNs. It is worth noticing that the strategy of using off-the-shelf CNNs as feature extractors is simple and effective on small-scale data sets.
However, when the amount of training samples is not adequate to train a new CNN from scratch, fine-tuning an already trained CNNs on target data sets is a good choice. Castelluccio et al. [7]
delved into the use of CNNs for aerial image scene classification by experimenting with three learning approaches: using pre-trained CNNs as feature extractors, fine tuning, and training from scratch. And they concluded that fine-tuning gave better results than full training when the scale of data sets is small. This made researchers interested in fine-adjusting scene classification networks or optimizing its loss functions. Cheng et al.
[17] designed a novel objective function for learning discriminative CNNs (D-CNNs). The D-CNNs shows better discriminability in scene classification. In [69], Liu et al. coupled CNN with a hierarchical Wasseratein loss function (HW-CNNs) to improve CNN’s discriminatory ability. Minetto et al. [82] devised a new satellite image scene classification framework, named Hydra, which is an ensembles of CNNs and achieves the best results on the NWPU-RESISC45 data set. Wang et al. [110] introduced attention mechanism into CNNs and designed the ARCNet ( attention recurrent convolutional network ) for scene classification. It is capable of highlighting key areas and discard noncritical information. In [68], to handle the problem of object scale variation in scene classification, Liu et al. formulated the multiscale CNN (MCNN). Fang et al. [27]designed a robust space-frequency joint representation (RSFJR) for scene classification by adding a frequency domain branch to CNNs. Because of fusing features from the space and frequency domains, the proposed method is able to provide more discriminative feature representations. Xie et al.
[115] designed a scale-free CNN (SF-CNN) for the task of scene classification. SF-CNN can accept the images of arbitrary size as input without any resizing operation. Sun et al. [100] proposed a gated bidirectional network (GBN) for scene classification, which can get rid of the interference information and aggregate the interdependent information among different CNN layers.In the above-mentioned methods, CNNs can learn discriminative features and obtain better performance by fine adjusting their structures, optimizing their objective function, or fine-tuning the modified CNNs on the target data sets.In [9], Chen et al. introduced knowledge distillation into scene classification to boost the performance of light CNNs. Zhang et al. [127] illustrated a lightweight and effective CNN that introduces the dilated convolution and channel attention into Mobilenetv2 [95] for scene classification., Zhang et al. presented a gradient boosting random convolutional network (GBRCN) for scene classification via assembling different deep neural networks.
Generative Adversarial Network (GAN) [32] is another important and promising machine learning method. As its name implies, GAN models the distribution of data via adversarial learning based on a minimax two-player game, and generates real-like data. GANs contain a pair of components—the discriminator and generator . As shown in Fig. 7, can be analogues to a group of counterfeiters who take the role of generating fake currency, while can be thought of as polices who determine whether the currency is made by or bank. and constantly pit against each other in this game until cannot distinguish between the counterfeit currency and genuine articles. GANs see the competition between and as the sole training criterion. takes an input , which is a latent variable obeying a prior distribution , then maps with noise into data space by using a differential function , where denotes the generator ’s parameters. outputs the probability of the input data that comes from real data rather than generator through a mapping with parameters , where denotes the discriminator ’s parameters. The entire process of the two-player minimax game is described as equation (6), where is the distribution of data and is an object function. From ’s perspective, given an input data generated by , will play a role in minimizing its output. While if a sample is real data, will maximize its output. This is the reason why the term is plugged into equation (6). Meanwhile, to fool , makes an effort to maximize ’s output when a generated data is input to . Thus the relationship that wants to and struggles to is formed.
(6) |
Data sets | Image number per class | Number of scene classes | Total image number | Image size | Year |
UC Merced [118] | 100 | 21 | 2100 | 256 256 | 2010 |
WHU-RS19 [114] | 5061 | 19 | 1005 | 600600 | 2012 |
SIRI-WHU [135] | 200 | 12 | 2400 | 200200 | 2016 |
RSSCN7 [147] | 400 | 7 | 2800 | 400400 | 2015 |
RSC11 [137] | about 100 | 11 | 1232 | 512512 | 2016 |
Brazilian Coffee Scene [91] | 100 | 21 | 2100 | 256 256 | 2010 |
AID [113] | 220420 | 30 | 10000 | 600600 | 2017 |
NWPU-RESISC45 [13] | 700 | 45 | 31500 | 256256 | 2017 |
OPTIMAL-31 [110] | 60 | 31 | 1860 | 256256 | 2018 |
EuroSAT [40] | 20003000 | 10 | 27000 | 6464 | 2019 |
As a key method for unsupervised learning, since the introduction by Goodfellow et al. [32]
in 2014, GANs have been gradually applied to many tasks such as image to image translation, sample generation, image super-resolution, and so on. Facing the tremendous volume of remote sensing images, CNN-based methods need to use massive labeled samples to train models. However, annotating samples is labor-intensive. Some researchers began to employ GANs to scene classification. In 2017, Lin et al.
[65] proposed a multiple-layer feature-matching generative adversarial networks (MARTA GANs) for the task of scene classification. Duan et al. [25] used an adversarial net to assist in mining the inherent and discriminative features from aerial images. The dug features are able to enhance the classification accuracy. Bashmal et al. [2] provided a GAN-based method, called Siamese-GAN, to handle the aerial vehicle images classification problems under cross-domain conditions. In [116], to generate high-quality satellite images for scene classification, Xu et al. added the scaled exponential linear unites to GANs. Ma et al. [76] designed the SiftingGAN, which can generate a large variety of authentic annotated samples for scene classification. Teng et al. [105] presented a classifier-constrained adversarial network for cross-domain semi-supervised scene classification. Han et al. [33] introduced a generative framework, named SSGF, to scene classification. Yu et al. [124] devised an attention GAN for scene classification. Attention GAN achieves better scene classification performance by enhancing the representation power of the discriminator.In the aera of aerial image classification, most of GAN-based methods usually use GANs for generating samples, or for feature learning through training networks in an adversarial manner. Compared with CNN-based scene classification methods, only a small number of literatures have been reported so far, but the powerful self-supervised feature learning capacity of GANs provides a promising future direction for scene classification.
Data sets play an irreplaceable role on the advance of scene classification. Meanwhile, they are crucial for developing and evaluating various scene classification methods. As the number of high-resolution satellites increases, the access to massive high-resolution satellite images makes it possible to build large-scale scene classification benchmarks. In the past few years, the researchers from different groups have proposed several publicly available high-resolution benchmark data sets for scene classification of aerial images [118, 91, 13, 113, 114, 147, 137, 135, 40, 110] to facilitate this field forward. Starting with the UC-Merced data set [118], some representative data sets include WHU-RS19 [114], Brazilian Coffee Scene [91], RSSCN7[147], RSC11[137], SIRI-WHU[135], AID[113], NWPU-RESISC45[13], OPTIMAL-31[110], and EuroSAT[40]. The features of these data sets are listed in Table II. Among them, the UC-Merced data [118], AID data set [113], and NWPU-RESISC45 data set [13] are three commonly-used benchmark data sets, which will be introduced below in detail.
The UC-Merced data set111http://weegee.vision.ucmerced.edu/datasets/form.html[118] was released in 2010 and contains 21 scene classes. Each category consists of 100 land-use images. In total, the data set comprises 2100 scene images, of which the pixel resolution is 0.3 m. These images were obtained from United States Geological Survey National Map of 21 U.S. regions and fixed at pixels. Fig. 8 lists the samples of each category from the data set. Up to now, the data set continues to be broadly employed for scene classification. When conducting algorithm evaluation, two widely-used training ratios are 50 and 80, and the remaining 50 and 20 are used for testing.
The AID [113] data set222www.lmars.whu.edu.cn/xia/AID-project.html is a relatively large-scale data set for aerial scene classification. It was published in 2017 by Wuhan University and consists of 30 scene classes. Each scene class consists of 220 to 420 images, which were cropped from Google Earth imagery and fixed at pixels. In total, the data set comprises 10000 scene images. Fig. 9 lists the samples of each category from the data set. Different from the UC-Merced data set, the AID data set is multi-sourced because these aerial images were captured with different satellites. Moreover, the data set is also multi-resolution and the pixel resolution of each scene categories varies from about 8 m to about 0.5 m. When conducting algorithm evaluation, two widely-used training ratios are 20 and 50, and the remaining 80 and 50 are used for testing.
To the best of our knowledge, the NWPU-RESISC45 data set333http://www.escience.cn/people/gongcheng/NWPU-RESISC45.html[13], released by Northwest Polytechnical University, is currently the largest scene classification data set. It consists of 45 scene categories. Each category consists of 700 images, which were obtained from Google Earth and fixed at pixels. In total, the data set comprises 31500 scene images, which is chosen from more than 100 countries and regions. Apart from some specific classes with lower spatial resolution (e.g., island, lake, mountain, and iceberg), the pixel resolution of most the scene categories varies from about 30 m to 0.2 m. Fig. 10 lists the samples of each category from the data set. The release of NWPU-RESISC45 data set has allowed deep learning models to develop their full potential. When conducting algorithm evaluation, two widely-used training ratios are 10 and 20, and the remaining 90 and 80 are used for testing.
Method | Year | Publication | Training ratio | ||
50 | 80 | ||||
Autoencoder-based | SGUFL [129] | 2014 | IEEE TGRS | - | 82.721.18 |
partlets-based method [12] | 2015 | IEEE TGRS | 88.760.79 | - | |
SCDAE [24] | 2016 | IEEE TCYB | - | 93.71.3 | |
CNN-based | GBRCN [130] | 2015 | IEEE TGRS | - | 94.53 |
LPCNN [140] | 2016 | JARS | - | 89.90 | |
Fusion by Addition [8] | 2017 | IEEE TGRS | - | 97.421.79 | |
ARCNet-VGG16 [110] | 2018 | IEEE TGRS | 96.810.14 | 99.120.40 | |
MSCP [38] | 2018 | IEEE TGRS | - | 98.360.58 | |
D-CNNs [17] | 2018 | IEEE TGRS | - | 98.930.10 | |
MCNN [68] | 2018 | IEEE TGRS | - | 96.660.9 | |
ADSSM [143] | 2018 | IEEE TGRS | - | 99.760.24 | |
FACNN [73] | 2019 | IEEE TGRS | - | 98.810.24 | |
SF-CNN [115] | 2019 | IEEE TGRS | - | 99.050.27 | |
RSFJR [27] | 2019 | IEEE TGRS | 97.210.65 | - | |
GBN [100] | 2019 | IEEE TGRS | 97.050.19 | 98.570.48 | |
ADFF [145] | 2019 | Remote Sensing | 96.050.56 | 97.530.63 | |
CNN-CapsNet [133] | 2019 | Remote Sensing | 97.590.16 | 99.050.24 | |
Siamese ResNet50 [66] | 2019 | IEEE GRSL | 90.95 | 94.29 | |
GAN-based | MARTA GANs [65] | 2017 | IEEE GRSL | 85.50.69 | 94.860.80 |
Attention GANs [124] | 2019 | IEEE TGRS | 89.060.50 | 97.690.69 |
Method | Year | Publication | Training ratio | ||
20 | 50 | ||||
CNN-based | Fusion by Addition [8] | 2017 | IEEE TGRS | - | 91.870.36 |
ARCNet-VGG16 [110] | 2018 | IEEE TGRS | 88.750.40 | 93.100.55 | |
MSCP [38] | 2018 | IEEE TGRS | 91.520.21 | 94.420.17 | |
D-CNNs [17] | 2018 | IEEE TGRS | 90.820.16 | 96.890.10 | |
MCNN [68] | 2018 | IEEE TGRS | - | 91.800.22 | |
HW-CNNs [69] | 2018 | IEEE TGRS | - | 96.980.33 | |
FACNN [73] | 2019 | IEEE TGRS | - | 95.450.11 | |
SF-CNN [115] | 2019 | IEEE TGRS | 93.600.12 | 96.660.11 | |
RSFJR [27] | 2019 | IEEE TGRS | - | 96.811.36 | |
GBN [100] | 2019 | IEEE TGRS | 92.200.23 | 95.480.12 | |
ADFF [145] | 2019 | Remote Sensing | 93.680.29 | 94.750.25 | |
CNN-CapsNet [133] | 2019 | Remote Sensing | 93.790.13 | 96.320.12 | |
GAN-based | MARTA GANs [65] | 2017 | IEEE GRSL | 75.390.49 | 81.570.33 |
Attention GANs [124] | 2019 | IEEE TGRS | 78.95±0.23 | 84.520.18 |
In recent years, a variety of scene classification algorithms have been published. Here, 24 deep learning-based scene classification methods are selected for performance comparison on three widely-used benchmark data sets. Among the 24 deep learning methods, 3 of them are autoencoder-based methods, 19 of them are CNN-based methods, and 2 of them are GAN-based methods.
Method | Year | Publication | Training ratio | ||
10 | 20 | ||||
CNN-based | BoCF [16] | 2017 | IEEE GRSL | 82.650.31 | 84.320.17 |
MSCP [38] | 2018 | IEEE TGRS | 88.070.18 | 90.810.13 | |
D-CNNs [17] | 2018 | IEEE TGRS | 89.220.50 | 91.890.22 | |
HW-CNNs [69] | 2018 | IEEE TGRS | - | 94.380.17 | |
IORN [109] | 2018 | IEEE GRSL | 87.830.16 | 91.300.17 | |
ADSSM [143] | 2018 | IEEE TGRS | 91.690.22 | 94.290.14 | |
SF-CNN [115] | 2019 | IEEE TGRS | 89.890.16 | 92.550.14 | |
ADFF [145] | 2019 | Remote Sensing | 90.580.19 | 91.910.23 | |
CNN-CapsNet [133] | 2019 | Remote Sensing | 89.030.21 | 89.030.21 | |
Hydra [82] | 2019 | IEEE TGRS | 92.440.34 | 94.510.21 | |
Siamese ResNet50 [66] | 2019 | IEEE GRSL | - | 92.28 | |
GAN-based | MARTA GANs [65] | 2017 | IEEE GRSL | 68.630.22 | 75.030.28 |
Attention GANs [124] | 2019 | IEEE TGRS | 72.210.21 | 77.990.19 |
Tables III, IV, V report the classification accuracy comparison of deep learning-based scene classification methods on the UC-Merced data set, the AID data set, and the NWPU-RESISC45 data set, respectively, measured in terms of overall accuracy (OA). The metric of OA is a commonly used criteria for evaluating the performance of the methods for scene classification of aerial images, which is formulated as the total number of accurately classified samples divided by the total number of tested samples.
As can be seen from Tables III, IV, V, the performance of aerial image scene classification has been successively advanced. In the early days, deep learning-based scene classification approaches were mainly based on autoencoder, and researchers usually use the UC-Merced data set to evaluate autoencoder-based algorithms. As an early unsupervised deep learning method, the structure of autoencoder was relatively simple, so its feature learning capability was also limited. The accuracies of the autoencoder-based approaches had plateaued on the standard benchmarks.
Fortunately, after 2012, CNNs, a powerful supervised learning method, have proved to be capable of learning abstract features from raw images. Despite their powerful potential, it took some time for CNNs to take off in the satellite image scene classification domain, until 2015. A short while later, CNN-based algorithms mainly used CNNs as feature extractors, which outperformed autoencoder-based methods. However, only using CNNs as feature extractors did not make full use of the potential of CNNs. Thanks to the release of two large-scale scene classification benchmarks, namely AID and NWPU-RESISC45 in 2017, fine-tuning off-the-shelf CNNs have shown better generalization ability in the task of scene classification than only using CNNs as feature extractors.
Scene classification is an important and challenging problem for remote sensing image interpretation. Driven by its wide application, it has aroused extensive research attention. Thanks to the advancement of deep learning techniques and the establishment of large-scale data sets for scene classification, scene classification has been seeing dramatic improvement. In spite of the amazing successes obtained in the past several years, there still exists a giant gap between the current understanding level of machines and human-level performance. Thus, there is still much work that needs to be done in the field of scene classification. By investigating the current scene classification algorithms and the available data sets, this paper discusses several potential future directions for scene classification in remote sensing imagery.
(1) Developing larger scale scene classification data sets. An ideal scene classification system would be capable of accurately and efficiently recognizing all scene types in all open world scenes. Recent scene classification methods are still trained with relatively limited data sets, so they are capable of classifying scene categories within the training data sets but blind, in principle, to other scene classes outside the data sets. Therefore, a compelling scene classification system should be able to accurately label a novel scene image with a semantic category. The existing data sets [118, 13, 113] contain dozens of scene classes, which are far fewer than those that humans can distinguish. Moreover, a common deep CNN has millions of parameters and it tends to over-fit the tens of thousands of training samples in the training set. Hence, fully training a deep classification model is almost impracticable by using currently available scene classification data sets. A majority of advanced scene classification algorithms mainly rely on fine-tuning already trained CNNs on the target data sets or utilizing pre-trained CNNs as feature extractors. Although the transferring solutions behaves fairly well on the target data sets with limited types and samples, they are not the most optimal solution compared with fully training a deep CNN model because the model trained from scratch is able to extract more specific features that are adaptable to the target domain when training samples is large enough. Considering this, developing a new large scale data set with considerably more scene classes for scene classification is very promising.
(2) Unsupervised learning for scene classification. Currently, the most advanced scene classification algorithms generally use fully supervised models learned from annotated data with semantic categories and have achieved amazing scene classification results. However, such fully supervised learning is extremely expensive and time-consuming to undertake because data annotation must be done manually by researchers with expert knowledge of the area of remote sensing image understanding. When the number of scene classes is huge, data annotation may become very difficult due to the massive amount of diversities and variations in remote sensing images. Meanwhile, the labeled data is generally full of noise and errors, especially for large-scale data sets, since the diverse knowledge levels of different specialists result in different understandings of the same classes of scene. Fully supervised learning can hardly work well without a large data set with clean labels. As a promising unsupervised learning method, generative adversarial networks have been used for tracking scene classification with data sets that lack annotations [25, 65, 124]. Consequently, it is valuable to explore unsupervised learning for scene classification.
(3) Compact and efficient scene classification models. During the past few years, another key factor in the outstanding progress in scene classification is the evolution of powerful deep CNNs. In order to achieve high accuracy in classification, the layer number of the CNNs has increased from several layers to hundreds of layers. Most advanced CNN models have millions of parameters and require a massive labeled data set for training and high-performance GPUs, which severely limits the deploying of scene classification algorithms on airborne and satellite-borne embedded systems. In response, some researchers are working to design compact and lightweight scene classification models [9, 127]. In this area, there is much work to be done.
(4) Learning discriminative feature representations. Two key factors that influences the performance of scene classification tasks are intraclass diversity and interclass similarity existing in remote sensing images. Even though CNNs are able to extract abundant semantic features from a given satellite image, and some methods of learning discriminative CNN feature representation have been proposed [82, 39, 138], the challenges of higher intraclass variation and smaller interclass separability are still not fully solved. These challenges seriously affect the performance of scene classification. In the future, learning more discriminative feature representations to handle the challenges needs to be addressed.
(5) Scene classification with limited samples. CNNs have obtained huge successes in the field of scene classification. However, most of those models demand large-scale labelled data and numerous iterations to train their parameter sets. This extremely limits their scalability to novel categories because of the high cost of labeling. Also, this fundamentally confines their applicability to rare scene categories (e.g., missile position, military zones), which are difficult to capture. In contrast, humans are adept at distinguishing scenes with little supervision learning, or none at all, such as few-shot [101] or zero-shot learning [122]. For instance, children can quickly and accurately recognize scene types ranging from a single image on TV, in a book, or hearing its description. The current best scene classification approaches are still far from achieving the humans’ ability to classify scene types with a few labelled samples. Exploring few-shot/zero-shot learning approachs for scene classification [66, 126, 54] still needs to be further developed.
(6) Cross-domain scene classification. Current researches have confirmed that CNNs are powerful tools for the task of scene classification. Pre-trained CNN models have shown better generalization to remote sensing scene data sets by extracting discriminative holistic features or fine-tuning off-the-shelf CNNs on target data sets. However, we often assume that the training set and test set do not exist within different domains. The assumption is not warranted, however, because satellite images are acquired under differing conditions. Indeed, it is insufficient to use simple transfer learning or fine-tuning to deal with aerial images, captured with different satellites or over different locations. In the past few years, some researchers have exported cross-domain scene classification to enhance the generalization of CNN models and reduce the distribution gap between target and source domains
[1, 90, 72, 99]. There is much potential for improving domain adaption-based methods for scene classification, such as mapping the feature representations from target and source domains onto a uniform space while preserving the original data structures, designing additional adaptation layers, and optimizing the loss functions.Scene classification of aerial images has obtained major improvements through several decades of development. The number of papers on aerial image scene classification is breathtaking, especially the literature about deep learning-based methods. By taking into account the rapid rate of progress in scene classification, in this paper, we first discussed the main challenges that the current area of satellite image scene classification faces for. Then, we surveyed three kinds of deep learning-based methods in detail and introduced the mainstream scene classification benchmarks. Next, we summarized the performance of deep learning-based methods on three widely used data sets in tabular forms, and also provided the analysis of the results. Finally, we discussed a set of promising opportunities for further research.
Mapping vegetation morphology types in a dry savanna ecosystem: integrating hierarchical object-based image analysis with random forest
. International Journal of Remote Sensing 35 (3), pp. 1175–1198. Cited by: §I.Deep learning based feature selection for remote sensing scene classification
. IEEE Geoscience and Remote Sensing Letters 12 (11), pp. 2321–2325. Cited by: TABLE II, §IV.